--- license: apache-2.0 # inference: false # pipeline_tag: zero-shot-image-classification pipeline_tag: feature-extraction # inference: # parameters: tags: - clip - zh - image-text - feature-extraction --- # Model Details This model is a Chinese CLIP model trained on [Noah-Wukong Dataset](https://wukong-dataset.github.io/wukong-dataset/), which contains about 100M Chinese image-text pairs. We use ViT-B-32 from [openAI](https://github.com/openai/CLIP) as image encoder and Chinese pre-trained language model [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) as text encoder. We freeze the image encoder and only finetune the text encoder. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs. # Taiyi (太乙) Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community. # Usage ```python3 from PIL import Image import requests import clip import torch from transformers import BertForSequenceClassification, BertConfig, BertTokenizer import numpy as np # 加载TaiYi 中文 text encoder text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/TaiYi-CLIP-Roberta-102M-Chinese") text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/TaiYi-CLIP-Roberta-102M-Chinese").eval() text = text_tokenizer(["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'], return_tensors='pt', padding=True)['input_ids'] # 加载CLIP的image encoder url = "http://images.cocodataset.org/val2017/000000039769.jpg" clip_model, preprocess = clip.load("ViT-B/32", device='cpu') image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0) with torch.no_grad(): image_features = clip_model.encode_image(image) text_features = text_encoder(text).logits # 归一化 image_features = image_features / image_features.norm(dim=1, keepdim=True) text_features = text_features / text_features.norm(dim=1, keepdim=True) # 计算余弦相似度 logit_scale是尺度系数 logit_scale = clip_model.logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.t() logits_per_text = logits_per_image.t() probs = logits_per_image.softmax(dim=-1).cpu().numpy() print(np.around(probs, 3)) ``` # Evaluation ### Zero-Shot Classification | model | dataset | Top1 | Top5 | | ---- | ---- | ---- | ---- | | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | ImageNet1k-CN | 41.00% | 69.19% | ### Zero-Shot Text-to-Image Retrieval | model | dataset | Top1 | Top5 | Top10 | | ---- | ---- | ---- | ---- | ---- | | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | COCO-CN | 25.47 % | 51.70% | 63.07% | | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | wukong50k | 48.67 % | 81.77% | 90.09% | # Citation If you find the resource is useful, please cite the following website in your paper. ``` @misc{Fengshenbang-LM, title={Fengshenbang-LM}, author={IDEA-CCNL}, year={2022}, howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, } ```