IDEA-CCNL
/

Taiyi-CLIP-Roberta-102M-Chinese

+---
+license: apache-2.0
+# inference: false
+pipeline_tag: zero-shot-image-classification
+# inference:
+#   parameters:
+tags:
+- clip
+- zh
+- image-text
+---
+# Model Details
+This model is a Chinese CLIP model trained on [Noah-Wukong Dataset](https://wukong-dataset.github.io/wukong-dataset/), which contains about 100M Chinese image-text pairs. We use the image encoder ViT-B-32 from [openAI](https://github.com/openai/CLIP) and the Chinese pre-trained language model from [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) via contrastive learning. We freeze the image encoder and only finetune the language model. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs.
+# Taiyi (太乙)
+Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community.
+# Usage
+```python3
+from PIL import Image
+import requests
+import clip
+import torch
+from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
+import numpy as np
+# 加载TaiYi 中文 text encoder
+text_tokenizer = BertTokenizer.from_pretrained("wf-genius/TaiYi-CLIP-ViT-B-32-Roberta-Chinese")
+text_encoder = BertForSequenceClassification.from_pretrained("wf-genius/TaiYi-CLIP-ViT-B-32-Roberta-Chinese").eval()
+text = text_tokenizer(["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'], return_tensors='pt', padding=True)['input_ids']
+# 加载CLIP的image encoder
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+clip_model, preprocess = clip.load("ViT-B/32", device='cpu')
+image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
+with torch.no_grad():
+    image_features = clip_model.encode_image(image)
+    text_features = text_encoder(text).logits
+    # 归一化
+    image_features = image_features / image_features.norm(dim=1, keepdim=True)
+    text_features = text_features / text_features.norm(dim=1, keepdim=True)
+    # 计算余弦相似度 logit_scale是尺度系数
+    logit_scale = clip_model.logit_scale.exp()
+    logits_per_image = logit_scale * image_features @ text_features.t()
+    logits_per_text = logits_per_image.t()
+    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
+    print(np.around(probs, 3))
+```
+# Evaluation
+### Zero-Shot Classification
+|  model   | dataset  | Top1 | Top5 |
+|  ----  | ----  | ---- | ---- |
+| TaiYi-CLIP-ViT-B-32-Roberta-Chinese  | ImageNet-CN | 40.64 % | 69.16% |
+### Text-to-Image Retrieval
+|  model   | dataset  | Top1 | Top5 | Top10 |
+|  ----  | ----  | ---- | ---- | ---- |
+| TaiYi-CLIP-ViT-B-32-Roberta-Chinese  | COCO-CN | 25.47 % | 51.70%  | 63.07% |
+| TaiYi-CLIP-ViT-B-32-Roberta-Chinese  | wukong50k | 47.64 % | 80.97% | 89.51% |
+# Citation
+If you find the resource is useful, please cite the following website in your paper.
+```
+@misc{Fengshenbang-LM,
+  title={Fengshenbang-LM},
+  author={IDEA-CCNL},
+  year={2022},
+  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
+}
+```