weifeng chen commited on
Commit
31cb198
1 Parent(s): 596630e
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ # inference: false
4
+ pipeline_tag: zero-shot-image-classification
5
+
6
+ # inference:
7
+ # parameters:
8
+ tags:
9
+ - clip
10
+ - zh
11
+ - image-text
12
+ ---
13
+
14
+ # Model Details
15
+
16
+ This model is a Chinese CLIP model trained on [Noah-Wukong Dataset](https://wukong-dataset.github.io/wukong-dataset/), which contains about 100M Chinese image-text pairs. We use the image encoder ViT-B-32 from [openAI](https://github.com/openai/CLIP) and the Chinese pre-trained language model from [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) via contrastive learning. We freeze the image encoder and only finetune the language model. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs.
17
+
18
+ # Taiyi (太乙)
19
+ Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community.
20
+
21
+
22
+
23
+ # Usage
24
+
25
+ ```python3
26
+ from PIL import Image
27
+ import requests
28
+ import clip
29
+ import torch
30
+ from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
31
+ import numpy as np
32
+
33
+ # 加载TaiYi 中文 text encoder
34
+ text_tokenizer = BertTokenizer.from_pretrained("wf-genius/TaiYi-CLIP-ViT-B-32-Roberta-Chinese")
35
+ text_encoder = BertForSequenceClassification.from_pretrained("wf-genius/TaiYi-CLIP-ViT-B-32-Roberta-Chinese").eval()
36
+ text = text_tokenizer(["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'], return_tensors='pt', padding=True)['input_ids']
37
+
38
+ # 加载CLIP的image encoder
39
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
40
+ clip_model, preprocess = clip.load("ViT-B/32", device='cpu')
41
+ image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
42
+
43
+ with torch.no_grad():
44
+ image_features = clip_model.encode_image(image)
45
+ text_features = text_encoder(text).logits
46
+ # 归一化
47
+ image_features = image_features / image_features.norm(dim=1, keepdim=True)
48
+ text_features = text_features / text_features.norm(dim=1, keepdim=True)
49
+ # 计算余弦相似度 logit_scale是尺度系数
50
+ logit_scale = clip_model.logit_scale.exp()
51
+ logits_per_image = logit_scale * image_features @ text_features.t()
52
+ logits_per_text = logits_per_image.t()
53
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
54
+ print(np.around(probs, 3))
55
+ ```
56
+
57
+ # Evaluation
58
+
59
+ ### Zero-Shot Classification
60
+ | model | dataset | Top1 | Top5 |
61
+ | ---- | ---- | ---- | ---- |
62
+ | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | ImageNet-CN | 40.64 % | 69.16% |
63
+
64
+ ### Text-to-Image Retrieval
65
+
66
+ | model | dataset | Top1 | Top5 | Top10 |
67
+ | ---- | ---- | ---- | ---- | ---- |
68
+ | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | COCO-CN | 25.47 % | 51.70% | 63.07% |
69
+ | TaiYi-CLIP-ViT-B-32-Roberta-Chinese | wukong50k | 47.64 % | 80.97% | 89.51% |
70
+
71
+
72
+ # Citation
73
+
74
+ If you find the resource is useful, please cite the following website in your paper.
75
+
76
+ ```
77
+ @misc{Fengshenbang-LM,
78
+ title={Fengshenbang-LM},
79
+ author={IDEA-CCNL},
80
+ year={2022},
81
+ howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
82
+ }
83
+ ```