|
--- |
|
license: apache-2.0 |
|
|
|
|
|
pipeline_tag: feature-extraction |
|
|
|
|
|
|
|
tags: |
|
- clip |
|
- zh |
|
- image-text |
|
- feature-extraction |
|
--- |
|
|
|
# Model Details |
|
|
|
This model is a Chinese CLIP model trained on [Noah-Wukong Dataset](https://wukong-dataset.github.io/wukong-dataset/), which contains about 100M Chinese image-text pairs. We use ViT-B-32 from [openAI](https://github.com/openai/CLIP) as image encoder and Chinese pre-trained language model [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) as text encoder. We freeze the image encoder and only finetune the text encoder. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs. |
|
|
|
# Taiyi (太乙) |
|
Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community. |
|
|
|
|
|
|
|
# Usage |
|
|
|
```python3 |
|
from PIL import Image |
|
import requests |
|
import clip |
|
import torch |
|
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer |
|
from transformers import CLIPProcessor, CLIPModel |
|
import numpy as np |
|
|
|
query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'] # 这里是输入文本的,可以随意替换。 |
|
# 加载Taiyi 中文 text encoder |
|
text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese") |
|
text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval() |
|
text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids'] |
|
|
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # 这里可以换成任意图片的url |
|
# 加载CLIP的image encoder |
|
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") |
|
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") |
|
image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
image_features = clip_model.get_image_features(**image) |
|
text_features = text_encoder(text).logits |
|
# 归一化 |
|
image_features = image_features / image_features.norm(dim=1, keepdim=True) |
|
text_features = text_features / text_features.norm(dim=1, keepdim=True) |
|
# 计算余弦相似度 logit_scale是尺度系数 |
|
logit_scale = clip_model.logit_scale.exp() |
|
logits_per_image = logit_scale * image_features @ text_features.t() |
|
logits_per_text = logits_per_image.t() |
|
probs = logits_per_image.softmax(dim=-1).cpu().numpy() |
|
print(np.around(probs, 3)) |
|
|
|
``` |
|
|
|
# Evaluation |
|
|
|
### Zero-Shot Classification |
|
| model | dataset | Top1 | Top5 | |
|
| ---- | ---- | ---- | ---- | |
|
| Taiyi-CLIP-Roberta-102M-Chinese | ImageNet1k-CN | 41.00% | 69.19% | |
|
|
|
### Zero-Shot Text-to-Image Retrieval |
|
|
|
| model | dataset | Top1 | Top5 | Top10 | |
|
| ---- | ---- | ---- | ---- | ---- | |
|
| Taiyi-CLIP-Roberta-102M-Chinese | Flickr30k-CNA-test | 44.06% | 71.42% | 80.84% | |
|
| Taiyi-CLIP-Roberta-102M-Chinese | COCO-CN-test | 46.24% | 78.06% | 88.88% | |
|
| Taiyi-CLIP-Roberta-102M-Chinese | wukong50k | 48.67% | 81.77% | 90.09% | |
|
|
|
|
|
# Citation |
|
|
|
If you find the resource is useful, please cite the following website in your paper. |
|
|
|
``` |
|
@misc{Fengshenbang-LM, |
|
title={Fengshenbang-LM}, |
|
author={IDEA-CCNL}, |
|
year={2022}, |
|
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, |
|
} |
|
``` |