IDEA-CCNL
/

Taiyi-CLIP-Roberta-102M-Chinese

Feature Extraction

text-classification

Inference Endpoints

Model card Files Files and versions Community

Taiyi-CLIP-Roberta-102M-Chinese / README.md

weifeng-chen's picture

update coco result

ec67cfd over 2 years ago

|

3.53 kB

	---
	license: apache-2.0
	# inference: false
	# pipeline_tag: zero-shot-image-classification
	pipeline_tag: feature-extraction

	# inference:
	# parameters:
	tags:
	- clip
	- zh
	- image-text
	- feature-extraction
	---

	# Model Details

	This model is a Chinese CLIP model trained on [Noah-Wukong Dataset](https://wukong-dataset.github.io/wukong-dataset/), which contains about 100M Chinese image-text pairs. We use ViT-B-32 from [openAI](https://github.com/openai/CLIP) as image encoder and Chinese pre-trained language model [chinese-roberta-wwm](https://huggingface.co/hfl/chinese-roberta-wwm-ext) as text encoder. We freeze the image encoder and only finetune the text encoder. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs.

	# Taiyi (太乙)
	Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community.



	# Usage

	```python3
	from PIL import Image
	import requests
	import clip
	import torch
	from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
	from transformers import CLIPProcessor, CLIPModel
	import numpy as np

	query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'] # 这里是输入文本的，可以随意替换。
	# 加载Taiyi 中文 text encoder
	text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese")
	text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval()
	text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']

	url = "http://images.cocodataset.org/val2017/000000039769.jpg" # 这里可以换成任意图片的url
	# 加载CLIP的image encoder
	clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
	processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
	image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")

	with torch.no_grad():
	image_features = clip_model.get_image_features(**image)
	text_features = text_encoder(text).logits
	# 归一化
	image_features = image_features / image_features.norm(dim=1, keepdim=True)
	text_features = text_features / text_features.norm(dim=1, keepdim=True)
	# 计算余弦相似度 logit_scale是尺度系数
	logit_scale = clip_model.logit_scale.exp()
	logits_per_image = logit_scale * image_features @ text_features.t()
	logits_per_text = logits_per_image.t()
	probs = logits_per_image.softmax(dim=-1).cpu().numpy()
	print(np.around(probs, 3))

	```

	# Evaluation

	### Zero-Shot Classification
	\| model \| dataset \| Top1 \| Top5 \|
	\| ---- \| ---- \| ---- \| ---- \|
	\| Taiyi-CLIP-Roberta-102M-Chinese \| ImageNet1k-CN \| 41.00% \| 69.19% \|

	### Zero-Shot Text-to-Image Retrieval

	\| model \| dataset \| Top1 \| Top5 \| Top10 \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \|
	\| Taiyi-CLIP-Roberta-102M-Chinese \| Flickr30k-CNA-test \| 44.06% \| 71.42% \| 80.84% \|
	\| Taiyi-CLIP-Roberta-102M-Chinese \| COCO-CN-test \| 46.24% \| 78.06% \| 88.88% \|
	\| Taiyi-CLIP-Roberta-102M-Chinese \| wukong50k \| 48.67% \| 81.77% \| 90.09% \|


	# Citation

	If you find the resource is useful, please cite the following website in your paper.

	```
	@misc{Fengshenbang-LM,
	title={Fengshenbang-LM},
	author={IDEA-CCNL},
	year={2022},
	howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
	}
	```