weifeng-chen's picture
update coco result
ec67cfd
|
raw
history blame
3.53 kB
metadata
license: apache-2.0
pipeline_tag: feature-extraction
tags:
  - clip
  - zh
  - image-text
  - feature-extraction

Model Details

This model is a Chinese CLIP model trained on Noah-Wukong Dataset, which contains about 100M Chinese image-text pairs. We use ViT-B-32 from openAI as image encoder and Chinese pre-trained language model chinese-roberta-wwm as text encoder. We freeze the image encoder and only finetune the text encoder. The model was trained for 20 epochs and it takes about 10 days with 8 A100 GPUs.

Taiyi (太乙)

Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community.

Usage

from PIL import Image
import requests
import clip
import torch
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from transformers import CLIPProcessor, CLIPModel
import numpy as np

query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎']  # 这里是输入文本的,可以随意替换。
# 加载Taiyi 中文 text encoder
text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese")
text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval()
text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']

url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # 这里可以换成任意图片的url
# 加载CLIP的image encoder
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")  
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")

with torch.no_grad():
    image_features = clip_model.get_image_features(**image)
    text_features = text_encoder(text).logits
    # 归一化
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)
    # 计算余弦相似度 logit_scale是尺度系数
    logit_scale = clip_model.logit_scale.exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    print(np.around(probs, 3))

Evaluation

Zero-Shot Classification

model dataset Top1 Top5
Taiyi-CLIP-Roberta-102M-Chinese ImageNet1k-CN 41.00% 69.19%

Zero-Shot Text-to-Image Retrieval

model dataset Top1 Top5 Top10
Taiyi-CLIP-Roberta-102M-Chinese Flickr30k-CNA-test 44.06% 71.42% 80.84%
Taiyi-CLIP-Roberta-102M-Chinese COCO-CN-test 46.24% 78.06% 88.88%
Taiyi-CLIP-Roberta-102M-Chinese wukong50k 48.67% 81.77% 90.09%

Citation

If you find the resource is useful, please cite the following website in your paper.

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2022},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}