QA-CLIP-ViT-B-16 / README_CN.md
Kunyi's picture
Update README_CN.md
8f09600

中文说明 | English

项目介绍

本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。 本项目于QQ-ARC Joint Lab, Tencent PCG完成。 更详细的信息可以参考QA-CLIP项目的主页面。我们也在github上开源了模型,QA-CLIP,welcome to star!

实验结果

针对图文检索任务,我们在MUGE RetrievalFlickr30K-CNCOCO-CN上进行了zero-shot测试。 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:

Flickr30K-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.876.084.660.085.992.0
QA-CLIPRN5050.577.486.167.187.993.2
CN-CLIPViT-B/1662.786.992.874.693.597.1
QA-CLIPViT-B/1663.888.093.278.496.198.5
CN-CLIPViT-L/1468.089.794.480.296.698.2
AltClipViT-L/1469.790.194.884.897.799.1
QA-CLIPViT-L/1469.390.394.785.397.999.2

MUGE Zero-shot Retrieval (Official Validation Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5042.668.578.030.056.266.9
QA-CLIPRN5044.069.979.532.459.570.3
CN-CLIPViT-B/1652.176.784.438.765.675.1
QA-CLIPViT-B/1653.277.785.140.768.277.2
CN-CLIPViT-L/1456.479.886.242.669.878.6
AltClipViT-L/1429.649.958.821.442.051.9
QA-CLIPViT-L/1457.481.087.745.573.081.4

COCO-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.181.390.550.981.190.5
QA-CLIPRN5050.182.591.756.785.292.9
CN-CLIPViT-B/1662.287.194.956.384.093.3
QA-CLIPViT-B/1662.987.794.761.587.694.8
CN-CLIPViT-L/1464.988.894.260.684.493.1
AltClipViT-L/1463.587.693.562.688.595.9
QA-CLIPViT-L/1465.790.295.064.588.395.1

Zero-shot Image Classification on ImageNet:

TaskImageNet
CN-CLIPRN5033.5
QA-CLIPRN5035.5
CN-CLIPViT-B/1648.4
QA-CLIPViT-B/1649.7
CN-CLIPViT-L/1454.7
QA-CLIPViT-L/1455.8



使用教程

推理代码

推理代码示例:

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)



致谢

项目代码基于Chinese-CLIP实现,非常感谢他们优秀的开源工作。