[**中文说明**](README_CN.md) | [**English**](README.md) # 项目介绍 本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。 本项目于QQ-ARC Joint Lab, Tencent PCG完成。 更详细的信息可以参考[QA-CLIP项目的主页面](https://huggingface.co/TencentARC/QA-CLIP)。我们也在github上开源了模型,[QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP),welcome to star!

## 实验结果 针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表: **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.876.084.660.085.992.0
QA-CLIPRN5050.577.486.167.187.993.2
CN-CLIPViT-B/1662.786.992.874.693.597.1
QA-CLIPViT-B/1663.888.093.278.496.198.5
CN-CLIPViT-L/1468.089.794.480.296.698.2
AltClipViT-L/1469.790.194.884.897.799.1
QA-CLIPViT-L/1469.390.394.785.397.999.2

**MUGE Zero-shot Retrieval (Official Validation Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5042.668.578.030.056.266.9
QA-CLIPRN5044.069.979.532.459.570.3
CN-CLIPViT-B/1652.176.784.438.765.675.1
QA-CLIPViT-B/1653.277.785.140.768.277.2
CN-CLIPViT-L/1456.479.886.242.669.878.6
AltClipViT-L/1429.649.958.821.442.051.9
QA-CLIPViT-L/1457.481.087.745.573.081.4

**COCO-CN Zero-shot Retrieval (Official Test Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.181.390.550.981.190.5
QA-CLIPRN5050.182.591.756.785.292.9
CN-CLIPViT-B/1662.287.194.956.384.093.3
QA-CLIPViT-B/1662.987.794.761.587.694.8
CN-CLIPViT-L/1464.988.894.260.684.493.1
AltClipViT-L/1463.587.693.562.688.595.9
QA-CLIPViT-L/1465.790.295.064.588.395.1

**Zero-shot Image Classification on ImageNet**:
TaskImageNet
CN-CLIPRN5033.5
QA-CLIPRN5035.5
CN-CLIPViT-B/1648.4
QA-CLIPViT-B/1649.7
CN-CLIPViT-L/1454.7
QA-CLIPViT-L/1455.8



# 使用教程 ## 推理代码 推理代码示例: ```python from PIL import Image import requests from transformers import ChineseCLIPProcessor, ChineseCLIPModel model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-B-16") processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-B-16") url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" image = Image.open(requests.get(url, stream=True).raw) # Squirtle, Bulbasaur, Charmander, Pikachu in English texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] # compute image feature inputs = processor(images=image, return_tensors="pt") image_features = model.get_image_features(**inputs) image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize # compute text features inputs = processor(text=texts, padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs) text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize # compute image-text similarity scores inputs = processor(text=texts, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) ```

# 致谢 项目代码基于[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)实现,非常感谢他们优秀的开源工作。