[**中文说明**](README_CN.md) | [**English**](README.md) # 项目介绍 本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。 本项目于QQ-ARC Joint Lab, Tencent PCG完成

# 模型及实验 ## 模型规模 & 下载链接 QA-CLIP目前开源3个不同规模,其模型信息和下载方式见下表:
模型规模下载链接参数量视觉侧骨架视觉侧参数量文本侧骨架文本侧参数量分辨率
QA-CLIPRN50Download77MResNet5038MRBT339M224
QA-CLIPViT-B/16Download188MViT-B/1686MRoBERTa-wwm-Base102M224
QA-CLIPViT-L/14Download406MViT-L/14304MRoBERTa-wwm-Base102M224

## 实验结果 针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表: **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.876.084.660.085.992.0
QA-CLIPRN5050.577.486.167.187.993.2
CN-CLIPViT-B/1662.786.992.874.693.597.1
QA-CLIPViT-B/1663.888.093.278.496.198.5
CN-CLIPViT-L/1468.089.794.480.296.698.2
AltClipViT-L/1469.790.194.884.897.799.1
QA-CLIPViT-L/1469.390.394.785.397.999.2

**MUGE Zero-shot Retrieval (Official Validation Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5042.668.578.030.056.266.9
QA-CLIPRN5044.069.979.532.459.570.3
CN-CLIPViT-B/1652.176.784.438.765.675.1
QA-CLIPViT-B/1653.277.785.140.768.277.2
CN-CLIPViT-L/1456.479.886.242.669.878.6
AltClipViT-L/1429.649.958.821.442.051.9
QA-CLIPViT-L/1457.481.087.745.573.081.4

**COCO-CN Zero-shot Retrieval (Official Test Set)**:
TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.181.390.550.981.190.5
QA-CLIPRN5050.182.591.756.785.292.9
CN-CLIPViT-B/1662.287.194.956.384.093.3
QA-CLIPViT-B/1662.987.794.761.587.694.8
CN-CLIPViT-L/1464.988.894.260.684.493.1
AltClipViT-L/1463.587.693.562.688.595.9
QA-CLIPViT-L/1465.790.295.064.588.395.1

**Zero-shot Image Classification on ImageNet**:
TaskImageNet
CN-CLIPRN5033.5
QA-CLIPRN5035.5
CN-CLIPViT-B/1648.4
QA-CLIPViT-B/1649.7
CN-CLIPViT-L/1454.7
QA-CLIPViT-L/1455.8



# 使用教程 ## 安装要求 环境配置要求: * python >= 3.6.4 * pytorch >= 1.8.0 (with torchvision >= 0.9.0) * CUDA Version >= 10.2 安装本项目所需库 ```bash cd /yourpath/QA-CLIP-main pip install -r requirements.txt ``` ## 推理代码 ```bash export PYTHONPATH=/yourpath/QA-CLIP-main ``` 推理代码示例: ```python import torch from PIL import Image import clip as clip from clip import load_from_name, available_models print("Available models:", available_models()) # Available models: ['ViT-B-16', 'ViT-L-14', 'RN50'] device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./') model.eval() image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device) text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) # 对特征进行归一化,请使用归一化后的图文特征用于下游任务 image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) logits_per_image, logits_per_text = model.get_similarity(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probs:", probs) ```

## 预测及评估 ### 图文检索测试数据集下载 [Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)项目中已经预处理好测试集,这是他们提供的下载链接: MUGE数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip) Flickr30K-CN数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip) 另外[COCO-CN](https://github.com/li-xirong/coco-cn)数据的获取需要向原作者进行申请 ### ImageNet数据集下载 原始数据请自行下载,[中文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt)和[英文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt)同样由[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)项目提供 ### 图文检索评估 图文检索评估代码可以参考如下: ```bash split=test # 指定计算valid或test集特征 resume=your_ckp_path DATAPATH=your_DATAPATH dataset_name=Flickr30k-CN # dataset_name=MUGE python -u eval/extract_features.py \ --extract-image-feats \ --extract-text-feats \ --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \ --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \ --img-batch-size=32 \ --text-batch-size=32 \ --context-length=52 \ --resume=${resume} \ --vision-model=ViT-B-16 \ --text-model=RoBERTa-wwm-ext-base-chinese python -u eval/make_topk_predictions.py \ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \ --top-k=10 \ --eval-batch-size=32768 \ --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl" python -u eval/make_topk_predictions_tr.py \ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \ --top-k=10 \ --eval-batch-size=32768 \ --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl" python eval/evaluation.py \ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \ ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \ ${DATAPATH}/datasets/${dataset_name}/output1.json cat ${DATAPATH}/datasets/${dataset_name}/output1.json python eval/transform_ir_annotation_to_tr.py \ --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl python eval/evaluation_tr.py \ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \ ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \ ${DATAPATH}/datasets/${dataset_name}/output2.json cat ${DATAPATH}/datasets/${dataset_name}/output2.json ``` ### ImageNet零样本分类 ImageNet零样本分类的代码参考如下 ```bash bash scripts/zeroshot_eval.sh 0 \ ${DATAPATH} imagenet \ ViT-B-16 RoBERTa-wwm-ext-base-chinese \ ./pretrained_weights/QA-CLIP-base.pt ```

# 致谢 项目代码基于[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)实现,非常感谢他们优秀的开源工作。