QA-CLIP / README_CN.md
Kunyi's picture
Update README_CN.md
48c326f

中文说明 | English

项目介绍

本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。 本项目于QQ-ARC Joint Lab, Tencent PCG完成。我们也在github上开源了模型,QA-CLIP,welcome to star!

模型及实验

模型规模 & 下载链接

QA-CLIP目前开源3个不同规模,其模型信息和下载方式见下表:

模型规模下载链接参数量视觉侧骨架视觉侧参数量文本侧骨架文本侧参数量分辨率
QA-CLIPRN50Download77MResNet5038MRBT339M224
QA-CLIPViT-B/16Download188MViT-B/1686MRoBERTa-wwm-Base102M224
QA-CLIPViT-L/14Download406MViT-L/14304MRoBERTa-wwm-Base102M224

实验结果

针对图文检索任务,我们在MUGE RetrievalFlickr30K-CNCOCO-CN上进行了zero-shot测试。 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:

Flickr30K-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.876.084.660.085.992.0
QA-CLIPRN5050.577.486.167.187.993.2
CN-CLIPViT-B/1662.786.992.874.693.597.1
QA-CLIPViT-B/1663.888.093.278.496.198.5
CN-CLIPViT-L/1468.089.794.480.296.698.2
AltClipViT-L/1469.790.194.884.897.799.1
QA-CLIPViT-L/1469.390.394.785.397.999.2

MUGE Zero-shot Retrieval (Official Validation Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5042.668.578.030.056.266.9
QA-CLIPRN5044.069.979.532.459.570.3
CN-CLIPViT-B/1652.176.784.438.765.675.1
QA-CLIPViT-B/1653.277.785.140.768.277.2
CN-CLIPViT-L/1456.479.886.242.669.878.6
AltClipViT-L/1429.649.958.821.442.051.9
QA-CLIPViT-L/1457.481.087.745.573.081.4

COCO-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.181.390.550.981.190.5
QA-CLIPRN5050.182.591.756.785.292.9
CN-CLIPViT-B/1662.287.194.956.384.093.3
QA-CLIPViT-B/1662.987.794.761.587.694.8
CN-CLIPViT-L/1464.988.894.260.684.493.1
AltClipViT-L/1463.587.693.562.688.595.9
QA-CLIPViT-L/1465.790.295.064.588.395.1

Zero-shot Image Classification on ImageNet:

TaskImageNet
CN-CLIPRN5033.5
QA-CLIPRN5035.5
CN-CLIPViT-B/1648.4
QA-CLIPViT-B/1649.7
CN-CLIPViT-L/1454.7
QA-CLIPViT-L/1455.8



使用教程

安装要求

环境配置要求:

  • python >= 3.6.4
  • pytorch >= 1.8.0 (with torchvision >= 0.9.0)
  • CUDA Version >= 10.2

安装本项目所需库

cd /yourpath/QA-CLIP-main
pip install --upgrade pip
pip install -r requirements.txt

推理代码

export PYTHONPATH=/yourpath/QA-CLIP-main

推理代码示例:

import torch 
from PIL import Image

import clip as clip
from clip import load_from_name, available_models
print("Available models:", available_models())  
# Available models: ['ViT-B-16', 'ViT-L-14', 'RN50']

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    # 对特征进行归一化,请使用归一化后的图文特征用于下游任务
    image_features /= image_features.norm(dim=-1, keepdim=True) 
    text_features /= text_features.norm(dim=-1, keepdim=True)    

    logits_per_image, logits_per_text = model.get_similarity(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)



预测及评估

图文检索测试数据集下载

Chinese-CLIP项目中已经预处理好测试集,这是他们提供的下载链接:

MUGE数据:下载链接

Flickr30K-CN数据:下载链接

另外COCO-CN数据的获取需要向原作者进行申请

ImageNet数据集下载

原始数据请自行下载,中文标签英文标签同样由Chinese-CLIP项目提供

图文检索评估

图文检索评估代码可以参考如下:

split=test # 指定计算valid或test集特征
resume=your_ckp_path
DATAPATH=your_DATAPATH
dataset_name=Flickr30k-CN
# dataset_name=MUGE

python -u eval/extract_features.py \
    --extract-image-feats \
    --extract-text-feats \
    --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
    --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
    --img-batch-size=32 \
    --text-batch-size=32 \
    --context-length=52 \
    --resume=${resume} \
    --vision-model=ViT-B-16 \
    --text-model=RoBERTa-wwm-ext-base-chinese

python -u eval/make_topk_predictions.py \
    --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
    --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
    --top-k=10 \
    --eval-batch-size=32768 \
    --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"

python -u eval/make_topk_predictions_tr.py \
    --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
    --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
    --top-k=10 \
    --eval-batch-size=32768 \
    --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"

python eval/evaluation.py \
    ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
    ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
    ${DATAPATH}/datasets/${dataset_name}/output1.json
cat  ${DATAPATH}/datasets/${dataset_name}/output1.json

python eval/transform_ir_annotation_to_tr.py \
    --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl

python eval/evaluation_tr.py \
    ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
    ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
    ${DATAPATH}/datasets/${dataset_name}/output2.json
cat ${DATAPATH}/datasets/${dataset_name}/output2.json

ImageNet零样本分类

ImageNet零样本分类的代码参考如下

bash scripts/zeroshot_eval.sh 0 \
    ${DATAPATH} imagenet \
    ViT-B-16 RoBERTa-wwm-ext-base-chinese \
    ./pretrained_weights/QA-CLIP-base.pt



Huggingface模型及在线Demo

我们在huggingface网站上也开源了我们的模型,可以更方便的使用。并且也准备了一个简单的零样本分类的在线Demo供大家体验,欢迎大家来多多尝试!

⭐️QA-CLIP-ViT-B-16⭐️

⭐️QA-CLIP-ViT-L-14⭐️

下面是一些展示的例子: sample

致谢

项目代码基于Chinese-CLIP实现,非常感谢他们优秀的开源工作。