Edit model card

Chinese-CLIP-ViT-Large-Patch14-336px

Introduction

This is the large-version of the Chinese CLIP, with ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP (Welcome to star! 🔥🔥)

Use with the official API

We provide a simple code snippet to show how to use the API of Chinese-CLIP to compute the image & text embeddings and similarities.

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]

However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.

Results

MUGE Text-to-Image Retrieval:

SetupZero-shotFinetune
MetricR@1R@5R@10MRR@1R@5R@10MR
Wukong42.769.078.063.252.777.985.672.1
R2D249.575.783.269.560.182.989.477.5
CN-CLIP63.084.189.278.868.988.793.183.6

Flickr30K-CN Retrieval:

TaskText-to-ImageImage-to-Text
SetupZero-shotFinetuneZero-shotFinetune
MetricR@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
Wukong51.778.986.377.494.597.076.194.897.592.799.199.6
R2D260.986.892.784.496.798.477.696.798.995.699.8100.0
CN-CLIP71.291.495.583.896.998.681.697.598.895.399.7100.0

COCO-CN Retrieval:

TaskText-to-ImageImage-to-Text
SetupZero-shotFinetuneZero-shotFinetune
MetricR@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
Wukong53.480.290.174.094.498.155.281.090.673.394.098.0
R2D256.485.093.179.196.598.963.389.395.779.397.198.7
CN-CLIP69.289.996.181.596.999.163.086.692.983.597.399.2

Zero-shot Image Classification:

TaskCIFAR10CIFAR100DTDEuroSATFERFGVCKITTIMNISTPCVOC
GIT88.561.142.943.441.46.722.168.950.080.2
ALIGN94.976.866.152.150.825.041.274.055.283.0
CLIP94.977.056.063.048.333.311.579.062.384.0
Wukong95.477.140.950.3------
CN-CLIP96.079.751.252.055.126.249.979.463.584.9

Citation

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Downloads last month
1,180

Spaces using OFA-Sys/chinese-clip-vit-large-patch14-336px 5