Learning Transferable Visual Models From Natural Language Supervision
Paper
•
2103.00020
•
Published
•
19
This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model is an updated version of line-corporation/clip-japanese-base. It increases the training data to approximately ~2B image–text pairs and applies model distillation to improve overall performance.
pip install pillow requests sentencepiece transformers torch timm
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base-v2'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1., 0., 0.]]
The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.
| Model | Params | Avg. | ImageNet-1k (acc@1) | Recruit Datasets (acc@1) | WAON (acc@1) | STAIR Captions (R@1) |
|---|---|---|---|---|---|---|
| clip-japanese-base-v2 | 196M | 0.708 | 0.666 | 0.913 | 0.975 | 0.277 |
| clip-japanese-base | 196M | 0.673 | 0.580 | 0.884 | 0.934 | 0.293 |
| llm-jp/waon-siglip2-base-path16-256 | 375M | 0.664 | 0.555 | 0.872 | 0.951 | 0.276 |
| google/siglip2-base-patch16-224 | 375M | 0.517 | 0.579 | 0.802 | 0.871 | 0.126 |
| google/siglip2-so400m-patch14-224 | 1135M | 0.642 | 0.643 | 0.837 | 0.925 | 0.163 |
The Apache License, Version 2.0
@misc{clip-japanese-base-v2,
title = {CLIP Japanese Base V2},
author={Shuntaro Okada, Shuhei Yokoo, Kei Mukaiyama, Peifei Zhu and Shuhei Nishimura}
url = {https://huggingface.co/line-corporation/clip-japanese-base-v2},
}