clip-japanese-base / README.md
pfzhu's picture
Upload folder using huggingface_hub
071945c verified
|
raw
history blame
4.14 kB
metadata
language: ja
license: apache-2.0
tags:
  - clip
  - japanese-clip
pipeline_tag: feature-extraction

clip-japanese-base

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.

How to use

  1. Install packages
pip install pillow requests sentencepiece transformers torch timm
  1. Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
device = "cuda" if torch.cuda.is_available() else "cpu"

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt")
text = tokenizer(["犬", "猫", "象"])

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

  • STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
  • Recruit Datasets for image classification.
  • ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in ja-imagenet-1k-classnames.txt and ja-imagenet-1k-templates.txt.

Result

Model Image Encoder Params Text Encoder params STAIR Captions (R@1) Recruit Datasets (acc@1) ImageNet-1K (acc@1)
Ours 86M(Eva02-B) 100M(BERT) 0.30 0.89 0.58
Stable-ja-clip 307M(ViT-L) 100M(BERT) 0.24 0.77 0.68
Rinna-ja-clip 86M(ViT-B) 100M(BERT) 0.13 0.54 0.56
Laion-clip 632M(ViT-H) 561M(XLM-RoBERTa) 0.30 0.83 0.58
Hakuhodo-ja-clip 632M(ViT-H) 100M(BERT) 0.21 0.82 0.46

Licenses

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo, Shuntaro Okada, Peifei Zhu, Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}