Edit model card



This is a Japanese CLOOB (Contrastive Leave One Out Boost) model trained by rinna Co., Ltd..

Please see japanese-clip for the other available models.

How to use the model

  1. Install package
$ pip install git+https://github.com/rinnakk/japanese-clip.git
  1. Run
import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = ja_clip.load("rinna/japanese-cloob-vit-b-16", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

Model architecture

The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. It was initialized with google/vit-base-patch16-224 as the image encoder and the Japanese pre-trained RoBERTa model rinna/japanese-roberta-base with the same sentencepiece tokenizer as the text encoder.


The model was trained on CC12M translated the captions to Japanese.


The Apache 2.0 license

Downloads last month
Hosted inference API
This model can be loaded on the Inference API on-demand.