Edit model card



This is a Japanese CLOOB (Contrastive Leave One Out Boost) model trained by rinna Co., Ltd..

Please see japanese-clip for the other available models.

How to use the model

  1. Install package
$ pip install git+https://github.com/rinnakk/japanese-clip.git
  1. Run
import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = ja_clip.load("rinna/japanese-cloob-vit-b-16", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

Model architecture

The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer BERT as a text encoder. The image encoder was initialized from the AugReg vit-base-patch16-224 model.


The model was trained on CC12M translated the captions to Japanese.


The Apache 2.0 license

Downloads last month
Hosted inference API
This model can be loaded on the Inference API on-demand.