Edit model card

Japanese CLIP ViT-H/14 (Wider)

Table of Contents

  1. Overview
  2. Usage
  3. Model Details
  4. Evaluation
  5. Limitations and Biases
  6. Citation
  7. See Also
  8. Contact Information


Presented here is a Japanese CLIP (Contrastive Language-Image Pre-training) model, mapping Japanese texts and images to a unified embedding space. Capable of multimodal tasks including zero-shot image classification, text-to-image retrieval, and image-to-text retrieval, this model extends its utility when integrated with other components, contributing to generative models like image-to-text and text-to-image generation.



python3 -m pip install pillow sentencepiece torch torchvision transformers


The usage is similar to CLIPModel and VisionTextDualEncoderModel.

import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor, BatchEncoding

# Download
model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-wider"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Prepare raw inputs
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Process inputs
inputs = processor(
    text=["犬", "猫", "象"],

# Infer and output
outputs = model(**BatchEncoding(inputs).to(device))
probs = outputs.logits_per_image.softmax(dim=1)
print([f"{x:.2f}" for x in probs.flatten().tolist()])  # ['0.00', '1.00', '0.00']

Model Details


The model consists of a frozen ViT-H image encoder from laion/CLIP-ViT-H-14-laion2B-s32B-b79K and a 12-layer 24-head BERT text encoder initialized from hakuhodo-tech/japanese-clip-vit-h-14-bert-base with Model Fusion.


Model training is done by Zhi Wang with 8 A100 (80 GB) GPUs. Locked-image Tuning (LiT) is adopted. See more details in the paper.


The Japanese subset of the laion2B-multi dataset containing ~120M image-text pairs.


Testing Data

The 5K evaluation set (val2017) of MS-COCO with STAIR Captions.


Zero-shot image-to-text and text-to-image recall@1, 5, 10.


Text Retrieval Image Retrieval
R@1 R@5 R@10 R@1 R@5 R@10
recruit-jp/japanese-clip-vit-b-32-roberta-base 23.0 46.1 57.4 16.1 35.4 46.3
rinna/japanese-cloob-vit-b-16 37.1 63.7 74.2 25.1 48.0 58.8
rinna/japanese-clip-vit-b-16 36.9 64.3 74.3 24.8 48.8 60.0
Japanese CLIP ViT-H/14 (Base) 39.2 66.3 76.6 28.9 53.3 63.9
Japanese CLIP ViT-H/14 (Deeper) 48.7 74.0 82.4 36.5 61.5 71.8
Japanese CLIP ViT-H/14 (Wider) 47.9 74.2 83.2 37.3 62.8 72.7

* Japanese Stable CLIP ViT-L/16 is excluded for zero-shot retrieval evaluation as the model was partially pre-trained with MS-COCO.

Limitations and Biases

Despite our data filtering, it is crucial to acknowledge the possibility of the training dataset containing offensive or inappropriate content. Users should be mindful of the potential societal impact and ethical considerations associated with the outputs generated by the model when deploying in production systems. It is recommended not to employ the model for applications that have the potential to cause harm or distress to individuals or groups.


If you found this model useful, please consider citing:

 author = {王 直 and 細野 健人 and 石塚 湖太 and 奥田 悠太 and 川上 孝介},
 journal = {言語処理学会年次大会発表論文集},
 month = {Mar},
 pages = {1547--1552},
 title = {日本語特化の視覚と言語を組み合わせた事前学習モデルの開発 Developing Vision-Language Pre-Trained Models for {J}apanese},
 volume = {30},
 year = {2024}

See Also

Contact Information

Please contact hr-koho@hakuhodo-technologies.co.jp for questions and comments about the model, and/or for business and partnership inquiries.

お問い合わせは hr-koho@hakuhodo-technologies.co.jp にご連絡ください。

Downloads last month
Model size
910M params
Tensor type
Inference API
Inference API (serverless) does not yet support model repos that contain custom code.