Mitsua
/

mitsua-japanese-clip-vit-b-16

@@ -11,6 +11,7 @@ tags:
 - vision
 - japanese-clip
 - japanese
 ---
 # Mitsua Japanese CLIP ViT-B-16
@@ -169,14 +170,28 @@ As mentioned above, this model does not use any pretrained model and is trained
   - Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
   - The training corpus was extracted from the image-text training dataset listed above.
 3. Train CLIP model
-  - Then, CLIP model is trained on licensed + openly-licensed + public domain dataset.
   - Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
   - Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
   - Training dataset consists of approx. 30M images, which is relatively small for CLIP training
   - Training took approx. 400 H100 GPU hours for 64 epochs.
 ## Evaluation
-- TBD
 ## Disclaimer
 - The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.

 - vision
 - japanese-clip
 - japanese
+pipeline_tag: zero-shot-image-classification
 ---
 # Mitsua Japanese CLIP ViT-B-16
   - Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
   - The training corpus was extracted from the image-text training dataset listed above.
 3. Train CLIP model
+  - Then, CLIP model is trained on licensed + openly-licensed + public domain dataset. The Contrastive Loss is used.
   - Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
   - Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
   - Training dataset consists of approx. 30M images, which is relatively small for CLIP training
   - Training took approx. 400 H100 GPU hours for 64 epochs.
+### Implementation Notes
+- For HF-compatible CLIP modeling, `SiglipTextModel` is used for the text encoder just because it provides better compatibility for our sentencepiece tokenizer.
+- This CLIP model is trained with standard Contrastive Loss, not Siglip loss, since we do not see any improvement for Siglip loss over CLIP loss in our internal ablation study.
 ## Evaluation
+We evaluated Japanese zeroshot accuracy.
+### Dataset
+- [japanese-image-classification-evaluation-dataset](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) (CC BY 4.0, Developed by: Recruit Co., Ltd.)
+### Result
+| **Model** | **Training Data** | **Supported Language** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
+|:---|:---|:---|---:|---:|---:|---:|
+| **Mitsua/mitsua-japanese-clip-vit-b-16** | **Opt-in / Openly Licensed + PD** | Japanese and English | 0.297 | 0.707 | 0.676 | 0.769 |
+| rinna/japanese-clip-vit-b-16 | CC12M | Japanese | 0.235 | 0.513 | 0.614 | 0.625 |
+| recruit-jp/japanese-clip-vit-b-32-roberta-base | Ja subset of LAION2B-multi | Japanese | 0.502 | 0.556 | 0.647 | **0.803** |
+| google/siglip-base-patch16-256-multilingual | WebLI | Multilingual | **0.776** | **0.928** | **0.692** | 0.762 |
 ## Disclaimer
 - The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.