Mitsua commited on
Commit
255034d
1 Parent(s): 07e5ffb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -11,6 +11,7 @@ tags:
11
  - vision
12
  - japanese-clip
13
  - japanese
 
14
  ---
15
 
16
  # Mitsua Japanese CLIP ViT-B-16
@@ -169,14 +170,28 @@ As mentioned above, this model does not use any pretrained model and is trained
169
  - Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
170
  - The training corpus was extracted from the image-text training dataset listed above.
171
  3. Train CLIP model
172
- - Then, CLIP model is trained on licensed + openly-licensed + public domain dataset.
173
  - Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
174
  - Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
175
  - Training dataset consists of approx. 30M images, which is relatively small for CLIP training
176
  - Training took approx. 400 H100 GPU hours for 64 epochs.
177
 
 
 
 
 
178
  ## Evaluation
179
- - TBD
 
 
 
 
 
 
 
 
 
 
180
 
181
  ## Disclaimer
182
  - The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
 
11
  - vision
12
  - japanese-clip
13
  - japanese
14
+ pipeline_tag: zero-shot-image-classification
15
  ---
16
 
17
  # Mitsua Japanese CLIP ViT-B-16
 
170
  - Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
171
  - The training corpus was extracted from the image-text training dataset listed above.
172
  3. Train CLIP model
173
+ - Then, CLIP model is trained on licensed + openly-licensed + public domain dataset. The Contrastive Loss is used.
174
  - Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
175
  - Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
176
  - Training dataset consists of approx. 30M images, which is relatively small for CLIP training
177
  - Training took approx. 400 H100 GPU hours for 64 epochs.
178
 
179
+ ### Implementation Notes
180
+ - For HF-compatible CLIP modeling, `SiglipTextModel` is used for the text encoder just because it provides better compatibility for our sentencepiece tokenizer.
181
+ - This CLIP model is trained with standard Contrastive Loss, not Siglip loss, since we do not see any improvement for Siglip loss over CLIP loss in our internal ablation study.
182
+
183
  ## Evaluation
184
+ We evaluated Japanese zeroshot accuracy.
185
+ ### Dataset
186
+ - [japanese-image-classification-evaluation-dataset](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) (CC BY 4.0, Developed by: Recruit Co., Ltd.)
187
+
188
+ ### Result
189
+ | **Model** | **Training Data** | **Supported Language** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
190
+ |:---|:---|:---|---:|---:|---:|---:|
191
+ | **Mitsua/mitsua-japanese-clip-vit-b-16** | **Opt-in / Openly Licensed + PD** | Japanese and English | 0.297 | 0.707 | 0.676 | 0.769 |
192
+ | rinna/japanese-clip-vit-b-16 | CC12M | Japanese | 0.235 | 0.513 | 0.614 | 0.625 |
193
+ | recruit-jp/japanese-clip-vit-b-32-roberta-base | Ja subset of LAION2B-multi | Japanese | 0.502 | 0.556 | 0.647 | **0.803** |
194
+ | google/siglip-base-patch16-256-multilingual | WebLI | Multilingual | **0.776** | **0.928** | **0.692** | 0.762 |
195
 
196
  ## Disclaimer
197
  - The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.