Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,7 @@ tags:
|
|
11 |
- vision
|
12 |
- japanese-clip
|
13 |
- japanese
|
|
|
14 |
---
|
15 |
|
16 |
# Mitsua Japanese CLIP ViT-B-16
|
@@ -169,14 +170,28 @@ As mentioned above, this model does not use any pretrained model and is trained
|
|
169 |
- Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
|
170 |
- The training corpus was extracted from the image-text training dataset listed above.
|
171 |
3. Train CLIP model
|
172 |
-
- Then, CLIP model is trained on licensed + openly-licensed + public domain dataset.
|
173 |
- Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
|
174 |
- Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
|
175 |
- Training dataset consists of approx. 30M images, which is relatively small for CLIP training
|
176 |
- Training took approx. 400 H100 GPU hours for 64 epochs.
|
177 |
|
|
|
|
|
|
|
|
|
178 |
## Evaluation
|
179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
|
181 |
## Disclaimer
|
182 |
- The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
|
|
|
11 |
- vision
|
12 |
- japanese-clip
|
13 |
- japanese
|
14 |
+
pipeline_tag: zero-shot-image-classification
|
15 |
---
|
16 |
|
17 |
# Mitsua Japanese CLIP ViT-B-16
|
|
|
170 |
- Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
|
171 |
- The training corpus was extracted from the image-text training dataset listed above.
|
172 |
3. Train CLIP model
|
173 |
+
- Then, CLIP model is trained on licensed + openly-licensed + public domain dataset. The Contrastive Loss is used.
|
174 |
- Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
|
175 |
- Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
|
176 |
- Training dataset consists of approx. 30M images, which is relatively small for CLIP training
|
177 |
- Training took approx. 400 H100 GPU hours for 64 epochs.
|
178 |
|
179 |
+
### Implementation Notes
|
180 |
+
- For HF-compatible CLIP modeling, `SiglipTextModel` is used for the text encoder just because it provides better compatibility for our sentencepiece tokenizer.
|
181 |
+
- This CLIP model is trained with standard Contrastive Loss, not Siglip loss, since we do not see any improvement for Siglip loss over CLIP loss in our internal ablation study.
|
182 |
+
|
183 |
## Evaluation
|
184 |
+
We evaluated Japanese zeroshot accuracy.
|
185 |
+
### Dataset
|
186 |
+
- [japanese-image-classification-evaluation-dataset](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) (CC BY 4.0, Developed by: Recruit Co., Ltd.)
|
187 |
+
|
188 |
+
### Result
|
189 |
+
| **Model** | **Training Data** | **Supported Language** | **jafood101**| **jaflower30** | **jafacility20** | **jalandmark10** |
|
190 |
+
|:---|:---|:---|---:|---:|---:|---:|
|
191 |
+
| **Mitsua/mitsua-japanese-clip-vit-b-16** | **Opt-in / Openly Licensed + PD** | Japanese and English | 0.297 | 0.707 | 0.676 | 0.769 |
|
192 |
+
| rinna/japanese-clip-vit-b-16 | CC12M | Japanese | 0.235 | 0.513 | 0.614 | 0.625 |
|
193 |
+
| recruit-jp/japanese-clip-vit-b-32-roberta-base | Ja subset of LAION2B-multi | Japanese | 0.502 | 0.556 | 0.647 | **0.803** |
|
194 |
+
| google/siglip-base-patch16-256-multilingual | WebLI | Multilingual | **0.776** | **0.928** | **0.692** | 0.762 |
|
195 |
|
196 |
## Disclaimer
|
197 |
- The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
|