Spaces:
Build error
Build error
# KoCLIP | |
KoCLIP is a Korean port of OpenAI's CLIP. | |
## Models | |
We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data. | |
| KoCLIP | LM | ViT | | |
|----------------|----------------------|--------------------------------| | |
| `koclip-base` | `klue/roberta-large` | `openai/clip-vit-base-patch32` | | |
| `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` | | |
## Data | |
KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset. | |
While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise. | |
## Demo | |
We present three demos, which each illustrate different use cases of KoCLIP. | |
* *Image to Text*: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided. | |
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text. | |
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query. | |
## Prompting | |
We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as | |
``` | |
μ΄κ²μ {{}} μ΄λ€ (This is {{}}.) | |
``` | |
noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length. | |
## Future Work | |
Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompting is somewhat of a mysterious trick and an active area of ongoing research, we hope to explore ways to take a more scientific approach on prompt engineering. | |
--- | |
We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo! | |