# KoCLIP

KoCLIP is a Korean port of OpenAI's CLIP.

## Models

We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.

| KoCLIP         | LM                   | ViT                            |
|----------------|----------------------|--------------------------------|
| `koclip-base`  | `klue/roberta-large` | `openai/clip-vit-base-patch32` |
| `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` |

## Data

KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.

While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.

## Demo

We present three demos, which each illustrate different use cases of KoCLIP.
 
* *Image to Text*: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
* *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text. 
* *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

---

We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!