Spaces:

flax-community
/

koclip

Build error

App Files Files Community

koclip / intro.md

jaketae

docs: add findings and future work section

6458346 almost 3 years ago

preview code

raw history blame

No virus

3.25 kB

	# KoCLIP

	KoCLIP is a Korean port of OpenAI's CLIP.

	## Models

	We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.

	\| KoCLIP \| LM \| ViT \|
	\|----------------\|----------------------\|--------------------------------\|
	\| `koclip-base` \| `klue/roberta-large` \| `openai/clip-vit-base-patch32` \|
	\| `koclip-large` \| `klue/roberta-large` \| `google/vit-large-patch16-224` \|

	## Data

	KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.

	While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.

	## Demo

	We present three demos, which each illustrate different use cases of KoCLIP.

	* Image to Text: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
	* Text to Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
	* Text to Patch: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

	## Prompting

	We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as

	```
	이것은 {{}} 이다 (This is {{}}.)
	```

	noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.

	## Future Work

	Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompting is somewhat of a mysterious trick and an active area of ongoing research, we hope to explore ways to take a more scientific approach on prompt engineering.

	---

	We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!