jaketae commited on
Commit
59d1d42
β€’
1 Parent(s): 317f44a

docs: fix minor typo, cleanup grammar

Browse files
Files changed (1) hide show
  1. intro.md +8 -6
intro.md CHANGED
@@ -4,7 +4,7 @@ KoCLIP is a Korean port of OpenAI's CLIP.
4
 
5
  ## Models
6
 
7
- We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.
8
 
9
  | KoCLIP | LM | ViT |
10
  |----------------|----------------------|--------------------------------|
@@ -13,7 +13,7 @@ We trained a total of two models, `koclip-base` and `koclip-large`. Both models
13
 
14
  ## Data
15
 
16
- KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.
17
 
18
  While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.
19
 
@@ -34,22 +34,24 @@ In this section, we detail some interesting findings we made throughout the proj
34
  We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
35
 
36
  ```
37
- 이것은 {{}} 이닀 (This is {{}}.)
38
  ```
39
 
40
  noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
41
 
42
  ### Multilinguality
43
 
44
- Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingl well for simple words (e.g. "dog"). This could be one of two reasons, or a combination thereof:
45
 
46
- * *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English image captioning dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is the fact that `koclip-large` also demonstrates limited multilingual behavior.
47
 
48
  * *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."
49
 
 
 
50
  ## Future Work
51
 
52
- Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches to the topic.
53
 
54
  ## References
55
 
 
4
 
5
  ## Models
6
 
7
+ We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.
8
 
9
  | KoCLIP | LM | ViT |
10
  |----------------|----------------------|--------------------------------|
 
13
 
14
  ## Data
15
 
16
+ KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset.
17
 
18
  While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.
19
 
 
34
  We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
35
 
36
  ```
37
+ 이것은 {{}} 이닀 (EN: This is {{}}.)
38
  ```
39
 
40
  noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
41
 
42
  ### Multilinguality
43
 
44
+ Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. "dog", "car"). This could be one of two reasons, or a combination thereof:
45
 
46
+ * *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that `koclip-large` also demonstrates similar multilingual behavior.
47
 
48
  * *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."
49
 
50
+ At the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings that worked well with ViT.
51
+
52
  ## Future Work
53
 
54
+ Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further measure its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches on this the topic.
55
 
56
  ## References
57