jaketae commited on
Commit
145ffed
1 Parent(s): 6458346

docs: add multilinguality to findings section

Browse files
Files changed (1) hide show
  1. intro.md +38 -2
intro.md CHANGED
@@ -25,7 +25,11 @@ We present three demos, which each illustrate different use cases of KoCLIP.
25
  * *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
26
  * *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
27
 
28
- ## Prompting
 
 
 
 
29
 
30
  We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
31
 
@@ -35,9 +39,41 @@ We found that KoCLIP performs better when prompting is used to induce zero-shot
35
 
36
  noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
37
 
 
 
 
 
 
 
 
 
38
  ## Future Work
39
 
40
- Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompting is somewhat of a mysterious trick and an active area of ongoing research, we hope to explore ways to take a more scientific approach on prompt engineering.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ---
43
 
25
  * *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
26
  * *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
27
 
28
+ ## Findings
29
+
30
+ In this section, we detail some interesting findings we made throughout the project.
31
+
32
+ ### Prompting
33
 
34
  We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
35
 
39
 
40
  noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
41
 
42
+ ### Multilinguality
43
+
44
+ Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingl well for simple words (e.g. "dog"). This could be one of two reasons, or a combination thereof:
45
+
46
+ * *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English image captioning dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is the fact that `koclip-large` also demonstrates limited multilingual behavior.
47
+
48
+ * *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."
49
+
50
  ## Future Work
51
 
52
+ Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as [M-CLIP](https://huggingface.co/M-CLIP). We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches to the topic.
53
+
54
+ ## References
55
+
56
+ ```bibtex
57
+ @misc{park2021klue,
58
+ title={KLUE: Korean Language Understanding Evaluation},
59
+ author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
60
+ year={2021},
61
+ eprint={2105.09680},
62
+ archivePrefix={arXiv},
63
+ primaryClass={cs.CL}
64
+ }
65
+ ```
66
+
67
+ ```bibtex
68
+ @misc{radford2021learning,
69
+ title={Learning Transferable Visual Models From Natural Language Supervision},
70
+ author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
71
+ year={2021},
72
+ eprint={2103.00020},
73
+ archivePrefix={arXiv},
74
+ primaryClass={cs.CV}
75
+ }
76
+ ```
77
 
78
  ---
79