jaketae commited on
Commit
a811816
β€’
1 Parent(s): 48a1fa8

feature: add intro page, cleanup descriptions

Browse files
Files changed (6) hide show
  1. app.py +4 -2
  2. image2text.py +12 -6
  3. intro.md +32 -0
  4. intro.py +6 -0
  5. text2image.py +2 -10
  6. text2patch.py +4 -2
app.py CHANGED
@@ -1,17 +1,19 @@
1
  import streamlit as st
2
 
3
  import image2text
 
4
  import text2image
5
  import text2patch
6
 
7
  PAGES = {
 
8
  "Text to Image": text2image,
9
  "Image to Text": image2text,
10
- "Patch Importance Ranking": text2patch,
11
  }
12
 
13
  st.sidebar.title("Navigation")
14
  model = st.sidebar.selectbox("Choose a model", ["koclip-base", "koclip-large"])
15
- page = st.sidebar.selectbox("Choose a task", list(PAGES.keys()))
16
 
17
  PAGES[page].app(model)
1
  import streamlit as st
2
 
3
  import image2text
4
+ import intro
5
  import text2image
6
  import text2patch
7
 
8
  PAGES = {
9
+ "Introduction": intro,
10
  "Text to Image": text2image,
11
  "Image to Text": image2text,
12
+ "Text to Patch": text2patch,
13
  }
14
 
15
  st.sidebar.title("Navigation")
16
  model = st.sidebar.selectbox("Choose a model", ["koclip-base", "koclip-large"])
17
+ page = st.sidebar.selectbox("Navigate to...", list(PAGES.keys()))
18
 
19
  PAGES[page].app(model)
image2text.py CHANGED
@@ -14,9 +14,9 @@ def app(model_name):
14
  st.title("Zero-shot Image Classification")
15
  st.markdown(
16
  """
17
- This demonstration explores capability of KoCLIP in the field of Zero-Shot Prediction. This demo takes a set of image and captions from the user, and predicts the most likely label among the different captions given.
18
-
19
- KoCLIP is a retraining of OpenAI's CLIP model using 82,783 images from [MSCOCO](https://cocodataset.org/#home) dataset and Korean caption annotations. Korean translation of caption annotations were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence). Base model `koclip` uses `klue/roberta` as text encoder and `openai/clip-vit-base-patch32` as image encoder. Larger model `koclip-large` uses `klue/roberta` as text encoder and bigger `google/vit-large-patch16-224` as image encoder.
20
  """
21
  )
22
 
@@ -30,6 +30,7 @@ def app(model_name):
30
 
31
  with col2:
32
  captions_count = st.selectbox("Number of labels", options=range(1, 6), index=2)
 
33
  compute = st.button("Classify")
34
 
35
  with col1:
@@ -37,7 +38,7 @@ def app(model_name):
37
  defaults = ["κ·€μ—¬μš΄ 고양이", "λ©‹μžˆλŠ” 강아지", "ν¬λ™ν¬λ™ν•œ ν–„μŠ€ν„°"]
38
  for idx in range(captions_count):
39
  value = defaults[idx] if idx < len(defaults) else ""
40
- captions.append(st.text_input(f"Insert label {idx+1}", value=value))
41
 
42
  if compute:
43
  if not any([query1, query2]):
@@ -61,8 +62,13 @@ def app(model_name):
61
  inputs["pixel_values"], axes=[0, 2, 3, 1]
62
  )
63
  outputs = model(**inputs)
64
- probs = jax.nn.softmax(outputs.logits_per_image, axis=1)
65
- chart_data = pd.Series(probs[0], index=captions)
 
 
 
 
 
66
 
67
  col1, col2 = st.beta_columns(2)
68
  with col1:
14
  st.title("Zero-shot Image Classification")
15
  st.markdown(
16
  """
17
+ This demo explores KoCLIP's zero-shot prediction capabilities. The model takes an image and a list of candidate captions from the user and predicts the most likely caption that best describes the given image.
18
+
19
+ ---
20
  """
21
  )
22
 
30
 
31
  with col2:
32
  captions_count = st.selectbox("Number of labels", options=range(1, 6), index=2)
33
+ normalize = st.checkbox("Apply Softmax")
34
  compute = st.button("Classify")
35
 
36
  with col1:
38
  defaults = ["κ·€μ—¬μš΄ 고양이", "λ©‹μžˆλŠ” 강아지", "ν¬λ™ν¬λ™ν•œ ν–„μŠ€ν„°"]
39
  for idx in range(captions_count):
40
  value = defaults[idx] if idx < len(defaults) else ""
41
+ captions.append(st.text_input(f"Insert caption {idx+1}", value=value))
42
 
43
  if compute:
44
  if not any([query1, query2]):
62
  inputs["pixel_values"], axes=[0, 2, 3, 1]
63
  )
64
  outputs = model(**inputs)
65
+ if normalize:
66
+ name = "normalized prob"
67
+ probs = jax.nn.softmax(outputs.logits_per_image, axis=1)
68
+ else:
69
+ name = "cosine sim"
70
+ probs = outputs.logits_per_image
71
+ chart_data = pd.Series(probs[0], index=captions, name=name)
72
 
73
  col1, col2 = st.beta_columns(2)
74
  with col1:
intro.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KoCLIP
2
+
3
+ KoCLIP is a Korean port of OpenAI's CLIP.
4
+
5
+ ## Models
6
+
7
+ We trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.
8
+
9
+ | KoCLIP | LM | ViT |
10
+ |----------------|----------------------|--------------------------------|
11
+ | `koclip-base` | `klue/roberta-large` | `openai/clip-vit-base-patch32` |
12
+ | `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` |
13
+
14
+ ## Data
15
+
16
+ KoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.
17
+
18
+ While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.
19
+
20
+ ## Demo
21
+
22
+ We present three demos, which each illustrate different use cases of KoCLIP.
23
+
24
+ * *Image to Text*: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
25
+ * *Text to * Image*: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
26
+ * *Text to Patch*: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
27
+
28
+ ---
29
+
30
+ We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!
31
+
32
+
intro.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+
4
+ def app(*args):
5
+ with open("intro.md") as f:
6
+ st.markdown(f.read())
text2image.py CHANGED
@@ -17,17 +17,9 @@ def app(model_name):
17
  st.title("Text to Image Search Engine")
18
  st.markdown(
19
  """
20
- This demonstration explores capability of KoCLIP as a Korean-language Image search engine. Embeddings for each of
21
- 5000 images from [MSCOCO](https://cocodataset.org/#home) 2017 validation set was generated using trained KoCLIP
22
- vision model. They are ranked based on cosine similarity distance from input Text query embeddings and top 10 images
23
- are displayed below.
24
 
25
- KoCLIP is a retraining of OpenAI's CLIP model using 82,783 images from [MSCOCO](https://cocodataset.org/#home) dataset and
26
- Korean caption annotations. Korean translation of caption annotations were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence).
27
- Base model `koclip` uses `klue/roberta` as text encoder and `openai/clip-vit-base-patch32` as image encoder.
28
- Larger model `koclip-large` uses `klue/roberta` as text encoder and bigger `google/vit-large-patch16-224` as image encoder.
29
-
30
- Example Queries : μ»΄ν“¨ν„°ν•˜λŠ” 고양이(Cat playing on a computer), κΈΈ μœ„μ—μ„œ λ‹¬λ¦¬λŠ” μžλ™μ°¨(Car running on the road)
31
  """
32
  )
33
 
17
  st.title("Text to Image Search Engine")
18
  st.markdown(
19
  """
20
+ This demo explores KoCLIP's use case as a Korean image search engine. We pre-computed embeddings of 5000 images from [MSCOCO](https://cocodataset.org/#home) 2017 validation using KoCLIP's ViT backbone. Then, given a text query from the user, these image embeddings are ranked based on cosine similarity. Top matches are displayed below.
 
 
 
21
 
22
+ Example Queries: μ»΄ν“¨ν„°ν•˜λŠ” 고양이 (Cat playing on a computer), κΈΈ μœ„μ—μ„œ λ‹¬λ¦¬λŠ” μžλ™μ°¨ (Car on the road)
 
 
 
 
 
23
  """
24
  )
25
 
text2patch.py CHANGED
@@ -25,7 +25,7 @@ def split_image(im, num_rows=3, num_cols=3):
25
  def app(model_name):
26
  model, processor = load_model(f"koclip/{model_name}")
27
 
28
- st.title("Patch-based Relevance Retrieval")
29
  st.markdown(
30
  """
31
  Given a piece of text, the CLIP model finds the part of an image that best explains the text.
@@ -37,6 +37,8 @@ def app(model_name):
37
  which will yield the most relevant image tile from a grid of the image. You can specify how
38
  granular you want to be with your search by specifying the number of rows and columns that
39
  make up the image grid.
 
 
40
  """
41
  )
42
 
@@ -46,7 +48,7 @@ def app(model_name):
46
  )
47
  query2 = st.file_uploader("or upload an image...", type=["jpg", "jpeg", "png"])
48
  captions = st.text_input(
49
- "Enter query to find most relevant part of image ",
50
  value="이건 μ„œμšΈμ˜ 경볡ꢁ 사진이닀.",
51
  )
52
 
25
  def app(model_name):
26
  model, processor = load_model(f"koclip/{model_name}")
27
 
28
+ st.title("Patch-based Relevance Ranking")
29
  st.markdown(
30
  """
31
  Given a piece of text, the CLIP model finds the part of an image that best explains the text.
37
  which will yield the most relevant image tile from a grid of the image. You can specify how
38
  granular you want to be with your search by specifying the number of rows and columns that
39
  make up the image grid.
40
+
41
+ ---
42
  """
43
  )
44
 
48
  )
49
  query2 = st.file_uploader("or upload an image...", type=["jpg", "jpeg", "png"])
50
  captions = st.text_input(
51
+ "Enter a prompt to query the image.",
52
  value="이건 μ„œμšΈμ˜ 경볡ꢁ 사진이닀.",
53
  )
54