Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

4rtemi5 commited on Jul 24, 2021

Commit

3f830ea

•

2 Parent(s): a9e905c 944631b

Merge branch 'main' of https://huggingface.co/spaces/clip-italian/clip-italian-demo

Browse files

# Conflicts:
# examples.py
# home.py
# image2text.py
# introduction.md
# text2image.py

Files changed (3) hide show

examples.py +13 -14
introduction.md +12 -9
static/img/table_captions.png +0 -0

examples.py CHANGED Viewed

@@ -20,37 +20,36 @@ def app():
     st.markdown("### 1. Actors in Scenes")
     st.markdown("These examples were taken from the CC dataset")
-    st.subheader("una coppia")
-    st.markdown("*a couple*")
     st.image("static/img/examples/couple_0.jpeg")
     col1, col2 = st.beta_columns(2)
-    col1.subheader("una coppia con il tramonto sullo sfondo")
-    col1.markdown("*a couple with the sunset in the background*")
     col1.image("static/img/examples/couple_1.jpeg")
-    col2.subheader("una coppia che passeggia sulla spiaggia")
-    col2.markdown("*a couple walking on the beach*")
     col2.image("static/img/examples/couple_2.jpeg")
-    st.subheader("una coppia che passeggia sulla spiaggia al tramonto")
-    st.markdown("*a couple walking on the beach at sunset*")
     st.image("static/img/examples/couple_3.jpeg")
     st.markdown("### 2. Dresses")
     st.markdown("These examples were taken from the Unsplash dataset")
     col1, col2 = st.beta_columns(2)
-    col1.subheader("un vestito primavrile")
-    col1.markdown("*a dress for the spring*")
     col1.image("static/img/examples/vestito1.png")
-    col2.subheader("un vestito autunnale")
-    col2.markdown("*a dress for the autumn*")
     col2.image("static/img/examples/vestito_autunnale.png")
-    #st.markdown("## Image Classification")
-    st.markdown("<h2 style='text-align: center; color: #008C45; font-weight:bold;'> Zero Shot Image Classification </h2>", unsafe_allow_html=True)
     st.markdown("We report this cool example provided by the "
                 "[DALLE-mini team](https://github.com/borisdayma/dalle-mini). "
                 "Is the DALLE-mini logo an *avocado* or an armchair (*poltrona*)?")

     st.markdown("### 1. Actors in Scenes")
     st.markdown("These examples were taken from the CC dataset")
+    st.subheader("Una coppia")
+    st.markdown("*A couple*")
     st.image("static/img/examples/couple_0.jpeg")
     col1, col2 = st.beta_columns(2)
+    col1.subheader("Una coppia con il tramonto sullo sfondo")
+    col1.markdown("*A couple with the sunset in the background*")
     col1.image("static/img/examples/couple_1.jpeg")
+    col2.subheader("Una coppia che passeggia sulla spiaggia")
+    col2.markdown("*A couple walking on the beach*")
     col2.image("static/img/examples/couple_2.jpeg")
+    st.subheader("Una coppia che passeggia sulla spiaggia al tramonto")
+    st.markdown("*A couple walking on the beach at sunset*")
     st.image("static/img/examples/couple_3.jpeg")
     st.markdown("### 2. Dresses")
     st.markdown("These examples were taken from the Unsplash dataset")
     col1, col2 = st.beta_columns(2)
+    col1.subheader("Un vestito primaverile")
+    col1.markdown("*A dress for the spring*")
     col1.image("static/img/examples/vestito1.png")
+    col2.subheader("Un vestito autunnale")
+    col2.markdown("*A dress for the autumn*")
     col2.image("static/img/examples/vestito_autunnale.png")
+    st.markdown("## Image Classification")
     st.markdown("We report this cool example provided by the "
                 "[DALLE-mini team](https://github.com/borisdayma/dalle-mini). "
                 "Is the DALLE-mini logo an *avocado* or an armchair (*poltrona*)?")

introduction.md CHANGED Viewed

@@ -36,6 +36,7 @@ different applications that can start from here.
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results, we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.
@@ -82,7 +83,7 @@ Each photo comes along with an Italian caption.
 Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
 reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
-but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
 Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
 The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
@@ -97,6 +98,8 @@ weighting - of 0.858 (great agreement!).
 | person walking down the aisle                                                     | persona che cammina lungo la navata                                                                     |
 | popular rides at night at the county fair                                         | giostre popolari di notte alla fiera della contea                                                       |
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
@@ -192,7 +195,7 @@ described by the original caption. As evaluation metrics we use the MRR@K.
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
-_If the table above does not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
 It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
 on 400million images (and some of them might have been from MSCOCO).
@@ -210,7 +213,7 @@ We evaluate the models computing the accuracy at different levels.
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
-_If the table above doesn not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_IR.png)._
 ### Discussion
@@ -233,24 +236,24 @@ Look at the following - slightly cherry picked - examples:
 ### Colors
 Here's "a yellow flower"
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
 And here's "a blue flower"
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
 ### Counting
 What about "one cat"?
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto.png" alt="drawing" width="600"/>
 And what about "two cats"?
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_gatti.png" alt="drawing" width="600"/>
 ### Complex Queries
 Have you ever seen "two brown horses"?
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
 And finally, here's a very nice "cat on a chair"
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>
 # Broader Outlook

 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results, we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.
 Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
 reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
+but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality.
 Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
 The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
 | person walking down the aisle                                                     | persona che cammina lungo la navata                                                                     |
 | popular rides at night at the county fair                                         | giostre popolari di notte alla fiera della contea                                                       |
+_If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_captions.png)._
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
+_If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
 It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
 on 400million images (and some of them might have been from MSCOCO).
 | Accuracy@10     |  **52.55**   | 42.91 |
 | Accuracy@100    |  **81.08**   | 67.11 |
+_If the table above doesn't show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_IR.png)._
 ### Discussion
 ### Colors
 Here's "a yellow flower"
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="500"/>
 And here's "a blue flower"
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="500"/>
 ### Counting
 What about "one cat"?
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto.png" alt="drawing" width="500"/>
 And what about "two cats"?
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_gatti.png" alt="drawing" width="500"/>
 ### Complex Queries
 Have you ever seen "two brown horses"?
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="500"/>
 And finally, here's a very nice "cat on a chair"
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="500"/>
 # Broader Outlook

static/img/table_captions.png ADDED Viewed