Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

Silvia Terragni commited on Jul 25, 2021

Commit

6c1a3f9

•

2 Parent(s): 47ab34b 35772cf

Merge remote-tracking branch 'origin/main' into main

Browse files

Files changed (11) hide show

app.py +3 -1
examples.py +15 -1
introduction.md +14 -5
localization.py +178 -0
requirements.txt +2 -1
static/img/examples/child_on_slide.png +0 -0
static/img/examples/due_gatti.png +0 -0
static/img/examples/un_gatto.png +0 -0
static/img/gatto_cane.png +0 -0
static/img/image_to_text.png +0 -0
static/img/text_to_image.png +0 -0

app.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import streamlit as st
 import image2text
 import text2image
 import home
 import examples
 from PIL import Image
@@ -9,7 +10,8 @@ PAGES = {
     "Introduction": home,
     "Text to Image": text2image,
     "Image to Text": image2text,
-    "Examples & Applications": examples,
 }
 st.sidebar.title("Explore our CLIP-Italian demo")

 import streamlit as st
 import image2text
 import text2image
+import localization
 import home
 import examples
 from PIL import Image
     "Introduction": home,
     "Text to Image": text2image,
     "Image to Text": image2text,
+    "Localization": localization,
+    "Gallery": examples,
 }
 st.sidebar.title("Explore our CLIP-Italian demo")

examples.py CHANGED Viewed

@@ -3,7 +3,7 @@ import streamlit as st
 def app():
-    st.title("Examples & Applications")
     st.write(
         """
@@ -81,6 +81,20 @@ def app():
     col2.markdown("*A rustic chair*")
     col2.image("static/img/examples/sedia_rustica.jpeg", use_column_width=True)
     st.markdown("## Image Classification")
     st.markdown(
         "We report this cool example provided by the "

 def app():
+    st.title("Gallery")
     st.write(
         """
     col2.markdown("*A rustic chair*")
     col2.image("static/img/examples/sedia_rustica.jpeg", use_column_width=True)
+    st.markdown("## Localization")
+    st.subheader("Un gatto")
+    st.markdown("*A cat*")
+    st.image("static/img/examples/un_gatto.png", use_column_width=True)
+    st.subheader("Un gatto")
+    st.markdown("*A cat*")
+    st.image("static/img/examples/due_gatti.png", use_column_width=True)
+    st.subheader("Un bambino")
+    st.markdown("*A child*")
+    st.image("static/img/examples/child_on_slide.png", use_column_width=True)
     st.markdown("## Image Classification")
     st.markdown(
         "We report this cool example provided by the "

introduction.md CHANGED Viewed

@@ -9,7 +9,7 @@ is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-b
 In building this project we kept in mind the following principles:
-+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs (**that we will share with the community**) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
 + **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
 + **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
@@ -21,14 +21,23 @@ Thank you for this amazing opportunity, we hope you will like the results! :hear
 In this demo, we present two tasks:
-+ *Text to Image*: This task is essentially an image retrieval task. The user is asked to input a string of text and CLIP is going to
 compute the similarity between this string of text with respect to a set of images. The webapp is going to display the images that
 have the highest similarity with the text query.
-+ *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
 is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
-+ *Examples & Applications*: This page showcases some interesting results we got from the model, we believe that there are
 different applications that can start from here.
 # Novel Contributions
@@ -247,7 +256,7 @@ labels most probably had an impact on the final scores.
 We hereby show some interesting properties of the model. One is its ability to detect colors,
 then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
-more examples in the "*Examples & Applications*" section of this demo.
 To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
 is searching the right image from a set of 25K images from an Unsplash dataset.

 In building this project we kept in mind the following principles:
++ **Novel Contributions**: We created an impressive dataset of ~1.4 million Italian image-text pairs (**that we will share with the community**) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
 + **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
 + **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
 In this demo, we present two tasks:
++ **Text to Image**: This task is essentially an image retrieval task. The user is asked to input a string of text and CLIP is going to
 compute the similarity between this string of text with respect to a set of images. The webapp is going to display the images that
 have the highest similarity with the text query.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/text_to_image.png" alt="drawing" width="95%"/>
++ **Image to Text**: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
 is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/image_to_text.png" alt="drawing" width="95%"/>
++ **Localization**: This is a **very cool** feature :sunglasses: and at the best of our knowledge, it is a novel contribution. We can use CLIP
+to find where "something" (like a "cat") is an image. The location of the object is computed by masking different areas of the image and looking at how the similarity to the image description changes.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_cane.png" alt="drawing" width="95%"/>
++ **Gallery**: This page showcases some interesting results we got from the model, we believe that there are
 different applications that can start from here.
 # Novel Contributions
 We hereby show some interesting properties of the model. One is its ability to detect colors,
 then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
+more examples in the "*Gallery*" section of this demo.
 To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
 is searching the right image from a set of 25K images from an Unsplash dataset.

localization.py ADDED Viewed

	@@ -0,0 +1,178 @@

+import streamlit as st
+from text2image import get_model, get_tokenizer, get_image_transform
+from utils import text_encoder
+from torchvision import transforms
+from PIL import Image
+from jax import numpy as jnp
+import pandas as pd
+import numpy as np
+import requests
+import psutil
+import time
+import jax
+import gc
+preprocess = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+])
+def pad_to_square(image, size=224):
+    ratio = float(size) / max(image.size)
+    new_size = tuple([int(x * ratio) for x in image.size])
+    image = image.resize(new_size, Image.ANTIALIAS)
+    new_image = Image.new("RGB", size=(size, size), color=(128, 128, 128))
+    new_image.paste(image, ((size - new_size[0]) // 2, (size - new_size[1]) // 2))
+    return new_image
+def image_encoder(image, model):
+    image = np.transpose(image, (0, 2, 3, 1))
+    features = model.get_image_features(image)
+    features /= jnp.linalg.norm(features, keepdims=True)
+    return features
+def gen_image_batch(image_url, image_size=224, pixel_size=10):
+    n_pixels = image_size // pixel_size + 1
+    image_batch = []
+    masks = []
+    image_raw = requests.get(image_url, stream=True).raw
+    image = Image.open(image_raw).convert("RGB")
+    image = pad_to_square(image, size=image_size)
+    gray = np.ones_like(image) * 128
+    mask = np.ones_like(image)
+    image_batch.append(image)
+    masks.append(mask)
+    for i in range(0, n_pixels):
+        for j in range(i+1, n_pixels):
+            m = mask.copy()
+            m[:min(i*pixel_size, image_size) + 1, :] = 0
+            m[min(j*pixel_size, image_size) + 1:, :] = 0
+            neg_m = 1 - m
+            image_batch.append(image * m + gray * neg_m)
+            masks.append(m)
+    for i in range(0, n_pixels+1):
+        for j in range(i+1, n_pixels+1):
+            m = mask.copy()
+            m[:, :min(i*pixel_size + 1, image_size)] = 0
+            m[:, min(j*pixel_size + 1, image_size):] = 0
+            neg_m = 1 - m
+            image_batch.append(image * m + gray * neg_m)
+            masks.append(m)
+    return image_batch, masks
+def get_heatmap(image_url, text, pixel_size=10, iterations=3):
+    tokenizer = get_tokenizer()
+    model = get_model()
+    image_size = model.config.vision_config.image_size
+    text_embedding = text_encoder(text, model, tokenizer)
+    images, masks = gen_image_batch(image_url, image_size=image_size, pixel_size=pixel_size)
+    input_image = images[0].copy()
+    images = np.stack([preprocess(image) for image in images], axis=0)
+    image_embeddings = jnp.asarray(image_encoder(images, model))
+    sims = []
+    scores = []
+    mask_val = jnp.zeros_like(masks[0])
+    for e, m in zip(image_embeddings, masks):
+        sim = jnp.matmul(e, text_embedding.T)
+        sims.append(sim)
+        if len(sims) > 1:
+            scores.append(sim * m)
+            mask_val += 1 - m
+    score = jnp.mean(jnp.clip(jnp.array(scores) - sims[0], 0, jnp.inf), axis=0)
+    for i in range(iterations):
+        score = jnp.clip(score - jnp.mean(score), 0, jnp.inf)
+    score = (score - jnp.min(score)) / (jnp.max(score) - jnp.min(score))
+    return np.asarray(score), input_image
+def app():
+    st.title("Zero-Shot Localization")
+    st.markdown(
+        """
+        ### 👋 Ciao!
+        Here you can find an example for zero shot localization that will show you where in an image the model sees an object.
+        The location of the object is computed by masking different areas of the image and looking at
+        how the similarity to the image description changes. If you want to have a look at the implementation in details
+        you can find it in [this Colab](https://colab.research.google.com/drive/10neENr1DEAFq_GzsLqBDo0gZ50hOhkOr?usp=sharing).
+        On the two parameters: the pixel size defines the resolution of the localization map. A pixel size of 15 means
+        that 15 pixels in the original image will form 1 pixel in the heatmap. The refinement
+        iterations are just a cheap operation to reduce background noise. Too few iterations will leave a lot of noise.
+        Too many will shrink the heatmap too much.
+        🤌 Italian mode on! 🤌
+        For example, try typing "gatto" (cat) or "cane" (dog) in the space for label and click "locate"!
+        """
+    )
+    image_url = st.text_input(
+        "You can input the URL of an image here...",
+        value="https://www.tuttosuigatti.it/files/styles/full_width/public/images/featured/205/cani-e-gatti.jpg?itok=WAAiTGS6",
+    )
+    MAX_ITER = 1
+    col1, col2 = st.beta_columns([3, 1])
+    with col2:
+        pixel_size = st.selectbox(
+            "Pixel Size", options=range(10, 21, 5), index=0
+        )
+        iterations = st.selectbox(
+            "Refinement Steps", options=range(3, 30, 3), index=0
+        )
+        compute = st.button("LOCATE")
+    with col1:
+        caption = st.text_input(f"Insert label...")
+    if compute:
+        with st.spinner('Waiting for resources...'):
+            sleep_time = 5
+            print('CPU_load', psutil.cpu_percent())
+            while psutil.cpu_percent() > 60:
+                time.sleep(sleep_time)
+        if not caption or not image_url:
+            st.error("Please choose one image and at least one label")
+        else:
+            with st.spinner("Computing... This might take up to a few minutes depending on the current load 😕  \n"
+                            "[Colab Link](https://colab.research.google.com/drive/10neENr1DEAFq_GzsLqBDo0gZ50hOhkOr?usp=sharing)"):
+                heatmap, image = get_heatmap(image_url, caption, pixel_size, iterations)
+                with col1:
+                    st.image(image, use_column_width=True)
+                    st.image(heatmap, use_column_width=True)
+                    st.image(np.asarray(image) / 255.0 * heatmap, use_column_width=True)
+        gc.collect()
+    elif image_url:
+        image_raw = requests.get(image_url, stream=True, ).raw
+        image = Image.open(image_raw).convert("RGB")
+        with col1:
+            st.image(image)

requirements.txt CHANGED Viewed

@@ -6,4 +6,5 @@ torchvision
 natsort
 stqdm
 pandas
-requests

 natsort
 stqdm
 pandas
+requests
+psutil

static/img/examples/child_on_slide.png ADDED Viewed

static/img/examples/due_gatti.png ADDED Viewed

static/img/examples/un_gatto.png ADDED Viewed

static/img/gatto_cane.png ADDED Viewed

static/img/image_to_text.png ADDED Viewed

static/img/text_to_image.png ADDED Viewed