File size: 6,048 Bytes
723e786
 
 
11284d9
95ffddc
b229ff8
95ffddc
01eae42
 
 
b229ff8
 
 
c236c46
b229ff8
01eae42
 
 
05a12f7
 
2f06c37
 
a857d97
2f06c37
f2b9add
2f06c37
a857d97
2f06c37
 
 
 
670a584
2f06c37
 
 
 
e3f6126
2f06c37
 
9d18536
2f06c37
173eaab
 
2f06c37
 
173eaab
 
2f06c37
 
173eaab
 
a857d97
9841e3b
 
 
 
 
 
 
 
 
 
 
1b95551
f2b9add
a857d97
f2b9add
05a12f7
9841e3b
01eae42
f2b9add
1b95551
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import streamlit as st
from PIL import Image

st.title('Text-Image matching and Animal Classification via CLIP')

st.markdown("## Overview")

st.markdown("### Problem") 

st.markdown("In this project, we use CLIP to achieve two tasks: text-image matching and animal classification.")

st.markdown("When we were children, we all encountered some connecting questions, like in Figure 1, we need to match pictures and words together. In the text-image matching task, we will give some possible options and let our model identify which is the most likely option.")

figure1 = Image.open('img1.png')
st.image(figure1, caption='The text-image matching example. (1)')

st.markdown("In the animal classification, we use two different datasets. The first dataset has four kinds of animals, including elephants, buffalo, rhino, and zebra. In the second dataset, we added dog, cat, cow, sheep, chicken, and horse on the basis of the original four animals, and expanded the types of animals to ten. In the meanwhile, the amount of data for different animal species is unbalanced. We need to use CLIP to complete the classification of animals")

st.markdown("### Approach")

figure2 = Image.open('img2.png')
st.image(figure2, caption='The source of figure: https://github.com/openai/CLIP  (2)')

st.markdown("1. During model training, each batch we take consists of N image-text pairs. When these N images are sent to the image encoder, N image feature vectors (I1, I2,..., In) will be obtained. Similarly, when these N texts are sent to the text encoder, we can obtain N text feature vectors(T1, T2,..., Tn). Because only the image and text on the diagonal are a pair, the training goal of CLIP is to make the feature vector similarity of an image-text pair as high as possible.")

st.markdown("2. We used the CLIP model designed by OpenAI to complete these two tasks. CLIP is a zero-shot and multimodal neural network. Zero-shot means that our model can learn a mapping even though it has not learned samples of this type before and the zero-shot model can predict results with few training samples. On the other hand, CLIP is a multimodal neural network that maps images to text labels. This model is very suitable for image-text matching. It can find suitable images according to your text description and describe the picture samples with accurate text.")

st.markdown("## Text-Image Matching")

st.markdown("This is our example image in text-image matching task")

figure3 = Image.open('Golden_Retriever.jpeg')
st.image(figure3, caption='How would you describe the picture?')

st.markdown("1. When giving the options [dog, cat, rabbit, squirrel]")

figure4 = Image.open('Result1.png')
st.image(figure4, caption='When giving the options [dog, cat, rabbit, squirrel]')

st.markdown("2. When giving the options [Labrador, Siberian Husky, Boxer, Golden Retriever]")

figure5 = Image.open('Result2.png')
st.image(figure5, caption='When giving the options [Labrador, Siberian Husky, Boxer, Golden Retriever]')

st.markdown("3. When giving the options [Labrador, Siberian Husky, Boxer, German Shepherd]")
figure6 = Image.open('Result3.png')
st.image(figure6, caption='When giving the options [Labrador, Siberian Husky, Boxer, German Shepherd]')

st.markdown("4. When giving the options ['Golden Retriever is running', 'Golden Retriever is playing', 'Golden Retriever is sitting', 'Golden Retriever is sleeping', 'Siberian Husky is running', 'Siberian Husky is playing', 'Siberian Husky is sitting', 'Siberian Husky is sleeping', 'Labrador is running', 'Labrador is playing', 'Labrador is sitting', 'Labrador is sleeping', 'Boxer is running', 'Boxer is playing', 'Boxer is sitting', 'Boxer is sleeping']")
figure7 = Image.open('Result4.png')
st.image(figure7, caption="When giving the options ['Golden Retriever is running', 'Golden Retriever is playing', 'Golden Retriever is sitting', 'Golden Retriever is sleeping', 'Siberian Husky is running', 'Siberian Husky is playing', 'Siberian Husky is sitting', 'Siberian Husky is sleeping', 'Labrador is running', 'Labrador is playing', 'Labrador is sitting', 'Labrador is sleeping', 'Boxer is running', 'Boxer is playing', 'Boxer is sitting', 'Boxer is sleeping']")

st.markdown("## Animal Classification")

st.markdown("1. We carried out the classification of four kinds of animals containing buffalo, elephant, rhino, and zebra. There are 1000 images for each kind of animal. In this part, we use CLIP to classify four kinds of animals and the accuracy is 0.9830.")

figure8 = Image.open('conf1.png')
st.image(figure8, caption='The confusion matrix of four kinds of animal classification')

st.markdown("2. In order to test the performance of CLIP in the fine-grained classification task, we added six other animals to the original dataset of four animals. In the meanwhile, the dataset becomes an unbalanced dataset. Some animal species have more than 4000 images but someone just has 1000 images. Finally, the accuracy of animals is 0.9592")

figure9 = Image.open('conf2.png')
st.image(figure9, caption='The confusion matrix of ten kinds of animal classification')

st.markdown("## Critical thinking")

st.markdown("1. The performance of CLIP is depend significantly on class design and the choices one makes for categories. Different class designs have a great influence on the final result")

st.markdown("2. The model can identify people's race, gender, age, beauty, and ugliness. It may describe specific groups discriminatively, such as short, and so on. These identifications of characters may bring potential discrimination.")

st.markdown("3. The model still requires us to give some possible options, which cannot be described directly from the picture. At the same time, the accuracy of the model on fine-grained classification tasks did not meet expectations")

st.markdown("## Reference")

st.markdown("1. https://www.kaggle.com/datasets/alessiocorrado99/animals10")

st.markdown("2. https://www.kaggle.com/datasets/ayushv322/animal-classification")

st.markdown("3. https://github.com/OpenAI/CLIP")