File size: 8,169 Bytes
2bbf92c
 
 
 
 
 
 
 
 
 
 
 
 
f4963f2
 
 
 
2bbf92c
 
 
 
 
 
 
 
 
 
 
 
01fc68e
 
 
 
 
583a144
01fc68e
 
583a144
 
f4963f2
 
583a144
f4963f2
583a144
f4963f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
583a144
 
2bbf92c
f4963f2
 
 
 
 
 
2bbf92c
f4963f2
2bbf92c
f4963f2
2bbf92c
 
f4963f2
 
 
 
 
2bbf92c
f4963f2
 
 
 
 
2bbf92c
f4963f2
2bbf92c
f4963f2
2bbf92c
f4963f2
2bbf92c
 
 
f4963f2
 
2bbf92c
 
f4963f2
2bbf92c
 
f4963f2
 
2bbf92c
 
 
 
f4963f2
2bbf92c
 
 
 
 
 
 
 
f4963f2
2bbf92c
f4963f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
from io import BytesIO
import streamlit as st
import pandas as pd
import json
import os
import numpy as np
from model.flax_clip_vision_bert.modeling_clip_vision_bert import FlaxCLIPVisionBertForSequenceClassification
from utils import get_transformed_image, get_text_attributes, get_top_5_predictions, plotly_express_horizontal_bar_plot, translate_labels
import matplotlib.pyplot as plt
from mtranslate import translate
from PIL import Image


from session import _get_state

state = _get_state()

@st.cache
def load_model(ckpt):
    return FlaxCLIPVisionBertForSequenceClassification.from_pretrained(ckpt)

def softmax(logits):
    return np.exp(logits)/np.sum(np.exp(logits), axis=0)

checkpoints = ['./ckpt/ckpt-60k-5999'] # TODO: Maybe add more checkpoints?
dummy_data = pd.read_csv('dummy_vqa_multilingual.tsv', sep='\t')
with open('answer_reverse_mapping.json') as f:
    answer_reverse_mapping = json.load(f)


st.set_page_config(
    page_title="Multilingual VQA",
    layout="wide",
    initial_sidebar_state="collapsed",
    page_icon="./misc/mvqa-logo.png",
)

st.title("Multilingual Visual Question Answering")



with st.beta_expander("About"):
    st.write("This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple ViT+BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and well enough. Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the Marian models.  We achieved 0.49 accuracy on the multilingual validation set we created. With better captions, and hyperparameter-tuning, we expect to see higher performance.")
with st.beta_expander("Method"):
    col1, col2 = st.beta_columns([5,4])
    col1.image("./misc/Multilingual-VQA.png")
    col2.markdown("""
## Pretraining
We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a ViT encoder.
The task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
### Dataset
The dataset we use for pre-training is a cleaned version of [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m). The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 `
mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language.
""")

    st.markdown("""
### Model
The model is shown in the image above.We create a custom model in Flax which integerates the ViT model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and ViT checkpoints. The image is fed to the ViT encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and ViT (actually CLIPVision) models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-vqa). 
## Fine-tuning

### Dataset
For fine-tuning, we use the [VQA 2.0](https://visualqa.org/) dataset - particularly, the `train` and `validation` sets. We translate all the questions into the four languages specified above using language-specific MarianMT models. This is because MarianMT models return better labels and are faster, hence, are better for fine-tuning. We get 4x the number of examples in each subset.
### Model
We use the `SequenceClassification` model as reference to create our own sequence classification model. 3129 answer labels are chosen, as is the convention for the English VQA task, which can be found [here](https://github.com/gchhablani/multilingual-vqa/blob/main/answer_mapping.json). These are the same labels used in fine-tuning of the VisualBERT models. The outputs shown here have been translated using the [`mtranslate`](https://github.com/mouuff/mtranslate) Google Translate API library. Then we use various pre-trained checkpoints and train the sequence classification model for various steps.

Checkpoints:
- Pre-trained checkpoint: [multilingual-vqa](https://huggingface.co/flax-community/multilingual-vqa)
- Fine-tuned on 45k pretrained checkpoint: [multilingual-vqa-pt-45k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft)
- Fine-tuned on 45k pretrained checkpoint with AdaFactor (others use AdamW): [multilingual-vqa-pt-45k-ft-adf](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft-adf)
- Fine-tuned on 60k pretrained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft)
- Fine-tuned on 70k pretrained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-70k-ft)
- From scratch (without pre-training) model: [multilingual-vqa-ft](https://huggingface.co/flax-community/multilingual-vqa-ft)

**Caveat**: The best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training. In future, we hope to improve this model by addressing such concerns.
""")

with st.beta_expander("Cherry-Picked Results"):
    pass

with st.beta_expander("Conclusion"):
    pass

with st.beta_expander("Usage"):
    pass

# Init Session State
if state.image_file is None:
    state.image_file = dummy_data.loc[0,'image_file']
    state.question = dummy_data.loc[0,'question'].strip('- ')
    state.answer_label = dummy_data.loc[0,'answer_label']
    state.question_lang_id = dummy_data.loc[0, 'lang_id']
    state.answer_lang_id = dummy_data.loc[0, 'lang_id']
    
    image_path = os.path.join('images',state.image_file)
    image = plt.imread(image_path)
    state.image = image

col1, col2 = st.beta_columns([5,5])

# Display Image
col1.image(state.image, use_column_width='always')

if col2.button('Get a random example'):
    sample = dummy_data.sample(1).reset_index()
    state.image_file = sample.loc[0,'image_file']
    state.question = sample.loc[0,'question'].strip('- ')
    state.answer_label = sample.loc[0,'answer_label']
    state.question_lang_id = sample.loc[0, 'lang_id']
    state.answer_lang_id = sample.loc[0, 'lang_id']

    image_path = os.path.join('images',state.image_file)
    image = plt.imread(image_path)
    state.image = image

st.write("OR")

uploaded_file = col2.file_uploader('Upload your image', type=['png','jpg','jpeg'])
if uploaded_file is not None:
    state.image_file = os.path.join('images/val2014',uploaded_file.name)
    state.image = np.array(Image.open(uploaded_file))


transformed_image = get_transformed_image(state.image)

# Display Question
question = st.text_input(label="Question", value=state.question)
st.markdown(f"""**English Translation**: {question if state.question_lang_id == "en" else translate(question, 'en')}""")
question_inputs = get_text_attributes(question)

# Select Language
options = ['en', 'de', 'es', 'fr']
state.answer_lang_id = st.selectbox('Answer Language', index=options.index(state.answer_lang_id), options=options)
# Display Top-5 Predictions
with st.spinner('Loading model...'):
    model = load_model(checkpoints[0])
with st.spinner('Predicting...'):
    predictions = model(pixel_values = transformed_image, **question_inputs)
logits = np.array(predictions[0][0])
logits = softmax(logits)
labels, values = get_top_5_predictions(logits, answer_reverse_mapping)
translated_labels = translate_labels(labels, state.answer_lang_id)
fig = plotly_express_horizontal_bar_plot(values, translated_labels)
st.plotly_chart(fig, use_container_width = True)