gchhablani commited on
Commit
547e7ab
1 Parent(s): 50811dd

Fix image display issue.

Browse files
Files changed (3) hide show
  1. app.py +3 -2
  2. sections/abstract.md +2 -2
  3. sections/pretraining.md +10 -0
app.py CHANGED
@@ -110,8 +110,6 @@ if state.image_file is None:
110
  transformed_image = get_transformed_image(state.image)
111
 
112
  new_col1, new_col2 = st.beta_columns([5,5])
113
- # Display Image
114
- new_col1.image(state.image, use_column_width="always")
115
 
116
  if new_col2.button("Get a random example", help="Get a random example from one of the seeded examples."):
117
  sample = dummy_data.sample(1).reset_index()
@@ -122,6 +120,9 @@ if new_col2.button("Get a random example", help="Get a random example from one o
122
  image = plt.imread(image_path)
123
  state.image = image
124
 
 
 
 
125
  # Display Reference Caption
126
  new_col2.write("**Reference Caption**: " + state.caption)
127
  new_col2.markdown(
 
110
  transformed_image = get_transformed_image(state.image)
111
 
112
  new_col1, new_col2 = st.beta_columns([5,5])
 
 
113
 
114
  if new_col2.button("Get a random example", help="Get a random example from one of the seeded examples."):
115
  sample = dummy_data.sample(1).reset_index()
 
120
  image = plt.imread(image_path)
121
  state.image = image
122
 
123
+ # Display Image
124
+ new_col1.image(state.image, use_column_width="always")
125
+
126
  # Display Reference Caption
127
  new_col2.write("**Reference Caption**: " + state.caption)
128
  new_col2.markdown(
sections/abstract.md CHANGED
@@ -1,4 +1,4 @@
1
  ## Abstract
2
- This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
3
 
4
- Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large `one-to-many` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.
 
1
  ## Abstract
2
+ This project is focused on Spanish Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to show that CLIP Vision + Marian model can be trained on Spanish translation textual checkpoints with pre-trained image encoders and made to perform well enough on this particular task.
3
 
4
+ Due to lack of good-quality Spanish data, we translate subsets of the Conceptual 12M dataset into Spanish using the Marian MT `Helsinki-NLP/opus-mt-en-es` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.
sections/pretraining.md CHANGED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Pretraining
2
+ We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an `<bos>` token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.
3
+
4
+ **Dataset**
5
+
6
+ The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 2.5M of these image-text pairs. Then we use the Marian `Helsinki-NLP/opus-mt-en-es` checkpoint to translate the captions into Spanish.
7
+
8
+ **Model**
9
+
10
+ The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside Marian model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from Marian and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are fed to the Marian decoder. We use the `Helsinki-NLP/opus-mt-en-es` and `openai/clip-vit-base-patch32` checkpoints for Marian and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/spanish-image-captioning).