bhavitvyamalik commited on
Commit
86a4b01
1 Parent(s): 652973c

update sections

Browse files
sections/abstract.md CHANGED
@@ -1,4 +1,4 @@
1
  ## Abstract
2
- This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
3
 
4
- Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large `one-to-many` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.
 
1
  ## Abstract
2
+ This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
3
 
4
+ Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.
sections/challenges.md CHANGED
@@ -1,10 +1,10 @@
1
  ## Challenges and Technical Difficulties
2
- Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.
3
 
4
- - Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).
5
 
6
  - We prepared the model and config classes for our model from scratch, basing it on `CLIP model based on ViT-B/32 Image Transformer` and `mBART50` implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.
7
 
8
- - RAM issues- Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less `num_workers` in order to avoid this issue.
9
 
10
- - We were only able to get around 2 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/multilingual-image-captioning-5M/tensorboard) show that the training hasn't converged, and we could see further improvement in the loss.
 
1
  ## Challenges and Technical Difficulties
2
+ Training image captioning that too multilingual was a difficult task and we faced challenges at almost every point of this process.
3
 
4
+ - Dataset: Our initial plan was to translate Conceptual 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. Further on, we realised that mBART captions were not that good and model was not converging because of that, causing us to re-translate our captions with [Marian](https://huggingface.co/transformers/model_doc/marian.html)
5
 
6
  - We prepared the model and config classes for our model from scratch, basing it on `CLIP model based on ViT-B/32 Image Transformer` and `mBART50` implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.
7
 
8
+ - RAM issues: Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less `num_workers` in order to avoid this issue.
9
 
10
+ - We were only able to get around 2-3 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/clip-vit-base-patch32_mbart-large-50/tensorboard) show that the training hasn't converged, and we could see further improvement in the loss.
sections/future_scope.md CHANGED
@@ -1,5 +1,5 @@
1
  ## Future scope of work
2
  We hope to improve this project in the future by using:
3
  - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
4
- - More training time: We found that training image captioning model for a simple epoch takes a lot of compute time and if we want to replicate the same then the training time goes up manyfold for the same number of samples.
5
- - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. I hope we'll be able to support TFLite for our model as well in future.
 
1
  ## Future scope of work
2
  We hope to improve this project in the future by using:
3
  - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
4
+ - More training time: We found that training image captioning model for an epoch takes a lot of compute time and if we want to replicate the same then the training time goes up manyfold for the same number of samples.
5
+ - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. Hopefully we'll be able to support TFLite for our model as well in future.
sections/intro.md CHANGED
@@ -1,5 +1,3 @@
1
- This demo uses [CLIP-mBART50 model checkpoint](https://huggingface.co/flax-community/multilingual-image-captioning-5M/) to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder (CLIP-ViT) and text decoder (mBART50) with approximately 5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) translated using [MBart50](https://huggingface.co/transformers/model_doc/mbart50.html).
2
-
3
- New demo coming soon 🤗
4
 
5
  For more details, click on `Usage` or `Article` 🤗 below.
 
1
+ This demo uses [CLIP-mBART50 model checkpoint](https://huggingface.co/flax-community/multilingual-image-captioning-5M/) to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder (CLIP-ViT) and text decoder (mBART50) with approximately 5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) translated using [MarianMT](https://huggingface.co/transformers/model_doc/marian.html).
 
 
2
 
3
  For more details, click on `Usage` or `Article` 🤗 below.
sections/pretraining.md CHANGED
@@ -3,8 +3,19 @@ We follow an encoder-decoder approach for image captioning, where the image enco
3
 
4
  **Dataset**
5
 
6
- The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 5M of these image-text pairs. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping approximately 1.25 million examples of each language.
7
 
8
  **Model**
9
 
10
- The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are fed to the mBART decoder. We use the `facebook/mbart-large-50` and `openai/clip-vit-base-patch32` checkpoints for mBART and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-image-captioning).
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  **Dataset**
5
 
6
+ The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 2.5M of these image-text pairs. Then we use the MarianMT `Helsinki-NLP/opus-mt-{src}-{tgt}` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping approximately 2.5M examples of each language.
7
 
8
  **Model**
9
 
10
+ The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are fed to the mBART decoder. We use the `facebook/mbart-large-50` and `openai/clip-vit-base-patch32` checkpoints for mBART and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-image-captioning).
11
+
12
+ Our model reached **eval loss of ~2.6** around ~70K steps. Here are the BLEU^ scores for different languages:
13
+
14
+ |Language |BLEU-1|BLEU-2|BLEU-3|BLEU-4|
15
+ |--------------------------|------|------|------|------|
16
+ |English | 0.163| 0.127| 0.10 | 0.081|
17
+ |Spanish | 0.171| 0.133| 0.114| 0.082|
18
+ |German | 0.165| 0.129| 0.103| 0.077|
19
+ |French | 0.162| 0.124| 0.104| 0.073|
20
+
21
+ ^BLEU scores are out of 1
sections/usage.md CHANGED
@@ -1,4 +1,4 @@
1
- - This demo loads the `FlaxCLIPVisionMBartforConditionlGeneration` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-17499` which is pre-trained checkpoint with 17.5k steps. 100 random validation set examples are present in the `references.tsv` with respective images in the `images` directory.
2
 
3
  - We provide `English Translation` of the generated caption and reference captions for users who are not well-acquainted with the other languages. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API. We will also add the original captions soon.
4
 
 
1
+ - This demo loads the `FlaxCLIPVisionMBartforConditionlGeneration` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-49499` which is pre-trained checkpoint with 70k steps. 100 random validation set examples are present in the `references.tsv` with respective images in the `images` directory.
2
 
3
  - We provide `English Translation` of the generated caption and reference captions for users who are not well-acquainted with the other languages. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API. We will also add the original captions soon.
4