gchhablani commited on
Commit
1519f87
1 Parent(s): 58582da

Update sections

Browse files
sections/acknowledgements.md CHANGED
@@ -1,4 +1,4 @@
1
- # Acknowledgements
2
  We'd like to thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the cleaned CC-12M data on our TPU-VMs and we are very grateful to him.
3
 
4
  This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped us review our approach and guided us throughout the project. We especially thank Patrick for going out of the way and allowing us extra TPU time so that we could work on this project.
 
1
+ ## Acknowledgements
2
  We'd like to thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the cleaned CC-12M data on our TPU-VMs and we are very grateful to him.
3
 
4
  This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped us review our approach and guided us throughout the project. We especially thank Patrick for going out of the way and allowing us extra TPU time so that we could work on this project.
sections/challenges.md CHANGED
@@ -1,4 +1,4 @@
1
- # Challenges and Technical Difficulties
2
  Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.
3
 
4
  - Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).
 
1
+ ## Challenges and Technical Difficulties
2
  Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.
3
 
4
  - Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).
sections/future_scope.md CHANGED
@@ -1,4 +1,4 @@
1
- # Future scope of work
2
  We hope to improve this project in the future by using:
3
  - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, especially in low-resource languages.
4
  - More training time: We found that training image captioning model for a single model takes a lot of compute time and if we want to replicate the same then the training time goes up manifold for the same number of samples.
 
1
+ ## Future scope of work
2
  We hope to improve this project in the future by using:
3
  - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, especially in low-resource languages.
4
  - More training time: We found that training image captioning model for a single model takes a lot of compute time and if we want to replicate the same then the training time goes up manifold for the same number of samples.
sections/pretraining.md CHANGED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Pretraining
2
+ We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an `<eos>` token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.
3
+
4
+ **Dataset**
5
+
6
+ The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 5M of these image-text pairs. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping approximately 1.25 million examples of each language.
7
+
8
+ **Model**
9
+
10
+ The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are fed to the mBART decoder. We use the `facebook/mbart-large-50` and `openai/clip-vit-base-patch32` checkpoints for mBART and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-image-captioning).
sections/references.md CHANGED
@@ -1,4 +1,4 @@
1
- # References
2
  - [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
3
 
4
  - [Hybrid CLIP Example](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
 
1
+ ## References
2
  - [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
3
 
4
  - [Hybrid CLIP Example](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
sections/social_impact.md CHANGED
@@ -1,4 +1,4 @@
1
- # Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
  Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.
 
1
+ ##s Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
  Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.