gchhablani commited on
Commit
185a893
β€’
1 Parent(s): d065a7f

Update article

Browse files
sections/abstract.md CHANGED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ## Abstract
2
+ This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
3
+
4
+ Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large `one-to-many` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.
sections/acknowledgements.md CHANGED
@@ -1,4 +1,4 @@
1
- # Acknowledgements
2
  We'd like to thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the cleaned CC-12M data on our TPU-VMs and we are very grateful to him.
3
 
4
  This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped us review our approach and guided us throughout the project. We especially thank Patrick for going out of the way and allowing us extra TPU time so that we could work on this project.
 
1
+ ## Acknowledgements
2
  We'd like to thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the cleaned CC-12M data on our TPU-VMs and we are very grateful to him.
3
 
4
  This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped us review our approach and guided us throughout the project. We especially thank Patrick for going out of the way and allowing us extra TPU time so that we could work on this project.
sections/challenges.md CHANGED
@@ -1 +1,10 @@
1
- # Challenges and Technical Difficulties
 
 
 
 
 
 
 
 
 
 
1
+ ## Challenges and Technical Difficulties
2
+ We faced challenges at every step of the way, despite having some example scripts and models ready by the πŸ€— team in Flax.
3
+
4
+ - The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using `mtranslate` first but it turned out to be too slow, even with multiprocessing.
5
+
6
+ - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
7
+
8
+ - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `mBART` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here.
9
+
10
+ - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/spanish-image-captioning/tensorboard) show that the training hasn't converged, and we could see further improvement in the BLEU scores.
sections/intro.md CHANGED
@@ -1,5 +1,4 @@
1
-
2
- This demo uses [CLIP-Vision-Marian model checkpoint](https://huggingface.co/flax-community/spanish-image-captioninh/) to predict caption for a given image in Spanish. Training was done using image encoder and text decoder with approximately 2.5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) with captions translated using [Marian](https://huggingface.co/transformers/model_doc/marian.html).
3
 
4
 
5
  For more details, click on `Usage` or `Article` πŸ€— below.
 
1
+ This demo uses [CLIP-Vision-Marian model checkpoint](https://huggingface.co/flax-community/spanish-image-captioning/) to predict caption for a given image in Spanish. Training was done using image encoder and text decoder with approximately 2.5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) with captions translated using [Marian](https://huggingface.co/transformers/model_doc/marian.html).
 
2
 
3
 
4
  For more details, click on `Usage` or `Article` πŸ€— below.
sections/references.md CHANGED
@@ -1,4 +1,4 @@
1
- # References
2
  - [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
3
 
4
  - [Hybrid CLIP Example](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
 
1
+ ## References
2
  - [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
3
 
4
  - [Hybrid CLIP Example](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
sections/social_impact.md CHANGED
@@ -1,4 +1,4 @@
1
- # Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
  Our initial plan was to work with a low-resource language - Marathi. However, the existing translations do not perform as well and we would have received poor labels and hence we did not pursue this further.
 
1
+ ## Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
  Our initial plan was to work with a low-resource language - Marathi. However, the existing translations do not perform as well and we would have received poor labels and hence we did not pursue this further.
sections/usage.md CHANGED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ - This demo loads the `FlaxCLIPVisionMarianMT` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-23999` which is pre-trained checkpoint with 24kk steps. 100 random validation set examples are present in the `references.tsv` with respective images in the `images` directory.
2
+
3
+ - We provide `English Translation` of the generated caption and reference captions for users who are not well-acquainted with Spanish. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API. We will also add the original captions soon.
4
+
5
+ - The sidebar contains generation parameters such as `Number of Beams`, `Top-P`, `Temperature` which will be used when generating the caption.
6
+
7
+ - Cliking on `Generate Caption` will generate the caption in Spanish.