bhavitvyamalik commited on
Commit
c62e9c5
β€’
1 Parent(s): 8678313

update sections

Browse files
sections/challenges.md CHANGED
@@ -5,6 +5,4 @@ We faced challenges at every step of the way, despite having some example script
5
 
6
  - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
7
 
8
- - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `mBART` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here.
9
-
10
- - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/spanish-image-captioning/tensorboard) show that the training hasn't converged, and we could see further improvement in the BLEU scores.
5
 
6
  - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
7
 
8
+ - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `Marian` implementations in Flax.
 
 
sections/intro.md CHANGED
@@ -1,4 +1,4 @@
1
- This demo uses [CLIP-Vision-Marian model checkpoint](https://huggingface.co/flax-community/spanish-image-captioning/) to predict caption for a given image in Spanish. Training was done using image encoder and text decoder with approximately 2.5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) with captions translated using [Marian](https://huggingface.co/transformers/model_doc/marian.html).
2
 
3
 
4
  For more details, click on `Usage` or `Article` πŸ€— below.
1
+ This demo uses [CLIP-Vision-Marian model checkpoint](https://huggingface.co/flax-community/clip-vit-base-patch32_marian-es) to predict caption for a given image in Spanish. Training was done using image encoder and text decoder with approximately 2.5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) with captions translated using [MarianMT English to Spanish](https://huggingface.co/transformers/model_doc/marian.html).
2
 
3
 
4
  For more details, click on `Usage` or `Article` πŸ€— below.
sections/social_impact.md CHANGED
@@ -1,4 +1,4 @@
1
  ## Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
- Our initial plan was to work with a low-resource language - Marathi. However, the existing translations do not perform as well and we would have received poor labels and hence we did not pursue this further.
1
  ## Social Impact
2
  Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
3
 
4
+ Our initial plan was to work with a low-resource language only. However, the existing translations do not perform as well and we would have received poor labels and hence we did not pursue this further.
sections/usage.md CHANGED
@@ -1,4 +1,4 @@
1
- - This demo loads the `FlaxCLIPVisionMarianMT` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-23999` which is pre-trained checkpoint with 24kk steps. 100 random validation set examples are present in the `references.tsv` with respective images in the `images` directory.
2
 
3
  - We provide `English Translation` of the generated caption and reference captions for users who are not well-acquainted with Spanish. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API. We will also add the original captions soon.
4
 
1
+ - This demo loads the `FlaxCLIPVisionMarianMT` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-23999` which is pre-trained checkpoint with 24k steps. 100 random validation set examples are present in the `references.tsv` with respective images in the `images` directory.
2
 
3
  - We provide `English Translation` of the generated caption and reference captions for users who are not well-acquainted with Spanish. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API. We will also add the original captions soon.
4