Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

bhavitvyamalik commited on Jul 19, 2021

Commit

cbc5727

1 Parent(s): ff13355

readme for impact

Files changed (5) hide show

app.py CHANGED Viewed

@@ -88,7 +88,7 @@ with st.beta_expander("Article"):
     st.write(read_markdown("caveats.md"))
     # st.write("# Methodology")
     st.image(
-        "./misc/Multilingual-IC.png", caption="Seq2Seq model for Image-text Captioning."
     )
     st.markdown(read_markdown("pretraining.md"))
     st.write(read_markdown("challenges.md"))

     st.write(read_markdown("caveats.md"))
     # st.write("# Methodology")
     st.image(
+        "./misc/Multilingual-IC.png"
     )
     st.markdown(read_markdown("pretraining.md"))
     st.write(read_markdown("challenges.md"))

sections/challenges.md CHANGED Viewed

@@ -1,4 +1,4 @@
-## Challenges and Technical Difficulties
 Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.
 - Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).

+# Challenges and Technical Difficulties
 Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.
 - Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).

sections/future_scope.md CHANGED Viewed

@@ -2,4 +2,4 @@
 We hope to improve this project in the future by using:
 - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, especially in low-resource languages.
 - More training time: We found that training image captioning model for a single model takes a lot of compute time and if we want to replicate the same then the training time goes up manifold for the same number of samples.
-- Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. I hope we'll be able to support TFLite support for our model as well in future.

 We hope to improve this project in the future by using:
 - Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, especially in low-resource languages.
 - More training time: We found that training image captioning model for a single model takes a lot of compute time and if we want to replicate the same then the training time goes up manifold for the same number of samples.
+- Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. I hope we'll be able to support TFLite for our model as well in future.

sections/intro.md CHANGED Viewed

@@ -1,6 +1,6 @@
-This demo uses [CLIP-mBART50 model checkpoint](https://huggingface.co/flax-community/multilingual-image-captioning-5M/) to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder and text decoder with approximately 5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) translated using [MBart](https://huggingface.co/transformers/model_doc/mbart.html).
-The model predicts one out of 3129 classes in English which can be found [here](https://huggingface.co/spaces/flax-community/Multilingual-VQA/blob/main/answer_reverse_mapping.json), and then the translated versions are provided based on the language chosen as `Answer Language`. The question can be present or written in any of the following: English, French, German and Spanish.
 For more details, click on `Usage` or `Article` 🤗 below.

+This demo uses [CLIP-mBART50 model checkpoint](https://huggingface.co/flax-community/multilingual-image-captioning-5M/) to predict caption for a given image in 4 languages (English, French, German, Spanish). Training was done using image encoder and text decoder with approximately 5 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) translated using [MBart50](https://huggingface.co/transformers/model_doc/mbart50.html).
+New demo coming soon 🤗
 For more details, click on `Usage` or `Article` 🤗 below.

sections/social_impact.md CHANGED Viewed

@@ -1,4 +1,4 @@
-## Social Impact
 Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
-Our plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time. We strongly believe that there should be no barriers in accessing

+# Social Impact
 Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
+Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.