Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

App Files Files Community

bhavitvyamalik commited on Jul 25, 2021

Commit

a82b5ff

•

1 Parent(s): 35975d8

update README

Browse files

Files changed (8) hide show

apps/article.py +2 -2
sections/abstract.md +0 -4
sections/bias.md +1 -2
sections/conclusion_future_work/conclusion.md +1 -0
sections/conclusion_future_work/future_scope.md +5 -2
sections/conclusion_future_work/social_impact.md +3 -1
sections/intro/intro.md +3 -0
sections/references.md +14 -0

apps/article.py CHANGED Viewed

@@ -53,8 +53,8 @@ def app(state=None):
     toc.header("Challenges and Technical Difficulties")
     st.write(read_markdown("challenges.md"))
-    toc.header("Limitations")
-    st.write(read_markdown("limitations.md"))
     toc.header("Conclusion, Future Work, and Social Impact")
     toc.subheader("Conclusion")

     toc.header("Challenges and Technical Difficulties")
     st.write(read_markdown("challenges.md"))
+    toc.header("Limitations and Biases")
+    st.write(read_markdown("bias.md"))
     toc.header("Conclusion, Future Work, and Social Impact")
     toc.subheader("Conclusion")

sections/abstract.md DELETED Viewed

@@ -1,4 +0,0 @@
-## Abstract
-This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
-Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.

sections/bias.md CHANGED Viewed

@@ -1,4 +1,3 @@
-## Bias Analysis
 Due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias.
-One of the reasons why we chose Conceptual 12M over COCO captioning dataset for training our Multi-lingual Image Captioning model was that in former all named entities of type Person were substituted by a special token <PERSON>. Because of this, the gendered terms in our captions became quite infrequent. We'll present a few captions from our model to analyse how our model performed on different images on which different pre-trained image captioning model usually gives gender prediction biases



1	Due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias.
2
3	+ One of the reasons why we chose Conceptual 12M over COCO captioning dataset for training our Multi-lingual Image Captioning model was that in former all named entities of type Person were substituted by a special token <PERSON>. Because of this, the gendered terms in our captions became quite infrequent. We'll present a few captions from our model to analyse how our model performed on different images on which different pre-trained image captioning model usually gives gender prediction biases

sections/conclusion_future_work/conclusion.md CHANGED Viewed

	@@ -0,0 +1 @@


1	+ In this project, we presented Proof-of-Concept with our CLIP Vision + mBART-50 model baseline which leverages a multilingual checkpoint with pre-trained image encoders in four languages - English, French, German, and Spanish. We intend to extend this project to more languages with better translations and improve our work based on the observations made.

sections/conclusion_future_work/future_scope.md CHANGED Viewed

@@ -1,3 +1,6 @@
 We hope to improve this project in the future by using:
-- Better options for data translation: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
-- Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. Hopefully we'll be able to support TFLite for our model as well in future.

 We hope to improve this project in the future by using:
+- Superior translation model: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
+- Checking translation quality: Inspecting quality of translated data is as important as the translation model itself. For this we'll either require native speakers to manually inspect a sample of translated data or devise some unsupervised translation quality metrics for the same.
+- More data: Currently we are using only 2.5M images of Conceptual 12M for image captioning. We plan to include other datasets like Conceptual Captions 3M, subset of YFCC100M dataset etc.
+- Low resource languages: With better translation tools we also wish to train our model in low resource languages which would further democratize the image captioning solution and help people realise the potential of language systems.
+- Accessibility: Making the model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. Hopefully we'll be able to support TFLite for our model as well in future.

sections/conclusion_future_work/social_impact.md CHANGED Viewed

@@ -1,3 +1,5 @@
 Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
-Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.

+Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.
 Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.
+A slightly (not-so) long term use case would definitely be, explaining what happens in a video, frame by frame. One more recent use-case for the same can be generating surgical instructions. Since our model is multi-lingual which means the instructions will not be just limited to regions where English is spoken but those instructions can be perused in regions where Spanish, French and German are spoken as well. Further if we extend this project to low-resource languages then its impact can be manifold.

sections/intro/intro.md CHANGED Viewed

	@@ -0,0 +1,3 @@


1	+ This project is focused on Mutilingual Image Captioning. It has attracted an increasing amount of attention in the last decade due to its potential applications. Most of the existing datasets and models on this task work with English-only image-text pairs. It is a challenging task to generate captions with proper linguistics properties in different languages as it requires an advanced level of image understanding. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German, and Spanish.
2	+
3	+ Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.

sections/references.md CHANGED Viewed

@@ -1,3 +1,17 @@
 ```
 @inproceedings{wolf-etal-2020-transformers,
     title = "Transformers: State-of-the-Art Natural Language Processing",

+```
+@inproceedings{NIPS2017_3f5ee243,
+ author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, \L ukasz and Polosukhin, Illia},
+ booktitle = {Advances in Neural Information Processing Systems},
+ editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett},
+ pages = {},
+ publisher = {Curran Associates, Inc.},
+ title = {Attention is All you Need},
+ url = {https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf},
+ volume = {30},
+ year = {2017}
+}
+```
 ```
 @inproceedings{wolf-etal-2020-transformers,
     title = "Transformers: State-of-the-Art Natural Language Processing",