Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

gchhablani commited on Jul 19, 2021

Commit

d1e7790

1 Parent(s): d825f6f

Update sections

Files changed (2) hide show

sections/abstract.md CHANGED Viewed

+# Abstract
+This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.
+Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large `one-to-many` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.

sections/future_scope.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ## Future scope of work
 We hope to improve this project in the future by using:
-- Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, especially in low-resource languages.
-- More training time: We found that training image captioning model for a single model takes a lot of compute time and if we want to replicate the same then the training time goes up manifold for the same number of samples.
 - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. I hope we'll be able to support TFLite for our model as well in future.

 ## Future scope of work
 We hope to improve this project in the future by using:
+- Better translating options: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
+- More training time: We found that training image captioning model for a simple epoch takes a lot of compute time and if we want to replicate the same then the training time goes up manyfold for the same number of samples.
 - Accessibility: Make model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. We got to know that JAX has an experimental converter `jax2tf`to convert JAX functions to TF. I hope we'll be able to support TFLite for our model as well in future.