Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

App Files Files Community

bhavitvyamalik commited on Jul 23, 2021

Commit

68fa3ca

•

1 Parent(s): 312f3f7

add references, blue, bias section

Browse files

Files changed (3) hide show

app.py +22 -1
sections/pretraining.md +5 -7
sections/references.md +58 -7

app.py CHANGED Viewed

@@ -72,7 +72,8 @@ st.write(
 )
 st.sidebar.title("Generation Parameters")
-max_length = st.sidebar.number_input("Max Length", min_value=16, max_value=128, value=64, step=1, help="The maximum length of sequence to be generated.")
 do_sample = st.sidebar.checkbox("Sample", value=False, help="Sample from the model instead of using beam search.")
 top_k = st.sidebar.number_input("Top K", min_value=10, max_value=200, value=50, step=1, help="The number of highest probability vocabulary tokens to keep for top-k-filtering.")
 num_beams = st.sidebar.number_input(label="Number of Beams", min_value=2, max_value=10, value=4, step=1, help="Number of beams to be used in beam search.")
@@ -97,6 +98,26 @@ with st.beta_expander("Article"):
     st.markdown(read_markdown("pretraining.md"))
     st.write(read_markdown("challenges.md"))
     st.write(read_markdown("social_impact.md"))
     st.write(read_markdown("future_scope.md"))
     st.write(read_markdown("references.md"))
     # st.write(read_markdown("checkpoints.md"))

 )
 st.sidebar.title("Generation Parameters")
+# max_length = st.sidebar.number_input("Max Length", min_value=16, max_value=128, value=64, step=1, help="The maximum length of sequence to be generated.")
+max_length = 64
 do_sample = st.sidebar.checkbox("Sample", value=False, help="Sample from the model instead of using beam search.")
 top_k = st.sidebar.number_input("Top K", min_value=10, max_value=200, value=50, step=1, help="The number of highest probability vocabulary tokens to keep for top-k-filtering.")
 num_beams = st.sidebar.number_input(label="Number of Beams", min_value=2, max_value=10, value=4, step=1, help="Number of beams to be used in beam search.")
     st.markdown(read_markdown("pretraining.md"))
     st.write(read_markdown("challenges.md"))
     st.write(read_markdown("social_impact.md"))
+    st.write(read_markdown("bias.md"))
+    col1, col2, col3, col4 = st.beta_columns([0.5,2.5,2.5,0.5])
+    with col2:
+        st.image("./misc/examples/female_dev_1.jpg", width=350, caption = 'German Caption: <PERSON> arbeitet an einem Computer.', use_column_width='always')
+    with col3:
+        st.image("./misc/examples/female_doctor.jpg", width=350, caption = 'English Caption: A portrait of <PERSON>, a doctor who specializes in health care.', use_column_width='always')
+    col1, col2, col3, col4 = st.beta_columns([0.5,2.5,2.5,0.5])
+    with col2:
+        st.image("./misc/examples/female_doctor_1.jpg", width=350, caption = 'Spanish Caption: El Dr. <PERSON> es un estudiante de posgrado.', use_column_width='always')
+    with col3:
+        st.image("./misc/examples/women_cricket.jpg", width=350, caption = 'English Caption: <PERSON> of India bats against <PERSON> of Australia during the first Twenty20 match between India and Australia at Indian Bowl Stadium in New Delhi on Friday. - PTI', use_column_width='always')
+    col1, col2, col3, col4 = st.beta_columns([0.5,2.5,2.5,0.5])
+    with col2:
+        st.image("./misc/examples/female_dev_2.jpg", width=350, caption = "French Caption: Un écran d'ordinateur avec un écran d'ordinateur ouvert.", use_column_width='always')
+    with col3:
+        st.image("./misc/examples/female_biker_resized.jpg", width=350, caption = 'German Caption: <PERSON> auf dem Motorrad von <PERSON>.', use_column_width='always')
     st.write(read_markdown("future_scope.md"))
     st.write(read_markdown("references.md"))
     # st.write(read_markdown("checkpoints.md"))

sections/pretraining.md CHANGED Viewed

@@ -9,13 +9,11 @@ The dataset we use for pre-training is a cleaned version of Conceptual 12M. The
 The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are  fed to the mBART decoder. We use the `facebook/mbart-large-50` and `openai/clip-vit-base-patch32` checkpoints for mBART and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-image-captioning).
-Our model reached **eval loss of ~2.6** around ~70K steps. Here are the BLEU^ scores for different languages:
 |Language                  |BLEU-1|BLEU-2|BLEU-3|BLEU-4|
 |--------------------------|------|------|------|------|
-|English                   | 0.163| 0.127| 0.10 | 0.081|
-|Spanish                   | 0.171| 0.133| 0.114| 0.082|
-|German                    | 0.165| 0.129| 0.103| 0.077|
-|French                    | 0.162| 0.124| 0.104| 0.073|
-^BLEU scores are out of 1

 The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are  fed to the mBART decoder. We use the `facebook/mbart-large-50` and `openai/clip-vit-base-patch32` checkpoints for mBART and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-image-captioning).
+Our model reached **eval loss of ~2.6** around ~70K steps. Here are the BLEU scores (out of 1) for different languages:
 |Language                  |BLEU-1|BLEU-2|BLEU-3|BLEU-4|
 |--------------------------|------|------|------|------|
+|English                   | 0.13083| 0.08887| 0.06681 | 0.04899|
+|Spanish                   | 0.15981| 0.09858| 0.06918| 0.04776|
+|German                    | 0.14234| 0.09817| 0.07405| 0.0515|
+|French                    | 0.13021| 0.08862| 0.06598| 0.04647|

sections/references.md CHANGED Viewed

@@ -1,12 +1,63 @@
 ## References
-- [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
-- [Hybrid CLIP Example](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
-- [mBART50 Modeling File](https://github.com/huggingface/transformers/blob/master/src/transformers/models/mbart/modeling_flax_mbart.py)
-- [CLIP Modeling File](https://github.com/huggingface/transformers/blob/master/src/transformers/models/clip/modeling_flax_clip.py)
-- [Hybrid CLIP Training Script](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/hybrid_clip/run_hybrid_clip.py)
-- [Summarization Training Script](https://github.com/huggingface/transformers/blob/master/examples/flax/summarization/run_summarization_flax.py)

 ## References
+```
+@inproceedings{wolf-etal-2020-transformers,
+    title = "Transformers: State-of-the-Art Natural Language Processing",
+    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
+    month = oct,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
+    pages = "38--45"
+}
+```
+```
+@inproceedings{changpinyo2021cc12m,
+  title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
+  author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
+  booktitle = {CVPR},
+  year = {2021},
+}
+```
+```
+@InProceedings{mariannmt,
+  title     = {Marian: Fast Neural Machine Translation in {C++}},
+  author    = {Junczys-Dowmunt, Marcin and Grundkiewicz, Roman and
+               Dwojak, Tomasz and Hoang, Hieu and Heafield, Kenneth and
+               Neckermann, Tom and Seide, Frank and Germann, Ulrich and
+               Fikri Aji, Alham and Bogoychev, Nikolay and
+               Martins, Andr\'{e} F. T. and Birch, Alexandra},
+  booktitle = {Proceedings of ACL 2018, System Demonstrations},
+  pages     = {116--121},
+  publisher = {Association for Computational Linguistics},
+  year      = {2018},
+  month     = {July},
+  address   = {Melbourne, Australia},
+  url       = {http://www.aclweb.org/anthology/P18-4020}
+}
+```
+```
+@article{liu2020multilingual,
+    title={Multilingual Denoising Pre-training for Neural Machine Translation},
+    author={Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer},
+    year={2020},
+    eprint={2001.08210},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+```
+@misc{radford2021learning,
+      title={Learning Transferable Visual Models From Natural Language Supervision},
+      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
+      year={2021},
+      eprint={2103.00020},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```