michelecafagna26
/

vinvl-base-finetuned-hl-scenes-image-captioning

Model card Files Files and versions Community

michelecafagna26 commited on Sep 11, 2023

Commit

6d1c75e

•

1 Parent(s): 084ecaa

Update README.md

Files changed (1) hide show

README.md +13 -12

README.md CHANGED Viewed

@@ -100,6 +100,19 @@ VinVL model finetuned on scenes descriptions:
     abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
 }
 ```
@@ -123,16 +136,4 @@ Please consider citing the original project and the VinVL paper
   pages={5579--5588},
   year={2021}
 }
-```
-And the HL Dataset paper
-```BibTeX
-@inproceedings{cafagna2023hl,
-  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
-{R}ationales},
-  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
-  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
-address = {Prague, Czech Republic},
-  year={2023}
-}
 ```

     abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
 }
+HL Dataset paper:
+```BibTeX
+@inproceedings{cafagna2023hl,
+  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
+{R}ationales},
+  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
+  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
+address = {Prague, Czech Republic},
+  year={2023}
+}
+```
 ```
   pages={5579--5588},
   year={2021}
 }
 ```