michelecafagna26
/

vinvl-base-finetuned-hl-scenes-image-captioning

Model card Files Files and versions Community

michelecafagna26 commited on Sep 11, 2023

Commit

99aa060

•

1 Parent(s): 6d1c75e

Update README.md

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -99,6 +99,7 @@ VinVL model finetuned on scenes descriptions:
     pages = "56--72",
     abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
 }
 HL Dataset paper:
@@ -113,9 +114,6 @@ address = {Prague, Czech Republic},
 }
 ```
-```
 Please consider citing the original project and the VinVL paper
 ```BibTeX

     pages = "56--72",
     abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
 }
+```
 HL Dataset paper:
 }
 ```
 Please consider citing the original project and the VinVL paper
 ```BibTeX