michelecafagna26 commited on
Commit
6d1c75e
1 Parent(s): 084ecaa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -100,6 +100,19 @@ VinVL model finetuned on scenes descriptions:
100
  abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
101
  }
102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ```
105
 
@@ -123,16 +136,4 @@ Please consider citing the original project and the VinVL paper
123
  pages={5579--5588},
124
  year={2021}
125
  }
126
- ```
127
- And the HL Dataset paper
128
-
129
- ```BibTeX
130
- @inproceedings{cafagna2023hl,
131
- title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
132
- {R}ationales},
133
- author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
134
- booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
135
- address = {Prague, Czech Republic},
136
- year={2023}
137
- }
138
  ```
 
100
  abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
101
  }
102
 
103
+ HL Dataset paper:
104
+
105
+ ```BibTeX
106
+ @inproceedings{cafagna2023hl,
107
+ title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
108
+ {R}ationales},
109
+ author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
110
+ booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
111
+ address = {Prague, Czech Republic},
112
+ year={2023}
113
+ }
114
+ ```
115
+
116
 
117
  ```
118
 
 
136
  pages={5579--5588},
137
  year={2021}
138
  }
 
 
 
 
 
 
 
 
 
 
 
 
139
  ```