michelecafagna26 commited on
Commit
99aa060
1 Parent(s): 6d1c75e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -99,6 +99,7 @@ VinVL model finetuned on scenes descriptions:
99
  pages = "56--72",
100
  abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
101
  }
 
102
 
103
  HL Dataset paper:
104
 
@@ -113,9 +114,6 @@ address = {Prague, Czech Republic},
113
  }
114
  ```
115
 
116
-
117
- ```
118
-
119
  Please consider citing the original project and the VinVL paper
120
 
121
  ```BibTeX
 
99
  pages = "56--72",
100
  abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
101
  }
102
+ ```
103
 
104
  HL Dataset paper:
105
 
 
114
  }
115
  ```
116
 
 
 
 
117
  Please consider citing the original project and the VinVL paper
118
 
119
  ```BibTeX