michelecafagna26
commited on
Commit
•
99aa060
1
Parent(s):
6d1c75e
Update README.md
Browse files
README.md
CHANGED
@@ -99,6 +99,7 @@ VinVL model finetuned on scenes descriptions:
|
|
99 |
pages = "56--72",
|
100 |
abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
|
101 |
}
|
|
|
102 |
|
103 |
HL Dataset paper:
|
104 |
|
@@ -113,9 +114,6 @@ address = {Prague, Czech Republic},
|
|
113 |
}
|
114 |
```
|
115 |
|
116 |
-
|
117 |
-
```
|
118 |
-
|
119 |
Please consider citing the original project and the VinVL paper
|
120 |
|
121 |
```BibTeX
|
|
|
99 |
pages = "56--72",
|
100 |
abstract = "Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.",
|
101 |
}
|
102 |
+
```
|
103 |
|
104 |
HL Dataset paper:
|
105 |
|
|
|
114 |
}
|
115 |
```
|
116 |
|
|
|
|
|
|
|
117 |
Please consider citing the original project and the VinVL paper
|
118 |
|
119 |
```BibTeX
|