gitlost-murali
commited on
Commit
•
51f5bd1
1
Parent(s):
c41f111
Update README.md
Browse files
README.md
CHANGED
@@ -27,10 +27,15 @@ tags:
|
|
27 |
|
28 |
# TL;DR
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
|
|
|
|
33 |
|
|
|
|
|
|
|
34 |
|
35 |
The abstract of the model states that:
|
36 |
> Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and
|
|
|
27 |
|
28 |
# TL;DR
|
29 |
|
30 |
+
## Details for Pix2Struct-RefExp: (Based on their [pre-processing](https://github.com/google-research/pix2struct/blob/main/pix2struct/preprocessing/convert_refexp.py))
|
31 |
|
32 |
+
-> __Input__: An image with a bounding box drawn on it around a candidate object and a header containing the referring expression (stored in the image feature).
|
33 |
+
|
34 |
+
-> __Output__: A boolean flag (parse feature) indicating whether the candidate object is the correct referent of the referring expression.
|
35 |
|
36 |
+
__Pix2Struct__ is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
|
37 |
+
|
38 |
+
![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)
|
39 |
|
40 |
The abstract of the model states that:
|
41 |
> Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and
|