gitlost-murali commited on
Commit
51f5bd1
1 Parent(s): c41f111

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -27,10 +27,15 @@ tags:
27
 
28
  # TL;DR
29
 
30
- Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
31
 
32
- ![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)
 
 
33
 
 
 
 
34
 
35
  The abstract of the model states that:
36
  > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and
 
27
 
28
  # TL;DR
29
 
30
+ ## Details for Pix2Struct-RefExp: (Based on their [pre-processing](https://github.com/google-research/pix2struct/blob/main/pix2struct/preprocessing/convert_refexp.py))
31
 
32
+ -> __Input__: An image with a bounding box drawn on it around a candidate object and a header containing the referring expression (stored in the image feature).
33
+
34
+ -> __Output__: A boolean flag (parse feature) indicating whether the candidate object is the correct referent of the referring expression.
35
 
36
+ __Pix2Struct__ is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
37
+
38
+ ![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)
39
 
40
  The abstract of the model states that:
41
  > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and