gitlost-murali
/

pix2struct-refexp-base

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

gitlost-murali commited on Jul 1, 2023

Commit

51f5bd1

•

1 Parent(s): c41f111

Update README.md

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -27,10 +27,15 @@ tags:
 # TL;DR
-Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
-![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)
 The abstract of the model states that:
 > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and

 # TL;DR
+## Details for Pix2Struct-RefExp: (Based on their [pre-processing](https://github.com/google-research/pix2struct/blob/main/pix2struct/preprocessing/convert_refexp.py))
+-> __Input__: An image with a bounding box drawn on it around a candidate object and a header containing the referring expression (stored in the image feature).
+-> __Output__: A boolean flag (parse feature) indicating whether the candidate object is the correct referent of the referring expression.
+__Pix2Struct__ is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper:
+![Table 1 - paper](https://s3.amazonaws.com/moonup/production/uploads/1678712985040-62441d1d9fdefb55a0b7d12c.png)
 The abstract of the model states that:
 > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and