ucsahin
/

paligemma-3b-mix-448-ft-TableDetection

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ucsahin commited on May 26, 2024

Commit

e5350c1

•

1 Parent(s): 651ff88

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -27,8 +27,7 @@ It achieves the following results on the evaluation set:
 **Outputs:**
 - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
-If everything goes smoothly, the model will output a text similar to "<loc[value]><loc[value]><loc[value]><loc[value]> table; <loc[value]><loc[value]><loc[value]><loc[value]> table" depending on the number of tables detected in the image. Then, you can use the following script to convert the text output into PASCAL VOC format.
 ```python
 import re
@@ -42,7 +41,7 @@ def post_process(bbox_text, image_width, image_height):
         loc_values = loc_values[:4]
         loc_values = [value/1024 for value in loc_values]
-        # pascal voc format (xmin, ymin, xmax, ymax)
         loc_values = [
             int(loc_values[1]*image_width), int(loc_values[0]*image_height),
             int(loc_values[3]*image_width), int(loc_values[2]*image_height),

 **Outputs:**
 - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
+If everything goes smoothly, the model will output a text similar to "<loc[value]><loc[value]><loc[value]><loc[value]> table; <loc[value]><loc[value]><loc[value]><loc[value]> table" depending on the number of tables detected in the image. Then, you can use the following script to convert the text output into PASCAL VOC formatted bounding boxes.
 ```python
 import re
         loc_values = loc_values[:4]
         loc_values = [value/1024 for value in loc_values]
+        # convert to (xmin, ymin, xmax, ymax)
         loc_values = [
             int(loc_values[1]*image_width), int(loc_values[0]*image_height),
             int(loc_values[3]*image_width), int(loc_values[2]*image_height),