SoM / benchmark /README.md
ZHUZHUXIA2025's picture
Upload folder using huggingface_hub
c41cba9 verified

A newer version of the Gradio SDK is available: 4.37.2

Upgrade

SoM-Bench: Evaluating Visual Grounding with Visual Prompting

We build a new benchmark called SoM-Bench to evaluate the visual grounding capability of LLMs with visual prompting.

Dataset

Vision Taks Source #Images #Instances Marks Metric Data
Open-Vocab Segmentation COCO 100 567 Numeric IDs and Masks Precision Download
Open-Vocab Segmentation ADE20K 100 488 Numeric IDs and Masks Precision Download
Phrase Grounding Flickr30K 100 274 Numeric IDs and Masks and Boxes Recall @ 1 Download
Referring Comprehension RefCOCO 100 177 Numeric IDs and Masks ACC @ 0.5 Download
Referring Segmentation RefCOCO 100 177 Numeric IDs and Masks mIoU Download

Dataset Structure

Open-Vocab Segmentation on COCO

We provide COCO in the following structure:

coco_ovseg
β”œβ”€β”€ som_images
    β”œβ”€β”€ 000000000285_0.jpg
    β”œβ”€β”€ 000000000872_0.jpg
    |── 000000000872_5.jpg
    β”œβ”€β”€ ...
    β”œβ”€β”€ 000000002153_5.jpg
    └── 000000002261_0.jpg

For some of the samples, the regions are very dense, so we split the regions into multiple groups of size 5,. For example, 000000000872_0.jpg has 5 regions, and 000000000872_5.jpg has the other 5 regions. Note that you can use the image_id to track the original image.

We used the following language prompt for the task:

I have labeled a bright numeric ID at the center for each visual object in the image. Please enumerate their names. You must answer by selecting from the following names: [COCO Vocabulary]

Open-Vocab Segmentation on ADE20K

ade20k_ovseg
β”œβ”€β”€ som_images
    β”œβ”€β”€ ADE_val_00000001_0.jpg
    β”œβ”€β”€ ADE_val_00000001_5.jpg
    |── ADE_val_00000011_5.jpg
    β”œβ”€β”€ ...
    β”œβ”€β”€ ADE_val_00000039_5.jpg
    └── ADE_val_00000040_0.jpg

Similar to COCO, the regions in ADE20K are also very dense, so we split the regions into multiple groups of size 5,. For example, ADE_val_00000001_0.jpg has 5 regions, and ADE_val_00000001_5.jpg has the other 5 regions. Note that you can use the image_id to track the original image.

We used the following language prompt for the task:

I have labeled a bright numeric ID at the center for each visual object in the image. Please enumerate their names. You must answer by selecting from the following names: [ADE20K Vocabulary]

Phrase Grounding on Flickr30K

flickr30k_grounding
β”œβ”€β”€ som_images
    β”œβ”€β”€ 14868339.jpg
    β”œβ”€β”€ 14868339_wbox.jpg
    |── 14868339.json
    β”œβ”€β”€ ...
    β”œβ”€β”€ 302740416.jpg
    |── 319185571_wbox.jpg
    └── 302740416.json

For Flickr30K, we provide the image with numeric IDs and masks, and also the image with additional bounding boxes. The json file containing the ground truth bounding boxes and the corresponding phrases. Note that the bounding boxes are in the format of [x1, y1, x2, y2].

We used the following language prompt for the task:

I have labeled a bright numeric ID at the center for each visual object in the image. Given the image showing a man in glasses holding a piece of paper, find the corresponding regions for a man in glasses, a piece of paper.

Referring Expression Comprehension and Segmentation on RefCOCOg

refcocog_refseg
β”œβ”€β”€ som_images
    β”œβ”€β”€ 000000000795.jpg
    |── 000000000795.json
    β”œβ”€β”€ ...
    |── 000000007852.jpg
    └── 000000007852.json

For RefCOCOg, we provide the image with numeric IDs and masks, and also the json file containing the referring expressions and the corresponding referring ids.

We used the following language prompt for the task:

I have labeled a bright numeric ID at the center for each visual object in the image. Please tell me the IDs for: The laptop behind the beer bottle; Laptop turned on.