Spaces:

jbilcke-hf
/

SNIPED_grasp-any-region

Running on Zero

App Files Files Community

jbilcke-hf commited on Oct 25

Commit

46861c5

verified ·

1 Parent(s): a1232f4

Upload core files for paper 2510.18876

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +35 -0
CLAUDE.md +304 -0
GRADIO_APP_SUMMARY.md +180 -0
LICENSE +201 -0
README.md +49 -5
README_original.md +208 -0
app.py +442 -0
demo/gar_relationship.py +143 -0
demo/gar_with_mask.py +132 -0
demo/gar_with_sam.py +272 -0
demo/gradio/.gradio/certificate.pem +31 -0
demo/gradio/README.md +11 -0
demo/gradio/app.py +267 -0
demo/gradio/frontend/README.md +126 -0
demo/gradio/frontend/configs/webpack/common.js +85 -0
demo/gradio/frontend/configs/webpack/dev.js +25 -0
demo/gradio/frontend/configs/webpack/prod.js +22 -0
demo/gradio/frontend/package.json +64 -0
demo/gradio/frontend/postcss.config.js +10 -0
demo/gradio/frontend/src/App.tsx +306 -0
demo/gradio/frontend/src/components/ErrorModal.tsx +32 -0
demo/gradio/frontend/src/components/LoadingOverlay.tsx +30 -0
demo/gradio/frontend/src/components/QueueStatusIndicator.tsx +29 -0
demo/gradio/frontend/src/components/Stage.tsx +343 -0
demo/gradio/frontend/src/components/Tool.tsx +182 -0
demo/gradio/frontend/src/components/helpers/Interfaces.tsx +47 -0
demo/gradio/frontend/src/components/helpers/imageUtils.tsx +21 -0
demo/gradio/frontend/src/components/helpers/maskUtils.tsx +65 -0
demo/gradio/frontend/src/components/helpers/onnxModelAPI.tsx +71 -0
demo/gradio/frontend/src/components/helpers/scaleHelper.tsx +18 -0
demo/gradio/frontend/src/components/hooks/context.tsx +35 -0
demo/gradio/frontend/src/components/hooks/createContext.tsx +35 -0
demo/gradio/frontend/src/index.tsx +17 -0
demo/gradio/frontend/src/services/maskApi.tsx +211 -0
demo/gradio/frontend/tailwind.config.js +12 -0
demo/gradio/frontend/tsconfig.json +24 -0
demo/gradio/frontend/yarn.lock +0 -0
demo/gradio/requirements.txt +15 -0
evaluation/DLC-Bench/annotations/annotations.json +0 -0
evaluation/DLC-Bench/annotations/class_names.json +102 -0
evaluation/DLC-Bench/annotations/qa.json +0 -0
evaluation/DLC-Bench/eval_gpt_with_image.py +483 -0
evaluation/DLC-Bench/eval_llama_without_image.py +503 -0
evaluation/DLC-Bench/inference.py +173 -0
evaluation/DLC-Bench/model_outputs/gar_1b.json +102 -0
evaluation/DLC-Bench/model_outputs/gar_1b_eval.json +0 -0
evaluation/DLC-Bench/model_outputs/gar_1b_eval_gpt.json +0 -0
evaluation/DLC-Bench/model_outputs/gar_8b.json +102 -0
evaluation/DLC-Bench/model_outputs/gar_8b_eval.json +0 -0
evaluation/DLC-Bench/model_outputs/gar_8b_eval_gpt.json +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,35 @@

+# Python
+__pycache__
+*.pyc
+*.egg-info
+# Log
+*.log
+*.log.*
+# *.json
+# *.jsonl
+# Data
+!**/alpaca-data-conversation.json
+# Editor
+.idea
+*.swp
+# Other
+.DS_Store
+wandb
+# output
+checkpoints
+ckpts*
+pretrained*
+.ipynb_checkpoints
+*.ipynb
+# DevContainer
+!.devcontainer/*
+# Demo
+serve_images/

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,304 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+**Grasp Any Region (GAR)** is a research project for region-level multimodal understanding in vision-language models. It enables:
+1. **Single Region Understanding**: Detailed description of specific image/video regions via points/boxes/scribbles/masks
+2. **Multi-Region Reasoning**: Complex relationship modeling and reasoning across multiple regions simultaneously
+3. **Advanced Compositional Reasoning**: Active dialogue about regions rather than passive description
+The model is built on top of Facebook's Perception-LM architecture and uses xTuner training framework with PyTorch distributed training.
+## Architecture
+### Core Components
+**Model Architecture** (`projects/grasp_any_region/models/grasp_any_region.py:GraspAnyRegion`):
+- Wraps `PerceptionLMForConditionalGeneration` from HuggingFace
+- Key innovation: **RoI-aligned feature replay technique** using `torchvision.ops.roi_align`
+- Adds `mask_patch_embedding` layer (Conv2d) for region mask encoding
+- Supports 15 visual prompt tokens (`<Prompt0>` through `<Prompt14>`) plus `<NO_Prompt>`
+- Forward pass implements feature replay mechanism at grasp_any_region.py:291-377
+**Visual Prompt System**:
+- Masks are encoded with prompt IDs (0-14) where each ID represents a different region
+- Special value (15 = `<NO_Prompt>`) indicates background/non-region areas
+- RoI features are extracted using bounding boxes and replayed into the sequence at crop token positions
+**Training Pipeline**:
+- Uses xTuner framework (built on MMEngine)
+- Dataset: Arrow format with three subsets (Seed, Fine-Grained, Relation)
+- Custom collate function handles variable-length sequences and multi-region inputs
+- Flash Attention 2 required for efficiency
+### Directory Structure
+```
+projects/grasp_any_region/     # Main model code
+  ├── configs/                 # Training configs (gar_1b.py, gar_8b.py)
+  ├── models/
+  │   ├── grasp_any_region.py  # Main model wrapper
+  │   └── modeling/            # Custom PerceptionLM implementations
+  ├── datasets/                # Dataset and data loading
+  └── hf_models/               # HuggingFace conversion utilities
+demo/                          # Inference demos
+  ├── gar_with_mask.py        # Direct mask input
+  ├── gar_with_sam.py         # SAM-based region selection
+  ├── gar_relationship.py     # Multi-region reasoning
+  └── gradio/                 # Web demo
+evaluation/                    # Benchmarks
+  ├── GAR-Bench/              # Custom benchmark (Caption-Simple, Caption-Detailed, VQA)
+  ├── DLC-Bench/              # Detailed localized captioning
+  ├── Ferret-Bench/           # Region description
+  └── MDVP-Bench/             # Multi-domain visual perception
+tools/
+  ├── train.py                # Training entry point
+  ├── test.py                 # Testing entry point
+  └── dist.sh                 # Distributed training launcher
+```
+## Common Commands
+### Environment Setup
+```bash
+# Create environment (requires Python 3.11.2)
+conda create -n gar python=3.11.2 -y
+conda activate gar
+# Install dependencies
+pip3 install xtuner==0.2.0rc0
+pip3 install -r requirements.txt
+pip3 install flash-attn==2.7.4.post1 --no-build-isolation -v
+```
+### Training
+```bash
+# Single-node distributed training (8 GPUs)
+bash tools/dist.sh train projects/grasp_any_region/configs/gar_1b.py 8
+# The dist.sh script uses torchrun with:
+# - Configurable MASTER_ADDR, PORT, NNODES, NODE_RANK
+# - DeepSpeed Zero2 by default (set DEEPSPEED env var to override)
+# - 5-hour timeout (TORCHELASTIC_TIMEOUT=18000)
+```
+**Config Files**:
+- `projects/grasp_any_region/configs/gar_1b.py` - 1B model
+- `projects/grasp_any_region/configs/gar_8b.py` - 8B model
+Key training settings (gar_1b.py):
+- Base model: `facebook/Perception-LM-1B`
+- Batch size: 1 per device × 2 accumulation × 32 GPUs = 64 global
+- Learning rate: 1e-5 (AdamW), warmup: 3%, cosine annealing
+- Max length: 16384 tokens
+- Saves every 5000 steps, keeps last 2 checkpoints
+### Dataset Preparation
+```bash
+# Download dataset from HuggingFace
+hf download HaochenWang/Grasp-Any-Region-Dataset --local-dir data --repo-type dataset
+# Expected structure:
+# data/
+#   ├── Seed-Dataset/data-*.arrow
+#   ├── Fine-Grained-Dataset/data-*.arrow
+#   └── Relation-Dataset/data-*.arrow
+```
+### Inference Demos
+**Single Region with Mask**:
+```bash
+torchrun --nproc-per-node=1 --master-port=8119 \
+  demo/gar_with_mask.py \
+  --image_path assets/demo_image_1.png \
+  --mask_path assets/demo_mask_1.png
+```
+**Single Region with SAM** (points or box):
+```bash
+# Using points
+torchrun --nproc-per-node=1 --master-port=8119 \
+  demo/gar_with_sam.py \
+  --image_path assets/demo_image_2.jpg \
+  --points '[[1172, 812], [1572, 800]]'
+# Using bounding box
+torchrun --nproc-per-node=1 --master-port=8119 \
+  demo/gar_with_sam.py \
+  --image_path assets/demo_image_2.jpg \
+  --box '[800, 500, 1800, 1000]' \
+  --use_box
+```
+**Multi-Region Relationship**:
+```bash
+torchrun --nproc-per-node=1 --master-port=8119 \
+  demo/gar_relationship.py \
+  --image_path assets/demo_image_3.png \
+  --mask_paths "['assets/demo_mask_3_0.png', 'assets/demo_mask_3_1.png', 'assets/demo_mask_3_2.png']" \
+  --question_str 'Question: What is the relationship between <Prompt0>, <Prompt1>, and <Prompt2>?'
+```
+**Gradio Demo**:
+```bash
+cd demo/gradio
+pip install -r requirements.txt
+python app.py
+```
+### Evaluation
+All evaluation scripts follow the same pattern: inference → evaluation with LLM judge (GPT-4o or Llama).
+**GARBench-Caption-Simple**:
+```bash
+# Inference
+torchrun --nproc-per-node=1 --master-port=9811 \
+  evaluation/GAR-Bench/inference.py \
+  --model_name_or_path HaochenWang/GAR-8B \
+  --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \
+  --mode simple \
+  --cache_name my_test \
+  --data_type bf16 \
+  --seed 42
+# Evaluation (requires Azure OpenAI)
+export AZURE_OPENAI_ENDPOINT=YOUR_ENDPOINT
+export AZURE_OPENAI_KEY=YOUR_KEY
+python3 evaluation/GAR-Bench/eval_simple.py \
+  --pred evaluation/GAR-Bench/model_outputs/my_test_simple.json
+```
+**GARBench-VQA** (multi-region reasoning):
+```bash
+torchrun --nproc-per-node=1 --master-port=9811 \
+  evaluation/GAR-Bench/inference.py \
+  --model_name_or_path HaochenWang/GAR-8B \
+  --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \
+  --mode vqa \
+  --cache_name my_test \
+  --data_type bf16
+# VQA evaluation is automatic (no LLM judge)
+```
+**DLC-Bench** (detailed localized captioning):
+```bash
+# Download images first
+cd evaluation/DLC-Bench/annotations
+hf download nvidia/DLC-Bench --repo-type dataset --include "images/*" --local-dir ./
+cd ../../..
+# Inference
+torchrun --nproc-per-node=1 --master-port=8841 \
+  evaluation/DLC-Bench/inference.py \
+  --model_name_or_path HaochenWang/GAR-8B \
+  --cache_name my_test \
+  --data_type bf16
+# Evaluation with GPT-4o
+python3 evaluation/DLC-Bench/eval_gpt_with_image.py \
+  --pred evaluation/DLC-Bench/model_outputs/my_test.json
+# Alternative: Evaluation with Llama3.1-8B (requires vLLM server)
+bash evaluation/DLC-Bench/serve_judge.sh  # in one terminal
+python3 evaluation/DLC-Bench/eval_llama_without_image.py \
+  --pred evaluation/DLC-Bench/model_outputs/my_test.json \
+  --base_url http://localhost:8007/v1
+```
+### Model Conversion
+```bash
+# Convert trained checkpoint to HuggingFace format
+python3 projects/grasp_any_region/hf_models/convert_to_hf.py \
+  projects/grasp_any_region/configs/gar_1b.py \
+  --pth-model PATH_TO_PTH_MODEL \
+  --save-path PATH_TO_SAVE_FOLDER
+# Note: Manually copy required .py files to save folder after conversion
+```
+## Key Implementation Details
+### RoI Feature Replay Mechanism
+The core innovation is at `grasp_any_region.py:291-377`:
+1. Image features are extracted as tiles (16×16 patches per tile)
+2. Tiles are merged into full-resolution feature map
+3. For each `<PromptN>` token in input:
+   - Extract RoI bounding box from `data["bboxes"]`
+   - Apply `torchvision.ops.roi_align` to extract 16×16 features
+   - Replace prompt tokens in sequence with RoI features
+4. This allows attending to region-specific features with global context
+### Mask Encoding
+Masks are provided as 3-channel images where pixel values encode prompt IDs:
+- Values 0-14: Different region prompts
+- Value 15 (or `prompt_numbers`): Background (no prompt)
+- `mask_patch_embedding` (Conv2d) encodes binary masks into feature space
+- Masks are processed at patch level matching vision encoder stride
+### Data Format
+Dataset uses Arrow format with fields:
+- `pixel_values`: (num_tiles, 3, H, W) image tiles
+- `input_ids`: Token sequence with special image/prompt tokens
+- `labels`: Target sequence (-100 for non-loss positions)
+- `global_mask_values`: Region masks with prompt IDs
+- `aspect_ratios`: (ncw, nch) tile arrangement
+- `bboxes`: Dict mapping crop tokens to normalized bbox coordinates
+### Special Tokens
+The model extends base tokenizer with:
+- `<Prompt0>` through `<Prompt14>`: Region identifiers in text
+- `<NO_Prompt>`: Background/non-region marker
+- `<|reserved_special_token_{pid+2}|>`: Internal crop tokens for feature replay
+## Important Notes
+- **Flash Attention 2 is required** - training will fail without it
+- **Python 3.11.2 specifically** - later versions may have compatibility issues
+- **Single batch size only** - code asserts `batch_size=1` at grasp_any_region.py:270
+- **Distributed training required** - single-GPU training not well supported
+- **DeepSpeed Zero2** - default optimization for memory efficiency
+- **torchrun vs torch.distributed.launch** - dist.sh tries torchrun first, falls back to launch
+- **xTuner framework** - all training uses xTuner's runner, not native PyTorch
+- **Evaluation randomness** - LLM judges have variance even with temperature=0
+## HuggingFace Models
+Pre-trained models available:
+- `HaochenWang/GAR-1B` - 1 billion parameter model
+- `HaochenWang/GAR-8B` - 8 billion parameter model
+Base architecture:
+- `facebook/Perception-LM-1B` - Base vision-language model
+- `facebook/Perception-LM-8B` - Larger variant
+## Citation
+```bibtex
+@article{wang2025grasp,
+  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
+  author={Haochen Wang and Yuhao Wang and Tao Zhang and Yikang Zhou and Yanwei Li and Jiacong Wang and Ye Tian and Jiahao Meng and Zilong Huang and Guangcan Mai and Anran Wang and Yunhai Tong and Zhuochen Wang and Xiangtai Li and Zhaoxiang Zhang},
+  journal={arXiv preprint arXiv:2510.18876},
+  year={2025}
+}
+```
+## License
+Apache-2.0 License

GRADIO_APP_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,180 @@

+# Gradio App Summary for Grasp Any Region (GAR)
+## ✅ Completion Status
+Successfully created a comprehensive Gradio demo for the Grasp Any Region (GAR) project.
+## 📁 Files Created/Modified
+### 1. **app.py** (NEW)
+- Complete Gradio interface with 3 tabs:
+  - **Points → Describe**: Interactive point-based segmentation with SAM
+  - **Box → Describe**: Bounding box-based segmentation
+  - **Mask → Describe**: Direct mask upload for region description
+- Features:
+  - ZeroGPU integration with `@spaces.GPU` decorator
+  - Proper import order (spaces first, then CUDA packages)
+  - SAM (Segment Anything Model) integration for interactive segmentation
+  - GAR-1B model for detailed region descriptions
+  - Visualization with contours and input annotations
+  - Example images and clear instructions
+  - Error handling and status messages
+### 2. **requirements.txt** (UPDATED)
+- Gradio 5.49.1 (required version)
+- httpx version fixed to >=0.24.1,<1.0 (Gradio compatibility)
+- PyTorch 2.8.0 (pinned for FlashAttention compatibility)
+- FlashAttention 2.8.3 prebuilt wheel (PyTorch 2.8, Python 3.10, CUDA 12, abiFALSE)
+- spaces==0.30.4 for ZeroGPU
+- All original dependencies preserved
+- Segment Anything from GitHub
+- Vision libraries (opencv-python, pillow, pycocotools)
+- Transformers 4.56.2 and supporting ML libraries
+## 🎯 Key Features
+1. **Three Interaction Modes**:
+   - Points: Click or enter coordinates to segment regions
+   - Box: Draw or enter bounding boxes
+   - Mask: Upload pre-made masks directly
+2. **Model Integration**:
+   - GAR-1B for region understanding (1 billion parameters)
+   - SAM ViT-Huge for automatic segmentation
+   - Both models loaded once at startup for efficiency
+3. **ZeroGPU Optimization**:
+   - Proper `@spaces.GPU(duration=120)` decorator usage
+   - 2-minute GPU allocation per function call
+   - NVIDIA H200 with 70GB VRAM available
+   - Critical import order: `spaces` imported before torch
+4. **User Experience**:
+   - Clear step-by-step instructions
+   - Example images included
+   - Real-time visualization with overlays
+   - Comprehensive error handling
+   - Professional UI with Gradio 5.x Soft theme
+## 🔧 Technical Details
+### Import Order (CRITICAL)
+```python
+# 🚨 spaces MUST be imported FIRST
+import spaces
+# Then import CUDA packages
+import torch
+from transformers import AutoModel, AutoProcessor
+```
+This prevents the "CUDA has been initialized" error.
+### FlashAttention Configuration
+- Using prebuilt wheel for PyTorch 2.8.0
+- Python 3.10 (cp310)
+- CUDA 12 (cu12)
+- abiFALSE (REQUIRED - never use abiTRUE)
+- URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+### Model Loading Strategy
+- Models loaded once at startup (outside decorated functions)
+- Moved to CUDA device after loading
+- GPU-decorated functions only handle inference
+- Efficient memory usage
+## 📋 Dependencies Highlights
+**Core:**
+- gradio==5.49.1
+- torch==2.8.0
+- spaces==0.30.4
+- flash-attn (prebuilt wheel)
+**AI/ML:**
+- transformers==4.56.2
+- accelerate>=0.28.0
+- timm==1.0.19
+- peft==0.15.2
+**Vision:**
+- opencv-python
+- pillow>=9.4.0
+- segment-anything (from GitHub)
+- pycocotools
+## 🎨 UI Structure
+```
+Grasp Any Region (GAR) Demo
+├── Introduction & Links
+├── Tab 1: Points → Describe
+│   ├── Image upload + points input
+│   ├── Generate Mask button
+│   ├── Describe Region button
+│   └── Outputs: mask, visualization, description
+├── Tab 2: Box → Describe
+│   ├── Image upload + box input
+│   ├── Generate Mask button
+│   ├── Describe Region button
+│   └── Outputs: mask, visualization, description
+├── Tab 3: Mask → Describe
+│   ├── Image upload + mask upload
+│   ├── Describe Region button
+│   └── Outputs: visualization, description
+└── Documentation & Citation
+```
+## 🚀 How to Run
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the app
+python app.py
+```
+The app will automatically:
+1. Load GAR-1B and SAM models
+2. Launch Gradio interface
+3. Allocate GPU on-demand with ZeroGPU
+## 📊 Expected Performance
+- **Model**: GAR-1B (lightweight, fast inference)
+- **GPU**: NVIDIA H200, 70GB VRAM
+- **Inference Time**: ~10-30 seconds per region (depending on complexity)
+- **Max New Tokens**: 1024 (configurable)
+## ⚠️ Important Notes
+1. **Import Order**: Always import `spaces` before torch/CUDA packages
+2. **Python Version**: Requires Python 3.10 (for FlashAttention wheel)
+3. **FlashAttention**: Uses prebuilt wheel (no compilation needed)
+4. **Asset Files**: Demo expects images in `assets/` directory
+5. **SingleRegionCaptionDataset**: Required from evaluation module
+## 🔗 References
+- **Paper**: https://arxiv.org/abs/2510.18876
+- **GitHub**: https://github.com/Haochen-Wang409/Grasp-Any-Region
+- **Model**: https://huggingface.co/HaochenWang/GAR-1B
+- **SAM**: https://github.com/facebookresearch/segment-anything
+## 📝 Citation
+```bibtex
+@article{wang2025grasp,
+  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
+  author={Haochen Wang et al.},
+  journal={arXiv preprint arXiv:2510.18876},
+  year={2025}
+}
+```
+---
+**Created**: 2025-10-25
+**Status**: ✅ Ready for deployment
+**Hardware**: zerogpu (NVIDIA H200, 70GB VRAM)

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,12 +1,56 @@
 ---
-title: SNIPED Grasp-any-region
-emoji: ⚡
-colorFrom: green
-colorTo: purple
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: "Grasp-Any-Region"
+emoji: 🤖
+colorFrom: yellow
+colorTo: blue
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
+short_description: "Manual Entry: https://huggingface.co/papers/2510.18876"
+hardware: zerogpu
+tags:
+  - research
+  - paper
+  - code
+  - cheatcode
+license: mit
 ---
+# Grasp-Any-Region
+**Automated upload by CheatCode** 🚀
+## 📄 Paper Information
+- **Paper ID**: 2510.18876
+- **Title**: Manual Entry: https://huggingface.co/papers/2510.18876
+- **Original Repository**: [https://github.com/Haochen-Wang409/Grasp-Any-Region](https://github.com/Haochen-Wang409/Grasp-Any-Region)
+## 🛠️ Repository Information
+- **Languages**: JavaScript, Python, Shell, TypeScript
+- **Gradio App**: ✅ Generated by CheatCode
+## 🤖 About CheatCode
+This Space was automatically created by [CheatCode](https://github.com/jbilcke-hf/CheatCode),
+an AI-powered tool that:
+1. Discovers research papers from HuggingFace
+2. Extracts and analyzes linked repositories
+3. Generates Gradio demo applications
+4. Uploads everything to HuggingFace Spaces
+## 📝 Usage
+This Space includes a Gradio app that was automatically generated from the repository code.
+## ⚠️ Disclaimer
+This is an automated upload. The code comes from the original repository and may require
+additional configuration or dependencies to run properly.
+## 📜 License
+Please refer to the original repository for licensing information: https://github.com/Haochen-Wang409/Grasp-Any-Region

README_original.md ADDED Viewed

	@@ -0,0 +1,208 @@

+---
+title: Grasp Any Region - Region-Level Visual Understanding
+emoji: 🎯
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 5.49.1
+app_file: app.py
+pinned: false
+short_description: A multimodal model for precise region-level understanding and reasoning in images and videos
+hardware: zerogpu
+---
+# Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
+by
+[Haochen Wang](https://haochen-wang409.github.io),
+Yuhao Wang,
+[Tao Zhang](https://scholar.google.com/citations?user=3xu4a5oAAAAJ),
+[Yikang Zhou](https://scholar.google.com/citations?user=dZikW2YAAAAJ),
+[Yanwei Li](https://yanwei-li.com/),
+[Jiacong Wang](https://scholar.google.com/citations?user=rzYgLkgAAAAJ),
+[Ye Tian](https://scholar.google.com/citations?user=vUY_PIUAAAAJ),
+[Jiahao Meng](https://scholar.google.com/citations?user=NJfjvfIAAAAJ),
+[Zilong Huang](https://speedinghzl.github.io/),
+[Guangcan Mai](https://scholar.google.com/citations?user=739cUNMAAAAJ),
+[Anran Wang](https://sites.google.com/view/anranwang/home),
+[Yunhai Tong](https://scholar.google.com/citations?user=T4gqdPkAAAAJ),
+Zhuochen Wang,
+[Xiangtai Li](https://lxtgh.github.io/), and
+[Zhaoxiang Zhang](https://scholar.google.com/citations?user=qxWfV6cAAAAJ).
+[[Paper](https://arxiv.org/abs/2510.18876)] | [[HuggingFace](https://huggingface.co/collections/HaochenWang/grasp-any-region-68f7433671030d6ea682f692)] |  [[Citation](#citation)]
+**TL; DR**: Our Grasp Any Region (GAR) supports both (1) describing a *single* region of an image or a video in the form of points/boxes/scribbles/masks in detail and (2) understanding *multiple* regions such as modeling interactions and performing complex reasoning. We also release a new benchmark, GARBench, to evaluate models on advanced region-level understanding tasks.
+![](./assets/teaser.png)
+> **Abstract.** While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle
+> in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate
+> details and object inter-relationships. Region-level MLLMs have been a promising step. However,
+> previous attempts are generally optimized to understand given regions in isolation, neglecting
+> crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehensive
+> region-level visual understanding. Empowered by an effective RoI-aligned feature replay
+> technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2)
+> modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced
+> compositional reasoning to answer specific free-form questions about any region, shifting the
+> paradigm from passive description to active dialogue. Moreover, we construct GARBench, which
+> not only provides a more accurate evaluation of single-region comprehension, but also, more
+> importantly, measures interactions and complex reasoning across multiple regions. Extensive
+> experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning
+> capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling rela-
+> tionships between multiple prompts with advanced comprehension capabilities, even surpassing
+> InternVL3-78B on GARBench-VQA. More importantly, our zero-shot GAR-8B even outperforms
+> in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily
+> transferred to videos.
+# Installation
+```bash
+conda create -n gar python=3.11.2 -y
+conda activate gar
+pip3 install xtuner==0.2.0rc0
+pip3 install -r requirements.txt
+pip3 install flash-attn==2.7.4.post1 --no-build-isolation -v
+```
+# Demos
+## Gradio Demo
+Please refer to [`demo/gradio/README.md`](demo/gradio/README.md) for serving an online captioning demo using gradio.
+## Examples
+### Detailed Localized Image Descriptions with Masks
+- [`demo/gar_with_mask.py`](demo/gar_with_mask.py) - Command-line tool for processing single images, allowing users to specify specify the region-of-interest using its segmentation mask.
+<details>
+<summary>Expand to see example commands</summary>
+<img src="assets/1_output_visualization.png" width="400">
+```bash
+torchrun --nproc-per-node=1 --master-port=8119 demo/gar_with_mask.py --image_path assets/demo_image_1.png --mask_path assets/demo_mask_1.png
+```
+**Input instruction:** Describe the masked region in detail.
+**Output answer:** A bright green, **frog-shaped slipper** with a smooth, rounded body and a wide, open mouth. The slipper has a small, raised bump on the top of its head, resembling a frog's eye.
+</details>
+### Detailed Localized Image Descriptions with SAM
+- [`demo/gar_with_sam.py`](demo/gar_with_sam.py) - Command-line tool for processing single images using SAM v1, allowing users to specify points or bounding boxes for mask generation
+<details>
+<summary>Expand to see example commands</summary>
+<img src="assets/2_output_visualization.png" width="400">
+```bash
+# You can use it with points or a bounding box for the region of interest.
+# SAM is used to turn points or a bounding box into a mask.
+# You can also use mask directly, see `demo/gar_with_mask.py`.
+torchrun --nproc-per-node=1 --master-port=8119 demo/gar_with_sam.py --image_path assets/demo_image_2.jpg --points '[[1172, 812], [1572, 800]]' --output_image_path output_visualization.png
+torchrun --nproc-per-node=1 --master-port=8119 demo/gar_with_sam.py --image_path assets/demo_image_2.jpg --box '[800, 500, 1800, 1000]' --use_box --output_image_path output_visualization.png
+```
+**Input instruction:** Describe the masked region in detail.
+**Output answer:** A medium-sized, short-haired dog with a predominantly tan coat featuring white markings on its face, chest, and paws. The dog has a white stripe running down the center of its face, extending from the forehead to the nose. Its ears are large, pointed, and stand erect. The dog is wearing a red collar with a visible tag. Its mouth is open, revealing its tongue and teeth, and it appears to be in mid-leap with its front legs extended forward and hind legs stretched out behind.
+</details>
+### Modeling Complex Relationship between Multiple Regions
+- [`demo/gar_relationship.py`](demo/gar_relationship.py) - Command-line tool for processing single images with multiple regions-of-interest, allowing users to specify specify the region-of-interest using its segmentation mask
+<details>
+<summary>Expand to see example commands</summary>
+<img src="assets/3_output_visualization.png" width="400">
+```bash
+torchrun --nproc-per-node=1 --master-port=8119 demo/gar_relationship.py --image_path assets/demo_image_3.png --mask_paths "['assets/demo_mask_3_0.png', 'assets/demo_mask_3_1.png', 'assets/demo_mask_3_2.png']" --question_str 'Question: What is the relationship between <Prompt0>, <Prompt1>, and <Prompt2>?\nOptions:\nA. <Prompt0> is using <Prompt2> to point at <Prompt1>\nB. <Prompt0> has already hit <Prompt1> with <Prompt2>\nC. <Prompt0> is swinging <Prompt2> and is about to hit <Prompt1>\nD. <Prompt0> is holding <Prompt2> while looking away from <Prompt1>'
+```
+**Input instruction:**
+```
+Question: What is the relationship between <Prompt0>, <Prompt1>, and <Prompt2>?
+Options:
+A. <Prompt0> is using <Prompt2> to point at <Prompt1>
+B. <Prompt0> has already hit <Prompt1> with <Prompt2>
+C. <Prompt0> is swinging <Prompt2> and is about to hit <Prompt1>
+D. <Prompt0> is holding <Prompt2> while looking away from <Prompt1>
+Answer with the correct option's letter directly.
+```
+**Output answer:** C
+Note that `<Prompt0>`, `<Prompt1>`, and `<Prompt2>` are illustrated in <span style="color:#C00000;">red</span>, <span style="color:#00B050;">green</span>, and <span style="color:#0000FF;">blue</span>, respectively.
+</details>
+# Training
+**1. Dataset Preparation**
+First, download the dataset:
+`hf download HaochenWang/Grasp-Any-Region-Dataset --local-dir data --repo-type dataset`
+The overall data structure should be:
+```sh
+data
+├── Fine-Grained-Dataset
+│   └── data-*-of-*.arrow
+├── Relation-Dataset
+│   └── data-*-of-*.arrow
+└── Seed-Dataset
+    └── data-*-of-*.arrow
+```
+**2. Launch Training**
+Next, run the following script to train using 8 GPUS:
+`bash tools/dist.sh train projects/grasp_any_region/configs/gar_1b.py 8`
+**3. Convert to HuggingFace Format**
+```python3 projects/grasp_any_region/hf_models/convert_to_hf.py projects/grasp_any_region/configs/gar_1b.py --pth-model PATH_TO_PTH_MODEL --save-path PATH_TO_SAVE_FOLDER```
+Note that this script only convert the checkpoint and some `*.py` files requires manually copy to `${PATH_TO_SAVE_FOLDER}`.
+# Evaluation
+Please refer to [`evaluation/EVALUATION.md`](evaluation/EVALUATION.md).
+# License
+This project is licensed under the [Apache-2.0 License](LICENSE).
+# Citation
+If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.
+```
+@article{wang2025grasp,
+  title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
+  author={Haochen Wang and Yuhao Wang and Tao Zhang and Yikang Zhou and Yanwei Li and Jiacong Wang and Ye Tian and Jiahao Meng and Zilong Huang and Guangcan Mai and Anran Wang and Yunhai Tong and Zhuochen Wang and Xiangtai Li and Zhaoxiang Zhang},
+  journal={arXiv preprint arXiv:2510.18876},
+  year={2025}
+}
+```
+# Acknowledgements
+We would like to thank the following projects for their contributions to this work:
+- [SAM](https://github.com/facebookresearch/segment-anything)
+- [DAM](https://github.com/NVlabs/describe-anything)
+- [Sa2VA](https://github.com/bytedance/Sa2VA)

app.py ADDED Viewed

	@@ -0,0 +1,442 @@

+# *************************************************************************
+# Grasp Any Region (GAR) - Gradio Demo
+# Region-level Multimodal Understanding for Vision-Language Models
+# *************************************************************************
+# 🚨 CRITICAL: Import spaces FIRST before any CUDA-related packages
+import spaces
+# Now import CUDA-related packages
+import torch
+import numpy as np
+from PIL import Image
+import gradio as gr
+from transformers import (
+    AutoModel,
+    AutoProcessor,
+    GenerationConfig,
+    SamModel,
+    SamProcessor,
+)
+import cv2
+import sys
+import os
+# Add project root to path for imports
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+try:
+    from evaluation.eval_dataset import SingleRegionCaptionDataset
+except ImportError:
+    print("Warning: Could not import SingleRegionCaptionDataset. Using simplified version.")
+    SingleRegionCaptionDataset = None
+# Initialize device
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Global model variables (loaded once)
+gar_model = None
+gar_processor = None
+sam_model = None
+sam_processor = None
+def load_models():
+    """Load models once at startup"""
+    global gar_model, gar_processor, sam_model, sam_processor
+    if gar_model is None:
+        print("Loading GAR model...")
+        model_path = "HaochenWang/GAR-1B"
+        gar_model = AutoModel.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+        ).eval()
+        gar_processor = AutoProcessor.from_pretrained(
+            model_path,
+            trust_remote_code=True,
+        )
+        print("GAR model loaded successfully!")
+    if sam_model is None:
+        print("Loading SAM model...")
+        sam_model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
+        sam_processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
+        print("SAM model loaded successfully!")
+@spaces.GPU(duration=120)
+def generate_mask_from_points(image, points_str):
+    """Generate mask using SAM from point coordinates"""
+    try:
+        load_models()
+        if not points_str or points_str.strip() == "":
+            return None, "Please provide points in format: x1,y1;x2,y2"
+        # Parse points
+        points = []
+        labels = []
+        for point in points_str.split(';'):
+            point = point.strip()
+            if point:
+                x, y = map(float, point.split(','))
+                points.append([x, y])
+                labels.append(1)  # Foreground point
+        if not points:
+            return None, "No valid points provided"
+        # Apply SAM
+        inputs = sam_processor(
+            image,
+            input_points=[points],
+            input_labels=[labels],
+            return_tensors="pt",
+        ).to(device)
+        with torch.no_grad():
+            outputs = sam_model(**inputs)
+        masks = sam_processor.image_processor.post_process_masks(
+            outputs.pred_masks.cpu(),
+            inputs["original_sizes"].cpu(),
+            inputs["reshaped_input_sizes"].cpu(),
+        )[0][0]
+        scores = outputs.iou_scores[0, 0]
+        mask_selection_index = scores.argmax()
+        mask_np = masks[mask_selection_index].numpy()
+        # Visualize mask
+        mask_img = (mask_np * 255).astype(np.uint8)
+        return Image.fromarray(mask_img), "Mask generated successfully!"
+    except Exception as e:
+        return None, f"Error generating mask: {str(e)}"
+@spaces.GPU(duration=120)
+def generate_mask_from_box(image, box_str):
+    """Generate mask using SAM from bounding box"""
+    try:
+        load_models()
+        if not box_str or box_str.strip() == "":
+            return None, "Please provide box in format: x1,y1,x2,y2"
+        # Parse box
+        box = list(map(float, box_str.split(',')))
+        if len(box) != 4:
+            return None, "Box must have 4 coordinates: x1,y1,x2,y2"
+        # Apply SAM
+        inputs = sam_processor(
+            image,
+            input_boxes=[[box]],
+            return_tensors="pt",
+        ).to(device)
+        with torch.no_grad():
+            outputs = sam_model(**inputs)
+        masks = sam_processor.image_processor.post_process_masks(
+            outputs.pred_masks.cpu(),
+            inputs["original_sizes"].cpu(),
+            inputs["reshaped_input_sizes"].cpu(),
+        )[0][0]
+        scores = outputs.iou_scores[0, 0]
+        mask_selection_index = scores.argmax()
+        mask_np = masks[mask_selection_index].numpy()
+        # Visualize mask
+        mask_img = (mask_np * 255).astype(np.uint8)
+        return Image.fromarray(mask_img), "Mask generated successfully!"
+    except Exception as e:
+        return None, f"Error generating mask: {str(e)}"
+@spaces.GPU(duration=120)
+def describe_region(image, mask):
+    """Generate description for a region defined by a mask"""
+    try:
+        load_models()
+        if image is None:
+            return "Please provide an image"
+        if mask is None:
+            return "Please provide a mask (upload or generate using SAM)"
+        # Convert mask to numpy
+        if isinstance(mask, Image.Image):
+            mask_np = np.array(mask.convert("L"))
+        else:
+            mask_np = np.array(mask)
+        # Ensure mask is binary
+        mask_np = (mask_np > 127).astype(np.uint8)
+        # Prepare data
+        prompt_number = gar_model.config.prompt_numbers
+        prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+        if SingleRegionCaptionDataset is not None:
+            dataset = SingleRegionCaptionDataset(
+                image=image,
+                mask=mask_np,
+                processor=gar_processor,
+                prompt_number=prompt_number,
+                visual_prompt_tokens=prompt_tokens,
+                data_dtype=torch.bfloat16,
+            )
+            data_sample = dataset[0]
+        else:
+            # Simplified processing if dataset class not available
+            # This is a fallback - the actual implementation requires SingleRegionCaptionDataset
+            return "Error: SingleRegionCaptionDataset not available. Please check installation."
+        # Generate description
+        with torch.no_grad():
+            generate_ids = gar_model.generate(
+                **data_sample,
+                generation_config=GenerationConfig(
+                    max_new_tokens=1024,
+                    do_sample=False,
+                    eos_token_id=gar_processor.tokenizer.eos_token_id,
+                    pad_token_id=gar_processor.tokenizer.pad_token_id,
+                ),
+                return_dict=True,
+            )
+        output_caption = gar_processor.tokenizer.decode(
+            generate_ids.sequences[0], skip_special_tokens=True
+        ).strip()
+        return output_caption
+    except Exception as e:
+        return f"Error generating description: {str(e)}"
+def create_visualization(image, mask, points_str=None, box_str=None):
+    """Create visualization with mask overlay"""
+    try:
+        if image is None or mask is None:
+            return None
+        img_np = np.array(image).astype(float) / 255.0
+        if isinstance(mask, Image.Image):
+            mask_np = np.array(mask.convert("L")) > 127
+        else:
+            mask_np = np.array(mask) > 127
+        # Draw contour
+        mask_uint8 = mask_np.astype(np.uint8) * 255
+        contours, _ = cv2.findContours(mask_uint8, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+        img_vis = img_np.copy()
+        cv2.drawContours(img_vis, contours, -1, (1.0, 1.0, 0.0), thickness=3)
+        # Draw points if provided
+        if points_str:
+            for point in points_str.split(';'):
+                point = point.strip()
+                if point:
+                    x, y = map(float, point.split(','))
+                    cv2.circle(img_vis, (int(x), int(y)), radius=8, color=(1.0, 0.0, 0.0), thickness=-1)
+                    cv2.circle(img_vis, (int(x), int(y)), radius=8, color=(1.0, 1.0, 1.0), thickness=2)
+        # Draw box if provided
+        if box_str:
+            coords = list(map(float, box_str.split(',')))
+            if len(coords) == 4:
+                x1, y1, x2, y2 = map(int, coords)
+                cv2.rectangle(img_vis, (x1, y1), (x2, y2), color=(1.0, 1.0, 1.0), thickness=3)
+                cv2.rectangle(img_vis, (x1, y1), (x2, y2), color=(1.0, 0.0, 0.0), thickness=1)
+        img_pil = Image.fromarray((img_vis * 255.0).astype(np.uint8))
+        return img_pil
+    except Exception as e:
+        print(f"Error creating visualization: {str(e)}")
+        return None
+# Create Gradio interface
+with gr.Blocks(title="Grasp Any Region (GAR) Demo", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 🎯 Grasp Any Region (GAR)
+    **Region-level Multimodal Understanding for Vision-Language Models**
+    This demo showcases GAR's ability to understand and describe specific regions in images:
+    - 🎨 **Single Region Understanding**: Describe specific areas using points, boxes, or masks
+    - 🔍 **SAM Integration**: Generate masks interactively using Segment Anything Model
+    - 💡 **Detailed Descriptions**: Get comprehensive descriptions of any region
+    Built on top of Perception-LM with RoI-aligned feature replay technique.
+    📄 [Paper](https://arxiv.org/abs/2510.18876) | 💻 [GitHub](https://github.com/Haochen-Wang409/Grasp-Any-Region) | 🤗 [Model](https://huggingface.co/HaochenWang/GAR-1B)
+    """)
+    with gr.Tabs():
+        # Tab 1: Points-based segmentation
+        with gr.Tab("🎯 Points → Describe"):
+            gr.Markdown("### Click points on the image or enter coordinates to segment and describe a region")
+            with gr.Row():
+                with gr.Column():
+                    img_points = gr.Image(label="Input Image", type="pil")
+                    points_input = gr.Textbox(
+                        label="Points (format: x1,y1;x2,y2;...)",
+                        placeholder="e.g., 1172,812;1572,800",
+                        value="1172,812;1572,800"
+                    )
+                    with gr.Row():
+                        gen_mask_points_btn = gr.Button("Generate Mask", variant="primary")
+                        describe_points_btn = gr.Button("Describe Region", variant="secondary")
+                with gr.Column():
+                    mask_points = gr.Image(label="Generated Mask", type="pil")
+                    vis_points = gr.Image(label="Visualization")
+                    desc_points = gr.Textbox(label="Region Description", lines=5)
+            points_status = gr.Textbox(label="Status", visible=False)
+            gen_mask_points_btn.click(
+                fn=generate_mask_from_points,
+                inputs=[img_points, points_input],
+                outputs=[mask_points, points_status]
+            )
+            describe_points_btn.click(
+                fn=describe_region,
+                inputs=[img_points, mask_points],
+                outputs=desc_points
+            ).then(
+                fn=create_visualization,
+                inputs=[img_points, mask_points, points_input, gr.Textbox(visible=False)],
+                outputs=vis_points
+            )
+            gr.Examples(
+                examples=[
+                    ["assets/demo_image_2.jpg", "1172,812;1572,800"],
+                ],
+                inputs=[img_points, points_input],
+                label="Example Images"
+            )
+        # Tab 2: Box-based segmentation
+        with gr.Tab("📦 Box → Describe"):
+            gr.Markdown("### Draw a bounding box or enter coordinates to segment and describe a region")
+            with gr.Row():
+                with gr.Column():
+                    img_box = gr.Image(label="Input Image", type="pil")
+                    box_input = gr.Textbox(
+                        label="Bounding Box (format: x1,y1,x2,y2)",
+                        placeholder="e.g., 800,500,1800,1000",
+                        value="800,500,1800,1000"
+                    )
+                    with gr.Row():
+                        gen_mask_box_btn = gr.Button("Generate Mask", variant="primary")
+                        describe_box_btn = gr.Button("Describe Region", variant="secondary")
+                with gr.Column():
+                    mask_box = gr.Image(label="Generated Mask", type="pil")
+                    vis_box = gr.Image(label="Visualization")
+                    desc_box = gr.Textbox(label="Region Description", lines=5)
+            box_status = gr.Textbox(label="Status", visible=False)
+            gen_mask_box_btn.click(
+                fn=generate_mask_from_box,
+                inputs=[img_box, box_input],
+                outputs=[mask_box, box_status]
+            )
+            describe_box_btn.click(
+                fn=describe_region,
+                inputs=[img_box, mask_box],
+                outputs=desc_box
+            ).then(
+                fn=create_visualization,
+                inputs=[img_box, mask_box, gr.Textbox(visible=False), box_input],
+                outputs=vis_box
+            )
+            gr.Examples(
+                examples=[
+                    ["assets/demo_image_2.jpg", "800,500,1800,1000"],
+                ],
+                inputs=[img_box, box_input],
+                label="Example Images"
+            )
+        # Tab 3: Direct mask upload
+        with gr.Tab("🎭 Mask → Describe"):
+            gr.Markdown("### Upload a pre-made mask to describe a region")
+            with gr.Row():
+                with gr.Column():
+                    img_mask = gr.Image(label="Input Image", type="pil")
+                    mask_upload = gr.Image(label="Upload Mask", type="pil")
+                    describe_mask_btn = gr.Button("Describe Region", variant="primary")
+                with gr.Column():
+                    vis_mask = gr.Image(label="Visualization")
+                    desc_mask = gr.Textbox(label="Region Description", lines=5)
+            describe_mask_btn.click(
+                fn=describe_region,
+                inputs=[img_mask, mask_upload],
+                outputs=desc_mask
+            ).then(
+                fn=create_visualization,
+                inputs=[img_mask, mask_upload, gr.Textbox(visible=False), gr.Textbox(visible=False)],
+                outputs=vis_mask
+            )
+            gr.Examples(
+                examples=[
+                    ["assets/demo_image_1.png", "assets/demo_mask_1.png"],
+                ],
+                inputs=[img_mask, mask_upload],
+                label="Example Images"
+            )
+    gr.Markdown("""
+    ---
+    ### 📖 How to Use:
+    1. **Points → Describe**: Click or enter point coordinates, generate mask, then describe
+    2. **Box → Describe**: Draw or enter a bounding box, generate mask, then describe
+    3. **Mask → Describe**: Upload a pre-made mask directly and describe
+    ### 🔧 Technical Details:
+    - **Model**: GAR-1B (1 billion parameters)
+    - **Base**: Facebook Perception-LM with RoI-aligned feature replay
+    - **Segmentation**: Segment Anything Model (SAM ViT-Huge)
+    - **Hardware**: Powered by ZeroGPU (NVIDIA H200, 70GB VRAM)
+    ### 📚 Citation:
+    ```bibtex
+    @article{wang2025grasp,
+      title={Grasp Any Region: Prompting MLLM to Understand the Dense World},
+      author={Haochen Wang et al.},
+      journal={arXiv preprint arXiv:2510.18876},
+      year={2025}
+    }
+    ```
+    """)
+# Load models on startup
+try:
+    load_models()
+except Exception as e:
+    print(f"Warning: Could not pre-load models: {e}")
+    print("Models will be loaded on first use.")
+if __name__ == "__main__":
+    demo.launch()

demo/gar_relationship.py ADDED Viewed

	@@ -0,0 +1,143 @@

+# --------------------------------------------------------
+# Copyright (2025) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License")
+# Grasp Any Region Project
+# Written by Haochen Wang
+# --------------------------------------------------------
+import argparse
+import ast
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoProcessor, GenerationConfig
+from evaluation.eval_dataset import MultiRegionDataset
+TORCH_DTYPE_MAP = dict(fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32)
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Inference of Grasp Any Region models on DLC-Bench."
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        help="HF model name or path",
+        default="HaochenWang/GAR-8B",
+    )
+    parser.add_argument(
+        "--image_path",
+        help="image path",
+        required=True,
+    )
+    parser.add_argument(
+        "--mask_paths",
+        help="mask path",
+        required=True,
+    )
+    parser.add_argument(
+        "--question_str",
+        help="input instructions",
+        required=True,
+    )
+    parser.add_argument(
+        "--data_type",
+        help="data dtype",
+        type=str,
+        choices=["fp16", "bf16", "fp32"],
+        default="bf16",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=0,
+        help="Random seed for reproducible text generation",
+    )
+    args = parser.parse_args()
+    return args
+def select_ann(coco, img_id, area_min=None, area_max=None):
+    cat_ids = coco.getCatIds()
+    ann_ids = coco.getAnnIds(imgIds=[img_id], catIds=cat_ids, iscrowd=None)
+    if area_min is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] >= area_min
+        ]
+    if area_max is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] <= area_max
+        ]
+    return ann_ids
+def main():
+    args = parse_args()
+    data_dtype = TORCH_DTYPE_MAP[args.data_type]
+    torch.manual_seed(args.seed)
+    # init ditribution for dispatch_modules in LLM
+    torch.cuda.set_device(0)
+    torch.distributed.init_process_group(backend="nccl")
+    # build HF model
+    model = AutoModel.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=data_dtype,
+        device_map="cuda:0",
+    ).eval()
+    processor = AutoProcessor.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    img = Image.open(args.image_path)
+    masks = []
+    for mask_path in ast.literal_eval(args.mask_paths):
+        mask = np.array(Image.open(mask_path).convert("L")).astype(bool)
+        masks.append(mask)
+    prompt_number = model.config.prompt_numbers
+    prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+    dataset = MultiRegionDataset(
+        image=img,
+        masks=masks,
+        question_str=args.question_str
+        + "\nAnswer with the correct option's letter directly.",
+        processor=processor,
+        prompt_number=prompt_number,
+        visual_prompt_tokens=prompt_tokens,
+        data_dtype=data_dtype,
+    )
+    data_sample = dataset[0]
+    with torch.no_grad():
+        generate_ids = model.generate(
+            **data_sample,
+            generation_config=GenerationConfig(
+                max_new_tokens=1024,
+                do_sample=False,
+                eos_token_id=processor.tokenizer.eos_token_id,
+                pad_token_id=processor.tokenizer.pad_token_id,
+            ),
+            return_dict=True,
+        )
+    outputs = processor.tokenizer.decode(
+        generate_ids.sequences[0], skip_special_tokens=True
+    ).strip()
+    print(outputs)  # Print model output for this image
+if __name__ == "__main__":
+    main()

demo/gar_with_mask.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# --------------------------------------------------------
+# Copyright (2025) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License")
+# Grasp Any Region Project
+# Written by Haochen Wang
+# --------------------------------------------------------
+import argparse
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoProcessor, GenerationConfig
+from evaluation.eval_dataset import SingleRegionCaptionDataset
+TORCH_DTYPE_MAP = dict(fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32)
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Inference demo of Grasp Any Region models."
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        help="HF model name or path",
+        default="HaochenWang/GAR-8B",
+    )
+    parser.add_argument(
+        "--image_path",
+        help="image path",
+        required=True,
+    )
+    parser.add_argument(
+        "--mask_path",
+        help="mask path",
+        required=True,
+    )
+    parser.add_argument(
+        "--data_type",
+        help="data dtype",
+        type=str,
+        choices=["fp16", "bf16", "fp32"],
+        default="bf16",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=0,
+        help="Random seed for reproducible text generation",
+    )
+    args = parser.parse_args()
+    return args
+def select_ann(coco, img_id, area_min=None, area_max=None):
+    cat_ids = coco.getCatIds()
+    ann_ids = coco.getAnnIds(imgIds=[img_id], catIds=cat_ids, iscrowd=None)
+    if area_min is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] >= area_min
+        ]
+    if area_max is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] <= area_max
+        ]
+    return ann_ids
+def main():
+    args = parse_args()
+    data_dtype = TORCH_DTYPE_MAP[args.data_type]
+    torch.manual_seed(args.seed)
+    # init ditribution for dispatch_modules in LLM
+    torch.cuda.set_device(0)
+    torch.distributed.init_process_group(backend="nccl")
+    # build HF model
+    model = AutoModel.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=data_dtype,
+        device_map="cuda:0",
+    ).eval()
+    processor = AutoProcessor.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    img = Image.open(args.image_path)
+    mask = np.array(Image.open(args.mask_path).convert("L")).astype(bool)
+    prompt_number = model.config.prompt_numbers
+    prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+    dataset = SingleRegionCaptionDataset(
+        image=img,
+        mask=mask,
+        processor=processor,
+        prompt_number=prompt_number,
+        visual_prompt_tokens=prompt_tokens,
+        data_dtype=data_dtype,
+    )
+    data_sample = dataset[0]
+    with torch.no_grad():
+        generate_ids = model.generate(
+            **data_sample,
+            generation_config=GenerationConfig(
+                max_new_tokens=1024,
+                do_sample=False,
+                eos_token_id=processor.tokenizer.eos_token_id,
+                pad_token_id=processor.tokenizer.pad_token_id,
+            ),
+            return_dict=True,
+        )
+    outputs = processor.tokenizer.decode(
+        generate_ids.sequences[0], skip_special_tokens=True
+    ).strip()
+    print(outputs)  # Print model output for this image
+if __name__ == "__main__":
+    main()

demo/gar_with_sam.py ADDED Viewed

	@@ -0,0 +1,272 @@

+# *************************************************************************
+# This file may have been modified by Bytedance Inc. (“Bytedance Inc.'s Mo-
+# difications”). All Bytedance Inc.'s Modifications are Copyright (2025) B-
+# ytedance Inc..
+# *************************************************************************
+# Adapted from https://github.com/NVlabs/describe-anything/blob/main/examples/dam_with_sam.py
+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import ast
+import cv2
+import numpy as np
+import torch
+from PIL import Image
+from transformers import (
+    AutoModel,
+    AutoProcessor,
+    GenerationConfig,
+    SamModel,
+    SamProcessor,
+)
+from evaluation.eval_dataset import SingleRegionCaptionDataset
+TORCH_DTYPE_MAP = dict(fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32)
+def apply_sam(image, input_points=None, input_boxes=None, input_labels=None):
+    inputs = sam_processor(
+        image,
+        input_points=input_points,
+        input_boxes=input_boxes,
+        input_labels=input_labels,
+        return_tensors="pt",
+    ).to(device)
+    with torch.no_grad():
+        outputs = sam_model(**inputs)
+    masks = sam_processor.image_processor.post_process_masks(
+        outputs.pred_masks.cpu(),
+        inputs["original_sizes"].cpu(),
+        inputs["reshaped_input_sizes"].cpu(),
+    )[0][0]
+    scores = outputs.iou_scores[0, 0]
+    mask_selection_index = scores.argmax()
+    mask_np = masks[mask_selection_index].numpy()
+    return mask_np
+def add_contour(img, mask, input_points=None, input_boxes=None):
+    img = img.copy()
+    # Draw contour
+    mask = mask.astype(np.uint8) * 255
+    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+    cv2.drawContours(img, contours, -1, (1.0, 1.0, 1.0), thickness=6)
+    # Draw points if provided
+    if input_points is not None:
+        for points in input_points:  # Handle batch of points
+            for x, y in points:
+                # Draw a filled circle for each point
+                cv2.circle(
+                    img,
+                    (int(x), int(y)),
+                    radius=10,
+                    color=(1.0, 0.0, 0.0),
+                    thickness=-1,
+                )
+                # Draw a white border around the circle
+                cv2.circle(
+                    img, (int(x), int(y)), radius=10, color=(1.0, 1.0, 1.0), thickness=2
+                )
+    # Draw boxes if provided
+    if input_boxes is not None:
+        for box_batch in input_boxes:  # Handle batch of boxes
+            for box in box_batch:  # Iterate through boxes in the batch
+                x1, y1, x2, y2 = map(int, box)
+                # Draw rectangle with white color
+                cv2.rectangle(
+                    img, (x1, y1), (x2, y2), color=(1.0, 1.0, 1.0), thickness=4
+                )
+                # Draw inner rectangle with red color
+                cv2.rectangle(
+                    img, (x1, y1), (x2, y2), color=(1.0, 0.0, 0.0), thickness=2
+                )
+    return img
+def denormalize_coordinates(coords, image_size, is_box=False):
+    """Convert normalized coordinates (0-1) to pixel coordinates."""
+    width, height = image_size
+    if is_box:
+        # For boxes: [x1, y1, x2, y2]
+        x1, y1, x2, y2 = coords
+        return [int(x1 * width), int(y1 * height), int(x2 * width), int(y2 * height)]
+    else:
+        # For points: [x, y]
+        x, y = coords
+        return [int(x * width), int(y * height)]
+def print_streaming(text):
+    """Helper function to print streaming text with flush"""
+    print(text, end="", flush=True)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Detailed Localized Image Descriptions with SAM"
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        help="HF model name or path",
+        default="HaochenWang/GAR-8B",
+    )
+    parser.add_argument(
+        "--image_path", type=str, required=True, help="Path to the image file"
+    )
+    parser.add_argument(
+        "--points",
+        type=str,
+        default="[[1172, 812], [1572, 800]]",
+        help="List of points for SAM input",
+    )
+    parser.add_argument(
+        "--box",
+        type=str,
+        default="[773, 518, 1172, 812]",
+        help="Bounding box for SAM input (x1, y1, x2, y2)",
+    )
+    parser.add_argument(
+        "--use_box",
+        action="store_true",
+        help="Use box instead of points for SAM input (default: use points)",
+    )
+    parser.add_argument(
+        "--normalized_coords",
+        action="store_true",
+        help="Interpret coordinates as normalized (0-1) values",
+    )
+    parser.add_argument(
+        "--output_image_path",
+        type=str,
+        default=None,
+        help="Path to save the output image with contour",
+    )
+    parser.add_argument(
+        "--data_type",
+        help="data dtype",
+        type=str,
+        choices=["fp16", "bf16", "fp32"],
+        default="bf16",
+    )
+    args = parser.parse_args()
+    data_dtype = TORCH_DTYPE_MAP[args.data_type]
+    # Load the image
+    img = Image.open(args.image_path).convert("RGB")
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    sam_model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
+    sam_processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
+    image_size = img.size  # (width, height)
+    # Prepare input_points or input_boxes
+    if args.use_box:
+        input_boxes = ast.literal_eval(args.box)
+        if args.normalized_coords:
+            input_boxes = denormalize_coordinates(input_boxes, image_size, is_box=True)
+        input_boxes = [[input_boxes]]  # Add an extra level of nesting
+        print(f"Using input_boxes: {input_boxes}")
+        mask_np = apply_sam(img, input_boxes=input_boxes)
+    else:
+        input_points = ast.literal_eval(args.points)
+        if args.normalized_coords:
+            input_points = [
+                denormalize_coordinates(point, image_size) for point in input_points
+            ]
+        # Assume all points are foreground
+        input_labels = [1] * len(input_points)
+        input_points = [[x, y] for x, y in input_points]  # Convert to list of lists
+        input_points = [input_points]  # Wrap in outer list
+        input_labels = [input_labels]  # Wrap labels in list
+        print(f"Using input_points: {input_points}")
+        mask_np = apply_sam(img, input_points=input_points, input_labels=input_labels)
+    # build HF model
+    model = AutoModel.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=data_dtype,
+        device_map="cuda:0",
+    ).eval()
+    processor = AutoProcessor.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    # Get description
+    prompt_number = model.config.prompt_numbers
+    prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+    dataset = SingleRegionCaptionDataset(
+        image=img,
+        mask=mask_np,
+        processor=processor,
+        prompt_number=prompt_number,
+        visual_prompt_tokens=prompt_tokens,
+        data_dtype=data_dtype,
+    )
+    data_sample = dataset[0]
+    with torch.no_grad():
+        generate_ids = model.generate(
+            **data_sample,
+            generation_config=GenerationConfig(
+                max_new_tokens=1024,
+                do_sample=False,
+                eos_token_id=processor.tokenizer.eos_token_id,
+                pad_token_id=processor.tokenizer.pad_token_id,
+            ),
+            return_dict=True,
+        )
+    outputs = processor.tokenizer.decode(
+        generate_ids.sequences[0], skip_special_tokens=True
+    ).strip()
+    print(outputs)  # Print model output for this image
+    if args.output_image_path:
+        img_np = np.asarray(img).astype(float) / 255.0
+        # Prepare visualization inputs
+        vis_points = input_points if not args.use_box else None
+        vis_boxes = input_boxes if args.use_box else None
+        img_with_contour_np = add_contour(
+            img_np, mask_np, input_points=vis_points, input_boxes=vis_boxes
+        )
+        img_with_contour_pil = Image.fromarray(
+            (img_with_contour_np * 255.0).astype(np.uint8)
+        )
+        img_with_contour_pil.save(args.output_image_path)
+        print(f"Output image with contour saved as {args.output_image_path}")

demo/gradio/.gradio/certificate.pem ADDED Viewed

	@@ -0,0 +1,31 @@

+-----BEGIN CERTIFICATE-----
+MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
+TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
+cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
+WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
+ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
+MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
+h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
+0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
+A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
+T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
+B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
+B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
+KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
+OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
+jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
+qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
+rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
+HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
+hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
+ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
+3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
+NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
+ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
+TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
+jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
+oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
+4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
+mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
+emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
+-----END CERTIFICATE-----

demo/gradio/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+Please install segment-anything package through:
+```
+pip install git+https://github.com/facebookresearch/segment-anything.git
+```
+This demo is based on the Segment Anything demo under Apache 2.0 license. Please refer to the [Segment Anything LICENSE](https://github.com/facebookresearch/segment-anything/blob/main/LICENSE) for more details.
+## Run the demo
+```
+python demo/gradio/app.py
+```

demo/gradio/app.py ADDED Viewed

	@@ -0,0 +1,267 @@

+# *************************************************************************
+# This file may have been modified by Bytedance Inc. (“Bytedance Inc.'s Mo-
+# difications”). All Bytedance Inc.'s Modifications are Copyright (2025) B-
+# ytedance Inc..
+# *************************************************************************
+# Adapted from https://github.com/NVlabs/describe-anything/blob/main/examples/dam_with_sam.py
+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import base64
+import io
+import cv2
+import gradio as gr
+import numpy as np
+import torch
+from fastapi import FastAPI
+from fastapi.staticfiles import StaticFiles
+from PIL import Image
+from segment_anything import SamPredictor, sam_model_registry
+from transformers import (
+    AutoModel,
+    AutoProcessor,
+    GenerationConfig,
+    SamModel,
+    SamProcessor,
+)
+try:
+    from spaces import GPU
+except ImportError:
+    print("Spaces not installed, using dummy GPU decorator")
+    def GPU(*args, **kwargs):
+        def decorator(fn):
+            return fn
+        return decorator
+from evaluation.eval_dataset import SingleRegionCaptionDataset
+# Load SAM model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+sam_model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
+sam_processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
+# Initialize the captioning model and processor
+model_path = "HaochenWang/GAR-1B"
+model = AutoModel.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda:0",
+).eval()
+processor = AutoProcessor.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+)
+@GPU(duration=75)
+def image_to_sam_embedding(base64_image):
+    try:
+        # Decode base64 string to bytes
+        image_bytes = base64.b64decode(base64_image)
+        # Convert bytes to PIL Image
+        image = Image.open(io.BytesIO(image_bytes))
+        # Process image with SAM processor
+        inputs = sam_processor(image, return_tensors="pt").to(device)
+        # Get image embedding
+        with torch.no_grad():
+            image_embedding = sam_model.get_image_embeddings(inputs["pixel_values"])
+        # Convert to CPU and numpy
+        image_embedding = image_embedding.cpu().numpy()
+        # Encode the embedding as base64
+        embedding_bytes = image_embedding.tobytes()
+        embedding_base64 = base64.b64encode(embedding_bytes).decode("utf-8")
+        return embedding_base64
+    except Exception as e:
+        print(f"Error processing image: {str(e)}")
+        raise gr.Error(f"Failed to process image: {str(e)}")
+@GPU(duration=75)
+def describe(image_base64: str, mask_base64: str, query: str):
+    # Convert base64 to PIL Image
+    image_bytes = base64.b64decode(
+        image_base64.split(",")[1] if "," in image_base64 else image_base64
+    )
+    img = Image.open(io.BytesIO(image_bytes))
+    mask_bytes = base64.b64decode(
+        mask_base64.split(",")[1] if "," in mask_base64 else mask_base64
+    )
+    mask = Image.open(io.BytesIO(mask_bytes))
+    mask = np.array(mask.convert("L"))
+    prompt_number = model.config.prompt_numbers
+    prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+    # Assuming mask is given as a numpy array and the image is a PIL image
+    dataset = SingleRegionCaptionDataset(
+        image=img,
+        mask=mask,
+        processor=processor,
+        prompt_number=prompt_number,
+        visual_prompt_tokens=prompt_tokens,
+        data_dtype=torch.bfloat16,
+    )
+    data_sample = dataset[0]
+    # Generate the caption
+    with torch.no_grad():
+        generate_ids = model.generate(
+            **data_sample,
+            generation_config=GenerationConfig(
+                max_new_tokens=1024,
+                eos_token_id=processor.tokenizer.eos_token_id,
+                pad_token_id=processor.tokenizer.pad_token_id,
+            ),
+            return_dict=True,
+        )
+    output_caption = processor.tokenizer.decode(
+        generate_ids.sequences[0], skip_special_tokens=True
+    ).strip()
+    # Stream the tokens
+    text = ""
+    for token in output_caption:
+        text += token
+        yield text
+@GPU(duration=75)
+def describe_without_streaming(image_base64: str, mask_base64: str, query: str):
+    # Convert base64 to PIL Image
+    image_bytes = base64.b64decode(
+        image_base64.split(",")[1] if "," in image_base64 else image_base64
+    )
+    img = Image.open(io.BytesIO(image_bytes))
+    mask_bytes = base64.b64decode(
+        mask_base64.split(",")[1] if "," in mask_base64 else mask_base64
+    )
+    mask = Image.open(io.BytesIO(mask_bytes))
+    mask = np.array(mask.convert("L"))
+    prompt_number = model.config.prompt_numbers
+    prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + ["<NO_Prompt>"]
+    # Assuming mask is given as a numpy array and the image is a PIL image
+    dataset = SingleRegionCaptionDataset(
+        image=img,
+        mask=mask,
+        processor=processor,
+        prompt_number=prompt_number,
+        visual_prompt_tokens=prompt_tokens,
+        data_dtype=torch.bfloat16,
+    )
+    data_sample = dataset[0]
+    # Generate the caption
+    with torch.no_grad():
+        generate_ids = model.generate(
+            **data_sample,
+            generation_config=GenerationConfig(
+                max_new_tokens=1024,
+                # do_sample=False,
+                eos_token_id=processor.tokenizer.eos_token_id,
+                pad_token_id=processor.tokenizer.pad_token_id,
+            ),
+            return_dict=True,
+        )
+    output_caption = processor.tokenizer.decode(
+        generate_ids.sequences[0], skip_special_tokens=True
+    ).strip()
+    return output_caption
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Describe Anything gradio demo")
+    parser.add_argument(
+        "--server_addr",
+        "--host",
+        type=str,
+        default=None,
+        help="The server address to listen on.",
+    )
+    parser.add_argument(
+        "--server_port", "--port", type=int, default=None, help="The port to listen on."
+    )
+    args = parser.parse_args()
+    # Create Gradio interface
+    with gr.Blocks() as demo:
+        gr.Interface(
+            fn=image_to_sam_embedding,
+            inputs=gr.Textbox(label="Image Base64"),
+            outputs=gr.Textbox(label="Embedding Base64"),
+            title="Image Embedding Generator",
+            api_name="image_to_sam_embedding",
+        )
+        gr.Interface(
+            fn=describe,
+            inputs=[
+                gr.Textbox(label="Image Base64"),
+                gr.Text(label="Mask Base64"),
+                gr.Text(label="Prompt"),
+            ],
+            outputs=[gr.Text(label="Description")],
+            title="Mask Description Generator",
+            api_name="describe",
+        )
+        gr.Interface(
+            fn=describe_without_streaming,
+            inputs=[
+                gr.Textbox(label="Image Base64"),
+                gr.Text(label="Mask Base64"),
+                gr.Text(label="Prompt"),
+            ],
+            outputs=[gr.Text(label="Description")],
+            title="Mask Description Generator (Non-Streaming)",
+            api_name="describe_without_streaming",
+        )
+    demo._block_thread = demo.block_thread
+    demo.block_thread = lambda: None
+    demo.launch(
+        share=True,
+        server_name=args.server_addr,
+        server_port=args.server_port,
+        ssr_mode=False,
+    )
+    for route in demo.app.routes:
+        if route.path == "/":
+            demo.app.routes.remove(route)
+    demo.app.mount("/", StaticFiles(directory="dist", html=True), name="demo")
+    demo._block_thread()

demo/gradio/frontend/README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+## Segment Anything Simple Web demo
+This **front-end only** React based web demo shows how to load a fixed image and corresponding `.npy` file of the SAM image embedding, and run the SAM ONNX model in the browser using Web Assembly with mulithreading enabled by `SharedArrayBuffer`, Web Worker, and SIMD128.
+<img src="https://github.com/facebookresearch/segment-anything/raw/main/assets/minidemo.gif" width="500"/>
+## Run the app
+Install Yarn
+```
+npm install --g yarn
+```
+Build and run:
+```
+yarn && yarn start
+```
+Navigate to [`http://localhost:8081/`](http://localhost:8081/)
+Move your cursor around to see the mask prediction update in real time.
+## Export the image embedding
+In the [ONNX Model Example notebook](https://github.com/facebookresearch/segment-anything/blob/main/notebooks/onnx_model_example.ipynb) upload the image of your choice and generate and save corresponding embedding.
+Initialize the predictor:
+```python
+checkpoint = "sam_vit_h_4b8939.pth"
+model_type = "vit_h"
+sam = sam_model_registry[model_type](checkpoint=checkpoint)
+sam.to(device='cuda')
+predictor = SamPredictor(sam)
+```
+Set the new image and export the embedding:
+```
+image = cv2.imread('src/assets/dogs.jpg')
+predictor.set_image(image)
+image_embedding = predictor.get_image_embedding().cpu().numpy()
+np.save("dogs_embedding.npy", image_embedding)
+```
+Save the new image and embedding in `src/assets/data`.
+## Export the ONNX model
+You also need to export the quantized ONNX model from the [ONNX Model Example notebook](https://github.com/facebookresearch/segment-anything/blob/main/notebooks/onnx_model_example.ipynb).
+Run the cell in the notebook which saves the `sam_onnx_quantized_example.onnx` file, download it and copy it to the path `/model/sam_onnx_quantized_example.onnx`.
+Here is a snippet of the export/quantization code:
+```
+onnx_model_path = "sam_onnx_example.onnx"
+onnx_model_quantized_path = "sam_onnx_quantized_example.onnx"
+quantize_dynamic(
+    model_input=onnx_model_path,
+    model_output=onnx_model_quantized_path,
+    optimize_model=True,
+    per_channel=False,
+    reduce_range=False,
+    weight_type=QuantType.QUInt8,
+)
+```
+**NOTE: if you change the ONNX model by using a new checkpoint you need to also re-export the embedding.**
+## Update the image, embedding, model in the app
+Update the following file paths at the top of`App.tsx`:
+```py
+const IMAGE_PATH = "/assets/data/dogs.jpg";
+const IMAGE_EMBEDDING = "/assets/data/dogs_embedding.npy";
+const MODEL_DIR = "/model/sam_onnx_quantized_example.onnx";
+```
+## ONNX multithreading with SharedArrayBuffer
+To use multithreading, the appropriate headers need to be set to create a cross origin isolation state which will enable use of `SharedArrayBuffer` (see this [blog post](https://cloudblogs.microsoft.com/opensource/2021/09/02/onnx-runtime-web-running-your-machine-learning-model-in-browser/) for more details)
+The headers below are set in `configs/webpack/dev.js`:
+```js
+headers: {
+    "Cross-Origin-Opener-Policy": "same-origin",
+    "Cross-Origin-Embedder-Policy": "credentialless",
+}
+```
+## Structure of the app
+**`App.tsx`**
+- Initializes ONNX model
+- Loads image embedding and image
+- Runs the ONNX model based on input prompts
+**`Stage.tsx`**
+- Handles mouse move interaction to update the ONNX model prompt
+**`Tool.tsx`**
+- Renders the image and the mask prediction
+**`helpers/maskUtils.tsx`**
+- Conversion of ONNX model output from array to an HTMLImageElement
+**`helpers/onnxModelAPI.tsx`**
+- Formats the inputs for the ONNX model
+**`helpers/scaleHelper.tsx`**
+- Handles image scaling logic for SAM (longest size 1024)
+**`hooks/`**
+- Handle shared state for the app

demo/gradio/frontend/configs/webpack/common.js ADDED Viewed

	@@ -0,0 +1,85 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+const { resolve } = require("path");
+const HtmlWebpackPlugin = require("html-webpack-plugin");
+const FriendlyErrorsWebpackPlugin = require("friendly-errors-webpack-plugin");
+const CopyPlugin = require("copy-webpack-plugin");
+const webpack = require("webpack");
+module.exports = {
+  entry: "./src/index.tsx",
+  resolve: {
+    extensions: [".js", ".jsx", ".ts", ".tsx"],
+    fallback: { 'process/browser': require.resolve('process/browser'), }
+  },
+  output: {
+    path: resolve(__dirname, "dist"),
+  },
+  module: {
+    rules: [
+      {
+        test: /\.mjs$/,
+        include: /node_modules/,
+        type: "javascript/auto",
+        resolve: {
+          fullySpecified: false,
+        },
+      },
+      {
+        test: [/\.jsx?$/, /\.tsx?$/],
+        use: ["ts-loader"],
+        exclude: /node_modules/,
+      },
+      {
+        test: /\.css$/,
+        use: ["style-loader", "css-loader"],
+      },
+      {
+        test: /\.(scss|sass)$/,
+        use: ["style-loader", "css-loader", "postcss-loader"],
+      },
+      {
+        test: /\.(jpe?g|png|gif|svg)$/i,
+        use: [
+          "file-loader?hash=sha512&digest=hex&name=img/[contenthash].[ext]",
+          "image-webpack-loader?bypassOnDebug&optipng.optimizationLevel=7&gifsicle.interlaced=false",
+        ],
+      },
+      {
+        test: /\.(woff|woff2|ttf)$/,
+        use: {
+          loader: "url-loader",
+        },
+      },
+    ],
+  },
+  plugins: [
+    new CopyPlugin({
+      patterns: [
+        {
+          from: "node_modules/onnxruntime-web/dist/*.wasm",
+          to: "[name][ext]",
+        },
+        {
+          from: "model",
+          to: "model",
+        },
+        {
+          from: "src/assets/examples",
+          to: "examples",
+        },
+      ],
+    }),
+    new HtmlWebpackPlugin({
+      template: "./src/assets/index.html",
+    }),
+    new FriendlyErrorsWebpackPlugin(),
+    new webpack.ProvidePlugin({
+      process: "process/browser",
+    }),
+  ],
+};

demo/gradio/frontend/configs/webpack/dev.js ADDED Viewed

	@@ -0,0 +1,25 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+// development config
+const { merge } = require("webpack-merge");
+const commonConfig = require("./common");
+module.exports = merge(commonConfig, {
+  mode: "development",
+  devServer: {
+    hot: true, // enable HMR on the server
+    open: true,
+    // These headers enable the cross origin isolation state
+    // needed to enable use of SharedArrayBuffer for ONNX
+    // multithreading.
+    headers: {
+      "Cross-Origin-Opener-Policy": "same-origin",
+      "Cross-Origin-Embedder-Policy": "credentialless",
+    },
+  },
+  devtool: "cheap-module-source-map",
+});

demo/gradio/frontend/configs/webpack/prod.js ADDED Viewed

	@@ -0,0 +1,22 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+// production config
+const { merge } = require("webpack-merge");
+const { resolve } = require("path");
+const Dotenv = require("dotenv-webpack");
+const commonConfig = require("./common");
+module.exports = merge(commonConfig, {
+  mode: "production",
+  output: {
+    filename: "js/bundle.[contenthash].min.js",
+    path: resolve(__dirname, "../../dist"),
+    publicPath: "/",
+  },
+  devtool: "source-map",
+  plugins: [new Dotenv()],
+});

demo/gradio/frontend/package.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "name": "segment-anything-mini-demo",
+  "version": "0.1.0",
+  "license": "MIT",
+  "scripts": {
+    "build": "yarn run clean-dist && webpack --config=configs/webpack/prod.js && mv dist/*.wasm dist/js && rsync -r --delete dist ../",
+    "clean-dist": "rimraf dist/*",
+    "lint": "eslint './src/**/*.{js,ts,tsx}' --quiet",
+    "start": "yarn run start-dev",
+    "test": "yarn run start-model-test",
+    "start-dev": "webpack serve --config=configs/webpack/dev.js"
+  },
+  "devDependencies": {
+    "@babel/core": "^7.18.13",
+    "@babel/preset-env": "^7.18.10",
+    "@babel/preset-react": "^7.18.6",
+    "@babel/preset-typescript": "^7.18.6",
+    "@pmmmwh/react-refresh-webpack-plugin": "^0.5.7",
+    "@testing-library/react": "^13.3.0",
+    "@types/node": "^18.7.13",
+    "@types/react": "^18.0.17",
+    "@types/react-dom": "^18.0.6",
+    "@types/underscore": "^1.11.4",
+    "@typescript-eslint/eslint-plugin": "^5.35.1",
+    "@typescript-eslint/parser": "^5.35.1",
+    "babel-loader": "^8.2.5",
+    "copy-webpack-plugin": "^11.0.0",
+    "css-loader": "^6.7.1",
+    "dotenv": "^16.0.2",
+    "dotenv-webpack": "^8.0.1",
+    "eslint": "^8.22.0",
+    "eslint-plugin-react": "^7.31.0",
+    "file-loader": "^6.2.0",
+    "fork-ts-checker-webpack-plugin": "^7.2.13",
+    "friendly-errors-webpack-plugin": "^1.7.0",
+    "html-webpack-plugin": "^5.5.0",
+    "image-webpack-loader": "^8.1.0",
+    "postcss-loader": "^7.0.1",
+    "postcss-preset-env": "^7.8.0",
+    "process": "^0.11.10",
+    "rimraf": "^3.0.2",
+    "sass": "^1.54.5",
+    "sass-loader": "^13.0.2",
+    "style-loader": "^3.3.1",
+    "tailwindcss": "^3.1.8",
+    "ts-loader": "^9.3.1",
+    "typescript": "^4.8.2",
+    "webpack": "^5.74.0",
+    "webpack-cli": "^4.10.0",
+    "webpack-dev-server": "^4.10.0",
+    "webpack-dotenv-plugin": "^2.1.0",
+    "webpack-merge": "^5.8.0"
+  },
+  "dependencies": {
+    "@gradio/client": "^1.7.1",
+    "npyjs": "^0.4.0",
+    "onnxruntime-web": "1.14.0",
+    "react": "^18.2.0",
+    "react-dom": "^18.2.0",
+    "react-refresh": "^0.14.0",
+    "underscore": "^1.13.6",
+    "axios": "^1.6.7"
+  }
+}

demo/gradio/frontend/postcss.config.js ADDED Viewed

	@@ -0,0 +1,10 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+const tailwindcss = require("tailwindcss");
+module.exports = {
+  plugins: ["postcss-preset-env", 'tailwindcss/nesting', tailwindcss],
+};

demo/gradio/frontend/src/App.tsx ADDED Viewed

	@@ -0,0 +1,306 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import { InferenceSession, Tensor } from "onnxruntime-web";
+import React, { useContext, useEffect, useState, useRef } from "react";
+import axios from "axios";
+import "./assets/scss/App.scss";
+import { handleImageScale } from "./components/helpers/scaleHelper";
+import { modelScaleProps, QueueStatus } from "./components/helpers/Interfaces";
+import { onnxMaskToImage, arrayToImageData, imageDataToURL } from "./components/helpers/maskUtils";
+import { modelData } from "./components/helpers/onnxModelAPI";
+import Stage, { DescriptionState } from "./components/Stage";
+import AppContext from "./components/hooks/createContext";
+import { imageToSamEmbedding } from "./services/maskApi";
+import LoadingOverlay from "./components/LoadingOverlay";
+import ErrorModal from './components/ErrorModal';
+import QueueStatusIndicator from "./components/QueueStatusIndicator";
+const ort = require("onnxruntime-web");
+// Define image and model paths
+const MODEL_DIR = "/model/sam_onnx_quantized_example.onnx";
+const App = () => {
+  const {
+    clicks: [clicks, setClicks],
+    image: [image, setImage],
+    maskImg: [maskImg, setMaskImg],
+    maskImgData: [maskImgData, setMaskImgData],
+    isClicked: [isClicked, setIsClicked]
+  } = useContext(AppContext)!;
+  const [model, setModel] = useState<InferenceSession | null>(null);
+  const [tensor, setTensor] = useState<Tensor | null>(null);
+  const [modelScale, setModelScale] = useState<modelScaleProps | null>(null);
+  const [isLoading, setIsLoading] = useState<boolean>(false);
+  const [error, setError] = useState<string | null>(null);
+  const [descriptionState, setDescriptionState] = useState<DescriptionState>({
+    state: 'ready',
+    description: ''
+  });
+  const [queueStatus, setQueueStatus] = useState<QueueStatus>({ inQueue: false });
+  // Initialize the ONNX model
+  useEffect(() => {
+    const initModel = async () => {
+      try {
+        if (MODEL_DIR === undefined) return;
+        const URL: string = MODEL_DIR;
+        const model = await InferenceSession.create(URL);
+        setModel(model);
+      } catch (e) {
+        console.log(e);
+      }
+    };
+    initModel();
+  }, []);
+  const handleImageUpload = async (event: React.ChangeEvent<HTMLInputElement>) => {
+    const file = event.target.files?.[0];
+    if (!file) return;
+    try {
+      const url = URL.createObjectURL(file);
+      await loadImage(new URL(url));
+    } catch (error) {
+      setError('Failed to load image. Please try again with a different image.');
+      console.error('Error loading image:', error);
+    }
+  };
+  const loadImage = async (url: URL) => {
+    try {
+      setIsLoading(true);
+      const img = new Image();
+      img.src = url.href;
+      img.onload = async () => {
+        const { height, width, samScale } = handleImageScale(img);
+        setModelScale({
+          height: height,
+          width: width,
+          samScale: samScale,
+        });
+        img.width = width;
+        img.height = height;
+        setImage(img);
+        // After image is loaded, fetch its embedding from Gradio
+        await fetchImageEmbedding(img);
+        setIsLoading(false);
+      };
+    } catch (error) {
+      console.log(error);
+      setIsLoading(false);
+    }
+  };
+  const fetchImageEmbedding = async (img: HTMLImageElement) => {
+    try {
+      // Create a canvas to convert the image to base64
+      const canvas = document.createElement('canvas');
+      canvas.width = img.width;
+      canvas.height = img.height;
+      const ctx = canvas.getContext('2d');
+      ctx?.drawImage(img, 0, 0);
+      // Convert image to base64 data URL and extract the base64 string
+      const base64Image = canvas.toDataURL('image/jpeg').split(',')[1];
+      // Make request to Gradio API
+      const samEmbedding = await imageToSamEmbedding(
+        base64Image,
+        (status: QueueStatus) => {
+          setQueueStatus(status);
+        }
+      );
+      // Convert base64 embedding back to array buffer
+      const binaryString = window.atob(samEmbedding);
+      const len = binaryString.length;
+      const bytes = new Uint8Array(len);
+      for (let i = 0; i < len; i++) {
+        bytes[i] = binaryString.charCodeAt(i);
+      }
+      // Create tensor from the embedding
+      const embedding = new ort.Tensor(
+        'float32',
+        new Float32Array(bytes.buffer),  // Convert to Float32Array
+        [1, 256, 64, 64] // SAM embedding shape
+      );
+      setTensor(embedding);
+    } catch (error) {
+      setQueueStatus({ inQueue: false }); // Reset queue status on error
+      let errorMessage = 'Failed to process image. Please try again.';
+      if (axios.isAxiosError(error)) {
+        errorMessage = error.response?.data?.message || errorMessage;
+      }
+      setError(errorMessage);
+      console.error('Error fetching embedding:', error);
+    }
+  };
+  useEffect(() => {
+    const handleMaskUpdate = async () => {
+      await runONNX();
+    };
+    handleMaskUpdate();
+  }, [clicks]);
+  const runONNX = async () => {
+    try {
+      // Don't run if already described or is describing
+      if (descriptionState.state !== 'ready') return;
+      console.log('Running ONNX model with:', {
+        modelLoaded: model !== null,
+        hasClicks: clicks !== null,
+        hasTensor: tensor !== null,
+        hasModelScale: modelScale !== null
+      });
+      if (
+        model === null ||
+        clicks === null ||
+        tensor === null ||
+        modelScale === null
+      ) {
+        console.log('Missing required inputs, returning early');
+        return;
+      }
+      else {
+        console.log('Preparing model feeds with:', {
+          clicks,
+          tensorShape: tensor.dims,
+          modelScale
+        });
+        const feeds = modelData({
+          clicks,
+          tensor,
+          modelScale,
+        });
+        if (feeds === undefined) {
+          console.log('Model feeds undefined, returning early');
+          return;
+        }
+        console.log('Running model with feeds:', feeds);
+        const results = await model.run(feeds);
+        console.log('Model run complete, got results:', results);
+        const output = results[model.outputNames[0]];
+        console.log('Processing output with dims:', output.dims);
+        // Calculate and log the mask area (number of non-zero values)
+        const maskArray = Array.from(output.data as Uint8Array);
+        const maskArea = maskArray.filter(val => val > 0).length;
+        console.log('Mask area (number of non-zero pixels):', maskArea);
+        // Double check that the state is ready before processing the mask since the state may have changed
+        if (descriptionState.state !== 'ready') return;
+        // If clicked, we only handle the first mask (note that mask will be cleared after clicking before handling to let us know if it's the first mask).
+        if (isClicked && maskImgData != null) return;
+        if (maskArea > 0) {
+          setMaskImg(onnxMaskToImage(output.data, output.dims[2], output.dims[3], false));
+          setMaskImgData(imageDataToURL(arrayToImageData(output.data, output.dims[2], output.dims[3], true)));
+        } else {
+          console.warn('No mask area detected, clearing mask');
+          setMaskImg(null);
+          // setMaskImgData(null);
+        }
+        console.log('Mask processing complete');
+      }
+    } catch (e) {
+      setError('Failed to process the image. Please try again.');
+      console.error('Error running ONNX model:', e);
+    }
+  };
+  const handleNewRegion = () => {
+    setDescriptionState({
+      state: 'ready',
+      description: ''
+    } as DescriptionState);
+    setMaskImg(null);
+    // setMaskImgData(null);
+    setIsClicked(false);
+  };
+  const handleCopyDescription = () => {
+    navigator.clipboard.writeText(descriptionState.description);
+  };
+  const handleReset = () => {
+    // Clear all states
+    setDescriptionState({
+      state: 'ready',
+      description: ''
+    } as DescriptionState);
+    setMaskImg(null);
+    // setMaskImgData(null);
+    setImage(null);
+    setClicks(null);
+    setIsClicked(false);
+  };
+  return (
+    <div className="flex flex-col h-screen">
+      {isLoading && <LoadingOverlay />}
+      {error && <ErrorModal message={error} onClose={() => setError(null)} />}
+      <QueueStatusIndicator queueStatus={queueStatus} />
+      <div className="flex-1">
+        <Stage
+          onImageUpload={handleImageUpload}
+          descriptionState={descriptionState}
+          setDescriptionState={setDescriptionState}
+          queueStatus={queueStatus}
+          setQueueStatus={setQueueStatus}
+        />
+      </div>
+      <div className="description-container">
+        <div className={`description-box ${descriptionState.state !== 'described' ? descriptionState.state : ''}`}>
+          {descriptionState.description ? (
+            descriptionState.description + (descriptionState.state === 'describing' ? '...' : '')
+          ) : descriptionState.state === 'describing' ? (
+            <em>Describing the region... (this may take a while if compute resources are busy)</em>
+          ) : (
+            image ? (
+              <em>Click on the image to describe the region</em>
+            ) : (
+              <em>Upload an image to describe the region</em>
+            )
+          )}
+        </div>
+        <div className="description-controls">
+          <button
+            onClick={handleCopyDescription}
+            disabled={descriptionState.state !== 'described'}
+          >
+            Copy description
+          </button>
+          <button
+            onClick={handleNewRegion}
+            disabled={descriptionState.state !== 'described'}
+          >
+            Describe a new region
+          </button>
+          <button
+            onClick={handleReset}
+            className="reset-button"
+            disabled={descriptionState.state === 'describing' || !image}
+          >
+            Try a new image
+          </button>
+        </div>
+      </div>
+    </div>
+  );
+};
+export default App;

demo/gradio/frontend/src/components/ErrorModal.tsx ADDED Viewed

	@@ -0,0 +1,32 @@

+import React from 'react';
+interface ErrorModalProps {
+  message: string;
+  onClose: () => void;
+}
+const ErrorModal: React.FC<ErrorModalProps> = ({ message, onClose }) => {
+  return (
+    <div className="fixed inset-0 bg-black bg-opacity-50 flex items-center justify-center z-50">
+      <div className="bg-white p-6 rounded-lg shadow-xl max-w-md w-full mx-4">
+        <div className="flex flex-col items-center">
+          <div className="bg-red-100 p-4 rounded-full mb-4">
+            <svg className="w-6 h-6 text-red-600" fill="none" stroke="currentColor" viewBox="0 0 24 24">
+              <path strokeLinecap="round" strokeLinejoin="round" strokeWidth="2" d="M6 18L18 6M6 6l12 12" />
+            </svg>
+          </div>
+          <h3 className="text-lg font-semibold text-gray-900 mb-2">Error</h3>
+          <p className="text-gray-600 text-center mb-6">{message}</p>
+          <button
+            onClick={onClose}
+            className="bg-red-600 text-white px-4 py-2 rounded hover:bg-red-700 transition-colors"
+          >
+            Close
+          </button>
+        </div>
+      </div>
+    </div>
+  );
+};
+export default ErrorModal;

demo/gradio/frontend/src/components/LoadingOverlay.tsx ADDED Viewed

	@@ -0,0 +1,30 @@

+import React from 'react';
+const LoadingOverlay: React.FC = () => {
+  return (
+    <div className="fixed inset-0 bg-gray-500 bg-opacity-75 flex items-center justify-center z-50">
+      <div className="bg-white p-8 rounded-lg shadow-xl flex flex-col items-center">
+        <svg width="54" height="54" viewBox="0 0 54 54" fill="none" xmlns="http://www.w3.org/2000/svg" className="w-16 h-16 mb-4">
+          <path d="M5.92017 41.0562L27.0002 48.0802" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92017 12.9438L27.0002 26.9998" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M27 5.91992L48.08 26.9999" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92017 41.0559L27.0002 5.91992" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M27 48.08L48.08 27" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M27 27H48.08" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92017 12.9439L27.0002 5.91992" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92017 41.056L27.0002 27" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92017 12.9438L27.0002 48.0798" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M26.9998 31.9201C29.7171 31.9201 31.9198 29.7173 31.9198 27.0001C31.9198 24.2828 29.7171 22.0801 26.9998 22.0801C24.2826 22.0801 22.0798 24.2828 22.0798 27.0001C22.0798 29.7173 24.2826 31.9201 26.9998 31.9201Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92 17.8639C8.63724 17.8639 10.84 15.6612 10.84 12.9439C10.84 10.2267 8.63724 8.02393 5.92 8.02393C3.20276 8.02393 1 10.2267 1 12.9439C1 15.6612 3.20276 17.8639 5.92 17.8639Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M5.92 45.9757C8.63724 45.9757 10.84 43.773 10.84 41.0557C10.84 38.3385 8.63724 36.1357 5.92 36.1357C3.20276 36.1357 1 38.3385 1 41.0557C1 43.773 3.20276 45.9757 5.92 45.9757Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M48.0806 31.9201C50.7979 31.9201 53.0006 29.7173 53.0006 27.0001C53.0006 24.2828 50.7979 22.0801 48.0806 22.0801C45.3634 22.0801 43.1606 24.2828 43.1606 27.0001C43.1606 29.7173 45.3634 31.9201 48.0806 31.9201Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M26.9998 53.0002C29.7171 53.0002 31.9198 50.7974 31.9198 48.0802C31.9198 45.3629 29.7171 43.1602 26.9998 43.1602C24.2826 43.1602 22.0798 45.3629 22.0798 48.0802C22.0798 50.7974 24.2826 53.0002 26.9998 53.0002Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+          <path d="M26.9998 10.84C29.7171 10.84 31.9198 8.63724 31.9198 5.92C31.9198 3.20276 29.7171 1 26.9998 1C24.2826 1 22.0798 3.20276 22.0798 5.92C22.0798 8.63724 24.2826 10.84 26.9998 10.84Z" fill="white" stroke="#1C2B33" strokeWidth="2" strokeMiterlimit="10"/>
+        </svg>
+        <p className="text-lg font-semibold text-gray-800">Loading image embedding...</p>
+      </div>
+    </div>
+  );
+};
+export default LoadingOverlay;

demo/gradio/frontend/src/components/QueueStatusIndicator.tsx ADDED Viewed

	@@ -0,0 +1,29 @@

+import React from 'react';
+import { QueueStatus } from './helpers/Interfaces';
+interface QueueStatusIndicatorProps {
+  queueStatus: QueueStatus;
+}
+const QueueStatusIndicator: React.FC<QueueStatusIndicatorProps> = ({ queueStatus }) => {
+  if (!queueStatus.inQueue) return null;
+  return (
+    <div className="fixed top-4 right-4 bg-white rounded-lg shadow-lg p-4 z-50">
+      <div className="flex flex-col gap-2">
+        {queueStatus.rank === 0 ? (
+          <p className="text-sm">You're next in line! ({queueStatus.queueSize} total in queue)</p>
+        ) : (
+          <p className="text-sm">Queue position: {queueStatus.rank! + 1} of {queueStatus.queueSize}</p>
+        )}
+        {queueStatus.rankEta && (
+          <p className="text-sm text-gray-600">
+            Estimated wait: {Math.ceil(queueStatus.rankEta)} seconds
+          </p>
+        )}
+      </div>
+    </div>
+  );
+};
+export default QueueStatusIndicator;

demo/gradio/frontend/src/components/Stage.tsx ADDED Viewed

	@@ -0,0 +1,343 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import React, { useContext, useState, useEffect } from "react";
+import * as _ from "underscore";
+import Tool from "./Tool";
+import { modelInputProps, QueueStatus } from "./helpers/Interfaces";
+import AppContext from "./hooks/createContext";
+// import { describeMask } from '../services/maskApi';
+interface DescriptionState {
+  state: string; // 'ready', 'describing', 'described'
+  description: string;
+}
+interface StageProps {
+  onImageUpload: (event: React.ChangeEvent<HTMLInputElement>) => Promise<void>;
+  descriptionState: DescriptionState;
+  setDescriptionState: React.Dispatch<React.SetStateAction<DescriptionState>>;
+  queueStatus: QueueStatus;
+  setQueueStatus: (status: QueueStatus) => void;
+}
+const EXAMPLE_IMAGES = Array.from({ length: 21 }, (_, i) => `/examples/${i + 1}.jpg`);
+const BREAKPOINT_MEDIUM = 2100;
+const BREAKPOINT_SMALL = 1100;
+const Stage = ({ onImageUpload, descriptionState, setDescriptionState, queueStatus, setQueueStatus }: StageProps) => {
+  const {
+    clicks: [, setClicks],
+    image: [image],
+    maskImg: [maskImg],
+    maskImgData: [maskImgData]
+  } = useContext(AppContext)!;
+  const [isDragging, setIsDragging] = useState(false);
+  const [currentPage, setCurrentPage] = useState(1);
+  const [imagesPerPage, setImagesPerPage] = useState(8);
+  useEffect(() => {
+    const handleResize = () => {
+      if (window.innerWidth < BREAKPOINT_SMALL) {
+        setImagesPerPage(1);
+      } else if (window.innerWidth < BREAKPOINT_MEDIUM) {
+        setImagesPerPage(4);
+      } else {
+        setImagesPerPage(8);
+      }
+    };
+    // Set initial value
+    handleResize();
+    // Add event listener
+    window.addEventListener('resize', handleResize);
+    // Cleanup
+    return () => window.removeEventListener('resize', handleResize);
+  }, []);
+  const getClick = (x: number, y: number): modelInputProps => {
+    const clickType = 1;
+    return { x, y, clickType };
+  };
+  const handleMouseMove = _.throttle((e: any) => {
+    if (descriptionState.state !== 'ready') return;
+    if (e.clientX === undefined || e.clientY === undefined) {
+      console.warn('Mouse move event does not contain clientX or clientY');
+      return;
+    }
+    let el = e.nativeEvent.target;
+    const rect = el.getBoundingClientRect();
+    // Calculate the actual dimensions of the contained image
+    const containerAspectRatio = el.offsetWidth / el.offsetHeight;
+    const imageAspectRatio = image ? image.width / image.height : 1;
+    let renderedWidth, renderedHeight;
+    if (containerAspectRatio > imageAspectRatio) {
+      // Image is constrained by height
+      renderedHeight = el.offsetHeight;
+      renderedWidth = renderedHeight * imageAspectRatio;
+    } else {
+      // Image is constrained by width
+      renderedWidth = el.offsetWidth;
+      renderedHeight = renderedWidth / imageAspectRatio;
+    }
+    // Calculate the empty space offset
+    const offsetX = (el.offsetWidth - renderedWidth) / 2;
+    const offsetY = (el.offsetHeight - renderedHeight) / 2;
+    // Get click position relative to the actual image
+    let x = e.clientX - rect.left - offsetX;
+    let y = e.clientY - rect.top - offsetY;
+    // Convert to original image coordinates
+    const scaleX = image ? image.width / renderedWidth : 1;
+    const scaleY = image ? image.height / renderedHeight : 1;
+    x *= scaleX;
+    y *= scaleY;
+    // Ensure coordinates are within bounds
+    if (image) {
+      x = Math.max(0, Math.min(x, image.width));
+      y = Math.max(0, Math.min(y, image.height));
+    }
+    const click = getClick(x, y);
+    if (click) {
+      setClicks([click]);
+    }
+  }, 15);
+  const handleDragEnter = (e: React.DragEvent) => {
+    e.preventDefault();
+    e.stopPropagation();
+    setIsDragging(true);
+  };
+  const handleDragLeave = (e: React.DragEvent) => {
+    e.preventDefault();
+    e.stopPropagation();
+    setIsDragging(false);
+  };
+  const handleDragOver = (e: React.DragEvent) => {
+    e.preventDefault();
+    e.stopPropagation();
+  };
+  const handleDrop = async (e: React.DragEvent) => {
+    e.preventDefault();
+    e.stopPropagation();
+    setIsDragging(false);
+    const files = e.dataTransfer.files;
+    if (files && files[0]) {
+      const file = files[0];
+      // Cast to unknown first, then to the desired type
+      const syntheticEvent = {
+        target: {
+          files: [file]
+        }
+      } as unknown as React.ChangeEvent<HTMLInputElement>;
+      onImageUpload(syntheticEvent);
+    }
+  };
+  const flexCenterClasses = "flex items-center justify-center";
+  // const handleDescribeMask = async () => {
+  //   if (!maskImg || !maskImgData || !image) {
+  //     console.warn('No mask or image available to describe');
+  //     return;
+  //   }
+  //   try {
+  //     const canvas = document.createElement('canvas');
+  //     canvas.width = image.width;
+  //     canvas.height = image.height;
+  //     const ctx = canvas.getContext('2d');
+  //     ctx?.drawImage(image, 0, 0);
+  //     const imageBase64 = canvas.toDataURL('image/jpeg').split(',')[1];
+  //     const maskBase64 = maskImgData.split(',')[1];
+  //     const result = await describeMask(maskBase64, imageBase64);
+  //     console.log('Mask description:', result.description);
+  //     alert("Mask description: " + result.description);
+  //   } catch (error) {
+  //     console.error('Failed to describe mask:', error);
+  //   }
+  // };
+  return (
+    <div
+      className={`flex flex-col w-full h-full relative`}
+      onDragEnter={handleDragEnter}
+      onDragOver={handleDragOver}
+      onDragLeave={handleDragLeave}
+      onDrop={handleDrop}
+    >
+      {/* Title and Description */}
+      <div className="w-full px-8 mb-8 flex flex-col justify-center mt-4">
+        <div className="flex flex-col sm:flex-row justify-between items-center gap-4">
+          <h1 className="text-3xl font-bold text-center sm:text-left"><a href="/">Describe Anything Model Demo</a></h1>
+          <div className="flex flex-wrap justify-center gap-4 sm:space-x-8 text-lg font-medium">
+            <a href="https://describe-anything.github.io/" target="_blank" rel="noopener noreferrer" className="text-gray-600 hover:text-gray-800">Project Page</a>
+            <a href="https://github.com/NVlabs/describe-anything?tab=readme-ov-file#simple-gradio-demo-for-detailed-localized-video-descriptions" target="_blank" rel="noopener noreferrer" className="text-gray-600 hover:text-gray-800">DAM for video</a>
+          </div>
+        </div>
+        <div className="border-b border-gray-300 mt-4 mb-4"></div>
+        {!image && <div className="space-y-4 text-gray-600 text-left">
+          <p>Describe Anything Model (DAM) takes in a region of an image or a video in the form of points/boxes/scribbles/masks and outputs detailed descriptions to the region. For videos, it is sufficient to supply annotation on any frame.</p>
+          <p>This demo supports DAM model that takes points on images as queries. For other use cases, please refer to the <a href="" className="text-gray-600 hover:text-gray-800 underline">inference scripts and video demo</a> for more details.</p>
+        </div>}
+      </div>
+      {/* Main Content Area */}
+      <div className={`flex items-center justify-center flex-grow overflow-hidden`}>
+        {/* Main Stage */}
+        <div
+          className={`${flexCenterClasses} relative w-full h-full max-h-[calc(100vh-300px)] px-8 ${
+            isDragging ? 'border-4 border-dashed border-blue-500 bg-blue-50' : ''
+          }`}
+        >
+          {image ? (
+            <>
+              <Tool
+                handleMouseMove={handleMouseMove}
+                descriptionState={descriptionState}
+                setDescriptionState={setDescriptionState}
+                queueStatus={queueStatus}
+                setQueueStatus={setQueueStatus}
+              />
+            </>
+          ) : (
+            <>
+              <div className="flex flex-col items-center gap-6 w-full h-full">
+                <div className="flex-1" />
+                <div className="text-gray-500 text-lg">
+                  {isDragging ? 'Drop image here' : 'Upload your own image'}
+                </div>
+                <div className="flex gap-4 mb-8">
+                  <input
+                    type="file"
+                    id="imageUpload"
+                    accept="image/*"
+                    onChange={onImageUpload}
+                    className="hidden"
+                  />
+                  <label
+                    htmlFor="imageUpload"
+                    className="bg-blue-500 hover:bg-blue-700 text-white font-bold py-2 px-4 rounded cursor-pointer"
+                  >
+                    Upload Image
+                  </label>
+                </div>
+                <div className="text-gray-500 text-lg">
+                  or choose an example image below
+                </div>
+                <div className="relative w-full max-w-[2200px]">
+                  {/* Left Arrow */}
+                  <button
+                    onClick={() => setCurrentPage(prev => Math.max(prev - 1, 1))}
+                    disabled={currentPage === 1}
+                    className={`absolute left-0 top-1/2 -translate-y-1/2 z-10 p-4 ${
+                      currentPage === 1
+                        ? 'text-gray-300 cursor-not-allowed'
+                        : 'text-gray-600 hover:text-gray-800'
+                    }`}
+                  >
+                    <svg xmlns="http://www.w3.org/2000/svg" className="h-8 w-8" fill="none" viewBox="0 0 24 24" stroke="currentColor">
+                      <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M15 19l-7-7 7-7" />
+                    </svg>
+                  </button>
+                  {/* Example Images */}
+                  <div className="flex flex-wrap justify-center gap-8 px-16">
+                    {EXAMPLE_IMAGES.slice(
+                      (currentPage - 1) * imagesPerPage,
+                      currentPage * imagesPerPage
+                    ).map((src, index) => (
+                      <img
+                        key={index}
+                        src={src}
+                        alt={`Example ${index + 1}`}
+                        className="w-[200px] h-[150px] object-cover rounded-sm cursor-pointer hover:opacity-80 transition-opacity"
+                        onClick={() => {
+                          fetch(src)
+                            .then(res => res.blob())
+                            .then(blob => {
+                              const file = new File([blob], `example-${index + 1}.jpg`, { type: 'image/jpeg' });
+                              const syntheticEvent = {
+                                target: {
+                                  files: [file]
+                                }
+                              } as unknown as React.ChangeEvent<HTMLInputElement>;
+                              onImageUpload(syntheticEvent);
+                            });
+                        }}
+                      />
+                    ))}
+                  </div>
+                  {/* Right Arrow */}
+                  <button
+                    onClick={() => setCurrentPage(prev => Math.min(prev + 1, Math.ceil(EXAMPLE_IMAGES.length / imagesPerPage)))}
+                    disabled={currentPage === Math.ceil(EXAMPLE_IMAGES.length / imagesPerPage)}
+                    className={`absolute right-0 top-1/2 -translate-y-1/2 z-10 p-4 ${
+                      currentPage === Math.ceil(EXAMPLE_IMAGES.length / imagesPerPage)
+                        ? 'text-gray-300 cursor-not-allowed'
+                        : 'text-gray-600 hover:text-gray-800'
+                    }`}
+                  >
+                    <svg xmlns="http://www.w3.org/2000/svg" className="h-8 w-8" fill="none" viewBox="0 0 24 24" stroke="currentColor">
+                      <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M9 5l7 7-7 7" />
+                    </svg>
+                  </button>
+                  {/* Page Indicator */}
+                  {/* <div className="w-full text-center mt-4 text-gray-600">
+                    Page {currentPage} of {Math.ceil(EXAMPLE_IMAGES.length / imagesPerPage)}
+                  </div> */}
+                </div>
+                <div className="flex-1" /> {/* Bottom spacer */}
+                {/* Image Credits */}
+                {!image && (
+                <div className="pl-5 pr-5 text-gray-500 text-sm">
+                  Image credit for example images: {' '}
+                  <a
+                    href="https://segment-anything.com/terms"
+                    target="_blank"
+                    className="text-gray-600 hover:text-gray-800 underline"
+                  >
+                    Segment Anything Materials
+                  </a>
+                  {' '}(CC BY-SA 4.0)
+                </div>
+                )}
+              </div>
+            </>
+          )}
+        </div>
+      </div>
+    </div>
+  );
+};
+export default Stage;
+export type { DescriptionState };

demo/gradio/frontend/src/components/Tool.tsx ADDED Viewed

	@@ -0,0 +1,182 @@

+import React, { useContext, useEffect, useState } from "react";
+import AppContext from "./hooks/createContext";
+import { ToolProps, QueueStatus } from "./helpers/Interfaces";
+import * as _ from "underscore";
+import { describeMask, describeMaskWithoutStreaming } from "../services/maskApi";
+import ErrorModal from './ErrorModal';
+import { DescriptionState } from "./Stage";
+const prompt = "<image>\nDescribe the masked region in detail.";
+const Tool = ({
+  handleMouseMove,
+  descriptionState,
+  setDescriptionState,
+  queueStatus,
+  setQueueStatus
+}: ToolProps) => {
+  console.log("Tool handleMouseMove");
+  const {
+    image: [image],
+    maskImg: [maskImg, setMaskImg],
+    maskImgData: [maskImgData, setMaskImgData],
+    isClicked: [isClicked, setIsClicked]
+  } = useContext(AppContext)!;
+  const [shouldFitToWidth, setShouldFitToWidth] = useState(true);
+  const bodyEl = document.body;
+  const fitToPage = () => {
+    if (!image) return;
+    const maxWidth = window.innerWidth - 64; // Account for padding (32px on each side)
+    const maxHeight = window.innerHeight - 200; // Account for header and some padding
+    const imageAspectRatio = image.width / image.height;
+    const containerAspectRatio = maxWidth / maxHeight;
+    setShouldFitToWidth(
+      imageAspectRatio > containerAspectRatio ||
+      image.width > maxWidth
+    );
+  };
+  const resizeObserver = new ResizeObserver((entries) => {
+    for (const entry of entries) {
+      if (entry.target === bodyEl) {
+        fitToPage();
+      }
+    }
+  });
+  useEffect(() => {
+    fitToPage();
+    resizeObserver.observe(bodyEl);
+    return () => {
+      resizeObserver.unobserve(bodyEl);
+    };
+  }, [image]);
+  const imageClasses = "";
+  const maskImageClasses = `absolute opacity-40 pointer-events-none`;
+  const [error, setError] = useState<string | null>(null);
+  const [useStreaming, setUseStreaming] = useState(true);
+  useEffect(() => {
+      if (!isClicked || !maskImg || !maskImgData || !image || descriptionState.state !== 'ready') {
+        console.log("Not ready to call model, isClicked:", isClicked, "maskImg:", maskImg !== null, "maskImgData:", maskImgData !== null, "image:", image !== null, "descriptionState.state:", descriptionState.state);
+        return;
+      }
+      try {
+        setDescriptionState({
+          state: 'describing',
+          description: ''
+        } as DescriptionState);
+        const canvas = document.createElement('canvas');
+        canvas.width = image.width;
+        canvas.height = image.height;
+        const ctx = canvas.getContext('2d');
+        ctx?.drawImage(image, 0, 0);
+        const imageBase64 = canvas.toDataURL('image/jpeg').split(',')[1];
+        const maskBase64 = maskImgData.split(',')[1];
+        const describeMaskWithFallback = async (useStreamingInFunction: boolean) => {
+          try {
+            let result;
+            console.log("useStreaming", useStreaming, "useStreamingInFunction", useStreamingInFunction);
+            if (useStreamingInFunction) {
+              result = await describeMask(
+                maskBase64,
+                imageBase64,
+                prompt,
+                (streamResult: string) => {
+                  setDescriptionState({
+                    state: 'describing',
+                    description: streamResult
+                  } as DescriptionState);
+                },
+                (status: QueueStatus) => {
+                  setQueueStatus(status);
+                }
+              );
+            } else {
+              result = await describeMaskWithoutStreaming(
+                maskBase64,
+                imageBase64,
+                prompt
+              );
+            }
+            setDescriptionState({
+              state: 'described',
+              description: result
+            } as DescriptionState);
+            setQueueStatus({ inQueue: false });
+            setIsClicked(false);
+          } catch (error) {
+            if (useStreaming) {
+              console.log("Error describing mask, switching to non-streaming", error);
+              setUseStreaming(false);
+              describeMaskWithFallback(false);
+            } else {
+              setError('Failed to generate description. Please try again.');
+              setDescriptionState({
+                state: 'ready',
+                description: ''
+              } as DescriptionState);
+              setIsClicked(false);
+              console.error('Failed to describe mask:', error);
+            }
+          }
+        };
+        describeMaskWithFallback(useStreaming);
+      } catch (error) {
+        setIsClicked(false);
+        setError('Failed to generate description. Please try again.');
+        setDescriptionState({
+          state: 'ready',
+          description: ''
+        } as DescriptionState);
+        console.error('Failed to describe mask:', error);
+      }
+  }, [maskImgData]);
+  const handleClick = async (e: React.MouseEvent<HTMLImageElement>) => {
+    if (descriptionState.state !== 'ready') return;
+    setMaskImg(null);
+    setMaskImgData(null);
+    setIsClicked(true);
+    handleMouseMove(e);
+  };
+  return (
+    <>
+      {error && <ErrorModal message={error} onClose={() => setError(null)} />}
+      <div className="relative flex items-center justify-center w-full h-full">
+        {image && (
+          <img
+            onMouseMove={handleMouseMove}
+            onMouseLeave={() => _.defer(() => (descriptionState.state === 'ready' && !isClicked) ? setMaskImg(null) : undefined)}
+            onTouchStart={handleMouseMove}
+            onClick={handleClick}
+            src={image.src}
+            className={`${
+              shouldFitToWidth ? "w-full" : "h-full"
+            } ${imageClasses} object-contain max-h-full max-w-full`}
+          ></img>
+        )}
+        {maskImg && (
+          <img
+            src={maskImg.src}
+            className={`${
+              shouldFitToWidth ? "w-full" : "h-full"
+            } ${maskImageClasses} object-contain max-h-full max-w-full`}
+          ></img>
+        )}
+      </div>
+    </>
+  );
+};
+export default Tool;

demo/gradio/frontend/src/components/helpers/Interfaces.tsx ADDED Viewed

	@@ -0,0 +1,47 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import { Tensor } from "onnxruntime-web";
+import { DescriptionState } from "../Stage";
+export interface modelScaleProps {
+  samScale: number;
+  height: number;
+  width: number;
+}
+export interface modelInputProps {
+  x: number;
+  y: number;
+  clickType: number;
+}
+export interface modeDataProps {
+  clicks?: Array<modelInputProps>;
+  tensor: Tensor;
+  modelScale: modelScaleProps;
+}
+export interface ToolProps {
+  handleMouseMove: (e: any) => void;
+  descriptionState: DescriptionState;
+  setDescriptionState: (value: DescriptionState) => void;
+  queueStatus: QueueStatus;
+  setQueueStatus: (value: QueueStatus) => void;
+}
+export interface StageProps {
+  onImageUpload: (event: React.ChangeEvent<HTMLInputElement>) => void;
+  descriptionState: DescriptionState;
+  setDescriptionState: (value: DescriptionState) => void;
+}
+export interface QueueStatus {
+  inQueue: boolean;
+  rank?: number;
+  queueSize?: number;
+  rankEta?: number | null;
+}

demo/gradio/frontend/src/components/helpers/imageUtils.tsx ADDED Viewed

	@@ -0,0 +1,21 @@

+import { Buffer } from 'buffer';
+export const base64ToImage = async (base64String: string): Promise<HTMLImageElement> => {
+  return new Promise((resolve, reject) => {
+    const img = new Image();
+    img.onload = () => resolve(img);
+    img.onerror = reject;
+    img.src = base64String.startsWith('data:') ?
+      base64String :
+      `data:image/png;base64,${base64String}`;
+  });
+};
+export const imageToBase64 = (img: HTMLImageElement): string => {
+  const canvas = document.createElement('canvas');
+  canvas.width = img.width;
+  canvas.height = img.height;
+  const ctx = canvas.getContext('2d');
+  ctx?.drawImage(img, 0, 0);
+  return canvas.toDataURL('image/png');
+};

demo/gradio/frontend/src/components/helpers/maskUtils.tsx ADDED Viewed

	@@ -0,0 +1,65 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+// Convert the onnx model mask prediction to ImageData
+function arrayToImageData(input: any, width: number, height: number, binary: boolean) {
+  let [r, g, b, a] = [0, 114, 189, 255]; // the masks's blue color
+  let [r_bg, g_bg, b_bg, a_bg] = [0, 0, 0, 0]; // the background's white color
+  if (binary) {
+    [r, g, b, a] = [255, 255, 255, 255]; // black and white
+    [r_bg, g_bg, b_bg, a_bg] = [0, 0, 0, 255]; // black and white
+  }
+  const arr = new Uint8ClampedArray(4 * width * height).fill(0);
+  for (let i = 0; i < input.length; i++) {
+    // Threshold the onnx model mask prediction at 0.0
+    // This is equivalent to thresholding the mask using predictor.model.mask_threshold
+    // in python
+    if (input[i] > 0.0) {
+      arr[4 * i + 0] = r;
+      arr[4 * i + 1] = g;
+      arr[4 * i + 2] = b;
+      arr[4 * i + 3] = a;
+    } else if (binary){
+      arr[4 * i + 0] = r_bg;
+      arr[4 * i + 1] = g_bg;
+      arr[4 * i + 2] = b_bg;
+      arr[4 * i + 3] = a_bg;
+    }
+  }
+  return new ImageData(arr, height, width);
+}
+// Use a Canvas element to produce an image from ImageData
+function imageDataToImage(imageData: ImageData) {
+  const canvas = imageDataToCanvas(imageData);
+  const image = new Image();
+  image.src = canvas.toDataURL();
+  return image;
+}
+function imageDataToURL(imageData: ImageData) {
+  const canvas = imageDataToCanvas(imageData);
+  return canvas.toDataURL();
+}
+// Canvas elements can be created from ImageData
+function imageDataToCanvas(imageData: ImageData) {
+  const canvas = document.createElement("canvas");
+  const ctx = canvas.getContext("2d");
+  canvas.width = imageData.width;
+  canvas.height = imageData.height;
+  ctx?.putImageData(imageData, 0, 0);
+  return canvas;
+}
+// Convert the onnx model mask output to an HTMLImageElement
+function onnxMaskToImage(input: any, width: number, height: number, binary: boolean) {
+  return imageDataToImage(arrayToImageData(input, width, height, binary));
+}
+export { arrayToImageData, imageDataToImage, onnxMaskToImage, imageDataToURL };

demo/gradio/frontend/src/components/helpers/onnxModelAPI.tsx ADDED Viewed

	@@ -0,0 +1,71 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import { Tensor } from "onnxruntime-web";
+import { modeDataProps } from "./Interfaces";
+const modelData = ({ clicks, tensor, modelScale }: modeDataProps) => {
+  const imageEmbedding = tensor;
+  let pointCoords;
+  let pointLabels;
+  let pointCoordsTensor;
+  let pointLabelsTensor;
+  // Check there are input click prompts
+  if (clicks) {
+    let n = clicks.length;
+    // If there is no box input, a single padding point with
+    // label -1 and coordinates (0.0, 0.0) should be concatenated
+    // so initialize the array to support (n + 1) points.
+    pointCoords = new Float32Array(2 * (n + 1));
+    pointLabels = new Float32Array(n + 1);
+    // Add clicks and scale to what SAM expects
+    for (let i = 0; i < n; i++) {
+      pointCoords[2 * i] = clicks[i].x * modelScale.samScale;
+      pointCoords[2 * i + 1] = clicks[i].y * modelScale.samScale;
+      pointLabels[i] = clicks[i].clickType;
+    }
+    // Add in the extra point/label when only clicks and no box
+    // The extra point is at (0, 0) with label -1
+    pointCoords[2 * n] = 0.0;
+    pointCoords[2 * n + 1] = 0.0;
+    pointLabels[n] = -1.0;
+    // Create the tensor
+    pointCoordsTensor = new Tensor("float32", pointCoords, [1, n + 1, 2]);
+    pointLabelsTensor = new Tensor("float32", pointLabels, [1, n + 1]);
+  }
+  const imageSizeTensor = new Tensor("float32", [
+    modelScale.height,
+    modelScale.width,
+  ]);
+  if (pointCoordsTensor === undefined || pointLabelsTensor === undefined)
+    return;
+  // There is no previous mask, so default to an empty tensor
+  const maskInput = new Tensor(
+    "float32",
+    new Float32Array(256 * 256),
+    [1, 1, 256, 256]
+  );
+  // There is no previous mask, so default to 0
+  const hasMaskInput = new Tensor("float32", [0]);
+  return {
+    image_embeddings: imageEmbedding,
+    point_coords: pointCoordsTensor,
+    point_labels: pointLabelsTensor,
+    orig_im_size: imageSizeTensor,
+    mask_input: maskInput,
+    has_mask_input: hasMaskInput,
+  };
+};
+export { modelData };

demo/gradio/frontend/src/components/helpers/scaleHelper.tsx ADDED Viewed

	@@ -0,0 +1,18 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+// Helper function for handling image scaling needed for SAM
+const handleImageScale = (image: HTMLImageElement) => {
+  // Input images to SAM must be resized so the longest side is 1024
+  const LONG_SIDE_LENGTH = 1024;
+  let w = image.naturalWidth;
+  let h = image.naturalHeight;
+  const samScale = LONG_SIDE_LENGTH / Math.max(h, w);
+  return { height: h, width: w, samScale };
+};
+export { handleImageScale };

demo/gradio/frontend/src/components/hooks/context.tsx ADDED Viewed

	@@ -0,0 +1,35 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import React, { useState } from "react";
+import { modelInputProps } from "../helpers/Interfaces";
+import AppContext from "./createContext";
+const AppContextProvider = (props: {
+  children: React.ReactElement<any, string | React.JSXElementConstructor<any>>;
+}) => {
+  const [clicks, setClicks] = useState<Array<modelInputProps> | null>(null);
+  const [image, setImage] = useState<HTMLImageElement | null>(null);
+  const [maskImg, setMaskImg] = useState<HTMLImageElement | null>(null);
+  const [maskImgData, setMaskImgData] = useState<string | null>(null);
+  const [isClicked, setIsClicked] = useState<boolean>(false);
+  return (
+    <AppContext.Provider
+      value={{
+        clicks: [clicks, setClicks],
+        image: [image, setImage],
+        maskImg: [maskImg, setMaskImg],
+        maskImgData: [maskImgData, setMaskImgData],
+        isClicked: [isClicked, setIsClicked],
+      }}
+    >
+      {props.children}
+    </AppContext.Provider>
+  );
+};
+export default AppContextProvider;

demo/gradio/frontend/src/components/hooks/createContext.tsx ADDED Viewed

	@@ -0,0 +1,35 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import { createContext } from "react";
+import { modelInputProps } from "../helpers/Interfaces";
+interface contextProps {
+  clicks: [
+    clicks: modelInputProps[] | null,
+    setClicks: (e: modelInputProps[] | null) => void
+  ];
+  image: [
+    image: HTMLImageElement | null,
+    setImage: (e: HTMLImageElement | null) => void
+  ];
+  maskImg: [
+    maskImg: HTMLImageElement | null,
+    setMaskImg: (e: HTMLImageElement | null) => void
+  ];
+  maskImgData: [
+    maskImgData: string | null,
+    setMaskImgData: (e: string | null) => void
+  ];
+  isClicked: [
+    isClicked: boolean,
+    setIsClicked: (e: boolean) => void
+  ];
+}
+const AppContext = createContext<contextProps | null>(null);
+export default AppContext;

demo/gradio/frontend/src/index.tsx ADDED Viewed

	@@ -0,0 +1,17 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+import * as React from "react";
+import { createRoot } from "react-dom/client";
+import AppContextProvider from "./components/hooks/context";
+import App from "./App";
+const container = document.getElementById("root");
+const root = createRoot(container!);
+root.render(
+  <AppContextProvider>
+    <App/>
+  </AppContextProvider>
+);

demo/gradio/frontend/src/services/maskApi.tsx ADDED Viewed

	@@ -0,0 +1,211 @@

+import axios from 'axios';
+import * as _ from 'underscore';
+const API_URL = process.env.NODE_ENV === 'development' ? 'http://localhost:7860/gradio_api' : '/gradio_api';
+export const describeMaskWithoutStreaming = _.throttle(async (
+  maskBase64: string,
+  imageBase64: string,
+  query: string
+): Promise<string> => {
+  try {
+    const response = await axios.post(`${API_URL}/run/describe_without_streaming`, {
+      data: [imageBase64, maskBase64, query],
+    });
+    console.log("response", response.data);
+    return response.data.data[0];
+  } catch (error) {
+    console.error('Error describing mask:', error);
+    throw error;
+  }
+}, 100);
+export const describeMask = _.throttle(async (
+  maskBase64: string,
+  imageBase64: string,
+  query: string,
+  onStreamUpdate: (token: string) => void,
+  onQueueUpdate?: (status: {
+    inQueue: boolean,
+    rank?: number,
+    queueSize?: number,
+    rankEta?: number | null
+  }) => void
+): Promise<string> => {
+  console.log("describeMask");
+  const initiateResponse = await axios.post(`${API_URL}/call/describe`, {
+    data: [imageBase64, maskBase64, query],
+  });
+  const eventId = initiateResponse.data.event_id;
+  const response = await axios.get(`${API_URL}/queue/data?session_hash=${eventId}`, {
+    headers: {
+      'Accept': 'text/event-stream',
+    },
+    responseType: 'stream',
+    adapter: 'fetch',
+  });
+  const stream = response.data;
+  const reader = stream.pipeThrough(new TextDecoderStream()).getReader();
+  let result = '';
+  let partialMessage = '';
+  while (true) {
+    const { value, done } = await reader.read();
+    if (done) {
+      return result;
+    }
+    // Concatenate with any previous partial message
+    const currentData = partialMessage + value;
+    const lines = currentData.split('\n');
+    // Save the last line if it's incomplete
+    partialMessage = lines[lines.length - 1];
+    // Process all complete lines except the last one
+    let eventType = '';
+    for (let i = 0; i < lines.length - 1; i++) {
+      const line = lines[i];
+      if (line.startsWith('event: ')) {
+        eventType = line.slice(7); // Remove 'event: ' prefix
+        console.log('Event message', line);
+      } else if (line.startsWith('data: ')) {
+        const eventData = line.slice(6); // Remove 'data: ' prefix
+        try {
+          let data = JSON.parse(eventData);
+          if (data['msg']) {
+            eventType = data['msg'];
+            if (eventType === 'process_generating') {
+              eventType = 'generating';
+              data = data['output']['data'];
+            } else if (eventType === 'process_completed') {
+              eventType = 'complete';
+              data = data['output']['data'];
+            }
+          }
+          if (eventType === 'estimation' && onQueueUpdate) {
+            onQueueUpdate({
+              inQueue: true,
+              rank: data.rank,
+              queueSize: data.queue_size,
+              rankEta: data.rank_eta
+            });
+          } else if (eventType === 'process_starts' && onQueueUpdate) {
+            onQueueUpdate({
+              inQueue: false
+            });
+          } else if ((eventType === 'generating' || eventType === 'complete') && data[0]) {
+            result = data[0];
+            onStreamUpdate(data[0]);
+            if (eventType === 'complete') {
+              return result;
+            }
+          }
+        } catch (e) {
+          console.log('Error parsing SSE message:', e);
+        }
+      } else if (line !== '') {
+        console.log('Unknown message', line);
+      }
+    }
+  }
+}, 100);
+export const imageToSamEmbedding = _.throttle(async (
+  imageBase64: string,
+  onQueueUpdate?: (status: {
+    inQueue: boolean,
+    rank?: number,
+    queueSize?: number,
+    rankEta?: number | null
+  }) => void
+): Promise<string> => {
+  // First call to initiate the process
+  const initiateResponse = await axios.post(`${API_URL}/call/image_to_sam_embedding`, {
+    data: [imageBase64]
+  });
+  const eventId = initiateResponse.data.event_id;
+  // Get the stream for queue updates and results
+  const response = await axios.get(`${API_URL}/queue/data?session_hash=${eventId}`, {
+    headers: {
+      'Accept': 'text/event-stream',
+    },
+    responseType: 'stream',
+    adapter: 'fetch',
+  });
+  const stream = response.data;
+  const reader = stream.pipeThrough(new TextDecoderStream()).getReader();
+  let result = '';
+  let partialMessage = '';
+  while (true) {
+    const { value, done } = await reader.read();
+    if (done) {
+      return result;
+    }
+    // Concatenate with any previous partial message
+    const currentData = partialMessage + value;
+    const lines = currentData.split('\n');
+    // Save the last line if it's incomplete (doesn't end with \n)
+    // The endpoint will send an empty line to indicate the end of a message, so it's ok to not process the partial message.
+    partialMessage = lines[lines.length - 1];
+    // Process all complete lines except the last one
+    let eventType = '';
+    for (let i = 0; i < lines.length - 1; i++) {
+      const line = lines[i];
+      if (line.startsWith('event: ')) {
+        eventType = line.slice(7);
+      } else if (line.startsWith('data: ')) {
+        const eventData = line.slice(6);
+        try {
+          let data = JSON.parse(eventData);
+          if (data['msg']) {
+            eventType = data['msg'];
+            console.log("Event type:", eventType);
+            if (eventType === 'process_completed') {
+              eventType = 'complete';
+              data = data['output']['data'];
+            }
+          }
+          if (eventType === 'estimation' && onQueueUpdate) {
+            onQueueUpdate({
+              inQueue: true,
+              rank: data.rank,
+              queueSize: data.queue_size,
+              rankEta: data.rank_eta
+            });
+          } else if (eventType === 'process_starts' && onQueueUpdate) {
+            onQueueUpdate({
+              inQueue: false
+            });
+          } else if (eventType === 'complete' && data[0]) {
+            result = data[0];
+            console.log("Result for image to sam embedding:", result);
+            return result;
+          } else {
+            console.log("Unknown event type:", eventType);
+          }
+        } catch (e) {
+          console.log('Error parsing SSE message:', e, 'Raw data:', eventData);
+        }
+      }
+    }
+  }
+}, 100);
+export { API_URL };

demo/gradio/frontend/tailwind.config.js ADDED Viewed

	@@ -0,0 +1,12 @@

+// Copyright (c) Meta Platforms, Inc. and affiliates.
+// All rights reserved.
+// This source code is licensed under the license found in the
+// LICENSE file in the root directory of this source tree.
+/** @type {import('tailwindcss').Config} */
+module.exports = {
+  content: ["./src/**/*.{html,js,tsx}"],
+  theme: {},
+  plugins: [],
+};

demo/gradio/frontend/tsconfig.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "compilerOptions": {
+    "lib": ["dom", "dom.iterable", "esnext"],
+    "allowJs": true,
+    "skipLibCheck": true,
+    "strict": true,
+    "forceConsistentCasingInFileNames": true,
+    "noEmit": false,
+    "esModuleInterop": true,
+    "module": "esnext",
+    "moduleResolution": "node",
+    "resolveJsonModule": true,
+    "isolatedModules": true,
+    "jsx": "react",
+    "incremental": true,
+    "target": "ESNext",
+    "useDefineForClassFields": true,
+    "allowSyntheticDefaultImports": true,
+    "outDir": "./dist/",
+    "sourceMap": true
+  },
+  "include": ["next-env.d.ts", "**/*.ts", "**/*.tsx", "src"],
+  "exclude": ["node_modules"]
+}

demo/gradio/frontend/yarn.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

demo/gradio/requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+sentencepiece
+accelerate>=0.28.0
+pydantic>=2.10.1
+numpy>=1.23.5,<2.0.0
+pillow>=9.4.0
+gradio>=5.5.0
+requests
+httpx
+uvicorn
+fastapi
+protobuf
+opencv-python
+openai>=1.55.0
+spaces==0.30.4
+git+https://github.com/facebookresearch/segment-anything.git

evaluation/DLC-Bench/annotations/annotations.json ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluation/DLC-Bench/annotations/class_names.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+    "2391781": "wild bird",
+    "2580323": "picture/frame",
+    "4782942": "megaphone/speaker",
+    "6037269": "showerhead",
+    "7050495": "handbag",
+    "8331699": "computer box",
+    "8556676": "apple",
+    "11012500": "taco",
+    "12348080": "scissors",
+    "16951734": "potato",
+    "17265254": "rickshaw",
+    "18845103": "spoon",
+    "20993402": "tape",
+    "21529954": "can/container",
+    "22879790": "garlic",
+    "24010373": "guitar",
+    "24694197": "avocado",
+    "279135": "ski",
+    "622329": "eraser",
+    "622332": "stapler",
+    "1075308": "monitor/tv",
+    "1770866": "sign/banner",
+    "2391761": "boat",
+    "2580318": "mouse",
+    "2588513": "wood block",
+    "3993075": "marker",
+    "4027486": "truck",
+    "4243725": "soap",
+    "4781902": "stool",
+    "4782949": "drum",
+    "5211280": "rice cooker",
+    "5718392": "storage box",
+    "6037272": "bottle",
+    "6820594": "cat",
+    "5718424": "sneakers",
+    "6055310": "tape measure/ruler",
+    "8201777": "van",
+    "8331685": "headphone",
+    "8331718": "notebook",
+    "8557176": "watch",
+    "8557195": "toaster",
+    "9766617": "duck/goose",
+    "11021544": "faucet",
+    "11775390": "sandals",
+    "11950619": "table tennis paddle",
+    "12178946": "bottle",
+    "12348079": "scale",
+    "14832137": "barrel/bucket",
+    "15050320": "wine glass",
+    "16957916": "lettuce",
+    "17385866": "ice cream",
+    "17404769": "suv",
+    "18217373": "glasses",
+    "19455186": "cart/trolley",
+    "19610023": "slippers",
+    "19610025": "rabbit",
+    "20568676": "pot",
+    "21107974": "gavel/mallet",
+    "22064315": "antelope",
+    "22107522": "bow tie",
+    "24017816": "car",
+    "24498027": "street lights",
+    "24581953": "dog",
+    "24786060": "towel",
+    "25054869": "toilet",
+    "25273553": "tripod",
+    "25419495": "tong",
+    "25419516": "stuffed toy",
+    "25579493": "bowl",
+    "297718": "sushi",
+    "361105": "herb",
+    "1196168": "air conditioner",
+    "1894089": "screwdriver",
+    "2391780": "wild bird",
+    "4502267": "green bean",
+    "4604873": "crane",
+    "4916799": "globe",
+    "5718415": "tent",
+    "6012878": "traffic light",
+    "6820595": "cat",
+    "8556674": "orange/tangerine",
+    "8906172": "earphone",
+    "10666665": "clock",
+    "10811497": "key",
+    "11021562": "microwave",
+    "11021563": "stove",
+    "12348078": "person",
+    "13138178": "stool",
+    "13187927": "motorcycle",
+    "14490578": "seal",
+    "14640483": "cutting/chopping board",
+    "16010041": "chopsticks",
+    "17072759": "belt",
+    "17072764": "pear",
+    "18301585": "bench",
+    "18680641": "carpet",
+    "25273528": "hot air balloon",
+    "25419509": "fork",
+    "25612310": "basket",
+    "17265253": "rickshaw"
+}

evaluation/DLC-Bench/annotations/qa.json ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluation/DLC-Bench/eval_gpt_with_image.py ADDED Viewed

	@@ -0,0 +1,483 @@

+# *************************************************************************
+# This file may have been modified by Bytedance Inc. (“Bytedance Inc.'s Mo-
+# difications”). All Bytedance Inc.'s Modifications are Copyright (2025) B-
+# ytedance Inc..
+# *************************************************************************
+# Adapted from https://github.com/NVlabs/describe-anything/blob/main/evaluation/eval_model_outputs.py
+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import base64
+import io
+import json
+import os
+import inflect
+import numpy as np
+import openai
+from PIL import Image
+from pycocotools.coco import COCO
+from tqdm import tqdm
+# Define Azure OpenAI details
+model_name = "gpt-4o-2024-11-20"
+max_tokens = 1000  # range: [1, 4095]
+# Initialize the Azure client
+client = openai.AzureOpenAI(
+    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
+    api_key=os.getenv("AZURE_OPENAI_KEY"),
+    api_version="2024-03-01-preview",
+)
+prompt_eval = """Answer the multiple-choice question based on the text description of an object in this image. You need to follow these rules:
+1. Do not output any reasoning. Do not perform correction. Please output exactly one answer from the choices for each question. Do not repeat the question.
+2. There is no need for exact matching. Please choose the closest option based on the description.
+The description is:
+{pred_caption}
+From the description above, please answer the following question with one of the choices:
+{question_text_str}
+"""
+api_call_count = 0
+def query(prompt, images, temperature, max_tokens):
+    global api_call_count
+    if api_call_count >= args.api_call_limit:
+        raise Exception("API call limit reached")
+    api_call_count += 1
+    content = [
+        {"type": "text", "text": "The image:\n"},
+        {
+            "type": "image_url",
+            "image_url": {"url": f"data:image/jpeg;base64,{images[0]}"},
+        },
+        {"type": "text", "text": "\nThe mask of the image:\n"},
+        {
+            "type": "image_url",
+            "image_url": {"url": f"data:image/jpeg;base64,{images[1]}"},
+        },
+        {"type": "text", "text": f"\n{prompt}\n"},
+    ]
+    # Adjusted to use the Azure OpenAI client with the specified parameters
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=[{"role": "user", "content": content}],
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=1,
+        frequency_penalty=0,
+        presence_penalty=0,
+    )
+    message = response.choices[0].message.content
+    return message
+def parse_pred(pred, choices, key):
+    pred = pred.strip().lower()
+    substr_indices = []
+    for index, choice in enumerate(choices):
+        choice = choice.strip().lower()
+        prefix = "abcde"[index]
+        if choice == pred or pred == f"{prefix}. {choice}" or pred == prefix:
+            return index
+        if choice in pred:
+            substr_indices.append((index, pred.index(choice), len(choice)))
+    if len(substr_indices) == 1:
+        return substr_indices[0][0]
+    choices_label = "abcde"
+    if pred[0] in choices_label and pred[1] == ".":
+        ret = choices_label.index(pred[0])
+        return ret
+    if substr_indices:
+        if len(substr_indices) > 1:
+            ret, ret_pos, _ = max(substr_indices, key=lambda x: x[1])
+            max_items = [item for item in substr_indices if item[1] == ret_pos]
+            if len(max_items) > 1:
+                ret = max(max_items, key=lambda x: x[2])[0]
+            return ret
+        else:
+            ret = substr_indices[0][0]
+        return ret
+    match_lengths = []
+    for index, choice in enumerate(choices):
+        choice = choice.strip().lower()
+        if pred in choice:
+            match_lengths.append((index, len(choice)))
+    if match_lengths:
+        if len(match_lengths) > 1:
+            ret = max(match_lengths, key=lambda x: x[1])[0]
+        else:
+            ret = match_lengths[0][0]
+        return ret
+    if pred[0] in "abcde" and (len(pred.strip()) == 1 or pred[1] == "\n"):
+        ret = "abcde".index(pred[0])
+        return ret
+    return None
+def evaluate(
+    question_dicts,
+    pred_caption,
+    temperature,
+    max_tokens,
+    images,
+    *,
+    response_override=None,
+    key,
+    verbose=False,
+) -> dict:
+    pred_answers = []
+    prompt = []
+    response = []
+    for index, question_dict in enumerate(question_dicts):
+        question_text_str = f"{question_dict['question']}\n"
+        choices_text = ""
+        for choice_index, (choice, score) in enumerate(question_dict["choices"]):
+            choice_index = "ABCDE"[choice_index]
+            choices_text += f"{choice_index}. {choice}\n"
+        question_text_str += choices_text
+        prompt_item = prompt_eval.format(
+            pred_caption=pred_caption, question_text_str=question_text_str.strip()
+        )
+        if (
+            response_override is None
+            or len(response_override) < index
+            or response_override[index] is None
+        ):
+            response_item = query(prompt_item, images, temperature, max_tokens)
+        else:
+            response_item = response_override[index]
+        pred_answer = response_item.strip()
+        pred_answers.append(pred_answer)
+        prompt.append(prompt_item)
+        response.append(response_item)
+    pred_indices = [
+        parse_pred(
+            pred_answer, [choice for choice, score in question_dict["choices"]], key
+        )
+        for pred_answer, question_dict in zip(pred_answers, question_dicts)
+    ]
+    parsed_eval_results = [
+        question_dict["choices"][pred_index][1] if pred_index is not None else 0
+        for pred_index, question_dict in zip(pred_indices, question_dicts)
+    ]
+    parsed_eval_results_positives = []
+    parsed_eval_results_negatives = []
+    details_positives = []
+    details_negatives = []
+    details_recognition = []
+    recognition_result = None
+    for question_index, (parsed_eval_result, question_dict) in enumerate(
+        zip(parsed_eval_results, question_dicts)
+    ):
+        if question_dict["type"] == "recognition":
+            if parsed_eval_result == "correct":
+                recognition_result = True
+            elif parsed_eval_result == "incorrect":
+                recognition_result = False
+                print(
+                    f"Recognition is incorrect for key {key}, setting score to at most 0 for all questions"
+                )
+            else:
+                raise ValueError(f"Invalid recognition result: {parsed_eval_result}")
+            details_recognition.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    "eval_result": parsed_eval_result,
+                }
+            )
+        elif question_dict["type"] == "negative":
+            if recognition_result is False:
+                parsed_eval_result = min(0, parsed_eval_result)
+            parsed_eval_results_negatives.append(parsed_eval_result)
+            details_negatives.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    "eval_result": parsed_eval_result,
+                }
+            )
+        elif question_dict["type"] == "positive":
+            if recognition_result is False:
+                parsed_eval_result = min(0, parsed_eval_result)
+            parsed_eval_results_positives.append(parsed_eval_result)
+            details_positives.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    "eval_result": parsed_eval_result,
+                }
+            )
+    score_pos = sum(parsed_eval_results_positives) / len(parsed_eval_results_positives)
+    score_neg = (
+        sum(parsed_eval_results_negatives) / len(parsed_eval_results_negatives)
+        if parsed_eval_results_negatives
+        else None
+    )
+    score = (
+        sum(parsed_eval_results_positives) + sum(parsed_eval_results_negatives)
+    ) / (len(parsed_eval_results_positives) + len(parsed_eval_results_negatives))
+    info = dict(
+        details_positives=details_positives,
+        details_negatives=details_negatives,
+        details_recognition=details_recognition,
+        prompt=prompt,
+        response=response,
+        score=score,
+        score_pos=score_pos,
+        score_neg=score_neg,
+        recognition_result=recognition_result,
+    )
+    return info
+def is_plural(string):
+    if string == "bus":
+        return False
+    return p.singular_noun(string) is not False
+def select_ann(img_id, area_min=None, area_max=None):
+    cat_ids = coco.getCatIds()
+    ann_ids = coco.getAnnIds(imgIds=[img_id], catIds=cat_ids, iscrowd=None)
+    if area_min is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] >= area_min
+        ]
+    if area_max is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] <= area_max
+        ]
+    return ann_ids
+def mask_to_box(mask_np):
+    mask_coords = np.argwhere(mask_np)
+    y0, x0 = mask_coords.min(axis=0)
+    y1, x1 = mask_coords.max(axis=0) + 1
+    h = y1 - y0
+    w = x1 - x0
+    return x0, y0, w, h
+def encode_pil_image_to_base64(pil_image):
+    buffered = io.BytesIO()
+    pil_image.save(buffered, format="PNG")
+    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
+    return img_str
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluate model outputs")
+    parser.add_argument(
+        "--pred", type=str, help="Path to the prediction JSON file", required=True
+    )
+    parser.add_argument(
+        "--qa",
+        type=str,
+        help="Path to the reference QA file",
+        default="evaluation/DLC-Bench/annotations/qa.json",
+    )
+    parser.add_argument(
+        "--class-names",
+        type=str,
+        help="Path to the class names JSON file",
+        default="evaluation/DLC-Bench/annotations/class_names.json",
+    )
+    parser.add_argument(
+        "--api-call-limit", type=int, default=1000, help="API call limit"
+    )
+    parser.add_argument(
+        "--suffix", type=str, default="", help="Suffix for the evaluation file"
+    )
+    parser.add_argument("--verbose", action="store_true", help="Enable verbose mode")
+    parser.add_argument(
+        "--quiet", action="store_true", help="Enable quiet mode (result only)"
+    )
+    parser.add_argument("--csv", action="store_true", help="Output results as CSV only")
+    parser.add_argument(
+        "--data-root", type=str, default="evaluation/DLC-Bench/annotations"
+    )
+    args = parser.parse_args()
+    eval_file = os.path.splitext(args.pred)[0] + f"_eval_gpt{args.suffix}.json"
+    eval_results = {}
+    if os.path.exists(eval_file):
+        with open(eval_file) as f:
+            eval_results = json.load(f)
+    with open(args.pred) as f:
+        data_pred = json.load(f)
+    with open(args.qa) as f:
+        data_qa = json.load(f)
+    with open(args.class_names) as f:
+        data_class_names = json.load(f)
+    scores = {}
+    scores_pos = {}
+    scores_neg = {}
+    keys = list(data_qa.keys())
+    p = inflect.engine()
+    annotations_file = os.path.join(args.data_root, "annotations.json")
+    coco = COCO(annotations_file)
+    with open(annotations_file, "r") as f:
+        data = json.load(f)
+    missing_key_count = 0
+    for key in tqdm(keys, disable=args.quiet):
+        key = str(key)
+        for item in data["annotations"]:
+            if int(item["id"]) == int(key):
+                img_id = item["image_id"]
+        img_info = coco.loadImgs(img_id)[0]
+        img_path = os.path.join(args.data_root, "images", img_info["file_name"])
+        img = Image.open(img_path)
+        anns = coco.loadAnns([int(key)])
+        mask_np = coco.annToMask(anns[0]).astype(bool)
+        img_np = np.array(img)
+        pil_mask = Image.fromarray((mask_np * 255).astype(np.uint8))
+        assert (
+            img_np.shape[:2] == mask_np.shape
+        ), f"image shape mismatches with mask shape: {img_np.shape}, {mask_np.shape}"
+        img_h, img_w = img_np.shape[:2]
+        x0, y0, w, h = mask_to_box(mask_np)
+        xc, yc = x0 + w / 2, y0 + h / 2
+        # focal_crop: need to have at least min_box_w and min_box_h pixels, otherwise resizing to (384, 384) leads to artifacts that may be OOD
+        w, h = max(w, 56), max(h, 56)
+        x0, y0 = int(xc - w / 2), int(yc - h / 2)
+        # focal crop
+        cropped_img_np = img_np[
+            max(y0 - h, 0) : min(y0 + 2 * h, img_h),
+            max(x0 - w, 0) : min(x0 + 2 * w, img_w),
+        ]
+        cropped_mask_np = mask_np[
+            max(y0 - h, 0) : min(y0 + 2 * h, img_h),
+            max(x0 - w, 0) : min(x0 + 2 * w, img_w),
+        ]
+        cropped_pil_img = Image.fromarray(cropped_img_np)
+        cropped_pil_mask = Image.fromarray((cropped_mask_np * 255).astype(np.uint8))
+        base64_image = encode_pil_image_to_base64(img)
+        base64_mask = encode_pil_image_to_base64(pil_mask)
+        base64_cropped_image = encode_pil_image_to_base64(cropped_pil_img)
+        base64_cropped_mask = encode_pil_image_to_base64(cropped_pil_mask)
+        images = [base64_cropped_image, base64_cropped_mask]
+        if key in eval_results:
+            response_override = eval_results[key]["response"]
+        else:
+            response_override = None
+        if key not in data_pred:
+            if args.default_prediction is None:
+                raise ValueError(f"Key {key} not found in prediction data")
+            else:
+                pred_value = args.default_prediction
+                missing_key_count += 1
+        else:
+            pred_value = data_pred[key]
+        class_name = data_class_names[key]
+        recognition_question = f"The object in the image is {class_name}. Based on the image, is it likely that the object in the description is given class: {class_name} or object of a similar type?"
+        recognition_question_dict = {
+            "question": recognition_question,
+            "choices": [("Yes", "correct"), ("No", "incorrect")],
+            "type": "recognition",
+        }
+        question_dicts = [recognition_question_dict, *data_qa[key]]
+        info = evaluate(
+            question_dicts=question_dicts,
+            pred_caption=pred_value,
+            images=images,
+            temperature=0.0,
+            max_tokens=300,
+            response_override=response_override,
+            key=key,
+        )
+        score = info["score"]
+        scores[key] = score
+        scores_pos[key] = info["score_pos"]
+        scores_neg[key] = info["score_neg"]
+        eval_results[key] = {"pred": pred_value, **info}
+    avg_score_pos = sum(scores_pos.values()) / len(scores_pos)
+    avg_score_neg = sum(
+        [item for item in scores_neg.values() if item is not None]
+    ) / len(scores_neg)
+    eval_results["avg_pos"] = avg_score_pos
+    eval_results["avg_neg"] = avg_score_neg
+    with open(eval_file, "w") as f:
+        json.dump(eval_results, f, indent=4)
+    print(f"Average Positive Score: {avg_score_pos:.3f}")
+    print(f"Average Negative Score: {avg_score_neg:.3f}")
+    print(
+        f"Summary (Pos\tNeg\tAvg(Pos, Neg)):\t{avg_score_pos:.3f},\t{avg_score_neg:.3f},\t{(avg_score_pos + avg_score_neg) / 2:.3f}"
+    )
+    print(f"QA Scores: {scores}")
+    print(f"Evaluation data saved to {eval_file}")

evaluation/DLC-Bench/eval_llama_without_image.py ADDED Viewed

	@@ -0,0 +1,503 @@

+# Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# SPDX-License-Identifier: Apache-2.0
+import argparse
+import json
+import os
+import inflect
+from openai import OpenAI
+from tqdm import tqdm
+prompt_eval = """Answer the multiple-choice question based on the text description of an object in an image. You need to follow these rules:
+1. Do not output any reasoning. Do not perform correction. Please output exactly one answer from the choices for each question. Do not repeat the question.
+2. There is no need for exact matching. Please choose the closest option based on the description.
+The description is:
+{pred_caption}
+From the description above, please answer the following question with one of the choices:
+{question_text_str}
+"""
+api_call_count = 0
+def query(prompt, temperature, max_tokens, model):
+    global api_call_count
+    if api_call_count >= args.api_call_limit:
+        raise Exception("API call limit reached")
+    api_call_count += 1
+    response = client.chat.completions.create(
+        model=model,
+        messages=[{"role": "user", "content": prompt}],
+        temperature=temperature,
+        max_tokens=max_tokens,
+        top_p=1,
+        frequency_penalty=0,
+        presence_penalty=0,
+    )
+    message = response.choices[0].message.content
+    return message
+def parse_pred(pred, choices, key):
+    pred = pred.strip().lower()
+    substr_indices = []
+    for index, choice in enumerate(choices):
+        choice = choice.strip().lower()
+        prefix = "abcde"[index]
+        if choice == pred or pred == f"{prefix}. {choice}" or pred == prefix:
+            return index
+        if choice in pred:
+            substr_indices.append((index, pred.index(choice), len(choice)))
+    # Only one match (choice in prediction)
+    if len(substr_indices) == 1:
+        return substr_indices[0][0]
+    # Prefix match
+    choices_label = "abcde"
+    if pred[0] in choices_label and pred[1] == ".":
+        ret = choices_label.index(pred[0])
+        # print(f"{key}: Chosen {ret} for pred: {pred}, choices: {choices}")
+        # print(f"{key}: More than one occurrence found or no substr of choice in pred: pred {pred}, choices {choices}, substr indices: {substr_indices}, returning {ret} (choice {choices_label})")
+        return ret
+    # More than one match
+    if substr_indices:
+        # Return the last occurrence if there are multiple matches (referenced from MMMU): https://github.com/MMMU-Benchmark/MMMU/blob/b119c944a15c145c10d52a58e841c5b9cb6a535e/eval/utils/eval_utils.py#L57
+        if len(substr_indices) > 1:
+            ret, ret_pos, _ = max(substr_indices, key=lambda x: x[1])
+            max_items = [item for item in substr_indices if item[1] == ret_pos]
+            if len(max_items) > 1:
+                # select the item with the longest match if there are multiple occurrence at the same place
+                ret = max(max_items, key=lambda x: x[2])[0]
+            print(
+                f"{key}: More than one occurrence found: pred {pred}, choices {choices}, {substr_indices}, returning {ret} (choice {choices_label})"
+            )
+        else:
+            ret = substr_indices[0][0]
+        return ret
+    # Parse the case where pred is a substr of choice
+    match_lengths = []
+    for index, choice in enumerate(choices):
+        choice = choice.strip().lower()
+        if pred in choice:
+            match_lengths.append((index, len(choice)))
+    if match_lengths:
+        # Return the longest matched substring if there are multiple matches
+        if len(match_lengths) > 1:
+            ret = max(match_lengths, key=lambda x: x[1])[0]
+            print(
+                f"{key}: More than one occurrence found: pred {pred}, choices {choices}, {match_lengths}, returning {ret}"
+            )
+        else:
+            ret = match_lengths[0][0]
+        return ret
+    if pred[0] in "abcde" and (len(pred.strip()) == 1 or pred[1] == "\n"):
+        ret = "abcde".index(pred[0])
+        print(f"{key}: Chosen {ret} for pred: {pred}, choices: {choices}")
+        return ret
+    print(f"*WARNING*: {key}: No match found. Pred: {pred}, choices: {choices}")
+    # If no matching choice is found, raise an error.
+    # raise ValueError(f"No match found. Pred: {pred}, Choices: {choices}")
+    # If no matching choice is found, return None (treat as no mention, score 0).
+    return None
+def evaluate(
+    question_dicts,
+    pred_caption,
+    temperature,
+    max_tokens,
+    model,
+    *,
+    response_override=None,
+    key,
+    verbose=False,
+) -> dict:
+    pred_answers = []
+    prompt = []
+    response = []
+    for index, question_dict in enumerate(question_dicts):
+        question_text_str = f"{question_dict['question']}\n"
+        choices_text = ""
+        for choice_index, (choice, score) in enumerate(question_dict["choices"]):
+            choice_index = "ABCDE"[choice_index]
+            choices_text += f"{choice_index}. {choice}\n"
+        question_text_str += choices_text
+        prompt_item = prompt_eval.format(
+            pred_caption=pred_caption, question_text_str=question_text_str.strip()
+        )
+        if (
+            response_override is None
+            or len(response_override) < index
+            or response_override[index] is None
+        ):
+            response_item = query(prompt_item, temperature, max_tokens, model)
+            # print(f"Prompt:\n{prompt_item}")
+            # print(f"Output: {response_item}")
+        else:
+            response_item = response_override[index]
+        pred_answer = response_item.strip()
+        pred_answers.append(pred_answer)
+        prompt.append(prompt_item)
+        response.append(response_item)
+    assert len(pred_answers) == len(
+        question_dicts
+    ), f"Length mismatch for key {key} question {index}: pred: {len(pred_answers)} vs question: {len(question_dicts)}"
+    pred_indices = [
+        parse_pred(
+            pred_answer, [choice for choice, score in question_dict["choices"]], key
+        )
+        for pred_answer, question_dict in zip(pred_answers, question_dicts)
+    ]
+    assert len(pred_indices) == len(
+        question_dicts
+    ), f"Length mismatch for key {key} question {index}: pred: {len(pred_indices)} vs question: {len(question_dicts)}"
+    # If no matching, treat as no mention.
+    try:
+        parsed_eval_results = [
+            question_dict["choices"][pred_index][1] if pred_index is not None else 0
+            for pred_index, question_dict in zip(pred_indices, question_dicts)
+        ]
+    except IndexError as e:
+        print(
+            f"Error: {e}, key: {key}, pred_indices: {pred_indices}, question_dicts: {question_dicts}"
+        )
+        raise e
+    parsed_eval_results_positives = []
+    parsed_eval_results_negatives = []
+    details_positives = []
+    details_negatives = []
+    details_recognition = []
+    recognition_result = None
+    for question_index, (parsed_eval_result, question_dict) in enumerate(
+        zip(parsed_eval_results, question_dicts)
+    ):
+        if question_dict["type"] == "recognition":
+            # If the type is recognition, it's the recognition question.
+            if parsed_eval_result == "correct":
+                recognition_result = True
+            elif parsed_eval_result == "incorrect":
+                recognition_result = False
+                print(
+                    f"Recognition is incorrect for key {key}, setting score to at most 0 for all questions"
+                )
+            else:
+                raise ValueError(f"Invalid recognition result: {parsed_eval_result}")
+            details_recognition.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    "eval_result": parsed_eval_result,
+                }
+            )
+        elif question_dict["type"] == "negative":
+            assert (
+                recognition_result is not None
+            ), f"Negative questions come before recognition question in {key}, {question_dicts}"
+            if recognition_result is False:
+                if verbose:
+                    print(
+                        f"Processing negative question {question_index} for key {key}, setting score to at most 0 since recognition is incorrect"
+                    )
+                parsed_eval_result = min(0, parsed_eval_result)
+            # If the type is negative, it's one of the negatives.
+            parsed_eval_results_negatives.append(parsed_eval_result)
+            details_negatives.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    # Subtract 1 to get the index in the original question list (excluding the recognition question)
+                    "question_index": question_index - 1,
+                    "eval_result": parsed_eval_result,
+                }
+            )
+        elif question_dict["type"] == "positive":
+            assert (
+                recognition_result is not None
+            ), f"Positive questions come before recognition question in {key}, {question_dicts}"
+            if recognition_result is False:
+                if verbose:
+                    print(
+                        f"Processing positive question {question_index} for key {key}, setting score to at most 0 since recognition is incorrect"
+                    )
+                parsed_eval_result = min(0, parsed_eval_result)
+            parsed_eval_results_positives.append(parsed_eval_result)
+            details_positives.append(
+                {
+                    **question_dict,
+                    "pred_answer": pred_answers[question_index],
+                    "pred_index": pred_indices[question_index],
+                    # Subtract 1 to get the index in the original question list (excluding the recognition question)
+                    "question_index": question_index - 1,
+                    "eval_result": parsed_eval_result,
+                }
+            )
+        else:
+            raise ValueError(f"Invalid question type: {question_dict['type']}")
+    score_pos = sum(parsed_eval_results_positives) / len(parsed_eval_results_positives)
+    # It's possible that we don't have negatives for an instance. For this case, we skip over the instance for negative score calculation.
+    if len(parsed_eval_results_negatives):
+        score_neg = sum(parsed_eval_results_negatives) / len(
+            parsed_eval_results_negatives
+        )
+    else:
+        score_neg = None
+    # Overall score is the average of the positive and negative scores
+    info = dict(
+        details_positives=details_positives,
+        details_negatives=details_negatives,
+        details_recognition=details_recognition,
+        prompt=prompt,
+        response=response,
+        score=(sum(parsed_eval_results_positives) + sum(parsed_eval_results_negatives))
+        / (len(parsed_eval_results_positives) + len(parsed_eval_results_negatives)),
+        score_pos=score_pos,
+        score_neg=score_neg,
+        neg_valid_num=len(parsed_eval_results_negatives),
+        recognition_result=recognition_result,
+    )
+    return info
+def is_plural(string):
+    # A case that the inflect library does not handle
+    if string == "bus":
+        return False
+    # singular_noun returns False if the word is already singular (otherwise it returns the singular form)
+    return p.singular_noun(string) is not False
+if __name__ == "__main__":
+    # Example:
+    # python eval_model_outputs.py --pred model_outputs_cache/dam_3b_v1.json --base-url "http://localhost:9100/v1"
+    parser = argparse.ArgumentParser(description="Evaluate model outputs")
+    parser.add_argument(
+        "--pred", type=str, help="Path to the prediction JSON file", required=True
+    )
+    parser.add_argument(
+        "--qa",
+        type=str,
+        help="Path to the reference QA file",
+        default="evaluation/DLC-Bench/annotations/qa.json",
+    )
+    parser.add_argument(
+        "--class-names",
+        type=str,
+        help="Path to the class names JSON file",
+        default="evaluation/DLC-Bench/annotations/class_names.json",
+    )
+    parser.add_argument(
+        "--default-prediction",
+        type=str,
+        default=None,
+        help="Default prediction if key is not present in the prediction file",
+    )
+    parser.add_argument(
+        "--api-call-limit", type=int, default=1000, help="API call limit"
+    )
+    parser.add_argument(
+        "--api-key", type=str, default=None, help="Path to the OpenAI API key file"
+    )
+    parser.add_argument(
+        "--suffix", type=str, default="", help="Suffix for the evaluation file"
+    )
+    parser.add_argument("--model", type=str, default="llama3.1-8b", help="Model name")
+    parser.add_argument("--verbose", action="store_true", help="Enable verbose mode")
+    parser.add_argument(
+        "--quiet", action="store_true", help="Enable quiet mode (result only)"
+    )
+    parser.add_argument("--csv", action="store_true", help="Output results as CSV only")
+    parser.add_argument(
+        "--base-url",
+        type=str,
+        default="http://localhost:8007/v1",
+        help="Base URL for the API call",
+    )
+    args = parser.parse_args()
+    always_print = print
+    if args.quiet:
+        print = lambda *args, **kwargs: None
+    # v3 is from v2.1
+    eval_file = os.path.splitext(args.pred)[0] + f"_eval{args.suffix}.json"
+    eval_results = {}
+    if False:
+        assert not os.path.exists(eval_file), f"Evaluation file exists at {eval_file}"
+    else:
+        if os.path.exists(eval_file):
+            print(f"Loading existing evaluation data from {eval_file}")
+            try:
+                with open(eval_file) as f:
+                    eval_results = json.load(f)
+            except Exception as e:
+                always_print(f"Error loading evaluation data {eval_file}: {e}")
+                raise e
+    if args.api_key:
+        with open(args.api_key) as f:
+            client = OpenAI(api_key=f.read().strip(), base_url=args.base_url)
+    else:
+        client = OpenAI(api_key="sk-abc123", base_url=args.base_url)
+    with open(args.pred) as f:
+        data_pred = json.load(f)
+    with open(args.qa) as f:
+        data_qa = json.load(f)
+    with open(args.class_names) as f:
+        data_class_names = json.load(f)
+    scores = {}
+    scores_pos = {}
+    scores_neg = {}
+    keys = list(data_qa.keys())
+    p = inflect.engine()
+    print(f"Using model {args.model}")
+    missing_key_count = 0
+    for key in tqdm(keys, disable=args.quiet):
+        key = str(key)
+        if key in eval_results:
+            if args.verbose:
+                print(f"Skipping {key}")
+            response_override = eval_results[key]["response"]
+        else:
+            response_override = None
+        if key not in data_pred:
+            if args.default_prediction is None:
+                raise ValueError(
+                    f"Key {key} not found in prediction data, and no default prediction provided"
+                )
+            else:
+                print(
+                    f"Key {key} not found in prediction data, using default prediction {args.default_prediction}"
+                )
+                pred_value = args.default_prediction
+                missing_key_count += 1
+        elif data_pred[key].startswith("Error:"):
+            if args.default_prediction is None:
+                raise ValueError(
+                    f"Key {key} has an error in prediction data, and no default prediction provided: {data_pred[key]}"
+                )
+            else:
+                print(
+                    f"Key {key} has an error in prediction: {data_pred[key]}, using default prediction {args.default_prediction}"
+                )
+                pred_value = args.default_prediction
+                missing_key_count += 1
+        else:
+            pred_value = data_pred[key]
+        # print(f"Evaluating {key}")
+        class_name = data_class_names[key]
+        if is_plural(class_name):
+            recognition_question = f"Is it likely that the objects in the description are {class_name} or objects of a similar type? Again, It does not have to be an exact match."
+        else:
+            recognition_question = f"Is it likely that the object in the description is {p.a(class_name)} or an object of a similar type? Again, It does not have to be an exact match."
+        recognition_question_dict = {
+            "question": recognition_question,
+            "choices": [("Yes", "correct"), ("No", "incorrect")],
+            "type": "recognition",
+        }
+        # Add the recognition question to the beginning of the list
+        question_dicts = [recognition_question_dict, *data_qa[key]]
+        info = evaluate(
+            question_dicts=question_dicts,
+            pred_caption=pred_value,
+            model=args.model,
+            temperature=0.0,
+            max_tokens=300,
+            response_override=response_override,
+            key=key,
+            verbose=args.verbose,
+        )
+        score = info["score"]
+        scores[key] = score
+        scores_pos[key] = info["score_pos"]
+        scores_neg[key] = info["score_neg"]
+        eval_results[key] = {"pred": pred_value, **info}
+        if args.verbose:
+            print(f"Score: {score}")
+        with open(eval_file, "w") as f:
+            json.dump(eval_results, f, indent=4)
+    avg_score_pos = sum(scores_pos.values()) / len(scores_pos)
+    scores_neg_valid_only = [item for item in scores_neg.values() if item is not None]
+    avg_score_neg = sum(scores_neg_valid_only) / len(scores_neg_valid_only)
+    if args.csv:
+        # Print comma-separated values directly to stdout
+        always_print(
+            f"{avg_score_pos:.3f},{avg_score_neg:.3f},{(avg_score_pos + avg_score_neg) / 2:.3f}"
+        )
+    else:
+        always_print(f"Result for {args.pred}")
+        always_print(f"Average Positive Score: {avg_score_pos:.3f}")
+        always_print(f"Average Negative Score: {avg_score_neg:.3f}")
+        always_print(
+            f"Average of Positive and Negative Scores: {(avg_score_pos + avg_score_neg) / 2:.3f}"
+        )
+        always_print(
+            f"Summary (Pos\tNeg\tAvg(Pos, Neg)):\t{avg_score_pos:.3f},\t{avg_score_neg:.3f},\t{(avg_score_pos + avg_score_neg) / 2:.3f}"
+        )
+        print(f"QA Scores: {scores}")
+        if missing_key_count:
+            print(
+                f"Note: Missing {missing_key_count} keys, using default prediction {args.default_prediction}"
+            )
+    eval_results["avg_pos"] = avg_score_pos
+    eval_results["avg_neg"] = avg_score_neg
+    with open(eval_file, "w") as f:
+        json.dump(eval_results, f, indent=4)
+    print(f"Evaluation data saved to {eval_file}")

evaluation/DLC-Bench/inference.py ADDED Viewed

	@@ -0,0 +1,173 @@

+# --------------------------------------------------------
+# Copyright (2025) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License")
+# Grasp Any Region Project
+# Written by Haochen Wang
+# --------------------------------------------------------
+import argparse
+import json
+import os
+import numpy as np
+import torch
+from PIL import Image
+from pycocotools import mask as mask_utils
+from pycocotools.coco import COCO
+from tqdm import tqdm
+from transformers import AutoModel, AutoProcessor, GenerationConfig
+from evaluation.eval_dataset import SingleRegionCaptionDataset
+TORCH_DTYPE_MAP = dict(fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32)
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Inference of Grasp Any Region models on DLC-Bench."
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        help="HF model name or path",
+        default="HaochenWang/GAR-1B",
+    )
+    parser.add_argument(
+        "--cache_name",
+        help="cache name to save model outputs.",
+        default="gar_1b",
+    )
+    parser.add_argument(
+        "--data_type",
+        help="data dtype",
+        type=str,
+        choices=["fp16", "bf16", "fp32"],
+        default="bf16",
+    )
+    parser.add_argument(
+        "--anno_file",
+        help="path to the annotation file.",
+        default="evaluation/DLC-Bench/annotations/annotations.json",
+    )
+    parser.add_argument(
+        "--image_folder",
+        help="the folder of images",
+        default="evaluation/DLC-Bench/annotations",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=0,
+        help="Random seed for reproducible text generation",
+    )
+    args = parser.parse_args()
+    return args
+def select_ann(coco, img_id, area_min=None, area_max=None):
+    cat_ids = coco.getCatIds()
+    ann_ids = coco.getAnnIds(imgIds=[img_id], catIds=cat_ids, iscrowd=None)
+    if area_min is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] >= area_min
+        ]
+    if area_max is not None:
+        ann_ids = [
+            ann_id for ann_id in ann_ids if coco.anns[ann_id]["area"] <= area_max
+        ]
+    return ann_ids
+def main():
+    args = parse_args()
+    data_dtype = TORCH_DTYPE_MAP[args.data_type]
+    torch.manual_seed(args.seed)
+    # init ditribution for dispatch_modules in LLM
+    torch.cuda.set_device(0)
+    torch.distributed.init_process_group(backend="nccl")
+    # build HF model
+    model = AutoModel.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=data_dtype,
+    )
+    model.cuda()
+    model.eval()
+    processor = AutoProcessor.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+    )
+    model_outputs = {}
+    cache_name = args.cache_name
+    # This coco instance is actually an o365 subset. This is for code reuse.
+    coco = COCO(args.anno_file)
+    img_ids = list(coco.imgs.keys())
+    num_anns = len(coco.anns)
+    pbar = tqdm(total=num_anns)
+    for img_id in img_ids:
+        ann_ids = select_ann(coco, img_id)
+        img_info = coco.loadImgs(img_id)[0]
+        for i, ann_id in enumerate(ann_ids):
+            if ann_id in model_outputs.keys():
+                pbar.update(1)
+                continue
+            anns = coco.loadAnns([ann_id])
+            mask = coco.annToMask(anns[0])
+            img_path = os.path.join(args.image_folder, "images", img_info["file_name"])
+            img = Image.open(img_path)
+            prompt_number = model.config.prompt_numbers
+            prompt_tokens = [f"<Prompt{i_p}>" for i_p in range(prompt_number)] + [
+                "<NO_Prompt>"
+            ]
+            dataset = SingleRegionCaptionDataset(
+                image=img,
+                mask=mask,
+                processor=processor,
+                prompt_number=prompt_number,
+                visual_prompt_tokens=prompt_tokens,
+                data_dtype=data_dtype,
+            )
+            data_sample = dataset[0]
+            with torch.no_grad():
+                generate_ids = model.generate(
+                    **data_sample,
+                    generation_config=GenerationConfig(
+                        max_new_tokens=1024,
+                        do_sample=False,
+                        eos_token_id=processor.tokenizer.eos_token_id,
+                        pad_token_id=processor.tokenizer.pad_token_id,
+                    ),
+                    return_dict=True,
+                )
+            outputs = processor.tokenizer.decode(
+                generate_ids.sequences[0], skip_special_tokens=True
+            ).strip()
+            print(outputs)  # Print model output for this image
+            model_outputs[ann_id] = outputs
+            pbar.update(1)
+    pbar.close()
+    with open(f"evaluation/DLC-Bench/model_outputs/{cache_name}.json", "w") as file:
+        json.dump(model_outputs, file, indent=4, ensure_ascii=False)
+    print(f"Cache name: {cache_name}")
+if __name__ == "__main__":
+    main()

evaluation/DLC-Bench/model_outputs/gar_1b.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+    "279135": "The ski features a predominantly black surface with intricate orange and white geometric patterns along its length. It is equipped with a black binding system, including a silver and black mechanism, and an orange adjustment lever. The tip of the ski has a similar pattern to the rest of the ski.",
+    "297718": "A piece of sushi with a bed of white rice, topped with a layer of black seaweed, and filled with a mixture of pink and white fish roe. The top is garnished with a sprinkle of sesame seeds.",
+    "361105": "A small cluster of fresh, vibrant green leaves with a smooth texture, attached to a slender, slightly curved stem. The leaves are elongated with pointed tips and a glossy surface, showing a few small brown spots.",
+    "622329": "A rectangular, flat, beige eraser with rounded corners and a slightly textured surface.",
+    "622332": "A black rectangular stapler with a glossy finish, featuring a silver brand logo on the top right corner. The stapler has a visible metal stapling mechanism on the right side.",
+    "1075308": "A vintage-style television set with a boxy, black plastic casing. The front features a large, square screen with a slightly curved surface. The top of the television has a series of buttons and dials, and there is a small, rectangular display area above the screen.",
+    "1196168": "A rectangular, white air conditioner unit with a large circular fan grille on the left side. The grille has a grid pattern with multiple blades visible. To the right of the grille, there is a small rectangular panel with a circular emblem and some text.",
+    "1770866": "A white price tag with handwritten text in blue and red marker. The text reads \"Libra\" in blue at the top, followed by \"Lb\" in blue, \"per\" in blue, \"lb\" in blue, and \"950\" in red.",
+    "1894089": "A metallic screwdriver with a long, slender shaft and a flat, rectangular head. The shaft is smooth and tapers slightly towards the head, which is flat and has a small, circular indentation near the tip.",
+    "2391761": "The canoe has a wooden hull with horizontal planks and a blue tarpaulin cover draped over it. The tarpaulin is secured with ropes and has some white markings on it. The canoe also features a small outboard motor mounted on the stern.",
+    "2391780": "A bird with a long, slender neck and a pointed beak. Its plumage is predominantly brown with lighter, almost white, streaks on the wings and back. The bird's legs are thin and dark, and it has a small, rounded tail.",
+    "2391781": "The bird has a predominantly dark brown body with lighter brown and white markings on its wings and back. Its wings are outstretched, showing a mix of dark and light feathers. The bird's head is slightly turned, with a visible beak and a hint of white feathers around the neck area.",
+    "2580318": "The mouse has a sleek, metallic silver body with a smooth, reflective surface. The visible part of the mouse is triangular in shape, with a slightly curved edge and a subtle gradient of light reflecting off its surface.",
+    "2580323": "A rectangular wooden frame encloses a detailed architectural blueprint with various lines, symbols, and text. The frame has a natural wood finish and is mounted on a wall.",
+    "2588513": "A rectangular wooden block with a light beige top surface and a black bottom surface. The top surface has a smooth texture with visible wood grain patterns, while the bottom surface is solid black.",
+    "3993075": "A cylindrical marker with a white body featuring a colorful design, including a blue and green pattern near the middle and a red cap.",
+    "4027486": "The bus is predominantly blue with a white section on the right side. It has a black horizontal stripe running along the middle, with a green stripe above it. The rear of the bus features a white license plate with black text. There is a small, white, triangular logo with a black design on the blue section near the rear.",
+    "4243725": "The soap is a rectangular, slightly curved bar with a smooth, creamy beige surface.",
+    "4502267": "A green bean with a smooth, slightly curved surface, tapering to a point at one end and having a broader, rounded base at the other. The bean has a consistent green color with subtle variations in shading.",
+    "4604873": "A large, industrial crane with a lattice structure, featuring a long, horizontal boom extending from a vertical mast. The boom is supported by a series of diagonal cross-bracing and has a hook at the end. The mast is equipped with various mechanical components and a counterweight at the base.",
+    "4781902": "A dark brown, wooden ladder with evenly spaced, flat rungs and side rails that taper slightly towards the top.",
+    "4782942": "A large, dark-colored, conical-shaped horn with a wide, flared opening and a narrow, cylindrical body.",
+    "4782949": "The drum has a circular shape with a red body and a black rim. The drumhead is a light brown color with a blue circular patch in the center.",
+    "4916799": "A spherical sculpture with a textured surface composed of small, raised, silver-colored elements. The sphere is adorned with blue, three-dimensional letters spelling \"Reve\" and is mounted on a black, cylindrical base. A green band encircles the sphere, and there are colorful, abstract shapes and patterns on the left side.",
+    "5211280": "A stainless steel rice cooker with a black handle on top. The front features a digital display screen in the center, surrounded by various buttons and controls. The buttons are arranged in a circular pattern around the display, with additional buttons below. The cooker has a sleek, modern design with a smooth, reflective surface.",
+    "5718392": "A woven basket with a dark brown color and a pattern of interlocking diamond shapes, featuring a sturdy, slightly curved handle.",
+    "5718415": "The tent has a yellow canopy with a dark brown edge. The visible part of the tent includes a metal pole with a rusted section near the bottom.",
+    "5718424": "A black athletic shoe with a textured surface, featuring a prominent yellow swoosh logo on the side. The shoe has a low-top design with a padded collar and a lace-up closure. The sole is thick and rugged, designed for traction.",
+    "6012878": "A square pedestrian traffic light with a black background, featuring a red illuminated hand symbol on the left side.",
+    "6037269": "A metallic shower head with a curved, elongated handle. The handle is cylindrical and appears to be made of a light-colored material. The shower head itself is conical with a rounded tip and a slightly wider base, featuring a reflective surface.",
+    "6037272": "A green shampoo bottle with a slightly curved shape, featuring a white label with text and a small orange logo.",
+    "6055310": "A golden ruler with a series of evenly spaced, small, rectangular notches along its length.",
+    "6820594": "A medium-sized cat with a predominantly white face and chest, featuring a mix of brown and black tabby markings on its back and sides. The cat has large, expressive green eyes and a pink nose. Its ears are pointed and have a light brown color with darker tips.",
+    "6820595": "A cat with a white face and ears, featuring a mix of black and brown fur on its back and tail. The cat's body is predominantly white with black patches, and it has a short, sleek coat.",
+    "7050495": "A black leather handbag with a smooth texture and a slightly curved bottom edge.",
+    "8201777": "A black van with a rear window displaying the word \"TAXI\" in large yellow letters. The van has a yellow license plate with black text and a small white sticker on the lower left side of the rear bumper. The rear lights are vertically aligned on both sides of the van.",
+    "8331685": "The earphone features a sleek, curved design with a dark gray color. The earpiece is circular and appears to be cushioned for comfort. The headband is also dark gray and has a smooth, slightly glossy finish.",
+    "8331699": "The visible part of the printer is black with a smooth, curved surface.",
+    "8331718": "A black spiral-bound notebook with a white cover and the word \"Xtreme\" written in white on the cover.",
+    "8556674": "A round, orange fruit with a smooth, glossy surface. The fruit has a gradient of colors, transitioning from a deep orange at the bottom to a lighter, almost yellowish hue at the top. There is a small, white, irregularly shaped patch near the top center.",
+    "8556676": "A deep red apple with a glossy surface reflecting light, showcasing a smooth curvature and a small, visible portion of the stem.",
+    "8557176": "The watch features a rectangular gold-toned case with a black dial. It has a black leather strap with white stitching and a small metallic buckle.",
+    "8557195": "The microwave oven features a smooth, curved, off-white exterior with a slightly reflective surface. The visible part of the microwave includes a rounded edge and a small, dark-colored component at the top.",
+    "8906172": "A black, in-ear headphone with a sleek, curved design.",
+    "9766617": "The goose has a predominantly brown body with a pattern of darker brown and black feathers on its back. Its head is black with a white patch on the side of its neck. The underbelly is white, and the legs and feet are greenish.",
+    "10666665": "A round wall clock with a black frame and a white face. The clock features black Arabic numerals for each hour, with the numbers 1 through 12 clearly visible. The hour and minute hands are black and pointed, while the second hand is thin and black. The clock has a simple, minimalist design.",
+    "10811497": "A dark green, oval-shaped key with a smooth surface and a small, circular indentation near the bottom.",
+    "11012500": "A soft, round tortilla filled with fresh arugula, a slice of ripe tomato, shredded lettuce, and a dollop of creamy sauce.",
+    "11021544": "The faucet features a sleek, curved design with a polished chrome finish. It has a single lever handle on the right side for controlling water flow and temperature. The spout is slightly arched, extending outward with a smooth, flowing curve.",
+    "11021562": "The microwave oven features a white exterior with a prominent vertical handle on the left side of the door. The door has a series of horizontal ventilation slits near the top.",
+    "11021563": "A white stove with four black burners, featuring a control panel with knobs on the back.",
+    "11775390": "A green rubber shoe with a thick, textured sole and multiple circular holes on the side. The shoe features a black and white design on the side, with a prominent black section and white accents. The upper part of the shoe has a smooth, rounded shape with a slight curve at the top.",
+    "11950619": "The racket has a light-colored wooden handle with a smooth finish. The head of the racket is covered with a transparent protective guard, revealing a blue and white string bed. The guard has a rectangular shape with rounded edges and is secured to the head of the racket.",
+    "12178946": "A cylindrical bottle with a yellow cap and a blue label featuring white text.",
+    "12348078": "A woman with dark hair tied back in a bun, wearing a white t-shirt with red text and graphics on the front, and black pants. She is seated and holding a baby close to her chest.",
+    "12348079": "A digital weighing scale with a rectangular, blue weighing platform on top. The scale has a white base with a control panel on the left side, featuring several buttons and a display screen. The right side of the scale has a series of blue and black buttons.",
+    "12348080": "A pair of scissors with bright red handles, each handle forming a loop with a smooth, rounded edge. The blades are metallic and converge at a central pivot point, with one blade partially visible.",
+    "13138178": "The stool has a deep blue, glossy finish with a smooth, curved design. The visible part includes a rounded, arch-like structure with a slight indentation in the middle, creating a sleek and modern appearance.",
+    "13187927": "The motorcycle is a white scooter with a black seat and a rear storage compartment. It features a rear red tail light and a license plate mounted below the tail light. The handlebars are equipped with rearview mirrors, and the body has a sleek, modern design with a slightly curved front.",
+    "14490578": "The harbor seal has a sleek, elongated body with a dark brown to black coloration. Its skin appears smooth and slightly shiny, with a few lighter patches scattered across its back. The seal's head is rounded, and its body tapers towards the tail.",
+    "14640483": "A rectangular wooden chopping board with a smooth surface and rounded edges. The board has a natural wood grain pattern and a warm, honey-brown color.",
+    "14832137": "A cylindrical, dark blue plastic bucket with a smooth surface and a slightly flared rim. The bucket has a handle attached to the top edge, which is also dark blue.",
+    "15050320": "A dark brown wine glass with a wide, shallow bowl and a short stem. The glass has a smooth, reflective surface with a few light reflections visible on the bowl.",
+    "16010041": "A pair of light-colored chopsticks with a smooth, slightly tapered design, featuring a subtle gradient from a pale yellow to a soft orange hue at the tips.",
+    "16951734": "A triangular slice of mango with a smooth, light orange flesh and a slightly darker orange edge.",
+    "16957916": "A piece of green lettuce with a slightly curled edge and a mix of light and dark green hues, featuring a few small brown spots and a hint of red at the base.",
+    "17072759": "A black belt with a smooth texture, featuring a silver rectangular buckle.",
+    "17072764": "A partially visible pear with a smooth, light green skin transitioning to a yellowish hue towards the top. The pear has a small, brown stem protruding from its top left side.",
+    "17265253": "A black rickshaw with a single visible wheel featuring a silver rim and black tire. The wheel is attached to a black frame with a visible axle and a small, round, orange reflector on the side. The rickshaw has a black canopy with a slightly curved top edge.",
+    "17265254": "A traditional rickshaw with a black frame and a red seat cushion. It features a single large spoked wheel on each side, connected by a horizontal axle. The rickshaw has a curved handlebar at the front, and a footrest is visible beneath the seat. The wheels are equipped with black tires and silver rims.",
+    "17385866": "A scoop of vanilla ice cream with a swirl of red and yellow fruit toppings, possibly strawberry and lemon, on a light green and yellow marbled base.",
+    "17404769": "The car is a white SUV with a rear hatchback design. It features a rear window with a slight tint and a small, square fuel cap on the right side of the rear door. The taillights are vertically aligned and wrap around the side of the vehicle. The rear bumper is slightly curved, and the car has a visible rear wheel with a five-spoke alloy rim.",
+    "18217373": "The spectacles feature a thin, dark brown frame with a slightly curved bridge. The lenses are rectangular with rounded edges, and the frame has a subtle metallic sheen.",
+    "18301585": "The bench features a black metal frame with horizontal slats forming the backrest and seat. The backrest slats are evenly spaced and supported by white, rectangular concrete supports. The seat slats are also black and run parallel to the backrest. The bench has a sturdy, industrial design with a solid, robust appearance.",
+    "18680641": "A rectangular, plush, red carpet with a slightly uneven surface and a subtle gradient of darker red in the middle. The edges are bordered by a thin, dark gray trim.",
+    "18845103": "A metallic spoon with a slightly curved, elongated handle and a shallow, oval-shaped bowl. The handle has a smooth, polished finish, and the bowl is also metallic with a reflective surface.",
+    "19455186": "A blue metal handcart with two horizontal bars and two vertical supports. The cart has two black wheels at the bottom.",
+    "19610023": "A bright green croc-style shoe with a thick, textured sole and a wide, open toe design. The shoe features a smooth, rounded toe and a slightly raised heel.",
+    "19610025": "A white rabbit with large, upright ears and a red backpack. It is wearing a yellow shirt and blue pants. The rabbit has a playful expression with its mouth open and eyes wide.",
+    "20568676": "A stainless steel cooking pot with a rounded bottom and a rolled edge, featuring two riveted handles on opposite sides.",
+    "20993402": "A roll of white adhesive tape with a smooth, glossy surface and a slightly reflective sheen. The tape is wound tightly around a cylindrical core, with the outer edge appearing clean and unblemished.",
+    "21107974": "A wooden gavel with a cylindrical head featuring three evenly spaced, horizontal grooves. The handle is smooth and tapers slightly towards the end.",
+    "21529954": "A cylindrical can with a predominantly green label featuring the word \"Pepsi\" in white, bold letters. The top of the can is orange with a white cap.",
+    "22064315": "The visible part of the antelope shows two long, curved horns with a dark, almost black coloration, tapering to a point. The horns are covered in a pattern of ridges and grooves, giving them a textured appearance.",
+    "22107522": "A black bow tie with a smooth, satin-like finish, featuring a classic butterfly shape with pointed tips. The bow tie has a symmetrical design with a central knot and two loops that are slightly curved outward.",
+    "22879790": "A single, partially peeled white onion with a smooth, slightly shiny surface. The onion has a bulbous shape with a visible root end that is dry and brownish. The layers are tightly packed, and the outer skin is mostly intact, showing a few small, white root remnants.",
+    "24010373": "The guitar has a dark, glossy finish with a cutaway body design. It features a white pickguard and a circular soundhole with a simple rosette pattern. The fretboard is dark with white dot inlays, and the headstock is equipped with tuning pegs.",
+    "24017816": "The car features a dark-tinted side window with a black frame, and a portion of the front windshield is visible, also with a black frame.",
+    "24498027": "A tall, slender black pole with a decorative, ornate top featuring a small, pointed finial. The pole has a horizontal arm extending from the middle, supporting a lantern-style light fixture with a glass enclosure and a metal frame.",
+    "24581953": "A large, light gray dog with a short, smooth coat is lying down with its body stretched out. The dog has a long, slender tail that extends straight out behind it. Its legs are extended, with the front legs slightly bent and the hind legs stretched out. The dog's head is resting on the ground, and its ears are relaxed and folded back.",
+    "24694197": "A ripe avocado with a bumpy, dark green to almost black skin and a large, round, red to yellow-green pit nestled in the center.",
+    "24786060": "A light gray towel with a soft, plush texture, featuring a slightly wrinkled appearance. The towel has a rectangular shape with a visible fold running vertically down the center.",
+    "25054869": "A beige toilet cistern with a smooth, curved top surface and a slightly protruding front edge.",
+    "25273528": "A hot air balloon with a vibrant pattern of alternating vertical stripes in dark blue, red, and yellow. The balloon has a teardrop shape with a small basket attached at the bottom.",
+    "25273553": "A black tripod with three legs, each leg featuring a rubber foot for stability. The legs are connected at the top by a central column, which supports a mounting platform with a quick-release plate. The tripod has a telescopic head with a pan handle for adjusting the angle of the head.",
+    "25419495": "The tongs have a dark green handle with a black grip at the end. The metal arms are slightly curved and have a dark, matte finish. The tips of the arms are pointed and designed for precise gripping.",
+    "25419509": "A metallic fork with a slightly curved handle and four evenly spaced tines. The handle has a smooth, reflective surface with a subtle gradient from light to dark.",
+    "25419516": "The toy is a plush, blue creature with large, expressive eyes and prominent, pointed ears. It has a small tuft of hair on top of its head and a light blue underbelly.",
+    "25579493": "A square-shaped mixing bowl with rounded edges, featuring a light blue exterior and a cream-colored interior. The bowl contains a mixture of white and yellow ingredients, with a small piece of red garnish on top.",
+    "25612310": "A woven wicker basket with a dark brown hue, featuring a series of horizontal slats and a slightly curved edge."
+}

evaluation/DLC-Bench/model_outputs/gar_1b_eval.json ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluation/DLC-Bench/model_outputs/gar_1b_eval_gpt.json ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluation/DLC-Bench/model_outputs/gar_8b.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+    "279135": "The ski features a predominantly black base with intricate orange and white geometric patterns. The design includes a series of interconnected shapes and lines, creating a dynamic and modern appearance. The tip of the ski is slightly curved and tapers to a point, with the pattern continuing seamlessly along the length of the ski.",
+    "297718": "A piece of sushi with a bed of white rice wrapped in a dark seaweed sheet, filled with a generous portion of pink and white crab meat. The top is sprinkled with sesame seeds and a light drizzle of soy sauce.",
+    "361105": "A small cluster of fresh, vibrant green leaves with a smooth texture, attached to a thin, green stem. The leaves are broad and slightly serrated at the edges, with a glossy surface.",
+    "622329": "A rectangular, flat, beige-colored eraser with a slightly rough texture and rounded edges.",
+    "622332": "A black, rectangular stapler with a glossy finish. The top surface features a white logo and text. The front edge has a slightly raised, horizontal groove.",
+    "1075308": "A vintage-style television set with a boxy, black frame and a slightly curved screen. The top of the television features a series of control buttons and a small display screen.",
+    "1196168": "A rectangular, wall-mounted air conditioner with a large circular vent on the left side, featuring a grid pattern. The right side of the unit has a smooth surface with a small rectangular panel and a few visible screws.",
+    "1770866": "A white tag with handwritten text in blue and red marker. The blue text reads \"LIBRA\" and \"my tabouts\" in a cursive style. Below, in red marker, the text \"Add $50\" is written in a bold, sans-serif font.",
+    "1894089": "A metallic screwdriver with a flathead tip and a cylindrical shaft. The handle is textured for grip and has a slight taper towards the tip.",
+    "2391761": "The canoe features a blue tarpaulin cover secured over its wooden frame. The visible part of the canoe's hull is made of wooden planks, with a natural brown finish. The canoe has a pointed bow and a slightly raised stern. A white fender is attached to the side, and a red and white lifebuoy is visible inside the canoe.",
+    "2391780": "The bird has a long, slender neck and a pointed beak. Its body is covered in brown feathers with a slightly lighter underbelly. The wings are outstretched, showing a mix of brown and white feathers with a distinct pattern. The tail feathers are short and pointed.",
+    "2391781": "The bird has a predominantly white body with a mix of gray and brown feathers on its wings and back. Its wings are outstretched, showing a gradient from white at the base to darker shades towards the tips. The bird's head is slightly turned, with a small, pointed beak visible.",
+    "2580318": "The mouse has a smooth, metallic surface with a slightly curved, ergonomic shape. The visible part is a triangular section with a gradient of light and dark shades, giving it a sleek and modern appearance.",
+    "2580323": "A rectangular wooden picture frame with a light brown finish, containing a detailed architectural floor plan and elevation drawings. The drawings are monochrome and feature various rooms, furniture, and structural elements. The frame has a simple, smooth design with slightly rounded edges.",
+    "2588513": "A rectangular wooden block with a light beige color and visible wood grain texture. The block has a black base and a white band wrapped around its middle.",
+    "3993075": "A white pen with a red cap and a green and blue design on the barrel.",
+    "4027486": "The bus is predominantly blue with a white section near the bottom. It has a rectangular window with a black frame and a visible license plate that reads \"SABF.\" The bus features a sleek, modern design with a slightly curved roof and a small, white, triangular logo near the bottom.",
+    "4243725": "A curved, elongated object with a gradient of colors ranging from light yellow to dark brown, featuring a smooth, glossy surface.",
+    "4502267": "A green bean with a smooth, slightly curved surface, featuring a gradient of light to dark green hues. The bean has a tapered end and a small, pointed tip.",
+    "4604873": "A tall, lattice-style mobile crane with a long, horizontal boom extending to the left. The crane has a rectangular base and a vertical mast with a series of diagonal cross-bracing. The boom is supported by a series of cables and pulleys, and there is a hook at the end of the boom.",
+    "4781902": "A dark brown wooden stool with a triangular seat and four legs, each leg angling outward and connected by a lower horizontal support beam.",
+    "4782942": "A dark-colored, conical-shaped horn with a wide, flared opening and a smooth, cylindrical body.",
+    "4782949": "A cylindrical drum with a dark brown, textured surface and a metallic rim. The drum has a blue and white striped pattern on the side.",
+    "4916799": "A spherical sculpture composed of numerous small, white, dome-shaped elements arranged in a grid pattern. The sphere is mounted on a cylindrical base and features a blue band with the word \"Pune\" in blue letters. There are also green and yellow accents on the sphere.",
+    "5211280": "A stainless steel crock pot with a curved, dark gray handle on top. The control panel features a digital display in the center, surrounded by various buttons and indicators. The buttons are arranged in a semi-circular pattern around the display, with labels in both English and another language. The crock pot has a smooth, reflective surface and a slightly tapered design towards the base.",
+    "5718392": "The box is a rectangular prism with a woven pattern of interlocking dark brown and light brown strips. The surface has a textured appearance, with the weave creating a series of small, diamond-shaped openings.",
+    "5718415": "The tent has a yellow canopy with a slightly curved edge. The visible part of the tent includes a vertical metal pole supporting the canopy.",
+    "5718424": "A rugged, dark-colored shoe with a thick, textured sole and a prominent, rounded toe. The shoe features a light-colored trim around the opening and a visible lace-up design.",
+    "6012878": "A square traffic light with a black background and a red illuminated hand symbol on the left side.",
+    "6037269": "A vintage-style shower head with a curved, metallic arm and a cylindrical, cream-colored handle. The shower head itself is round and metallic, with a slightly domed top and a flat bottom.",
+    "6037272": "A green, cylindrical shampoo bottle with a slightly tapered end. The bottle has a smooth surface with a small, circular, orange and white label near the top.",
+    "6055310": "A wooden measuring stick with a natural finish, featuring black measurement markings in centimeters and millimeters. The stick has a slightly tapered end and a metal tip at the opposite end.",
+    "6820594": "A medium-sized cat with a predominantly white face and underbelly, featuring a mix of dark brown and black patches on its back and sides. The cat has large, round, light green eyes and a pink nose. Its ears are upright, with the left ear having a light brown patch and the right ear being mostly white. The cat's fur is short and smooth.",
+    "6820595": "A cat with a white face, black ears, and a black patch over its left eye. The body is predominantly black with a white underbelly and a white patch on its right side. The tail is black.",
+    "7050495": "A black leather handbag with a smooth, slightly glossy finish. The visible part shows a rectangular shape with a subtle seam along the bottom edge.",
+    "8201777": "A black van with a rear window displaying the word \"TAXI\" in yellow letters. The van has a yellow license plate with black text and a small white sticker below it. The rear lights are vertically aligned on both sides, and the van has a small emblem above the license plate.",
+    "8331685": "A black over-ear headphone with a curved headband and a cushioned earcup. The earcup has a circular shape with a smooth, matte finish. The headband is attached to the earcup with a visible hinge mechanism.",
+    "8331699": "The visible part of the waste container is black with a smooth surface and a slightly curved edge.",
+    "8331718": "A black spiral-bound notebook with a white cover featuring the word \"Xtreme\" in a stylized font.",
+    "8556674": "A single, round orange with a smooth, glossy surface. The orange has a vibrant, bright orange color with a small, lighter patch near the top left.",
+    "8556676": "A deep red apple with a smooth, glossy surface. The apple has a slightly irregular shape with a prominent bulge on the left side and a smaller bulge on the right side. The bottom part of the apple is slightly darker, almost black, with a few small, reflective spots.",
+    "8557176": "The watch features a rectangular gold case with a white dial. The strap is black with a textured pattern and a gold buckle.",
+    "8557195": "A beige, rectangular bread maker with a smooth surface and slightly rounded edges. The top edge has a small, dark opening.",
+    "8906172": "A black, curved earphone with a smooth, glossy finish.",
+    "9766617": "The goose has a predominantly brown body with a pattern of darker brown and black feathers on its back. Its head is black with a white patch on the side of its neck. The beak is black, and the legs and feet are also black. The underbelly is white, and the tail feathers are black with a white tip.",
+    "10666665": "A round wall clock with a black frame and a white face. The clock features black Arabic numerals at each hour mark, with the numbers 12, 3, 6, and 9 in larger font. The clock has three black hands: an hour hand, a minute hand, and a second hand. The hour hand is pointing between the 10 and 11, the minute hand is pointing at 12, and the second hand is pointing at 6.",
+    "10811497": "The mouse is a dark green, oval-shaped device with a smooth surface. It has a small, circular indentation near the bottom edge.",
+    "11012500": "A burrito filled with fresh green arugula, a slice of ripe tomato, shredded lettuce, and a layer of seasoned ground meat, all wrapped in a soft, lightly toasted tortilla.",
+    "11021544": "A metallic, curved faucet with a polished finish, featuring a single lever handle and a long, slender spout.",
+    "11021562": "The microwave oven has a white exterior with a rectangular shape. It features a prominent, curved handle on the front door, which is also white. The control panel is located on the right side of the door, with a series of buttons and a small display screen. The top of the microwave has a vented section for ventilation.",
+    "11021563": "A stainless steel gas stove with a black control panel featuring four knobs. The stove has a rectangular shape with a slightly raised back panel. The control panel is positioned at the back, and the stove has a smooth, reflective surface.",
+    "11775390": "A green rubber shoe with a textured sole and multiple circular holes on the side. The shoe features a black and white design on the upper part, with green laces threaded through the eyelets.",
+    "11950619": "The dumbbell features a white, rectangular handle with rounded edges and a smooth surface. The handle is attached to a metallic, rectangular weight plate with a series of evenly spaced, vertical slots. The weight plate is secured to the handle with a visible screw.",
+    "12178946": "A yellow bottle with a blue label featuring white text.",
+    "12348078": "A woman with dark hair tied up in a bun, wearing a white t-shirt with red text and graphics on the front, and black pants. She is holding a baby in her arms.",
+    "12348079": "A rectangular digital weighing scale with a metallic blue weighing platform. The scale has a white base with a control panel on the left side, featuring several buttons and a small display screen. The edges of the scale are slightly rounded, and the weighing platform has a textured surface.",
+    "12348080": "A pair of scissors with bright red plastic handles and metallic blades. The handles are oval-shaped with a smooth, glossy finish. The blades are straight and sharp, with a slight taper towards the tips.",
+    "13138178": "A blue plastic stool with a smooth, curved seat and rounded legs. The stool has a simple, sturdy design with a slightly glossy finish.",
+    "13187927": "The motorcycle is a white scooter with a sleek, modern design. It features a black seat and a rear storage compartment with a red reflector. The rear light is integrated into the storage compartment, and the scooter has a visible license plate mounted below the light. The handlebars are equipped with rearview mirrors, and the front section includes a headlight and a windshield.",
+    "14490578": "The harbor seal has a sleek, elongated body with a dark, almost black coloration. Its skin appears smooth and slightly glossy, with a subtle gradient of lighter shades along its back. The seal's head is rounded, and its body tapers towards the tail.",
+    "14640483": "A rectangular wooden chopping board with a smooth surface and a natural wood grain pattern. The board has a slightly rounded edge and a visible handle on one side.",
+    "14832137": "A cylindrical, light purple plastic bucket with a smooth surface and a slightly flared rim. The bucket has a small, curved handle attached near the top.",
+    "15050320": "A dark brown wine glass with a wide, flat base and a slender stem.",
+    "16010041": "A pair of light-colored wooden chopsticks with a smooth, polished surface. The tips of the chopsticks are slightly tapered and have a subtle orange hue.",
+    "16951734": "A wedge of cantaloupe with a smooth, light orange flesh and a thin, pale rind.",
+    "16957916": "Fresh green lettuce leaves with ruffled edges and a crisp texture, exhibiting a gradient of color from pale green at the base to a darker green towards the tips.",
+    "17072759": "A black belt with a smooth texture, featuring a silver rectangular buckle. The belt has a single prong and a loop near the buckle for securing the tail end.",
+    "17072764": "A pear with a smooth, light green skin, featuring a slight yellowish hue on the upper right side. The pear has a short, brown stem attached to its top.",
+    "17265253": "A black rickshaw with a black canopy, featuring a single visible wheel with a silver rim and black tire. The wheel is attached to a black frame with a visible pedal mechanism.",
+    "17265254": "A traditional rickshaw with a black frame and a red seat, featuring a curved handlebar and a single front wheel with spokes and a rubber tire.",
+    "17385866": "A scoop of vanilla ice cream topped with a slice of red strawberry, resting on a bed of green mint leaves.",
+    "17404769": "The car is a white minivan with a rear design featuring a large, dark-tinted rear window and a smaller, rectangular window on the side. The rear lights are vertically aligned and wrap around the side of the vehicle. The car has a visible rear wheel with a five-spoke alloy rim. There is a small, square fuel cap located on the side panel near the rear wheel.",
+    "18217373": "The spectacles feature a round, gold-colored frame with a thin, dark brown temple arm. The lens is a light, translucent yellow.",
+    "18301585": "The bench features a black metal frame with horizontal slats forming the backrest and seat. The backrest consists of three horizontal slats, while the seat has two horizontal slats. The bench is supported by white concrete legs that are rectangular in shape and have a slightly tapered design.",
+    "18680641": "A rectangular, plush, red carpet with a slightly textured surface and a dark gray border along the edges.",
+    "18845103": "A metallic spoon with a slightly curved handle and a shallow, oval-shaped bowl. The handle has a smooth, reflective surface with a subtle taper towards the bowl.",
+    "19455186": "A blue metal cart with a rectangular frame and four black wheels. The cart has two horizontal blue bars across the front, with a small white label affixed to the upper bar.",
+    "19610023": "A bright green, frog-shaped slipper with a smooth, rounded body and a wide, open mouth. The slipper has a small, raised bump on the top of its head, resembling an eye.",
+    "19610025": "A white rabbit with upright ears, wearing a yellow shirt and blue pants, is holding a brown basket on its back.",
+    "20568676": "A stainless steel bowl filled with a mixture of chopped nuts and a yellow spatula resting on top.",
+    "20993402": "A roll of translucent adhesive tape with a smooth, glossy surface and a slightly reflective finish. The tape is wound tightly around a central cardboard core, which is visible at the top.",
+    "21107974": "A wooden gavel with a cylindrical head and a smooth, slightly tapered handle. The head features a prominent, rounded end and a series of horizontal grooves near the top. The handle is uniformly cylindrical and extends straight from the head.",
+    "21529954": "A cylindrical can with a white cap, featuring a vibrant design. The top half is orange with a small white logo, while the bottom half is green with a large, stylized white text. The can has a slightly curved shape and a glossy finish.",
+    "22064315": "The visible part of the gazelle shows a pair of long, curved horns with a dark, almost black coloration. The horns are smooth and taper to a point. The base of the horns is attached to a light brown, slightly textured head.",
+    "22107522": "A black bow tie with a smooth, satin-like finish, featuring a classic butterfly shape with pointed tips.",
+    "22879790": "A single, large, white onion with a smooth, slightly shiny surface. The onion has a bulbous shape with a few thin, papery layers visible near the top. The root end is dark brown and slightly shriveled, with a few small roots extending from it.",
+    "24010373": "The guitar has a dark, glossy body with a cutaway design. The neck is dark with white dot inlays on the fretboard. The headstock is also dark, matching the neck, and features tuning pegs. The body has a circular soundhole and a bridge with white bridge pins.",
+    "24017816": "The van is white with a large, rectangular side window and a side mirror. The window is tinted, and the side mirror is black. The van has a sleek, modern design with smooth lines and a slightly curved roof.",
+    "24498027": "A tall, slender black pole with a decorative, ornate top featuring a small, pointed finial. The pole has a rectangular, box-like structure attached near the top, and a smaller, horizontal arm extending from the middle section.",
+    "24581953": "A large, light-colored carnivore with a robust, muscular body and a thick, short coat. It has a broad head with small, rounded ears and a long, tapering tail. The legs are sturdy and strong, with large paws.",
+    "24694197": "A ripe avocado with a bumpy, dark green skin and a central pit cavity filled with a reddish-brown, creamy substance.",
+    "24786060": "A light gray towel with a soft, slightly wrinkled texture, hanging loosely with a gentle curve.",
+    "25054869": "The toilet features a smooth, rounded lid with a glossy finish, seamlessly integrated into the tank. The tank has a slightly curved, angular design with a uniform, light beige color.",
+    "25273528": "The balloon features a vibrant pattern with alternating vertical stripes of red, yellow, and green. The red stripes are the most prominent, with yellow and green stripes creating a striking contrast. The balloon has a teardrop shape with a small black basket attached at the bottom.",
+    "25273553": "A black tripod with a central column and three legs, each leg featuring a rubber foot for stability. The legs are connected to a central hub, which is part of the tripod's support structure.",
+    "25419495": "The tongs have a metallic, slightly curved arm with a black rubberized grip handle. The handle is ergonomically designed with a smooth, matte finish. The tongs are open, showing the inner surfaces of the arms, which are also metallic and slightly curved.",
+    "25419509": "A metallic fork with a slightly curved handle and four evenly spaced tines. The handle has a smooth, reflective surface with a gentle upward curve near the tines.",
+    "25419516": "A plush toy with a blue face, large white eyes with black pupils, and two pointed ears.",
+    "25579493": "A small, square-shaped bowl with rounded edges, featuring a light blue exterior and a white interior. The bowl contains a mixture of white rice and a small piece of red food item in the center.",
+    "25612310": "A dark brown wicker basket with a woven pattern, featuring a slightly curved edge and a visible portion of the basket's side."
+}

evaluation/DLC-Bench/model_outputs/gar_8b_eval.json ADDED Viewed

The diff for this file is too large to render. See raw diff

evaluation/DLC-Bench/model_outputs/gar_8b_eval_gpt.json ADDED Viewed

The diff for this file is too large to render. See raw diff