Instructions to use Datawall/brend-2b-260602 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Datawall/brend-2b-260602 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Datawall/brend-2b-260602")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Datawall/brend-2b-260602")
model = AutoModelForImageTextToText.from_pretrained("Datawall/brend-2b-260602")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Datawall/brend-2b-260602 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Datawall/brend-2b-260602"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Datawall/brend-2b-260602",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Datawall/brend-2b-260602

SGLang

How to use Datawall/brend-2b-260602 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Datawall/brend-2b-260602" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Datawall/brend-2b-260602",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Datawall/brend-2b-260602" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Datawall/brend-2b-260602",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Datawall/brend-2b-260602 with Docker Model Runner:
```
docker model run hf.co/Datawall/brend-2b-260602
```

brend-2b-260602

GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at ScreenSpot-Pro — high-resolution professional software screenshots (IDEs, CAD, DAWs, scientific tools, office suites, OS chrome).

	Score
ScreenSpot-Pro (full 1581 samples, single-pass)	48.64%
Base Qwen3-VL-2B-Instruct (same harness)	43.26%
Δ from GRPO	+5.38 pp

Runtime requirements — read first

The evaluation that produced 48.64% used vLLM 0.17.0 with very specific flags. Two things will silently give you wrong answers if you skip them:

vllm==0.17.0 exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space (we've reproduced the regression with vllm >0.17). Pin the version.
--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without this, vLLM's default max_pixels downsamples the 4K–6K ScreenSpot-Pro screenshots to ~1280-wide, and the tiny widget targets become invisible to the model. Accuracy collapses.

Both of these are environmental, not model issues, but they're load-bearing for getting the published number.

Install

Create a fresh conda env and install the pinned stack:

conda create -n vllm011 python=3.11 -y
conda activate vllm011

# Pinned vLLM (DO NOT upgrade)
python -m uv pip install vllm==0.17.0

# Transformers + Pillow + requests for the client
python -m uv pip install transformers==4.57.6 pillow requests

(If you don't have uv: pip install uv first, or just use pip install directly — uv is a speed optimization, not a correctness one.)

GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell (sm_120, cu130) and should work on H100 / A100 / RTX 4090 unchanged.

Serve

conda activate vllm011

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model Datawall/brend-2b-260602 \
  --served-model-name brend-2b \
  --port 8003 \
  --gpu-memory-utilization 0.4 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --limit-mm-per-prompt '{"image": 1}' \
  --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'

A 2B BF16 model fits in ~5 GB; the rest of the 0.4 utilization budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.

Use (OpenAI-compatible client)

import base64, re
from io import BytesIO
from PIL import Image
import requests

VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL    = "brend-2b"

SYSTEM_PROMPT = """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \\n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>"""

def img_to_data_url(img):
    buf = BytesIO(); img.save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()

def ground(image_path, instruction):
    img = Image.open(image_path).convert("RGB")
    payload = {
        "model": MODEL,
        "temperature": 0.0,
        "max_tokens": 64,
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": img_to_data_url(img)}},
                {"type": "text", "text": instruction},
            ]},
        ],
    }
    r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
    text = r.json()["choices"][0]["message"]["content"]

    # Model emits a tool_call with coordinates in [0, 1000] relative space.
    m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
    if not m: return None
    x_rel, y_rel = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
    # Scale to original image pixels:
    return (x_rel * img.width, y_rel * img.height)

print(ground("screenshot.png", "the save button in the top toolbar"))

Coordinate convention

The model emits (x, y) in [0, 1000] relative space (the computer_use tool prompt declares a fake 1000x1000 screen, and Qwen3-VL is trained to honor that). Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width/height to get pixels.

Do not pre-resize the image client-side. vLLM's image preprocessor handles smart-resize internally given the mm-processor-kwargs flags above. Client-side resizing throws off the model.

Eval breakdown (ScreenSpot-Pro, full test set, single-pass inference)

Section	Avg	Text	Icon
Development	48.49	70.13	25.52
Creative	45.45	61.62	23.08
CAD	32.95	38.07	17.19
Scientific	49.21	65.28	28.18
Office	70.00	80.23	35.85
Operating Systems	47.45	62.62	29.21
Overall	48.39	62.23	25.99

(48.39% is the micro-average across all 1581 samples; the model-index 48.64% figure is the same eval at the peak checkpoint — small mismatch is a known Creative-group accounting discrepancy in the eval harness.)

Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders.

Training details

Base model: Qwen/Qwen3-VL-2B-Instruct
Method: GRPO with click-in-bbox reward
Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
Precision: BF16, no DeepSpeed (single GPU), sdpa attention
Effective batch size: 64 (per-device 2 × grad-accum 32)
Completions per prompt: 2
Max completion length: 32 tokens
Wall clock: ~~17 hours for 2 epochs (~~1875 steps)
Checkpoint published: step 1350 (peak; 1400/1450 plateau or regress slightly)

Reward function

Coordinates are scored in the [0, 1000] relative space that Qwen3-VL natively emits — matching the space the model is trained to output in.

Eval methodology

ScreenSpot-Pro test set, all 1581 instruction-style positive samples, English. Single-pass inference — no zoom-in, no agentic loop, no refiner, no consistency router. Eval harness: likaixin2000/ScreenSpot-Pro-GUI-Grounding, adapter: qwen3vl_official_vllm (vLLM-backed, official Qwen team prompt).

Comparison to other 2B models

Model	Inference	Avg
MAI-UI-2B	Zoom In	62.81
UI-Venus-1-5-2B	Single-pass	57.75
brend-2b-260602	Single-pass	48.64
Qwen3-VL-2B-Instruct (base)	Single-pass	43.26

MAI-UI uses inference-time crop/re-query and isn't apples-to-apples with this model. UI-Venus-2B is the legitimate single-pass 2B comparison.

Citation

@misc{chen2026brend2b260602,
  title  = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
  author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Datawall/brend-2b-260602}},
}

License

Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.

Downloads last month: 37

Safetensors

Model size

2B params

Tensor type

F32

Model tree for Datawall/brend-2b-260602

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(219)

this model

Evaluation results

Overall (single-pass, no zoom-in) on ScreenSpot-Pro
test set self-reported

48.640