Instructions to use Datawall/brend-2b-260602 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Datawall/brend-2b-260602 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Datawall/brend-2b-260602") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Datawall/brend-2b-260602") model = AutoModelForImageTextToText.from_pretrained("Datawall/brend-2b-260602") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Datawall/brend-2b-260602 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Datawall/brend-2b-260602" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Datawall/brend-2b-260602", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Datawall/brend-2b-260602
- SGLang
How to use Datawall/brend-2b-260602 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Datawall/brend-2b-260602" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Datawall/brend-2b-260602", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Datawall/brend-2b-260602" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Datawall/brend-2b-260602", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Datawall/brend-2b-260602 with Docker Model Runner:
docker model run hf.co/Datawall/brend-2b-260602
brend-2b-260602
GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at ScreenSpot-Pro — high-resolution professional software screenshots (IDEs, CAD, DAWs, scientific tools, office suites, OS chrome).
| Score | |
|---|---|
| ScreenSpot-Pro (full 1581 samples, single-pass) | 48.64% |
| Base Qwen3-VL-2B-Instruct (same harness) | 43.26% |
| Δ from GRPO | +5.38 pp |
Runtime requirements — read first
The evaluation that produced 48.64% used vLLM 0.17.0 with very specific flags. Two things will silently give you wrong answers if you skip them:
vllm==0.17.0exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space (we've reproduced the regression with vllm >0.17). Pin the version.--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without this, vLLM's defaultmax_pixelsdownsamples the 4K–6K ScreenSpot-Pro screenshots to ~1280-wide, and the tiny widget targets become invisible to the model. Accuracy collapses.
Both of these are environmental, not model issues, but they're load-bearing for getting the published number.
Install
Create a fresh conda env and install the pinned stack:
conda create -n vllm011 python=3.11 -y
conda activate vllm011
# Pinned vLLM (DO NOT upgrade)
python -m uv pip install vllm==0.17.0
# Transformers + Pillow + requests for the client
python -m uv pip install transformers==4.57.6 pillow requests
(If you don't have uv: pip install uv first, or just use pip install
directly — uv is a speed optimization, not a correctness one.)
GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell (sm_120, cu130) and should work on H100 / A100 / RTX 4090 unchanged.
Serve
conda activate vllm011
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model Datawall/brend-2b-260602 \
--served-model-name brend-2b \
--port 8003 \
--gpu-memory-utilization 0.4 \
--max-model-len 16384 \
--max-num-seqs 32 \
--limit-mm-per-prompt '{"image": 1}' \
--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'
A 2B BF16 model fits in ~5 GB; the rest of the 0.4 utilization budget is
KV cache for batched serving. Bump --gpu-memory-utilization and
--max-num-seqs if you have headroom.
Use (OpenAI-compatible client)
import base64, re
from io import BytesIO
from PIL import Image
import requests
VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL = "brend-2b"
SYSTEM_PROMPT = """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \\n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>"""
def img_to_data_url(img):
buf = BytesIO(); img.save(buf, format="PNG")
return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
def ground(image_path, instruction):
img = Image.open(image_path).convert("RGB")
payload = {
"model": MODEL,
"temperature": 0.0,
"max_tokens": 64,
"messages": [
{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": img_to_data_url(img)}},
{"type": "text", "text": instruction},
]},
],
}
r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
text = r.json()["choices"][0]["message"]["content"]
# Model emits a tool_call with coordinates in [0, 1000] relative space.
m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
if not m: return None
x_rel, y_rel = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
# Scale to original image pixels:
return (x_rel * img.width, y_rel * img.height)
print(ground("screenshot.png", "the save button in the top toolbar"))
Coordinate convention
The model emits (x, y) in [0, 1000] relative space (the computer_use
tool prompt declares a fake 1000x1000 screen, and Qwen3-VL is trained to
honor that). Divide by 1000 to get normalized [0, 1] coordinates, then
multiply by the original image's width/height to get pixels.
Do not pre-resize the image client-side. vLLM's image preprocessor
handles smart-resize internally given the mm-processor-kwargs flags
above. Client-side resizing throws off the model.
Eval breakdown (ScreenSpot-Pro, full test set, single-pass inference)
| Section | Avg | Text | Icon |
|---|---|---|---|
| Development | 48.49 | 70.13 | 25.52 |
| Creative | 45.45 | 61.62 | 23.08 |
| CAD | 32.95 | 38.07 | 17.19 |
| Scientific | 49.21 | 65.28 | 28.18 |
| Office | 70.00 | 80.23 | 35.85 |
| Operating Systems | 47.45 | 62.62 | 29.21 |
| Overall | 48.39 | 62.23 | 25.99 |
(48.39% is the micro-average across all 1581 samples; the model-index 48.64% figure is the same eval at the peak checkpoint — small mismatch is a known Creative-group accounting discrepancy in the eval harness.)
Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders.
Training details
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Method: GRPO with click-in-bbox reward
- Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
- Precision: BF16, no DeepSpeed (single GPU),
sdpaattention - Effective batch size: 64 (per-device 2 × grad-accum 32)
- Completions per prompt: 2
- Max completion length: 32 tokens
- Wall clock:
17 hours for 2 epochs (1875 steps) - Checkpoint published: step 1350 (peak; 1400/1450 plateau or regress slightly)
Reward function
Coordinates are scored in the [0, 1000] relative space that Qwen3-VL
natively emits — matching the space the model is trained to output in.
Eval methodology
ScreenSpot-Pro
test set, all 1581 instruction-style positive samples, English. Single-pass
inference — no zoom-in, no agentic loop, no refiner, no consistency router.
Eval harness: likaixin2000/ScreenSpot-Pro-GUI-Grounding,
adapter: qwen3vl_official_vllm (vLLM-backed, official Qwen team prompt).
Comparison to other 2B models
| Model | Inference | Avg |
|---|---|---|
| MAI-UI-2B | Zoom In | 62.81 |
| UI-Venus-1-5-2B | Single-pass | 57.75 |
| brend-2b-260602 | Single-pass | 48.64 |
| Qwen3-VL-2B-Instruct (base) | Single-pass | 43.26 |
MAI-UI uses inference-time crop/re-query and isn't apples-to-apples with this model. UI-Venus-2B is the legitimate single-pass 2B comparison.
Citation
@misc{chen2026brend2b260602,
title = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
year = {2026},
howpublished = {\url{https://huggingface.co/Datawall/brend-2b-260602}},
}
License
Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.
- Downloads last month
- 37
Model tree for Datawall/brend-2b-260602
Base model
Qwen/Qwen3-VL-2B-InstructEvaluation results
- Overall (single-pass, no zoom-in) on ScreenSpot-Protest set self-reported48.640