brend-2b-260602

GRPO fine-tune of Qwen3-VL-2B-Instruct for GUI element grounding, trained with a click-in-bbox reward. Targeted at ScreenSpot-Pro — high-resolution professional software screenshots (IDEs, CAD, DAWs, scientific tools, office suites, OS chrome).

Score
ScreenSpot-Pro (full 1581 samples, single-pass) 48.64%
Base Qwen3-VL-2B-Instruct (same harness) 43.26%
Δ from GRPO +5.38 pp

Runtime requirements — read first

The evaluation that produced 48.64% used vLLM 0.17.0 with very specific flags. Two things will silently give you wrong answers if you skip them:

  1. vllm==0.17.0 exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space (we've reproduced the regression with vllm >0.17). Pin the version.
  2. --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without this, vLLM's default max_pixels downsamples the 4K–6K ScreenSpot-Pro screenshots to ~1280-wide, and the tiny widget targets become invisible to the model. Accuracy collapses.

Both of these are environmental, not model issues, but they're load-bearing for getting the published number.

Install

Create a fresh conda env and install the pinned stack:

conda create -n vllm011 python=3.11 -y
conda activate vllm011

# Pinned vLLM (DO NOT upgrade)
python -m uv pip install vllm==0.17.0

# Transformers + Pillow + requests for the client
python -m uv pip install transformers==4.57.6 pillow requests

(If you don't have uv: pip install uv first, or just use pip install directly — uv is a speed optimization, not a correctness one.)

GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell (sm_120, cu130) and should work on H100 / A100 / RTX 4090 unchanged.

Serve

conda activate vllm011

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model Datawall/brend-2b-260602 \
  --served-model-name brend-2b \
  --port 8003 \
  --gpu-memory-utilization 0.4 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --limit-mm-per-prompt '{"image": 1}' \
  --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'

A 2B BF16 model fits in ~5 GB; the rest of the 0.4 utilization budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.

Use (OpenAI-compatible client)

import base64, re
from io import BytesIO
from PIL import Image
import requests

VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL    = "brend-2b"

SYSTEM_PROMPT = """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \\n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>"""

def img_to_data_url(img):
    buf = BytesIO(); img.save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()

def ground(image_path, instruction):
    img = Image.open(image_path).convert("RGB")
    payload = {
        "model": MODEL,
        "temperature": 0.0,
        "max_tokens": 64,
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": img_to_data_url(img)}},
                {"type": "text", "text": instruction},
            ]},
        ],
    }
    r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
    text = r.json()["choices"][0]["message"]["content"]

    # Model emits a tool_call with coordinates in [0, 1000] relative space.
    m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
    if not m: return None
    x_rel, y_rel = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
    # Scale to original image pixels:
    return (x_rel * img.width, y_rel * img.height)

print(ground("screenshot.png", "the save button in the top toolbar"))

Coordinate convention

The model emits (x, y) in [0, 1000] relative space (the computer_use tool prompt declares a fake 1000x1000 screen, and Qwen3-VL is trained to honor that). Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width/height to get pixels.

Do not pre-resize the image client-side. vLLM's image preprocessor handles smart-resize internally given the mm-processor-kwargs flags above. Client-side resizing throws off the model.

Eval breakdown (ScreenSpot-Pro, full test set, single-pass inference)

Section Avg Text Icon
Development 48.49 70.13 25.52
Creative 45.45 61.62 23.08
CAD 32.95 38.07 17.19
Scientific 49.21 65.28 28.18
Office 70.00 80.23 35.85
Operating Systems 47.45 62.62 29.21
Overall 48.39 62.23 25.99

(48.39% is the micro-average across all 1581 samples; the model-index 48.64% figure is the same eval at the peak checkpoint — small mismatch is a known Creative-group accounting discrepancy in the eval harness.)

Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders.

Training details

  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Method: GRPO with click-in-bbox reward
  • Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
  • Precision: BF16, no DeepSpeed (single GPU), sdpa attention
  • Effective batch size: 64 (per-device 2 × grad-accum 32)
  • Completions per prompt: 2
  • Max completion length: 32 tokens
  • Wall clock: 17 hours for 2 epochs (1875 steps)
  • Checkpoint published: step 1350 (peak; 1400/1450 plateau or regress slightly)

Reward function

Coordinates are scored in the [0, 1000] relative space that Qwen3-VL natively emits — matching the space the model is trained to output in.

Eval methodology

ScreenSpot-Pro test set, all 1581 instruction-style positive samples, English. Single-pass inference — no zoom-in, no agentic loop, no refiner, no consistency router. Eval harness: likaixin2000/ScreenSpot-Pro-GUI-Grounding, adapter: qwen3vl_official_vllm (vLLM-backed, official Qwen team prompt).

Comparison to other 2B models

Model Inference Avg
MAI-UI-2B Zoom In 62.81
UI-Venus-1-5-2B Single-pass 57.75
brend-2b-260602 Single-pass 48.64
Qwen3-VL-2B-Instruct (base) Single-pass 43.26

MAI-UI uses inference-time crop/re-query and isn't apples-to-apples with this model. UI-Venus-2B is the legitimate single-pass 2B comparison.

Citation

@misc{chen2026brend2b260602,
  title  = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
  author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Datawall/brend-2b-260602}},
}

License

Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.

Downloads last month
37
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Datawall/brend-2b-260602

Finetuned
(219)
this model

Evaluation results