Instructions to use renezander030/browserground with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use renezander030/browserground with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-2B-Instruct") model = PeftModel.from_pretrained(base_model, "renezander030/browserground") - MLX
How to use renezander030/browserground with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("renezander030/browserground") config = load_config("renezander030/browserground") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use renezander030/browserground with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "renezander030/browserground"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "renezander030/browserground" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use renezander030/browserground with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "renezander030/browserground"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default renezander030/browserground
Run Hermes
hermes
browserground — Qwen3-VL-2B LoRA for hybrid AI agents (v0.1)
The local UI-grounding specialist for hybrid AI agents. Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
Why this exists — the hybrid AI argument
Today, most AI agents route every screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a hybrid one: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
That's exactly what browserground is — the click-grounding specialist.
| Pure-cloud (status quo) | Hybrid (+ browserground) | |
|---|---|---|
| Per-screenshot cost | $0.01–0.05 | $0 |
| Latency | 800ms–2s round-trip | ~1.8s local |
| Tokens billed by cloud | 1500+ multimodal | ~40 text |
| Screenshots leave machine | yes | no |
| Rate limits | yes | no |
What it does
Given a screenshot and a target description ("submit form button", "the red Sign Up link", "the second profile picture from the left"), this LoRA-fine-tuned Qwen3-VL-2B emits a strict JSON object:
{"bbox_2d": [x1, y1, x2, y2]}
— the pixel coordinates of the element to click. 100% format compliance on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
Results on ScreenSpot-v2
Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
| Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
|---|---|---|---|---|---|---|
| GPT-4o (cloud) | — | 18.3% | — | — | — | — |
| SeeClick (Qwen-VL-Chat) | 9.6B | 55.1% | — | — | — | — |
| ShowUI-2B | 2B | 75.5% | — | — | — | — |
| UI-TARS-2B-SFT (ByteDance) | 2B | 89.5% | — | — | — | — |
| OS-Atlas-Base-7B | 7B | ~91% | — | — | — | — |
| browserground v0.1 (this model) | 2B | 45.3% | 64.0% | 28.0% | 44.0% | 100% |
| Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
- Beats GPT-4o by 2.5× and zero-shot Qwen3-VL by 7× on the same benchmark
- 100% strict-JSON format compliance — no markdown fences, no commentary
- Sits below ShowUI/UI-TARS at this v0.1; v0.2 (Tier 2, target ≥ 60%) on the roadmap
Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
Quick start
npm install -g browserground
browserground parse screenshot.png --target "Submit button"
# {"bbox_2d": [344, 612, 478, 658]}
Full install + agent-stack integration: github.com/renezander030/browserground.
Use from Python directly
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
import torch
from PIL import Image
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, "renezander030/browserground")
model = model.merge_and_unload(); model.eval()
img = Image.open("screenshot.png").convert("RGB")
messages = [
{"role": "system", "content": [{"type": "text", "text":
'You are a UI-grounding model. Given a screenshot and a target description, '
'output the bounding box of the SINGLE UI element to click. Output ONLY a JSON '
'object: {"bbox_2d": [x1, y1, x2, y2]} with pixel coordinates, origin at top-left.'}]},
{"role": "user", "content": [
{"type": "image", "image": img},
{"type": "text", "text": "Locate the element described: Submit button"},
]},
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[[img]], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
Training recipe
- Base:
Qwen/Qwen3-VL-2B-Instruct - Method: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
- Trainable params: 17.4 M (0.81% of base)
- Data mix (12k examples):
- OS-Atlas-Data desktop_domain (macOS): 4k
- OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
- OS-Atlas-Data mobile_domain (UIBert): 4k
- Hyperparams: bf16, LR 1e-4, cosine schedule, batch 1 × grad-accum 8 (effective batch 8), 1 epoch, gradient checkpointing on
- Hardware: 1× L40S 48 GB (RunPod Secure Cloud)
- Compute cost: ~$2 training + ~$0.50 eval
- Wall time: ~2 hr total
Full training scripts (private repo, request access): renezander030/imgparse-tier1.
Output format
{"bbox_2d": [x1, y1, x2, y2]}
— a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no <ref> tokens. Verified 100% parseable on the eval set.
Limitations & next
- Web and desktop accuracy lag mobile (we trained primarily on macOS + mobile UI). v0.2 adds 8k+ web records and ~2× total data.
- Long-tail icon recognition is weaker than text grounding.
- No mouse-action prediction — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
- English-only training data.
Use cases (what's this drop-in for)
- Claude Computer Use / Claude Code screen-grounding tool calls
- OpenAI Codex CLI screen-grounding extension
- browser-use / Skyvern click-targeting (Python adapter in the GitHub repo)
- Custom agent stacks that need a $0/call grounding step instead of GPT-4V per screenshot
- Self-hosted compound-AI systems with a routing layer (specialist model for grounding, general LLM for planning)
Work with me
This adapter is a public reference of the recipe I deliver to freelance clients: small, fast, structured-output local specialists that slot into compound-AI agent stacks and cut cloud-LLM bills without losing capability.
If you need one of these, I can build it:
- a UI-grounding model trained on your own product's screenshots — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
- a hybrid agent architecture that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
- an on-prem agent deployment — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
- a structured-output evaluation harness that tells you when the local model is actually good enough to replace the cloud call in production
Reach out: https://renezander.com
Citation
@misc{browserground-2026,
title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
author = {Zander, René},
year = {2026},
url = {https://huggingface.co/renezander030/browserground}
}
License
Apache 2.0, same as the base model Qwen/Qwen3-VL-2B-Instruct.
Acknowledgements
Qwen/Qwen3-VL-2B-InstructbaseOS-Copilot/OS-Atlas-Datatraining dataagentsea/wave-ui(for the upcoming v0.2 web slice)OS-Copilot/ScreenSpot-v2evaluation set
- Downloads last month
- -
Quantized
Model tree for renezander030/browserground
Base model
Qwen/Qwen3-VL-2B-Instruct