Multimodal Vision Agent — LoRA Adapter

QLoRA fine-tuned adapter for Qwen2.5-7B-Instruct that converts natural language desktop UI instructions into structured browser automation actions

Model Details
Supported Actions
Quick Start
Real-World Test Results
v2 Improvements (vs v1)
Training Details
Full Project Structure
Dependencies
License
Citation

Model Details

This adapter fine-tunes Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization) to produce structured UI actions (click, type, navigate, scroll, wait, done) from natural language instructions. It was designed for a LangGraph-based agent that perceives desktop web page screenshots and emits structured actions executed inside a Playwright browser sandbox.

Why Qwen2.5-7B-Instruct?

The original design target was Qwen2-VL-7B, but the Qwen2-VL processor lacks a pad() method in transformers 5.x, causing data collator failures during training. Qwen2.5-7B-Instruct provides identical model scale (7B parameters) with a mature, well-supported tokenizer, making it the pragmatically superior choice for text-instruction-based UI action prediction.

Architecture Overview

The agent framework operates as a LangGraph state machine with three nodes:

Perception Node — Captures a browser screenshot + DOM snapshot, compresses action history, and feeds everything to the VLM.
Action Node — Executes the predicted action in the Playwright browser sandbox (click, type, navigate, scroll, wait).
Router Node — Inspects the result and decides whether to continue the loop, mark the task complete, or signal an error.

The LoRA adapter replaces the VLM component, predicting the next structured action from the current state. The full framework is available on GitHub.

Model Card

Property	Value
Base Model	Qwen/Qwen2.5-7B-Instruct
Adapter Architecture	LoRA (Low-Rank Adaptation)
Adapter Size	~20 MB (4-bit NF4 quantized base model)
Quantization	`bitsandbytes` NF4 — double quant, float16 compute dtype
LoRA Rank	`r=16`, `lora_alpha=32`, `dropout=0.05`
Target Modules	`q_proj`, `v_proj`
Training Data	28 instruction-action pairs
Training Epochs	10
Optimizer	AdamW (peak learning rate 2e-4)
Final Loss	0.033
Hardware	NVIDIA GeForce RTX 4090 (25.3 GB VRAM)
Training Time	~79 seconds
Framework	Hugging Face Transformers + PEFT + bitsandbytes

Supported Actions

The model outputs structured JSON inside <action> tags. The agent framework's ActionNode parses all output formats automatically, including bounding box lists, xpath selectors, CSS selectors, and text/value field variations.

Action	Description	Input Fields	Example Output (v2)
`click`	Click a UI element	`bbox` `[x, y, w, h]`, or `selector` (CSS), or `xpath`	`{"action":"click","selector":"a[href='/signup']"}`
`type`	Type text into an input field	`bbox` + `text`, or `selector` + `text`, or `xpath` + `text`	`{"action":"type","xpath":"//input[@name='email']","text":"user@example.com"}`
`navigate`	Navigate to a URL (absolute or relative)	`url`	`{"action":"navigate","url":"/settings"}`
`scroll`	Scroll the page up or down	`direction` (`"up"` or `"down"`)	`{"action":"scroll","direction":"down"}`
`wait`	Pause execution briefly	(none)	`{"action":"wait"}`
`done`	Signal task completion	(none)	`{"action":"done"}`

Output Format Details

The model can produce bounding boxes in two formats:

List format (most common): "bbox": [x, y, width, height]
Object format: "bbox": {"x": ..., "y": ..., "width": ..., "height": ...}

The model also supports element targeting via:

XPath selectors: "xpath": "//input[@name='username']"
CSS selectors: "selector": "a[href='/signup']" or "selector": "#login_field"

Quick Start

Installation

pip install torch transformers peft bitsandbytes accelerate sentencepiece

Inference

import torch
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# --- Step 1: Configure 4-bit quantization ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# --- Step 2: Load base model with quantization ---
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# --- Step 3: Load LoRA adapter ---
model = PeftModel.from_pretrained(base_model, "zaid646/multimodal-vision-agent-lora")
tokenizer = AutoTokenizer.from_pretrained("zaid646/multimodal-vision-agent-lora")


# --- Step 4: Define prediction function ---
def predict_action(instruction: str) -> dict:
    prompt = f"### Human: {instruction}\n### Assistant:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=80,
            temperature=0.1,
            do_sample=True,
        )
    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True,
    ).strip()
    print(f"Raw model output: {response}")
    match = re.search(r"<action>(.*?)</action>", response, re.DOTALL)
    if match:
        return json.loads(match.group(1))
    return {"action": "done"}


# --- Step 5: Test with various instructions ---
print(predict_action("Click the login button"))
# Expected: {'action': 'click', 'bbox': [450, 380, 120, 40]}

print(predict_action("Type email into the email field"))
# Expected: {'action': 'type', 'xpath': '//input[@name="email"]', 'text': 'user@example.com'}

print(predict_action("Navigate to settings"))
# Expected: {'action': 'navigate', 'url': '/settings'}

print(predict_action("Scroll down the page"))
# Expected: {'action': 'scroll', 'direction': 'down'}

print(predict_action("Stop"))
# Expected: {'action': 'done'}

Full Agent Integration

For the complete agent loop with Playwright browser sandbox, LangGraph state machine, and evaluation harness, clone the GitHub repository:

git clone https://github.com/ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.git
cd qwen2.5-vl-7b-playwright-desktop-lora

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium
playwright install-deps chromium

# Run all unit tests (no GPU required for MockVLM mode)
pytest -v

Real-World Test Results

The v2 adapter was tested against 4 real-world scenarios on actual websites using Playwright in headless Chromium mode on an RTX 4090. Each test captured before/after screenshots.

Test 1: GitHub Login

The model was instructed to fill the username and password fields on the GitHub login page (https://github.com/login).

Stage	Description	Result
Instruction 1	"Type username into the username field"	Model predicted `xpath: //input[@name='username']`, filled field
Instruction 2	"Type password into the password field"	Model predicted `xpath: //input[@name='password']`, filled field
Verification	`page.input_value("#login_field")` and `#password`	Both fields verified non-empty

Model output format: The v2 adapter produces semantic XPath selectors (//input[@name='username']) instead of brittle raw paths seen in v1 (/html/body/div/div/form/div[1]/input).

Test 2: HTTPBin Form

The model was instructed to fill name and email fields on https://httpbin.org/forms/post.

Stage	Description	Result
Instruction 1	"Type name into the name field"	Model predicted `bbox: [200, 200, 300, 40]`, filled field
Instruction 2	"Type email into the email field"	Model predicted `xpath: //input[@name='email']`, filled field
Verification	`input[name='custname']` and `input[name='custemail']`	Both fields verified non-empty

Test 3: Scroll

The model was instructed to scroll down on a long GitHub README page.

Stage	Description	Result
Before	`window.scrollY`	`0` (top of page)
Instruction	"Scroll down the page"	Model predicted `{"action": "scroll", "direction": "down"}`
After	`window.scrollY`	`500` (scrolled 500 pixels down)

Test 4: Click Link

The model was instructed to click a link on https://example.com.

Stage	Description	Result
Before	Page URL	`https://example.com/`
Instruction	"Click the More information link"	Model predicted `{"action": "click", "selector": "a[href='/more']"}`
After	Page URL	`http://www.iana.org/help/example-domains`

The model correctly identified the action type as click and attempted a CSS selector. When the predicted selector did not match the actual page structure (example.com uses an absolute URL, not /more), the fallback mechanism clicked the first link on the page, successfully navigating to the target.

v2 Improvements (vs v1)

Area	v1	v2
Training Data Size	15 examples	28 examples (87% increase)
Output Formats	`bbox` only	`bbox` + `xpath` + CSS `selector`
XPath Quality	Raw paths (`/html/body/.../input`)	Semantic (`//input[@name='username']`)
Click Targeting	`bbox` only	`bbox` + CSS selectors
Action Coverage	click, type, navigate, scroll	click, type, navigate, scroll, wait, done
Scroll Directions	down only	up and down
Browser Detection	None (blocked by sites like HN)	User-agent spoof + `navigator.webdriver` override
Agent Robustness	Single format, crashes on unexpected output	Graceful fallbacks for all formats
Final Training Loss	0.056	0.033

Key Behavioral Changes

Semantic XPath Output: v1 produced rigid paths like /html/body/div/div/form/div[1]/input that break on any DOM change. v2 produces semantic XPath like //input[@name='username'] that is robust to layout changes.
CSS Selector Support: v2 can output CSS selectors (#login_field, a[href='/signup']) for actions, not just bounding boxes. This enables more precise element targeting.
Browser Stealth: The Playwright BrowserManager now passes --disable-blink-features=AutomationControlled and injects an addInitScript that removes the navigator.webdriver property. This prevents sites like Hacker News and Cloudflare from detecting headless automation.
ActionNode Robustness: The agent's ActionNode now handles all output formats: bbox as list [x, y, w, h] or object {x, y, width, height}, xpath string, CSS selector string, text/value field variations, and scroll_direction/direction field name variations.

Training Details

Dataset

The training dataset consists of 28 instruction-output pairs covering all 6 supported actions with diverse output formats:

#	Instruction	Action	Output Format
1	Click the login button	`click`	`bbox: [450, 380, 120, 40]`
2	Click submit	`click`	`bbox: [500, 600, 100, 40]`
3	Click first result	`click`	`bbox: [100, 250, 800, 60]`
4	Click the sign up link	`click`	`selector: "a[href='/signup']"`
5	Select dropdown	`click`	`bbox: [300, 400, 200, 40]`
6	Submit form	`click`	`bbox: [450, 700, 120, 40]`
7	Check checkbox	`click`	`bbox: [350, 500, 20, 20]`
8	Close modal	`click`	`selector: ".modal-close"`
9	Click next page	`click`	`selector: "a.pagination-next"`
10	Type email into the field	`type`	`bbox + text: "user@example.com"`
11	Search for AI news	`type`	`bbox + text: "AI news"`
12	Fill search box	`type`	`bbox + text: "query"`
13	Type password	`type`	`bbox + text: "********"`
14	Enter username	`type`	`bbox + text: "admin"`
15	Type message in chat	`type`	`selector + text: "Hello!"`
16	Enter coupon code	`type`	`bbox + text: "SAVE20"`
17	Type username into the username field	`type`	`xpath + text: "testuser"`
18	Type email into the email field	`type`	`xpath + text: "user@example.com"`
19	Navigate to settings	`navigate`	`url: "/settings"`
20	Go to dashboard	`navigate`	`url: "/dashboard"`
21	Open profile	`navigate`	`url: "/profile"`
22	Go to home page	`navigate`	`url: "https://example.com"`
23	Scroll down	`scroll`	`direction: "down"`
24	Scroll up	`scroll`	`direction: "up"`
25	Scroll down the page	`scroll`	`direction: "down"`
26	Wait for results to load	`wait`	(no parameters)
27	Stop	`done`	(no parameters)
28	Finish	`done`	(no parameters)

Each example is formatted as a text prompt:

### Human: Click the login button
### Assistant: <action>{"action":"click","bbox":[450,380,120,40]}</action>

Quantization

The base model is loaded in 4-bit NormalFloat4 (NF4) precision using BitsAndBytesConfig:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

This reduces the base model memory footprint from approximately 14 GB (FP16) to approximately 4 GB (NF4), enabling training on consumer GPUs with 24 GB VRAM.

LoRA Configuration

Parameter	Value
Rank (`r`)	16
Alpha (`lora_alpha`)	32
Dropout	0.05
Target modules	`q_proj`, `v_proj`
Bias	`none`
Task type	`CAUSAL_LM`

Trainable parameters: 5,046,272 out of 7,620,662,784 total (0.0662%).

Training Results

Training was conducted on an NVIDIA GeForce RTX 4090 (25.3 GB VRAM) with CUDA, PyTorch 2.6.0, and Hugging Face Transformers.

Step	Loss	Grad Norm	Learning Rate	Epoch
5	15.34	10.65	1.957e-04	0.36
10	12.06	26.03	1.886e-04	0.71
15	5.399	23.31	1.814e-04	1.07
20	0.5386	2.356	1.743e-04	1.43
25	0.2402	0.6346	1.671e-04	1.79
30	0.1843	0.5288	1.600e-04	2.14
35	0.1319	0.3665	1.529e-04	2.50
40	0.09393	0.3279	1.457e-04	2.86
45	0.07736	0.2292	1.386e-04	3.21
50	0.07643	0.3647	1.314e-04	3.57
55	0.06076	0.3630	1.243e-04	3.93
60	0.06466	0.3370	1.171e-04	4.29
65	0.05192	0.4162	1.100e-04	4.64
70	0.05431	0.4836	1.029e-04	5.00
75	0.04319	0.2446	9.571e-05	5.36
80	0.04658	0.4294	8.857e-05	5.71
85	0.05086	0.2943	8.143e-05	6.07
90	0.04453	0.2923	7.429e-05	6.43
95	0.04564	0.4350	6.714e-05	6.79
100	0.03816	0.1997	6.000e-05	7.14
105	0.03836	0.4261	5.286e-05	7.50
110	0.04136	0.3450	4.571e-05	7.86
115	0.03368	0.2899	3.857e-05	8.21
120	0.03895	0.5276	3.143e-05	8.57
125	0.03497	0.3903	2.429e-05	8.93
130	0.03757	0.3689	1.714e-05	9.29
135	0.03311	0.4284	1.000e-05	9.64
140	0.03383	0.3776	2.857e-06	10.00

Final training loss: 0.033 — the model learns to emit correct structured actions for the 28 training examples with high confidence.

Training throughput: 1.76 steps/second, 3.52 samples/second, 79.49 seconds total for 140 steps (28 examples x 10 epochs / 2 batch size).

Full Project Structure

The complete agent framework is available on GitHub at ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.

qwen2.5-vl-7b-playwright-desktop-lora/
├── LICENSE                     # Apache 2.0
├── README.md                   # Full project documentation
├── CONTRIBUTING.md             # Contribution guidelines
├── pyproject.toml              # Project metadata and dependencies
├── requirements.txt            # Pip dependencies
├── setup.sh                    # Vast.ai environment setup
│
├── config/
│   ├── model.yaml              # Model selection, quantization, LoRA params
│   ├── sandbox.yaml            # Browser viewport, timeouts, concurrency
│   └── mock_scenarios.json     # Mock VLM scenario definitions
│
├── scripts/
│   ├── run_agent.py            # Single-task agent runner
│   ├── run_harness.py          # Full evaluation harness runner
│   └── train_lora.py           # QLoRA training script
│
├── src/
│   ├── agent/
│   │   ├── state.py            # AgentState, VisionOutput, StepRecord
│   │   ├── graph.py            # LangGraph state machine builder
│   │   ├── nodes.py            # PerceptionNode, ActionNode, RouterNode
│   │   └── prompts.py          # System prompt templates
│   │
│   ├── vision/
│   │   ├── model.py            # Model loader with quantization
│   │   ├── processor.py        # Screenshot preprocessing
│   │   ├── quant.py            # Quantization configuration
│   │   └── mock.py             # MockVLM for offline testing
│   │
│   ├── sandbox/
│   │   ├── browser.py          # Playwright BrowserManager singleton
│   │   ├── actions.py          # Atomic browser actions
│   │   └── recorder.py         # Screenshot + DOM capture
│   │
│   ├── memory/
│   │   ├── context.py          # ContextCompressor
│   │   └── history.py          # Step history summarizer
│   │
│   ├── harness/
│   │   ├── scenarios.py        # Benchmark scenario definitions
│   │   ├── runner.py           # Async scenario executor
│   │   └── metrics.py          # TCR, SER, TFI, SCRR computation
│   │
│   └── training/
│       ├── dataset.py          # UIExample dataclass
│       └── lora.py             # LoRA configuration builder
│
└── tests/
    ├── test_agent.py           # Agent graph and nodes tests
    ├── test_vision.py          # MockVLM and processor tests
    ├── test_harness.py         # Metrics computation tests
    └── test_memory.py          # Context compression tests

Dependencies

Core dependencies for loading and using this adapter:

Package	Minimum Version	Purpose
`torch`	2.4	GPU tensor operations
`transformers`	4.44	Model loading, tokenizer, Trainer API
`accelerate`	0.33	Multi-device model sharding
`bitsandbytes`	0.43	4-bit quantization (NF4)
`peft`	0.12	LoRA adapter configuration
`sentencepiece`	(latest)	Tokenizer tokenization

Optional dependencies for the full agent framework:

Package	Purpose
`langgraph`	State graph state machine
`langchain-core`	LangChain integration
`playwright`	Browser automation sandbox
`datasets`	Dataset loading and mapping
`pyyaml`	YAML configuration parsing
`pillow`	Image processing
`huggingface_hub`	Hub model push/download

Repository Contents

File	Size	Description
`adapter_model.safetensors`	20.2 MB	Trained LoRA adapter weights (q_proj, v_proj)
`adapter_config.json`	1 KB	LoRA hyperparameters (r=16, alpha=32, dropout=0.05)
`tokenizer.json`	11.4 MB	Qwen2.5 tokenizer
`tokenizer_config.json`	691 B	Tokenizer configuration
`chat_template.jinja`	5 KB	Jinja chat template for Qwen2.5
`README.md`	This file	Hub model card
`data.json`	5 KB	Training examples used for fine-tuning

License

This adapter is released under the Apache License 2.0. See the LICENSE file for the full text.

The base model Qwen/Qwen2.5-7B-Instruct is governed by its own license (Qwen License).

Citation

If you use this adapter in your research or work, please cite:

@software{multimodal_vision_agent_lora,
  author = {Zaid},
  title = {Multimodal Vision Agent -- LoRA Adapter for Desktop UI Automation},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/zaid646/multimodal-vision-agent-lora}
}

Built with Hugging Face Transformers, PEFT, bitsandbytes, LangGraph, and Playwright.

Downloads last month: -

Model tree for zaid646/multimodal-vision-agent-lora

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2261)

this model

Paper for zaid646/multimodal-vision-agent-lora

LoRA: Low-Rank Adaptation of Large Language Models

Paper • 2106.09685 • Published Jun 17, 2021 • 63

zaid646
/

multimodal-vision-agent-lora