Multimodal Vision Agent β€” LoRA Adapter

QLoRA fine-tuned adapter for Qwen2.5-7B-Instruct that converts natural language desktop UI instructions into structured browser automation actions

PEFT Base Model License Quantization GitHub Python PRs Welcome


Table of Contents


Model Details

This adapter fine-tunes Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization) to produce structured UI actions (click, type, navigate, scroll, wait, done) from natural language instructions. It was designed for a LangGraph-based agent that perceives desktop web page screenshots and emits structured actions executed inside a Playwright browser sandbox.

Why Qwen2.5-7B-Instruct?

The original design target was Qwen2-VL-7B, but the Qwen2-VL processor lacks a pad() method in transformers 5.x, causing data collator failures during training. Qwen2.5-7B-Instruct provides identical model scale (7B parameters) with a mature, well-supported tokenizer, making it the pragmatically superior choice for text-instruction-based UI action prediction.

Architecture Overview

The agent framework operates as a LangGraph state machine with three nodes:

  1. Perception Node β€” Captures a browser screenshot + DOM snapshot, compresses action history, and feeds everything to the VLM.
  2. Action Node β€” Executes the predicted action in the Playwright browser sandbox (click, type, navigate, scroll, wait).
  3. Router Node β€” Inspects the result and decides whether to continue the loop, mark the task complete, or signal an error.

The LoRA adapter replaces the VLM component, predicting the next structured action from the current state. The full framework is available on GitHub.

Model Card

Property Value
Base Model Qwen/Qwen2.5-7B-Instruct
Adapter Architecture LoRA (Low-Rank Adaptation)
Adapter Size ~20 MB (4-bit NF4 quantized base model)
Quantization bitsandbytes NF4 β€” double quant, float16 compute dtype
LoRA Rank r=16, lora_alpha=32, dropout=0.05
Target Modules q_proj, v_proj
Training Data 28 instruction-action pairs
Training Epochs 10
Optimizer AdamW (peak learning rate 2e-4)
Final Loss 0.033
Hardware NVIDIA GeForce RTX 4090 (25.3 GB VRAM)
Training Time ~79 seconds
Framework Hugging Face Transformers + PEFT + bitsandbytes

Supported Actions

The model outputs structured JSON inside <action> tags. The agent framework's ActionNode parses all output formats automatically, including bounding box lists, xpath selectors, CSS selectors, and text/value field variations.

Action Description Input Fields Example Output (v2)
click Click a UI element bbox [x, y, w, h], or selector (CSS), or xpath {"action":"click","selector":"a[href='/signup']"}
type Type text into an input field bbox + text, or selector + text, or xpath + text {"action":"type","xpath":"//input[@name='email']","text":"user@example.com"}
navigate Navigate to a URL (absolute or relative) url {"action":"navigate","url":"/settings"}
scroll Scroll the page up or down direction ("up" or "down") {"action":"scroll","direction":"down"}
wait Pause execution briefly (none) {"action":"wait"}
done Signal task completion (none) {"action":"done"}

Output Format Details

The model can produce bounding boxes in two formats:

  • List format (most common): "bbox": [x, y, width, height]
  • Object format: "bbox": {"x": ..., "y": ..., "width": ..., "height": ...}

The model also supports element targeting via:

  • XPath selectors: "xpath": "//input[@name='username']"
  • CSS selectors: "selector": "a[href='/signup']" or "selector": "#login_field"

Quick Start

Installation

pip install torch transformers peft bitsandbytes accelerate sentencepiece

Inference

import torch
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# --- Step 1: Configure 4-bit quantization ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# --- Step 2: Load base model with quantization ---
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# --- Step 3: Load LoRA adapter ---
model = PeftModel.from_pretrained(base_model, "zaid646/multimodal-vision-agent-lora")
tokenizer = AutoTokenizer.from_pretrained("zaid646/multimodal-vision-agent-lora")


# --- Step 4: Define prediction function ---
def predict_action(instruction: str) -> dict:
    prompt = f"### Human: {instruction}\n### Assistant:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=80,
            temperature=0.1,
            do_sample=True,
        )
    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True,
    ).strip()
    print(f"Raw model output: {response}")
    match = re.search(r"<action>(.*?)</action>", response, re.DOTALL)
    if match:
        return json.loads(match.group(1))
    return {"action": "done"}


# --- Step 5: Test with various instructions ---
print(predict_action("Click the login button"))
# Expected: {'action': 'click', 'bbox': [450, 380, 120, 40]}

print(predict_action("Type email into the email field"))
# Expected: {'action': 'type', 'xpath': '//input[@name="email"]', 'text': 'user@example.com'}

print(predict_action("Navigate to settings"))
# Expected: {'action': 'navigate', 'url': '/settings'}

print(predict_action("Scroll down the page"))
# Expected: {'action': 'scroll', 'direction': 'down'}

print(predict_action("Stop"))
# Expected: {'action': 'done'}

Full Agent Integration

For the complete agent loop with Playwright browser sandbox, LangGraph state machine, and evaluation harness, clone the GitHub repository:

git clone https://github.com/ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.git
cd qwen2.5-vl-7b-playwright-desktop-lora

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium
playwright install-deps chromium

# Run all unit tests (no GPU required for MockVLM mode)
pytest -v

Real-World Test Results

The v2 adapter was tested against 4 real-world scenarios on actual websites using Playwright in headless Chromium mode on an RTX 4090. Each test captured before/after screenshots.

Test 1: GitHub Login

The model was instructed to fill the username and password fields on the GitHub login page (https://github.com/login).

Stage Description Result
Instruction 1 "Type username into the username field" Model predicted xpath: //input[@name='username'], filled field
Instruction 2 "Type password into the password field" Model predicted xpath: //input[@name='password'], filled field
Verification page.input_value("#login_field") and #password Both fields verified non-empty

Model output format: The v2 adapter produces semantic XPath selectors (//input[@name='username']) instead of brittle raw paths seen in v1 (/html/body/div/div/form/div[1]/input).

Test 2: HTTPBin Form

The model was instructed to fill name and email fields on https://httpbin.org/forms/post.

Stage Description Result
Instruction 1 "Type name into the name field" Model predicted bbox: [200, 200, 300, 40], filled field
Instruction 2 "Type email into the email field" Model predicted xpath: //input[@name='email'], filled field
Verification input[name='custname'] and input[name='custemail'] Both fields verified non-empty

Test 3: Scroll

The model was instructed to scroll down on a long GitHub README page.

Stage Description Result
Before window.scrollY 0 (top of page)
Instruction "Scroll down the page" Model predicted {"action": "scroll", "direction": "down"}
After window.scrollY 500 (scrolled 500 pixels down)

Test 4: Click Link

The model was instructed to click a link on https://example.com.

Stage Description Result
Before Page URL https://example.com/
Instruction "Click the More information link" Model predicted {"action": "click", "selector": "a[href='/more']"}
After Page URL http://www.iana.org/help/example-domains

The model correctly identified the action type as click and attempted a CSS selector. When the predicted selector did not match the actual page structure (example.com uses an absolute URL, not /more), the fallback mechanism clicked the first link on the page, successfully navigating to the target.


v2 Improvements (vs v1)

Area v1 v2
Training Data Size 15 examples 28 examples (87% increase)
Output Formats bbox only bbox + xpath + CSS selector
XPath Quality Raw paths (/html/body/.../input) Semantic (//input[@name='username'])
Click Targeting bbox only bbox + CSS selectors
Action Coverage click, type, navigate, scroll click, type, navigate, scroll, wait, done
Scroll Directions down only up and down
Browser Detection None (blocked by sites like HN) User-agent spoof + navigator.webdriver override
Agent Robustness Single format, crashes on unexpected output Graceful fallbacks for all formats
Final Training Loss 0.056 0.033

Key Behavioral Changes

  1. Semantic XPath Output: v1 produced rigid paths like /html/body/div/div/form/div[1]/input that break on any DOM change. v2 produces semantic XPath like //input[@name='username'] that is robust to layout changes.

  2. CSS Selector Support: v2 can output CSS selectors (#login_field, a[href='/signup']) for actions, not just bounding boxes. This enables more precise element targeting.

  3. Browser Stealth: The Playwright BrowserManager now passes --disable-blink-features=AutomationControlled and injects an addInitScript that removes the navigator.webdriver property. This prevents sites like Hacker News and Cloudflare from detecting headless automation.

  4. ActionNode Robustness: The agent's ActionNode now handles all output formats: bbox as list [x, y, w, h] or object {x, y, width, height}, xpath string, CSS selector string, text/value field variations, and scroll_direction/direction field name variations.


Training Details

Dataset

The training dataset consists of 28 instruction-output pairs covering all 6 supported actions with diverse output formats:

# Instruction Action Output Format
1 Click the login button click bbox: [450, 380, 120, 40]
2 Click submit click bbox: [500, 600, 100, 40]
3 Click first result click bbox: [100, 250, 800, 60]
4 Click the sign up link click selector: "a[href='/signup']"
5 Select dropdown click bbox: [300, 400, 200, 40]
6 Submit form click bbox: [450, 700, 120, 40]
7 Check checkbox click bbox: [350, 500, 20, 20]
8 Close modal click selector: ".modal-close"
9 Click next page click selector: "a.pagination-next"
10 Type email into the field type bbox + text: "user@example.com"
11 Search for AI news type bbox + text: "AI news"
12 Fill search box type bbox + text: "query"
13 Type password type bbox + text: "********"
14 Enter username type bbox + text: "admin"
15 Type message in chat type selector + text: "Hello!"
16 Enter coupon code type bbox + text: "SAVE20"
17 Type username into the username field type xpath + text: "testuser"
18 Type email into the email field type xpath + text: "user@example.com"
19 Navigate to settings navigate url: "/settings"
20 Go to dashboard navigate url: "/dashboard"
21 Open profile navigate url: "/profile"
22 Go to home page navigate url: "https://example.com"
23 Scroll down scroll direction: "down"
24 Scroll up scroll direction: "up"
25 Scroll down the page scroll direction: "down"
26 Wait for results to load wait (no parameters)
27 Stop done (no parameters)
28 Finish done (no parameters)

Each example is formatted as a text prompt:

### Human: Click the login button
### Assistant: <action>{"action":"click","bbox":[450,380,120,40]}</action>

Quantization

The base model is loaded in 4-bit NormalFloat4 (NF4) precision using BitsAndBytesConfig:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

This reduces the base model memory footprint from approximately 14 GB (FP16) to approximately 4 GB (NF4), enabling training on consumer GPUs with 24 GB VRAM.

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha (lora_alpha) 32
Dropout 0.05
Target modules q_proj, v_proj
Bias none
Task type CAUSAL_LM

Trainable parameters: 5,046,272 out of 7,620,662,784 total (0.0662%).

Training Results

Training was conducted on an NVIDIA GeForce RTX 4090 (25.3 GB VRAM) with CUDA, PyTorch 2.6.0, and Hugging Face Transformers.

Step Loss Grad Norm Learning Rate Epoch
5 15.34 10.65 1.957e-04 0.36
10 12.06 26.03 1.886e-04 0.71
15 5.399 23.31 1.814e-04 1.07
20 0.5386 2.356 1.743e-04 1.43
25 0.2402 0.6346 1.671e-04 1.79
30 0.1843 0.5288 1.600e-04 2.14
35 0.1319 0.3665 1.529e-04 2.50
40 0.09393 0.3279 1.457e-04 2.86
45 0.07736 0.2292 1.386e-04 3.21
50 0.07643 0.3647 1.314e-04 3.57
55 0.06076 0.3630 1.243e-04 3.93
60 0.06466 0.3370 1.171e-04 4.29
65 0.05192 0.4162 1.100e-04 4.64
70 0.05431 0.4836 1.029e-04 5.00
75 0.04319 0.2446 9.571e-05 5.36
80 0.04658 0.4294 8.857e-05 5.71
85 0.05086 0.2943 8.143e-05 6.07
90 0.04453 0.2923 7.429e-05 6.43
95 0.04564 0.4350 6.714e-05 6.79
100 0.03816 0.1997 6.000e-05 7.14
105 0.03836 0.4261 5.286e-05 7.50
110 0.04136 0.3450 4.571e-05 7.86
115 0.03368 0.2899 3.857e-05 8.21
120 0.03895 0.5276 3.143e-05 8.57
125 0.03497 0.3903 2.429e-05 8.93
130 0.03757 0.3689 1.714e-05 9.29
135 0.03311 0.4284 1.000e-05 9.64
140 0.03383 0.3776 2.857e-06 10.00

Final training loss: 0.033 β€” the model learns to emit correct structured actions for the 28 training examples with high confidence.

Training throughput: 1.76 steps/second, 3.52 samples/second, 79.49 seconds total for 140 steps (28 examples x 10 epochs / 2 batch size).


Full Project Structure

The complete agent framework is available on GitHub at ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.

qwen2.5-vl-7b-playwright-desktop-lora/
β”œβ”€β”€ LICENSE                     # Apache 2.0
β”œβ”€β”€ README.md                   # Full project documentation
β”œβ”€β”€ CONTRIBUTING.md             # Contribution guidelines
β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies
β”œβ”€β”€ requirements.txt            # Pip dependencies
β”œβ”€β”€ setup.sh                    # Vast.ai environment setup
β”‚
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ model.yaml              # Model selection, quantization, LoRA params
β”‚   β”œβ”€β”€ sandbox.yaml            # Browser viewport, timeouts, concurrency
β”‚   └── mock_scenarios.json     # Mock VLM scenario definitions
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_agent.py            # Single-task agent runner
β”‚   β”œβ”€β”€ run_harness.py          # Full evaluation harness runner
β”‚   └── train_lora.py           # QLoRA training script
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ state.py            # AgentState, VisionOutput, StepRecord
β”‚   β”‚   β”œβ”€β”€ graph.py            # LangGraph state machine builder
β”‚   β”‚   β”œβ”€β”€ nodes.py            # PerceptionNode, ActionNode, RouterNode
β”‚   β”‚   └── prompts.py          # System prompt templates
β”‚   β”‚
β”‚   β”œβ”€β”€ vision/
β”‚   β”‚   β”œβ”€β”€ model.py            # Model loader with quantization
β”‚   β”‚   β”œβ”€β”€ processor.py        # Screenshot preprocessing
β”‚   β”‚   β”œβ”€β”€ quant.py            # Quantization configuration
β”‚   β”‚   └── mock.py             # MockVLM for offline testing
β”‚   β”‚
β”‚   β”œβ”€β”€ sandbox/
β”‚   β”‚   β”œβ”€β”€ browser.py          # Playwright BrowserManager singleton
β”‚   β”‚   β”œβ”€β”€ actions.py          # Atomic browser actions
β”‚   β”‚   └── recorder.py         # Screenshot + DOM capture
β”‚   β”‚
β”‚   β”œβ”€β”€ memory/
β”‚   β”‚   β”œβ”€β”€ context.py          # ContextCompressor
β”‚   β”‚   └── history.py          # Step history summarizer
β”‚   β”‚
β”‚   β”œβ”€β”€ harness/
β”‚   β”‚   β”œβ”€β”€ scenarios.py        # Benchmark scenario definitions
β”‚   β”‚   β”œβ”€β”€ runner.py           # Async scenario executor
β”‚   β”‚   └── metrics.py          # TCR, SER, TFI, SCRR computation
β”‚   β”‚
β”‚   └── training/
β”‚       β”œβ”€β”€ dataset.py          # UIExample dataclass
β”‚       └── lora.py             # LoRA configuration builder
β”‚
└── tests/
    β”œβ”€β”€ test_agent.py           # Agent graph and nodes tests
    β”œβ”€β”€ test_vision.py          # MockVLM and processor tests
    β”œβ”€β”€ test_harness.py         # Metrics computation tests
    └── test_memory.py          # Context compression tests

Dependencies

Core dependencies for loading and using this adapter:

Package Minimum Version Purpose
torch 2.4 GPU tensor operations
transformers 4.44 Model loading, tokenizer, Trainer API
accelerate 0.33 Multi-device model sharding
bitsandbytes 0.43 4-bit quantization (NF4)
peft 0.12 LoRA adapter configuration
sentencepiece (latest) Tokenizer tokenization

Optional dependencies for the full agent framework:

Package Purpose
langgraph State graph state machine
langchain-core LangChain integration
playwright Browser automation sandbox
datasets Dataset loading and mapping
pyyaml YAML configuration parsing
pillow Image processing
huggingface_hub Hub model push/download

Repository Contents

File Size Description
adapter_model.safetensors 20.2 MB Trained LoRA adapter weights (q_proj, v_proj)
adapter_config.json 1 KB LoRA hyperparameters (r=16, alpha=32, dropout=0.05)
tokenizer.json 11.4 MB Qwen2.5 tokenizer
tokenizer_config.json 691 B Tokenizer configuration
chat_template.jinja 5 KB Jinja chat template for Qwen2.5
README.md This file Hub model card
data.json 5 KB Training examples used for fine-tuning

License

This adapter is released under the Apache License 2.0. See the LICENSE file for the full text.

The base model Qwen/Qwen2.5-7B-Instruct is governed by its own license (Qwen License).


Citation

If you use this adapter in your research or work, please cite:

@software{multimodal_vision_agent_lora,
  author = {Zaid},
  title = {Multimodal Vision Agent -- LoRA Adapter for Desktop UI Automation},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/zaid646/multimodal-vision-agent-lora}
}

Built with Hugging Face Transformers, PEFT, bitsandbytes, LangGraph, and Playwright.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zaid646/multimodal-vision-agent-lora

Base model

Qwen/Qwen2.5-7B
Adapter
(2261)
this model

Paper for zaid646/multimodal-vision-agent-lora