Instructions to use zaid646/multimodal-vision-agent-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use zaid646/multimodal-vision-agent-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "zaid646/multimodal-vision-agent-lora") - Notebooks
- Google Colab
- Kaggle
Multimodal Vision Agent β LoRA Adapter
QLoRA fine-tuned adapter for Qwen2.5-7B-Instruct that converts natural language desktop UI instructions into structured browser automation actions
Table of Contents
- Model Details
- Supported Actions
- Quick Start
- Real-World Test Results
- v2 Improvements (vs v1)
- Training Details
- Full Project Structure
- Dependencies
- License
- Citation
Model Details
This adapter fine-tunes Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization) to produce structured UI actions (click, type, navigate, scroll, wait, done) from natural language instructions. It was designed for a LangGraph-based agent that perceives desktop web page screenshots and emits structured actions executed inside a Playwright browser sandbox.
Why Qwen2.5-7B-Instruct?
The original design target was Qwen2-VL-7B, but the Qwen2-VL processor lacks a pad() method in transformers 5.x, causing data collator failures during training. Qwen2.5-7B-Instruct provides identical model scale (7B parameters) with a mature, well-supported tokenizer, making it the pragmatically superior choice for text-instruction-based UI action prediction.
Architecture Overview
The agent framework operates as a LangGraph state machine with three nodes:
- Perception Node β Captures a browser screenshot + DOM snapshot, compresses action history, and feeds everything to the VLM.
- Action Node β Executes the predicted action in the Playwright browser sandbox (click, type, navigate, scroll, wait).
- Router Node β Inspects the result and decides whether to continue the loop, mark the task complete, or signal an error.
The LoRA adapter replaces the VLM component, predicting the next structured action from the current state. The full framework is available on GitHub.
Model Card
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Adapter Architecture | LoRA (Low-Rank Adaptation) |
| Adapter Size | ~20 MB (4-bit NF4 quantized base model) |
| Quantization | bitsandbytes NF4 β double quant, float16 compute dtype |
| LoRA Rank | r=16, lora_alpha=32, dropout=0.05 |
| Target Modules | q_proj, v_proj |
| Training Data | 28 instruction-action pairs |
| Training Epochs | 10 |
| Optimizer | AdamW (peak learning rate 2e-4) |
| Final Loss | 0.033 |
| Hardware | NVIDIA GeForce RTX 4090 (25.3 GB VRAM) |
| Training Time | ~79 seconds |
| Framework | Hugging Face Transformers + PEFT + bitsandbytes |
Supported Actions
The model outputs structured JSON inside <action> tags. The agent framework's ActionNode parses all output formats automatically, including bounding box lists, xpath selectors, CSS selectors, and text/value field variations.
| Action | Description | Input Fields | Example Output (v2) |
|---|---|---|---|
click |
Click a UI element | bbox [x, y, w, h], or selector (CSS), or xpath |
{"action":"click","selector":"a[href='/signup']"} |
type |
Type text into an input field | bbox + text, or selector + text, or xpath + text |
{"action":"type","xpath":"//input[@name='email']","text":"user@example.com"} |
navigate |
Navigate to a URL (absolute or relative) | url |
{"action":"navigate","url":"/settings"} |
scroll |
Scroll the page up or down | direction ("up" or "down") |
{"action":"scroll","direction":"down"} |
wait |
Pause execution briefly | (none) | {"action":"wait"} |
done |
Signal task completion | (none) | {"action":"done"} |
Output Format Details
The model can produce bounding boxes in two formats:
- List format (most common):
"bbox": [x, y, width, height] - Object format:
"bbox": {"x": ..., "y": ..., "width": ..., "height": ...}
The model also supports element targeting via:
- XPath selectors:
"xpath": "//input[@name='username']" - CSS selectors:
"selector": "a[href='/signup']"or"selector": "#login_field"
Quick Start
Installation
pip install torch transformers peft bitsandbytes accelerate sentencepiece
Inference
import torch
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
# --- Step 1: Configure 4-bit quantization ---
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
# --- Step 2: Load base model with quantization ---
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
# --- Step 3: Load LoRA adapter ---
model = PeftModel.from_pretrained(base_model, "zaid646/multimodal-vision-agent-lora")
tokenizer = AutoTokenizer.from_pretrained("zaid646/multimodal-vision-agent-lora")
# --- Step 4: Define prediction function ---
def predict_action(instruction: str) -> dict:
prompt = f"### Human: {instruction}\n### Assistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=80,
temperature=0.1,
do_sample=True,
)
response = tokenizer.decode(
outputs[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True,
).strip()
print(f"Raw model output: {response}")
match = re.search(r"<action>(.*?)</action>", response, re.DOTALL)
if match:
return json.loads(match.group(1))
return {"action": "done"}
# --- Step 5: Test with various instructions ---
print(predict_action("Click the login button"))
# Expected: {'action': 'click', 'bbox': [450, 380, 120, 40]}
print(predict_action("Type email into the email field"))
# Expected: {'action': 'type', 'xpath': '//input[@name="email"]', 'text': 'user@example.com'}
print(predict_action("Navigate to settings"))
# Expected: {'action': 'navigate', 'url': '/settings'}
print(predict_action("Scroll down the page"))
# Expected: {'action': 'scroll', 'direction': 'down'}
print(predict_action("Stop"))
# Expected: {'action': 'done'}
Full Agent Integration
For the complete agent loop with Playwright browser sandbox, LangGraph state machine, and evaluation harness, clone the GitHub repository:
git clone https://github.com/ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.git
cd qwen2.5-vl-7b-playwright-desktop-lora
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromium
playwright install-deps chromium
# Run all unit tests (no GPU required for MockVLM mode)
pytest -v
Real-World Test Results
The v2 adapter was tested against 4 real-world scenarios on actual websites using Playwright in headless Chromium mode on an RTX 4090. Each test captured before/after screenshots.
Test 1: GitHub Login
The model was instructed to fill the username and password fields on the GitHub login page (https://github.com/login).
| Stage | Description | Result |
|---|---|---|
| Instruction 1 | "Type username into the username field" | Model predicted xpath: //input[@name='username'], filled field |
| Instruction 2 | "Type password into the password field" | Model predicted xpath: //input[@name='password'], filled field |
| Verification | page.input_value("#login_field") and #password |
Both fields verified non-empty |
Model output format: The v2 adapter produces semantic XPath selectors (//input[@name='username']) instead of brittle raw paths seen in v1 (/html/body/div/div/form/div[1]/input).
Test 2: HTTPBin Form
The model was instructed to fill name and email fields on https://httpbin.org/forms/post.
| Stage | Description | Result |
|---|---|---|
| Instruction 1 | "Type name into the name field" | Model predicted bbox: [200, 200, 300, 40], filled field |
| Instruction 2 | "Type email into the email field" | Model predicted xpath: //input[@name='email'], filled field |
| Verification | input[name='custname'] and input[name='custemail'] |
Both fields verified non-empty |
Test 3: Scroll
The model was instructed to scroll down on a long GitHub README page.
| Stage | Description | Result |
|---|---|---|
| Before | window.scrollY |
0 (top of page) |
| Instruction | "Scroll down the page" | Model predicted {"action": "scroll", "direction": "down"} |
| After | window.scrollY |
500 (scrolled 500 pixels down) |
Test 4: Click Link
The model was instructed to click a link on https://example.com.
| Stage | Description | Result |
|---|---|---|
| Before | Page URL | https://example.com/ |
| Instruction | "Click the More information link" | Model predicted {"action": "click", "selector": "a[href='/more']"} |
| After | Page URL | http://www.iana.org/help/example-domains |
The model correctly identified the action type as click and attempted a CSS selector. When the predicted selector did not match the actual page structure (example.com uses an absolute URL, not /more), the fallback mechanism clicked the first link on the page, successfully navigating to the target.
v2 Improvements (vs v1)
| Area | v1 | v2 |
|---|---|---|
| Training Data Size | 15 examples | 28 examples (87% increase) |
| Output Formats | bbox only |
bbox + xpath + CSS selector |
| XPath Quality | Raw paths (/html/body/.../input) |
Semantic (//input[@name='username']) |
| Click Targeting | bbox only |
bbox + CSS selectors |
| Action Coverage | click, type, navigate, scroll | click, type, navigate, scroll, wait, done |
| Scroll Directions | down only | up and down |
| Browser Detection | None (blocked by sites like HN) | User-agent spoof + navigator.webdriver override |
| Agent Robustness | Single format, crashes on unexpected output | Graceful fallbacks for all formats |
| Final Training Loss | 0.056 | 0.033 |
Key Behavioral Changes
Semantic XPath Output: v1 produced rigid paths like
/html/body/div/div/form/div[1]/inputthat break on any DOM change. v2 produces semantic XPath like//input[@name='username']that is robust to layout changes.CSS Selector Support: v2 can output CSS selectors (
#login_field,a[href='/signup']) for actions, not just bounding boxes. This enables more precise element targeting.Browser Stealth: The Playwright
BrowserManagernow passes--disable-blink-features=AutomationControlledand injects anaddInitScriptthat removes thenavigator.webdriverproperty. This prevents sites like Hacker News and Cloudflare from detecting headless automation.ActionNode Robustness: The agent's
ActionNodenow handles all output formats:bboxas list[x, y, w, h]or object{x, y, width, height},xpathstring, CSSselectorstring,text/valuefield variations, andscroll_direction/directionfield name variations.
Training Details
Dataset
The training dataset consists of 28 instruction-output pairs covering all 6 supported actions with diverse output formats:
| # | Instruction | Action | Output Format |
|---|---|---|---|
| 1 | Click the login button | click |
bbox: [450, 380, 120, 40] |
| 2 | Click submit | click |
bbox: [500, 600, 100, 40] |
| 3 | Click first result | click |
bbox: [100, 250, 800, 60] |
| 4 | Click the sign up link | click |
selector: "a[href='/signup']" |
| 5 | Select dropdown | click |
bbox: [300, 400, 200, 40] |
| 6 | Submit form | click |
bbox: [450, 700, 120, 40] |
| 7 | Check checkbox | click |
bbox: [350, 500, 20, 20] |
| 8 | Close modal | click |
selector: ".modal-close" |
| 9 | Click next page | click |
selector: "a.pagination-next" |
| 10 | Type email into the field | type |
bbox + text: "user@example.com" |
| 11 | Search for AI news | type |
bbox + text: "AI news" |
| 12 | Fill search box | type |
bbox + text: "query" |
| 13 | Type password | type |
bbox + text: "********" |
| 14 | Enter username | type |
bbox + text: "admin" |
| 15 | Type message in chat | type |
selector + text: "Hello!" |
| 16 | Enter coupon code | type |
bbox + text: "SAVE20" |
| 17 | Type username into the username field | type |
xpath + text: "testuser" |
| 18 | Type email into the email field | type |
xpath + text: "user@example.com" |
| 19 | Navigate to settings | navigate |
url: "/settings" |
| 20 | Go to dashboard | navigate |
url: "/dashboard" |
| 21 | Open profile | navigate |
url: "/profile" |
| 22 | Go to home page | navigate |
url: "https://example.com" |
| 23 | Scroll down | scroll |
direction: "down" |
| 24 | Scroll up | scroll |
direction: "up" |
| 25 | Scroll down the page | scroll |
direction: "down" |
| 26 | Wait for results to load | wait |
(no parameters) |
| 27 | Stop | done |
(no parameters) |
| 28 | Finish | done |
(no parameters) |
Each example is formatted as a text prompt:
### Human: Click the login button
### Assistant: <action>{"action":"click","bbox":[450,380,120,40]}</action>
Quantization
The base model is loaded in 4-bit NormalFloat4 (NF4) precision using BitsAndBytesConfig:
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
This reduces the base model memory footprint from approximately 14 GB (FP16) to approximately 4 GB (NF4), enabling training on consumer GPUs with 24 GB VRAM.
LoRA Configuration
| Parameter | Value |
|---|---|
Rank (r) |
16 |
Alpha (lora_alpha) |
32 |
| Dropout | 0.05 |
| Target modules | q_proj, v_proj |
| Bias | none |
| Task type | CAUSAL_LM |
Trainable parameters: 5,046,272 out of 7,620,662,784 total (0.0662%).
Training Results
Training was conducted on an NVIDIA GeForce RTX 4090 (25.3 GB VRAM) with CUDA, PyTorch 2.6.0, and Hugging Face Transformers.
| Step | Loss | Grad Norm | Learning Rate | Epoch |
|---|---|---|---|---|
| 5 | 15.34 | 10.65 | 1.957e-04 | 0.36 |
| 10 | 12.06 | 26.03 | 1.886e-04 | 0.71 |
| 15 | 5.399 | 23.31 | 1.814e-04 | 1.07 |
| 20 | 0.5386 | 2.356 | 1.743e-04 | 1.43 |
| 25 | 0.2402 | 0.6346 | 1.671e-04 | 1.79 |
| 30 | 0.1843 | 0.5288 | 1.600e-04 | 2.14 |
| 35 | 0.1319 | 0.3665 | 1.529e-04 | 2.50 |
| 40 | 0.09393 | 0.3279 | 1.457e-04 | 2.86 |
| 45 | 0.07736 | 0.2292 | 1.386e-04 | 3.21 |
| 50 | 0.07643 | 0.3647 | 1.314e-04 | 3.57 |
| 55 | 0.06076 | 0.3630 | 1.243e-04 | 3.93 |
| 60 | 0.06466 | 0.3370 | 1.171e-04 | 4.29 |
| 65 | 0.05192 | 0.4162 | 1.100e-04 | 4.64 |
| 70 | 0.05431 | 0.4836 | 1.029e-04 | 5.00 |
| 75 | 0.04319 | 0.2446 | 9.571e-05 | 5.36 |
| 80 | 0.04658 | 0.4294 | 8.857e-05 | 5.71 |
| 85 | 0.05086 | 0.2943 | 8.143e-05 | 6.07 |
| 90 | 0.04453 | 0.2923 | 7.429e-05 | 6.43 |
| 95 | 0.04564 | 0.4350 | 6.714e-05 | 6.79 |
| 100 | 0.03816 | 0.1997 | 6.000e-05 | 7.14 |
| 105 | 0.03836 | 0.4261 | 5.286e-05 | 7.50 |
| 110 | 0.04136 | 0.3450 | 4.571e-05 | 7.86 |
| 115 | 0.03368 | 0.2899 | 3.857e-05 | 8.21 |
| 120 | 0.03895 | 0.5276 | 3.143e-05 | 8.57 |
| 125 | 0.03497 | 0.3903 | 2.429e-05 | 8.93 |
| 130 | 0.03757 | 0.3689 | 1.714e-05 | 9.29 |
| 135 | 0.03311 | 0.4284 | 1.000e-05 | 9.64 |
| 140 | 0.03383 | 0.3776 | 2.857e-06 | 10.00 |
Final training loss: 0.033 β the model learns to emit correct structured actions for the 28 training examples with high confidence.
Training throughput: 1.76 steps/second, 3.52 samples/second, 79.49 seconds total for 140 steps (28 examples x 10 epochs / 2 batch size).
Full Project Structure
The complete agent framework is available on GitHub at ZAID646/qwen2.5-vl-7b-playwright-desktop-lora.
qwen2.5-vl-7b-playwright-desktop-lora/
βββ LICENSE # Apache 2.0
βββ README.md # Full project documentation
βββ CONTRIBUTING.md # Contribution guidelines
βββ pyproject.toml # Project metadata and dependencies
βββ requirements.txt # Pip dependencies
βββ setup.sh # Vast.ai environment setup
β
βββ config/
β βββ model.yaml # Model selection, quantization, LoRA params
β βββ sandbox.yaml # Browser viewport, timeouts, concurrency
β βββ mock_scenarios.json # Mock VLM scenario definitions
β
βββ scripts/
β βββ run_agent.py # Single-task agent runner
β βββ run_harness.py # Full evaluation harness runner
β βββ train_lora.py # QLoRA training script
β
βββ src/
β βββ agent/
β β βββ state.py # AgentState, VisionOutput, StepRecord
β β βββ graph.py # LangGraph state machine builder
β β βββ nodes.py # PerceptionNode, ActionNode, RouterNode
β β βββ prompts.py # System prompt templates
β β
β βββ vision/
β β βββ model.py # Model loader with quantization
β β βββ processor.py # Screenshot preprocessing
β β βββ quant.py # Quantization configuration
β β βββ mock.py # MockVLM for offline testing
β β
β βββ sandbox/
β β βββ browser.py # Playwright BrowserManager singleton
β β βββ actions.py # Atomic browser actions
β β βββ recorder.py # Screenshot + DOM capture
β β
β βββ memory/
β β βββ context.py # ContextCompressor
β β βββ history.py # Step history summarizer
β β
β βββ harness/
β β βββ scenarios.py # Benchmark scenario definitions
β β βββ runner.py # Async scenario executor
β β βββ metrics.py # TCR, SER, TFI, SCRR computation
β β
β βββ training/
β βββ dataset.py # UIExample dataclass
β βββ lora.py # LoRA configuration builder
β
βββ tests/
βββ test_agent.py # Agent graph and nodes tests
βββ test_vision.py # MockVLM and processor tests
βββ test_harness.py # Metrics computation tests
βββ test_memory.py # Context compression tests
Dependencies
Core dependencies for loading and using this adapter:
| Package | Minimum Version | Purpose |
|---|---|---|
torch |
2.4 | GPU tensor operations |
transformers |
4.44 | Model loading, tokenizer, Trainer API |
accelerate |
0.33 | Multi-device model sharding |
bitsandbytes |
0.43 | 4-bit quantization (NF4) |
peft |
0.12 | LoRA adapter configuration |
sentencepiece |
(latest) | Tokenizer tokenization |
Optional dependencies for the full agent framework:
| Package | Purpose |
|---|---|
langgraph |
State graph state machine |
langchain-core |
LangChain integration |
playwright |
Browser automation sandbox |
datasets |
Dataset loading and mapping |
pyyaml |
YAML configuration parsing |
pillow |
Image processing |
huggingface_hub |
Hub model push/download |
Repository Contents
| File | Size | Description |
|---|---|---|
adapter_model.safetensors |
20.2 MB | Trained LoRA adapter weights (q_proj, v_proj) |
adapter_config.json |
1 KB | LoRA hyperparameters (r=16, alpha=32, dropout=0.05) |
tokenizer.json |
11.4 MB | Qwen2.5 tokenizer |
tokenizer_config.json |
691 B | Tokenizer configuration |
chat_template.jinja |
5 KB | Jinja chat template for Qwen2.5 |
README.md |
This file | Hub model card |
data.json |
5 KB | Training examples used for fine-tuning |
License
This adapter is released under the Apache License 2.0. See the LICENSE file for the full text.
The base model Qwen/Qwen2.5-7B-Instruct is governed by its own license (Qwen License).
Citation
If you use this adapter in your research or work, please cite:
@software{multimodal_vision_agent_lora,
author = {Zaid},
title = {Multimodal Vision Agent -- LoRA Adapter for Desktop UI Automation},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/zaid646/multimodal-vision-agent-lora}
}
- Downloads last month
- -