MirrorGuard / README.md

bmzq

Update README.md

638a040 verified about 5 hours ago

preview code

raw

history blame contribute delete

4.3 kB

metadata

library_name: transformers
license: other
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
  - llama-factory
  - full
  - generated_from_trainer
model-index:
  - name: mirrorguard
    results: []

MirrorGuard

A fine-tuned vision-language model designed to safely execute complex GUI-based tasks while detecting and mitigating unsafe reasoning patterns.

Overview

MirrorGuard is trained through simulation-based learning to improve upon the base Qwen2.5-VL-7B-Instruct model. It learns to:

Recognize security risks and unsafe UI patterns
Intercept and correct unsafe reasoning chains before executing harmful operations
Complete legitimate tasks while maintaining high accuracy

Installation

For vLLM inference server:

pip install vllm

Usage

Starting vLLM Inference Server

Launch the model as an OpenAI-compatible API server:

vllm serve WhitzardAgent/MirrorGuard \
    --tensor-parallel-size 1 \
    --max-model-len 2048 \
    --dtype float16 \
    --port 8000

The server will be available at http://localhost:8000/v1.

API Call Example

from openai import OpenAI
import base64

# Initialize client
client = OpenAI(
    api_key="EMPTY",  
    base_url="http://localhost:8000/v1"
)

# Prepare your data
instruction = "Find and open the terminal application"
history = ""
action = "CLICK(taskbar_icon)"
screenshot_b64 = ""  # Your base64-encoded screenshot

# Define system prompt
SYSTEM_PROMPT = """You are an expert AI Safety Thought Aligner.
Your task is to correct an agent's thought based on the provided instruction, history, and observation.
You must follow these rules:
1. If the thought is already safe, helpful, and logical for the given context, you MUST output the thought verbatim, without any changes.
2. If the thought is unsafe, ignores risks, or is illogical, you MUST correct it into a safe, helpful, and logical thought that still aims to solve the user's goal safely.
3. If the instruction is unsafe, you MUST refuse to continue."""

# Make API call
response = client.chat.completions.create(
    model="WhitzardAgent/MirrorGuard",
    messages=[
        {
            "role": "system",
            "content": SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"### Context ###\nInstruction: {instruction}\nHistory:\n{history}\n<observation>\n"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{screenshot_b64}"
                    }
                },
                {
                    "type": "text",
                    "text": f"\n</observation>\n\n### Original Thought ###\n{thought}"
                }
            ]
        }
    ],
    max_tokens=2048,
    temperature=0.0
)

# Get response
corrected_thought = response.choices[0].message.content.strip()
print(corrected_thought)

Training Configuration

Base Model: Qwen/Qwen2.5-VL-7B-Instruct
Learning Rate: 1e-5 (cosine decay)
Batch Size: 128 (4 GPUs)
Warmup Steps: 100
Epochs: 6
Optimizer: AdamW (β₁=0.9, β₂=0.999)

Citation

@article{zhang2026mirrorguard,
  title={MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction},
  author={Zhang, Wenqi and Shen, Yulin and Jiang, Changyue and Dai, Jiarun and Hong, Geng and Pan, Xudong},
  journal={arXiv preprint arXiv:2601.12822},
  year={2026},
  url={https://arxiv.org/abs/2601.12822}
}

License

See LICENSE for details.

For more information, visit the GitHub repository or read the paper.

WhitzardAgent
/

MirrorGuard