AgentHijack-Agent

AgentHijack-Agent is the action-generation model released with the paper AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions (ICML 2026).

It is fine-tuned from UI-TARS-1.5-7B (Qwen2.5-VL architecture) using Data-Augmented Group Relative Policy Optimization (DA-GRPO) on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under common environment corruptions (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts).

The same checkpoint serves a dual role in the AgentHijack-Agent framework:

  1. Action generator — produces the next GUI action from screenshots + history.
  2. Onlooker — summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution.
  • 📄 Paper: AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions (ICML 2026)
  • 🌐 Project page: https://AgentHijack.github.io
  • 🧩 Base model: ByteDance-Seed/UI-TARS-1.5-7B (Qwen2.5-VL-7B architecture)
  • 🏛️ Affiliations: TMLR Group, Hong Kong Baptist University

Highlights

Compared with the base UI-TARS-1.5-7B, AgentHijack-Agent:

  • Improves average task success rate on the AgentHijack benchmark by +4.15% (and a larger margin on UI-TARS-7B-DPO baseline).
  • Maintains accurate grounding under visual disruptors (pop-ups, resolution change, marks, subtitle, multi-apps).
  • Recovers from unexpected operations (accidental touch, app minimization) via behavioral summarization.
  • Detects environment errors (network failure, login/verification prompts) up-front instead of looping on meaningless attempts.

See Table 2 and Figure 8 of the paper for full results and qualitative trajectories.


Model details

Field Value
Architecture Qwen2_5_VLForConditionalGeneration
Parameters ~7B
Precision bfloat16
Context length 128k tokens
Image resolution 1920 × 1080 (native, paper default)
Sharding 4 × safetensors shards
Tokenizer Inherited from UI-TARS-1.5-7B / Qwen2.5-VL

Training

  • Algorithm: Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across different corrupted environments drawn from a corruption set C, instead of a single clean environment.
  • Framework: VERL.
  • Data: 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total).
  • Schedule: 15 epochs.
  • Reward: r = r_success + r_format, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches.
  • Optimization: clip range [0.2, 0.3], KL loss disabled to encourage exploration.

Usage

The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with transformers and vllm.

Action space

AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B:

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='xxx')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait()
finished(content='xxx')

Prompt template (action generator)

You are a GUI agent. You are given a task and your action history, with
screenshots. You need to perform the next action to complete the task.

## Output Format

Thought: ... Action: ...


## Action Space

{action_space}

## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target
  element) in one sentence in `Thought` part.

## User Instruction
{instruction}

Minimal inference example

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "<your-username>/AgentHijack-Agent"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Build a chat with screenshot(s) + the action-generator prompt above,
# then run model.generate(...) as usual.

For the full agent framework (action generator + onlooker + environment checking), please refer to the code at AgentHijack.github.io.


Citation

If you use this model or the AgentHijack benchmark, please cite:

@inproceedings{sun2026agenthijack,
  title     = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
  author    = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0H5Im3Xvuf}
}

Acknowledgements

This model is built on top of UI-TARS-1.5-7B and the Qwen2.5-VL family, with training infrastructure based on VERL. The benchmark environment extends OSWorld.

Downloads last month
4
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TMLR-Group-HF/AgentHijack-Agent

Finetuned
(10)
this model
Quantizations
1 model