Instructions to use TMLR-Group-HF/AgentHijack-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TMLR-Group-HF/AgentHijack-Agent with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="TMLR-Group-HF/AgentHijack-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("TMLR-Group-HF/AgentHijack-Agent") model = AutoModelForImageTextToText.from_pretrained("TMLR-Group-HF/AgentHijack-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TMLR-Group-HF/AgentHijack-Agent with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TMLR-Group-HF/AgentHijack-Agent" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent
- SGLang
How to use TMLR-Group-HF/AgentHijack-Agent with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TMLR-Group-HF/AgentHijack-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TMLR-Group-HF/AgentHijack-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use TMLR-Group-HF/AgentHijack-Agent with Docker Model Runner:
docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent
AgentHijack-Agent
AgentHijack-Agent is the action-generation model released with the paper AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions (ICML 2026).
It is fine-tuned from UI-TARS-1.5-7B (Qwen2.5-VL architecture) using Data-Augmented Group Relative Policy Optimization (DA-GRPO) on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under common environment corruptions (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts).
The same checkpoint serves a dual role in the AgentHijack-Agent framework:
- Action generator — produces the next GUI action from screenshots + history.
- Onlooker — summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution.
- 📄 Paper: AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions (ICML 2026)
- 🌐 Project page: https://AgentHijack.github.io
- 🧩 Base model:
ByteDance-Seed/UI-TARS-1.5-7B(Qwen2.5-VL-7B architecture) - 🏛️ Affiliations: TMLR Group, Hong Kong Baptist University
Highlights
Compared with the base UI-TARS-1.5-7B, AgentHijack-Agent:
- Improves average task success rate on the AgentHijack benchmark by +4.15% (and a larger margin on UI-TARS-7B-DPO baseline).
- Maintains accurate grounding under visual disruptors (pop-ups, resolution change, marks, subtitle, multi-apps).
- Recovers from unexpected operations (accidental touch, app minimization) via behavioral summarization.
- Detects environment errors (network failure, login/verification prompts) up-front instead of looping on meaningless attempts.
See Table 2 and Figure 8 of the paper for full results and qualitative trajectories.
Model details
| Field | Value |
|---|---|
| Architecture | Qwen2_5_VLForConditionalGeneration |
| Parameters | ~7B |
| Precision | bfloat16 |
| Context length | 128k tokens |
| Image resolution | 1920 × 1080 (native, paper default) |
| Sharding | 4 × safetensors shards |
| Tokenizer | Inherited from UI-TARS-1.5-7B / Qwen2.5-VL |
Training
- Algorithm: Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across different corrupted environments drawn from a corruption set
C, instead of a single clean environment. - Framework: VERL.
- Data: 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total).
- Schedule: 15 epochs.
- Reward:
r = r_success + r_format, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches. - Optimization: clip range [0.2, 0.3], KL loss disabled to encourage exploration.
Usage
The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with transformers and vllm.
Action space
AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B:
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='xxx')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait()
finished(content='xxx')
Prompt template (action generator)
You are a GUI agent. You are given a task and your action history, with
screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ... Action: ...
## Action Space
{action_space}
## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target
element) in one sentence in `Thought` part.
## User Instruction
{instruction}
Minimal inference example
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "<your-username>/AgentHijack-Agent"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Build a chat with screenshot(s) + the action-generator prompt above,
# then run model.generate(...) as usual.
For the full agent framework (action generator + onlooker + environment checking), please refer to the code at AgentHijack.github.io.
Citation
If you use this model or the AgentHijack benchmark, please cite:
@inproceedings{sun2026agenthijack,
title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=0H5Im3Xvuf}
}
Acknowledgements
This model is built on top of UI-TARS-1.5-7B and the Qwen2.5-VL family, with training infrastructure based on VERL. The benchmark environment extends OSWorld.
- Downloads last month
- 4