OmniAgent-RL-7B

OmniAgent-RL-7B is the final checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is built on Qwen2.5-Omni-7B and trained in two stages — Agentic SFT, then Agentic RL with TAURA. This is the recommended checkpoint and reproduces the main results in the paper.

What it does

Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.

How to use

⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.

# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-RL-7B assets/example_video_mcq.mp4

Model size & components

OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.

Which checkpoint should I use?

  • OmniAgent-RL-7B (this model) — best performance; use for inference, evaluation, and deployment.
  • OmniAgent-SFT-7B — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.

Results (highlights)

  • Open-source SoTA across 10 video / audio-visual / temporal-grounding benchmarks.
  • On LVBench, this 7B agent outperforms the 10× larger Qwen2.5-VL-72B (50.5 vs. 47.3) with ~73% fewer frames.
  • Large temporal-grounding gains over the Qwen2.5-Omni-7B base (+33 IoU on LongVALE and VUE-TR).

See the paper for the full tables.

Limitations

The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work.

Citation

@inproceedings{xing2026omniagent,
  title={Native Active Perception as Reasoning for Omni-Modal Understanding},
  author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for harryhsing/OmniAgent-RL-7B

Finetuned
(1)
this model

Paper for harryhsing/OmniAgent-RL-7B