OmniAgent-RL-7B

OmniAgent-RL-7B is the final checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is built on Qwen2.5-Omni-7B and trained in two stages — Agentic SFT, then Agentic RL with TAURA. This is the recommended checkpoint and reproduces the main results in the paper.

📄 Paper: Native Active Perception as Reasoning for Omni-Modal Understanding
💻 Code: https://github.com/HarryHsing/OmniAgent
🤗 Models: OmniAgent-RL-7B · OmniAgent-SFT-7B

What it does

Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.

How to use

⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.

# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-RL-7B assets/example_video_mcq.mp4

Model size & components

OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.

Which checkpoint should I use?

OmniAgent-RL-7B (this model) — best performance; use for inference, evaluation, and deployment.
OmniAgent-SFT-7B — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.

Results (highlights)

Open-source SoTA across 10 video / audio-visual / temporal-grounding benchmarks.
On LVBench, this 7B agent outperforms the 10× larger Qwen2.5-VL-72B (50.5 vs. 47.3) with ~73% fewer frames.
Large temporal-grounding gains over the Qwen2.5-Omni-7B base (+33 IoU on LongVALE and VUE-TR).

See the paper for the full tables.

Limitations

The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work.

Citation

@inproceedings{xing2026omniagent,
  title={Native Active Perception as Reasoning for Omni-Modal Understanding},
  author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}