Instructions to use harryhsing/OmniAgent-RL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use harryhsing/OmniAgent-RL-7B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("harryhsing/OmniAgent-RL-7B") model = AutoModelForMultimodalLM.from_pretrained("harryhsing/OmniAgent-RL-7B") - Notebooks
- Google Colab
- Kaggle
OmniAgent-RL-7B
OmniAgent-RL-7B is the final checkpoint of OmniAgent, the first native omni-modal agent for active perception in video understanding (ICML 2026). It is built on Qwen2.5-Omni-7B and trained in two stages — Agentic SFT, then Agentic RL with TAURA. This is the recommended checkpoint and reproduces the main results in the paper.
- 📄 Paper: Native Active Perception as Reasoning for Omni-Modal Understanding
- 💻 Code: https://github.com/HarryHsing/OmniAgent
- 🤗 Models: OmniAgent-RL-7B · OmniAgent-SFT-7B
What it does
Instead of consuming every frame, OmniAgent runs an Observation–Thought–Action (OTA) loop: a single omni model decides what to look at (get_frames), listen to (get_audio), or watch (get_clip) on demand, distills each percept into a compact textual memory, and answers when it has enough evidence. The environment only returns raw media — all perception and reasoning are done by this model.
How to use
⚠️ This is an agent checkpoint. The weights follow the Qwen2.5-Omni-7B architecture, but to reproduce OmniAgent's active perception you must run it inside the OTA environment in the GitHub repo, not via a plain transformers call.
# After setup (see the repo README):
bash demo/launch_inference.sh checkpoints/OmniAgent-RL-7B assets/example_video_mcq.mp4
Model size & components
OmniAgent uses only the Qwen2.5-Omni thinker for agent reasoning; the talker (audio generation) is never used, and only the thinker is loaded and used during training and inference. The text-to-audio tag is auto-derived from the Qwen2.5-Omni architecture in config.json and does not reflect how OmniAgent is used.
Which checkpoint should I use?
- OmniAgent-RL-7B (this model) — best performance; use for inference, evaluation, and deployment.
- OmniAgent-SFT-7B — the cold-start checkpoint; use it to re-run Agentic RL yourself or to study the SFT-only stage.
Results (highlights)
- Open-source SoTA across 10 video / audio-visual / temporal-grounding benchmarks.
- On LVBench, this 7B agent outperforms the 10× larger Qwen2.5-VL-72B (50.5 vs. 47.3) with ~73% fewer frames.
- Large temporal-grounding gains over the Qwen2.5-Omni-7B base (+33 IoU on LongVALE and VUE-TR).
See the paper for the full tables.
Limitations
The sequential OTA loop adds inference latency compared to a single forward pass; reducing this via parallel exploration is left to future work.
Citation
@inproceedings{xing2026omniagent,
title={Native Active Perception as Reasoning for Omni-Modal Understanding},
author={Zhenghao Xing and Ruiyang Xu and Yuxuan Wang and Jinzheng He and Ziyang Ma and Qize Yang and Yunfei Chu and Jin Xu and Junyang Lin and Chi-Wing Fu and Pheng-Ann Heng},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
- Downloads last month
- -