Papers
arxiv:2505.09439

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Published on May 14, 2025
· Submitted by
Andrew Rouditchenko
on May 15, 2025
Authors:
,
,
,
,

Abstract

Omni-R1 fine-tunes Qwen2.5-Omni with GRPO on an audio QA dataset, achieving state-of-the-art performance in sound, music, speech, and average categories.

We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU benchmark. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Community

Paper author Paper submitter

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2505.09439
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.09439 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.09439 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.09439 in a Space README.md to link it from this page.

Collections including this paper 5