Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception

Universal Audio Tokenizer is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs.

πŸ“„ Paper | πŸ’» GitHub

For code and more detailed information, please refer to the corresponding GitHub repository.

πŸ’‘ Highlights

Existing semantic speech tokenizers often suffer from acoustic blindness, while acoustic tokenizers typically lack linguistic alignment.

Universal Audio Tokenizer bridges this gap through:

  • 🧩 Semantic-Acoustic Primitives (SAP) supervision that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives
  • βš–οΈ Semantic-Acoustic Equilibrium (SAE) mechanism that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams

This results in a compact single-codebook audio tokenizer that simultaneously enables:

  • 🧠 Seamless LLM Integration: A unified audio input/output interface in Audio-LLMs
  • πŸ—£οΈ Linguistic Alignment: Superior performance on speech reconstruction and TTS synthesis tasks
  • 🎯 General Audio Perception: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks

πŸ“Œ Model Details

Attribute Value
Frame Rate 25 Hz
Codebook Size 8,192
Bits Per Second (BPS) 325

πŸš€ Quick Start

To use Universal Audio Tokenizer, please clone the official repository and install dependencies.

Installation

# 1. Clone the repository with all submodules
git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git
cd Universal_Audio_Tokenizer

# If you have already cloned the repository without --recursive,
# initialize submodules with:
git submodule update --init --recursive

# 2. Create a conda environment
conda create -n universal-audio-tokenizer python=3.10.13 -y
conda activate universal-audio-tokenizer

# 3. Install dependencies
conda install -c conda-forge libsndfile -y
pip install -r requirements.txt

Download Pretrained Checkpoints

Using huggingface-cli:

huggingface-cli download tencent/Universal_Audio_Tokenizer \
  --local-dir checkpoints/Universal_Audio_Tokenizer

Or using Python:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="tencent/Universal_Audio_Tokenizer", 
    local_dir="checkpoints/Universal_Audio_Tokenizer"
)

Run Inference

We provide a simple inference demo in example_usage.py.

python example_usage.py \
  --device auto \
  --model_path checkpoints/Universal_Audio_Tokenizer \
  --audio_path /path/to/audio.wav

The script will:

  • load the tokenizer and feature extractor;
  • extract discrete audio tokens from input audio clips;
  • reconstruct waveforms from the tokens and save reconstructed audio under reconstruction/.

Also, you can directly run the inference code snippet below:

import os
import torch
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperVQEncoder
from src.model.flow_inference import AudioDecoder
from src.model.utils import extract_audio_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer")

# Load tokenizer and feature extractor
tokenizer_path = os.path.join(model_dir, "tokenizer")
tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)

# Load decoder
decoder_path = os.path.join(model_dir, "decoder")
decoder = AudioDecoder(
    config_path=os.path.join(decoder_path, "config.yaml"),
    flow_ckpt_path=os.path.join(decoder_path, "flow.pt"),
    hift_ckpt_path=os.path.join(decoder_path, "hift.pt"),
    device="cuda"
)

# 2. Tokenize
tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)

πŸ“Š Performance

Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks.

Latent Space Disentanglement

We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space.

Model ESC-10 Sil. (↑) ESC-10 Purity (↑) ESC-50 Sil. (↑) ESC-50 Purity (↑)
WavTokenizer -0.030 0.450 -0.108 0.215
GLM-4-Voice-Tokenizer -0.182 0.373 -0.304 0.133
CosyVoice2 -0.016 0.413 -0.100 0.216
StableToken -0.035 0.468 -0.096 0.174
Ours 0.091 0.730 0.023 0.390

High-Quality Speech Reconstruction

Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers.

Model Frame
Rate
BPS WER (↓)
LS-clean
WER (↓)
LS-other
WER (↓)
SEED-en
WER (↓)
SEED-zh
MOS (↑)
LS-clean
MOS (↑)
LS-other
MOS (↑)
SEED-en
MOS (↑)
SEED-zh
WavTokenizer 75Hz 900 5.07 13.09 5.60 4.02 3.37 3.09 3.01 3.13
GLM-4-Voice-Tokenizer 12.5Hz 175 4.04 9.33 3.54 3.23 4.07 3.99 4.16 4.10
CosyVoice2 25Hz 325 4.25 9.68 4.34 2.75 3.36 3.25 3.31 3.58
StableToken 25Hz 325 3.84 7.99 3.44 2.62 4.09 3.83 4.01 4.18
Ours 25Hz 325 3.47 6.79 2.55 1.90 4.19 4.18 4.13 4.25

Superior Downstream Audio-LLM Performance

When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks, demonstrating its effectiveness as a unified audio input/output interface for Audio-LLMs.

Audio Understanding

Accuracy on audio understanding benchmarks:

Tokenizer MMAU
(Speech)
MMAU
(Sound)
MMAU
(Music)
MMAU
(Overall)
MMAR
(Speech)
MMAR
(Sound)
MMAR
(Music)
MMAR
(Overall)
MMSU
(Perception)
MMSU
(Reasoning)
MMSU
(Overall)
WavTokenizer 36.94 60.36 57.78 51.70 39.80 31.52 29.61 36.30 32.83 45.37 38.90
CosyVoice2 39.94 61.56 62.57 54.70 41.50 35.76 30.58 38.10 27.44 45.83 36.34
GLM-4-Voice-Tokenizer 43.24 60.06 62.28 55.20 39.46 40.00 36.89 40.10 32.40 47.64 39.78
StableToken 45.05 58.56 55.99 53.20 42.18 39.39 31.07 39.10 31.98 49.71 40.56
Ours 45.05 70.27 67.96 61.10 (+5.90) 45.24 43.64 40.29 45.80 (+5.70) 35.54 52.07 43.54 (+2.98)

Controllable TTS Synthesis

Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS).

Tokenizer SIM (↑) WER (↓) MOS (↑)
CosyVoice2 .758 | .762 | .760 2.71 | 1.39 | 2.05 3.75 | 3.37 | 3.56
Ours .792 | .742 | .767 1.78 | 1.29 | 1.54 4.07 | 3.68 | 3.88

Citation

If you find our code or model useful for your research, please cite:

@misc{song2026uniaudiotokenempoweringsemanticspeech,
      title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception}, 
      author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
      year={2026},
      eprint={2605.31521},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.31521}, 
}

License

This project is licensed under the License Term of Universal_Audio_Tokenizer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for tencent/Universal_Audio_Tokenizer