StableToken / README.md

nielsr HF Staff

Add pipeline tag and link to paper

e956ae6 verified 26 days ago

8.22 kB

language:
  - en
  - zh
license: other
license_name: license-term-of-stabletoken
tags:
  - speech tokenizer
pipeline_tag: audio-to-audio

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

📄 Paper | 💻 GitHub

For code and more detailed information, please refer to the corresponding GitHub repository.

Model Details

Attribute	Value
Frame Rate	25 Hz
Codebook Size	8,192
BPS (Bits Per Second)	325

Quick Start

To use StableToken, please clone the official repository and install dependencies.

Installation

git clone --recursive https://github.com/Tencent/StableToken.git
cd StableToken && pip install -r requirements.txt

Inference

import os
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperLFQEncoder
from src.utils.flow_inference import AudioDecoder
from src.utils.utils import extract_speech_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/StableToken")

# Load Tokenizer
tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

# Load Decoder
decoder = AudioDecoder(
    config_path=os.path.join(model_dir, "decoder", "config.yaml"),
    flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
    hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
    device="cuda"
)

# 2. Tokenize
tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)

Performance

StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

Noise Robustness (UED ↓)

Model	Frame Rate	Codebook Size	UED (%, ↓)
GLM-4-Voice-Tokenizer	12.5Hz	16,384	31.10
S3 Tokenizer	25Hz	4,096	26.17
CosyVoice2	25Hz	6,561	38.66
StableToken	25Hz	8,192	10.17 🏆

Reconstruction Quality

Measurements on LibriSpeech (LS) and SEED benchmarks.

Model	Frame Rate	BPS	WER (↓) LS-clean	WER (↓) LS-other	WER (↓) SEED-en	WER (↓) SEED-zh	MOS (↑) LS-clean	MOS (↑) LS-other	MOS (↑) SEED-en	MOS (↑) SEED-zh
GLM-4-Voice-Tokenizer	12.5Hz	175	4.04	9.33	3.54	3.23	4.07	3.99	4.16	4.10
S3 Tokenizer	25Hz	300	5.78	13.38	5.91	4.26	3.40	3.31	3.40	3.31
CosyVoice2	25Hz	325	4.25	9.68	4.34	2.75	3.36	3.25	3.31	3.58
StableToken	25Hz	325	3.84	7.99	3.44	2.62	4.09	3.83	4.01	4.18

Citation

@article{song2025stabletoken,
  title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
  author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL .

# Current model card

The README of the model repository currently looks like this:

## Metadata
```yaml
language:
- en
- zh
license: other
license_name: license-term-of-stabletoken
tags:
- speech tokenizer

Content