StableToken / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag and link to paper
e956ae6 verified
|
raw
history blame
8.22 kB
metadata
language:
  - en
  - zh
license: other
license_name: license-term-of-stabletoken
tags:
  - speech tokenizer
pipeline_tag: audio-to-audio

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

πŸ“„ Paper | πŸ’» GitHub

For code and more detailed information, please refer to the corresponding GitHub repository.

Model Details

Attribute Value
Frame Rate 25 Hz
Codebook Size 8,192
BPS (Bits Per Second) 325

Quick Start

To use StableToken, please clone the official repository and install dependencies.

Installation

git clone --recursive https://github.com/Tencent/StableToken.git
cd StableToken && pip install -r requirements.txt

Inference

import os
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperLFQEncoder
from src.utils.flow_inference import AudioDecoder
from src.utils.utils import extract_speech_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/StableToken")

# Load Tokenizer
tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

# Load Decoder
decoder = AudioDecoder(
    config_path=os.path.join(model_dir, "decoder", "config.yaml"),
    flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
    hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
    device="cuda"
)

# 2. Tokenize
tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)

Performance

StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

Noise Robustness (UED ↓)

Model Frame Rate Codebook Size UED (%, ↓)
GLM-4-Voice-Tokenizer 12.5Hz 16,384 31.10
S3 Tokenizer 25Hz 4,096 26.17
CosyVoice2 25Hz 6,561 38.66
StableToken 25Hz 8,192 10.17 πŸ†

Reconstruction Quality

Measurements on LibriSpeech (LS) and SEED benchmarks.

Model Frame
Rate
BPS WER (↓)
LS-clean
WER (↓)
LS-other
WER (↓)
SEED-en
WER (↓)
SEED-zh
MOS (↑)
LS-clean
MOS (↑)
LS-other
MOS (↑)
SEED-en
MOS (↑)
SEED-zh
GLM-4-Voice-Tokenizer 12.5Hz 175 4.04 9.33 3.54 3.23 4.07 3.99 4.16 4.10
S3 Tokenizer 25Hz 300 5.78 13.38 5.91 4.26 3.40 3.31 3.40 3.31
CosyVoice2 25Hz 325 4.25 9.68 4.34 2.75 3.36 3.25 3.31 3.58
StableToken 25Hz 325 3.84 7.99 3.44 2.62 4.09 3.83 4.01 4.18

Citation

@article{song2025stabletoken,
  title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
  author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL .

# Current model card

The README of the model repository currently looks like this:

## Metadata
```yaml
language:
- en
- zh
license: other
license_name: license-term-of-stabletoken
tags:
- speech tokenizer

Content

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

πŸ“„ Paper | πŸ’» GitHub

For code and more detailed information, please refer to the corresponding GitHub repository.

Model Details

Attribute Value
Frame Rate 25 Hz
Codebook Size 8,192
BPS (Bits Per Second) 325

Quick Start

To use StableToken, please clone the official repository and install dependencies.

Installation

git clone --recursive https://github.com/Tencent/StableToken.git
cd StableToken && pip install -r requirements.txt

Inference

import os
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperLFQEncoder
from src.utils.flow_inference import AudioDecoder
from src.utils.utils import extract_speech_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/StableToken")

# Load Tokenizer
tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

# Load Decoder
decoder = AudioDecoder(
    config_path=os.path.join(model_dir, "decoder", "config.yaml"),
    flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
    hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
    device="cuda"
)

# 2. Tokenize
tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)

Performance

StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

Noise Robustness (UED ↓)

Model Frame Rate Codebook Size UED (%, ↓)
GLM-4-Voice-Tokenizer 12.5Hz 16,384 31.10
S3 Tokenizer 25Hz 4,096 26.17
CosyVoice2 25Hz 6,561 38.66
StableToken 25Hz 8,192 10.17 πŸ†

Reconstruction Quality

Measurements on LibriSpeech (LS) and SEED benchmarks.

Model Frame
Rate
BPS WER (↓)
LS-clean
WER (↓)
LS-other
WER (↓)
SEED-en
WER (↓)
SEED-zh
MOS (↑)
LS-clean
MOS (↑)
LS-other
MOS (↑)
SEED-en
MOS (↑)
SEED-zh
GLM-4-Voice-Tokenizer 12.5Hz 175 4.04 9.33 3.54 3.23 4.07 3.99 4.16 4.10
S3 Tokenizer 25Hz 300 5.78 13.38 5.91 4.26 3.40 3.31 3.40 3.31
CosyVoice2 25Hz 325 4.25 9.68 4.34 2.75 3.36 3.25 3.31 3.58
StableToken 25Hz 325 3.84 7.99 3.44 2.62 4.09 3.83 4.01 4.18

Citation

@article{song2025stabletoken,
  title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
  author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao},
  journal={arXiv preprint arXiv:2509.22220},
  year={2025}
}

License

This project is licensed under the License Term of StableToken. ```