HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
Paper • 2605.29948 • Published
HoliTok is a continuous holistic speech tokenization model designed for unified generation-understanding modeling. It encodes 48 kHz speech into a compact 25 Hz sequence of 128-dimensional latents. It is designed to preserve signal-level fidelity, incorporate semantic information, and maintain strong latent learnability.
This repository hosts runtime-only HoliTok checkpoints for inference. The checkpoints are loaded by the HoliTok inference package through the public aliases HoliTok-Base and HoliTok-Unite.
You can use the HoliTok Python API for audio tokenization, reconstruction, and semantic feature extraction.
import torch
from holitok import HoliTok, SemanticModule
model = HoliTok.from_pretrained("HoliTok-Unite", device="cuda:0")
audio = torch.randn(1, 1, 48000, device="cuda:0")
# [B, 2 * latent_dim, T], concat(mu, log_std)
posterior = model.encode_posterior(audio)
# [B, latent_dim, T]
latents = model.posterior_mean(posterior)
# [B, 1, samples]
recon = model.decode(latents)
semantic = SemanticModule.from_pretrained("HoliTok-Unite", device="cuda:0")
features = semantic(latents.transpose(1, 2)) # [B, T, 1536]
@misc{li2026holitokacoutinuousholistictokenization,
title={HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding},
author={Bohan Li and Shi Lian and Hankun Wang and Yiwei Guo and Yu Xi and Zhihan Li and Da Zheng and Colin Zhang and Kai Yu},
year={2026},
eprint={2605.29948},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2605.29948},
}