HoliTok

HoliTok is a continuous holistic speech tokenization model designed for unified generation-understanding modeling. It encodes 48 kHz speech into a compact 25 Hz sequence of 128-dimensional latents. It is designed to preserve signal-level fidelity, incorporate semantic information, and maintain strong latent learnability.

Paper: HoliTok: A Continuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
Code: GitHub Repository

This repository hosts runtime-only HoliTok checkpoints for inference. The checkpoints are loaded by the HoliTok inference package through the public aliases HoliTok-Base and HoliTok-Unite.

Sample Usage

You can use the HoliTok Python API for audio tokenization, reconstruction, and semantic feature extraction.

import torch
from holitok import HoliTok, SemanticModule

model = HoliTok.from_pretrained("HoliTok-Unite", device="cuda:0")
audio = torch.randn(1, 1, 48000, device="cuda:0")

# [B, 2 * latent_dim, T], concat(mu, log_std)
posterior = model.encode_posterior(audio)

# [B, latent_dim, T]
latents = model.posterior_mean(posterior)

# [B, 1, samples]
recon = model.decode(latents)

semantic = SemanticModule.from_pretrained("HoliTok-Unite", device="cuda:0")
features = semantic(latents.transpose(1, 2))  # [B, T, 1536]

Citation

@misc{li2026holitokacoutinuousholistictokenization,
      title={HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding}, 
      author={Bohan Li and Shi Lian and Hankun Wang and Yiwei Guo and Yu Xi and Zhihan Li and Da Zheng and Colin Zhang and Kai Yu},
      year={2026},
      eprint={2605.29948},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.29948}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for bovod-sjtu/HoliTok

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Paper • 2605.29948 • Published 4 days ago