HoLo-ToLk (TTS) β a tokenizer-free HSL text-to-speech feasibility model
A works-demonstration that the zero-parameter byte-signal substrate
hsl-embedding-zero enables tokenizer-free
text-to-speech: UTF-8 text bytes β frozen HSL 27-D features (no tokenizer / vocab / learned input
door) β Pre-LN transformer β AR mel decoder + guided attention β HiFi-GAN β a natural-sounding
voice.
- Code / reproduce: https://github.com/Woojiggun/HoLo-ToLk-TTS
- Live demo (combined line Space): https://huggingface.co/spaces/ggunio/HoLo-ToLk
- Author: Jinhyun Woo (ggunio5782@gmail.com) Β· License: CC BY-NC 4.0 (non-commercial; the HSL substrate is separately MIT)
The honest claim
Tokenizer-free TTS on the HSL substrate: text UTF-8 bytes β frozen HSL 27-D (no tokenizer / vocab / learned input door) β Pre-LN transformer β AR mel decoder + guided attention β HiFi-GAN β a natural-sounding voice. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0β3 = 0.296 / 0.293 / 0.292 / 0.290, range 0.006). Single-speaker (LJSpeech), English. A feasibility / works demonstration that HSL works as a tokenizer-free TTS input β NOT a multi-speaker or production TTS system, and no SOTA claim.
| seed | held-out mel-L1 (teacher-forced) |
|---|---|
| 0 | 0.296 |
| 1 | 0.293 |
| 2 | 0.292 |
| 3 | 0.290 |
| mean | 0.293 (range 0.006) |
The genuine strength is the natural voice β guided attention drives a monotonic alignment, the stop token fires, and the HiFi-GAN output is intelligible and natural to listen to. But it is still one speaker and a feasibility demo.
Intended use
- Research / educational demonstration that a lossless byte substrate (
hsl-embedding-zero) works as a tokenizer-free / vocab-free TTS input β no learned input embedding, just raw UTF-8 bytes β frozen HSL β transformer. - A worked example of an AR mel-decoder + guided-attention TTS pipeline driven entirely by byte features, vocoded with a pretrained HiFi-GAN.
Not intended for production speech synthesis, multi-speaker / voice-cloning use, or any setting needing a specific or controllable voice.
How it works
UTF-8 text bytes β frozen 27-D hsl-embedding-zero features (no lens β text is HSL's home) β
LayerNorm + linear projection β Pre-LN transformer text encoder (4 layers, dim 384) β AR mel decoder
(prenet β causal self-attention β cross-attention to text) with a guided-attention loss (monotonic
alignment) β stop token β Postnet (5Γ conv1d residual) β 80-bin log-mel (22.05 kHz, jik876/HiFi-GAN-exact
filterbank) β speechbrain/tts-hifigan-ljspeech vocoder β 22050 Hz waveform. See the repo README and
tts_lens.py for the exact architecture.
Training data
LJSpeech 1.1 β a single-speaker, public-domain English read-speech corpus (~24h, one female
narrator). ~12.8k clips for training; ~256 held out (every 50th clip, deterministic). Audio stored at
22.05 kHz mono FLAC; mel targets computed on the fly. The dataset is not redistributed; rebuild it
with collect_ljspeech.py (streams from MikhailT/lj-speech, split full).
Training procedure
AR transformer TTS with teacher forcing. dim 384 / enc-layers 4 / dec-layers 4 / heads 6 / ff 1536,
batch 24, AdamW lr 3e-4 (cosine, 2000-step warmup), 40k steps, grad-clip 1.0, guided-attention loss
(--ga-weight 1.0 --ga-sigma 0.2), SortaGrad short-text-first, seed 0. Loss = L1(mel_pre) + L1(mel_post)
- BCE(stop) + guided-attention. The reported checkpoint is the best held-out mel-L1 over training.
Evaluation
Held-out teacher-forced mel-L1 (L1 between predicted and target log-mel) on the LJSpeech held-out split. Reported: mel-L1 0.296 (seed 0); multi-seed seeds 0β3 = 0.296 / 0.293 / 0.292 / 0.290 (mean 0.293, range 0.006). Free-run (autoregressive) synthesis is the qualitative test β the voice is natural and intelligible with the HiFi-GAN vocoder; per-clip free-run quality varies and is best on clear, sentence-length English input.
Limitations
- Single-speaker (LJSpeech, one narrator) β no speaker control or voice cloning; it only speaks in that one voice.
- English only (LJSpeech English read speech) β not trained or evaluated on other languages.
- The headline number is a teacher-forced mel-L1, not a naturalness benchmark (no MOS); free-run quality varies clip to clip.
- A feasibility / works demonstration, not state-of-the-art TTS or a claim of general superiority over dedicated TTS systems.
Files / checkpoint to upload
Upload the frozen seed-0 model to this model repo:
tts_lens_ga_best.ptβ guided-attention AR TTS, 40k steps, seed 0, held-out mel-L1 0.296 (~254 MB). Self-contained: stores its ownconfig, sotts_lens.pyrebuilds the model on load.
Synthesize (free-run, natural voice β --vocoder hifigan is required):
pip install hsl-embedding-zero zstandard soundfile librosa speechbrain torch torchaudio
python tts_lens.py --synth --synth-n 5 --vocoder hifigan \
--data ./data/ljspeech --synth-ckpt tts_lens_ga_best.pt
Acknowledgments
Independent research, developed in collaboration with AI assistants β Claude Code (Anthropic) and Codex. The HSL work and experimental direction are the author's; the tools assisted with engineering and review.
Citation
@software{woo_holotolk_tts_2026,
author = {Jinhyun Woo},
title = {HoLo-ToLk (TTS): a tokenizer-free HSL text-to-speech feasibility model},
year = {2026},
doi = {10.5281/zenodo.21007475},
url = {https://github.com/Woojiggun/HoLo-ToLk-TTS}
}