HoLo-ToLk (TTS) — a tokenizer-free HSL text-to-speech feasibility model

A works-demonstration that the zero-parameter byte-signal substrate hsl-embedding-zero enables tokenizer-free text-to-speech: UTF-8 text bytes → frozen HSL 27-D features (no tokenizer / vocab / learned input door) → Pre-LN transformer → AR mel decoder + guided attention → HiFi-GAN → a natural-sounding voice.

Code / reproduce: https://github.com/Woojiggun/HoLo-ToLk-TTS
Live demo (combined line Space): https://huggingface.co/spaces/ggunio/HoLo-ToLk
Author: Jinhyun Woo (ggunio5782@gmail.com) · License: CC BY-NC 4.0 (non-commercial; the HSL substrate is separately MIT)

The honest claim

Tokenizer-free TTS on the HSL substrate: text UTF-8 bytes → frozen HSL 27-D (no tokenizer / vocab / learned input door) → Pre-LN transformer → AR mel decoder + guided attention → HiFi-GAN → a natural-sounding voice. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290, range 0.006). Single-speaker (LJSpeech), English. A feasibility / works demonstration that HSL works as a tokenizer-free TTS input — NOT a multi-speaker or production TTS system, and no SOTA claim.

seed	held-out mel-L1 (teacher-forced)
0	0.296
1	0.293
2	0.292
3	0.290
mean	0.293 (range 0.006)

The genuine strength is the natural voice — guided attention drives a monotonic alignment, the stop token fires, and the HiFi-GAN output is intelligible and natural to listen to. But it is still one speaker and a feasibility demo.

Intended use

Research / educational demonstration that a lossless byte substrate (hsl-embedding-zero) works as a tokenizer-free / vocab-free TTS input — no learned input embedding, just raw UTF-8 bytes → frozen HSL → transformer.
A worked example of an AR mel-decoder + guided-attention TTS pipeline driven entirely by byte features, vocoded with a pretrained HiFi-GAN.

Not intended for production speech synthesis, multi-speaker / voice-cloning use, or any setting needing a specific or controllable voice.

How it works

UTF-8 text bytes → frozen 27-D hsl-embedding-zero features (no lens — text is HSL's home) → LayerNorm + linear projection → Pre-LN transformer text encoder (4 layers, dim 384) → AR mel decoder (prenet → causal self-attention → cross-attention to text) with a guided-attention loss (monotonic alignment) → stop token → Postnet (5× conv1d residual) → 80-bin log-mel (22.05 kHz, jik876/HiFi-GAN-exact filterbank) → speechbrain/tts-hifigan-ljspeech vocoder → 22050 Hz waveform. See the repo README and tts_lens.py for the exact architecture.

Training data

LJSpeech 1.1 — a single-speaker, public-domain English read-speech corpus (~24h, one female narrator). ~12.8k clips for training; ~256 held out (every 50th clip, deterministic). Audio stored at 22.05 kHz mono FLAC; mel targets computed on the fly. The dataset is not redistributed; rebuild it with collect_ljspeech.py (streams from MikhailT/lj-speech, split full).

Training procedure

AR transformer TTS with teacher forcing. dim 384 / enc-layers 4 / dec-layers 4 / heads 6 / ff 1536, batch 24, AdamW lr 3e-4 (cosine, 2000-step warmup), 40k steps, grad-clip 1.0, guided-attention loss (--ga-weight 1.0 --ga-sigma 0.2), SortaGrad short-text-first, seed 0. Loss = L1(mel_pre) + L1(mel_post)

BCE(stop) + guided-attention. The reported checkpoint is the best held-out mel-L1 over training.

Evaluation

Held-out teacher-forced mel-L1 (L1 between predicted and target log-mel) on the LJSpeech held-out split. Reported: mel-L1 0.296 (seed 0); multi-seed seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290 (mean 0.293, range 0.006). Free-run (autoregressive) synthesis is the qualitative test — the voice is natural and intelligible with the HiFi-GAN vocoder; per-clip free-run quality varies and is best on clear, sentence-length English input.

Limitations

Single-speaker (LJSpeech, one narrator) — no speaker control or voice cloning; it only speaks in that one voice.
English only (LJSpeech English read speech) — not trained or evaluated on other languages.
The headline number is a teacher-forced mel-L1, not a naturalness benchmark (no MOS); free-run quality varies clip to clip.
A feasibility / works demonstration, not state-of-the-art TTS or a claim of general superiority over dedicated TTS systems.

Files / checkpoint to upload

Upload the frozen seed-0 model to this model repo:

tts_lens_ga_best.pt — guided-attention AR TTS, 40k steps, seed 0, held-out mel-L1 0.296 (~254 MB). Self-contained: stores its own config, so tts_lens.py rebuilds the model on load.

Synthesize (free-run, natural voice — --vocoder hifigan is required):

pip install hsl-embedding-zero zstandard soundfile librosa speechbrain torch torchaudio
python tts_lens.py --synth --synth-n 5 --vocoder hifigan \
  --data ./data/ljspeech --synth-ckpt tts_lens_ga_best.pt

Acknowledgments

Independent research, developed in collaboration with AI assistants — Claude Code (Anthropic) and Codex. The HSL work and experimental direction are the author's; the tools assisted with engineering and review.

Citation

@software{woo_holotolk_tts_2026,
  author = {Jinhyun Woo},
  title  = {HoLo-ToLk (TTS): a tokenizer-free HSL text-to-speech feasibility model},
  year   = {2026},
  doi    = {10.5281/zenodo.21007475},
  url    = {https://github.com/Woojiggun/HoLo-ToLk-TTS}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

ggunio
/

HoLo-ToLk-TTS