voxtream2-he
Streaming Hebrew text-to-speech. A Hebrew (and mixed Hebrew/English) finetune of
herimor/voxtream2, packaged as a clean,
installable Python inference library.
- Base model:
herimor/voxtream2 - Code / GitHub repo:
flowtyone/voxtream2-he
This repository hosts the model weights. The code that loads them lives in the GitHub
repo above; weights are fetched automatically the first time you construct SpeechGenerator().
Because the model is small and fully streaming, outputs are not always stable. Commercial use is allowed but not recommended.
What it does
Voxtream2 is a zero-shot, full-stream text-to-speech model: you give it a few seconds of any voice plus some text, and it speaks that text in that voice โ no per-voice training needed.
- Built for streaming. It produces audio in small chunks (one ~80 ms frame at a time) while it reads the text. The first chunk comes back in tens of milliseconds, so it suits live, interactive, real-time use (voice agents, dubbing, assistants).
- Faster than real-time. On a consumer GPU it generates audio several times faster than it takes to play.
- Voice cloning from a short sample. Point it at a 3โ10 second clip of a target voice and it mimics that speaker โ accent, tone, and timbre โ without any fine-tuning.
- Adjustable speaking rate. You can ask for faster or slower speech, even mid-sentence.
This is the Hebrew finetune. It speaks Hebrew (with English mixed in) and keeps all of the
streaming behavior above. The generator consumes phonemes (a Hebrew G2P front-end with
espeak-ng as the English fallback) and emits 24 kHz mono audio frames.
Usage
Install the library (see the GitHub repo for full
instructions, including the espeak-ng system dependency):
uv pip install "git+https://github.com/flowtyone/voxtream2-he.git"
Weights download from this repo automatically on first use:
import numpy as np
import soundfile as sf
from voxtream2_he import SpeechGenerator
gen = SpeechGenerator() # weights auto-download from this repo
phonemizer = gen.make_phonemizer()
prompt = gen.prepare_prompt_from_path("voices/shaul.wav")
phonemes = phonemizer.phonemize("ืฉืืื, ืื ืฉืืืื ืืืื?", language="he")
frames = [
frame
for frame, _ in gen.generate_stream(
text=phonemes,
prepared_prompt=prompt,
language="he",
normalize_text=False,
input_is_phonemes=True,
)
]
sf.write("out.wav", np.concatenate(frames), gen.config.mimi_sr) # 24 kHz mono
Pre-download ahead of time (e.g. to bake into an image or warm the cache):
hf download flowty1/voxtream2-he
Files in this repo
| File | What it is |
|---|---|
model.safetensors |
Full model weights (phone/temp/dep formers + heads) |
config.json |
Model architecture configuration |
phoneme_to_token.json |
Phoneme โ token vocabulary used by the generator |
renikud/model.onnx |
Hebrew grapheme-to-phoneme (G2P) front-end (ReNikud, ONNX) |
The Mimi codec (kyutai/moshiko-pytorch-bf16), the ReDimNet speaker encoder
(IDRnD/ReDimNet, via torch.hub), and the optional Sidon prompt enhancer
(sarulab-speech/sidon-v0.1) are pulled from their own upstream repos, not from here.
Acknowledgements
- Voxtream2 โ the upstream full-stream TTS model. herimor/voxtream
- ReNikud โ audio-supervised Hebrew grapheme-to-phoneme (G2P) model used as the Hebrew front-end. renikud/renikud
- Mimi audio codec, from Kyutai's Moshi.
- ReDimNet speaker encoder (IDRnD/ReDimNet).
- Sidon speech enhancement (sarulab-speech/Sidon), optional prompt enhancement.
License
GPL-3.0. Bundled third-party components (Moshi/Mimi, ReDimNet, Sidon) retain their respective upstream licenses.
- Downloads last month
- 47
Model tree for flowty1/voxtream2-he
Base model
herimor/voxtream2