voxtream2-he

Streaming Hebrew text-to-speech. A Hebrew (and mixed Hebrew/English) finetune of herimor/voxtream2, packaged as a clean, installable Python inference library.

Base model: herimor/voxtream2
Code / GitHub repo: flowtyone/voxtream2-he

This repository hosts the model weights. The code that loads them lives in the GitHub repo above; weights are fetched automatically the first time you construct SpeechGenerator().

Because the model is small and fully streaming, outputs are not always stable. Commercial use is allowed but not recommended.

What it does

Voxtream2 is a zero-shot, full-stream text-to-speech model: you give it a few seconds of any voice plus some text, and it speaks that text in that voice — no per-voice training needed.

Built for streaming. It produces audio in small chunks (one ~80 ms frame at a time) while it reads the text. The first chunk comes back in tens of milliseconds, so it suits live, interactive, real-time use (voice agents, dubbing, assistants).
Faster than real-time. On a consumer GPU it generates audio several times faster than it takes to play.
Voice cloning from a short sample. Point it at a 3–10 second clip of a target voice and it mimics that speaker — accent, tone, and timbre — without any fine-tuning.
Adjustable speaking rate. You can ask for faster or slower speech, even mid-sentence.

This is the Hebrew finetune. It speaks Hebrew (with English mixed in) and keeps all of the streaming behavior above. The generator consumes phonemes (a Hebrew G2P front-end with espeak-ng as the English fallback) and emits 24 kHz mono audio frames.

Usage

Install the library (see the GitHub repo for full instructions, including the espeak-ng system dependency):

uv pip install "git+https://github.com/flowtyone/voxtream2-he.git"

Weights download from this repo automatically on first use:

import numpy as np
import soundfile as sf
from voxtream2_he import SpeechGenerator

gen = SpeechGenerator()                          # weights auto-download from this repo
phonemizer = gen.make_phonemizer()

prompt = gen.prepare_prompt_from_path("voices/shaul.wav")
phonemes = phonemizer.phonemize("שלום, מה שלומך היום?", language="he")

frames = [
    frame
    for frame, _ in gen.generate_stream(
        text=phonemes,
        prepared_prompt=prompt,
        language="he",
        normalize_text=False,
        input_is_phonemes=True,
    )
]
sf.write("out.wav", np.concatenate(frames), gen.config.mimi_sr)  # 24 kHz mono

Pre-download ahead of time (e.g. to bake into an image or warm the cache):

hf download flowty1/voxtream2-he

Files in this repo

File	What it is
`model.safetensors`	Full model weights (phone/temp/dep formers + heads)
`config.json`	Model architecture configuration
`phoneme_to_token.json`	Phoneme → token vocabulary used by the generator
`renikud/model.onnx`	Hebrew grapheme-to-phoneme (G2P) front-end (ReNikud, ONNX)

The Mimi codec (kyutai/moshiko-pytorch-bf16), the ReDimNet speaker encoder (IDRnD/ReDimNet, via torch.hub), and the optional Sidon prompt enhancer (sarulab-speech/sidon-v0.1) are pulled from their own upstream repos, not from here.

Acknowledgements

Voxtream2 — the upstream full-stream TTS model. herimor/voxtream
ReNikud — audio-supervised Hebrew grapheme-to-phoneme (G2P) model used as the Hebrew front-end. renikud/renikud
Mimi audio codec, from Kyutai's Moshi.
ReDimNet speaker encoder (IDRnD/ReDimNet).
Sidon speech enhancement (sarulab-speech/Sidon), optional prompt enhancement.

License

GPL-3.0. Bundled third-party components (Moshi/Mimi, ReDimNet, Sidon) retain their respective upstream licenses.

Downloads last month: 47

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for flowty1/voxtream2-he

Base model

herimor/voxtream2

Finetuned

(1)

this model