voxtream2-he

Streaming Hebrew text-to-speech. A Hebrew (and mixed Hebrew/English) finetune of herimor/voxtream2, packaged as a clean, installable Python inference library.

This repository hosts the model weights. The code that loads them lives in the GitHub repo above; weights are fetched automatically the first time you construct SpeechGenerator().

Because the model is small and fully streaming, outputs are not always stable. Commercial use is allowed but not recommended.

What it does

Voxtream2 is a zero-shot, full-stream text-to-speech model: you give it a few seconds of any voice plus some text, and it speaks that text in that voice โ€” no per-voice training needed.

  • Built for streaming. It produces audio in small chunks (one ~80 ms frame at a time) while it reads the text. The first chunk comes back in tens of milliseconds, so it suits live, interactive, real-time use (voice agents, dubbing, assistants).
  • Faster than real-time. On a consumer GPU it generates audio several times faster than it takes to play.
  • Voice cloning from a short sample. Point it at a 3โ€“10 second clip of a target voice and it mimics that speaker โ€” accent, tone, and timbre โ€” without any fine-tuning.
  • Adjustable speaking rate. You can ask for faster or slower speech, even mid-sentence.

This is the Hebrew finetune. It speaks Hebrew (with English mixed in) and keeps all of the streaming behavior above. The generator consumes phonemes (a Hebrew G2P front-end with espeak-ng as the English fallback) and emits 24 kHz mono audio frames.

Usage

Install the library (see the GitHub repo for full instructions, including the espeak-ng system dependency):

uv pip install "git+https://github.com/flowtyone/voxtream2-he.git"

Weights download from this repo automatically on first use:

import numpy as np
import soundfile as sf
from voxtream2_he import SpeechGenerator

gen = SpeechGenerator()                          # weights auto-download from this repo
phonemizer = gen.make_phonemizer()

prompt = gen.prepare_prompt_from_path("voices/shaul.wav")
phonemes = phonemizer.phonemize("ืฉืœื•ื, ืžื” ืฉืœื•ืžืš ื”ื™ื•ื?", language="he")

frames = [
    frame
    for frame, _ in gen.generate_stream(
        text=phonemes,
        prepared_prompt=prompt,
        language="he",
        normalize_text=False,
        input_is_phonemes=True,
    )
]
sf.write("out.wav", np.concatenate(frames), gen.config.mimi_sr)  # 24 kHz mono

Pre-download ahead of time (e.g. to bake into an image or warm the cache):

hf download flowty1/voxtream2-he

Files in this repo

File What it is
model.safetensors Full model weights (phone/temp/dep formers + heads)
config.json Model architecture configuration
phoneme_to_token.json Phoneme โ†’ token vocabulary used by the generator
renikud/model.onnx Hebrew grapheme-to-phoneme (G2P) front-end (ReNikud, ONNX)

The Mimi codec (kyutai/moshiko-pytorch-bf16), the ReDimNet speaker encoder (IDRnD/ReDimNet, via torch.hub), and the optional Sidon prompt enhancer (sarulab-speech/sidon-v0.1) are pulled from their own upstream repos, not from here.

Acknowledgements

  • Voxtream2 โ€” the upstream full-stream TTS model. herimor/voxtream
  • ReNikud โ€” audio-supervised Hebrew grapheme-to-phoneme (G2P) model used as the Hebrew front-end. renikud/renikud
  • Mimi audio codec, from Kyutai's Moshi.
  • ReDimNet speaker encoder (IDRnD/ReDimNet).
  • Sidon speech enhancement (sarulab-speech/Sidon), optional prompt enhancement.

License

GPL-3.0. Bundled third-party components (Moshi/Mimi, ReDimNet, Sidon) retain their respective upstream licenses.

Downloads last month
47
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for flowty1/voxtream2-he

Finetuned
(1)
this model