Instructions to use Mamsu/chatterbox-tts-et-lobiseja with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use Mamsu/chatterbox-tts-et-lobiseja with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
chatterbox-tts-et-lobiseja
Estonian text-to-speech. A hobby fine-tune of Chatterbox Multilingual V3 that teaches the model to read Estonian legibly while keeping zero-shot voice cloning intact. lobiseja is Estonian for "chatterbox."
The goal was "intelligible Estonian", and I think we got more than that.
Mostly inspired by the work of TartuNLP and their Neurokõne synthesiser. Where Neurokõne is convolutional (DeepVoice 3), this is an autoregressive base. Training the adapter took 80 minutes on a single RTX 3090.
I am not educated in this area. This is purely a curiosity that turned out to be a quite fun adventure.
Samples
Mari, Albert, Vesta and Kalev were part of the original training set.
Dwight and Samantha are zero-shot English references.
| Voice | Complex numbers | Phonological length (välde) |
|---|---|---|
| Mari | ||
| Albert | ||
| Vesta | ||
| Kalev | ||
| Dwight | ||
| Samantha |
Volume levels vary due to differences in reference audio volume levels.
Setup
git clone https://huggingface.co/Mamsu/chatterbox-tts-et-lobiseja
cd chatterbox-tts-et-lobiseja
# recommended to use uv
uv python pin 3.11
uv sync
# or, with pip (use Python 3.11 yourself):
pip install -r requirements.txt
Usage
# Tested on Python 3.11
from chatterbox_et import EstonianTTS # in this repo, required for preprocessing (see Files)
# Load straight from the cloned repo dir — the weights are already here, no re-download.
tts = EstonianTTS.from_local(".")
# Built-in default voice:
wav, sr = tts.synth("Tere! Kell on 15 ja täna on 28. mai 2016.")
# Your own reference voice (ideally in Estonian):
wav, sr = tts.synth("Linna tänavad viivad linna.", audio_prompt_path="ref.wav")
# With generation controls (defaults shown):
wav, sr = tts.synth(
"Linna tänavad viivad linna.",
temperature=0.8,
cfg_weight=0.5,
exaggeration=0.5,
repetition_penalty=1.2,
)
import soundfile as sf
sf.write("out.wav", wav, sr)
synth() applies Estonian text normalisation (numbers/dates/abbreviations → spoken words),
splits the input into clauses, and synthesises + stitches them. The generation knobs
(temperature, cfg_weight, exaggeration, repetition_penalty) are keyword arguments to
synth(), with the defaults shown above.
Text normalisation
The model was trained on text passed through TartuNLP/tts_preprocess_et, applying transformations that convert numbers and abbreviations to full spoken words.
e.g. "1995" is transformed to "tuhat üheksasada üheksakümmend viis" for the model.
The normalization retains the special words found in vocabulary, such as [laughter] or [gasp]. These have their own issues, read more under imitations.
If side-stepping my preprocessor, it is still strongly recommended to normalise just the same. Use without applying normalisations is undefined.
Limitations
- Lacks Estonglish as the training data was news-focused, the fine-tune is unable to pronounce foreign words correctly without workarounds. While 'correct' Estonian does not mix English into the vocabulary, such is our everyday language for better and for worse. (i.e. "Fire" should be written as "Faier" for the model)
- Phonological Length (välde) can be hit-or-miss. I'm actually rather surprised by how well the model performs, but don't be surprised when Q2 and Q3 get mixed up. This is a rather known issue for any speech synthesiser handling Estonian. Vabamorf as another preprocessor could be useful here, but I have not yet tested.
- Premature halting by upstream analyser. It doesn't like repeating tokens, which turns out to be a fairly commonly occurring event. This caused generations to end earlier than necessary, usually after commas or periods where the speaker takes a short pause. The included toolkit splits up sentences and generates consecutive clips, which it then concatenates. Kind of messes with the seed and is an ugly hack in general. Future improvements!
- Special words inherited from base, such as
[laughter]or[gasp], straight up do not work even as we retain them through normalization. Since the training data had none of these, I imagine the weights were polished out during fine-tuning. You may try them, but usually you'll just hear "õhh!".
Watermarking
Output is watermarked with Resemble AI's Perth.
Perth enables you to embed imperceptible watermarks in audio files and later detect them, even after the audio has undergone various transformations or manipulations. The library implements multiple watermarking techniques including neural network-based approaches.
Training data
TartuNLP news-sentence speech corpus: ~65.9h, ~36k sentences, 4 speakers, CC-BY. All credit to the Institute of Computer Science of Tartu Ülikool.
About ~40.5h of this was actually used, abbreviation-heavy and >10sec sentences were dropped. Estonian read speech (TartuNLP konekorpus, 4 speakers), normalised text.
Merged weights of T3 (r=128, alpha=256) plus full-rank new-vocab embeddings; ve/s3gen frozen.
Acknowledgements & Credits
- Chatterbox TTS (Resemble AI) - base model
- Perth (Resemble AI) - audio watermarking
- chatterbox-finetuning - initial testing, extensively modified
- tts_preprocess_et + EstNLTK - Estonian text normalisation
- Institute of Computer Science of University of Tartu and the Institute of the Estonian Language (Eesti Keele Instituut) — for the speech corpus and
tts_preprocess_et, and for the decades of Estonian language-tech (EstNLTK, vabamorf) that make any of this possible. - The report "Närvivõrgu põhise kõnesünteesi arendamine" (2020): Liisa Rätsep, Liisi Piits, Hille Pajupuu, Indrek Hein, Mark Fišel.
- The four (anonymised) voice talents whose readings make up the corpus.
License
The merged weights and configurations in this repository are released under the MIT License.
- The base model, Chatterbox, is licensed under MIT by Resemble AI.
- The training data, TartuNLP news-sentence speech corpus, is licensed by University of Tartu under CC-BY; the corpus does not specify a version.
Model tree for Mamsu/chatterbox-tts-et-lobiseja
Base model
ResembleAI/chatterbox