chatterbox-tts-et-lobiseja

Estonian text-to-speech. A hobby fine-tune of Chatterbox Multilingual V3 that teaches the model to read Estonian legibly while keeping zero-shot voice cloning intact. lobiseja is Estonian for "chatterbox."

The goal was "intelligible Estonian", and I think we got more than that.

Mostly inspired by the work of TartuNLP and their Neurokõne synthesiser. Where Neurokõne is convolutional (DeepVoice 3), this is an autoregressive base. Training the adapter took 80 minutes on a single RTX 3090.

I am not educated in this area. This is purely a curiosity that turned out to be a quite fun adventure.

Samples

Mari, Albert, Vesta and Kalev were part of the original training set.

Dwight and Samantha are zero-shot English references.

Voice	Complex numbers	Phonological length (välde)
Mari
Albert
Vesta
Kalev
Dwight
Samantha

Volume levels vary due to differences in reference audio volume levels.

Setup

git clone https://huggingface.co/Mamsu/chatterbox-tts-et-lobiseja
cd chatterbox-tts-et-lobiseja

# recommended to use uv
uv python pin 3.11
uv sync
# or, with pip (use Python 3.11 yourself):
pip install -r requirements.txt

Usage

# Tested on Python 3.11

from chatterbox_et import EstonianTTS   # in this repo, required for preprocessing (see Files)

# Load straight from the cloned repo dir — the weights are already here, no re-download.
tts = EstonianTTS.from_local(".")

# Built-in default voice:
wav, sr = tts.synth("Tere! Kell on 15 ja täna on 28. mai 2016.")

# Your own reference voice (ideally in Estonian):
wav, sr = tts.synth("Linna tänavad viivad linna.", audio_prompt_path="ref.wav")

# With generation controls (defaults shown):
wav, sr = tts.synth(
    "Linna tänavad viivad linna.",
    temperature=0.8,
    cfg_weight=0.5,
    exaggeration=0.5,
    repetition_penalty=1.2,
)

import soundfile as sf
sf.write("out.wav", wav, sr)

synth() applies Estonian text normalisation (numbers/dates/abbreviations → spoken words), splits the input into clauses, and synthesises + stitches them. The generation knobs (temperature, cfg_weight, exaggeration, repetition_penalty) are keyword arguments to synth(), with the defaults shown above.

Text normalisation

The model was trained on text passed through TartuNLP/tts_preprocess_et, applying transformations that convert numbers and abbreviations to full spoken words.

e.g. "1995" is transformed to "tuhat üheksasada üheksakümmend viis" for the model.

The normalization retains the special words found in vocabulary, such as [laughter] or [gasp]. These have their own issues, read more under imitations.

If side-stepping my preprocessor, it is still strongly recommended to normalise just the same. Use without applying normalisations is undefined.

Limitations

Lacks Estonglish as the training data was news-focused, the fine-tune is unable to pronounce foreign words correctly without workarounds. While 'correct' Estonian does not mix English into the vocabulary, such is our everyday language for better and for worse. (i.e. "Fire" should be written as "Faier" for the model)
Phonological Length (välde) can be hit-or-miss. I'm actually rather surprised by how well the model performs, but don't be surprised when Q2 and Q3 get mixed up. This is a rather known issue for any speech synthesiser handling Estonian. Vabamorf as another preprocessor could be useful here, but I have not yet tested.
Premature halting by upstream analyser. It doesn't like repeating tokens, which turns out to be a fairly commonly occurring event. This caused generations to end earlier than necessary, usually after commas or periods where the speaker takes a short pause. The included toolkit splits up sentences and generates consecutive clips, which it then concatenates. Kind of messes with the seed and is an ugly hack in general. Future improvements!
Special words inherited from base, such as [laughter] or [gasp], straight up do not work even as we retain them through normalization. Since the training data had none of these, I imagine the weights were polished out during fine-tuning. You may try them, but usually you'll just hear "õhh!".

Watermarking

Output is watermarked with Resemble AI's Perth.

Perth enables you to embed imperceptible watermarks in audio files and later detect them, even after the audio has undergone various transformations or manipulations. The library implements multiple watermarking techniques including neural network-based approaches.

Training data

TartuNLP news-sentence speech corpus: ~65.9h, ~36k sentences, 4 speakers, CC-BY. All credit to the Institute of Computer Science of Tartu Ülikool.

About ~40.5h of this was actually used, abbreviation-heavy and >10sec sentences were dropped. Estonian read speech (TartuNLP konekorpus, 4 speakers), normalised text. Merged weights of T3 (r=128, alpha=256) plus full-rank new-vocab embeddings; ve/s3gen frozen.

Acknowledgements & Credits

Chatterbox TTS (Resemble AI) - base model
Perth (Resemble AI) - audio watermarking
chatterbox-finetuning - initial testing, extensively modified
tts_preprocess_et + EstNLTK - Estonian text normalisation
Institute of Computer Science of University of Tartu and the Institute of the Estonian Language (Eesti Keele Instituut) — for the speech corpus and tts_preprocess_et, and for the decades of Estonian language-tech (EstNLTK, vabamorf) that make any of this possible.
The report "Närvivõrgu põhise kõnesünteesi arendamine" (2020): Liisa Rätsep, Liisi Piits, Hille Pajupuu, Indrek Hein, Mark Fišel.
The four (anonymised) voice talents whose readings make up the corpus.

License

The merged weights and configurations in this repository are released under the MIT License.

The base model, Chatterbox, is licensed under MIT by Resemble AI.
The training data, TartuNLP news-sentence speech corpus, is licensed by University of Tartu under CC-BY; the corpus does not specify a version.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Mamsu/chatterbox-tts-et-lobiseja

Base model

ResembleAI/chatterbox

Finetuned

(54)

this model

Space using Mamsu/chatterbox-tts-et-lobiseja 1

Paper for Mamsu/chatterbox-tts-et-lobiseja

Neural Speech Synthesis for Estonian

Paper • 2010.02636 • Published Oct 6, 2020