Permissively licensed transcription to International Phonetic Alphabet (IPA) in Python
First off, this is such a great project, congratulations!
I'm working to integrate this model as the primary TTS model for txtai. I notice that tokenization is handled by espeak-ng, which is GPL licensed. Additionally, it seems like installing espeak isn't straightforward for everyone.
With that, I've added support to ttstokenizer for transcribing text to the International Phonetic Alphabet. This is a drop-in replacement for espeak (for English) and generates token ids that can be consumed by this model.
a provider implementation where you can chose the phonemizer would be a good solution, but I won't drop espeak from the library because it's the only phonemizer actually usable for multilanguage purpose that may fit lot of people needs
That's your call. If the eSpeak GPL license works for you, that's great. If you're building commercial software and don't intend to open source your work, then it could be problematic.
Hi, I'm aware of the GPL-ness of espeak-ng
. More importantly, the performance can be lacking sometimes.
To that end, the next version of the model will use https://github.com/hexgrad/misaki for English: a simple, dictionary-first + fallback approach to G2P.
The fallback there is still espeak-ng, but one of the TODOs there is:
Fallbacks: Train seq2seq fallback models on dictionaries using this notebook.
You can also find a demo here https://hf.co/spaces/hexgrad/Misaki-G2P and any gold/silver (or diamond) is NOT hitting espeak fallback.
In general, the G2P method seems relatively flexible and can be airdropped into a model, as long as you continue training on the new G2P scheme.
@hexgrad Glad to hear it!
I'll keep an eye on this project. ttstokenizer
that I mentioned is a fork of g2p_en
which has the model with the notebook you mentioned. It might be of some use to you.
Ultimately, someone has to do the work to collect a dictionary dataset for languages then build an out of vocab model. Sounds like you're signing up for the task!
Misaki should now intercept, conservatively, like 85%+ of English words before touching any fallback.
Between the gold and silver dictionaries, there are each about 170k+ total entries, round that down to the nearest OOM: 100k.
"The" should be the most common word at 7%, and Zipf's law says the second word goes to 3.5%, third 2.33%, fourth 1.75%, etc. This is a partial sum of the harmonic series, and math says a dictionary with the top 100k words should get you just shy of 85% coverage, on average. In practice, because of capitalization and basic rules, it's likely upwards of 90% or 95%, although really esoteric text could send this percentage the other way instead.
Still have not gotten around to training this elusive non-espeak fallback model, since other things take priority, although under the TODO's there's a pretty good hint on how that would be done.
It's also possible to use public Wiktionary dumps and/or large-scale LLM prompting—ideally DeepSeek with its 14T+ training tokens—to beef up dictionary size, although filtering would probably be needed to maintain quality.
Very cool, thank you for the update!
ttstokenizer is a fork of g2p_en, which uses the same notebook you reference in the Misaki repo for out of vocab words. The main difference is ttstokenizer
uses ARPABET and translates to IPA, so it was able to use the existing pre-trained model, for better or worse.