This is a multispeaker piper model containing 22 speakers for my worldbuilding project Trinsfer/The Dimensional Stack, where most characters have "voice claims" based on YouTubers' voices concatenated and repitched before being fed into AI to mix them into new voices.
This model is not particularly natural, but it is comprehensible. It's only been trained for ~35 epochs, and since ~20 or so it hasn't improved much. The law of diminishing returns is quite strong when finetuning piper, it seems, and that means if you're satisfied with mere comprehensibility and timbre accuracy like I am, then you really don't need to finetune your piper model for more than a few hours on a medium grade GPU (I have a 4060 Ti; I used 300 max phoneme ids and batch size 16).
It is finetuned from the ljspeech-medium checkpoint, making it not affected by the whole lessac license problem where almost all models are legally stuck to "research purposes only". In fact, I think it might be the first multi-speaker English model on medium quality to not have this problem (the libritts model is high quality which is a bit slower even on a beefy gaming laptop).
To train this, I actually downgraded to the old rhasspy piper implementation to allow finetuning a single-speaker model into a multi-speaker one; the OHF-voice implementation doesn't have that option even though it really should (and in rhasspy the option is EXTREMELY simple, just a few lines of code).
The speakers:
- 0 - m_tovmeth - Inaccurate depiction of my voice, clear and medium-low with pretty average timbre
- 1 - a_kyrannikalx - Sounds like a middle-aged woman
- 2 - a_typhumebiek - Weird ransom-note voice with rapidly darting pitch and gender
- 3 - f_banqrrougt - Aggressive feminine voice that sounds like it's coming out of a radio speaker
- 4 - f_lexanephaong - Natural low-pitched feminine voice
- 5 - f_thea - Natural high-pitched feminine voice
- 6 - m_alanite - Low pitched male voice, like a movie announcer
- 7 - m_alexander - Mutters everything with a sheepy, quivering kind of intonation. Probably the least intelligible, but still is mostly intelligible
- 8 - m_arctakkurus - Very clear, cheerful male voice
- 9 - m_axtrad - Complex purring timbre, sounds a little like Kinger from TADC but not exactly
- 10 - m_ievokt - Gravelly, militaristic tone
- 11 - m_macrelydve - Cheerful "wide" tone with light gravel
- 12 - m_outzcradien - Grating and somewhat annoying but also a very clear timbre. Dull scientist.
- 13 - m_stellantrythe - "Mocking", silly, extremely gravelly timbre
- 14 - m_taylor - Almost exactly the same as m_alanite with slightly more varied intonation
- 15 - m_temuontetxecgen_aa - Unusual, low-frequency, high-resonance timbre
- 16 - m_temuontetxecgen_c - High-pitched annoying guy ready to tell you when you made a minor spelling mistake. Has a terrible microphone.
- 17 - m_thaneophyros_arra - Low-pitched, calm, and clear
- 18 - m_thaneophyros_post - Audiobook-like voice, complex but clear and reminiscent (in my opinion) of the timbres of less AI-oriented text-to-speech voices from the 2000s/10s
- 19 - m_thaneophyros_pre - YELLS EVERYTHING LIKE A YOUTUBER TRYING TO FARM ENGAGEMENT!!!!
- 20 - m_uncovesseltuxe - Very nerdy, "dragging" voice
- 21 - m_vethendaosphone - High-pitched male voice. Probably the most natural here.
Model tree for kronosta/piper.en_US-trinsfer-medium
Base model
rhasspy/piper-voices