Parler-TTS
High-fidelity Text-To-Speech
If you want to find out more about how these models were trained and even fine-tune them yourself, check-out the Parler-TTS repository on GitHub.
High-fidelity Text-To-Speech
Note Parler-TTS Mini v1.1 is a 938M parameters Parler checkpoint, trained on 45K hours of audio data. The only change with v1 is the use of a better prompt tokenizer. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.
Note Parler-TTS Large is a 2.2B-parameters Parler checkpoint, trained on 45K hours of audio data.
Note Parler-TTS Mini is a 880M parameters Parler checkpoint, trained on 45K hours of audio data.
Note Parler-TTS v0.1 is a lightweight text-to-speech (TTS) model, trained on 10.5K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation). It is the first release model from the Parler-TTS project, which aims to provide the community with TTS training resources and dataset pre-processing code. V1 coming soon!
Note Used to recover the audio waveform from the audio tokens predicted by the decoder. We use the DAC model from Descript, although other codec models, such as EnCodec, can also be used.
Note Used to encode text descriptions.