--- license: mit tags: - text-to-speech --- # WhisperSpeech [](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw) [](https://discord.gg/FANw4rHD5E) *If you have questions or you want to help you can find us in the \#audio-generation channel on the LAION Discord server.* An Open Source text-to-speech system built by inverting Whisper. Previously known as **spear-tts-pytorch**. We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable. We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications. Currently the models are trained on the English LibreLight dataset. In the next release we want to target multiple languages (Whisper and EnCodec are both multilanguage). Sample of the synthesized voice: https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434 ## Progress update \[2024-01-29\] We successfully trained a `tiny` S2A model on an en+pl+fr dataset and it can do voice cloning in French: https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9 We were able to do this with frozen semantic tokens that were only trained on English and Polish. This supports the idea that we will be able to train a single semantic token model to support all the languages in the world. Quite likely even ones that are not currently well supported by the Whisper model. Stay tuned for more updates on this front. :) ## Progress update \[2024-01-18\] We spend the last week optimizing inference performance. We integrated `torch.compile`, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090! We can mix languages in a single sentence (here the highlighted English project names are seamlessly mixed into Polish speech): > To jest pierwszy test wielojęzycznego `Whisper Speech` modelu > zamieniającego tekst na mowę, który `Collabora` i `Laion` nauczyli na > superkomputerze `Jewels`. https://github.com/collabora/WhisperSpeech/assets/107984/d7092ef1-9df7-40e3-a07e-fdc7a090ae9e We also added an easy way to test voice-cloning. Here is a sample voice cloned from [a famous speech by Winston Churchill](https://en.wikipedia.org/wiki/File:Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg) (the radio static is a feature, not a bug ;) – it is part of the reference recording): https://github.com/collabora/WhisperSpeech/assets/107984/bd28110b-31fb-4d61-83f6-c997f560bc26 You can [test all of these on Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw) (we optimized the dependencies so now it takes less than 30 seconds to install). A Huggingface Space is coming soon. ## Progress update \[2024-01-10\] We’ve pushed a new SD S2A model that is a lot faster while still generating high-quality speech. We’ve also added an example of voice cloning based on a reference audio file. As always, you can [check out our Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw) to try it yourself! ## Progress update \[2023-12-10\] Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can [check out our Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw) to try it yourself! English speech, female voice (transferred from a Polish language dataset): https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434 A Polish sample, male voice: https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec [Older progress updates are archived here](https://github.com/collabora/WhisperSpeech/issues/23) ## Downloads We encourage you to start with the Google Colab link above or run the provided notebook locally. If you want to download manually or train the models from scratch then both [the WhisperSpeech pre-trained models](https://huggingface.co/collabora/whisperspeech) as well as [the converted datasets](https://huggingface.co/datasets/collabora/whisperspeech) are available on HuggingFace. ## Roadmap - [ ] [Gather a bigger emotive speech dataset](https://github.com/collabora/spear-tts-pytorch/issues/11) - [ ] Figure out a way to condition the generation on emotions and prosody - [ ] Create a community effort to gather freely licensed speech in multiple languages - [ ] [Train final multi-language models](https://github.com/collabora/spear-tts-pytorch/issues/12) ## Architecture The general architecture is similar to [AudioLM](https://google-research.github.io/seanet/audiolm/examples/), [SPEAR TTS](https://google-research.github.io/seanet/speartts/examples/) from Google and [MusicGen](https://ai.honu.io/papers/musicgen/) from Meta. We avoided the NIH syndrome and built it on top of powerful Open Source models: [Whisper](https://github.com/openai/whisper) from OpenAI to generate semantic tokens and perform transcription, [EnCodec](https://github.com/facebookresearch/encodec) from Meta for acoustic modeling and [Vocos](https://github.com/charactr-platform/vocos) from Charactr Inc as the high-quality vocoder. We gave two presentation diving deeper into WhisperSpeech. The first one talks about the challenges of large scale training: