--- license: cc-by-nc-4.0 --- # lina-speech (beta) Exploring "linear attention" for text-to-speech. It predicts audio codec "à la" [MusicGen](https://arxiv.org/abs/2306.05284) : delayed residual vector quantizers so that we do not need multiple models. Featuring [RWKV](https://github.com/BlinkDL/RWKV-LM), [Mamba](https://github.com/state-spaces/mamba), [Gated Linear Attention](https://github.com/sustcsonglin/flash-linear-attention). Compared to other LM TTS model : - Can be easily pretrained and finetuned on midrange GPUs. - Tiny memory footprint. - Trained on long context (up to 2000 tokens : ~27s). ### Models | Model | #Params | Dataset | Checkpoint | Steps | Note | | :---: | :---: |:---: |:---: |:---: |:---: | | GLA | 60M, 130M | Librilight-medium | [Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 300k | GPU inference only | | Mamba| 60M | Librilight-medium |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9)| 300k | GPU inference only | | RWKV v6 | 60M | LibriTTS |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 150k | GPU inference only | ### Installation Following the linear complexity LM you choose, follow respective instructions first: - For Mamba check the [official repo](https://github.com/state-spaces/mamba). - For GLA/RWKV inference check [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention). - For RWKV training check [RWKV-LM](https://github.com/BlinkDL/RWKV-LM) ### Acknowledgment - The RWKV authors and the community around for carrying high-level truly opensource research. - @SmerkyG for making my life easy at testing cutting edge language model. - @lucidrains for its huge codebase. - @sustcsonglin who made [GLA and FLA](https://github.com/sustcsonglin/flash-linear-attention). - @harrisonvanderbyl fixing RWKV inference. ### Cite ```bib @software{lemerle2024linaspeech, title = {LinaSpeech: Exploring "linear attention" for text-to-speech.}, author = {Lemerle, Théodor}, url = {https://github.com/theodorblackbird/lina-speech}, month = april, year = {2024} } ``` ### IRCAM This work takes place at IRCAM, and is part of the following project : [ANR Exovoices](https://anr.fr/Projet-ANR-21-CE23-0040)