Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# lina-speech (beta)
|
2 |
+
|
3 |
+
Exploring "linear attention" for text-to-speech.
|
4 |
+
|
5 |
+
It predicts audio codec "à la" [MusicGen](https://arxiv.org/abs/2306.05284) : delayed residual vector quantizers so that we do not need multiple models.
|
6 |
+
|
7 |
+
Featuring [RWKV](https://github.com/BlinkDL/RWKV-LM), [Mamba](https://github.com/state-spaces/mamba), [Gated Linear Attention](https://github.com/sustcsonglin/flash-linear-attention).
|
8 |
+
|
9 |
+
Compared to other LM TTS model :
|
10 |
+
- Can be easily pretrained and finetuned on midrange GPUs.
|
11 |
+
- Tiny memory footprint.
|
12 |
+
- Trained on long context (up to 2000 tokens : ~27s).
|
13 |
+
|
14 |
+
### Models
|
15 |
+
|
16 |
+
| Model | #Params | Dataset | Checkpoint | Steps | Note |
|
17 |
+
| :---: | :---: |:---: |:---: |:---: |:---: |
|
18 |
+
| GLA | 60M, 130M | Librilight-medium | [Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 300k | GPU inference only |
|
19 |
+
| Mamba| 60M | Librilight-medium |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9)| 300k | GPU inference only |
|
20 |
+
| RWKV v6 | 60M | LibriTTS |[Download](https://nubo.ircam.fr/index.php/s/wjNYLb54m7L8xf9) | 150k | GPU inference only |
|
21 |
+
|
22 |
+
### Installation
|
23 |
+
Following the linear complexity LM you choose, follow respective instructions first:
|
24 |
+
- For Mamba check the [official repo](https://github.com/state-spaces/mamba).
|
25 |
+
- For GLA/RWKV inference check [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention).
|
26 |
+
- For RWKV training check [RWKV-LM](https://github.com/BlinkDL/RWKV-LM)
|
27 |
+
|
28 |
+
### Inference
|
29 |
+
|
30 |
+
Download configuration and weights above, then check `Inference.ipynb`.
|
31 |
+
|
32 |
+
### TODO
|
33 |
+
|
34 |
+
- [x] Fix RWKV6 inference and/or switch to FLA implem.
|
35 |
+
- [ ] Provide a Datamodule for training (_lhotse_ might also work well).
|
36 |
+
- [ ] Implement CFG.
|
37 |
+
- [ ] Scale up.
|
38 |
+
|
39 |
+
### Acknowledgment
|
40 |
+
|
41 |
+
- The RWKV authors and the community around for carrying high-level truly opensource research.
|
42 |
+
- @SmerkyG for making my life easy at testing cutting edge language model.
|
43 |
+
- @lucidrains for its huge codebase.
|
44 |
+
- @sustcsonglin who made [GLA and FLA](https://github.com/sustcsonglin/flash-linear-attention).
|
45 |
+
- @harrisonvanderbyl fixing RWKV inference.
|
46 |
+
|
47 |
+
### Cite
|
48 |
+
```bib
|
49 |
+
@software{lemerle2024linaspeech,
|
50 |
+
title = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
|
51 |
+
author = {Lemerle, Théodor},
|
52 |
+
url = {https://github.com/theodorblackbird/lina-speech},
|
53 |
+
month = april,
|
54 |
+
year = {2024}
|
55 |
+
}
|
56 |
+
```
|
57 |
+
### IRCAM
|
58 |
+
|
59 |
+
This work takes place at IRCAM, and is part of the following project :
|
60 |
+
[ANR Exovoices](https://anr.fr/Projet-ANR-21-CE23-0040)
|
61 |
+
|
62 |
+
<img align="left" width="200" height="200" src="logo_ircam.jpeg">
|