wetdog AlexK-PL commited on
Commit
96c711d
0 Parent(s):

Duplicate from BSC-LT/matcha-tts-cat-multispeaker

Browse files

Co-authored-by: Alex Peiró Lilja <AlexK-PL@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ca
4
+ licence:
5
+ - apache-2.0
6
+ tags:
7
+ - matcha-tts
8
+ - acoustic modelling
9
+ - speech
10
+ - multispeaker
11
+ pipeline_tag: text-to-speech
12
+ datasets:
13
+ - projecte-aina/festcat_trimmed_denoised
14
+ - projecte-aina/openslr-slr69-ca-trimmed-denoised
15
+ ---
16
+
17
+ # Matcha-TTS Catalan Multispeaker
18
+
19
+ ## Table of Contents
20
+ <details>
21
+ <summary>Click to expand</summary>
22
+
23
+ - [Model description](#model-description)
24
+ - [Intended uses and limitations](#intended-uses-and-limitations)
25
+ - [How to use](#how-to-use)
26
+ - [Training](#training)
27
+ - [Evaluation](#evaluation)
28
+ - [Citation](#citation)
29
+ - [Additional information](#additional-information)
30
+
31
+ </details>
32
+
33
+ ## Model Description
34
+
35
+ **Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
36
+ The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features.
37
+ And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
38
+ In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
39
+
40
+ **Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
41
+ This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
42
+
43
+ ## Intended Uses and Limitations
44
+
45
+ This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
46
+ It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping
47
+ its output into a speech waveform.
48
+
49
+ The quality of the samples can vary depending on the speaker.
50
+ This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
51
+
52
+ ## How to Get Started with the Model
53
+
54
+ ### Installation
55
+
56
+ This model has been trained using the espeak-ng open source text-to-speech software.
57
+ The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
58
+
59
+ Create a virtual environment:
60
+ ```bash
61
+ python -m venv /path/to/venv
62
+ ```
63
+ ```bash
64
+ source /path/to/venv/bin/activate
65
+ ```
66
+
67
+ For training and inferencing with Catalan Matcha-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:
68
+ ```bash
69
+ git clone https://github.com/projecte-aina/espeak-ng.git
70
+
71
+ export PYTHON=/path/to/env/<env_name>/bin/python
72
+ cd /path/to/espeak-ng
73
+ ./autogen.sh
74
+ ./configure --prefix=/path/to/espeak-ng
75
+ make
76
+ make install
77
+
78
+ pip cache purge
79
+ pip install mecab-python3
80
+ pip install unidic-lite
81
+ ```
82
+ Clone the repository:
83
+
84
+ ```bash
85
+ git clone -b dev-cat https://github.com/langtech-bsc/Matcha-TTS.git
86
+ cd Matcha-TTS
87
+
88
+ ```
89
+ Install the package from source:
90
+ ```bash
91
+ pip install -e .
92
+
93
+ ```
94
+
95
+
96
+ ### For Inference
97
+
98
+ #### PyTorch
99
+
100
+ Speech end-to-end inference can be done together with **Catalan Matcha-TTS**.
101
+ Both models (Catalan Matcha-TTS and Vocos) are loaded remotely from the HF hub.
102
+
103
+ First, export the following environment variables to include the installed espeak-ng version:
104
+
105
+ ```bash
106
+ export PYTHON=/path/to/your/venv/bin/python
107
+ export ESPEAK_DATA_PATH=/path/to/espeak-ng/espeak-ng-data
108
+ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/espeak-ng/lib
109
+ export PATH="/path/to/espeak-ng/bin:$PATH"
110
+
111
+ ```
112
+ Then you can run the inference script:
113
+ ```bash
114
+ cd Matcha-TTS
115
+ python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya."
116
+
117
+ ```
118
+ You can also modify the length scale (speech rate) and the temperature of the generated sample:
119
+ ```bash
120
+ python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya." --length_scale=0.8 --temperature=0.7
121
+
122
+ ```
123
+
124
+ #### ONNX
125
+
126
+ We also release a ONNX version of the model
127
+
128
+ ### For Training
129
+
130
+ The entire checkpoint is also released to continue training or finetuning.
131
+ See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
132
+
133
+
134
+ ## Training Details
135
+
136
+ ### Training data
137
+
138
+ The model was trained on 2 **Catalan** speech datasets
139
+
140
+ | Dataset | Language | Hours | Num. Speakers |
141
+ |---------------------|----------|---------|-----------------|
142
+ | [Festcat](https://huggingface.co/datasets/projecte-aina/festcat_trimmed_denoised) | ca | 22 | 11 |
143
+ | [OpenSLR69](https://huggingface.co/datasets/projecte-aina/openslr-slr69-ca-trimmed-denoised) | ca | 5 | 36 |
144
+
145
+ ### Training procedure
146
+
147
+ ***Catalan Matcha-TTS*** was finetuned from the English multispeaker checkpoint,
148
+ which was trained with the [VCTK dataset](https://huggingface.co/datasets/vctk) and provided by the model authors.
149
+
150
+ The embedding layer was initialized with the number of catalan speakers (47) and the original hyperparameters were kept.
151
+
152
+ ### Training Hyperparameters
153
+
154
+ * batch size: 32 (x2 GPUs)
155
+ * learning rate: 1e-4
156
+ * number of speakers: 47
157
+ * n_fft: 1024
158
+ * n_feats: 80
159
+ * sample_rate: 22050
160
+ * hop_length: 256
161
+ * win_length: 1024
162
+ * f_min: 0
163
+ * f_max: 8000
164
+ * data_statistics:
165
+ * mel_mean: -6578195
166
+ * mel_std: 2.538758
167
+ * number of samples: 13340
168
+
169
+ ## Evaluation
170
+
171
+ Validation values obtained from tensorboard from epoch 2399*:
172
+
173
+ * val_dur_loss_epoch: 0.38
174
+ * val_prior_loss_epoch: 0.97
175
+ * val_diff_loss_epoch: 2.195
176
+
177
+ (Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset)
178
+
179
+ ## Citation
180
+
181
+ If this code contributes to your research, please cite the work:
182
+
183
+ ```
184
+ @misc{mehta2024matchatts,
185
+ title={Matcha-TTS: A fast TTS architecture with conditional flow matching},
186
+ author={Shivam Mehta and Ruibo Tu and Jonas Beskow and Éva Székely and Gustav Eje Henter},
187
+ year={2024},
188
+ eprint={2309.03199},
189
+ archivePrefix={arXiv},
190
+ primaryClass={eess.AS}
191
+ }
192
+ ```
193
+
194
+ ## Additional Information
195
+
196
+ ### Author
197
+ The Language Technologies Unit from Barcelona Supercomputing Center.
198
+
199
+ ### Contact
200
+ For further information, please send an email to <langtech@bsc.es>.
201
+
202
+ ### Copyright
203
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
204
+
205
+ ### License
206
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
207
+
208
+ ### Funding
209
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
checkpoint_epoch=2399.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:352de98d658b7096f2b270da79e398217045a566c33e0496be3df21efd217b55
3
+ size 250638851
config.yaml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ cfm:
2
+ name: CFM
3
+ sigma_min: 0.0001
4
+ solver: euler
5
+ data_statistics:
6
+ mel_mean: -6.578195
7
+ mel_std: 2.538758
8
+ decoder:
9
+ act_fn: snakebeta
10
+ attention_head_dim: 64
11
+ channels:
12
+ - 256
13
+ - 256
14
+ dropout: 0.05
15
+ n_blocks: 1
16
+ num_heads: 2
17
+ num_mid_blocks: 2
18
+ encoder:
19
+ duration_predictor_params:
20
+ filter_channels_dp: 256
21
+ kernel_size: 3
22
+ p_dropout: 0.1
23
+ encoder_params:
24
+ filter_channels: 768
25
+ filter_channels_dp: 256
26
+ kernel_size: 3
27
+ n_channels: 192
28
+ n_feats: 80
29
+ n_heads: 2
30
+ n_layers: 6
31
+ n_spks: 47
32
+ p_dropout: 0.1
33
+ prenet: true
34
+ spk_emb_dim: 64
35
+ encoder_type: RoPE Encoder
36
+ n_feats: 80
37
+ n_spks: 47
38
+ n_vocab: 178
39
+ optimizer: null
40
+ out_size: null
41
+ prior_loss: true
42
+ scheduler: null
43
+ spk_emb_dim: 64
matcha_multispeaker_cat_opset_15_10_steps_2399.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:87faec2df4837126ca324a72b16b53cf378a230dc0dc86f1781431388e714a94
3
+ size 86049399
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44b9640678f0d3be86a09484bbcf2cd55c9c4d2a92fc0eb3fb193ada6b5d01aa
3
+ size 83535314