Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ datasets:
|
|
5 |
- projecte-aina/openslr-slr69-ca-trimmed-denoised
|
6 |
---
|
7 |
|
8 |
-
# Vocos-mel-22khz
|
9 |
|
10 |
<!-- Provide a quick summary of what the model is/does. -->
|
11 |
|
@@ -22,12 +22,13 @@ Unlike other typical GAN-based vocoders, Vocos does not model audio samples in t
|
|
22 |
Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
|
23 |
inverse Fourier transform.
|
24 |
|
25 |
-
This version of
|
26 |
in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
|
27 |
The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
|
28 |
-
acoustic output of several TTS models.
|
29 |
-
|
30 |
|
|
|
31 |
|
32 |
## Intended Uses and limitations
|
33 |
|
@@ -79,6 +80,7 @@ We also release a onnx version of the model, you can check in colab:
|
|
79 |
<a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
|
80 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
81 |
</a>
|
|
|
82 |
## Training Details
|
83 |
|
84 |
### Training Data
|
@@ -98,7 +100,7 @@ The model was trained on 3 Catalan speech datasets
|
|
98 |
### Training Procedure
|
99 |
|
100 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
101 |
-
The model was trained for
|
102 |
We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
|
103 |
|
104 |
|
@@ -116,7 +118,7 @@ We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 inst
|
|
116 |
|
117 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
118 |
|
119 |
-
Evaluation was done using the metrics on the original repo, after ~ 1000 epochs we achieve:
|
120 |
|
121 |
* val_loss: 3.57
|
122 |
* f1_score: 0.95
|
|
|
5 |
- projecte-aina/openslr-slr69-ca-trimmed-denoised
|
6 |
---
|
7 |
|
8 |
+
# Vocos-mel-22khz-cat
|
9 |
|
10 |
<!-- Provide a quick summary of what the model is/does. -->
|
11 |
|
|
|
22 |
Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
|
23 |
inverse Fourier transform.
|
24 |
|
25 |
+
This version of **Vocos** uses 80-bin mel spectrograms as acoustic features which are widespread
|
26 |
in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
|
27 |
The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
|
28 |
+
acoustic output of several TTS models. This version is tailored for the Catalan language,
|
29 |
+
as it was trained only on Catalan speech datasets.
|
30 |
|
31 |
+
We are grateful with the authors for open sourcing the code allowing us to modify and train this version.
|
32 |
|
33 |
## Intended Uses and limitations
|
34 |
|
|
|
80 |
<a target="_blank" href="https://colab.research.google.com/github/langtech-bsc/vocos/blob/matcha/notebooks/vocos_22khz_onnx_inference.ipynb">
|
81 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
82 |
</a>
|
83 |
+
|
84 |
## Training Details
|
85 |
|
86 |
### Training Data
|
|
|
100 |
### Training Procedure
|
101 |
|
102 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
103 |
+
The model was trained for 1.5M steps and 1.3k epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4.
|
104 |
We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
|
105 |
|
106 |
|
|
|
118 |
|
119 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
120 |
|
121 |
+
Evaluation was done using the metrics on the [original repo](https://github.com/gemelo-ai/vocos), after ~ 1000 epochs we achieve:
|
122 |
|
123 |
* val_loss: 3.57
|
124 |
* f1_score: 0.95
|