Initial description commit
Browse files
README.md
CHANGED
@@ -1,3 +1,60 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
### Model Description
|
6 |
+
|
7 |
+
<!-- Provide a longer summary of what this model is. -->
|
8 |
+
|
9 |
+
**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
|
10 |
+
Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
|
11 |
+
Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
|
12 |
+
inverse Fourier transform.
|
13 |
+
|
14 |
+
This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
|
15 |
+
in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
|
16 |
+
The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
|
17 |
+
acoustic output of several TTS models.
|
18 |
+
|
19 |
+
## Intended Uses and limitations
|
20 |
+
|
21 |
+
The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
|
22 |
+
domain is possible that the model won't produce high quality samples.
|
23 |
+
|
24 |
+
### Installation
|
25 |
+
|
26 |
+
To use Vocos only in inference mode, install it using:
|
27 |
+
|
28 |
+
```bash
|
29 |
+
pip install git+https://github.com/langtech-bsc/vocos.git@matcha
|
30 |
+
```
|
31 |
+
|
32 |
+
### Reconstruct audio from mel-spectrogram
|
33 |
+
|
34 |
+
```python
|
35 |
+
import torch
|
36 |
+
|
37 |
+
from vocos import Vocos
|
38 |
+
|
39 |
+
vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz")
|
40 |
+
|
41 |
+
mel = torch.randn(1, 80, 256) # B, C, T
|
42 |
+
audio = vocos.decode(mel)
|
43 |
+
```
|
44 |
+
|
45 |
+
### Training Data
|
46 |
+
|
47 |
+
The model was trained on private 800+ hours dataset, made from Ukraininan audio books, using [narizaka](https://github.com/patriotyk/narizaka) tool.
|
48 |
+
|
49 |
+
### Training Procedure
|
50 |
+
|
51 |
+
The model was trained for 2.0M steps and 210 epochs with a batch size of 20. We used a Cosine scheduler with a initial learning rate of 3e-4.
|
52 |
+
|
53 |
+
#### Training Hyperparameters
|
54 |
+
|
55 |
+
* initial_learning_rate: 3e-4
|
56 |
+
* scheduler: cosine without warmup or restarts
|
57 |
+
* mel_loss_coeff: 45
|
58 |
+
* mrd_loss_coeff: 1.0
|
59 |
+
* batch_size: 20
|
60 |
+
* num_samples: 32768
|