patriotyk
/

vocos-mel-hifigan-compat-44100khz

Model card Files Files and versions Metrics Training metrics Community

patriotyk commited on May 10, 2024

Commit

d726ead

·

verified ·

1 Parent(s): a42713b

Initial description commit

Files changed (1) hide show

README.md +60 -3

README.md CHANGED Viewed

@@ -1,3 +1,60 @@
----
-license: mit
----

+---
+license: mit
+---
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
+Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
+Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through
+inverse Fourier transform.
+This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread
+in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
+The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
+acoustic output of several TTS models.
+## Intended Uses and limitations
+The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
+domain is possible that the model won't produce high quality samples.
+### Installation
+To use Vocos only in inference mode, install it using:
+```bash
+pip install git+https://github.com/langtech-bsc/vocos.git@matcha
+```
+### Reconstruct audio from mel-spectrogram
+```python
+import torch
+from vocos import Vocos
+vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz")
+mel = torch.randn(1, 80, 256)  # B, C, T
+audio = vocos.decode(mel)
+```
+### Training Data
+The model was trained on private 800+ hours dataset, made from Ukraininan audio books, using [narizaka](https://github.com/patriotyk/narizaka) tool.
+### Training Procedure
+The model was trained for 2.0M steps and 210 epochs with a batch size of 20. We used a Cosine scheduler with a initial learning rate of 3e-4.
+#### Training Hyperparameters
+* initial_learning_rate: 3e-4
+* scheduler: cosine without warmup or restarts
+* mel_loss_coeff: 45
+* mrd_loss_coeff: 1.0
+* batch_size: 20
+* num_samples: 32768