Upload folder using huggingface_hub
Browse files- README.md +80 -0
- config.yaml +40 -0
- pytorch_model.bin +3 -0
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
tags:
|
4 |
+
- audio
|
5 |
+
library_name: pytorch
|
6 |
+
---
|
7 |
+
|
8 |
+
# Vocos
|
9 |
+
|
10 |
+
Note: This repo has no affiliation with the author of Vocos.
|
11 |
+
|
12 |
+
## What is this?
|
13 |
+
|
14 |
+
This is a pretrained Vocos model similar to the official ones, except for having been trained to reconstruct audio in 48kHz, as opposed to 24kHz.
|
15 |
+
|
16 |
+
Its purpose is to serve as a general high quality vocoder, but also as a building block for TTS models.
|
17 |
+
|
18 |
+
## Usage
|
19 |
+
Make sure the Vocos library is installed:
|
20 |
+
|
21 |
+
```bash
|
22 |
+
pip install vocos
|
23 |
+
```
|
24 |
+
|
25 |
+
then, load the model as usual:
|
26 |
+
|
27 |
+
```python
|
28 |
+
from vocos import Vocos
|
29 |
+
vocos = Vocos.from_pretrained("kittn/vocos-mel-48khz-alpha1")
|
30 |
+
```
|
31 |
+
|
32 |
+
For more detailed examples, see [github.com/charactr-platform/vocos#usage](https://github.com/charactr-platform/vocos#usage)
|
33 |
+
|
34 |
+
|
35 |
+
## What is Vocos?
|
36 |
+
|
37 |
+
Here's a summary from the official repo [[link](https://github.com/charactr-platform/vocos)]:
|
38 |
+
|
39 |
+
> Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
|
40 |
+
|
41 |
+
For more details and other variants, check out the repo link above.
|
42 |
+
|
43 |
+
## Model summary
|
44 |
+
```bash
|
45 |
+
=================================================================
|
46 |
+
Layer (type:depth-idx) Param #
|
47 |
+
=================================================================
|
48 |
+
Vocos --
|
49 |
+
├─MelSpectrogramFeatures: 1-1 --
|
50 |
+
│ └─MelSpectrogram: 2-1 --
|
51 |
+
│ │ └─Spectrogram: 3-1 --
|
52 |
+
│ │ └─MelScale: 3-2 --
|
53 |
+
├─VocosBackbone: 1-2 --
|
54 |
+
│ └─Conv1d: 2-2 918,528
|
55 |
+
│ └─LayerNorm: 2-3 2,048
|
56 |
+
│ └─ModuleList: 2-4 --
|
57 |
+
│ │ └─ConvNeXtBlock: 3-3 4,208,640
|
58 |
+
│ │ └─ConvNeXtBlock: 3-4 4,208,640
|
59 |
+
│ │ └─ConvNeXtBlock: 3-5 4,208,640
|
60 |
+
│ │ └─ConvNeXtBlock: 3-6 4,208,640
|
61 |
+
│ │ └─ConvNeXtBlock: 3-7 4,208,640
|
62 |
+
│ │ └─ConvNeXtBlock: 3-8 4,208,640
|
63 |
+
│ │ └─ConvNeXtBlock: 3-9 4,208,640
|
64 |
+
│ │ └─ConvNeXtBlock: 3-10 4,208,640
|
65 |
+
│ └─LayerNorm: 2-5 2,048
|
66 |
+
├─ISTFTHead: 1-3 --
|
67 |
+
│ └─Linear: 2-6 2,101,250
|
68 |
+
│ └─ISTFT: 2-7 --
|
69 |
+
=================================================================
|
70 |
+
Total params: 36,692,994
|
71 |
+
Trainable params: 36,692,994
|
72 |
+
Non-trainable params: 0
|
73 |
+
=================================================================
|
74 |
+
```
|
75 |
+
|
76 |
+
## Evals
|
77 |
+
TODO
|
78 |
+
|
79 |
+
## Training details
|
80 |
+
TODO
|
config.yaml
ADDED
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
backbone:
|
2 |
+
class_path: vocos.models.VocosBackbone
|
3 |
+
init_args:
|
4 |
+
adanorm_num_embeddings: null
|
5 |
+
dim: 1024
|
6 |
+
input_channels: 128
|
7 |
+
intermediate_dim: 2048
|
8 |
+
layer_scale_init_value: null
|
9 |
+
num_layers: 8
|
10 |
+
decay_mel_coeff: false
|
11 |
+
enable_discriminator: true
|
12 |
+
evaluate_periodicty: true
|
13 |
+
evaluate_pesq: true
|
14 |
+
evaluate_utmos: true
|
15 |
+
feature_extractor:
|
16 |
+
class_path: vocos.feature_extractors.MelSpectrogramFeatures
|
17 |
+
init_args:
|
18 |
+
hop_length: 256
|
19 |
+
n_fft: 2048
|
20 |
+
n_mels: 128
|
21 |
+
padding: center
|
22 |
+
sample_rate: 48000
|
23 |
+
generator_period: 3
|
24 |
+
grad_acc: 1
|
25 |
+
head:
|
26 |
+
class_path: vocos.heads.ISTFTHead
|
27 |
+
init_args:
|
28 |
+
dim: 1024
|
29 |
+
hop_length: 256
|
30 |
+
n_fft: 2048
|
31 |
+
padding: center
|
32 |
+
initial_learning_rate: 0.0003
|
33 |
+
mel_loss_coeff: 15.0
|
34 |
+
mrd_loss_coeff: 0.1
|
35 |
+
num_warmup_steps: 500
|
36 |
+
pretrain_decoupled_steps: 0
|
37 |
+
pretrain_disc_steps: 500
|
38 |
+
pretrain_mel_steps: 0
|
39 |
+
pretrained_ckpt: null
|
40 |
+
sample_rate: 48000
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3315c87d130922dff1c4c0cfd153ac3ef037950ac0eba13f355bb38cbda46fc2
|
3 |
+
size 147342055
|