Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,98 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
|
5 |
+
# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
|
6 |
+
|
7 |
+
[Audio samples](https://charactr-platform.github.io/vocos/) |
|
8 |
+
Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)
|
9 |
+
|
10 |
+
Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative
|
11 |
+
Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical
|
12 |
+
GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral
|
13 |
+
coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
|
14 |
+
|
15 |
+
## Installation
|
16 |
+
|
17 |
+
To use Vocos only in inference mode, install it using:
|
18 |
+
|
19 |
+
```bash
|
20 |
+
pip install vocos
|
21 |
+
```
|
22 |
+
|
23 |
+
If you wish to train the model, install it with additional dependencies:
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install vocos[train]
|
27 |
+
```
|
28 |
+
|
29 |
+
## Usage
|
30 |
+
|
31 |
+
### Reconstruct audio from mel-spectrogram
|
32 |
+
|
33 |
+
```python
|
34 |
+
import torch
|
35 |
+
|
36 |
+
from vocos import Vocos
|
37 |
+
|
38 |
+
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
|
39 |
+
|
40 |
+
mel = torch.randn(1, 100, 256) # B, C, T
|
41 |
+
audio = vocos.decode(mel)
|
42 |
+
```
|
43 |
+
|
44 |
+
Copy-synthesis from a file:
|
45 |
+
|
46 |
+
```python
|
47 |
+
import torchaudio
|
48 |
+
|
49 |
+
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
|
50 |
+
if y.size(0) > 1: # mix to mono
|
51 |
+
y = y.mean(dim=0, keepdim=True)
|
52 |
+
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
|
53 |
+
y_hat = vocos(y)
|
54 |
+
```
|
55 |
+
|
56 |
+
### Reconstruct audio from EnCodec tokens
|
57 |
+
|
58 |
+
Additionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the
|
59 |
+
list: `[1.5, 3.0, 6.0, 12.0]`.
|
60 |
+
|
61 |
+
```python
|
62 |
+
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
|
63 |
+
|
64 |
+
audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames
|
65 |
+
features = vocos.codes_to_features(audio_tokens)
|
66 |
+
bandwidth_id = torch.tensor([2]) # 6 kbps
|
67 |
+
|
68 |
+
audio = vocos.decode(features, bandwidth_id=bandwidth_id)
|
69 |
+
```
|
70 |
+
|
71 |
+
Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a
|
72 |
+
single forward pass.
|
73 |
+
|
74 |
+
```python
|
75 |
+
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
|
76 |
+
if y.size(0) > 1: # mix to mono
|
77 |
+
y = y.mean(dim=0, keepdim=True)
|
78 |
+
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
|
79 |
+
|
80 |
+
y_hat = vocos(y, bandwidth_id=bandwidth_id)
|
81 |
+
```
|
82 |
+
|
83 |
+
## Citation
|
84 |
+
|
85 |
+
If this code contributes to your research, please cite our work:
|
86 |
+
|
87 |
+
```
|
88 |
+
@article{siuzdak2023vocos,
|
89 |
+
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
|
90 |
+
author={Siuzdak, Hubert},
|
91 |
+
journal={arXiv preprint arXiv:2306.00814},
|
92 |
+
year={2023}
|
93 |
+
}
|
94 |
+
```
|
95 |
+
|
96 |
+
## License
|
97 |
+
|
98 |
+
The code in this repository is released under the MIT license.
|