Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
tags:
|
4 |
+
- encodec
|
5 |
+
- audio
|
6 |
+
- music
|
7 |
+
- audiocraft
|
8 |
+
---
|
9 |
+
|
10 |
+
|
11 |
+
<a target="_blank" href="https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing">
|
12 |
+
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
13 |
+
</a>
|
14 |
+
<br>
|
15 |
+
|
16 |
+
# MultiBand Diffusion
|
17 |
+
|
18 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
19 |
+
|
20 |
+
This repository contains the weights for Meta's MultiBand Diffusion models, described in this research paper: [From Discrete Tokens to High Fidelity Audio using MultiBand Diffusion][arxiv].
|
21 |
+
|
22 |
+
MultiBand diffusion is a collection of 4 models that can decode tokens from <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> into waveform audio.
|
23 |
+
|
24 |
+
|
25 |
+
## Model Details
|
26 |
+
|
27 |
+
### Model Description
|
28 |
+
|
29 |
+
<!-- Provide a longer summary of what this model is. -->
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
+
- **Developed by:** Meta
|
34 |
+
- **Model type:** Diffusion Models
|
35 |
+
- **License:** The models weights in this repository are released under the CC-BY-NC 4.0 license.
|
36 |
+
|
37 |
+
### Model Sources [optional]
|
38 |
+
|
39 |
+
<!-- Provide the basic links for the model. -->
|
40 |
+
|
41 |
+
- **Repository:** [AudioCraft repo](https://github.com/facebookresearch/audiocraft/tree/main)
|
42 |
+
- **Paper:** [From Discrete Tokens to High Fidelity Audio using MultiBand Diffusion](https://dl.fbaipublicfiles.com/encodec/Diffusion/paper.pdf)
|
43 |
+
|
44 |
+
|
45 |
+
|
46 |
+
## Installation
|
47 |
+
|
48 |
+
Please follow the AudioCraft installation instructions from the [README](../README.md).
|
49 |
+
|
50 |
+
|
51 |
+
## Usage
|
52 |
+
|
53 |
+
[AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main) offers a number of way to use MultiBand Diffusion:
|
54 |
+
1. A MusicGen demo includes a toggle to try diffusion decoder. You can use the demo locally by running [`python -m demos.musicgen_app --share`](https://github.com/facebookresearch/audiocraft/tree/main/demos/musicgen_app.py), or through a [MusicGen Colab](https://colab.research.google.com/drive/1JlTOjB-G0A2Hz3h8PK63vLZk4xdCI5QB?usp=sharing).
|
55 |
+
2. You can play with MusicGen by running the jupyter notebook at [`demos/musicgen_demo.ipynb`](https://github.com/facebookresearch/audiocraft/tree/main/demos/musicgen_demo.ipynb) locally (if you have a GPU).
|
56 |
+
|
57 |
+
## API
|
58 |
+
|
59 |
+
[AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main) provides a simple API and pre-trained models for MusicGen and for EnCodec at 24 khz for 3 bitrates (1.5 kbps, 3 kbps and 6 kbps).
|
60 |
+
|
61 |
+
See after a quick example for using MultiBandDiffusion with the MusicGen API:
|
62 |
+
|
63 |
+
```python
|
64 |
+
import torchaudio
|
65 |
+
from audiocraft.models import MusicGen, MultiBandDiffusion
|
66 |
+
from audiocraft.data.audio import audio_write
|
67 |
+
|
68 |
+
model = MusicGen.get_pretrained('facebook/musicgen-melody')
|
69 |
+
mbd = MultiBandDiffusion.get_mbd_musicgen()
|
70 |
+
model.set_generation_params(duration=8) # generate 8 seconds.
|
71 |
+
wav, tokens = model.generate_unconditional(4, return_tokens=True) # generates 4 unconditional audio samples and keep the tokens for MBD generation
|
72 |
+
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
|
73 |
+
wav_diffusion = mbd.tokens_to_wav(tokens)
|
74 |
+
wav, tokens = model.generate(descriptions, return_tokens=True) # generates 3 samples and keep the tokens.
|
75 |
+
wav_diffusion = mbd.tokens_to_wav(tokens)
|
76 |
+
melody, sr = torchaudio.load('./assets/bach.mp3')
|
77 |
+
# Generates using the melody from the given audio and the provided descriptions, returns audio and audio tokens.
|
78 |
+
wav, tokens = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr, return_tokens=True)
|
79 |
+
wav_diffusion = mbd.tokens_to_wav(tokens)
|
80 |
+
|
81 |
+
for idx, one_wav in enumerate(wav):
|
82 |
+
# Will save under {idx}.wav and {idx}_diffusion.wav, with loudness normalization at -14 db LUFS for comparing the methods.
|
83 |
+
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
|
84 |
+
audio_write(f'{idx}_diffusion', wav_diffusion[idx].cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
|
85 |
+
```
|
86 |
+
|
87 |
+
For the compression task (and to compare with [EnCodec](https://github.com/facebookresearch/encodec)):
|
88 |
+
|
89 |
+
```python
|
90 |
+
import torch
|
91 |
+
from audiocraft.models import MultiBandDiffusion
|
92 |
+
from encodec import EncodecModel
|
93 |
+
from audiocraft.data.audio import audio_read, audio_write
|
94 |
+
|
95 |
+
bandwidth = 3.0 # 1.5, 3.0, 6.0
|
96 |
+
mbd = MultiBandDiffusion.get_mbd_24khz(bw=bandwidth)
|
97 |
+
encodec = EncodecModel.get_encodec_24khz()
|
98 |
+
|
99 |
+
somepath = ''
|
100 |
+
wav, sr = audio_read(somepath)
|
101 |
+
with torch.no_grad():
|
102 |
+
compressed_encodec = encodec(wav)
|
103 |
+
compressed_diffusion = mbd.regenerate(wav, sample_rate=sr)
|
104 |
+
|
105 |
+
audio_write('sample_encodec', compressed_encodec.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
|
106 |
+
audio_write('sample_diffusion', compressed_diffusion.squeeze(0).cpu(), mbd.sample_rate, strategy="loudness", loudness_compressor=True)
|
107 |
+
```
|
108 |
+
|
109 |
+
|
110 |
+
## Training
|
111 |
+
|
112 |
+
A [DiffusionSolver](https://github.com/facebookresearch/audiocraft/tree/main/audiocraft/solvers/diffusion.py) implements Meta diffusion training pipeline.
|
113 |
+
It generates waveform audio conditioned on the embeddings extracted from a pre-trained EnCodec model
|
114 |
+
(see [EnCodec documentation from the AudioCraft library](https://github.com/facebookresearch/audiocraft/tree/main/ENCODEC.md) for more details on how to train such model).
|
115 |
+
|
116 |
+
Note that **the library do NOT provide any of the datasets** used for training our diffusion models.
|
117 |
+
We provide a dummy dataset containing just a few examples for illustrative purposes.
|
118 |
+
|
119 |
+
### Example configurations and grids
|
120 |
+
|
121 |
+
One can train diffusion models as described in the paper by using this [dora grid](https://github.com/facebookresearch/audiocraft/tree/main/audiocraft/grids/diffusion/4_bands_base_32khz.py).
|
122 |
+
```shell
|
123 |
+
# 4 bands MBD trainning
|
124 |
+
dora grid diffusion.4_bands_base_32khz
|
125 |
+
```
|
126 |
+
|
127 |
+
### Learn more
|
128 |
+
|
129 |
+
Learn more about AudioCraft training pipelines in the [dedicated section](https://github.com/facebookresearch/audiocraft/tree/main/TRAINING.md).
|
130 |
+
|
131 |
+
|
132 |
+
|
133 |
+
## Citation
|
134 |
+
|
135 |
+
```
|
136 |
+
@article{sanroman2023fromdi,
|
137 |
+
title={From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion},
|
138 |
+
author={San Roman, Robin and Adi, Yossi and Deleforge, Antoine and Serizel, Romain and Synnaeve, Gabriel and Défossez, Alexandre},
|
139 |
+
journal={arXiv preprint arXiv:},
|
140 |
+
year={2023}
|
141 |
+
}
|
142 |
+
```
|
143 |
+
|
144 |
+
|
145 |
+
[arxiv]: https://dl.fbaipublicfiles.com/encodec/Diffusion/paper.pdf
|
146 |
+
[mbd_samples]: https://ai.honu.io/papers/mbd/
|