Spaces:
Runtime error
Runtime error
File size: 5,291 Bytes
71df811 05be1df 71df811 02a7301 825c8bf ea68dfd b2f1783 ea68dfd b2f1783 869c0ac 825c8bf 869c0ac 825c8bf aedf71e 869c0ac b2f1783 8bc5536 869c0ac dbe37b6 825c8bf 7c89b23 825c8bf 1dea888 3526755 825c8bf 869c0ac 825c8bf 65fa65c 825c8bf 7c89b23 825c8bf 65fa65c 3526755 65fa65c 825c8bf 760dafb 825c8bf 7c89b23 825c8bf 3526755 825c8bf d73aa64 825c8bf 3526755 825c8bf 633cada 825c8bf d73aa64 0a4662e d73aa64 1dea888 825c8bf 7c89b23 825c8bf 1dea888 825c8bf 3526755 825c8bf 65fa65c 1dea888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
title: Audio Diffusion
emoji: 🎵
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 3.1.4
app_file: app.py
pinned: false
license: gpl-3.0
---
# audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
### Apply [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package to synthesize music instead of images.
---
**UPDATES**:
4/10/2022
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
27/9/2022
You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
---
![mel spectrogram](mel.png)
---
Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
A DDPM model is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
You can play around with some pretrained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
| Model | Dataset | Description |
|-------|---------|-------------|
| [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
| [teticio/audio-diffusion-breaks-256](https://huggingface.co/teticio/audio-diffusion-breaks-256) | [teticio/audio-diffusion-breaks-256](https://huggingface.co/datasets/teticio/audio-diffusion-breaks-256) | Samples that have been used in music, sourced from [WhoSampled](https://whosampled.com) and [YouTube](https://youtube.com) |
| [teticio/audio-diffusion-instrumental-hiphop-256](https://huggingface.co/teticio/audio-diffusion-instrumental-hiphop-256) | [teticio/audio-diffusion-instrumental-hiphop-256](https://huggingface.co/datasets/teticio/audio-diffusion-instrumental-hiphop-256) | Instrumental Hip Hop music |
---
## Generate Mel spectrogram dataset from directory of audio files
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
```bash
python audio_to_images.py \
--resolution 64 \
--hop_length 1024 \
--input_dir path-to-audio-files \
--output_dir data-test
```
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
```bash
python audio_to_images.py \
--resolution 256 \
--input_dir path-to-audio-files \
--output_dir data-256 \
--push_to_hub teticio/audio-diffusion-256
```
## Train model
#### Run training on local machine.
```bash
accelerate launch --config_file accelerate_local.yaml \
train_unconditional.py \
--dataset_name data-64 \
--resolution 64 \
--hop_length 1024 \
--output_dir ddpm-ema-audio-64 \
--train_batch_size 16 \
--num_epochs 100 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-4 \
--lr_warmup_steps 500 \
--mixed_precision no
```
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
```bash
accelerate launch --config_file accelerate_local.yaml \
train_unconditional.py \
--dataset_name teticio/audio-diffusion-256 \
--resolution 256 \
--output_dir ddpm-ema-audio-256 \
--num_epochs 100 \
--train_batch_size 2 \
--eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--lr_warmup_steps 500 \
--mixed_precision no \
--push_to_hub True \
--hub_model_id audio-diffusion-256 \
--hub_token $(cat $HOME/.huggingface/token)
```
#### Run training on SageMaker.
```bash
accelerate launch --config_file accelerate_sagemaker.yaml \
strain_unconditional.py \
--dataset_name teticio/audio-diffusion-256 \
--resolution 256 \
--output_dir ddpm-ema-audio-256 \
--train_batch_size 16 \
--num_epochs 100 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-4 \
--lr_warmup_steps 500 \
--mixed_precision no
```
|