metadata

title: Audio Diffusion
emoji: 🎵
colorFrom: pink
colorTo: blue
sdk: gradio
sdk_version: 3.1.4
app_file: app.py
pinned: false
license: gpl-3.0

audio-diffusion

Apply Denoising Diffusion Probabilistic Models using the new Hugging Face diffusers package to synthesize music instead of images.

Audio can be represented as images by transforming to a mel spectrogram, such as the one shown above. The class Mel in mel.py can convert a slice of audio into a mel spectrogram of x_res x y_res and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the test-mel.ipynb notebook.

A DDPM model is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio. See the test-model.ipynb notebook for an example.

You can play around with the model I trained on about 500 songs from my Spotify "liked" playlist on Google Colab or Hugging Face spaces. Check out some samples I generated here.

Generate Mel spectrogram dataset from directory of audio files

Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.

python audiodiffusion/audio_to_images.py \
  --resolution 64 \
  --hop_length 1024\
  --input_dir path-to-audio-files \
  --output_dir data-test

Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).

python audiodiffusion/audio_to_images.py \
  --resolution 256 \
  --input_dir path-to-audio-files \
  --output_dir data-256 \
  --push_to_hub teticio\audio-diffusion-256

Train model

Run training on local machine.

accelerate launch --config_file accelerate_local.yaml \
  audiodiffusion/train_unconditional.py \
  --dataset_name data-64 \
  --resolution 64 \
  --hop_length 1024 \
  --output_dir ddpm-ema-audio-64 \
  --train_batch_size 16 \
  --num_epochs 100 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no

Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.

accelerate launch --config_file accelerate_local.yaml \
  audiodiffusion/train_unconditional.py \
  --dataset_name teticio/audio-diffusion-256 \
  --resolution 256 \
  --output_dir ddpm-ema-audio-256 \
  --num_epochs 100 \
  --train_batch_size 2 \
  --eval_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no \
  --push_to_hub True \
  --hub_model_id audio-diffusion-256 \
  --hub_token $(cat $HOME/.huggingface/token)

Run training on SageMaker.

accelerate launch --config_file accelerate_sagemaker.yaml \
  audiodiffusion/train_unconditional.py \
  --dataset_name teticio/audio-diffusion-256 \
  --resolution 256 \
  --output_dir ddpm-ema-audio-256 \
  --train_batch_size 16 \
  --num_epochs 100 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-4 \
  --lr_warmup_steps 500 \
  --mixed_precision no