teticio commited on
Commit
bf017fc
1 Parent(s): 0f3ac5f

release to pypi

Browse files
Files changed (3) hide show
  1. .gitignore +1 -0
  2. README.md +74 -60
  3. setup.cfg +1 -0
.gitignore CHANGED
@@ -5,6 +5,7 @@ data
5
  models
6
  flagged
7
  build
 
8
  audiodiffusion.egg-info
9
  lightning_logs
10
  taming
 
5
  models
6
  flagged
7
  build
8
+ dist
9
  audiodiffusion.egg-info
10
  lightning_logs
11
  taming
README.md CHANGED
@@ -11,7 +11,7 @@ license: gpl-3.0
11
  ---
12
  # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
13
 
14
- ### Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package.
15
 
16
  ---
17
 
@@ -41,7 +41,6 @@ A DDPM is trained on a set of mel spectrograms that have been generated from a d
41
 
42
  You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
43
 
44
-
45
  | Model | Dataset | Description |
46
  |-------|---------|-------------|
47
  | [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
@@ -54,117 +53,132 @@ You can play around with some pre-trained models on [Google Colab](https://colab
54
  ---
55
 
56
  ## Generate Mel spectrogram dataset from directory of audio files
 
57
  #### Install
 
58
  ```bash
59
  pip install .
60
  ```
61
 
62
- #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 
63
  ```bash
64
  python scripts/audio_to_images.py \
65
- --resolution 64,64 \
66
- --hop_length 1024 \
67
- --input_dir path-to-audio-files \
68
- --output_dir path-to-output-data
69
  ```
70
 
71
- #### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
 
72
  ```bash
73
  python scripts/audio_to_images.py \
74
- --resolution 256 \
75
- --input_dir path-to-audio-files \
76
- --output_dir data/audio-diffusion-256 \
77
- --push_to_hub teticio/audio-diffusion-256
78
  ```
79
 
80
  Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
81
 
82
  ## Train model
83
- #### Run training on local machine.
 
 
84
  ```bash
85
  accelerate launch --config_file config/accelerate_local.yaml \
86
- scripts/train_unconditional.py \
87
- --dataset_name data/audio-diffusion-64 \
88
- --hop_length 1024 \
89
- --output_dir models/ddpm-ema-audio-64 \
90
- --train_batch_size 16 \
91
- --num_epochs 100 \
92
- --gradient_accumulation_steps 1 \
93
- --learning_rate 1e-4 \
94
- --lr_warmup_steps 500 \
95
- --mixed_precision no
96
  ```
97
 
98
- #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
 
99
  ```bash
100
  accelerate launch --config_file config/accelerate_local.yaml \
101
- scripts/train_unconditional.py \
102
- --dataset_name teticio/audio-diffusion-256 \
103
- --output_dir models/audio-diffusion-256 \
104
- --num_epochs 100 \
105
- --train_batch_size 2 \
106
- --eval_batch_size 2 \
107
- --gradient_accumulation_steps 8 \
108
- --learning_rate 1e-4 \
109
- --lr_warmup_steps 500 \
110
- --mixed_precision no \
111
- --push_to_hub True \
112
- --hub_model_id audio-diffusion-256 \
113
- --hub_token $(cat $HOME/.huggingface/token)
114
  ```
115
 
116
- #### Run training on SageMaker.
 
117
  ```bash
118
  accelerate launch --config_file config/accelerate_sagemaker.yaml \
119
- scripts/train_unconditional.py \
120
- --dataset_name teticio/audio-diffusion-256 \
121
- --output_dir models/ddpm-ema-audio-256 \
122
- --train_batch_size 16 \
123
- --num_epochs 100 \
124
- --gradient_accumulation_steps 1 \
125
- --learning_rate 1e-4 \
126
- --lr_warmup_steps 500 \
127
- --mixed_precision no
128
  ```
129
 
130
  ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
 
131
  #### A DDIM can be trained by adding the parameter
 
132
  ```bash
133
- --scheduler ddim
134
  ```
135
 
136
  Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
137
 
138
  ## Latent Audio Diffusion
 
139
  Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
140
 
141
  At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
142
 
143
- #### Train latent diffusion model using pre-trained VAE.
 
144
  ```bash
145
  accelerate launch ...
146
- ...
147
- --vae teticio/latent-audio-diffusion-256
148
  ```
149
 
150
- #### Install dependencies to train with Stable Diffusion.
151
- ```
 
152
  pip install omegaconf pytorch_lightning
153
  pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
154
  pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
155
  ```
156
 
157
- #### Train an autoencoder.
 
158
  ```bash
159
  python scripts/train_vae.py \
160
- --dataset_name teticio/audio-diffusion-256 \
161
- --batch_size 2 \
162
- --gradient_accumulation_steps 12
163
  ```
164
 
165
- #### Train latent diffusion model.
 
166
  ```bash
167
  accelerate launch ...
168
- ...
169
- --vae models/autoencoder-kl
170
  ```
 
11
  ---
12
  # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
13
 
14
+ ## Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package
15
 
16
  ---
17
 
 
41
 
42
  You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
43
 
 
44
  | Model | Dataset | Description |
45
  |-------|---------|-------------|
46
  | [teticio/audio-diffusion-256](https://huggingface.co/teticio/audio-diffusion-256) | [teticio/audio-diffusion-256](https://huggingface.co/datasets/teticio/audio-diffusion-256) | My "liked" Spotify playlist |
 
53
  ---
54
 
55
  ## Generate Mel spectrogram dataset from directory of audio files
56
+
57
  #### Install
58
+
59
  ```bash
60
  pip install .
61
  ```
62
 
63
+ #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results
64
+
65
  ```bash
66
  python scripts/audio_to_images.py \
67
+ --resolution 64,64 \
68
+ --hop_length 1024 \
69
+ --input_dir path-to-audio-files \
70
+ --output_dir path-to-output-data
71
  ```
72
 
73
+ #### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`)
74
+
75
  ```bash
76
  python scripts/audio_to_images.py \
77
+ --resolution 256 \
78
+ --input_dir path-to-audio-files \
79
+ --output_dir data/audio-diffusion-256 \
80
+ --push_to_hub teticio/audio-diffusion-256
81
  ```
82
 
83
  Note that the default `sample_rate` is 22050 and audios will be resampled if they are at a different rate. If you change this value, you may find that the results in the `test_mel.ipynb` notebook are not good (for example, if `sample_rate` is 48000) and that it is necessary to adjust `n_fft` (for example, to 2000 instead of the default value of 2048; alternatively, you can resample to a `sample_rate` of 44100). Make sure you use the same parameters for training and inference. You should also bear in mind that not all resolutions work with the neural network architecture as currently configured - you should be safe if you stick to powers of 2.
84
 
85
  ## Train model
86
+
87
+ #### Run training on local machine
88
+
89
  ```bash
90
  accelerate launch --config_file config/accelerate_local.yaml \
91
+ scripts/train_unconditional.py \
92
+ --dataset_name data/audio-diffusion-64 \
93
+ --hop_length 1024 \
94
+ --output_dir models/ddpm-ema-audio-64 \
95
+ --train_batch_size 16 \
96
+ --num_epochs 100 \
97
+ --gradient_accumulation_steps 1 \
98
+ --learning_rate 1e-4 \
99
+ --lr_warmup_steps 500 \
100
+ --mixed_precision no
101
  ```
102
 
103
+ #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub
104
+
105
  ```bash
106
  accelerate launch --config_file config/accelerate_local.yaml \
107
+ scripts/train_unconditional.py \
108
+ --dataset_name teticio/audio-diffusion-256 \
109
+ --output_dir models/audio-diffusion-256 \
110
+ --num_epochs 100 \
111
+ --train_batch_size 2 \
112
+ --eval_batch_size 2 \
113
+ --gradient_accumulation_steps 8 \
114
+ --learning_rate 1e-4 \
115
+ --lr_warmup_steps 500 \
116
+ --mixed_precision no \
117
+ --push_to_hub True \
118
+ --hub_model_id audio-diffusion-256 \
119
+ --hub_token $(cat $HOME/.huggingface/token)
120
  ```
121
 
122
+ #### Run training on SageMaker
123
+
124
  ```bash
125
  accelerate launch --config_file config/accelerate_sagemaker.yaml \
126
+ scripts/train_unconditional.py \
127
+ --dataset_name teticio/audio-diffusion-256 \
128
+ --output_dir models/ddpm-ema-audio-256 \
129
+ --train_batch_size 16 \
130
+ --num_epochs 100 \
131
+ --gradient_accumulation_steps 1 \
132
+ --learning_rate 1e-4 \
133
+ --lr_warmup_steps 500 \
134
+ --mixed_precision no
135
  ```
136
 
137
  ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
138
+
139
  #### A DDIM can be trained by adding the parameter
140
+
141
  ```bash
142
+ --scheduler ddim
143
  ```
144
 
145
  Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
146
 
147
  ## Latent Audio Diffusion
148
+
149
  Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
150
 
151
  At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
152
 
153
+ #### Train latent diffusion model using pre-trained VAE
154
+
155
  ```bash
156
  accelerate launch ...
157
+ ...
158
+ --vae teticio/latent-audio-diffusion-256
159
  ```
160
 
161
+ #### Install dependencies to train with Stable Diffusion
162
+
163
+ ```bash
164
  pip install omegaconf pytorch_lightning
165
  pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
166
  pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
167
  ```
168
 
169
+ #### Train an autoencoder
170
+
171
  ```bash
172
  python scripts/train_vae.py \
173
+ --dataset_name teticio/audio-diffusion-256 \
174
+ --batch_size 2 \
175
+ --gradient_accumulation_steps 12
176
  ```
177
 
178
+ #### Train latent diffusion model
179
+
180
  ```bash
181
  accelerate launch ...
182
+ ...
183
+ --vae models/autoencoder-kl
184
  ```
setup.cfg CHANGED
@@ -3,6 +3,7 @@ name = audiodiffusion
3
  version = attr: audiodiffusion.VERSION
4
  description = Generate Mel spectrogram dataset from directory of audio files.
5
  long_description = file: README.md
 
6
  license = GPL3
7
  classifiers =
8
  Programming Language :: Python :: 3
 
3
  version = attr: audiodiffusion.VERSION
4
  description = Generate Mel spectrogram dataset from directory of audio files.
5
  long_description = file: README.md
6
+ long_description_content_type = text/markdown
7
  license = GPL3
8
  classifiers =
9
  Programming Language :: Python :: 3