Robert Smith commited on
Commit
43ebb3b
1 Parent(s): f34a81b
Files changed (4) hide show
  1. .gitignore +2 -0
  2. README.md +19 -12
  3. audiodiffusion/__init__.py +33 -24
  4. notebooks/test_model.ipynb +164 -17
.gitignore CHANGED
@@ -9,3 +9,5 @@ audiodiffusion.egg-info
9
  lightning_logs
10
  taming
11
  checkpoints
 
 
 
9
  lightning_logs
10
  taming
11
  checkpoints
12
+ Pipfile
13
+ Pipfile.lock
README.md CHANGED
@@ -11,20 +11,19 @@ license: gpl-3.0
11
  ---
12
  # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
13
 
14
- ### Apply [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package to synthesize music instead of images.
15
 
16
  ---
17
 
18
  **UPDATES**:
19
 
20
- 15/10/2022
21
- Added latent audio diffusion (see below). Also added the possibility to train a model to use DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)) by setting `--scheduler ddim`. These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
22
 
23
- 4/10/2022
24
- It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
25
 
26
- 27/9/2022
27
- You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
 
28
 
29
  ---
30
 
@@ -32,11 +31,13 @@ You can now generate an audio based on a previous one. You can use this to gener
32
 
33
  ---
34
 
 
 
35
  Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
36
 
37
- A DDPM model is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
38
 
39
- You can play around with some pretrained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
40
 
41
 
42
  | Model | Dataset | Description |
@@ -54,7 +55,6 @@ pip install .
54
  ```
55
 
56
  #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
57
-
58
  ```bash
59
  python scripts/audio_to_images.py \
60
  --resolution 64,64 \
@@ -119,10 +119,17 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
119
  --lr_warmup_steps 500 \
120
  --mixed_precision no
121
  ```
 
 
 
 
 
 
 
122
  ## Latent Audio Diffusion
123
- Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
124
 
125
- At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
126
 
127
  #### Install dependencies to train with Stable Diffusion
128
  ```
 
11
  ---
12
  # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
13
 
14
+ ### Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package.
15
 
16
  ---
17
 
18
  **UPDATES**:
19
 
20
+ **22/10/2022**. Added DDIM encoder and ability to interpolate between audios in latent "noise" space. Mel spectrograms no longer have to be square (thanks to Tristan for this one), so you can set the vertical (frequency) and horizontal (time) resolutions independently.
 
21
 
22
+ **15/10/2022**. Added latent audio diffusion (see below). Also added the possibility to train a DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)). These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
 
23
 
24
+ **4/10/2022**. It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
25
+
26
+ **27/9/2022**. You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
27
 
28
  ---
29
 
 
31
 
32
  ---
33
 
34
+ ## DDPM ([De-noising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239))
35
+
36
  Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
37
 
38
+ A DDPM is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
39
 
40
+ You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
41
 
42
 
43
  | Model | Dataset | Description |
 
55
  ```
56
 
57
  #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 
58
  ```bash
59
  python scripts/audio_to_images.py \
60
  --resolution 64,64 \
 
119
  --lr_warmup_steps 500 \
120
  --mixed_precision no
121
  ```
122
+ ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
123
+ #### A DDIM can be trained by adding the parameter
124
+ ```bash
125
+ --scheduler ddim
126
+ ```
127
+ Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
128
+
129
  ## Latent Audio Diffusion
130
+ Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
131
 
132
+ At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
133
 
134
  #### Install dependencies to train with Stable Diffusion
135
  ```
audiodiffusion/__init__.py CHANGED
@@ -6,12 +6,12 @@ import numpy as np
6
  from PIL import Image
7
  from tqdm.auto import tqdm
8
  from librosa.beat import beat_track
9
- from diffusers import (DiffusionPipeline, DDPMPipeline, UNet2DConditionModel,
10
- DDIMScheduler, DDPMScheduler, AutoencoderKL)
11
 
12
  from .mel import Mel
13
 
14
- VERSION = "1.2.2"
15
 
16
 
17
  class AudioDiffusion:
@@ -24,7 +24,7 @@ class AudioDiffusion:
24
  top_db: int = 80,
25
  cuda: bool = torch.cuda.is_available(),
26
  progress_bar: Iterable = tqdm):
27
- """Class for generating audio using Denoising Diffusion Probabilistic Models.
28
 
29
  Args:
30
  model_id (String): name of model (local directory or Hugging Face Hub)
@@ -60,18 +60,21 @@ class AudioDiffusion:
60
  top_db=top_db)
61
 
62
  def generate_spectrogram_and_audio(
63
- self,
64
- steps: int = 1000,
65
- generator: torch.Generator = None,
66
- step_generator: torch.Generator = None,
67
- eta: float = 0) -> Tuple[Image.Image, Tuple[int, np.ndarray]]:
 
 
68
  """Generate random mel spectrogram and convert to audio.
69
 
70
  Args:
71
- steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
72
  generator (torch.Generator): random number generator or None
73
- step_generator (torch.Generator): random number generator used to denoise or None
74
  eta (float): parameter between 0 and 1 used with DDIM scheduler
 
75
 
76
  Returns:
77
  PIL Image: mel spectrogram
@@ -83,7 +86,8 @@ class AudioDiffusion:
83
  steps=steps,
84
  generator=generator,
85
  step_generator=step_generator,
86
- eta=eta)
 
87
  return images[0], (sample_rate, audios[0])
88
 
89
  def generate_spectrogram_and_audio_from_audio(
@@ -92,7 +96,7 @@ class AudioDiffusion:
92
  raw_audio: np.ndarray = None,
93
  slice: int = 0,
94
  start_step: int = 0,
95
- steps: int = 1000,
96
  generator: torch.Generator = None,
97
  mask_start_secs: float = 0,
98
  mask_end_secs: float = 0,
@@ -107,11 +111,11 @@ class AudioDiffusion:
107
  raw_audio (np.ndarray): audio as numpy array
108
  slice (int): slice number of audio to convert
109
  start_step (int): step to start from
110
- steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
111
  generator (torch.Generator): random number generator or None
112
  mask_start_secs (float): number of seconds of audio to mask (not generate) at start
113
  mask_end_secs (float): number of seconds of audio to mask (not generate) at end
114
- step_generator (torch.Generator): random number generator used to denoise or None
115
  eta (float): parameter between 0 and 1 used with DDIM scheduler
116
  noise (torch.Tensor): noisy image or None
117
 
@@ -173,7 +177,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
173
  raw_audio: np.ndarray = None,
174
  slice: int = 0,
175
  start_step: int = 0,
176
- steps: int = 1000,
177
  generator: torch.Generator = None,
178
  mask_start_secs: float = 0,
179
  mask_end_secs: float = 0,
@@ -190,23 +194,24 @@ class AudioDiffusionPipeline(DiffusionPipeline):
190
  raw_audio (np.ndarray): audio as numpy array
191
  slice (int): slice number of audio to convert
192
  start_step (int): step to start from
193
- steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
194
  generator (torch.Generator): random number generator or None
195
  mask_start_secs (float): number of seconds of audio to mask (not generate) at start
196
  mask_end_secs (float): number of seconds of audio to mask (not generate) at end
197
- step_generator (torch.Generator): random number generator used to denoise or None
198
  eta (float): parameter between 0 and 1 used with DDIM scheduler
199
- noise (torch.Tensor): noisy image or None
200
 
201
  Returns:
202
  List[PIL Image]: mel spectrograms
203
  (float, List[np.ndarray]): sample rate and raw audios
204
  """
205
 
 
 
206
  self.scheduler.set_timesteps(steps)
207
  step_generator = step_generator or generator
208
- mask = None
209
- # For backwards compatiibility
210
  if type(self.unet.sample_size) == int:
211
  self.unet.sample_size = (self.unet.sample_size,
212
  self.unet.sample_size)
@@ -215,6 +220,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
215
  (batch_size, self.unet.in_channels) + self.unet.sample_size,
216
  generator=generator)
217
  images = noise
 
218
 
219
  if audio_file is not None or raw_audio is not None:
220
  mel.load_audio(audio_file, raw_audio)
@@ -289,11 +295,12 @@ class AudioDiffusionPipeline(DiffusionPipeline):
289
  return images, (mel.get_sample_rate(), audios)
290
 
291
  @torch.no_grad()
292
- def encode(self, images: List[Image.Image]) -> np.ndarray:
293
  """Reverse step process: recover noisy image from generated image.
294
 
295
  Args:
296
  images (List[PIL Image]): list of images to encode
 
297
 
298
  Returns:
299
  np.ndarray: noise tensor of shape (batch_size, 1, height, width)
@@ -301,6 +308,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
301
 
302
  # Only works with DDIM as this method is deterministic
303
  assert isinstance(self.scheduler, DDIMScheduler)
 
304
  sample = np.array([
305
  np.frombuffer(image.tobytes(), dtype="uint8").reshape(
306
  (1, image.height, image.width)) for image in images
@@ -308,7 +316,8 @@ class AudioDiffusionPipeline(DiffusionPipeline):
308
  sample = ((sample / 255) * 2 - 1)
309
  sample = torch.Tensor(sample).to(self.device)
310
 
311
- for t in torch.flip(self.scheduler.timesteps, (0, )):
 
312
  prev_timestep = (t - self.scheduler.num_train_timesteps //
313
  self.scheduler.num_inference_steps)
314
  alpha_prod_t = self.scheduler.alphas_cumprod[t]
@@ -334,7 +343,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
334
  Args:
335
  x0 (torch.Tensor): first tensor to interpolate between
336
  x1 (torch.Tensor): seconds tensor to interpolate between
337
- alpha (float): interpolation betwen 0 and 1
338
 
339
  Returns:
340
  torch.Tensor: interpolated tensor
 
6
  from PIL import Image
7
  from tqdm.auto import tqdm
8
  from librosa.beat import beat_track
9
+ from diffusers import (DiffusionPipeline, UNet2DConditionModel, DDIMScheduler,
10
+ DDPMScheduler, AutoencoderKL)
11
 
12
  from .mel import Mel
13
 
14
+ VERSION = "1.2.3"
15
 
16
 
17
  class AudioDiffusion:
 
24
  top_db: int = 80,
25
  cuda: bool = torch.cuda.is_available(),
26
  progress_bar: Iterable = tqdm):
27
+ """Class for generating audio using De-noising Diffusion Probabilistic Models.
28
 
29
  Args:
30
  model_id (String): name of model (local directory or Hugging Face Hub)
 
60
  top_db=top_db)
61
 
62
  def generate_spectrogram_and_audio(
63
+ self,
64
+ steps: int = None,
65
+ generator: torch.Generator = None,
66
+ step_generator: torch.Generator = None,
67
+ eta: float = 0,
68
+ noise: torch.Tensor = None
69
+ ) -> Tuple[Image.Image, Tuple[int, np.ndarray]]:
70
  """Generate random mel spectrogram and convert to audio.
71
 
72
  Args:
73
+ steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
74
  generator (torch.Generator): random number generator or None
75
+ step_generator (torch.Generator): random number generator used to de-noise or None
76
  eta (float): parameter between 0 and 1 used with DDIM scheduler
77
+ noise (torch.Tensor): noisy image or None
78
 
79
  Returns:
80
  PIL Image: mel spectrogram
 
86
  steps=steps,
87
  generator=generator,
88
  step_generator=step_generator,
89
+ eta=eta,
90
+ noise=noise)
91
  return images[0], (sample_rate, audios[0])
92
 
93
  def generate_spectrogram_and_audio_from_audio(
 
96
  raw_audio: np.ndarray = None,
97
  slice: int = 0,
98
  start_step: int = 0,
99
+ steps: int = None,
100
  generator: torch.Generator = None,
101
  mask_start_secs: float = 0,
102
  mask_end_secs: float = 0,
 
111
  raw_audio (np.ndarray): audio as numpy array
112
  slice (int): slice number of audio to convert
113
  start_step (int): step to start from
114
+ steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
115
  generator (torch.Generator): random number generator or None
116
  mask_start_secs (float): number of seconds of audio to mask (not generate) at start
117
  mask_end_secs (float): number of seconds of audio to mask (not generate) at end
118
+ step_generator (torch.Generator): random number generator used to de-noise or None
119
  eta (float): parameter between 0 and 1 used with DDIM scheduler
120
  noise (torch.Tensor): noisy image or None
121
 
 
177
  raw_audio: np.ndarray = None,
178
  slice: int = 0,
179
  start_step: int = 0,
180
+ steps: int = None,
181
  generator: torch.Generator = None,
182
  mask_start_secs: float = 0,
183
  mask_end_secs: float = 0,
 
194
  raw_audio (np.ndarray): audio as numpy array
195
  slice (int): slice number of audio to convert
196
  start_step (int): step to start from
197
+ steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
198
  generator (torch.Generator): random number generator or None
199
  mask_start_secs (float): number of seconds of audio to mask (not generate) at start
200
  mask_end_secs (float): number of seconds of audio to mask (not generate) at end
201
+ step_generator (torch.Generator): random number generator used to de-noise or None
202
  eta (float): parameter between 0 and 1 used with DDIM scheduler
203
+ noise (torch.Tensor): noise tensor of shape (batch_size, 1, height, width) or None
204
 
205
  Returns:
206
  List[PIL Image]: mel spectrograms
207
  (float, List[np.ndarray]): sample rate and raw audios
208
  """
209
 
210
+ steps = steps or 50 if isinstance(self.scheduler,
211
+ DDIMScheduler) else 1000
212
  self.scheduler.set_timesteps(steps)
213
  step_generator = step_generator or generator
214
+ # For backwards compatibility
 
215
  if type(self.unet.sample_size) == int:
216
  self.unet.sample_size = (self.unet.sample_size,
217
  self.unet.sample_size)
 
220
  (batch_size, self.unet.in_channels) + self.unet.sample_size,
221
  generator=generator)
222
  images = noise
223
+ mask = None
224
 
225
  if audio_file is not None or raw_audio is not None:
226
  mel.load_audio(audio_file, raw_audio)
 
295
  return images, (mel.get_sample_rate(), audios)
296
 
297
  @torch.no_grad()
298
+ def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray:
299
  """Reverse step process: recover noisy image from generated image.
300
 
301
  Args:
302
  images (List[PIL Image]): list of images to encode
303
+ steps (int): number of encoding steps to perform (defaults to 50)
304
 
305
  Returns:
306
  np.ndarray: noise tensor of shape (batch_size, 1, height, width)
 
308
 
309
  # Only works with DDIM as this method is deterministic
310
  assert isinstance(self.scheduler, DDIMScheduler)
311
+ self.scheduler.set_timesteps(steps)
312
  sample = np.array([
313
  np.frombuffer(image.tobytes(), dtype="uint8").reshape(
314
  (1, image.height, image.width)) for image in images
 
316
  sample = ((sample / 255) * 2 - 1)
317
  sample = torch.Tensor(sample).to(self.device)
318
 
319
+ for t in self.progress_bar(torch.flip(self.scheduler.timesteps,
320
+ (0, ))):
321
  prev_timestep = (t - self.scheduler.num_train_timesteps //
322
  self.scheduler.num_inference_steps)
323
  alpha_prod_t = self.scheduler.alphas_cumprod[t]
 
343
  Args:
344
  x0 (torch.Tensor): first tensor to interpolate between
345
  x1 (torch.Tensor): seconds tensor to interpolate between
346
+ alpha (float): interpolation between 0 and 1
347
 
348
  Returns:
349
  torch.Tensor: interpolated tensor
notebooks/test_model.ipynb CHANGED
@@ -53,6 +53,25 @@
53
  "from audiodiffusion import AudioDiffusion"
54
  ]
55
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  {
57
  "cell_type": "markdown",
58
  "id": "7fd945bb",
@@ -74,8 +93,6 @@
74
  "\n",
75
  "#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
76
  "\n",
77
- "#@markdown teticio/audio-diffusion-ddim-256 - DDIM model trained on my Spotify \"liked\" playlist\n",
78
- "\n",
79
  "model_id = \"teticio/audio-diffusion-256\" #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
80
  ]
81
  },
@@ -86,9 +103,7 @@
86
  "metadata": {},
87
  "outputs": [],
88
  "source": [
89
- "audio_diffusion = AudioDiffusion(model_id=model_id)\n",
90
- "mel = Mel(x_res=256, y_res=256)\n",
91
- "generator = torch.Generator()"
92
  ]
93
  },
94
  {
@@ -299,17 +314,90 @@
299
  " audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
300
  " raw_audio=mel.get_audio_slice(slice),\n",
301
  " mask_start_secs=1,\n",
302
- " mask_end_secs=1, step_generator=torch.Generator())\n",
 
303
  "display(Audio(audio, rate=sample_rate))\n",
304
  "display(Audio(audio2, rate=sample_rate))"
305
  ]
306
  },
307
  {
308
  "cell_type": "markdown",
309
- "id": "ef54cef3",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
  "metadata": {},
311
  "source": [
312
- "### Compare results with random sample from training set"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313
  ]
314
  },
315
  {
@@ -319,35 +407,94 @@
319
  "metadata": {},
320
  "outputs": [],
321
  "source": [
322
- "ds = load_dataset(model_id)"
 
323
  ]
324
  },
325
  {
326
  "cell_type": "code",
327
  "execution_count": null,
328
- "id": "b9023846",
329
  "metadata": {},
330
  "outputs": [],
331
  "source": [
332
- "image = random.choice(ds['train'])['image']\n",
333
- "image"
334
  ]
335
  },
336
  {
337
  "cell_type": "code",
338
  "execution_count": null,
339
- "id": "492e2334",
340
  "metadata": {},
341
  "outputs": [],
342
  "source": [
343
- "audio = mel.image_to_audio(image)\n",
344
- "Audio(data=audio, rate=mel.get_sample_rate())"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
345
  ]
346
  },
347
  {
348
  "cell_type": "code",
349
  "execution_count": null,
350
- "id": "4deb47f4",
351
  "metadata": {},
352
  "outputs": [],
353
  "source": []
@@ -374,7 +521,7 @@
374
  "name": "python",
375
  "nbconvert_exporter": "python",
376
  "pygments_lexer": "ipython3",
377
- "version": "3.10.6"
378
  },
379
  "toc": {
380
  "base_numbering": 1,
 
53
  "from audiodiffusion import AudioDiffusion"
54
  ]
55
  },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": null,
59
+ "id": "b294a94a",
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "mel = Mel(x_res=256, y_res=256)\n",
64
+ "generator = torch.Generator()"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "markdown",
69
+ "id": "f3feb265",
70
+ "metadata": {},
71
+ "source": [
72
+ "## DDPM (Denoising Diffusion Probabilistic Models)"
73
+ ]
74
+ },
75
  {
76
  "cell_type": "markdown",
77
  "id": "7fd945bb",
 
93
  "\n",
94
  "#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
95
  "\n",
 
 
96
  "model_id = \"teticio/audio-diffusion-256\" #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
97
  ]
98
  },
 
103
  "metadata": {},
104
  "outputs": [],
105
  "source": [
106
+ "audio_diffusion = AudioDiffusion(model_id=model_id)"
 
 
107
  ]
108
  },
109
  {
 
314
  " audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
315
  " raw_audio=mel.get_audio_slice(slice),\n",
316
  " mask_start_secs=1,\n",
317
+ " mask_end_secs=1,\n",
318
+ " step_generator=torch.Generator())\n",
319
  "display(Audio(audio, rate=sample_rate))\n",
320
  "display(Audio(audio2, rate=sample_rate))"
321
  ]
322
  },
323
  {
324
  "cell_type": "markdown",
325
+ "id": "efc32dae",
326
+ "metadata": {},
327
+ "source": [
328
+ "## DDIM (Denoising Diffusion Implicit Models)"
329
+ ]
330
+ },
331
+ {
332
+ "cell_type": "code",
333
+ "execution_count": null,
334
+ "id": "a021f78a",
335
+ "metadata": {},
336
+ "outputs": [],
337
+ "source": [
338
+ "audio_diffusion = AudioDiffusion(model_id='teticio/audio-diffusion-ddim-256')"
339
+ ]
340
+ },
341
+ {
342
+ "cell_type": "markdown",
343
+ "id": "deb23339",
344
  "metadata": {},
345
  "source": [
346
+ "### Generation can be done in many fewer steps with DDIMs"
347
+ ]
348
+ },
349
+ {
350
+ "cell_type": "code",
351
+ "execution_count": null,
352
+ "id": "c105a497",
353
+ "metadata": {},
354
+ "outputs": [],
355
+ "source": [
356
+ "for _ in range(10):\n",
357
+ " seed = generator.seed()\n",
358
+ " print(f'Seed = {seed}')\n",
359
+ " generator.manual_seed(seed)\n",
360
+ " image, (sample_rate,\n",
361
+ " audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
362
+ " generator=generator)\n",
363
+ " display(image)\n",
364
+ " display(Audio(audio, rate=sample_rate))\n",
365
+ " loop = AudioDiffusion.loop_it(audio, sample_rate)\n",
366
+ " if loop is not None:\n",
367
+ " display(Audio(loop, rate=sample_rate))\n",
368
+ " else:\n",
369
+ " print(\"Unable to determine loop points\")"
370
+ ]
371
+ },
372
+ {
373
+ "cell_type": "markdown",
374
+ "id": "cab4692c",
375
+ "metadata": {},
376
+ "source": [
377
+ "The parameter eta controls the variance:\n",
378
+ "* 0 - DDIM (deterministic)\n",
379
+ "* 1 - DDPM (Denoising DIffusion )"
380
+ ]
381
+ },
382
+ {
383
+ "cell_type": "code",
384
+ "execution_count": null,
385
+ "id": "72bdd207",
386
+ "metadata": {},
387
+ "outputs": [],
388
+ "source": [
389
+ "image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
390
+ " steps=1000, generator=generator, eta=1)\n",
391
+ "display(image)\n",
392
+ "display(Audio(audio, rate=sample_rate))"
393
+ ]
394
+ },
395
+ {
396
+ "cell_type": "markdown",
397
+ "id": "b8d5442c",
398
+ "metadata": {},
399
+ "source": [
400
+ "### DDIMs can be used as encoders..."
401
  ]
402
  },
403
  {
 
407
  "metadata": {},
408
  "outputs": [],
409
  "source": [
410
+ "# Doesn't have to be an audio from the train dataset, this is just for convenience\n",
411
+ "ds = load_dataset('teticio/audio-diffusion-256')"
412
  ]
413
  },
414
  {
415
  "cell_type": "code",
416
  "execution_count": null,
417
+ "id": "278d1d80",
418
  "metadata": {},
419
  "outputs": [],
420
  "source": [
421
+ "image = ds['train'][264]['image']\n",
422
+ "display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))"
423
  ]
424
  },
425
  {
426
  "cell_type": "code",
427
  "execution_count": null,
428
+ "id": "912b54e4",
429
  "metadata": {},
430
  "outputs": [],
431
  "source": [
432
+ "noise = audio_diffusion.pipe.encode([image], steps=50)"
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "code",
437
+ "execution_count": null,
438
+ "id": "c7b31f97",
439
+ "metadata": {},
440
+ "outputs": [],
441
+ "source": [
442
+ "# Reconstruct original audio from noise\n",
443
+ "_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
444
+ " noise=noise, generator=generator)\n",
445
+ "display(Audio(audio, rate=sample_rate))"
446
+ ]
447
+ },
448
+ {
449
+ "cell_type": "markdown",
450
+ "id": "998c776b",
451
+ "metadata": {},
452
+ "source": [
453
+ "### ...or to interpolate between audios"
454
+ ]
455
+ },
456
+ {
457
+ "cell_type": "code",
458
+ "execution_count": null,
459
+ "id": "33f82367",
460
+ "metadata": {},
461
+ "outputs": [],
462
+ "source": [
463
+ "image2 = ds['train'][15978]['image']\n",
464
+ "display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))"
465
+ ]
466
+ },
467
+ {
468
+ "cell_type": "code",
469
+ "execution_count": null,
470
+ "id": "f93fb6c0",
471
+ "metadata": {},
472
+ "outputs": [],
473
+ "source": [
474
+ "noise2 = audio_diffusion.pipe.encode([image2], steps=50)"
475
+ ]
476
+ },
477
+ {
478
+ "cell_type": "code",
479
+ "execution_count": null,
480
+ "id": "a4190563",
481
+ "metadata": {},
482
+ "outputs": [],
483
+ "source": [
484
+ "alpha = 0.5 #@param {type:\"slider\", min:0, max:1, step:.1}\n",
485
+ "_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
486
+ " noise=audio_diffusion.pipe.slerp(noise, noise2, alpha),\n",
487
+ " steps=50,\n",
488
+ " generator=generator)\n",
489
+ "display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))\n",
490
+ "display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))\n",
491
+ "display(Audio(audio, rate=sample_rate))"
492
  ]
493
  },
494
  {
495
  "cell_type": "code",
496
  "execution_count": null,
497
+ "id": "0b05539f",
498
  "metadata": {},
499
  "outputs": [],
500
  "source": []
 
521
  "name": "python",
522
  "nbconvert_exporter": "python",
523
  "pygments_lexer": "ipython3",
524
+ "version": "3.8.9"
525
  },
526
  "toc": {
527
  "base_numbering": 1,