Diffusers documentation

Audio Diffusion

You are viewing v0.18.2 version. A newer version v0.32.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Audio Diffusion

Overview

Audio Diffusion by Robert Dargavel Smith.

Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to and from mel spectrogram images.

The original codebase of this implementation can be found here, including training scripts and example notebooks.

Available Pipelines:

Pipeline Tasks Colab
pipeline_audio_diffusion.py Unconditional Audio Generation Open In Colab

Examples:

Audio Diffusion

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

Latent Audio Diffusion

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

Audio Diffusion with DDIM (faster)

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

Variations, in-painting, out-painting etc.

output = pipe(
    raw_audio=output.audios[0, 0],
    start_step=int(pipe.get_default_steps() / 2),
    mask_start_secs=1,
    mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

AudioDiffusionPipeline

class diffusers.AudioDiffusionPipeline

< >

( vqvae: AutoencoderKL unet: UNet2DConditionModel mel: Mel scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler] )

Parameters

  • vqae (AutoencoderKL) — Variational AutoEncoder for Latent Audio Diffusion or None
  • unet (UNet2DConditionModel) — UNET model
  • mel (Mel) — transform audio <-> spectrogram
  • scheduler ([DDIMScheduler or DDPMScheduler]) — de-noising scheduler

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< >

( batch_size: int = 1 audio_file: str = None raw_audio: ndarray = None slice: int = 0 start_step: int = 0 steps: int = None generator: Generator = None mask_start_secs: float = 0 mask_end_secs: float = 0 step_generator: Generator = None eta: float = 0 noise: Tensor = None encoding: Tensor = None return_dict = True ) β†’ List[PIL Image]

Parameters

  • batch_size (int) — number of samples to generate
  • audio_file (str) — must be a file on disk due to Librosa limitation or
  • raw_audio (np.ndarray) — audio as numpy array
  • slice (int) — slice number of audio to convert
  • start_step (int) — step to start from
  • steps (int) — number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
  • generator (torch.Generator) — random number generator or None
  • mask_start_secs (float) — number of seconds of audio to mask (not generate) at start
  • mask_end_secs (float) — number of seconds of audio to mask (not generate) at end
  • step_generator (torch.Generator) — random number generator used to de-noise or None
  • eta (float) — parameter between 0 and 1 used with DDIM scheduler
  • noise (torch.Tensor) — noise tensor of shape (batch_size, 1, height, width) or None
  • encoding (torch.Tensor) — for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim)
  • return_dict (bool) — if True return AudioPipelineOutput, ImagePipelineOutput else Tuple

Returns

List[PIL Image]

mel spectrograms (float, List[np.ndarray]): sample rate and raw audios

Generate random mel spectrogram from audio input and convert to audio.

encode

< >

( images: typing.List[PIL.Image.Image] steps: int = 50 ) β†’ np.ndarray

Parameters

  • images (List[PIL Image]) — list of images to encode
  • steps (int) — number of encoding steps to perform (defaults to 50)

Returns

np.ndarray

noise tensor of shape (batch_size, 1, height, width)

Reverse step process: recover noisy image from generated image.

get_default_steps

< >

( ) β†’ int

Returns

int

number of steps

Returns default number of steps recommended for inference

slerp

< >

( x0: Tensor x1: Tensor alpha: float ) β†’ torch.Tensor

Parameters

  • x0 (torch.Tensor) — first tensor to interpolate between
  • x1 (torch.Tensor) — seconds tensor to interpolate between
  • alpha (float) — interpolation between 0 and 1

Returns

torch.Tensor

interpolated tensor

Spherical Linear intERPolation

Mel

class diffusers.Mel

< >

( x_res: int = 256 y_res: int = 256 sample_rate: int = 22050 n_fft: int = 2048 hop_length: int = 512 top_db: int = 80 n_iter: int = 32 )

Parameters

  • x_res (int) — x resolution of spectrogram (time)
  • y_res (int) — y resolution of spectrogram (frequency bins)
  • sample_rate (int) — sample rate of audio
  • n_fft (int) — number of Fast Fourier Transforms
  • hop_length (int) — hop length (a higher number is recommended for lower than 256 y_res)
  • top_db (int) — loudest in decibels
  • n_iter (int) — number of iterations for Griffin Linn mel inversion

audio_slice_to_image

< >

( slice: int ) β†’ PIL Image

Parameters

  • slice (int) — slice number of audio to convert (out of get_number_of_slices())

Returns

PIL Image

grayscale image of x_res x y_res

Convert slice of audio to spectrogram.

get_audio_slice

< >

( slice: int = 0 ) β†’ np.ndarray

Parameters

  • slice (int) — slice number of audio (out of get_number_of_slices())

Returns

np.ndarray

audio as numpy array

Get slice of audio.

get_number_of_slices

< >

( ) β†’ int

Returns

int

number of spectograms audio can be sliced into

Get number of slices in audio.

get_sample_rate

< >

( ) β†’ int

Returns

int

sample rate of audio

Get sample rate:

image_to_audio

< >

( image: Image ) β†’ audio (np.ndarray)

Parameters

  • image (PIL Image) — x_res x y_res grayscale image

Returns

audio (np.ndarray)

raw audio

Converts spectrogram to audio.

load_audio

< >

( audio_file: str = None raw_audio: ndarray = None )

Parameters

  • audio_file (str) — must be a file on disk due to Librosa limitation or
  • raw_audio (np.ndarray) — audio as numpy array

Load audio.

set_resolution

< >

( x_res: int y_res: int )

Parameters

  • x_res (int) — x resolution of spectrogram (time)
  • y_res (int) — y resolution of spectrogram (frequency bins)

Set resolution.