MasonCrinr commited on
Commit
a316a0b
1 Parent(s): b4d9186

Upload 6 files

Browse files
Files changed (6) hide show
  1. .gitignore +2 -0
  2. .pre-commit-config.yaml +41 -0
  3. LICENSE +21 -0
  4. LOGO.png +0 -0
  5. README.md +251 -0
  6. setup.py +29 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ __pycache__
2
+ .mypy_cache
.pre-commit-config.yaml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/pre-commit/pre-commit-hooks
3
+ rev: v2.3.0
4
+ hooks:
5
+ - id: end-of-file-fixer
6
+ - id: trailing-whitespace
7
+
8
+ # Formats code correctly
9
+ - repo: https://github.com/psf/black
10
+ rev: 21.12b0
11
+ hooks:
12
+ - id: black
13
+ args: [
14
+ '--experimental-string-processing'
15
+ ]
16
+
17
+ # Sorts imports
18
+ - repo: https://github.com/pycqa/isort
19
+ rev: 5.10.1
20
+ hooks:
21
+ - id: isort
22
+ name: isort (python)
23
+ args: ["--profile", "black"]
24
+
25
+ # Checks unused imports, like lengths, etc
26
+ - repo: https://gitlab.com/pycqa/flake8
27
+ rev: 4.0.0
28
+ hooks:
29
+ - id: flake8
30
+ args: [
31
+ '--per-file-ignores=__init__.py:F401',
32
+ '--max-line-length=88',
33
+ '--ignore=E203,W503'
34
+ ]
35
+
36
+ # Checks types
37
+ - repo: https://github.com/pre-commit/mirrors-mypy
38
+ rev: 'v0.971'
39
+ hooks:
40
+ - id: mypy
41
+ additional_dependencies: [data-science-types>=0.2, torch>=1.6]
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2022 archinet.ai
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
LOGO.png ADDED
README.md ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="./LOGO.png"></img>
2
+
3
+ A fully featured audio diffusion library, for PyTorch. Includes models for unconditional audio generation, text-conditional audio generation, diffusion autoencoding, upsampling, and vocoding. The provided models are waveform-based, however, the U-Net (built using [`a-unet`](https://github.com/archinetai/a-unet)), `DiffusionModel`, diffusion method, and diffusion samplers are both generic to any dimension and highly customizable to work on other formats. **Notes: (1) no pre-trained models are provided here, (2) the configs shown are indicative and untested, see [Moûsai](https://arxiv.org/abs/2301.11757) for the configs used in the paper.**
4
+
5
+
6
+ ## Install
7
+
8
+ ```bash
9
+ pip install audio-diffusion-pytorch
10
+ ```
11
+
12
+ [![PyPI - Python Version](https://img.shields.io/pypi/v/audio-diffusion-pytorch?style=flat&colorA=black&colorB=black)](https://pypi.org/project/audio-diffusion-pytorch/)
13
+ [![Downloads](https://static.pepy.tech/personalized-badge/audio-diffusion-pytorch?period=total&units=international_system&left_color=black&right_color=black&left_text=Downloads)](https://pepy.tech/project/audio-diffusion-pytorch)
14
+
15
+
16
+ ## Usage
17
+
18
+ ### Unconditional Generator
19
+
20
+ ```py
21
+ from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler
22
+
23
+ model = DiffusionModel(
24
+ net_t=UNetV0, # The model type used for diffusion (U-Net V0 in this case)
25
+ in_channels=2, # U-Net: number of input/output (audio) channels
26
+ channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
27
+ factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
28
+ items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
29
+ attentions=[0, 0, 0, 0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
30
+ attention_heads=8, # U-Net: number of attention heads per attention item
31
+ attention_features=64, # U-Net: number of attention features per attention item
32
+ diffusion_t=VDiffusion, # The diffusion method used
33
+ sampler_t=VSampler, # The diffusion sampler used
34
+ )
35
+
36
+ # Train model with audio waveforms
37
+ audio = torch.randn(1, 2, 2**18) # [batch_size, in_channels, length]
38
+ loss = model(audio)
39
+ loss.backward()
40
+
41
+ # Turn noise into new audio sample with diffusion
42
+ noise = torch.randn(1, 2, 2**18) # [batch_size, in_channels, length]
43
+ sample = model.sample(noise, num_steps=10) # Suggested num_steps 10-100
44
+ ```
45
+
46
+ ### Text-Conditional Generator
47
+ A text-to-audio diffusion model that conditions the generation with `t5-base` text embeddings, requires `pip install transformers`.
48
+ ```py
49
+ from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler
50
+
51
+ model = DiffusionModel(
52
+ # ... same as unconditional model
53
+ use_text_conditioning=True, # U-Net: enables text conditioning (default T5-base)
54
+ use_embedding_cfg=True, # U-Net: enables classifier free guidance
55
+ embedding_max_length=64, # U-Net: text embedding maximum length (default for T5-base)
56
+ embedding_features=768, # U-Net: text mbedding features (default for T5-base)
57
+ cross_attentions=[0, 0, 0, 1, 1, 1, 1, 1, 1], # U-Net: cross-attention enabled/disabled at each layer
58
+ )
59
+
60
+ # Train model with audio waveforms
61
+ audio_wave = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
62
+ loss = model(
63
+ audio_wave,
64
+ text=['The audio description'], # Text conditioning, one element per batch
65
+ embedding_mask_proba=0.1 # Probability of masking text with learned embedding (Classifier-Free Guidance Mask)
66
+ )
67
+ loss.backward()
68
+
69
+ # Turn noise into new audio sample with diffusion
70
+ noise = torch.randn(1, 2, 2**18)
71
+ sample = model.sample(
72
+ noise,
73
+ text=['The audio description'],
74
+ embedding_scale=5.0, # Higher for more text importance, suggested range: 1-15 (Classifier-Free Guidance Scale)
75
+ num_steps=2 # Higher for better quality, suggested num_steps: 10-100
76
+ )
77
+ ```
78
+
79
+ ### Diffusion Upsampler
80
+ Upsample audio from a lower sample rate to higher sample rate using diffusion, e.g. 3kHz to 48kHz.
81
+ ```py
82
+ from audio_diffusion_pytorch import DiffusionUpsampler, UNetV0, VDiffusion, VSampler
83
+
84
+ upsampler = DiffusionUpsampler(
85
+ net_t=UNetV0, # The model type used for diffusion
86
+ upsample_factor=16, # The upsample factor (e.g. 16 can be used for 3kHz to 48kHz)
87
+ in_channels=2, # U-Net: number of input/output (audio) channels
88
+ channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
89
+ factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
90
+ items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
91
+ diffusion_t=VDiffusion, # The diffusion method used
92
+ sampler_t=VSampler, # The diffusion sampler used
93
+ )
94
+
95
+ # Train model with high sample rate audio waveforms
96
+ audio = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
97
+ loss = upsampler(audio)
98
+ loss.backward()
99
+
100
+ # Turn low sample rate audio into high sample rate
101
+ downsampled_audio = torch.randn(1, 2, 2**14) # [batch, in_channels, length]
102
+ sample = upsampler.sample(downsampled_audio, num_steps=10) # Output has shape: [1, 2, 2**18]
103
+ ```
104
+
105
+ ### Diffusion Vocoder
106
+ Convert a mel-spectrogram to wavefrom using diffusion.
107
+ ```py
108
+ from audio_diffusion_pytorch import DiffusionVocoder, UNetV0, VDiffusion, VSampler
109
+
110
+ vocoder = DiffusionVocoder(
111
+ mel_n_fft=1024, # Mel-spectrogram n_fft
112
+ mel_channels=80, # Mel-spectrogram channels
113
+ mel_sample_rate=48000, # Mel-spectrogram sample rate
114
+ mel_normalize_log=True, # Mel-spectrogram log normalization (alternative is mel_normalize=True for [-1,1] power normalization)
115
+ net_t=UNetV0, # The model type used for diffusion vocoding
116
+ channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
117
+ factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
118
+ items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
119
+ diffusion_t=VDiffusion, # The diffusion method used
120
+ sampler_t=VSampler, # The diffusion sampler used
121
+ )
122
+
123
+ # Train model on waveforms (automatically converted to mel internally)
124
+ audio = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
125
+ loss = vocoder(audio)
126
+ loss.backward()
127
+
128
+ # Turn mel spectrogram into waveform
129
+ mel_spectrogram = torch.randn(1, 2, 80, 1024) # [batch, in_channels, mel_channels, mel_length]
130
+ sample = vocoder.sample(mel_spectrogram, num_steps=10) # Output has shape: [1, 2, 2**18]
131
+ ```
132
+
133
+ ### Diffusion Autoencoder
134
+ Autoencode audio into a compressed latent using diffusion. Any encoder can be provided as long as it subclasses the `EncoderBase` class or contains an `out_channels` and `downsample_factor` field.
135
+ ```py
136
+ from audio_diffusion_pytorch import DiffusionAE, UNetV0, VDiffusion, VSampler
137
+ from audio_encoders_pytorch import MelE1d, TanhBottleneck
138
+
139
+ autoencoder = DiffusionAE(
140
+ encoder=MelE1d( # The encoder used, in this case a mel-spectrogram encoder
141
+ in_channels=2,
142
+ channels=512,
143
+ multipliers=[1, 1],
144
+ factors=[2],
145
+ num_blocks=[12],
146
+ out_channels=32,
147
+ mel_channels=80,
148
+ mel_sample_rate=48000,
149
+ mel_normalize_log=True,
150
+ bottleneck=TanhBottleneck(),
151
+ ),
152
+ inject_depth=6,
153
+ net_t=UNetV0, # The model type used for diffusion upsampling
154
+ in_channels=2, # U-Net: number of input/output (audio) channels
155
+ channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
156
+ factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
157
+ items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
158
+ diffusion_t=VDiffusion, # The diffusion method used
159
+ sampler_t=VSampler, # The diffusion sampler used
160
+ )
161
+
162
+ # Train autoencoder with audio samples
163
+ audio = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
164
+ loss = autoencoder(audio)
165
+ loss.backward()
166
+
167
+ # Encode/decode audio
168
+ audio = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
169
+ latent = autoencoder.encode(audio) # Encode
170
+ sample = autoencoder.decode(latent, num_steps=10) # Decode by sampling diffusion model conditioning on latent
171
+ ```
172
+
173
+ ## Other
174
+
175
+ ### Inpainting
176
+ ```py
177
+ from audio_diffusion_pytorch import UNetV0, VInpainter
178
+
179
+ # The diffusion UNetV0 (this is an example, the net must be trained to work)
180
+ net = UNetV0(
181
+ dim=1,
182
+ in_channels=2, # U-Net: number of input/output (audio) channels
183
+ channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
184
+ factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
185
+ items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
186
+ attentions=[0, 0, 0, 0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
187
+ attention_heads=8, # U-Net: number of attention heads per attention block
188
+ attention_features=64, # U-Net: number of attention features per attention block,
189
+ )
190
+
191
+ # Instantiate inpainter with trained net
192
+ inpainter = VInpainter(net=net)
193
+
194
+ # Inpaint source
195
+ y = inpainter(
196
+ source=torch.randn(1, 2, 2**18), # Start source
197
+ mask=torch.randint(0, 2, (1, 2, 2 ** 18), dtype=torch.bool), # Set to `True` the parts you want to keep
198
+ num_steps=10, # Number of inpainting steps
199
+ num_resamples=2, # Number of resampling steps
200
+ show_progress=True,
201
+ ) # [1, 2, 2 ** 18]
202
+ ```
203
+
204
+ ## Appreciation
205
+
206
+ * [StabilityAI](https://stability.ai/) for the compute, [Zach Evans](https://github.com/zqevans) and everyone else from [HarmonAI](https://www.harmonai.org/) for the interesting research discussions.
207
+ * [ETH Zurich](https://inf.ethz.ch/) for the resources, [Zhijing Jin](https://zhijing-jin.com/), [Bernhard Schoelkopf](https://is.mpg.de/~bs), and [Mrinmaya Sachan](http://www.mrinmaya.io/) for supervising this Thesis.
208
+ * [Phil Wang](https://github.com/lucidrains) for the beautiful open source contributions on [diffusion](https://github.com/lucidrains/denoising-diffusion-pytorch) and [Imagen](https://github.com/lucidrains/imagen-pytorch).
209
+ * [Katherine Crowson](https://github.com/crowsonkb) for the experiments with [k-diffusion](https://github.com/crowsonkb/k-diffusion) and the insane collection of samplers.
210
+
211
+ ## Citations
212
+
213
+ DDPM Diffusion
214
+ ```bibtex
215
+ @misc{2006.11239,
216
+ Author = {Jonathan Ho and Ajay Jain and Pieter Abbeel},
217
+ Title = {Denoising Diffusion Probabilistic Models},
218
+ Year = {2020},
219
+ Eprint = {arXiv:2006.11239},
220
+ }
221
+ ```
222
+
223
+ DDIM (V-Sampler)
224
+ ```bibtex
225
+ @misc{2010.02502,
226
+ Author = {Jiaming Song and Chenlin Meng and Stefano Ermon},
227
+ Title = {Denoising Diffusion Implicit Models},
228
+ Year = {2020},
229
+ Eprint = {arXiv:2010.02502},
230
+ }
231
+ ```
232
+
233
+ V-Diffusion
234
+ ```bibtex
235
+ @misc{2202.00512,
236
+ Author = {Tim Salimans and Jonathan Ho},
237
+ Title = {Progressive Distillation for Fast Sampling of Diffusion Models},
238
+ Year = {2022},
239
+ Eprint = {arXiv:2202.00512},
240
+ }
241
+ ```
242
+
243
+ Imagen (T5 Text Conditioning)
244
+ ```bibtex
245
+ @misc{2205.11487,
246
+ Author = {Chitwan Saharia and William Chan and Saurabh Saxena and Lala Li and Jay Whang and Emily Denton and Seyed Kamyar Seyed Ghasemipour and Burcu Karagol Ayan and S. Sara Mahdavi and Rapha Gontijo Lopes and Tim Salimans and Jonathan Ho and David J Fleet and Mohammad Norouzi},
247
+ Title = {Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},
248
+ Year = {2022},
249
+ Eprint = {arXiv:2205.11487},
250
+ }
251
+ ```
setup.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import find_packages, setup
2
+
3
+ setup(
4
+ name="audio-diffusion-pytorch",
5
+ packages=find_packages(exclude=[]),
6
+ version="0.1.3",
7
+ license="MIT",
8
+ description="Audio Diffusion - PyTorch",
9
+ long_description_content_type="text/markdown",
10
+ author="Flavio Schneider",
11
+ author_email="archinetai@protonmail.com",
12
+ url="https://github.com/archinetai/audio-diffusion-pytorch",
13
+ keywords=["artificial intelligence", "deep learning", "audio generation"],
14
+ install_requires=[
15
+ "tqdm",
16
+ "torch>=1.6",
17
+ "torchaudio",
18
+ "data-science-types>=0.2",
19
+ "einops>=0.6",
20
+ "a-unet",
21
+ ],
22
+ classifiers=[
23
+ "Development Status :: 4 - Beta",
24
+ "Intended Audience :: Developers",
25
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
26
+ "License :: OSI Approved :: MIT License",
27
+ "Programming Language :: Python :: 3.6",
28
+ ],
29
+ )