Latent Diffusion Models

arXiv | BibTeX

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer
* equal contribution

News

April 2022

More pre-trained LDMs are available:
- A 1.45B model trained on the LAION-400M database.
- A class-conditional model on ImageNet, achieving a FID of 3.6 when using classifier-free guidance Available via a colab notebook .

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Pretrained Models

A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.

Text-to-Image

Download the pre-trained weights (5.7GB)

mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

and sample with

python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50

This will save each sample individually as well as a grid of size n_iter x n_samples at the specified output location (default: outputs/txt2img-samples). Quality, sampling speed and diversity are best controlled via the scale, ddim_steps and ddim_eta arguments. As a rule of thumb, higher values of scale produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps generally also gives higher quality samples, but returns are diminishing for values > 250. Fast sampling (i.e. low values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0.
Faster sampling (i.e. even lower values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0 and --plms (added by Katherine Crowson, see Pseudo Numerical Methods for Diffusion Models on Manifolds).

Beyond 256²

For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on can sometimes result in interesting results. To try it out, tune the H and W arguments (which will be integer-divided by 8 in order to calculate the corresponding latent size), e.g. run

python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0

to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.

The example below was generated using the above command.

Inpainting

Download the pre-trained weights

wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1

and sample with

python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

indir should contain images *.png and masks <image_fname>_mask.png like the examples provided in data/inpainting_examples.

Class-Conditional ImageNet

Available via a notebook .

Unconditional Models

We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>

Train your own LDMs

Data preparation

Faces

For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.

LSUN

The LSUN datasets can be conveniently downloaded via the script available here. We performed a custom split into training and validation images, and provide the corresponding filenames at https://ommer-lab.com/files/lsun.zip. After downloading, extract them to ./data/lsun. The beds/cats/churches subsets should also be placed/symlinked at ./data/lsun/bedrooms/./data/lsun/cats/./data/lsun/churches, respectively.

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

Model Training

Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>.

Training autoencoder models

Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,

where config_spec is one of {autoencoder_kl_8x8x64(f=32, d=64), autoencoder_kl_16x16x16(f=16, d=16), autoencoder_kl_32x32x4(f=8, d=4), autoencoder_kl_64x64x3(f=4, d=3)}.

For training VQ-regularized models, see the taming-transformers repository.

Training LDMs

In configs/latent-diffusion/ we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,

where <config_spec> is one of {celebahq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_bedrooms-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_churches-ldm-vq-4(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

Model Zoo

Pretrained Autoencoding Models

All models were trained until convergence (no further substantial improvement in rFID).

Model	rFID vs val	train steps	PSNR	PSIM	Link	Comments
f=4, VQ (Z=8192, d=3)	0.58	533066	27.43 +/- 4.26	0.53 +/- 0.21	https://ommer-lab.com/files/latent-diffusion/vq-f4.zip
f=4, VQ (Z=8192, d=3)	1.06	658131	25.21 +/- 4.17	0.72 +/- 0.26	https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1	no attention
f=8, VQ (Z=16384, d=4)	1.14	971043	23.07 +/- 3.99	1.17 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/vq-f8.zip
f=8, VQ (Z=256, d=4)	1.49	1608649	22.35 +/- 3.81	1.26 +/- 0.37	https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip
f=16, VQ (Z=16384, d=8)	5.15	1101166	20.83 +/- 3.61	1.73 +/- 0.43	https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1

f=4, KL	0.27	176991	27.53 +/- 4.54	0.55 +/- 0.24	https://ommer-lab.com/files/latent-diffusion/kl-f4.zip
f=8, KL	0.90	246803	24.19 +/- 4.19	1.02 +/- 0.35	https://ommer-lab.com/files/latent-diffusion/kl-f8.zip
f=16, KL (d=16)	0.87	442998	24.08 +/- 4.22	1.07 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/kl-f16.zip
f=32, KL (d=64)	2.04	406763	22.27 +/- 3.93	1.41 +/- 0.40	https://ommer-lab.com/files/latent-diffusion/kl-f32.zip

Get the models

Running the following script downloads und extracts all available pretrained autoencoding models.

bash scripts/download_first_stages.sh

The first stage models can then be found in models/first_stage_models/<model_spec>

Pretrained LDMs

Datset	Task	Model	FID	IS	Prec	Recall	Link	Comments
CelebA-HQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	5.11 (5.11)	3.29	0.72	0.49	https://ommer-lab.com/files/latent-diffusion/celeba.zip
FFHQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	4.98 (4.98)	4.50 (4.50)	0.73	0.50	https://ommer-lab.com/files/latent-diffusion/ffhq.zip
LSUN-Churches	Unconditional Image Synthesis	LDM-KL-8 (400 DDIM steps, eta=0)	4.02 (4.02)	2.72	0.64	0.52	https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip
LSUN-Bedrooms	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	2.95 (3.0)	2.22 (2.23)	0.66	0.48	https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip
ImageNet	Class-conditional Image Synthesis	LDM-VQ-8 (200 DDIM steps, eta=1)	7.77(7.76)* /15.82**	201.56(209.52)* /78.82**	0.84* / 0.65**	0.35* / 0.63**	https://ommer-lab.com/files/latent-diffusion/cin.zip	: w/ guiding, classifier_scale 10 *: w/o guiding, scores in bracket calculated with script provided by ADM
Conceptual Captions	Text-conditional Image Synthesis	LDM-VQ-f4 (100 DDIM steps, eta=0)	16.79	13.89	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/text2img.zip	finetuned from LAION
OpenImages	Super-resolution	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip	BSR image degradation
OpenImages	Layout-to-Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	32.02	15.92	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip	finetuned on resolution 512x512

Get the models

The LDMs listed above can jointly be downloaded and extracted via

bash scripts/download_models.sh

The models can then be found in models/ldm/<model_spec>.

Coming Soon...

More inference scripts for conditional LDMs.
In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing

Comments

Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}