Model Card for flex-diffusion-2-1

stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.

TLDR:

There are 2 models in this repo:

One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.

For usage, see - How to Get Started with the Model

It aims to solve the following issues:

Generated images looks like they are cropped from a larger image.
Generating non-square images creates weird results, due to the model being trained on square images. Examples:

resolution	model	stable diffusion	flex diffusion
576x1024 (9:16)	v2-1
576x1024 (9:16)	v2-base
1024x576 (16:9)	v2-1
1024x576 (16:9)	v2-base

Limitations:

It's trained on a small dataset, so it's improvements may be limited.
For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions. For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.

Potential improvements:

Train on a larger dataset.
Train on different resolutions even for the same aspect ratio.
Train on specific aspect ratios, instead of a range of aspect ratios.

Model Card for flex-diffusion-2-1
Table of Contents
Table of Contents
Model Details
- Model Description
Uses
Bias, Risks, and Limitations
- Recommendations
Training Details
- Training Data
- Training Procedure
  - Preprocessing
  - Speeds, Sizes, Times
Evaluation
- Testing Data, Factors & Metrics
- Results
Model Examination
Environmental Impact
Technical Specifications [optional]
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Citation
Glossary [optional]
More Information [optional]
Model Card Authors [optional]
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.

finetuned resolutions:

	width	height	aspect ratio
0	512	1024	1:2
1	576	1024	9:16
2	576	960	3:5
3	640	1024	5:8
4	512	768	2:3
5	640	896	5:7
6	576	768	3:4
7	512	640	4:5
8	640	768	5:6
9	640	704	10:11
10	512	512	1:1
11	704	640	11:10
12	768	640	6:5
13	640	512	5:4
14	768	576	4:3
15	896	640	7:5
16	768	512	3:2
17	1024	640	8:5
18	960	576	5:3
19	1024	576	16:9
20	1024	512	2:1

Developed by: Jonathan Chang
Model type: Diffusion-based text-to-image generation model
Language(s): English
License: creativeml-openrail-m
Parent Model: https://huggingface.co/stabilityai/stable-diffusion-2-1
Resources for more information: More information needed

Uses

see https://huggingface.co/stabilityai/stable-diffusion-2-1

Training Details

Training Data

LAION aesthetic dataset, subset of it with 6+ rating
- https://laion.ai/blog/laion-aesthetics/
- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
I only used a small portion of that, see Preprocessing
most common aspect ratios in the dataset (before preprocessing)

	aspect_ratio	counts
0	1:1	154727
1	3:2	119615
2	2:3	61197
3	4:3	52276
4	16:9	38862
5	400:267	21893
6	3:4	16893
7	8:5	16258
8	4:5	15684
9	6:5	12228
10	1000:667	12097
11	2:1	11006
12	800:533	10259
13	5:4	9753
14	500:333	9700
15	250:167	9114
16	5:3	8460
17	200:133	7832
18	1024:683	7176
19	11:10	6470

predefined aspect ratios

	width	height	aspect ratio
0	512	1024	1:2
1	576	1024	9:16
2	576	960	3:5
3	640	1024	5:8
4	512	768	2:3
5	640	896	5:7
6	576	768	3:4
7	512	640	4:5
8	640	768	5:6
9	640	704	10:11
10	512	512	1:1
11	704	640	11:10
12	768	640	6:5
13	640	512	5:4
14	768	576	4:3
15	896	640	7:5
16	768	512	3:2
17	1024	640	8:5
18	960	576	5:3
19	1024	576	16:9
20	1024	512	2:1

Training Procedure

Preprocessing

download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus

I only used the first file train-00000-of-00007-29aec9150af50f9f.parquet

use img2dataset to convert to webdataset
- https://github.com/rom1504/img2dataset
- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called first-file
- the output folder is /mnt/aesthetics6plus, change this to your own folder

echo INPUT_FOLDER=first-file
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
        --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
        --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True

The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:

use webdataset to load the data
calculate the aspect ratio of each image
find the closest aspect ratio & it's associated resolution from the predefined resolutions: argmin(abs(aspect_ratio - predefined_aspect_ratios)). E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
random crop the image to the associated resolution. E.g. crop to 512x1024
if more than 10% of the image is lost in the cropping, discard this example.
batch examples by aspect ratio, so all examples in a batch have the same aspect ratio

Speeds, Sizes, Times

Dataset size: 100k image-caption pairs, before filtering.
- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called aesthetics6plus-small, with 100k image-caption pairs in total. The full dataset is a lot bigger.
Hardware: 1 RTX3090 GPUs
Optimizer: 8bit Adam
Batch size: 32
- actual batch size: 2
- gradient_accumulation_steps: 16
- effective batch size: 32
Learning rate: warmup to 2e-6 for 500 steps and then kept constant
Learning rate: 2e-6
Training steps: 6k
Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
- Each example is seen 1.92 times on average.
Training time: approximately 1 day

Results

More information needed

Model Card Authors

Jonathan Chang

How to Get Started with the Model

Use the code below to get started with the model.

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel

def use_DPM_solver(pipe):
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    return pipe

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1",
    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
    torch_dtype=torch.float16,
    )
# for v2-base, use the following line instead
#pipe = StableDiffusionPipeline.from_pretrained(
#  "stabilityai/stable-diffusion-2-base",
#    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
#    torch_dtype=torch.float16)
pipe = use_DPM_solver(pipe).to("cuda")
pipe = pipe.to("cuda")

prompt = "a professional photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")