Temporal Scene Generation w/ Stable Diffusion

Community Article Published January 8, 2024

1. Introduction

This Blog report presents in detail how we executed and implemented our Deep Learning Final Project ”Zero-Shot Temporal Scene Generation with Stable Diffusion”. The Blog report presents thoroughly explaining each step we undertook along the way which came as a result of several days and hours of trial and error. In this project we implemented 2 pipelines: Stable Diffusion 1.5 pipeline, and Stable Diffusion XL pipeline both of which output a set of Images which may be used for: Movie / Cartoon Scenes Generation, Social Story Generation amongst others. Based on these scenes videos may be generated using Stable Video Diffusion and then merged.

Resources used for this project:

Google Colab Pro+ Account (50 $)
ChatGPT 4 Subscription with 20 API Credits (20 $)
Google Drive 100 GB Storage (3 $)

The following points outline the major things we did in for this project: In this project we worked on 2 pipelines:

In the first pipeline we worked with the base Stable Diffusion 1.5 model (runwayml/stable-diffusion-v1-5), We performed DreamBooth on specific characters and combined it with textual inversion to train on a specific style, and integrated cross attention control to create temporally consistent images.
In this pipeline we used the newly released (18 July ’23) Stable Diffusion XL model and trained it on a specific character using LoRA technique and fused them with pre-trained checkpoints from CivitAI.

The motivation / goal behind this project was to make a pipeline which would generate a consistent set of images which may later be used to generate story boards, animated movie cartoon scenes, stories for children etc.

In both above pipelines we integrated the ChatGPT API to generate prompts and captions for the desired images. The captions are automatically fed into the models which output the images along with the captions below them.

Besides Stable Diffusion XL, we used Stable Diffusion 1.5 as besides being lighter it also has official implementation on multiple papers such as DreamBooth, textual inversion, control net etc. and many pre-trained checkpoints can be found on top of it.

Below are images of both the pipelines:

Stable Diffusion 1.5 Pipeline

SDXL pipeline

In Chapter 2 we’ll be taking you over how each of the technique which we have used in this project works, and in Chapter 3 onwards we'll be showing the implementation

Also below are some of the images generated from our models. All our models can be found on HuggingFace, all the code (fine-tuning/image generation) is on this Github repository and the generated images may be found on the CivitAI profile as well as in the notebooks.

Images Generated from Our Models

2. How Everything works

a. Diffusion Models

Diffusion models in deep learning represent a class of generative models that leverage the concept of diffusion processes to model complex data distributions. These models have gained prominence due to their ability to generate high-quality samples and perform tasks such as image synthesis and denoising. One notable diffusion model is the Variational Diffusion Autoencoder (VDAE), which combines variational autoencoder (VAE) principles with the diffusion process for improved generative modeling.

Internal architecture of a diffusion model

Stages of a diffusion model transforming a dog image to noise and back.

Mathematically, the generative process of a diffusion model can be expressed as follows.

Let $x_t$ denote the data at time $t$, and $z$ represent latent variables. The diffusion process can be modeled as a sequence of transformations:

$x_t = x_0 + \sqrt{2\beta(t)}\cdot\sigma(t)\cdot\epsilon_t, \quad t = 1,2,...,T$

where $\epsilon_t$ is a unit Gaussian noise, $\sigma(t)$ is a scale function, and β controls the diffusion rate. The initial data point is assumed to be sampled from a simple distribution such as a standard Gaussian.

The following latent variable model is employed to integrate the diffusion process with a variational autoencoder:

$p(x_t \mid x_{t-1}) = \mathcal{N}(x_t; x_{t-1}, 2\beta(t)\sigma^2(t)I)$

where g(z) is the decoder network mapping latent variables to an initial data point.

The likelihood of observing the data xt at each step of the diffusion process is given by:

The objective is to maximize the marginal likelihood of the observed data:

$\log p(x_0) + t = \sum_{t=1}^{T} \log p(x_t \mid x_{t-1})$

This objective can be further augmented with a variationally lower bound, incorporating the encoder network q(z∣x0) . The final objective for training the VDAE becomes:

$L = \log p(x_0) + t = \sum_{t=1}^{T} \log p(x_t \mid x_{t-1}) - KL[q(z \mid x_0) \| p(z)]$

In summary, diffusion models in deep learning blend the power of variational autoencoders with the elegance of diffusion processes to capture intricate data distributions. The integration of latent variables enables the model to generate diverse and realistic samples while facilitating efficient training through the optimization of the variational lower bound.

b. Stable Diffusion

Diffusion models play a pivotal role in synthesizing high-quality samples from complex data distributions. However, traditional diffusion models may encounter issues related to stability during the generative process. The Stable Diffusion Model (SDM) is proposed as a solution, incorporating stability mechanisms to enhance convergence and robustness.

The core of the Stable Diffusion Model lies in the modification of the diffusion process to ensure stability. The generative process at each time step $t$ is defined as follows:

$x_t = x_0 + \sqrt{2\beta(t)}\cdot\sigma(t)\cdot\epsilon_t + \text{StabilityTerm}, \quad t = 1,2,...,T$

Where ϵt is a unit Gaussian noise, σ(t) is a scale function, and β controls the diffusion rate. The initial data point is assumed to be sampled from a simple distribution such as a standard Gaussian, and the "Stability Term" ensures stability during the diffusion process. The specific formulation of the Stability Term depends on the chosen stability criterion.

The architecture of the Stable Diffusion Model consists of an encoder, a decoder, and the stability module. The encoder network maps observed data x0 to latent variables 𝑧, while the decoder reconstructs the data from latent variables.

The stability module incorporates additional parameters, dynamically adjusting the Stability Term during the diffusion process to maintain stability. The stability module introduces extra latent variables that contribute to stability during the generative process. These additional latent variables are optimized jointly with the primary latent variables, enhancing the model's ability to capture intricate dependencies in the data distribution.

The training objective of the Stable Diffusion Model involves maximizing the marginal likelihood of the observed data, accounting for the Stability Term and additional latent variables. The objective function is augmented with a variational lower bound, enabling efficient training through optimization.

c. Stable Diffusion XL

Stable Diffusion XL (SDXL) is a significant advancement in image generation, building upon the success of its predecessor, Stable Diffusion. It employs a larger U-Net model with 2.6 billion parameters, allowing for finer details and more complex imagery. SDXL adopts a heterogeneous distribution of Transformer blocks, optimizing the learning process and fostering improved image fidelity.

SDXL excels in text-to-image generation by leveraging a larger text conditioning encoder, OpenCLIP ViT-bigG, and an extra text encoder, CLIP ViT-L, to accurately translate textual descriptions into visual elements. It also introduces novel conditioning techniques such as size-conditioning, crop-conditioning, and multi-aspect conditioning, which refine the image generation process.

SDXL also incorporates a specialized refiner model to handle high-resolution image generation. This model excels at capturing minute local details and enhancing overall image quality, ensuring that even the most high-resolution outputs remain sharp and visually appealing. These advancements contribute to SDXL's superior performance in image generation.

Now, let's delve into the mathematical formulation of SDXL. The model can be expressed as follows:

$H_l = H_l + \alpha \nabla H_l * \nabla H_l$

where ∇H (l) represents the gradient of the loss with respect to the output of the l-th layer, and ∗ denotes element-wise multiplication. The hyperparameter α controls the strength of the self-diffusion.

The cross-layer diffusion is modeled by allowing information to flow between neighboring layers:

$H_l = H_l + \beta (\nabla H_{l-1} - 2\nabla H_l + \nabla H_{l+1}) * \nabla H_l$

Here, β is a hyperparameter controlling the cross-layer diffusion strength.

The overall objective function for the SDXL model can be formulated as a combination of the original loss function and the self-diffusion and cross-layer diffusion terms:

$L_{SDXL} = L_{\text{original}} + \gamma l = 1 \sum l \| \nabla H_l \|_2^2$

Where 𝛾 is a regularization parameter, and $\nabla H_l |_2$ penalizes the magnitude of the gradients, encouraging smoother information flow.

This mathematical formulation demonstrates how SDXL incorporates self-diffusion and cross-layer diffusion mechanisms to enhance gradient flow and improve the training dynamics of deep neural networks. Researchers have found that integrating such diffusion models can lead to more stable and efficient training processes, contributing to better generalization performance across various tasks.

d. DreamBooth

DreamBooth works by altering the structure of the model itself. We have 2 inputs we have to be concerned with, the first is the image that we want to train and the second is the sentence(s) which has a unique identifier.

DreamBooth trains the model to associate the unique identifier with the concept of the image to be trained. The sentence is converted into a text embedding, where each word is represented by a vector that contains semantic information about the word.

A vector contains information about each word, and the concept (image) is associated with the new vector of the word which doesn’t have any meaning.

We pass the text embedding and then we apply a noise to the sample image until it has enough noise, and then we apply a little bit less noise to create another image. After this, we try to take the model to take in the one with more noise and output the one with less noise. The output is compared to the image with less noise using the loss function, after which a gradient update is performed (we punish the model if the loss is high) and reward the model if the loss is low.

Creating a whole new model, as we make changes to the internal structure of the model until it learns the concept.

This is the most common and effective, however, it's very storage inefficient because every time we do DreamBooth we have a new model, and training multiple concepts to a single model can confuse it.

Diagram of DreamBooth training process for personalized image generation with super-resolution and class-specific loss components.

The DreamBooth model is based on a conditional Generative Adversarial Network (cGAN) architecture, which consists of a generator network G and a discriminator network D. The generator takes as input a text description t and a set of input images I, and produces an output image O. The discriminator network D takes as input pairs of images and text descriptions, and learns to distinguish between real images and text-conditioned generated images.

The generator network G is trained to maximize the probability that the generated image O is classified as real by the discriminator D, while the discriminator D is trained to maximize the probability of correctly classifying real images and minimizing the probability of incorrectly classifying generated images.

The objective function for the cGAN can be formulated as:

$G_{\text{min}}, D_{\text{max}} = \mathbb{E}_{I,t} [\log D(I, t)] + \mathbb{E}_t [\log(1 - D(G(t, I), t))]$

where I represents the input images, t represents the text description, G(t,I) represents the generated image, and E represents the expectation.

The $\mathbb{E}_{I,t} [\log D(I, t)]$ function maximizes the discriminator's ability to correctly identify real images as real, given the conditional input t.
The $\mathbb{E}_{t} [\log(1 - D(G(t, I)]$ function maximizes the generator's ability to create images that fool the discriminator, given the conditional input I.

The cGAN architecture allows the DreamBooth model to generate images that are conditioned on specific text descriptions and input images, resulting in highly customizable and personalized image generation.

e. Textual Inversion

It's similar to DreamBooth, however, the difference is that the gradient update doesn't happen in the model, rather it happens in the vector, and slowly the vector gets close to the desired visual phenomena, and eventually, we’ll have the vector which could near perfectly describe the desired image. The benefit of textual inversion is that the output is not a new model like DreamBooth, except it’s a tiny little additional embedding. Hence it is much more storage efficient than DreamBooth, and anyone can easily download the embedding and plug it into their model and also get that desired image output.

So just by using a perfect vector, we can generate any arbitrary visual phenomena. DreamBooth however is more effective than Textual Inversion, and DreamBooth can be a viable option if you have at least 12GB of VRAM available.

Model Architecture

The proposed textual inversion model architecture consists of several key components.

Let $x$ represent the input text sequence and $Y$represent the corresponding image output. The core components of the architecture include:

Tokenizer: The tokenizer processes the input text sequence into tokens for further processing.

$T(x) = {t_1, t_2, ... t_n }$ where $x$ is the input text, $T$ is the tokenizer, and $t_1$ are tokens.
Embedding Lookup: The embedding lookup operation maps the tokenized input to continuous vector representations.

$v_i=E(t_i)$ where E is the embedding matrix and vi is the embedding of token **ti.
Text Transformer: The text transformer module transforms the vector sequence into an intermediate representation.

$h = T_x(v_1, v_2, .., v_n)$ where Tx is the text transformer, and h is the final text representation.
Concept Tokens: Concept tokens C are embedded into the model to enable personalized image generation based on user-defined concepts.
Latent Space: The latent space Z captures the underlying features of the input text and concept tokens. $z^* = argmin_zL(G(z), x)$ where G is the image generator, z is the latent code, and L is the loss function.
Image Generation: $I = G(z^*)$ where $I$ is the generated image.

Architecture of Textual Inverision

f. LoRA

LoRA stands for Low-Rank Adaption. In DreamBooth we’re trying to teach a new concept to the model ( we train and update the model until it learns the new concept) until we’re left with a new model. Through LoRA, we can teach the model a new concept without having to create a new copy of the model. LoRA initially was used for LLMs (Large language models), which are known to have billions of parameters. Creating a new model for LLMs is really not feasible when we want to teach the model a new sub-task. So LoRA inserts new layers into the model. The image below shows a diagram of a neural network with LoRA layers (the above diagram, with orange layers being LoRA layers), and the below diagram without LoRA layers.

By default, the LoRA layers are set up in such a way that they don't impact the model at all. Then as the training proceeds with loss, we update the intermediate LoRA layers so that they output different values as the information through them. Once the intermediate layers are updated enough, they’ll eventually be able to understand the new concept. So LoRA’s are similar to DreamBooth, however instead of updating weights like in DreamBooth, we’re adding new weights and updating those new weights to achieve exactly the same effect.

LoRA training is much faster than DreamBooth and takes much less memory. Also, as the LoRA layers are very small, we can pass them around and add them to different models and share them much more easily than the full model.

Model Architecture

The LoRA (Low-Rank Adaptation) model is a technique used to adapt a pre-trained language model to a specific task or domain. It achieves this by introducing a low-rank matrix into the model's attention mechanism, which allows the model to learn task-specific representations without significantly changing the overall architecture.

Given a pre-trained language model with an attention mechanism, we introduce a low-rank matrix $L \space \epsilon \space R^{d*r}$ and a set of task-specific vectors $v_1,v_2...v_n$, where d is the dimension of the attention key and value vectors, r is the rank of the low-rank matrix, and n is the number of task-specific vectors.

We modify the attention mechanism as follows:

$A t t e n t i o n (Q, K, V) = s o f t m a x (d Q K T) V + L V v T$

where Q, K, and V are the query, key, and value vectors, respectively, $v_1,v_2...v_n$ and is a task-specific vector.

The low-rank matrix L and the task-specific vectors $v_1,v_2...v_n$ are learned during fine-tuning on the specific task or domain.

Comparison of DreamBooth, LoRA, Textual Inversion and HyperNetworks

Since we didn’t attempt HyperNetworks in our project, we’re skipping it in this blog.

3. SD 1.5 Pipeline

Attempting DreamBooth

This was perhaps the most challenging part of our project. To train the character of our pipeline, we opted for DreamBooth as it is known to be the best-performing method for fine-tuning stable diffusion. To implement Dreambooth we jumped back and forth between several notebooks and techniques. We worked with several implementations of dreambooth, some of which include the huggingface implementation github.com/huggingface/diffusers, TheLastBen, ShivamShrirao, Kohya ss, and the fastDreamBooth notebooks to name a few. For us, TheLastBen notebook proved to be the most straightforward implementation. However, the huggingface implementation provided more customization while being simple as well (no need to generate image captions/instance organization etc.). We started with a learning rate of 1e-5 which resulted in a very overfitted model, and only outputted the images that were fed into them. This proved to be very high for Dreambooth. Upon surfing the subreddit of Stable Diffusion, r/StableDiffusion, we learned that stable diffusion is highly prone to overfitting and learning rates with e-7 are usual for DreamBooth. Also, a high number of class images lead to the model underfitting severely. We started off training 14 images of the PO character from Kung Fu Panda, using the token phtmejhn

For its identification. and played around with a lot of parameters. We downloaded the images from the internet and some from movies converted them all into a single format (jpg/png), and resized them using birme.net. Although the hugingface implementation provides the option for bucketing, we chose to resize our images nevertheless.

These are the following pointers which we learned in this project:

A learning rate of <=1e-6 gave us good models. The best model of Kungfu panda which we got had a learning rate of 2e-6
10 class images for each instance class. Also, we opted for automatic class image generation using dreambooth (as it would be very difficult to find 10*14 = 140) images of pandas. Also, this allowed us to play with the numbers as the huggingface implementation has the option to generate class images for us. We also learned that this approach may work for real faces but is not the best approach to take for animated characters, as during dreambooth training class images generated were of real pandas which resulted in very high regularization.
Maximum steps = 100 worked fine for us. We went as high as 3000+ but resulted in a highly overfitted model.
We learned from both the results (as well as from r/StableDiffusion) that prior preservation is very useful in avoiding overfitting (we left this to be default 1), and training the text encoder also gives better results (we also used the default value as 350)
We also learned that for animated images (for Stable diffusion 1.5 at least) require >=75 inference steps for complete generation. We observed polygon meshes to form on the face usually.
Also, for animated characters, a lower guidance scale gave better results for inferencing (4-6), while for real faces 7-9 worked well.
We did use the 8bit Adam, as it resulted in faster training and we didn’t observe a significant drop in results (also we were cautious of our GPU units)
Negative prompts play a crucial role and have a significant impact on getting the desired results. For prompts we found this guide to be really helpful https://www.reddit.com/r/StableDiffusion/comments/17qe7v1/sdxl_prompt_best_practices_to_guide_chatgpt/

For this project, we made around 20+ models, most of which we uploaded to huggingface (both good and bad).

Underfitted Model Results

However, after several iterations later we came up with the results below. Training this model took 2 hours on a V100 Gpu (including class image generation). The model is uploaded to Bilal326/PO_KungfuPanda_DB

Results

These were the results of our first successful DreamBooth

Training Script (Hugging face implementation)

After facing a lot of bugs and conflict in the documentation, we found the following code to work

!git clone https://github.com/huggingface/diffusers
%cd diffusers
!pip install -e .

%cd /content/diffusers/examples
!pip install -r /content/diffusers/examples/dreambooth/requirements.txt
!accelerate config default
!pip install bitsandbytes
MODEL_NAME="frankjoshua/toonyou_beta6"
INSTANCE_DIR="/content/drive/MyDrive/DL_Project/Flynn_19"

CLASS_DIR="/content/drive/MyDrive/DL_Project/Flynn_DB/T3/class_images" 
OUTPUT_DIR="/content/drive/MyDrive/DL_Project/Flynn_DB/T3"

!accelerate launch /content/diffusers/examples/dreambooth/train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME  \
--train_text_encoder \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of fljsdhn man" \
--class_prompt="a photo of man" \
--resolution=512 \
--train_batch_size=1 \
--use_8bit_adam \
--gradient_checkpointing \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=3600 \
--checkpointing_steps=600 \
--max_train_steps=2400

Dreambooth outputs a model between 3-12 gb, hence for which we found it necessary to buy 100 GB of Google Drive storage. We used checkpointing steps during training to save training steps which may give the best results. For inference, we found that installing autotrain-advanced installs most of the libraries (such as xformers, diffusers, accelerate etc.) for us and we can move straight to inference after it. For inference of a dreambooth model we used the following code:

from diffusers import DiffusionPipeline, UNet2DConditionModel
from transformers import CLIPTextModel
import torch

unet = UNet2DConditionModel.from_pretrained("/content/drive/MyDrive/DL_Project/flyn_model_hf_5/checkpoint-5000/unet")

text_encoder = CLIPTextModel.from_pretrained("/content/drive/MyDrive/DL_Project/flyn_model_hf_5/checkpoint-5000/text_encoder")

pipeline5000 = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", unet=unet, text_encoder=text_encoder, dtype=torch.float16,
).to("cuda")

But when we didn’t have checkpoints, we could just load the folder directly from the drive like this:

pipelinef = DiffusionPipeline.from_pretrained(
    "/content/drive/MyDrive/DL_Project/flyn_model_hf_3", torch_dtype=torch.float16
).to("cuda")

And for generating images we used the following code

prompt = "a phtmejhn panda "
generator = torch.Generator("cuda").manual_seed(1)
image = pipeline(prompt=prompt,
                     negative_prompt = negative_prompt,
                     guidance_scale = 5, generator=generator, num_inference_steps=100) 
image = image.images[0]
image

In the beginning, we generated 10 images per prompt for various seed values to better gauge the performance of the model as well as the quality of the prompt. We also Some of the notebooks such as fastDreambooth output the model in a ckpt format which needs to be converted into diffusers before inference. For that, we utilized the convert_original_stable_diffusion_to_diffusers script from the huggingface/diffusers page. The code for which is below

!pip install autotrain-advanced
!wget https://raw.githubusercontent.com/huggingface/diffusers/main/scripts/convert_original_stable_diffusion_to_diffusers.py
!pip install omegaconf
!accelerate config default

!python convert_original_stable_diffusion_to_diffusers.py \
    --checkpoint_path "/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/phtmejhn/phtmejhn.ckpt" \
    --dump_path "/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/phtmejhn_conv2"

As our drive storage was limited, we pushed our models to huggingface using the code below. (This is much faster than uploading manually)

!huggingface-cli login

from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="/content/drive/MyDrive/DL_Project/Abishek_Flynn",
    repo_id="Bilal326/Flynn_LoRA",
    repo_type="model",
)

Some of the notebooks such as ShivamShrirao’s output in the newer safe-tensors format, but for those we found the .from_single_file() method to work for inference but for fine-tuning these too would need to be converted to diffusers,

DreamBooth on Flynn - Rapunzel

After learning that for animated characters we would either need to find regularization class images (of that animation style) or skip it. We found a walkaround for this issue by selecting a pretrained dreambooth checkpoint from CivitAI and further trained that using DreamBooth. This time we trained on 18 images of the Flynn Character from the Rapunzel movie, using the token fljsdhn for identification.

The training images of Flynn are below:

We chose 2 check points from CivitAI: 1. Toonyou Checkpoint and 2. Disney Pixar Merged Checkpoint

We downloaded models from CivitAI using its API in google colab:

!wget https://civitai.com/api/download/models/{model_id} --content-disposition

The downloaded models were converted to diffusers and dreambooth was applied on top of them

For the toonyou model, we couldn’t fine-tune a good model, but we have uploaded it - Bilal326/Flynn_fast_1. We did, however, get a very good model by training on top of the Disney Pixar checkpoint.

Class images generated using toonyou checkpoint

The following hyperparameters worked best for us:

Max training steps = 1530
Learning Rate = 2e-6
Class Images = 200
Text encoder training steps = 350

As evident from the results that training an animated character on top of a pre-trained/merged checkpoint gives better results.

Textual Inversion

For styling, we opted for textual inversion, as it is the fastest and lightest of all the fine-tuning methods of Stable Diffusion. Here, as described in Section 2, specific token(s) are used to move the generated image(s) in a specific direction in the latent space. We utilized the official Textual Inversion implementation present on the paper’s Github Page. For training, we took 30 screenshots from the Spider-Man: Across the Spider-Verse (2023) movie. We took screenshots from various scenes cropping out any characters. We trained on top of the raw Stable diffusion 1.5 model (but as mentioned earlier in this report training on top of a pre-trained checkpoint. This model was pushed to the huggingface embeddings concept library and can be found at sd-concepts-library/style-spdmn. We for now trained only on a single token <style-spdmn>, without any trigger words. The following are the training details:

Token = <style-spdmn>
Leaning rate 5e-4
Max train steps = 2000
Train batch size = 4

We saved embedding checkpoints for every 250 steps but found the 2000 one to be the better one of all others. Below are the generated images of prompts from this embedding. We generated 25 images for each prompt:

Prompt: A man walking in <style-spdmn>

Prompt: A concert on mars in <style-spdmn>

Integrating Textual Inversion and DreamBooth We experimented with various techniques and methods for making a model as per our liking. We found combining the dreambooth model with textual inversion made the model more customized. We could also fuse LoRA layers with the model (for a particular theme/style), however due to time constraints we couldn’t get to try this. We loaded textual Embeddings on to the trained Dreambooth model which was trained on the Disney Pixar Checkpoint using the code below:

pipeline.load_textual_inversion("sd-concepts-library/style-spdmn")

These are some of the results after uploading the textual inversion on top of the dreambooth model:

Cross Attention Control

From the previous section, it is evident that the results are not consistent. The character, theme, and scene all seem to change with only changing 1 or 2 words in the prompt. To tackle this we utilized Cross Attention Control (paper released 2 Aug 2022, GitHub repo: https://github.com/bloc97/CrossAttentionControl). It works on large language models, and in stable diffusion, it alters attention maps within the model during the inference stage. This is done without necessitating a mask from the user, avoiding the challenges of creating an appropriate mask shape. The better part of this is that it does not require additional training in the model. It works by scaling the importance of specific tokens in the prompt, allowing for targeted edits. This can be used for prompt-to-prompt editing, where an image is mapped back to its latent space and modified according to a new prompt.

We utilized the official implementation from the paper’s official GitHub repo with slight modifications.

py: This is responsible for modifying the attention mechanism in the model.
py: defines functions for sequence alignment.

Next AttentionReplace, AttentionRefine, and AttentionReweight un functions were written, which used the above scripts, for controlling attention layers in the network. The notebook for this code is present in the DB+TI+CAE2.ipynb notebook.

OpenAI API

After obtaining the fine-tuned model (fine-tuned diffusion + Textual Inversion + Cross Attention Control), We obtained a model capable of creating a storyboard with multiple consistent images, each depicting a different part of a story. However, we would like to generate these images from a single line e.g. “Give me a story of Flynn relaxing on a sunny day” as good prompts can be of 50+ words, and our characters and styles were trained on unique words/tokens (phtmejhn, fljsdhn, <style-spdmn>) which would need to be placed inside the prompts. Also, to get desired features in an image more weights would need to be given to words for eg. (smile:1.3), (teeth, mouth open:1.4), and trigger words also need to be carefully placed inside the prompt, if they were used in training. So, we integrated our DreamBooth+Textual Inversion+Cross Attention control model with the ChatGPT 3.5 API, to generate prompts + captions of 5 images, using only a single sentence “Give me a story of Flynn relaxing on a sunny day”. In the arguments, Flynn was asked to be replaced with fljsdhn. The followings are the prompts and captions for the scenes generated by the ChatGPT API. The code for API integration can be found in this notebook chatgpt_api.ipynb The results are shown below. These images may be edited further using prompt to prompt, although the results are much more consistent than what a raw model or the dreambooth model would produce, they can be further improved by further tuning of models and prompts.

Prompts:

Fljsdhn sitting on a park bench in the early morning, holding a sandwich with a soft smile, gentle sunlight filtering through the trees in the background.
Fljsdhn sitting on the grass, looking slightly puzzled and concerned as he examines a map and compass, with a picnic basket beside him and a vintage camera lying on the grass.
Fljsdhn hiding behind a large tree in the forest, peeking out with a playful and mischievous expression, suggesting he's about to pull off a surprise.
Fljsdhn is seated in a clearing with his guitar, deeply immersed in his music, a look of passion on his face as the sun sets behind him, casting warm hues over the scene.
Under a vibrant twilight sky, Fljsdhn looks upwards with an inspired and joyful expression, the colors of the sunset reflecting in his eyes, capturing a moment of wonder.

Captions:

Flynn enjoys a moment of solitude in the park, relishing a sandwich on a sunny day.
Caught in mid-thought, Flynn ponders over a map, planning his next adventure with a hint of concern.
Stealthily, Flynn peers out from behind a tree, his playful smirk suggesting a surprise in store.
With a gentle strum, Flynn loses himself in the music, his guitar echoing through the quiet woods.
As the sky blushes with dusk, Flynn’s face lights up with joy, the day’s adventures turning into fond memories.

These were very simple prompts generated but they do get advanced when asked for. Also, if integrated with ControlNet, Reinforcement learning in conjunction with prompt-to-prompt cross-attention control may yield better results.

SDXL Pipeline

For this project, we also attempted fine-tuning our characters on the Stable Diffusion XL model. But since the time of writing this report, Huggingface hasn’t released an implementation of DreamBooth for SDXL. There is however an unofficial implementation through the Linqruf and bmaltis Kohya ss notebooks. We did attempt Dreambooth on SDXK using baltic/Kohya_ss and burned 200 collab units on it, but were not able to get good results. We then opted for LoRA (Low-Rank Adaptation of Large Language Models), which is the next best option after DreamBooth. For this, we went for the simplest implementation using the Auto-train Library by huggingface. We trained on 2 characters, first an animated character Flynn from Rapunzel, and second a real-life character Natalie Dormer from Game of Thrones. Initially, we opted for training on top of checkpoints, but we ran into script issues after training on the converted models (safetensors were converted into diffusers for conversion). Also, we couldn’t find those same checkpoints present on huggingface which didn’t give that error. So, because of this and time constraints, we were forced to train on the Raw SDXL model.

LoRA - Flynn

We trained the Flynn Character using the same 18 images that were used to train the SD 1.5 model. Training SDXL was very computationally expensive, so we did an extensive search on r/StableDiffusion and CivitAI for which parameters were proven to give good results. We used the following parameters:

Learning Rate = 1e-4
Num-steps = 1800 (100 for each image)
Gradient-accumulation =4
Resolution 512 (using 1080p resolution increased the training time to x3)
We didn’t use the 8bit Adam as we had units to spare and wanted to get the best possible results

6.	!autotrain dreambooth \
7.	--model stabilityai/stable-diffusion-xl-base-1.0 \
8.	--image-path /content/drive/MyDrive/DL_Project/Flynn_19/ \
9.	--prompt "photo of fljsdhn man" \
10.	--project-name "Abishek_ND_3" \
11.	--resolution 512 \
12.	--batch-size 1 \
13.	--num-steps 1800 \
14.	--fp16 \
15.	--gradient-accumulation 4 \
16.	--lr 1e-4

Even 18 512x512 images took 2 hours on an A100 GPU and took ~70 GPU credits on colab. The results are shown below, and they are much better than the DreamBooth results from Stable Diffusion 1.5. Since LoRA layers can be fused with other checkpoints we took a popular checkpoint Unstable Diffusion v11 which, from its CivitAI page is known to give good images of both animated and real-world characters. The safetensor was downloaded from CivitAI, converted into diffusers and pushed to HuggingFace. LoRA layers trained from the base SDXL 1.0 model were fused with the UnstableDiffusers v11 checkpoint using the code below. We tried fusing on various scales, but for Flynn Lora scale of 0.7 worked very well. The code is below.

from diffusers import DiffusionPipeline
import torch

pipeline_unst = DiffusionPipeline.from_pretrained(
    "Bilal326/SDXLUnstableDiff_v11", torch_dtype=torch.float16
).to("cuda")
pipeline_unst.load_lora_weights("/content/drive/MyDrive/DL_Project/Abishek_Flynn", weight_name="pytorch_lora_weights.safetensors")
pipeline_unst.fuse_lora(lora_scale=0.8)

The results are below:

Natalie - LoRA training

Similarly for training a real-world character, we used 30 images of Actress Natalie Dormer from Game of Thrones and used the same parameters mentioned in the previous section. Training of 30 images took 3.5 hours and 150 GPU units on collab (100 512x512 images took 8 hours and 100 1920x1080 images took 24 hours).

Results from various prompts are shown below. Initially, we fused using a scale of 0.7, but increasing it to 0.9 resembled to the character much more. We experimented by fusing the layers with the base SDXL model as well as with:

Juggernaut XL v7 (trained checkpoint of SDXL base 1.0)
Photon v1 (merged checkpoint of SDXL base 1.0)
DreamShaper XL (trained checkpoint of SDXL Turbo)
ICBINP (LCM of SD 1.5)

DreamShaper’s results had a bit of high contrast in them, but Juggernait, Photon, and ICBINP results were much better than the base SDXL 1.0 model results.

We have uploaded most of our generated images on our CivitAI’s profiles as well. Thre results are below:

Future Work

Although did achieve the goal of our project which was to fine-tune the Stable diffusion model using all of the mainstream methods, combine them, and integrate it with Cross Attention Control and Chat GPT API. We also were able to fine-tune the SDXL model, integrate it with the API, and generate results out of that. However, there are a few things that can be tried moving forward in the future:

We believe our fine-tuned model which was integrated with cross-attention control can be further fine-tuned to give better results.
Due to time constraints, we couldn’t compare our results with MidJourney, but since it isn’t a fine-tuned model, and so prompting results of images of Natalie Dormer would surely not be better than ours. We did however compare our results with that of Dall-E with the same prompts with Flynn in place of the token words, visually the results are better than our model which is based on SD 1.5 (but SD 1.5 is more than a year old, and Dall-E 3 is released in Oct ’23). Also, due to time constraints, we couldn’t perform more experiments. Our tuned SDXL model, however, gives better results. We believe Further fine-tuning SD 1.5 can give better results.

At the time of writing this report, there isn’t an official implementation of Textual inversion on SDXL. There is however an unofficial implementation on GitHub (https://github.com/oss-roettger/XL-Textual-Inversion) which can be attempted and loaded with SDXL (as embeddings can be loaded in SDXL right now).
We have yet to evaluate the images generated from our models with evaluation metrics such as CLIP score and RMSE.
We had integrated the ChatGPT 3.5 API with our models. ChatGPT 4 API can also be used to give better prompts as well as the ability to upload images to better guide the model on what type of image we want.
We have yet to experiment with merging models. From the r/StableDiffusion subreddit we learned that merging of models trained Dreambooth models is a very common practice, and is shown to give good results e.g. The DisneyPixar checkpoint we used was trained.
Due to time constraints, we couldn’t experiment with the newly released Stable Video Diffusion model (release 23 November 2023). Our generated images and scenes can be converted into a video and then edited to depict a scene or even a music video.
Currently, SDXL also doesn’t have an implementation with Cross Attention control present, which can also be tried out in the future.
Our pipeline may be converted into a product (like ChatGPT) with which a user can interact and get a realistic animated storyboard scene. The user may select a character on which he wants a story, select a theme/style on which he wants the story, and then enter a short prompt such as “Give me a story of Flynn relaxing on a sunny day”, and the integrated ChatGPT API can generate prompt and then feed into the model to get the number of desired temporally consistent images. He may edit the images using prompts, or feed upload images to change poses/emotions/clothing of characters. That may be done by integrating ControlNet with Cross Attention control, which is another thing that can be attempted in the future.

Conclusion

In conclusion, our project presents a comprehensive exploration of utilizing Stable Diffusion models and Textual Inversion to produce consistent images for various creative applications, including storyboards, animated movie scenes, and children's stories. The project leverages advanced techniques such as DreamBooth and specific character training, offering detailed insights into training scripts, inference codes, and the resultant outcomes. The project exemplifies the potential of integrating sophisticated AI models and methodologies to create visually compelling and coherent images for diverse storytelling purposes, underscoring the significance of AI advancements in visual content generation.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote