Kandinsky 2.2

This script is experimental, and it’s easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset.

Kandinsky 2.2 is a multilingual text-to-image model capable of producing more photorealistic images. The model includes an image prior model for creating image embeddings from text prompts, and a decoder model that generates images based on the prior model’s embeddings. That’s why you’ll find two separate scripts in Diffusers for Kandinsky 2.2, one for training the prior model and one for training the decoder model. You can train both models separately, but to get the best results, you should train both the prior and decoder models.

Depending on your GPU, you may need to enable gradient_checkpointing (⚠️ not supported for the prior model!), mixed_precision, and gradient_accumulation_steps to help fit the model into memory and to speedup training. You can reduce your memory-usage even more by enabling memory-efficient attention with xFormers (version v0.0.16 fails for training on some GPUs so you may need to install a development version instead).

This guide explores the train_text_to_image_prior.py and the train_text_to_image_decoder.py scripts to help you become more familiar with it, and how you can adapt it for your own use-case.

Before running the scripts, make sure you install the library from source:

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

Then navigate to the example folder containing the training script and install the required dependencies for the script you’re using:

cd examples/kandinsky2_2/text_to_image
pip install -r requirements.txt

🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It’ll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate Quick tour to learn more.

Initialize an 🤗 Accelerate environment:

accelerate config

To setup a default 🤗 Accelerate environment without choosing any configurations:

accelerate config default

Or if your environment doesn’t support an interactive shell, like a notebook, you can use:

from accelerate.utils import write_basic_config

write_basic_config()

Lastly, if you want to train a model on your own dataset, take a look at the Create a dataset for training guide to learn how to create a dataset that works with the training script.

The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn’t cover every aspect of the scripts in detail. If you’re interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns.

Script parameters

The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the parse_args() function. The training scripts provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you’d like.

For example, to speedup training with mixed precision using the fp16 format, add the --mixed_precision parameter to the training command:

accelerate launch train_text_to_image_prior.py \
  --mixed_precision="fp16"

Most of the parameters are identical to the parameters in the Text-to-image training guide, so let’s get straight to a walkthrough of the Kandinsky training scripts!

Min-SNR weighting

The Min-SNR weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting epsilon (noise) or v_prediction, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.

Add the --snr_gamma parameter and set it to the recommended value of 5.0:

accelerate launch train_text_to_image_prior.py \
  --snr_gamma=5.0

Training script

The training script is also similar to the Text-to-image training guide, but it’s been modified to support training the prior and decoder models. This guide focuses on the code that is unique to the Kandinsky 2.2 training scripts.

prior model

decoder model

Launch the script

Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! 🚀

You’ll train on the Naruto BLIP captions dataset to generate your own Naruto characters, but you can also create and train on your own dataset by following the Create a dataset for training guide. Set the environment variable DATASET_NAME to the name of the dataset on the Hub or if you’re training on your own files, set the environment variable TRAIN_DIR to a path to your dataset.

If you’re training on more than one GPU, add the --multi_gpu parameter to the accelerate launch command.

To monitor training progress with Weights & Biases, add the --report_to=wandb parameter to the training command. You’ll also need to add the --validation_prompt to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.

prior model

decoder model

Once training is finished, you can use your newly trained model for inference!

prior model

decoder model

Next steps

Congratulations on training a Kandinsky 2.2 model! To learn more about how to use your new model, the following guides may be helpful:

Read the Kandinsky guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting, interpolation), and how it can be combined with a ControlNet.
Check out the DreamBooth and LoRA training guides to learn how to train a personalized Kandinsky model with just a few example images. These two training techniques can even be combined!

< > Update on GitHub