1to2: Training Multiple-Subject Models using only Single-Subject Data (Experimental)

Updates will be mirrored on both Hugging Face and Civitai.

Introduction

It has been shown that multiple characters can be trained into the model. A harder task is to create a model that can generate multiple characters simultaneously without modifying the generation pipeline. This document describes a simple technique that has been shown to help generating multiple characters in the same image.

Method

Requirement: Sets of single-character images
Steps:
1. Train a multi-concept model using the original dataset
2. Create an augmentation dataset of joined image pairs from the original dataset
3. Train on the augmentation dataset

Experiment

Setup

3 characters from the game Cinderella Girls are chosen for the experiment. The base model is anime-final-pruned. It has been checked that the base model has minimal knowledge of the trained characters.

For the captions of the joined images, the template format CharLeft/CharRight/COMPOSITE, TagsLeft, TagsRight is used.

A LoRA (Hadamard product) is trained using the config file below:

[model_arguments]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "Animefull-final-pruned.ckpt"

[additional_network_arguments]
no_metadata = false
unet_lr = 0.0005
text_encoder_lr = 0.0005
network_module = "lycoris.kohya"
network_dim = 8
network_alpha = 1
network_args = [ "conv_dim=0", "conv_alpha=16", "algo=loha",]
network_train_unet_only = false
network_train_text_encoder_only = false

[optimizer_arguments]
optimizer_type = "AdamW8bit"
learning_rate = 0.0005
max_grad_norm = 1.0
lr_scheduler = "cosine"
lr_warmup_steps = 0

[dataset_arguments]
debug_dataset = false
# keep token 1

[training_arguments]
output_name = "cg3comp"
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 2
max_token_length = 225
mem_eff_attn = false
xformers = true
max_train_epochs = 40
max_data_loader_n_workers = 8
persistent_data_loader_workers = true
gradient_checkpointing = false
gradient_accumulation_steps = 1
mixed_precision = "fp16"
clip_skip = 2
lowram = true

[sample_prompt_arguments]
sample_every_n_epochs = 1
sample_sampler = "k_euler_a"

[saving_arguments]
save_model_as = "safetensors"

For the second stage of training, the batch size was reduced to 2 while keeping other settings identical. The training took less than 2 hours on a T4 GPU.

Results

(see preview images)

Limitations

This technique doubles the memory/compute requirement
Composites can still be generated despite negative prompting
Cloned characters seem to become the primary failure mode in place of blended characters

Related Works

Models been trained on datasets based on anime shows have demonstrated multi-subject capabilty. Simply using concepts distant enough such as 1girl, 1boy has also been shown to be effective.

Future work

Below is a list of ideas yet to be explored

Synthetic datasets
Regularatization
Joint training instaed of sequential