File size: 3,234 Bytes
3250b65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# 1to2: Training Multiple-Subject Models using only Single-Subject Data (Experimental)

Updates will be mirrored on both Hugging Face and Civitai.

## Introduction

[It has been shown that multiple characters can be trained into the model](https://civitai.com/models/23476/the-idolmster-cinderella-girls-starlight-stage-style-90-characters). A harder task is to create a model that can generate multiple characters simultaneously without modifying the generation pipeline. This document describes a simple technique that has been shown to help generating multiple characters in the same image.

## Method

```
Requirement: Sets of single-character images
Steps:
1. Train a multi-concept model using the original dataset
2. Create an augmentation dataset of joined image pairs from the original dataset
3. Train on the augmentation dataset
```

## Experiment


### Setup

3 characters from the game Cinderella Girls are chosen for the experiment. The base model is `anime-final-pruned`. It has been checked that the base model has minimal knowledge of the trained characters.

For the captions of the joined images, the template format `CharLeft/CharRight/COMPOSITE, TagsLeft, TagsRight` is used.

A LoRA (Hadamard product) is trained using the config file below:
```
[model_arguments]
v2 = false
v_parameterization = false
pretrained_model_name_or_path = "Animefull-final-pruned.ckpt"

[additional_network_arguments]
no_metadata = false
unet_lr = 0.0005
text_encoder_lr = 0.0005
network_module = "lycoris.kohya"
network_dim = 8
network_alpha = 1
network_args = [ "conv_dim=0", "conv_alpha=16", "algo=loha",]
network_train_unet_only = false
network_train_text_encoder_only = false

[optimizer_arguments]
optimizer_type = "AdamW8bit"
learning_rate = 0.0005
max_grad_norm = 1.0
lr_scheduler = "cosine"
lr_warmup_steps = 0

[dataset_arguments]
debug_dataset = false
# keep token 1

[training_arguments]
output_name = "cg3comp"
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 2
max_token_length = 225
mem_eff_attn = false
xformers = true
max_train_epochs = 40
max_data_loader_n_workers = 8
persistent_data_loader_workers = true
gradient_checkpointing = false
gradient_accumulation_steps = 1
mixed_precision = "fp16"
clip_skip = 2
lowram = true

[sample_prompt_arguments]
sample_every_n_epochs = 1
sample_sampler = "k_euler_a"

[saving_arguments]
save_model_as = "safetensors"
```
For the second stage of training, the batch size was reduced to 2 while keeping other settings identical.
The training took less than 2 hours on a T4 GPU.

### Results
(see preview images)

## Limitations
* This technique doubles the memory/compute requirement
* Composites can still be generated despite negative prompting
* Cloned characters seem to become the primary failure mode in place of blended characters

## Related Works

Models been trained on datasets based on anime shows have [demonstrated](https://civitai.com/models/21305/) multi-subject capabilty.
Simply using concepts distant enough such as `1girl, 1boy`  [has also been shown to be effective](https://civitai.com/models/17640/).

## Future work

Below is a list of ideas yet to be explored
* Synthetic datasets
* Regularatization
* Joint training instaed of sequential