Experimental Regional Controlnet Test

by Sen-sou - opened 27 days ago

Tried Training a regional controlnet with 50 images and 10 repeats for total of 5000 steps with all recommended parameters from the lllite training guide.
The dataset was generated from Anima base 1, and manually masked with color blobs. A total of 50 images.
Model: https://huggingface.co/Sen-sou/Anima-LLLite-Regional-Controlnet

And here are the result:

The controlnet strength was set to 2. Below that the characters were not following the regions.

The training caption that i used was something in the following format:
masterpiece, absurdres, score_9, anime,
rias gremory standing close to the viewer, cat sitting on the ground in the background, himejima akeno standing in the background, outdoors,
region:red rias gremory, standing, ,
region:green cat, animal, sitting, on ground ,
region:blue himejima akeno, standing, facing viewer ,

I dont think it helped in relating the prompt to the specific region (maybe it was the very small dataset).

I test generated a lot of images and it did generate characters most of the time where the color blob was so i think it was a success for an experiment.
Even though it didn't link the prompt to the region that well, that can also be done by cross attention token masking or even a little bit of self attention token masking (like in the https://github.com/Haoming02/sd-forge-couple or https://github.com/Sen-sou/Comfyui-Anima-Regional-Conditioning).

So it seems that with a proper large dataset it could be done right.

Sen-sou

26 days ago

•

edited 26 days ago

@kohya-ss can you advice me on the training parameters? I used the same traning parameters as the training guide.

Sen-sou

25 days ago

updated the model. https://huggingface.co/Sen-sou/Anima-LLLite-Regional-Controlnet. Trained on 580 images.

kohya-ss

Owner 15 days ago

Thank you for letting me know. That's a great idea!

LLLite has the characteristic of not affecting text conditioning. Therefore, it's surprising how well it works. The reason it works well in practical applications when combined with attention couples is likely because it effectively complements this characteristic.

This LLLite model probably learns that characters should be written in the mask area.

I'm not entirely sure about the training parameters, but while it didn't have much effect with anytest or inpainting, you might want to try enabling ASPP to identify whether or not a character is present in the mask area.

I also think it will be a good idea to extend the architecture of LLLite for this kinds of applications.

Sen-sou

15 days ago

•

edited 15 days ago

Here is what im using:
--learning-rate "1e-3"
--lr-scheduler "constant_with_warmup"
--lr-warmup-steps 250
--max-train-epochs 50
--save-every-n-epochs 2
--lllite-cond-dim 128
--lllite-mlp-dim 64
--lllite-target-layers "self_attn_qkv"
--lllite-cond-resblocks 4
--lllite-use-aspp
--caption-dropout-rate "0.15"

[general]
caption_extension = ".txt"
shuffle_caption = false

[[datasets]]
resolution = 1024
batch_size = 4
enable_bucket = true
bucket_no_upscale = true
bucket_reso_steps = 64
min_bucket_reso = 64
max_bucket_reso = 1536

num_repeats = 2

As you can see im already using ASPP @kohya-ss

Sen-sou

15 days ago

would it be possible to apply the regions to the text tokens? for example, training with a specific caption format like above and have the regions apply to those region specific text tokens.

kohya-ss

Owner 15 days ago

•

edited 15 days ago

I see. To be honest, I don't know any recommended parameters either.

If you want to try something, try adding self_attn_kv_pre and mlp_fc1_pre (comma-separated) to --lllite_target_layers. self_attn_kv_pre might increase the effect. mlp_fc1_pre does the same, but it might also allow LLLite to learn extra information beyond control (since MLPs are said to learn knowledge).

LLLite always uses the same input size as images, so it cannot be applied to text. The main reason is that text length is variable. To apply it to text, LLLite would require an architectural extension to support variable-length input/output.

(This would probably be something new, not ControlNet.)

Sen-sou

15 days ago

Thanks for replying. I'll take your suggestions into account for the next version. 👍

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment