@MonsterMMORPG on Hugging Face: "Now You Can Full Fine Tune / DreamBooth Stable Diffusion XL (SDXL) with only…"

MonsterMMORPG

posted an update Mar 26, 2024

Post

2114

Now You Can Full Fine Tune / DreamBooth Stable Diffusion XL (SDXL) with only 10.3 GB VRAM via OneTrainer — Both U-NET and Text Encoder 1 is trained — Compared 14 GB config vs slower 10.3 GB Config

Full config and instructions are shared here : https://www.patreon.com/posts/96028218

Used SG161222/RealVisXL_V4.0 as a base model and OneTrainer to train on Windows 10 : https://github.com/Nerogar/OneTrainer

The posted example x/y/z checkpoint comparison images are not cherry picked. So I can get perfect images with multiple tries.

Trained 150 epochs, 15 images and used my ground truth 5200 regularization images : https://www.patreon.com/posts/massive-4k-woman-87700469

In each epoch only 15 of regularization images used to make DreamBooth training affect

As a caption only “ohwx man” is used, for regularization images just “man”
You can download configs and full instructions here : https://www.patreon.com/posts/96028218

Hopefully full public tutorial coming within 2 weeks. I will show all configuration as well

The tutorial will be on our channel : https://www.youtube.com/SECourses
Training speeds are as below thus durations:

RTX 3060 — slow preset : 3.72 second / it thus 15 train images 150 epoch 2 (reg images concept) : 4500 steps = 4500 3.72 / 3600 = 4.6 hours

RTX 3090 TI — slow preset : 1.58 second / it thus : 4500 * 1.58 / 3600 = 2 hours

RTX 3090 TI — fast preset : 1.45 second / it thus : 4500 * 1.45 / 3600 = 1.8 hours

A quick tutorial for how to use concepts in OneTrainer : https://youtu.be/yPOadldf6bI

deleted

Mar 26, 2024

Did you find it useful to refine the text encoder? I mean, any difference with only fine-tuning the U-Net denoiser? The usual method in the reference Stable Diffusion is to only train the U-Net denoiser, but it is definitely interesting to explore the refinement of both the text encoder and the VAE as well

MonsterMMORPG

Mar 26, 2024

•

edited Mar 26, 2024

I did over 100 trainings empirically to find best hyper parameters. And training U-NET + Text Encoder 1 yields better results that only U-NET @researcher171473

MaziyarPanahi

Apr 1, 2024

This is awesome! I really wish I had more time to play around with DreamBooth specially making it work with multiple prompts/image rather than 1 prompt/object.
Thanks for sharing this!

MonsterMMORPG

Apr 2, 2024

thanks a lot

Join the conversation