Mind sharing the parameters you used during training?

#1
by zbulrush - opened

Hey there! So, I was trying to train ControlNet using the Diffusers library, but I ran into a few hiccups. Mind sharing the parameters you used during training? It could really help me out! Thanks!

Sure! I've made a few modifications to the training script, including using a non-standard optimizer. Here's my version of the script and what I used to launch it was accelerate launch /workspace/train_control.py --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 --output_dir=/workspace/out --dataset_name=/workspace/zoe_stuff --conditioning_image_column=depth --validation_prompt "nightmare construction worker, unsettling" "android warrior, unsettling" --validation_image /workspace/depth2.png /workspace/depth4.png --validation_steps 100 --tracker_project_name sd_xl_train_controlnet --mixed_precision fp16 --report_to wandb --push_to_hub --hub_model_id controlnet-sd-xl-1.0-depth-magma --max_grad_norm 1.0 --checkpointing_steps 10000 --num_train_epochs 5 --resolution 512 --seed 1 --gradient_accumulation_steps 1 --train_batch_size 8 --enable_xformers_memory_efficient_attention --caption_column prompt --gradient_checkpointing --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --use_prodigy --set_grads_to_none --controlnet_model_name_or_path /workspace/out. If you were running into black preview images, chances are your problem was the VAE; the version included by default is unstable at FP16. Finally, you can few my wandb run for it here

Hope that helps!

Wow, thank you so much! You've been a great help to me.

No worries! I hope everything goes well, feel free to hit me up here or on discord @sargezt if you need anything else on it. I'm working on a more general purpose training base, a base controlnet if you will, that is currently training. Right now a lot of the time in training is simply doing the first adaptation of the network, so I'm doing augmentations on the input data and mixing depth, canny, and seg during training to decondition the network from normal image generation. If you'd like, I can certainly ping you when it's done so you can resume from that. Happy training!

I'm really puzzled, the validate_image generated every 100 steps is always the same, you can find it in my wandb:
https://wandb.ai/zbulrush/sd_xl_train_controlnet/runs/3n9w86jj/workspace?workspace=user-zbulrush

That's very strange. I'll try to replicate that since the dataset is public and report back!

Oh jeez, I just noticed I left the augments on in the version I sent you. I was using that for some experiments. I've updated the gist to remove them. I'm about to start a run to make sure that that was the problem with that verison.

OK yeah, that appears to have been the problem. I can't speak to how it's fitting yet, but my samples are most certainly changing. I have relocated the trivial transform to only act on the target image, which is how it's supposed to be used, not in the way I was using it on both the control and training images. Here is the new version, one note though is that I've ripped out the manual prompts and just use the test split of the dataset to generate images. If you want to revert to the old behavior, just rip out the log_validation function from your current version.

TrivialAugmentWide will take a bit longer to learn, but it will be much less prone to overfitting. You can remove it from the preprocess transforms if you so wish though!

OK yeah, that appears to have been the problem. I can't speak to how it's fitting yet, but my samples are most certainly changing. I have relocated the trivial transform to only act on the target image, which is how it's supposed to be used, not in the way I was using it on both the control and training images. Here is the new version, one note though is that I've ripped out the manual prompts and just use the test split of the dataset to generate images. If you want to revert to the old behavior, just rip out the log_validation function from your current version.

TrivialAugmentWide will take a bit longer to learn, but it will be much less prone to overfitting. You can remove it from the preprocess transforms if you so wish though!

thanks again, I'll hurry up and give it a try!

Thanks a lot for sharing the training details! I'm trying to train a controlnet for SDXL (full precision) myself but quickly get to noisy validation images (already after 200 steps).
Now I'm trying your hyperparameters, but I'm seeing that the learning rate is set to '1' with 'cosine' schedule.
Do you know how your model learning anything with such a high learning rate? Thanks!

I use the prodigy optimizer which automatically adapts the LR but requires starting out at 1 and going down, preferably with a cosine schedule.

Sign up or log in to comment