Details about training

#1
by laidawang - opened

This is truly fantastic work! I have personally tested and found that the result is amazing, this makes me even more curious about certain training details, which I think may be the key.
1.How to do data augmentation?
2.how to make random mask?
3. Do you need to provide prompt words of different lengths, or drop out some prompt words? As far as I know, the prompt generated by cogvlm are often too long, making it difficult for humans to meet this standard
Any information (details or training sciripts) regarding training would be much appreciated!

canny model is the easiest model to train in the controlnet series. The data augmentation can be random threhold, you can refer to the https://github.com/lllyasviel/ControlNet/blob/main/gradio_canny2image.py, the point is that the augmentation need to set properly to let the network can learn it otherwise it will got worse if it is too hard. The random mask can be many choices, you can generate polygen of different sizes or use rgb channels in images to filter a random percentage of images. You can ask gpt4 to help generate some functions for this. Prompt indeed impress mainly the prompt following, or the smartness of the model, but not the beauty of the model. The midjourney original image-text pair is good as it is selected by professional designers. However there also exists a lot of images in popular websites that doesn't have a good natural prompt. You can intruct cogvlm to restrict the prompt length. But according to the dalle-3 report, longer prompt makes better prompt following and the main text-to-image models support 225 words, so long prompt is not a bad thing.

Very good model! thanks for your effort.
Will you plan to train a tile model for sdxl ? I think tile model in sd is also very important like canny and can achieve very good effect results!

Sure, tile model, I will try it. Perhaps a month latter, tile model sometimes like deblur and highres, I need to analyse it first.

I want to 'mutiple loss' means what

not enough info..

Hello and thank you for the controlnet and information! I trained a few controlnets for Stable Diffusion 1 a year ago too and documented everything here: https://civitai.com/articles/2078 and here https://github.com/lllyasviel/ControlNet/discussions/318

I'm also interested in more details about the training and would be happy if you could answer some of these questions:

  • Which training script did you use and can you share?
  • Which image datasets and what part of it did you use and why (if you want to publish this information)? (laion5b alone has 5000mio images, but you "only" used 100mio)
  • How did you crop and resize the images?
  • How specifically did you de-duplicate and clean the images, how much did you throw away?
  • Which concrete prompt did you use for CogVLM and how did you come up with it?
  • Can you share the captions as a dataset?
  • Did you apply prompt dropping, how much and why?
  • How did you come up with the parameters for masking and how did you evaluate it?
  • What do you mean by mutiple loss and multi resolution exactly?
  • What did you use for learning_rate and hyperparameters and how did you evaluate or come up with these values?
  • When did you reach convergence?

Thank you for your time, even if you can only answer a few questions!

the real batch size is 2560 when used accumulate_grad_batches

Do you mean 256 or how do you get to this value?

I would also be interested if it was possible to fine-tune? Amazing work!

@GeroldMeisinger I can tell you some details you care about, the training scripts was based on diffusers and I rewrite it in lightning reference to SD series model code, the training speed increases 25% compared with original code in diffusers. The dataset was mainly collected from internet, laion only images that have high aesthetic score will be choosen. The internet includes the popular image websites which have good images. The image crop and resize is just as usual, i think the key is the use of bucket training, I realize the bucket propose by novelai and support any aspect ratio. Captions can by tagged by vLLM models such as llava or cogvlm, now more choices as more powerful model are realised. Or you can use the original prompt by image itself, as the websites usually will give the image a description by human, it is short but preciese. The prompt drop is used to enhance the model understanding in semantic level. The masking algorithm is just usual, you can see some papers about image super-resolution like GAN series, there exiest many augment methods. Multiples loss is to accelerate the convergence of model, it is based on different resolution of predict noise, you can refer to paper simple diffusion for more detail. The learning rete and hyperparamter was not set elaborately,default setting will be ok and you can also explore it. convergence is nothing but like the original auther lvmin zhang said, in some steps, the model will convergence suddenly and more training will refine the result but not much.

The scripts release I will consider, perhaps a few times later. Now I have some idea in uni-controlnet and the new model is coming, when the final model is released I will release all the code and detail.

Thank you for the taking time to answer EVERY question! Looking forward to the code and details.

Amazing model, I have been playing with a hybrid IPadapter and this canny model within ComfyUI with Beautiful results. It is by far exciding the performance of other SDXL canny models. Bravo ! ( I'm not a technician so my experience is purely based on image results ) .

Sign up or log in to comment