Learning rate for v3.1

#16
by KeilahElla - opened

It looks like v3.1 uses a significantly higher learning rate, especially for the text encoder. At the same time the batches are also smaller.

Do you not observe overfitting or degredations resulting from this?

Cagliostro Research Lab org

It uses CosineAnnealingWithRestarts as the learning rate scheduler with a gamma of 0.9 and a minimum learning rate of 1e-6. In our opinion, it's underfitted rather than overfitted, as the model was trained to specialize in generating characters. There are some new characters in the Animagine XL 3.1 datasets that aren't generated accurately compared to when Animagine XL 3.0 was trained. There is a good finetune of Animagine XL 3.1, and it's proven to be a good model to build on top of.

image.png

There is also a good evaluation by Furusu. Thanks to them, we know that Animagine XL 3.1 is much better than Animagine XL 3.0. For future projects, we plan to conduct qualitative and quantitative evaluations for better insights.

image.png

However, we don't think it's a good thing either. We admit that using 1e-5 with a 48x2 batch size is not a good choice, and the Animagine XL 3.0 training configuration is still much better than 3.1.

It has been observed that artifacts and some undesirable results are generated because:

  1. AdamW was used for pretraining without further proof of concepts.
  2. The dataset's creation date range is not balanced; many of the images are way too old, from around 2005 to 2014.
  3. An aesthetic scorer was used without further research.
  4. We realize that the current quality tags distribution is poor, resulting in many posts with a score of < 20 being considered low quality and worst quality.

Hello @Asahina2K , I really appreciate taking the time to give this great detailed response! This makes a lot of sense.

Eventhough your LR is 1e-5, because of the cosine scheduler with decay the average or effective LR is much less. Just eyeballing it from the graph says that 3e-6 is much closer to the average or effective LR the model is trained with, with 1e-5 being the peak.

What motivated the change from Adafactor to AdamW? Adafactor is a personal favourite of mine due to low memory consumption.

My personal experience with SDXL finetuning is that somewhere around 5e-6 my model outputs start getting more and more blurry as the model trains longer. I have a suspicion that my dataset contains quite a few jpeg images with compression artefacts, and with high learning rate the model starts to overfit on those. using smoothed-l1 loss instead of l2 helps a lot, but at the expense of much slower training (not sure why). Maybe I can affort much higher learning rates with smoothed-l1 loss. I probably need to purge jpeg images with lots of compression artefacts from my dataset, but haven't found an easy way to do this automatically yet.

Sign up or log in to comment