Spaces:
Runtime error
Runtime error
# stable diffusion training methods | |
## fine-tuning | |
- retrains parts of the hypernetwork with new data thus modifying original weights | |
requires large and precisely labelled dataset | |
- size is same as original model size, ~2-7gb | |
- verdict: prohibitive due to large dataset and effort required | |
## model merge | |
- combines weights from multiple models according to specified rules | |
- verdict: highly desired to create pre-set models for specific use-case | |
## textual inversion | |
- assign vector to a new concept with originally one vector per embedding, hacks to enable multi-vector embeddings | |
works by expanding vocabulary of a model, but majority of learned content is actually assembled from existing concepts | |
can be considered as a formula on which already learned weights should be combined to achieve learned concept | |
- size 768/1024b per vector | |
- verdict: best currently viable short-term training solution | |
## aesthetic gradient | |
- uses low-precision trained embeddings to steer clip using classifier guidance | |
training is very cheap, but classifier guidance sloes down image generation | |
result is basic transfer of style from learned image to generated image | |
- size is same as embedding | |
- origin: independent work | |
- verdict: inconsistent results with minimal value | |
## custom diffusion | |
- fine-tuning specific model matrices with textual inversion | |
similar speed and memory requirements to embedding training and supposedly gives better results in less steps | |
- size ~50mb | |
- origin: cmu | |
- verdict: possibly promising, requires further investigation, surprisingly low chatter on this topic | |
## hypernetwork | |
- similar to model fine-tuning, but adds small a small neural network that on-the-fly modifies weights of the last two layers of the main model | |
works like adaptive head that steers model in a learned direction so primary use-case is style transfer, not concept transfer | |
- size is limited to learned layers, ~100-200mb | |
- origin: leaked from novel.ai | |
- verdict: lower priority as concept transfer is more important than style transfer | |
## null-text inversion | |
- similar concept to textual inversion, but trains unconditional embedding that is used for classifier free guidance instead of text embedding | |
resulting embedding is apparently more detailed than standard textual embedding | |
- size is larger but comparable to textual inversion | |
- origin: google | |
- verdict: possibly promising, requires further investigation, but no working prototype as of yet | |
## clip inversion | |
- similar concept to textual inversion, but uses clip embedding instead of text embedding | |
- size is same as textual inversion | |
- origin: google | |
- verdict: prohibitive due to requirement of specially fine-tuned model as a starting point | |
## dream artist | |
- variation on ti training where both positive and negative embeddings are created | |
- size is same as textual inversion | |
- origin: independent work | |
- verdict: skip for now as solution does not appear to be sufficiently maintained | |
## dreambooth | |
- similar to model fine-tuning except it adds information on top of model instead of forgetting/overwriting existing concepts | |
- size is equal to original model size, ~2-7gb | |
- origin: google, but heavily modified by independent work | |
- verdict: prohibitive due to resulting size and requirement to load full model on-demand | |
## lora | |
- "low-rank adaptation of large language models" | |
injects trainable layers to steer cross attention layers | |
very flexible, but memory intensive so limited training opportunities on normal gpu | |
multiple incompatible implementations: should choose which implementation to use | |
- size varies from ~5mb to full-model size, average ~150-300mb | |
- origin: microsoft | |
- verdict: very promising, but memory prohibitive until further optimizations | |