|
--- |
|
pipeline_tag: text-to-image |
|
license: other |
|
license_name: faipl-1.0-sd |
|
license_link: LICENSE |
|
decoder: |
|
- Disty0/sotediffusion-wuerstchen3-alpha1-decoder |
|
--- |
|
|
|
|
|
# SoteDiffusion Wuerstchen3 |
|
|
|
Anime finetune of Würstchen V3. |
|
Currently is in early state in training. |
|
No commercial use thanks to StabilityAI. |
|
|
|
# Release Notes |
|
|
|
Did major cleanup on the dataset in this release. |
|
Changed the training parameters and started from a fresh state. |
|
Switch to FairAI license. (Still no commercial use.) |
|
|
|
|
|
<table> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/oKTevlG-qi5Jfdy6TkGeI.png" height="576"> |
|
</table> |
|
|
|
|
|
# Code Example |
|
|
|
```shell |
|
pip install diffusers |
|
``` |
|
|
|
```python |
|
import torch |
|
from diffusers import StableCascadeCombinedPipeline |
|
|
|
device = "cuda" |
|
dtype = torch.bfloat16 |
|
model = "Disty0/sotediffusion-wuerstchen3-alpha1-decoder" |
|
|
|
pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype) |
|
|
|
# send everything to the gpu: |
|
pipe = pipe.to(device, dtype=dtype) |
|
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype) |
|
|
|
# or enable model offload to save vram: |
|
# pipe.enable_model_cpu_offload() |
|
|
|
|
|
|
|
prompt = "1girl, solo, cowboy shot, straight hair, looking at viewer, hoodie, indoors, slight smile, casual, furniture, doorway, very aesthetic, best quality, newest," |
|
negative_prompt = "very displeasing, worst quality, oldest, monochrome, sketch, realistic," |
|
|
|
output = pipe( |
|
width=1024, |
|
height=1536, |
|
prompt=prompt, |
|
negative_prompt=negative_prompt, |
|
decoder_guidance_scale=1.0, |
|
prior_guidance_scale=8.0, |
|
prior_num_inference_steps=40, |
|
output_type="pil", |
|
num_inference_steps=10 |
|
).images[0] |
|
|
|
## do something with the output image |
|
|
|
``` |
|
|
|
|
|
## Training Status: |
|
|
|
**GPU used for training**: 1x AMD RX 7900 XTX 24GB |
|
**GPU Hours**: 100 |
|
|
|
| dataset name | training done | remaining | |
|
|---|---|---| |
|
| **newest** | 003 | 228 | |
|
| **recent** | 003 | 169 | |
|
| **mid** | 003 | 121 | |
|
| **early** | 003 | 067 | |
|
| **oldest** | 003 | 017 | |
|
| **pixiv** | 003 | 039 | |
|
| **visual novel cg** | 003 | 025 | |
|
| **anime wallpaper** | 003 | 010 | |
|
| **Total** | 32 | 682 | |
|
|
|
**Note**: chunks starts from 0 and there are 8000 images per chunk |
|
|
|
|
|
## Dataset: |
|
|
|
**GPU used for captioning**: 1x Intel ARC A770 16GB |
|
**GPU Hours**: 350 |
|
|
|
**Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3 |
|
**Command:** |
|
``` |
|
python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./ |
|
``` |
|
|
|
|
|
| dataset name | total images | total chunk | |
|
|---|---|---| |
|
| **newest** | 1.848.331 | 232 | |
|
| **recent** | 1.380.630 | 173 | |
|
| **mid** | 993.227 | 125 | |
|
| **early** | 566.152 | 071 | |
|
| **oldest** | 160.397 | 021 | |
|
| **pixiv** | 343.614 | 043 | |
|
| **visual novel cg** | 231.358 | 029 | |
|
| **anime wallpaper** | 104.790 | 014 | |
|
| **Total** | 5.628.499 | 708 | |
|
|
|
**Note**: |
|
- Smallest size is 1280x600 | 768.000 pixels |
|
- Deduped based on image similarity using czkawka-cli |
|
|
|
|
|
## Tags: |
|
|
|
Model is trained with random tag order but this is the order in the dataset if you are interested: |
|
``` |
|
aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags |
|
``` |
|
|
|
### Date: |
|
|
|
| tag | date | |
|
|---|---| |
|
| **newest** | 2022 to 2024 | |
|
| **recent** | 2019 to 2021 | |
|
| **mid** | 2015 to 2018 | |
|
| **early** | 2011 to 2014 | |
|
| **oldest** | 2005 to 2010 | |
|
|
|
### Aesthetic Tags: |
|
**Model used**: shadowlilac/aesthetic-shadow-v2 |
|
|
|
| score greater than | tag | count | |
|
|---|---|---| |
|
| **0.90** | extremely aesthetic | 125.451 | |
|
| **0.80** | very aesthetic | 887.382 | |
|
| **0.70** | aesthetic | 1.049.857 | |
|
| **0.50** | slightly aesthetic | 1.643.091 | |
|
| **0.40** | not displeasing | 569.543 | |
|
| **0.30** | not aesthetic | 445.188 | |
|
| **0.20** | slightly displeasing | 341.424 | |
|
| **0.10** | displeasing | 237.660 | |
|
| **rest of them** | very displeasing | 328.712 | |
|
|
|
### Quality Tags: |
|
**Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth |
|
|
|
| score greater than | tag | count | |
|
|---|---|---| |
|
| **0.980** | best quality | 1.270.447 | |
|
| **0.900** | high quality | 498.244 | |
|
| **0.750** | great quality | 351.006 | |
|
| **0.500** | medium quality | 366.448 | |
|
| **0.250** | normal quality | 368.380 | |
|
| **0.125** | bad quality | 279.050 | |
|
| **0.025** | low quality | 538.958 | |
|
| **rest of them** | worst quality | 1.955.966 | |
|
|
|
## Rating Tags |
|
|
|
| tag | count | |
|
|---|---| |
|
| **general** | 1.416.451 | |
|
| **sensitive** | 3.447.664 | |
|
| **nsfw** | 427.459 | |
|
| **explicit nsfw** | 336.925 | |
|
|
|
## Custom Tags: |
|
|
|
| dataset name | custom tag | |
|
|---|---| |
|
| **image boards** | date, | |
|
| **pixiv** | art by Display_Name, | |
|
| **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, | |
|
| **anime wallpaper** | date, anime wallpaper, | |
|
|
|
## Training Parameters: |
|
**Software used**: Kohya SD-Scripts with Stable Cascade branch |
|
https://github.com/kohya-ss/sd-scripts/tree/stable-cascade |
|
|
|
**Base model**: Disty0/sote-diffusion-cascade-alpha0 |
|
### Command: |
|
```shell |
|
LD_PRELOAD=/usr/lib/libtcmalloc.so.4 accelerate launch --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \ |
|
--mixed_precision fp16 \ |
|
--save_precision fp16 \ |
|
--full_fp16 \ |
|
--sdpa \ |
|
--gradient_checkpointing \ |
|
--train_text_encoder \ |
|
--resolution "1024,1024" \ |
|
--train_batch_size 2 \ |
|
--gradient_accumulation_steps 8 \ |
|
--learning_rate 1e-5 \ |
|
--learning_rate_te1 1e-5 \ |
|
--lr_scheduler constant_with_warmup \ |
|
--lr_warmup_steps 100 \ |
|
--optimizer_type adafactor \ |
|
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \ |
|
--max_grad_norm 0 \ |
|
--token_warmup_min 1 \ |
|
--token_warmup_step 0 \ |
|
--shuffle_caption \ |
|
--caption_separator ", " \ |
|
--caption_dropout_rate 0 \ |
|
--caption_tag_dropout_rate 0 \ |
|
--caption_dropout_every_n_epochs 0 \ |
|
--dataset_repeats 1 \ |
|
--save_state \ |
|
--save_every_n_steps 256 \ |
|
--sample_every_n_steps 64 \ |
|
--max_token_length 225 \ |
|
--max_train_epochs 1 \ |
|
--caption_extension ".txt" \ |
|
--max_data_loader_n_workers 2 \ |
|
--persistent_data_loader_workers \ |
|
--enable_bucket \ |
|
--min_bucket_reso 256 \ |
|
--max_bucket_reso 4096 \ |
|
--bucket_reso_steps 64 \ |
|
--bucket_no_upscale \ |
|
--log_with tensorboard \ |
|
--output_name sotediffusion-wr3_3b \ |
|
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0004/0005 \ |
|
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0004/0005.json \ |
|
--output_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-4/0005 \ |
|
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-4/0005/logs \ |
|
--resume /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-4/0004/sotediffusion-wr3_3b-state \ |
|
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-4/0004/sotediffusion-wr3_3b.safetensors \ |
|
--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-4/0004/sotediffusion-wr3_3b_text_model.safetensors \ |
|
--effnet_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/effnet_encoder.safetensors \ |
|
--previewer_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/previewer.safetensors \ |
|
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/config/sotediffusion-prompt.txt |
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
### Bias |
|
|
|
- This model is intended for anime illustrations. |
|
Realistic capabilites are not tested at all. |
|
|
|
### Limitations |
|
|
|
- Can fall back to realistic. |
|
Add "realistic" tag to the negatives when this happens. |
|
- Far shot eyes can be bad. |
|
- Anatomy and hands can be bad. |
|
- Still in active training. |
|
|
|
|
|
## License |
|
(This part is copied directly from Animagine V3.1 and modified.) |
|
|
|
SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points: |
|
|
|
1. **Modification Sharing:** If you modify SoteDiffusion models, you must share both your changes and the original license. |
|
2. **Source Code Accessibility:** If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too. |
|
3. **Distribution Terms:** Any distribution must be under this license or another with similar rules. |
|
4. **Compliance:** Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values. |
|
|
|
**Notes**: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT. Meaning, still no commercial use of any kind. |
|
|