SoteDiffusion Cascade

Anime finetune of Stable Cascade.
Currently is in very early state in training.
No commercial use thanks to StabilityAI.

Code Example

pip install diffusers

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "newest, 1girl, solo, cat ears, looking at viewer, blush, light smile,"
negative_prompt = "very displeasing, worst quality, monochrome, sketch, fat, child,"

prior = StableCascadePriorPipeline.from_pretrained("Disty0/sote-diffusion-cascade_alpha0", torch_dtype=torch.float16)
decoder = StableCascadeDecoderPipeline.from_pretrained("Disty0/sote-diffusion-cascade-decoder_alpha0", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=7.0,
    num_images_per_prompt=1,
    num_inference_steps=40
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=1.5
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

Training Status:

Alpha0 Release: This release resets the training and enables Text Encoder training.

GPU used for training: 1x AMD RX 7900 XTX 24GB

dataset name	training done	remaining
newest	000	230
recent	000	206
mid	000	201
early	000	055
oldest	000	016
pixiv	000	074
visual novel cg	000	070
anime wallpaper	000	013
Total	8	865

Note: chunks starts from 0 and there are 8000 images per chunk

Dataset:

GPU used for captioning: 1x Intel ARC A770 16GB
Model used for captioning: SmilingWolf/wd-swinv2-tagger-v3
Command:

python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./

dataset name	total images	total chunk
newest	1.843.053	221
recent	1.652.420	207
mid	1.609.608	202
early	442.368	056
oldest	128.311	017
pixiv	594.046	075
visual novel cg	560.903	071
anime wallpaper	106.882	014
Total	6.937.591	873

Note: Smallest size is 1280x600 | 768.000 pixels

Tags:

aesthetic tags, quality tags, date tags, custom tags, rating tags, character tags, rest of the tags

Date:

tag	date
newest	2022 to 2024
recent	2019 to 2021
mid	2015 to 2018
early	2011 to 2014
oldest	2005 to 2010

Aesthetic Tags:

Model used: shadowlilac/aesthetic-shadow-v2

score greater than	tag
0.90	extremely aesthetic
0.80	very aesthetic
0.70	aesthetic
0.50	slightly aesthetic
0.40	not displeasing
0.30	not aesthetic
0.20	slightly displeasing
0.10	displeasing
rest of them	very displeasing

Quality Tags:

Model used: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

score greater than	tag
0.980	best quality
0.900	high quality
0.750	great quality
0.500	medium quality
0.250	normal quality
0.125	bad quality
0.025	low quality
rest of them	worst quality

Rating Tags

general
sensitive
questionable
explicit

Custom Tags:

dataset name	custom tag
image boards	date,
pixiv	art by Display_Name,
visual novel cg	Full_VN_Name (short_3_letter_name), visual novel cg,
anime wallpaper	date, anime wallpaper,

Training Params:

Software used: Kohya SD-Scripts with Stable Cascade branch
Base model: Disty0/sote-diffusion-cascade_pre-alpha0

Command:

accelerate launch  --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
--mixed_precision fp16 \
--save_precision fp16 \
--full_fp16 \
--sdpa \
--gradient_checkpointing \
--train_text_encoder \
--resolution "1024,1024" \
--train_batch_size 2 \
--adaptive_loss_weight \
--learning_rate 4e-6 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--optimizer_type adafactor \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--max_grad_norm 0 \
--token_warmup_min 1 \
--token_warmup_step 0 \
--shuffle_caption \
--caption_dropout_rate 0 \
--caption_tag_dropout_rate 0 \
--caption_dropout_every_n_epochs 0 \
--dataset_repeats 1 \
--save_state \
--save_every_n_steps 2048 \
--sample_every_n_steps 512 \
--max_token_length 225 \
--max_train_epochs 1 \
--caption_extension ".txt" \
--max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--enable_bucket \
--min_bucket_reso 256 \
--max_bucket_reso 4096 \
--bucket_reso_steps 64 \
--bucket_no_upscale \
--log_with tensorboard \
--output_name sotediffusion-sc_3b \
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000 \
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0000.json \
--output_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0 \
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-0/logs \
--resume /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480-state \
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480.safetensors \
--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-step00020480_text_model.safetensors \
--effnet_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/effnet_encoder.safetensors \
--previewer_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/previewer.safetensors \
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/StableCascade/config/sotediffusion-prompt.txt

Limitations and Bias

Bias

This model is intended for anime illustrations.
Realistic capabilites are not tested at all.
Still underbaked.

Limitations

Can fall back to realistic.
Add "realistic" tag to the negatives when this happens.
Far shot eyes are still bad thanks to the heavy latent compression.

Disty0
/

sote-diffusion-cascade_alpha0