SoteDiffusion Cascade

Anime finetune of Stable Cascade.
Currently is in very early state in training.
No commercial use thanks to StabilityAI.

Code Example

pip install diffusers

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "(extremely aesthetic, best quality, newest), 1girl, solo, cat ears, looking at viewer, blush, light smile, upper body,"
negative_prompt = "very displeasing, worst quality, monochrome, sketch, blurry, fat, child,"

prior = StableCascadePriorPipeline.from_pretrained("Disty0/sote-diffusion-cascade_pre-alpha0", torch_dtype=torch.float16)
decoder = StableCascadeDecoderPipeline.from_pretrained("Disty0/sote-diffusion-cascade-decoder_pre-alpha0", torch_dtype=torch.float16)

prior.enable_model_cpu_offload()
prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=6.0,
    num_images_per_prompt=1,
    num_inference_steps=40
)

decoder.enable_model_cpu_offload()
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings,
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=2.0,
    output_type="pil",
    num_inference_steps=10
).images[0]
decoder_output.save("cascade.png")

Training Status:

GPU used for training: 1x AMD RX 7900 XTX 24GB

dataset name	training done	remaining
newest	002	218
late	002	204
mid	002	199
early	002	053
oldest	002	014
pixiv	002	072
visual novel cg	002	068
anime wallpaper	002	011
Total	24	839

Note: chunks starts from 0 and there are 8000 images per chunk

Dataset:

GPU used for captioning: 1x Intel ARC A770 16GB
Model used for captioning: SmilingWolf/wd-v1-4-convnextv2-tagger-v2

dataset name	total images	total chunk
newest	1.766.335	221
late	1.652.420	207
mid	1.609.608	202
early	442.368	056
oldest	128.311	017
pixiv	594.046	075
visual novel cg	560.903	071
anime wallpaper	106.882	014
Total	6.860.873	863

Note: Smallest size is 1280x600 | 768.000 pixels

Tags:

aesthetic tags, quality tags, date tags, custom tags, rest of the tags

Date:

tag	date
newest	2022 to 2024
late	2019 to 2021
mid	2015 to 2018
early	2011 to 2014
oldest	2005 to 2010

Aesthetic Tags:

Model used: shadowlilac/aesthetic-shadow

score greater than	tag
0.980	extremely aesthetic
0.900	very aesthetic
0.750	aesthetic
0.500	slightly aesthetic
0.350	not displeasing
0.250	not aesthetic
0.125	slightly displeasing
0.025	displeasing
rest of them	very displeasing

Quality Tags:

Model used: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

score greater than	tag
0.980	best quality
0.900	high quality
0.750	great quality
0.500	medium quality
0.250	normal quality
0.125	bad quality
0.025	low quality
rest of them	worst quality

Custom Tags:

dataset name	custom tag
image boards	date,
pixiv	art by Display_Name,
visual novel cg	Full_VN_Name (short_3_letter_name), visual novel cg,
anime wallpaper	date, anime wallpaper,

Training Params:

Software used: Kohya SD-Scripts with Stable Cascade branch
Base model: KBlueLeaf/Stable-Cascade-FP16-fixed

Command:

accelerate launch  --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
--mixed_precision fp16 \
--save_precision fp16 \
--full_fp16 \
--sdpa \
--gradient_checkpointing \
--resolution "1024,1024" \
--train_batch_size 2 \
--gradient_accumulation_steps 32 \
--adaptive_loss_weight \
--learning_rate 4e-6 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--optimizer_type adafactor \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--max_grad_norm 0 \
--token_warmup_min 1 \
--token_warmup_step 0 \
--shuffle_caption \
--caption_dropout_rate 0 \
--caption_tag_dropout_rate 0 \
--caption_dropout_every_n_epochs 0 \
--dataset_repeats 1 \
--save_state \
--save_every_n_steps 128 \
--sample_every_n_steps 32 \
--max_token_length 225 \
--max_train_epochs 1 \
--caption_extension ".txt" \
--max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--enable_bucket \
--min_bucket_reso 256 \
--max_bucket_reso 4096 \
--bucket_reso_steps 64 \
--bucket_no_upscale \
--log_with tensorboard \
--output_name sotediffusion-sc_3b \
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0002 \
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-0002.json \
--output_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-2 \
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-2/logs \
--resume /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-1/sotediffusion-sc_3b-1-state \
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-sc_3b-1/sotediffusion-sc_3b-1.safetensors \
--effnet_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/effnet_encoder.safetensors \
--previewer_checkpoint_path /mnt/DataSSD/AI/models/sd-cascade/previewer.safetensors \
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/StableCascade/sotediffusion-prompt.txt

Limitations and Bias

Bias

This model is intended for anime illustrations.
Realistic capabilites are not tested at all.
Current version has bias to older anime styles.

Limitations

Can fall back to realistic.
Use "anime illustration" tag to point it into the right direction.
Far shot eyes are bad thanks to the heavy latent compression.

Disty0
/

sote-diffusion-cascade_pre-alpha0