metadata

language:
  - en
tags:
  - stable-cascade
  - SDXL
  - art
  - artstyle
  - fantasy
  - anime
  - aiart
  - ketengan
  - SomniumSC
pipeline_tag: image-segmentation
library_name: diffusers

SomniumSC-v1 Model Showcase

Ketengan-Diffusion/SomniumSC-v1 is a fine tuned stage C Stable Cascade model stabilityai/stable-cascade.

A fine-tuned model from all new stabilityAI model, Stable Cascade (Or we could say Würstchen v3) with a 2D (cartoonish) style is trained at Stage C 3.6B model. This model also trains the text encoder to generate a 2D style, so this model not only could generate using booru tag prompt, also you can use the natural language.

The model uses same amount and method of AnySomniumXL v2 used which has 33,000+ curated images from hundreds of thousands of images from various sources. The dataset is built by saving images that have an aesthetic score of at least 19 and a maximum of 50 (to maintain the cartoonish model and not too realistic. The scale is based on our proprietary aesthetic scoring mechanism), and do not have text and watermarks such as signatures or comic/manga images. Thus, images that have an aesthetic score of less than 17 and more than 50 will be discarded, as well as images that have watermarks or text will be discarded.

Training Process

SomniumSC v1 Technical Specifications:

Training per 1 Epoch 30 Epoch (Results from SomniumSC using Epoch 30)

Captioned by proprietary multimodal LLM, better than LLaVA

Trained with a bucket size of 1024x1024

Shuffle Caption: Yes

Clip Skip: 0

Trained with 1x NVIDIA A100 80GB

Our Dataset Process Curation

Image source: Source1 Source2 Source3

Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract

This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 50-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted

The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 7 days for curating 1.000.000 images.

Captioning process

We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.

Tagging Process

We simply using booru tags, that retrieved from booru boards so this could be tagged by manually by human hence make this tags more accurate.

Limitations:

✓ Still requires broader dataset training for more variation of poses and style

✓ Text cannot generated correctly, and seems ruined

✓ This optimized for human or mutated human generation. Non human like SCP, Ponies, and more maybe could resulting not what you expecting

✓ The faces maybe looks compressed. Generate the image at 1536px could be better

Smaller half size and stable cascade lite version will be released soon

How to use SomniumSC:

Currently Stable Cascade only supported by ComfyUI.

You can use tutorial in here or here

To simplify which model should you download, I will provide you the where's to download model directly

For stage A you can download from Official stabilityai/stable-cascade repo.

For stage B you can download from Official stabilityai/stable-cascade repo.

For stage C you can download the safetensors on huggingface repo that you find on files tab

And the text encoder you download from our huggingface repo on text_encoder folder

SomniumSC Pro tips:

Negative prompt is a must to get better quality output. The recommended negative prompt is lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name

If the model producing pointy ears on the character, just add elf or pointy ears.

If the model producing "Compressed Face" use 1536px resolution, so the model can produce the face clearly.

Disclaimer:

This model is under STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE. Which this model cannot be sold, and the derivative works cannot be commercialized. Except As far as I know, you can buy the membership of StabilityAI here To commercialize your derivative works based on this model. Please support StabilityAI, so they can always provide open source model for us. But still you can merge our model freely