BPModel / README.md
Crosstyan
ema
7395ca4
|
raw
history blame
11.9 kB
metadata
language:
  - en
license: creativeml-openrail-m
tags:
  - stable-diffusion
  - stable-diffusion-diffusers
  - text-to-image
  - diffusers
inference: true
widget:
  - text: >-
      1girl with blonde two side up disheveled hair red eyes in black serafuku
      red ribbon, upper body, simple background, grey background, collarbone
    example_title: example 1girl

BPModel

BPModel is an experimental Stable Diffusion model based on ACertainty from Joseph Cheung.

Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models. Well, is there any models trained with resolution base resolution (base_res) 768 even 1024 before? Don't think so. Here it is, the BPModel, a Stable Diffusion model you may love or hate. Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from Sankaku Complex with annotations. Not the best strategy since pure combination of tags may not be the optimal way to describe the image, but I don't need to do extra work. And no, I won't feed any AI generated image to the model even it might outlaw the model from being used in some countries.

The training of a high resolution model requires a significant amount of GPU hours and can be costly. In this particular case, 10 V100 GPU hours were spent on training a model with a resolution of 512, while 60 V100 GPU hours were spent on training a model with a resolution of 768. An additional 50 V100 GPU hours were also spent on training a model with a resolution of 1024, although only 10 epochs were run. The results of the training on the 1024 resolution model did not show a significant improvement compared to the 768 resolution model, and the resource demands, achieving a batch size of 1 on a V100 with 32G VRAM, were high. However, training on the 768 resolution did yield better results than training on the 512 resolution, and it is worth considering as an option. It is worth noting that Stable Diffusion 2.x also chose to train on a 768 resolution model. However, it may be more efficient to start with training on a 512 resolution model due to the slower training process and the need for additional prior knowledge to speed up the training process when working with a 768 resolution.

Mikubill/naifu-diffusion is used as training script and I also recommend to checkout CCRcmcpe/scal-sdt.

The configuration for 1024 and 768 resolution with aspect ratio bucket is presented here.

# 768
arb:
  enabled: true
  debug: false
  base_res: [768, 768]
  max_size: [1152, 768]
  divisible: 64
  max_ar_error: 4
  min_dim: 512
  dim_limit: 1792

# 1024
arb:
  enabled: true
  debug: false
  base_res: [1024, 1024]
  max_size: [1536, 1024]
  divisible: 64
  max_ar_error: 4
  min_dim: 960
  dim_limit: 2389

Limitation

The limitation described in SCAL-SDT Wiki is still applied.

SD cannot generate human body properly, like generating 6 fingers on one hand.

BPModel can generate more proper kitty cat (if you know what I mean) than other anime model, but it's still not perfect. As results presented in Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models, the copy and paste effect is still observed.

Anything v3™ has been proven to be the most popular anime model in the community, but it's not perfect either as described in JosephusCheung/ACertainThing

It does not always stay true to your prompts; it adds irrelevant details, and sometimes these details are highly homogenized.

BPModel, which has been fine-tuned on a relatively small dataset, is prone to overfit inherently. This is not surprising given the size of the dataset, but the strong prior knowledge of ACertainty (full Danbooru) and Stable Diffusion (LAION) helps to minimize the impact of overfitting. However I believe it would perform better than some artist style DreamBooth model which only train with a few hundred images or even less. I also oppose changing style by merging model since You could apply different style by training with proper captions and prompting.

Besides some of images in my dataset has the artist name in the caption, however some artist name will be misinterpreted by CLIP when tokenizing. For example, as109 will be tokenized as [as, 1, 0, 9] and fuzichoco will become [fu, z, ic, hoco]. Romanized Japanese suffers from the problem a lot and I don't have a good solution to fix it other than changing the artist name in the caption, which is time consuming and you can't promise the token you choose is unique enough. Remember the sks?

Language drift problem is still exist. There's nothing I can do unless I can find a way to generate better caption or caption the image manually. OFA combined with convnext-tagger could provide a better result for SFW content. However fine tune is necessary for NSFW content, which I don't think anyone would like to do. (Could Unstable Diffusion give us surprise?)

Cherry Picked Samples

Here're some cherry picked samples.

orange

1girl in black serafuku standing in a field solo, food, fruit, lemon, bubble, planet, moon, orange \(fruit\), lemon slice, leaf, fish, orange slice, by (tabi:1.25), spot color, looking at viewer, closeup cowboy shot
Negative prompt: (bad:0.81), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:0.81), (speech:0.81), (worst:0.81), (blush:0.9), 2koma, 3koma, 4koma, collage, lipstick
Steps: 18, Sampler: DDIM, CFG scale: 7, Seed: 2017390109, Size: 768x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 1, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0

icecream

[sketch:0.75] [(oil painting:0.5)::0.75] by (fuzichoco:0.8) shion (fkey:0.9), fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves hand on own cheek hand on own face sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 910302581, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 2, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0

girl

(best:0.7), highly detailed,1girl,upper body,beautiful detailed eyes, medium_breasts, long hair,grey hair, grey eyes, curly hair, bangs,empty eyes,expressionless,twintails, beautiful detailed sky, beautiful detailed water, [cinematic lighting:0.6], upper body, school uniform,black ribbon,light smile
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler, CFG scale: 8.5, Seed: 2311603025, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 3, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0

I don't think other model can do that.

middle_f

by [shion (fkey:0.9):momoko \(momopoco\):0.15], fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves hand on own cheek (middle finger:1.1) sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 2496891010, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 1, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0

middle_f_2

by [shion (fkey:0.9):momoko \(momopoco\):0.55], closed mouth fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves (middle finger:1.1) sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick, (chibi:0.8)
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 2668993375, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 3, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0

Usage

The bp_1024_e10.ckpt doesn't include any VAE and you should using other popular VAE in the community when using with AUTOMATIC1111/stable-diffusion-webui or you would see the LaTeNt SpAcE!

Use bp_1024_with_vae_te.ckpt if you don't have VAE and text encoder with you, still EMA weight is not included and it's fp16.

If you want to continue training, use bp_1024_e10_ema.ckpt which is the ema weight and with fp32 precision.

For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2.

About the Model Name

I asked the chatGPT what the proper explanation of abbreviation BP could be.

Here are a few more ideas for creative interpretations of the abbreviation "BP":

- Brightest Point - This could refer to a moment of exceptional brilliance or clarity.
- Brainpower - the abbreviation refers to something that requires a lot of mental effort or intelligence to accomplish.
- Bespoke Partition - A custom made section that separates two distinct areas.
- Bukkake Picchi - A Japanese style of rice dish.
- Bokeh Picker - A traditional Japanese photography technique that involves selecting points of light from a digital image.
- Bipolarity - Two distinct and opposing elements or perspectives.

Note that "BP" is often used as an abbreviation for "blood pressure," so it is important to context to determine the most appropriate interpretation of the abbreviation.

Personally, I would call it "Big Pot".

License

This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:

  1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content
  2. The authors claims no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
  3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully) Please read the full license here