File size: 14,058 Bytes
d893fb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aca794
 
b946723
 
33fc9c3
e7d19f4
4e33905
 
e7d19f4
b946723
 
 
5aca794
 
d893fb0
5aca794
 
a00bc39
30b3c66
1cee318
a00bc39
 
 
5aca794
ea4306f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aca794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33fc9c3
 
5aca794
 
 
 
 
e7d19f4
5aca794
 
 
 
 
 
 
 
d893fb0
5aca794
 
 
 
 
 
 
30b3c66
5aca794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b83e690
 
33fdca1
b83e690
4c294cc
 
 
 
 
 
 
 
5aca794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33fc9c3
e7d19f4
5aca794
 
d893fb0
 
 
fd1f65b
 
 
b946723
7395ca4
 
bd194f9
33fdca1
bd194f9
5aca794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
language:
- en
license: creativeml-openrail-m
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
- diffusers
inference: true
widget:
- text: "1girl with blonde two side up disheveled hair red eyes in black serafuku red ribbon, upper body, simple background, grey background, collarbone"
  example_title: "example 1girl"
---


# BPModel

## Update

**2023-01-02:** I wasted more GPU hours to train it a little bit more overfitting. Check out [bp_mk3.safetensors](bp_mk3.safetensors) and [bp_mk5.safetensors](bp_mk5.safetensors). Prepare yourself own VAE! Update your WebUI if you can't load [safetensors](https://github.com/huggingface/safetensors). Adds lots of samples in `images` folder!

**2023-01-06:** Checkout [NMFSAN](NMFSAN/README.md) for a new model trained with custom embeddings.

![car](images/00976-3769766671_20221226155509.png)

## Introduction

BPModel is an experimental Stable Diffusion model based on [ACertainty](https://huggingface.co/JosephusCheung/ACertainty) from [Joseph Cheung](https://huggingface.co/JosephusCheung). 

Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models.
Well, is there any models trained with resolution base resolution (`base_res`) 768 even 1024 before? Don't think so.
Here it is, the BPModel, a Stable Diffusion model you may love or hate.
Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from [Sankaku Complex](https://chan.sankakucomplex.com) with annotations. 
The dataset is public in [Crosstyan/BPDataset](https://huggingface.co/datasets/Crosstyan/BPDataset) for the sake of full disclosure .
Pure combination of tags may not be the optimal way to describe the image, 
but I don't need to do extra work. 
And no, I won't feed any AI generated image
to the model even it might outlaw the model from being used in some countries.

The training of a high resolution model requires a significant amount of GPU
hours and can be costly. In this particular case, 10 V100 GPU hours were spent
on training 30 epochs with a resolution of 512, while 60 V100 GPU hours were spent
on training 30 epochs with a resolution of 768. An additional 100 V100 GPU hours
were also spent on training a model with a resolution of 1024, although **ONLY** 10
epochs were run. The results of the training on the 1024 resolution model did
not show a significant improvement compared to the 768 resolution model, and the
resource demands, achieving a batch size of 1 on a V100 with 32G VRAM, were
high. However, training on the 768 resolution did yield better results than
training on the 512 resolution, and it is worth considering as an option. It is
worth noting that Stable Diffusion 2.x also chose to train on a 768 resolution
model. However, it may be more efficient to start with training on a 512
resolution model due to the slower training process and the need for additional
prior knowledge to speed up the training process when working with a 768
resolution.

[Mikubill/naifu-diffusion](https://github.com/Mikubill/naifu-diffusion) is used as training script and I also recommend to
checkout [CCRcmcpe/scal-sdt](https://github.com/CCRcmcpe/scal-sdt).

The configuration for 1024 and 768 resolution with aspect ratio bucket is presented here. 

```yaml
# 768
arb:
  enabled: true
  debug: false
  base_res: [768, 768]
  max_size: [1152, 768]
  divisible: 64
  max_ar_error: 4
  min_dim: 512
  dim_limit: 1792

# 1024
arb:
  enabled: true
  debug: false
  base_res: [1024, 1024]
  max_size: [1536, 1024]
  divisible: 64
  max_ar_error: 4
  min_dim: 960
  dim_limit: 2389
```

## Limitation

![building](images/00167-4082916932_20230102081230.png)

The limitation described in [SCAL-SDT Wiki](https://github.com/CCRcmcpe/scal-sdt/wiki#what-you-should-expect)
is still applied.

> SD cannot generate human body properly, like generating 6 fingers on one hand. 

BPModel can generate [more proper kitty cat](https://twitter.com/crosstyan/status/1606026536246685696) (if you know what I mean) than other anime model, 
but it's still not perfect. As results presented in [Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models](https://arxiv.org/abs/2212.03860), the copy and paste effect is still observed.

Anything v3™ has been proven to be the most popular anime model in the community, but it's not perfect either as
described in [JosephusCheung/ACertainThing](https://huggingface.co/JosephusCheung/ACertainThing)

> It does not always stay true to your prompts; it adds irrelevant details, and sometimes these details are highly homogenized. 

BPModel, which has been fine-tuned on a relatively small dataset, is prone
to overfit inherently. This is not surprising given the size of the dataset, but the
strong prior knowledge of ACertainty (full Danbooru) and Stable Diffusion
(LAION) helps to minimize the impact of overfitting.
However I believe it would perform
better than some artist style DreamBooth model which only train with a few
hundred images or even less. I also oppose changing style by merging model since You
could apply different style by training with proper captions and prompting.

Besides some of images in my dataset have the artist name in the caption, however some artist name will
be misinterpreted by CLIP when tokenizing. For example, *as109* will be tokenized as `[as, 1, 0, 9]` and
*fuzichoco* will become `[fu, z, ic, hoco]`. Romanized Japanese suffers from the problem a lot and
I don't have a good solution to fix it other than changing the artist name in the caption, which is
time consuming and you can't promise the token you choose is unique enough. [Remember the sks?](https://www.reddit.com/r/StableDiffusion/comments/yju5ks/from_one_of_the_original_dreambooth_authors_stop/)

Language drift problem is still exist. There's nothing I can do unless I can find a way to
generate better caption or caption the image manually. [OFA](https://github.com/OFA-Sys/OFA) 
combined with [convnext-tagger](https://huggingface.co/SmilingWolf/wd-v1-4-convnext-tagger) could
provide a better result for SFW content. However fine tune is necessary for NSFW content, which
I don't think anyone would like to do. (Could Unstable Diffusion give us surprise?)

## Cherry Picked Samples

Here're some **cherry picked** samples.

I were using [xformers](https://github.com/facebookresearch/xformers) when generating these sample 
and it might yield slight different result even with the same seed (welcome to the non deterministic field).
"`Upscale latent space image when doing hires. fix`" is enabled also.

![sunset](images/00121-4236324744_20230102073128.png)

```txt
by (fkey:1) (shion:0.4) [sketch:0.75] (closed mouth expressionless:1) cat ears nekomimi 1girl, wearing a white sailor uniform with a short skirt and white pantyhose standing on the deck of a yacht, cowboy shot, and the sun setting behind her in the background, light particle, bokeh
Negative prompt: lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, worst quality, low quality, normal quality, lipstick, 2koma, 3koma, dutch angle, blush, from behind
Steps: 28, Sampler: Euler a, CFG scale: 12, Seed: 4236324744, Size: 960x1600, Model hash: 855959a4, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

![orange](images/00317-2017390109_20221220015645.png)

```txt
1girl in black serafuku standing in a field solo, food, fruit, lemon, bubble, planet, moon, orange \(fruit\), lemon slice, leaf, fish, orange slice, by (tabi:1.25), spot color, looking at viewer, closeup cowboy shot
Negative prompt: (bad:0.81), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:0.81), (speech:0.81), (worst:0.81), (blush:0.9), 2koma, 3koma, 4koma, collage, lipstick
Steps: 18, Sampler: DDIM, CFG scale: 7, Seed: 2017390109, Size: 768x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 1, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

![icecream](images/00748-910302581_20221220073123.png)

```txt
[sketch:0.75] [(oil painting:0.5)::0.75] by (fuzichoco:0.8) shion (fkey:0.9), fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves hand on own cheek hand on own face sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 910302581, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 2, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

![girl](images/01101-2311603025_20221220161819.png)

```txt
(best:0.7), highly detailed,1girl,upper body,beautiful detailed eyes, medium_breasts, long hair,grey hair, grey eyes, curly hair, bangs,empty eyes,expressionless,twintails, beautiful detailed sky, beautiful detailed water, [cinematic lighting:0.6], upper body, school uniform,black ribbon,light smile
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler, CFG scale: 8.5, Seed: 2311603025, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 3, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

*I don't think other model can do that.*

![middle_f](images/00819-2496891010_20221220080243.png)

```txt
by [shion (fkey:0.9):momoko \(momopoco\):0.15], fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves hand on own cheek (middle finger:1.1) sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 2496891010, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 1, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

![middle_f_2](images/01073-2668993375_20221220100952.png)

```txt
by [shion (fkey:0.9):momoko \(momopoco\):0.55], closed mouth fang solo cat ears nekomimi girl with multicolor streaked messy hair blue [black|blue] long hair bangs blue eyes in blue sailor collar school uniform serafuku short sleeves (middle finger:1.1) sitting, upper body, strawberry sweets ice cream food fruit spoon orange parfait
Negative prompt: (bad:0.98), (normal:0.98), (comic:0.81), (cropped:0.81), (error:0.81), (extra:0.81), (low:0.81), (lowres:1), (speech:0.81), (worst:0.81), 2koma, 3koma, 4koma, collage, lipstick, (chibi:0.8)
Steps: 40, Sampler: Euler a, CFG scale: 8, Seed: 2668993375, Size: 960x1600, Model hash: fed5b383, Batch size: 4, Batch pos: 3, Denoising strength: 0.7, Clip skip: 2, ENSD: 31337, First pass size: 0x0
```

more samples can be found in [images](images/00976-3769766671_20221226155509.png) folder.

## Usage

The [`bp_1024_e10.ckpt`](bp_1024_e10.ckpt) doesn't include any VAE and you should using other popular VAE in the community when using with [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) or you would see the
LaTeNt SpAcE! 

Use [`bp_1024_with_vae_te.ckpt`](bp_1024_with_vae_te.ckpt) if you don't have VAE and text encoder with you, still
EMA weight is not included and it's fp16.

If you want to continue training, use [`bp_1024_e10_ema.ckpt`](bp_1024_e10_ema.ckpt) which is the ema unet weight 
and with fp32 precision.

For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2. It's also recommended to use turn on
"`Upscale latent space image when doing hires. fix`" in the settings of [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
which adds intricate details when using `Highres. fix`. 

## About the Model Name

I asked the [chatGPT](https://openai.com/blog/chatgpt/) what the proper explanation of abbreviation BP could be.

```txt
Here are a few more ideas for creative interpretations of the abbreviation "BP":

- Brightest Point - This could refer to a moment of exceptional brilliance or clarity.
- Brainpower - the abbreviation refers to something that requires a lot of mental effort or intelligence to accomplish.
- Bespoke Partition - A custom made section that separates two distinct areas.
- Bukkake Picchi - A Japanese style of rice dish.
- Bokeh Picker - A traditional Japanese photography technique that involves selecting points of light from a digital image.
- Bipolarity - Two distinct and opposing elements or perspectives.

Note that "BP" is often used as an abbreviation for "blood pressure," so it is important to context to determine the most appropriate interpretation of the abbreviation.
```

Personally, I would call it "Big Pot".

## License

This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:

1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content
2. The authors claims no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully) Please read the full license here