Crosstyan commited on
Commit
7ca1077
2 Parent(s): b946723 33fdca1

Merge branch 'main' of https://huggingface.co/Crosstyan/BPModel

Browse files
Files changed (1) hide show
  1. README.md +30 -6
README.md CHANGED
@@ -27,10 +27,28 @@ BPModel is an experimental Stable Diffusion model based on [ACertainty](https://
27
  Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models.
28
  Well, is there any models trained with resolution base resolution (`base_res`) 768 even 1024 before? Don't think so.
29
  Here it is, the BPModel, a Stable Diffusion model you may love or hate.
30
- Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from [Sankaku Complex](https://chan.sankakucomplex.com) with annotations. Not the best strategy since pure combination of tags may not be the optimal way to describe the image, but I don't need to do extra work. And no, I won't feed any AI generated image
31
- to the model even it might outlaw the model from being used in some countries.
32
-
33
- The training of a high resolution model requires a significant amount of GPU hours and can be costly. In this particular case, 10 V100 GPU hours were spent on training a model with a resolution of 512, while 60 V100 GPU hours were spent on training a model with a resolution of 768. An additional 50 V100 GPU hours were also spent on training a model with a resolution of 1024, although only 10 epochs were run. The results of the training on the 1024 resolution model did not show a significant improvement compared to the 768 resolution model, and the resource demands, achieving a batch size of 1 on a V100 with 32G VRAM, were high. However, training on the 768 resolution did yield better results than training on the 512 resolution, and it is worth considering as an option. It is worth noting that Stable Diffusion 2.x also chose to train on a 768 resolution model. However, it may be more efficient to start with training on a 512 resolution model due to the slower training process and the need for additional prior knowledge to speed up the training process when working with a 768 resolution.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  [Mikubill/naifu-diffusion](https://github.com/Mikubill/naifu-diffusion) is used as training script and I also recommend to
36
  checkout [CCRcmcpe/scal-sdt](https://github.com/CCRcmcpe/scal-sdt).
@@ -85,7 +103,7 @@ better than some artist style DreamBooth model which only train with a few
85
  hundred images or even less. I also oppose changing style by merging model since You
86
  could apply different style by training with proper captions and prompting.
87
 
88
- Besides some of images in my dataset has the artist name in the caption, however some artist name will
89
  be misinterpreted by CLIP when tokenizing. For example, *as109* will be tokenized as `[as, 1, 0, 9]` and
90
  *fuzichoco* will become `[fu, z, ic, hoco]`. Romanized Japanese suffers from the problem a lot and
91
  I don't have a good solution to fix it other than changing the artist name in the caption, which is
@@ -101,6 +119,10 @@ I don't think anyone would like to do. (Could Unstable Diffusion give us surpris
101
 
102
  Here're some **cherry picked** samples.
103
 
 
 
 
 
104
  ![orange](images/00317-2017390109_20221220015645.png)
105
 
106
  ```txt
@@ -154,7 +176,9 @@ EMA weight is not included and it's fp16.
154
  If you want to continue training, use [`bp_1024_e10_ema.ckpt`](bp_1024_e10_ema.ckpt) which is the ema unet weight
155
  and with fp32 precision.
156
 
157
- For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2.
 
 
158
 
159
  ## About the Model Name
160
 
 
27
  Why is the Model even existing? There are loads of Stable Diffusion model out there, especially anime style models.
28
  Well, is there any models trained with resolution base resolution (`base_res`) 768 even 1024 before? Don't think so.
29
  Here it is, the BPModel, a Stable Diffusion model you may love or hate.
30
+ Trained with 5k high quality images that suit my taste (not necessary yours unfortunately) from [Sankaku Complex](https://chan.sankakucomplex.com) with annotations.
31
+ The dataset is public in [Crosstyan/BPDataset](https://huggingface.co/datasets/Crosstyan/BPDataset) for the sake of full disclosure .
32
+ Pure combination of tags may not be the optimal way to describe the image,
33
+ but I don't need to do extra work.
34
+ And no, I won't feed any AI generated image
35
+ to the model even it might outlaw the model from being used in some countries.
36
+
37
+ The training of a high resolution model requires a significant amount of GPU
38
+ hours and can be costly. In this particular case, 10 V100 GPU hours were spent
39
+ on training 30 epochs with a resolution of 512, while 60 V100 GPU hours were spent
40
+ on training 30 epochs with a resolution of 768. An additional 100 V100 GPU hours
41
+ were also spent on training a model with a resolution of 1024, although **ONLY** 10
42
+ epochs were run. The results of the training on the 1024 resolution model did
43
+ not show a significant improvement compared to the 768 resolution model, and the
44
+ resource demands, achieving a batch size of 1 on a V100 with 32G VRAM, were
45
+ high. However, training on the 768 resolution did yield better results than
46
+ training on the 512 resolution, and it is worth considering as an option. It is
47
+ worth noting that Stable Diffusion 2.x also chose to train on a 768 resolution
48
+ model. However, it may be more efficient to start with training on a 512
49
+ resolution model due to the slower training process and the need for additional
50
+ prior knowledge to speed up the training process when working with a 768
51
+ resolution.
52
 
53
  [Mikubill/naifu-diffusion](https://github.com/Mikubill/naifu-diffusion) is used as training script and I also recommend to
54
  checkout [CCRcmcpe/scal-sdt](https://github.com/CCRcmcpe/scal-sdt).
 
103
  hundred images or even less. I also oppose changing style by merging model since You
104
  could apply different style by training with proper captions and prompting.
105
 
106
+ Besides some of images in my dataset have the artist name in the caption, however some artist name will
107
  be misinterpreted by CLIP when tokenizing. For example, *as109* will be tokenized as `[as, 1, 0, 9]` and
108
  *fuzichoco* will become `[fu, z, ic, hoco]`. Romanized Japanese suffers from the problem a lot and
109
  I don't have a good solution to fix it other than changing the artist name in the caption, which is
 
119
 
120
  Here're some **cherry picked** samples.
121
 
122
+ I were using [xformers](https://github.com/facebookresearch/xformers) when generating these sample
123
+ and it might yield slight different result even with the same seed (welcome to the non deterministic field).
124
+ "`Upscale latent space image when doing hires. fix`" is enabled also.
125
+
126
  ![orange](images/00317-2017390109_20221220015645.png)
127
 
128
  ```txt
 
176
  If you want to continue training, use [`bp_1024_e10_ema.ckpt`](bp_1024_e10_ema.ckpt) which is the ema unet weight
177
  and with fp32 precision.
178
 
179
+ For better performance, it is strongly recommended to use Clip skip (CLIP stop at last layers) 2. It's also recommended to use turn on
180
+ "`Upscale latent space image when doing hires. fix`" in the settings of [AUTOMATIC1111/stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
181
+ which adds intricate details when using `Highres. fix`.
182
 
183
  ## About the Model Name
184