Ketengan-Diffusion commited on
Commit
0b6b971
1 Parent(s): 69d8a37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
2
  license: creativeml-openrail-m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: creativeml-openrail-m
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-image
7
+ - Pixart-α
8
+ - art
9
+ - Pixart-XL
10
+ - fantasy
11
+ - anime
12
+ - waifu
13
+ - aiart
14
+ - ketengan
15
+ - AnySomniumAlpha
16
+ pipeline_tag: text-to-image
17
+ library_name: diffusers
18
  ---
19
+
20
+ # AnySomniumAlpha Model Teaser
21
+ <p align="center">
22
+ <img src="01.png" width=70% height=70%>
23
+ </p>
24
+
25
+ `Ketengan-Diffusion/AnySomniumAlpha` is an experimental model that has been with pixart-α base model, fine-tuned from [PixArt-alpha/PixArt-XL-2-1024-MS](https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS).
26
+
27
+ This is a first version of AnySomniumAlpha the first ever Anime style model in Pixart-α environment, there is still need a lot of improvement.
28
+
29
+ Our model use same dataset and curation as AnySomniumXL v2, but with better captioning. This model also support booru tag based caption and natural language caption.
30
+
31
+
32
+ # Our Dataset Process Curation
33
+ <p align="center">
34
+ <img src="Curation.png" width=70% height=70%>
35
+ </p>
36
+
37
+ Image source: [Source1](https://danbooru.donmai.us/posts/3143351) [Source2](https://danbooru.donmai.us/posts/3272710) [Source3](https://danbooru.donmai.us/posts/3320417)
38
+
39
+ Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract
40
+
41
+ This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 65-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted
42
+
43
+ The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 7 days for curating 1.000.000 images.
44
+
45
+ # Captioning process
46
+ We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.
47
+
48
+ This captioning process to captioning 33k images takes about 3 Days with NVIDIA Tesla A100 80GB PCIe. We still improving our script to generate caption faster. The minimum VRAM that required for this captioning process is 24GB VRAM which is not sufficient if we using NVIDIA Tesla T4 16GB
49
+
50
+ # Tagging Process
51
+ We simply using booru tags, that retrieved from booru boards so this could be tagged by manually by human hence make this tags more accurate.
52
+
53
+ # Official Demo
54
+ Coming soon
55
+
56
+ # Technical Specifications
57
+
58
+ AnySomniumAlpha Technical Specifications:
59
+
60
+ Batch Size: 8
61
+
62
+ Learning rate: 3e-6
63
+
64
+ Trained with a bucket size of 1024x1024
65
+
66
+ Datasets count: 33k Images
67
+
68
+ Text Encoder: t5-v1_1-xxl
69
+
70
+ Train datatype: tfloat32
71
+
72
+ Model weight: fp32
73
+
74
+ Trained with NVIDIA A100 80GB
75
+
76
+ You can support me:
77
+ - on [Ko-FI](https://ko-fi.com/ncaix)