|
--- |
|
license: other |
|
license_name: fair-ai-public-license-1.0-sd |
|
license_link: https://freedevproject.org/faipl-1.0-sd/ |
|
datasets: |
|
- KBlueLeaf/danbooru2023-webp-4Mpixel |
|
- KBlueLeaf/danbooru2023-sqlite |
|
language: |
|
- en |
|
library_name: diffusers |
|
pipeline_tag: text-to-image |
|
--- |
|
|
|
# Kohaku XL Zeta |
|
join us: https://discord.gg/tPBsKDyRR5 |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630593e2fca1d8d92b81d2a1/rUeUdKYiUfi6LtTcpasgN.png) |
|
|
|
## Highlights |
|
- Resume from Kohaku-XL-Epsilon rev2 |
|
- More stable, long/detailed prompt is not a requirement now. |
|
- Better fidelity on style and character, support more style. |
|
- CCIP metric surpass Sanae XL anime. have over 2200 character with CCIP score > 0.9 in 3700 character set. |
|
- Trained on both danbooru tags and natural language, better ability on nl caption. |
|
- Trained on combined dataset, not only danbooru |
|
- danbooru (7.6M images, last id 7832883, 2024/07/10) |
|
- pixiv (filtered from 2.6M special set, will release the url set) |
|
- pvc figure (around 30k images, internal source) |
|
- realbooru (around 90k images, for regularization) |
|
- 8.46M images in total |
|
- Since the model is trained on both kind of caption, the ctx length limit is extended to 300. |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630593e2fca1d8d92b81d2a1/2EpGwA8D1c0UnVGuPMFtY.png) |
|
|
|
|
|
## Usage (PLEASE READ THIS SECTION) |
|
### Recommended Generation Settings |
|
- resolution: 1024x1024 or similar pixel count |
|
- cfg scale: 3.5~6.5 |
|
- sampler/scheduler: |
|
- Euler (A) / any scheduler |
|
- DPM++ series / exponential scheduler |
|
- for other sampler, I personally recommend exponential scheduler. |
|
- step: 12~50 |
|
|
|
### Prompt Gen |
|
DTG series prompt gen can still be used on KXL zeta. |
|
A brand new prompt gen for cooperating both tag and nl caption is under developing. |
|
|![image/png](https://cdn-uploads.huggingface.co/production/uploads/630593e2fca1d8d92b81d2a1/ixiBsWdO1sg6QUMqRUbHu.png)|![image/png](https://cdn-uploads.huggingface.co/production/uploads/630593e2fca1d8d92b81d2a1/Byv2Xg1g8zN9nuCURasK6.png)| |
|
|-|-| |
|
|
|
### Prompt Format |
|
As same as Kohaku XL Epsilon or Delta, but you can replace "general tags" with "natural language caption". |
|
You can also put both together. |
|
|
|
### Special Tags |
|
- Quality tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality |
|
- Rating tags: safe, sensitive, nsfw, explicit |
|
- Date tags: newest, recent, mid, early, old |
|
|
|
#### Rating tags |
|
General: safe |
|
Sensitive: sensitive |
|
Questionable: nsfw |
|
Explicit: nsfw, explicit |
|
|
|
## Dataset |
|
For better ability on some certain concepts, I use full danbooru dataset instead of filterd one. |
|
Than use crawled Pixiv dataset (from 3~5 tag with popularity sort) as addon dataset. |
|
Since Pixiv's search system only allow 5000 page per tag so there is not much meaningful image, and some of them are duplicated with danbooru set(but since I want to reinforce these concept I directly ignore the duplication) |
|
|
|
As same as kxl eps rev2, I add realbooru and pvc figure images for more flexibility on concept/style. |
|
|
|
## Training |
|
- Hardware: Quad RTX 3090s |
|
- Num Train Images: 8,468,798 |
|
- Total Epoch: 1 |
|
- Total Steps: 16548 |
|
- Training Time: 430 hours (wall time) |
|
- Batch Size: 4 |
|
- Grad Accumulation Step: 32 |
|
- Equivalent Batch Size: 512 |
|
- Optimizer: Lion8bit |
|
- Learning Rate: 1e-5 for UNet / TE training disabled |
|
- LR Scheduler: Constant (with warmup) |
|
- Warmup Steps: 100 |
|
- Weight Decay: 0.1 |
|
- Betas: 0.9, 0.95 |
|
- Min SNR Gamma: 5 |
|
- Debiased Estimation Loss: Enabled |
|
- IP Noise Gamma: 0.05 |
|
- Resolution: 1024x1024 |
|
- Min Bucket Resolution: 256 |
|
- Max Bucket Resolution: 4096 |
|
- Mixed Precision: FP16 |
|
- Caption Tag Dropout: 0.2 |
|
- Caption Group Dropout: 0.2 (for dropping tag/nl caption entirely) |
|
|
|
|
|
## Why do you still use SDXL but not any Brand New DiT-Based Models? |
|
Why do you think HunYuan or SD3 or Flux or AuraFlow will be better choice even if they are slower than SDXL and more difficult to finetune? <br> |
|
Why do you think DiT-based will be better choice even if the DiT paper use 9 times sample seen to surpass LDM-4? <br> |
|
Do you know the most of "improvements" of these "DiT models" is mostly about dataset and scaling? <br> |
|
Do you know "UNet" in SDXL have more than 1.75B or 70% parameter in transformer block? |
|
|
|
Unless any one give me reasonable compute resource or any team release efficient enough DiT or I will not train any DiT-based anime base model. <br> |
|
But if you give me 8xH100 for an year, I can even train lot of DiT from scratch (If you want) |
|
|
|
|
|
## License: |
|
Fair-AI-public-1.0-sd |