ywlee88 commited on
Commit
7de8133
•
1 Parent(s): f08d87d

update README

Browse files
Files changed (1) hide show
  1. README.md +151 -1
README.md CHANGED
@@ -1 +1,151 @@
1
- # KOALA-1B Model Card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-to-image
4
+ - KOALA
5
+ ---
6
+
7
+ <div align="center">
8
+ <img src="https://dl.dropboxusercontent.com/scl/fi/yosvi68jvyarbvymxc4hm/github_logo.png?rlkey=r9ouwcd7cqxjbvio43q9b3djd&dl=1" width="1024px" />
9
+ </div>
10
+
11
+
12
+
13
+ <div style="display:flex;justify-content: center">
14
+ <a href="https://youngwanlee.github.io/KOALA/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
15
+ <a href="https://github.com/youngwanLEE/sdxl-koala"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> &ensp;
16
+ <a href="https://arxiv.org/abs/2312.04005"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:KOALA&color=red&logo=arxiv"></a> &ensp;
17
+ </div>
18
+
19
+
20
+
21
+ # KOALA-1B Model Card
22
+
23
+
24
+ ## Abstract
25
+ ### TL;DR
26
+ > We propose a fast text-to-image model, called KOALA, by compressing SDXL's U-Net and distilling knowledge from SDXL into our model. KOALA-700M can generate a 1024x1024 image in less than 1.5 seconds on an NVIDIA 4090 GPU, which is more than 2x faster than SDXL.
27
+
28
+ <details><summary>FULL abstract</summary>
29
+ Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature.
30
+ Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model.
31
+ However, its increased computation cost and model size require higher-end hardware (e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation.
32
+ To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL.
33
+ To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis.
34
+ Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part.
35
+ With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B &-700M, while reducing the model size up to 54% and 69% of the original SDXL model.
36
+ In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality.
37
+ We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.
38
+ </details>
39
+
40
+ <br>
41
+
42
+
43
+ These 1024x1024 samples are generated by KOALA-700M with 25 denoising steps.
44
+
45
+ <div align="center">
46
+ <img src="https://dl.dropboxusercontent.com/scl/fi/rjsqqgfney7be069y2yr7/teaser.png?rlkey=7lq0m90xpjcoqclzl4tieajpo&dl=1" width="1024px" />
47
+ </div>
48
+
49
+
50
+ ## Architecture
51
+ There are two two types of compressed U-Net, KOALA-1B and KOALA-700M, which are realized by reducing residual blocks and transformer blocks.
52
+
53
+ <div align="center">
54
+ <img src="https://dl.dropboxusercontent.com/scl/fi/5ydeywgiyt1d3njw63dpk/arch.png?rlkey=1p6imbjs4lkmfpcxy153i1a2t&dl=1" width="1024px" />
55
+ </div>
56
+
57
+ ### U-Net comparison
58
+
59
+ | U-Net | SDM-v2.0 | SDXL-Base-1.0 | KOALA-1B | KOALA-700M |
60
+ |-------|:----------:|:-----------:|:-----------:|:-------------:|
61
+ | Param. | 865M | 2,567M | 1,161M | 782M |
62
+ | CKPT size | 3.46GB | 10.3GB | 4.4GB | 3.0GB |
63
+ | Tx blocks | [1, 1, 1, 1] | [0, 2, 10] | [0, 2, 6] | [0, 2, 5] |
64
+ | Mid block | ✓ | ✓ | ✓ | ✗ |
65
+ | Latency | 1.131s | 3.133s | 1.604s | 1.257s |
66
+
67
+ - Tx menans transformer block and CKPT means the trained checkpoint file.
68
+ - We measured latency with FP16-precision, and 25 denoising steps in NVIDIA 4090 GPU (24GB).
69
+ - SDM-v2.0 uses 768x768 resolution, while SDXL and KOALA models uses 1024x1024 resolution.
70
+
71
+
72
+ ## Latency and memory usage comparison on different GPUs
73
+
74
+ We measure the inference time of SDM-v2.0 with 768x768 resolution and the other models with 1024x1024 using a variety of consumer-grade GPUs: NVIDIA 3060Ti (8GB), 2080Ti (11GB), and 4090 (24GB). We use 25 denoising steps and FP16/FP32 precisions. OOM means Out-of-Memory. Note that SDXL-Base cannot operate in the 8GB-GPU.
75
+
76
+
77
+ <div align="center">
78
+ <img src="https://dl.dropboxusercontent.com/scl/fi/u1az20y0zfww1l5lhbcyd/latency_gpu.svg?rlkey=vjn3gpkmywmp7jpilar4km7sd&dl=1" width="1024px" />
79
+ </div>
80
+
81
+
82
+
83
+
84
+
85
+ ## Key Features
86
+ - **Efficient U-Net Architecture**: KOALA models use a simplified U-Net architecture that reduces the model size by up to 54% and 69% respectively compared to its predecessor, Stable Diffusion XL (SDXL).
87
+ - **Self-Attention-Based Knowledge Distillation**: The core technique in KOALA focuses on the distillation of self-attention features, which proves crucial for maintaining image generation quality.
88
+
89
+
90
+
91
+ ## Model Description
92
+
93
+ - Developed by [ETRI Visual Intelligence Lab](https://huggingface.co/etri-vilab)
94
+ - Developer: [Youngwan Lee](https://youngwanlee.github.io/), [Kwanyong Park](https://pkyong95.github.io/), [Yoorhim Cho](https://ofzlo.github.io/), [Young-Ju Lee](https://scholar.google.com/citations?user=6goOQh8AAAAJ&hl=en), [Sung Ju Hwang](http://www.sungjuhwang.com/)
95
+ - Model Description: Latent Diffusion based text-to-image generative model. KOALA models uses the same text encoders as [SDXL-Base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and only replace the denoising U-Net with the compressed U-Nets.
96
+ - Training data: [LAION-aesthetics-V2 6+](https://laion.ai/blog/laion-aesthetics/)
97
+ - Resources for more information: Check out [KOALA report on arXiv](https://arxiv.org/abs/2312.04005) and [project page](https://youngwanlee.github.io/KOALA/).
98
+
99
+
100
+
101
+
102
+ ## Usage with 🤗[Diffusers library](https://github.com/huggingface/diffusers)
103
+ The inference code with denoising step 25
104
+ ```python
105
+ import torch
106
+ from diffusers import StableDiffusionXLPipeline
107
+
108
+ pipe = StableDiffusionXLPipeline.from_pretrained("etri-vilab/koala-700m", torch_dtype=torch.float16)
109
+ pipe = pipe.to("cuda")
110
+
111
+ prompt = "A portrait painting of a Golden Retriever like Leonard da Vinci"
112
+ negative = "worst quality, low quality, illustration, low resolution"
113
+ image = pipe(prompt=prompt, negative_prompt=negative).images[0]
114
+ ```
115
+
116
+
117
+
118
+ ## Uses
119
+ ### Direct Use
120
+ The model is intended for research purposes only. Possible research areas and tasks include
121
+
122
+ - Generation of artworks and use in design and other artistic processes.
123
+ - Applications in educational or creative tools.
124
+ - Research on generative models.
125
+ - Safe deployment of models which have the potential to generate harmful content.
126
+ - Probing and understanding the limitations and biases of generative models.
127
+ - Excluded uses are described below.
128
+
129
+ ### Out-of-Scope Use
130
+
131
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
132
+
133
+
134
+ ## Limitations and Bias
135
+ - Text Rendering: The models face challenges in rendering long, legible text within images.
136
+ - Complex Prompts: KOALA sometimes struggles with complex prompts involving multiple attributes.
137
+ - Dataset Dependencies: The current limitations are partially attributed to the characteristics of the training dataset (LAION-aesthetics-V2 6+).
138
+
139
+
140
+
141
+ ## Citation
142
+ ```bibtex
143
+ @misc{Lee@koala,
144
+ title={KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis},
145
+ author={Youngwan Lee and Kwanyong Park and Yoorhim Cho and Yong-Ju Lee and Sung Ju Hwang},
146
+ year={2023},
147
+ eprint={2312.04005},
148
+ archivePrefix={arXiv},
149
+ primaryClass={cs.CV}
150
+ }
151
+ ```