ai-forever commited on
Commit
1db830e
1 Parent(s): 87d578a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Kandinsky-3: Text-to-image diffusion model
6
+
7
+ ![](assets/title.jpg)
8
+
9
+ [Kandinsky 3.0 Post](https://habr.com/ru/companies/sberbank/articles/775590/) | [Project Page](https://ai-forever.github.io/Kandinsky-3) | [Generate](https://fusionbrain.ai) | [Telegram-bot](https://t.me/kandinsky21_bot) | [Technical Report](https://arxiv.org/pdf/2312.03511.pdf)
10
+
11
+ # Kandinsky 3.1:
12
+
13
+ ## Description:
14
+
15
+ We present Kandinsky 3.1, the follow-up to the Kandinsky 3.0 model, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation, which we have enhanced and enriched with a variety of useful features and modes to give users more opportunities to fully utilise the power of our new model.
16
+
17
+ ## Kandinsky Flash
18
+
19
+ <figure>
20
+ <img src="assets/butterly_effect.jpg">
21
+ </figure>
22
+
23
+
24
+ Diffusion models have problems with fast image generation. To address this problem, we trained a Kandinksy Flash model based on the [Adversarial Diffusion Distillation](https://arxiv.org/abs/2311.17042) approach with some modifications: we trained the model on latents, which reduced the memory overhead and removed distillation loss as it did not affect the training.
25
+
26
+ ### Architecture
27
+
28
+ For training Kandinsky Flash we used the following architecture of discriminator. It is the half of Kandinsky 3.0 U-Net encoder with additional head predictions.
29
+
30
+ <img src="assets/architecture.png">
31
+
32
+ ### How to use:
33
+ Check our jupyter notebooks with examples in `./examples` folder
34
+
35
+ ```python
36
+ from kandinsky3 import get_T2I_Flash_pipeline
37
+
38
+ device_map = torch.device('cuda:0')
39
+ dtype_map = {
40
+ 'unet': torch.float32,
41
+ 'text_encoder': torch.float16,
42
+ 'movq': torch.float32,
43
+ }
44
+
45
+ t2i_pipe = get_T2I_Flash_pipeline(
46
+ device_map, dtype_map
47
+ )
48
+
49
+ res = t2i_pipe("A cute corgi lives in a house made out of sushi.")
50
+ ```
51
+
52
+ ## Prompt beautification
53
+
54
+ <figure>
55
+ <img src="assets/prompt_beautifcation.png">
56
+ </figure>
57
+
58
+
59
+ Prompt plays crucial role in text-to-image generation. So, in Kandinsky 3.1 we decided to use language model for making prompt better. We used Intel's [neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) with the following system prompt as the LLM:
60
+
61
+ ```
62
+ ### System: You are a prompt engineer. Your mission is to expand prompts written by user. You should provide the best prompt for text to image generation in English.
63
+ ### User:
64
+ {prompt}
65
+ ### Assistant:
66
+ {answer of the model}
67
+ ```
68
+
69
+ ## KandiSuperRes
70
+
71
+ <figure>
72
+ <img src="assets/superres.png">
73
+ </figure>
74
+
75
+ To learn more about KandiSuperRes, please checkout: https://github.com/ai-forever/KandiSuperRes/
76
+
77
+ ## Kandinsky IP-Adapter & Kandinsky ControlNet
78
+
79
+ <figure>
80
+ <img src="assets/ip-adapter.png">
81
+ </figure>
82
+
83
+ To allow using image as condition in Kandinsky model, we trained IP-Adapter and HED-based ControlNet model. For more details please check out: https://github.com/ai-forever/kandinsky3-diffusers
84
+
85
+ # Kandinsky 3.0:
86
+
87
+ ## Description:
88
+
89
+ Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, Kandinsky 3.0 incorporates more data and specifically related to Russian culture, which allows to generate pictures related to Russin culture. Furthermore, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.
90
+
91
+ For more information: details of training, example of generations check out our [post](). The english version will be released in a couple of days.
92
+
93
+ ## Architecture details:
94
+
95
+
96
+ ![](assets/kandinsky.jpg)
97
+
98
+
99
+ Architecture consists of three parts:
100
+
101
+ + Text encoder Flan-UL2 (encoder part) - 8.6B
102
+ + Latent Diffusion U-Net - 3B
103
+ + MoVQ encoder/decoder - 267M
104
+
105
+
106
+ ## Models
107
+
108
+ We release our two models:
109
+
110
+ + [Base](): Base text-to-image diffusion model. This model was trained over 2M steps on 400 A100
111
+ + [Inpainting](): Inpainting version of the model. The model was initialized from final checkpoint of base model and trained 250k steps on 300 A100.
112
+
113
+ ## Installing
114
+
115
+ To install repo first one need to create conda environment:
116
+
117
+ ```
118
+ conda create -n kandinsky -y python=3.8;
119
+ source activate kandinsky;
120
+ pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu113/torch_stable.html;
121
+ pip install -r requirements.txt;
122
+ ```
123
+ The exact dependencies is got using `pip freeze` and can be found in `exact_requirements.txt`
124
+
125
+ ## How to use:
126
+
127
+ Check our jupyter notebooks with examples in `./examples` folder
128
+
129
+ ### 1. text2image
130
+
131
+ ```python
132
+ import sys
133
+ sys.path.append('..')
134
+
135
+ import torch
136
+ from kandinsky3 import get_T2I_pipeline
137
+
138
+ device_map = torch.device('cuda:0')
139
+ dtype_map = {
140
+ 'unet': torch.float32,
141
+ 'text_encoder': torch.float16,
142
+ 'movq': torch.float32,
143
+ }
144
+
145
+ t2i_pipe = get_T2I_pipeline(
146
+ device_map, dtype_map,
147
+ )
148
+ res = t2i_pipe("A cute corgi lives in a house made out of sushi.")
149
+
150
+ res[0]
151
+ ```
152
+
153
+ ### 2. inpainting
154
+
155
+ ```python
156
+ from kandinsky3 import get_inpainting_pipeline
157
+
158
+ device_map = torch.device('cuda:0')
159
+ dtype_map = {
160
+ 'unet': torch.float16,
161
+ 'text_encoder': torch.float16,
162
+ 'movq': torch.float32,
163
+ }
164
+
165
+ pipe = get_inpainting_pipeline(
166
+ device_map, dtype_map,
167
+ )
168
+
169
+ image = ... # PIL Image
170
+ mask = ... # Numpy array (HxW). Set 1 where image should be masked
171
+ image = inp_pipe( "A cute corgi lives in a house made out of sushi.", image, mask)
172
+ ```
173
+
174
+ ## Examples of generations
175
+
176
+ <hr>
177
+
178
+ <table class="center">
179
+ <tr>
180
+ <td><img src="assets/photo_8.jpg" raw=true></td>
181
+ <td><img src="assets/photo_15.jpg"></td>
182
+ <td><img src="assets/photo_16.jpg"></td>
183
+ <td><img src="assets/photo_17.jpg"></td>
184
+ </tr>
185
+ <tr>
186
+ <td width=25% align="center">"A beautiful landscape outdoors scene in the crochet knitting art style, drawing in style by Alfons Mucha"</td>
187
+ <td width=25% align="center">"gorgeous phoenix, cosmic, darkness, epic, cinematic, moonlight, stars, high - definition, texture,Oscar-Claude Monet"</td>
188
+ <td width=25% align="center">"a yellow house at the edge of the danish fjord, in the style of eiko ojala, ingrid baars, ad posters, mountainous vistas, george ault, realistic details, dark white and dark gray, 4k"</td>
189
+ <td width=25% align="center">"dragon fruit head, upper body, realistic, illustration by Joshua Hoffine Norman Rockwell, scary, creepy, biohacking, futurism, Zaha Hadid style"</td>
190
+ </tr>
191
+ <tr>
192
+ <td><img src="assets/photo_2.jpg" raw=true></td>
193
+ <td><img src="assets/photo_19.jpg"></td>
194
+ <td><img src="assets/photo_13.jpg"></td>
195
+ <td><img src="assets/photo_14.jpg"></td>
196
+ </tr>
197
+ <tr>
198
+ <td width=25% align="center">"Amazing playful nice cute strawberry character, dynamic poze, surreal fantazy garden background, gorgeous masterpice, award winning photo, soft natural lighting, 3d, Blender, Octane render, tilt - shift, deep field, colorful, I can't believe how beautiful this is, colorful, cute and sweet baby - loved photo"</td>
199
+ <td width=25% align="center">"beautiful fairy-tale desert, in the sky a wave of sand merges with the milky way, stars, cosmism, digital art, 8k"</td>
200
+ <td width=25% align="center">"Car, mustang, movie, person, poster, car cover, person, in the style of alessandro gottardo, gold and cyan, gerald harvey jones, reflections, highly detailed illustrations, industrial urban scenes""</td>
201
+ <td width=25% align="center">"cloud in blue sky, a red lip, collage art, shuji terayama, dreamy objects, surreal, criterion collection, showa era, intricate details, mirror"</td>
202
+ </tr>
203
+
204
+ </table>
205
+
206
+ <hr>
207
+
208
+ ## Authors
209
+
210
+ + Vladimir Arkhipkin: [Github](https://github.com/oriBetelgeuse)
211
+ + Anastasia Maltseva [Github](https://github.com/NastyaMittseva)
212
+ + Andrei Filatov [Github](https://github.com/anvilarth),
213
+ + Igor Pavlov: [Github](https://github.com/boomb0om)
214
+ + Julia Agafonova
215
+ + Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip)
216
+ + Andrey Kuznetsov: [Github](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai)
217
+ + Denis Dimitrov: [Github](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)
218
+
219
+ ## Citation
220
+ ```
221
+ @misc{arkhipkin2023kandinsky,
222
+ title={Kandinsky 3.0 Technical Report},
223
+ author={Vladimir Arkhipkin and Andrei Filatov and Viacheslav Vasilev and Anastasia Maltseva and Said Azizov and Igor Pavlov and Julia Agafonova and Andrey Kuznetsov and Denis Dimitrov},
224
+ year={2023},
225
+ eprint={2312.03511},
226
+ archivePrefix={arXiv},
227
+ primaryClass={cs.CV}
228
+ }
229
+ ```
230
+