svjack commited on
Commit
8b73e77
1 Parent(s): 9a9fe46

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +370 -0
README.md ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Chinese Stable Diffusion Pokemon Model Card
2
+
3
+ <!--
4
+ ![rinna](https://github.com/rinnakk/japanese-clip/blob/master/data/rinna.png?raw=true)
5
+ -->
6
+
7
+ Stable-Diffusion-Pokemon-zh is a Chinese-specific latent text-to-image diffusion model capable of generating Pokemon images given any text input.
8
+
9
+ This model was trained by using a powerful text-to-image model, [diffusers](https://github.com/huggingface/diffusers)
10
+ For more information about our training method, see [train_zh_model.py]().
11
+
12
+ <!--
13
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rinnakk/japanese-stable-diffusion/blob/master/scripts/txt2img.ipynb)
14
+ -->
15
+
16
+ ## Model Details
17
+ - **Developed by:** Zhipeng Yang
18
+ - **Model type:** Diffusion-based text-to-image generation model
19
+ - **Language(s):** Chinese
20
+ - **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
21
+ - **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model (LDM)](https://arxiv.org/abs/2112.10752) that used [Stable Diffusion](https://github.com/CompVis/stable-diffusion) as a pre-trained model.
22
+ - **Resources for more information:** [https://github.com/svjack/Stable-Diffusion-Pokemon](https://github.com/svjack/Stable-Diffusion-Pokemon)
23
+
24
+ ## Examples
25
+
26
+ Firstly, install our package as follows. This package is modified [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Chinese Stable Diffusion.
27
+
28
+
29
+ ```bash
30
+ pip install git+https://github.com/rinnakk/japanese-stable-diffusion
31
+ pip install diffusers==0.4.1
32
+ sudo apt-get install git-lfs
33
+ git clone https://huggingface.co/svjack/Stable-Diffusion-Pokemon-zh
34
+ ```
35
+
36
+ Run this command to log in with your HF Hub token if you haven't before:
37
+
38
+ ```bash
39
+ huggingface-cli login
40
+ ```
41
+
42
+ Running the pipeline with the LMSDiscreteScheduler scheduler:
43
+
44
+ ```python
45
+ import torch
46
+ import pandas as pd
47
+
48
+ from torch import autocast
49
+ from diffusers import LMSDiscreteScheduler
50
+
51
+ import torch
52
+ from transformers import BertForSequenceClassification, BertConfig, BertTokenizer, BertForTokenClassification
53
+ from transformers import CLIPProcessor, CLIPModel
54
+ import numpy as np
55
+
56
+ from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion import *
57
+ from japanese_stable_diffusion.pipeline_stable_diffusion import *
58
+
59
+ class StableDiffusionPipelineWrapper(StableDiffusionPipeline):
60
+
61
+ @torch.no_grad()
62
+ def __call__(
63
+ self,
64
+ prompt: Union[str, List[str]],
65
+ height: int = 512,
66
+ width: int = 512,
67
+ num_inference_steps: int = 50,
68
+ guidance_scale: float = 7.5,
69
+ negative_prompt: Optional[Union[str, List[str]]] = None,
70
+ num_images_per_prompt: Optional[int] = 1,
71
+ eta: float = 0.0,
72
+ generator: Optional[torch.Generator] = None,
73
+ latents: Optional[torch.FloatTensor] = None,
74
+ output_type: Optional[str] = "pil",
75
+ return_dict: bool = True,
76
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
77
+ callback_steps: Optional[int] = 1,
78
+ **kwargs,
79
+ ):
80
+ if isinstance(prompt, str):
81
+ batch_size = 1
82
+ elif isinstance(prompt, list):
83
+ batch_size = len(prompt)
84
+ else:
85
+ raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
86
+
87
+ if height % 8 != 0 or width % 8 != 0:
88
+ raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
89
+
90
+ if (callback_steps is None) or (
91
+ callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
92
+ ):
93
+ raise ValueError(
94
+ f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
95
+ f" {type(callback_steps)}."
96
+ )
97
+
98
+ # get prompt text embeddings
99
+ text_inputs = self.tokenizer(
100
+ prompt,
101
+ padding="max_length",
102
+ max_length=self.tokenizer.model_max_length,
103
+ return_tensors="pt",
104
+ )
105
+ text_input_ids = text_inputs.input_ids
106
+
107
+ if text_input_ids.shape[-1] > self.tokenizer.model_max_length:
108
+ removed_text = self.tokenizer.batch_decode(text_input_ids[:, self.tokenizer.model_max_length :])
109
+ logger.warning(
110
+ "The following part of your input was truncated because CLIP can only handle sequences up to"
111
+ f" {self.tokenizer.model_max_length} tokens: {removed_text}"
112
+ )
113
+ text_input_ids = text_input_ids[:, : self.tokenizer.model_max_length]
114
+ text_embeddings = self.text_encoder(text_input_ids.to(self.device))[0]
115
+
116
+ # duplicate text embeddings for each generation per prompt, using mps friendly method
117
+ bs_embed, seq_len, _ = text_embeddings.shape
118
+ text_embeddings = text_embeddings.repeat(1, num_images_per_prompt, 1)
119
+ text_embeddings = text_embeddings.view(bs_embed * num_images_per_prompt, seq_len, -1)
120
+
121
+ # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
122
+ # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
123
+ # corresponds to doing no classifier free guidance.
124
+ do_classifier_free_guidance = guidance_scale > 1.0
125
+ # get unconditional embeddings for classifier free guidance
126
+ if do_classifier_free_guidance:
127
+ uncond_tokens: List[str]
128
+ if negative_prompt is None:
129
+ uncond_tokens = [""]
130
+ elif type(prompt) is not type(negative_prompt):
131
+ raise TypeError(
132
+ f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
133
+ f" {type(prompt)}."
134
+ )
135
+ elif isinstance(negative_prompt, str):
136
+ uncond_tokens = [negative_prompt]
137
+ elif batch_size != len(negative_prompt):
138
+ raise ValueError(
139
+ f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
140
+ f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
141
+ " the batch size of `prompt`."
142
+ )
143
+ else:
144
+ uncond_tokens = negative_prompt
145
+
146
+ max_length = text_input_ids.shape[-1]
147
+ uncond_input = self.tokenizer(
148
+ uncond_tokens,
149
+ padding="max_length",
150
+ max_length=max_length,
151
+ truncation=True,
152
+ return_tensors="pt",
153
+ )
154
+ uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0]
155
+
156
+ # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
157
+ seq_len = uncond_embeddings.shape[1]
158
+ uncond_embeddings = uncond_embeddings.repeat(batch_size, num_images_per_prompt, 1)
159
+ uncond_embeddings = uncond_embeddings.view(batch_size * num_images_per_prompt, seq_len, -1)
160
+
161
+ # For classifier free guidance, we need to do two forward passes.
162
+ # Here we concatenate the unconditional and text embeddings into a single batch
163
+ # to avoid doing two forward passes
164
+ text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
165
+
166
+ # get the initial random noise unless the user supplied it
167
+
168
+ # Unlike in other pipelines, latents need to be generated in the target device
169
+ # for 1-to-1 results reproducibility with the CompVis implementation.
170
+ # However this currently doesn't work in `mps`.
171
+ latents_shape = (batch_size * num_images_per_prompt, self.unet.in_channels, height // 8, width // 8)
172
+ latents_dtype = text_embeddings.dtype
173
+ if latents is None:
174
+ if self.device.type == "mps":
175
+ # randn does not work reproducibly on mps
176
+ latents = torch.randn(latents_shape, generator=generator, device="cpu", dtype=latents_dtype).to(
177
+ self.device
178
+ )
179
+ else:
180
+ latents = torch.randn(latents_shape, generator=generator, device=self.device, dtype=latents_dtype)
181
+ else:
182
+ if latents.shape != latents_shape:
183
+ raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {latents_shape}")
184
+ latents = latents.to(self.device)
185
+
186
+ # set timesteps
187
+ self.scheduler.set_timesteps(num_inference_steps)
188
+
189
+ # Some schedulers like PNDM have timesteps as arrays
190
+ # It's more optimized to move all timesteps to correct device beforehand
191
+ timesteps_tensor = self.scheduler.timesteps.to(self.device)
192
+
193
+ # scale the initial noise by the standard deviation required by the scheduler
194
+ latents = latents * self.scheduler.init_noise_sigma
195
+
196
+ # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
197
+ # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
198
+ # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
199
+ # and should be between [0, 1]
200
+ accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
201
+ extra_step_kwargs = {}
202
+ if accepts_eta:
203
+ extra_step_kwargs["eta"] = eta
204
+
205
+ for i, t in enumerate(self.progress_bar(timesteps_tensor)):
206
+ # expand the latents if we are doing classifier free guidance
207
+ latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
208
+ latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
209
+
210
+ # predict the noise residual
211
+ ###text_embeddings
212
+ #print("before :" ,text_embeddings.shape)
213
+ eh_shape = text_embeddings.shape
214
+ if i == 0:
215
+ eh_pad = torch.zeros((eh_shape[0], eh_shape[1], 768 - 512))
216
+ eh_pad = eh_pad.to(self.device)
217
+ text_embeddings = torch.concat([text_embeddings, eh_pad], -1)
218
+
219
+ #print("after :" ,text_embeddings.shape)
220
+ noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
221
+
222
+ # perform guidance
223
+ if do_classifier_free_guidance:
224
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
225
+ noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
226
+
227
+ # compute the previous noisy sample x_t -> x_t-1
228
+ latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
229
+
230
+ # call the callback, if provided
231
+ if callback is not None and i % callback_steps == 0:
232
+ callback(i, t, latents)
233
+
234
+ latents = 1 / 0.18215 * latents
235
+ image = self.vae.decode(latents).sample
236
+
237
+ image = (image / 2 + 0.5).clamp(0, 1)
238
+
239
+ # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
240
+ image = image.cpu().permute(0, 2, 3, 1).float().numpy()
241
+
242
+ if self.safety_checker is not None:
243
+ safety_checker_input = self.feature_extractor(self.numpy_to_pil(image), return_tensors="pt").to(
244
+ self.device
245
+ )
246
+ image, has_nsfw_concept = self.safety_checker(
247
+ images=image, clip_input=safety_checker_input.pixel_values.to(text_embeddings.dtype)
248
+ )
249
+ else:
250
+ has_nsfw_concept = None
251
+
252
+ if output_type == "pil":
253
+ image = self.numpy_to_pil(image)
254
+
255
+ if not return_dict:
256
+ return (image, has_nsfw_concept)
257
+
258
+ return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
259
+
260
+
261
+ scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012,
262
+ beta_schedule="scaled_linear", num_train_timesteps=1000)
263
+
264
+ #pretrained_model_name_or_path = "zh_model_20000"
265
+ #### sudo apt-get install git-lfs
266
+ #### git clone https://huggingface.co/svjack/Stable-Diffusion-Pokemon-zh
267
+ pretrained_model_name_or_path = "Stable-Diffusion-Pokemon-zh"
268
+
269
+ tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path, subfolder = "tokenizer")
270
+ text_encoder = BertForTokenClassification.from_pretrained(pretrained_model_name_or_path, subfolder = "text_encoder")
271
+
272
+ vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")
273
+ unet = UNet2DConditionModel.from_pretrained(pretrained_model_name_or_path, subfolder="unet")
274
+
275
+ tokenizer.model_max_length = 77
276
+ pipeline_wrap = StableDiffusionPipelineWrapper(
277
+ text_encoder=text_encoder,
278
+ vae=vae,
279
+ unet=unet,
280
+ tokenizer=tokenizer,
281
+ scheduler=scheduler,
282
+ safety_checker=StableDiffusionSafetyChecker.from_pretrained("CompVis/stable-diffusion-safety-checker"),
283
+ feature_extractor=CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32"),
284
+ )
285
+ pipeline_wrap.safety_checker = lambda images, clip_input: (images, False)
286
+ pipeline_wrap = pipeline_wrap.to("cuda")
287
+
288
+ imgs = pipeline_wrap("一个头上戴着盆栽的卡通人物",
289
+ num_inference_steps = 100
290
+ )
291
+ image = imgs.images[0]
292
+
293
+ image.save("output.png")
294
+ ```
295
+
296
+ <!--
297
+ _Note: `JapaneseStableDiffusionPipeline` is almost same as diffusers' `StableDiffusionPipeline` but added some lines to initialize our models properly._
298
+
299
+
300
+ ## Misuse, Malicious Use, and Out-of-Scope Use
301
+ _Note: This section is taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), but applies in the same way to Stable Diffusion v1._
302
+
303
+
304
+ The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
305
+
306
+ ### Out-of-Scope Use
307
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
308
+
309
+ ### Misuse and Malicious Use
310
+ Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
311
+
312
+ - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
313
+ - Intentionally promoting or propagating discriminatory content or harmful stereotypes.
314
+ - Impersonating individuals without their consent.
315
+ - Sexual content without consent of the people who might see it.
316
+ - Mis- and disinformation
317
+ - Representations of egregious violence and gore
318
+ - Sharing of copyrighted or licensed material in violation of its terms of use.
319
+ - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
320
+
321
+ ## Limitations and Bias
322
+
323
+ ### Limitations
324
+
325
+ - The model does not achieve perfect photorealism
326
+ - The model cannot render legible text
327
+ - The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
328
+ - Faces and people in general may not be generated properly.
329
+ - The model was trained mainly with Japanese captions and will not work as well in other languages.
330
+ - The autoencoding part of the model is lossy
331
+ - The model was trained on a subset of a large-scale dataset
332
+ [LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
333
+ and is not fit for product use without additional safety mechanisms and
334
+ considerations.
335
+ - No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data.
336
+ The training data can be searched at [https://rom1504.github.io/clip-retrieval/](https://rom1504.github.io/clip-retrieval/) to possibly assist in the detection of memorized images.
337
+
338
+ ### Bias
339
+
340
+ While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
341
+ Japanese Stable Diffusion was trained on Japanese datasets including [LAION-5B](https://laion.ai/blog/laion-5b/) with Japanese captions,
342
+ which consists of images that are primarily limited to Japanese descriptions.
343
+ Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for.
344
+ This affects the overall output of the model.
345
+ Further, the ability of the model to generate content with non-Japanese prompts is significantly worse than with Japanese-language prompts.
346
+
347
+ ### Safety Module
348
+
349
+ The intended use of this model is with the [Safety Checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) in Diffusers.
350
+ This checker works by checking model outputs against known hard-coded NSFW concepts.
351
+ The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter.
352
+ Specifically, the checker compares the class probability of harmful concepts in the embedding space of the `CLIPTextModel` *after generation* of the images.
353
+ The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept.
354
+
355
+
356
+ ## Training
357
+
358
+ **Training Data**
359
+ We used the following dataset for training the model:
360
+
361
+ - Approximately 100 million images with Japanese captions, including the Japanese subset of [LAION-5B](https://laion.ai/blog/laion-5b/).
362
+
363
+ **Training Procedure**
364
+ Japanese Stable Diffusion has the same architecture as Stable Diffusion and was trained by using Stable Diffusion. Because Stable Diffusion was trained on English dataset and the CLIP tokenizer is basically for English, we had 2 stages to transfer to a language-specific model, inspired by [PITI](https://arxiv.org/abs/2205.12952).
365
+
366
+ 1. Train a Japanese-specific text encoder with our Japanese tokenizer from scratch with the latent diffusion model fixed. This stage is expected to map Japanese captions to Stable Diffusion's latent space.
367
+ 2. Fine-tune the text encoder and the latent diffusion model jointly. This stage is expected to generate Japanese-style images more.
368
+
369
+ [//]: # (_Note: Japanese Stable Diffusion is still running and this checkpoint is the current best one. We might update to a better checkpoint via this repository._)
370
+ -->