Navyabhat commited on
Commit
dcfb6cf
β€’
1 Parent(s): 81376a7

Upload 10 files

Browse files
README.md CHANGED
@@ -1,13 +1,89 @@
1
  ---
2
- title: Session20
3
- emoji: πŸƒ
4
- colorFrom: red
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 3.47.1
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "ERA SESSION20 - Stable Diffusion: Generative Art with Guidance"
3
+ emoji: 🌍
4
+ colorFrom: indigo
5
+ colorTo: pink
6
  sdk: gradio
7
+ sdk_version: 3.48.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
 
13
+ **Styles Used:**
14
+ 1. [Oil style](https://huggingface.co/sd-concepts-library/oil-style)
15
+ 2. [Xyz](https://huggingface.co/sd-concepts-library/xyz)
16
+ 3. [Allante](https://huggingface.co/sd-concepts-library/style-of-marc-allante)
17
+ 4. [Moebius](https://huggingface.co/sd-concepts-library/moebius)
18
+ 5. [Polygons](https://huggingface.co/sd-concepts-library/low-poly-hd-logos-icons)
19
+
20
+ ### Result of Experiments with different styles:
21
+ **Prompt:** `"a cat and dog in the style of cs"` \
22
+ _"cs" in the prompt refers to "custom style" whose embedding is replaced by each of the concept embeddings shown below_
23
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/1effe375-6ef4-4adc-be7b-d6311fdaa50d)
24
+
25
+ ---
26
+ **Prompt:** `"dolphin swimming on Mars in the style of cs"`
27
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/2cd32248-4233-42c0-97c0-00e1ae8fdc85)
28
+
29
+ ### Result of Experiments with Guidance loss functions:
30
+ **Prompt:** `"a mouse in the style of cs"`
31
+ **Loss Function:**
32
+ ```python
33
+ def loss_fn(images):
34
+ return images.mean()
35
+ ```
36
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/c9d46e14-44bb-4ea7-88a4-26ef46344fce)
37
+ ---
38
+ ```python
39
+ def loss_fn(images):
40
+ return -images.median()/3
41
+ ```
42
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/2649e4f6-3de5-4e54-8f22-3d65874b7b07)
43
+ ---
44
+ ```python
45
+ def loss_fn(images):
46
+ error = (images - images.min()) / 255*(images.max() - images.min())
47
+ return error.mean()
48
+ ```
49
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/6399c780-e9b7-42f8-8d90-44c8b40d5265)
50
+ ---
51
+ **Prompt:** `"angry german shephard in the style of cs"`
52
+ ```python
53
+ def loss_fn(images):
54
+ error1 = torch.abs(images[:, 0] - 0.9)
55
+ error2 = torch.abs(images[:, 1] - 0.9)
56
+ error3 = torch.abs(images[:, 2] - 0.9)
57
+ return (
58
+ torch.sin(error1.mean()) + torch.sin(error2.mean()) + torch.sin(error3.mean())
59
+ ) / 3
60
+ ```
61
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/fa7d30ed-4efd-4504-b89c-94e093f51f9c)
62
+
63
+ ---
64
+ **Prompt:** `"A campfire (oil on canvas)"`
65
+ ```python
66
+ def loss_fn(images):
67
+ error1 = torch.abs(images[:, 0] - 0.9)
68
+ error2 = torch.abs(images[:, 1] - 0.9)
69
+ error3 = torch.abs(images[:, 2] - 0.9)
70
+ return (
71
+ torch.sin((error1 * error2 * error3)).mean()
72
+ + torch.cos((error1 * error2 * error3)).mean()
73
+ )
74
+ ```
75
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/88382dae-6701-4103-a664-ed17727b690f)
76
+
77
+ ---
78
+ ```python
79
+ def loss_fn(images):
80
+ error1 = torch.abs(images[:, 0] - 0.9)
81
+ error2 = torch.abs(images[:, 1] - 0.9)
82
+ error3 = torch.abs(images[:, 2] - 0.9)
83
+ return (
84
+ torch.sin(error1.mean()) + torch.sin(error2.mean()) + torch.sin(error3.mean())
85
+ ) / 3
86
+ ```
87
+ ![image](https://github.com/RaviNaik/ERA-SESSION20/assets/23289802/0ab3edad-579d-4821-b992-6c18b61bd444)
88
+
89
+
app.py CHANGED
@@ -88,4 +88,4 @@ with gr.Blocks() as app:
88
  outputs=[lossless_gallery, lossy_gallery],
89
  )
90
 
91
- app.launch()
 
88
  outputs=[lossless_gallery, lossy_gallery],
89
  )
90
 
91
+ app.launch()
concept_libs/coffeemachine.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc3a85dc9cbdf6ab5fca4056c473da1b632c0565030be918682ce3e62095b4b1
3
+ size 3840
concept_libs/collage_style.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b143c4841c5f2d39d0eb2015d62c17d1b18da9bb0a42c76320df7acfe1e144bf
3
+ size 3840
concept_libs/cube.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a6d6394f0cd38847259c42746a6b0e50ca1e76e6ddc8e217ff14f2feb7dbca4
3
+ size 3819
concept_libs/jerrymouse2.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9713d9367f1faa6ebd753db5c8a209c565be0b25e32051c723c4533dd9df605
3
+ size 3840
concept_libs/zero.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78286aa910deafe4e46c6e38a86f464a246aef95ad5611a756dd99405f418a85
3
+ size 3819
requirements.txt CHANGED
Binary files a/requirements.txt and b/requirements.txt differ
 
src/stable_diffusion.py ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModel
3
+ from transformers import CLIPTextModel, CLIPTokenizer
4
+ from PIL import Image
5
+ from tqdm import tqdm
6
+
7
+
8
+ class StableDiffusion:
9
+ def __init__(
10
+ self,
11
+ vae_arch="CompVis/stable-diffusion-v1-4",
12
+ tokenizer_arch="openai/clip-vit-large-patch14",
13
+ encoder_arch="openai/clip-vit-large-patch14",
14
+ unet_arch="CompVis/stable-diffusion-v1-4",
15
+ device="cpu",
16
+ height=512,
17
+ width=512,
18
+ num_inference_steps=30,
19
+ guidance_scale=7.5,
20
+ manual_seed=1,
21
+ ) -> None:
22
+ self.height = height # default height of Stable Diffusion
23
+ self.width = width # default width of Stable Diffusion
24
+ self.num_inference_steps = num_inference_steps # Number of denoising steps
25
+ self.guidance_scale = guidance_scale # Scale for classifier-free guidance
26
+ self.device = device
27
+ self.manual_seed = manual_seed
28
+
29
+ vae = AutoencoderKL.from_pretrained(vae_arch, subfolder="vae")
30
+ # Load the tokenizer and text encoder to tokenize and encode the text.
31
+ self.tokenizer = CLIPTokenizer.from_pretrained(tokenizer_arch)
32
+ text_encoder = CLIPTextModel.from_pretrained(encoder_arch)
33
+
34
+ # The UNet model for generating the latents.
35
+ unet = UNet2DConditionModel.from_pretrained(unet_arch, subfolder="unet")
36
+
37
+ # The noise scheduler
38
+ self.scheduler = LMSDiscreteScheduler(
39
+ beta_start=0.00085,
40
+ beta_end=0.012,
41
+ beta_schedule="scaled_linear",
42
+ num_train_timesteps=1000,
43
+ )
44
+
45
+ # To the GPU we go!
46
+ self.vae = vae.to(self.device)
47
+ self.text_encoder = text_encoder.to(self.device)
48
+ self.unet = unet.to(self.device)
49
+
50
+ self.token_emb_layer = text_encoder.text_model.embeddings.token_embedding
51
+ pos_emb_layer = text_encoder.text_model.embeddings.position_embedding
52
+ position_ids = text_encoder.text_model.embeddings.position_ids[:, :77]
53
+ self.position_embeddings = pos_emb_layer(position_ids)
54
+
55
+ def get_output_embeds(self, input_embeddings):
56
+ # CLIP's text model uses causal mask, so we prepare it here:
57
+ bsz, seq_len = input_embeddings.shape[:2]
58
+ causal_attention_mask = (
59
+ self.text_encoder.text_model._build_causal_attention_mask(
60
+ bsz, seq_len, dtype=input_embeddings.dtype
61
+ )
62
+ )
63
+
64
+ # Getting the output embeddings involves calling the model with passing output_hidden_states=True
65
+ # so that it doesn't just return the pooled final predictions:
66
+ encoder_outputs = self.text_encoder.text_model.encoder(
67
+ inputs_embeds=input_embeddings,
68
+ attention_mask=None, # We aren't using an attention mask so that can be None
69
+ causal_attention_mask=causal_attention_mask.to(self.device),
70
+ output_attentions=None,
71
+ output_hidden_states=True, # We want the output embs not the final output
72
+ return_dict=None,
73
+ )
74
+
75
+ # We're interested in the output hidden state only
76
+ output = encoder_outputs[0]
77
+
78
+ # There is a final layer norm we need to pass these through
79
+ output = self.text_encoder.text_model.final_layer_norm(output)
80
+
81
+ # And now they're ready!
82
+ return output
83
+
84
+ def set_timesteps(self, scheduler, num_inference_steps):
85
+ scheduler.set_timesteps(num_inference_steps)
86
+ scheduler.timesteps = scheduler.timesteps.to(torch.float32)
87
+
88
+ def latents_to_pil(self, latents):
89
+ # bath of latents -> list of images
90
+ latents = (1 / 0.18215) * latents
91
+ with torch.no_grad():
92
+ image = self.vae.decode(latents).sample
93
+ image = (image / 2 + 0.5).clamp(0, 1)
94
+ image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
95
+ images = (image * 255).round().astype("uint8")
96
+ pil_images = [Image.fromarray(image) for image in images]
97
+ return pil_images
98
+
99
+ def generate_with_embs(self, text_embeddings, text_input, loss_fn, loss_scale):
100
+ generator = torch.manual_seed(
101
+ self.manual_seed
102
+ ) # Seed generator to create the inital latent noise
103
+ batch_size = 1
104
+
105
+ max_length = text_input.input_ids.shape[-1]
106
+ uncond_input = self.tokenizer(
107
+ [""] * batch_size,
108
+ padding="max_length",
109
+ max_length=max_length,
110
+ return_tensors="pt",
111
+ )
112
+ with torch.no_grad():
113
+ uncond_embeddings = self.text_encoder(
114
+ uncond_input.input_ids.to(self.device)
115
+ )[0]
116
+ text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
117
+
118
+ # Prep Scheduler
119
+ self.set_timesteps(self.scheduler, self.num_inference_steps)
120
+
121
+ # Prep latents
122
+ latents = torch.randn(
123
+ (batch_size, self.unet.in_channels, self.height // 8, self.width // 8),
124
+ generator=generator,
125
+ )
126
+ latents = latents.to(self.device)
127
+ latents = latents * self.scheduler.init_noise_sigma
128
+
129
+ # Loop
130
+ for i, t in tqdm(
131
+ enumerate(self.scheduler.timesteps), total=len(self.scheduler.timesteps)
132
+ ):
133
+ # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
134
+ latent_model_input = torch.cat([latents] * 2)
135
+ sigma = self.scheduler.sigmas[i]
136
+ latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
137
+
138
+ # predict the noise residual
139
+ with torch.no_grad():
140
+ noise_pred = self.unet(
141
+ latent_model_input, t, encoder_hidden_states=text_embeddings
142
+ )["sample"]
143
+
144
+ # perform guidance
145
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
146
+ noise_pred = noise_pred_uncond + self.guidance_scale * (
147
+ noise_pred_text - noise_pred_uncond
148
+ )
149
+ if i % 5 == 0:
150
+ # Requires grad on the latents
151
+ latents = latents.detach().requires_grad_()
152
+
153
+ # Get the predicted x0:
154
+ # latents_x0 = latents - sigma * noise_pred
155
+ latents_x0 = self.scheduler.step(
156
+ noise_pred, t, latents
157
+ ).pred_original_sample
158
+
159
+ # Decode to image space
160
+ denoised_images = (
161
+ self.vae.decode((1 / 0.18215) * latents_x0).sample / 2 + 0.5
162
+ ) # range (0, 1)
163
+
164
+ # Calculate loss
165
+ loss = loss_fn(denoised_images) * loss_scale
166
+
167
+ # Occasionally print it out
168
+ # if i % 10 == 0:
169
+ # print(i, "loss:", loss.item())
170
+
171
+ # Get gradient
172
+ cond_grad = torch.autograd.grad(loss, latents)[0]
173
+
174
+ # Modify the latents based on this gradient
175
+ latents = latents.detach() - cond_grad * sigma**2
176
+ self.scheduler._step_index = self.scheduler._step_index - 1
177
+
178
+ # compute the previous noisy sample x_t -> x_t-1
179
+ latents = self.scheduler.step(noise_pred, t, latents).prev_sample
180
+
181
+ return self.latents_to_pil(latents)[0]
182
+
183
+ def generate_image(
184
+ self,
185
+ prompt="A campfire (oil on canvas)",
186
+ loss_fn=None,
187
+ loss_scale=200,
188
+ concept_embed=None, # birb_embed["<birb-style>"]
189
+ ):
190
+ prompt += " in the style of cs"
191
+ text_input = self.tokenizer(
192
+ prompt,
193
+ padding="max_length",
194
+ max_length=self.tokenizer.model_max_length,
195
+ truncation=True,
196
+ return_tensors="pt",
197
+ )
198
+ input_ids = text_input.input_ids.to(self.device)
199
+ custom_style_token = self.tokenizer.encode("cs", add_special_tokens=False)[0]
200
+ # Get token embeddings
201
+ token_embeddings = self.token_emb_layer(input_ids)
202
+
203
+ # The new embedding - our special birb word
204
+ embed_key = list(concept_embed.keys())[0]
205
+ replacement_token_embedding = concept_embed[embed_key]
206
+
207
+ # Insert this into the token embeddings
208
+ token_embeddings[
209
+ 0, torch.where(input_ids[0] == custom_style_token)
210
+ ] = replacement_token_embedding.to(self.device)
211
+ # token_embeddings = token_embeddings + (replacement_token_embedding * 0.9)
212
+ # Combine with pos embs
213
+ input_embeddings = token_embeddings + self.position_embeddings
214
+
215
+ # Feed through to get final output embs
216
+ modified_output_embeddings = self.get_output_embeds(input_embeddings)
217
+
218
+ # And generate an image with this:
219
+ generated_image = self.generate_with_embs(
220
+ modified_output_embeddings, text_input, loss_fn, loss_scale
221
+ )
222
+ return generated_image
src/utils.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def loss_fn(images):
2
+ return -images.median() / 3
3
+
4
+
5
+ concept_styles = {
6
+ "Coffee Machine": "coffeemachine.bin",
7
+ "College Style": "college_style.bin",
8
+ "Cube": "cube.bin",
9
+ "Jerry Mouse": "jerrymouse",
10
+ "Zero": "zero.bin",
11
+ }