File size: 5,315 Bytes
8e4ed8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c970c8
 
 
8e4ed8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

---
license: openrail++
base_model: stabilityai/stable-diffusion-xl-base-1.0
tags:
- stable-diffusion-xl
- stable-diffusion-xl-diffusers
- text-to-image
- diffusers
- controlnet
inference: false
---
    
# SDXL-controlnet: Depth

These are controlnet weights trained on stabilityai/stable-diffusion-xl-base-1.0 with depth conditioning. This checkpoint is 5x smaller than the original XL controlnet checkpoint. You can find some example images in the following.

prompt: donald trump, serious look, cigar in the mouth, 70mm, film still, head shot
![open](oppenheimer_mid.png)

prompt: spiderman lecture, photorealistic
![images_0)](./spiderman_mid.png)

prompt: aerial view, a futuristic research complex in a bright foggy jungle, hard lighting
![images_1)](./hf_logo_mid.png)

prompt: megatron in an apocalyptic world ground, runied city in the background, photorealistic
![images_2)](./megatron_mid.png)

## Usage

Make sure to first install the libraries:

```bash
pip install accelerate transformers safetensors diffusers
```

And then we're ready to go:

```python
import torch
import numpy as np
from PIL import Image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, AutoencoderKL
from diffusers.utils import load_image


depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to("cuda")
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-depth-sdxl-1.0-mid",
    variant="fp16",
    use_safetensors=True,
    torch_dtype=torch.float16,
).to("cuda")
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    variant="fp16",
    use_safetensors=True,
    torch_dtype=torch.float16,
).to("cuda")
pipe.enable_model_cpu_offload()

def get_depth_map(image):
    image = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
    with torch.no_grad(), torch.autocast("cuda"):
        depth_map = depth_estimator(image).predicted_depth

    depth_map = torch.nn.functional.interpolate(
        depth_map.unsqueeze(1),
        size=(1024, 1024),
        mode="bicubic",
        align_corners=False,
    )
    depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_map = (depth_map - depth_min) / (depth_max - depth_min)
    image = torch.cat([depth_map] * 3, dim=1)

    image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
    image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
    return image


prompt = "stormtrooper lecture, photorealistic"
image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png")
controlnet_conditioning_scale = 0.5  # recommended for good generalization

depth_image = get_depth_map(image)

images = pipe(
    prompt, image=depth_image, num_inference_steps=30, controlnet_conditioning_scale=controlnet_conditioning_scale,
).images
images[0]

images[0].save(f"stormtrooper_grid.png")
```

![](stormtrooper_grid.png)

To more details, check out the official documentation of [`StableDiffusionXLControlNetPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet_sdxl).

🚨 Please note that this checkpoint is experimental and there's a lot of room for improvement. We encourage the community to build on top of it, improve it, and provide us with feedback. 🚨

### Training

Our training script was built on top of the official training script that we provide [here](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md). 
You can refer to [this script](https://github.com/huggingface/diffusers/blob/7b93c2a882d8e12209fbaeffa51ee2b599ab5349/examples/research_projects/controlnet/train_controlnet_webdataset.py) for full discolsure.

* This checkpoint does not perform distillation. We just use a smaller ControlNet initialized from the SDXL UNet. We
encourage the community to try and conduct distillation too. This resource might be of help in [this regard](https://huggingface.co/blog/sd_distillation). 
* To learn more about how the ControlNet was initialized, refer to [this code block](https://github.com/huggingface/diffusers/blob/7b93c2a882d8e12209fbaeffa51ee2b599ab5349/examples/research_projects/controlnet/train_controlnet_webdataset.py#L981C1-L999C36). 
* It does not have any attention blocks.
* The model works pretty good on most conditioning images. But for more complex conditionings, the bigger checkpoints might be better. We are still working on improving the quality of this checkpoint and looking for feedback from the community.
* We recommend playing around with the `controlnet_conditioning_scale` and `guidance_scale` arguments for potentially better
image generation quality.

#### Training data
The model was trained on 3M images from LAION aesthetic 6 plus subset, with batch size of 256 for 50k steps with constant learning rate of 3e-5.

#### Compute
One 8xA100 machine

#### Mixed precision
FP16