controlnet-zoe-depth-sdxl-1.0 / README.md

Update README.md

62134b9 10 months ago

No virus

4.8 kB


	---
	license: openrail++
	base_model: stabilityai/stable-diffusion-xl-base-1.0
	tags:
	- stable-diffusion-xl
	- stable-diffusion-xl-diffusers
	- text-to-image
	- diffusers
	- controlnet
	inference: false
	---

	# SDXL-controlnet: Zoe-Depth

	These are ControlNet weights trained on stabilityai/stable-diffusion-xl-base-1.0 with zoe depth conditioning. [Zoe-depth](https://github.com/isl-org/ZoeDepth) is an open-source SOTA depth estimation model which produces high-quality depth maps, which are better suited for conditioning.

	You can find some example images in the following.

	![images_0)](./zoe-depth-example.png)

	![images_2](./zoe-megatron.png)

	![images_3](./photo-woman.png)

	## Usage

	Make sure first to install the libraries:

	```bash
	pip install accelerate transformers safetensors diffusers
	```

	And then setup the zoe-depth model

	```
	import torch
	import matplotlib
	import matplotlib.cm
	import numpy as np

	torch.hub.help("intel-isl/MiDaS", "DPT_BEiT_L_384", force_reload=True) # Triggers fresh download of MiDaS repo
	model_zoe_n = torch.hub.load("isl-org/ZoeDepth", "ZoeD_NK", pretrained=True).eval()
	model_zoe_n = model_zoe_n.to("cuda")


	def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invalid_mask=None, background_color=(128, 128, 128, 255), gamma_corrected=False, value_transform=None):
	if isinstance(value, torch.Tensor):
	value = value.detach().cpu().numpy()

	value = value.squeeze()
	if invalid_mask is None:
	invalid_mask = value == invalid_val
	mask = np.logical_not(invalid_mask)

	# normalize
	vmin = np.percentile(value[mask],2) if vmin is None else vmin
	vmax = np.percentile(value[mask],85) if vmax is None else vmax
	if vmin != vmax:
	value = (value - vmin) / (vmax - vmin) # vmin..vmax
	else:
	# Avoid 0-division
	value = value * 0.

	# squeeze last dim if it exists
	# grey out the invalid values

	value[invalid_mask] = np.nan
	cmapper = matplotlib.cm.get_cmap(cmap)
	if value_transform:
	value = value_transform(value)
	# value = value / value.max()
	value = cmapper(value, bytes=True) # (nxmx4)

	# img = value[:, :, :]
	img = value[...]
	img[invalid_mask] = background_color

	# gamma correction
	img = img / 255
	img = np.power(img, 2.2)
	img = img * 255
	img = img.astype(np.uint8)
	img = Image.fromarray(img)
	return img


	def get_zoe_depth_map(image):
	with torch.autocast("cuda", enabled=True):
	depth = model_zoe_n.infer_pil(image)
	depth = colorize(depth, cmap="gray_r")
	return depth
	```

	Now we're ready to go:

	```python
	import torch
	import numpy as np
	from PIL import Image

	from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, AutoencoderKL
	from diffusers.utils import load_image

	controlnet = ControlNetModel.from_pretrained(
	"diffusers/controlnet-zoe-depth-sdxl-1.0",
	use_safetensors=True,
	torch_dtype=torch.float16,
	).to("cuda")
	vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda")
	pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0",
	controlnet=controlnet,
	vae=vae,
	variant="fp16",
	use_safetensors=True,
	torch_dtype=torch.float16,
	).to("cuda")
	pipe.enable_model_cpu_offload()


	prompt = "pixel-art margot robbie as barbie, in a coupé . low-res, blocky, pixel art style, 8-bit graphics"
	negative_prompt = "sloppy, messy, blurry, noisy, highly detailed, ultra textured, photo, realistic"
	image = load_image("https://media.vogue.fr/photos/62bf04b69a57673c725432f3/3:2/w_1793,h_1195,c_limit/rev-1-Barbie-InstaVert_High_Res_JPEG.jpeg")

	controlnet_conditioning_scale = 0.55

	depth_image = get_zoe_depth_map(image).resize((1088, 896))

	generator = torch.Generator("cuda").manual_seed(978364352)
	images = pipe(
	prompt, image=depth_image, num_inference_steps=50, controlnet_conditioning_scale=controlnet_conditioning_scale, generator=generator
	).images
	images[0]

	images[0].save(f"pixel-barbie.png")
	```

	![images_1)](./barbie.png)

	To more details, check out the official documentation of [`StableDiffusionXLControlNetPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet_sdxl).

	### Training

	Our training script was built on top of the official training script that we provide [here](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).

	#### Training data and Compute
	The model is trained on 3M image-text pairs from LAION-Aesthetics V2. The model is trained for 700 GPU hours on 80GB A100 GPUs.

	#### Batch size
	Data parallel with a single gpu batch size of 8 for a total batch size of 256.

	#### Hyper Parameters
	Constant learning rate of 1e-5.

	#### Mixed precision
	fp16