Diffusers documentation

Text-guided depth-to-image generation

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.14.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Text-guided depth-to-image generation

The StableDiffusionDepth2ImgPipeline lets you pass a text prompt and an initial image to condition the generation of new images. In addition, you can also pass a depth_map to preserve the image structure. If no depth_map is provided, the pipeline automatically predicts the depth via an integrated depth-estimation model.

Start by creating an instance of the StableDiffusionDepth2ImgPipeline:

import torch
import requests
from PIL import Image

from diffusers import StableDiffusionDepth2ImgPipeline

pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(

Now pass your prompt to the pipeline. You can also pass a negative_prompt to prevent certain words from guiding how an image is generated:

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
prompt = "two tigers"
n_prompt = "bad, deformed, ugly, bad anatomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_prompt, strength=0.7).images[0]
Input Output

Play around with the Spaces below and see if you notice a difference between generated images with and without a depth map!