Diffusers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.35.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This pipeline is deprecated but it can still be used. However, we won’t test the pipeline anymore and won’t accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model.

GLIGEN (Grounded Language-to-Image Generation)

The GLIGEN model was created by researchers and engineers from University of Wisconsin-Madison, Columbia University, and Microsoft. The StableDiffusionGLIGENPipeline and StableDiffusionGLIGENTextImagePipeline can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with StableDiffusionGLIGENPipeline, if input images are given, StableDiffusionGLIGENTextImagePipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it’ll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It’s trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.

The abstract from the paper is:

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.

Make sure to check out the Stable Diffusion Tips section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently!

If you want to use one of the official checkpoints for a task, explore the gligen Hub organizations!

StableDiffusionGLIGENPipeline was contributed by Nikhil Gajendrakumar and StableDiffusionGLIGENTextImagePipeline was contributed by Nguyễn Công Tú Anh.

Diffusers

GLIGEN (Grounded Language-to-Image Generation)

StableDiffusionGLIGENPipeline

class diffusers.StableDiffusionGLIGENPipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

encode_prompt

StableDiffusionGLIGENTextImagePipeline

class diffusers.StableDiffusionGLIGENTextImagePipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

complete_mask

crop

draw_inpaint_mask_from_boxes

encode_prompt

get_clip_feature

get_cross_attention_kwargs_with_grounded

get_cross_attention_kwargs_without_grounded

target_size_center_crop

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

call

call