Diffusers documentation
T-GATE
T-GATE
T-GATE accelerates inference for Stable Diffusion, PixArt, and Latency Consistency Model pipelines by skipping the cross-attention calculation once it converges. This method doesn’t require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like DeepCache.
Before you begin, make sure you install T-GATE.
pip install tgate pip install -U torch diffusers transformers accelerate DeepCache
To use T-GATE with a pipeline, you need to use its corresponding loader.
| Pipeline | T-GATE Loader | 
|---|---|
| PixArt | TgatePixArtLoader | 
| Stable Diffusion XL | TgateSDXLLoader | 
| Stable Diffusion XL + DeepCache | TgateSDXLDeepCacheLoader | 
| Stable Diffusion | TgateSDLoader | 
| Stable Diffusion + DeepCache | TgateSDDeepCacheLoader | 
Next, create a TgateLoader with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the tgate method on the pipeline with a prompt, gate step, and the number of inference steps.
Let’s see how to enable this for several different pipelines.
Accelerate PixArtAlphaPipeline with T-GATE:
import torch
from diffusers import PixArtAlphaPipeline
from tgate import TgatePixArtLoader
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")
image = pipe.tgate(
       "An alpaca made of colorful building blocks, cyberpunk.",
       gate_step=gate_step,
       num_inference_steps=inference_step,
).images[0]T-GATE also supports StableDiffusionPipeline and PixArt-alpha/PixArt-LCM-XL-2-1024-MS.
Benchmarks
| Model | MACs | Param | Latency | Zero-shot 10K-FID on MS-COCO | 
|---|---|---|---|---|
| SD-1.5 | 16.938T | 859.520M | 7.032s | 23.927 | 
| SD-1.5 w/ T-GATE | 9.875T | 815.557M | 4.313s | 20.789 | 
| SD-2.1 | 38.041T | 865.785M | 16.121s | 22.609 | 
| SD-2.1 w/ T-GATE | 22.208T | 815.433 M | 9.878s | 19.940 | 
| SD-XL | 149.438T | 2.570B | 53.187s | 24.628 | 
| SD-XL w/ T-GATE | 84.438T | 2.024B | 27.932s | 22.738 | 
| Pixart-Alpha | 107.031T | 611.350M | 61.502s | 38.669 | 
| Pixart-Alpha w/ T-GATE | 65.318T | 462.585M | 37.867s | 35.825 | 
| DeepCache (SD-XL) | 57.888T | - | 19.931s | 23.755 | 
| DeepCache w/ T-GATE | 43.868T | - | 14.666s | 23.999 | 
| LCM (SD-XL) | 11.955T | 2.570B | 3.805s | 25.044 | 
| LCM w/ T-GATE | 11.171T | 2.024B | 3.533s | 25.028 | 
| LCM (Pixart-Alpha) | 8.563T | 611.350M | 4.733s | 36.086 | 
| LCM w/ T-GATE | 7.623T | 462.585M | 4.543s | 37.048 | 
The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with calflops, and the FID is calculated with PytorchFID.
< > Update on GitHub