File size: 5,698 Bytes
ee2441e
64dedd7
0e13e08
64dedd7
 
 
 
0e13e08
138bd96
343bb04
64dedd7
 
ee2441e
64dedd7
 
 
ed45558
64dedd7
343bb04
0b077d8
138bd96
 
64dedd7
0e13e08
64dedd7
ed45558
942ef5e
 
 
 
 
 
6bae04c
ed45558
0e13e08
0b077d8
0e13e08
 
 
 
 
 
 
ed45558
0e13e08
 
 
 
 
 
 
00fab53
343bb04
 
 
 
 
 
6bae04c
ed45558
64dedd7
343bb04
942ef5e
 
 
343bb04
 
942ef5e
 
8a31c11
942ef5e
 
 
 
 
 
 
 
 
 
ed45558
343bb04
942ef5e
343bb04
942ef5e
 
00fab53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d994c55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00fab53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: openrail++
base_model: stabilityai/stable-diffusion-xl-base-1.0
language:
  - en
tags:
  - stable-diffusion
  - stable-diffusion-xl
  - stable-diffusion-xl-lcm
  - stable-diffusion-xl-lcmlora
  - tensorrt
  - text-to-image
---

# Stable Diffusion XL 1.0 TensorRT

## Introduction

This repository hosts the TensorRT versions(sdxl, sdxl-lcm, sdxl-lcmlora) of **Stable Diffusion XL 1.0** created in collaboration with [NVIDIA](https://huggingface.co/nvidia). The optimized versions give substantial improvements in speed and efficiency.

See the [usage instructions](#usage-example) for how to run the SDXL pipeline with the ONNX files hosted in this repository. 


![examples](./examples.jpg)

## Model Description

- **Developed by:** Stability AI
- **Model type:** Diffusion-based text-to-image generative model
- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/LICENSE.md)
- **Model Description:** This is a conversion of the [SDXL base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [SDXL refiner 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) models for [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) optimized inference


## Performance Comparison

#### Timings for 30 steps at 1024x1024

| Accelerator | Baseline (non-optimized) | NVIDIA TensorRT (optimized) | Percentage improvement |
|-------------|--------------------------|-----------------------------|------------------------|
| A10         | 9399 ms                  | 8160 ms                     | ~13%                   |
| A100        | 3704 ms                  | 2742 ms                     | ~26%                   |
| H100        | 2496 ms                  | 1471 ms                     | ~41%                   |

#### Image throughput for 30 steps at 1024x1024

| Accelerator | Baseline (non-optimized) | NVIDIA TensorRT (optimized) | Percentage improvement |
|-------------|--------------------------|-----------------------------|------------------------|
| A10         | 0.10 images/sec          | 0.12 images/sec             | ~20%                   |
| A100        | 0.27 images/sec          | 0.36 images/sec             | ~33%                   |
| H100        | 0.40 images/sec          | 0.68 images/sec             | ~70%                   |

#### Timings for Latent Consistency Model(LCM) version for 4 steps at 1024x1024

| Accelerator | CLIP                     | Unet                        | VAE                    |Total                   |
|-------------|--------------------------|-----------------------------|------------------------|------------------------|
| A100        | 1.08 ms                  | 192.02 ms                   | 228.34 ms              | 426.16 ms              |
| H100        | 0.78 ms                  | 102.8 ms                    | 126.95 ms              | 234.22 ms              |


## Usage Example

1. Following the [setup instructions](https://github.com/rajeevsrao/TensorRT/blob/release/9.2/demo/Diffusion/README.md) on launching a TensorRT NGC container.
```shell
git clone https://github.com/rajeevsrao/TensorRT.git
cd TensorRT
git checkout release/9.2
docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.11-py3 /bin/bash
```

2. Download the SDXL TensorRT files from this repo
```shell
git lfs install 
git clone https://huggingface.co/stabilityai/stable-diffusion-xl-1.0-tensorrt
cd stable-diffusion-xl-1.0-tensorrt
git lfs pull
cd ..
```

3. Install libraries and requirements
```shell
cd demo/Diffusion
python3 -m pip install --upgrade pip
pip3 install -r requirements.txt
python3 -m pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt
```

4. Perform TensorRT optimized inference:

  - **SDXL**
    
    The first invocation produces plan files in `engine_xl_base` and `engine_xl_refiner` specific to the accelerator being run on and are reused for later invocations. 

    ```
    python3 demo_txt2img_xl.py \
      "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" \
      --build-static-batch \
      --use-cuda-graph \
      --num-warmup-runs 1 \
      --width 1024 \
      --height 1024 \
      --denoising-steps 30 \
      --onnx-base-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base \
      --onnx-refiner-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-refiner
    ```

  - **SDXL-LCM**
    
    The first invocation produces plan files in --engine-dir specific to the accelerator being run on and are reused for later invocations. 
    ```
    python3 demo_txt2img_xl.py \
      ""Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"" \
      --version=xl-1.0 \
      --onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm \
      --engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcm-nocfg \
      --scheduler LCM \
      --denoising-steps 4 \
      --guidance-scale 0.0 \
      --seed 42
    
    ```
  - **SDXL-LCMLORA**

    The first invocation produces plan files in --engine-dir specific to the accelerator being run on and are reused for later invocations. 

    ```
    python3 demo_txt2img_xl.py \
      ""Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"" \
      --version=xl-1.0 \
      --onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcmlora \
      --engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcmlora-nocfg \
      --scheduler LCM \
      --lora-path latent-consistency/lcm-lora-sdxl \
      --lora-scale 1.0 \
      --denoising-steps 4 \
      --guidance-scale 0.0 \
      --seed 42
    
    ```