Update README.md
Browse files
README.md
CHANGED
|
@@ -30,14 +30,31 @@ library_name: diffusers
|
|
| 30 |
<a href='https://hanlab.mit.edu/projects/svdquant'>[Website]</a> 
|
| 31 |
<a href='https://hanlab.mit.edu/blog/svdquant'>[Blog]</a>
|
| 32 |
</div>
|
| 33 |
-
|
| 34 |

|
| 35 |
SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.
|
| 36 |
|
| 37 |
## Method
|
| 38 |
#### Quantization Method -- SVDQuant
|
| 39 |
|
| 40 |
-

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
#### Nunchaku Engine Design
|
| 43 |
|
|
|
|
| 30 |
<a href='https://hanlab.mit.edu/projects/svdquant'>[Website]</a> 
|
| 31 |
<a href='https://hanlab.mit.edu/blog/svdquant'>[Blog]</a>
|
| 32 |
</div>
|
|
|
|
| 33 |

|
| 34 |
SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.
|
| 35 |
|
| 36 |
## Method
|
| 37 |
#### Quantization Method -- SVDQuant
|
| 38 |
|
| 39 |
+

|
| 40 |
+
<!DOCTYPE html>
|
| 41 |
+
<html lang="en">
|
| 42 |
+
<head>
|
| 43 |
+
<meta charset="UTF-8">
|
| 44 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 45 |
+
<title>LaTeX Rendering Example</title>
|
| 46 |
+
<script type="text/javascript" async
|
| 47 |
+
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
|
| 48 |
+
</script>
|
| 49 |
+
</head>
|
| 50 |
+
<body>
|
| 51 |
+
<p>
|
| 52 |
+
The key idea behind SVDQuant is to introduce an additional low-rank branch that can absorb quantization difficulties in both weights and activations. As shown in the above animation, originally, both the activation \( \boldsymbol{X} \) and weights \( \boldsymbol{W} \) contain massive outliers, making 4-bit quantization challenging. We can first aggregate the outliers by migrating them from activations to weights via smoothing, resulting in the updated activation \( \hat{\boldsymbol{X}} \) and weights \( \hat{\boldsymbol{W}} \). While \( \hat{\boldsymbol{X}} \) becomes easier to quantize, \( \hat{\boldsymbol{W}} \) now becomes more difficult. At the last stage, SVDQuant further decomposes \( \hat{\boldsymbol{W}} \) into a low-rank component \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \) with Singular Value Decomposition (SVD). As the singular value distribution of \( \hat{\boldsymbol{W}} \) is highly imbalanced, with only the first several values being significantly larger, removing these dominant values can dramatically reduce \( \hat{\boldsymbol{W}} \)’s magnitude and outliers, as suggested by <a href='https://en.wikipedia.org/wiki/Low-rank_approximation'>Eckart-Young-Mirsky theorem</a>. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision. The below figure illustrates an example value distribution of the input activations and weights in PixArt-∑.
|
| 53 |
+
</p>
|
| 54 |
+
</body>
|
| 55 |
+
</html>
|
| 56 |
+
|
| 57 |
+
Overview of SVDQuant. Stage1: Originally, both the activation $\boldsymbol{X}$ and weights $\boldsymbol{W}$ contain outliers, making 4-bit quantization challenging. Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation $\hat{\boldsymbol{X}}$ and weights $\hat{\boldsymbol{W}}$. While $\hat{\boldsymbol{X}}$ becomes easier to quantize, $\hat{\boldsymbol{W}}$ now becomes more difficult. Stage 3: SVDQuant further decomposes $\hat{\boldsymbol{W}}$ into a low-rank component $\boldsymbol{L}_1\boldsymbol{L}_2$ and a residual $\hat{\boldsymbol{W}}-\boldsymbol{L}_1\boldsymbol{L}_2$ with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.
|
| 58 |
|
| 59 |
#### Nunchaku Engine Design
|
| 60 |
|