mit-han-lab
/

svdq-int4-flux.1-schnell

@@ -30,14 +30,31 @@ library_name: diffusers
   <a href='https://hanlab.mit.edu/projects/svdquant'>[Website]</a>&ensp;
   <a href='https://hanlab.mit.edu/blog/svdquant'>[Blog]</a>
 </div>
 ![teaser](https://github.com/mit-han-lab/nunchaku/raw/refs/heads/main/assets/teaser.jpg)
 SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.
 ## Method
 #### Quantization Method -- SVDQuant
-![intuition](https://github.com/mit-han-lab/nunchaku/raw/refs/heads/main/assets/intuition.gif)Overview of SVDQuant. Stage1: Originally, both the activation $\boldsymbol{X}$ and weights $\boldsymbol{W}$ contain outliers, making 4-bit quantization challenging.  Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation $\hat{\boldsymbol{X}}$ and weights $\hat{\boldsymbol{W}}$. While $\hat{\boldsymbol{X}}$ becomes easier to quantize, $\hat{\boldsymbol{W}}$ now becomes more difficult. Stage 3: SVDQuant further decomposes $\hat{\boldsymbol{W}}$ into a low-rank component $\boldsymbol{L}_1\boldsymbol{L}_2$ and a residual $\hat{\boldsymbol{W}}-\boldsymbol{L}_1\boldsymbol{L}_2$ with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.
 #### Nunchaku Engine Design

   <a href='https://hanlab.mit.edu/projects/svdquant'>[Website]</a>&ensp;
   <a href='https://hanlab.mit.edu/blog/svdquant'>[Blog]</a>
 </div>
 ![teaser](https://github.com/mit-han-lab/nunchaku/raw/refs/heads/main/assets/teaser.jpg)
 SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.
 ## Method
 #### Quantization Method -- SVDQuant
+![intuition](https://github.com/mit-han-lab/nunchaku/raw/refs/heads/main/assets/intuition.gif)
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>LaTeX Rendering Example</title>
+    <script type="text/javascript" async
+        src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
+    </script>
+</head>
+<body>
+    <p>
+The key idea behind SVDQuant is to introduce an additional low-rank branch that can absorb quantization difficulties in both weights and activations. As shown in the above animation, originally, both the activation \( \boldsymbol{X} \) and weights \( \boldsymbol{W} \) contain massive outliers, making 4-bit quantization challenging. We can first aggregate the outliers by migrating them from activations to weights via smoothing, resulting in the updated activation \( \hat{\boldsymbol{X}} \) and weights \( \hat{\boldsymbol{W}} \). While \( \hat{\boldsymbol{X}} \) becomes easier to quantize, \( \hat{\boldsymbol{W}} \) now becomes more difficult. At the last stage, SVDQuant further decomposes \( \hat{\boldsymbol{W}} \) into a low-rank component \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \) with Singular Value Decomposition (SVD). As the singular value distribution of \( \hat{\boldsymbol{W}} \) is highly imbalanced, with only the first several values being significantly larger, removing these dominant values can dramatically reduce \( \hat{\boldsymbol{W}} \)’s magnitude and outliers, as suggested by <a href='https://en.wikipedia.org/wiki/Low-rank_approximation'>Eckart-Young-Mirsky theorem</a>. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision. The below figure illustrates an example value distribution of the input activations and weights in PixArt-∑.
+    </p>
+</body>
+</html>
+Overview of SVDQuant. Stage1: Originally, both the activation $\boldsymbol{X}$ and weights $\boldsymbol{W}$ contain outliers, making 4-bit quantization challenging.  Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation $\hat{\boldsymbol{X}}$ and weights $\hat{\boldsymbol{W}}$. While $\hat{\boldsymbol{X}}$ becomes easier to quantize, $\hat{\boldsymbol{W}}$ now becomes more difficult. Stage 3: SVDQuant further decomposes $\hat{\boldsymbol{W}}$ into a low-rank component $\boldsymbol{L}_1\boldsymbol{L}_2$ and a residual $\hat{\boldsymbol{W}}-\boldsymbol{L}_1\boldsymbol{L}_2$ with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.
 #### Nunchaku Engine Design