Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,44 @@ inference: false
|
|
20 |
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model. Links to other models can be found in the index at the bottom.
|
21 |
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
## Example of Usage
|
24 |
|
25 |
```sh
|
|
|
20 |
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model. Links to other models can be found in the index at the bottom.
|
21 |
|
22 |
|
23 |
+
## About GPTQ (from HF Blog)
|
24 |
+
|
25 |
+
Quantization methods usually belong into one of two categories:
|
26 |
+
|
27 |
+
1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
|
28 |
+
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
|
29 |
+
|
30 |
+
GPTQ falls into the PTQ category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
|
31 |
+
|
32 |
+
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
|
33 |
+
|
34 |
+
The benefits of this scheme are twofold:
|
35 |
+
|
36 |
+
- Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory.
|
37 |
+
- Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights.
|
38 |
+
|
39 |
+
The GPTQ paper tackles the layer-wise compression problem:
|
40 |
+
|
41 |
+
Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), we want to find a quantized version of the weight \\(\hat{W}_{l}\\) to minimize the mean squared error (MSE):
|
42 |
+
|
43 |
+
|
44 |
+
\\({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} \|W_{l}X-\hat{W}_{l}X\|^{2}_{2}\\)
|
45 |
+
|
46 |
+
Once this is solved per layer, a solution to the global problem can be obtained by combining the layer-wise solutions.
|
47 |
+
|
48 |
+
In order to solve this layer-wise compression problem, the author uses the Optimal Brain Quantization framework ([Frantar et al 2022](https://arxiv.org/abs/2208.11580)). The OBQ method starts from the observation that the above equation can be written as the sum of the squared errors, over each row of \\(W_{l}\\).
|
49 |
+
|
50 |
+
|
51 |
+
\\( \sum_{i=0}^{d_{row}} \|W_{l[i,:]}X-\hat{W}_{l[i,:]}X\|^{2}_{2} \\)
|
52 |
+
|
53 |
+
This means that we can quantize each row independently. This is called per-channel quantization. For each row \\(W_{l[i,:]}\\), OBQ quantizes one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight. The update on selected weights has a closed-form formula, utilizing Hessian matrices.
|
54 |
+
|
55 |
+
The GPTQ paper improves this framework by introducing a set of optimizations that reduces the complexity of the quantization algorithm while retaining the accuracy of the model.
|
56 |
+
|
57 |
+
Compared to OBQ, the quantization step itself is also faster with GPTQ: it takes 2 GPU-hours to quantize a BERT model (336M) with OBQ, whereas with GPTQ, a Bloom model (176B) can be quantized in less than 4 GPU-hours.
|
58 |
+
|
59 |
+
To learn more about the exact algorithm and the different benchmarks on perplexity and speedups, check out the original [paper](https://arxiv.org/pdf/2210.17323.pdf).
|
60 |
+
|
61 |
## Example of Usage
|
62 |
|
63 |
```sh
|