Update README.md
Browse files
README.md
CHANGED
@@ -7,15 +7,13 @@ This model is quantized by autoawq package using `tctsung/chat_restaurant_recomm
|
|
7 |
|
8 |
Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
|
9 |
|
10 |
-
For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git)
|
11 |
-
|
12 |
## Key results:
|
13 |
|
14 |
1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**.
|
15 |
2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model.
|
16 |
3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%
|
17 |
|
18 |
-
|
19 |
|
20 |
## Inference tutorial
|
21 |
|
|
|
7 |
|
8 |
Reference model: [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
|
9 |
|
|
|
|
|
10 |
## Key results:
|
11 |
|
12 |
1. AWQ quantization resulted in a **1.62x improvement** in inference speed, generating **140.47 new tokens per second**.
|
13 |
2. The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only **17.57%** of the original model.
|
14 |
3. I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%
|
15 |
|
16 |
+
For more details, see github repo [tctsung/LLM_quantize](https://github.com/tctsung/LLM_quantize.git)
|
17 |
|
18 |
## Inference tutorial
|
19 |
|