yangapku commited on
Commit
95590fa
1 Parent(s): 6ec2d41

update int8 quantization info

Browse files
Files changed (1) hide show
  1. README.md +33 -20
README.md CHANGED
@@ -37,7 +37,7 @@ For more details about the open-source model of Qwen-7B, please refer to the [Gi
37
  ## 要求(Requirements)
38
 
39
  * python 3.8及以上版本
40
- * pytorch 2.0及以上版本,推荐2.0及以上版本
41
  * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
42
  * python 3.8 and above
43
  * pytorch 2.0 and above, 2.0 and above are recommended
@@ -104,40 +104,53 @@ For more information, please refer to our [GitHub repo](https://github.com/QwenL
104
 
105
  ### 效果评测
106
 
107
- 我们对BF16和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
108
 
109
- We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
110
 
111
- | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
112
- |--------------|:----:|:-----------:|:-----:|:---------:|
113
- | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
114
- | Int4 | 55.1 | 59.2 | 49.7 | 35.4 |
 
115
 
116
  ### 推理速度 (Inference Speed)
117
 
118
- 我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度。如图所示:
119
 
120
- We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
121
 
122
- | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
123
- |--------------|:-------------------:|:-------------------:|
124
- | BF16 | 30.53 | 28.51 |
125
- | Int4 | 45.60 | 33.83 |
 
 
 
 
 
 
 
126
 
127
- 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
128
 
129
- In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
 
 
 
 
130
 
131
  ### 显存使用 (GPU Memory Usage)
132
 
133
- 我们还测算了BF16和Int4模型编码2048个token及生成8192��token的峰值显存占用情况。结果如下所示:
134
 
135
- We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
136
 
137
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
138
- |--------------------|:-----------------------------------:|:-------------------------------------:|
139
- | BF16 | 18.99GB | 24.40GB |
140
- | Int4 | 10.20GB | 15.61GB |
 
141
 
142
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
143
 
 
37
  ## 要求(Requirements)
38
 
39
  * python 3.8及以上版本
40
+ * pytorch 2.0及以上版本
41
  * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
42
  * python 3.8 and above
43
  * pytorch 2.0 and above, 2.0 and above are recommended
 
104
 
105
  ### 效果评测
106
 
107
+ 我们对BF16,Int8和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
108
 
109
+ We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
110
 
111
+ | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
112
+ | ------------- | :--------: | :----------: | :----: | :--------: |
113
+ | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
114
+ | Int8 | 55.4 | 59.4 | 48.3 | 34.8 |
115
+ | Int4 | 55.1 | 59.2 | 49.7 | 29.9 |
116
 
117
  ### 推理速度 (Inference Speed)
118
 
119
+ 我们测算了不同精度模型以及不同FlashAttn库版本下模型生成2048和8192个token的平均推理速度。如图所示:
120
 
121
+ We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.
122
 
123
+ | Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
124
+ | ------------- | :-------: | :------------------:| :------------------:|
125
+ | BF16 | v2 | 40.93 | 36.14 |
126
+ | Int8 | v2 | 37.47 | 32.54 |
127
+ | Int4 | v2 | 50.09 | 38.61 |
128
+ | BF16 | v1 | 40.75 | 35.34 |
129
+ | Int8 | v1 | 37.51 | 32.39 |
130
+ | Int4 | v1 | 45.98 | 36.47 |
131
+ | BF16 | Disabled | 37.55 | 33.56 |
132
+ | Int8 | Disabled | 37.84 | 32.65 |
133
+ | Int4 | Disabled | 48.12 | 36.70 |
134
 
135
+ 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.8。推理速度是生成8192个token的速度均值。
136
 
137
+ In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.
138
+
139
+ 注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。
140
+
141
+ Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
142
 
143
  ### 显存使用 (GPU Memory Usage)
144
 
145
+ 我们还测算了不同模型精度编码2048个token及生成8192token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
146
 
147
+ We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. The GPU memory usage is similar when using flash-attention or not.)The results are shown below.
148
 
149
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
150
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
151
+ | BF16 | 16.99GB | 22.53GB |
152
+ | Int8 | 11.20GB | 16.62GB |
153
+ | Int4 | 8.21GB | 13.63GB |
154
 
155
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
156