yangapku commited on
Commit
806d62b
1 Parent(s): c48054c

update int8 quantization info

Browse files
Files changed (1) hide show
  1. README.md +29 -16
README.md CHANGED
@@ -155,40 +155,53 @@ response, history = model.chat(tokenizer, "你好", history=None)
155
 
156
  ### 效果评测
157
 
158
- 我们对BF16和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
159
 
160
- We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
161
 
162
  | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
163
  |--------------|:----:|:-----------:|:-----:|:---------:|
164
- | BF16 | 64.6 | 69.8 | 61.0 | 43.9 |
 
165
  | Int4 | 63.3 | 69.0 | 59.8 | 45.7 |
166
 
167
  ### 推理速度 (Inference Speed)
168
 
169
- 我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度。如图所示:
170
 
171
- We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
172
 
173
- | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
174
- |--------------|:-------------------:|:-------------------:|
175
- | BF16 | 30.70 | 21.73 |
176
- | Int4 | 37.11 | 26.11 |
 
 
 
 
 
 
 
177
 
178
- 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
179
 
180
- In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
 
 
 
 
181
 
182
  ### 显存使用 (GPU Memory Usage)
183
 
184
- 我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示:
185
 
186
- We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
187
 
188
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
189
- |--------------------|:-----------------------------------:|:-------------------------------------:|
190
- | BF16 | 30.15GB | 38.94GB |
191
- | Int4 | 13.00GB | 21.79GB |
 
192
 
193
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
194
 
 
155
 
156
  ### 效果评测
157
 
158
+ 我们对BF16,Int8和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示:
159
 
160
+ We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
161
 
162
  | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
163
  |--------------|:----:|:-----------:|:-----:|:---------:|
164
+ | BF16 | 64.6 | 69.8 | 60.1 | 43.9 |
165
+ | Int8 | 63.6 | 68.6 | 60.0 | 48.2 |
166
  | Int4 | 63.3 | 69.0 | 59.8 | 45.7 |
167
 
168
  ### 推理速度 (Inference Speed)
169
 
170
+ 我们测算了不同精度模型以及不同FlashAttn库版本下模型生成2048和8192个token的平均推理速度。如图所示:
171
 
172
+ We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.
173
 
174
+ | Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
175
+ | ------------- | :-------: | :------------------:| :------------------:|
176
+ | BF16 | v2 | 32.88 | 24.87 |
177
+ | Int8 | v2 | 29.28 | 24.22 |
178
+ | Int4 | v2 | 38.72 | 27.33 |
179
+ | BF16 | v1 | 32.76 | 28.89 |
180
+ | Int8 | v1 | 28.31 | 23.87 |
181
+ | Int4 | v1 | 37.81 | 26.46 |
182
+ | BF16 | Disabled | 29.32 | 22.91 |
183
+ | Int8 | Disabled | 31.12 | 24.60 |
184
+ | Int4 | Disabled | 37.65 | 26.00 |
185
 
186
+ 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.8。推理速度是生成8192个token的速度均值。
187
 
188
+ In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.
189
+
190
+ 注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。
191
+
192
+ Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
193
 
194
  ### 显存使用 (GPU Memory Usage)
195
 
196
+ 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
197
 
198
+ We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. The GPU memory usage is similar when using flash-attention or not.)The results are shown below.
199
 
200
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
201
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
202
+ | BF16 | 30.15GB | 38.94GB |
203
+ | Int8 | 18.81GB | 27.54GB |
204
+ | Int4 | 13.01GB | 21.79GB |
205
 
206
  上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
207