yangapku commited on
Commit
563db39
1 Parent(s): f99c7d5

update content of vllm gptq model

Browse files
Files changed (1) hide show
  1. README.md +12 -4
README.md CHANGED
@@ -136,9 +136,9 @@ print(response)
136
  # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
137
  ```
138
 
139
- 注意:vLLM暂不支持gptq量化方案,我们将近期给出解决方案。
140
 
141
- Note: vLLM does not currently support the `gpt` quantization, and we will provide a solution in the near future.
142
 
143
  关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
144
 
@@ -192,12 +192,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
192
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
193
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
194
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
 
 
 
195
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
196
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
197
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
198
- | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 85.99GB |
199
- | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 85.99GB |
 
 
200
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
 
 
 
201
 
202
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
203
 
 
136
  # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
137
  ```
138
 
139
+ 注意:使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型,近期将更新。
140
 
141
+ Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
142
 
143
  关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
144
 
 
192
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
193
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
194
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
195
+ | Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
196
+ | Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
197
+ | Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
198
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
199
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
200
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
201
+ | Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
202
+ | Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
203
+ | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
204
+ | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
205
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
206
+ | Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
207
+ | Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
208
+ | Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
209
 
210
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
211