Qwen
/

Qwen-72B-Chat-Int4

@@ -136,9 +136,9 @@ print(response)
 # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
 ```
-注意：vLLM暂不支持gptq量化方案，我们将近期给出解决方案。
-Note: vLLM does not currently support the `gpt` quantization, and we will provide a solution in the near future.
 关于更多的使用说明，请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
@@ -192,12 +192,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
 |      Int4     | HF + FlashAttn-v2 |        1           |       1         |       2048        |      11.67       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v1 |        1           |       1         |       2048        |      11.27       |         48.86GB        |
 |      Int4     | HF + No FlashAttn |        1           |       1         |       2048        |      11.32       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v2 |        2           |      6144       |       2048        |       6.75       |         85.99GB        |
 |      Int4     | HF + FlashAttn-v1 |        2           |      6144       |       2048        |       6.32       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        2           |      6144       |       2048        |       5.97       |         88.30GB        |
-|      Int4     | HF + FlashAttn-v2 |        3           |     14336       |       2048        |       4.18       |         85.99GB        |
-|      Int4     | HF + FlashAttn-v1 |        3           |     14336       |       2048        |       3.72       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        3           |     14336       |       2048        |       OOM        |            OOM         |
 \* vLLM会提前预分配显存，因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。

 # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
 ```
+注意：使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型，近期将更新。
+Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
 关于更多的使用说明，请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
 |      Int4     | HF + FlashAttn-v2 |        1           |       1         |       2048        |      11.67       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v1 |        1           |       1         |       2048        |      11.27       |         48.86GB        |
 |      Int4     | HF + No FlashAttn |        1           |       1         |       2048        |      11.32       |         48.86GB        |
+|      Int4     |       vLLM        |        1           |       1         |       2048        |      14.63       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        2           |       1         |       2048        |      20.76       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |       1         |       2048        |      27.19       |      Pre-Allocated*    |
 |      Int4     | HF + FlashAttn-v2 |        2           |      6144       |       2048        |       6.75       |         85.99GB        |
 |      Int4     | HF + FlashAttn-v1 |        2           |      6144       |       2048        |       6.32       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        2           |      6144       |       2048        |       5.97       |         88.30GB        |
+|      Int4     |       vLLM        |        2           |      6144       |       2048        |      18.07       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |      6144       |       2048        |      24.56       |      Pre-Allocated*    |
+|      Int4     | HF + FlashAttn-v2 |        3           |     14336       |       2048        |       4.18       |        148.73GB        |
+|      Int4     | HF + FlashAttn-v1 |        3           |     14336       |       2048        |       3.72       |        148.73GB        |
 |      Int4     | HF + No FlashAttn |        3           |     14336       |       2048        |       OOM        |            OOM         |
+|      Int4     |       vLLM        |        2           |     14336       |       2048        |     	14.51       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |     14336       |       2048        |      19.28       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |     30720       |       2048        |      16.93       |      Pre-Allocated*    |
 \* vLLM会提前预分配显存，因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。