readme: Update inference speed

Signed-off-by: eric <renillhuang@163.com>

Files changed (3) hide show

README.md CHANGED Viewed

@@ -159,29 +159,20 @@ Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. Inference speed
-Setup inference server on 8x Nvidia RTX3090， and get results from client in unit of tokens per second.<br>
-We found that the inference speed results vary based on the number of concurrent requests and the length of output. To facilitate horizontal comparisons, we conducted multiple sets of tests. Each set of test data has a specific format: \<n>para_out\<m>. For example, "4para_out220" indicates the inference speed when there are 4 concurrent requests from the client and the average output token length is 220.
-|OrionLLM_V2.4.6.1|1para_out62|1para_out85|1para_out125|1para_out210|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
-|Qwen32   | 26.46 | 26.73 | 26.80 | 27.03 |
-|OrionLLM_V2.4.6.1|4para_out62|4para_out90|4para_out125|4para_out220|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
-|Qwen32   | 23.61 | 24.30 | 24.86 | 25.17 |
-|OrionLLM_V2.4.6.1|8para_out62|8para_out85|8para_out125|8para_out220|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
-|Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
-<div align="center">
-  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
-</div>
 <a name="model-inference"></a><br>

 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. Inference speed
+Setup inference server on 8x Nvidia RTX3090， and get results from client in unit of 'tokens per second'.
+|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
+|---------|--------|-------|--------|-------|
+|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
+|Qwen32   | 52.93  | 46.06 | 62.43  | 56.81 |
+<br>
+We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
+|input size | 4k | 8k | 12k | 16k | 32k | 64k |
+|---------|-------|-------|-------|-------|-------|-------|
+|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
+|Qwen32   | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
+<br>
 <a name="model-inference"></a><br>

README_zh.md CHANGED Viewed

@@ -152,29 +152,20 @@
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. 推理速度
-搭建基于8卡Nvidia RTX3090，采用"token/秒"为单位，从客户端统计测试结果。<br>
-我们发现根据推理的并发数以及模型输出长度的不同，推理速度的结果会有变化，为了方便横向对比，我们做了多组数据的测试，每一组测试数据的格式含义：客户端并发数_每次推理输出token数，在此前提条件下的推理速度，例如:4para_out220，表示客户端4并发打请求，输出token数平均在220个token时的推理速度。
-|OrionLLM_V2.4.6.1|1并发_输出62|1并发_输出85|1并发_输出125|1并发_输出210|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
-|Qwen32   | 26.46 | 26.73 | 26.80 | 27.03 |
-|OrionLLM_V2.4.6.1|4并发_输出62|4并发_输出90|4并发_输出125|4并发_220|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
-|Qwen32   | 23.61 | 24.30 | 24.86 | 25.17 |
-|OrionLLM_V2.4.6.1|8并发_输出62|8并发_输出85|8并发_输出125|8并发_输出220|
-|---------|-------|-------|-------|-------|
-|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
-|Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
-<div align="center">
-  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
-</div>
 <a name="zh_model-inference"></a><br>

 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. 推理速度
+搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100，采用"token/秒"为单位，从客户端统计测试结果。
+|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
+|---------|--------|-------|--------|-------|
+|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
+|Qwen32   | 52.93  | 46.06 | 62.43  | 56.81 |
+<br>
+同时测试了4卡A100上，基于不同输入长度（tokens）的推理速度比较，采用"token/秒"为单位，从客户端统计测试结果。
+|input size | 4k | 8k | 12k | 16k | 32k | 64k |
+|---------|-------|-------|-------|-------|-------|-------|
+|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
+|Qwen32   | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
+<br>
 <a name="zh_model-inference"></a><br>

assets/imgs/inf_spd.png DELETED Viewed

Binary file (119 kB)