renillhuang commited on
Commit
9b4111c
1 Parent(s): 1205968

readme: Update inference speed

Browse files

Signed-off-by: eric <renillhuang@163.com>

Files changed (3) hide show
  1. README.md +14 -23
  2. README_zh.md +14 -23
  3. assets/imgs/inf_spd.png +0 -0
README.md CHANGED
@@ -159,29 +159,20 @@ Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
159
  |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
160
 
161
  ### 3.1.6. Inference speed
162
- Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of tokens per second.<br>
163
- We found that the inference speed results vary based on the number of concurrent requests and the length of output. To facilitate horizontal comparisons, we conducted multiple sets of tests. Each set of test data has a specific format: \<n>para_out\<m>. For example, "4para_out220" indicates the inference speed when there are 4 concurrent requests from the client and the average output token length is 220.
164
-
165
- |OrionLLM_V2.4.6.1|1para_out62|1para_out85|1para_out125|1para_out210|
166
- |---------|-------|-------|-------|-------|
167
- |OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
168
- |Qwen32 | 26.46 | 26.73 | 26.80 | 27.03 |
169
-
170
- |OrionLLM_V2.4.6.1|4para_out62|4para_out90|4para_out125|4para_out220|
171
- |---------|-------|-------|-------|-------|
172
- |OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
173
- |Qwen32 | 23.61 | 24.30 | 24.86 | 25.17 |
174
-
175
- |OrionLLM_V2.4.6.1|8para_out62|8para_out85|8para_out125|8para_out220|
176
- |---------|-------|-------|-------|-------|
177
- |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
178
- |Qwen32 | 21.16 | 21.92 | 23.14 | 23.56 |
179
-
180
-
181
-
182
- <div align="center">
183
- <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
184
- </div>
185
 
186
 
187
  <a name="model-inference"></a><br>
 
159
  |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
160
 
161
  ### 3.1.6. Inference speed
162
+ Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
163
+ |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
164
+ |---------|--------|-------|--------|-------|
165
+ |OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
166
+ |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
167
+
168
+ <br>
169
+ We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
170
+
171
+ |input size | 4k | 8k | 12k | 16k | 32k | 64k |
172
+ |---------|-------|-------|-------|-------|-------|-------|
173
+ |OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
174
+ |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
175
+ <br>
 
 
 
 
 
 
 
 
 
176
 
177
 
178
  <a name="model-inference"></a><br>
README_zh.md CHANGED
@@ -152,29 +152,20 @@
152
  |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
153
 
154
  ### 3.1.6. 推理速度
155
- 搭建基于8卡Nvidia RTX3090,采用"token/秒"为单位,从客户端统计测试结果。<br>
156
- 我们发现根据推理的并发数以及模型输出长度的不同,推理速度的结果会有变化,为了方便横向对比,我们做了多组数据的测试,每一组测试数据的格式含义:客户端并发数_每次推理输出token数,在此前提条件下的推理速度,例如:4para_out220,表示客户端4并发打请求,输出token数平均在220个token时的推理速度。
157
-
158
- |OrionLLM_V2.4.6.1|1并发_输出62|1并发_输出85|1并发_输出125|1并发_输出210|
159
- |---------|-------|-------|-------|-------|
160
- |OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
161
- |Qwen32 | 26.46 | 26.73 | 26.80 | 27.03 |
162
-
163
- |OrionLLM_V2.4.6.1|4并发_输出62|4并发_输出90|4并发_输出125|4并发_220|
164
- |---------|-------|-------|-------|-------|
165
- |OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
166
- |Qwen32 | 23.61 | 24.30 | 24.86 | 25.17 |
167
-
168
- |OrionLLM_V2.4.6.1|8并发_输出62|8并发_输出85|8并发_输出125|8并发_输出220|
169
- |---------|-------|-------|-------|-------|
170
- |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
171
- |Qwen32 | 21.16 | 21.92 | 23.14 | 23.56 |
172
-
173
-
174
-
175
- <div align="center">
176
- <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
177
- </div>
178
 
179
 
180
  <a name="zh_model-inference"></a><br>
 
152
  |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
153
 
154
  ### 3.1.6. 推理速度
155
+ 搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
156
+ |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
157
+ |---------|--------|-------|--------|-------|
158
+ |OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
159
+ |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
160
+
161
+ <br>
162
+ 同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
163
+
164
+ |input size | 4k | 8k | 12k | 16k | 32k | 64k |
165
+ |---------|-------|-------|-------|-------|-------|-------|
166
+ |OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
167
+ |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
168
+ <br>
 
 
 
 
 
 
 
 
 
169
 
170
 
171
  <a name="zh_model-inference"></a><br>
assets/imgs/inf_spd.png DELETED
Binary file (119 kB)