renillhuang
commited on
Commit
•
9b4111c
1
Parent(s):
1205968
readme: Update inference speed
Browse filesSigned-off-by: eric <renillhuang@163.com>
- README.md +14 -23
- README_zh.md +14 -23
- assets/imgs/inf_spd.png +0 -0
README.md
CHANGED
@@ -159,29 +159,20 @@ Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
|
|
159 |
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
160 |
|
161 |
### 3.1.6. Inference speed
|
162 |
-
Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of tokens per second
|
163 |
-
|
164 |
-
|
165 |
-
|
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
|
174 |
-
|
175 |
-
|
176 |
-
|---------|-------|-------|-------|-------|
|
177 |
-
|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
|
178 |
-
|Qwen32 | 21.16 | 21.92 | 23.14 | 23.56 |
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
-
<div align="center">
|
183 |
-
<img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
|
184 |
-
</div>
|
185 |
|
186 |
|
187 |
<a name="model-inference"></a><br>
|
|
|
159 |
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
160 |
|
161 |
### 3.1.6. Inference speed
|
162 |
+
Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
|
163 |
+
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
164 |
+
|---------|--------|-------|--------|-------|
|
165 |
+
|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
|
166 |
+
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
167 |
+
|
168 |
+
<br>
|
169 |
+
We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
|
170 |
+
|
171 |
+
|input size | 4k | 8k | 12k | 16k | 32k | 64k |
|
172 |
+
|---------|-------|-------|-------|-------|-------|-------|
|
173 |
+
|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
|
174 |
+
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
175 |
+
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
176 |
|
177 |
|
178 |
<a name="model-inference"></a><br>
|
README_zh.md
CHANGED
@@ -152,29 +152,20 @@
|
|
152 |
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
153 |
|
154 |
### 3.1.6. 推理速度
|
155 |
-
搭建基于8卡Nvidia RTX3090,采用"token/秒"
|
156 |
-
|
157 |
-
|
158 |
-
|
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
|
167 |
-
|
168 |
-
|
169 |
-
|---------|-------|-------|-------|-------|
|
170 |
-
|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
|
171 |
-
|Qwen32 | 21.16 | 21.92 | 23.14 | 23.56 |
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
<div align="center">
|
176 |
-
<img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
|
177 |
-
</div>
|
178 |
|
179 |
|
180 |
<a name="zh_model-inference"></a><br>
|
|
|
152 |
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
153 |
|
154 |
### 3.1.6. 推理速度
|
155 |
+
搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
|
156 |
+
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
157 |
+
|---------|--------|-------|--------|-------|
|
158 |
+
|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
|
159 |
+
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
160 |
+
|
161 |
+
<br>
|
162 |
+
同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
|
163 |
+
|
164 |
+
|input size | 4k | 8k | 12k | 16k | 32k | 64k |
|
165 |
+
|---------|-------|-------|-------|-------|-------|-------|
|
166 |
+
|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
|
167 |
+
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
168 |
+
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
169 |
|
170 |
|
171 |
<a name="zh_model-inference"></a><br>
|
assets/imgs/inf_spd.png
DELETED
Binary file (119 kB)
|
|