OrionStarAI
/

Orion-MoE8x7B

@@ -77,7 +77,7 @@ tags:
 - Model pretrain data distribution
   - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
 <div align="center">
-  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="70%" />
 </div>
@@ -159,7 +159,8 @@ Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. Inference speed
-Setup inference server on 8x Nvidia RTX3090， and get results from client in unit of tokens per second.
 |OrionLLM_V2.4.6.1|1para_out62|1para_out85|1para_out125|1para_out210|
 |---------|-------|-------|-------|-------|
@@ -176,10 +177,10 @@ Setup inference server on 8x Nvidia RTX3090， and get results from client in un
 |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
 |Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
-We found that the inference speed results vary based on the number of concurrent requests and the length of output. To facilitate horizontal comparisons, we conducted multiple sets of tests. Each set of test data has a specific format: \<n>para_out\<m>. For example, "4para_out220" indicates the inference speed when there are 4 concurrent requests from the client and the average output token length is 220.
 <div align="center">
-  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="100%" />
 </div>

 - Model pretrain data distribution
   - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
 <div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
 </div>
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. Inference speed
+Setup inference server on 8x Nvidia RTX3090， and get results from client in unit of tokens per second.<br>
+We found that the inference speed results vary based on the number of concurrent requests and the length of output. To facilitate horizontal comparisons, we conducted multiple sets of tests. Each set of test data has a specific format: \<n>para_out\<m>. For example, "4para_out220" indicates the inference speed when there are 4 concurrent requests from the client and the average output token length is 220.
 |OrionLLM_V2.4.6.1|1para_out62|1para_out85|1para_out125|1para_out210|
 |---------|-------|-------|-------|-------|
 |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
 |Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
 <div align="center">
+  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
 </div>

README_zh.md CHANGED Viewed

@@ -69,7 +69,7 @@
 - Orion-MOE8x7B-Base训练数据组成
   - 预训练数据语种上主要由英语、中文和其他多语言语言组成，分别占比50%、25%和12%。数据分类上，代码占9%，数学文本占4%，分布参考下图。
 <div align="center">
-  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="70%" />
 </div>
@@ -152,7 +152,8 @@
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. 推理速度
-搭建基于8卡Nvidia RTX3090，采用"token/秒"为单位，从客户端统计测试结果。
 |OrionLLM_V2.4.6.1|1并发_输出62|1并发_输出85|1并发_输出125|1并发_输出210|
 |---------|-------|-------|-------|-------|
@@ -169,10 +170,10 @@
 |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
 |Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
-我们发现根据推理的并发数以及模型输出长度的不同，推理速度的结果会有变化，为了方便横向对比，我们做了多组数据的测试，每一组测试数据的格式含义：客户端并发数_每次推理输出token数，在此前提条件下的推理速度，例如:4para_out220，表示客户端4并发打请求，输出token数平均在220个token时的推理速度。
 <div align="center">
-  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="100%" />
 </div>

 - Orion-MOE8x7B-Base训练数据组成
   - 预训练数据语种上主要由英语、中文和其他多语言语言组成，分别占比50%、25%和12%。数据分类上，代码占9%，数学文本占4%，分布参考下图。
 <div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="50%" />
 </div>
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
 ### 3.1.6. 推理速度
+搭建基于8卡Nvidia RTX3090，采用"token/秒"为单位，从客户端统计测试结果。<br>
+我们发现根据推理的并发数以及模型输出长度的不同，推理速度的结果会有变化，为了方便横向对比，我们做了多组数据的测试，每一组测试数据的格式含义：客户端并发数_每次推理输出token数，在此前提条件下的推理速度，例如:4para_out220，表示客户端4并发打请求，输出token数平均在220个token时的推理速度。
 |OrionLLM_V2.4.6.1|1并发_输出62|1并发_输出85|1并发_输出125|1并发_输出210|
 |---------|-------|-------|-------|-------|
 |OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
 |Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
 <div align="center">
+  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="60%" />
 </div>