ZenQin commited on
Commit
05f111b
1 Parent(s): fee10e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -47
README.md CHANGED
@@ -1,41 +1,78 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- # **JetMoE**
5
- **JetMoE-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/).
6
- JetMoE project aims to provide an LLaMA2-level performance and efficient language model with a limited budget.
7
- To achieve this goal, JetMoE uses a sparsely activated architecture inspired by the [ModuleFormer](https://arxiv.org/abs/2306.04640).
8
- Each JetMoE block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts.
9
- Given the input tokens, it activates a subset of its experts to process them.
10
- Thus, JetMoE-8B has 8B parameters in total, but only 2B are activated for each input token.
11
- This sparse activation schema enables JetMoE to achieve much better training throughput than similar size dense models.
12
- The model is trained with 1.25T tokens from publicly available datasets on 96 H100s within 13 days.
13
- Given the current market price of H100 GPU hours, training the model costs around 0.1 million dollars.
14
- To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
15
- Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves better performance.
16
-
17
- ## Evaluation Results
18
- For most benchmarks, we use the same evaluation methodology as in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). For code benchmarks, we use the same evaluation methodology as in the LLaMA2 and Deepseek MoE paper. The evaluation results are as follows:
19
- |Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  |---|---|---|---|---|---|---|---|---|---|---|---|
21
- |Shot|||25|10|5|0|5|5||3|0|
22
- |Metric|||acc_norm|acc_norm|acc|mc2|acc|acc||Pass@1|Pass@1|
23
- |LLaMA2-7B|7B|2T|53.1|78.6|46.9|38.8|74|14.5|51.0|20.8|12.8|
24
- |LLaMA-13B|13B|1T|**56.2**|**80.9**|47.7|39.5|**76.2**|7.6|51.4|22.0|15.8|
25
- |DeepseekMoE-16B|2.8B|2T|53.2|79.8|46.3|36.1|73.7|17.3|51.1|34.0|**25.0**|
26
- |Gemma-2B|2B|2T|48.4|71.8|41.8|33.1|66.3|16.9|46.4|28.0|24.4|
27
- |JetMoE-8B|2.2B|1.25T|48.7|80.5|**49.2**|**41.7**|70.2|**27.8**|**53.0**|**34.2**|14.6|
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Model Usage
30
- To load the models, you need install [this package](https://github.com/yikangshen/JetMoE):
31
  ```
32
  pip install -e .
33
  ```
34
 
35
  Then you can load the model with the following code:
36
- ```
37
  from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
38
  from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification
 
39
  AutoConfig.register("jetmoe", JetMoEConfig)
40
  AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
41
  AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)
@@ -59,10 +96,6 @@ JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a le
59
  </center>
60
  </figure>
61
 
62
- **Input** Models input text only.
63
-
64
- **Output** Models generate text only.
65
-
66
  ## Training Details
67
  Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
68
 
@@ -73,29 +106,14 @@ Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCP
73
  </center>
74
  </figure>
75
 
76
- ## Authors
77
- This project is currently contributed by the following authors:
78
- - Yikang Shen
79
- - Zhen Guo
80
- - Tianle Cai
81
- - Zengyi Qin
82
-
83
  ## Technical Report
84
  For more details, please refer to the JetMoE Technical Report (Coming Soon).
85
 
86
- <!-- ## Citation
87
-
88
- Please cite the following paper if you use the data or code in this repo.
89
-
90
- ```
91
- @article{
92
- }
93
- ``` -->
94
-
95
  ## JetMoE Model Index
96
  |Model|Index|
97
  |---|---|
98
  |JetMoE-8B| [Link](https://huggingface.co/jetmoe/jetmoe-8B) |
99
 
100
- ## Ethical Considerations and Limitations
101
- JetMoE is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, JetMoE’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of JetMoE, developers should perform safety testing and tuning tailored to their specific applications of the model.
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
6
+
7
+
8
+ <div align="center">
9
+ <div>&nbsp;</div>
10
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/ieHnwuczidNNoGRA_FN2y.png" width="500"/>
11
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/641de0213239b631552713e4/UOsk9_zcbHpCCy6kmryYM.png" width="530"/>
12
+ </div>
13
+
14
+ ## Key Messages
15
+
16
+ 1. JetMoE-8B is **trained with less than $ 0.1 million**<sup>1</sup> **cost but outperforms LLaMA2-7B from Meta AI**, who has multi-billion-dollar training resources. LLM training can be **much cheaper than people generally thought**.
17
+
18
+ 2. JetMoE-8B is **fully open-sourced and academia-friendly** because:
19
+ - It **only uses public datasets** for training, and the code is open-sourced. No proprietary resource is needed.
20
+ - It **can be finetuned with very limited compute budget** (e.g., consumer-grade GPU) that most labs can afford.
21
+
22
+ 3. JetMoE-8B **only has 2.2B active parameters** during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.
23
+
24
+ <sup>1</sup> We used a 96×H100 GPU cluster for 2 weeks, which cost ~$0.08 million.
25
+
26
+ Website: [https://research.myshell.ai/jetmoe](https://research.myshell.ai/jetmoe)
27
+
28
+ HuggingFace: [https://huggingface.co/jetmoe/jetmoe-8b](https://huggingface.co/jetmoe/jetmoe-8b)
29
+
30
+ Online Demo on Lepton AI: [https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat](https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat)
31
+
32
+ ## Authors
33
+
34
+ The project is contributed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ), [Zhen Guo](https://zguo0525.github.io/), [Tianle Cai](https://www.tianle.website/#/) and [Zengyi Qin](https://www.qinzy.tech/). For technical inquiries, please contact [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ). For media and collaboration inquiries, please contact [Zengyi Qin](https://www.qinzy.tech/).
35
+
36
+ ## Collaboration
37
+ **If you have great ideas but need more resources (GPU, data, funding, etc.)**, welcome to contact **MyShell.ai** via [Zengyi Qin](https://www.qinzy.tech/). **MyShell.ai** is open to collaborations and are actively supporting high-quality open-source projects.
38
+
39
+ ## Benchmarks
40
+ We use the same evaluation methodology as in the Open LLM leaderboard. For MBPP code benchmark, we use the same evaluation methodology as in the LLaMA2 and Deepseek-MoE paper. The results are shown below:
41
+
42
+ |Model|Activate Params|Training Tokens|Open LLM Leaderboard Avg|ARC|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|MBPP|HumanEval|
43
  |---|---|---|---|---|---|---|---|---|---|---|---|
44
+ |Shot||||25|10|5|0|5|5|3|0|
45
+ |Metric||||acc_norm|acc_norm|acc|mc2|acc|acc|Pass@1|Pass@1|
46
+ |LLaMA2-7B|7B|2T|51.0|53.1|78.6|46.9|38.8|74|14.5|20.8|12.8|
47
+ |LLaMA-13B|13B|1T|51.4|**56.2**|**80.9**|47.7|39.5|**76.2**|7.6|22.0|15.8|
48
+ |DeepseekMoE-16B|2.8B|2T|51.1|53.2|79.8|46.3|36.1|73.7|17.3|34.0|**25.0**|
49
+ |Gemma-2B|2B|2T|46.4|48.4|71.8|41.8|33.1|66.3|16.9|28.0|24.4|
50
+ |JetMoE-8B|2.2B|1.25T|**53.0**|48.7|80.5|**49.2**|**41.7**|70.2|**27.8**|**34.2**|14.6|
51
+
52
+ | Model | MT-Bench Score |
53
+ |---------------------|-----------|
54
+ | GPT-4 | 9.014 |
55
+ | GPT-3.5-turbo | 7.995 |
56
+ | Claude-v1 | 7.923 |
57
+ | **JetMoE-8B-chat** | **6.681** |
58
+ | Llama-2-13b-chat | 6.650 |
59
+ | Vicuna-13b-v1.3 | 6.413 |
60
+ | Wizardlm-13b | 6.353 |
61
+ | Llama-2-7b-chat | 6.269 |
62
+
63
+ To our surprise, despite the lower training cost and computation, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B. Compared to a model with similar training and inference computation, like Gemma-2B, JetMoE-8B achieves better performance.
64
 
65
  ## Model Usage
66
+ To load the models, you need install this package:
67
  ```
68
  pip install -e .
69
  ```
70
 
71
  Then you can load the model with the following code:
72
+ ```python
73
  from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, AutoModelForSequenceClassification
74
  from jetmoe import JetMoEForCausalLM, JetMoEConfig, JetMoEForSequenceClassification
75
+
76
  AutoConfig.register("jetmoe", JetMoEConfig)
77
  AutoModelForCausalLM.register(JetMoEConfig, JetMoEForCausalLM)
78
  AutoModelForSequenceClassification.register(JetMoEConfig, JetMoEForSequenceClassification)
 
96
  </center>
97
  </figure>
98
 
 
 
 
 
99
  ## Training Details
100
  Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
101
 
 
106
  </center>
107
  </figure>
108
 
 
 
 
 
 
 
 
109
  ## Technical Report
110
  For more details, please refer to the JetMoE Technical Report (Coming Soon).
111
 
 
 
 
 
 
 
 
 
 
112
  ## JetMoE Model Index
113
  |Model|Index|
114
  |---|---|
115
  |JetMoE-8B| [Link](https://huggingface.co/jetmoe/jetmoe-8B) |
116
 
117
+ ## Acknowledgement
118
+ We express our gratitude to [Shengding Hu](https://shengdinghu.github.io/) for his valuable advice on the Phase 2 data mixture. We also express our gratitude to [Exabits](https://www.exabits.ai/) for their assistance in setting up the GPU clusters, and to [Lepton AI](https://www.lepton.ai/) for their support in setting up the chat demo.
119
+