jetmoe
/

jetmoe-8b

Text Generation

Model card Files Files and versions Community

YikangS commited on Mar 28, 2024

Commit

118601f

·

1 Parent(s): f4c989b

update readme

Files changed (3) hide show

README.md +3 -2
images/Phase1_data.png +0 -0
images/Phase2_data.png +0 -0

README.md CHANGED Viewed

@@ -64,11 +64,12 @@ JetMoE-8x1B is trained on 1.25T tokens from publicly available datasets, with a
 **Output** Models generate text only.
 ## Training Details
-Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses an annealing phase with exponential learning rate data and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
 <figure>
 <center>
 <img src="images/Phase1_data.png" width="60%">
-<img src="images/Phase2_data.png" width="75%">
 </center>
 </figure>

 **Output** Models generate text only.
 ## Training Details
+Our training recipe follows the [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4)'s two-phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
 <figure>
 <center>
 <img src="images/Phase1_data.png" width="60%">
+<img src="images/Phase2_data.png" width="60%">
 </center>
 </figure>

images/Phase1_data.png CHANGED Viewed

images/Phase2_data.png CHANGED Viewed