Doge 120M MoE checkpoint
Doge uses wsd_scheduler
as the training scheduler, which divides the learning rate into three stages: warmup
, stable
, and decay
. It allows us to continue training on any new dataset from any checkpoint in the stable stage
without spikes of the training.
Here are the initial learning rates required to continue training at each checkpoint:
- Doge-20M: 8e-3
- Doge-20M-MoE: 8e-3
- Doge-60M: 6e-3
- Doge-120M-MoE: 6e-3
- Doge-160M: 4e-3
- Doge-480M-MoE: 4e-3
- Doge-320M: 2e-3
- Doge-1.4B-MoE: 2e-3
Model | Learning Rate | Schedule | Warmup Steps | Stable Steps |
---|---|---|---|---|
Doge-20M | 8e-3 | wsd_scheduler | 800 | 6400 |
Doge-20M-MoE | 8e-3 | wsd_scheduler | 800 | 6400 |
Doge-60M | 6e-3 | wsd_scheduler | 1600 | 12800 |
Doge-120M-MoE | 6e-3 | wsd_scheduler | 1600 | 12800 |
Doge-160M | 4e-3 | wsd_scheduler | 2400 | 19200 |
Doge-480M-MoE | 4e-3 | wsd_scheduler | 2400 | 19200 |
Doge-320M | 2e-3 | wsd_scheduler | 3200 | 25600 |
Doge-1.4B-MoE | 2e-3 | wsd_scheduler | 3200 | 25600 |
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support