megalodon-200m: minipile
Small pretraining experiment:
- 8192 ctx, approx 1 epoch
- codebase: https://github.com/pszemraj/megalodon/tree/dataload-fixes
- training logs
Model Configuration
- Number of Layers: 12
- Model Dimension: 1024
- Z Dimension: 256
- Value Dimension: 2048
- Number of Heads: 1
- FFN Hidden Dimension: 2560
- CEMA NDIM: 16
- Chunk Size: 2048
- Efficient Attention: None
- Initialization Mode: He
- Vocabulary Size: 20480
- Output Size: 20480
- Normalization Groups: 32
- Normalization Affine: True
- Normalization Epsilon: 1e-05
- ROPE Base: None
- Dropout: 0.0
- Hidden Dropout: 0.0
- Attention Dropout: 0.0
- SWIGLU: False
- Rescale NFFN: False
- Scale Embedding: False
- Share Embedding: False
- Layerwise Checkpointing: False