megalodon-200m: minipile

Small pretraining experiment:

Model Configuration

  • Number of Layers: 12
  • Model Dimension: 1024
  • Z Dimension: 256
  • Value Dimension: 2048
  • Number of Heads: 1
  • FFN Hidden Dimension: 2560
  • CEMA NDIM: 16
  • Chunk Size: 2048
  • Efficient Attention: None
  • Initialization Mode: He
  • Vocabulary Size: 20480
  • Output Size: 20480
  • Normalization Groups: 32
  • Normalization Affine: True
  • Normalization Epsilon: 1e-05
  • ROPE Base: None
  • Dropout: 0.0
  • Hidden Dropout: 0.0
  • Attention Dropout: 0.0
  • SWIGLU: False
  • Rescale NFFN: False
  • Scale Embedding: False
  • Share Embedding: False
  • Layerwise Checkpointing: False
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .

Dataset used to train pszemraj/megalodon-200m-minipile