Instructions to use AlexWortega/moe100m-physics-tinybpe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AlexWortega/moe100m-physics-tinybpe with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AlexWortega/moe100m-physics-tinybpe", dtype="auto") - Notebooks
- Google Colab
- Kaggle
moe100m-physics-tinybpe
A ~100M-active Qwen3-style sparse-MoE language model trained from scratch on physics-simulation next-frame-prediction text, with a custom 512-token ByteLevel-BPE whose vocabulary is simulation-only (digits, punctuation, structural keywords). Built autonomously by the ml-intern Claude Code skill.
Model
- Active params: 92.8M | Total: 246.2M
- d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32)
- MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router
- Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE
- max_seq_len 1024, vocab 512
- Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100)
Training
- Data:
AlexWortega/physics-scenarios-packed(24 trained scenario types, interleaved) - Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest)
- tokens seen: 1.4e+08 (planned 7e+08)
- final train loss: 1.7107 | best eval loss: 1.6926
- wall: 0.20 GPU-h on 1x V100 (eva01)
Eval β Pymunk position error (% of scene diagonal), greedy autoregressive rollout
| set | @15f |
|---|---|
| trained (all 30 scenes) | 5.548% |
| held-out (all) | 6.753% |
| trained, fittable (<=12 obj) | 1.649% |
| held-out, fittable | 2.524% |
Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.
NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with
~12 objects cannot fit a full frame in the generation budget, so the fittable rows are the fair comparison. The model generates well-formed physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the larger 8192-ctx LFM2 baseline.
Training note (honest)
Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two data seeds). The published checkpoint is the best clean one (step 17000, eval 1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md.
VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the
divergence above β documented, not masked)
- 1_generation_sanity: PASS
- 2_loss_sanity: PASS
- 3_eval_tracks_train: PASS
- 4_data_consumption: FAIL
- 5_stderr_scan: FAIL
- 6_param_count: PASS
Files
model.py (+ optim/) defines MoEModel; config.json has the trained
hyperparameters; tokenizer.json is the tiny-BPE; train.log/eval.log/
VERIFY.md/EVAL_RESULTS.json are the full run record.
- Downloads last month
- 199