moe100m-physics-tinybpe

A ~100M-active Qwen3-style sparse-MoE language model trained from scratch on physics-simulation next-frame-prediction text, with a custom 512-token ByteLevel-BPE whose vocabulary is simulation-only (digits, punctuation, structural keywords). Built autonomously by the ml-intern Claude Code skill.

Model

  • Active params: 92.8M | Total: 246.2M
  • d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32)
  • MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router
  • Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE
  • max_seq_len 1024, vocab 512
  • Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100)

Training

  • Data: AlexWortega/physics-scenarios-packed (24 trained scenario types, interleaved)
  • Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest)
  • tokens seen: 1.4e+08 (planned 7e+08)
  • final train loss: 1.7107 | best eval loss: 1.6926
  • wall: 0.20 GPU-h on 1x V100 (eva01)

Eval β€” Pymunk position error (% of scene diagonal), greedy autoregressive rollout

set @15f
trained (all 30 scenes) 5.548%
held-out (all) 6.753%
trained, fittable (<=12 obj) 1.649%
held-out, fittable 2.524%

Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.

NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with

~12 objects cannot fit a full frame in the generation budget, so the fittable rows are the fair comparison. The model generates well-formed physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the larger 8192-ctx LFM2 baseline.

Training note (honest)

Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two data seeds). The published checkpoint is the best clean one (step 17000, eval 1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md.

VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the

divergence above β€” documented, not masked)

  • 1_generation_sanity: PASS
  • 2_loss_sanity: PASS
  • 3_eval_tracks_train: PASS
  • 4_data_consumption: FAIL
  • 5_stderr_scan: FAIL
  • 6_param_count: PASS

Files

model.py (+ optim/) defines MoEModel; config.json has the trained hyperparameters; tokenizer.json is the tiny-BPE; train.log/eval.log/ VERIFY.md/EVAL_RESULTS.json are the full run record.

Downloads last month
199
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support