delphi-9e18-669Mparams-2.3Btokens

A 669M-parameter base model from the Delphi scaling suite. Trained at 9 × 10¹⁸ FLOPs on 2.3B tokens with the Delphi recipe.

About Delphi

Delphi is the Marin team's first open scaling suite, inspired by Pythia. It has three parts:

  • a scaling recipe that maps compute budgets to model configurations,
  • a scaling suite of models trained from that recipe at IsoFLOP budgets from 3 × 10¹⁸ to 1 × 10²³ FLOPs, and
  • a scaling law which uses the smaller Delphi models to predict the larger ones.

A pre-registered forecast from that scaling law predicted the final loss of the largest Delphi run (1 × 10²³ FLOPs, 25 B parameters, 600 B tokens) within 0.2%, using 300× less compute than the training run itself. The same process forecasts downstream benchmarks — MMLU, HumanEval, and GSM8K — via a two-step regression combining compute and observational scaling laws.

See "Scaling Laws That Extrapolate 300× Past the Fit" for the recipe, fit, and downstream-eval projections. The full set of Delphi checkpoints — IsoFLOP grid points, held-out optima at 1e21/1e22/1e23 with multiple random seeds, and training intermediates — lives on marin-community on the Hub.

This is a research artifact, not a production model.

Model details

Architecture Qwen 3 (pre-norm decoder, RMSNorm, RoPE, QK-norm with learned scaling, SwiGLU MLPs)
Parameters 669,160,448
Hidden size 1280
Layers 13
Attention heads 10
KV heads 10 (no GQA)
Head dim 128
FFN intermediate 5120 (MLP ratio 4)
Vocab size 128,256 (Llama 3 tokenizer)
Max sequence length 4096
Position encoding RoPE (θ = 500000, Llama 3-style scaling)
Bias terms None
Tied embeddings No

Training

Compute 9 × 10¹⁸ FLOPs
Tokens 2,336,227,328
Steps 35,647
Sequence length 4096
Optimizer AdamH (Adam with Hyperball)
Recipe Delphi (Complete(d)P-style scaling with (T₀/T)^0.3 token-horizon LR correction)
LR schedule WSD: 10% linear warmup, 20% linear decay, 0 floor
Precision f32 master params, bf16 compute
Parallelism FSDP
Data mixture Nemotron-CC + StarCoderData + ProofPile 2
Tokenizer Llama 3 (vocab 128,256)

AdamH, Adam with Hyperball, constrains every projection weight to stay on the Frobenius- norm sphere it was initialized on, so weight decay has nothing to regularize away and falls out of the recipe. A Complete(d)P-style transfer rule with a (T₀/T)^0.3 correction sets learning rate as token horizon grows. Reference constants: B₀ = 64, T₀ = 2.5 B tokens, η₀ = 0.00630, η₀,Adam = 0.000656, ε₀ = 1.85 × 10⁻⁸. Recipe code: experiments/scaling_law_sweeps/completed_adamh.py.

Companion releases

  • All Delphi model checkpoints: marin-community on the Hub.
  • Plot data behind every figure in the blog post: marin-community/delphi-blog-data (one config per figure, with wandb_url on every row).
  • Pipelines that deterministically reproduce the training mixture from public Nemotron-CC, StarCoderData, and ProofPile 2 sources: see the Marin repo.

Evaluation

This checkpoint is part of the Delphi eval suite (experiments/exp1337_eval_suite.py), which scores every Delphi run alongside reference open-weights baselines (Qwen 3, Llama 2/3, OLMo 2, Marin 8B). Following the blog's two-step forecast, soft metrics (per-choice log-prob for multiple-choice tasks, bits-per-byte for generative tasks) carry the signal the scaling law is fit on, and a sigmoid fit on an external model pool maps soft metric to hard metric (accuracy, pass@1, exact-match). Below ~1e21 FLOPs the hard metrics stay near chance even when the underlying probabilities are improving smoothly; that is expected and is exactly why the soft metrics exist.

Limitations

  • Trained on an English-heavy web mixture; no multilingual coverage.
  • Pretrained-only — no instruction tuning, RLHF, or safety alignment.
  • The Delphi recipe targets compute-optimal training, not inference-cost-aware overtraining; for inference-heavy deployments, an overtrained smaller model may be preferable. The blog's "off-optimal training" section quantifies the penalty.
  • This is one checkpoint in a much larger Delphi release; pick the one that matches your compute / parameter / token regime, or browse the full set at marin-community.

Citation

@misc{held2026delphi,
  title  = {Scaling Laws That Extrapolate 300× Past the Fit},
  author = {Held, Will and {Marin Community}},
  year   = {2026},
  url    = {https://openathena.ai/blog/delphi}
}
Downloads last month
305
Safetensors
Model size
0.7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train marin-community/delphi-9e18-669Mparams-2.3Btokens

Collection including marin-community/delphi-9e18-669Mparams-2.3Btokens