YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ποΈ Parameter Golf Submission β tbukuai
Minimize Bits Per Byte (BPB) on the FineWeb validation split under a ~15β16 MB artifact size limit.
| Metric | Value |
|---|---|
| SOTA to beat | 1.1056 BPB (cmpatino-8, 8ΓH100) |
| Baseline | 1.2244 BPB (9L/512d/SP1024) |
| Our config | SP4096, 11L, 512d, Muon + EMA, depth recurrence, int6 quantization |
Architecture
- Tokenizer: SP4096 BPE (SentencePiece, vocab=4096)
- Model: 11 layers, 512 dim, 8 heads, 4 KV heads (GQA), MLPΓ4
- Tied embeddings: shared input/output embedding (saves artifact space)
- Parallel residuals: layers 7β10 (GPT-J/Falcon style)
- Depth recurrence: layers 3β5 looped Γ2 extra (activates at 35% of training)
- U-Net skip connections: encoder/decoder halves with learned skip weights
- Activation: LeakyReLUΒ² (negative_slope=0.5)
- Logit softcapping: tanh(logits/30) Γ 30 (Gemma-style)
- Positional encoding: Partial RoPE (25% of head dim)
Optimizer
- Matrix weights (Q/K/V/O/MLP): Muon optimizer (lr=0.022, momentum=0.95, wd=0.095, 5 Newton-Schulz steps)
- Embeddings: Adam (lr=0.05 tied, fused)
- Scalars/norms: Adam (lr=0.04, fused)
- Schedule: 20-step warmup β trapezoidal β 5000-step warmdown
- EMA: decay=0.9965, starts at step 100
Quantization & Compression
- Weights: int6 GPTQ-style with per-row SDClip (k=12.85 Γ std)
- Embeddings: int8 (more sensitive, k=20.0)
- Small tensors (<65K params): kept as float16
- Control tensors (scales, gains, skip weights): kept as float32
- Compression: Brotli quality=11 (fallback: zlib-9)
Data
- Source:
InAbsentia/parameter-golf-sp4096on HF Hub - Train: 106 shards Γ
190MB each (10B tokens) - Val: 1 shard (86.7MB)
- Tokenizer:
tokenizers/fineweb_4096_bpe.model(295.6KB)
Files
| File | Description |
|---|---|
train_gpt.py |
Core training script β model, Muon optimizer, EMA, quantization, BPB eval, checkpoint save/resume |
launch_training_competitive.py |
Competitive launcher β downloads data, sets optimized hyperparams, runs torchrun |
launch_training.py |
Original launcher (requires private openai/parameter-golf repo) |
parameter_golf_colab.ipynb |
Colab notebook β train on a free 1ΓT4 GPU |
parameter_golf_kaggle.ipynb |
Kaggle notebook β train on free 2ΓT4 GPUs (recommended!) |
sota_train_gpt.py |
LZMA-compressed submission artifact |
claude.md |
Detailed technical notes |
roadmap.md |
Experiment plan & hyperparameter tuning roadmap |
How to Run
π₯ Free: Kaggle (2ΓT4, 30h/week) β Recommended
The best free option. 2 GPUs = ~2Γ faster, 30h/week quota, stable sessions.
Quick start:
- Go to kaggle.com/code β New Notebook
- Upload
parameter_golf_kaggle.ipynb - Settings (right panel):
- Accelerator β GPU T4 Γ2
- Internet β ON
- Add HF token: Add-ons β Secrets β name:
HF_TOKEN, value: your token from hf.co/settings/tokens - Run all cells
What the notebook does:
- Downloads data (selectable: 10-106 train shards)
- Patches Flash Attention β mem-efficient SDP (T4 compatible)
- Runs
torchrun --nproc_per_node=2for 2-GPU distributed training grad_accum_stepsauto-adjusts: 8 β 4 (effective batch stays the same)- Checkpoints to HF Hub every 1000 steps
Kaggle presets:
| Preset | Wallclock | Iters | Warmdown | Shards | Est. steps |
|---|---|---|---|---|---|
| Quick test | 600s (10min) | 2000 | 350 | 10 | ~200 |
| 1-hour run | 3600s | 8000 | 1500 | 20 | ~3000 |
| Full 12h session | 39600s (11h) | 30000 | 5000 | 106 | ~20000 |
Maximizing 30h/week:
- Run 3 sessions Γ 10h across the week
- Set
RESUME_FROM_HUB=1in sessions 2 & 3 to continue from last checkpoint - Total: ~30K steps with full dataset β competitive BPB
π Free: Google Colab (1ΓT4, 16GB)
- Open
parameter_golf_colab.ipynbin Colab - Runtime β Change runtime type β T4 GPU
- Run all cells
Colab presets:
| Preset | Wallclock | Iters | Warmdown | Purpose |
|---|---|---|---|---|
| Quick test | 600s (10min) | 2000 | 350 | Verify everything works |
| 1-hour run | 3600s | 5000 | 1000 | Get a baseline BPB |
| Max session | 36000s (10h) | 20000 | 3500 | Best from free Colab |
Colab vs Kaggle
| Colab Free | Kaggle | |
|---|---|---|
| GPUs | 1Γ T4 | 2Γ T4 |
| VRAM | 16GB | 16GB Γ 2 |
| Quota | Throttled, no guarantee | 30h/week guaranteed |
| Session | ~12h (can disconnect) | 12h (stable) |
| Disk | ~100GB shared | 73GB |
| Speed | 1Γ | ~2Γ |
| Distributed | β single GPU | β torchrun DDP |
| Best for | Quick tests | Serious training |
π» Multi-GPU (8ΓH100, 6 hours)
python launch_training_competitive.py
β© Resume from checkpoint
RESUME_FROM_HUB=1 python launch_training_competitive.py
π§ Quick test (1ΓGPU, 10 min)
MAX_WALLCLOCK_SECONDS=600 ITERATIONS=2000 WARMDOWN_ITERS=350 \
python launch_training_competitive.py
All hyperparameters are overridable via environment variables (see Hyperparameters class in train_gpt.py).
Free GPU Resources
| Platform | GPU | VRAM | Free Quota | Session Limit | Best For |
|---|---|---|---|---|---|
| Kaggle | 2Γ T4 | 16GB each | 30h/week | 12h | π₯ Best free option |
| Google Colab | T4 | 16GB | Unlimited (throttled) | ~12h | Quick tests |
| Lightning.ai | T4 / A10G | 16-24GB | 22 GPU-hrs/month | Varies | Persistent storage |
| Paperspace Gradient | M4000 | 8GB | Free tier | 6h | Light experiments |
| Oracle Cloud Free | A10G | 24GB | 1 always-free VM | Persistent | Best if you get capacity |
| Google Cloud Trial | T4/V100/A100 | 16-80GB | $300 credit (90 days) | You manage | Competitive runs |
BPB Metric
BPB = cross_entropy_nats / (ln(2) Γ avg_bytes_per_token)
Where avg_bytes_per_token is the actual UTF-8 byte count of decoded tokens (accounting for SentencePiece leading-space tokens). Larger vocab β more bytes/token β better BPB ratio.
Strategy to Beat SOTA
High-confidence improvements
- Train longer: 6+ hours on 8ΓH100
- More iterations: 30K steps with 5K warmdown
- Use all 106 train shards: More data = better generalization
Medium-risk experiments
- Increase recurrence: layers 3β5 Γ 3 instead of Γ 2
- Larger MLP: MLPΓ6 with smaller model_dim
- Better warmdown schedule: Cosine instead of linear
High-risk / high-reward
- SP8192 tokenizer: Better bytes/token ratio, but larger embedding table
- Mixed quantization: int4 for MLP, int6 for attention, int8 for embeddings
- Distillation: Train larger model, distill into size-limited one
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support