🏌️ Parameter Golf Submission — tbukuai

Minimize Bits Per Byte (BPB) on the FineWeb validation split under a ~15–16 MB artifact size limit.

Metric	Value
SOTA to beat	1.1056 BPB (cmpatino-8, 8×H100)
Baseline	1.2244 BPB (9L/512d/SP1024)
Our config	SP4096, 11L, 512d, Muon + EMA, depth recurrence, int6 quantization

Architecture

Tokenizer: SP4096 BPE (SentencePiece, vocab=4096)
Model: 11 layers, 512 dim, 8 heads, 4 KV heads (GQA), MLP×4
Tied embeddings: shared input/output embedding (saves artifact space)
Parallel residuals: layers 7–10 (GPT-J/Falcon style)
Depth recurrence: layers 3–5 looped ×2 extra (activates at 35% of training)
U-Net skip connections: encoder/decoder halves with learned skip weights
Activation: LeakyReLU² (negative_slope=0.5)
Logit softcapping: tanh(logits/30) × 30 (Gemma-style)
Positional encoding: Partial RoPE (25% of head dim)

Optimizer

Matrix weights (Q/K/V/O/MLP): Muon optimizer (lr=0.022, momentum=0.95, wd=0.095, 5 Newton-Schulz steps)
Embeddings: Adam (lr=0.05 tied, fused)
Scalars/norms: Adam (lr=0.04, fused)
Schedule: 20-step warmup → trapezoidal → 5000-step warmdown
EMA: decay=0.9965, starts at step 100

Quantization & Compression

Weights: int6 GPTQ-style with per-row SDClip (k=12.85 × std)
Embeddings: int8 (more sensitive, k=20.0)
Small tensors (<65K params): kept as float16
Control tensors (scales, gains, skip weights): kept as float32
Compression: Brotli quality=11 (fallback: zlib-9)

Data

Source: InAbsentia/parameter-golf-sp4096 on HF Hub
Train: 106 shards × ~~190MB each (~~10B tokens)
Val: 1 shard (86.7MB)
Tokenizer: tokenizers/fineweb_4096_bpe.model (295.6KB)

Files

File	Description
`train_gpt.py`	Core training script — model, Muon optimizer, EMA, quantization, BPB eval, checkpoint save/resume
`launch_training_competitive.py`	Competitive launcher — downloads data, sets optimized hyperparams, runs torchrun
`launch_training.py`	Original launcher (requires private openai/parameter-golf repo)
`parameter_golf_colab.ipynb`	Colab notebook — train on a free 1×T4 GPU
`parameter_golf_kaggle.ipynb`	Kaggle notebook — train on free 2×T4 GPUs (recommended!)
`sota_train_gpt.py`	LZMA-compressed submission artifact
`claude.md`	Detailed technical notes
`roadmap.md`	Experiment plan & hyperparameter tuning roadmap

How to Run

🥇 Free: Kaggle (2×T4, 30h/week) — Recommended

The best free option. 2 GPUs = ~2× faster, 30h/week quota, stable sessions.

Quick start:

Go to kaggle.com/code → New Notebook
Upload parameter_golf_kaggle.ipynb
Settings (right panel):
- Accelerator → GPU T4 ×2
- Internet → ON
Add HF token: Add-ons → Secrets → name: HF_TOKEN, value: your token from hf.co/settings/tokens
Run all cells

What the notebook does:

Downloads data (selectable: 10-106 train shards)
Patches Flash Attention → mem-efficient SDP (T4 compatible)
Runs torchrun --nproc_per_node=2 for 2-GPU distributed training
grad_accum_steps auto-adjusts: 8 → 4 (effective batch stays the same)
Checkpoints to HF Hub every 1000 steps

Kaggle presets:

Preset	Wallclock	Iters	Warmdown	Shards	Est. steps
Quick test	600s (10min)	2000	350	10	~200
1-hour run	3600s	8000	1500	20	~3000
Full 12h session	39600s (11h)	30000	5000	106	~20000

Maximizing 30h/week:

Run 3 sessions × 10h across the week
Set RESUME_FROM_HUB=1 in sessions 2 & 3 to continue from last checkpoint
Total: ~30K steps with full dataset — competitive BPB

🆓 Free: Google Colab (1×T4, 16GB)

Open parameter_golf_colab.ipynb in Colab
Runtime → Change runtime type → T4 GPU
Run all cells

Colab presets:

Preset	Wallclock	Iters	Warmdown	Purpose
Quick test	600s (10min)	2000	350	Verify everything works
1-hour run	3600s	5000	1000	Get a baseline BPB
Max session	36000s (10h)	20000	3500	Best from free Colab

Colab vs Kaggle

	Colab Free	Kaggle
GPUs	1× T4	2× T4
VRAM	16GB	16GB × 2
Quota	Throttled, no guarantee	30h/week guaranteed
Session	~12h (can disconnect)	12h (stable)
Disk	~100GB shared	73GB
Speed	1×	~2×
Distributed	❌ single GPU	✅ torchrun DDP
Best for	Quick tests	Serious training

💻 Multi-GPU (8×H100, 6 hours)

python launch_training_competitive.py

⏩ Resume from checkpoint

RESUME_FROM_HUB=1 python launch_training_competitive.py

🔧 Quick test (1×GPU, 10 min)

MAX_WALLCLOCK_SECONDS=600 ITERATIONS=2000 WARMDOWN_ITERS=350 \
  python launch_training_competitive.py

All hyperparameters are overridable via environment variables (see Hyperparameters class in train_gpt.py).

Free GPU Resources

Platform	GPU	VRAM	Free Quota	Session Limit	Best For
Kaggle	2× T4	16GB each	30h/week	12h	🥇 Best free option
Google Colab	T4	16GB	Unlimited (throttled)	~12h	Quick tests
Lightning.ai	T4 / A10G	16-24GB	22 GPU-hrs/month	Varies	Persistent storage
Paperspace Gradient	M4000	8GB	Free tier	6h	Light experiments
Oracle Cloud Free	A10G	24GB	1 always-free VM	Persistent	Best if you get capacity
Google Cloud Trial	T4/V100/A100	16-80GB	$300 credit (90 days)	You manage	Competitive runs

BPB Metric

BPB = cross_entropy_nats / (ln(2) × avg_bytes_per_token)

Where avg_bytes_per_token is the actual UTF-8 byte count of decoded tokens (accounting for SentencePiece leading-space tokens). Larger vocab → more bytes/token → better BPB ratio.

Strategy to Beat SOTA

High-confidence improvements

Train longer: 6+ hours on 8×H100
More iterations: 30K steps with 5K warmdown
Use all 106 train shards: More data = better generalization

Medium-risk experiments

Increase recurrence: layers 3–5 × 3 instead of × 2
Larger MLP: MLP×6 with smaller model_dim
Better warmdown schedule: Cosine instead of linear

High-risk / high-reward

SP8192 tokenizer: Better bytes/token ratio, but larger embedding table
Mixed quantization: int4 for MLP, int6 for attention, int8 for embeddings
Distillation: Train larger model, distill into size-limited one

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support