Hybrid Stream Mixer 48M — BabyLM 2026 Strict

A ~48M-parameter Hybrid Stream Mixer trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs. 15 of 16 layers are linear-time stream mixers; a single causal self-attention layer sits at the middle of the stack.

This is a successor to the 26M pure Stream Mixer. An internal ablation showed that inserting one attention layer drops validation loss by −0.24 nats (−21% perplexity) at matched compute, while adding a second only contributed ~0.02 more — a saturating curve, so this configuration keeps just one.

Model Details

Architecture: Hybrid Stream Mixer (15 linear-time mixers + 1 causal self-attention)
Parameters: 48,169,440
Layers: 16 (attention at depth 8)
Hidden dim: 512
Streams: 32, stream dim 64, 4 read heads
Attention: 8 heads (head_dim 64), RoPE, causal
Vocab: 16,384 BPE (trained on BabyLM corpus)
Context: 1,024 tokens
Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-48m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/streammixer-48m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Track: Strict (100M words, 10 epochs max)
Eval repo: babylm-eval
Leaderboard: BabyLM-Leaderboard-2026

Results (zero-shot, causal, temperature 1.0)

Task	Prior 26M	48M Hybrid	Δ
BLiMP	62.87	63.07	+0.20
BLiMP supplement	49.64	55.81	+6.17
EWOK	52.76	56.03	+3.27
Entity Tracking	17.90	24.10	+6.20
Comps	52.40	53.47	+1.07
Reading (eye tracking)	0.93	1.88	+0.95
Reading (self-paced)	0.14	0.06	−0.08

BLiMP field breakdown (48M): morphology 71.61, syntax/semantics 65.29, semantics 60.34, syntax 55.88. The remaining BLiMP headroom is concentrated in specific syntactic phenomena (left-branch islands, long-distance wh-extraction) that require parameter scale this model doesn't have.

Results (fine-tuning, GLUE)

Single-seed (seed=42), 10 epochs per task (WSC: 30), bsz=16/32, lr=3e-5, seq=512, best-validation checkpoint reported.

Task	Prior 26M	48M Hybrid	Δ
BoolQ	63.8	64.4	+0.6
MultiRC	58.5	57.6	−0.9
RTE	61.2	59.7	−1.5
WSC	63.5	67.3	+3.8
MRPC	69.6	70.6	+1.0
QQP	69.6	70.3	+0.7
MNLI	43.6	45.0	+1.4

5 of 7 tasks improved. RTE / MultiRC dips are within single-seed noise. WSC's +3.8 is the largest gain — coreference benefits from the wider hidden dim and the centered attention layer's longer effective dependency range.

Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.

Architecture

The Hybrid Stream Mixer combines linear-time stream mixing with a single attention checkpoint:

Input tokens are embedded into the residual stream
Layers 0–7 and 9–15 are Stream Mixer blocks — n_streams parallel content-routed memory streams with diverse learned timescales, mixed via multi-head sigmoid-gated reads with QK-norm. O(n) per token, no KV cache.
Layer 8 is a causal self-attention block with RoPE positions, providing global syntactic binding the stream mixer can't express on its own.
Output is projected back to vocabulary logits (weight-tied with the embedding table).

Aggregate complexity is O(n) for 15/16 of the network plus a single O(n²) checkpoint, giving most of the wall-clock and memory advantage of a pure linear-time model while restoring the syntactic-binding capability that the ablation showed pure Stream Mixers lacked. KV cache is required only for layer 8; the other 15 layers carry constant-size (B, M, D) state.

Training Details

Parameter	Value
Optimizer	MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule	Warmup-Stable-Decay (85% stable)
Peak LR	5e-3 (muon), 5e-4 (adamw) — same as prior 26M config
Weight decay	0.1
Batch size	128 × 1,024 tokens (auto-scaled to GPU)
Total steps	7,629 (~1B tokens, 20.8× params — Chinchilla-optimal)
GPU	RTX PRO 6000 Blackwell 95GB (~51 min, 355k tok/s sustained)
Final val loss	2.8094 (perplexity 16.60, −15% vs prior 26M's 19.5)
Data cleaning	CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

ecreeth
/

streammixer-48m-babylm