Stream Mixer 26M — BabyLM 2026 Strict

A ~26M-parameter Stream Mixer (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.

Model Details

Architecture: Stream Mixer (no attention, no KV cache)
Parameters: 26,247,616
Layers: 14
Hidden dim: 384
Streams: 32, stream dim 64, 4 read heads
Vocab: 16,384 BPE (trained on BabyLM corpus)
Context: 1,024 tokens
Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-26m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/streammixer-26m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Track: Strict (100M words, 10 epochs max)
Eval repo: babylm-eval
Leaderboard: BabyLM-Leaderboard-2026

Results (zero-shot, causal, temperature 1.0)

Task	Score
BLiMP	62.87
EWOK (supplement)	49.64
VQA	52.76
Entity Tracking	17.90
Comps	52.40
Reading (eye tracking)	0.93
Reading (self-paced)	0.14

Results (fine-tuning, GLUE)

Task	Accuracy
BOOLQ	63.8
MULTIRC	58.5
RTE	61.2
WSC	63.5
MRPC	69.6
QQP	69.6
MNLI	43.6

Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.

Architecture

The Stream Mixer replaces self-attention with linear-time stream mixing:

Input tokens are embedded and fed into n_streams parallel streams
Streams are mixed via learned query/read-head projections (no attention matrix)
A lightweight feedforward layer processes the mixed streams
Output is projected back to vocabulary logits

This gives O(n) complexity per token (vs O(n²) for Transformers) and no KV cache — generation uses model.step() for token-by-token decoding.

Training Details

Parameter	Value
Optimizer	MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule	Warmup-Stable-Decay (85% stable)
Peak LR	5e-3 (muon), 5e-4 (adamw)
Weight decay	0.1
Batch size	128 × 1,024 tokens (auto-scaled to GPU)
Total steps	7,629
GPU	H100 80GB (~39 min)
Val loss	2.97
Data cleaning	CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

ecreeth
/

streammixer-26m-babylm