Hybrid Stream Mixer 48M β BabyLM 2026 Strict
A ~48M-parameter Hybrid Stream Mixer trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs. 15 of 16 layers are linear-time stream mixers; a single causal self-attention layer sits at the middle of the stack.
This is a successor to the 26M pure Stream Mixer. An internal ablation showed that inserting one attention layer drops validation loss by β0.24 nats (β21% perplexity) at matched compute, while adding a second only contributed ~0.02 more β a saturating curve, so this configuration keeps just one.
Model Details
- Architecture: Hybrid Stream Mixer (15 linear-time mixers + 1 causal self-attention)
- Parameters: 48,169,440
- Layers: 16 (attention at depth 8)
- Hidden dim: 512
- Streams: 32, stream dim 64, 4 read heads
- Attention: 8 heads (head_dim 64), RoPE, causal
- Vocab: 16,384 BPE (trained on BabyLM corpus)
- Context: 1,024 tokens
- Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-48m-babylm")
model = AutoModelForCausalLM.from_pretrained(
"ecreeth/streammixer-48m-babylm",
trust_remote_code=True,
)
inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
BabyLM Challenge
- Track: Strict (100M words, 10 epochs max)
- Eval repo: babylm-eval
- Leaderboard: BabyLM-Leaderboard-2026
Results (zero-shot, causal, temperature 1.0)
| Task | Prior 26M | 48M Hybrid | Ξ |
|---|---|---|---|
| BLiMP | 62.87 | 63.07 | +0.20 |
| BLiMP supplement | 49.64 | 55.81 | +6.17 |
| EWOK | 52.76 | 56.03 | +3.27 |
| Entity Tracking | 17.90 | 24.10 | +6.20 |
| Comps | 52.40 | 53.47 | +1.07 |
| Reading (eye tracking) | 0.93 | 1.88 | +0.95 |
| Reading (self-paced) | 0.14 | 0.06 | β0.08 |
BLiMP field breakdown (48M): morphology 71.61, syntax/semantics 65.29, semantics 60.34, syntax 55.88. The remaining BLiMP headroom is concentrated in specific syntactic phenomena (left-branch islands, long-distance wh-extraction) that require parameter scale this model doesn't have.
Results (fine-tuning, GLUE)
Single-seed (seed=42), 10 epochs per task (WSC: 30), bsz=16/32, lr=3e-5, seq=512, best-validation checkpoint reported.
| Task | Prior 26M | 48M Hybrid | Ξ |
|---|---|---|---|
| BoolQ | 63.8 | 64.4 | +0.6 |
| MultiRC | 58.5 | 57.6 | β0.9 |
| RTE | 61.2 | 59.7 | β1.5 |
| WSC | 63.5 | 67.3 | +3.8 |
| MRPC | 69.6 | 70.6 | +1.0 |
| QQP | 69.6 | 70.3 | +0.7 |
| MNLI | 43.6 | 45.0 | +1.4 |
5 of 7 tasks improved. RTE / MultiRC dips are within single-seed noise. WSC's +3.8 is the largest gain β coreference benefits from the wider hidden dim and the centered attention layer's longer effective dependency range.
Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.
Architecture
The Hybrid Stream Mixer combines linear-time stream mixing with a single attention checkpoint:
- Input tokens are embedded into the residual stream
- Layers 0β7 and 9β15 are Stream Mixer blocks β
n_streamsparallel content-routed memory streams with diverse learned timescales, mixed via multi-head sigmoid-gated reads with QK-norm. O(n) per token, no KV cache. - Layer 8 is a causal self-attention block with RoPE positions, providing global syntactic binding the stream mixer can't express on its own.
- Output is projected back to vocabulary logits (weight-tied with the embedding table).
Aggregate complexity is O(n) for 15/16 of the network plus a single O(nΒ²) checkpoint, giving most of the wall-clock and memory advantage of a pure linear-time model while restoring the syntactic-binding capability that the ablation showed pure Stream Mixers lacked. KV cache is required only for layer 8; the other 15 layers carry constant-size (B, M, D) state.
Training Details
| Parameter | Value |
|---|---|
| Optimizer | MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases) |
| LR schedule | Warmup-Stable-Decay (85% stable) |
| Peak LR | 5e-3 (muon), 5e-4 (adamw) β same as prior 26M config |
| Weight decay | 0.1 |
| Batch size | 128 Γ 1,024 tokens (auto-scaled to GPU) |
| Total steps | 7,629 (~1B tokens, 20.8Γ params β Chinchilla-optimal) |
| GPU | RTX PRO 6000 Blackwell 95GB (~51 min, 355k tok/s sustained) |
| Final val loss | 2.8094 (perplexity 16.60, β15% vs prior 26M's 19.5) |
| Data cleaning | CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered |
Links
- Downloads last month
- 17