Stream Mixer 26M โ BabyLM 2026 Strict
A ~26M-parameter Stream Mixer (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.
Model Details
- Architecture: Stream Mixer (no attention, no KV cache)
- Parameters: 26,247,616
- Layers: 14
- Hidden dim: 384
- Streams: 32, stream dim 64, 4 read heads
- Vocab: 16,384 BPE (trained on BabyLM corpus)
- Context: 1,024 tokens
- Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-26m-babylm")
model = AutoModelForCausalLM.from_pretrained(
"ecreeth/streammixer-26m-babylm",
trust_remote_code=True,
)
inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
BabyLM Challenge
- Track: Strict (100M words, 10 epochs max)
- Eval repo: babylm-eval
- Leaderboard: BabyLM-Leaderboard-2026
Results (zero-shot, causal, temperature 1.0)
| Task | Score |
|---|---|
| BLiMP | 62.87 |
| EWOK (supplement) | 49.64 |
| VQA | 52.76 |
| Entity Tracking | 17.90 |
| Comps | 52.40 |
| Reading (eye tracking) | 0.93 |
| Reading (self-paced) | 0.14 |
Results (fine-tuning, GLUE)
| Task | Accuracy |
|---|---|
| BOOLQ | 63.8 |
| MULTIRC | 58.5 |
| RTE | 61.2 |
| WSC | 63.5 |
| MRPC | 69.6 |
| QQP | 69.6 |
| MNLI | 43.6 |
Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.
Architecture
The Stream Mixer replaces self-attention with linear-time stream mixing:
- Input tokens are embedded and fed into
n_streamsparallel streams - Streams are mixed via learned query/read-head projections (no attention matrix)
- A lightweight feedforward layer processes the mixed streams
- Output is projected back to vocabulary logits
This gives O(n) complexity per token (vs O(nยฒ) for Transformers) and no KV
cache โ generation uses model.step() for token-by-token decoding.
Training Details
| Parameter | Value |
|---|---|
| Optimizer | MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases) |
| LR schedule | Warmup-Stable-Decay (85% stable) |
| Peak LR | 5e-3 (muon), 5e-4 (adamw) |
| Weight decay | 0.1 |
| Batch size | 128 ร 1,024 tokens (auto-scaled to GPU) |
| Total steps | 7,629 |
| GPU | H100 80GB (~39 min) |
| Val loss | 2.97 |
| Data cleaning | CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered |
Links
- Downloads last month
- 38