Stream Mixer 26M โ€” BabyLM 2026 Strict

A ~26M-parameter Stream Mixer (linear-time, attention-free) trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs.

Model Details

  • Architecture: Stream Mixer (no attention, no KV cache)
  • Parameters: 26,247,616
  • Layers: 14
  • Hidden dim: 384
  • Streams: 32, stream dim 64, 4 read heads
  • Vocab: 16,384 BPE (trained on BabyLM corpus)
  • Context: 1,024 tokens
  • Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-26m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/streammixer-26m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Results (zero-shot, causal, temperature 1.0)

Task Score
BLiMP 62.87
EWOK (supplement) 49.64
VQA 52.76
Entity Tracking 17.90
Comps 52.40
Reading (eye tracking) 0.93
Reading (self-paced) 0.14

Results (fine-tuning, GLUE)

Task Accuracy
BOOLQ 63.8
MULTIRC 58.5
RTE 61.2
WSC 63.5
MRPC 69.6
QQP 69.6
MNLI 43.6

Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.

Architecture

The Stream Mixer replaces self-attention with linear-time stream mixing:

  1. Input tokens are embedded and fed into n_streams parallel streams
  2. Streams are mixed via learned query/read-head projections (no attention matrix)
  3. A lightweight feedforward layer processes the mixed streams
  4. Output is projected back to vocabulary logits

This gives O(n) complexity per token (vs O(nยฒ) for Transformers) and no KV cache โ€” generation uses model.step() for token-by-token decoding.

Training Details

Parameter Value
Optimizer MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule Warmup-Stable-Decay (85% stable)
Peak LR 5e-3 (muon), 5e-4 (adamw)
Weight decay 0.1
Batch size 128 ร— 1,024 tokens (auto-scaled to GPU)
Total steps 7,629
GPU H100 80GB (~39 min)
Val loss 2.97
Data cleaning CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

Links

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support