Hybrid Stream Mixer 48M β€” BabyLM 2026 Strict

A ~48M-parameter Hybrid Stream Mixer trained on the BabyLM 2026 Strict 100M-word corpus for 10 epochs. 15 of 16 layers are linear-time stream mixers; a single causal self-attention layer sits at the middle of the stack.

This is a successor to the 26M pure Stream Mixer. An internal ablation showed that inserting one attention layer drops validation loss by βˆ’0.24 nats (βˆ’21% perplexity) at matched compute, while adding a second only contributed ~0.02 more β€” a saturating curve, so this configuration keeps just one.

Model Details

  • Architecture: Hybrid Stream Mixer (15 linear-time mixers + 1 causal self-attention)
  • Parameters: 48,169,440
  • Layers: 16 (attention at depth 8)
  • Hidden dim: 512
  • Streams: 32, stream dim 64, 4 read heads
  • Attention: 8 heads (head_dim 64), RoPE, causal
  • Vocab: 16,384 BPE (trained on BabyLM corpus)
  • Context: 1,024 tokens
  • Training: 10 epochs, ~1B tokens, WSD schedule, MuonAdamW optimizer

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ecreeth/streammixer-48m-babylm")
model = AutoModelForCausalLM.from_pretrained(
    "ecreeth/streammixer-48m-babylm",
    trust_remote_code=True,
)

inputs = tokenizer("The cat sat on the", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

BabyLM Challenge

Results (zero-shot, causal, temperature 1.0)

Task Prior 26M 48M Hybrid Ξ”
BLiMP 62.87 63.07 +0.20
BLiMP supplement 49.64 55.81 +6.17
EWOK 52.76 56.03 +3.27
Entity Tracking 17.90 24.10 +6.20
Comps 52.40 53.47 +1.07
Reading (eye tracking) 0.93 1.88 +0.95
Reading (self-paced) 0.14 0.06 βˆ’0.08

BLiMP field breakdown (48M): morphology 71.61, syntax/semantics 65.29, semantics 60.34, syntax 55.88. The remaining BLiMP headroom is concentrated in specific syntactic phenomena (left-branch islands, long-distance wh-extraction) that require parameter scale this model doesn't have.

Results (fine-tuning, GLUE)

Single-seed (seed=42), 10 epochs per task (WSC: 30), bsz=16/32, lr=3e-5, seq=512, best-validation checkpoint reported.

Task Prior 26M 48M Hybrid Ξ”
BoolQ 63.8 64.4 +0.6
MultiRC 58.5 57.6 βˆ’0.9
RTE 61.2 59.7 βˆ’1.5
WSC 63.5 67.3 +3.8
MRPC 69.6 70.6 +1.0
QQP 69.6 70.3 +0.7
MNLI 43.6 45.0 +1.4

5 of 7 tasks improved. RTE / MultiRC dips are within single-seed noise. WSC's +3.8 is the largest gain β€” coreference benefits from the wider hidden dim and the centered attention layer's longer effective dependency range.

Evaluated with babylm-eval on the strict track. Zero-shot tasks measure linguistic knowledge; fine-tuning tasks measure transfer learning to downstream classification.

Architecture

The Hybrid Stream Mixer combines linear-time stream mixing with a single attention checkpoint:

  1. Input tokens are embedded into the residual stream
  2. Layers 0–7 and 9–15 are Stream Mixer blocks β€” n_streams parallel content-routed memory streams with diverse learned timescales, mixed via multi-head sigmoid-gated reads with QK-norm. O(n) per token, no KV cache.
  3. Layer 8 is a causal self-attention block with RoPE positions, providing global syntactic binding the stream mixer can't express on its own.
  4. Output is projected back to vocabulary logits (weight-tied with the embedding table).

Aggregate complexity is O(n) for 15/16 of the network plus a single O(nΒ²) checkpoint, giving most of the wall-clock and memory advantage of a pure linear-time model while restoring the syntactic-binding capability that the ablation showed pure Stream Mixers lacked. KV cache is required only for layer 8; the other 15 layers carry constant-size (B, M, D) state.

Training Details

Parameter Value
Optimizer MuonAdamW (Muon for 2D weights, AdamW for embeddings/biases)
LR schedule Warmup-Stable-Decay (85% stable)
Peak LR 5e-3 (muon), 5e-4 (adamw) β€” same as prior 26M config
Weight decay 0.1
Batch size 128 Γ— 1,024 tokens (auto-scaled to GPU)
Total steps 7,629 (~1B tokens, 20.8Γ— params β€” Chinchilla-optimal)
GPU RTX PRO 6000 Blackwell 95GB (~51 min, 355k tok/s sustained)
Final val loss 2.8094 (perplexity 16.60, βˆ’15% vs prior 26M's 19.5)
Data cleaning CHILDES speaker tags, bracket annotations, Wikipedia headers, subtitle formatting, HTML tags filtered

Links

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support