Mnemo
μνήμη — Greek for "memory"
Mnemo is a small attention-free language model with 117M parameters, built on the Stream Mixer architecture — a linear-time recurrent sequence mixer that uses multiple parallel content-routed memory streams instead of self-attention. The name nods to the model's recurrent memory: every layer maintains M parallel state buffers that "remember" content over the entire sequence without quadratic attention.
The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of karpathy/nanochat, with the attention-based GPT replaced by a custom Stream Mixer block.
Quick facts
| Architecture | Stream Mixer (linear-time recurrent) |
| Parameters | 117,179,136 |
| Layers | 16 |
| Hidden dim | 768 |
| Memory streams (M) | 48 |
| Stream state dim (D) | 96 |
| Read heads | 6 |
| Context length | 2048 tokens |
| Vocab | 32,768 BPE (GPT-4-style pretokenization) |
| Special tokens | <|bos|>, <|user_start|>, <|user_end|>, <|assistant_start|>, <|assistant_end|> |
| Compute dtype | bf16 (Ampere+) / fp32 (T4/CPU) |
| Base perplexity (BPB) | 19.47 (0.9011 bits-per-byte) |
| Chat ChatCORE metric | 22.74% (mean centered across 5 tasks) |
| SpellingBee accuracy | 94.53% (256/256 test set) |
| License | MIT |
Architecture: Stream Mixer
Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention to compute pairwise interactions across tokens (cost: O(T²)), Mnemo uses a chunked parallel scan over M parallel content-routed memory streams (cost: O(T · M · D) — linear in sequence length).
Per token t and per layer:
- Compute value
v[t], read queryq[t], content-routerr[t], and per-stream decayα[t]. - Each memory stream
s_mupdates vias_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t]. - Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.
The full state across a layer is (B, M, D) — a fixed-size recurrent memory that the model can carry across arbitrary sequence lengths. The chunked scan implementation keeps numerical range bounded even for slow-decay streams.
For details see the model source.
Training
Pretraining (base model)
| Corpus | karpathy/climbmix-400b-shuffle — 88 shards |
| Total tokens | 5.24B (44.7× over params) |
| Steps | 80,000 × B=32 × T=2048 |
| Optimizer | AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1) |
| Compute | RTX PRO 6000 Blackwell (single GPU, bf16) |
| Wall time | ~9 hours |
| Best val loss | 2.9508 (perplexity ≈ 19.12) |
Supervised fine-tuning
| Mixture | SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs |
| Total conversations | ~1.09M |
| Steps | 30,000 × B=8 × T=2048 = ~500M SFT tokens |
| Optimizer | AdamW (peak LR 1e-4, warmup 300) |
| Best val loss | ~1.45 (masked cross-entropy over assistant tokens only) |
| Format | nanochat-style BOS-aligned best-fit packing with padding |
Pipeline
ClimbMix-400B
│
▼
[80k step pretrain on Stream Mixer]
│ best val 2.9508 @ step 79k
▼
Base checkpoint (completes prompts)
│
▼
[30k step SFT on multi-task mixture]
│ best val ~1.45
▼
SFT checkpoint (chat-aware — answers as Mnemo)
Evaluation results
Measured on the full test sets — no subsampling, no cherry-picking.
Base model — model.pt @ step 79,000
| Metric | Value |
|---|---|
| Validation loss (nats / token) | 2.9691 |
| Perplexity | 19.47 |
| Bits per byte (BPB) | 0.9011 |
| Evaluation window | 409,600 tokens / 1,947,169 bytes |
Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.
Chat model — full benchmark suite
Evaluated on the complete test set of each task (no --max-problems cap).
Categorical tasks use logit comparison over allowed letters; generative tasks
sample greedily and parse #### N for the final answer.
| Task | Type | N | Accuracy | Random baseline | Centered |
|---|---|---|---|---|---|
| MMLU (57 subjects) | categorical 4-way MCQ | 14,042 | 28.32% | 25% | +4.42 |
| ARC-Easy | categorical 4-way MCQ | 2,376 | 30.68% | 25% | +7.58 |
| ARC-Challenge | categorical 4-way MCQ | 1,172 | 29.52% | 25% | +6.03 |
| GSM8K (math word problems) | generative, parse #### N |
1,319 | 1.14% | 0% | +1.14 |
| SpellingBee (letter counting) | generative, parse #### N |
256 | 94.53% | 0% | +94.53 |
ChatCORE metric
ChatCORE = 22.74% — mean centered accuracy across all five tasks.
ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly can hold the necessary structure — the dominant ceiling is parameter count, not architecture.
Where the numbers come from
- SpellingBee 94.53% is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct
#### Nfinal answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one. - All three MCQ tasks above random confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
- GSM8K at 1.14% is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning +
#### Nfinal answer) but the arithmetic isn't reliable enough to land the right number consistently.
Capabilities and limitations
Confirmed strong
- Coherent conversational dialogue in chat format (
<|user_start|>/<|assistant_start|>) - Factual recall on common entities (capital cities, chemical symbols, planets ordered)
- Letter counting via manual enumeration — 94.5% on SpellingBee
- Multiple-choice answer commitment (above random on all three MCQ benchmarks)
- Persona consistency (model identifies as Mnemo with consistent self-description)
- Greedy + nucleus (top-p) sampling configurable for short or long generation
Confirmed weak
- Math word problems — 1.14% on GSM8K. Format is learned, arithmetic is not
- Single-token common words for spelling — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
- Niche factual recall — confabulates confidently on rare entities, exact dates, specific quotations
- Long multi-turn conversations — context drifts after ~2-3 turns
Limitations (architectural)
- 117M parameters — knowledge density is the ceiling, not the architecture
- No tool use, no internet, no images, no memory across sessions
- 2048-token context — quality degrades past ~1500 tokens without repetition penalty
- No RLHF — outputs reflect only supervised signal; may produce inappropriate completions
- English only — pretraining corpus is essentially English educational/web text
- Repetition prone in long generations without
--repetition-penaltyor--top-p
Usage
Direct loading
import torch
from tokenizers import Tokenizer
from model import GPT
tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')
config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()
state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)
Chat CLI (recommended)
python3 chat_cli.py # interactive REPL
python3 chat_cli.py -p "Who are you?" # one-shot
The chat CLI handles the chat-format token wrapping (<|bos|> → <|user_start|> …)
and stops generation cleanly on <|assistant_end|>. State is cached across turns
via the recurrent state buffer — only the new tokens of each user message are
prefilled, giving roughly 5–10× faster prefill on multi-turn conversations than
re-processing the entire history.
Raw inference (no chat format)
python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15
Recommended sampling parameters (empirically tuned, see training log):
- Greedy / factual probes:
-t 0 - Short prose (≤500 tok):
-t 0.8 -k 50 - Long prose (500–2000 tok):
-t 0.8 -k 50 --top-p 0.9 -r 1.15(anti-loop) - Diverse creative writing:
-t 0.9 --top-p 0.85 -r 1.1
Probe outputs (greedy, from the base checkpoint)
Run via python3 base_eval.py --eval sample against the pretrained checkpoint (model.pt, val 2.9508). Greedy, 64 tokens per completion.
| Prompt | First tokens of output | Verdict |
|---|---|---|
| The capital of France is | "...Paris, and it is the capital of France. The capital of France is Paris..." | ✓ Paris lands |
| The chemical symbol of gold is | "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." | ✓ Au + real applied claim |
| If yesterday was Friday, then tomorrow will be | "Tuesday. The weather is not so bad..." | ✗ (correct: Sunday) |
| The opposite of hot is | "the cold." | ✓ |
| The planets of the solar system are: | "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." | ✓ Correct astronomical order |
| My favorite color is | "red. It's a color that's been around for a long time..." | ✓ |
| If 5*x + 3 = 13, then x is | "a positive integer. If x is a positive integer, then x is a positive integer..." | ✗ Loop |
| Photosynthesis is the process by which | "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." | ✓ Factually correct opener |
5/7 of the original training probes land correct answers at greedy. Repetition is visible — the base model benefits substantially from --repetition-penalty 1.15 and/or --top-p 0.9 on longer generations (see Usage section).
Citation and acknowledgements
Built on top of karpathy/nanochat by Andrej Karpathy. The Stream Mixer architecture is an attention-free experiment swapping the standard Transformer block for a recurrent linear-time sequence mixer.
Pretraining data is karpathy/climbmix-400b-shuffle. SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k, and a custom 1000-conversation identity dataset.
@misc{mnemo2026,
title={Mnemo: A Linear-Time Recurrent Language Model},
author={Alvarado, Luis Miguel},
year={2026},
note={Built on karpathy/nanochat. Stream Mixer architecture.},
howpublished={\url{https://github.com/<your-handle>/mnemo}}
}
License
MIT. Use freely. No warranty.