Mnemo

μνήμη — Greek for "memory"

Mnemo is a small attention-free language model with 117M parameters, built on the Stream Mixer architecture — a linear-time recurrent sequence mixer that uses multiple parallel content-routed memory streams instead of self-attention. The name nods to the model's recurrent memory: every layer maintains M parallel state buffers that "remember" content over the entire sequence without quadratic attention.

The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of karpathy/nanochat, with the attention-based GPT replaced by a custom Stream Mixer block.

Quick facts


Architecture	Stream Mixer (linear-time recurrent)
Parameters	117,179,136
Layers	16
Hidden dim	768
Memory streams (M)	48
Stream state dim (D)	96
Read heads	6
Context length	2048 tokens
Vocab	32,768 BPE (GPT-4-style pretokenization)
Special tokens	`<\|bos\|>`, `<\|user_start\|>`, `<\|user_end\|>`, `<\|assistant_start\|>`, `<\|assistant_end\|>`
Compute dtype	bf16 (Ampere+) / fp32 (T4/CPU)
Base perplexity (BPB)	19.47 (0.9011 bits-per-byte)
Chat ChatCORE metric	22.74% (mean centered across 5 tasks)
SpellingBee accuracy	94.53% (256/256 test set)
License	MIT

Architecture: Stream Mixer

Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention to compute pairwise interactions across tokens (cost: O(T²)), Mnemo uses a chunked parallel scan over M parallel content-routed memory streams (cost: O(T · M · D) — linear in sequence length).

Per token t and per layer:

Compute value v[t], read query q[t], content-router r[t], and per-stream decay α[t].
Each memory stream s_m updates via s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t].
Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.

The full state across a layer is (B, M, D) — a fixed-size recurrent memory that the model can carry across arbitrary sequence lengths. The chunked scan implementation keeps numerical range bounded even for slow-decay streams.

For details see the model source.

Training

Pretraining (base model)


Corpus	karpathy/climbmix-400b-shuffle — 88 shards
Total tokens	5.24B (44.7× over params)
Steps	80,000 × B=32 × T=2048
Optimizer	AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1)
Compute	RTX PRO 6000 Blackwell (single GPU, bf16)
Wall time	~9 hours
Best val loss	2.9508 (perplexity ≈ 19.12)

Supervised fine-tuning


Mixture	SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs
Total conversations	~1.09M
Steps	30,000 × B=8 × T=2048 = ~500M SFT tokens
Optimizer	AdamW (peak LR 1e-4, warmup 300)
Best val loss	~1.45 (masked cross-entropy over assistant tokens only)
Format	nanochat-style BOS-aligned best-fit packing with padding

Pipeline

ClimbMix-400B
   │
   ▼
[80k step pretrain on Stream Mixer]
   │  best val 2.9508 @ step 79k
   ▼
Base checkpoint  (completes prompts)
   │
   ▼
[30k step SFT on multi-task mixture]
   │  best val ~1.45
   ▼
SFT checkpoint  (chat-aware — answers as Mnemo)

Evaluation results

Measured on the full test sets — no subsampling, no cherry-picking.

Base model — `model.pt` @ step 79,000

Metric	Value
Validation loss (nats / token)	2.9691
Perplexity	19.47
Bits per byte (BPB)	0.9011
Evaluation window	409,600 tokens / 1,947,169 bytes

Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.

Chat model — full benchmark suite

Evaluated on the complete test set of each task (no --max-problems cap). Categorical tasks use logit comparison over allowed letters; generative tasks sample greedily and parse #### N for the final answer.

Task	Type	N	Accuracy	Random baseline	Centered
MMLU (57 subjects)	categorical 4-way MCQ	14,042	28.32%	25%	+4.42
ARC-Easy	categorical 4-way MCQ	2,376	30.68%	25%	+7.58
ARC-Challenge	categorical 4-way MCQ	1,172	29.52%	25%	+6.03
GSM8K (math word problems)	generative, parse `#### N`	1,319	1.14%	0%	+1.14
SpellingBee (letter counting)	generative, parse `#### N`	256	94.53%	0%	+94.53

ChatCORE metric

ChatCORE = 22.74% — mean centered accuracy across all five tasks.

ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly can hold the necessary structure — the dominant ceiling is parameter count, not architecture.

Where the numbers come from

SpellingBee 94.53% is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct #### N final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
All three MCQ tasks above random confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
GSM8K at 1.14% is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + #### N final answer) but the arithmetic isn't reliable enough to land the right number consistently.

Capabilities and limitations

Confirmed strong

Coherent conversational dialogue in chat format (<|user_start|> / <|assistant_start|>)
Factual recall on common entities (capital cities, chemical symbols, planets ordered)
Letter counting via manual enumeration — 94.5% on SpellingBee
Multiple-choice answer commitment (above random on all three MCQ benchmarks)
Persona consistency (model identifies as Mnemo with consistent self-description)
Greedy + nucleus (top-p) sampling configurable for short or long generation

Confirmed weak

Math word problems — 1.14% on GSM8K. Format is learned, arithmetic is not
Single-token common words for spelling — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
Niche factual recall — confabulates confidently on rare entities, exact dates, specific quotations
Long multi-turn conversations — context drifts after ~2-3 turns

Limitations (architectural)

117M parameters — knowledge density is the ceiling, not the architecture
No tool use, no internet, no images, no memory across sessions
2048-token context — quality degrades past ~1500 tokens without repetition penalty
No RLHF — outputs reflect only supervised signal; may produce inappropriate completions
English only — pretraining corpus is essentially English educational/web text
Repetition prone in long generations without --repetition-penalty or --top-p

Usage

Direct loading

import torch
from tokenizers import Tokenizer
from model import GPT

tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')

config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()

state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)

Chat CLI (recommended)

python3 chat_cli.py                   # interactive REPL
python3 chat_cli.py -p "Who are you?"  # one-shot

The chat CLI handles the chat-format token wrapping (<|bos|> → <|user_start|> …) and stops generation cleanly on <|assistant_end|>. State is cached across turns via the recurrent state buffer — only the new tokens of each user message are prefilled, giving roughly 5–10× faster prefill on multi-turn conversations than re-processing the entire history.

Raw inference (no chat format)

python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15

Recommended sampling parameters (empirically tuned, see training log):

Greedy / factual probes: -t 0
Short prose (≤500 tok): -t 0.8 -k 50
Long prose (500–2000 tok): -t 0.8 -k 50 --top-p 0.9 -r 1.15 (anti-loop)
Diverse creative writing: -t 0.9 --top-p 0.85 -r 1.1

Probe outputs (greedy, from the base checkpoint)

Run via python3 base_eval.py --eval sample against the pretrained checkpoint (model.pt, val 2.9508). Greedy, 64 tokens per completion.

Prompt	First tokens of output	Verdict
The capital of France is	"...Paris, and it is the capital of France. The capital of France is Paris..."	✓ Paris lands
The chemical symbol of gold is	"Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..."	✓ Au + real applied claim
If yesterday was Friday, then tomorrow will be	"Tuesday. The weather is not so bad..."	✗ (correct: Sunday)
The opposite of hot is	"the cold."	✓
The planets of the solar system are:	"Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..."	✓ Correct astronomical order
My favorite color is	"red. It's a color that's been around for a long time..."	✓
If 5x + 3 = 13, then x is*	"a positive integer. If x is a positive integer, then x is a positive integer..."	✗ Loop
Photosynthesis is the process by which	"plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..."	✓ Factually correct opener

5/7 of the original training probes land correct answers at greedy. Repetition is visible — the base model benefits substantially from --repetition-penalty 1.15 and/or --top-p 0.9 on longer generations (see Usage section).

Citation and acknowledgements

Built on top of karpathy/nanochat by Andrej Karpathy. The Stream Mixer architecture is an attention-free experiment swapping the standard Transformer block for a recurrent linear-time sequence mixer.

Pretraining data is karpathy/climbmix-400b-shuffle. SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k, and a custom 1000-conversation identity dataset.

@misc{mnemo2026,
  title={Mnemo: A Linear-Time Recurrent Language Model},
  author={Alvarado, Luis Miguel},
  year={2026},
  note={Built on karpathy/nanochat. Stream Mixer architecture.},
  howpublished={\url{https://github.com/<your-handle>/mnemo}}
}

License

MIT. Use freely. No warranty.

Downloads last month: -; Downloads are not tracked for this model. How to track

ecreeth
/

Mnemo

Mnemo

Quick facts

Architecture: Stream Mixer

Training

Pretraining (base model)

Supervised fine-tuning

Pipeline

Evaluation results

Base model — `model.pt` @ step 79,000

Chat model — full benchmark suite

ChatCORE metric

Where the numbers come from

Capabilities and limitations

Confirmed strong

Confirmed weak

Limitations (architectural)

Usage

Direct loading

Chat CLI (recommended)

Raw inference (no chat format)

Probe outputs (greedy, from the base checkpoint)

Citation and acknowledgements

License

Datasets used to train ecreeth/Mnemo

Mnemo

Quick facts

Architecture: Stream Mixer

Training

Pretraining (base model)

Supervised fine-tuning

Pipeline

Evaluation results

Base model — model.pt @ step 79,000

Chat model — full benchmark suite

ChatCORE metric

Where the numbers come from

Capabilities and limitations

Confirmed strong

Confirmed weak

Limitations (architectural)

Usage

Direct loading

Chat CLI (recommended)

Raw inference (no chat format)

Probe outputs (greedy, from the base checkpoint)

Citation and acknowledgements

License

Datasets used to train ecreeth/Mnemo

Base model — `model.pt` @ step 79,000