Mnemo

μνήμη — Greek for "memory"

Mnemo is a small attention-free language model with 117M parameters, built on the Stream Mixer architecture — a linear-time recurrent sequence mixer that uses multiple parallel content-routed memory streams instead of self-attention. The name nods to the model's recurrent memory: every layer maintains M parallel state buffers that "remember" content over the entire sequence without quadratic attention.

The training pipeline (data, tokenizer, eval, fine-tuning) is a fork of karpathy/nanochat, with the attention-based GPT replaced by a custom Stream Mixer block.


Quick facts

Architecture Stream Mixer (linear-time recurrent)
Parameters 117,179,136
Layers 16
Hidden dim 768
Memory streams (M) 48
Stream state dim (D) 96
Read heads 6
Context length 2048 tokens
Vocab 32,768 BPE (GPT-4-style pretokenization)
Special tokens <|bos|>, <|user_start|>, <|user_end|>, <|assistant_start|>, <|assistant_end|>
Compute dtype bf16 (Ampere+) / fp32 (T4/CPU)
Base perplexity (BPB) 19.47 (0.9011 bits-per-byte)
Chat ChatCORE metric 22.74% (mean centered across 5 tasks)
SpellingBee accuracy 94.53% (256/256 test set)
License MIT

Architecture: Stream Mixer

Mnemo's defining feature is its sequence mixer. Where a Transformer uses self-attention to compute pairwise interactions across tokens (cost: O(T²)), Mnemo uses a chunked parallel scan over M parallel content-routed memory streams (cost: O(T · M · D)linear in sequence length).

Per token t and per layer:

  1. Compute value v[t], read query q[t], content-router r[t], and per-stream decay α[t].
  2. Each memory stream s_m updates via s_m[t] = α_m[t] · s_m[t-1] + r_m[t] · v[t].
  3. Multi-head sigmoid-gated read with QK-norm aggregates from the M streams.

The full state across a layer is (B, M, D) — a fixed-size recurrent memory that the model can carry across arbitrary sequence lengths. The chunked scan implementation keeps numerical range bounded even for slow-decay streams.

For details see the model source.


Training

Pretraining (base model)

Corpus karpathy/climbmix-400b-shuffle — 88 shards
Total tokens 5.24B (44.7× over params)
Steps 80,000 × B=32 × T=2048
Optimizer AdamW (peak LR 1e-3, warmup 500, cosine to 1e-5, weight decay 0.1)
Compute RTX PRO 6000 Blackwell (single GPU, bf16)
Wall time ~9 hours
Best val loss 2.9508 (perplexity ≈ 19.12)

Supervised fine-tuning

Mixture SmolTalk + MMLU×3 + ARC×4 + GSM8K×4 + SimpleSpelling + SpellingBee + 1000 Mnemo-branded identity convs
Total conversations ~1.09M
Steps 30,000 × B=8 × T=2048 = ~500M SFT tokens
Optimizer AdamW (peak LR 1e-4, warmup 300)
Best val loss ~1.45 (masked cross-entropy over assistant tokens only)
Format nanochat-style BOS-aligned best-fit packing with padding

Pipeline

ClimbMix-400B
   │
   ▼
[80k step pretrain on Stream Mixer]
   │  best val 2.9508 @ step 79k
   ▼
Base checkpoint  (completes prompts)
   │
   ▼
[30k step SFT on multi-task mixture]
   │  best val ~1.45
   ▼
SFT checkpoint  (chat-aware — answers as Mnemo)

Evaluation results

Measured on the full test sets — no subsampling, no cherry-picking.

Base model — model.pt @ step 79,000

Metric Value
Validation loss (nats / token) 2.9691
Perplexity 19.47
Bits per byte (BPB) 0.9011
Evaluation window 409,600 tokens / 1,947,169 bytes

Bits-per-byte is the tokenizer-invariant measure — directly comparable across models with different vocabularies. For reference, GPT-2 on similar web text lands around BPB ≈ 1.0; Mnemo at 117M on ClimbMix-400B gets to ~0.90, which is sensible for the size class.

Chat model — full benchmark suite

Evaluated on the complete test set of each task (no --max-problems cap). Categorical tasks use logit comparison over allowed letters; generative tasks sample greedily and parse #### N for the final answer.

Task Type N Accuracy Random baseline Centered
MMLU (57 subjects) categorical 4-way MCQ 14,042 28.32% 25% +4.42
ARC-Easy categorical 4-way MCQ 2,376 30.68% 25% +7.58
ARC-Challenge categorical 4-way MCQ 1,172 29.52% 25% +6.03
GSM8K (math word problems) generative, parse #### N 1,319 1.14% 0% +1.14
SpellingBee (letter counting) generative, parse #### N 256 94.53% 0% +94.53

ChatCORE metric

ChatCORE = 22.74% — mean centered accuracy across all five tasks.

ChatCORE is the same shape as nanochat's metric: it normalizes each task to its random baseline (so a fair guess scores 0, and perfect scores 100). At 22.74% on 117M params after 9h pretraining + 1h SFT, Mnemo lands meaningfully above random across all tasks. The Stream Mixer architecture clearly can hold the necessary structure — the dominant ceiling is parameter count, not architecture.

Where the numbers come from

  • SpellingBee 94.53% is the standout. Mnemo learned to character-by-character enumerate words from the 370k-word English dictionary and reliably emit a correct #### N final answer. Common short words that tokenize as single BPE tokens (like "strawberry") still fail because the model never observes their letters individually — this is a tokenizer limitation, not a model one.
  • All three MCQ tasks above random confirms the model genuinely commits to a letter at the assistant position when forced. The MMLU advantage (+4.4 pp) is modest — 117M can't memorize the breadth of academic facts MMLU covers.
  • GSM8K at 1.14% is honest for an unaligned 117M-parameter model with no tool use. The format is correctly learned (step-by-step reasoning + #### N final answer) but the arithmetic isn't reliable enough to land the right number consistently.

Capabilities and limitations

Confirmed strong

  • Coherent conversational dialogue in chat format (<|user_start|> / <|assistant_start|>)
  • Factual recall on common entities (capital cities, chemical symbols, planets ordered)
  • Letter counting via manual enumeration — 94.5% on SpellingBee
  • Multiple-choice answer commitment (above random on all three MCQ benchmarks)
  • Persona consistency (model identifies as Mnemo with consistent self-description)
  • Greedy + nucleus (top-p) sampling configurable for short or long generation

Confirmed weak

  • Math word problems — 1.14% on GSM8K. Format is learned, arithmetic is not
  • Single-token common words for spelling — "strawberry" → 2 r's (real answer: 3); tokenizer hides character-level information for words that fit in a single BPE token
  • Niche factual recall — confabulates confidently on rare entities, exact dates, specific quotations
  • Long multi-turn conversations — context drifts after ~2-3 turns

Limitations (architectural)

  • 117M parameters — knowledge density is the ceiling, not the architecture
  • No tool use, no internet, no images, no memory across sessions
  • 2048-token context — quality degrades past ~1500 tokens without repetition penalty
  • No RLHF — outputs reflect only supervised signal; may produce inappropriate completions
  • English only — pretraining corpus is essentially English educational/web text
  • Repetition prone in long generations without --repetition-penalty or --top-p

Usage

Direct loading

import torch
from tokenizers import Tokenizer
from model import GPT

tokenizer = Tokenizer.from_file('tokenizer.json')
ckpt = torch.load('model.pt', map_location='cuda')

config = dict(ckpt['config'])
config['vocab_size'] = ((tokenizer.get_vocab_size() + 63) // 64) * 64
model = GPT.from_config(config).cuda().eval()

state = {k.removeprefix('_orig_mod.'): v for k, v in ckpt['model'].items()}
model.load_state_dict(state, strict=False)

Chat CLI (recommended)

python3 chat_cli.py                   # interactive REPL
python3 chat_cli.py -p "Who are you?"  # one-shot

The chat CLI handles the chat-format token wrapping (<|bos|><|user_start|> …) and stops generation cleanly on <|assistant_end|>. State is cached across turns via the recurrent state buffer — only the new tokens of each user message are prefilled, giving roughly 5–10× faster prefill on multi-turn conversations than re-processing the entire history.

Raw inference (no chat format)

python3 infer.py -p "Photosynthesis is the process by which" --top-p 0.9 -r 1.15

Recommended sampling parameters (empirically tuned, see training log):

  • Greedy / factual probes: -t 0
  • Short prose (≤500 tok): -t 0.8 -k 50
  • Long prose (500–2000 tok): -t 0.8 -k 50 --top-p 0.9 -r 1.15 (anti-loop)
  • Diverse creative writing: -t 0.9 --top-p 0.85 -r 1.1

Probe outputs (greedy, from the base checkpoint)

Run via python3 base_eval.py --eval sample against the pretrained checkpoint (model.pt, val 2.9508). Greedy, 64 tokens per completion.

Prompt First tokens of output Verdict
The capital of France is "...Paris, and it is the capital of France. The capital of France is Paris..." ✓ Paris lands
The chemical symbol of gold is "Au. It is a soft, silvery-white metal... good conductor of electricity and heat, making it useful in electrical wiring..." ✓ Au + real applied claim
If yesterday was Friday, then tomorrow will be "Tuesday. The weather is not so bad..." ✗ (correct: Sunday)
The opposite of hot is "the cold."
The planets of the solar system are: "Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto..." ✓ Correct astronomical order
My favorite color is "red. It's a color that's been around for a long time..."
If 5*x + 3 = 13, then x is "a positive integer. If x is a positive integer, then x is a positive integer..." ✗ Loop
Photosynthesis is the process by which "plants and other organisms convert light energy into chemical energy. It is a complex process that involves the conversion of light energy into chemical energy..." ✓ Factually correct opener

5/7 of the original training probes land correct answers at greedy. Repetition is visible — the base model benefits substantially from --repetition-penalty 1.15 and/or --top-p 0.9 on longer generations (see Usage section).


Citation and acknowledgements

Built on top of karpathy/nanochat by Andrej Karpathy. The Stream Mixer architecture is an attention-free experiment swapping the standard Transformer block for a recurrent linear-time sequence mixer.

Pretraining data is karpathy/climbmix-400b-shuffle. SFT mixture sources: HuggingFaceTB/smol-smoltalk, cais/mmlu, allenai/ai2_arc, openai/gsm8k, and a custom 1000-conversation identity dataset.

@misc{mnemo2026,
  title={Mnemo: A Linear-Time Recurrent Language Model},
  author={Alvarado, Luis Miguel},
  year={2026},
  note={Built on karpathy/nanochat. Stream Mixer architecture.},
  howpublished={\url{https://github.com/<your-handle>/mnemo}}
}

License

MIT. Use freely. No warranty.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ecreeth/Mnemo