CPUFlow v9.7 β Memory-Enhanced Semi-Coherent Model
Best semi-coherent model in the CPUFlow series. Adds RAM-Net sparse memory to the v5-LN cumsum backbone for a 1.7 PPL improvement without breaking coherence.
Results
| Metric | v5-LN (baseline) | v9.7 (memory-enhanced) |
|---|---|---|
| Val PPL | 11.94 | 10.23 |
| Parameters | 2.0M | 2.47M |
| Speed | 7,833 tok/s | 3,369 tok/s |
| Coherent? | Semi | Semi |
| NaN events | 0 | 0 |
Architecture
embed + CumStepPos β [RAMScanBlock Γ 6] β LayerNorm β tied output + FSP
RAMScanBlock:
# Cumsum backbone (same as v5-LN)
x_n = LayerNorm(x)
h = W_proj(x_n) # fused: d β 3k
query, key, value = chunk(h, 3)
key = sigmoid(key); value = tanh(value)
scan_out = W_m(query * cumsum(key*value) / cumsum(key))
# RAM-Net sparse memory sidepath
addr = W_addr(x_n) β Product Softmax β Top-8 of 512 virtual slots
mem_out = sparse_read_write(addr, x_n)
merged = scan_out + W_mem_proj(mem_out) # direct addition, no gate
x = x + W_out(merged)
x = x + ff_down(relu(ff_up(LayerNorm(x))))
Generation Samples
Prompt: "Lily and Tim went to the park. They"
...They saw many kids playing near the back house. They went up to a tree and gave them to their dad. They were very happy. After a while, they saw a big pile of ants. It was not a normal day. They did not want to play hide behind. Tim and his friends were scared, but they did not want to go home.
Prompt: "There was a little girl named Lily. She loved to play with her friends. One day"
...she put her shoes in the park. In the park, Lily saw a big lock on the ground. She wanted to open it. She tried to open the key, but it was too small. She tried to unlock the door open, but she could not.
Limitations
- Semi-coherent at best. Named characters and pronoun tracking work early, but coherence breaks down ~100 tokens in.
- "She tried to open the key" β semantic confusion from cumsum state blending.
- Story drifts between scenes with no transition (park β church).
- 2.3x slower than v5-LN baseline due to memory overhead.
- Trained on TinyStories only β children's vocabulary, no general knowledge.
Key Finding
Sparse memory (RAM-Net Product Softmax, 512 slots, Top-8) improves PPL by 1.7 points as a parameter-efficient capacity expansion. It does NOT solve entity tracking β at 2.5M params, the binding threshold (~160M params) makes entity-specific addressing impossible. The memory just adds raw capacity.
Usage
import torch
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")
# Build model (see train_cpuflow_v97_simple_memory.py for full architecture)
# Generate with temperature=0.8
See GitHub for full training code.
Citation
@misc{Chang,
title = {FlashLM: CPU-Native Language Models Trained From Scratch on Free-Tier Hardware},
author = {Chang, Cheng},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20113960}
}
MIT License.