CPUFlow v9.7 β€” Memory-Enhanced Semi-Coherent Model

Best semi-coherent model in the CPUFlow series. Adds RAM-Net sparse memory to the v5-LN cumsum backbone for a 1.7 PPL improvement without breaking coherence.

Results

Metric v5-LN (baseline) v9.7 (memory-enhanced)
Val PPL 11.94 10.23
Parameters 2.0M 2.47M
Speed 7,833 tok/s 3,369 tok/s
Coherent? Semi Semi
NaN events 0 0

Architecture

embed + CumStepPos β†’ [RAMScanBlock Γ— 6] β†’ LayerNorm β†’ tied output + FSP

RAMScanBlock:
  # Cumsum backbone (same as v5-LN)
  x_n = LayerNorm(x)
  h = W_proj(x_n)            # fused: d β†’ 3k
  query, key, value = chunk(h, 3)
  key = sigmoid(key); value = tanh(value)
  scan_out = W_m(query * cumsum(key*value) / cumsum(key))

  # RAM-Net sparse memory sidepath
  addr = W_addr(x_n) β†’ Product Softmax β†’ Top-8 of 512 virtual slots
  mem_out = sparse_read_write(addr, x_n)
  merged = scan_out + W_mem_proj(mem_out)    # direct addition, no gate

  x = x + W_out(merged)
  x = x + ff_down(relu(ff_up(LayerNorm(x))))

Generation Samples

Prompt: "Lily and Tim went to the park. They"

...They saw many kids playing near the back house. They went up to a tree and gave them to their dad. They were very happy. After a while, they saw a big pile of ants. It was not a normal day. They did not want to play hide behind. Tim and his friends were scared, but they did not want to go home.

Prompt: "There was a little girl named Lily. She loved to play with her friends. One day"

...she put her shoes in the park. In the park, Lily saw a big lock on the ground. She wanted to open it. She tried to open the key, but it was too small. She tried to unlock the door open, but she could not.

Limitations

  • Semi-coherent at best. Named characters and pronoun tracking work early, but coherence breaks down ~100 tokens in.
  • "She tried to open the key" β€” semantic confusion from cumsum state blending.
  • Story drifts between scenes with no transition (park β†’ church).
  • 2.3x slower than v5-LN baseline due to memory overhead.
  • Trained on TinyStories only β€” children's vocabulary, no general knowledge.

Key Finding

Sparse memory (RAM-Net Product Softmax, 512 slots, Top-8) improves PPL by 1.7 points as a parameter-efficient capacity expansion. It does NOT solve entity tracking β€” at 2.5M params, the binding threshold (~160M params) makes entity-specific addressing impossible. The memory just adds raw capacity.

Usage

import torch
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")
# Build model (see train_cpuflow_v97_simple_memory.py for full architecture)
# Generate with temperature=0.8

See GitHub for full training code.

Citation

@misc{Chang,
  title        = {FlashLM: CPU-Native Language Models Trained From Scratch on Free-Tier Hardware},
  author       = {Chang, Cheng},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20113960}
}

MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using changcheng967/cpuflow-v97-memory 1