CPUFlow v8 — Discrete State Streams

Best PPL in the CPUFlow series (9.30) but generates incoherent text. Uses hard argmax routing with straight-through estimator.

Results

Metric	Value
Val PPL	9.30
Parameters	2.2M
Training speed	~7K tok/s
Training time	2 hours
Hardware	4 vCPU (Lightning AI free tier)
Coherent?	No

Architecture

embed + CumStepPos → [RouteBlock × 6] → LayerNorm → tied output + FSP

RouteBlock:
  x_n = LayerNorm(x)
  slot_id = argmax(W_route(x_n))    # hard routing via STE
  slot_state = read_write(slot_id, x_n)
  x = x + W_out(slot_state)
  x = x + ff_down(relu(ff_up(LayerNorm(x))))

Each token routes to exactly one of 32 memory slots via argmax. Straight-through estimator allows gradient flow through the discrete choice.

Why incoherent?

v8 has the best PPL (9.30) but produces word salad. The hard slot routing disrupts the continuous context representation that the cumsum backbone provides. PPL measures token prediction accuracy, not logical coherence. v5-LN (PPL 11.94) generates more coherent text because its cumsum maintains a running summary of all past tokens.

Limitations

Completely incoherent generation despite best PPL
PPL 9.30 is misleading — this model cannot tell a coherent story
Hard routing prevents smooth context flow between positions
This is a research model demonstrating the PPL ≠ coherence finding

Usage

import torch
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")
# Build model (see train_cpuflow_v8_discrete.py for full architecture)

See GitHub for full training code.

Citation

@misc{Chang,
  title        = {FlashLM: CPU-Native Language Models Trained From Scratch on Free-Tier Hardware},
  author       = {Chang, Cheng},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20113960}
}

MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

changcheng967
/

cpuflow-v8-discrete