CPUFlow v8 β Discrete State Streams
Best PPL in the CPUFlow series (9.30) but generates incoherent text. Uses hard argmax routing with straight-through estimator.
Results
| Metric | Value |
|---|---|
| Val PPL | 9.30 |
| Parameters | 2.2M |
| Training speed | ~7K tok/s |
| Training time | 2 hours |
| Hardware | 4 vCPU (Lightning AI free tier) |
| Coherent? | No |
Architecture
embed + CumStepPos β [RouteBlock Γ 6] β LayerNorm β tied output + FSP
RouteBlock:
x_n = LayerNorm(x)
slot_id = argmax(W_route(x_n)) # hard routing via STE
slot_state = read_write(slot_id, x_n)
x = x + W_out(slot_state)
x = x + ff_down(relu(ff_up(LayerNorm(x))))
Each token routes to exactly one of 32 memory slots via argmax. Straight-through estimator allows gradient flow through the discrete choice.
Why incoherent?
v8 has the best PPL (9.30) but produces word salad. The hard slot routing disrupts the continuous context representation that the cumsum backbone provides. PPL measures token prediction accuracy, not logical coherence. v5-LN (PPL 11.94) generates more coherent text because its cumsum maintains a running summary of all past tokens.
Limitations
- Completely incoherent generation despite best PPL
- PPL 9.30 is misleading β this model cannot tell a coherent story
- Hard routing prevents smooth context flow between positions
- This is a research model demonstrating the PPL β coherence finding
Usage
import torch
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")
# Build model (see train_cpuflow_v8_discrete.py for full architecture)
See GitHub for full training code.
Citation
@misc{Chang,
title = {FlashLM: CPU-Native Language Models Trained From Scratch on Free-Tier Hardware},
author = {Chang, Cheng},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20113960}
}
MIT License.