CPUFlow v5-LN
CPU-native language model with fused linear attention cumsum, trained from scratch on free-tier CPU in 2 hours. The coherent baseline for the CPUFlow series.
Results
| Metric | Value |
|---|---|
| Val PPL | 11.94 |
| Parameters | 2.0M |
| Training speed | 7,833 tok/s |
| Training time | 2 hours |
| Hardware | 4 vCPU (Lightning AI free tier) |
| NaN events | 0 |
Architecture
embed + CumStepPos β [ScanBlock Γ 6] β LayerNorm β tied output + FSP
ScanBlock:
x_n = LayerNorm(x)
h = W_proj(x_n) # fused: d β 3k
query, key, value = chunk(h, 3)
key = sigmoid(key); value = tanh(value)
scan_out = W_m(query * cumsum(key*value) / cumsum(key))
x = x + W_out(scan_out)
x = x + ff_down(relu(ff_up(LayerNorm(x))))
Generation Sample
Prompt: "Once upon a time"
Once upon a time, there was a little girl named Lily. She loved to collect the world around the forest. One day, while playing outside, she heard a noise. It was pretty and a small bush. Lily was curious.
Limitations
- Semi-coherent at best. Named characters and sentence structure work, but coherence breaks down ~100 tokens in.
- Trained on TinyStories only β children's vocabulary, no general knowledge.
- PPL 11.94 is worse than v8 (9.30) but generation is more coherent. PPL β coherence.
Usage
import torch
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")
# Build model (see train_cpuflow_v5_ln.py for full architecture)
# Generate with temperature=0.8
See GitHub for full training code.
Citation
@misc{Chang,
title = {FlashLM: CPU-Native Language Models Trained From Scratch on Free-Tier Hardware},
author = {Chang, Cheng},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20113960}
}
MIT License.