Collected Results

All measurements below are real numbers from actual runs. GPU, settings, and seed noted for each.

Table 1: Your Original Results (MPS, provided by author)

Config: 6 layers, chunk_size=64, B=8, T=256, 10% active, 2000 steps

d_model	Run	Time (s)	ms/step	Val Loss
512	dense_baseline	74.77	99.70	5.3142
512	sparse_full_dX	91.04	121.38	5.4141
512	sparse_sparse_dX	93.33	124.44	5.5467
2048	dense_baseline	1035.84	591.91	6.0264
2048	sparse_full_dX	875.51	500.29	5.9807
2048	sparse_sparse_dX	847.22	484.13	6.0231

Observation: Sparse is slower at d=512 (1.22x overhead), faster at d=2048 (1.18x speedup for full_dX, 1.22x for sparse_dX). Quality comparable at d=2048, worse at d=512.

Table 2: Isolated Matmul Microbenchmark (T4, per single FFN layer)

Config: B=8, T=256 (M=2048), chunk_size=64, 10% active, fp32, 100 iterations

d_model	FFN dim	Params	Fwd (ms)	dX (ms)	dW_dense (ms)	dW_sparse (ms)	Total_dense (ms)	Total_sparse_full_dX (ms)	Speedup
256	1024	0.3M	0.27	0.21	0.27	0.26	0.75	0.74	1.02x
384	1536	0.6M	0.52	0.69	0.61	0.18	1.82	1.39	1.31x
512	2048	1.0M	1.00	1.01	0.97	0.26	2.99	2.28	1.31x
768	3072	2.4M	2.16	2.25	2.05	0.40	6.46	4.81	1.34x
1024	4096	4.2M	3.69	3.90	3.35	0.59	10.95	8.18	1.34x
1536	6144	9.4M	10.33	9.03	8.14	1.30	27.50	20.66	1.33x
2048	8192	16.8M	14.76	15.57	13.19	1.93	43.51	32.26	1.35x

Amdahl ceiling (if dW were free): ~1.42–1.48x. Crossover point: d_model ≈ 384.

Table 3: Triton Kernel Correctness (T4)

d_in	d_out	chunk_size	dW max_err	dBias max_err	dX max_err	Status
512	2048	64	0.000320	0.000023	0.000042	✓
1024	4096	64	0.000443	0.000021	0.000092	✓
256	1024	32	0.000275	0.000038	0.000019	✓

Table 4: Triton vs PyLoop vs Dense — Isolated Backward (T4)

Config: M=2048, chunk_size=64, 10% active, full_dX mode (dW sparse, dX dense), 50 iterations after warmup

d_model	FFN dim	Active chunks	Dense (ms)	PyLoop (ms)	Triton (ms)	Triton/Dense	Triton/PyLoop
256	1024	1	0.39	0.40	0.46	0.85x	0.88x
512	2048	3	1.96	1.30	1.16	1.69x	1.12x
768	3072	4	4.29	2.52	2.51	1.70x	1.00x
1024	4096	6	7.29	4.37	4.30	1.70x	1.02x
1536	6144	9	17.32	10.04	9.78	1.77x	1.03x
2048	8192	12	29.14	17.20	16.89	1.73x	1.02x

Triton with both dW and dX sparse:

d_model	Dense (ms)	Triton_all (ms)	Speedup
512	1.96	0.41	4.83x
1024	7.06	1.07	6.58x
2048	29.00	3.71	7.81x

Table 5: End-to-End Training (T4, 100 steps)

Config: 6 layers, 8 heads, B=8, T=256, chunk_size=64, 10% active, seed=42, AdamW lr=5e-4, full_dX mode

d_model	Mode	ms/step	vs Dense	Val Loss
512	dense	184.6	1.00x	5.6954
512	pyloop	179.0	1.03x	5.8683
512	triton	196.0	0.94x	5.8683
1024	dense	451.5	1.00x	5.5300
1024	pyloop	435.6	1.04x	5.4803
1024	triton	441.0	1.02x	5.4800

d=2048 does not fit on T4 (16GB). A10G results pending (job 69f3af45d2c8bd8662bd419d).

Note: Triton autotune overhead hurts at small scale. At d=512 with only 1 active chunk per layer, fused kernels lose to PyTorch's already-optimized single-kernel launches.

Table 6: EMA Predictor Overlap (T4, 350 steps, seed=42)

Config: d=512, 6 layers, chunk_size=64, 10% active, measured every 25 steps after annealing (step ≥ 250)

Step	Jaccard	Recall
250	0.6000	0.7500
275	0.6552	0.7917
300	0.7778	0.8750
325	0.6000	0.7500

Single seed only. Full 3-seed results with 2000 steps pending from A10G job.

Table 7: Chunk-Size vs Speed (T4, 50 steps, timing only)

Config: d=512, 6 layers, 10% active, seed=42. Loss identical across sizes (only 50 steps, all in warmup).

Chunk Size	ms/step
16	601.4
32	453.0
64	321.5
128	251.3
256	219.8

Larger chunks = fewer Python loop iterations = less overhead. This is the PyLoop backend; Triton would show a different curve.

Pending Results (A10G jobs running)

Job ID	Experiment	Status
69f38371d70108f37ace1cae	Full 7-experiment suite (2000 steps, 3 seeds, all ablations)	Running
69f395b3d70108f37ace1cee	Model-size scaling study (d=256→2048, 2000 steps, 2 seeds)	Running
69f3af45d2c8bd8662bd419d	E2E training with Triton (d=512,1024,2048, 500 steps)	Running

These will provide:

Table 3 full: All 8 baselines with 3 seeds at 2000 steps (Dense, Random, EMA, EMA+sparse_dX, RigL, SET, TopK-SGD, Oracle)
Compute-matched dense (same FLOPs) vs sparse
Chunk-size ablation with loss numbers at 2000 steps
Epsilon-greedy exploration sweep
Attention sparsification results
Sparsity level sweep (5%–100%)
d=2048 end-to-end training with Triton