theapemachine's picture
Add complete results with all measured numbers
2f36ac4 verified

Collected Results

All measurements below are real numbers from actual runs. GPU, settings, and seed noted for each.


Table 1: Your Original Results (MPS, provided by author)

Config: 6 layers, chunk_size=64, B=8, T=256, 10% active, 2000 steps

d_model Run Time (s) ms/step Val Loss
512 dense_baseline 74.77 99.70 5.3142
512 sparse_full_dX 91.04 121.38 5.4141
512 sparse_sparse_dX 93.33 124.44 5.5467
2048 dense_baseline 1035.84 591.91 6.0264
2048 sparse_full_dX 875.51 500.29 5.9807
2048 sparse_sparse_dX 847.22 484.13 6.0231

Observation: Sparse is slower at d=512 (1.22x overhead), faster at d=2048 (1.18x speedup for full_dX, 1.22x for sparse_dX). Quality comparable at d=2048, worse at d=512.


Table 2: Isolated Matmul Microbenchmark (T4, per single FFN layer)

Config: B=8, T=256 (M=2048), chunk_size=64, 10% active, fp32, 100 iterations

d_model FFN dim Params Fwd (ms) dX (ms) dW_dense (ms) dW_sparse (ms) Total_dense (ms) Total_sparse_full_dX (ms) Speedup
256 1024 0.3M 0.27 0.21 0.27 0.26 0.75 0.74 1.02x
384 1536 0.6M 0.52 0.69 0.61 0.18 1.82 1.39 1.31x
512 2048 1.0M 1.00 1.01 0.97 0.26 2.99 2.28 1.31x
768 3072 2.4M 2.16 2.25 2.05 0.40 6.46 4.81 1.34x
1024 4096 4.2M 3.69 3.90 3.35 0.59 10.95 8.18 1.34x
1536 6144 9.4M 10.33 9.03 8.14 1.30 27.50 20.66 1.33x
2048 8192 16.8M 14.76 15.57 13.19 1.93 43.51 32.26 1.35x

Amdahl ceiling (if dW were free): ~1.42–1.48x. Crossover point: d_model ≈ 384.


Table 3: Triton Kernel Correctness (T4)

d_in d_out chunk_size dW max_err dBias max_err dX max_err Status
512 2048 64 0.000320 0.000023 0.000042 ✓
1024 4096 64 0.000443 0.000021 0.000092 ✓
256 1024 32 0.000275 0.000038 0.000019 ✓

Table 4: Triton vs PyLoop vs Dense — Isolated Backward (T4)

Config: M=2048, chunk_size=64, 10% active, full_dX mode (dW sparse, dX dense), 50 iterations after warmup

d_model FFN dim Active chunks Dense (ms) PyLoop (ms) Triton (ms) Triton/Dense Triton/PyLoop
256 1024 1 0.39 0.40 0.46 0.85x 0.88x
512 2048 3 1.96 1.30 1.16 1.69x 1.12x
768 3072 4 4.29 2.52 2.51 1.70x 1.00x
1024 4096 6 7.29 4.37 4.30 1.70x 1.02x
1536 6144 9 17.32 10.04 9.78 1.77x 1.03x
2048 8192 12 29.14 17.20 16.89 1.73x 1.02x

Triton with both dW and dX sparse:

d_model Dense (ms) Triton_all (ms) Speedup
512 1.96 0.41 4.83x
1024 7.06 1.07 6.58x
2048 29.00 3.71 7.81x

Table 5: End-to-End Training (T4, 100 steps)

Config: 6 layers, 8 heads, B=8, T=256, chunk_size=64, 10% active, seed=42, AdamW lr=5e-4, full_dX mode

d_model Mode ms/step vs Dense Val Loss
512 dense 184.6 1.00x 5.6954
512 pyloop 179.0 1.03x 5.8683
512 triton 196.0 0.94x 5.8683
1024 dense 451.5 1.00x 5.5300
1024 pyloop 435.6 1.04x 5.4803
1024 triton 441.0 1.02x 5.4800

d=2048 does not fit on T4 (16GB). A10G results pending (job 69f3af45d2c8bd8662bd419d).

Note: Triton autotune overhead hurts at small scale. At d=512 with only 1 active chunk per layer, fused kernels lose to PyTorch's already-optimized single-kernel launches.


Table 6: EMA Predictor Overlap (T4, 350 steps, seed=42)

Config: d=512, 6 layers, chunk_size=64, 10% active, measured every 25 steps after annealing (step ≥ 250)

Step Jaccard Recall
250 0.6000 0.7500
275 0.6552 0.7917
300 0.7778 0.8750
325 0.6000 0.7500

Single seed only. Full 3-seed results with 2000 steps pending from A10G job.


Table 7: Chunk-Size vs Speed (T4, 50 steps, timing only)

Config: d=512, 6 layers, 10% active, seed=42. Loss identical across sizes (only 50 steps, all in warmup).

Chunk Size ms/step
16 601.4
32 453.0
64 321.5
128 251.3
256 219.8

Larger chunks = fewer Python loop iterations = less overhead. This is the PyLoop backend; Triton would show a different curve.


Pending Results (A10G jobs running)

Job ID Experiment Status
69f38371d70108f37ace1cae Full 7-experiment suite (2000 steps, 3 seeds, all ablations) Running
69f395b3d70108f37ace1cee Model-size scaling study (d=256→2048, 2000 steps, 2 seeds) Running
69f3af45d2c8bd8662bd419d E2E training with Triton (d=512,1024,2048, 500 steps) Running

These will provide:

  • Table 3 full: All 8 baselines with 3 seeds at 2000 steps (Dense, Random, EMA, EMA+sparse_dX, RigL, SET, TopK-SGD, Oracle)
  • Compute-matched dense (same FLOPs) vs sparse
  • Chunk-size ablation with loss numbers at 2000 steps
  • Epsilon-greedy exploration sweep
  • Attention sparsification results
  • Sparsity level sweep (5%–100%)
  • d=2048 end-to-end training with Triton