Support this work -> · X · GitHub · REAP paper · Cerebras REAP

Step-3.7-Flash-148B

REAP-pruned stepfun-ai/Step-3.7-Flash-NVFP4.

At a glance

Base model stepfun-ai/Step-3.7-Flash-NVFP4
Format NVFP4 expert weights with FP8 KV cache
Effective size ~148B
Parameters removed 50.21B
Experts / MoE layer 212 kept / 288 original
Experts pruned / MoE layer 76
MoE layers 42
Hidden size 4096
Context 262,144
On-disk size 95 GB

Which variant should I pick?

Variant Format Link
Step-3.7-Flash-173B NVFP4 link
Step-3.7-Flash-148B (this) NVFP4 link

148B effective parameters | REAP-pruned | private experimental checkpoint

This is the more aggressively pruned Step 3.7 Flash NVFP4 checkpoint. It removes about 50.21B parameters by keeping 212 of 288 routed experts per MoE layer.

The goal is a smaller Step 3.7 Flash derivative for fit and serving experiments. It should be treated as an experimental compression artifact until load, generation, coherence, and benchmark evidence are complete.

What this is

  • Base: stepfun-ai/Step-3.7-Flash-NVFP4
  • Pruning: REAP (Routing-Enhanced Activation Pruning)
  • Routed experts kept per MoE layer: 212
  • Routed experts pruned per MoE layer: 76
  • Quantization: NVFP4 with FP8 KV cache metadata
  • Architecture: Step 3.7 Flash vision-language sparse MoE
  • Intended serving path: vLLM/SGLang/Transformers paths that support Step 3.7 Flash remote code and ModelOpt NVFP4
  • Status: private experimental checkpoint; validate fit, generation, and benchmark behavior before production use

How the REAP checkpoint was made

REAP is a one-shot MoE compression method that uses router-weighted expert activation observations to rank experts by practical usefulness. The observation pass records per-layer routed expert activity under calibration prompts, then each MoE layer is pruned independently.

For this checkpoint:

  1. Start from the Step 3.7 Flash NVFP4 checkpoint.
  2. Run calibration data through the model and record router/expert activation observations.
  3. Aggregate expert scores with the reap_score metric.
  4. Keep the top routed experts per MoE layer.
  5. Rewrite the checkpoint with pruned expert tensors and updated routing metadata.

Embeddings, attention blocks, normalization, router gates, shared experts, selected routed experts, vision components, tokenizer files, and generation/config files are preserved. The prune_summary.json and layer_expert_metrics.parquet files in this repo contain the exact pruning map and expert metrics.

Calibration evidence

The pruning pass used Step 3.7 Flash REAP observation artifacts uploaded to:

  • Dataset: 0xSero/step-3.7-flash-reap-observations-v2
  • Manifest size: 24,576 samples
  • Uploaded observation frontier: 24,576 / 24,576
  • Aggregate rows used for this prune: 13,696
  • Aggregate tokens by sequence length: 101,735,663
  • Sources: open-r1/Mixture-of-Thoughts/{math,science,code} and SWE-bench/SWE-smith-trajectories/tool

Benchmark status

Terminal-Bench artifacts are uploaded separately to 0xSero/step37-prune-terminal-bench-artifacts.

Do not treat the current Terminal-Bench evidence as a final score. The available 50B-pruned diagnostic run was interrupted and the corrected rerun hit harness/client timeouts before score-bearing proxy rows. The artifacts are useful for debugging the benchmark path, not for claiming model quality.

Loading

Use trust_remote_code=True and a runtime that supports Step 3.7 Flash plus ModelOpt NVFP4. For vLLM, start from StepFun's Step 3.7-compatible image and adapt the base NVFP4 launch profile:

python3 -m vllm.entrypoints.openai.api_server \
  --model 0xSero/Step-3.7-Flash-148B \
  --served-model-name step3p7-flash-148b \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --trust-remote-code \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5

Hardware fit is not guaranteed by the upload alone. Run a load smoke, generation smoke, and memory audit before longer evaluation or serving.

Limitations

  • This is an experimental private derivative, not an official StepFun release.
  • No full quality benchmark should be inferred from the pruning summary alone.
  • This variant is more aggressively pruned than Step-3.7-Flash-173B; expect higher quality risk until evaluated.
  • Some serving stacks may need patched Step 3.7 Flash support for the pruned expert count.
  • Model cards and manifests intentionally avoid hostnames, IPs, absolute local paths, and credentials.
Downloads last month
262
Safetensors
Model size
90B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/Step-3.7-Flash-148B

Quantized
(6)
this model

Paper for 0xSero/Step-3.7-Flash-148B