Step-3.7-Flash — MXFP4 (mixed precision)

A 4-bit MXFP4 quantization of Step-3.7-Flash, produced with qstream. The routed MoE experts (≈95% of the weights) are quantized to MXFP4; everything quality-sensitive stays BF16. The original model card follows in full below.

Size 123 GB (down from 212.5 GB FP8 source, ~58%)
Format compressed-tensors mixed-precision (E2M1 4-bit experts + E8M0 group-32 scales, BF16 rest)
Base Step-3.7-Flash (198B sparse MoE VLM, 288 experts top-8 + 1 shared, sigmoid routing, 45 layers + 3 MTP)

What is quantized to what

Component Precision Why
Routed experts (model.layers.*.moe.experts.*) MXFP4 (4-bit) ~95% of the weights — the only place worth the size win
Attention, dense MLP, shared expert, router gate BF16 sensitive / runs on every token — kept lossless from the source
MTP / next-token-prediction layers (45–47) BF16 speculative-decoding draft path — unquantized (as in the FP8/NVFP4 releases)
Embeddings, lm_head, vision encoder, projector, norms BF16 unchanged

The source Step-3.7-Flash-FP8 release already keeps attention / dense MLP / MTP in BF16 (only the experts are block-FP8); this checkpoint re-quantizes those experts from block-FP8 to MXFP4 and passes the BF16 remainder through unchanged.

Quality (this checkpoint, served on vLLM)

We report deterministic, reproducible faithfulness metrics rather than downstream task scores. (Task evals served over vLLM with continuous batching + MTP are not bitwise-deterministic — the same prompt at temperature=0 can yield different reductions depending on batch composition — so small-sample accuracies are noisy and we don't quote them.)

Metric Result What it shows
Perplexity (clean English) 6.52 language modeling intact — a broken quant lands in the hundreds
Routed-expert SQNR ≈ 19 dB reconstruction error is just the unavoidable 4-bit rounding (MXFP4 vs the block-FP8 source)

Why this is enough to trust the checkpoint:

  • Only the routed experts changed. ~95% of the weights are re-quantized to MXFP4; everything else is bit-identical BF16 to the source (attention, dense MLP, shared expert, router, MTP, the entire vision stack). So the model is the base model except for 4-bit rounding on the expert GEMMs.
  • The math path is verified. The 2D-linear and 3D-MoE dequant/GEMM paths were checked numerically before the full run; the only residual is the ~19 dB expert rounding above.

PPL script: evals/eval_ppl.py in the qstream repo.

Fidelity, footprint & provenance

  • Vision is untouched: the 1.8B vision encoder + projector stay BF16 (bit-identical to the source), so image capability equals the base model — only the text MoE is quantized. Verified working end-to-end (multimodal generation produces correct image descriptions).
  • MTP preserved: the 3 multi-token-prediction draft layers stay BF16, so speculative decoding works (mean acceptance length ≈ 3.0, draft acceptance ≈ 68% on one B300).
  • Footprint: ~115 GiB of weights; fits a single ≥256 GB GPU (e.g. B300), and the weights also fit 2×128 GB (e.g. DGX Spark).
  • Provenance: built with qstream @c30945a from the Step-3.7-Flash-FP8 release; mixed-precision recipe (experts→MXFP4, rest→BF16).

Serving with vLLM (this checkpoint)

Targets StepFun's prebuilt vllm/vllm-openai:stepfun37 image. The config.json here targets vLLM's merged runtime modules (qkv_proj, gate_up_proj) so the fused linears load quantized.

Single GPU (B300-class), MTP enabled:

docker run -d --name step37 --gpus all --privileged --ipc=host -p 8000:8000 \
  -e VLLM_MXFP4_USE_MARLIN=1 \
  -v $(pwd):/model \
  -v $(pwd)/vllm_patch/step3p5_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/step3p5_mtp.py \
  vllm/vllm-openai:stepfun37 /model \
  --served-model-name step3p7-flash \
  --tensor-parallel-size 1 --disable-cascade-attn \
  --reasoning-parser step3p5 --enable-auto-tool-choice --tool-call-parser step3p5 \
  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
  --gpu-memory-utilization 0.97 \
  --max-model-len 32768 --max-num-batched-tokens 2048 \
  --linear-backend marlin --trust-remote-code

For multi-GPU, swap --tensor-parallel-size 1 for the official --tensor-parallel-size 8 --enable-expert-parallel.

Leave CUDA graphs on (don't pass --enforce-eager). Graph capture works fine for this MXFP4 + MTP checkpoint and is worth ~2–3× decode throughput; eager mode only adds ~1 min of startup back. See throughput numbers below.

Throughput

Aggregate output throughput on a single B300 (TP=1), MXFP4 + MTP, CUDA graphs on:

┌──────┬──────────────────┬─────────────────┐ │ conc │ GB200 ×4 (NVFP4) │ 1× B300 (MXFP4) │ ├──────┼──────────────────┼─────────────────┤ │ 8 │ 1309 │ 1536 │ ├──────┼──────────────────┼─────────────────┤ │ 32 │ 4391 │ 4693 │ ├──────┼──────────────────┼─────────────────┤ │ 64 │ 8229 │ 7253 │ └──────┴──────────────────┴─────────────────┘

A single B300 here roughly matches published 4×GB200 (TP=4) NVFP4 + MTP figures at low–mid concurrency.

Method: fixed 512-token ignore_eos completions, N streams kept saturated, tokens counted over a 30 s steady-state window. Reproducible with evals/bench_mtp_sweep.py in the qstream repo. (MTP mean acceptance length ≈ 2.6–3.1 depending on workload.)

The MTP loader patch (vllm_patch/)

Serving with MTP requires the bundled vllm_patch/step3p5_mtp.py. The MTP block is BF16 (no KV-cache quantization), but the compressed-tensors path makes vLLM's inner Attention layer allocate inert k/v_scale + *_zero_point buffers, and the stock strict MTP loader rejects them as missing:

RuntimeError: ... mtp_block.self_attn.attn.k_zero_point ... not in the checkpoint

The patch treats those (inert) KV-attention quant params as optional — exactly as the loader already does for scalar scales. It is only needed when serving with MTP; without --speculative_config you can drop the patch mount. See vllm_patch/README.md. The official FP8/NVFP4 releases avoid the same allocation through their own config dialects (modules_to_not_convert / modelopt exclude_modules).

How it was made

qstream-quantize \
  --model_dir <Step-3.7-Flash-FP8 (block-FP8 source)> \
  --output_dir ./stepfun-mxfp4 \
  --include_layers "moe.experts" \
  --device cuda --workers 8

detect_input_format auto-detects the source's block-FP8 (128×128 float scales), dequantizes only the routed experts and re-quantizes them to MXFP4, passes the BF16 remainder (attention, dense MLP, shared expert, MTP, vision) through, and writes the mixed-precision config.json.


Original model card

[ModelPage]: https://static.stepfun.com/blog/step-3.7-flash/

1. Introduction

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.

We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.

2. Capabilities & Performance

Multimodal Perception and Verification

The model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.

Workflow Integrity and Tool Orchestration

Execution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.

Code Engineering and Professional Baselines

Step 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GDPVal-AA (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.

Step 3.7 Flash benchmark results across General Agent, Agentic Coding, and Multimodal evaluations

3. Pricing

Token Type Price
Input (cache miss) $0.20 / M tokens
Input (cache hit) $0.04 / M tokens
Output $1.15 / M tokens

4. Availability, Deployment, and Ecosystem

  • Availability: Step 3.7 Flash is available on the StepFun Open Platform — platform.stepfun.ai (Global) and platform.stepfun.com (China), OpenRouter, and NVIDIA NIM. StepFun is also partnering with DeepInfra, Fireworks AI, and Modal to expand availability soon.
  • Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
  • Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development workflows, StepFun model support has landed in the NVIDIA Megatron ecosystem, including Megatron Core and Megatron Bridge.

5. Examples

You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.

Pick the right base_url for your region. StepFun operates two regional platforms with separate API hosts. The base_url you pass to the OpenAI client must match the platform where your API key was issued, otherwise requests will be rejected as unauthorized.

To avoid hard-coding the wrong region, the examples below read both the API key and base URL from environment variables. Export them once before running:

export STEP_API_KEY="sk-..."
export STEP_BASE_URL="https://api.stepfun.ai/v1"   # use https://api.stepfun.com/v1 for the China platform

5.1 Chat Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.",
        },
        {
            "role": "user",
            "content": "Introduce StepFun's artificial intelligence capabilities."
        },
    ],
)

print(completion)

5.2 Text and Image Input Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this picture?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"},
                },
            ],
        },
    ],
)

print(completion)

6. Local Deployment

Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.

6.1 vLLM

We recommend using StepFun's prebuilt vLLM Docker image with Step 3.7 support.

  1. Install vLLM.
# via Docker
docker pull vllm/vllm-openai:stepfun37
  1. Launch the server.
  • For FP8 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
  • For BF16 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p7-flash-bf16 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
  • For NVFP4 model Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port ${PORT} \
--model stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--trust-remote-code \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--async-scheduling

6.2 SGLang

  1. Install SGLang.
# via Docker
docker pull lmsysorg/sglang:dev-step-3.7-flash

# or from source (pip)
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
  1. Launch the server.

Note: For Blackwell GPUs, --mm-attention-backend fa4 may be used.

  • For BF16 model
sglang serve --model-path stepfun-ai/Step-3.7-Flash \
  --tp 8 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --enable-multimodal \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000
  • For FP8 model
sglang serve --model-path stepfun-ai/Step-3.7-Flash-FP8 \
  --tp 8 \
  --ep 4 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --enable-multimodal \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000
  • For NVFP4 model
sglang serve --model-path stepfun-ai/Step-3.7-Flash-NVFP4 \
  --tp 4 --ep 4 \
  --moe-runner-backend flashinfer_trtllm \
  --kv-cache-dtype fp8_e4m3 \
  --quantization modelopt_fp4 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --attention-backend trtllm_mha

6.3 Transformers (Debug / Verification)

Use this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.

Note: Deployment of this model requires transformers 5.0 or later.

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_PATH = "<MODEL_PATH_OR_HF_ID>"

# 1. Setup
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    dtype="auto",
    trust_remote_code=True
)

# 2. Prepare Input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/photo.jpg"},
            {"type": "text", "text": "What is in this picture?"}
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

# 3. Generate
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = processor.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(output_text)

6.4 llama.cpp

System Requirements

GGUF Model Weights:

Component Quantization File Size
Language Model Q4_K_S 111.5 GB
Language Model IQ4_XS 104.99 GB
Language Model Q3_K_L 102.5 GB
Multimodal Projector FP16 3.97 GB
  • Runtime Overhead: ~7 GB
  • Minimum unified memory / VRAM: 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
  • Recommended: 128 GB unified memory

Steps

  1. Use llama.cpp:
git clone https://github.com/stepfun-ai/llama.cpp.git
cd llama.cpp
git checkout -b step3.7 origin/step3.7
  1. Build llama.cpp on Mac:
cmake -B build-macos -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_SHARED_LIBS=ON \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TESTS=ON \
    -DGGML_METAL=ON \
    -DGGML_METAL_EMBED_LIBRARY=ON \
    -DGGML_BLAS=ON \
    -DGGML_BLAS_VENDOR=Apple \
    -DGGML_ACCELERATE=ON \
    -DGGML_NATIVE=ON
cmake --build build-macos -j8
  1. Build llama.cpp on DGX-Spark:
cmake -S . -B build-cuda \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DLLAMA_OPENSSL=OFF \
  -DLLAMA_BUILD_COMMON=ON \
  -DLLAMA_BUILD_TOOLS=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_TESTS=OFF
cmake --build build-cuda -j8
  1. Build llama.cpp on AMD Windows:
cmake -S . -B build-vulkan \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_VULKAN=ON \
  -DGGML_NATIVE=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_UI=OFF \
  -DLLAMA_BUILD_TOOLS=ON
cmake --build build-vulkan -j8
  1. Run with llama-cli:
./llama-cli -m Step3.7_Q4_K_S.gguf -b 2048 -ub 2048 -fa on --temp 1.0 -p "What's your name?"
  1. Test performance with llama-batched-bench:
./llama-batched-bench -m step3.7_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

7. Using Step 3.7 Flash on Agent Platforms

You can use Step 3.7 Flash on Agent platforms such as Hermes Agent, OpenClaw, Kilo Code, and more.

8. Getting in Touch

As we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop — your insights directly influence our priorities.

  • Join the Conversation: Our Discord community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀
  • Report Friction: Encountering limitations? You can open an issue or start a discussion on GitHub / HuggingFace, or flag it directly in our Discord support channels.

📄 License

This project is open-sourced under the Apache 2.0 License.

Downloads last month
64
Safetensors
Model size
112B params
Tensor type
BF16
·
F32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/Step-3.7-Flash-MXFP4

Quantized
(33)
this model