Step 3.7 Flash Q4_K_M GGUF

This repo contains the full text-side GGUF quantization of stepfun-ai/Step-3.7-Flash.

For speculative decoding, use the companion MTP draft GGUFs here:

notSnix/Step-3.7-Flash-MTP-Draft-GGUF

The source model is Apache-2.0. The original model is multimodal, but this GGUF artifact was prepared and tested for text-side llama.cpp serving.

Files

File Size SHA256 Purpose
Step-3.7-Flash-Q4_K_M.gguf 111 GB 4de6519cf0131820d81137ebe6a0ab8dc225f1c463cc385038ab7de41ee7a36f Full model
chat_template.jinja 5.6 KB f428623fc81c940c35be3509fbffc086b4b4360d8800e46103e6f34d02891633 Chat template

Runtime

Current llama.cpp main supports Step MTP draft loading natively when used with the companion draft repo. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

Basic Command

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-file chat_template.jinja

With MTP Draft

Download an MTP draft GGUF from notSnix/Step-3.7-Flash-MTP-Draft-GGUF, then run:

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.60 \
  --chat-template-file chat_template.jinja

Local Benchmark Snapshot

GPUs: RTX PRO 6000, 3x RTX 3090.

Recommended local MTP setting from the tested sweep: --spec-draft-n-max 2 --spec-draft-p-min 0.60 with the Q8_0 draft.

Run Prompt tokens Prefill Decode TTFT Notes
Q4_K_M + Q8_0 MTP n_max=2 p_min=0.60 32,769 1823.47 tok/s 104.38 tok/s 18.054 s 87.1% draft accepted
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 32,769 1835.66 tok/s 93.38 tok/s 17.904 s 79.3% draft accepted
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 65,537 1626.84 tok/s 94.79 tok/s 40.391 s 81.2% draft accepted
Q4_K_M + MTP n_max=3 604 - 143.81 tok/s 0.415 s 172/181 draft accepted, 95.0%
Q4_K_M + MTP n_max=3 32,519 2097.79 tok/s 104.91 tok/s 15.62 s 60/73 draft accepted, 82.2%
Q4_K_M + MTP n_max=3 54,619 1909.23 tok/s 106.73 tok/s 28.82 s 60/70 draft accepted, 85.7%
Q4_K_S baseline 604 1738.12 tok/s 110.70 tok/s 0.352 s no MTP
Q4_K_S baseline 54,619 2194.42 tok/s 89.15 tok/s 25.16 s no MTP

Limited task checks:

Check Q4_K_S baseline Q4_K_M + MTP n_max=3
ARC Challenge chat, 10 samples 0.9 0.9
GSM8K strict/flexible, 10 samples 0.9 / 0.9 0.8 / 0.8
Code needle / NIAH reasoning-aware 12/12 12/12

Checksums

sha256sum -c SHA256SUMS

Notes

  • The base model advertises 256k context; this GGUF release was loaded locally at 256k context.
  • The MTP draft GGUFs are companion files for speculative decoding and are hosted separately to avoid confusing them with full-model quants.
  • This is a community GGUF quantization/repackaging of the upstream Apache-2.0 model, not an official StepFun release.
Downloads last month
-
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for notSnix/Step-3.7-Flash-Q4_K_M-GGUF

Quantized
(23)
this model