StepFun Step 3.7 ROCmFPX Q3 QualityPlus

Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

This is an extremely high quality FPX3 / ROCmFPX Q3 GGUF build of stepfun-ai/Step-3.7-Flash, tuned for AMD Strix Halo local serving with Step MTP.

The goal is simple: keep Step 3.7 Flash useful at 256K context, keep the quality as high as possible, and keep it as small as possible. This release is a true tight Q3-weight build: 3.57 BPW, 81.77 GiB of language-model shards, and strong agent/tool behavior in local evals.

Use this if you want the Step 3.7 behavior profile, MTP support, and a much smaller local footprint than the stock GGUF Q3_K_L or ROCmFP4 STRIX_LEAN builds.

Required runtime: these GGUFs do not run on stock upstream llama.cpp. They use ROCmFPX tensor types such as q3_0_rocmfpx plus Chadrock/ROCmFPX serving support for Step MTP. Build the pinned Ciru ROCmFPX runner below before trying to load the model.

Why This One

Step 3.7 is huge. The practical local problem is not only speed; it is fitting enough context, KV, and agent workload into memory.

This FPX3/Q3 QualityPlus recipe was built for that constraint:

  • 3.57 BPW effective language-model size
  • 81.77 GiB total language GGUF shards
  • 16.31% smaller than the local ROCmFP4 STRIX_LEAN build
  • 14.35% smaller than StepFun's original Q3_K_L GGUF split
  • up to 256K one-slot serving profile with q8_0 target KV and q8_0 draft KV
  • Step MTP Q8 draft support through draft-mtp
  • downloadable fixed Step tool/chat template using native tool_response observations and protocol-boundary escaping

In practice, the original StepFun Q3_K_L local split was not a compact 3-bit-feeling model: it measured about 95.46 GiB, or roughly 4.17 BPW by effective size. This QualityPlus build is the one I would publish/use as the FPX3 lane.

Size Comparison

Measured from local GGUF shards:

Build Effective BPW Shard total Difference vs this release
ROCmFPX Q3 QualityPlus 3.57 BPW 81.77 GiB baseline
StepFun original Q3_K_L ~4.17 BPW 95.46 GiB +13.70 GiB larger
ROCmFP4 STRIX_LEAN ~4.27 BPW 97.70 GiB +15.93 GiB larger

That size gap matters because Step 3.7 needs memory for long context, q8 KV, and MTP draft state. On the tested Strix Halo host, the Q3 QualityPlus 64K MTP profile used about 96.3 GiB peak pooled GPU memory during long tool/Hermes runs, leaving enough RAM headroom to run the evals cleanly.

Quality Highlights

This is not a throwaway low-bit build. The recipe protects the tensors that were most important for behavior while pushing the giant expert FFN tensors into q3_0_rocmfpx.

Local quality results on AMD Ryzen AI Max+ 395 / Strix Halo:

Benchmark Result Notes
Tool-Eval full, 69 scenarios 88/100, 122/138 raw points Same headline score as the recorded Step ROCmFP4 tool-eval row
HermesAgent-20, best Q3 run 85/100 13.40 min, 35.31 tok/s decode, 96.37 GiB peak pooled GPU

The best recorded Q3 HermesAgent-20 run was very close to the local BF16 Qwen3.6 27B MTP reference row:

Model / row HermesAgent-20 score Wall time
BF16 Qwen3.6 27B MTP GGUF 87/100 42.4 min
Step 3.7 ROCmFPX Q3 QualityPlus 85/100 13.4 min

That is within two points of the BF16 Qwen3.6 27B row on the local HermesAgent-20 suite, while running in a much more compact Step 3.7 Q3 package.

Exact Q3 QualityPlus tool-eval score summary: evals/tool-eval-q3-qualityplus.json. Public reference page for the Step 3.7 tool-calling work: StepFun Step 3.7 Tool Eval on llm.ciru.ai. The Q3 QualityPlus full run used the same 69-scenario tool-eval harness and scored 88/100 locally.

Speed

Q3 QualityPlus speed was effectively tied with the local ROCmFP4 Step build while using much less disk space.

Short-context MTP speed, Vulkan0, q8_0/q8_0 target KV, q8_0/q8_0 draft KV, one slot, n_max=2, p_min=0.75, b8192/u2048, 128 generated tokens:

Prompt PP tok/s TG tok/s
2k 309.44 29.97
4k 325.18 29.39
8k 311.15 28.58
16k 306.37 26.26

Compared with the local ROCmFP4 Step build:

Prompt Q3 QualityPlus TG ROCmFP4 TG Takeaway
2k 29.97 26.52 Q3 faster
4k 29.39 29.37 tied
8k 28.58 28.02 tied/slightly Q3
16k 26.26 26.42 tied

128K stress row:

Context PP tok/s TG tok/s Peak pooled GPU
~130k prompt 146.67 14.52 ~95.36 GiB

At 128K, MTP initialized but produced no accepted drafts in that particular row, so treat the 128K decode number as an effective no-draft long-context decode reference.

256K load proof:

Context Proof Memory state
262144 target + Q8 MTP draft loaded, one slot, draft-mtp, /v1/models reports n_ctx=262144 and n_ctx_train=262144 ~99.04 GiB pooled GPU used, ~16 GiB system RAM available

The 256K row is a load/allocation proof, not a 256K prompt prefill benchmark.

Files

Published shard names intentionally match the model name:

Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00002-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00003-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00004-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00005-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00006-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00007-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00008-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00009-of-00009.gguf

The Step MTP draft model is not duplicated here. If you enable draft-mtp, you must also download and pass the separate Q8 draft from notSnix/Step-3.7-Flash-MTP-Draft-GGUF, for example Step-3.7-Flash-MTP-Q8_0.gguf. The main Q3 target GGUF does not contain the MTP draft layers.

This repo also includes the tested chat/tool template:

step37-native-tool-response-template.jinja

Download the target shards and template:

huggingface-cli download jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus \
  --include "Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-*.gguf" \
  --include "step37-native-tool-response-template.jinja" \
  --local-dir /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Download the required Q8 MTP draft:

huggingface-cli download notSnix/Step-3.7-Flash-MTP-Draft-GGUF \
  Step-3.7-Flash-MTP-Q8_0.gguf \
  --local-dir /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF

Direct template URL:

https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja

Direct Q8 draft URL:

https://huggingface.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF/resolve/main/Step-3.7-Flash-MTP-Q8_0.gguf

Required ROCmFPX Runner

This model is tied to the Charlie/Ciru ROCmFPX llama.cpp runner family. A stock llama-server will not understand the ROCmFPX tensor types in these shards and will not reproduce the MTP serving behavior used for the benchmark rows.

Use the pinned Ciru runner:

repo: https://github.com/ciru-ai/ROCmFPX
current recommended pin: 221402af8574faf652b101b6afe225a3f329561f
branch at time of pin: main
upstream lineage: charlie12345/ROCmFPX

The earlier Chadrock v2 speed-runner tag remains useful for historical comparison:

tag: chadrockv2-runner-20260622
commit: 7aa484a2f0a504dc612a3d74a068024f3e6d6353

The Q3 QualityPlus Step 3.7 rows on this card were validated with the Chadrock/ROCmFPX runner path on AMD Ryzen AI Max+ 395 / Strix Halo. For fresh installs, use the current Ciru pin above unless you are reproducing an older benchmark exactly.

Build the runner on a Linux system with a working ROCm/HIP toolchain, Vulkan development headers, CMake, and a C++ compiler. This is the pinned Strix Halo reference build used by Ciru; it is not a universal distro installer, so package names and ROCm paths may differ on Ubuntu, Arch, Fedora, NixOS, and other distros.

git clone https://github.com/ciru-ai/ROCmFPX.git
cd ROCmFPX
git checkout 221402af8574faf652b101b6afe225a3f329561f

env JOBS="$(nproc)" \
  CMAKE_HIP_ARCHITECTURES=gfx1151 \
  ROCMFPX_DECODE_TUNE=stable \
  scripts/build-strix-rocmfp4-mtp.sh llama-server llama-bench

If your ROCm or rocWMMA headers live outside the script defaults, set the relevant environment variables before running the build, for example ROCM_WMMA_INCLUDE=/path/to/rocWMMA/library/include. If your GPU is not Strix Halo / gfx1151, change CMAKE_HIP_ARCHITECTURES for your target.

The script and build directory still use the historical rocmfp4 name, but this is the ROCmFPX/Chadrock runner. For this model, the required support is ROCmFPX Q3 tensor support, not a ROCmFP4-only runtime.

The server binary should be:

./build-strix-rocmfp4/bin/llama-server

Again, build-strix-rocmfp4 is the historical build-directory name used by the ROCmFPX runner script.

If the model load fails with an unknown GGUF tensor type, you are using the wrong runner.

Recommended Serving Profile

The locally tested long-context profile:

context: up to 262144
slots: 1
backend: Vulkan0 target + Vulkan0 draft
MTP: --spec-type draft-mtp
draft model: Step-3.7-Flash-MTP-Q8_0.gguf from notSnix/Step-3.7-Flash-MTP-Draft-GGUF
speculative.n_max: 2
speculative.n_min: 0
speculative.p_min: 0.75
speculative.p_split: 0.10
batch / ubatch: 8192 / 2048
target KV: q8_0 / q8_0
draft KV: q8_0 / q8_0
prompt cache: disabled for 256K fit runs
sampler: temperature 1.0, top_p 0.95, min_p 0.0, repeat_penalty 1.0
reasoning: on, DeepSeek format
chat template: Step native tool_response template with protocol-boundary escaping

Serving backend note: on the tested AMD Ryzen AI Max+ 395 / Strix Halo system, this Step 3.7 Q3 build worked best through the ROCmFPX/Chadrock runner serving on Vulkan0 for both target and draft. In the command below, ROCmFPX is the required tensor/runtime support; -dev Vulkan0 and --spec-draft-device Vulkan0 are the recommended serving backend.

For models.ini-style launchers, make sure the draft path is present. Setting spec-type = draft-mtp without spec-draft-model makes the runner try to build an MTP draft context from the main target GGUF, which fails because the target does not contain MTP draft layers.

model = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
chat-template-file = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/step37-native-tool-response-template.jinja

spec-type = draft-mtp
spec-draft-model = /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF/Step-3.7-Flash-MTP-Q8_0.gguf
spec-draft-device = Vulkan0
spec-draft-ngl = all
spec-draft-type-k = q8_0
spec-draft-type-v = q8_0
spec-draft-n-max = 2
spec-draft-n-min = 0
spec-draft-p-min = 0.75
spec-draft-p-split = 0.10

If you see context type MTP requested but model doesn't contain MTP layers, the draft model is missing or the path is wrong.

Example shape:

./build-strix-rocmfp4/bin/llama-server \
  -m Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf \
  --alias step-3.7-flash-rocmfpx-q3-qualityplus \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  -c 262144 \
  --reasoning on \
  --reasoning-format deepseek \
  --reasoning-budget -1 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 8192 \
  -ub 2048 \
  --parallel 1 \
  --no-mmap \
  --cache-ram 0 \
  -ctk q8_0 \
  -ctv q8_0 \
  --spec-draft-model Step-3.7-Flash-MTP-Q8_0.gguf \
  --spec-draft-device Vulkan0 \
  --spec-type draft-mtp \
  --spec-draft-ngl all \
  --spec-draft-type-k q8_0 \
  --spec-draft-type-v q8_0 \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.75 \
  --spec-draft-p-split 0.10 \
  --chat-template-file /path/to/step37-native-tool-response-template.jinja \
  --metrics

Template Note

The best local Step setup uses the included step37-native-tool-response-template.jinja template. It renders tool outputs as tool_response turns and escapes protocol-boundary tokens inside tool output. This is a general protocol-adapter fix: tool/file/search results stay observations instead of being flattened into user text.

Download:

curl -L -o step37-native-tool-response-template.jinja \
  https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja

That matters for real agents because Step 3.7 can otherwise confuse tool output with conversation authority, especially in file/search-result injection cases.

Build Notes

These are model-build notes, not runner-build instructions. Build the pinned ROCmFPX runner in the section above before serving the GGUFs.

The QualityPlus policy used here:

  • huge ffn_*_exps tensors: q3_0_rocmfpx
  • attention q/output protected at q5_K
  • attention k/v protected at q4_K
  • shared/dense FFN protected at q5_K
  • output/token embeddings at q4_0_rocmfp4_fast

Converter-reported size: 83726.08 MiB / 3.57 BPW, 9 shards.

Credits

Caveats

  • This is a custom ROCmFPX GGUF release. It requires the compatible ROCmFPX/Chadrock llama.cpp runner; stock llama.cpp is not expected to load it.
  • Quality numbers are local Strix Halo measurements and depend on runtime, chat template, KV type, and MTP settings.
  • The model is strong but not perfect at autonomous email/message side effects; it can be cautious and ask for subject/body/recipient details instead of sending with inferred defaults.
Downloads last month
-
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Quantized
(1)
this model