Instructions to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus",
	filename="Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
# Run inference directly in the terminal:
llama cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
# Run inference directly in the terminal:
llama cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
# Run inference directly in the terminal:
./llama-cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Use Docker

docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

LM Studio
Jan

vLLM

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Ollama
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Ollama:
```
ollama run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
```

Unsloth Studio

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus to start chatting

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Docker Model Runner:
```
docker model run hf.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus
```

Lemonade

How to use jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Run and chat with the model

lemonade run user.Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-{{QUANT_TAG}}

List all available models

lemonade list

Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

This is an extremely high quality FPX3 / ROCmFPX Q3 GGUF build of stepfun-ai/Step-3.7-Flash, tuned for AMD Strix Halo local serving with Step MTP.

The goal is simple: keep Step 3.7 Flash useful at 256K context, keep the quality as high as possible, and keep it as small as possible. This release is a true tight Q3-weight build: 3.57 BPW, 81.77 GiB of language-model shards, and strong agent/tool behavior in local evals.

Use this if you want the Step 3.7 behavior profile, MTP support, and a much smaller local footprint than the stock GGUF Q3_K_L or ROCmFP4 STRIX_LEAN builds.

Required runtime: these GGUFs do not run on stock upstream llama.cpp. They use ROCmFPX tensor types such as q3_0_rocmfpx plus Chadrock/ROCmFPX serving support for Step MTP. Build the pinned Ciru ROCmFPX runner below before trying to load the model.

Why This One

Step 3.7 is huge. The practical local problem is not only speed; it is fitting enough context, KV, and agent workload into memory.

This FPX3/Q3 QualityPlus recipe was built for that constraint:

3.57 BPW effective language-model size
81.77 GiB total language GGUF shards
16.31% smaller than the local ROCmFP4 STRIX_LEAN build
14.35% smaller than StepFun's original Q3_K_L GGUF split
up to 256K one-slot serving profile with q8_0 target KV and q8_0 draft KV
Step MTP Q8 draft support through draft-mtp
downloadable fixed Step tool/chat template using native tool_response observations and protocol-boundary escaping

In practice, the original StepFun Q3_K_L local split was not a compact 3-bit-feeling model: it measured about 95.46 GiB, or roughly 4.17 BPW by effective size. This QualityPlus build is the one I would publish/use as the FPX3 lane.

Size Comparison

Measured from local GGUF shards:

Build	Effective BPW	Shard total	Difference vs this release
ROCmFPX Q3 QualityPlus	`3.57 BPW`	`81.77 GiB`	baseline
StepFun original `Q3_K_L`	`~4.17 BPW`	`95.46 GiB`	`+13.70 GiB` larger
ROCmFP4 STRIX_LEAN	`~4.27 BPW`	`97.70 GiB`	`+15.93 GiB` larger

That size gap matters because Step 3.7 needs memory for long context, q8 KV, and MTP draft state. On the tested Strix Halo host, the Q3 QualityPlus 64K MTP profile used about 96.3 GiB peak pooled GPU memory during long tool/Hermes runs, leaving enough RAM headroom to run the evals cleanly.

Quality Highlights

This is not a throwaway low-bit build. The recipe protects the tensors that were most important for behavior while pushing the giant expert FFN tensors into q3_0_rocmfpx.

Local quality results on AMD Ryzen AI Max+ 395 / Strix Halo:

Benchmark	Result	Notes
Tool-Eval full, 69 scenarios	`88/100`, `122/138` raw points	Same headline score as the recorded Step ROCmFP4 tool-eval row
HermesAgent-20, best Q3 run	`85/100`	`13.40 min`, `35.31 tok/s` decode, `96.37 GiB` peak pooled GPU

The best recorded Q3 HermesAgent-20 run was very close to the local BF16 Qwen3.6 27B MTP reference row:

Model / row	HermesAgent-20 score	Wall time
BF16 Qwen3.6 27B MTP GGUF	`87/100`	`42.4 min`
Step 3.7 ROCmFPX Q3 QualityPlus	`85/100`	`13.4 min`

That is within two points of the BF16 Qwen3.6 27B row on the local HermesAgent-20 suite, while running in a much more compact Step 3.7 Q3 package.

Exact Q3 QualityPlus tool-eval score summary: evals/tool-eval-q3-qualityplus.json. Public reference page for the Step 3.7 tool-calling work: StepFun Step 3.7 Tool Eval on llm.ciru.ai. The Q3 QualityPlus full run used the same 69-scenario tool-eval harness and scored 88/100 locally.

Speed

Q3 QualityPlus speed was effectively tied with the local ROCmFP4 Step build while using much less disk space.

Short-context MTP speed, Vulkan0, q8_0/q8_0 target KV, q8_0/q8_0 draft KV, one slot, n_max=2, p_min=0.75, b8192/u2048, 128 generated tokens:

Prompt	PP tok/s	TG tok/s
`2k`	`309.44`	`29.97`
`4k`	`325.18`	`29.39`
`8k`	`311.15`	`28.58`
`16k`	`306.37`	`26.26`

Compared with the local ROCmFP4 Step build:

Prompt	Q3 QualityPlus TG	ROCmFP4 TG	Takeaway
`2k`	`29.97`	`26.52`	Q3 faster
`4k`	`29.39`	`29.37`	tied
`8k`	`28.58`	`28.02`	tied/slightly Q3
`16k`	`26.26`	`26.42`	tied

128K stress row:

Context	PP tok/s	TG tok/s	Peak pooled GPU
`~130k prompt`	`146.67`	`14.52`	`~95.36 GiB`

At 128K, MTP initialized but produced no accepted drafts in that particular row, so treat the 128K decode number as an effective no-draft long-context decode reference.

256K load proof:

Context	Proof	Memory state
`262144`	target + Q8 MTP draft loaded, one slot, `draft-mtp`, `/v1/models` reports `n_ctx=262144` and `n_ctx_train=262144`	`~99.04 GiB` pooled GPU used, `~16 GiB` system RAM available

The 256K row is a load/allocation proof, not a 256K prompt prefill benchmark.

Files

Published shard names intentionally match the model name:

Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00002-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00003-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00004-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00005-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00006-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00007-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00008-of-00009.gguf
Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00009-of-00009.gguf

The Step MTP draft model is not duplicated here. If you enable draft-mtp, you must also download and pass the separate Q8 draft from notSnix/Step-3.7-Flash-MTP-Draft-GGUF, for example Step-3.7-Flash-MTP-Q8_0.gguf. The main Q3 target GGUF does not contain the MTP draft layers.

This repo also includes the tested chat/tool template:

step37-native-tool-response-template.jinja

Download the target shards and template:

huggingface-cli download jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus \
  --include "Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-*.gguf" \
  --include "step37-native-tool-response-template.jinja" \
  --local-dir /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Download the required Q8 MTP draft:

huggingface-cli download notSnix/Step-3.7-Flash-MTP-Draft-GGUF \
  Step-3.7-Flash-MTP-Q8_0.gguf \
  --local-dir /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF

Direct template URL:

https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja

Direct Q8 draft URL:

https://huggingface.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF/resolve/main/Step-3.7-Flash-MTP-Q8_0.gguf

Required ROCmFPX Runner

This model is tied to the Charlie/Ciru ROCmFPX llama.cpp runner family. A stock llama-server will not understand the ROCmFPX tensor types in these shards and will not reproduce the MTP serving behavior used for the benchmark rows.

Use the pinned Ciru runner:

repo: https://github.com/ciru-ai/ROCmFPX
current recommended pin: 221402af8574faf652b101b6afe225a3f329561f
branch at time of pin: main
upstream lineage: charlie12345/ROCmFPX

The earlier Chadrock v2 speed-runner tag remains useful for historical comparison:

tag: chadrockv2-runner-20260622
commit: 7aa484a2f0a504dc612a3d74a068024f3e6d6353

The Q3 QualityPlus Step 3.7 rows on this card were validated with the Chadrock/ROCmFPX runner path on AMD Ryzen AI Max+ 395 / Strix Halo. For fresh installs, use the current Ciru pin above unless you are reproducing an older benchmark exactly.

Build the runner on a Linux system with a working ROCm/HIP toolchain, Vulkan development headers, CMake, and a C++ compiler. This is the pinned Strix Halo reference build used by Ciru; it is not a universal distro installer, so package names and ROCm paths may differ on Ubuntu, Arch, Fedora, NixOS, and other distros.

git clone https://github.com/ciru-ai/ROCmFPX.git
cd ROCmFPX
git checkout 221402af8574faf652b101b6afe225a3f329561f

env JOBS="$(nproc)" \
  CMAKE_HIP_ARCHITECTURES=gfx1151 \
  ROCMFPX_DECODE_TUNE=stable \
  scripts/build-strix-rocmfp4-mtp.sh llama-server llama-bench

If your ROCm or rocWMMA headers live outside the script defaults, set the relevant environment variables before running the build, for example ROCM_WMMA_INCLUDE=/path/to/rocWMMA/library/include. If your GPU is not Strix Halo / gfx1151, change CMAKE_HIP_ARCHITECTURES for your target.

The script and build directory still use the historical rocmfp4 name, but this is the ROCmFPX/Chadrock runner. For this model, the required support is ROCmFPX Q3 tensor support, not a ROCmFP4-only runtime.

The server binary should be:

./build-strix-rocmfp4/bin/llama-server

Again, build-strix-rocmfp4 is the historical build-directory name used by the ROCmFPX runner script.

If the model load fails with an unknown GGUF tensor type, you are using the wrong runner.

Recommended Serving Profile

The locally tested long-context profile:

context: up to 262144
slots: 1
backend: Vulkan0 target + Vulkan0 draft
MTP: --spec-type draft-mtp
draft model: Step-3.7-Flash-MTP-Q8_0.gguf from notSnix/Step-3.7-Flash-MTP-Draft-GGUF
speculative.n_max: 2
speculative.n_min: 0
speculative.p_min: 0.75
speculative.p_split: 0.10
batch / ubatch: 8192 / 2048
target KV: q8_0 / q8_0
draft KV: q8_0 / q8_0
prompt cache: disabled for 256K fit runs
sampler: temperature 1.0, top_p 0.95, min_p 0.0, repeat_penalty 1.0
reasoning: on, DeepSeek format
chat template: Step native tool_response template with protocol-boundary escaping

Serving backend note: on the tested AMD Ryzen AI Max+ 395 / Strix Halo system, this Step 3.7 Q3 build worked best through the ROCmFPX/Chadrock runner serving on Vulkan0 for both target and draft. In the command below, ROCmFPX is the required tensor/runtime support; -dev Vulkan0 and --spec-draft-device Vulkan0 are the recommended serving backend.

For models.ini-style launchers, make sure the draft path is present. Setting spec-type = draft-mtp without spec-draft-model makes the runner try to build an MTP draft context from the main target GGUF, which fails because the target does not contain MTP draft layers.

model = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf
chat-template-file = /mnt/models/jcbtc-Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/step37-native-tool-response-template.jinja

spec-type = draft-mtp
spec-draft-model = /mnt/models/notSnix-Step-3.7-Flash-MTP-Draft-GGUF/Step-3.7-Flash-MTP-Q8_0.gguf
spec-draft-device = Vulkan0
spec-draft-ngl = all
spec-draft-type-k = q8_0
spec-draft-type-v = q8_0
spec-draft-n-max = 2
spec-draft-n-min = 0
spec-draft-p-min = 0.75
spec-draft-p-split = 0.10

If you see context type MTP requested but model doesn't contain MTP layers, the draft model is missing or the path is wrong.

Example shape:

./build-strix-rocmfp4/bin/llama-server \
  -m Step-3.7-Flash-ROCmFPX-Q3-QualityPlus-00001-of-00009.gguf \
  --alias step-3.7-flash-rocmfpx-q3-qualityplus \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  -c 262144 \
  --reasoning on \
  --reasoning-format deepseek \
  --reasoning-budget -1 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 8192 \
  -ub 2048 \
  --parallel 1 \
  --no-mmap \
  --cache-ram 0 \
  -ctk q8_0 \
  -ctv q8_0 \
  --spec-draft-model Step-3.7-Flash-MTP-Q8_0.gguf \
  --spec-draft-device Vulkan0 \
  --spec-type draft-mtp \
  --spec-draft-ngl all \
  --spec-draft-type-k q8_0 \
  --spec-draft-type-v q8_0 \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.75 \
  --spec-draft-p-split 0.10 \
  --chat-template-file /path/to/step37-native-tool-response-template.jinja \
  --metrics

Template Note

The best local Step setup uses the included step37-native-tool-response-template.jinja template. It renders tool outputs as tool_response turns and escapes protocol-boundary tokens inside tool output. This is a general protocol-adapter fix: tool/file/search results stay observations instead of being flattened into user text.

Download:

curl -L -o step37-native-tool-response-template.jinja \
  https://huggingface.co/jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus/resolve/main/step37-native-tool-response-template.jinja

That matters for real agents because Step 3.7 can otherwise confuse tool output with conversation authority, especially in file/search-result injection cases.

Build Notes

These are model-build notes, not runner-build instructions. Build the pinned ROCmFPX runner in the section above before serving the GGUFs.

The QualityPlus policy used here:

huge ffn_*_exps tensors: q3_0_rocmfpx
attention q/output protected at q5_K
attention k/v protected at q4_K
shared/dense FFN protected at q5_K
output/token embeddings at q4_0_rocmfp4_fast

Converter-reported size: 83726.08 MiB / 3.57 BPW, 9 shards.

Credits

Base model: stepfun-ai/Step-3.7-Flash
MTP draft GGUF source: notSnix/Step-3.7-Flash-MTP-Draft-GGUF
ROCmFPX creator: Charlie, charlie12345 / @italianclownz, charlie12345/ROCmFPX
Pinned public runner fork and build recipe: ciru-ai/ROCmFPX, current recommended pin 221402af8574faf652b101b6afe225a3f329561f
Quantization, the ROCmFPX Step 3.7 Q3 QualityPlus recipe, Strix Halo profile, and local benchmark work: Crown / Ciru

Caveats

This is a custom ROCmFPX GGUF release. It requires the compatible ROCmFPX/Chadrock llama.cpp runner; stock llama.cpp is not expected to load it.
Quality numbers are local Strix Halo measurements and depend on runtime, chat template, KV type, and MTP settings.
The model is strong but not perfect at autonomous email/message side effects; it can be cautious and ask for subject/body/recipient details instead of sending with inferred defaults.

Downloads last month: -

GGUF

Model size

197B params

Architecture

step35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for jcbtc/Step-3.7-Flash-ROCmFPX-Q3-QualityPlus

Base model

stepfun-ai/Step-3.7-Flash

Quantized

notSnix/Step-3.7-Flash-MTP-Draft-GGUF

Quantized

(1)

this model