Step 3.7 Flash MTP Draft GGUFs

These are companion MTP draft GGUFs for speculative decoding. They are not standalone chat models.

Use them with the full model repo:

notSnix/Step-3.7-Flash-Q4_K_M-GGUF

The draft GGUF is passed with --model-draft; the full model is passed with --model.

Files

File Size SHA256 Purpose
Step-3.7-Flash-MTP-Q8_0.gguf 3.5 GB 017de8990140621b5b4af431448f20873fbf0b052f6c50d2afac15f45802a98d Recommended MTP draft
Step-3.7-Flash-MTP-Q6_K.gguf 2.7 GB f41736e0dcce133d0dd0b81e14bd2965091e27dff306a28cec11ceb19fadbf46 Smaller Q6_K MTP draft
Step-3.7-Flash-MTP-Q4_K_M.gguf 2.0 GB 44118cfe64f45b38127ad6fb626e16bd94ee5a827cb34aa83d9e6df3450aebaf Smaller MTP draft
Step-3.7-Flash-MTP-BF16.gguf 6.5 GB fd811c81d14c786d314d8006655bba61971059abcfdfb6109ce83fd768f8b289 Experimental BF16 MTP draft

Runtime

Current llama.cpp main supports Step MTP-tail draft loading natively. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

Usage

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.60

Which Draft Should I Use?

Use Step-3.7-Flash-MTP-Q8_0.gguf first. It was the best local default in testing.

Use Step-3.7-Flash-MTP-Q6_K.gguf if you want a smaller draft file while staying above Q4.

Use Step-3.7-Flash-MTP-Q4_K_M.gguf if you want the smaller draft file.

Use Step-3.7-Flash-MTP-BF16.gguf for experimentation.

Checksums

sha256sum -c SHA256SUMS

Notes

  • These files intentionally keep the upstream Step MTP tail-layer numbering (blk.45, blk.46, blk.47).
  • They are companion speculative-decoding draft GGUFs, not full-model quants.
  • The full Q4_K_M model is hosted separately so Hugging Face's GGUF widget does not display draft files as tiny full-model quantizations.
  • This is a community GGUF conversion of the upstream Apache-2.0 model, not an official StepFun release.
Downloads last month
-
GGUF
Model size
3B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for notSnix/Step-3.7-Flash-MTP-Draft-GGUF

Quantized
(23)
this model