Step-3.5-Flash MTP GGUF
Experimental same-GGUF Step-3.5-Flash builds for tnhnyzc/llama.cpp, based on stepfun-ai/Step-3.5-Flash.
This is not intended for stock llama.cpp. The GGUFs contain the normal Step target model plus MTP / nextn tensors, so the fork can use the same file for MTP speculative decoding. No separate draft model file is needed.
Upstream Step 3.5 MTP support is being worked on in ggml-org/llama.cpp#23274. That PR uses llama.cpp's separate draft-model MTP path (--spec-type draft-mtp with -md). These published folders target the experimental same-GGUF fork as-is; upstream-style MTP serving requires a separate draft-only GGUF.
Files
The model files are split into folders. Choose one quant folder and load the first shard; the split-aware llama.cpp loader will find the sibling shards.
Step-3.5-Flash-MTP-IQ4_XS-3.90BPW-Q8_MTP/Step-3.5-Flash-MTP-IQ3_S-3.64BPW-Q8_MTP/Step-3.5-Flash-MTP-IQ3_XXS-3.27BPW-Q8_MTP/
In these files, the MTP / nextn tensors are kept Q8_0. The public metadata reports step35.nextn_predict_layers = 1.
The calibration imatrix is Bartowski's stepfun-ai_Step-3.5-Flash-imatrix.gguf from bartowski/stepfun-ai_Step-3.5-Flash-GGUF. The IQ4_XS variant follows the public AesSedai Step-3.5-Flash IQ4_XS expert layout. The IQ3_S and IQ3_XXS variants are smaller custom expert layouts.
Usage
Build the fork, then:
./build/bin/llama-server \
--model /path/to/quant-folder/first-shard.gguf \
--ctx-size 131072 \
-ctk q8_0 -ctv q8_0 \
-ctkd q8_0 -ctvd q8_0 \
-ngl 99 \
-np 1 \
-b 4096 \
-ub 1024 \
-fa on \
--cache-prompt \
--cache-ram 8192 \
-mtp \
--draft 1
Add normal sampler/server args as needed.
For stochastic p/q verification experiments:
--spec-draft-pq-accept
Notes
- Tested locally on Apple M3 Max / Metal.
- Recommended starting point:
-mtp --draft 1with the default exact-match verifier. --draft 2+, dGPU performance, and long-run production stability are not proven.- The published files report one trained
nextnlayer. Deeper drafts reuse that layer recurrently. - StepFun's
step3p5-mtpbranch can run MTP, but in local server testing its prompt-cache path disabled MTP unless--cache-ram 0was used. The linked fork includes additional prompt-cache/MTP handling. - Observed speedups are prompt-dependent. One local 384-token IQ3_S smoke showed
26.87 t/swithout MTP and32.55 t/swith-mtp --draft 1, with168/214draft tokens accepted. - Treat the numbers as directional. Context length, sampler settings, cache reuse, memory pressure, and host load can move them.
See the GitHub README for current limitations and implementation details.
- Downloads last month
- 233
3-bit
4-bit
Model tree for tnhnyzc/Step-3.5-Flash-MTP-GGUF
Base model
stepfun-ai/Step-3.5-Flash