Step-3.5-Flash MTP GGUF

Experimental same-GGUF Step-3.5-Flash builds for tnhnyzc/llama.cpp, based on stepfun-ai/Step-3.5-Flash.

This is not intended for stock llama.cpp. The GGUFs contain the normal Step target model plus MTP / nextn tensors, so the fork can use the same file for MTP speculative decoding. No separate draft model file is needed.

Upstream Step 3.5 MTP support is being worked on in ggml-org/llama.cpp#23274. That PR uses llama.cpp's separate draft-model MTP path (--spec-type draft-mtp with -md). These published folders target the experimental same-GGUF fork as-is; upstream-style MTP serving requires a separate draft-only GGUF.

Files

The model files are split into folders. Choose one quant folder and load the first shard; the split-aware llama.cpp loader will find the sibling shards.

  • Step-3.5-Flash-MTP-IQ4_XS-3.90BPW-Q8_MTP/
  • Step-3.5-Flash-MTP-IQ3_S-3.64BPW-Q8_MTP/
  • Step-3.5-Flash-MTP-IQ3_XXS-3.27BPW-Q8_MTP/

In these files, the MTP / nextn tensors are kept Q8_0. The public metadata reports step35.nextn_predict_layers = 1.

The calibration imatrix is Bartowski's stepfun-ai_Step-3.5-Flash-imatrix.gguf from bartowski/stepfun-ai_Step-3.5-Flash-GGUF. The IQ4_XS variant follows the public AesSedai Step-3.5-Flash IQ4_XS expert layout. The IQ3_S and IQ3_XXS variants are smaller custom expert layouts.

Usage

Build the fork, then:

./build/bin/llama-server \
  --model /path/to/quant-folder/first-shard.gguf \
  --ctx-size 131072 \
  -ctk q8_0 -ctv q8_0 \
  -ctkd q8_0 -ctvd q8_0 \
  -ngl 99 \
  -np 1 \
  -b 4096 \
  -ub 1024 \
  -fa on \
  --cache-prompt \
  --cache-ram 8192 \
  -mtp \
  --draft 1

Add normal sampler/server args as needed.

For stochastic p/q verification experiments:

--spec-draft-pq-accept

Notes

  • Tested locally on Apple M3 Max / Metal.
  • Recommended starting point: -mtp --draft 1 with the default exact-match verifier.
  • --draft 2+, dGPU performance, and long-run production stability are not proven.
  • The published files report one trained nextn layer. Deeper drafts reuse that layer recurrently.
  • StepFun's step3p5-mtp branch can run MTP, but in local server testing its prompt-cache path disabled MTP unless --cache-ram 0 was used. The linked fork includes additional prompt-cache/MTP handling.
  • Observed speedups are prompt-dependent. One local 384-token IQ3_S smoke showed 26.87 t/s without MTP and 32.55 t/s with -mtp --draft 1, with 168/214 draft tokens accepted.
  • Treat the numbers as directional. Context length, sampler settings, cache reuse, memory pressure, and host load can move them.

See the GitHub README for current limitations and implementation details.

Downloads last month
233
GGUF
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tnhnyzc/Step-3.5-Flash-MTP-GGUF

Quantized
(25)
this model