Usage

For now, you need to use the gg/spec-mtp-experiments branch on llama.cpp or a custom mtp fork.

You can switch to the mtp branch with git checkout gg/spec-mtp-experiments after cloning and entering the llama.cpp repository.

Add --spec-type mtp --spec-draft-n-max 5 --spec-draft-n-min 0 to your llama-server or llama-cli command.

Feel free to tweak --spec-draft-n-max and find out what works best for your setup.

Try not to push --spec-draft-n-min too far, keep it in single digits.

I found that (in my testing), token speed was as such when tweaking --spec-draft-n-min:

Setting for --spec-draft-n-min generation t/s
0 47.7
1 47.2
2 47.8
3 44.2
4 44.9
5 36.2

with the launch command being

/llama.cpp/build/bin/llama-cli -st -p 'What is the antiderivative of x^3?' --verbose-prompt --prio 3 --batch-size 1024 --ubatch-size 1024 --mmap --perf --flash-attn on --fit-ctx 16384 -ctk q8_0 -ctv q8_0 -m /ai/models/Qwen3.5-4B-Q5_K_M-mtp.gguf -ngl all --spec-type mtp --spec-draft-n-max 5 --spec-draft-n-min "$i" -fitt 2048

Credits

  • Qwen for this amazing model.
  • Unsloth for the imatrix quantization file.
  • All of the llama.cpp and ggml contributors for allowing me to run AI models locally.
Downloads last month
615
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

5-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EntityDeletr/Qwen3.5-4B-MTP-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(221)
this model