Why is using native MTP speculative decoding slower than the no-spec baseline?

#4
by tianxiaoban - opened

The device is a 5060 Ti 16GB, with 32GB of RAM. For the same question, the generation speed using native MTP speculative decoding is 11.77 t/s, while using the no-spec baseline it is 21 t/s. I'm a beginner and don't know where I went wrong.
use-no-spec baseline

use-mtp

llama.cpp HEAD implementation gaps.

This comment has been hidden

What’s Qwopus? If you have the full tensors you can use the publishedQwen3.6-27B-DQ recipe: https://x.com/ex0byt/status/2054258311013265595?s=46

llama.cpp HEAD implementation gaps.

Okay, after watching Llama, it still can't fully support MTP

I released an EAGLE-3 specdec for the model a while ago, feel free to use it in oMLX or another supporting inference engine :)

This comment has been hidden (marked as Off-Topic)
Ex0bit changed discussion status to closed

I released an EAGLE-3 specdec for the model a while ago, feel free to use it in oMLX or another supporting inference engine :)

Okay, thank you for the suggestion. I'll try again when I have time

Sign up or log in to comment