MTP version

#2
by george-kc-chung - opened

Could you please make a MTP version?

BugTraceAI org

Hey george-kc-chung! πŸ‘‹

MTP (Multi-Token Prediction) is on our radar, but it's not a simple re-export β€” it requires the model to be trained with MTP heads from scratch (or fine-tuned with them), not just applied post-hoc to an existing GGUF.

The current Ultra is a standard next-token prediction model. Adding MTP would mean:

Modifying the training objective to predict N future tokens simultaneously
Re-running the full training pipeline on H100
llama.cpp has experimental MTP support, but it's still maturing
Short answer: not in the current release cycle, but it's a valid request. If there's enough community interest we'll factor it into the next training run.

In the meantime, if your main goal is faster inference speed, the Q4_K_S variant with n_batch=512 and Flash Attention enabled gives solid tokens/sec on most setups. The Q6_K is better for quality if VRAM allows.

Sign up or log in to comment