mtp
#1
by
festr2
- opened
Hello,
is mtp still possible?
Hey
@festr2
, we'd need to run our pruning procedure on the MTP block too to keep a model with uniform num_experts. Pruning could also affect speedup from MTP in this case. We'll look into keeping a pruned MTP layer!
Would be best if possible. Enabling MTP in sglang gives me 1.5x ~ 2x speedup for original FP8 model.
on 4x RTX 6000 PRO FP8 - without MTP - 58toknes/sec, with - 90-105 tokens/sec.
need a NVFP4 for 2s 2x6000 pro users!