mtp

#1
by festr2 - opened

Hello,

is mtp still possible?

Cerebras org

Hey @festr2 , we'd need to run our pruning procedure on the MTP block too to keep a model with uniform num_experts. Pruning could also affect speedup from MTP in this case. We'll look into keeping a pruned MTP layer!

Would be best if possible. Enabling MTP in sglang gives me 1.5x ~ 2x speedup for original FP8 model.

on 4x RTX 6000 PRO FP8 - without MTP - 58toknes/sec, with - 90-105 tokens/sec.

need a NVFP4 for 2s 2x6000 pro users!

Sign up or log in to comment