Quantization of the Model
#9
by shiva2022 - opened
I have been comparing this model to DSV3.2 model quantization published by NVIDIA. Running this model with sglang on RTX-PRO-6000 is significantly slower than the DSV3.2. This model seems only quantizes the MOE Layer and not the other layers. The voipmonitor/sglang container has a very different behavior than the base containers from lmsys.
Not sure if the author has some thoughts on the quantization of this model.
Try the latest container with proper native sparse attention