Quantization of the Model

by shiva2022 - opened 27 days ago

shiva2022

I have been comparing this model to DSV3.2 model quantization published by NVIDIA. Running this model with sglang on RTX-PRO-6000 is significantly slower than the DSV3.2. This model seems only quantizes the MOE Layer and not the other layers. The voipmonitor/sglang container has a very different behavior than the base containers from lmsys.
Not sure if the author has some thoughts on the quantization of this model.

lukealonso

Owner 19 days ago

Try the latest container with proper native sparse attention

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment