What engine should be used to infer this model?
Thank you for you contribution,my question is : What engine should be used to infer this model?
wondering whether this model is quantized by https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py . could you offer any quantize details?
how use vllm run this model , could you give some tips or examples
The model was quantized through https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py
The example also illustrates how to run the model through vllm https://github.com/vllm-project/llm-compressor/blob/0a34a894b11f317fb46c7a4bac7e71cd6417a0ad/examples/quantizing_moe/deepseek_moe_w4a16.py#L98. Instead of SAVE_DIR
you would pass in the model stub
Why does the speed of the quantized model decrease significantly?
How are you running it?
How are you running it?
after finished deepseek_moe_w4a16.py, you will get a int4 model, size should near 112G, then run it with vllm 0.6, i failed with 054 version, try to skip it. https://github.com/vllm-project/llm-compressor/issues/857