What engine should be used to infer this model?

#1
by RobertLiu0905 - opened

Thank you for you contribution,my question is : What engine should be used to infer this model?

wondering whether this model is quantized by https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py . could you offer any quantize details?

how use vllm run this model , could you give some tips or examples

NM Testing org
edited 8 days ago

Why does the speed of the quantized model decrease significantly?

NM Testing org

How are you running it?

How are you running it?

after finished deepseek_moe_w4a16.py, you will get a int4 model, size should near 112G, then run it with vllm 0.6, i failed with 054 version, try to skip it. https://github.com/vllm-project/llm-compressor/issues/857

Sign up or log in to comment