Issue running on vLLM using FP8

#3
by ffleandro - opened

I'm trying to run this model on an H100 NVL or H100 SXM using vLLM, but facing the following error:

Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (fp8)

This is my run command:

--model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 --host 0.0.0.0 --port 80 --dtype auto --quantization fp8 --tensor-parallel-size 1 --download-dir /data --gpu_memory_utilization 0.95 --kv-cache-dtype fp8 --max_model_len 60000

Any hints on what I'm doing wrong?

Neural Magic org

Dont specify --quantization fp8

Ah yes, thank you, thought that option was mandatory.

ffleandro changed discussion status to closed

Sign up or log in to comment