Issue running on vLLM using FP8
#3
by
ffleandro
- opened
I'm trying to run this model on an H100 NVL
or H100 SXM
using vLLM, but facing the following error:
Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (fp8)
This is my run command:
--model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 --host 0.0.0.0 --port 80 --dtype auto --quantization fp8 --tensor-parallel-size 1 --download-dir /data --gpu_memory_utilization 0.95 --kv-cache-dtype fp8 --max_model_len 60000
Any hints on what I'm doing wrong?
Dont specify --quantization fp8
Ah yes, thank you, thought that option was mandatory.
ffleandro
changed discussion status to
closed