OOM on RTX 3090 with vLLM

#1
by willowill5 - opened

Weird bug with this model or vLLM. But the original hf4 model loads fine on 24 GB but OOMs with AWQ / vLLM.

python -m vllm.entrypoints.api_server --model TheBloke/zephyr-7B-alpha-AWQ --quantization awq --dtype float16

Sign up or log in to comment