vLLM out of memory

#2
by cfrancois7 - opened

I have only a RTX 3070 with 8Go VRAM.

When I execute your code for AutoAWQ, it works well on my computer.
I succeed to manage the size of my max tokens.
I run with around 7Go of RAM.

But when I want to test vLLM, the script want 14Go of GPU RAM allocation, and crash.
I do not succeed to change the max tokens size.

Try with --max-model-len 512

I reinstall and test with :

llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ", 
    quantization="awq",
    dtype="auto",
    max_model_len=512,
   gpu_memory_utilization=0.8
)

And it works.

Sign up or log in to comment