Suggested vllm options
#1
by
vlzft
- opened
Hi! With a single H100 vllm runs out of GPU memory when trying to use this with simply vllm serve cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic
.
What are the suggested options to use β the description says this should be optimized for running on a single H100.
Hi! Apologies for the confusion, and thank you for bringing this to our attention. The model is indeed optimized for a single H100, but two parameters were missing in the instructions. Please use --max-model-len 9000
and set --gpu-memory-util 0.95
when serving the model with vllm. We've updated the information to include these details. Let us know if you have any further questions!