vLLM support, GGUF
#1
by
dpkirchner
- opened
According to the model card:
You can also run this model with vLLM, by running the following in your terminal after pip install vllm
vllm serve NousResearch/Hermes-3-Llama-3.1-70B
I assume this is just carried over from the base model README. How do you load, say, the Q4 M gguf with vllm and then use it on the chat completions endpoint?