Using this model with Vllm

#10
by haltux - opened

Hello,

I could make the model work directly in Python with the example code provided, but with VLLM it does not work, even with 80 GB of GPU memory.

I try:

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model NousResearch/Yarn-Mistral-7b-128k --trust-remote-code

and i get:

  File "/home/azureuser/.local/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 266, in forward
    self.multi_query_kv_attention(
  File "/home/azureuser/.local/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 117, in multi_query_kv_attention
    key = torch.repeat_interleave(key, self.num_queries_per_kv, dim=1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any idea about how to fix that?

Thanks a lot.

vLLM v0.2.2 was just released, which includes Yarn support, so maybe it will work for you now!

Sign up or log in to comment