Using this model with Vllm
#10
by
haltux
- opened
Hello,
I could make the model work directly in Python with the example code provided, but with VLLM it does not work, even with 80 GB of GPU memory.
I try:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model NousResearch/Yarn-Mistral-7b-128k --trust-remote-code
and i get:
File "/home/azureuser/.local/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 266, in forward
self.multi_query_kv_attention(
File "/home/azureuser/.local/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 117, in multi_query_kv_attention
key = torch.repeat_interleave(key, self.num_queries_per_kv, dim=1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Any idea about how to fix that?
Thanks a lot.
vLLM v0.2.2 was just released, which includes Yarn support, so maybe it will work for you now!