Qwen/QwQ-32B-Preview · multi GPU inferencing

cjj2003

Nov 29, 2024

Is it possible to do inferencing on a multi gpu set up? I have been unsuccessful with just using the demonstration code, with this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
../aten/src/ATen/native/cuda/TensorCompare.cu:110: assertasync_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.

mammour

Nov 29, 2024

try tabby api with this quant :)

SongXiaoMao

Dec 20, 2024

Yes, you can use the vLLM framework for multi-GPU inference.