getting this error while trying to fine-tune
#2
by
Rapidinnovation
- opened
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Yeah I've seen that a few times in multi-GPU situations with unquantised models. I'm afraid I don't know what causes it, but I don't believe it's specific to these files as I've seen it with several Llama models. It might be a Transformers bug.