RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":995, please report a bug to PyTorch.
#54
by
venkatesh-thiru
- opened
I got this error when I try to run the sd3.5 large model on A100 - 80GB MiG GPU. But the medium model worked fine. I have been trying to find solutions for this with no luck so far.
I am using a MiG device on the Slurm framework, and the issue occurred due to an incorrect assignment of the GPU ID in the CUDA_VISIBLE_DEVICES variable.
The inference worked properly after adding the following line above my inference script:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"
Previously, I had set the device ID to "0," which caused the runtime error mentioned earlier. Since this is not related to the sd3.5 code, I am closing this issue.
venkatesh-thiru
changed discussion status to
closed