RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":995, please report a bug to PyTorch.

#54
by venkatesh-thiru - opened

I got this error when I try to run the sd3.5 large model on A100 - 80GB MiG GPU. But the medium model worked fine. I have been trying to find solutions for this with no luck so far.

I am using a MiG device on the Slurm framework, and the issue occurred due to an incorrect assignment of the GPU ID in the CUDA_VISIBLE_DEVICES variable.

The inference worked properly after adding the following line above my inference script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

Previously, I had set the device ID to "0," which caused the runtime error mentioned earlier. Since this is not related to the sd3.5 code, I am closing this issue.

venkatesh-thiru changed discussion status to closed

Sign up or log in to comment