RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":995, please report a bug to PyTorch.

#54

by venkatesh-thiru - opened Nov 21, 2024

Nov 21, 2024

I got this error when I try to run the sd3.5 large model on A100 - 80GB MiG GPU. But the medium model worked fine. I have been trying to find solutions for this with no luck so far.

venkatesh-thiru

Nov 25, 2024

I am using a MiG device on the Slurm framework, and the issue occurred due to an incorrect assignment of the GPU ID in the CUDA_VISIBLE_DEVICES variable.

The inference worked properly after adding the following line above my inference script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

Previously, I had set the device ID to "0," which caused the runtime error mentioned earlier. Since this is not related to the sd3.5 code, I am closing this issue.

venkatesh-thiru changed discussion status to closed Nov 25, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment