Out of memory error on multiple runs

#91
by giord - opened

I'm trying to run a simple code to generate multiple images:

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

for i in range(10):
    image = pipe(
        my_prompt,
        negative_prompt="",
        num_inference_steps=28,
        guidance_scale=7.0,
    ).images[0]
    image.save(f"test{i}.png")

however, after a few iterations I get an out of memory error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 

Am I doing something wrong?

Moreover, the same code soon goes out of memory if I try to run on a mps device on my MacBook M3

A few ideas, one,

 del image

at the end of each loop. Without going into details I don't really understand, Python doesn't release memory for a variable until it's deleted. I suspect each iteration you are holding the entire model in RAM.

If that doesn't work, make a garbage collection function and call it after the del image.
Here is mine with print statements about memory allocated. I haven't reviewed this in a long time, so a lot of it is excess print statements.

def reclaim_mem():
    allocated_memory = torch.cuda.memory_allocated()
    cached_memory = torch.cuda.memory_reserved()
    mem_alloc = f"Memory Allocated: {allocated_memory / 1024**2:.2f} MB"
    mem_cache = f"Memory Cached: {cached_memory / 1024**2:.2f} MB"
    print(mem_alloc)
    print(mem_cache)
    torch.cuda.ipc_collect()
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(0.01)
    allocated_memory = torch.cuda.memory_allocated()
    cached_memory = torch.cuda.memory_reserved()
    print(f"Memory Allocated after del {mem_alloc}")
    print(f"Memory Cached after del {mem_cache}") 

Finally, if the above easy fixes don't work. Read up on TCMalloc https://github.com/google/tcmalloc
Or just try implementing this :

 # This is a fix for the way that python doesn't release system memory back to the OS and it was leading to locking up the system
libc = ctypes.cdll.LoadLibrary("libc.so.6")
M_MMAP_THRESHOLD = -3

# Set malloc mmap threshold.
libc.mallopt(M_MMAP_THRESHOLD, 2**20) 

Sign up or log in to comment