Minimal CUDA GPU requirements

#18
by wass-grass - opened

So what's the minimal requirement to run this model ?

I tried with a 4GB GPU and got a :
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 GiB already allocated; 0 bytes free; 3.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

:/

I had the same error some times in Google Colab (free not plus/Pro) with Tesla T4, and I was not able to setup the PYTORCH_CUDA_ALLOC_CONF in any way. The only solution I have found was to use some memory check and clean up script:

from torch import cuda
def get_less_used_gpu(gpus=None, debug=False):
    """Inspect cached/reserved and allocated memory on specified gpus and return the id of the less used device"""
    if gpus is None:
        warn = 'Falling back to default: all gpus'
        gpus = range(cuda.device_count())
    elif isinstance(gpus, str):
        gpus = [int(el) for el in gpus.split(',')]

    # check gpus arg VS available gpus
    sys_gpus = list(range(cuda.device_count()))
    if len(gpus) > len(sys_gpus):
        gpus = sys_gpus
        warn = f'WARNING: Specified {len(gpus)} gpus, but only {cuda.device_count()} available. Falling back to default: all gpus.\nIDs:\t{list(gpus)}'
    elif set(gpus).difference(sys_gpus):
        # take correctly specified and add as much bad specifications as unused system gpus
        available_gpus = set(gpus).intersection(sys_gpus)
        unavailable_gpus = set(gpus).difference(sys_gpus)
        unused_gpus = set(sys_gpus).difference(gpus)
        gpus = list(available_gpus) + list(unused_gpus)[:len(unavailable_gpus)]
        warn = f'GPU ids {unavailable_gpus} not available. Falling back to {len(gpus)} device(s).\nIDs:\t{list(gpus)}'

    cur_allocated_mem = {}
    cur_cached_mem = {}
    max_allocated_mem = {}
    max_cached_mem = {}
    for i in gpus:
        cur_allocated_mem[i] = cuda.memory_allocated(i)
        cur_cached_mem[i] = cuda.memory_reserved(i)
        max_allocated_mem[i] = cuda.max_memory_allocated(i)
        max_cached_mem[i] = cuda.max_memory_reserved(i)
    min_allocated = min(cur_allocated_mem, key=cur_allocated_mem.get)
    if debug:
        print(warn)
        print('Current allocated memory:', {f'cuda:{k}': v for k, v in cur_allocated_mem.items()})
        print('Current reserved memory:', {f'cuda:{k}': v for k, v in cur_cached_mem.items()})
        print('Maximum allocated memory:', {f'cuda:{k}': v for k, v in max_allocated_mem.items()})
        print('Maximum reserved memory:', {f'cuda:{k}': v for k, v in max_cached_mem.items()})
        print('Suggested GPU:', min_allocated)
    return min_allocated


def free_memory(to_delete: list, debug=False):
    import gc
    import inspect
    calling_namespace = inspect.currentframe().f_back
    if debug:
        print('Before:')
        get_less_used_gpu(debug=True)

    if to_delete:
      for _var in to_delete:
          calling_namespace.f_locals.pop(_var, None)
          gc.collect()
          cuda.empty_cache()
      else:
        gc.collect()
        cuda.empty_cache()
    if debug:
        print('After:')
        get_less_used_gpu(debug=True)

What I did, after running the pipelines e.g.

lms = LMSDiscreteScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear"
)
pipe = MyStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-3", 
    scheduler=lms,
    #float16 is cuda only
    use_auth_token=True
).to(device)

was to immediately cleanup resources just after the inference (e.g. images = pipe([prompt]*samples), so in another cell I did like

to_delete = (lms, pipe)
free_memory(to_delete, debug=True)

this saved me to restart the kernel and losing the work done at least, hope it help someone else!

You tried on 4, I am not able to get 1.4 running on a 8GB card,

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 7.77 GiB total capacity; 5.36 GiB already allocated; 351.25 MiB free; 5.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Were you able to run atleast any single inference before deleting and freeing up the memory?

Apparently it should be possible.

From the readme:
Note: If you are limited by GPU memory and have less than 10GB of GPU RAM available, please make sure to load the StableDiffusionPipeline in float16 precision instead of the default float32 precision as done above. You can do so by telling diffusers to expect the weights to be in float16 precision:

I haven't tried it since I could play it on Google collab with 16gb of vram wich were almost all used.

Exactly what I did later on posting this. Downloaded the weights in 16 bit precision, smaller model and runs easy. Ofcourse, a bit compro on quality but yeah good enough for learning.
Thanks @wass-grass

osanseviero changed discussion status to closed
This comment has been hidden

Sign up or log in to comment