2x GPU but only one is being used

#22
by ecaglar - opened

I am trying to run a LLM but even i choose 2x H100 only one is utilized and then getting below error. Any idea?

image.png
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacty of 79.11 GiB of which 168.50 MiB is free. Process 3311833 has 78.93 GiB memory in use. Of the allocated memory 78.31 GiB is allocated by PyTorch, and 189.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Can you share your docker run command? Seems I got hang when using 4xA100 80GB

Sign up or log in to comment