Model loading taking too much GPU memory

#2
by tehreemfarooqi - opened

Hey, when trying to load the model using the code given in the repo card, it keep giving me CUDA out of memory error. I am using NVIDIA V100 with 16 GB RAM. Given that I have run LLMs with more parameters as well as speech-to-text models on this GPU, this doesn't make sense to me. I'm using the exact code given in the repo card. Am I doing something wrong?

SILMA AI org

Hello Tehreem and thanks for trying the model

Our model will run on 16GB GPUs only in Quantization mode, you can find the sample code here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#quantized-versions-through-bitsandbytes

You can also find our recommended GPU requirements here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#gpu-requirements

Finally, here is a probable technical explanation of why you got OOM:

  • Our model is 9B parameters with each parameter represented as BF/FP16 (16-bit floating-point)
  • This means that 9 billion parameters will be represented by 18 billion bytes, with each parameter requiring 2 bytes (16 bits).
  • To find the amount of memory needed, you will then need to divide 18B bytes by 1,073,741,824 (since 1GB=1,073,741,824 bytes)
  • Therefor, you will need 16.76 GB of GPU memory only to load the weights

Thanks for your reply @karimouda ! I was able to run it using a multi-GPU setup.

tehreemfarooqi changed discussion status to closed

Sign up or log in to comment