Text Generation
Transformers
PyTorch
Safetensors
English
llama
Eval Results
Inference Endpoints
text-generation-inference

Cuda Out of Memory Error

#4
by ibraweeb - opened

I'm currently trying to use the orca mini 3b and getting the Cuda out of memory error. The error line says the following:
OutOfMemoryError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 8.00 GiB total capacity; 6.65 GiB already allocated; 0 bytes free; 7.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This is strange in my case as I was easily able to use the model a while ago.

This is the cell where I'm importing the model:
image.png

This is the standard generate text function I'm using with the default prompt to write a letter:
image.png

Instead of an output I get the error:
image.png

Usually when people get errors like this a common solution is to reduce the batch size. I'm not exactly sure how to reduce the batch size here as I don't see any parameters regarding it. I am also new to using huggingface/transformers so if anyone has any suggestions, it would be much appreciated.

Thanks for detail post and screenshot, it’s look like your current machine which is shown is screenshot has only 8GB VRAM, so I am not sure you can directly use this model and code from the repo data card. You should try quantized version provided by “TheBloke” on his HF Repo and follow this detail post on how to setup local webui to play around with quantized orca-minis or any other quantized HF models.
https://www.reddit.com/r/LocalLLaMA/wiki/guide?utm_source=share&utm_medium=ios_app&utm_name=ioscss&utm_content=1&utm_term=1

pankajmathur changed discussion status to closed

Sign up or log in to comment