getting CUDA out of memory

#51
by allpunks - opened

Hi ! I'm trying to run this model locally and gettign this bug:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 6.00 GiB of which 0 bytes is free. Of the allocated memory 12.57 GiB is allocated by PyTorch, and 241.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I already tried to set this environment variable:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"

but didn't work.
There is any other configuration that i can set to make this model run on my computer ? thanks !

Hi ! I'm trying to run this model locally and gettign this bug:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 6.00 GiB of which 0 bytes is free. Of the allocated memory 12.57 GiB is allocated by PyTorch, and 241.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I already tried to set this environment variable:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"

but didn't work.
There is any other configuration that i can set to make this model run on my computer ? thanks !

Check Available GPU Memory

Verify the available GPU memory. It’s possible something else is using your GPU memory.
Use:

nvidia-smi

Memory Cleanup

Try explicitly releasing GPU memory using:

import torch
torch.cuda.empty_cache()

Limit GPU Usage

You can try to limit the fraction of GPU memory PyTorch by using :

import torch
torch.cuda.set_per_process_memory_fraction(0.8)  # Adjust the fraction as needed

Quantized Model

You can try using a quantized version of this model, TheBloke has GPTQ, and AWQ quantized models. You could also try using bitsandbytes and load the model in either 8 bit or 4 bit.

CPU Inference

Your last option could be to set the model onto your CPU, obviously you won’t get to utilize your GPU to accelerate inference.

If you’d like you could share your code and I could try to help you.

@SpeedStar101 Hey thanks for the response and sorry for the late. I'm trying to use zephyr-7b-beta. I have an RTX 4060 TI and i'm still getting the CUDA out of memory Error

Here is my code so far:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128,garbage_collection_threshold:0.7"
os.environ['HF_HOME'] = 'E:\.cache\huggingface'

import torch
#import gc
from transformers import pipeline

torch.cuda.set_per_process_memory_fraction(.75, device=0)
torch.cuda.empty_cache()
#gc.collect()

print(torch.cuda.memory_allocated(0))
print(torch.cuda.max_memory_allocated(0))

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device=0)

print("----------------------------")

while True:
    input_text = input("Ask a question: ").strip()
    
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who is a little bit sarcastic and like jokes, you deliver medium to short responses. Your name is Oscar !",
        },
        {"role": "user", "content": input_text},
    ]
    
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])

    torch.cuda.empty_cache()

There is anything i did wrong ? Thanks for the response

@SpeedStar101 This is the error the script gives me

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 942.83 MiB is free. Of the allocated memory 5.82 GiB is allocated by PyTorch, and 113.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@SpeedStar101 This is the error the script gives me

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 942.83 MiB is free. Of the allocated memory 5.82 GiB is allocated by PyTorch, and 113.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@allpunks , I apologize for my late response I haven't been on this site for a while.

It appears as though there's significant memory previously designated by PyTorch which leaves limited free space for extra portions.

Here's what I recommend you can do.

Adjust Memory Allocation Configuration

You've proactively set the PYTORCH_CUDA_ALLOC_CONF climate variable, yet it very well may merit tweaking it further or confirming that it's being set accurately toward the start of your content. Given the blunder message about fracture, you can change the max_split_size_mb to an alternate worth to check whether it makes a difference:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64,garbage_collection_threshold:0.7"

Optimize Memory Usage

Utilize torch.cuda.empty_cache() and gc.collect() effectively, possibly after each generation to ensure no memory is held unnecessarily.
You could also try to reduce the max_new_tokens even further if necessary.

Control memory allocation

Setting torch.cuda.set_per_process_memory_fraction to an even lower fraction might help manage the total memory usage better.

Last option Using a GPTQ or AWQ version

You can try using a quantized version of HuggingFaceH4/zephyr-7b-beta. A GPTQ and AWQ version of this model is compressed making it smaller which makes it easier to run and uses less memory on your local machine. You can try using TheBloke/zephyr-7B-beta-GPTQ or ***TheBloke/zephyr-7B-beta-AWQ**.

Here is your code with my suggested optimization methods:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64,garbage_collection_threshold:0.7"
os.environ['HF_HOME'] = 'E:\\.cache\\huggingface'

import torch
import gc
from transformers import pipeline

# Further reduce the GPU memory fraction if needed
torch.cuda.set_per_process_memory_fraction(0.7, device=0)  # Adjusted from 0.75 to 0.7
torch.cuda.empty_cache()
gc.collect()

print("Allocated Memory:", torch.cuda.memory_allocated(0))
print("Max Memory Allocated:", torch.cuda.max_memory_allocated(0))

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device=0)

print("----------------------------")

while True:
    input_text = input("Ask a question: ").strip()
    
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who is a little bit sarcastic and likes jokes, you deliver medium to short responses. Your name is Oscar!",
        },
        {"role": "user", "content": input_text},
    ]
    
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])

    torch.cuda.empty_cache()
    gc.collect()  # Added explicit garbage collection after each generation

I hope this helps!

@SpeedStar101 Thanks ! I will check it when i have some time

Sign up or log in to comment