Running into an OOM error on a 2070S (8GB) - any way around it?

#13
by skohan - opened

I'm attempting to run this code to test inference using this model on a 2070S:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(
    model,
)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

When running this code, I get the following error:

OutOfMemoryError                          Traceback (most recent call last)
Cell In[5], line 1
----> 1 sequences = pipeline(
      2     'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
      3     do_sample=True,
      4     top_k=10,
      5     num_return_sequences=1,
      6     eos_token_id=tokenizer.eos_token_id,
      7     max_length=200,
      8 )
      9 for seq in sequences:
     10     print(f"Result: {seq['generated_text']}")

...

OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 0; 7.78 GiB total capacity; 6.05 GiB already allocated; 183.44 MiB free; 6.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So it appears I am reaching my VRAM limit with this task. Is there any workaround?

Huggingface Projects org

Hi @skohan

You can load your model in 4bit to run the model in your environment. Not sure if it's possible to load models in 4bit when using pipeline, but you can do this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             load_in_4bit=True,
                                             device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id)
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

message = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n'

prompt = f'[INST] <<SYS>>\n{DEFAULT_SYSTEM_PROMPT}\n<</SYS>>\n\n{message.strip()} [/INST]'
inputs = tokenizer([prompt], return_tensors='pt').to('cuda')

out = model.generate(inputs['input_ids'],
                     max_new_tokens=200,
                     do_sample=True,
                     top_k=10)
decoded = tokenizer.decode(out[0], skip_special_tokens=True)
print(decoded[len(prompt):])

It took a little less than 6GB of VRAM to run this in my environment.

BTW, it would be more appropriate to ask this kind of questions in the Forum or Discord.

Great, thank you!

And I will move to the forum/discord next time

skohan changed discussion status to closed

Sign up or log in to comment