Spaces:
Running
on
Zero
Running
on
Zero
Running into an OOM error on a 2070S (8GB) - any way around it?
#13
by
skohan
- opened
I'm attempting to run this code to test inference using this model on a 2070S:
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(
model,
)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
When running this code, I get the following error:
OutOfMemoryError Traceback (most recent call last)
Cell In[5], line 1
----> 1 sequences = pipeline(
2 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
3 do_sample=True,
4 top_k=10,
5 num_return_sequences=1,
6 eos_token_id=tokenizer.eos_token_id,
7 max_length=200,
8 )
9 for seq in sequences:
10 print(f"Result: {seq['generated_text']}")
...
OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 0; 7.78 GiB total capacity; 6.05 GiB already allocated; 183.44 MiB free; 6.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
So it appears I am reaching my VRAM limit with this task. Is there any workaround?
Hi @skohan
You can load your model in 4bit to run the model in your environment. Not sure if it's possible to load models in 4bit when using pipeline
, but you can do this:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(model_id,
load_in_4bit=True,
device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id)
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
message = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n'
prompt = f'[INST] <<SYS>>\n{DEFAULT_SYSTEM_PROMPT}\n<</SYS>>\n\n{message.strip()} [/INST]'
inputs = tokenizer([prompt], return_tensors='pt').to('cuda')
out = model.generate(inputs['input_ids'],
max_new_tokens=200,
do_sample=True,
top_k=10)
decoded = tokenizer.decode(out[0], skip_special_tokens=True)
print(decoded[len(prompt):])
It took a little less than 6GB of VRAM to run this in my environment.
BTW, it would be more appropriate to ask this kind of questions in the Forum or Discord.
Great, thank you!
And I will move to the forum/discord next time
skohan
changed discussion status to
closed