CUDA out of memory
I use NVIDIA GeForce RTX 3090 GPU with 24GBRAM.
When I run this demo code, it turns out these tips:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 185.62 MiB is free. Including non-PyTorch memory, this process has 23.50 GiB memory in use. Of the allocated memory 22.83 GiB is allocated by PyTorch, and 1.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Could you specify which demo? Are you loading in float16, bfloat16? You should use accelerate
with device_map = "auto"
to overcome potential RAM issues. It should fit in 24GB
Maybe try to reduce the number of tokens that are generated, 1000 seems like a lot. Try with a smaller number like 40 and go from there
Maybe try to reduce the number of tokens that are generated, 1000 seems like a lot. Try with a smaller number like 40 and go from there
It's doesnt help, the error appears at <model.to(device1)>
Then try load in 8 bits or use accelerate and device_map = "auto" :hug:
@logan39522361tq
Can you try to load the model as such:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", torch_dtype=torch.float16, device_map="auto")
...
The snippet you shared will load the model in full precision (28GB), hence the GPU error you get
Alternatively you can also do:
# pip install bitsandbytes
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", load_in_4bit=True)
...
I used this to init model
<model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", torch_dtype=torch.float16, device_map="auto")>
but when I run
<model.to(device)>
it turns out this error(using python virtual env):
lib/python3.10/site-packages/accelerate/big_modeling.py", line 425, in wrapper
raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk."
this is because device_map="auto" has automatically offloaded your model into cpu or disk. What is your GPU total VRAM?
this is because device_map="auto" has automatically offloaded your model into cpu or disk. What is your GPU total VRAM?
NVIDIA GeForce RTX 3090 GPU with 24GBRAM
the model has 44B parameters (you need ~90GB VRAM to fit your GPU in half-precision), it will not fit into your GPU. Please consider running the model in 4bit precision - or use cpu / disk offloading at the risk of not being able to call model.to(device)
code:
model = AutoModelForCausalLM.from_pretrained(path, load_in_4bit=True)
config.json:
"torch_dtype": "bfloat16",
it turns out error:model.to(device)
is not supported for 4-bit
or 8-bit
bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype
.
loading the model with quantization will automatically dispatch the model in the available devices, hence there is no need to call .to as it will also create issues with offloading as well