google/flan-ul2 · ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM

Mar 4, 2023

•

edited Mar 4, 2023

Loading the model in 8bit=True does not seem to work on google colab.

model_name =  "google/flan-ul2"
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(
    model_name, 
    #torch_dtype=torch.bfloat16, 
    load_in_8bit=True,
    device_map="auto",
    #offload_folder="offload",  
    #offload_state_dict=True,
)

This code leads to the following error. I'm running it on a standard google colab GPU (Tesla T4, 15GB RAM).
(I had successfully loaded the model with torch_dtype=torch.bfloat16 and offloading (accelerate and bitsandbytes is installed), but it doesn't seem to work with 8bit)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-b20ea4cc84c4> in <module>
      3 from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
      4 tokenizer = AutoTokenizer.from_pretrained(model_name)
----> 5 model = T5ForConditionalGeneration.from_pretrained(
      6     model_name,
      7     #torch_dtype=torch.bfloat16,

/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2423                 }
   2424                 if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
-> 2425                     raise ValueError(
   2426                         """
   2427                         Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you have set a value for `max_memory` you should increase that. To have
                        an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map.

MoritzLaurer

Mar 4, 2023

This comment has been hidden

MoritzLaurer changed discussion status to closed Mar 4, 2023

MoritzLaurer changed discussion status to open Mar 4, 2023

miltonc

Apr 12, 2023

same for my deployment in sagemaker using instance instance_type="ml.g4dn.4xlarge". Waiting for someone to help on this as well.

my code:
def model_fn(model_dir):
#load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2",
load_in_8bit=True, device_map="auto", cache_dir="/tmp/model_cache/")
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
return model, tokenizer

from sagemaker.huggingface.model import HuggingFaceModel

huggingface_model = HuggingFaceModel(
model_data=s3_location,
role=role,
transformers_version="4.17",
pytorch_version="1.10",
py_version='py38',
)
from sagemaker.utils import name_from_base

endpoint_name = name_from_base(model_name)

predictor = huggingface_model.deploy(
initial_instance_count=1,
#instance_type="ml.g5.4xlarge",
instance_type="ml.g4dn.4xlarge",
endpoint_name=endpoint_name,
)
data = {
"inputs": prompt,
"min_length": 20,
"max_length": 50,
"do_sample": True,
"temperature": 0.6,
}

res = predictor.predict(data=data)
print(res)

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit\n the quantized model. If you have set a value for max_memory you should increase that. To have\n an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map

ybelkada

Apr 12, 2023

Hi @MoritzLaurer @miltonc
As stated by the error trace it seems that you don't have enough memory to fit the model on a 15GB GPU. The model is a 20B parameters model so you would need roughly 20GB GPU RAM at least, to run the model in int8
However, you might be interested in dispatching the model between CPU and GPU, and fit ~70% of the model weights on the GPU and the rest on CPU using BitsAndBytesConfig. Please have a look at the following section: https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

MoritzLaurer

Apr 13, 2023

Hi @ybelkada , yeah just increasing memory makes sense. What confused me is that it worked with bf16, but the same setup did not work with int8. I thought bf16 requires more memory than 8int. If that's the case, then I don't really understand why I got the error with int8, especially since a central motivation for using int8 is to decrease memory requirements. But maybe I misunderstand something

fredbananas

Oct 19, 2023

•

edited Oct 19, 2023

Hey, @MoritzLaurer were you ever able to figure this out? I am having a similar problem in the Google Colab workspace on a T4 GPU which has 16Gb of memory, but I am loading my fine tuned Llama 2 7B hf model which should in theory work but I run into the same error. Is it really as simple as I need more memory? I would really like to remain in the free tier of google colab if at all possible.

EDIT: To clarify, I am even using the 4 bit quantization using BitsAndBytes
MODEL_NAME = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

training the model works fine but then when I load the trained model I encounter the error.

chenaclee

Oct 22, 2023

•

edited Oct 22, 2023

@fredbananas
I ran into the same error despite having aws g5.4xlarge instance. (using a different model) For me it was because nvidia runtime wasn't up. If you think that might be your issue too, try nvidia-smi and see if it's working

ybelkada

Oct 23, 2023

Hi @chenaclee @fredbananas
In order to load the model on a free tier Gcolab instance, I recommend you to use the sharded version of the model instead such as: Trelis/Llama-2-7b-chat-hf-sharded-bf16
You can find more sharded checkpoints on my personal collection: https://huggingface.co/collections/ybelkada/sharded-checkpoints-64fefe78cccea7ce7b268310

egsari

Nov 8, 2023

@fredbananas were you able to fix the error?

Anees-Aslam

Jan 17, 2024

Free Up GPU :)
Guys who are trying in Google Colab, you can restart session.
Then the GPU Memory is cleared for your use.

sanujen

Jan 29, 2024

•

edited Jan 29, 2024

I have the same issue with the LLaVA model.
But previously it worked in my system and suddenly it shows like not enough GPU. It's weird, Why?

mayur456

Feb 6, 2024

Same issue, Is anyone find the solution?

saramirabi

Mar 7, 2024

Same issue, anyone have any suggestion? I am using TinyLlama-1.1B-Chat-v0.1 LLM. I have 8.0 GB GPU.

sword2000

Mar 27, 2024

As this session talked about, it is because your GPU memory is not big enough to load the model. You can increase GPU memory, or seperated part of the modules into CPU memory.

saramirabi

Jul 1, 2024

Can you recoomend the amount of needed GPU memory? Mine is 48.0 GB, Having 2 GPUs both Nvidia RTX A4000.

bubbalizard

Sep 18, 2024

Getting this error as well with google/madlad400-7b-mt and 10b. Using bfloat16 everything runs fine without error. in my 12G VRAM+CPU. However using this quantization config I get the ValueError above. Can you explain why it would work for bfloat16 and not 4-bit? Seems the quantization would use CPU ram as needed just like bfloat16. Is this a bug?

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)