Text2Text Generation
Transformers
PyTorch
5 languages
t5
flan-ul2
Inference Endpoints
text-generation-inference

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM

#8
by MoritzLaurer HF staff - opened

Loading the model in 8bit=True does not seem to work on google colab.

model_name =  "google/flan-ul2"
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(
    model_name, 
    #torch_dtype=torch.bfloat16, 
    load_in_8bit=True,
    device_map="auto",
    #offload_folder="offload",  
    #offload_state_dict=True,
)

This code leads to the following error. I'm running it on a standard google colab GPU (Tesla T4, 15GB RAM).
(I had successfully loaded the model with torch_dtype=torch.bfloat16 and offloading (accelerate and bitsandbytes is installed), but it doesn't seem to work with 8bit)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-b20ea4cc84c4> in <module>
      3 from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer
      4 tokenizer = AutoTokenizer.from_pretrained(model_name)
----> 5 model = T5ForConditionalGeneration.from_pretrained(
      6     model_name,
      7     #torch_dtype=torch.bfloat16,

/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2423                 }
   2424                 if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
-> 2425                     raise ValueError(
   2426                         """
   2427                         Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you have set a value for `max_memory` you should increase that. To have
                        an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map.
                        
This comment has been hidden
MoritzLaurer changed discussion status to closed
MoritzLaurer changed discussion status to open

same for my deployment in sagemaker using instance instance_type="ml.g4dn.4xlarge". Waiting for someone to help on this as well.

my code:
def model_fn(model_dir):
#load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2",
load_in_8bit=True, device_map="auto", cache_dir="/tmp/model_cache/")
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
return model, tokenizer

from sagemaker.huggingface.model import HuggingFaceModel

huggingface_model = HuggingFaceModel(
model_data=s3_location,
role=role,
transformers_version="4.17",
pytorch_version="1.10",
py_version='py38',
)
from sagemaker.utils import name_from_base

endpoint_name = name_from_base(model_name)

predictor = huggingface_model.deploy(
initial_instance_count=1,
#instance_type="ml.g5.4xlarge",
instance_type="ml.g4dn.4xlarge",
endpoint_name=endpoint_name,
)
data = {
"inputs": prompt,
"min_length": 20,
"max_length": 50,
"do_sample": True,
"temperature": 0.6,
}

res = predictor.predict(data=data)
print(res)

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit\n the quantized model. If you have set a value for max_memory you should increase that. To have\n an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map

Google org

Hi @MoritzLaurer @miltonc
As stated by the error trace it seems that you don't have enough memory to fit the model on a 15GB GPU. The model is a 20B parameters model so you would need roughly 20GB GPU RAM at least, to run the model in int8
However, you might be interested in dispatching the model between CPU and GPU, and fit ~70% of the model weights on the GPU and the rest on CPU using BitsAndBytesConfig. Please have a look at the following section: https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

Hi @ybelkada , yeah just increasing memory makes sense. What confused me is that it worked with bf16, but the same setup did not work with int8. I thought bf16 requires more memory than 8int. If that's the case, then I don't really understand why I got the error with int8, especially since a central motivation for using int8 is to decrease memory requirements. But maybe I misunderstand something

Hey, @MoritzLaurer were you ever able to figure this out? I am having a similar problem in the Google Colab workspace on a T4 GPU which has 16Gb of memory, but I am loading my fine tuned Llama 2 7B hf model which should in theory work but I run into the same error. Is it really as simple as I need more memory? I would really like to remain in the free tier of google colab if at all possible.

EDIT: To clarify, I am even using the 4 bit quantization using BitsAndBytes
MODEL_NAME = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

training the model works fine but then when I load the trained model I encounter the error.

@fredbananas
I ran into the same error despite having aws g5.4xlarge instance. (using a different model) For me it was because nvidia runtime wasn't up. If you think that might be your issue too, try nvidia-smi and see if it's working

Google org

Hi @chenaclee @fredbananas
In order to load the model on a free tier Gcolab instance, I recommend you to use the sharded version of the model instead such as: Trelis/Llama-2-7b-chat-hf-sharded-bf16
You can find more sharded checkpoints on my personal collection: https://huggingface.co/collections/ybelkada/sharded-checkpoints-64fefe78cccea7ce7b268310

@fredbananas were you able to fix the error?

Free Up GPU :)
Guys who are trying in Google Colab, you can restart session.
Then the GPU Memory is cleared for your use.

I have the same issue with the LLaVA model.
But previously it worked in my system and suddenly it shows like not enough GPU. It's weird, Why?

image.png

Same issue, Is anyone find the solution?

Same issue, anyone have any suggestion? I am using TinyLlama-1.1B-Chat-v0.1 LLM. I have 8.0 GB GPU.

As this session talked about, it is because your GPU memory is not big enough to load the model. You can increase GPU memory, or seperated part of the modules into CPU memory.

Sign up or log in to comment