Why is this 7B model only showing 5GB of gpu ram allocation?

#96
by shayak - opened

Completely new to locally running llms. I'm using the following code to load the model into memory, using transformers library. What am I doing wrong? Using float16 dtype I would expect 14GB memory usage for a 7B model no? However it shows me around 5-6 gb of my gpu being used. Running nvidia-smi shows the same as well.

Using model.to(DEVICE) throws an error, but without it I'm assuming a portion of it is running on gpu? How do I make it load the full 14GB of the model onto gpu?

// DEVICE variable is set to 'cuda'

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_4bit=True,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map=DEVICE,
    trust_remote_code=True,
)
# model.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print('loaded.')

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used // 1024 ** 2} MB.")

If you want full fp32 model on GPU then just do

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map = DEVICE,
    trust_remote_code = True,
)

and it won't quantize the model (which shrinks it footprint in memory by using lower precision numbers)

Completely new to locally running llms. I'm using the following code to load the model into memory, using transformers library. What am I doing wrong? Using float16 dtype I would expect 14GB memory usage for a 7B model no? However it shows me around 5-6 gb of my gpu being used. Running nvidia-smi shows the same as well.

Using model.to(DEVICE) throws an error, but without it I'm assuming a portion of it is running on gpu? How do I make it load the full 14GB of the model onto gpu?

// DEVICE variable is set to 'cuda'

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_4bit=True,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map=DEVICE,
    trust_remote_code=True,
)
# model.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print('loaded.')

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used // 1024 ** 2} MB.")

Have You resolved the issue?

@shayak - thanks for the issue !
This is because you are loading the model in 4bit precision (because you passed load_in_4bit=True) - Since you need 2 bytes per parameter for float16/bfloat16 you will indeed observe a memory consumption of 14GB for a fp16 model, if you load it in 8-bit you would need only 1 byte per parameter --> 7GB for 4bit you would need ~0.6 byte par parameter --> 5GB. Does that make sense? To load mistral in 16bit simply run

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map ="auto",
    torch_dtype=torch.float16,
)

Sign up or log in to comment