What is the required GPU memory to load the model?

#15
by nx - opened

I've got 3 x V100, are they enough to load the model?

The current code will cause out-of-memory issue:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

ImportError: Using low_cpu_mem_usage=True or a device_map requires Accelerate: pip install accelerate

on this part

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

in bfloat16 it takes ~65GB of VRAM on A1000 80GB, in 8bit ~46GB

I don’t have any A1000, but 3x V100 (32g each). Is there some sample script that I can use to run the model with all 3 cards?

in bfloat16 it takes ~65GB of VRAM on A1000 80GB, in 8bit ~46GB

why i use a100-80GB , still report cuda memory error?
File "/root/.cache/huggingface/modules/transformers_modules/falcon-40b/modelling_RW.py", line 93, in forward
return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 3.19 MiB free; 78.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@leocheung same i hv 96GB still same error , any solution u got?

@a749734 i got suggestion from @masonbraysx
A 40-B parameter model will not fit on and A100-80GB if it is in bf16 or fp16. In 16-bit precision the amount of VRAM needed to run a given model is at least 2GB per 1B parameters, and some models are closer to 3GB per 1B parameters. This does not include the amount of memory needed to actually run any type of inferencing. Two easy options: 1) run it on a node with multiple A100 80GB GPUs. 2) load the model in 8bit precision. This requires the package "bitsandbytes". This reduces the necessary VRAM to about 45GB. I have successfully loaded and performed inference with the falcon-40b-instruct model on a system with 4 A4500's (each GPU has 20GB VRAM) using this method.

@leocheung how did you run the inference part of the model on the cluster of GPUs. Did you change the model's scripts or passed some arguments for reading the devices?

Sign up or log in to comment