GPU requirement

#10
by pgedeon - opened

How much ram would meet the minimum requirement? I can not wait for some language specific models, buying an a100 is a bit out of my price range.

@pgedeon I was able to load it in under 24GB by using bitsandbytes. You can add load_in_8bit=True, device_map="auto" to the load_pretrained line.

I second setting load_in_8bit=True, but be careful when setting device_mapto auto if you only have 1 GPU since it may offload some of the layers to the CPU. The BigCode model obj class does not have a flag you can set to true to offload them to CPU. I ended up setting it to my own device map dict.

Is it possible to run it in a 3070?

How about 4080 (16 GB)?

@cactusthecoder8 Could you probably share the device map that worked for you?

@cactusthecoder8 Could you probably share the device map that worked for you?

@AV99 how many GPUs you have and what are the sizes of their memory each?

How about 4080 (16 GB)?

I tried multiple configurations of the model, and nothing runs successfully with only 16GB unfortunately.

@cactusthecoder8 I initially started out with a single 16GB GPU and with offloading between CPU and GPU (and an hour of inference time later), I was barely able to get the "Hello World" running.

I now have 4GPUs, 16GB memory each. Any suggestions?

Can I run it locally on my Mac Studio (M1 Max 32 G)?

You can try ggml implementation starcoder.cpp to run the model locally on your M1 machine.

In fp16/bf16 on one GPU the model takes ~32GB, in 8bit the model requires ~22GB, so with 4 GPUs you can split this memory requirement by 4 and fit it in less than 10GB on each using the following code (make sure you have accelerate installed and bitsandbytes for 8bit mode):

from transformers import AutoModelForCausalLM
import torch

def get_gpus_max_memory(max_memory):
    max_memory = {i: max_memory for i in range(torch.cuda.device_count())}
    return max_memory

# for example for a max use of 10GB per GPU
# for fp16 replace with  `load_in_8bit=True` with   `torch_dtype=torch.float16`
model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder", 
    device_map="auto", 
    load_in_8bit=True,
    max_memory=get_gpus_max_memory("10GB"),
)

To understand the logic behind this check this documentation or this blog for handling large model inference.

loubnabnl changed discussion status to closed
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "gpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, load_in_8bit=True).to(device)

this code snippet is giving me the following error:
python3.10/site-packages/transformers/modeling_utils.py", line 2009, in to
raise ValueError(
ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype

I am unable to run without having load_in_8bit flag. I have a single A6000(48 gigs) gpu.

can anyone please help me in inferencing with starcoder??

To report progress after half a year:

  • I was able to run multiple small models (7b) quickly and flawlessly on my RTX 4080, using LM Studio server out-of-the-box. Could do up to 10b quantized, but these models are not common for some reason.
  • Their utility is questionable, though. For sure they are not reliable as a base of any useful system or process, even if only for personal use.
  • Usable only for experiments, learning or just fun.

Sign up or log in to comment