Issue with loading model to GPU when using pipeline

#5
by AlpYu-HubX - opened

Maybe I'm doing something wrong but when I try to use the pipeline method to do some inference with a cuda device (16 GB of GPU RAM), I still get a "Killed" message meaning my CPU RAM is running out.

from transformers import pipeline
import torch
use_cuda = torch.cuda.is_available()
print(use_cuda)
pipe = pipeline(model='togethercomputer/GPT-NeoXT-Chat-Base-20B', device="cuda")

def generate_response(input):
response = pipe(input)
print(response)
return

if __name__ == "__main__":
prompt = "<human>: Hello!\n<bot>:"
generate_response(prompt)

@AlpYu-HubX You need a GPU with more RAM or multiple GPUs also you could use following to load in 8bit and distribute to CPU (but you will still need a better GPU), for instance G5 instances on SageMaker:

model_8bit = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", load_in_8bit=True)

You need the following dependencies though:

bitsandbytes
accelerate
Together org

@AlpYu-HubX You seem to have encountered an OOM problem. Unfortunately, I suspect the 8-bit solution still won't work for you, as it requires >20GB to load the model in 8-bit. But it can be distributed to multiple GPUs as a workaround.

Sign up or log in to comment