OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor · can't run inference on multi GPU

May 24, 2023

•

edited Jun 4, 2023

Works on a single A6000:

from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer

tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="sequential", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)

Throws error on 2 V100S cards (hosting 17GB of model weights each):

from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer

tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="auto", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)

throws:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "myvenv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1558, in generate
    return self.sample(
  File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2641, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Only difference is I'm using device_map auto to make use of both GPUs. (Also happens for .to('cuda'), .to(0), .to(1) instead of .to(model.device).)

ah, there's an open bug in transformers for it:
https://github.com/huggingface/transformers/issues/22914

daryl149 changed discussion status to closed May 24, 2023

daryl149

Jun 4, 2023

•

edited Jun 4, 2023

Update:
The inf/nan is caused by CUDA 11.8 and bitsandbytes==0.38.1. It's solved by downgrading to CUDA 11.6 and bitsandbytes 0.31.8

However, the inference on multi gpu is still broken. It returns gibberish when using load_in_8bit=True. See this issue I created in transformers https://github.com/huggingface/transformers/issues/23989

daryl149 changed discussion status to open Jun 4, 2023