Generate answers use Dolly_v2_3b on several GPUs

#31
by Andi2022HH - opened

Hi all,

as an exercise I am trying to use Dolly_v2_3b on 4 GPU with 16RAM. It works to distribute Dolly across my 4 GPUs. However, when I am trying to send a prompt to the model

text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer(text, return_tensors="pt")
tokenized_text = tokenized_text.to(0)
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(x, skip_special_tokens=True)[0]
print(translated_text)

then I am getting the information that

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0!

Here is the distribution across the GPU which works well:

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11261 C python 3121MiB |
| 1 N/A N/A 11261 C python 3837MiB |
| 2 N/A N/A 11261 C python 3837MiB |
| 3 N/A N/A 11261 C python 3121MiB |

Here is my complete code:

from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b",device_map='auto')

peft_config = PromptTuningConfig(
task_type=TaskType.CAUSAL_LM,
prompt_tuning_init=PromptTuningInit.TEXT,
num_virtual_tokens=50,
prompt_tuning_init_text="Answer the question as truthfully as possible",
tokenizer_name_or_path="databricks/dolly-v2-3b"
)

model = get_peft_model(model, peft_config)

max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
dtype='float16',
low_zero=False,
)

device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer(text, return_tensors="pt")
tokenized_text = tokenized_text.to(0)
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(x, skip_special_tokens=True)[0]
print(translated_text)

any idea how to solve this?

Databricks org

You should use a pipeline with device="cuda" around the model and tokenizer; it will do the right thing here

Sign up or log in to comment