databricks/dolly-v2-3b · Generate answers use Dolly_v2

Hi all,

as an exercise I am trying to use Dolly_v2_3b on 4 GPU with 16RAM. It works to distribute Dolly across my 4 GPUs. However, when I am trying to send a prompt to the model

text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer(text, return_tensors="pt")
tokenized_text = tokenized_text.to(0)
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(x, skip_special_tokens=True)[0]
print(translated_text)

then I am getting the information that

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0!

Here is the distribution across the GPU which works well:

Here is my complete code:

from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b",device_map='auto')

peft_config = PromptTuningConfig(
task_type=TaskType.CAUSAL_LM,
prompt_tuning_init=PromptTuningInit.TEXT,
num_virtual_tokens=50,
prompt_tuning_init_text="Answer the question as truthfully as possible",
tokenizer_name_or_path="databricks/dolly-v2-3b"
)

model = get_peft_model(model, peft_config)

max_memory = get_balanced_memory(
model,
max_memory=None,
no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
dtype='float16',
low_zero=False,
)

device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

any idea how to solve this?

databricks
/

dolly-v2-3b

Generate answers use Dolly_v2_3b on several GPUs