Phind/Phind-CodeLlama-34B-v2 · Input

Sep 16, 2023

I'm getting this odd error and not entirely sure why, it may be to do with the model and how I'm using the device_map not the actual input_ids. It also states that the attention mask and the pad token id aren't set, in the example of how to run the script there's no mention of these, and unfortunately the error message in console doesn't say where that issue is coming from so not a lot of clues to run off of, but this is the error it provides:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1535: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cuda') before running .generate().
warnings.warn(
Traceback (most recent call last):
File "/usr/local/llamaengineer.py", line 498, in
generated_text = generate(prompt)
File "/usr/local/llamaengineer.py", line 488, in generate
generate_ids = model.generate(inputs.input_ids.to("cpu"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 333, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 184, in apply_rotary_pos_emb
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

That's was the error I got when I tried running it like this:

 generate_ids = model.generate(inputs.input_ids.to("cpu"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)

That was only tried because an almost identical issue occurred when I tried running it with input_ids.to("cuda"), the difference is that instead of getting the warning about the input_ids being run on a different device than my model's device. I just got this message:

 The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` 
 to obtain reliable results.
 Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

Any help would be greatly appreciated, I'll provide the important part of my script that I'm running for reference:

 model_path = "Phind/Phind-CodeLlama-34B-v2"
 model = LlamaForCausalLM.from_pretrained(model_path, quantization_config=bnb_config, device_map=device_map)
 tokenizer = AutoTokenizer.from_pretrained(model_path)

 def generate(prompt: str):

     tokenizer.pad_token = tokenizer.eos_token
     inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)

     # Generate
     generate_ids = model.generate(inputs.input_ids.to("cuda"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
     completion = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
     completion = completion.replace(prompt, "").split("\n\n\n")[0]

     # Print the completion to the console
     print("Generated Completion:")
     print(completion)

     return completion
 prompt = "Please write a small script that prints the numbers 1-10 in the console"
 generated_text = generate(prompt)

GaaraOtheSand

Sep 16, 2023

So turns out I was able to solve my problem and now have the model working, very excited to see it in action, if anyone's interested I put the model script up on my GitHub, solely because I'm using a technique that allows me to run this model on my limited GPU and that's pretty cool I think. Shikamaru5/LlamaEngineer

GaaraOtheSand changed discussion status to closed Sep 16, 2023

Phind
/

Phind-CodeLlama-34B-v2

Input_Id's issue