Text Generation
Transformers
PyTorch
llama
text-generation-inference
Inference Endpoints

RuntimeError: expected scalar type Half but found Char

#5
by nicolasbo - opened

System Info

Error when loading LLM with 8 bit quantization.

Versions:
tokenizers 0.13.3
transformers 4.31.0

Error message:

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in forward
    query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in <listcomp>
    query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
RuntimeError: expected scalar type Half but found Char

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

To reproduce the issue:

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, LlamaConfig

model_id="WizardLM/WizardLM-13B-V1.2"
tokenizer = LlamaTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(
        model_id,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto",
)

model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2
model.eval()

Inference:

prompt_ = "What is the difference between fusion and fission?"
prompts = f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt_} ASSISTANT:"""
inputs = tokenizer(prompts, return_tensors="pt")
device = "cuda"
input_ids = inputs["input_ids"].to(device)
max_new_tokens= 2048
with torch.no_grad():
    generation_output = model.generate(
                input_ids=input_ids,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens
    )

Expected behavior

Reply to the prompt.

i get same error .

tokenizer = LlamaTokenizer.from_pretrained("WizardLM/WizardLM-13B-V1.2", model_max_length=2048)
model = LlamaForCausalLM.from_pretrained("WizardLM/WizardLM-13B-V1.2", pretraining_tp=1,
                                         load_in_8bit=True, torch_dtype=torch.float16, device_map="auto")

Load the model with pretraining_tp=1

Sign up or log in to comment