WizardLMTeam/WizardLM-13B-V1.2 · RuntimeError: expected scalar type Half but found Char

Jul 28, 2023

System Info

Error when loading LLM with 8 bit quantization.

Versions:
tokenizers 0.13.3
transformers 4.31.0

Error message:

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in forward
    query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 295, in <listcomp>
    query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
RuntimeError: expected scalar type Half but found Char

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

To reproduce the issue:

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, LlamaConfig

model_id="WizardLM/WizardLM-13B-V1.2"
tokenizer = LlamaTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(
        model_id,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto",
)

model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = 1
model.config.eos_token_id = 2
model.eval()

Inference:

prompt_ = "What is the difference between fusion and fission?"
prompts = f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt_} ASSISTANT:"""
inputs = tokenizer(prompts, return_tensors="pt")
device = "cuda"
input_ids = inputs["input_ids"].to(device)
max_new_tokens= 2048
with torch.no_grad():
    generation_output = model.generate(
                input_ids=input_ids,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens
    )

Expected behavior

Reply to the prompt.

Aavenir

Aug 8, 2023

i get same error .

TripleExclam

Aug 10, 2023

tokenizer = LlamaTokenizer.from_pretrained("WizardLM/WizardLM-13B-V1.2", model_max_length=2048)
model = LlamaForCausalLM.from_pretrained("WizardLM/WizardLM-13B-V1.2", pretraining_tp=1,
                                         load_in_8bit=True, torch_dtype=torch.float16, device_map="auto")

Load the model with pretraining_tp=1