example code returns RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

by iekang - opened Jul 14, 2023

Jul 14, 2023

Thanks for sharing this amazing model!
When I try to run your example code from my server (with 8-GPUs, CUDA 11.4), I got the errors below, any insight?

Traceback (most recent call last):
File "/mnt/task_runtime/test_olm_llama.py", line 20, in
generation_output = model.generate(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 194, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in call_impl
return forward_call(*args, **kwargs)
File "/mnt/task_runtime/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu" not implemented for 'Half'

young-geng

OpenLM Research org Jul 16, 2023

This is likely a result of running it on CPU, where the half-precision ops are not supported. To use it on CPU, you need to convert the data type to float32 before you run any inference.

captainst

Jul 17, 2023

it would be helpful if you paste the sample code you were testing on.

Chie2727

Aug 14, 2023

If you see this line comment it out:
torch.set_default_tensor_type(torch.cuda.HalfTensor)

RahulCmv

Aug 18, 2023

Where do we find that line? which file?

Chie2727

Aug 24, 2023

If you downloaded the model directly from Meta, there should be this python script: /llama/llama/generation.py
You can either comment out line 100, or update it to be:

    if torch.cuda.is_available():
        torch.set_default_tensor_type(torch.cuda.HalfTensor)

so that half-tensors are not used if you don't have CUDA.

tapananand

Sep 16, 2023

It would be helpful if you can paste the code that gave you this error. I was getting this same error when trying out the code present in this tutorial: https://huggingface.co/blog/llama2

For me, changing torch_dtype from torch.float16 to torch.bfloat16 fixed the issue.

wingsuiting

Sep 18, 2023

For me this replicated the issue in colab:
https://colab.research.google.com/drive/1SDN3rJhyL9EpDWuDVjyE3lJ6hV0Cfdd-?usp=sharing

The error was not resolved by changing torch_dtype from torch.float16 to torch.bfloat16

meValerio

Oct 10, 2023

•

edited Oct 10, 2023

This is the code that generated the error. If you have a solution, may you also describe the proposed solution's rationale?

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-70b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

fatimazahid1

Oct 17, 2023

This is the code that generated the error. If you have a solution, may you also describe the proposed solution's rationale?

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-70b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

same issue, and by changing torch16 to torch32, its taking forever to load, and consumes 99% of the RAM space and the notebook stops then. If anyone knows why is it happening and any solution to it. Please let me know. Thankyou!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment