Getting an error when loading the model

#3
by MLDataScientist - opened

Can you please share what script you used to run the model? With the script provided in this repo, I am getting "Runtime error: LayerNormKernelImpl not implemented for Half". Thank you!

Pruna AI org

Are you running on CPU ?

Yes, I am using CPU only. I changed the load mode to CPU as well. I have 96GB of RAM.

Pruna AI org

In that case can you try removing the torch_dtype ? CPUs don't support half-precision (i.e. float16)

Sure. I will share results soon here.
So, I will remove torch_dtype=bfloat16 parameter from the script provided here, right?
Thanks!

Pruna AI org

Yes exactly, let us know if it works :)

So, I changed the script and removed the bfloat16 parameter. But I am still getting an error.
Below is the code I used to run the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(".", device_map="cpu", trust_remote_code=True)

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cpu")


outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0])) 

Above code is saved in dbrx_run.py and it is in the same folder as model weights. And here is the error I get when I run it:

user:~/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit$ python3 dbrx_run.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15/15 [00:02<00:00,  5.42it/s]
Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.
Traceback (most recent call last):
  File "/home/user/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit/dbrx_run.py", line 12, in <module>
    outputs = model.generate(**input_ids, max_new_tokens=100)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/52aabf1d1280c4a1f0425f8bc3554b66c1318007/modeling_dbrx.py", line 1307, in forward
    outputs = self.transformer(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/52aabf1d1280c4a1f0425f8bc3554b66c1318007/modeling_dbrx.py", line 1105, in forward
    block_outputs = block(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/52aabf1d1280c4a1f0425f8bc3554b66c1318007/modeling_dbrx.py", line 886, in forward
    resid_states, hidden_states, self_attn_weights, present_key_value = self.norm_attn_norm(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/52aabf1d1280c4a1f0425f8bc3554b66c1318007/modeling_dbrx.py", line 666, in forward
    hidden_states = self.norm_1(hidden_states).to(hidden_states.dtype)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 201, in forward
    return F.layer_norm(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 2546, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Do I need to change some parameters in config.json file as well? Otherwise, let me know how to fix this. Thanks~

Pruna AI org
β€’
edited Apr 1

I have two suggestions:

1- Pass torch_dtype=torch.float32
2- Change L35 here to float32 (https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit/blob/3b64c668f5fc525408170ca6565e347b9f95103f/config.json#L35)

Thanks! I will try it out tonight. I also have 36GB VRAM. Is it possible to offload part of the model to GPU and some part to CPU (RAM)? If yes, let me know. I will try out both methods tonight. Thanks!

Pruna AI org

I would suggest to try both methods at the same time.

Regarding your second question, if you set device_map='auto' you should be able to use both your GPU and CPU.

ok, I tested cpu only method. I am now getting this error:

user:~/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit$ python3 dbrx_run.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15/15 [00:02<00:00,  6.28it/s]
Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.
Traceback (most recent call last):
  File "/home/user/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit/dbrx_run.py", line 12, in <module>
    outputs = model.generate(**input_ids, max_new_tokens=100)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1307, in forward
    outputs = self.transformer(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1105, in forward
    block_outputs = block(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 886, in forward
    resid_states, hidden_states, self_attn_weights, present_key_value = self.norm_attn_norm(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 668, in forward
    hidden_states, attn_weights, past_key_value = self.attn(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 316, in forward
    qkv_states = self.Wqkv(hidden_states)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 429, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 577, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/home/user/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1082, in dequantize_4bit
    device = pre_call(A.device)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 437, in pre_call
    torch.cuda.set_device(device)
  File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 406, in set_device
    device = _get_device_index(device)
  File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
    raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu

When I change the device_map='auto', I also get an error.

This is an error for device_map='auto'.

user:~/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit$ python3 dbrx_run.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15/15 [00:07<00:00,  1.91it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.
Traceback (most recent call last):
  File "/home/user/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit/dbrx_run.py", line 12, in <module>
    outputs = model.generate(**input_ids, max_new_tokens=100)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1307, in forward
    outputs = self.transformer(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1105, in forward
    block_outputs = block(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 886, in forward
    resid_states, hidden_states, self_attn_weights, present_key_value = self.norm_attn_norm(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 668, in forward
    hidden_states, attn_weights, past_key_value = self.attn(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 316, in forward
    qkv_states = self.Wqkv(hidden_states)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 161, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 347, in pre_forward
    set_module_tensor_to_device(
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 411, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 313, in to
    return self._quantize(device)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 280, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1009, in quantize_4bit
    raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

@johnrachwanpruna let me know if there is a fix to above issues. Thanks!

Pruna AI org
β€’
edited Apr 2

ok, I tested cpu only method. I am now getting this error:

user:~/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit$ python3 dbrx_run.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15/15 [00:02<00:00,  6.28it/s]
Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.
Traceback (most recent call last):
  File "/home/user/Downloads/text-generation-webui-main/models/PrunaAI_dbrx-instruct-bnb-4bit/dbrx_run.py", line 12, in <module>
    outputs = model.generate(**input_ids, max_new_tokens=100)
  File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1307, in forward
    outputs = self.transformer(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 1105, in forward
    block_outputs = block(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 886, in forward
    resid_states, hidden_states, self_attn_weights, present_key_value = self.norm_attn_norm(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 668, in forward
    hidden_states, attn_weights, past_key_value = self.attn(
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.cache/huggingface/modules/transformers_modules/SinclairSchneider/dbrx-instruct-quantization-fixed/dfa405627499b4934c7d63132f7a3002ecf97d1e/modeling_dbrx.py", line 316, in forward
    qkv_states = self.Wqkv(hidden_states)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 429, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 577, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/home/user/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1082, in dequantize_4bit
    device = pre_call(A.device)
  File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 437, in pre_call
    torch.cuda.set_device(device)
  File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 406, in set_device
    device = _get_device_index(device)
  File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
    raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu

When I change the device_map='auto', I also get an error.

Did you also put device_map='cpu' here ? and is the input also on cpu ?

@johnrachwanpruna yes, I changed the parameter to device_map='cpu'. Regarding inputs, yes, I also changed it to cpu. I could not find any similar issues in Google search. Then, I tried device_map='auto' which also resulted in a different error as shared above.

@johnrachwanpruna let me know if there is a workaround for this issue. Thanks!

Sign up or log in to comment