RuntimeError: expected scalar type Float but found Half

#3
by Jakxx - opened

The model seems to load fine, but trying to generate text with it just throws me "RuntimeError: expected scalar type Float but found Half"

Any idea what this could be? I loaded the model as bfloat16.

Edit: my bad. I loaded the model as OPT instead of GPTJ into Oogabooga. >_<

Has anybody gotten this model to load in oogabooga? I'm also getting the "RuntimeError: expected scalar type Float but found Half"

Using --wbits 4 --groupsize 128 (no model_type given)

I've seen other people use model_type as gptj but it performs slow as hell.

I'm having no problem loading MetaIX/GPT4-X-Alpaca-30B-Int4/gpt4-x-alpaca-30b-128g-4bit.safetensors with just --wbits 4 --groupsize 128.

I think that is just what you have to deal with. 30 Billion parameters are a LOT of data. I am running this on a 4090 and I get around 0.4 tokens/s, while a 13B model gives me more like 9 tokens/s.

Jakxx.. the only problem is that my other 30 Billion parameter model performs ok on a 3090:

Found the following quantized model: models/GPT4-X-Alpaca-30B-Int4/gpt4-x-alpaca-30b-128g-4bit.safetensors
Loading model ...
Done.
Loaded the model in 18.39 seconds.
Output generated in 9.74 seconds (6.26 tokens/s, 61 tokens, context 218, seed 3259645)

Try MetaIX/GPT4-X-Alpaca-30B-Int4 on your 4090.

I believe this model was quantized with a funny branch of GPTQ-for-LLaMa. Most people are using the standard GPTQ-for-LLaMa with text-generation-webui. Hopefully, we'll see another version of this model quantized.

Aah well here's to hoping.

Sign up or log in to comment