Download corrupted/Not working

by Sergei6000 - opened Jun 23, 2023

Jun 23, 2023

Could be an issue on my end but I get an error that points torwards corrupted model + the downloaded filesize is only 6.5GB instead of 7.01.
Can anybody confirm this?

Part of the error in gptq-for-llama is
size mismatch for model.layers.29.self_attn.o_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.self_attn.q_proj.qzeros: copying a param with shape torch.Size([1, 640]) from checkpoint, the shape in current model is torch.Size([40, 640]). size mismatch for model.layers.29.self_attn.q_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.self_attn.v_proj.qzeros: copying a param with shape torch.Size([1, 640]) from checkpoint, the shape in current model is torch.Size([40, 640]). size mismatch for model.layers.29.self_attn.v_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.mlp.down_proj.qzeros: copying a param with shape torch.Size([1, 640]) from

I got that before on a corrupted downloaded.
Tried multiple times with ooba and manually in the browser.

flashvenom

Owner Jun 23, 2023

•

edited Jun 23, 2023

Try downloading again, I've tried downloading multiple times and it works fine. I use git lfs clone to download, so git lfs clone https://huggingface.co/flashvenom/Airoboros-13B-SuperHOT-8K-GPTQ should work

Sergei6000

Jun 23, 2023

Weird, with that method I could get it to load in exllama. Still getting the same error in gptq-for-llama.
Could be an issue on my end, but I can load other models.

flashvenom

Owner Jun 23, 2023

Ah, I've only tried on exLLaMa, I wonder what happened, I will re-run the quantize when I get some time

Sergei6000

Jun 23, 2023

Thanks, much appreciated!
For some reason exllama is slower on my system. Could be because of my 2 pascal cards.

TheYuriLover

Jun 23, 2023

This model is really slow

I tried to fix that in the config.json by changing "use_cache = False" to "use_cache = True" but when I do that it doesn't want to generate anymore

flashvenom

Owner Jun 23, 2023

•

edited Jun 23, 2023

What is your setup? It's probably slow cause you are using AutoGPTQ, try exllama instead. @Sergei6000 new model is uploaded, idk if it will change anything but try -- if it doesn't then I'm unsure what's going on, prob incompatible with llama.cpp in its current form

TheYuriLover

Jun 23, 2023

@flashvenom It's not supposed to be that slow, on AutoGPTQ I usually get 7-8 tokens/s with 13b models, but like I said your model is too slow because it has "use_cache = False" on the config.json, but I can't put it into true I got errors when doing that

Yeah I should be using exllama but it doesn't load models on windows I don't know why :(

flashvenom

Owner Jun 23, 2023

Ah, I see, well I just quantisized the model to use with exLLama but I don't know the specifics of why it is slow with AutoGPTQ. I can look into it when I get some time. For exllama not loading on windows, try WSL2 if you can

TheYuriLover

Jun 23, 2023

•

edited Jun 23, 2023

Didn't know you could quantize the model to fit the exllama paradigm. And yeah it works for WSL2, it's just that the loading is much slower when doing that, I hope it'll get fixed in the future.

Btw I just tried your model (through WSL2 of course lol) and I'm a big fan, like airoboros was really good at following orders and eloquant prose, but it was a bit rigid and never added something new to the table, adding SuperHOT fixed that issue. This is probably the best 13b model we got right there :D

Sergei6000

Jun 23, 2023

•

edited Jun 23, 2023

@flashvenom
Unfortunately didn't help.
Its probably because I use the gptq-for-llama from ooba.
https://github.com/oobabooga/text-generation-webui/issues/1661

I tried updating to https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/old-cuda.
Able to load the model with that but get a RuntimeError: expected scalar type Float but found Half on generating.
Couldn't really get triton to work so I'm out of luck :(

@TheYuriLover
Definitely the best 13B I have seen yet.
Sometimes it really likes to go full "gpt" though in writing though. Thats most likely from Airoboros though.
Haven't seen a 13B giving long coherent text outputs like this before.Its really good.

flashvenom

Owner Jun 23, 2023

why don't you try exllama? @Sergei6000

Sergei6000

Jun 23, 2023

•

edited Jun 23, 2023

Exllama works, but its a bit funky.
It seems like the prompt takes very long but the actual new output is fast.
In Tavernai with a long prompt it gets really slow. Almost 50 sec for a reply. Which is almost 30B levels with cpu offload for me.
But its fine for testing. Thanks that you tried requantizing for me, might be my older pascal cards.

max-fry

Jun 24, 2023

It looks like the patch is made for hugging face transformers. Does it actually affect ExLlama in any way?

flashvenom

Owner Jun 24, 2023

exLLama has a built in option to do scaling, its called -cpe

giovanith

Jul 1, 2023

Not running here (windows oogabooga).
I tried several configs, but it returns only 'no-sense', almost randomly words in chat.
I also tested Airoboros 13b from TheBloke, yet no success, same wrong answers completely messy words in chat.

Any clue ?
Thanks
Giovani - Brazil

Xeddius

Aug 21, 2023

Can anyone port this to work with faraday.dev or koboldcpp?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment