Download corrupted/Not working
Could be an issue on my end but I get an error that points torwards corrupted model + the downloaded filesize is only 6.5GB instead of 7.01.
Can anybody confirm this?
Part of the error in gptq-for-llama is
size mismatch for model.layers.29.self_attn.o_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.self_attn.q_proj.qzeros: copying a param with shape torch.Size([1, 640]) from checkpoint, the shape in current model is torch.Size([40, 640]). size mismatch for model.layers.29.self_attn.q_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.self_attn.v_proj.qzeros: copying a param with shape torch.Size([1, 640]) from checkpoint, the shape in current model is torch.Size([40, 640]). size mismatch for model.layers.29.self_attn.v_proj.scales: copying a param with shape torch.Size([1, 5120]) from checkpoint, the shape in current model is torch.Size([40, 5120]). size mismatch for model.layers.29.mlp.down_proj.qzeros: copying a param with shape torch.Size([1, 640]) from
I got that before on a corrupted downloaded.
Tried multiple times with ooba and manually in the browser.
Try downloading again, I've tried downloading multiple times and it works fine. I use git lfs clone to download, so git lfs clone https://huggingface.co/flashvenom/Airoboros-13B-SuperHOT-8K-GPTQ
should work
Ah, I've only tried on exLLaMa, I wonder what happened, I will re-run the quantize when I get some time
Thanks, much appreciated!
For some reason exllama is slower on my system. Could be because of my 2 pascal cards.
What is your setup? It's probably slow cause you are using AutoGPTQ, try exllama instead. @Sergei6000 new model is uploaded, idk if it will change anything but try -- if it doesn't then I'm unsure what's going on, prob incompatible with llama.cpp in its current form
@flashvenom It's not supposed to be that slow, on AutoGPTQ I usually get 7-8 tokens/s with 13b models, but like I said your model is too slow because it has "use_cache = False" on the config.json, but I can't put it into true I got errors when doing that
Yeah I should be using exllama but it doesn't load models on windows I don't know why :(
Ah, I see, well I just quantisized the model to use with exLLama but I don't know the specifics of why it is slow with AutoGPTQ. I can look into it when I get some time. For exllama not loading on windows, try WSL2 if you can
Didn't know you could quantize the model to fit the exllama paradigm. And yeah it works for WSL2, it's just that the loading is much slower when doing that, I hope it'll get fixed in the future.
Btw I just tried your model (through WSL2 of course lol) and I'm a big fan, like airoboros was really good at following orders and eloquant prose, but it was a bit rigid and never added something new to the table, adding SuperHOT fixed that issue. This is probably the best 13b model we got right there :D
@flashvenom
Unfortunately didn't help.
Its probably because I use the gptq-for-llama from ooba.
https://github.com/oobabooga/text-generation-webui/issues/1661
I tried updating to https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/old-cuda.
Able to load the model with that but get a RuntimeError: expected scalar type Float but found Half on generating.
Couldn't really get triton to work so I'm out of luck :(
@TheYuriLover
Definitely the best 13B I have seen yet.
Sometimes it really likes to go full "gpt" though in writing though. Thats most likely from Airoboros though.
Haven't seen a 13B giving long coherent text outputs like this before.Its really good.
why don't you try exllama? @Sergei6000
Exllama works, but its a bit funky.
It seems like the prompt takes very long but the actual new output is fast.
In Tavernai with a long prompt it gets really slow. Almost 50 sec for a reply. Which is almost 30B levels with cpu offload for me.
But its fine for testing. Thanks that you tried requantizing for me, might be my older pascal cards.
It looks like the patch is made for hugging face transformers. Does it actually affect ExLlama in any way?
exLLama has a built in option to do scaling, its called -cpe
Not running here (windows oogabooga).
I tried several configs, but it returns only 'no-sense', almost randomly words in chat.
I also tested Airoboros 13b from TheBloke, yet no success, same wrong answers completely messy words in chat.
Any clue ?
Thanks
Giovani - Brazil
Can anyone port this to work with faraday.dev or koboldcpp?