Beware! Requires new cuda.

by autobots - opened May 2, 2023

Discussion

autobots

May 2, 2023

Would not load in 0cc4m/ooba/etc.

New cuda is slower :(

Yhyu13

May 2, 2023

which new cuda? The oobabooga GPTQ cuda branch?

autobots

May 2, 2023

•

edited May 2, 2023

That's the one that won't work.

this one:https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda will load it and be slow

teknium

May 2, 2023

huh

autobots

May 2, 2023

Yea, I'm looking for another 65b since I don't want to download 127gb of the fp32 model to convert myself. Will see if the maderix works. Unfortunately none of them have a merged alpaca lora.

TheBloke

Owner May 2, 2023

Did you guys try alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors ? That was the file I made for old GPTQ ie ooba.

autobots

May 2, 2023

Yea.. it did not work. That is what I downloaded. If you made it with the current cuda branch...

TheBloke

Owner May 2, 2023

What exactly is the issue? the no-act-order models I make normally work in ooba GPTQ

I will do some testing tomorrow.

autobots

May 3, 2023

It doesn't load and gives me a state dict error until I use the newest cuda branch.

TheBloke

Owner May 3, 2023

Ah, you must be using CPU offload. Yes I've seen that problem with pre_layer specifically. I will look into it

alain401

May 3, 2023

Was able to run the model on 2 GPUS, 24GB each by using --gpu-memory 17 17.
Works well until context is about 1.1K tokens, then runs out of memory.

autobots

May 4, 2023

Nope.. no offload. P40/3090 :) I'll try it with autogptq and see if I get better perf now that it's fixed.

TheBloke

Owner May 4, 2023

Are you splitting it across GPUs though? Maybe that causes the same issue as CPU offload. Ie not all on the same device

Yeah as you're into AutoGPTQ now just try that instead, and let me know

autobots

May 6, 2023

•

edited May 6, 2023

I am splitting it.. I got it running now. I wonder if it was old instances of gptq_llama being installed. I think I can only do half context and get about 1it/s, slightly over if I just do instruct.

I'd really like to try a 1024 group version to see if it would run full but you only have that for triton. 3090 can use triton but P40 cannot. Autogptq loads but can only do very small contexts because it loads lopsided.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment