Will this work with the Local LLMs One-Click UI runpod?

#2
by nichedreams - opened

I see this model has no groupsize and every gptq model without a groupsize gives me an error using runpod:

NO_GROUP: tl.constexpr, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr, GROUP_SIZE_M: tl.constexpr):β€œβ€"Compute the matrix multiplication C = A x B.A is of shape (M, K) float16B is of shape (K//8, N) int32C is of shape (M, N) float16scales is of shape (G, N) float16zeros is of shape (G, N) float16g_ptr is of shape (K) int32β€œβ€"infearure_per_bits = 32 // bits

Or maybe there's a setting that I'm missing, but I'm not proficient in docker and/or linux.

I'm going to bed now but I'll make another runpod template tomorrow that it will work with

That is great news, thank you. Quick side question about your old template, when I click apply and restart interface in webui after enabling extensions or changing webui gui (from notebook to chat for example) it always crashes it. Is that an inherent limitation of the docker template setup or am I not doing something correctly?

Hi there, quick side question, where can i find these runpod templates? Would also be very interested. Currently running my own ones, but on most hosts i get gpu enumerate errors as there is no way to run a docker container on runpod and set the --gpus all flag.

Also, is the Dockerfile to your image for the runpod template public? Been trying to find thaz one but no luck. Feel like im missing one little piece to be able to deploy my container reliably.

We can probably help you over on the openaccess ai collective discord. https://discord.gg/Y24CzatG

OK I have updated the one click template to now use the older ooba CUDA GPTQ-for-LLaMA fork, for full compatibility with this and all the other models I've released recently.

I have tested it with WizardLM 30B GPTQ and it works fine on a 24GB GPU, eg 3090.

GGML Support

I've also added support for CUDA accelerated GGML models. Download any GGML model and you can apply parameters like these to offload to the GPU:

image.png

When to use GGML vs GPTQ?

Generally speaking, if you have enough VRAM to load the model fully (including context), I would use GPTQ. In my testing it's faster than GGML in that scenario.

But if you want to load a model you don't have enough VRAM for - eg a 65B model on a 24GB card, or a 30B model on a 16GB card - then GGML will be much faster. For example, on a 16GB GPU you can load WizardLM 30B GGML with 50 layers offloaded (which uses around 15.5GB VRAM) and get 4-5 tokens/s, compared to ~1/token.s for GPTQ. GPTQ/pytorch are really slow in scenarios where the whole model can't be loaded into VRAM.

@nichedreams I haven't managed to fix the issue where you can't reload the UI. I'm not quite sure why it happens. I will keep investigating.

@jules241 you can find the Runpod template here: https://runpod.io/gsc?template=qk29nkmbfr&ref=eexqfacd It's dead easy to use. Just click that link and then you'll be at the normal Runpod screen for selecting a pod, with that template selected. Read the README for further instructions on how to easily download and use GPTQ models. And you can also find it by browsing the Templates section on Runpod, which would allow you to copy it if you want to modify it (eg to change the disk allocation defaults).

Perfect! Last question, would you know how I can send commandline arguments when starting the pod (to avoid having to set settings and reload interface)?

Sign up or log in to comment