Text Generation
Transformers
Safetensors
English
llama
text-generation-inference
4-bit precision
gptq

text generation webui / Error in GPTQ_loader.py

#3
by dspyrhsu - opened

For every one else who is deperately searching for a usable text generation webui (especially the one TheBloke seems to be talking about in his model card), this is where to find it: https://github.com/oobabooga/text-generation-webui. After following the installation instructions there, I had an interface with the components mentioned in the model card, so I assume this is the one.

Now to the problem: When trying to activate the model, I get an error in modules/GPTQ_loader.py

line 17, in import llama_inference_offload ModuleNotFoundError: No module named ‘llama_inference_offload’

How to proceed? Any help would be appreciated!

@TheBloke maybe it would make sense to mention that directly in your model card. searching for the term gives many results (especially when combined with "huggingface") which are not helpful at all - at least it did in my case

I have clarified the README thus:

image.png

Please use the text-generation-webui one-click installer if you didn't already. And then set Loader to ExLlama. I thought that was the default actually. It should work out of the box with the one-click installer and following the instructions in the README

Thanks for the quick reply. I will have to try this again, I think, since I used the conda (or rather mamba) install on an older machine but now, at least, I get a CUDA OOM message, which is better, I guess ... I will try this on another machine with a RTX 2070 at least and see what happens there ...

Does that only have 6GB VRAM? If so, you're going to struggle. You need 10GB minimum to load a 13B GPTQ with ExLlama. You can use text-generation-webui's pre_layer to offload some to RAM but it will be very slow.

Using a GGML might be the better option for you, as that performs much better when partially on GPU and partially in RAM.

Actually, it has 8 GB VRAM AFAIK (will check in a minute), but even that seems too little then. I have access to other machines, just not here at home ... Thanks for the hints, anyway!

If you just want to get started quickly, get a 7B model (like Orca Mini v2 7B GPTQ) - that will fit in 8GB and with ExLlama should fit in 6GB too actually now I think of it.

Great, yes, I am just getting started and would like to tinker around a bit. Found your model card for it (my hint as to the link to the text-generation-webui also applies there ...). Will give this a try!

Just wanted to let you know that it works like a charm (and the RTX 2070 has 8GB of VRAM, of which 6.2 are used when running the model). Asking it "How do I use HuggingFace's textgeneration web ui?" it comes up with a sensible looking answer, but hallucinates the URL "https://text-generation.huggingface.co/", but this looks very promising! Thanks for your work and for your immediate help! One thing though: maybe you could also include the hint about having to uses the ExLlama loader in your instructions? I would never have known had you not told me above.

Sign up or log in to comment