ModuleNotFoundError: No module named ‘llama_inference_offload’ when running in oobabooga

#7
by pelatho - opened

I'm trying to run this in oobabooga (via runpod). I put in the model and set wbits to 4 and save the settings and try to reload the model, but I get this error:

Traceback (most recent call last):
File “/workspace/text-generation-webui/server.py”, line 100, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “/workspace/text-generation-webui/modules/models.py”, line 125, in load_model
from modules.GPTQ_loader import load_quantized
File “/workspace/text-generation-webui/modules/GPTQ_loader.py”, line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named ‘llama_inference_offload’

Yeah the default Runpod text-gen-ui template doesn't support GPTQ

I have one that does! https://runpod.io/gsc?template=qk29nkmbfr&ref=eexqfacd

Please read its README

@TheBloke Ah, I see. Thanks!

@TheBloke Got it working! Thanks for making this model! It's really quite versatile! Can't WAIT for the 65B version!

@TheBloke Sorry if this is asking a lot but the pod template seems to break when I stop the pod and start it again. It won't even open the pod ssh terminal. It seems to be crashing. Am I missing something?

@TheBloke Sorry if this is asking a lot but the pod template seems to break when I stop the pod and start it again. It won't even open the pod ssh terminal. It seems to be crashing. Am I missing something?

Hmm that's not meant to happen. I recently updated the template so you could stop and start the pod and your models would still be there.

Did you set any template overrides at all?

What GPU type are you using?

@TheBloke I was using the RTX 4090, 24 GB VRAM, 83 GB RAM 16 vCPU. I did not set any template overrides, no. I start it, follow your README to download the model and load it and it works. Then if I stop the pod and start it, the server won't start, not even SSH works. Just clicking 'start SSH terminal' seems to crash the pod.

Oh! I think I know what that is. It is starting. It's just taking a long time.

So when text-gen starts it will auto load a model. and when the pod first launches, the model is not cached and it loads much slower. And I think that might be affecting the SSH as well

Can you launch it again, and keep trying to access the UI for at least 3-4 minutes. Let me know. Then if you can access the UI, try SSH again.

I'll then see if I can make that experience better, eg by disable model auto load maybe

@TheBloke Hmmm. I think I waited for 10 minutes a few times? I'll try one more time.

Oh OK. I'm testing it now

Question, I'm supposed to enable AutoGPTQ, right? (If I try loadin the model without that, i get an error)

Sign up or log in to comment