Unfortunately I can't run on text-generation-webui

#1
by Suoriks - opened

Tell me, what am I doing wrong? I did everything according to the instructions,

  1. add in update_windows.bat
    pip install autogptq
    pip install einops
  2. run it
  3. add 'trust_remote_code': shared.args.trust_remote_code, in AutoGPTQ_loader.py
  4. add --trust-remote-code (instead --trust_remote_code) and --autogptq in webui.py
    but I get an error:

Traceback (most recent call last): File โ€œD:\LLaMA\oobabooga_windows\text-generation-webui\server.pyโ€, line 71, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File โ€œD:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.pyโ€, line 95, in load_model output = load_func(model_name) File โ€œD:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.pyโ€, line 297, in AutoGPTQ_loader return modules.AutoGPTQ_loader.load_quantized(model_name) File โ€œD:\LLaMA\oobabooga_windows\text-generation-webui\modules\AutoGPTQ_loader.pyโ€, line 43, in load_quantized model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params) File โ€œD:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling\auto.pyโ€, line 62, in from_quantized model_type = check_and_get_model_type(save_dir) File โ€œD:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_utils.pyโ€, line 124, in check_and_get_model_type raise TypeError(f"{config.model_type} isnโ€™t supported yet.") TypeError: RefinedWeb isnโ€™t supported yet.

Parameters on the tab GPTQ
wbits 4, groupsize 64, model_type llama

You need to update AutoGPTQ with:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install . # This step requires CUDA toolkit installed

I will make this clearer in the README!

Three more things to note:

  1. The GPTQ parameters don't have any effect for AutoGPTQ models
  2. This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading
  3. It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.

It's working! Thanks!

Output generated in 24.27 seconds (0.45 tokens/s, 11 tokens, context 48, seed 466795515)
Output generated in 43.11 seconds (0.49 tokens/s, 21 tokens, context 48, seed 532631384)
Output generated in 40.49 seconds (0.57 tokens/s, 23 tokens, context 41, seed 1349492009)
Output generated in 334.20 seconds (0.33 tokens/s, 109 tokens, context 48, seed 1693397338)

๐Ÿ˜ญ

Yup :) It is slow as hell atm. I've flagged it with qwopqwop and PanQiWei of AutoGPTQ so hopefully they can investigate if it's anything on the AutoGPTQ side.

But my feeling is that it may be as much to do with the custom code for loading the Falcon model - or some combination of that code with AutoGPTQ

Three more things to note:

  1. The GPTQ parameters don't have any effect for AutoGPTQ models
  2. This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading
  3. It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.

How to open CPU offloading, is there any possiable run this model in 4090 24G?

How to run it in google colab. I encounter the below error,Could you please help to resolve the issue. Running with A100 GPU instance .
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/_base.py:182 โ”‚
โ”‚ if (pos_ids := kwargs.get("position_ids", None)) is not None: โ”‚
โ”‚ โ–ฒ โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
SyntaxError: invalid syntax

Code used

! BUILD_CUDA_EXT=0 pip install auto-gptq
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

Download the model from HF and store it locally, then reference its location here:

quantized_model_dir = "/TheBloke/falcon-40b-instruct-GPTQ"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
device="cuda:0",
use_triton=False,
use_safetensors=True,
torch_dtype=torch.float32,
trust_remote_code=True)

The models fully loads on my 3090 in WSL2 with text-gen-webui, using AutoGPTQ.

@Plaban81 you appear to be compiling AutoGPTQ with the CUDA extension not compiled. That will kill performance.

Please try:

pip uninstall -y auto-gptq
pip install auto-gptq --no-cache-dir

And report back. AutoGPTQ just released version 0.2.2 which fixes some installation issues.

Compiling AutoGPTQ in 0.3.0.dev0 from source still does not mark the module +cuXXX.

Yeah he still hasn't fixed that. But it does compile the module; or should.

There's a simple test as to whether you have the CUDA extension installed:

$ python -c 'import torch ; import autogptq_cuda'
$

If that returns no output, it should be OK.

Sign up or log in to comment