Unfortunately I can't run on text-generation-webui

by Suoriks - opened May 27, 2023

Discussion

Suoriks

May 27, 2023

•

edited May 27, 2023

Tell me, what am I doing wrong? I did everything according to the instructions,

add in update_windows.bat
pip install autogptq
pip install einops
run it
add 'trust_remote_code': shared.args.trust_remote_code, in AutoGPTQ_loader.py
add --trust-remote-code (instead --trust_remote_code) and --autogptq in webui.py
but I get an error:

Traceback (most recent call last): File “D:\LLaMA\oobabooga_windows\text-generation-webui\server.py”, line 71, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “D:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.py”, line 95, in load_model output = load_func(model_name) File “D:\LLaMA\oobabooga_windows\text-generation-webui\modules\models.py”, line 297, in AutoGPTQ_loader return modules.AutoGPTQ_loader.load_quantized(model_name) File “D:\LLaMA\oobabooga_windows\text-generation-webui\modules\AutoGPTQ_loader.py”, line 43, in load_quantized model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params) File “D:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling\auto.py”, line 62, in from_quantized model_type = check_and_get_model_type(save_dir) File “D:\LLaMA\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_utils.py”, line 124, in check_and_get_model_type raise TypeError(f"{config.model_type} isn’t supported yet.") TypeError: RefinedWeb isn’t supported yet.

Parameters on the tab GPTQ
wbits 4, groupsize 64, model_type llama

TheBloke

Owner May 27, 2023

You need to update AutoGPTQ with:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install . # This step requires CUDA toolkit installed

I will make this clearer in the README!

TheBloke

Owner May 27, 2023

Three more things to note:

The GPTQ parameters don't have any effect for AutoGPTQ models
This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading
It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.

Suoriks

May 27, 2023

It's working! Thanks!

Thireus

May 27, 2023

Output generated in 24.27 seconds (0.45 tokens/s, 11 tokens, context 48, seed 466795515)
Output generated in 43.11 seconds (0.49 tokens/s, 21 tokens, context 48, seed 532631384)
Output generated in 40.49 seconds (0.57 tokens/s, 23 tokens, context 41, seed 1349492009)
Output generated in 334.20 seconds (0.33 tokens/s, 109 tokens, context 48, seed 1693397338)

😭

TheBloke

Owner May 27, 2023

Yup :) It is slow as hell atm. I've flagged it with qwopqwop and PanQiWei of AutoGPTQ so hopefully they can investigate if it's anything on the AutoGPTQ side.

But my feeling is that it may be as much to do with the custom code for loading the Falcon model - or some combination of that code with AutoGPTQ

yizer16

Jun 6, 2023

Three more things to note:

The GPTQ parameters don't have any effect for AutoGPTQ models

This 40B model requires more than 24GB VRAM. So you will have to use CPU offloading

It's slow as hell at the moment! Even with enough VRAM (eg on a 48GB card), I was getting less than 1 tokens/s.

How to open CPU offloading, is there any possiable run this model in 4090 24G?

Plaban81

Jun 6, 2023

How to run it in google colab. I encounter the below error,Could you please help to resolve the issue. Running with A100 GPU instance .
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/_base.py:182 │
│ if (pos_ids := kwargs.get("position_ids", None)) is not None: │
│ ▲ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
SyntaxError: invalid syntax

Code used

! BUILD_CUDA_EXT=0 pip install auto-gptq
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

Download the model from HF and store it locally, then reference its location here:

quantized_model_dir = "/TheBloke/falcon-40b-instruct-GPTQ"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
device="cuda:0",
use_triton=False,
use_safetensors=True,
torch_dtype=torch.float32,
trust_remote_code=True)

mancub

Jun 7, 2023

The models fully loads on my 3090 in WSL2 with text-gen-webui, using AutoGPTQ.

TheBloke

Owner Jun 8, 2023

@Plaban81 you appear to be compiling AutoGPTQ with the CUDA extension not compiled. That will kill performance.

Please try:

pip uninstall -y auto-gptq
pip install auto-gptq --no-cache-dir

And report back. AutoGPTQ just released version 0.2.2 which fixes some installation issues.

mancub

Jun 9, 2023

Compiling AutoGPTQ in 0.3.0.dev0 from source still does not mark the module +cuXXX.

TheBloke

Owner Jun 9, 2023

Yeah he still hasn't fixed that. But it does compile the module; or should.

There's a simple test as to whether you have the CUDA extension installed:

$ python -c 'import torch ; import autogptq_cuda'
$

If that returns no output, it should be OK.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment