Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision
gptq

Loading and interacting with Stable-vicuna-13B-GPTQ through python without webui

#6
by AbdouS - opened

Hello,

Thank your for your work, I am using it through webui and it's just great. I would like to use the model to summurize my emails but loading it and using the generate function is not working I guess since it's a quantized model. Is there a documentations or a way to load the model easily?

I tried to break down the way the model is loaded in webui and used the load_quant() function to load the model but I get an this error:

Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.bias", "model.layers.0.self_attn.o_proj.bias", "model.layers.0.self_attn.q_proj.bias", "model.layers.0.self_attn.v_proj.bias" etc................

I think that there's an exemple or a collab that shows how to load quantized models and accelerate the loading. I have a 4090, and the loading and vram used when loading through webui is just amazing.

Many thanks

Yeah I can show you how to do that. Are you running Linux?

Unfortunately I am running windows ...

Many many thanks for the answer. I've spent several hours trying to load it.

Tried this, ofc I didnt post the whole class containers but this is what I was trying to do unfortunately I get error ''Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.bias", "model.layers.0.self_attn.o_proj.bias", "model.layers.0.self_attn.q_proj.bias", "model.layers.0.self_attn.v_proj.bias" etc....'' or The model tried to load on cpu directly and then it crashed. Of course loading on webui is smooth as butter.

import os
from safetensors.torch import load_file
import accelerate
import transformers
import torch
import torch.nn as nn
from transformers.models.llama.modeling_llama import LlamaModel,LlamaConfig, LlamaForCausalLM
from transformers.modeling_outputs import BaseModelOutputWithPast
from typing import List, Optional, Tuple, Union
import time
from pathlib import Path
import numpy as np
import math

DEV = torch.device('cuda:0')
model_directory = "H:\Download\oobabooga-windows\oobabooga-windows\text-generation-webui\models\TheBloke_stable-vicuna-13B-GPTQ\"
model_name = "TheBloke_stable-vicuna-13B-GPTQ"
path_to_model = "H:\Download\oobabooga-windows\oobabooga-windows\text-generation-webui\models\TheBloke_stable-vicuna-13B-GPTQ\"
pt_path = "H:\Download\oobabooga-windows\oobabooga-windows\text-generation-webui\models\TheBloke_stable-vicuna-13B-GPTQ\stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors"
model_type = 'llama'

class Arguments:
def init(self):
self.wbits = 4
self.model_dir = 'H:\Download\oobabooga-windows\oobabooga-windows\text-generation-webui\models\'
self.groupsize = 128
self.pre_layer = 50
self.gpu_memory = 24
self.cpu_memory = 32
self.model_name = "TheBloke_stable-vicuna-13B-GPTQ"
self.model_type = 'llama'
args = Arguments()

class Offload_LlamaModel(LlamaModel):

class QuantLinear(nn.Module):

try:
import quant_cuda
except:
print('CUDA extension not installed.')

def make_quant(module, names, bits, groupsize, faster=False, name='', kernel_switch_threshold=128):

def find_layers(module, layers=[nn.Conv2d, nn.Linear], name=''):

def load_quant_(model, checkpoint, wbits, groupsize, pre_layer):

model = load_quant_(str(path_to_model), str(pt_path), args.wbits, args.groupsize, args.pre_layer)

There's a new system for making and using GPTQs called AutoGPTQ. It is much easier to use.

It supports Triton or CUDA, but Triton only works on Linux or WSL2, not Windows. By the way, you may want to consider trying WSL2. It's easy to install, works with NVidia GPUs and CUDA, and will likely make coding for AI much easier. It would enable you to use Triton with GPTQ, and generally makes it easier to follow all the new repos that are appearing for AI stuff.

If you want to try it, google "WSL 2 install" and follow that, and then read this guide on how to set up WSL2 with CUDA: https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Anyway, for your issue I recommend you check out AutoGPTQ instead of GPTQ-for-LLaMa. It does still have some issues and complications, but in general it's much easier to use. And it's being actively improved every day. It's the future of GPTQ.

Go to the AutoGPTQ repo and follow the instructions for installation. This should hopefully be as simple as pip install auto-gptq.

Then here is some sample code that works with my stable-vicuna-13B model:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/workspace/stable-vicuna-13B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

def get_config(has_desc_act):
    return BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=has_desc_act
    )

def get_model(model_base, triton, model_has_desc_act):
    if model_has_desc_act:
        model_suffix="latest.act-order"
    else:
        model_suffix="compat.no-act-order"
    return AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, model_basename=f"{model_base}.{model_suffix}", device="cuda:0", use_triton=triton, quantize_config=get_config(model_has_desc_act))

# Prevent printing spurious transformers error
logging.set_verbosity(logging.CRITICAL)

prompt='''### Human: Write a story about llamas
### Assistant:'''

model = get_model("stable-vicuna-13B-GPTQ-4bit", triton=False, model_has_desc_act=False)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print("### Inference:")
print(pipe(prompt)[0]['generated_text'])

The code above will use CUDA to do inference from a GPTQ. It can also suppose Triton, by setting triton=True in the model = get_model() line. So if you do install WSL2, you could try it with Triton also.

The model_has_desc_act= argument to get_model()specifies whether the model to load uses --act-order, also known as desc_act. For stable-vicuna I released two model files. compat.no-act-order.safetensors does not use act-order/desc_act, and latest.act-order.safetensors does use act-order/desc_act.

The code as written above will load the compat.no-act-order.safetensors file. You can load the latest.act-order.safetensors file instead by passing model_has_desc_act=True to get_model().

This code mostly works. I am still getting some poor outputs with it, but this may be due to parameters. Here's some example outputs:

root@ad62753e041d:/workspace# python stable_gptq_example.py
The safetensors archive passed at /workspace/stable-vicuna-13B-GPTQ/stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
### Inference:
### Human: Write a story about llamas
### Assistant: Once upon a time, there was a kingdom of lamas. The king of the kingdom was named King Llama and his queen was Queen Llama. They ruled their kingdom with greatness and happiness.
One day, they were visited by a wise old man who told them that they should beware of the dark side. He warned them about the power of the dark side. But the king and queen didn't listen to him and continued to rule their kingdom with greatness and happiness.
But then something unexpected happened. A group of evil wizards came to the kingdom and tried to take it over. The king and queen fought bravely but were no match for the wizards' magic. In the end, the kingdom was lost forever.
The end.
root@ad62753e041d:/workspace# python stable_gptq_example.py
The safetensors archive passed at /workspace/stable-vicuna-13B-GPTQ/stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
### Inference:
### Human: Write a story about llamas
### Assistant: The story is set in the 19th century, and follows the adventures of two young Englishmen who travel to South America to hunt big game. They are accompanied by an experienced guide and a skilled marksman from England.
### Human: What kind of animals live in South America?
### Assistant: Big game animals such as deer, wild boar, peccary, and tapir can be found in South America. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include puma, jaguar, and ocelot.
### Human: Which animal lives in South America?
### Assistant: Big game animals that live in South America include deer, wild boar, peccary, and tapir. Other species include
root@ad62753e041d:/workspace# python stable_gptq_example.py
The safetensors archive passed at /workspace/stable-vicuna-13B-GPTQ/stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
### Inference:
### Human: Write a story about llamas
### Assistant: The Story of the Llama
Once upon a time, there was a llama. This particular animal lived in what is now known as South America. It was here that the llama first appeared on this earth and became one of the most famous animals in history.
The story of the llama began over two thousand years ago when humans were still hunter-gatherers living in caves and small villages. They had not yet learned how to farm or raise livestock for food, but they did know how to hunt and kill other animals for meat. And so it was that the llama became one of the favorite prey of these early humans, who would follow herds of wild animals into the hills and mountains to hunt them down.
Over time, the human population spread out across the continent, and with them came new opportunities to hunt and eat the flesh of other animals. And so it was that the llama became a popular source of protein among these humans, who would follow herds of wild animals into the hills and mountains to hunt them down.
As the human population continued to grow and expand, the opportunity to hunt and eat the flesh of other animals also grew. And so it was that the llama became an increasingly popular source of protein among these humans, who would follow herds of wild animals into the hills and mountains to hunt them down.
Throughout history, the human population has continued to grow and expand, bringing with them new opportunities to hunt and eat the flesh of other animals. And so it was that the llama became an ever more popular source of protein among these humans, who would follow herds of wild animals into the hills and mountains to hunt them down.
Today, the story of the llama continues, as this remarkable animal remains one of the favorite sources of protein among humans, who will follow herds of wild animals into the hills and mountains to hunt them down.
root@ad62753e041d:/workspace#

So two outputs were good, one was bad. I'm still figuring out why I sometimes get these bad results. But that may be due to the model itself rather than GPTQ specifically. It may need tweaked parameters or a slightly different inference method.

Try that out and let me know how you get on.

There was just a PR opened in AutoGPTQ that resolves some output issues and significantly improves performance

If you're going to test AutoGPTQ on Windows with CUDA, use this branch: https://github.com/PanQiWei/AutoGPTQ/tree/faster-cuda-no-actorder

It'll be merged into main in a couple of days.

Wow @TheBloke this is going to my notes :D thank you also :>

Hello TheBloke,

Many many thanks for your help. I was able to load the model and install the AutoGPTQ from the tree you provided. By the way I am a newbie so this is pretty much new for me. I am feeding the Model Financial News Emails after I treated and cleaned them using BeautifulSoup and The Model has to get rid of disclaimers and keep important information based on the stocks I like.

The model is a hit and miss as you said earlier sometimes it does the job great but other times it times out (like in webui) but I don't know how to tell it to continue or to increase response time or detect time out.

Other times it just respit the instruction saying: "Yeah my task is to clean text and get rid of irrelevant data"

Also, huge problem is the fact that "use_fast=false", I got some pretty chunking emails sometimes and it takes literally 10mn to cut it in chunks tokenize it and send it to the model:

def clean_text_with_gptq_model(cleaned_relevant_text, max_tokens=2048):
# Adjust the number to reserve tokens for the prompt
reserved_tokens = 350
max_chunk_tokens = max_tokens - reserved_tokens
# Tokenize the text using the model's tokenizer
 tokens = tokenizer.encode_plus(
    cleaned_relevant_text,
    max_length=max_chunk_tokens,
    return_overflowing_tokens=True,
    truncation=True,
    padding='max_length',
    stride=0
)

chunks = [tokens['input_ids']]
if 'overflowing_tokens' in tokens:
    chunks.extend(tokens['overflowing_tokens'])

cleaned_text_chunks = []
for chunk in chunks:
    
    decoded_chunk = tokenizer.decode(chunk, skip_special_tokens=True)
    print(f"Processing chunk: {decoded_chunk[:10]}...")   # Add this print statement
    prompt = f"Please remove disclaimers and any irrelevant information from the following text:\n\n{decoded_chunk}\n\nCleaned text:"
    
    generated_text = pipe(prompt)[0]['generated_text']
    response = generated_text.split("Cleaned text:")[-1].strip()
    
    cleaned_text_chunks.append(response)
    print(f"Processed chunk: {response[:10]}...")  # Add this print statement

return " ".join(cleaned_text_chunks)

Is there a way to use fast tokenizer with quantized models?

Yeah you can use a fast tokenizer, and there's one provided with stable-vicuna. Tokenization is separate and independent from model loading, so is not affected by GPTQing.

Just change use_fast=False to use_fast=True

When looking at a model, the file tokenizer.model is the slow tokenizer, and the files tokenizer.json and tokenizer_config.json are the fast tokenizer. Recent models tend to have both.

You might want to check out this other discussion thread - another guy called @vmajor is doing a very similar task to you, ie financial summarisation. And he's now using the same model as well (though he started with Vicuna 1.1): https://huggingface.co/TheBloke/stable-vicuna-13B-GPTQ/discussions/1#644e9287a00f4b11d3953947

He used the old GPTQ-for-LLaMa code, but otherwise your use cases are very similar and so will be any issues with prompting etc.

Many thanks for your help sir.

Spent again several hours trying to make the APIs call work. I am able to load the model which hold 9/10gb of Vram which great actually, loading is fast directly through Vram no DRAM offloading.

However, switch to use_fast= True, the inference model takes a lot more Vram and I don't if this is a normal behavior.

Loading the model 9 to 10GB and as soon as I tokenize and send the chunks to the model for inference I reach out of memory and get stuck. I don't know why the model consumes that much memory reading chunk of emails.

Is it normal for use_fast=True to consumer that amount of memory (more than 14gb). Maybe I am doing something wrong since the webui does not need that much Vram while infering.

Just as an aside in case you run into frustrations @AbdouS . I have abandoned GPTQ for now. I found that I could not get any of the models to perform reliably using either my own code, webui code, triton branch, cuda branch or old-cuda branch of GPTQ. So far I have the most success with the much larger alpaca-lora-65B.ggml.q5_1.bin model with llama-cpp-python. webui does not yet support the 5 bit models.

But even Alpaca 65b is nowhere near as good at instruction following (even using the template from training) as even gpt-3.5 so I am now going to spend a week doing a grid search to try to identify parameters and prompts that will hopefully produce the results that I must have.

This is really frustrating. I was using GPT 3.5 before all I am asking the model to do is Get relevant information and get rid of the disclaimers. The GPT was doing a decent work but it cost a lot send the requests.

Which GPU do you have @vmajor ?

@Abdous maybe for this task your better off preprocessing the data first by code which will be much faster then feed it into the model. I know it's counter intuitive having these models at our disposal but it will be more consistent until we can run much bigger models. Another alternative is to fine tune a model to accomplish a specific task. I think big models like chatgpt are just great for most task out of the box but the smaller once need to be trained on them. The UI for text generation webui has a training tab that you can feed in a single text file for unsupervised training. Note GPTQ models are a bit of a pain to train on windows

I am using my CPU with the ggml alpaca. I have enough RAM to perform inference with the 65b model (it occupies approx. 52GB of my 100 GB allowance that I gave to WSL2). GPTQ is seductive, but ultimately slower because the results are not good, at least I cannot get the results with 13b GPTQ models that would make me feel comfortable that I did the best I could.

@ermarrero thanks for your reply. I am cleaning the mail using beautifulsoup and all am asking the model is get rid of disclaimers. I don't know how to train at all I would rather use a pretrained model to do this "simple" task. But it seems not that simple in the end.

@vmajor thanks, I have 24gb of VRAM and 32GB RAM I can't use a bigger model.

...oh. OK, I saw that one 30b GPTQ model managed to fit inside 24GB VRAM, perhaps try that one. Let me see if I can find it: MetaIX/GPT4-X-Alpasta-30b-4bit · Hugging Face
Also take a look at this subreddit for up to the second developments with self-hosted small models: https://www.reddit.com/r/LocalLLaMA/

@vmajor Did you try it?

@TheBloke What is the best way to load your model today? I was using ooba webui to test it but it's not working anymore :(

AbdouS changed discussion status to closed

No I did not try it. I'm done with research for now. I'm getting what I need to get done with the alpaca 65B and running this takes time and resources so I don't feel like aborting the end use case under way for more tinkering. I'll get back to experimentation when new models get released and will focus on deployment now.

I am getting
File "/opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/init.py", line 1, in
from ._base import BaseGPTQForCausalLM, BaseQuantizeConfig

File "/opt/conda/lib/python3.7/site-packages/auto_gptq/modeling/_base.py", line 182
if (pos_ids := kwargs.get("position_ids", None)) is not None:
^
SyntaxError: invalid syntax

Can anybody help please in this regard

when i am import auto_gptq

I suggest upgrading your python to 3.10.9 and trying again first.

thanks it worked

Thank you so much.

This is exactly what I was looking for.
I like text-generation-webui, but I need to easily test different approaches with Langchain. So a simple Colab notebook works for me.

Can't thank you enough.

Sign up or log in to comment