GPTQ bugging: Wondering if I'm loading the model correctly

#9
by quantuan125 - opened

Hi TheBloke,

Big fan of your work so keep it up!

Since I'm a bit of a newbie to the whole space I wanted to ask you to see if I am loading the model correctly through auto-gptq module

Essentially I'm trying to load the GPTQ and GGML modules through local directory onto my own application. However when I load it on the application, and checking it after going through the pipeline , I can see that the GPTQ model parameter is not showing correctly(however the GGML does) and the GPTQ model seems to be buggy and reloading every time I run the python script.

image.png
*Caption: It is showing gpt2?? which might be the default value from HuggingfacePipeline, but then I don't know if it correct model *

image.png
Caption: It is showing the model_path correctly, as well as all the parameters set in Llamccp

I have tried many strategies to see if I set it correctly including going through the HuggingFacePipeline.from_model_id instead
EX:
"""
local_model = HuggingFacePipeline.from_model_id(
model_id=model_id,
task="text-generation",
model_kwargs={"trust_remote_code": True},
pipeline_kwargs={
"model": model,
"tokenizer": tokenizer,
"device_map": "auto",
"max_new_tokens": 1200,
"temperature": 0.3,
"top_p": 0.95,
"repetition_penalty": 1.15,
},
return local_model
"""

However, I'm getting constant errors such as:
""
OSError: TheBloke/Llama-2-7B-Chat-GPTQ does not appear to have a file named pytorch_model.bin but there is a file for TensorFlow weights. Use from_tf=True to load this model from those weights.
""

or

""
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory
""

Anyhow, I would like to show you my code to see if I'm loading the GPTQ model correctly. Thank you in advance:

CODE: Relevant snippet of code related to the loading of GPTQ model
"""
@st .cache_data
def load_local_model(device_type, model_id, model_path, model_basename=None):
if model_basename is not None:
...
if ".safetensors" in model_basename:
# Remove the ".safetensors" ending if present
model_basename = model_basename.replace(".safetensors", "")

    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=False,
        quantize_config=None,
    )

...

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15,
    generation_config=generation_config
)

local_model = HuggingFacePipeline(pipeline=pipe)
logging.info("Local LLM Loaded")

return local_model

def main():
...
elif localai_model == "Llama-2-7B-Chat-GPTQ":
model_path = "C:\Users\quant\OneDrive\Documents\Purpose\AI\multipdfchat\local\Llama2\7B\GPTQ"
model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"
model_basename = "gptq_model-4bit-128g.safetensors"
device_type = "cuda:0"
st.session_state.local = load_local_model(device_type=device_type, model_id=model_id, model_path=model_path, model_basename=model_basename)
st.write("Local LLM model has been loaded. Press 'Process' to continue")
st.write(st.session_state.local)
...
"""

I'm not sure. Your code seems all right. I just ran the following test script - largely copied from the README - and confirmed it works fine:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]'''

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

Output:

 [pytorch2] tomj@17b00c4e2a6d:/workspace ᐅ python3 test_gptq.py
Downloading (…)okenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 727/727 [00:00<00:00, 4.40MB/s]
Downloading tokenizer.model: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500k/500k [00:00<00:00, 21.1MB/s]
Downloading (…)/main/tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.84M/1.84M [00:00<00:00, 11.5MB/s]
Downloading (…)cial_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 411/411 [00:00<00:00, 3.46MB/s]
Downloading (…)lve/main/config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 572/572 [00:00<00:00, 4.59MB/s]
Downloading (…)quantize_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 185/185 [00:00<00:00, 1.55MB/s]
Downloading (…)bit-128g.safetensors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.90G/3.90G [00:19<00:00, 199MB/s]
The safetensors archive passed at /workspace/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ/snapshots/b7ee6c20ac0bba85a310dc699d6bb4c845811608/gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
*** Pipeline:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Tell me about AI [/INST]  Of course! I'd be happy to provide information on AI (Artificial Intelligence). AI refers to the development of computer systems able to perform tasks typically requiring human intelligence, such as visual perception, speech recognition, decision-making, and language translation. Artificial intelligence has been around for several decades and has evolved significantly over time. Here are some key aspects of AI:
1. Machine Learning: A subset of AI focused on enabling machines to learn from data without explicit programming. It involves training algorithms using historical data to recognize patterns, classify objects, or predict outcomes.
2. Deep Learning: A subfield of machine learning that utilizes neural networks with multiple layers to analyze complex data sets. These networks can recognize images, understand natural language, and generate creative content like music or art.
3. Natural Language Processing (NLP): The branch of AI concerned with developing computers capable of understanding, interpreting, and generating human language. NLP enables applications like chatbots, voice assistants, and language translation software.
4. Robotics: The intersection of AI and robotics focuses on creating robots capable of performing tasks that typically require human intelligence, such as assembly, maintenance, and transportation.
5. Computer Vision: This area of AI deals with enabling computers to interpret and understand visual data from the world, including recognizing objects, tracking movements, and analyzing facial expressions.
6. Expert Systems: These are AI systems designed to mimic the decision-making abilities of human experts in specific domains, such as medicine, finance, or engineering.
7. Reinforcement Learning: An AI technique where an algorithm learns by interacting with its environment and receiving feedback in the form of rewards or penalties. This approach allows AI to optimize its behavior based on desired outcomes.
8. Generative Adversarial Networks (GANs): A type of deep learning algorithm involving two neural networks working together to create new data that resembles existing examples. GANs have led to breakthroughs in image generation, video creation, and text production.
9. Autonomous Vehicles: Self-driving cars and trucks use a combination of sensors, mapping technology, and AI algorithms to navigate roads safely and efficiently.
10. Ethical Considerations: As A

(Note I removed model_basename as it's not actually needed, as I've named my models with the default that AutoGPTQ looks for. So you can remove that from your code.)

So the base AutoGPTQ and Transformers code is definitely working fine. I think it must be something happening elsewhere in your code. I don't know what HuggingFacePipeline(pipeline=pipe) might be doing, for example?

Just to double check: what version of AutoGPTQ are you using? You need 0.2.2 minimum, but I would recommend you upgrade to 0.3.2. And I suggest people do that via source due to various installation problems at the moment:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install .

I'm confident the model works and that the provided AutoGPTQ code works. So I suggest you debug it step by step, starting from a simple test script like the above - confirm that works for your first - and then building your extra code on top of that known-good example.

One more thought: if you're cloning the models locally (rather than loading direct from HF like in my above example), confirm that you've cloned all the files in the GPTQ branch. If you missed out any files, that could cause problems.

I just realised that HuggingFacePipeline is a LangChain thing. I've never tested LangChain. It's possible that the problem is happening there. I know for sure that LangChain can work with AutoGPTQ as many people have mentioned to me that they're doing that. But I don't know exactly how they do it. Maybe you need to make a new HuggingFacePipeline class that loads the model with AutoGPTQ? Not sure.

Hi @TheBloke ,

Thank you very much for your detailed answer. I very much appreciate the guidance

Just to let you know I'm using 0.2.2 autogptq + 11.8 cuda as I had problem installing it from source. However, I can give it a try again.

Regarding my initial question, I guess my main confusion was how the HuggingFacePipeline (from Langchain) was showing the information related to the model and as you can see it was able to detect the GGML models correctly however not for the GPTQ model (it identifies as "gpt2" for Llama2-7B model).

The model was able to load successfully regardless, but it just left me a bit confused hence why I posed you the question.

Anyway, based on what I am seeing and what you are saying, I will take it that the GPTQ model works fine as Llama2 7B model and not GPT2 model.

Additionally, another reason why I raised such concern was the fact that it takes quite sometimes to initialize the model and it seems to reinitialize every time my application process another action which create latency in the processing of information (ex: QAing pdf files in this case) . This was not the case for GGML model, but perhaps it is just a nature of it.

To be more clear, the latency issue as I describe is following:
GGML: Initialize -> Process 1 -> Process 2 -> Process 3
GPTQ: Initialize -> Process 1 -> Initialize -> Process 2 -> Initialize -> Process 3

If you are saying the model should work correctly then perhaps it has to do the script of the application instead.

Side note: I can confirm that all files are cloned. The way I load two different is by the following path, can you confirm that I'm doing it correctly?
GGML: "C:\Users\quant\OneDrive\Documents\Purpose\AI\multipdfchat\local\Llama2\7B\GGML\llama-2-7b-chat.ggmlv3.q4_0.bin" -> to the bin file
GPTQ: "C:\Users\quant\OneDrive\Documents\Purpose\AI\multipdfchat\local\Llama2\7B\GPTQ" -> to the entire folder instead of the .safetensors file

can anyone explain this warning:
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.

Sign up or log in to comment