TheBloke/Llama-2-13B-chat-GPTQ · Error loading model from a different branch with revision

Jul 20, 2023

I keep getting an error while loading gptq_model-8bit-128g.safetensors with revision gptq-8bit-128g-actorder_False, without revision the 4bit model from the main branch loads fine. Have updated autogptq -> 0.3.0

│ ❱  65 │   │   │   model = AutoGPTQForCausalLM.from_quantized(                                    │
│    66 │   │   │   │   model_id,                                                                  │
│    67 │   │   │   │   revision=revision,                                                         │
│    68 │   │   │   │   model_basename=model_basename,                                             │
│                                                                                                  │
│ /home/a/.local/lib/python3.10/site-packages/auto_gptq/modeling/auto.py:94 in              │
│ from_quantized                                                                                   │
│                                                                                                  │
│    91 │   │   │   for key in signature(quant_func).parameters                                    │
│    92 │   │   │   if key in kwargs                                                               │
│    93 │   │   }                                                                                  │
│ ❱  94 │   │   return quant_func(                                                                 │
│    95 │   │   │   model_name_or_path=model_name_or_path,                                         │
│    96 │   │   │   save_dir=save_dir,                                                             │
│    97 │   │   │   device_map=device_map,                                                         │
│                                                                                                  │
│ /home/a/.local/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:714 in            │
│ from_quantized                                                                                   │
│                                                                                                  │
│   711 │   │   │   │   │   break                                                                  │
│   712 │   │                                                                                      │
│   713 │   │   if resolved_archive_file is None: # Could not find a model file to use             │
│ ❱ 714 │   │   │   raise FileNotFoundError(f"Could not find model in {model_name_or_path}")       │
│   715 │   │                                                                                      │
│   716 │   │   model_save_name = resolved_archive_file                                            │
│   717                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not find model in TheBloke/Llama-2-13B-chat-GPTQ

TheBloke

Owner Jul 20, 2023

Did you update the basename correctly for the file in the new branch? The model_basename is set to the name of the file without .safetensors. So in this example it should be gptq_model-8bit-128g

amitj

Jul 21, 2023

Yes I am removing the .safetensors extension. The behavior is as if the revision branch is not honored.

TheBloke

Owner Jul 21, 2023

Yeah damn you're right, it's not using revision for some reason. It's an AutoGPTQ bug but I can't immediately see what's wrong. I will keep investigating

TiZott

Jul 28, 2023

I could not load that revision either. My fix may be related:
https://github.com/TheBloke/AutoGPTQ/blob/45576f0933f5e9ef7c1617006d5db359e1669155/auto_gptq/modeling/_base.py#L666C95-L666C95
That kwargs got popped empty, so it defaults to 4bit. If i change that into cached_file_kwargs it still keeps warning about the safetensors, but inferences just fine.

TheBloke

Owner Jul 28, 2023

The bug with revision was fixed in 0.3.2, please update and it will work fine

The warning about the safetensors metadata is also fine, and won't appear for future GPTQs I make. That was also fixed in 0.3.2 (now metadata is saved into each GPTQ to prevent that warning)

TiZott

Jul 28, 2023

Oh i see. Thanks for the quick answer. I just noticed that pip downloaded 0.3.1 because of this error:
Discarding https://files.pythonhosted.org/packages/1b/79/5a3a7d877a9b0a72f528e9977ec65cdb9fad800fa4f5110f87f2acaaf6fe/auto_gptq-0.3.2.tar.gz (from https://pypi.org/simple/auto-gptq/) (requires-python:>=3.8.0): Requested auto-gptq from https://files.pythonhosted.org/packages/1b/79/5a3a7d877a9b0a72f528e9977ec65cdb9fad800fa4f5110f87f2acaaf6fe/auto_gptq-0.3.2.tar.gz has inconsistent version: expected '0.3.2', but metadata has '0.3.2+cu117'

TheBloke

Owner Jul 28, 2023

Yeah that's a bug in AutoGPTQ at the moment, should be fixed this weekend. The revision issue was fixed in 0.3.1 and then 0.3.2 was another change, so 0.3.1 should work fine with revision too

TiZott

Jul 28, 2023

Alright, thanks again. Great work btw! )

tnavin

Sep 29, 2023

Hello @TheBloke , has this been fixed? I'm also getting the same error.

"""
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from huggingface_hub import snapshot_download

model_name = "TheBloke/Llama-2-13B-chat-GPTQ"
local_folder = "/home/n/resume-parser/llama2/13b"

snapshot_download(repo_id=model_name, local_dir=local_folder, local_dir_use_symlinks=False)

model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(local_folder, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(local_folder,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)

input_ids = tokenizer("Llamas are", return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

"""

ERROR:

Exllama kernel is not installed, reset disable_exllama to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:

You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
You are using pytorch without CUDA support.
CUDA and nvcc are not installed in your device.
Traceback (most recent call last):
File "/home/n/resume-parser/main.py", line 16, in
model = AutoGPTQForCausalLM.from_quantized(local_folder,
File "/home/n/anaconda3/envs/resume-parser/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
return quant_func(
File "/home/n/anaconda3/envs/resume-parser/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in /home/n/resume-parser/llama2/13b

I am using CUDA-11.7 and Python 3.10