I need help I don't know what to do
i have this error:
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models\TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g.
Please check the README again - you need to apply the GPTQ parameters.
OSError: model.safetensors or model.safetensors.index.json and thus cannot be loaded with safetensors
. Please make sure that the model has been saved with safe_serialization=True
or do not set use_safetensors=True
.
File "C:\GIT\localGPT\run_localGPT.py", line 28, in load_model
model = AutoGPTQForCausalLM.from_pretrained(model_id,quantize_config, device="cuda",use_safetensors=True)
a Python example on how to load this model in GPU would be appreciated.
set use_safetensors=False
for this model. It's one of the very few that I didn't save in safetensors
Or use one of the newer, better models like WizardLM-13B-Uncensored-GPTQ or Nous-Hermes-13B-GPTQ
set
use_safetensors=False
for this model. It's one of the very few that I didn't save in safetensorsOr use one of the newer, better models like WizardLM-13B-Uncensored-GPTQ or Nous-Hermes-13B-GPTQ
Is still not working:
TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
This is the code that I am using:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
model = AutoGPTQForCausalLM.from_pretrained(model_id,quantize_config, device="cuda",use_safetensors=False)
13B is too big for my GPU. I also tried: TheBloke/wizardLM-7B-GPTQ and get the same error.
I just added a quantize_config.json so you don't need to pass the quantize_config
. But you do need to pass model_basename
Try this:
from auto_gptq import AutoGPTQForCausalLM
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"
model = AutoGPTQForCausalLM.from_pretrained(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)
That's also the issue if you're trying models like WizardLM-7B-Uncensored (which is a better model than this one.) You need to pass model_basename=
and the name of the file without the extension. So this file is called vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order.pt
so we pass model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"
I just added a quantize_config.json so you don't need to pass the
quantize_config
. But you do need to passmodel_basename
Try this:
from auto_gptq import AutoGPTQForCausalLM model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g" model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order" model = AutoGPTQForCausalLM.from_pretrained(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)
That's also the issue if you're trying models like WizardLM-7B-Uncensored (which is a better model than this one.) You need to pass
model_basename=
and the name of the file without the extension. So this file is calledvicuna-7B-1.1-GPTQ-4bit-128g.no-act-order.pt
so we passmodel_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"
still getting the same error:
TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
I have the module auto-gptq v0.2.2 ; bitsandbytes v0.39.0; transformers v4.29.2; torch v2.1.0
ohhh, I didn't spot before that you were using .from_pretrained()
. It should be .from_quantized()
. Also this model didn't have a fast tokenizer, which I have now added.
This code works:
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, logging, pipeline
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"
model = AutoGPTQForCausalLM.from_quantized(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))
# Inference can also be done using transformers' pipeline
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(pipe(prompt_template)[0]['generated_text'])
Than you very much for your help. The above code works perfectly and is very easy to follow and understand. Is good to have a full example.