I need help I don't know what to do

#5
by enginee - opened

i have this error:

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models\TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g.

Please check the README again - you need to apply the GPTQ parameters.

OSError: model.safetensors or model.safetensors.index.json and thus cannot be loaded with safetensors. Please make sure that the model has been saved with safe_serialization=True or do not set use_safetensors=True.

File "C:\GIT\localGPT\run_localGPT.py", line 28, in load_model
model = AutoGPTQForCausalLM.from_pretrained(model_id,quantize_config, device="cuda",use_safetensors=True)

a Python example on how to load this model in GPU would be appreciated.

set use_safetensors=False for this model. It's one of the very few that I didn't save in safetensors

Or use one of the newer, better models like WizardLM-13B-Uncensored-GPTQ or Nous-Hermes-13B-GPTQ

set use_safetensors=False for this model. It's one of the very few that I didn't save in safetensors

Or use one of the newer, better models like WizardLM-13B-Uncensored-GPTQ or Nous-Hermes-13B-GPTQ

Is still not working:
TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

This is the code that I am using:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"

quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)

model = AutoGPTQForCausalLM.from_pretrained(model_id,quantize_config, device="cuda",use_safetensors=False)

13B is too big for my GPU. I also tried: TheBloke/wizardLM-7B-GPTQ and get the same error.

I just added a quantize_config.json so you don't need to pass the quantize_config. But you do need to pass model_basename

Try this:

from auto_gptq import AutoGPTQForCausalLM
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"

model = AutoGPTQForCausalLM.from_pretrained(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)

That's also the issue if you're trying models like WizardLM-7B-Uncensored (which is a better model than this one.) You need to pass model_basename= and the name of the file without the extension. So this file is called vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order.pt so we pass model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"

I just added a quantize_config.json so you don't need to pass the quantize_config. But you do need to pass model_basename

Try this:

from auto_gptq import AutoGPTQForCausalLM
model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"

model = AutoGPTQForCausalLM.from_pretrained(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)

That's also the issue if you're trying models like WizardLM-7B-Uncensored (which is a better model than this one.) You need to pass model_basename= and the name of the file without the extension. So this file is called vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order.pt so we pass model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"

still getting the same error:
TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

I have the module auto-gptq v0.2.2 ; bitsandbytes v0.39.0; transformers v4.29.2; torch v2.1.0

ohhh, I didn't spot before that you were using .from_pretrained(). It should be .from_quantized(). Also this model didn't have a fast tokenizer, which I have now added.

This code works:

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, logging, pipeline

model_id = "TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g"
model_basename="vicuna-7B-1.1-GPTQ-4bit-128g.no-act-order"

model = AutoGPTQForCausalLM.from_quantized(model_id,device="cuda:0",use_safetensors=False, quantize_config=None, model_basename=model_basename)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

Than you very much for your help. The above code works perfectly and is very easy to follow and understand. Is good to have a full example.

Sign up or log in to comment