how to load this in transformers python

#3
by krish14388 - opened

I keep getting this error ...any idea?

OSError: Could not locate pytorch_model-00001-of-00006.bin inside TheBloke/medalpaca-13B-GPTQ-4bit.

You can't load this GPTQ model using standard transformers. To load GPTQ models from Python code and use it transformers style, please use AutoGPTQ.

Firstly, you need to compile the latest AutoGPTQ from source as the code is still under active development, and pre-compiled binaries are not yet provided. Compiling requires that you have the CUDA toolkit installed. Do the following:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install .

Then download the TheBloke/medalpaca-13B-GPTQ-4bit model locally (I guess you already have), and run code like the following:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

# change this path to match where you downloaded the model
quantized_model_dir = "/workspace/models/TheBloke_medalpaca-13B-GPTQ-4bit"

model_basename = "medalpaca-13B-GPTQ-4bit-128g.compat.no-act-order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False
    )

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=quantize_config)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Then you can use that code as a base for whatever you want to do.

Thanks a lot, that worked.

krish14388 changed discussion status to closed

Works fine.
But it is generating gibrish?
How to resolve it

Sign up or log in to comment