can you load the provided safetensors with AutoModelForCausalLM.from_pretrained?

#4
by shawei3000 - opened

I could not, tried many ways, on torch 2.0 ; would you be able to provide script how you load the downloaded 4bit safetensors?

You can't load GPTQ files directly with transformers AutoModelForCausalLM. It's not supported without additional code.

I recommend using AutoGPTQ

Here's an example script using AutoGPTQ. Note that first you need to download the model locally (eg with git clone - AutoGPTQ doesn't yet support directly downloading from HF, but it will very soon)

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

quantized_model_dir = "/path/to/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

also thanks for your super fast relase of quantified models for the comunity, very helpful, great work!

You're very welcome!

You can't load GPTQ files directly with transformers AutoModelForCausalLM. It's not supported without additional code.

I recommend using AutoGPTQ

Here's an example script using AutoGPTQ. Note that first you need to download the model locally (eg with git clone - AutoGPTQ doesn't yet support directly downloading from HF, but it will very soon)

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

quantized_model_dir = "/path/to/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order."

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False
    )

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=quantize_config)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Should group_size be set as group_size=-1 instead of 128?

err yes, sorry!

In fact, you don't even need to specify quantize_config because it's provided with the model:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/path/to/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order."

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

thanks!, Bloke, this works!

I can fine tune 65B lora on your GPTQ, but could not merge end lora with base GPTQ model... in such case, do you have inference code that join foundation model and lora 1 by 1, and make prediction? I have following code (using two 48GB GPUs), but seems out of memory error, is it possible to put lora on different GPU?

model, tokenizer = load_llama_model_4bit_low_ram_and_offload('/media/jimsha/E/gpt4-alpaca-lora_mlp-65B-GPTQ',
'/media/jimsha/E/gpt4-alpaca-lora_mlp-65B-GPTQ/gpt4-alpaca-lora_mlp-65B-GPTQ-4bit.safetensors',
#device_map='auto',
groupsize=-1,
is_v1_model=False,
max_memory = {0: '42Gib', 1:'42Gib' ,'cpu':'70Gib'})

model = PeftModel.from_pretrained(model, '/media/jimsha/E/ML_Tests/Jim/LLMZoo/alpaca_lora_4bit-main/alpaca_lora/',
#device_map='auto',
max_memory = {0: '42Gib', 1:'42Gib' ,'cpu':'70Gib'},
torch_dtype=torch.float32,
is_trainable=True)
......
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=max_new_tokens,
)

ERROR:
"CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 47.50 GiB total capacity; 44.36 GiB already allocated; 31.12 MiB free; 45.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

I always do a hard merge of the LoRA and base model, saving to a new model. I use this code for that:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

import os
import argparse

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--base_model_name_or_path", type=str)
    parser.add_argument("--peft_model_path", type=str)
    parser.add_argument("--output_dir", type=str)
    parser.add_argument("--device", type=str, default="auto")
    parser.add_argument("--push_to_hub", action="store_true")
    parser.add_argument("--trust_remote_code", action="store_true")

    return parser.parse_args()

def main():
    args = get_args()

    if args.device == 'auto':
        device_arg = { 'device_map': 'auto' }
    else:
        device_arg = { 'device_map': { "": args.device} }

    print(f"Loading base model: {args.base_model_name_or_path}")
    base_model = AutoModelForCausalLM.from_pretrained(
        args.base_model_name_or_path,
        return_dict=True,
        torch_dtype=torch.float16,
        trust_remote_code=args.trust_remote_code,
        **device_arg
    )

    print(f"Loading PEFT: {args.peft_model_path}")
    model = PeftModel.from_pretrained(base_model, args.peft_model_path, **device_arg)
    print(f"Running merge_and_unload")
    model = model.merge_and_unload()

    tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)

    if args.push_to_hub:
        print(f"Saving to hub ...")
        model.push_to_hub(f"{args.output_dir}", use_temp_dir=False)
        tokenizer.push_to_hub(f"{args.output_dir}", use_temp_dir=False)
    else:
        model.save_pretrained(f"{args.output_dir}")
        tokenizer.save_pretrained(f"{args.output_dir}")
        print(f"Model saved to {args.output_dir}")

if __name__ == "__main__" :
    main()

I get a message without any error, autogptq is from the latest
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.

I get a message without any error, autogptq is from the latest
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.

That's fine, it's just an informational message. Completely normal. Please ignore it. In future releases of AutoGPTQ the message will likely be removed

err yes, sorry!

In fact, you don't even need to specify quantize_config because it's provided with the model:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/path/to/guanaco-65B-GPTQ"
model_basename = "Guanaco-65B-GPTQ-4bit.act-order."

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Thank you for this information.

@TheBloke : regarding "I always do a hard merge of the LoRA and base model, saving to a new model. I use this code for that:..."
I acturally trained Lora using the 65B GPQT 4bit base model, is there a way merge Lora and GPTQ 4bit base model that is used during the fine-tune?
or,
maybe you are suggesting merging Lora (trained on 65B 4bit base) with 65B HF base model would be just fine?

I've never looked at 4bit LoRA at all so I'm not immediately sure. It should be possible, but the script I provided wouldn't work.

AutoGPTQ is adding PEFT support now (it's already merged in the latest dev version). That might provide a way.

Otherwise I'm not sure. Maybe ask on the Alpaca Lora 4bit repo?

Sign up or log in to comment