Conversion to GGUF and quantization

#1
by Ke09876 - opened

Hi, I tried to convert and quantize this model in order to experiment it in llama.cpp, but I got the error "KeyError: ('torch._utils', '_rebuild_meta_tensor_no_storage')" in the conversion step. I don't know if the issue come from the way I merged the QLoRA (I'm new to this), I was hoping you could tell me how to proceed.

Here is the script I used for the merge:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name_or_path = "./Mistral-7B-Instruct-v0.1"
peft_model_path = "./aria-7b"
output_dir = "./merge"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name_or_path,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
    offload_folder="./offload",
)

base_model.tie_weights()

model = PeftModel.from_pretrained(base_model, peft_model_path, offload_folder="./offload")
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path)

model.save_pretrained(f"{output_dir}")
tokenizer.save_pretrained(f"{output_dir}")

πŸ™

Faraday Lab org

Hi, I tried to convert and quantize this model in order to experiment it in llama.cpp, but I got the error "KeyError: ('torch._utils', '_rebuild_meta_tensor_no_storage')" in the conversion step. I don't know if the issue come from the way I merged the QLoRA (I'm new to this), I was hoping you could tell me how to proceed.

Here is the script I used for the merge:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name_or_path = "./Mistral-7B-Instruct-v0.1"
peft_model_path = "./aria-7b"
output_dir = "./merge"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name_or_path,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
    offload_folder="./offload",
)

base_model.tie_weights()

model = PeftModel.from_pretrained(base_model, peft_model_path, offload_folder="./offload")
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path)

model.save_pretrained(f"{output_dir}")
tokenizer.save_pretrained(f"{output_dir}")

πŸ™

Hello, Thank you for the feedback @Edoziem from the team will check it out and come back to you asap. You can definetely quantize the model.

Faraday Lab org

Save the merged LoRA model on Hugging Face first by loading the model in float 16:

base_model = AutoModelForCausalLM.from_pretrained(
base_model_name_or_path,
return_dict=True,
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
model = PeftModel.from_pretrained(base_model, peft_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
Then calling merge_and_unload() and push_to_hub().

model = model.merge_and_unload()
model.push_to_hub(f"{output_dir}", use_temp_dir=False)
Then clone the hugging face model repo into your /models folder, then use convert.py in llama.cpp.

Thank you for your answer, I followed your instructions but I still have the same error after cloning the model from my HG repo and trying to run convert.py on it. Would it be possible to distribute a quantized version in your repository?

Sign up or log in to comment