longchat-7b-qlora-customer-support Model Card
This repo contains the 4-bit LORA (low-rank) adapter weights for longchat-7b-16k model, fine-tuned on top of Bitext's customor support domain dataset.
The Supervised Fine-Tuning (SFT) method is based on this qlora paper using 🤗 peft adapters, transformers, and bitsandbytes.
Model details
Model type: longchat-7b-qlora-customer-support is an 4-bit LORA (low-rank) adapter supervised fine-tuned on top of the longchat-7b-16k model with Bitext's customor support domain dataset.
It's a Causal LM decoder-only LLM.
Language: English
License: apache-2.0 inherited from Base Model and the dataset.
Base Model: lmsys/longchat-7b-16k
Dataset: bitext/customer-support-intent-dataset
GPU Mermory Consumption: ~6GB GPU consumption in 4-bit mode with fully loaded (base + qlora) models
Install dependcy packages
pip install -r requirements.txt
Per the base model instruction, the llma_condense_monkey_patch.py file is needed to load the base model properly. This file is alreay included in this repo.
Load the model in 4-bit mode
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from llama_condense_monkey_patch import replace_llama_with_condense
from peft import PeftConfig
from peft import PeftModel
import torch
## config device params & load model
peft_model_id = "mingkuan/longchat-7b-qlora-customer-support"
base_model_id = "lmsys/longchat-7b-16k"
config = AutoConfig.from_pretrained(base_model_id)
replace_llama_with_condense(config.rope_condense_ratio)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=False)
kwargs = {"torch_dtype": torch.float16}
kwargs["device_map"] = "auto"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
return_dict=True,
trust_remote_code=True,
quantization_config=nf4_config,
load_in_4bit=True,
**kwargs
)
model = PeftModel.from_pretrained(model, peft_model_id)
Inference the model
def getLLMResponse(prompt):
device = "cuda"
input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.5, max_new_tokens=256)
promptLen = len(prompt)
response = tokenizer.decode(output[0], skip_special_tokens=True)[promptLen:] ## omit the user input part
return response
query = 'help me to setup my new shipping address.'
response = getLLMResponse(generate_prompt(query))
print(f'\nUserInput:{query}\n\nLLM:\n{response}\n\n')
Inference Output:
{
"category": "SHIPPING",
"intent": "setup_new_shipping_address",
"answer": "Sure, I can help you with that. Can you please provide me your full name, current shipping address, and the new shipping address you would like to set up?"
}