longchat-7b-qlora-customer-support Model Card

This repo contains the 4-bit LORA (low-rank) adapter weights for longchat-7b-16k model, fine-tuned on top of Bitext's customor support domain dataset.

The Supervised Fine-Tuning (SFT) method is based on this qlora paper using 🤗 peft adapters, transformers, and bitsandbytes.

Model details

Model type: longchat-7b-qlora-customer-support is an 4-bit LORA (low-rank) adapter supervised fine-tuned on top of the longchat-7b-16k model with Bitext's customor support domain dataset.

It's a Causal LM decoder-only LLM.

Language: English

License: apache-2.0 inherited from Base Model and the dataset.

Base Model: lmsys/longchat-7b-16k

Dataset: bitext/customer-support-intent-dataset

GPU Mermory Consumption: ~6GB GPU consumption in 4-bit mode with fully loaded (base + qlora) models

Install dependcy packages

pip install -r requirements.txt

Per the base model instruction, the llma_condense_monkey_patch.py file is needed to load the base model properly. This file is alreay included in this repo.

Load the model in 4-bit mode


from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 
from llama_condense_monkey_patch import replace_llama_with_condense
from peft import PeftConfig
from peft import PeftModel
import torch

## config device params & load model
peft_model_id = "mingkuan/longchat-7b-qlora-customer-support"
base_model_id = "lmsys/longchat-7b-16k"

config = AutoConfig.from_pretrained(base_model_id)
replace_llama_with_condense(config.rope_condense_ratio)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=False)

kwargs = {"torch_dtype": torch.float16}
kwargs["device_map"] = "auto"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    return_dict=True, 
    trust_remote_code=True, 
    quantization_config=nf4_config,
    load_in_4bit=True,
    **kwargs
)
model = PeftModel.from_pretrained(model, peft_model_id)

Inference the model


def getLLMResponse(prompt):
    device = "cuda"
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.5, max_new_tokens=256)
    promptLen = len(prompt)
    response = tokenizer.decode(output[0], skip_special_tokens=True)[promptLen:] ## omit the user input part
    return response

query = 'help me to setup my new shipping address.'
response = getLLMResponse(generate_prompt(query)) 
print(f'\nUserInput:{query}\n\nLLM:\n{response}\n\n')

Inference Output:

{
"category": "SHIPPING",
"intent": "setup_new_shipping_address",
"answer": "Sure, I can help you with that. Can you please provide me your full name, current shipping address, and the new shipping address you would like to set up?"
}