Text Generation

longchat-7b-qlora-customer-support Model Card

This repo contains the 4-bit LORA (low-rank) adapter weights for longchat-7b-16k model, fine-tuned on top of Bitext's customor support domain dataset.

The Supervised Fine-Tuning (SFT) method is based on this qlora paper using 🤗 peft adapters, transformers, and bitsandbytes.

Model details

Model type: longchat-7b-qlora-customer-support is an 4-bit LORA (low-rank) adapter supervised fine-tuned on top of the longchat-7b-16k model with Bitext's customor support domain dataset.

It's a Causal LM decoder-only LLM.

Language: English

License: apache-2.0 inherited from Base Model and the dataset.

Base Model: lmsys/longchat-7b-16k

Dataset: bitext/customer-support-intent-dataset

GPU Mermory Consumption: ~6GB GPU consumption in 4-bit mode with fully loaded (base + qlora) models

Install dependcy packages

pip install -r requirements.txt

Per the base model instruction, the llma_condense_monkey_patch.py file is needed to load the base model properly. This file is alreay included in this repo.

Load the model in 4-bit mode


from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 
from llama_condense_monkey_patch import replace_llama_with_condense
from peft import PeftConfig
from peft import PeftModel
import torch

## config device params & load model
peft_model_id = "mingkuan/longchat-7b-qlora-customer-support"
base_model_id = "lmsys/longchat-7b-16k"

config = AutoConfig.from_pretrained(base_model_id)
replace_llama_with_condense(config.rope_condense_ratio)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=False)

kwargs = {"torch_dtype": torch.float16}
kwargs["device_map"] = "auto"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    return_dict=True, 
    trust_remote_code=True, 
    quantization_config=nf4_config,
    load_in_4bit=True,
    **kwargs
)
model = PeftModel.from_pretrained(model, peft_model_id)

Inference the model


def getLLMResponse(prompt):
    device = "cuda"
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.5, max_new_tokens=256)
    promptLen = len(prompt)
    response = tokenizer.decode(output[0], skip_special_tokens=True)[promptLen:] ## omit the user input part
    return response

query = 'help me to setup my new shipping address.'
response = getLLMResponse(generate_prompt(query)) 
print(f'\nUserInput:{query}\n\nLLM:\n{response}\n\n')

Inference Output:

{
"category": "SHIPPING",
"intent": "setup_new_shipping_address",
"answer": "Sure, I can help you with that. Can you please provide me your full name, current shipping address, and the new shipping address you would like to set up?"
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.