InvestLM-33b-awq / README.md
yixuantt's picture
Update README.md
8d75f03
metadata
license: llama2
language:
  - en
tags:
  - finance

InvestLM

This is the repo for a new financial domain large language model, InvestLM, tuned on LLaMA-33B[1], using a carefully curated instruction dataset related to financial investment. We provide guidance on how to use InvestLM for inference.

Github Link: InvestLM

About AWQ

AWQ [2] is an efficient, accurate, and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is supported by:

  1. AutoAWQ: AutoAWQ is an easy-to-use package for AWQ 4-bit quantized models. AutoAWQ Github link
pip install autoawq
  1. vLLM: vLLM is a Python library that contains pre-compiled C++ and CUDA (11.8) binaries. We can use it to run offline inference or serve as the endpoint. It offers advanced continuous batching and a much higher (~10x) throughput. But it's more complicated. vllm Doc
# (Optional) Create a new conda environment.
conda create -n myenv python=3.8 -y
conda activate myenv
# Install vLLM.
pip install vllm
  1. Additional options: these are some Python libraries integrated with vLLM.
    • Fastchat: FastChat is an open platform for training, serving, and evaluating large language model based chatbots. We can use vLLM as an optimized worker implementation in FastChat. Fastchat Github
    • aphrodite-engine: Aphrodite is the official backend engine for PygmalionAI. It is designed to serve as the inference endpoint for the PygmalionAI website. Aphrodite-engine Github

Inference

Please use the following command to log in hugging face first.

huggingface-cli login

Prompt template

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}

How to use this AWQ model from Python code

# Please run the following command in CLI.
# pip install autoawq transformers
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
# Inference Template 
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}
    
def generate_prompt(instruction, input=None):
    if input:
        return PROMPT_DICT["prompt_input"].format(instruction=instruction,input=input)
    else:
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)
tokenizer =  AutoTokenizer.from_pretrained('yixuantt/InvestLM-awq', use_fast=False)
tokenizer.pad_token = tokenizer.unk_token
model = AutoAWQForCausalLM.from_quantized('yixuantt/InvestLM-awq', fuse_layers=False)
print("\n\n*** Generate:")
tokens = tokenizer(
    generate_prompt(instruction="Tell me about finance."),
    return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    repetition_penalty = 1.1,
    max_new_tokens=512
)
print("Output: ", tokenizer.decode(generation_output[0]))

Serving this model from vLLM

  • Please ensure you are using vLLM version 0.2 or later.
  • When using vLLM as a server, pass the --quantization awq parameter.
  • At the time of writing, vLLM AWQ does not support loading models in bfloat16, so to ensure compatibility with all models, also pass --dtype float16. For example:
python3 python -m vllm.entrypoints.api_server --model 'yixuantt/InvestLM-awq' --quantization awq --dtype float16

When using vLLM from Python code, again pass the quantization=awq and dtype=float16 parameters.

from vllm import LLM, SamplingParams
# Inference Template 
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}
    
def generate_prompt(instruction, input=None):
    if input:
        return PROMPT_DICT["prompt_input"].format(instruction=instruction,input=input)
    else:
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)
question = generate_prompt(instruction="Tell me about finance.")
prompts = [
   question,
]
sampling_params = SamplingParams(temperature=0.1, top_p=0.75)
llm = LLM(model="yixuantt/InvestLM-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

References

[1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

[2] Lin, J., Tang, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" arXiv preprint arXiv:2306.00978 (2023).


license: llama2