Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

InvestLM

This is the repo for a new financial domain large language model, InvestLM, tuned on LLaMA-33B[1], using a carefully curated instruction dataset related to financial investment. We provide guidance on how to use InvestLM for inference.

Github Link: InvestLM

About AWQ

AWQ [2] is an efficient, accurate, and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is supported by:

  1. AutoAWQ: AutoAWQ is an easy-to-use package for AWQ 4-bit quantized models. AutoAWQ Github link
pip install autoawq
  1. vLLM: vLLM is a Python library that contains pre-compiled C++ and CUDA (11.8) binaries. We can use it to run offline inference or serve as the endpoint. It offers advanced continuous batching and a much higher (~10x) throughput. But it's more complicated. vllm Doc
# (Optional) Create a new conda environment.
conda create -n myenv python=3.8 -y
conda activate myenv
# Install vLLM.
pip install vllm
  1. Additional options: these are some Python libraries integrated with vLLM.
    • Fastchat: FastChat is an open platform for training, serving, and evaluating large language model based chatbots. We can use vLLM as an optimized worker implementation in FastChat. Fastchat Github
    • aphrodite-engine: Aphrodite is the official backend engine for PygmalionAI. It is designed to serve as the inference endpoint for the PygmalionAI website. Aphrodite-engine Github

Inference

Please use the following command to log in hugging face first.

huggingface-cli login

Prompt template

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}

How to use this AWQ model from Python code

# Please run the following command in CLI.
# pip install autoawq transformers
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
# Inference Template 
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}
    
def generate_prompt(instruction, input=None):
    if input:
        return PROMPT_DICT["prompt_input"].format(instruction=instruction,input=input)
    else:
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)
tokenizer =  AutoTokenizer.from_pretrained('yixuantt/InvestLM-awq', use_fast=False)
tokenizer.pad_token = tokenizer.unk_token
model = AutoAWQForCausalLM.from_quantized('yixuantt/InvestLM-awq', fuse_layers=False)
print("\n\n*** Generate:")
tokens = tokenizer(
    generate_prompt(instruction="Tell me about finance."),
    return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    repetition_penalty = 1.1,
    max_new_tokens=512
)
print("Output: ", tokenizer.decode(generation_output[0]))

Serving this model from vLLM

  • Please ensure you are using vLLM version 0.2 or later.
  • When using vLLM as a server, pass the --quantization awq parameter.
  • At the time of writing, vLLM AWQ does not support loading models in bfloat16, so to ensure compatibility with all models, also pass --dtype float16. For example:
python3 python -m vllm.entrypoints.api_server --model 'yixuantt/InvestLM-awq' --quantization awq --dtype float16

When using vLLM from Python code, again pass the quantization=awq and dtype=float16 parameters.

from vllm import LLM, SamplingParams
# Inference Template 
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with further context. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\n Input:\n{input}\n\n Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "Instruction:\n{instruction}\n\nResponse:"
    ),
}
    
def generate_prompt(instruction, input=None):
    if input:
        return PROMPT_DICT["prompt_input"].format(instruction=instruction,input=input)
    else:
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)
question = generate_prompt(instruction="Tell me about finance.")
prompts = [
   question,
]
sampling_params = SamplingParams(temperature=0.1, top_p=0.75)
llm = LLM(model="yixuantt/InvestLM-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

References

[1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

[2] Lin, J., Tang, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" arXiv preprint arXiv:2306.00978 (2023).


license: llama2

Downloads last month
0

Spaces using yixuantt/InvestLM-33b-awq 3