MobiLlama / docs /awq.md
Ashmal's picture
Upload folder using huggingface_hub
5472531 verified

AWQ 4bit Inference

We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.

Install AWQ

Setup environment (please refer to this link for more details):

conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip    # enable PEP 660 support
pip install -e .             # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .             # install awq package

cd awq/kernels				
python setup.py install	     # install awq CUDA kernels

Chat with the CLI

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
    --awq-wbits 4 \
    --awq-groupsize 128 

Benchmark

  • Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

  • Benchmark on NVIDIA RTX A6000:

    Model Bits Max Memory (MiB) Speed (ms/token) AWQ Speedup
    vicuna-7b 16 13543 26.06 /
    vicuna-7b 4 5547 12.43 2.1x
    llama2-7b-chat 16 13543 27.14 /
    llama2-7b-chat 4 5547 12.44 2.2x
    vicuna-13b 16 25647 44.91 /
    vicuna-13b 4 9355 17.30 2.6x
    llama2-13b-chat 16 25647 47.28 /
    llama2-13b-chat 4 9355 20.28 2.3x
  • NVIDIA RTX 4090:

    Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
    vicuna-7b 8.61 19.09 2.2x
    llama2-7b-chat 8.66 19.97 2.3x
    vicuna-13b 12.17 OOM /
    llama2-13b-chat 13.54 OOM /
  • NVIDIA Jetson Orin:

    Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
    vicuna-7b 65.34 93.12 1.4x
    llama2-7b-chat 75.11 104.71 1.4x
    vicuna-13b 115.40 OOM /
    llama2-13b-chat 136.81 OOM /