File size: 2,365 Bytes
e72aedf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# GPTQ 4bit Inference

Support GPTQ 4bit inference with [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).

1. Window user: use the `old-cuda` branch.
2. Linux user: recommend the `fastest-inference-4bit` branch.

## Install

Setup environment:
```bash
# cd /path/to/FastChat
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa
cd repositories/GPTQ-for-LLaMa
# Window's user should use the `old-cuda` branch
git switch fastest-inference-4bit
# Install `quant-cuda` package in FastChat's virtualenv
python3 setup_cuda.py install
pip3 install texttable
```

Chat with the CLI:
```bash
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128
```

Start model worker:
```bash
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-wbits 4 \
    --gptq-groupsize 128

# You can specify which quantized model to use
python3 -m fastchat.serve.model_worker \
    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
    --gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \
    --gptq-wbits 4 \
    --gptq-groupsize 128 \
    --gptq-act-order
```

## Benchmark

| LLaMA-13B | branch                 | Bits | group-size | memory(MiB) | PPL(c4) | Median(s/token) | act-order | speed up |
| --------- | ---------------------- | ---- | ---------- | ----------- | ------- | --------------- | --------- | -------- |
| FP16      | fastest-inference-4bit | 16   | -          | 26634       | 6.96    | 0.0383          | -         | 1x       |
| GPTQ      | triton                 | 4    | 128        | 8590        | 6.97    | 0.0551          | -         | 0.69x    |
| GPTQ      | fastest-inference-4bit | 4    | 128        | 8699        | 6.97    | 0.0429          | true      | 0.89x    |
| GPTQ      | fastest-inference-4bit | 4    | 128        | 8699        | 7.03    | 0.0287          | false     | 1.33x    |
| GPTQ      | fastest-inference-4bit | 4    | -1         | 8448        | 7.12    | 0.0284          | false     | 1.44x    |