File size: 2,475 Bytes
278e8bb
 
 
 
 
 
 
 
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
 
7a29a0c
ad163e9
 
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
 
 
ad163e9
7a29a0c
 
ad163e9
7a29a0c
ad163e9
7a29a0c
 
 
8253e76
7a29a0c
 
 
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
ad163e9
7a29a0c
 
 
ad163e9
7a29a0c
ad163e9
 
7a29a0c
ad163e9
 
fd2432e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: apache-2.0
pipeline_tag: text-generation
---
<div align="center">
  <img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
</div>

# INT4 Weight-only Quantization and Deployment (W4A16)

LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.

LMDeploy supports the following NVIDIA GPU for W4A16 inference:

- Turing(sm75): 20 series, T4

- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

- Ada Lovelace(sm90): 40 series

Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.

```shell
pip install lmdeploy[all]
```

This article comprises the following sections:

<!-- toc -->

- [Inference](#inference)
- [Evaluation](#evaluation)
- [Service](#service)

<!-- tocstop -->
## Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("internlm/internlm2-chat-7b-4bits", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

For more information about the pipeline parameters, please refer to [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md).

## Evaluation

Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.

## Service

LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

```shell
lmdeploy serve api_server internlm/internlm2-chat-7b-4bits --backend turbomind --model-format awq
```

The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:

```shell
lmdeploy serve api_client http://0.0.0.0:23333
```

You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/restful_api.md).