File size: 6,624 Bytes
ecaa2ed
 
24815bc
ecaa2ed
24815bc
 
 
 
 
 
 
 
 
 
86edf4c
24815bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86edf4c
24815bc
 
 
 
 
 
 
 
 
 
 
 
86edf4c
24815bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86edf4c
 
 
 
24815bc
 
 
 
 
 
 
 
 
 
 
 
 
86edf4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24815bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
pipeline_tag: text-generation
---

<div align="center">
  <img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
</div>

[LMDeploy](https://github.com/InternLM/lmdeploy) supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.

Before proceeding with the inference of `internlm-chat-20b-4bit`, please ensure that lmdeploy is installed.

```shell
pip install 'lmdeploy>=0.0.11'
```

## Inference

Please download `internlm-chat-20b-4bit` model as follows,

```shell
git-lfs install
git clone https://huggingface.co/internlm/internlm-chat-20b-4bit
```

As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal

```shell

# Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name internlm-chat-20b \
    --model-path ./internlm-chat-20b-4bit \
    --model-format awq \
    --group-size 128

# inference
python3 -m lmdeploy.turbomind.chat ./workspace
```

## Serve with gradio

If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}
```

Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model.

Besides serving with gradio, there are two more serving methods. One is serving with Triton Inference Server (TIS), and the other is an OpenAI-like server named as `api_server`.

Please refer to the [user guide](https://github.com/InternLM/lmdeploy#quick-start) for detailed information if you are interested. 


## Inference Performance

LMDeploy provides scripts for benchmarking `token throughput` and `request throughput`.

`token throughput` tests the speed of generating new tokens, given a specified number of prompt tokens and completion tokens, while `request throughput` measures the number of requests processed per minute with real dialogue data.

We conducted benchmarks on `internlm-chat-20b-4bit`. And `token_throughput` was measured by setting 256 prompt tokens and generating 512 tokens in response on A100-80G. 

**Note**: The `session_len` in `workspace/triton_models/weights/config.ini` is changed to `2056` in our test.


| batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | rpm (req/min) | mem_per_proc(GB) |
|-------|-----------------|---------------|-------------------|-----------------------|---------------|------------------|
| 1     | 1               | 256           | 512               | 88.77                 | -             | 15.65            |
| 16    | 1               | 256           | 512               | 792.7                 | 220.23        | 51.46            |

### token throughput

Run the following command,

```shell
python benchmark/profile_generation.py \
  --model-path ./workspace \
  --concurrency 1 8 16 --prompt-tokens 256 512 512 1024 --completion-tokens 512 512 1024 1024
  --dst-csv ./token_throughput.csv
```
You will find the `token_throughput` metrics in `./token_throughput.csv`

| batch | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm(req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
|-------|---------------|-------------------|-----------------------|-----------------------|--------------|------------------|-----------------|------------------|
| 1     | 256           | 512               | 88.77                 | 710.12                | -            | 15.65            | 15.65           | 125.21           |
| 1     | 512           | 512               | 83.89                 | 671.15                | -            | 15.68            | 15.68           | 125.46           |
| 1     | 512           | 1024              | 80.19                 | 641.5                 | -            | 15.68            | 15.68           | 125.46           |
| 1     | 1024          | 1024              | 72.34                 | 578.74                | -            | 15.75            | 15.75           | 125.96           |
| 1     | 1             | 2048              | 80.69                 | 645.55                | -            | 15.62            | 15.62           | 124.96           |
| 8     | 256           | 512               | 565.21                | 4521.67               | -            | 32.37            | 32.37           | 258.96           |
| 8     | 512           | 512               | 489.04                | 3912.33               | -            | 32.62            | 32.62           | 260.96           |
| 8     | 512           | 1024              | 467.23                | 3737.84               | -            | 32.62            | 32.62           | 260.96           |
| 8     | 1024          | 1024              | 383.4                 | 3067.19               | -            | 33.06            | 33.06           | 264.46           |
| 8     | 1             | 2048              | 487.74                | 3901.93               | -            | 32.12            | 32.12           | 256.96           |
| 16    | 256           | 512               | 792.7                 | 6341.6                | -            | 51.46            | 51.46           | 411.71           |
| 16    | 512           | 512               | 639.4                 | 5115.17               | -            | 51.93            | 51.93           | 415.46           |
| 16    | 512           | 1024              | 591.39                | 4731.09               | -            | 51.93            | 51.93           | 415.46           |
| 16    | 1024          | 1024              | 449.11                | 3592.85               | -            | 52.06            | 52.06           | 416.46           |
| 16    | 1             | 2048              | 620.5                 | 4964.02               | -            | 51               | 51              | 407.96           |


### request throughput

LMDeploy uses ShareGPT dataset to test request throughput. Try the next commands, and you will get the `rpm` (request per minute) metric.

```
# download the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# 
python profile_throughput.py \
 ShareGPT_V3_unfiltered_cleaned_split.json \
 ./workspace \
 --concurrency 16
```