whcao commited on
Commit
13eba74
1 Parent(s): 208448c

WIP: add readme of internlm2-chat-20b-4bits model

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ ---
5
+ <div align="center">
6
+ <img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
7
+ </div>
8
+
9
+ [LMDeploy](https://github.com/InternLM/lmdeploy) supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.
10
+
11
+ Before proceeding with the inference of `internlm2-chat-20b-4bits`, please ensure that lmdeploy is installed.
12
+
13
+ ```shell
14
+ pip install 'lmdeploy>=0.0.11'
15
+ ```
16
+
17
+ ## Inference
18
+
19
+ Please download `internlm2-chat-20b-4bits` model as follows,
20
+
21
+ ```shell
22
+ git-lfs install
23
+ git clone https://huggingface.co/internlm/internlm2-chat-20b-4bits
24
+ ```
25
+
26
+ As demonstrated in the command below, you can interact with the AI assistant in the terminal
27
+
28
+ ```shell
29
+ lmdeploy chat turbomind \
30
+ --model-path ./internlm2-chat-20b-4bits \
31
+ --model-name internlm2-chat-20b \
32
+ --model-format awq \
33
+ --group-size 128
34
+ ```
35
+
36
+ ## Serve with gradio
37
+
38
+ If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:
39
+
40
+ ```shell
41
+ python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}
42
+ ```
43
+
44
+ Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model.
45
+
46
+ Besides serving with gradio, there are two more serving methods. One is serving with Triton Inference Server (TIS), and the other is an OpenAI-like server named as `api_server`.
47
+
48
+ Please refer to the [user guide](https://github.com/InternLM/lmdeploy#quick-start) for detailed information if you are interested.
49
+
50
+
51
+ ## Inference Performance
52
+
53
+ LMDeploy provides scripts for benchmarking `token throughput` and `request throughput`.
54
+
55
+ `token throughput` tests the speed of generating new tokens, given a specified number of prompt tokens and completion tokens, while `request throughput` measures the number of requests processed per minute with real dialogue data.
56
+
57
+ We conducted benchmarks on `internlm2-chat-20b-4bits`. And `token_throughput` was measured by setting 256 prompt tokens and generating 512 tokens in response on A100-80G.
58
+
59
+ **Note**: The `session_len` in `workspace/triton_models/weights/config.ini` is changed to `2056` in our test.
60
+
61
+
62
+ | batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | rpm (req/min) | mem_per_proc(GB) |
63
+ |-------|-----------------|---------------|-------------------|-----------------------|---------------|------------------|
64
+ | 1 | 1 | 256 | 512 | 88.77 | - | 15.65 |
65
+ | 16 | 1 | 256 | 512 | 792.7 | 220.23 | 51.46 |
66
+
67
+ ### token throughput
68
+
69
+ Run the following command,
70
+
71
+ ```shell
72
+ python benchmark/profile_generation.py \
73
+ --model-path ./workspace \
74
+ --concurrency 1 8 16 --prompt-tokens 256 512 512 1024 --completion-tokens 512 512 1024 1024
75
+ --dst-csv ./token_throughput.csv
76
+ ```
77
+ You will find the `token_throughput` metrics in `./token_throughput.csv`
78
+
79
+ | batch | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm(req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
80
+ |-------|---------------|-------------------|-----------------------|-----------------------|--------------|------------------|-----------------|------------------|
81
+ | 1 | 256 | 512 | 88.77 | 710.12 | - | 15.65 | 15.65 | 125.21 |
82
+ | 1 | 512 | 512 | 83.89 | 671.15 | - | 15.68 | 15.68 | 125.46 |
83
+ | 1 | 512 | 1024 | 80.19 | 641.5 | - | 15.68 | 15.68 | 125.46 |
84
+ | 1 | 1024 | 1024 | 72.34 | 578.74 | - | 15.75 | 15.75 | 125.96 |
85
+ | 1 | 1 | 2048 | 80.69 | 645.55 | - | 15.62 | 15.62 | 124.96 |
86
+ | 8 | 256 | 512 | 565.21 | 4521.67 | - | 32.37 | 32.37 | 258.96 |
87
+ | 8 | 512 | 512 | 489.04 | 3912.33 | - | 32.62 | 32.62 | 260.96 |
88
+ | 8 | 512 | 1024 | 467.23 | 3737.84 | - | 32.62 | 32.62 | 260.96 |
89
+ | 8 | 1024 | 1024 | 383.4 | 3067.19 | - | 33.06 | 33.06 | 264.46 |
90
+ | 8 | 1 | 2048 | 487.74 | 3901.93 | - | 32.12 | 32.12 | 256.96 |
91
+ | 16 | 256 | 512 | 792.7 | 6341.6 | - | 51.46 | 51.46 | 411.71 |
92
+ | 16 | 512 | 512 | 639.4 | 5115.17 | - | 51.93 | 51.93 | 415.46 |
93
+ | 16 | 512 | 1024 | 591.39 | 4731.09 | - | 51.93 | 51.93 | 415.46 |
94
+ | 16 | 1024 | 1024 | 449.11 | 3592.85 | - | 52.06 | 52.06 | 416.46 |
95
+ | 16 | 1 | 2048 | 620.5 | 4964.02 | - | 51 | 51 | 407.96 |
96
+
97
+
98
+ ### request throughput
99
+
100
+ LMDeploy uses ShareGPT dataset to test request throughput. Try the next commands, and you will get the `rpm` (request per minute) metric.
101
+
102
+ ```
103
+ # download the ShareGPT dataset
104
+ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
105
+ #
106
+ python profile_throughput.py \
107
+ ShareGPT_V3_unfiltered_cleaned_split.json \
108
+ ./workspace \
109
+ --concurrency 16
110
+ ```