RangiLyu commited on
Commit
7c42fba
1 Parent(s): 19f48c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -3
README.md CHANGED
@@ -48,6 +48,8 @@ InternLM2.5-7B-Chat-1M is the 1M-long-context version of InternLM2.5-7B-Chat. Si
48
 
49
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
50
 
 
 
51
  ```bash
52
  pip install lmdeploy
53
  ```
@@ -57,7 +59,12 @@ You can run batch inference locally with the following python code:
57
  ```python
58
  from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
59
 
60
- backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
 
 
 
 
 
61
  pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
62
  prompt = 'Use a long prompt to replace this sentence'
63
  response = pipe(prompt)
@@ -69,6 +76,7 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
69
 
70
  ### Import from Transformers
71
 
 
72
  To load the InternLM2 7B Chat model using Transformers, use the following code:
73
 
74
  ```python
@@ -114,6 +122,8 @@ pip install vllm
114
  python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
115
  ```
116
 
 
 
117
  Then you can send a chat request to the server:
118
 
119
  ```bash
@@ -164,6 +174,8 @@ InternLM2.5-7B-Chat-1M 支持 1 百万字超长上下文推理,且性能和 In
164
 
165
  LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
166
 
 
 
167
  ```bash
168
  pip install lmdeploy
169
  ```
@@ -174,8 +186,13 @@ pip install lmdeploy
174
  ```python
175
  from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
176
 
177
- backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
178
- pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)
 
 
 
 
 
179
  prompt = 'Use a long prompt to replace this sentence'
180
  response = pipe(prompt)
181
  print(response)
@@ -183,6 +200,8 @@ print(response)
183
 
184
  ### 通过 Transformers 加载
185
 
 
 
186
  通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
187
 
188
  ```python
@@ -228,6 +247,8 @@ pip install vllm
228
  python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
229
  ```
230
 
 
 
231
  然后你可以向服务端发起一个聊天请求:
232
 
233
  ```bash
 
48
 
49
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
50
 
51
+ Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
52
+
53
  ```bash
54
  pip install lmdeploy
55
  ```
 
59
  ```python
60
  from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
61
 
62
+ backend_config = TurbomindEngineConfig(
63
+ rope_scaling_factor=2.5,
64
+ session_len=1048576, # 1M context length
65
+ max_batch_size=1,
66
+ cache_max_entry_count=0.7,
67
+ tp=4) # 4xA100-80G.
68
  pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
69
  prompt = 'Use a long prompt to replace this sentence'
70
  response = pipe(prompt)
 
76
 
77
  ### Import from Transformers
78
 
79
+ Since Transformers does not support 1M long context, we only show the usage of non-long context.
80
  To load the InternLM2 7B Chat model using Transformers, use the following code:
81
 
82
  ```python
 
122
  python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
123
  ```
124
 
125
+ If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
126
+
127
  Then you can send a chat request to the server:
128
 
129
  ```bash
 
174
 
175
  LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
176
 
177
+ 以下是一个 1M 上下文推理的例子. **注意: 1M 上下文需要 4xA100-80G!**
178
+
179
  ```bash
180
  pip install lmdeploy
181
  ```
 
186
  ```python
187
  from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
188
 
189
+ backend_config = TurbomindEngineConfig(
190
+ rope_scaling_factor=2.5,
191
+ session_len=1048576, # 1M context length
192
+ max_batch_size=1,
193
+ cache_max_entry_count=0.7,
194
+ tp=4) # 4xA100-80G.
195
+ pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
196
  prompt = 'Use a long prompt to replace this sentence'
197
  response = pipe(prompt)
198
  print(response)
 
200
 
201
  ### 通过 Transformers 加载
202
 
203
+ 由于 Transformers 无法支持 1M 长上下文推理,这里仅演示非长文本的用法。
204
+
205
  通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
206
 
207
  ```python
 
247
  python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
248
  ```
249
 
250
+ 如果你遇到 OOM, 请减小 `--max-model-len` 或增加 `--tensor-parallel-size` 参数.
251
+
252
  然后你可以向服务端发起一个聊天请求:
253
 
254
  ```bash