Update vLLM description
Browse files
README.md
CHANGED
|
@@ -109,6 +109,13 @@ To serve this checkpoint with [SGLang](https://github.com/sgl-project/sglang), y
|
|
| 109 |
python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
|
| 110 |
```
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
### Evaluation
|
| 113 |
The accuracy benchmark results are presented in the table below:
|
| 114 |
<table>
|
|
|
|
| 109 |
python3 -m sglang.launch_server --model nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --quantization modelopt_fp4 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --moe-runner-backend flashinfer_cutlass --attention-backend flashinfer
|
| 110 |
```
|
| 111 |
|
| 112 |
+
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can install the [vllm nightly](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#install-the-latest-code) or start the docker `vllm/vllm-openai:nightly`, and run the sample command below:
|
| 113 |
+
|
| 114 |
+
```sh
|
| 115 |
+
vllm serve nvidia/MiniMax-M2.5-NVFP4 --tensor-parallel-size 8 --trust-remote-code --reasoning-parser minimax-append-think --tool-call-parser minimax-m2 --enable-auto-tool-choice
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
|
| 119 |
### Evaluation
|
| 120 |
The accuracy benchmark results are presented in the table below:
|
| 121 |
<table>
|