Spaces:
Runtime error
Runtime error
File size: 690 Bytes
e72aedf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# vLLM Integration
You can use [vLLM](https://vllm.ai/) as an optimized worker implementation in FastChat.
It offers advanced continuous batching and a much higher (~10x) throughput.
See the supported models [here](https://vllm.readthedocs.io/en/latest/models/supported_models.html).
## Instructions
1. Install vLLM.
```
pip install vllm
```
2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the vLLM worker (`fastchat.serve.vllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3
```
|