bf16_vs_fp8 / docs /lightllm_integration.md
zjasper666's picture
Upload folder using huggingface_hub
8655a4b verified

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

LightLLM Integration

You can use LightLLM as an optimized worker implementation in FastChat. It offers advanced continuous batching and a much higher (~10x) throughput. See the supported models here.

Instructions

  1. Please refer to the Get started to install LightLLM. Or use Pre-built image

  2. When you launch a model worker, replace the normal worker (fastchat.serve.model_worker) with the LightLLM worker (fastchat.serve.lightllm_worker). All other commands such as controller, gradio web server, and OpenAI API server are kept the same. Refer to --max_total_token_num to understand how to calculate the --max_total_token_num argument.

    python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
    

    If you what to use quantized weight and kv cache for inference, try

    python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv