Inference speed is extremly slow with FastChat

#22

by oximi123 - opened Dec 26, 2023

Dec 26, 2023

I use FastChat to deploy CodeLlama-7b-Instruct-hf on a A800-80GB server. The inference speed is extremly slow (It runs more than ten minutes without producing the response for a request). Any suggestion on how to solve this problem?

Here is how I deploy it with FastChat:

python -m fastchat.serve.controller
python -m fastchat.serve.model_worker --model-path /home/user/botao/CodeLlama-7b-Instruct-hf
python -m fastchat.serve.openai_api_server --host localhost --port 8000

Rahulmr42

Jun 20, 2024

Did you try with VLLM endpoint ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment