Efficiently run the model locally using ScaleLLM

#4
by guocuimi - opened

https://github.com/vectorch-ai/ScaleLLM

ScaleLLM is a tool that enables you to serve language models locally. You can find the project and documentation here: ScaleLLM GitHub. Here's how you can set it up:

1: start the model inference server
First, run the model inference server using the following Docker command. This command will start a container with GPU support (if available) and link it to your local model cache:

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=01-ai/Yi-34B-200K \
  -e DEVICE=auto \
  docker.io/vectorchai/scalellm:latest --logtostderr

2: start REST API server
Next, start the REST API server by running the following Docker command:

docker run -it --net=host \
  docker.io/vectorchai/scalellm-gateway:latest --logtostderr

you will get following running services:

ScaleLLM gRPC server on port 8888: localhost:8888
ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
ScaleLLM REST API server on port 8080: localhost:8080

You can now send requests to the local REST API server to generate text completions using a command like this:

curl http://localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "01-ai/Yi-34B-200K",
    "prompt": "what is vue.js",
    "max_tokens": 32,
    "temperature": 0.7
  }'

This command sends a POST request to the local REST API server, specifying the model, prompt, and other parameters to generate completions.

Make sure you have Docker installed and configured for GPU usage if you want to take advantage of GPU acceleration. This setup allows you to efficiently run the language model locally with ScaleLLM.

cc @cArlIcon Just to verify if you want these sorts of discussions or not ^

FancyZhao changed discussion status to closed

Sign up or log in to comment