Running an inference server using Docker + vLLM

#2
by YorelNation - opened

Hi,

Would it be possible to somehow deploy vigostral the same way we can deploy mistral via their recommended method: https://docs.mistral.ai/quickstart
Can I simply run:

docker run --gpus all
-e HF_TOKEN=$HF_TOKEN -p 8000:8000
ghcr.io/mistralai/mistral-src/vllm:latest
--host 0.0.0.0
--model bofenghuang/vigostral-7b-chat

I don't have the hardware to try this yet that's why I'm asking :)

Thanks

Hi,

Thanks for your message. I will look into it :)

Hi @YorelNation ,

The Mistral AI version has not yet been updated to support the prompt format of the Vigostral model.

However, I have managed to create another Docker image that also leverages vLLM for inference. You can use it as follows:

# Launch inference engine
docker run --gpus '"device=0"' \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/bofenghuang/vigogne/vllm:latest \
    --host 0.0.0.0 \
    --model bofenghuang/vigostral-7b-chat

# Launch inference engine on mutli-GPUs (4 here)
docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/bofenghuang/vigogne/vllm:latest \
    --host 0.0.0.0 \
    --tensor-parallel-size 4 \
    --model bofenghuang/vigostral-7b-chat

# Launch inference engine using the quantized AWQ version
# Note only supports Ampere or newer GPUs
docker run --gpus '"device=0"' \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/bofenghuang/vigogne/vllm:latest \
    --host 0.0.0.0 \
    --quantization awq \
    --model TheBloke/Vigostral-7B-Chat-AWQ

# Launch inference engine using the downloaded weights
docker run --gpus '"device=0"' \
    -p 8000:8000 \
    -v /path/to/model/:/mnt/model/ \
    ghcr.io/bofenghuang/vigogne/vllm:latest \
    --host 0.0.0.0 \
    --model="/mnt/model/"

Thanks ! Will try this

Sign up or log in to comment