Sanket Rai's picture
5 10

Sanket Rai

sanketrai
Β·

AI & ML interests

NLP, CV, RL, Deep Learning, Gen AI, , MLOps

Recent Activity

reacted to macadeliccc's post with πŸ”₯ 2 days ago
Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack. Step 1: Pull docker images ``` docker pull apostacyh/vllm:lmcache-0.1.0 ``` Step 2: Start vLLM + LMCache ``` model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name sudo docker run --runtime nvidia --gpus '"device=0"' \ -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \ -p 8000:8000 \ --env "HF_TOKEN=<Your huggingface access token>" \ --ipc=host \ --network=host \ apostacyh/vllm:lmcache-0.1.0 \ --model $model --gpu-memory-utilization 0.6 --port 8000 \ --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml ``` You can add another vLLM instance as long as its on a separate GPU by simply deploying another: ``` # The second vLLM instance listens at port 8001 model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name sudo docker run --runtime nvidia --gpus '"device=1"' \ -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \ -p 8001:8001 \ --env "HF_TOKEN=<Your huggingface token>" \ --ipc=host \ --network=host \ apostacyh/vllm:lmcache-0.1.0 \ --model $model --gpu-memory-utilization 0.7 --port 8001 \ --lmcache-config-file /lmcache/LMCache/examples/example.yaml ``` This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML). LMCache: https://github.com/LMCache/LMCache/tree/dev vLLM: https://github.com/vllm-project/vllm
updated a model about 1 month ago
sanketrai/modernbert-base-wnut17-english-ner
updated a model about 2 months ago
sanketrai/modernbert-base-conll2003-english-ner
View all activity

Organizations

Nutanix's profile picture

sanketrai's activity

reacted to macadeliccc's post with πŸ”₯ 2 days ago
view post
Post
1338
Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.

Step 1: Pull docker images
docker pull apostacyh/vllm:lmcache-0.1.0

Step 2: Start vLLM + LMCache
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<Your huggingface access token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

You can add another vLLM instance as long as its on a separate GPU by simply deploying another:

# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<Your huggingface token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).

LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm
upvoted an article 4 months ago