Sanket Rai's picture

7 10

Sanket Rai

sanketrai

·

https://sanketrai.xyz

AI & ML interests

NLP, CV, RL, Deep Learning, Gen AI, , MLOps

Recent Activity

updated a model 13 days ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

published a model 13 days ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

updated a model 25 days ago

sanketrai/starcoder-1b-hf-stack-v1-lora

View all activity

Organizations

sanketrai's activity

updated a model 13 days ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

Updated 13 days ago

published a model 13 days ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

Updated 13 days ago

updated a model 25 days ago

sanketrai/starcoder-1b-hf-stack-v1-lora

Updated 25 days ago • 5

published a model 25 days ago

sanketrai/starcoder-1b-hf-stack-v1-lora

Updated 25 days ago • 5

reacted to macadeliccc's post with 🔥 about 2 months ago

Post

1367

Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.

Step 1: Pull docker images

docker pull apostacyh/vllm:lmcache-0.1.0

Step 2: Start vLLM + LMCache

model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<Your huggingface access token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

You can add another vLLM instance as long as its on a separate GPU by simply deploying another:

# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<Your huggingface token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).

LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm

updated 2 models 3 months ago

sanketrai/modernbert-base-wnut17-english-ner

Token Classification • Updated Jan 4 • 5

sanketrai/modernbert-base-conll2003-english-ner

Token Classification • Updated Jan 2 • 8

updated a dataset 3 months ago

sanketrai/distilabel-dpo-sample

Viewer • Updated Jan 2 • 20 • 42

liked a dataset 3 months ago

eriktks/conll2003

Updated Jan 18, 2024 • 23.4k • 142

updated a model 3 months ago

sanketrai/modernbert-llm-router

Text Classification • Updated Dec 31, 2024 • 8

updated a model 4 months ago

sanketrai/SmolLM2-SFT-smoltalk

Updated Dec 17, 2024 • 5

upvoted a collection 8 months ago

FP8 LLMs for vLLM

Accurate FP8 quantized models by Neural Magic, ready for use with vLLM! • 44 items • Updated Oct 17, 2024 • 68

updated a model 8 months ago

Nutanix/Meta-Llama-3-8B-Instruct_KTO_lora_SupportGPT-alignment-1

Updated Jul 30, 2024

upvoted a collection 8 months ago

Llama 3.1

This collection hosts the transformers and original repos of the Llama 3.1, Llama Guard 3 and Prompt Guard models • 11 items • Updated Dec 6, 2024 • 654

liked a Space 10 months ago

Chunk Visualizer

Pick a text splitter => visualize chunks. Great for RAG.

upvoted a collection 11 months ago

Llama3-ChatQA-1.5

Llama3-ChatQA-1.5 models excel at conversational question answering (QA) and retrieval-augmented generation (RAG). • 6 items • Updated 6 days ago • 43

upvoted a paper 11 months ago

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Paper • 2404.18796 • Published Apr 29, 2024 • 70