Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ docker network create vllm
|
|
14 |
docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500
|
15 |
|
16 |
You can test vLLM's speed :
|
17 |
-
|
18 |
import asyncio
|
19 |
from openai import AsyncOpenAI
|
20 |
|
@@ -73,7 +73,7 @@ async def main():
|
|
73 |
|
74 |
if __name__ == "__main__":
|
75 |
asyncio.run(main())
|
76 |
-
|
77 |
|
78 |
Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.
|
79 |
|
|
|
14 |
docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500
|
15 |
|
16 |
You can test vLLM's speed :
|
17 |
+
|
18 |
import asyncio
|
19 |
from openai import AsyncOpenAI
|
20 |
|
|
|
73 |
|
74 |
if __name__ == "__main__":
|
75 |
asyncio.run(main())
|
76 |
+
|
77 |
|
78 |
Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.
|
79 |
|