Yujivus commited on
Commit
014e08b
·
verified ·
1 Parent(s): 9ad9b7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -14,7 +14,7 @@ docker network create vllm
14
  docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500
15
 
16
  You can test vLLM's speed :
17
- """"
18
  import asyncio
19
  from openai import AsyncOpenAI
20
 
@@ -73,7 +73,7 @@ async def main():
73
 
74
  if __name__ == "__main__":
75
  asyncio.run(main())
76
- """"
77
 
78
  Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.
79
 
 
14
  docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500
15
 
16
  You can test vLLM's speed :
17
+
18
  import asyncio
19
  from openai import AsyncOpenAI
20
 
 
73
 
74
  if __name__ == "__main__":
75
  asyncio.run(main())
76
+
77
 
78
  Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.
79