Vezora
/

Qwen2.5-Coder-32B-Instruct-fp8-W8A16

Model card Files Files and versions Community

Vezora commited on Dec 5, 2024

Commit

eb87bcf

·

verified ·

1 Parent(s): fb1c114

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -59,7 +59,7 @@ python3 -m vllm.entrypoints.openai.api_server \
     --dtype auto \
     --api-key token-abc123 \
     --quantization compressed-tensors \
-    --max-num-batched-tokens 32768 \
-    --max-model-len 32768 \
     --tensor-parallel-size 2 \
     --gpu-memory-utilization 0.99

     --dtype auto \
     --api-key token-abc123 \
     --quantization compressed-tensors \
+    --max-num-batched-tokens 16384 \
+    --max-model-len 16384 \
     --tensor-parallel-size 2 \
     --gpu-memory-utilization 0.99