Vezora commited on
Commit
fb1c114
·
verified ·
1 Parent(s): a4df890

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -1
README.md CHANGED
@@ -47,4 +47,19 @@ How to use the script:
47
 
48
  Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
49
 
50
- Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
49
 
50
+ Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.
51
+
52
+ ## How to Run
53
+
54
+ To launch the API server for this model, use the following command:
55
+
56
+ ```bash
57
+ python3 -m vllm.entrypoints.openai.api_server \
58
+ --model Vezora/Qwen2.5-Coder-32B-Instruct-fp8-W8A16 \
59
+ --dtype auto \
60
+ --api-key token-abc123 \
61
+ --quantization compressed-tensors \
62
+ --max-num-batched-tokens 32768 \
63
+ --max-model-len 32768 \
64
+ --tensor-parallel-size 2 \
65
+ --gpu-memory-utilization 0.99