Update README.md
Browse files
README.md
CHANGED
@@ -47,4 +47,19 @@ How to use the script:
|
|
47 |
|
48 |
Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
|
49 |
|
50 |
-
Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
|
49 |
|
50 |
+
Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.
|
51 |
+
|
52 |
+
## How to Run
|
53 |
+
|
54 |
+
To launch the API server for this model, use the following command:
|
55 |
+
|
56 |
+
```bash
|
57 |
+
python3 -m vllm.entrypoints.openai.api_server \
|
58 |
+
--model Vezora/Qwen2.5-Coder-32B-Instruct-fp8-W8A16 \
|
59 |
+
--dtype auto \
|
60 |
+
--api-key token-abc123 \
|
61 |
+
--quantization compressed-tensors \
|
62 |
+
--max-num-batched-tokens 32768 \
|
63 |
+
--max-model-len 32768 \
|
64 |
+
--tensor-parallel-size 2 \
|
65 |
+
--gpu-memory-utilization 0.99
|