Update README.md
Browse files
README.md
CHANGED
@@ -294,7 +294,6 @@ Our INT4 model is only optimized for batch size 1, so expect some slowdown with
|
|
294 |
|----------------------------------|----------------|----------------------------|
|
295 |
| | Phi-4 mini-Ins | phi4-mini-INT4 |
|
296 |
| latency (batch_size=1) | 2.46s | 2.2s (1.12x speedup) |
|
297 |
-
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (1.20x speedup) |
|
298 |
|
299 |
## Results (H100 machine)
|
300 |
| Benchmark (Latency) | | |
|
@@ -333,43 +332,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
|
|
333 |
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
|
334 |
```
|
335 |
|
336 |
-
## benchmark_serving
|
337 |
-
|
338 |
-
We benchmarked the throughput in a serving environment.
|
339 |
-
|
340 |
-
Download sharegpt dataset:
|
341 |
-
|
342 |
-
```Shell
|
343 |
-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
344 |
-
```
|
345 |
-
|
346 |
-
|
347 |
-
|
348 |
-
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
349 |
-
|
350 |
-
Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
|
351 |
-
|
352 |
-
### baseline
|
353 |
-
Server:
|
354 |
-
```Shell
|
355 |
-
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
|
356 |
-
```
|
357 |
-
|
358 |
-
Client:
|
359 |
-
```Shell
|
360 |
-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
|
361 |
-
```
|
362 |
-
|
363 |
-
### INT4
|
364 |
-
Server:
|
365 |
-
```Shell
|
366 |
-
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-INT4 --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
|
367 |
-
```
|
368 |
-
|
369 |
-
Client:
|
370 |
-
```Shell
|
371 |
-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-INT4 --num-prompts 1
|
372 |
-
```
|
373 |
</details>
|
374 |
|
375 |
# Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
|
|
|
294 |
|----------------------------------|----------------|----------------------------|
|
295 |
| | Phi-4 mini-Ins | phi4-mini-INT4 |
|
296 |
| latency (batch_size=1) | 2.46s | 2.2s (1.12x speedup) |
|
|
|
297 |
|
298 |
## Results (H100 machine)
|
299 |
| Benchmark (Latency) | | |
|
|
|
332 |
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
|
333 |
```
|
334 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
335 |
</details>
|
336 |
|
337 |
# Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
|