jerryzh168 commited on
Commit
7fdfcf2
·
verified ·
1 Parent(s): 0b668ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -38
README.md CHANGED
@@ -294,7 +294,6 @@ Our INT4 model is only optimized for batch size 1, so expect some slowdown with
294
  |----------------------------------|----------------|----------------------------|
295
  | | Phi-4 mini-Ins | phi4-mini-INT4 |
296
  | latency (batch_size=1) | 2.46s | 2.2s (1.12x speedup) |
297
- | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (1.20x speedup) |
298
 
299
  ## Results (H100 machine)
300
  | Benchmark (Latency) | | |
@@ -333,43 +332,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
333
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
334
  ```
335
 
336
- ## benchmark_serving
337
-
338
- We benchmarked the throughput in a serving environment.
339
-
340
- Download sharegpt dataset:
341
-
342
- ```Shell
343
- wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
344
- ```
345
-
346
-
347
-
348
- Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
349
-
350
- Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
351
-
352
- ### baseline
353
- Server:
354
- ```Shell
355
- vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
356
- ```
357
-
358
- Client:
359
- ```Shell
360
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
361
- ```
362
-
363
- ### INT4
364
- Server:
365
- ```Shell
366
- VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-INT4 --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
367
- ```
368
-
369
- Client:
370
- ```Shell
371
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-INT4 --num-prompts 1
372
- ```
373
  </details>
374
 
375
  # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
 
294
  |----------------------------------|----------------|----------------------------|
295
  | | Phi-4 mini-Ins | phi4-mini-INT4 |
296
  | latency (batch_size=1) | 2.46s | 2.2s (1.12x speedup) |
 
297
 
298
  ## Results (H100 machine)
299
  | Benchmark (Latency) | | |
 
332
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
333
  ```
334
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
335
  </details>
336
 
337
  # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization