|
# dinov2_vitl14_trt_a4000_fp16 |
|
|
|
|
|
## Triton |
|
|
|
``` |
|
make triton |
|
``` |
|
|
|
## Build TensorRT Model |
|
|
|
``` |
|
make model |
|
``` |
|
|
|
|
|
``` |
|
make trt |
|
``` |
|
|
|
``` |
|
tree model_repository |
|
``` |
|
``` |
|
model_repository/ |
|
βββ dinov2_vitl14 |
|
βββ 1 |
|
βΒ Β βββ model.plan |
|
βββ config.pbtxt |
|
``` |
|
|
|
|
|
## Perf |
|
|
|
``` |
|
make perf |
|
``` |
|
|
|
``` |
|
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560 |
|
|
|
================================= |
|
== Triton Inference Server SDK == |
|
================================= |
|
|
|
NVIDIA Release 23.04 (build 58408269) |
|
|
|
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
|
|
|
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
|
|
|
This container image and its contents are governed by the NVIDIA Deep Learning Container License. |
|
By pulling and using the container, you accept the terms and conditions of this license: |
|
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license |
|
|
|
NOTE: CUDA Forward Compatibility mode ENABLED. |
|
Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06. |
|
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. |
|
|
|
*** Measurement Settings *** |
|
Batch size: 1 |
|
Service Kind: Triton |
|
Using "time_windows" mode for stabilization |
|
Measurement window: 5000 msec |
|
Latency limit: 0 msec |
|
Concurrency limit: 16 concurrent requests |
|
Using synchronous calls for inference |
|
Stabilizing using p95 latency |
|
|
|
Request concurrency: 16 |
|
Client: |
|
Request count: 4009 |
|
Throughput: 222.66 infer/sec |
|
p50 latency: 70762 usec |
|
p90 latency: 83940 usec |
|
p95 latency: 90235 usec |
|
p99 latency: 102226 usec |
|
Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec) |
|
Server: |
|
Inference count: 4009 |
|
Execution count: 728 |
|
Successful request count: 4009 |
|
Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec) |
|
|
|
Inferences/Second vs. Client p95 Batch Latency |
|
Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec |
|
``` |
|
|