RoundtTble's picture
Edit
bdd3916
# dinov2_vitl14_trt_a4000_fp16
## Triton
```
make triton
```
## Build TensorRT Model
```
make model
```
```
make trt
```
```
tree model_repository
```
```
model_repository/
└── dinov2_vitl14
β”œβ”€β”€ 1
β”‚Β Β  └── model.plan
└── config.pbtxt
```
## Perf
```
make perf
```
```
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560
=================================
== Triton Inference Server SDK ==
=================================
NVIDIA Release 23.04 (build 58408269)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 16 concurrent requests
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 16
Client:
Request count: 4009
Throughput: 222.66 infer/sec
p50 latency: 70762 usec
p90 latency: 83940 usec
p95 latency: 90235 usec
p99 latency: 102226 usec
Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec)
Server:
Inference count: 4009
Execution count: 728
Successful request count: 4009
Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec
```