# dinov2_vitl14_trt_a4000_fp16 ## Triton ``` make triton ``` ## Build TensorRT Model ``` make model ``` ``` make trt ``` ``` tree model_repository ``` ``` model_repository/ └── dinov2_vitl14 ├── 1 │   └── model.plan └── config.pbtxt ``` ## Perf ``` make perf ``` ``` docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560 ================================= == Triton Inference Server SDK == ================================= NVIDIA Release 23.04 (build 58408269) Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. *** Measurement Settings *** Batch size: 1 Service Kind: Triton Using "time_windows" mode for stabilization Measurement window: 5000 msec Latency limit: 0 msec Concurrency limit: 16 concurrent requests Using synchronous calls for inference Stabilizing using p95 latency Request concurrency: 16 Client: Request count: 4009 Throughput: 222.66 infer/sec p50 latency: 70762 usec p90 latency: 83940 usec p95 latency: 90235 usec p99 latency: 102226 usec Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec) Server: Inference count: 4009 Execution count: 728 Successful request count: 4009 Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec) Inferences/Second vs. Client p95 Batch Latency Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec ```