dinov2_vitl14_trt_a4000_fp16

Triton

make triton

Build TensorRT Model

make model

make trt

tree model_repository

model_repository/
└── dinov2_vitl14
    ├── 1
    │   └── model.plan
    └── config.pbtxt

Perf

make perf

docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560

=================================
== Triton Inference Server SDK ==
=================================

NVIDIA Release 23.04 (build 58408269)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 16 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 16
  Client:
    Request count: 4009
    Throughput: 222.66 infer/sec
    p50 latency: 70762 usec
    p90 latency: 83940 usec
    p95 latency: 90235 usec
    p99 latency: 102226 usec
    Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec)
  Server:
    Inference count: 4009
    Execution count: 728
    Successful request count: 4009
    Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec