dinov2_vitl14_onnx / README.md
RoundtTble's picture
Add README
c360541

dinov2_vitl14_onnx

Run Triton

make triton
=============================
== Triton Inference Server ==
=============================

NVIDIA Release 23.04 (build 58408265)
Triton Server Version 2.33.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0715 04:13:59.173070 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1a70000000' with size 268435456
I0715 04:13:59.173293 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0715 04:13:59.175108 1 model_lifecycle.cc:459] loading: dinov2_vitl14:1
I0715 04:13:59.177471 1 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime
I0715 04:13:59.177510 1 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12
I0715 04:13:59.177518 1 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12
I0715 04:13:59.177525 1 onnxruntime.cc:2550] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0715 04:13:59.233419 1 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: dinov2_vitl14 (version 1)
I0715 04:13:59.233847 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'dinov2_vitl14': inputs and outputs already specified
I0715 04:13:59.234233 1 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: dinov2_vitl14_0 (GPU device 0)
2023-07-15 04:13:59.546824126 [W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2023-07-15 04:13:59.546847104 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
I0715 04:14:00.851748 1 model_lifecycle.cc:694] successfully loaded 'dinov2_vitl14' version 1
I0715 04:14:00.851859 1 server.cc:583]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0715 04:14:00.851944 1 server.cc:610]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0715 04:14:00.852005 1 server.cc:653]
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| dinov2_vitl14 | 1       | READY  |
+---------------+---------+--------+

I0715 04:14:00.872645 1 metrics.cc:808] Collecting metrics for GPU 0: NVIDIA RTX A4000
I0715 04:14:00.873026 1 metrics.cc:701] Collecting CPU metrics
I0715 04:14:00.873315 1 tritonserver.cc:2387]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.33.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /models                                                                                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0715 04:14:00.875498 1 grpc_server.cc:2450] Started GRPCInferenceService at 0.0.0.0:8001
I0715 04:14:00.875964 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0715 04:14:00.917871 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002

Perf Analyzer

docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:8001 --concurrency-range 16:16 --shape input:3,560,560

=================================
== Triton Inference Server SDK ==
=================================

NVIDIA Release 23.04 (build 58408269)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 16 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 16
  Client:
    Request count: 881
    Throughput: 48.927 infer/sec
    p50 latency: 324015 usec
    p90 latency: 330275 usec
    p95 latency: 331952 usec
    p99 latency: 336638 usec
    Avg gRPC time: 323066 usec ((un)marshal request/response 953 usec + response wait 322113 usec)
  Server:
    Inference count: 881
    Execution count: 111
    Successful request count: 881
    Avg request latency: 313673 usec (overhead 7065 usec + queue 151785 usec + compute input 7582 usec + compute infer 143162 usec + compute output 4077 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 16, throughput: 48.927 infer/sec, latency 331952 usec