RoundtTble commited on
Commit
c360541
1 Parent(s): 355e44f

Add README

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # dinov2_vitl14_onnx
2
+
3
+ ## Run Triton
4
+
5
+ ```
6
+ make triton
7
+ ```
8
+ ```
9
+ =============================
10
+ == Triton Inference Server ==
11
+ =============================
12
+
13
+ NVIDIA Release 23.04 (build 58408265)
14
+ Triton Server Version 2.33.0
15
+
16
+ Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
17
+
18
+ Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
19
+
20
+ This container image and its contents are governed by the NVIDIA Deep Learning Container License.
21
+ By pulling and using the container, you accept the terms and conditions of this license:
22
+ https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
23
+
24
+ NOTE: CUDA Forward Compatibility mode ENABLED.
25
+ Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
26
+ See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
27
+
28
+ I0715 04:13:59.173070 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1a70000000' with size 268435456
29
+ I0715 04:13:59.173293 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
30
+ I0715 04:13:59.175108 1 model_lifecycle.cc:459] loading: dinov2_vitl14:1
31
+ I0715 04:13:59.177471 1 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime
32
+ I0715 04:13:59.177510 1 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12
33
+ I0715 04:13:59.177518 1 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12
34
+ I0715 04:13:59.177525 1 onnxruntime.cc:2550] backend configuration:
35
+ {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
36
+ I0715 04:13:59.233419 1 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: dinov2_vitl14 (version 1)
37
+ I0715 04:13:59.233847 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'dinov2_vitl14': inputs and outputs already specified
38
+ I0715 04:13:59.234233 1 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: dinov2_vitl14_0 (GPU device 0)
39
+ 2023-07-15 04:13:59.546824126 [W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
40
+ 2023-07-15 04:13:59.546847104 [W:onnxruntime:, session_state.cc:1138 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
41
+ I0715 04:14:00.851748 1 model_lifecycle.cc:694] successfully loaded 'dinov2_vitl14' version 1
42
+ I0715 04:14:00.851859 1 server.cc:583]
43
+ +------------------+------+
44
+ | Repository Agent | Path |
45
+ +------------------+------+
46
+ +------------------+------+
47
+
48
+ I0715 04:14:00.851944 1 server.cc:610]
49
+ +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
50
+ | Backend | Path | Config |
51
+ +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
52
+ | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
53
+ +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
54
+
55
+ I0715 04:14:00.852005 1 server.cc:653]
56
+ +---------------+---------+--------+
57
+ | Model | Version | Status |
58
+ +---------------+---------+--------+
59
+ | dinov2_vitl14 | 1 | READY |
60
+ +---------------+---------+--------+
61
+
62
+ I0715 04:14:00.872645 1 metrics.cc:808] Collecting metrics for GPU 0: NVIDIA RTX A4000
63
+ I0715 04:14:00.873026 1 metrics.cc:701] Collecting CPU metrics
64
+ I0715 04:14:00.873315 1 tritonserver.cc:2387]
65
+ +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
66
+ | Option | Value |
67
+ +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
68
+ | server_id | triton |
69
+ | server_version | 2.33.0 |
70
+ | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
71
+ | model_repository_path[0] | /models |
72
+ | model_control_mode | MODE_NONE |
73
+ | strict_model_config | 0 |
74
+ | rate_limit | OFF |
75
+ | pinned_memory_pool_byte_size | 268435456 |
76
+ | cuda_memory_pool_byte_size{0} | 67108864 |
77
+ | min_supported_compute_capability | 6.0 |
78
+ | strict_readiness | 1 |
79
+ | exit_timeout | 30 |
80
+ | cache_enabled | 0 |
81
+ +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
82
+
83
+ I0715 04:14:00.875498 1 grpc_server.cc:2450] Started GRPCInferenceService at 0.0.0.0:8001
84
+ I0715 04:14:00.875964 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
85
+ I0715 04:14:00.917871 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
86
+ ```
87
+
88
+ ## Perf Analyzer
89
+
90
+ ```
91
+ docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:8001 --concurrency-range 16:16 --shape input:3,560,560
92
+
93
+ =================================
94
+ == Triton Inference Server SDK ==
95
+ =================================
96
+
97
+ NVIDIA Release 23.04 (build 58408269)
98
+
99
+ Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
100
+
101
+ Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
102
+
103
+ This container image and its contents are governed by the NVIDIA Deep Learning Container License.
104
+ By pulling and using the container, you accept the terms and conditions of this license:
105
+ https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
106
+
107
+ NOTE: CUDA Forward Compatibility mode ENABLED.
108
+ Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
109
+ See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
110
+
111
+ *** Measurement Settings ***
112
+ Batch size: 1
113
+ Service Kind: Triton
114
+ Using "time_windows" mode for stabilization
115
+ Measurement window: 5000 msec
116
+ Latency limit: 0 msec
117
+ Concurrency limit: 16 concurrent requests
118
+ Using synchronous calls for inference
119
+ Stabilizing using p95 latency
120
+
121
+ Request concurrency: 16
122
+ Client:
123
+ Request count: 881
124
+ Throughput: 48.927 infer/sec
125
+ p50 latency: 324015 usec
126
+ p90 latency: 330275 usec
127
+ p95 latency: 331952 usec
128
+ p99 latency: 336638 usec
129
+ Avg gRPC time: 323066 usec ((un)marshal request/response 953 usec + response wait 322113 usec)
130
+ Server:
131
+ Inference count: 881
132
+ Execution count: 111
133
+ Successful request count: 881
134
+ Avg request latency: 313673 usec (overhead 7065 usec + queue 151785 usec + compute input 7582 usec + compute infer 143162 usec + compute output 4077 usec)
135
+
136
+ Inferences/Second vs. Client p95 Batch Latency
137
+ Concurrency: 16, throughput: 48.927 infer/sec, latency 331952 usec
138
+ ```