RoundtTble
/

dinov2_vitl14_trt_a4000

Model card Files Files and versions Community

dinov2_vitl14_trt_a4000 / README.md

RoundtTble's picture

Edit

bdd3916 about 1 year ago

|

2.26 kB

	# dinov2_vitl14_trt_a4000_fp16


	## Triton

	```
	make triton
	```

	## Build TensorRT Model

	```
	make model
	```


	```
	make trt
	```

	```
	tree model_repository
	```
	```
	model_repository/
	└── dinov2_vitl14
	├── 1
	│ └── model.plan
	└── config.pbtxt
	```


	## Perf

	```
	make perf
	```

	```
	docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560

	=================================
	== Triton Inference Server SDK ==
	=================================

	NVIDIA Release 23.04 (build 58408269)

	Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	This container image and its contents are governed by the NVIDIA Deep Learning Container License.
	By pulling and using the container, you accept the terms and conditions of this license:
	https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

	NOTE: CUDA Forward Compatibility mode ENABLED.
	Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
	See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

	* Measurement Settings *
	Batch size: 1
	Service Kind: Triton
	Using "time_windows" mode for stabilization
	Measurement window: 5000 msec
	Latency limit: 0 msec
	Concurrency limit: 16 concurrent requests
	Using synchronous calls for inference
	Stabilizing using p95 latency

	Request concurrency: 16
	Client:
	Request count: 4009
	Throughput: 222.66 infer/sec
	p50 latency: 70762 usec
	p90 latency: 83940 usec
	p95 latency: 90235 usec
	p99 latency: 102226 usec
	Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec)
	Server:
	Inference count: 4009
	Execution count: 728
	Successful request count: 4009
	Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec)

	Inferences/Second vs. Client p95 Batch Latency
	Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec
	```