Commit
β’
bdd3916
1
Parent(s):
89d26a5
Edit
Browse files
README.md
CHANGED
@@ -1,3 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
|
3 |
## Perf
|
@@ -7,7 +37,6 @@ make perf
|
|
7 |
```
|
8 |
|
9 |
```
|
10 |
-
make perf
|
11 |
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560
|
12 |
|
13 |
=================================
|
@@ -40,19 +69,19 @@ NOTE: CUDA Forward Compatibility mode ENABLED.
|
|
40 |
|
41 |
Request concurrency: 16
|
42 |
Client:
|
43 |
-
Request count:
|
44 |
-
Throughput:
|
45 |
-
p50 latency:
|
46 |
-
p90 latency:
|
47 |
-
p95 latency:
|
48 |
-
p99 latency:
|
49 |
-
Avg gRPC time:
|
50 |
Server:
|
51 |
-
Inference count:
|
52 |
-
Execution count:
|
53 |
-
Successful request count:
|
54 |
-
Avg request latency:
|
55 |
|
56 |
Inferences/Second vs. Client p95 Batch Latency
|
57 |
-
Concurrency: 16, throughput:
|
58 |
```
|
|
|
1 |
+
# dinov2_vitl14_trt_a4000_fp16
|
2 |
+
|
3 |
+
|
4 |
+
## Triton
|
5 |
+
|
6 |
+
```
|
7 |
+
make triton
|
8 |
+
```
|
9 |
+
|
10 |
+
## Build TensorRT Model
|
11 |
+
|
12 |
+
```
|
13 |
+
make model
|
14 |
+
```
|
15 |
+
|
16 |
+
|
17 |
+
```
|
18 |
+
make trt
|
19 |
+
```
|
20 |
+
|
21 |
+
```
|
22 |
+
tree model_repository
|
23 |
+
```
|
24 |
+
```
|
25 |
+
model_repository/
|
26 |
+
βββ dinov2_vitl14
|
27 |
+
βββ 1
|
28 |
+
βΒ Β βββ model.plan
|
29 |
+
βββ config.pbtxt
|
30 |
+
```
|
31 |
|
32 |
|
33 |
## Perf
|
|
|
37 |
```
|
38 |
|
39 |
```
|
|
|
40 |
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560
|
41 |
|
42 |
=================================
|
|
|
69 |
|
70 |
Request concurrency: 16
|
71 |
Client:
|
72 |
+
Request count: 4009
|
73 |
+
Throughput: 222.66 infer/sec
|
74 |
+
p50 latency: 70762 usec
|
75 |
+
p90 latency: 83940 usec
|
76 |
+
p95 latency: 90235 usec
|
77 |
+
p99 latency: 102226 usec
|
78 |
+
Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec)
|
79 |
Server:
|
80 |
+
Inference count: 4009
|
81 |
+
Execution count: 728
|
82 |
+
Successful request count: 4009
|
83 |
+
Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec)
|
84 |
|
85 |
Inferences/Second vs. Client p95 Batch Latency
|
86 |
+
Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec
|
87 |
```
|