Metric3D v2 (ViT-S) β€” LiteRT (on-device, fully-GPU metric depth)

Metric3D v2 (CVPR/TPAMI 2024) monocular metric (absolute, in-meters) depth, converted to LiteRT and running fully on the CompiledModel GPU (ML Drift) on Android. Unlike relative-depth models (MiDaS, Depth Anything), Metric3D predicts depth in meters. The DINOv2 ViT-S encoder and the RAFT-DPT decoder both ride the GPU delegate β€” no CPU/ONNX fallback.

Metric3D v2 β€” input | metric depth (on-device LiteRT GPU)

On-device (Pixel 8a, Tensor G3 β€” verified)

nodes on GPU 2447 / 2447 LITERT_CL (full residency)
compile ~2.2 s (one-time)
inference ~44 ms (model); ~335 ms full app pipeline
size 78 MB (fp16)
accuracy depth corr 0.96 vs the original Metric3D (0.96–0.98 across indoor 0.7–4 m / mid 4–17 m / outdoor 11–200 m)
image[1,3,448,448] (ImageNet-normalized) β†’[GPU: DINOv2 ViT-S β†’ RAFT-DPT (4 iters)]β†’ depth[1,1,448,448] (meters)

The model outputs depth for a canonical camera (focal 1000 at the canonical resolution). For a calibrated camera multiply by fx / 1000 (the de-canonical transform); with no intrinsics the depth is already in meters and qualitatively correct.

Preprocessing

Center-crop to square, resize to 448Γ—448, ImageNet normalize in 0–255 scale (px βˆ’ [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375], NCHW planar.

Usage (Android, LiteRT CompiledModel)

val model = CompiledModel.create(modelPath, CompiledModel.Options(Accelerator.GPU), null)
val input = model.createInputBuffers()
val output = model.createOutputBuffers()
input[0].writeFloat(chw)        // [1,3,448,448] ImageNet-normalized
model.run(input, output)
val depth = output[0].readFloat()   // [448*448] meters

A complete Android sample (image picker + depth colormap) is in the official google-ai-edge/litert-samples repo under compiled_model_api/metric_depth.

How it converts (litert-torch)

Fixed 448Γ—448. Encoder = the DINOv2 ViT-S suite (fused-QKV β†’ 4D attention, LayerScale folded into Linear, baked pos-embed). The RAFT-DPT decoder needs three fixes that only the on-device run reveals (desktop fp16 stays at 0.9999):

  1. Convex upsample β†’ depth-to-space via ZeroStuffConvT2d β€” the naive "nearest-upsample + in-block mask" is exact on desktop but 0.57 on Mali (RESIZE_NEAREST differs at non-stride positions); ZeroStuffConvT2d masks only stride-aligned positions and the conv kernel supplies the offset.
  2. GELU β†’ accurate tanh approximation (POW-free); xΒ·sigmoid(1.702x) collapses far-depth to 0.51 over the 0.1–200 m log-depth bins, tanh restores 0.96.
  3. nn.ReLU(inplace=True) mutates the DPT ConvBlock residual (relu(x)+convs) β€” replicated exactly.

Conversion scripts: in the litert-samples sample's conversion/ directory.

License

BSD-2-Clause (Metric3D); the DINOv2 backbone is Apache-2.0. Upstream: YvanYin/Metric3D.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support