Metric3D v2 (ViT-S) — LiteRT (on-device, fully-GPU metric depth)

Metric3D v2 (CVPR/TPAMI 2024) monocular metric (absolute, in-meters) depth, converted to LiteRT and running fully on the CompiledModel GPU (ML Drift) on Android. Unlike relative-depth models (MiDaS, Depth Anything), Metric3D predicts depth in meters. The DINOv2 ViT-S encoder and the RAFT-DPT decoder both ride the GPU delegate — no CPU/ONNX fallback.

On-device (Pixel 8a, Tensor G3 — verified)


nodes on GPU	2447 / 2447 LITERT_CL (full residency)
compile	~2.2 s (one-time)
inference	~44 ms (model); ~335 ms full app pipeline
size	78 MB (fp16)
accuracy	depth corr 0.96 vs the original Metric3D (0.96–0.98 across indoor 0.7–4 m / mid 4–17 m / outdoor 11–200 m)

image[1,3,448,448] (ImageNet-normalized) →[GPU: DINOv2 ViT-S → RAFT-DPT (4 iters)]→ depth[1,1,448,448] (meters)

The model outputs depth for a canonical camera (focal 1000 at the canonical resolution). For a calibrated camera multiply by fx / 1000 (the de-canonical transform); with no intrinsics the depth is already in meters and qualitatively correct.

Preprocessing

Center-crop to square, resize to 448×448, ImageNet normalize in 0–255 scale (px − [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375], NCHW planar.

Usage (Android, LiteRT CompiledModel)

val model = CompiledModel.create(modelPath, CompiledModel.Options(Accelerator.GPU), null)
val input = model.createInputBuffers()
val output = model.createOutputBuffers()
input[0].writeFloat(chw)        // [1,3,448,448] ImageNet-normalized
model.run(input, output)
val depth = output[0].readFloat()   // [448*448] meters

A complete Android sample (image picker + depth colormap) is in the official google-ai-edge/litert-samples repo under compiled_model_api/metric_depth.

How it converts (litert-torch)

Fixed 448×448. Encoder = the DINOv2 ViT-S suite (fused-QKV → 4D attention, LayerScale folded into Linear, baked pos-embed). The RAFT-DPT decoder needs three fixes that only the on-device run reveals (desktop fp16 stays at 0.9999):

Convex upsample → depth-to-space via ZeroStuffConvT2d — the naive "nearest-upsample + in-block mask" is exact on desktop but 0.57 on Mali (RESIZE_NEAREST differs at non-stride positions); ZeroStuffConvT2d masks only stride-aligned positions and the conv kernel supplies the offset.
GELU → accurate tanh approximation (POW-free); x·sigmoid(1.702x) collapses far-depth to 0.51 over the 0.1–200 m log-depth bins, tanh restores 0.96.
nn.ReLU(inplace=True) mutates the DPT ConvBlock residual (relu(x)+convs) — replicated exactly.

Conversion scripts: in the litert-samples sample's conversion/ directory.

License

BSD-2-Clause (Metric3D); the DINOv2 backbone is Apache-2.0. Upstream: YvanYin/Metric3D.

Downloads last month: -

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support