Depth Anything 3 (Small) β€” LiteRT GPU, monocular depth

On-device LiteRT / TFLite conversion of Depth Anything 3 β€” Small (ByteDance-Seed, Apache-2.0) for monocular depth, running fully on the mobile GPU via the LiteRT CompiledModel API (ML Drift delegate). No CPU fallback ops β€” the whole graph is GPU-compatible.

Task Monocular depth (single RGB β†’ depth)
Backbone DINOv2 ViT-S + RoPE, DPT/DualDPT depth head
Input [1, 3, 896, 504] NCHW float32, ImageNet-normalized, native portrait aspect
Output [1, 1, 896, 504] depth
Precision / size FP16, 55 MB
Device Pixel 8a, LiteRT GPU (Accelerator.GPU), ~0.9 s / image (FP16, CompiledModel.Run)
Fidelity corr 0.99948 vs official PyTorch; on-device GPU-vs-CPU cos 0.99993 (re-verified, see below)

Why a fixed 896Γ—504 (native aspect, not square)

DA3 processes images at their native aspect ratio (upper_bound_resize, longer side β†’ 896, multiple of 14). Forcing a square 896Γ—896 and letterbox-padding drops the match to corr 0.977 (the black padding leaks into the content through global attention). Converting at the native rectangle restores corr 0.9994 and is also faster (fewer tokens). This checkpoint is built for portrait ~9:16. For another aspect, re-convert at that shape (or your camera's fixed aspect) with the script below.

Preprocessing (must match)

resize to 504Γ—896 (WΓ—H)  β†’  x/255  β†’  (x - mean) / std
mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]   # ImageNet, RGB, NCHW

GPU-clean conversion (what was patched)

Converted with litert-torch. DA3 is not GPU-clean out of the box; the following exact, GPU-clean rewrites were applied (all numerically faithful unless noted):

  1. checkpoint model. key-prefix strip (load fix)
  2. RoPE max_position = int(positions.max())+1 β†’ constant (torch.export data-dependent)
  3. fused-QKV attention β†’ 3 separate Linears + 4D attention (avoids 5D RESHAPE; exact, 1e-6)
  4. LayerScale gamma folded into attn.proj / mlp.fc2 (the LayerScale MUL otherwise mis-lays-out the token dim on the GPU delegate: fully_connected {1,1,N,C} vs {N,1,1,C})
  5. pos_embed bicubic interpolation baked to a constant (the interpolate of a constant emits GATHER_ND on desktop and RESIZE_BILINEAR with 0 runtime inputs on device)
  6. ConvTranspose2d(k=s,stride=s) β†’ zero-stuff (nearest-upsample Γ— top-left mask) + Conv2d (flipped weight) β€” exact equivalent (~1e-7), because the Pixel-8a GPU rejects TRANSPOSE_CONV and the conv+ depth-to-space alternative needs >4D
  7. DPT-head custom_interpolate align_corners=True β†’ False (GPU bans align_corners=True resize) β€” the only non-exact rewrite; source of the residual ~0.05 % vs the official model
  8. head UV pos-embed-again disabled (its make_sincos broadcast emits BROADCAST_TO; ratio-0.1 refinement)
  9. camera-token insertion x[:, :, 0] = cam_token β†’ torch.cat (in-place index-assign β†’ SELECT_V2)

Net result: GATHER_ND = 0, no >4D tensors, no TRANSPOSE_CONV / BROADCAST_TO / banned ops.

Fidelity note (honest)

corr 0.99948 vs the official FP32 PyTorch pipeline. FP16 is not a factor (FP32≑FP16, corr 1.0). The residual ~0.05 % is the align_corners=Trueβ†’False change in (7), which the mobile GPU forces β€” an irreducible hardware constraint, not a conversion error. Structure and edge sharpness are visually identical.

On-device GPU verification (re-confirmed)

Re-verified on a Pixel 8a with the official LiteRT C++ runtime + ML Drift accelerator: the model compiles to Replacing 1460 out of 1460 node(s) with delegate (LITERT_CL) (full residency, single partition, no XNNPACK CPU fallback), and the on-device GPU output matches the CPU/XNNPACK reference at cos 0.99993 / Pearson 0.99975 for the same input β€” i.e. the GPU result is numerically faithful, not merely "resident" (GPU full residency does not by itself guarantee a correct result).

Usage (Android / LiteRT CompiledModel)

val model = CompiledModel.create(context.assets, "da3_small_gpu_fp16.tflite",
    CompiledModel.Options(Accelerator.GPU), null)
// input: [1,3,896,504] NCHW, ImageNet-normalized; output: [1,1,896,504] depth

Training data & PII

Depth Anything 3 was trained by ByteDance-Seed on a large-scale collection of monocular-depth data β€” a mix of synthetic depth datasets and real images with pseudo-labelled depth (the Depth Anything line scales to tens of millions of images). No new training was performed for this conversion β€” it is a weights-faithful (corr β‰ˆ 1.0) format change of the public depth-anything/DA3-SMALL checkpoint. Because the source data includes real-world indoor/outdoor scenes, it may incidentally contain people, faces, vehicles, signage and other PII; no PII was deliberately collected and this conversion adds none. Apply your own content/PII filtering as appropriate. See the original Depth Anything 3 release and paper for full dataset details.

License

Apache-2.0, inherited from the upstream Depth Anything 3. This is a format conversion; all credit to the original authors (ByteDance-Seed).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Depth-Anything-3-LiteRT

Finetuned
(3)
this model

Paper for mlboydaisuke/Depth-Anything-3-LiteRT