DA3-GIANT — CoreML (.mlpackage) for monocular depth

A precompiled Core ML conversion of Depth Anything 3 — DA3-GIANT (the full ViT-g / 1.15B model), exposing a single-image relative-depth output for macOS/iOS.

Input: image, RGB, 504×504, [0,1] (CoreML ImageType; ImageNet norm baked in).
Output: depth, shape (1, 504, 504), single-channel relative depth.
Weights: FP16, ~2.2 GB. Only the backbone → depth head is converted (camera / sky / Gaussian-splat heads bypassed).
Conversion notes: the full DA3-GIANT backbone uses RoPE + multi-view camera tokens + qk-norm + SwiGLU FFN. Five things were handled for coremltools: the four DA3-LARGE RoPE/cam-token/meshgrid rewrites, plus a converter-side cast shim (numpy 2.x refuses int() on size-1 non-0-dim arrays, which breaks the const-cast of H//patch_size). Single-image behaviour is unchanged.

This is the highest-capacity DA3 depth variant, and correspondingly the slowest at inference — for real-time monocular depth the smaller DA3MONO-LARGE / DA3-LARGE are usually the better trade-off.

License & attribution

Derived from depth-anything/DA3-GIANT (Depth Anything 3, arXiv:2511.10647), CC-BY-NC-4.0. Released under the same license: attribution required, non-commercial use only. For commercial use, see the Apache-2.0 depth-anything/DA3MONO-LARGE.

Downloads last month: 7

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sdkv2/DA3-GIANT-CoreML

Base model

depth-anything/DA3-GIANT

Quantized

(1)

this model

Paper for sdkv2/DA3-GIANT-CoreML

Depth Anything 3: Recovering the Visual Space from Any Views

Paper • 2511.10647 • Published Nov 13, 2025 • 102