WiLoR hand pose estimation rebuilt end-to-end in MLX for Apple Silicon

#1
by lyonsno - opened
The Basin Maintenance Division org

We rebuilt WiLoR-mini end-to-end in MLX for Apple Silicon β€” the full inference pipeline including ViT-H/16 backbone, MANO hand model, and RefineNet refinement stage, with sub-millimeter geometric parity against PyTorch.

We couldn't find another public WiLoR MLX or CoreML port, so we're publishing this as a technical priority flag. If we missed related work, we'd love pointers.

One-line setup

from wilor_mlx import WiLoR
model = WiLoR.from_pretrained()  # auto-downloads weights, derives MANO locally

First run needs torch once for MANO conversion from the upstream WiLoR-mini checkpoint. After that, inference is pure MLX β€” no torch dependency.

Performance (M4 Max, float32)

The important measurement is the live sidecar route we actually use for interaction: camera frame β†’ hand crop β†’ WiLoR-mini pose/reconstruction β†’ hand-pose event.

On a clean post-reboot M4 Max same-harness smoke over recent 160x120 saved frames from a gesture UI prototype, MLX runs the pose/reconstruction model stage at about 37ms median versus 49ms for PyTorch MPS, and the full saved-frame route at about 49ms versus 60ms. That is roughly a 1.3x model-stage advantage and a 1.2x full-route advantage on the fair comparison denominator we trust most right now.

That latency is low enough to make 3D hand pose plausible as a real-time control primitive, not just a batch inference model. Our traces point to dispatch and synchronization as the main difference, not memory copies: both routes sit on Apple Silicon unified memory, but MLX's lazy graph gives the hot path fewer places for a hitch to land.

Older app-level PyTorch MPS telemetry is what motivated the port; clean reruns moved the comparison denominator enough that we're not using the old tail history as a fresh universal PyTorch-vs-MLX headline.

Larger derived-frame stress tests widen both backends; MLX remained faster in those runs, but we treat those numbers as route/runtime stress evidence rather than the headline model benchmark.

Lower-bandwidth M2 Pro/Tahoe validation also shows MLX ahead on archived hand-positive frames, but recent macOS/Metal changes moved both backends enough that we are treating exact M2 Pro numbers as rebaseline work rather than headline copy.

Numerical accuracy

Output Max abs diff
Mesh vertices (778Γ—3) 0.006
Hand keypoints (21Γ—3) 0.006

Sub-millimeter. Verified layer-by-layer through all 32 transformer blocks β€” the residual is float32 accumulation noise, not a port error.

Weights

Float32 (2.4 GB) and int4 (490 MB) safetensors on this model page. Int4 is a download/storage convenience β€” same inference speed because the model is compute-bound at 210 tokens, not memory-bandwidth-bound.

MANO licensing

MANO is separately licensed by the Max Planck Institute. wilor-mlx does not bundle or rehost MANO data β€” it fetches upstream WiLoR-mini assets and converts locally. You can also supply your own copy via mano_path=....

Links

lyonsno pinned discussion

Sign up or log in to comment