sapiens2-cpu / README.md
Nekochu's picture
README: verified 15/15 times, document 5B chain OOM
cb4d7d1
metadata
title: Sapiens2 CPU
emoji: 🧍
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: other

Sapiens2 CPU

Meta's facebook/sapiens2-* running on free HF CPU. 15 variants exposed: seg, normal, pointmap, pose across 0.4b, 0.8b, 1b, plus seg-5b, normal-5b, pointmap-5b as INT8 ONNX. Curl-callable with a Bearer token.

Variants and inference time on the included 6000×4000 demo image

Task Notes 0.4b 0.8b 1b 5b (INT8 ONNX)
seg DOME 29-class body parts 57 s 74 s 208 s 189 s
normal per-pixel surface normals 72 s 84 s 206 s 359 s
pointmap per-pixel XYZ in meters 78 s 99 s 274 s 386 s
pose DETR detect, 308 keypoints 47 s 68 s 232 s not shipped

Verified 15/15 via Gradio API on 2026-05-12. Times include first-call downloads.

0.4b through 1b run as fp32 PyTorch. 5B runs as INT8 ONNX (5 to 6 GB on disk; fp32 5B would need ~20 GB RAM, more than the free tier provides). Dense 0.4b/0.8b share an LRU(2) cache. Loading any 1B variant hard-clears all model caches (dense + pose + ORT) since 16 GB cpu-basic cannot fit two 1B-class models simultaneously. Pose has its own slot and DETR (facebook/detr-resnet-50) is sticky-loaded once.

5B chain limitation: calling a 5B variant right after another 5B variant on the same Space instance OOMs. ONNX Runtime's C++ session shutdown is not synchronous with the Python _ORT_SESSIONS.clear() call, so loading the next 5B session before the previous one's worker threads exit peaks RAM above 16 GB. If you need to benchmark multiple 5B variants, factory-restart the Space (Settings → Factory restart) between calls, or run one variant per cold Space.

The model fixes a 1024×768 input tensor (NCHW with H=1024, W=768, a portrait canvas in Meta's convention). Any input is aspect-preserve resized then padded to that.

CPU-friendly ONNX exports

Companion repo: WeReCooking/sapiens2-onnx (public). Files live in per-task folders seg/, normal/, pointmap/, pose/. Each variant is <task>/<task>_<size>_<precision>.onnx plus a .onnx.data external sidecar. 15 ONNX artifacts shipped: 12 covering 0.4b/0.8b/1b (fp16 for seg-0.4b, fp32 for the rest), and 3 new 5B int8 files (seg, normal, pointmap). Cosine similarity vs PyTorch fp32 is 0.999 or better on all shipped variants.

Turnkey CLI built into app.py (no sapiens2 / PyTorch dep needed; install requirements.txt):

export HF_TOKEN=hf_xxx
python app.py onnx seg 0.4b photo.jpg --output seg_overlay.png
python app.py onnx normal 1b photo.jpg --output normals.png
python app.py onnx pointmap 0.8b photo.jpg --output depth.png
python app.py onnx pose 0.4b photo.jpg --output pose.png
python app.py onnx seg 5b photo.jpg --output seg_5b.png

Curl tests

TOKEN="hf_xxx"
SPACE="https://werecooking-sapiens2-cpu.hf.space"
IMG="https://huggingface.co/spaces/facebook/sapiens2-seg/resolve/main/assets/images/pexels-alex-green-5699868.jpg"

EVT=$(curl -s -X POST "$SPACE/gradio_api/call/predict" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d "{\"data\":[{\"path\":\"$IMG\",\"meta\":{\"_type\":\"gradio.FileData\"}},\"seg\",\"0.4b\"]}" \
  | python -c "import sys,json;print(json.load(sys.stdin)['event_id'])")
curl -sN "$SPACE/gradio_api/call/predict/$EVT" -H "Authorization: Bearer $TOKEN"

Logs (SSE)

curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/build"
curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/run"

5B INT8 ONNX conversion recipe

The dense 5B variants ship as INT8 ONNX. To re-run the pipeline:

  1. Export fp16 ONNX using lazy fp16 init. Call torch.set_default_dtype(torch.float16) before init_model(cfg, None, device="cpu"), then stream the safetensors file tensor by tensor into the empty fp16 model. This avoids the ~22 GB fp32 init that OOMs on a 32 GB box. Export with opset_version=18 and no dynamic_axes. Force sys.stdout.reconfigure(encoding="utf-8") so torch.onnx's success print does not crash on Windows cp1252.
  2. Stream cast fp16 to fp32 on disk via onnx.external_data_helper.load_external_data_for_model plus per-tensor numpy_helper. Peak RAM stays close to a single tensor (~250 MB). Drop Cast(fp16 / fp32) nodes with transitive rename closure so consumers point at the original input.
  3. Run quantize.shape_inference.quant_pre_process(skip_onnx_shape=True, skip_optimization=True). This routes through ORT symbolic shape inference which understands sapiens2 windowed attention. Vanilla onnx.shape_inference errors with (6144) vs (512) on the pointmap and normal heads.
  4. quantize_dynamic(weight_type=QuantType.QInt8, per_channel=True, op_types_to_quantize=["MatMul"], use_external_data_format=True). This lowers to MatMulIntegerToFloat, which accepts fp32 input and has no 2D-only filter (unlike MatMulNBitsQuantizer which silently skips 3D packed-QKV weights).

Pose-5b is not shipped. It uses a different forward signature (single person bbox cropped tensor) and the int8 quantize attempt did not complete on the available hardware.

Files

  • app.py everything: Gradio Space UI, PyTorch dispatch for 0.4b/0.8b/1b, ORT for 5B, inlined keypoint visualization, plus the python app.py onnx ... CLI
  • requirements.txt Python deps including sapiens @ git+https://github.com/facebookresearch/sapiens2.git
  • packages.txt apt deps (libgl1, libglib2.0-0) installed by the Gradio SDK at build time