wifi-densepose-mmfi-pose — WiFi-CSI → 2D human pose (SOTA on MM-Fi)

Estimate where a person's body is — 17 joints — from nothing but the WiFi signal in the room. No camera.

This model reads the Channel State Information (CSI) of a WiFi link — the tiny distortions a human body imprints on the radio waves bouncing around a room — and predicts a 17-keypoint COCO skeleton. On the public MM-Fi benchmark it exceeds the prior published state of the art.

Results (MM-Fi WiFi-CSI, torso-normalized PCK@20)

Model	torso-PCK@20	Notes
CSI2Pose	68.41%	prior work
MultiFormer (prior SOTA, 2025)	72.25%	transformer + multi-stage fusion
This model (single)	82.69%	+10.4 over SOTA
This model (3-ensemble + TTA)	83.59%	+11.3 over SOTA

Protocol: MM-Fi random_split (ratio 0.8, seed 0) — the dataset's documented default.
Metric: torso-PCK@20 = fraction of joints within 0.2·‖right_shoulder − left_hip‖ — identical to MultiFormer's Table VII definition (we recompute on their metric for a fair comparison).
Integrity: the headline was self-corrected down from an inflated 91.86% (a looser bbox normalization) to 82.69% under the matched torso metric, before publishing.

⚠️ Controlled claim. This is an in-domain (random-split) result. Random split has temporal/subject-adjacency effects common to this benchmark family — it is not solved real-world generalization. On harder protocols we measured cross-subject 64% and cross-environment 17.5% (with CORAL domain alignment). Cross-room is the real deployment frontier.

How it works (plain language → technical)

Plain: A moving body changes how WiFi multipath arrives at the receiver. The model learns the mapping from those changes to joint positions.

Technical:

Input: [3, 114, 10] CSI amplitude — 3 antenna pairs × 114 subcarriers × 10 time frames (100 Hz). (We tested adding phase — raw and antenna-differenced — and it hurt; amplitude alone is best here.)
Encoder: the 10 time frames become 10 tokens (dim 342→256) → a 4-layer / 8-head Transformer.
Pooling: temporal attention pooling — the single biggest lever (replacing global-mean-pool, which discarded the motion dynamics, took us from 3% → 48%+ early on).
Decoder: MLP head → a skeleton-graph refinement head (graph-conv over the COCO bone topology) that nudges joints toward anatomically consistent positions.
Output: [17, 2] keypoints in [0,1]. ~2.3M params.

What made it SOTA (ablation-backed)

Attention pooling — keep temporal dynamics (the breakthrough).
Transformer encoder — 81.6 → 83+ over the conv baseline.
Skeleton-graph head + ensemble + test-time augmentation — the final points.

Honest negatives we tested and discarded: subject-adversarial (DANN), pose-contrastive embeddings, static-clutter removal, multi-task action+pose, bigger models, raw/sanitized phase.

Real-world deployment: zero-shot vs few-shot calibration

The headline 83.59% is in-domain (random_split). For a new person in a new room (cross-subject), honest zero-shot accuracy is ~64% — and we measured exactly how to close that gap. Adding more training subjects saturates fast (8→32 subjects: only +6 pts). But a tiny in-room calibration — a few labeled frames from the deployment site — recovers most of the gap:

In-room calibration	Frames	Cross-subject PCK@20
none (zero-shot)	0	63.7%
~20 frames/person	160	68.1%
~50 frames/person	400	72.2% (≈ prior SOTA)
~200 frames/person	1,600	76.1%
~1000 frames/person	8,000	78.3%

Takeaway: ship the model + a ~30-second on-site calibration and reach 72–76% in a brand-new room — far cheaper than collecting a giant multi-subject corpus (a few hundred calibration frames beat 20+ extra training subjects). Cross-environment (unseen room and people) is the same story, even more strongly: zero-shot is ~unusable (10.6%), but 5 calibration samples/person → 60%, 200 → 73.1%. An unseen room is one coherent shift a few labeled frames pin down — so the biggest zero-shot gap gives the biggest calibration gain. Bottom line: with a tiny in-room calibration there is no unsolved deployment case — any new room or person reaches SOTA-level pose from ~5–200 labeled samples, deployable as an ~11 KB per-room LoRA adapter. Full study: ADR-150 §3.3–3.6 in the repo.

Efficiency frontier — beats SOTA at a fraction of the size

The 2.32M-param flagship hits 82.69% (single) / 83.59% (ensemble). But for edge deployment, params + latency matter too. Swept smaller models on the same MM-Fi random_split:

Model	Params	Latency (b=1)	torso-PCK@20	vs SOTA 72.25
nano	39,971	0.126 ms	71.76%	−0.5 (58× smaller)
micro	75,237	0.224 ms	74.30%	✅ +2.1 (31× smaller)
tiny	210,949	0.299 ms	76.82%	✅ +4.6
small	348,005	0.287 ms	77.87%	✅ +5.6
base	726,437	0.344 ms	79.38%	✅ +7.1
flagship	2,320,869	—	83.59%	+11.3

A 75K-param model already beats MultiFormer (72.25%) — every config from micro up is Pareto-dominant (smaller and more accurate than prior SOTA). Full curve + method: https://github.com/ruvnet/RuView/blob/main/docs/benchmarks/wifi-pose-efficiency-frontier.md

Reproduce / proof

Weights SHA-256: d43ccd7e52e30a3adf5f6f3457224ea61bce5ed893abcdb183c570c78b16e13e
Full reproduction (parser + trainer + protocol): https://gist.github.com/ruvnet/af2fbc1c7674dddf09c15509b3c7f785
Auditable leaderboard + witness ledger: https://huggingface.co/spaces/ruvnet/aether-arena
Source: https://github.com/ruvnet/RuView

Usage

import torch, numpy as np
from model import TF                     # see model.py in this repo
net = TF(na=3); net.load_state_dict(torch.load("pose_mmfi_best.pt", map_location="cpu")); net.eval()
csi = torch.randn(1, 3, 114, 10)         # MM-Fi CSI amplitude, per-sample standardized
kp = net(csi).reshape(1, 17, 2)          # -> 17 keypoints in [0,1]

Dataset: MM-Fi (CC BY-NC 4.0). This model inherits the non-commercial license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support