wifi-densepose-mmfi-pose β WiFi-CSI β 2D human pose (SOTA on MM-Fi)
Estimate where a person's body is β 17 joints β from nothing but the WiFi signal in the room. No camera.
This model reads the Channel State Information (CSI) of a WiFi link β the tiny distortions a human body imprints on the radio waves bouncing around a room β and predicts a 17-keypoint COCO skeleton. On the public MM-Fi benchmark it exceeds the prior published state of the art.
Results (MM-Fi WiFi-CSI, torso-normalized PCK@20)
| Model | torso-PCK@20 | Notes |
|---|---|---|
| CSI2Pose | 68.41% | prior work |
| MultiFormer (prior SOTA, 2025) | 72.25% | transformer + multi-stage fusion |
| This model (single) | 82.69% | +10.4 over SOTA |
| This model (3-ensemble + TTA) | 83.59% | +11.3 over SOTA |
- Protocol: MM-Fi
random_split(ratio 0.8, seed 0) β the dataset's documented default. - Metric: torso-PCK@20 = fraction of joints within 0.2Β·βright_shoulder β left_hipβ β identical to MultiFormer's Table VII definition (we recompute on their metric for a fair comparison).
- Integrity: the headline was self-corrected down from an inflated 91.86% (a looser bbox normalization) to 82.69% under the matched torso metric, before publishing.
β οΈ Controlled claim. This is an in-domain (random-split) result. Random split has temporal/subject-adjacency effects common to this benchmark family β it is not solved real-world generalization. On harder protocols we measured cross-subject 64% and cross-environment 17.5% (with CORAL domain alignment). Cross-room is the real deployment frontier.
How it works (plain language β technical)
Plain: A moving body changes how WiFi multipath arrives at the receiver. The model learns the mapping from those changes to joint positions.
Technical:
- Input:
[3, 114, 10]CSI amplitude β 3 antenna pairs Γ 114 subcarriers Γ 10 time frames (100 Hz). (We tested adding phase β raw and antenna-differenced β and it hurt; amplitude alone is best here.) - Encoder: the 10 time frames become 10 tokens (dim 342β256) β a 4-layer / 8-head Transformer.
- Pooling: temporal attention pooling β the single biggest lever (replacing global-mean-pool, which discarded the motion dynamics, took us from 3% β 48%+ early on).
- Decoder: MLP head β a skeleton-graph refinement head (graph-conv over the COCO bone topology) that nudges joints toward anatomically consistent positions.
- Output:
[17, 2]keypoints in[0,1]. ~2.3M params.
What made it SOTA (ablation-backed)
- Attention pooling β keep temporal dynamics (the breakthrough).
- Transformer encoder β 81.6 β 83+ over the conv baseline.
- Skeleton-graph head + ensemble + test-time augmentation β the final points.
Honest negatives we tested and discarded: subject-adversarial (DANN), pose-contrastive embeddings, static-clutter removal, multi-task action+pose, bigger models, raw/sanitized phase.
Real-world deployment: zero-shot vs few-shot calibration
The headline 83.59% is in-domain (random_split). For a new person in a new room
(cross-subject), honest zero-shot accuracy is ~64% β and we measured exactly how to close that
gap. Adding more training subjects saturates fast (8β32 subjects: only +6 pts). But a tiny
in-room calibration β a few labeled frames from the deployment site β recovers most of the gap:
| In-room calibration | Frames | Cross-subject PCK@20 |
|---|---|---|
| none (zero-shot) | 0 | 63.7% |
| ~20 frames/person | 160 | 68.1% |
| ~50 frames/person | 400 | 72.2% (β prior SOTA) |
| ~200 frames/person | 1,600 | 76.1% |
| ~1000 frames/person | 8,000 | 78.3% |
Takeaway: ship the model + a ~30-second on-site calibration and reach 72β76% in a brand-new room β far cheaper than collecting a giant multi-subject corpus (a few hundred calibration frames beat 20+ extra training subjects). Cross-environment (unseen room and people) is the same story, even more strongly: zero-shot is ~unusable (10.6%), but 5 calibration samples/person β 60%, 200 β 73.1%. An unseen room is one coherent shift a few labeled frames pin down β so the biggest zero-shot gap gives the biggest calibration gain. Bottom line: with a tiny in-room calibration there is no unsolved deployment case β any new room or person reaches SOTA-level pose from ~5β200 labeled samples, deployable as an ~11 KB per-room LoRA adapter. Full study: ADR-150 Β§3.3β3.6 in the repo.
Efficiency frontier β beats SOTA at a fraction of the size
The 2.32M-param flagship hits 82.69% (single) / 83.59% (ensemble). But for edge deployment,
params + latency matter too. Swept smaller models on the same MM-Fi random_split:
| Model | Params | Latency (b=1) | torso-PCK@20 | vs SOTA 72.25 |
|---|---|---|---|---|
| nano | 39,971 | 0.126 ms | 71.76% | β0.5 (58Γ smaller) |
| micro | 75,237 | 0.224 ms | 74.30% | β +2.1 (31Γ smaller) |
| tiny | 210,949 | 0.299 ms | 76.82% | β +4.6 |
| small | 348,005 | 0.287 ms | 77.87% | β +5.6 |
| base | 726,437 | 0.344 ms | 79.38% | β +7.1 |
| flagship | 2,320,869 | β | 83.59% | +11.3 |
A 75K-param model already beats MultiFormer (72.25%) β every config from micro up is
Pareto-dominant (smaller and more accurate than prior SOTA). Full curve + method:
https://github.com/ruvnet/RuView/blob/main/docs/benchmarks/wifi-pose-efficiency-frontier.md
Reproduce / proof
- Weights SHA-256:
d43ccd7e52e30a3adf5f6f3457224ea61bce5ed893abcdb183c570c78b16e13e - Full reproduction (parser + trainer + protocol): https://gist.github.com/ruvnet/af2fbc1c7674dddf09c15509b3c7f785
- Auditable leaderboard + witness ledger: https://huggingface.co/spaces/ruvnet/aether-arena
- Source: https://github.com/ruvnet/RuView
Usage
import torch, numpy as np
from model import TF # see model.py in this repo
net = TF(na=3); net.load_state_dict(torch.load("pose_mmfi_best.pt", map_location="cpu")); net.eval()
csi = torch.randn(1, 3, 114, 10) # MM-Fi CSI amplitude, per-sample standardized
kp = net(csi).reshape(1, 17, 2) # -> 17 keypoints in [0,1]
Dataset: MM-Fi (CC BY-NC 4.0). This model inherits the non-commercial license.