wifi-densepose-mmfi-pose β€” WiFi-CSI β†’ 2D human pose (SOTA on MM-Fi)

Estimate where a person's body is β€” 17 joints β€” from nothing but the WiFi signal in the room. No camera.

This model reads the Channel State Information (CSI) of a WiFi link β€” the tiny distortions a human body imprints on the radio waves bouncing around a room β€” and predicts a 17-keypoint COCO skeleton. On the public MM-Fi benchmark it exceeds the prior published state of the art.

Results (MM-Fi WiFi-CSI, torso-normalized PCK@20)

Model torso-PCK@20 Notes
CSI2Pose 68.41% prior work
MultiFormer (prior SOTA, 2025) 72.25% transformer + multi-stage fusion
This model (single) 82.69% +10.4 over SOTA
This model (3-ensemble + TTA) 83.59% +11.3 over SOTA
  • Protocol: MM-Fi random_split (ratio 0.8, seed 0) β€” the dataset's documented default.
  • Metric: torso-PCK@20 = fraction of joints within 0.2Β·β€–right_shoulder βˆ’ left_hipβ€– β€” identical to MultiFormer's Table VII definition (we recompute on their metric for a fair comparison).
  • Integrity: the headline was self-corrected down from an inflated 91.86% (a looser bbox normalization) to 82.69% under the matched torso metric, before publishing.

⚠️ Controlled claim. This is an in-domain (random-split) result. Random split has temporal/subject-adjacency effects common to this benchmark family β€” it is not solved real-world generalization. On harder protocols we measured cross-subject 64% and cross-environment 17.5% (with CORAL domain alignment). Cross-room is the real deployment frontier.

How it works (plain language β†’ technical)

Plain: A moving body changes how WiFi multipath arrives at the receiver. The model learns the mapping from those changes to joint positions.

Technical:

  • Input: [3, 114, 10] CSI amplitude β€” 3 antenna pairs Γ— 114 subcarriers Γ— 10 time frames (100 Hz). (We tested adding phase β€” raw and antenna-differenced β€” and it hurt; amplitude alone is best here.)
  • Encoder: the 10 time frames become 10 tokens (dim 342β†’256) β†’ a 4-layer / 8-head Transformer.
  • Pooling: temporal attention pooling β€” the single biggest lever (replacing global-mean-pool, which discarded the motion dynamics, took us from 3% β†’ 48%+ early on).
  • Decoder: MLP head β†’ a skeleton-graph refinement head (graph-conv over the COCO bone topology) that nudges joints toward anatomically consistent positions.
  • Output: [17, 2] keypoints in [0,1]. ~2.3M params.

What made it SOTA (ablation-backed)

  1. Attention pooling β€” keep temporal dynamics (the breakthrough).
  2. Transformer encoder β€” 81.6 β†’ 83+ over the conv baseline.
  3. Skeleton-graph head + ensemble + test-time augmentation β€” the final points.

Honest negatives we tested and discarded: subject-adversarial (DANN), pose-contrastive embeddings, static-clutter removal, multi-task action+pose, bigger models, raw/sanitized phase.

Real-world deployment: zero-shot vs few-shot calibration

The headline 83.59% is in-domain (random_split). For a new person in a new room (cross-subject), honest zero-shot accuracy is ~64% β€” and we measured exactly how to close that gap. Adding more training subjects saturates fast (8β†’32 subjects: only +6 pts). But a tiny in-room calibration β€” a few labeled frames from the deployment site β€” recovers most of the gap:

In-room calibration Frames Cross-subject PCK@20
none (zero-shot) 0 63.7%
~20 frames/person 160 68.1%
~50 frames/person 400 72.2% (β‰ˆ prior SOTA)
~200 frames/person 1,600 76.1%
~1000 frames/person 8,000 78.3%

Takeaway: ship the model + a ~30-second on-site calibration and reach 72–76% in a brand-new room β€” far cheaper than collecting a giant multi-subject corpus (a few hundred calibration frames beat 20+ extra training subjects). Cross-environment (unseen room and people) is the same story, even more strongly: zero-shot is ~unusable (10.6%), but 5 calibration samples/person β†’ 60%, 200 β†’ 73.1%. An unseen room is one coherent shift a few labeled frames pin down β€” so the biggest zero-shot gap gives the biggest calibration gain. Bottom line: with a tiny in-room calibration there is no unsolved deployment case β€” any new room or person reaches SOTA-level pose from ~5–200 labeled samples, deployable as an ~11 KB per-room LoRA adapter. Full study: ADR-150 Β§3.3–3.6 in the repo.

Efficiency frontier β€” beats SOTA at a fraction of the size

The 2.32M-param flagship hits 82.69% (single) / 83.59% (ensemble). But for edge deployment, params + latency matter too. Swept smaller models on the same MM-Fi random_split:

Model Params Latency (b=1) torso-PCK@20 vs SOTA 72.25
nano 39,971 0.126 ms 71.76% βˆ’0.5 (58Γ— smaller)
micro 75,237 0.224 ms 74.30% βœ… +2.1 (31Γ— smaller)
tiny 210,949 0.299 ms 76.82% βœ… +4.6
small 348,005 0.287 ms 77.87% βœ… +5.6
base 726,437 0.344 ms 79.38% βœ… +7.1
flagship 2,320,869 β€” 83.59% +11.3

A 75K-param model already beats MultiFormer (72.25%) β€” every config from micro up is Pareto-dominant (smaller and more accurate than prior SOTA). Full curve + method: https://github.com/ruvnet/RuView/blob/main/docs/benchmarks/wifi-pose-efficiency-frontier.md

Reproduce / proof

Usage

import torch, numpy as np
from model import TF                     # see model.py in this repo
net = TF(na=3); net.load_state_dict(torch.load("pose_mmfi_best.pt", map_location="cpu")); net.eval()
csi = torch.randn(1, 3, 114, 10)         # MM-Fi CSI amplitude, per-sample standardized
kp = net(csi).reshape(1, 17, 2)          # -> 17 keypoints in [0,1]

Dataset: MM-Fi (CC BY-NC 4.0). This model inherits the non-commercial license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support