two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

A real-time hand gesture classifier trained on a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).

This model is part of the Maestro pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction.

Model Description

  • Architecture: EnhancedTwoStreamLSTM (BiLSTM h=96Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
  • Parameters: 1,208,554
  • Input: (batch, 16, 147) β€” 16-frame sliding window at 30 FPS β‰ˆ 533 ms
  • Output: Softmax logits over 10 gesture classes
  • Inference latency: < 1 ms per call (CPU, single sample)
  • Feature schema: feature-schema-v1

Architecture

EnhancedTwoStreamLSTM splits the 147-dim feature vector into two parallel streams and processes them through a BiLSTM + self-attention + MLP-gate pipeline:

Input (B, T=32, 147)
    β”‚
    β”œβ”€ Stream A β€” Pose/Shape (73 dims)
    β”‚   Linear+LN+GELU β†’ 96
    β”‚   2-layer BiLSTM (h=96) β†’ (B, T, 192)
    β”‚   LayerNorm β†’ Self-MHA (8 heads) + residual + post-LN
    β”‚   mean+max pool β†’ pool_LN β†’ ctx_a (B, 192)
    β”‚
    β”œβ”€ Stream B β€” Motion/Dynamics (74 dims)
    β”‚   (identical structure) β†’ ctx_b (B, 192)
    β”‚
    β”œβ”€ MLP cross-stream gate
    β”‚   gate_a = Sigmoid(
    β”‚     Linear(96β†’192)(
    β”‚       Tanh(Linear(192β†’96)(ctx_b))))
    β”‚   ctx_a  = LN(ctx_a Γ— gate_a + ctx_a)
    β”‚   gate_b = Sigmoid(
    β”‚     Linear(96β†’192)(
    β”‚       Tanh(Linear(192β†’96)(ctx_a))))
    β”‚   ctx_b  = LN(ctx_b Γ— gate_b + ctx_b)
    β”‚
    └─ cat(ctx_a, ctx_b) β†’ (384,)
       LN β†’ Linear(384β†’192) β†’ GELU β†’ Dropout β†’ Linear(192β†’10)

Design rationale:

  • BiLSTMs encode temporal order via their recurrent cell state β€” no positional encoding needed.
  • Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
  • The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).

Gesture Classes

Class Description
fist Closed fist (all fingers curled, thumb tucked)
swiping_right Horizontal swipe from left to right
swiping_left Horizontal swipe from right to left
swiping_down Vertical swipe downward
swiping_up Vertical swipe upward
zooming_in_full_hand Pinch-open / spread fingers away from each other
zooming_out_full_hand Pinch-close / bring fingers together
point_one Single-finger pointing gesture (continuous laser-pointer control)
point_two Two-finger pointing gesture (continuous annotation-pen control)
unknown Background / transition / no gesture

Gesture Usage In Presentation System

Class Mode Command Runtime handling
fist discrete erase_annotations Discrete command via GestureActivationController β†’ CommandDispatcher
swiping_right discrete next_slide Discrete command via GestureActivationController β†’ CommandDispatcher
swiping_left discrete previous_slide Discrete command via GestureActivationController β†’ CommandDispatcher
swiping_down discrete stop_presentation Discrete command via GestureActivationController β†’ CommandDispatcher
swiping_up discrete start_presentation Discrete command via GestureActivationController β†’ CommandDispatcher
zooming_in_full_hand discrete zoom_in_view Discrete command via GestureActivationController β†’ CommandDispatcher
zooming_out_full_hand discrete zoom_out_view Discrete command via GestureActivationController β†’ CommandDispatcher
point_one continuous β€” Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher)
point_two continuous β€” Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher)
unknown discrete no_action No-op background class

Feature Schema (feature-schema-v1)

Block Dims Description
position 0–62 21 wrist-relative, scale-normalised landmark positions (x, y, z)
fingertip_spread 63–67 5 inter-fingertip Euclidean distances
wrist_trajectory 68–70 Net wrist displacement from oldest frame in the window
velocity 71–133 21 per-landmark wrist-relative velocity vectors (Ξ”position per unit time)
joint_angles 134–143 10 MCP + PIP joint angles in radians
wrist_vel_raw 144–146 Camera-normalised wrist velocity (x, y, z) β€” key directional signal

How to Use

import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact

# Download the artifact (cached after first call)
local_path = hf_hub_download(
    repo_id="ntsrigaud/maestro-lstm-hybrid",
    filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
)

# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
    artifact_path=local_path,
    device=torch.device("cpu"),
)
artifact.model.eval()

# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
    # tensor shape: (batch=1, T=32, F=147)
    window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
    logits = artifact.model(window_tensor)
    pred_class = artifact.class_labels[logits.argmax(dim=1).item()]

Training Dataset

  • Source: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
  • Used classes: 10 (9 active gestures + unknown background)
  • Dataset split: 70% train / 15% val / 15% test (stratified by class)
  • Augmentation: temporal scale Β±20%, spatial jitter Οƒ=0.005

Training Strategy

Two-phase transfer learning pipeline:

  • Phase 1 (pretraining): backbone pretrained on external checkpoint two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt to learn generic gesture dynamics.
  • Phase 2 (fine-tuning): head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
  • Stage A (frozen backbone): 10 epoch(s) head-only warmup.
  • Stage B (full model): up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
  • Stage B retention defences: replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.

Training Configuration

Parameter Value
Architecture EnhancedTwoStreamLSTM (BiLSTM h=96Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
Input size 147
Hidden size 96/stream (BiLSTM output: 192)
Projection dim 96
Num layers 2
MHA heads 8 (head dim: 24)
Dropout 0.4
Learning rate 3e-05
Weight decay 0.001
Batch size 128
Max epochs 80
Early stopping patience 20
Label smoothing 0.05
Class weighting disabled
Max samples per class 3000
LR scheduler ReduceLROnPlateau (factor=0.5, patience=10)

Evaluation Results (Test Set)

Metric Value
Accuracy 96.1%
Macro F1 95.9%

Per-Class Recall

Class Recall
fist 97.3%
swiping_right 97.1%
swiping_left 98.3%
swiping_down 98.0%
swiping_up 98.2%
zooming_in_full_hand 97.0%
zooming_out_full_hand 95.1%
point_one 97.4%
point_two 95.1%
unknown 85.7%

Comparison with Previous Architecture

Feature TwoStreamGestureLSTM EnhancedTwoStreamLSTM
LSTM direction Unidirectional Bidirectional
Attention Bahdanau (scalar) MHA Q/K/V (8 heads)
Feature projection No Yes (β†’96)
Temporal pooling Mean only Mean + Max
Cross-stream fusion Concat only 2-layer MLP gate
Parameters ~182 K ~1,208,554

Limitations and Risks

  • Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data.
  • The unknown class represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined in production_hybrid.yaml.
  • Requires mediapipe>=0.10.14 for landmark extraction at inference time.
  • Not intended for safety-critical or accessibility-critical applications.
  • Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ.

Environmental Impact

Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated COβ‚‚ equivalent: negligible (<0.001 kg COβ‚‚eq).


Generated by the Maestro training pipeline on 2026-05-15.

Downloads last month
231
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results