two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z

A real-time hand gesture classifier trained on a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).

This model is part of the Maestro pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction.

Model Description

Architecture: EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
Parameters: 1,208,554
Input: (batch, 16, 147) — 16-frame sliding window at 30 FPS ≈ 533 ms
Output: Softmax logits over 10 gesture classes
Inference latency: < 1 ms per call (CPU, single sample)
Feature schema: feature-schema-v1

Architecture

EnhancedTwoStreamLSTM splits the 147-dim feature vector into two parallel streams and processes them through a BiLSTM + self-attention + MLP-gate pipeline:

Input (B, T=32, 147)
    │
    ├─ Stream A — Pose/Shape (73 dims)
    │   Linear+LN+GELU → 96
    │   2-layer BiLSTM (h=96) → (B, T, 192)
    │   LayerNorm → Self-MHA (8 heads) + residual + post-LN
    │   mean+max pool → pool_LN → ctx_a (B, 192)
    │
    ├─ Stream B — Motion/Dynamics (74 dims)
    │   (identical structure) → ctx_b (B, 192)
    │
    ├─ MLP cross-stream gate
    │   gate_a = Sigmoid(
    │     Linear(96→192)(
    │       Tanh(Linear(192→96)(ctx_b))))
    │   ctx_a  = LN(ctx_a × gate_a + ctx_a)
    │   gate_b = Sigmoid(
    │     Linear(96→192)(
    │       Tanh(Linear(192→96)(ctx_a))))
    │   ctx_b  = LN(ctx_b × gate_b + ctx_b)
    │
    └─ cat(ctx_a, ctx_b) → (384,)
       LN → Linear(384→192) → GELU → Dropout → Linear(192→10)

Design rationale:

BiLSTMs encode temporal order via their recurrent cell state — no positional encoding needed.
Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).

Gesture Classes

Class	Description
`fist`	Closed fist (all fingers curled, thumb tucked)
`swiping_right`	Horizontal swipe from left to right
`swiping_left`	Horizontal swipe from right to left
`swiping_down`	Vertical swipe downward
`swiping_up`	Vertical swipe upward
`zooming_in_full_hand`	Pinch-open / spread fingers away from each other
`zooming_out_full_hand`	Pinch-close / bring fingers together
`point_one`	Single-finger pointing gesture (continuous laser-pointer control)
`point_two`	Two-finger pointing gesture (continuous annotation-pen control)
`unknown`	Background / transition / no gesture

Gesture Usage In Presentation System

Class	Mode	Command	Runtime handling
`fist`	`discrete`	`erase_annotations`	Discrete command via GestureActivationController → CommandDispatcher
`swiping_right`	`discrete`	`next_slide`	Discrete command via GestureActivationController → CommandDispatcher
`swiping_left`	`discrete`	`previous_slide`	Discrete command via GestureActivationController → CommandDispatcher
`swiping_down`	`discrete`	`stop_presentation`	Discrete command via GestureActivationController → CommandDispatcher
`swiping_up`	`discrete`	`start_presentation`	Discrete command via GestureActivationController → CommandDispatcher
`zooming_in_full_hand`	`discrete`	`zoom_in_view`	Discrete command via GestureActivationController → CommandDispatcher
`zooming_out_full_hand`	`discrete`	`zoom_out_view`	Discrete command via GestureActivationController → CommandDispatcher
`point_one`	`continuous`	`—`	Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher)
`point_two`	`continuous`	`—`	Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher)
`unknown`	`discrete`	`no_action`	No-op background class

Feature Schema (`feature-schema-v1`)

Block	Dims	Description
`position`	0–62	21 wrist-relative, scale-normalised landmark positions (x, y, z)
`fingertip_spread`	63–67	5 inter-fingertip Euclidean distances
`wrist_trajectory`	68–70	Net wrist displacement from oldest frame in the window
`velocity`	71–133	21 per-landmark wrist-relative velocity vectors (Δposition per unit time)
`joint_angles`	134–143	10 MCP + PIP joint angles in radians
`wrist_vel_raw`	144–146	Camera-normalised wrist velocity (x, y, z) — key directional signal

How to Use

import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact

# Download the artifact (cached after first call)
local_path = hf_hub_download(
    repo_id="ntsrigaud/maestro-lstm-hybrid",
    filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
)

# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
    artifact_path=local_path,
    device=torch.device("cpu"),
)
artifact.model.eval()

# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
    # tensor shape: (batch=1, T=32, F=147)
    window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
    logits = artifact.model(window_tensor)
    pred_class = artifact.class_labels[logits.argmax(dim=1).item()]

Training Dataset

Source: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
Used classes: 10 (9 active gestures + unknown background)
Dataset split: 70% train / 15% val / 15% test (stratified by class)
Augmentation: temporal scale ±20%, spatial jitter σ=0.005

Training Strategy

Two-phase transfer learning pipeline:

Phase 1 (pretraining): backbone pretrained on external checkpoint two_stream_attn_v1_2layer_pretrain_20260515T125437Z.pt to learn generic gesture dynamics.
Phase 2 (fine-tuning): head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
Stage A (frozen backbone): 10 epoch(s) head-only warmup.
Stage B (full model): up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
Stage B retention defences: replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.

Training Configuration

Parameter	Value
Architecture	EnhancedTwoStreamLSTM (BiLSTM h=96×2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
Input size	147
Hidden size	96/stream (BiLSTM output: 192)
Projection dim	96
Num layers	2
MHA heads	8 (head dim: 24)
Dropout	0.4
Learning rate	3e-05
Weight decay	0.001
Batch size	128
Max epochs	80
Early stopping patience	20
Label smoothing	0.05
Class weighting	disabled
Max samples per class	3000
LR scheduler	ReduceLROnPlateau (factor=0.5, patience=10)

Evaluation Results (Test Set)

Metric	Value
Accuracy	96.1%
Macro F1	95.9%

Per-Class Recall

Class	Recall
`fist`	97.3%
`swiping_right`	97.1%
`swiping_left`	98.3%
`swiping_down`	98.0%
`swiping_up`	98.2%
`zooming_in_full_hand`	97.0%
`zooming_out_full_hand`	95.1%
`point_one`	97.4%
`point_two`	95.1%
`unknown`	85.7%

Comparison with Previous Architecture

Feature	TwoStreamGestureLSTM	EnhancedTwoStreamLSTM
LSTM direction	Unidirectional	Bidirectional
Attention	Bahdanau (scalar)	MHA Q/K/V (8 heads)
Feature projection	No	Yes (→96)
Temporal pooling	Mean only	Mean + Max
Cross-stream fusion	Concat only	2-layer MLP gate
Parameters	~182 K	~1,208,554

Limitations and Risks

Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data.
The unknown class represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined in production_hybrid.yaml.
Requires mediapipe>=0.10.14 for landmark extraction at inference time.
Not intended for safety-critical or accessibility-critical applications.
Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ.

Environmental Impact

Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated CO₂ equivalent: negligible (<0.001 kg CO₂eq).

Generated by the Maestro training pipeline on 2026-05-15.

Downloads last month: 231

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

accuracy on IPN Hand
self-reported

0.961
f1 on IPN Hand
self-reported

0.959