- two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z
A real-time hand gesture classifier trained on a Hybrid Jester+IPN gesture dataset (Jester dynamic classes + IPN pointing classes).
This model is part of the Maestro pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction.
Model Description
- Architecture: EnhancedTwoStreamLSTM (BiLSTM h=96Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
- Parameters: 1,208,554
- Input:
(batch, 16, 147)β 16-frame sliding window at 30 FPS β 533 ms - Output: Softmax logits over 10 gesture classes
- Inference latency: < 1 ms per call (CPU, single sample)
- Feature schema:
feature-schema-v1
Architecture
EnhancedTwoStreamLSTM splits the 147-dim feature vector into two parallel streams and
processes them through a BiLSTM + self-attention + MLP-gate pipeline:
Input (B, T=32, 147)
β
ββ Stream A β Pose/Shape (73 dims)
β Linear+LN+GELU β 96
β 2-layer BiLSTM (h=96) β (B, T, 192)
β LayerNorm β Self-MHA (8 heads) + residual + post-LN
β mean+max pool β pool_LN β ctx_a (B, 192)
β
ββ Stream B β Motion/Dynamics (74 dims)
β (identical structure) β ctx_b (B, 192)
β
ββ MLP cross-stream gate
β gate_a = Sigmoid(
β Linear(96β192)(
β Tanh(Linear(192β96)(ctx_b))))
β ctx_a = LN(ctx_a Γ gate_a + ctx_a)
β gate_b = Sigmoid(
β Linear(96β192)(
β Tanh(Linear(192β96)(ctx_a))))
β ctx_b = LN(ctx_b Γ gate_b + ctx_b)
β
ββ cat(ctx_a, ctx_b) β (384,)
LN β Linear(384β192) β GELU β Dropout β Linear(192β10)
Design rationale:
- BiLSTMs encode temporal order via their recurrent cell state β no positional encoding needed.
- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).
Gesture Classes
| Class | Description |
|---|---|
fist |
Closed fist (all fingers curled, thumb tucked) |
swiping_right |
Horizontal swipe from left to right |
swiping_left |
Horizontal swipe from right to left |
swiping_down |
Vertical swipe downward |
swiping_up |
Vertical swipe upward |
zooming_in_full_hand |
Pinch-open / spread fingers away from each other |
zooming_out_full_hand |
Pinch-close / bring fingers together |
point_one |
Single-finger pointing gesture (continuous laser-pointer control) |
point_two |
Two-finger pointing gesture (continuous annotation-pen control) |
unknown |
Background / transition / no gesture |
Gesture Usage In Presentation System
| Class | Mode | Command | Runtime handling |
|---|---|---|---|
fist |
discrete |
erase_annotations |
Discrete command via GestureActivationController β CommandDispatcher |
swiping_right |
discrete |
next_slide |
Discrete command via GestureActivationController β CommandDispatcher |
swiping_left |
discrete |
previous_slide |
Discrete command via GestureActivationController β CommandDispatcher |
swiping_down |
discrete |
stop_presentation |
Discrete command via GestureActivationController β CommandDispatcher |
swiping_up |
discrete |
start_presentation |
Discrete command via GestureActivationController β CommandDispatcher |
zooming_in_full_hand |
discrete |
zoom_in_view |
Discrete command via GestureActivationController β CommandDispatcher |
zooming_out_full_hand |
discrete |
zoom_out_view |
Discrete command via GestureActivationController β CommandDispatcher |
point_one |
continuous |
β |
Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
point_two |
continuous |
β |
Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
unknown |
discrete |
no_action |
No-op background class |
Feature Schema (feature-schema-v1)
| Block | Dims | Description |
|---|---|---|
position |
0β62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
fingertip_spread |
63β67 | 5 inter-fingertip Euclidean distances |
wrist_trajectory |
68β70 | Net wrist displacement from oldest frame in the window |
velocity |
71β133 | 21 per-landmark wrist-relative velocity vectors (Ξposition per unit time) |
joint_angles |
134β143 | 10 MCP + PIP joint angles in radians |
wrist_vel_raw |
144β146 | Camera-normalised wrist velocity (x, y, z) β key directional signal |
How to Use
import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact
# Download the artifact (cached after first call)
local_path = hf_hub_download(
repo_id="ntsrigaud/maestro-lstm-hybrid",
filename="two_stream_attn_v1_2layer_ld_cong_finetune_20260515T134706Z_inference.pt",
)
# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
artifact_path=local_path,
device=torch.device("cpu"),
)
artifact.model.eval()
# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
# tensor shape: (batch=1, T=32, F=147)
window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
logits = artifact.model(window_tensor)
pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
Training Dataset
- Source: Hybrid merge of Jester and IPN-Hand windows: Jester provides unknown/swiping/zoom/stop_sign classes; IPN-Hand provides point_one and point_two
- Used classes: 10 (9 active gestures +
unknownbackground) - Dataset split: 70% train / 15% val / 15% test (stratified by class)
- Augmentation: temporal scale Β±20%, spatial jitter Ο=0.005
Training Strategy
Two-phase transfer learning pipeline:
- Phase 1 (pretraining): backbone pretrained on external checkpoint
two_stream_attn_v1_2layer_pretrain_20260515T125437Z.ptto learn generic gesture dynamics. - Phase 2 (fine-tuning): head replaced and model adapted on Hybrid Jester+IPN 10-gesture vocabulary.
- Stage A (frozen backbone): 10 epoch(s) head-only warmup.
- Stage B (full model): up to 66 epoch(s) joint fine-tuning with scheduler/early stopping.
- Stage B retention defences: replay_max_samples_per_class=500, distillation_weight=0.0, replay_ce_weight=0.0, backbone_lr_multiplier=0.1, ewc_weight=N/A, gpm_components=0, forgetting_penalty_weight=0.5.
Training Configuration
| Parameter | Value |
|---|---|
| Architecture | EnhancedTwoStreamLSTM (BiLSTM h=96Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
| Input size | 147 |
| Hidden size | 96/stream (BiLSTM output: 192) |
| Projection dim | 96 |
| Num layers | 2 |
| MHA heads | 8 (head dim: 24) |
| Dropout | 0.4 |
| Learning rate | 3e-05 |
| Weight decay | 0.001 |
| Batch size | 128 |
| Max epochs | 80 |
| Early stopping patience | 20 |
| Label smoothing | 0.05 |
| Class weighting | disabled |
| Max samples per class | 3000 |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=10) |
Evaluation Results (Test Set)
| Metric | Value |
|---|---|
| Accuracy | 96.1% |
| Macro F1 | 95.9% |
Per-Class Recall
| Class | Recall |
|---|---|
fist |
97.3% |
swiping_right |
97.1% |
swiping_left |
98.3% |
swiping_down |
98.0% |
swiping_up |
98.2% |
zooming_in_full_hand |
97.0% |
zooming_out_full_hand |
95.1% |
point_one |
97.4% |
point_two |
95.1% |
unknown |
85.7% |
Comparison with Previous Architecture
| Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
|---|---|---|
| LSTM direction | Unidirectional | Bidirectional |
| Attention | Bahdanau (scalar) | MHA Q/K/V (8 heads) |
| Feature projection | No | Yes (β96) |
| Temporal pooling | Mean only | Mean + Max |
| Cross-stream fusion | Concat only | 2-layer MLP gate |
| Parameters | ~182 K | ~1,208,554 |
Limitations and Risks
- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data.
- The
unknownclass represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined inproduction_hybrid.yaml. - Requires mediapipe>=0.10.14 for landmark extraction at inference time.
- Not intended for safety-critical or accessibility-critical applications.
- Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ.
Environmental Impact
Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated COβ equivalent: negligible (<0.001 kg COβeq).
Generated by the Maestro training pipeline on 2026-05-15.
- Downloads last month
- 231
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Evaluation results
- accuracy on IPN Handself-reported0.961
- f1 on IPN Handself-reported0.959