ASL Recognition Model (Improved)
Model Description
This model performs American Sign Language (ASL) recognition from MediaPipe hand landmarks. It recognizes 77 different ASL signs including:
- 26 letters (A-Z)
- 51 common words and phrases
Architecture: 1D CNN + Bidirectional LSTM with Attention Mechanism
Input: MediaPipe hand landmarks (2 hands ร 21 landmarks ร 3 coordinates = 126 features per frame)
Performance: 64.61% validation accuracy
Model Details
Architecture Components
1D Convolutional Layers: Extract spatial features from hand landmarks
- Conv1d(126 โ 128) + BatchNorm + ReLU + Dropout
- Conv1d(128 โ 256) + BatchNorm + ReLU + Dropout
Bidirectional LSTM: Model temporal dependencies across frames
- 2 layers, hidden size 256
- Bidirectional for both past and future context
Attention Mechanism: Focus on important frames in the sequence
Classification Head: Multi-layer perceptron with dropout
- Linear(512 โ 512) + BatchNorm + ReLU + Dropout
- Linear(512 โ 256) + BatchNorm + ReLU + Dropout
- Linear(256 โ 77)
Training Configuration
- Optimizer: AdamW (lr=0.001, weight_decay=0.0001)
- Scheduler: OneCycleLR with cosine annealing
- Loss: Cross-Entropy with class weighting for imbalanced data
- Batch size: 32
- Epochs: 27 (early stopped from max 100)
- Data augmentation: Gaussian noise (ฯ=0.02) on training data
- Regularization: Dropout (0.3), gradient clipping (max_norm=1.0)
Model Parameters
Total trainable parameters: 3,258,574
Intended Use
Direct Use
This model is designed for:
- Real-time ASL recognition from webcam input
- Educational applications for learning sign language
- Accessibility tools for hearing-impaired communication
Limitations
- Requires MediaPipe hand tracking as preprocessing step
- Trained on specific hand landmark format (2 hands, 21 landmarks each)
- Performance may vary with different lighting conditions, hand sizes, and camera angles
- Currently supports 77 signs only
How to Use
Installation
pip install torch numpy mediapipe huggingface_hub
Inference Example
import torch
import numpy as np
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="best_model.pth")
label_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="label_encoder_classes.npy")
# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
checkpoint = torch.load(model_path, map_location=device)
# Load architecture (you'll need the model class definition)
from your_model_file import ImprovedASLModel
label_encoder_classes = np.load(label_path, allow_pickle=True)
num_classes = len(label_encoder_classes)
model = ImprovedASLModel(num_classes=num_classes)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()
# Inference on landmark sequence
# landmarks shape: (1, seq_len, 2, 21, 3)
with torch.no_grad():
logits = model(landmarks)
prediction = torch.argmax(logits, dim=-1)
predicted_sign = label_encoder_classes[prediction.item()]
print(f"Predicted sign: {predicted_sign}")
Training Data
The model was trained on a custom ASL dataset containing:
- 77 sign classes
- Hand landmark sequences extracted using MediaPipe Holistic
- Stratified train/validation split (80/20)
Signs Recognized
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z (letters)
afternoon, angry, bad, book, come, drink, eat, evening, family, feel, food, friend, go, good, goodbye, happy, have, hear, hello, help, home, know, learn, love, morning, need, night, no, please, read, sad, school, see, sign, sorry, speak, student, teacher, thank_you, think, time, tired, today, tomorrow, understand, want, water, work, write, yes, yesterday (words)
Training Procedure
Preprocessing
- MediaPipe Holistic extracts hand landmarks from video frames
- Normalize coordinates to be camera-independent
- Pad/truncate sequences to fixed length
- Create binary masks for valid frames
Training Hyperparameters
- Learning rate: 0.001
- Weight decay: 0.0001
- Batch size: 32
- Max epochs: 100
- Early stopping patience: 15
Evaluation Results
Metrics
| Metric | Value |
|---|---|
| Validation Accuracy | 64.61% |
| Training Accuracy | N/A |
| Best Epoch | 27 |
Environmental Impact
- Hardware: GPU training (CUDA-enabled)
- Training time: ~42 epochs with early stopping
- Carbon emissions: Not measured
Citation
If you use this model, please cite:
@misc{asl-improved-recognition-2025,
author = {Namratha},
title = {ASL Improved Recognition Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/namratha2412/asl-improved-recognition}}
}
Model Card Authors
Namratha (@namratha2412)
Model Card Contact
For questions or issues, please open an issue in the model repository.
Evaluation results
- Validation Accuracyself-reported0.646