ASL Recognition Model (Improved)

Model Description

This model performs American Sign Language (ASL) recognition from MediaPipe hand landmarks. It recognizes 77 different ASL signs including:

26 letters (A-Z)
51 common words and phrases

Architecture: 1D CNN + Bidirectional LSTM with Attention Mechanism

Input: MediaPipe hand landmarks (2 hands × 21 landmarks × 3 coordinates = 126 features per frame)

Performance: 64.61% validation accuracy

Model Details

Architecture Components

1D Convolutional Layers: Extract spatial features from hand landmarks
- Conv1d(126 → 128) + BatchNorm + ReLU + Dropout
- Conv1d(128 → 256) + BatchNorm + ReLU + Dropout
Bidirectional LSTM: Model temporal dependencies across frames
- 2 layers, hidden size 256
- Bidirectional for both past and future context
Attention Mechanism: Focus on important frames in the sequence
Classification Head: Multi-layer perceptron with dropout
- Linear(512 → 512) + BatchNorm + ReLU + Dropout
- Linear(512 → 256) + BatchNorm + ReLU + Dropout
- Linear(256 → 77)

Training Configuration

Optimizer: AdamW (lr=0.001, weight_decay=0.0001)
Scheduler: OneCycleLR with cosine annealing
Loss: Cross-Entropy with class weighting for imbalanced data
Batch size: 32
Epochs: 27 (early stopped from max 100)
Data augmentation: Gaussian noise (σ=0.02) on training data
Regularization: Dropout (0.3), gradient clipping (max_norm=1.0)

Model Parameters

Total trainable parameters: 3,258,574

Intended Use

Direct Use

This model is designed for:

Real-time ASL recognition from webcam input
Educational applications for learning sign language
Accessibility tools for hearing-impaired communication

Limitations

Requires MediaPipe hand tracking as preprocessing step
Trained on specific hand landmark format (2 hands, 21 landmarks each)
Performance may vary with different lighting conditions, hand sizes, and camera angles
Currently supports 77 signs only

How to Use

Installation

pip install torch numpy mediapipe huggingface_hub

Inference Example

import torch
import numpy as np
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="best_model.pth")
label_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="label_encoder_classes.npy")

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
checkpoint = torch.load(model_path, map_location=device)

# Load architecture (you'll need the model class definition)
from your_model_file import ImprovedASLModel
label_encoder_classes = np.load(label_path, allow_pickle=True)
num_classes = len(label_encoder_classes)

model = ImprovedASLModel(num_classes=num_classes)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Inference on landmark sequence
# landmarks shape: (1, seq_len, 2, 21, 3)
with torch.no_grad():
    logits = model(landmarks)
    prediction = torch.argmax(logits, dim=-1)
    predicted_sign = label_encoder_classes[prediction.item()]
    
print(f"Predicted sign: {predicted_sign}")

Training Data

The model was trained on a custom ASL dataset containing:

77 sign classes
Hand landmark sequences extracted using MediaPipe Holistic
Stratified train/validation split (80/20)

Signs Recognized

A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z (letters)

afternoon, angry, bad, book, come, drink, eat, evening, family, feel, food, friend, go, good, goodbye, happy, have, hear, hello, help, home, know, learn, love, morning, need, night, no, please, read, sad, school, see, sign, sorry, speak, student, teacher, thank_you, think, time, tired, today, tomorrow, understand, want, water, work, write, yes, yesterday (words)

Training Procedure

Preprocessing

MediaPipe Holistic extracts hand landmarks from video frames
Normalize coordinates to be camera-independent
Pad/truncate sequences to fixed length
Create binary masks for valid frames

Training Hyperparameters

Learning rate: 0.001
Weight decay: 0.0001
Batch size: 32
Max epochs: 100
Early stopping patience: 15

Evaluation Results

Metrics

Metric	Value
Validation Accuracy	64.61%
Training Accuracy	N/A
Best Epoch	27

Environmental Impact

Hardware: GPU training (CUDA-enabled)
Training time: ~42 epochs with early stopping
Carbon emissions: Not measured

Citation

If you use this model, please cite:

@misc{asl-improved-recognition-2025,
  author = {Namratha},
  title = {ASL Improved Recognition Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/namratha2412/asl-improved-recognition}}
}

Model Card Authors

Namratha (@namratha2412)

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Validation Accuracy
self-reported

0.646