ASL Recognition Model (Improved)

Model Description

This model performs American Sign Language (ASL) recognition from MediaPipe hand landmarks. It recognizes 77 different ASL signs including:

  • 26 letters (A-Z)
  • 51 common words and phrases

Architecture: 1D CNN + Bidirectional LSTM with Attention Mechanism

Input: MediaPipe hand landmarks (2 hands ร— 21 landmarks ร— 3 coordinates = 126 features per frame)

Performance: 64.61% validation accuracy

Model Details

Architecture Components

  1. 1D Convolutional Layers: Extract spatial features from hand landmarks

    • Conv1d(126 โ†’ 128) + BatchNorm + ReLU + Dropout
    • Conv1d(128 โ†’ 256) + BatchNorm + ReLU + Dropout
  2. Bidirectional LSTM: Model temporal dependencies across frames

    • 2 layers, hidden size 256
    • Bidirectional for both past and future context
  3. Attention Mechanism: Focus on important frames in the sequence

  4. Classification Head: Multi-layer perceptron with dropout

    • Linear(512 โ†’ 512) + BatchNorm + ReLU + Dropout
    • Linear(512 โ†’ 256) + BatchNorm + ReLU + Dropout
    • Linear(256 โ†’ 77)

Training Configuration

  • Optimizer: AdamW (lr=0.001, weight_decay=0.0001)
  • Scheduler: OneCycleLR with cosine annealing
  • Loss: Cross-Entropy with class weighting for imbalanced data
  • Batch size: 32
  • Epochs: 27 (early stopped from max 100)
  • Data augmentation: Gaussian noise (ฯƒ=0.02) on training data
  • Regularization: Dropout (0.3), gradient clipping (max_norm=1.0)

Model Parameters

Total trainable parameters: 3,258,574

Intended Use

Direct Use

This model is designed for:

  • Real-time ASL recognition from webcam input
  • Educational applications for learning sign language
  • Accessibility tools for hearing-impaired communication

Limitations

  • Requires MediaPipe hand tracking as preprocessing step
  • Trained on specific hand landmark format (2 hands, 21 landmarks each)
  • Performance may vary with different lighting conditions, hand sizes, and camera angles
  • Currently supports 77 signs only

How to Use

Installation

pip install torch numpy mediapipe huggingface_hub

Inference Example

import torch
import numpy as np
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="best_model.pth")
label_path = hf_hub_download(repo_id="namratha2412/asl-improved-recognition", filename="label_encoder_classes.npy")

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
checkpoint = torch.load(model_path, map_location=device)

# Load architecture (you'll need the model class definition)
from your_model_file import ImprovedASLModel
label_encoder_classes = np.load(label_path, allow_pickle=True)
num_classes = len(label_encoder_classes)

model = ImprovedASLModel(num_classes=num_classes)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Inference on landmark sequence
# landmarks shape: (1, seq_len, 2, 21, 3)
with torch.no_grad():
    logits = model(landmarks)
    prediction = torch.argmax(logits, dim=-1)
    predicted_sign = label_encoder_classes[prediction.item()]
    
print(f"Predicted sign: {predicted_sign}")

Training Data

The model was trained on a custom ASL dataset containing:

  • 77 sign classes
  • Hand landmark sequences extracted using MediaPipe Holistic
  • Stratified train/validation split (80/20)

Signs Recognized

A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z (letters)

afternoon, angry, bad, book, come, drink, eat, evening, family, feel, food, friend, go, good, goodbye, happy, have, hear, hello, help, home, know, learn, love, morning, need, night, no, please, read, sad, school, see, sign, sorry, speak, student, teacher, thank_you, think, time, tired, today, tomorrow, understand, want, water, work, write, yes, yesterday (words)

Training Procedure

Preprocessing

  1. MediaPipe Holistic extracts hand landmarks from video frames
  2. Normalize coordinates to be camera-independent
  3. Pad/truncate sequences to fixed length
  4. Create binary masks for valid frames

Training Hyperparameters

  • Learning rate: 0.001
  • Weight decay: 0.0001
  • Batch size: 32
  • Max epochs: 100
  • Early stopping patience: 15

Evaluation Results

Metrics

Metric Value
Validation Accuracy 64.61%
Training Accuracy N/A
Best Epoch 27

Environmental Impact

  • Hardware: GPU training (CUDA-enabled)
  • Training time: ~42 epochs with early stopping
  • Carbon emissions: Not measured

Citation

If you use this model, please cite:

@misc{asl-improved-recognition-2025,
  author = {Namratha},
  title = {ASL Improved Recognition Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/namratha2412/asl-improved-recognition}}
}

Model Card Authors

Namratha (@namratha2412)

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results