Abs6187 commited on
Commit
faadf01
·
verified ·
1 Parent(s): a1d0d2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -3
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ ---
3
+ library_name: pytorch
4
+ tags:
5
+ - sign-language
6
+ - computer-vision
7
+ - video-classification
8
+ - isl
9
+ - transformer
10
+ - mediapipe
11
+ - deep-learning
12
+ pipeline_tag: video-classification
13
+ license: other
14
+ language:
15
+ - en
16
+ - hi
17
+ metrics:
18
+ - accuracy
19
+ base_model: mobilenet_v3_large
20
+ ---
21
+
22
+ # Model Card for Indian Sign Language Recognition System
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ This model is a **hierarchical two-stream transformer** designed for real-time **Indian Sign Language (ISL)** recognition. It utilizes a novel gating architecture to decompose the classification problem into semantic groups, improving both accuracy and inference speed. The system integrates computer vision (MediaPipe) with deep learning (PyTorch) to process both visual frames and pose landmarks simultaneously.
29
+
30
+ - **Developed by:** [Developer Name/Organization]
31
+ - **Model type:** Hierarchical Two-Stream Transformer (Visual + Pose)
32
+ - **Language(s) (NLP):** English (Output), Hindi/Regional (Translation)
33
+ - **License:** [License Information]
34
+ - **Resources for more information:**
35
+ - [GitHub Repository](https://github.com/Abs6187/DynamicIndianSignLanguageDetection)
36
+
37
+ ### Model Sources
38
+
39
+ - **Repository:** https://github.com/Abs6187/DynamicIndianSignLanguageDetection
40
+ - **Paper:** [Link to relevant paper if applicable]
41
+ - **Demo:** [Link to demo if applicable]
42
+
43
+ ## Uses
44
+
45
+ ### Direct Use
46
+
47
+ The model is intended for:
48
+ - Real-time ISL-to-text translation.
49
+ - Accessibility tools for the deaf and hard-of-hearing community.
50
+ - Educational platforms for learning ISL.
51
+
52
+ ### Downstream Use
53
+
54
+ - Integration into video conferencing tools.
55
+ - Public service kiosk interfaces.
56
+ - Mobile applications for sign language interpretation.
57
+
58
+ ### Out-of-Scope Use
59
+
60
+ - Recognition of other sign languages (ASL, BSL, etc.) without retraining.
61
+ - High-stakes medical or legal interpretation without human oversight.
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ - **Lighting Conditions:** Performance improves with good lighting; extreme low light may affect accuracy (though HSV augmentation helps).
66
+ - **Occlusions:** Heavy occlusion of hands may degrade performance, despite robust interpolation methods.
67
+ - **Vocabulary:** Currently limited to the trained vocabulary (60+ signs generally, specific checkpoints may vary).
68
+
69
+ ## How to Get Started with the Model
70
+
71
+ Use the `SignLanguageInference` class to load the model and run predictions on video files.
72
+
73
+ ```python
74
+ from infer import SignLanguageInference
75
+
76
+ # Initialize
77
+ inference = SignLanguageInference(
78
+ model_path='best_model.pth',
79
+ metadata_path='metadata.json'
80
+ )
81
+
82
+ # Predict
83
+ result = inference.predict('video_sample.mp4')
84
+ print(f"Predicted Sign: {result['top_prediction']['class']}")
85
+ ```
86
+
87
+ ## Training Details
88
+
89
+ ### Training Data
90
+
91
+ - **Source:** Custom captured dataset of 2000+ video samples.
92
+ - **Classes:** 80+ ISL signs (hierarchically grouped).
93
+ - **Participants:** 10+ diverse signers in indoor/outdoor environments.
94
+
95
+ ### Preprocessing
96
+
97
+ - **MediaPipe:** Extracts 154-dimensional landmark vectors (Hands + Pose).
98
+ - **Augmentation:** 8 strategies including background blur, color shifts, and geometric-preserving spatial transforms.
99
+ - **Normalization:** Translation, scale, and Z-score standardization.
100
+
101
+ ### Training Procedure
102
+
103
+ - **Architecture:**
104
+ - **Visual Stream:** MobileNetV3 backbone -> Transformer Encoder.
105
+ - **Pose Stream:** 1D Convolutions -> Transformer Encoder.
106
+ - **Fusion:** Gated mechanism combinig visual and pose embeddings.
107
+ - **Loss Function:** Focal Loss (to handle class imbalance).
108
+ - **Optimization:** Mixed precision training.
109
+
110
+ ## Evaluation
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ ### Results
115
+
116
+ | Configuration | Accuracy | Inference Time | Memory |
117
+ |--------------|----------|----------------|---------|
118
+ | **Monolithic Model** | 88.3% | 150ms | 50MB |
119
+ | **Hierarchical (Ours)** | **93.8%** | **95ms** | **30MB** |
120
+
121
+ - **Group 0 (Pronouns):** 96.2% Accuracy
122
+ - **Group 1 (Objects):** 93.4% Accuracy
123
+ - **Group 2 (Actions):** 91.8% Accuracy
124
+
125
+ ## Environmental Impact
126
+
127
+ - **Hardware Type:** Trained on GPUs (Specifics N/A)
128
+ - **Inference Efficiency:** Optimized for CPU inference (approx 95ms/video), suitable for edge deployment.
129
+
130
+ ## Technical Specifications
131
+
132
+ - **Input:**
133
+ - Video Frames: (T, 3, H, W)
134
+ - Landmarks: (T, 154)
135
+ - **Frameworks:** PyTorch 2.0+, TensorFlow 2.x, MediaPipe
136
+
137
+
138
+ ---