dronefreak commited on
Commit
ae9e230
·
verified ·
1 Parent(s): c1c960d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +363 -3
README.md CHANGED
@@ -1,3 +1,363 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - action-recognition
8
+ - human-action-classification
9
+ - image-classification
10
+ - computer-vision
11
+ - pose-estimation
12
+ - mediapipe
13
+ - stanford40
14
+ - resnet
15
+ - mobilenet
16
+ datasets:
17
+ - stanford40
18
+ metrics:
19
+ - accuracy
20
+ - f1
21
+ - precision
22
+ - recall
23
+ pipeline_tag: image-classification
24
+ widget:
25
+ - src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_cooking.jpg
26
+ example_title: "Cooking"
27
+ - src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_jumping.jpg
28
+ example_title: "Jumping"
29
+ model-index:
30
+ - name: human-action-classification
31
+ results:
32
+ - task:
33
+ type: image-classification
34
+ name: Image Classification
35
+ dataset:
36
+ name: Stanford 40 Actions
37
+ type: stanford40
38
+ metrics:
39
+ - type: accuracy
40
+ value: 86.4
41
+ name: Accuracy
42
+ verified: false
43
+ - type: f1
44
+ value: 0.8618
45
+ name: Macro F1-Score
46
+ verified: false
47
+ ---
48
+
49
+ # Human Action Classification v2.0
50
+
51
+ State-of-the-art human action recognition model trained on Stanford 40 Actions dataset.
52
+
53
+ ![Demo](demo_result.jpg)
54
+
55
+ ## Model Description
56
+
57
+ This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.
58
+
59
+ - **Developed by:** Saumya Kumaar Saksena ([@dronefreak](https://github.com/dronefreak))
60
+ - **Model type:** Image Classification (Action Recognition)
61
+ - **Language(s):** English (action labels)
62
+ - **License:** MIT
63
+ - **Finetuned from:** ImageNet pretrained ResNet34
64
+
65
+ ## Key Features
66
+
67
+ - 🎯 **86% accuracy** on Stanford 40 Actions test set
68
+ - ⚡ **Real-time inference** (~25ms per image on GTX 1050 Ti)
69
+ - 🎨 **Pose-aware** optional MediaPipe integration
70
+ - 📦 **Easy to use** with simple Python API
71
+ - 🔧 **Production-ready** with comprehensive evaluation metrics
72
+
73
+ ## Model Variants
74
+
75
+ All models trained on Stanford 40 Actions dataset:
76
+
77
+ | Model | Accuracy | Macro F1 | Parameters | Size | Inference Time* |
78
+ |-------|----------|----------|-----------|------|-----------------|
79
+ | **ResNet50** | **88.5%** | **0.8842** | 23.5M | 94MB | ~30ms |
80
+ | **ResNet34** (this model) | **86.4%** | **0.8618** | 21.3M | 85MB | ~25ms |
81
+ | ResNet18 | 82.3% | 0.8178 | 11.2M | 45MB | ~18ms |
82
+ | MobileNet V3 Large | 82.1% | 0.8169 | 5.4M | 20MB | ~15ms |
83
+ | ViT Base | 76.8% | 0.7650 | 86M | 330MB | ~45ms |
84
+ | MobileNet V3 Small | 74.35% | 0.7350 | 2.5M | 10MB | ~10ms |
85
+
86
+ *Single image on NVIDIA GTX 1050 Ti
87
+
88
+ ### Detailed Performance Comparison
89
+
90
+ | Model | Accuracy (%) | Macro Precision | Macro Recall | Macro F1 | Weighted F1 |
91
+ |-------|--------------|-----------------|--------------|----------|-------------|
92
+ | ResNet50 | 88.5 | 0.8874 | 0.8850 | 0.8842 | 0.8842 |
93
+ | **ResNet34** | **86.4** | **0.8686** | **0.8640** | **0.8618** | **0.8618** |
94
+ | ResNet18 | 82.3 | 0.8211 | 0.8230 | 0.8178 | 0.8178 |
95
+ | MobileNet V3 Large | 82.1 | 0.8216 | 0.8210 | 0.8169 | 0.8169 |
96
+ | ViT Base Patch16 | 76.8 | 0.7774 | 0.7680 | 0.7650 | 0.7650 |
97
+ | MobileNet V3 Small | 74.35 | 0.7382 | 0.7435 | 0.7350 | 0.7350 |
98
+
99
+ **Trade-offs:**
100
+ - **ResNet50**: Best accuracy but slower and larger
101
+ - **ResNet34**: Optimal balance of accuracy and speed ⭐
102
+ - **MobileNet V3 Large**: Best mobile/edge deployment option
103
+ - **MobileNet V3 Small**: Fastest inference for resource-constrained devices
104
+
105
+ ## Supported Actions (40 Classes)
106
+
107
+ <details>
108
+ <summary>Click to expand full list</summary>
109
+
110
+ - applauding
111
+ - blowing_bubbles
112
+ - brushing_teeth
113
+ - cleaning_the_floor
114
+ - climbing
115
+ - cooking
116
+ - cutting_trees
117
+ - cutting_vegetables
118
+ - drinking
119
+ - feeding_a_horse
120
+ - fishing
121
+ - fixing_a_bike
122
+ - fixing_a_car
123
+ - gardening
124
+ - holding_an_umbrella
125
+ - jumping
126
+ - looking_through_a_microscope
127
+ - looking_through_a_telescope
128
+ - playing_guitar
129
+ - playing_violin
130
+ - pouring_liquid
131
+ - pushing_a_cart
132
+ - reading
133
+ - phoning
134
+ - riding_a_bike
135
+ - riding_a_horse
136
+ - rowing_a_boat
137
+ - running
138
+ - shooting_an_arrow
139
+ - smoking
140
+ - taking_photos
141
+ - texting_message
142
+ - throwing_frisby
143
+ - using_a_computer
144
+ - walking_the_dog
145
+ - washing_dishes
146
+ - watching_TV
147
+ - waving_hands
148
+ - writing_on_a_board
149
+ - writing_on_a_book
150
+
151
+ </details>
152
+
153
+ ## Quick Start
154
+
155
+ ### Installation
156
+
157
+ ```bash
158
+ pip install git+https://github.com/dronefreak/human-action-classification.git
159
+ ```
160
+
161
+ ### Basic Usage
162
+
163
+ ```python
164
+ from hac import ActionPredictor
165
+
166
+ # Initialize predictor
167
+ predictor = ActionPredictor(
168
+ model_path="hf://dronefreak/human-action-classification",
169
+ device='cuda'
170
+ )
171
+
172
+ # Predict on image
173
+ result = predictor.predict_image('photo.jpg', top_k=3)
174
+
175
+ # Print results
176
+ print(f"Action: {result['action']['top_class']}")
177
+ print(f"Confidence: {result['action']['top_confidence']:.2%}")
178
+
179
+ # Top 3 predictions
180
+ for pred in result['action']['predictions']:
181
+ print(f" {pred['class']}: {pred['confidence']:.2%}")
182
+ ```
183
+
184
+ ### With Pose Estimation
185
+
186
+ ```python
187
+ predictor = ActionPredictor(
188
+ model_path="hf://dronefreak/human-action-classification",
189
+ use_pose_estimation=True, # Enable MediaPipe
190
+ device='cuda'
191
+ )
192
+
193
+ result = predictor.predict_image('photo.jpg', return_pose=True)
194
+
195
+ print(f"Detected pose: {result['pose']['class']}")
196
+ print(f"Action: {result['action']['top_class']}")
197
+ ```
198
+
199
+ ### Batch Prediction
200
+
201
+ ```python
202
+ from pathlib import Path
203
+
204
+ image_paths = list(Path('images/').glob('*.jpg'))
205
+ results = predictor.predict_batch(image_paths, batch_size=32)
206
+
207
+ for img_path, result in zip(image_paths, results):
208
+ print(f"{img_path.name}: {result['action']['top_class']}")
209
+ ```
210
+
211
+ ## Performance Metrics
212
+
213
+ Evaluated on Stanford 40 Actions test set (5,532 images):
214
+
215
+ | Metric | Score |
216
+ |--------|-------|
217
+ | **Accuracy** | **86.4%** |
218
+ | Macro F1-Score | 0.8618 |
219
+ | Weighted F1-Score | 0.8618 |
220
+ | Macro Precision | 0.8686 |
221
+ | Macro Recall | 0.8640 |
222
+
223
+ ### Top Performing Classes
224
+
225
+ | Class | F1-Score |
226
+ |-------|----------|
227
+ | Applauding | 0.935 |
228
+ | Jumping | 0.925 |
229
+ | Running | 0.918 |
230
+ | Waving Hands | 0.912 |
231
+ | Drinking | 0.905 |
232
+
233
+ ### Confusion Analysis
234
+
235
+ Most commonly confused actions:
236
+ - Cooking ↔ Washing Dishes (similar kitchen setting)
237
+ - Reading ↔ Using Computer (similar seated poses)
238
+ - Fixing Bike ↔ Fixing Car (similar repair actions)
239
+
240
+ Full metrics available in [metrics.json](metrics.json)
241
+
242
+ ## Training Details
243
+
244
+ ### Training Data
245
+
246
+ - **Dataset:** Stanford 40 Actions
247
+ - **Training split:** ~4,000 images
248
+ - **Test split:** ~5,532 images
249
+ - **Classes:** 40 human action categories
250
+ - **Image resolution:** 224×224 (resized)
251
+
252
+ ### Training Procedure
253
+
254
+ #### Preprocessing
255
+
256
+ ```python
257
+ # Training augmentation
258
+ transforms.Compose([
259
+ transforms.RandomResizedCrop(224),
260
+ transforms.RandomHorizontalFlip(),
261
+ transforms.ColorJitter(brightness=0.2, contrast=0.2),
262
+ transforms.ToTensor(),
263
+ transforms.Normalize(mean=[0.485, 0.456, 0.406],
264
+ std=[0.229, 0.224, 0.225])
265
+ ])
266
+ ```
267
+
268
+ #### Training Hyperparameters
269
+
270
+ - **Backbone:** ResNet34 (ImageNet pretrained)
271
+ - **Optimizer:** AdamW
272
+ - **Learning rate:** 1e-3 → 1e-5 (cosine decay)
273
+ - **Weight decay:** 1e-3
274
+ - **Batch size:** 32
275
+ - **Epochs:** 200
276
+ - **Augmentation:** Mixup (α=0.4)
277
+ - **Scheduler:** CosineAnnealingLR
278
+
279
+ #### Hardware
280
+
281
+ - **GPU:** NVIDIA GTX 1050 Ti (4GB)
282
+ - **Training time:** ~4 hours
283
+ - **Framework:** PyTorch 2.0+
284
+
285
+ ### Two-Stage Training Strategy
286
+
287
+ 1. **Stage 1 (20 epochs):** Freeze backbone, train classifier head
288
+ 2. **Stage 2 (180 epochs):** Fine-tune entire network with Mixup
289
+
290
+ This approach reduced overfitting from 99% train / 62% test → 82% train / 86% test.
291
+
292
+ ## Evaluation
293
+
294
+ ```python
295
+ from hac.evaluation import evaluate_model
296
+
297
+ # Evaluate on test set
298
+ metrics = evaluate_model(
299
+ checkpoint='resnet34_best.pth',
300
+ data_dir='stanford40/',
301
+ split='test'
302
+ )
303
+
304
+ print(f"Accuracy: {metrics['accuracy']:.2%}")
305
+ print(f"F1-Score: {metrics['f1_macro']:.4f}")
306
+ ```
307
+
308
+ ## Environmental Impact
309
+
310
+ - **Hardware:** 1× NVIDIA GTX 1050 Ti
311
+ - **Training time:** 4 hours
312
+ - **Estimated CO2 emissions:** ~0.5 kg CO2eq
313
+
314
+ ## Limitations
315
+
316
+ - Trained on Stanford 40 which has limited diversity
317
+ - Best performance on indoor/outdoor daily activities
318
+ - May struggle with unusual camera angles or occlusions
319
+ - Requires clear view of person performing action
320
+ - Not suitable for fine-grained action recognition (e.g., different sports moves)
321
+
322
+ ## Bias and Fairness
323
+
324
+ The model inherits biases from the Stanford 40 dataset:
325
+ - Limited demographic diversity
326
+ - Western-centric activities
327
+ - Imbalanced class distribution
328
+
329
+ Users should evaluate performance on their specific use case.
330
+
331
+ ## Citation
332
+
333
+ ```bibtex
334
+ @software{saksena2025hac,
335
+ author = {Saksena, Saumya Kumaar},
336
+ title = {Human Action Classification v2.0},
337
+ year = {2025},
338
+ url = {https://github.com/dronefreak/human-action-classification},
339
+ version = {2.0}
340
+ }
341
+ ```
342
+
343
+ ## Model Card Authors
344
+
345
+ Saumya Kumaar Saksena
346
+
347
+ ## Model Card Contact
348
+
349
+ - GitHub: [@dronefreak](https://github.com/dronefreak)
350
+ - Repository: [human-action-classification](https://github.com/dronefreak/human-action-classification)
351
+
352
+ ## Additional Resources
353
+
354
+ - [GitHub Repository](https://github.com/dronefreak/human-action-classification)
355
+ - [Demo Notebook](https://github.com/dronefreak/human-action-classification/blob/main/notebooks/demo.ipynb)
356
+ - [Training Code](https://github.com/dronefreak/human-action-classification/blob/main/src/hac/training/train.py)
357
+ - [Evaluation Metrics](metrics.json)
358
+
359
+ ## License
360
+
361
+ Apache License 2.0 - Free for research and commercial use.
362
+
363
+ See [LICENSE](LICENSE) for full details.