dronefreak
/

human-action-classification-stanford40

+---
+license: apache-2.0
+language:
+- en
+library_name: pytorch
+tags:
+- action-recognition
+- human-action-classification
+- image-classification
+- computer-vision
+- pose-estimation
+- mediapipe
+- stanford40
+- resnet
+- mobilenet
+datasets:
+- stanford40
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+pipeline_tag: image-classification
+widget:
+- src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_cooking.jpg
+  example_title: "Cooking"
+- src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_jumping.jpg
+  example_title: "Jumping"
+model-index:
+- name: human-action-classification
+  results:
+  - task:
+      type: image-classification
+      name: Image Classification
+    dataset:
+      name: Stanford 40 Actions
+      type: stanford40
+    metrics:
+    - type: accuracy
+      value: 86.4
+      name: Accuracy
+      verified: false
+    - type: f1
+      value: 0.8618
+      name: Macro F1-Score
+      verified: false
+---
+# Human Action Classification v2.0
+State-of-the-art human action recognition model trained on Stanford 40 Actions dataset.
+![Demo](demo_result.jpg)
+## Model Description
+This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.
+- **Developed by:** Saumya Kumaar Saksena ([@dronefreak](https://github.com/dronefreak))
+- **Model type:** Image Classification (Action Recognition)
+- **Language(s):** English (action labels)
+- **License:** MIT
+- **Finetuned from:** ImageNet pretrained ResNet34
+## Key Features
+- 🎯 **86% accuracy** on Stanford 40 Actions test set
+- ⚡ **Real-time inference** (~25ms per image on GTX 1050 Ti)
+- 🎨 **Pose-aware** optional MediaPipe integration
+- 📦 **Easy to use** with simple Python API
+- 🔧 **Production-ready** with comprehensive evaluation metrics
+## Model Variants
+All models trained on Stanford 40 Actions dataset:
+| Model | Accuracy | Macro F1 | Parameters | Size | Inference Time* |
+|-------|----------|----------|-----------|------|-----------------|
+| **ResNet50** | **88.5%** | **0.8842** | 23.5M | 94MB | ~30ms |
+| **ResNet34** (this model) | **86.4%** | **0.8618** | 21.3M | 85MB | ~25ms |
+| ResNet18 | 82.3% | 0.8178 | 11.2M | 45MB | ~18ms |
+| MobileNet V3 Large | 82.1% | 0.8169 | 5.4M | 20MB | ~15ms |
+| ViT Base | 76.8% | 0.7650 | 86M | 330MB | ~45ms |
+| MobileNet V3 Small | 74.35% | 0.7350 | 2.5M | 10MB | ~10ms |
+*Single image on NVIDIA GTX 1050 Ti
+### Detailed Performance Comparison
+| Model | Accuracy (%) | Macro Precision | Macro Recall | Macro F1 | Weighted F1 |
+|-------|--------------|-----------------|--------------|----------|-------------|
+| ResNet50 | 88.5 | 0.8874 | 0.8850 | 0.8842 | 0.8842 |
+| **ResNet34** | **86.4** | **0.8686** | **0.8640** | **0.8618** | **0.8618** |
+| ResNet18 | 82.3 | 0.8211 | 0.8230 | 0.8178 | 0.8178 |
+| MobileNet V3 Large | 82.1 | 0.8216 | 0.8210 | 0.8169 | 0.8169 |
+| ViT Base Patch16 | 76.8 | 0.7774 | 0.7680 | 0.7650 | 0.7650 |
+| MobileNet V3 Small | 74.35 | 0.7382 | 0.7435 | 0.7350 | 0.7350 |
+**Trade-offs:**
+- **ResNet50**: Best accuracy but slower and larger
+- **ResNet34**: Optimal balance of accuracy and speed ⭐
+- **MobileNet V3 Large**: Best mobile/edge deployment option
+- **MobileNet V3 Small**: Fastest inference for resource-constrained devices
+## Supported Actions (40 Classes)
+<details>
+<summary>Click to expand full list</summary>
+- applauding
+- blowing_bubbles
+- brushing_teeth
+- cleaning_the_floor
+- climbing
+- cooking
+- cutting_trees
+- cutting_vegetables
+- drinking
+- feeding_a_horse
+- fishing
+- fixing_a_bike
+- fixing_a_car
+- gardening
+- holding_an_umbrella
+- jumping
+- looking_through_a_microscope
+- looking_through_a_telescope
+- playing_guitar
+- playing_violin
+- pouring_liquid
+- pushing_a_cart
+- reading
+- phoning
+- riding_a_bike
+- riding_a_horse
+- rowing_a_boat
+- running
+- shooting_an_arrow
+- smoking
+- taking_photos
+- texting_message
+- throwing_frisby
+- using_a_computer
+- walking_the_dog
+- washing_dishes
+- watching_TV
+- waving_hands
+- writing_on_a_board
+- writing_on_a_book
+</details>
+## Quick Start
+### Installation
+```bash
+pip install git+https://github.com/dronefreak/human-action-classification.git
+```
+### Basic Usage
+```python
+from hac import ActionPredictor
+# Initialize predictor
+predictor = ActionPredictor(
+    model_path="hf://dronefreak/human-action-classification",
+    device='cuda'
+)
+# Predict on image
+result = predictor.predict_image('photo.jpg', top_k=3)
+# Print results
+print(f"Action: {result['action']['top_class']}")
+print(f"Confidence: {result['action']['top_confidence']:.2%}")
+# Top 3 predictions
+for pred in result['action']['predictions']:
+    print(f"  {pred['class']}: {pred['confidence']:.2%}")
+```
+### With Pose Estimation
+```python
+predictor = ActionPredictor(
+    model_path="hf://dronefreak/human-action-classification",
+    use_pose_estimation=True,  # Enable MediaPipe
+    device='cuda'
+)
+result = predictor.predict_image('photo.jpg', return_pose=True)
+print(f"Detected pose: {result['pose']['class']}")
+print(f"Action: {result['action']['top_class']}")
+```
+### Batch Prediction
+```python
+from pathlib import Path
+image_paths = list(Path('images/').glob('*.jpg'))
+results = predictor.predict_batch(image_paths, batch_size=32)
+for img_path, result in zip(image_paths, results):
+    print(f"{img_path.name}: {result['action']['top_class']}")
+```
+## Performance Metrics
+Evaluated on Stanford 40 Actions test set (5,532 images):
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | **86.4%** |
+| Macro F1-Score | 0.8618 |
+| Weighted F1-Score | 0.8618 |
+| Macro Precision | 0.8686 |
+| Macro Recall | 0.8640 |
+### Top Performing Classes
+| Class | F1-Score |
+|-------|----------|
+| Applauding | 0.935 |
+| Jumping | 0.925 |
+| Running | 0.918 |
+| Waving Hands | 0.912 |
+| Drinking | 0.905 |
+### Confusion Analysis
+Most commonly confused actions:
+- Cooking ↔ Washing Dishes (similar kitchen setting)
+- Reading ↔ Using Computer (similar seated poses)
+- Fixing Bike ↔ Fixing Car (similar repair actions)
+Full metrics available in [metrics.json](metrics.json)
+## Training Details
+### Training Data
+- **Dataset:** Stanford 40 Actions
+- **Training split:** ~4,000 images
+- **Test split:** ~5,532 images
+- **Classes:** 40 human action categories
+- **Image resolution:** 224×224 (resized)
+### Training Procedure
+#### Preprocessing
+```python
+# Training augmentation
+transforms.Compose([
+    transforms.RandomResizedCrop(224),
+    transforms.RandomHorizontalFlip(),
+    transforms.ColorJitter(brightness=0.2, contrast=0.2),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                        std=[0.229, 0.224, 0.225])
+])
+```
+#### Training Hyperparameters
+- **Backbone:** ResNet34 (ImageNet pretrained)
+- **Optimizer:** AdamW
+- **Learning rate:** 1e-3 → 1e-5 (cosine decay)
+- **Weight decay:** 1e-3
+- **Batch size:** 32
+- **Epochs:** 200
+- **Augmentation:** Mixup (α=0.4)
+- **Scheduler:** CosineAnnealingLR
+#### Hardware
+- **GPU:** NVIDIA GTX 1050 Ti (4GB)
+- **Training time:** ~4 hours
+- **Framework:** PyTorch 2.0+
+### Two-Stage Training Strategy
+1. **Stage 1 (20 epochs):** Freeze backbone, train classifier head
+2. **Stage 2 (180 epochs):** Fine-tune entire network with Mixup
+This approach reduced overfitting from 99% train / 62% test → 82% train / 86% test.
+## Evaluation
+```python
+from hac.evaluation import evaluate_model
+# Evaluate on test set
+metrics = evaluate_model(
+    checkpoint='resnet34_best.pth',
+    data_dir='stanford40/',
+    split='test'
+)
+print(f"Accuracy: {metrics['accuracy']:.2%}")
+print(f"F1-Score: {metrics['f1_macro']:.4f}")
+```
+## Environmental Impact
+- **Hardware:** 1× NVIDIA GTX 1050 Ti
+- **Training time:** 4 hours
+- **Estimated CO2 emissions:** ~0.5 kg CO2eq
+## Limitations
+- Trained on Stanford 40 which has limited diversity
+- Best performance on indoor/outdoor daily activities
+- May struggle with unusual camera angles or occlusions
+- Requires clear view of person performing action
+- Not suitable for fine-grained action recognition (e.g., different sports moves)
+## Bias and Fairness
+The model inherits biases from the Stanford 40 dataset:
+- Limited demographic diversity
+- Western-centric activities
+- Imbalanced class distribution
+Users should evaluate performance on their specific use case.
+## Citation
+```bibtex
+@software{saksena2025hac,
+  author = {Saksena, Saumya Kumaar},
+  title = {Human Action Classification v2.0},
+  year = {2025},
+  url = {https://github.com/dronefreak/human-action-classification},
+  version = {2.0}
+}
+```
+## Model Card Authors
+Saumya Kumaar Saksena
+## Model Card Contact
+- GitHub: [@dronefreak](https://github.com/dronefreak)
+- Repository: [human-action-classification](https://github.com/dronefreak/human-action-classification)
+## Additional Resources
+- [GitHub Repository](https://github.com/dronefreak/human-action-classification)
+- [Demo Notebook](https://github.com/dronefreak/human-action-classification/blob/main/notebooks/demo.ipynb)
+- [Training Code](https://github.com/dronefreak/human-action-classification/blob/main/src/hac/training/train.py)
+- [Evaluation Metrics](metrics.json)
+## License
+Apache License 2.0 - Free for research and commercial use.
+See [LICENSE](LICENSE) for full details.