dronefreak
/

mc3-18-hmdb51-kinetics

+---
+tags:
+- video-classification
+- action-recognition
+- mc3
+- hmdb51
+- pytorch
+- computer-vision
+- spatiotemporal
+- 3dcnn
+library_name: pytorch
+datasets:
+- hmdb51
+metrics:
+- accuracy
+- f1
+- precision
+pipeline_tag: video-classification
+license: apache-2.0
+language:
+- en
+---
+# MC3-18 HMDB51 (Kinetics-400 Init)
+## Model Description
+MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.
+**Validation Accuracy: 56.34%**
+This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.
+## Model Details
+- **Architecture:** MC3-18 (11.7M parameters)
+- **Initialization:** Kinetics-400 pretrained weights
+- **Dataset:** HMDB51 split 1
+  - Train: 3,570 videos across 51 action classes
+  - Validation: 1,530 videos
+- **Input:** RGB video clips (8 frames, 112x112 spatial resolution)
+- **Output:** 51-class action predictions
+## Training Configuration
+```yaml
+Frames: 8
+Frame Interval: 1
+Spatial Size: 112x112
+Batch Size: 16
+Epochs: 150
+Learning Rate: 0.0003
+Weight Decay: 3e-3
+Optimizer: SGD (momentum=0.9)
+```
+**Augmentation:**
+- MixUp (alpha=0.6)
+- CutMix (alpha=1.0)
+- Label Smoothing (0.15)
+- Random horizontal flip
+- Color jitter
+- Random grayscale
+## Performance
+| Metric | Value |
+|--------|-------|
+| Validation Accuracy | 56.34% |
+| Training Accuracy | ~75% |
+| Train-Val Gap | ~19% |
+| Val F1 Score | ~0.54 |
+| Val Precision | ~0.55 |
+## Overfitting Analysis
+The 19% train-validation gap indicates significant overfitting, which is expected given:
+- HMDB51's small size (only 69 videos per class on average)
+- MC3-18 has 11.7M parameters
+- Even with strong augmentation and regularization, the model memorizes training data
+This gap is typical for HMDB51 and difficult to eliminate without:
+- Larger pretraining datasets
+- Ensemble methods
+- More aggressive regularization techniques
+- Reduced model capacity
+## Design Choices
+**Why num_frames=8 and frame_interval=1?**
+HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:
+- Frame repetition/tiling for short videos
+- Loss of temporal information
+This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.
+## Usage
+```python
+import torch
+from torchvision.models.video import mc3_18
+from torchvision import transforms
+import cv2
+# Load model
+model = mc3_18(weights=None)
+model.fc = torch.nn.Linear(model.fc.in_features, 51)
+checkpoint = torch.load('best.pth')
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+# Preprocessing
+transform = transforms.Compose([
+    transforms.ToPILImage(),
+    transforms.Resize((128, 171)),
+    transforms.CenterCrop(112),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.43216, 0.394666, 0.37645],
+                        std=[0.22803, 0.22145, 0.216989])
+])
+# Load 8 frames from video
+frames = []  # Load your 8 RGB frames here
+frames = [transform(frame) for frame in frames]
+video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0)  # (1, 3, 8, 112, 112)
+# Inference
+with torch.no_grad():
+    output = model(video_tensor)
+    pred = output.argmax(dim=1)
+```
+## Alternative Approach
+We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).
+See: `mc3-18-hmdb51-ucf-transfer`
+**Kinetics vs UCF-101 initialization:**
+- Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
+- UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)
+## Limitations
+- Overfits on small datasets (19% train-val gap)
+- Single model without ensemble
+- No test-time augmentation (multi-crop, temporal sampling)
+- Optimized for 8-frame inputs (may not generalize to different temporal windows)
+- Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)
+## HMDB51 Classes
+The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.
+## Training Details
+- Framework: PyTorch
+- Hardware: Single GPU (CUDA)
+- Training Time: ~2 hours (150 epochs)
+- Convergence: Best model saved at epoch ~85-90
+## Citation
+If you use this model, please cite the original HMDB51 dataset:
+```bibtex
+@inproceedings{kuehne2011hmdb,
+  title={HMDB: a large video database for human motion recognition},
+  author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
+  booktitle={2011 International Conference on Computer Vision},
+  pages={2556--2563},
+  year={2011},
+  organization={IEEE}
+}
+```
+And the MC3 architecture:
+```bibtex
+@inproceedings{tran2018closer,
+  title={A closer look at spatiotemporal convolutions for action recognition},
+  author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
+  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
+  pages={6450--6459},
+  year={2018}
+}
+```
+## License
+Model weights: [Apache]
+Code: [Apache]
+HMDB51 Dataset: [Original dataset license]