Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,189 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- video-classification
|
| 4 |
+
- action-recognition
|
| 5 |
+
- mc3
|
| 6 |
+
- hmdb51
|
| 7 |
+
- pytorch
|
| 8 |
+
- computer-vision
|
| 9 |
+
- spatiotemporal
|
| 10 |
+
- 3dcnn
|
| 11 |
+
library_name: pytorch
|
| 12 |
+
datasets:
|
| 13 |
+
- hmdb51
|
| 14 |
+
metrics:
|
| 15 |
+
- accuracy
|
| 16 |
+
- f1
|
| 17 |
+
- precision
|
| 18 |
+
pipeline_tag: video-classification
|
| 19 |
+
license: apache-2.0
|
| 20 |
+
language:
|
| 21 |
+
- en
|
| 22 |
+
---
|
| 23 |
+
# MC3-18 HMDB51 (Kinetics-400 Init)
|
| 24 |
+
|
| 25 |
+
## Model Description
|
| 26 |
+
|
| 27 |
+
MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.
|
| 28 |
+
|
| 29 |
+
**Validation Accuracy: 56.34%**
|
| 30 |
+
|
| 31 |
+
This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.
|
| 32 |
+
|
| 33 |
+
## Model Details
|
| 34 |
+
|
| 35 |
+
- **Architecture:** MC3-18 (11.7M parameters)
|
| 36 |
+
- **Initialization:** Kinetics-400 pretrained weights
|
| 37 |
+
- **Dataset:** HMDB51 split 1
|
| 38 |
+
- Train: 3,570 videos across 51 action classes
|
| 39 |
+
- Validation: 1,530 videos
|
| 40 |
+
- **Input:** RGB video clips (8 frames, 112x112 spatial resolution)
|
| 41 |
+
- **Output:** 51-class action predictions
|
| 42 |
+
|
| 43 |
+
## Training Configuration
|
| 44 |
+
```yaml
|
| 45 |
+
Frames: 8
|
| 46 |
+
Frame Interval: 1
|
| 47 |
+
Spatial Size: 112x112
|
| 48 |
+
Batch Size: 16
|
| 49 |
+
Epochs: 150
|
| 50 |
+
Learning Rate: 0.0003
|
| 51 |
+
Weight Decay: 3e-3
|
| 52 |
+
Optimizer: SGD (momentum=0.9)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
**Augmentation:**
|
| 56 |
+
- MixUp (alpha=0.6)
|
| 57 |
+
- CutMix (alpha=1.0)
|
| 58 |
+
- Label Smoothing (0.15)
|
| 59 |
+
- Random horizontal flip
|
| 60 |
+
- Color jitter
|
| 61 |
+
- Random grayscale
|
| 62 |
+
|
| 63 |
+
## Performance
|
| 64 |
+
|
| 65 |
+
| Metric | Value |
|
| 66 |
+
|--------|-------|
|
| 67 |
+
| Validation Accuracy | 56.34% |
|
| 68 |
+
| Training Accuracy | ~75% |
|
| 69 |
+
| Train-Val Gap | ~19% |
|
| 70 |
+
| Val F1 Score | ~0.54 |
|
| 71 |
+
| Val Precision | ~0.55 |
|
| 72 |
+
|
| 73 |
+
## Overfitting Analysis
|
| 74 |
+
|
| 75 |
+
The 19% train-validation gap indicates significant overfitting, which is expected given:
|
| 76 |
+
- HMDB51's small size (only 69 videos per class on average)
|
| 77 |
+
- MC3-18 has 11.7M parameters
|
| 78 |
+
- Even with strong augmentation and regularization, the model memorizes training data
|
| 79 |
+
|
| 80 |
+
This gap is typical for HMDB51 and difficult to eliminate without:
|
| 81 |
+
- Larger pretraining datasets
|
| 82 |
+
- Ensemble methods
|
| 83 |
+
- More aggressive regularization techniques
|
| 84 |
+
- Reduced model capacity
|
| 85 |
+
|
| 86 |
+
## Design Choices
|
| 87 |
+
|
| 88 |
+
**Why num_frames=8 and frame_interval=1?**
|
| 89 |
+
|
| 90 |
+
HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:
|
| 91 |
+
- Frame repetition/tiling for short videos
|
| 92 |
+
- Loss of temporal information
|
| 93 |
+
|
| 94 |
+
This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.
|
| 95 |
+
|
| 96 |
+
## Usage
|
| 97 |
+
```python
|
| 98 |
+
import torch
|
| 99 |
+
from torchvision.models.video import mc3_18
|
| 100 |
+
from torchvision import transforms
|
| 101 |
+
import cv2
|
| 102 |
+
|
| 103 |
+
# Load model
|
| 104 |
+
model = mc3_18(weights=None)
|
| 105 |
+
model.fc = torch.nn.Linear(model.fc.in_features, 51)
|
| 106 |
+
checkpoint = torch.load('best.pth')
|
| 107 |
+
model.load_state_dict(checkpoint['model_state_dict'])
|
| 108 |
+
model.eval()
|
| 109 |
+
|
| 110 |
+
# Preprocessing
|
| 111 |
+
transform = transforms.Compose([
|
| 112 |
+
transforms.ToPILImage(),
|
| 113 |
+
transforms.Resize((128, 171)),
|
| 114 |
+
transforms.CenterCrop(112),
|
| 115 |
+
transforms.ToTensor(),
|
| 116 |
+
transforms.Normalize(mean=[0.43216, 0.394666, 0.37645],
|
| 117 |
+
std=[0.22803, 0.22145, 0.216989])
|
| 118 |
+
])
|
| 119 |
+
|
| 120 |
+
# Load 8 frames from video
|
| 121 |
+
frames = [] # Load your 8 RGB frames here
|
| 122 |
+
frames = [transform(frame) for frame in frames]
|
| 123 |
+
video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0) # (1, 3, 8, 112, 112)
|
| 124 |
+
|
| 125 |
+
# Inference
|
| 126 |
+
with torch.no_grad():
|
| 127 |
+
output = model(video_tensor)
|
| 128 |
+
pred = output.argmax(dim=1)
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## Alternative Approach
|
| 132 |
+
|
| 133 |
+
We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).
|
| 134 |
+
|
| 135 |
+
See: `mc3-18-hmdb51-ucf-transfer`
|
| 136 |
+
|
| 137 |
+
**Kinetics vs UCF-101 initialization:**
|
| 138 |
+
- Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
|
| 139 |
+
- UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)
|
| 140 |
+
|
| 141 |
+
## Limitations
|
| 142 |
+
|
| 143 |
+
- Overfits on small datasets (19% train-val gap)
|
| 144 |
+
- Single model without ensemble
|
| 145 |
+
- No test-time augmentation (multi-crop, temporal sampling)
|
| 146 |
+
- Optimized for 8-frame inputs (may not generalize to different temporal windows)
|
| 147 |
+
- Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)
|
| 148 |
+
|
| 149 |
+
## HMDB51 Classes
|
| 150 |
+
|
| 151 |
+
The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.
|
| 152 |
+
|
| 153 |
+
## Training Details
|
| 154 |
+
|
| 155 |
+
- Framework: PyTorch
|
| 156 |
+
- Hardware: Single GPU (CUDA)
|
| 157 |
+
- Training Time: ~2 hours (150 epochs)
|
| 158 |
+
- Convergence: Best model saved at epoch ~85-90
|
| 159 |
+
|
| 160 |
+
## Citation
|
| 161 |
+
|
| 162 |
+
If you use this model, please cite the original HMDB51 dataset:
|
| 163 |
+
```bibtex
|
| 164 |
+
@inproceedings{kuehne2011hmdb,
|
| 165 |
+
title={HMDB: a large video database for human motion recognition},
|
| 166 |
+
author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
|
| 167 |
+
booktitle={2011 International Conference on Computer Vision},
|
| 168 |
+
pages={2556--2563},
|
| 169 |
+
year={2011},
|
| 170 |
+
organization={IEEE}
|
| 171 |
+
}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
And the MC3 architecture:
|
| 175 |
+
```bibtex
|
| 176 |
+
@inproceedings{tran2018closer,
|
| 177 |
+
title={A closer look at spatiotemporal convolutions for action recognition},
|
| 178 |
+
author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
|
| 179 |
+
booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
|
| 180 |
+
pages={6450--6459},
|
| 181 |
+
year={2018}
|
| 182 |
+
}
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
## License
|
| 186 |
+
|
| 187 |
+
Model weights: [Apache]
|
| 188 |
+
Code: [Apache]
|
| 189 |
+
HMDB51 Dataset: [Original dataset license]
|