dronefreak commited on
Commit
8d26e88
·
verified ·
1 Parent(s): ff5bce6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -3
README.md CHANGED
@@ -1,3 +1,189 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - video-classification
4
+ - action-recognition
5
+ - mc3
6
+ - hmdb51
7
+ - pytorch
8
+ - computer-vision
9
+ - spatiotemporal
10
+ - 3dcnn
11
+ library_name: pytorch
12
+ datasets:
13
+ - hmdb51
14
+ metrics:
15
+ - accuracy
16
+ - f1
17
+ - precision
18
+ pipeline_tag: video-classification
19
+ license: apache-2.0
20
+ language:
21
+ - en
22
+ ---
23
+ # MC3-18 HMDB51 (Kinetics-400 Init)
24
+
25
+ ## Model Description
26
+
27
+ MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with Kinetics-400 pretrained weights and adapted for HMDB51's shorter video clips.
28
+
29
+ **Validation Accuracy: 56.34%**
30
+
31
+ This is a reference baseline implementation. State-of-the-art on HMDB51 split 1 is approximately 70-75% using ensemble methods, test-time augmentation, and multi-crop evaluation.
32
+
33
+ ## Model Details
34
+
35
+ - **Architecture:** MC3-18 (11.7M parameters)
36
+ - **Initialization:** Kinetics-400 pretrained weights
37
+ - **Dataset:** HMDB51 split 1
38
+ - Train: 3,570 videos across 51 action classes
39
+ - Validation: 1,530 videos
40
+ - **Input:** RGB video clips (8 frames, 112x112 spatial resolution)
41
+ - **Output:** 51-class action predictions
42
+
43
+ ## Training Configuration
44
+ ```yaml
45
+ Frames: 8
46
+ Frame Interval: 1
47
+ Spatial Size: 112x112
48
+ Batch Size: 16
49
+ Epochs: 150
50
+ Learning Rate: 0.0003
51
+ Weight Decay: 3e-3
52
+ Optimizer: SGD (momentum=0.9)
53
+ ```
54
+
55
+ **Augmentation:**
56
+ - MixUp (alpha=0.6)
57
+ - CutMix (alpha=1.0)
58
+ - Label Smoothing (0.15)
59
+ - Random horizontal flip
60
+ - Color jitter
61
+ - Random grayscale
62
+
63
+ ## Performance
64
+
65
+ | Metric | Value |
66
+ |--------|-------|
67
+ | Validation Accuracy | 56.34% |
68
+ | Training Accuracy | ~75% |
69
+ | Train-Val Gap | ~19% |
70
+ | Val F1 Score | ~0.54 |
71
+ | Val Precision | ~0.55 |
72
+
73
+ ## Overfitting Analysis
74
+
75
+ The 19% train-validation gap indicates significant overfitting, which is expected given:
76
+ - HMDB51's small size (only 69 videos per class on average)
77
+ - MC3-18 has 11.7M parameters
78
+ - Even with strong augmentation and regularization, the model memorizes training data
79
+
80
+ This gap is typical for HMDB51 and difficult to eliminate without:
81
+ - Larger pretraining datasets
82
+ - Ensemble methods
83
+ - More aggressive regularization techniques
84
+ - Reduced model capacity
85
+
86
+ ## Design Choices
87
+
88
+ **Why num_frames=8 and frame_interval=1?**
89
+
90
+ HMDB51 contains many short videos (some as short as 10-20 frames). Using smaller temporal windows (8 frames with interval 1 = 8 consecutive frames) prevents:
91
+ - Frame repetition/tiling for short videos
92
+ - Loss of temporal information
93
+
94
+ This differs from the Kinetics-400 pretraining (which typically uses 16 frames), but adapts to HMDB51's characteristics.
95
+
96
+ ## Usage
97
+ ```python
98
+ import torch
99
+ from torchvision.models.video import mc3_18
100
+ from torchvision import transforms
101
+ import cv2
102
+
103
+ # Load model
104
+ model = mc3_18(weights=None)
105
+ model.fc = torch.nn.Linear(model.fc.in_features, 51)
106
+ checkpoint = torch.load('best.pth')
107
+ model.load_state_dict(checkpoint['model_state_dict'])
108
+ model.eval()
109
+
110
+ # Preprocessing
111
+ transform = transforms.Compose([
112
+ transforms.ToPILImage(),
113
+ transforms.Resize((128, 171)),
114
+ transforms.CenterCrop(112),
115
+ transforms.ToTensor(),
116
+ transforms.Normalize(mean=[0.43216, 0.394666, 0.37645],
117
+ std=[0.22803, 0.22145, 0.216989])
118
+ ])
119
+
120
+ # Load 8 frames from video
121
+ frames = [] # Load your 8 RGB frames here
122
+ frames = [transform(frame) for frame in frames]
123
+ video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0) # (1, 3, 8, 112, 112)
124
+
125
+ # Inference
126
+ with torch.no_grad():
127
+ output = model(video_tensor)
128
+ pred = output.argmax(dim=1)
129
+ ```
130
+
131
+ ## Alternative Approach
132
+
133
+ We also provide a model initialized from UCF-101 instead of Kinetics-400. That model achieves 55.46% validation accuracy with better generalization (13% train-val gap vs 19%).
134
+
135
+ See: `mc3-18-hmdb51-ucf-transfer`
136
+
137
+ **Kinetics vs UCF-101 initialization:**
138
+ - Kinetics: Larger pretraining dataset, optimized for short clips (8 frames)
139
+ - UCF-101: Closer domain to HMDB51, better generalization, but requires 16 frames (causes frame tiling on short videos)
140
+
141
+ ## Limitations
142
+
143
+ - Overfits on small datasets (19% train-val gap)
144
+ - Single model without ensemble
145
+ - No test-time augmentation (multi-crop, temporal sampling)
146
+ - Optimized for 8-frame inputs (may not generalize to different temporal windows)
147
+ - Trained on HMDB51 split 1 only (performance may vary on splits 2 and 3)
148
+
149
+ ## HMDB51 Classes
150
+
151
+ The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.
152
+
153
+ ## Training Details
154
+
155
+ - Framework: PyTorch
156
+ - Hardware: Single GPU (CUDA)
157
+ - Training Time: ~2 hours (150 epochs)
158
+ - Convergence: Best model saved at epoch ~85-90
159
+
160
+ ## Citation
161
+
162
+ If you use this model, please cite the original HMDB51 dataset:
163
+ ```bibtex
164
+ @inproceedings{kuehne2011hmdb,
165
+ title={HMDB: a large video database for human motion recognition},
166
+ author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
167
+ booktitle={2011 International Conference on Computer Vision},
168
+ pages={2556--2563},
169
+ year={2011},
170
+ organization={IEEE}
171
+ }
172
+ ```
173
+
174
+ And the MC3 architecture:
175
+ ```bibtex
176
+ @inproceedings{tran2018closer,
177
+ title={A closer look at spatiotemporal convolutions for action recognition},
178
+ author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
179
+ booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
180
+ pages={6450--6459},
181
+ year={2018}
182
+ }
183
+ ```
184
+
185
+ ## License
186
+
187
+ Model weights: [Apache]
188
+ Code: [Apache]
189
+ HMDB51 Dataset: [Original dataset license]