LS-ViT for HMDB51 Action Recognition
LS-ViT (Long-Short ViT) is a ViT-Base backbone augmented with two motion-aware modules for video action recognition:
- SMIFModule โ Short-term Motion Injection & Fusion. Operates on raw RGB frames, computes a windowed motion map across neighboring frames, and fuses it back into the spatial features via a 1ร1 convolution and a learned blend.
- LMIModule โ Long-term Motion Interaction. Inserted inside every transformer block. Operates on patch tokens by computing forward/backward temporal differences in a reduced space and using them as a token-level attention gate.
The backbone is initialized from vit_base_patch16_224 (timm) and the full
model is fine-tuned on HMDB51 for 51-way action classification.
Files
| File | Description |
|---|---|
lsvit_hmdb51_best.pt |
Best checkpoint by validation accuracy. State dict under key "model". |
modeling.py |
Self-contained model architecture (no timm runtime dependency). |
README.md |
This card. |
Training setup
| Pretrained backbone | vit_base_patch16_224 |
| Image size | 224 |
| Frames per clip | 12 |
| Frame stride | 2 |
| Epochs | 5 |
| Batch size | 2 (gradient accumulation = 16) |
| Optimizer | AdamW |
| Backbone LR | 5e-5 |
| Head LR | 2.5e-4 |
| Weight decay | 0.05 |
| Mixed precision | Yes (torch.amp) |
Result
Top-1 validation accuracy on the held-out HMDB51 split: ~32.7% (best checkpoint). This is a short 5-epoch run on a small batch size and should be treated as a starting point rather than a competitive HMDB51 number.
Training curves
Validation accuracy climbs roughly linearly from ~13% (epoch 1) to ~32.7% (epoch 5). Training accuracy reaches ~45% by epoch 5, and validation loss tracks training loss closely after epoch 2 โ the run was still improving when training stopped, so additional epochs would likely help.
Usage
import torch
from modeling import ViTConfig, LSViTForAction, HMDB51_CLASSES
config = ViTConfig(image_size=224)
model = LSViTForAction(config, num_classes=51)
ckpt = torch.load("lsvit_hmdb51_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()
# video: (B, T, C, H, W) โ float tensor in [0, 1] normalized with
# mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225). Trained with T=12.
video = torch.randn(1, 12, 3, 224, 224)
with torch.no_grad():
logits = model(video)
pred = logits.argmax(dim=-1).item()
print(HMDB51_CLASSES[pred])
Downloading from the Hub
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("devicoal/lsvit_hmdb51", "lsvit_hmdb51_best.pt")
modeling_path = hf_hub_download("devicoal/lsvit_hmdb51", "modeling.py")
Classes
51 HMDB51 action categories (alphabetical, matching sorted(os.listdir(...))):
brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive,
draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf,
handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup,
punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball,
shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand,
swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave
License
Apache-2.0.
