LS-ViT for HMDB51 Action Recognition

LS-ViT (Long-Short ViT) is a ViT-Base backbone augmented with two motion-aware modules for video action recognition:

SMIFModule — Short-term Motion Injection & Fusion. Operates on raw RGB frames, computes a windowed motion map across neighboring frames, and fuses it back into the spatial features via a 1×1 convolution and a learned blend.
LMIModule — Long-term Motion Interaction. Inserted inside every transformer block. Operates on patch tokens by computing forward/backward temporal differences in a reduced space and using them as a token-level attention gate.

The backbone is initialized from vit_base_patch16_224 (timm) and the full model is fine-tuned on HMDB51 for 51-way action classification.

Files

File	Description
`lsvit_hmdb51_best.pt`	Best checkpoint by validation accuracy. State dict under key `"model"`.
`modeling.py`	Self-contained model architecture (no `timm` runtime dependency).
`README.md`	This card.

Training setup


Pretrained backbone	`vit_base_patch16_224`
Image size	224
Frames per clip	12
Frame stride	2
Epochs	5
Batch size	2 (gradient accumulation = 16)
Optimizer	AdamW
Backbone LR	5e-5
Head LR	2.5e-4
Weight decay	0.05
Mixed precision	Yes (`torch.amp`)

Result

Top-1 validation accuracy on the held-out HMDB51 split: ~32.7% (best checkpoint). This is a short 5-epoch run on a small batch size and should be treated as a starting point rather than a competitive HMDB51 number.

Training curves

Validation accuracy climbs roughly linearly from ~13% (epoch 1) to ~32.7% (epoch 5). Training accuracy reaches ~45% by epoch 5, and validation loss tracks training loss closely after epoch 2 — the run was still improving when training stopped, so additional epochs would likely help.

Usage

import torch
from modeling import ViTConfig, LSViTForAction, HMDB51_CLASSES

config = ViTConfig(image_size=224)
model = LSViTForAction(config, num_classes=51)

ckpt = torch.load("lsvit_hmdb51_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

# video: (B, T, C, H, W) — float tensor in [0, 1] normalized with
# mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225). Trained with T=12.
video = torch.randn(1, 12, 3, 224, 224)
with torch.no_grad():
    logits = model(video)

pred = logits.argmax(dim=-1).item()
print(HMDB51_CLASSES[pred])

Downloading from the Hub

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("devicoal/lsvit_hmdb51", "lsvit_hmdb51_best.pt")
modeling_path = hf_hub_download("devicoal/lsvit_hmdb51", "modeling.py")

Classes

51 HMDB51 action categories (alphabetical, matching sorted(os.listdir(...))):

brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive,
draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf,
handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup,
punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball,
shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand,
swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave

License

Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support