LS-ViT for HMDB51 Action Recognition

LS-ViT (Long-Short ViT) is a ViT-Base backbone augmented with two motion-aware modules for video action recognition:

  • SMIFModule โ€” Short-term Motion Injection & Fusion. Operates on raw RGB frames, computes a windowed motion map across neighboring frames, and fuses it back into the spatial features via a 1ร—1 convolution and a learned blend.
  • LMIModule โ€” Long-term Motion Interaction. Inserted inside every transformer block. Operates on patch tokens by computing forward/backward temporal differences in a reduced space and using them as a token-level attention gate.

The backbone is initialized from vit_base_patch16_224 (timm) and the full model is fine-tuned on HMDB51 for 51-way action classification.

Files

File Description
lsvit_hmdb51_best.pt Best checkpoint by validation accuracy. State dict under key "model".
modeling.py Self-contained model architecture (no timm runtime dependency).
README.md This card.

Training setup

Pretrained backbone vit_base_patch16_224
Image size 224
Frames per clip 12
Frame stride 2
Epochs 5
Batch size 2 (gradient accumulation = 16)
Optimizer AdamW
Backbone LR 5e-5
Head LR 2.5e-4
Weight decay 0.05
Mixed precision Yes (torch.amp)

Result

Top-1 validation accuracy on the held-out HMDB51 split: ~32.7% (best checkpoint). This is a short 5-epoch run on a small batch size and should be treated as a starting point rather than a competitive HMDB51 number.

Training curves

Training and validation loss/accuracy over 5 epochs

Validation accuracy climbs roughly linearly from ~13% (epoch 1) to ~32.7% (epoch 5). Training accuracy reaches ~45% by epoch 5, and validation loss tracks training loss closely after epoch 2 โ€” the run was still improving when training stopped, so additional epochs would likely help.

Usage

import torch
from modeling import ViTConfig, LSViTForAction, HMDB51_CLASSES

config = ViTConfig(image_size=224)
model = LSViTForAction(config, num_classes=51)

ckpt = torch.load("lsvit_hmdb51_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

# video: (B, T, C, H, W) โ€” float tensor in [0, 1] normalized with
# mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225). Trained with T=12.
video = torch.randn(1, 12, 3, 224, 224)
with torch.no_grad():
    logits = model(video)

pred = logits.argmax(dim=-1).item()
print(HMDB51_CLASSES[pred])

Downloading from the Hub

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("devicoal/lsvit_hmdb51", "lsvit_hmdb51_best.pt")
modeling_path = hf_hub_download("devicoal/lsvit_hmdb51", "modeling.py")

Classes

51 HMDB51 action categories (alphabetical, matching sorted(os.listdir(...))):

brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive,
draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf,
handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup,
punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball,
shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand,
swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave

License

Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support