VideoMAE finetuned for shot scale classification

videomae-base-finetuned-kinetics model finetuned to classify shot scale into five classes: ECS (Extreme close-up shot), CS (close-up shot), MS (medium shot), FS (full shot), LS (long shot)

Movienet dataset is used for finetuning the model for 5 epochs. v1_split_trailer.json provides the training, validation and test data splits.

Evaluation

Model achieves accuracy of 88.93% and macro-f1 of 89.19%

Class-wise accuracies: ECS - 91.16%, CS - 83.65, MS - 86.2%, FS - 90.74%, LS - 94.55%

How to use

This is how model can be tested on a shot/clip from a video. Same code is used to process, transform and evaluate on the movienet test set.

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from pytorchvideo.transforms import ApplyTransformToKey
from torchvision.transforms import v2
from decord import VideoReader, cpu

## Evaluation Transform
transform = v2.Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=v2.Compose(
                [
                    v2.Lambda(lambda x: x.permute(0, 3, 1, 2)), # T, H, W, C -> T, C, H, W
                    v2.UniformTemporalSubsample(16),
                    v2.Resize(resize_to),
                    v2.Lambda(lambda x: x / 255.0),
                    v2.Normalize(img_mean, img_std)
                ]
            ),
        ),
    ]
)

## Preprocessor and Model loading
image_processor = VideoMAEImageProcessor.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale")
model = VideoMAEForVideoClassification.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale")

img_mean = image_processor.image_mean
img_std = image_processor.image_std
height = width = image_processor.size["shortest_edge"]
resize_to = (height, width)

## load video/clip and predict
video_path = "random_clip.mp4"
vr = VideoReader(video_path, width=480, height=270, ctx=cpu(0))
frames_tensor = torch.stack([torch.tensor(vr[i].asnumpy()) for i in range(len(vr))])  ## Shape: (T, H, W, C)

frames_tensor = transform({"video": frames_tensor})["video"]

output = model(pixel_values=frames_tensor)
pred = torch.argmax(outputs.logits, axis=1).cpu().numpy()

print(model.config.id2label[pred[0]])