|
--- |
|
library_name: transformers |
|
tags: |
|
- shot type |
|
- shot scale |
|
- movienet |
|
- movieshots |
|
- video classification |
|
license: mit |
|
metrics: |
|
- accuracy |
|
- f1 |
|
pipeline_tag: video-classification |
|
--- |
|
|
|
# VideoMAE finetuned for shot scale classification |
|
|
|
**videomae-base-finetuned-kinetics** model finetuned to classify shot scale into five classes: *ECS (Extreme close-up shot), CS (close-up shot), MS (medium shot), FS (full shot), LS (long shot)* |
|
|
|
[Movienet](https://movienet.github.io/projects/eccv20shot.html) dataset is used for finetuning the model for 5 epochs. *v1_split_trailer.json* provides the training, validation and test data splits. |
|
|
|
|
|
## Evaluation |
|
|
|
Model achieves accuracy of 88.93% and macro-f1 of 89.19% |
|
|
|
Class-wise accuracies: ECS - 91.16%, CS - 83.65, MS - 86.2%, FS - 90.74%, LS - 94.55% |
|
|
|
|
|
## How to use |
|
|
|
This is how model can be tested on a shot/clip from a video. |
|
Same code is used to process, transform and evaluate on the movienet test set. |
|
|
|
```python |
|
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification |
|
from pytorchvideo.transforms import ApplyTransformToKey |
|
from torchvision.transforms import v2 |
|
from decord import VideoReader, cpu |
|
|
|
## Evaluation Transform |
|
transform = v2.Compose( |
|
[ |
|
ApplyTransformToKey( |
|
key="video", |
|
transform=v2.Compose( |
|
[ |
|
v2.Lambda(lambda x: x.permute(0, 3, 1, 2)), # T, H, W, C -> T, C, H, W |
|
v2.UniformTemporalSubsample(16), |
|
v2.Resize(resize_to), |
|
v2.Lambda(lambda x: x / 255.0), |
|
v2.Normalize(img_mean, img_std) |
|
] |
|
), |
|
), |
|
] |
|
) |
|
|
|
## Preprocessor and Model loading |
|
image_processor = VideoMAEImageProcessor.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale") |
|
model = VideoMAEForVideoClassification.from_pretrained("gullalc/videomae-base-finetuned-kinetics-movieshots-scale") |
|
|
|
img_mean = image_processor.image_mean |
|
img_std = image_processor.image_std |
|
height = width = image_processor.size["shortest_edge"] |
|
resize_to = (height, width) |
|
|
|
## load video/clip and predict |
|
video_path = "random_clip.mp4" |
|
vr = VideoReader(video_path, width=480, height=270, ctx=cpu(0)) |
|
frames_tensor = torch.stack([torch.tensor(vr[i].asnumpy()) for i in range(len(vr))]) ## Shape: (T, H, W, C) |
|
|
|
frames_tensor = transform({"video": frames_tensor})["video"] |
|
|
|
output = model(pixel_values=frames_tensor) |
|
pred = torch.argmax(outputs.logits, axis=1).cpu().numpy() |
|
|
|
print(model.config.id2label[pred[0]]) |
|
``` |
|
|
|
|