metadata
license: cc-by-nc-4.0
tags:
- vision
- video-classification
TimeSformer (base-sized model, fine-tuned on Something Something v2)
TimeSformer model pre-trained on Something Something v2. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository.
Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon.
Intended uses & limitations
You can use the raw model for video classification into one of the 174 possible Something Something v2 labels.
How to use
Here is how to use this model to classify a video:
from transformers import AutoImageProcessor, TimesformerForVideoClassification
import numpy as np
import torch
video = list(np.random.randn(8, 3, 224, 224))
processor = AutoImageProcessor.from_pretrained("fcakyon/timesformer-base-finetuned-ssv2")
model = TimesformerForVideoClassification.from_pretrained("fcakyon/timesformer-base-finetuned-ssv2")
inputs = processor(images=video, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
For more code examples, we refer to the documentation.
BibTeX entry and citation info
@inproceedings{bertasius2021space,
title={Is Space-Time Attention All You Need for Video Understanding?},
author={Bertasius, Gedas and Wang, Heng and Torresani, Lorenzo},
booktitle={International Conference on Machine Learning},
pages={813--824},
year={2021},
organization={PMLR}
}