Video Classification

Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to.

Video Classification Model
Playing Guitar
Playing Tennis

About Video Classification

Use Cases

Video classification models can be used to categorize what a video is all about.

Activity Recognition

Video classification models are used to perform activity recognition which is useful for fitness applications. Activity recognition is also helpful for vision-impaired individuals especially when they're commuting.

Video Search

Models trained in video classification can improve user experience by organizing and categorizing video galleries on the phone or in the cloud, on multiple keywords or tags.


Below you can find code for inferring with a pre-trained video classification model.

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from pytorchvideo.transforms import UniformTemporalSubsample
from pytorchvideo.data.encoded_video import EncodedVideo

# Load the video.
video = EncodedVideo.from_path("path_to_video.mp4")
video_data = video.get_clip(start_sec=0, end_sec=4.0)["video"]

# Sub-sample a fixed set of frames and convert them to a NumPy array.
num_frames = 16
subsampler = UniformTemporalSubsample(num_frames)
subsampled_frames = subsampler(video_data)
video_data_np = subsampled_frames.numpy().transpose(1, 2, 3, 0)

# Preprocess the video frames.
inputs = feature_extractor(list(video_data_np), return_tensors="pt")

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Model predicts one of the 400 Kinetics 400 classes
predicted_label = logits.argmax(-1).item()
# `eating spaghetti` (if you chose this video:
# https://hf.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4)

Useful Resources

Creating your own video classifier in minutes

Video Classification demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Video Classification
Browse Models (22)

Note Strong Video Classification model trained on the Kinects 400 dataset.

Note Strong Video Classification model trained on the Kinects 400 dataset.

Datasets for Video Classification

No example dataset is defined for this task.

Note Contribute by proposing a dataset for this task !

Metrics for Video Classification
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative
Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.
Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)