Overview
PyTorchVideo is an open source video understanding library that provides up to date builders for state of the art video understanding backbones, layers, heads, and losses addressing different tasks, including acoustic event detection, action recognition (video classification), action detection (video detection), multimodal understanding (acoustic visual classification), self-supervised learning.
The models subpackage contains definitions for the following model architectures and layers:
- Acoustic Backbone
- Acoustic ResNet
- Visual Backbone
- Self-Supervised Learning
Build standard models
PyTorchVideo provide default builders to construct state-of-the-art video understanding models, layers, heads, and losses.
Models
You can construct a model with random weights by calling its constructor:
import pytorchvideo.models as models
resnet = models.create_resnet()
acoustic_resnet = models.create_acoustic_resnet()
slowfast = models.create_slowfast()
x3d = models.create_x3d()
r2plus1d = models.create_r2plus1d()
csn = models.create_csn()
You can verify whether you have built the model successfully by:
import pytorchvideo.models as models
resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)
Layers
You can construct a layer with random weights by calling its constructor:
import pytorchvideo.layers as layers
nonlocal = layers.create_nonlocal(dim_in=256, dim_inner=128)
swish = layers.Swish()
conv_2plus1d = layers.create_conv_2plus1d(in_channels=256, out_channels=512)
You can verify whether you have built the model successfully by:
import pytorchvideo.layers as layers
nonlocal = layers.create_nonlocal(dim_in=256, dim_inner=128)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = nonlocal(input_tensor)
swish = layers.Swish()
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = swish(input_tensor)
conv_2plus1d = layers.create_conv_2plus1d(in_channels=256, out_channels=512)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = conv_2plus1d(input_tensor)
Heads
You can construct a head with random weights by calling its constructor:
import pytorchvideo.models as models
res_head = models.head.create_res_basic_head(in_features, out_features)
x3d_head = models.x3d.create_x3d_head(dim_in=1024, dim_inner=512, dim_out=2048, num_classes=400)
You can verify whether you have built the head successfully by:
import pytorchvideo.models as models
res_head = models.head.create_res_basic_head(in_features, out_features)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = res_head(input_tensor)
x3d_head = models.x3d.create_x3d_head(dim_in=1024, dim_inner=512, dim_out=2048, num_classes=400)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = x3d_head(input_tensor)
Losses
You can construct a loss by calling its constructor:
import pytorchvideo.models as models
simclr_loss = models.SimCLR()
You can verify whether you have built the loss successfully by:
import pytorchvideo.models as models
import pytorchvideo.layers as layers
resnet = models.create_resnet()
mlp = layers.make_multilayer_perceptron(fully_connected_dims=(2048, 1024, 2048))
simclr_loss = models.SimCLR(mlp=mlp, backbone=resnet)
B, C, T, H, W = 2, 256, 4, 14, 14
view1, view2 = torch.zeros(B, C, T, H, W), torch.zeros(B, C, T, H, W)
loss = simclr_loss(view1, view2)
Build customized models
PyTorchVideo also supports building models with customized components, which is an important feature for video understanding research. Here we take a standard stem model as an example, show how to build each resnet components (head, backbone, stem) separately, and how to use your customized components to replace standard components.
from pytorchvideo.models.stem import create_res_basic_stem
# Create standard stem layer.
stem = create_res_basic_stem(in_channels=3, out_channels=64)
# Create customized stem layer with YourFancyNorm
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
norm=YourFancyNorm, # GhostNorm for example
)
# Create customized stem layer with YourFancyConv
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
conv=YourFancyConv, # OctConv for example
)
# Create customized stem layer with YourFancyAct
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
activation=YourFancyAct, # Swish for example
)
# Create customized stem layer with YourFancyPool
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
pool=YourFancyPool, # MinPool for example
)