|
# A 20-Minute Guide to MMAction2 FrameWork
|
|
|
|
In this tutorial, we will demonstrate the overall architecture of our `MMACTION2 1.0` through a step-by-step example of video action recognition.
|
|
|
|
The structure of this tutorial is as follows:
|
|
|
|
- [A 20-Minute Guide to MMAction2 FrameWork](#a-20-minute-guide-to-mmaction2-framework)
|
|
- [Step0: Prepare Data](#step0-prepare-data)
|
|
- [Step1: Build a Pipeline](#step1-build-a-pipeline)
|
|
- [Step2: Build a Dataset and DataLoader](#step2-build-a-dataset-and-dataloader)
|
|
- [Step3: Build a Recognizer](#step3-build-a-recognizer)
|
|
- [Step4: Build a Evaluation Metric](#step4-build-a-evaluation-metric)
|
|
- [Step5: Train and Test with Native PyTorch](#step5-train-and-test-with-native-pytorch)
|
|
- [Step6: Train and Test with MMEngine (Recommended)](#step6-train-and-test-with-mmengine-recommended)
|
|
|
|
First, we need to initialize the `scope` for registry, to ensure that each module is registered under the scope of `mmaction`. For more detailed information about registry, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/registry.html).
|
|
|
|
```python
|
|
from mmaction.utils import register_all_modules
|
|
|
|
register_all_modules(init_default_scope=True)
|
|
```
|
|
|
|
## Step0: Prepare Data
|
|
|
|
Please download our self-made [kinetics400_tiny](https://download.openmmlab.com/mmaction/kinetics400_tiny.zip) dataset and extract it to the `$MMACTION2/data` directory.
|
|
The directory structure after extraction should be as follows:
|
|
|
|
```
|
|
mmaction2
|
|
βββ data
|
|
β βββ kinetics400_tiny
|
|
β β βββ kinetics_tiny_train_video.txt
|
|
β β βββ kinetics_tiny_val_video.txt
|
|
β β βββ train
|
|
β β β βββ 27_CSXByd3s.mp4
|
|
β β β βββ 34XczvTaRiI.mp4
|
|
β β β βββ A-wiliK50Zw.mp4
|
|
β β β βββ ...
|
|
β β βββ val
|
|
β β βββ 0pVGiAU6XEA.mp4
|
|
β β βββ AQrbRSnRt8M.mp4
|
|
β β βββ ...
|
|
```
|
|
|
|
Here are some examples from the annotation file `kinetics_tiny_train_video.txt`:
|
|
|
|
```
|
|
D32_1gwq35E.mp4 0
|
|
iRuyZSKhHRg.mp4 1
|
|
oXy-e_P_cAI.mp4 0
|
|
34XczvTaRiI.mp4 1
|
|
h2YqqUhnR34.mp4 0
|
|
```
|
|
|
|
Each line in the file represents the annotation of a video, where the first item denotes the video filename (e.g., `D32_1gwq35E.mp4`), and the second item represents the corresponding label (e.g., label `0` for `D32_1gwq35E.mp4`). In this dataset, there are only `two` categories.
|
|
|
|
## Step1: Build a Pipeline
|
|
|
|
In order to `decode`, `sample`, `resize`, `crop`, `format`, and `pack` the input video and corresponding annotation, we need to design a pipeline to handle these processes. Specifically, we design seven `Transform` classes to build this video processing pipeline. Note that all `Transform` classes in OpenMMLab must inherit from the `BaseTransform` class in `mmcv`, implement the abstract method `transform`, and be registered to the `TRANSFORMS` registry. For more detailed information about data transform, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/data_transform.html).
|
|
|
|
```python
|
|
import mmcv
|
|
import decord
|
|
import numpy as np
|
|
from mmcv.transforms import TRANSFORMS, BaseTransform, to_tensor
|
|
from mmaction.structures import ActionDataSample
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoInit(BaseTransform):
|
|
def transform(self, results):
|
|
container = decord.VideoReader(results['filename'])
|
|
results['total_frames'] = len(container)
|
|
results['video_reader'] = container
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoSample(BaseTransform):
|
|
def __init__(self, clip_len, num_clips, test_mode=False):
|
|
self.clip_len = clip_len
|
|
self.num_clips = num_clips
|
|
self.test_mode = test_mode
|
|
|
|
def transform(self, results):
|
|
total_frames = results['total_frames']
|
|
interval = total_frames // self.clip_len
|
|
|
|
if self.test_mode:
|
|
# Make the sampling during testing deterministic
|
|
np.random.seed(42)
|
|
|
|
inds_of_all_clips = []
|
|
for i in range(self.num_clips):
|
|
bids = np.arange(self.clip_len) * interval
|
|
offset = np.random.randint(interval, size=bids.shape)
|
|
inds = bids + offset
|
|
inds_of_all_clips.append(inds)
|
|
|
|
results['frame_inds'] = np.concatenate(inds_of_all_clips)
|
|
results['clip_len'] = self.clip_len
|
|
results['num_clips'] = self.num_clips
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoDecode(BaseTransform):
|
|
def transform(self, results):
|
|
frame_inds = results['frame_inds']
|
|
container = results['video_reader']
|
|
|
|
imgs = container.get_batch(frame_inds).asnumpy()
|
|
imgs = list(imgs)
|
|
|
|
results['video_reader'] = None
|
|
del container
|
|
|
|
results['imgs'] = imgs
|
|
results['img_shape'] = imgs[0].shape[:2]
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoResize(BaseTransform):
|
|
def __init__(self, r_size):
|
|
self.r_size = (np.inf, r_size)
|
|
|
|
def transform(self, results):
|
|
img_h, img_w = results['img_shape']
|
|
new_w, new_h = mmcv.rescale_size((img_w, img_h), self.r_size)
|
|
|
|
imgs = [mmcv.imresize(img, (new_w, new_h))
|
|
for img in results['imgs']]
|
|
results['imgs'] = imgs
|
|
results['img_shape'] = imgs[0].shape[:2]
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoCrop(BaseTransform):
|
|
def __init__(self, c_size):
|
|
self.c_size = c_size
|
|
|
|
def transform(self, results):
|
|
img_h, img_w = results['img_shape']
|
|
center_x, center_y = img_w // 2, img_h // 2
|
|
x1, x2 = center_x - self.c_size // 2, center_x + self.c_size // 2
|
|
y1, y2 = center_y - self.c_size // 2, center_y + self.c_size // 2
|
|
imgs = [img[y1:y2, x1:x2] for img in results['imgs']]
|
|
results['imgs'] = imgs
|
|
results['img_shape'] = imgs[0].shape[:2]
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoFormat(BaseTransform):
|
|
def transform(self, results):
|
|
num_clips = results['num_clips']
|
|
clip_len = results['clip_len']
|
|
imgs = results['imgs']
|
|
|
|
# [num_clips*clip_len, H, W, C]
|
|
imgs = np.array(imgs)
|
|
# [num_clips, clip_len, H, W, C]
|
|
imgs = imgs.reshape((num_clips, clip_len) + imgs.shape[1:])
|
|
# [num_clips, C, clip_len, H, W]
|
|
imgs = imgs.transpose(0, 4, 1, 2, 3)
|
|
|
|
results['imgs'] = imgs
|
|
return results
|
|
|
|
|
|
@TRANSFORMS.register_module()
|
|
class VideoPack(BaseTransform):
|
|
def __init__(self, meta_keys=('img_shape', 'num_clips', 'clip_len')):
|
|
self.meta_keys = meta_keys
|
|
|
|
def transform(self, results):
|
|
packed_results = dict()
|
|
inputs = to_tensor(results['imgs'])
|
|
data_sample = ActionDataSample()
|
|
data_sample.set_gt_label(results['label'])
|
|
metainfo = {k: results[k] for k in self.meta_keys if k in results}
|
|
data_sample.set_metainfo(metainfo)
|
|
packed_results['inputs'] = inputs
|
|
packed_results['data_samples'] = data_sample
|
|
return packed_results
|
|
```
|
|
|
|
Below, we provide a code snippet (using `D32_1gwq35E.mp4 0` from the annotation file) to demonstrate how to use the pipeline.
|
|
|
|
```python
|
|
import os.path as osp
|
|
from mmengine.dataset import Compose
|
|
|
|
pipeline_cfg = [
|
|
dict(type='VideoInit'),
|
|
dict(type='VideoSample', clip_len=16, num_clips=1, test_mode=False),
|
|
dict(type='VideoDecode'),
|
|
dict(type='VideoResize', r_size=256),
|
|
dict(type='VideoCrop', c_size=224),
|
|
dict(type='VideoFormat'),
|
|
dict(type='VideoPack')
|
|
]
|
|
|
|
pipeline = Compose(pipeline_cfg)
|
|
data_prefix = 'data/kinetics400_tiny/train'
|
|
results = dict(filename=osp.join(data_prefix, 'D32_1gwq35E.mp4'), label=0)
|
|
packed_results = pipeline(results)
|
|
|
|
inputs = packed_results['inputs']
|
|
data_sample = packed_results['data_samples']
|
|
|
|
print('shape of the inputs: ', inputs.shape)
|
|
|
|
# Get metainfo of the inputs
|
|
print('image_shape: ', data_sample.img_shape)
|
|
print('num_clips: ', data_sample.num_clips)
|
|
print('clip_len: ', data_sample.clip_len)
|
|
|
|
# Get label of the inputs
|
|
print('label: ', data_sample.gt_label)
|
|
```
|
|
|
|
```
|
|
shape of the inputs: torch.Size([1, 3, 16, 224, 224])
|
|
image_shape: (224, 224)
|
|
num_clips: 1
|
|
clip_len: 16
|
|
label: tensor([0])
|
|
```
|
|
|
|
## Step2: Build a Dataset and DataLoader
|
|
|
|
All `Dataset` classes in OpenMMLab must inherit from the `BaseDataset` class in `mmengine`. We can customize annotation loading process by overriding the `load_data_list` method. Additionally, we can add more information to the `results` dict that is passed as input to the `pipeline` by overriding the `get_data_info` method. For more detailed information about `BaseDataset` class, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/basedataset.html).
|
|
|
|
```python
|
|
import os.path as osp
|
|
from mmengine.fileio import list_from_file
|
|
from mmengine.dataset import BaseDataset
|
|
from mmaction.registry import DATASETS
|
|
|
|
|
|
@DATASETS.register_module()
|
|
class DatasetZelda(BaseDataset):
|
|
def __init__(self, ann_file, pipeline, data_root, data_prefix=dict(video=''),
|
|
test_mode=False, modality='RGB', **kwargs):
|
|
self.modality = modality
|
|
super(DatasetZelda, self).__init__(ann_file=ann_file, pipeline=pipeline, data_root=data_root,
|
|
data_prefix=data_prefix, test_mode=test_mode,
|
|
**kwargs)
|
|
|
|
def load_data_list(self):
|
|
data_list = []
|
|
fin = list_from_file(self.ann_file)
|
|
for line in fin:
|
|
line_split = line.strip().split()
|
|
filename, label = line_split
|
|
label = int(label)
|
|
filename = osp.join(self.data_prefix['video'], filename)
|
|
data_list.append(dict(filename=filename, label=label))
|
|
return data_list
|
|
|
|
def get_data_info(self, idx: int) -> dict:
|
|
data_info = super().get_data_info(idx)
|
|
data_info['modality'] = self.modality
|
|
return data_info
|
|
```
|
|
|
|
Next, we will demonstrate how to use dataset and dataloader to index data. We will use the `Runner.build_dataloader` method to construct the dataloader. For more detailed information about dataloader, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/tutorials/dataset.html#details-on-dataloader).
|
|
|
|
```python
|
|
from mmaction.registry import DATASETS
|
|
|
|
train_pipeline_cfg = [
|
|
dict(type='VideoInit'),
|
|
dict(type='VideoSample', clip_len=16, num_clips=1, test_mode=False),
|
|
dict(type='VideoDecode'),
|
|
dict(type='VideoResize', r_size=256),
|
|
dict(type='VideoCrop', c_size=224),
|
|
dict(type='VideoFormat'),
|
|
dict(type='VideoPack')
|
|
]
|
|
|
|
val_pipeline_cfg = [
|
|
dict(type='VideoInit'),
|
|
dict(type='VideoSample', clip_len=16, num_clips=5, test_mode=True),
|
|
dict(type='VideoDecode'),
|
|
dict(type='VideoResize', r_size=256),
|
|
dict(type='VideoCrop', c_size=224),
|
|
dict(type='VideoFormat'),
|
|
dict(type='VideoPack')
|
|
]
|
|
|
|
train_dataset_cfg = dict(
|
|
type='DatasetZelda',
|
|
ann_file='kinetics_tiny_train_video.txt',
|
|
pipeline=train_pipeline_cfg,
|
|
data_root='data/kinetics400_tiny/',
|
|
data_prefix=dict(video='train'))
|
|
|
|
val_dataset_cfg = dict(
|
|
type='DatasetZelda',
|
|
ann_file='kinetics_tiny_val_video.txt',
|
|
pipeline=val_pipeline_cfg,
|
|
data_root='data/kinetics400_tiny/',
|
|
data_prefix=dict(video='val'))
|
|
|
|
train_dataset = DATASETS.build(train_dataset_cfg)
|
|
|
|
packed_results = train_dataset[0]
|
|
|
|
inputs = packed_results['inputs']
|
|
data_sample = packed_results['data_samples']
|
|
|
|
print('shape of the inputs: ', inputs.shape)
|
|
|
|
# Get metainfo of the inputs
|
|
print('image_shape: ', data_sample.img_shape)
|
|
print('num_clips: ', data_sample.num_clips)
|
|
print('clip_len: ', data_sample.clip_len)
|
|
|
|
# Get label of the inputs
|
|
print('label: ', data_sample.gt_label)
|
|
|
|
from mmengine.runner import Runner
|
|
|
|
BATCH_SIZE = 2
|
|
|
|
train_dataloader_cfg = dict(
|
|
batch_size=BATCH_SIZE,
|
|
num_workers=0,
|
|
persistent_workers=False,
|
|
sampler=dict(type='DefaultSampler', shuffle=True),
|
|
dataset=train_dataset_cfg)
|
|
|
|
val_dataloader_cfg = dict(
|
|
batch_size=BATCH_SIZE,
|
|
num_workers=0,
|
|
persistent_workers=False,
|
|
sampler=dict(type='DefaultSampler', shuffle=False),
|
|
dataset=val_dataset_cfg)
|
|
|
|
train_data_loader = Runner.build_dataloader(dataloader=train_dataloader_cfg)
|
|
val_data_loader = Runner.build_dataloader(dataloader=val_dataloader_cfg)
|
|
|
|
batched_packed_results = next(iter(train_data_loader))
|
|
|
|
batched_inputs = batched_packed_results['inputs']
|
|
batched_data_sample = batched_packed_results['data_samples']
|
|
|
|
assert len(batched_inputs) == BATCH_SIZE
|
|
assert len(batched_data_sample) == BATCH_SIZE
|
|
```
|
|
|
|
The terminal output should be the same as the one shown in the [Step1: Build a Pipeline](#step1-build-a-pipeline).
|
|
|
|
## Step3: Build a Recognizer
|
|
|
|
Next, we will construct the `recognizer`, which mainly consists of three parts: `data preprocessor` for batching and normalizing the data, `backbone` for feature extraction, and `cls_head` for classification.
|
|
|
|
The implementation of `data_preprocessor` is as follows:
|
|
|
|
```python
|
|
import torch
|
|
from mmengine.model import BaseDataPreprocessor, stack_batch
|
|
from mmaction.registry import MODELS
|
|
|
|
|
|
@MODELS.register_module()
|
|
class DataPreprocessorZelda(BaseDataPreprocessor):
|
|
def __init__(self, mean, std):
|
|
super().__init__()
|
|
|
|
self.register_buffer(
|
|
'mean',
|
|
torch.tensor(mean, dtype=torch.float32).view(-1, 1, 1, 1),
|
|
False)
|
|
self.register_buffer(
|
|
'std',
|
|
torch.tensor(std, dtype=torch.float32).view(-1, 1, 1, 1),
|
|
False)
|
|
|
|
def forward(self, data, training=False):
|
|
data = self.cast_data(data)
|
|
inputs = data['inputs']
|
|
batch_inputs = stack_batch(inputs) # Batching
|
|
batch_inputs = (batch_inputs - self.mean) / self.std # Normalization
|
|
data['inputs'] = batch_inputs
|
|
return data
|
|
```
|
|
|
|
Here is the usage of data_preprocessor: feed the `batched_packed_results` obtained from the [Step2: Build a Dataset and DataLoader](#step2-build-a-dataset-and-dataloader) into the `data_preprocessor` for batching and normalization.
|
|
|
|
```python
|
|
from mmaction.registry import MODELS
|
|
|
|
data_preprocessor_cfg = dict(
|
|
type='DataPreprocessorZelda',
|
|
mean=[123.675, 116.28, 103.53],
|
|
std=[58.395, 57.12, 57.375])
|
|
|
|
data_preprocessor = MODELS.build(data_preprocessor_cfg)
|
|
|
|
preprocessed_inputs = data_preprocessor(batched_packed_results)
|
|
print(preprocessed_inputs['inputs'].shape)
|
|
```
|
|
|
|
```
|
|
torch.Size([2, 1, 3, 16, 224, 224])
|
|
```
|
|
|
|
The implementations of `backbone`, `cls_head` and `recognizer` are as follows:
|
|
|
|
```python
|
|
import torch
|
|
import torch.nn as nn
|
|
import torch.nn.functional as F
|
|
from mmengine.model import BaseModel, BaseModule, Sequential
|
|
from mmengine.structures import LabelData
|
|
from mmaction.registry import MODELS
|
|
|
|
|
|
@MODELS.register_module()
|
|
class BackBoneZelda(BaseModule):
|
|
def __init__(self, init_cfg=None):
|
|
if init_cfg is None:
|
|
init_cfg = [dict(type='Kaiming', layer='Conv3d', mode='fan_out', nonlinearity="relu"),
|
|
dict(type='Constant', layer='BatchNorm3d', val=1, bias=0)]
|
|
|
|
super(BackBoneZelda, self).__init__(init_cfg=init_cfg)
|
|
|
|
self.conv1 = Sequential(nn.Conv3d(3, 64, kernel_size=(3, 7, 7),
|
|
stride=(1, 2, 2), padding=(1, 3, 3)),
|
|
nn.BatchNorm3d(64), nn.ReLU())
|
|
self.maxpool = nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2),
|
|
padding=(0, 1, 1))
|
|
|
|
self.conv = Sequential(nn.Conv3d(64, 128, kernel_size=3, stride=2, padding=1),
|
|
nn.BatchNorm3d(128), nn.ReLU())
|
|
|
|
def forward(self, imgs):
|
|
# imgs: [batch_size*num_views, 3, T, H, W]
|
|
# features: [batch_size*num_views, 128, T/2, H//8, W//8]
|
|
features = self.conv(self.maxpool(self.conv1(imgs)))
|
|
return features
|
|
|
|
|
|
@MODELS.register_module()
|
|
class ClsHeadZelda(BaseModule):
|
|
def __init__(self, num_classes, in_channels, dropout=0.5, average_clips='prob', init_cfg=None):
|
|
if init_cfg is None:
|
|
init_cfg = dict(type='Normal', layer='Linear', std=0.01)
|
|
|
|
super(ClsHeadZelda, self).__init__(init_cfg=init_cfg)
|
|
|
|
self.num_classes = num_classes
|
|
self.in_channels = in_channels
|
|
self.average_clips = average_clips
|
|
|
|
if dropout != 0:
|
|
self.dropout = nn.Dropout(dropout)
|
|
else:
|
|
self.dropout = None
|
|
|
|
self.fc = nn.Linear(self.in_channels, self.num_classes)
|
|
self.pool = nn.AdaptiveAvgPool3d(1)
|
|
self.loss_fn = nn.CrossEntropyLoss()
|
|
|
|
def forward(self, x):
|
|
N, C, T, H, W = x.shape
|
|
x = self.pool(x)
|
|
x = x.view(N, C)
|
|
assert x.shape[1] == self.in_channels
|
|
|
|
if self.dropout is not None:
|
|
x = self.dropout(x)
|
|
|
|
cls_scores = self.fc(x)
|
|
return cls_scores
|
|
|
|
def loss(self, feats, data_samples):
|
|
cls_scores = self(feats)
|
|
labels = torch.stack([x.gt_label for x in data_samples])
|
|
labels = labels.squeeze()
|
|
|
|
if labels.shape == torch.Size([]):
|
|
labels = labels.unsqueeze(0)
|
|
|
|
loss_cls = self.loss_fn(cls_scores, labels)
|
|
return dict(loss_cls=loss_cls)
|
|
|
|
def predict(self, feats, data_samples):
|
|
cls_scores = self(feats)
|
|
num_views = cls_scores.shape[0] // len(data_samples)
|
|
# assert num_views == data_samples[0].num_clips
|
|
cls_scores = self.average_clip(cls_scores, num_views)
|
|
|
|
for ds, sc in zip(data_samples, cls_scores):
|
|
pred = LabelData(item=sc)
|
|
ds.pred_scores = pred
|
|
return data_samples
|
|
|
|
def average_clip(self, cls_scores, num_views):
|
|
if self.average_clips not in ['score', 'prob', None]:
|
|
raise ValueError(f'{self.average_clips} is not supported. '
|
|
f'Currently supported ones are '
|
|
f'["score", "prob", None]')
|
|
|
|
total_views = cls_scores.shape[0]
|
|
cls_scores = cls_scores.view(total_views // num_views, num_views, -1)
|
|
|
|
if self.average_clips is None:
|
|
return cls_scores
|
|
elif self.average_clips == 'prob':
|
|
cls_scores = F.softmax(cls_scores, dim=2).mean(dim=1)
|
|
elif self.average_clips == 'score':
|
|
cls_scores = cls_scores.mean(dim=1)
|
|
|
|
return cls_scores
|
|
|
|
|
|
@MODELS.register_module()
|
|
class RecognizerZelda(BaseModel):
|
|
def __init__(self, backbone, cls_head, data_preprocessor):
|
|
super().__init__(data_preprocessor=data_preprocessor)
|
|
|
|
self.backbone = MODELS.build(backbone)
|
|
self.cls_head = MODELS.build(cls_head)
|
|
|
|
def extract_feat(self, inputs):
|
|
inputs = inputs.view((-1, ) + inputs.shape[2:])
|
|
return self.backbone(inputs)
|
|
|
|
def loss(self, inputs, data_samples):
|
|
feats = self.extract_feat(inputs)
|
|
loss = self.cls_head.loss(feats, data_samples)
|
|
return loss
|
|
|
|
def predict(self, inputs, data_samples):
|
|
feats = self.extract_feat(inputs)
|
|
predictions = self.cls_head.predict(feats, data_samples)
|
|
return predictions
|
|
|
|
def forward(self, inputs, data_samples=None, mode='tensor'):
|
|
if mode == 'tensor':
|
|
return self.extract_feat(inputs)
|
|
elif mode == 'loss':
|
|
return self.loss(inputs, data_samples)
|
|
elif mode == 'predict':
|
|
return self.predict(inputs, data_samples)
|
|
else:
|
|
raise RuntimeError(f'Invalid mode: {mode}')
|
|
```
|
|
|
|
The `init_cfg` is used for model weight initialization. For more information on model weight initialization, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/initialize.html). The usage of the above modules is as follows:
|
|
|
|
```python
|
|
import torch
|
|
import copy
|
|
from mmaction.registry import MODELS
|
|
|
|
model_cfg = dict(
|
|
type='RecognizerZelda',
|
|
backbone=dict(type='BackBoneZelda'),
|
|
cls_head=dict(
|
|
type='ClsHeadZelda',
|
|
num_classes=2,
|
|
in_channels=128,
|
|
average_clips='prob'),
|
|
data_preprocessor = dict(
|
|
type='DataPreprocessorZelda',
|
|
mean=[123.675, 116.28, 103.53],
|
|
std=[58.395, 57.12, 57.375]))
|
|
|
|
model = MODELS.build(model_cfg)
|
|
|
|
# Train
|
|
model.train()
|
|
model.init_weights()
|
|
data_batch_train = copy.deepcopy(batched_packed_results)
|
|
data = model.data_preprocessor(data_batch_train, training=True)
|
|
loss = model(**data, mode='loss')
|
|
print('loss dict: ', loss)
|
|
|
|
# Test
|
|
with torch.no_grad():
|
|
model.eval()
|
|
data_batch_test = copy.deepcopy(batched_packed_results)
|
|
data = model.data_preprocessor(data_batch_test, training=False)
|
|
predictions = model(**data, mode='predict')
|
|
print('Label of Sample[0]', predictions[0].gt_label)
|
|
print('Scores of Sample[0]', predictions[0].pred_score)
|
|
```
|
|
|
|
```shell
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv1.0.weight - torch.Size([64, 3, 3, 7, 7]):
|
|
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv1.0.bias - torch.Size([64]):
|
|
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv1.1.weight - torch.Size([64]):
|
|
The value is the same before and after calling `init_weights` of RecognizerZelda
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv1.1.bias - torch.Size([64]):
|
|
The value is the same before and after calling `init_weights` of RecognizerZelda
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv.0.weight - torch.Size([128, 64, 3, 3, 3]):
|
|
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv.0.bias - torch.Size([128]):
|
|
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv.1.weight - torch.Size([128]):
|
|
The value is the same before and after calling `init_weights` of RecognizerZelda
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
backbone.conv.1.bias - torch.Size([128]):
|
|
The value is the same before and after calling `init_weights` of RecognizerZelda
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
cls_head.fc.weight - torch.Size([2, 128]):
|
|
NormalInit: mean=0, std=0.01, bias=0
|
|
|
|
04/03 23:28:01 - mmengine - INFO -
|
|
cls_head.fc.bias - torch.Size([2]):
|
|
NormalInit: mean=0, std=0.01, bias=0
|
|
|
|
loss dict: {'loss_cls': tensor(0.6853, grad_fn=<NllLossBackward0>)}
|
|
Label of Sample[0] tensor([0])
|
|
Scores of Sample[0] tensor([0.5240, 0.4760])
|
|
```
|
|
|
|
## Step4: Build a Evaluation Metric
|
|
|
|
Note that all `Metric` classes in `OpenMMLab` must inherit from the `BaseMetric` class in `mmengine` and implement the abstract methods, `process` and `compute_metrics`. For more information on evaluation, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/tutorials/evaluation.html).
|
|
|
|
```python
|
|
import copy
|
|
from collections import OrderedDict
|
|
from mmengine.evaluator import BaseMetric
|
|
from mmaction.evaluation import top_k_accuracy
|
|
from mmaction.registry import METRICS
|
|
|
|
|
|
@METRICS.register_module()
|
|
class AccuracyMetric(BaseMetric):
|
|
def __init__(self, topk=(1, 5), collect_device='cpu', prefix='acc'):
|
|
super().__init__(collect_device=collect_device, prefix=prefix)
|
|
self.topk = topk
|
|
|
|
def process(self, data_batch, data_samples):
|
|
data_samples = copy.deepcopy(data_samples)
|
|
for data_sample in data_samples:
|
|
result = dict()
|
|
scores = data_sample['pred_score'].cpu().numpy()
|
|
label = data_sample['gt_label'].item()
|
|
result['scores'] = scores
|
|
result['label'] = label
|
|
self.results.append(result)
|
|
|
|
def compute_metrics(self, results: list) -> dict:
|
|
eval_results = OrderedDict()
|
|
labels = [res['label'] for res in results]
|
|
scores = [res['scores'] for res in results]
|
|
topk_acc = top_k_accuracy(scores, labels, self.topk)
|
|
for k, acc in zip(self.topk, topk_acc):
|
|
eval_results[f'topk{k}'] = acc
|
|
return eval_results
|
|
```
|
|
|
|
```python
|
|
from mmaction.registry import METRICS
|
|
|
|
metric_cfg = dict(type='AccuracyMetric', topk=(1, 5))
|
|
|
|
metric = METRICS.build(metric_cfg)
|
|
|
|
data_samples = [d.to_dict() for d in predictions]
|
|
|
|
metric.process(batched_packed_results, data_samples)
|
|
acc = metric.compute_metrics(metric.results)
|
|
print(acc)
|
|
```
|
|
|
|
```shell
|
|
OrderedDict([('topk1', 0.5), ('topk5', 1.0)])
|
|
```
|
|
|
|
## Step5: Train and Test with Native PyTorch
|
|
|
|
```python
|
|
import torch.optim as optim
|
|
from mmengine import track_iter_progress
|
|
|
|
|
|
device = 'cuda' # or 'cpu'
|
|
max_epochs = 10
|
|
|
|
optimizer = optim.Adam(model.parameters(), lr=0.01)
|
|
|
|
for epoch in range(max_epochs):
|
|
model.train()
|
|
losses = []
|
|
for data_batch in track_iter_progress(train_data_loader):
|
|
data = model.data_preprocessor(data_batch, training=True)
|
|
loss_dict = model(**data, mode='loss')
|
|
loss = loss_dict['loss_cls']
|
|
|
|
optimizer.zero_grad()
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
losses.append(loss.item())
|
|
|
|
print(f'Epoch[{epoch}]: loss ', sum(losses) / len(train_data_loader))
|
|
|
|
with torch.no_grad():
|
|
model.eval()
|
|
for data_batch in track_iter_progress(val_data_loader):
|
|
data = model.data_preprocessor(data_batch, training=False)
|
|
predictions = model(**data, mode='predict')
|
|
data_samples = [d.to_dict() for d in predictions]
|
|
metric.process(data_batch, data_samples)
|
|
|
|
acc = metric.acc = metric.compute_metrics(metric.results)
|
|
for name, topk in acc.items():
|
|
print(f'{name}: ', topk)
|
|
```
|
|
|
|
## Step6: Train and Test with MMEngine (Recommended)
|
|
|
|
For more details on training and testing, you can refer to [MMAction2 Tutorial](https://mmaction2.readthedocs.io/en/latest/user_guides/train_test.html). For more information on `Runner`, please refer to [MMEngine Tutorial](https://mmengine.readthedocs.io/en/latest/tutorials/runner.html).
|
|
|
|
```python
|
|
from mmengine.runner import Runner
|
|
|
|
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=10, val_interval=1)
|
|
val_cfg = dict(type='ValLoop')
|
|
|
|
optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.01))
|
|
|
|
runner = Runner(model=model_cfg, work_dir='./work_dirs/guide',
|
|
train_dataloader=train_dataloader_cfg,
|
|
train_cfg=train_cfg,
|
|
val_dataloader=val_dataloader_cfg,
|
|
val_cfg=val_cfg,
|
|
optim_wrapper=optim_wrapper,
|
|
val_evaluator=[metric_cfg],
|
|
default_scope='mmaction')
|
|
runner.train()
|
|
```
|
|
|