File size: 12,368 Bytes

d3dbf03

# Prepare Dataset

MMAction2 supports many existing datasets. In this chapter, we will lead you to prepare datasets for MMAction2.

- [Prepare Dataset](#prepare-dataset)
  - [Notes on Video Data Format](#notes-on-video-data-format)
  - [Use built-in datasets](#use-built-in-datasets)
  - [Use a custom dataset](#use-a-custom-dataset)
    - [Action Recognition](#action-recognition)
    - [Skeleton-based Action Recognition](#skeleton-based-action-recognition)
    - [Audio-based Action Recognition](#audio-based-action-recognition)
    - [Spatio-temporal Action Detection](#spatio-temporal-action-detection)
    - [Temporal Action Localization](#temporal-action-localization)
  - [Use mixed datasets for training](#use-mixed-datasets-for-training)
    - [Repeat dataset](#repeat-dataset)
  - [Browse dataset](#browse-dataset)

## Notes on Video Data Format

MMAction2 supports two types of data formats: raw frames and video. The former is widely used in previous projects such as [TSN](https://github.com/yjxiong/temporal-segment-networks).
This is fast when SSD is available but fails to scale to the fast-growing datasets.
(For example, the newest edition of [Kinetics](https://www.deepmind.com/open-source/kinetics) has 650K  videos and the total frames will take up several TBs.)
The latter saves much space but has to do the computation intensive video decoding at execution time.
To make video decoding faster, we support several efficient video loading libraries, such as [decord](https://github.com/zhreshold/decord), [PyAV](https://github.com/PyAV-Org/PyAV), etc.

## Use built-in datasets

MMAction2 already supports many datasets, we provide shell scripts for data preparation under the path `$MMACTION2/tools/data/`, please refer to [supported datasets](https://mmaction2.readthedocs.io/en/latest/datasetzoo_statistics.html) for details to prepare specific datasets.

## Use a custom dataset

The simplest way is to convert your dataset to existing dataset formats:

- `RawFrameDataset` and `VideoDataset` for [Action Recognition](#action-recognition)
- `PoseDataset` for [Skeleton-based Action Recognition](#skeleton-based-action-recognition)
- `AudioDataset` for [Audio-based Action Recognition](#Audio-based-action-recognition)
- `AVADataset` for [Spatio-temporal Action Detection](#spatio-temporal-action-detection)
- `ActivityNetDataset` for [Temporal Action Localization](#temporal-action-localization)

After the data pre-processing, the users need to further modify the config files to use the dataset.
Here is an example of using a custom dataset in rawframe format.

In `configs/task/method/my_custom_config.py`:

```python

...

# dataset settings

dataset_type = 'RawframeDataset'

data_root = 'path/to/your/root'

data_root_val = 'path/to/your/root_val'

ann_file_train = 'data/custom/custom_train_list.txt'

ann_file_val = 'data/custom/custom_val_list.txt'

ann_file_test = 'data/custom/custom_val_list.txt'

...

data = dict(

    videos_per_gpu=32,

    workers_per_gpu=2,

    train=dict(

        type=dataset_type,

        ann_file=ann_file_train,

        ...),

    val=dict(

        type=dataset_type,

        ann_file=ann_file_val,

        ...),

    test=dict(

        type=dataset_type,

        ann_file=ann_file_test,

        ...))

...

```

### Action Recognition

There are two kinds of annotation files for action recognition.

- rawframe annotaiton for `RawFrameDataset`

  The annotation of a rawframe dataset is a text file with multiple lines,
  and each line indicates `frame_directory` (relative path) of a video,
  `total_frames` of a video and the `label` of a video, which are split by a whitespace.

  Here is an example.

  ```

  some/directory-1 163 1

  some/directory-2 122 1

  some/directory-3 258 2

  some/directory-4 234 2

  some/directory-5 295 3

  some/directory-6 121 3

  ```

- video annotation for `VideoDataset`

  The annotation of a video dataset is a text file with multiple lines,
  and each line indicates a sample video with the `filepath` (relative path) and `label`,
  which are split by a whitespace.

  Here is an example.

  ```

  some/path/000.mp4 1

  some/path/001.mp4 1

  some/path/002.mp4 2

  some/path/003.mp4 2

  some/path/004.mp4 3

  some/path/005.mp4 3

  ```

### Skeleton-based Action Recognition

The task recognizes the action class based on the skeleton sequence (time sequence of keypoints). We provide some methods to build your custom skeleton dataset.

- Build from RGB video data

  You need to extract keypoints data from video and convert it to a supported format, we provide a [tutorial](https://github.com/open-mmlab/mmaction2/tree/main/configs/skeleton/posec3d/custom_dataset_training.md) with detailed instructions.

- Build from existing keypoint data

  Assuming that you already have keypoint data in coco formats, you can gather them into a pickle file.

  Each pickle file corresponds to an action recognition dataset. The content of a pickle file is a dictionary with two fields: `split` and `annotations`

  1. Split: The value of the `split` field is a dictionary: the keys are the split names, while the values are lists of video identifiers that belong to the specific clip.
  2. Annotations: The value of the `annotations` field is a list of skeleton annotations, each skeleton annotation is a dictionary, containing the following fields:
     - `frame_dir` (str): The identifier of the corresponding video.
     - `total_frames` (int): The number of frames in this video.
     - `img_shape` (tuple\[int\]): The shape of a video frame, a tuple with two elements, in the format of `(height, width)`. Only required for 2D skeletons.
     - `original_shape` (tuple\[int\]): Same as `img_shape`.
     - `label` (int): The action label.
     - `keypoint` (np.ndarray, with shape `[M x T x V x C]`): The keypoint annotation.
       - M: number of persons;
       - T: number of frames (same as `total_frames`);
       - V: number of keypoints (25 for NTURGB+D 3D skeleton, 17 for CoCo, 18 for OpenPose, etc. );
       - C: number of dimensions for keypoint coordinates (C=2 for 2D keypoint, C=3 for 3D keypoint).
     - `keypoint_score` (np.ndarray, with shape `[M x T x V]`): The confidence score of keypoints. Only required for 2D skeletons.

  Here is an example:

  ```

  {

      "split":

          {

              'xsub_train':

                  ['S001C001P001R001A001', ...],

              'xsub_val':

                  ['S001C001P003R001A001', ...],

              ...

          }



      "annotations:

          [

              {

                  {

                      'frame_dir': 'S001C001P001R001A001',

                      'label': 0,

                      'img_shape': (1080, 1920),

                      'original_shape': (1080, 1920),

                      'total_frames': 103,

                      'keypoint': array([[[[1032. ,  334.8], ...]]])

                      'keypoint_score': array([[[0.934 , 0.9766, ...]]])

                  },

                  {

                      'frame_dir': 'S001C001P003R001A001',

                      ...

                  },

                  ...



              }

          ]

  }

  ```

  Support other keypoint formats needs further modification, please refer to [customize dataset](../advanced_guides/customize_dataset.md).

### Audio-based Action Recognition

MMAction2 provides support for audio-based action recognition tasks utilizing the `AudioDataset`. This task employs mel spectrogram features as input. An example annotation file format is as follows:

```

ihWykL5mYRI.npy 300 153

lumzQD42AN8.npy 240 321

sWFRmD9Of4s.npy 250 250

w_IpfgRsBVA.npy 300 356

```

Each line represents a training sample. Taking the first line as an example, `ihWykL5mYRI.npy` corresponds to the filename of the mel spectrogram feature. The value `300` represents the total number of frames of the original video corresponding to this mel spectrogram feature, and `153` denotes the class label. We take the following two steps to perpare the mel spectrogram feature data:

First, extract `audios` from videos:

```shell

cd $MMACTION2

python tools/data/extract_audio.py ${ROOT} ${DST_ROOT} [--ext ${EXT}] [--num-workers ${N_WORKERS}] \

    [--level ${LEVEL}]

```

- `ROOT`: The root directory of the videos.
- `DST_ROOT`: The destination root directory of the audios.
- `EXT`: Extension of the video files. e.g., `mp4`.
- `N_WORKERS`: Number of processes to be used.

Next, offline generate the `mel spectrogram features` from the audios:

```shell

cd $MMACTION2

python tools/data/build_audio_features.py ${AUDIO_HOME_PATH} ${SPECTROGRAM_SAVE_PATH} [--level ${LEVEL}] \

    [--ext $EXT] [--num-workers $N_WORKERS] [--part $PART]

```

- `AUDIO_HOME_PATH`: The root directory of the audio files.
- `SPECTROGRAM_SAVE_PATH`: The destination root directory of the audio features.
- `EXT`: Extension of the audio files. e.g., `m4a`.
- `N_WORKERS`: Number of processes to be used.
- `PART`: Determines how many parts to be splited and which part to run. e.g., `2/5` means splitting all files into 5-fold and executing the 2nd part. This is useful if you have several machines.

### Spatio-temporal Action Detection

MMAction2 supports the task based on `AVADataset`. The annotation contains groundtruth bbox and proposal bbox.

- groundtruth bbox
  groundtruth bbox is a csv file with multiple lines, and each line is a detection sample of one frame, with following formats:

  video_identifier, time_stamp, lt_x, lt_y, rb_x, rb_y, label, entity_id

  each field means:

  `video_identifier` : The identifier of the corresponding video
  `time_stamp`: The time stamp of current frame
  `lt_x`: The normalized x-coordinate of the left top point of bounding box
  `lt_y`: The normalized y-coordinate of the left top point of bounding box
  `rb_y`: The normalized x-coordinate of the right bottom point of bounding box
  `rb_y`: The normalized y-coordinate of the right bottom point of bounding box
  `label`: The action label
  `entity_id`: a unique integer allowing this box to be linked to other boxes depicting the same person in adjacent frames of this video

  Here is an example.

  ```

  _-Z6wFjXtGQ,0902,0.063,0.049,0.524,0.996,12,0

  _-Z6wFjXtGQ,0902,0.063,0.049,0.524,0.996,74,0

  ...

  ```

- proposal bbox
  proposal bbox is a pickle file generated by a person detector, and usually needs to be fine-tuned on the target dataset. The pickle file contains a dict with below data structure:

  `{'video_identifier,time_stamp': bbox_info}`

  video_identifier (str): The identifier of the corresponding video

  time_stamp (int): The time stamp of current frame
  bbox_info (np.ndarray, with shape `[n, 5]`): Detected bbox, \<x1> \<y1> \<x2> \<y2> \<score>. x1, x2, y1, y2 are normalized with respect to frame size, which are between 0.0-1.0.



### Temporal Action Localization



We support Temporal Action Localization based on `ActivityNetDataset`. The annotation of ActivityNet dataset is a json file. Each key is a video name and the corresponding value is the meta data and annotation for the video.



Here is an example.



```

{

  "video1": {

      "duration_second": 211.53,
      "duration_frame": 6337,

      "annotations": [

          {

              "segment": [

                  30.025882995319815,

                  205.2318595943838

              ],

              "label": "Rock climbing"

          }

      ],

      "feature_frame": 6336,

      "fps": 30.0,

      "rfps": 29.9579255898

  },

  "video2": {...

  }

  ...

}

```


## Use mixed datasets for training

MMAction2 also supports to mix dataset for training. Currently it supports to repeat dataset.

### Repeat dataset

We use `RepeatDataset` as wrapper to repeat the dataset. For example, suppose the original dataset as `Dataset_A`,
to repeat it, the config looks like the following

```python

dataset_A_train = dict(

        type='RepeatDataset',

        times=N,

        dataset=dict(  # This is the original config of Dataset_A

            type='Dataset_A',

            ...

            pipeline=train_pipeline

        )

    )

```

## Browse dataset

coming soon...