| # Tutorial 2: Customize Datasets | |
| ## Customize datasets by reorganizing data | |
| The simplest way is to convert your dataset to organize your data into folders. | |
| An example of file structure is as followed. | |
| ```none | |
| βββ data | |
| β βββ my_dataset | |
| β β βββ img_dir | |
| β β β βββ train | |
| β β β β βββ xxx{img_suffix} | |
| β β β β βββ yyy{img_suffix} | |
| β β β β βββ zzz{img_suffix} | |
| β β β βββ val | |
| β β βββ ann_dir | |
| β β β βββ train | |
| β β β β βββ xxx{seg_map_suffix} | |
| β β β β βββ yyy{seg_map_suffix} | |
| β β β β βββ zzz{seg_map_suffix} | |
| β β β βββ val | |
| ``` | |
| A training pair will consist of the files with same suffix in img_dir/ann_dir. | |
| If `split` argument is given, only part of the files in img_dir/ann_dir will be loaded. | |
| We may specify the prefix of files we would like to be included in the split txt. | |
| More specifically, for a split txt like following, | |
| ```none | |
| xxx | |
| zzz | |
| ``` | |
| Only | |
| `data/my_dataset/img_dir/train/xxx{img_suffix}`, | |
| `data/my_dataset/img_dir/train/zzz{img_suffix}`, | |
| `data/my_dataset/ann_dir/train/xxx{seg_map_suffix}`, | |
| `data/my_dataset/ann_dir/train/zzz{seg_map_suffix}` will be loaded. | |
| Note: The annotations are images of shape (H, W), the value pixel should fall in range `[0, num_classes - 1]`. | |
| You may use `'P'` mode of [pillow](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#palette) to create your annotation image with color. | |
| ## Customize datasets by mixing dataset | |
| MMSegmentation also supports to mix dataset for training. | |
| Currently it supports to concat and repeat datasets. | |
| ### Repeat dataset | |
| We use `RepeatDataset` as wrapper to repeat the dataset. | |
| For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following | |
| ```python | |
| dataset_A_train = dict( | |
| type='RepeatDataset', | |
| times=N, | |
| dataset=dict( # This is the original config of Dataset_A | |
| type='Dataset_A', | |
| ... | |
| pipeline=train_pipeline | |
| ) | |
| ) | |
| ``` | |
| ### Concatenate dataset | |
| There 2 ways to concatenate the dataset. | |
| 1. If the datasets you want to concatenate are in the same type with different annotation files, | |
| you can concatenate the dataset configs like the following. | |
| 1. You may concatenate two `ann_dir`. | |
| ```python | |
| dataset_A_train = dict( | |
| type='Dataset_A', | |
| img_dir = 'img_dir', | |
| ann_dir = ['anno_dir_1', 'anno_dir_2'], | |
| pipeline=train_pipeline | |
| ) | |
| ``` | |
| 2. You may concatenate two `split`. | |
| ```python | |
| dataset_A_train = dict( | |
| type='Dataset_A', | |
| img_dir = 'img_dir', | |
| ann_dir = 'anno_dir', | |
| split = ['split_1.txt', 'split_2.txt'], | |
| pipeline=train_pipeline | |
| ) | |
| ``` | |
| 3. You may concatenate two `ann_dir` and `split` simultaneously. | |
| ```python | |
| dataset_A_train = dict( | |
| type='Dataset_A', | |
| img_dir = 'img_dir', | |
| ann_dir = ['anno_dir_1', 'anno_dir_2'], | |
| split = ['split_1.txt', 'split_2.txt'], | |
| pipeline=train_pipeline | |
| ) | |
| ``` | |
| In this case, `ann_dir_1` and `ann_dir_2` are corresponding to `split_1.txt` and `split_2.txt`. | |
| 2. In case the dataset you want to concatenate is different, you can concatenate the dataset configs like the following. | |
| ```python | |
| dataset_A_train = dict() | |
| dataset_B_train = dict() | |
| data = dict( | |
| imgs_per_gpu=2, | |
| workers_per_gpu=2, | |
| train = [ | |
| dataset_A_train, | |
| dataset_B_train | |
| ], | |
| val = dataset_A_val, | |
| test = dataset_A_test | |
| ) | |
| ``` | |
| A more complex example that repeats `Dataset_A` and `Dataset_B` by N and M times, respectively, and then concatenates the repeated datasets is as the following. | |
| ```python | |
| dataset_A_train = dict( | |
| type='RepeatDataset', | |
| times=N, | |
| dataset=dict( | |
| type='Dataset_A', | |
| ... | |
| pipeline=train_pipeline | |
| ) | |
| ) | |
| dataset_A_val = dict( | |
| ... | |
| pipeline=test_pipeline | |
| ) | |
| dataset_A_test = dict( | |
| ... | |
| pipeline=test_pipeline | |
| ) | |
| dataset_B_train = dict( | |
| type='RepeatDataset', | |
| times=M, | |
| dataset=dict( | |
| type='Dataset_B', | |
| ... | |
| pipeline=train_pipeline | |
| ) | |
| ) | |
| data = dict( | |
| imgs_per_gpu=2, | |
| workers_per_gpu=2, | |
| train = [ | |
| dataset_A_train, | |
| dataset_B_train | |
| ], | |
| val = dataset_A_val, | |
| test = dataset_A_test | |
| ) | |
| ``` | |