File size: 16,881 Bytes
d3dbf03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# Training and Test

- [Training and Test](#training-and-test)
  - [Training](#training)
    - [Training with your PC](#training-with-your-pc)
    - [Training with multiple GPUs](#training-with-multiple-gpus)
    - [Training with multiple machines](#training-with-multiple-machines)
      - [Multiple machines in the same network](#multiple-machines-in-the-same-network)
      - [Multiple machines managed with slurm](#multiple-machines-managed-with-slurm)
  - [Test](#test)
    - [Test with your PC](#test-with-your-pc)
    - [Test with multiple GPUs](#test-with-multiple-gpus)
    - [Test with multiple machines](#test-with-multiple-machines)
      - [Multiple machines in the same network](#multiple-machines-in-the-same-network-1)
      - [Multiple machines managed with slurm](#multiple-machines-managed-with-slurm-1)

## Training

### Training with your PC

You can use `tools/train.py` to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

```shell

python tools/train.py ${CONFIG_FILE} [ARGS]

```

````{note}

By default, MMAction2 prefers GPU to CPU. If you want to train a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.



```bash

CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]

```
````



| ARGS                                  | Description                                                                                                                                                         |

| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

| `CONFIG_FILE`                         | The path to the config file.                                                                                                                                        |

| `--work-dir WORK_DIR`                 | The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under `./work_dirs`.                                     |

| `--resume [RESUME]`                   | Resume training. If a path is specified, resume from it, while if not specified, try to auto resume from the latest checkpoint.                                     |

| `--amp`                               | Enable automatic-mixed-precision training.                                                                                                                          |

| `--no-validate`                       | **Not suggested**. Disable checkpoint evaluation during training.                                                                                                   |

| `--auto-scale-lr`                     | Auto scale the learning rate according to the actual batch size and the original batch size.                                                                        |

| `--seed`                              | Random seed.                                                                                                                                                        |

| `--diff-rank-seed`                    | Whether or not set different seeds for different ranks.                                                                                                             |

| `--deterministic`                     | Whether to set deterministic options for CUDNN backend.                                                                                                             |

| `--cfg-options CFG_OPTIONS`           | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |

| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. Defaults to `none`.                                                                                                                       |



### Training with multiple GPUs



We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.



```shell

bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]

```

| ARGS       | Description                                                                        |
| ---------- | ---------------------------------------------------------------------------------- |
| `CONFIG`   | The path to the config file.                                                       |
| `GPUS`     | The number of GPUs to be used.                                                     |
| `[PYARGS]` | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |

You can also specify extra arguments of the launcher by environment variables. For example, change the
communication port of the launcher to 29666 by the following command:

```shell

PORT=29666 bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]

```

If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying
different port and visible devices.

```shell

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]

CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]

```

### Training with multiple machines

#### Multiple machines in the same network

If you launch a training job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

```shell

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

```

On the second machine:

```shell

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

```

The following extra environment variables need to be specified to train or test models with multiple machines:

| ENV_VARS      | Description                                                                                           |

| ------------- | ----------------------------------------------------------------------------------------------------- |

| `NNODES`      | The total number of machines. Defaults to 1.                                                          |

| `NODE_RANK`   | The index of the local machine. Defaults to 0.                                                        |
| `PORT`        | The communication port, it should be the same in all machines. Defaults to 29500.                     |
| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. Defaults to `127.0.0.1`. |

Usually it is slow if you do not have high speed networking like InfiniBand.

#### Multiple machines managed with slurm

If you run MMAction2 on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`.

```shell

[ENV_VARS] bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} [PY_ARGS]

```

Here are the arguments description of the script.

| ARGS        | Description                                                                        |
| ----------- | ---------------------------------------------------------------------------------- |
| `PARTITION` | The partition to use in your cluster.                                              |
| `JOB_NAME`  | The name of your job, you can name it as you like.                                 |
| `CONFIG`    | The path to the config file.                                                       |
| `[PYARGS]`  | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |

Here are the environment variables can be used to configure the slurm job.

| ENV_VARS        | Description                                                                                                |

| --------------- | ---------------------------------------------------------------------------------------------------------- |

| `GPUS`          | The number of GPUs to be used. Defaults to 8.                                                              |

| `GPUS_PER_NODE` | The number of GPUs to be allocated per node. Defaults to 8.                                                |

| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.      |

| `SRUN_ARGS`     | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |

## Test

### Test with your PC

You can use `tools/test.py` to test a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

```shell

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

```

````{note}

By default, MMAction2 prefers GPU to CPU. If you want to test a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.



```bash

CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

```
````



| ARGS                                  | Description                                                                                                                                                         |

| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

| `CONFIG_FILE`                         | The path to the config file.                                                                                                                                        |

| `CHECKPOINT_FILE`                     | The path to the checkpoint file (It can be a http link)                                                                                                             |

| `--work-dir WORK_DIR`                 | The directory to save the file containing evaluation metrics. Defaults to a folder with the same name of the config file under `./work_dirs`.                       |

| `--dump DUMP`                         | The path to dump all outputs of the model for offline evaluation.                                                                                                   |

| `--cfg-options CFG_OPTIONS`           | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |

| `--show-dir SHOW_DIR`                 | The directory to save the result visualization images.                                                                                                              |

| `--show`                              | Visualize the prediction result in a window.                                                                                                                        |

| `--interval INTERVAL`                 | The interval of samples to visualize. Defaults to 1.                                                                                                                |

| `--wait-time WAIT_TIME`               | The display time of every window (in seconds). Defaults to 2.                                                                                                       |

| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher. Defaults to `none`.                                                                                                                       |



### Test with multiple GPUs



We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.



```shell

bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]

```

| ARGS         | Description                                                                      |
| ------------ | -------------------------------------------------------------------------------- |
| `CONFIG`     | The path to the config file.                                                     |
| `CHECKPOINT` | The path to the checkpoint file (It can be a http link)                          |
| `GPUS`       | The number of GPUs to be used.                                                   |
| `[PYARGS]`   | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). |

You can also specify extra arguments of the launcher by environment variables. For example, change the
communication port of the launcher to 29666 by the following command:

```shell

PORT=29666 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]

```

If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying
different port and visible devices.

```shell

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]

CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]

```

### Test with multiple machines

#### Multiple machines in the same network

If you launch a test job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

```shell

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS

```

On the second machine:

```shell

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS

```

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

| ENV_VARS      | Description                                                                                           |

| ------------- | ----------------------------------------------------------------------------------------------------- |

| `NNODES`      | The total number of machines. Defaults to 1.                                                          |

| `NODE_RANK`   | The index of the local machine. Defaults to 0.                                                        |
| `PORT`        | The communication port, it should be the same in all machines. Defaults to 29500.                     |
| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. Defaults to `127.0.0.1`. |

Usually it is slow if you do not have high speed networking like InfiniBand.

#### Multiple machines managed with slurm

If you run MMAction2 on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_test.sh`.

```shell

[ENV_VARS] bash tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${CHECKPOINT} [PY_ARGS]

```

Here are the arguments description of the script.

| ARGS         | Description                                                                      |
| ------------ | -------------------------------------------------------------------------------- |
| `PARTITION`  | The partition to use in your cluster.                                            |
| `JOB_NAME`   | The name of your job, you can name it as you like.                               |
| `CONFIG`     | The path to the config file.                                                     |
| `CHECKPOINT` | The path to the checkpoint file (It can be a http link)                          |
| `[PYARGS]`   | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc). |

Here are the environment variables can be used to configure the slurm job.

| ENV_VARS        | Description                                                                                                |

| --------------- | ---------------------------------------------------------------------------------------------------------- |

| `GPUS`          | The number of GPUs to be used. Defaults to 8.                                                              |

| `GPUS_PER_NODE` | The number of GPUs to be allocated per node. Defaults to 8.                                                |

| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.      |

| `SRUN_ARGS`     | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |