# Introduction
Trainings can take some time. A well-running training setup is essential to get the most of nnU-Net. nnU-Net does not 
require any fancy hardware, just a well-balanced system. We recommend at least 32 GB of RAM, 6 CPU cores (12 threads), 
SSD storage (this can be SATA and does not have to be PCIe. DO NOT use an external SSD connected via USB!) and a 
2080 ti GPU. If your system has multiple GPUs, the 
other components need to scale linearly with the number of GPUs.

# Benchmark Details
To ensure your system is running as intended, we provide some benchmark numbers against which you can compare. Here 
are the details about benchmarking:

- We benchmark **2d**, **3d_fullres** and a modified 3d_fullres that uses 3x the default batch size (called **3d_fullres large** here) 
- The datasets **Task002_Heart**, **Task005_Prostate** and **Task003_Liver** of the Medical Segmentation Decathlon are used 
(they provide a good spectrum of dataset properties)
- we use the nnUNetTrainerV2_5epochs trainer. This will run only for 5 epochs and it will skip validation. 
From the 5 epochs, we select the fastest one as the epoch time. 
- We will also be running the nnUNetTrainerV2_5epochs_dummyLoad trainer on the 3d_fullres config (called **3d_fullres dummy**). This trainer does not use 
the dataloader and instead uses random dummy inputs, bypassing all data augmentation (CPU) and I/O bottlenecks. 
- All trainings are done with mixed precision. This is why Pascal GPUs (Titan Xp) are so slow (they do not have 
tensor cores) 

# How to run the benchmark
First go into the folder where the preprocessed data and plans file of the task you would like to use are located. For me this is
`/home/fabian/data/nnUNet_preprocessed/Task002_Heart`

Then run the following python snippet. This will create our custom **3d_fullres_large** configuration. Note that this 
large configuration will only run on GPUs with 16GB or more! We included it in the test because some GPUs 
(V100, and probably also A100) can shine when they get more work to do per iteration.
```python
from batchgenerators.utilities.file_and_folder_operations import *
plans = load_pickle('nnUNetPlansv2.1_plans_3D.pkl')
stage = max(plans['plans_per_stage'].keys())
plans['plans_per_stage'][stage]['batch_size'] *= 3
save_pickle(plans, 'nnUNetPlansv2.1_bs3x_plans_3D.pkl')
```

Now you can run the benchmarks. Each should only take a couple of minutes
```bash
nnUNet_train 2d nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs_dummyLoad TASKID 0
nnUNet_train 3d_fullres nnUNetTrainerV2_5epochs TASKID 0 -p nnUNetPlansv2.1_bs3x # optional, only for GPUs with more than 16GB of VRAM
```

The time we are interested in is the epoch time. You can find it in the text output (stdout) or the log file 
located in your `RESULTS_FOLDER`. Note that the trainers used here run for 5 epochs. Select the fastest time from your 
output as your benchmark time.

# Results

The following table shows the results we are getting on our servers/workstations. We are using pytorch 1.7.1 that we 
compiled ourselves using the instrucutions found [here](https://github.com/pytorch/pytorch#from-source). The cuDNN 
version we used is 8.1.0.77. You should be seeing similar numbers when you 
run the benchmark on your server/workstation. Note that fluctuations of a couple of seconds are normal!

IMPORTANT: Compiling pytorch from source is currently mandatory for best performance! Pytorch 1.8 does not have 
working tensorcore acceleration for 3D convolutions when installed with pip or conda!

IMPORTANT: A100 and V100 are very fast with the newer cuDNN versions and need more CPU workers to prevent bottlenecks,
set the environment variable `nnUNet_n_proc_DA=XX`
to increase the number of data augmentation workers. Recommended: 20 for V100, 32 for A100. Datasets with many input
modalities (BraTS: 4) require A LOT of CPU and should be used with even larger values for `nnUNet_n_proc_DA`

## Pytorch 1.7.1 compiled with cuDNN 8.1.0.77

|                                   | A100 40GB (DGX A100) 400W | V100 32GB SXM3 (DGX2) 350W | V100 32GB PCIe 250W | Quadro RTX6000 24GB 260W | Titan RTX 24GB 280W | RTX 2080 ti 11GB 250W | Titan Xp 12GB 250W |
|-----------------------------------|---------------------------|----------------------------|---------------------|--------------------------|---------------------|-----------------------|--------------------|
| Task002_Heart 2d                  | 40.06                     | 66.03                      | 76.19               | 78.01                    | 79.78               | 98.49                 | 177.87             |
| Task002_Heart 3d_fullres          | 51.17                     | 85.96                      | 99.29               | 110.47                   | 112.34              | 148.36                | 504.93             |
| Task002_Heart 3d_fullres dummy    | 48.53                     | 79                         | 89.66               | 105.16                   | 105.56              | 138.4                 | 501.64             |
| Task002_Heart 3d_fullres large    | 118.5                     | 220.45                     | 251.25              | 322.28                   | 300.96              | OOM                   | OOM                |
|                                   |                           |                            |                     |                          |                     |                       |                    |
| Task003_Liver 2d                  | 39.71                     | 60.69                      | 69.65               | 72.29                    | 76.17               | 92.54                 | 183.73             |
| Task003_Liver 3d_fullres          | 44.48                     | 75.53                      | 87.19               | 85.18                    | 86.17               | 106.76                | 290.87             |
| Task003_Liver 3d_fullres dummy    | 41.1                      | 70.96                      | 80.1                | 79.43                    | 79.43               | 101.54                | 289.03             |
| Task003_Liver 3d_fullres large    | 115.33                    | 213.27                     | 250.09              | 261.54                   | 266.66              | OOM                   | OOM                |
|                                   |                           |                            |                     |                          |                     |                       |                    |
| Task005_Prostate 2d               | 42.21                     | 68.88                      | 80.46               | 83.62                    | 81.59               | 102.81                | 183.68             |
| Task005_Prostate 3d_fullres       | 47.19                     | 76.33                      | 85.4                | 100                      | 102.05              | 132.82                | 415.45             |
| Task005_Prostate 3d_fullres dummy | 43.87                     | 70.58                      | 81.32               | 97.48                    | 98.99               | 124.73                | 410.12             |
| Task005_Prostate 3d_fullres large | 117.31                    | 209.12                     | 234.28              | 277.14                   | 284.35              | OOM                   | OOM                |

# Troubleshooting
Your epoch times are substantially slower than ours? That's not good! This section will help you figure out what is 
wrong. Note that each system is unique and we cannot help you find bottlenecks beyond providing the information 
presented in this section!

## First step: Make sure you have the right software!
In order to get maximum performance, you need to have pytorch compiled with a recent cuDNN version (8002 or newer is a must!). 
Unfortunately the currently provided pip/conda installable pytorch versions have a bug which causes their performance 
to be very low (see https://github.com/pytorch/pytorch/issues/57115 and https://github.com/pytorch/pytorch/issues/50153). 
They are about 2x-3x slower than the numbers we report in the table above. 
You need to have a pytorch version that was compiled from source to get maximum performance as shown in the table above.  
The easiest way to get that is by using the [Nvidia pytorch Docker](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). 
If you cannot use docker, you will need to compile pytorch 
yourself. For that, first download and install cuDNN from the [Nvidia homepage](https://developer.nvidia.com/cudnn), then follow the 
[instructions on the pytorch website](https://github.com/pytorch/pytorch#from-source) on how to compile it.

If you compiled pytorch yourself, you can check for the correct cuDNN version by running:
```bash
python -c 'import torch;print(torch.backends.cudnn.version())'
```
If the output is `8002` or higher, then you are good to go. If not you may have to take action. IMPORTANT: this 
only applies to pytorch that was compiled from source. pip/conda installed pytorch will report a new cuDNN version 
but still have poor performance due to the bug linked above.

## Identifying the bottleneck
If the software is up to date and you are still experiencing problems, this is how you can figure out what is going on:

While a training is running, run `htop` and `watch -n 0.1 nvidia-smi` (depending on your region you may have to use 
`0,1` instead). If you have physical access to the machine, also have a look at the LED indicating I/O activity.

Here is what you can read from that:
- `nvidia-smi` shows the GPU activity. `watch -n 0.1` makes this command refresh every 0.1s. This will allow you to 
see your GPU in action. A well running training will have your GPU pegged at 90-100% with no drops in GPU utilization. 
Your power should also be close to the maximum (for example `237W / 250 W`) at all times. 
- `htop` gives you an overview of the CPU usage. nnU-Net uses 12 processes for data augmentation + one main process. 
This means that up to 13 processes should be running simultaneously.
- the I/O LED indicates that your system is reading/writing data from/to your hard drive/SSD. Whenever this is 
blinking your system is doing something with your HDD/SSD.

### GPU bottleneck
If `nvidia-smi` is constantly showing 90-100% GPU utilization and the reported power draw is near the maximum, your 
GPU is the bottleneck. This is great! That means that your other components are not slowing it down. Your epochs times 
should be the same as ours reported above. If they are not then you need to investigate your software stack (see cuDNN stuff above).

What can you do about it?
1) There is nothing holding you back. Everything is fine!
2) If you need faster training, consider upgrading your GPU. Performance numbers are above, feel free to use them for guidance.
3) Think about whether you need more (slower) GPUs or less (faster) GPUs. Make sure to include Server/Workstation 
costs into your calculations. Sometimes it is better to go with more cheaper but slower GPUs run run multiple trainings 
in parallel.

### CPU bottleneck
You can recognize a CPU bottleneck as follows:
1) htop is consistently showing 10+ processes that are associated with your nnU-Net training
2) nvidia-smi is reporting jumps of GPU activity with zeroes in between

What can you do about it?
1) Depending on your single core performance, some datasets may require more than the default 12 processes for data 
augmentation. The CPU requirements for DA increase roughly linearly with the number of input modalities. Most datasets 
will train fine with much less than 12 (6 or even just 4). But datasets with for example 4 modalities may require more. 
If you have more than 12 CPU threads available, set the environment variable `nnUNet_n_proc_DA` to a number higher than 12.
2) If your CPU has less than 12 threads in total, running 12 threads can overburden it. Try lowering `nnUNet_n_proc_DA` 
to the number of threads you have available.
3) (sounds stupid, but this is the only other way) upgrade your CPU. I have seen Servers with 8 CPU cores (16 threads)
 and 8 GPUs in them. That is not well balanced. CPUs are cheap compared to GPUs. On a 'workstation' (single or dual GPU) 
 you can get something like a Ryzen 3900X or 3950X. On a server you could consider Xeon 6226R or 6258R on the Intel 
 side or the EPYC 7302P, 7402P, 7502P or 7702P on the AMD side. Make sure to scale the number of cores according to your 
 number of GPUs and use case. Feel free to also use our nnU-net recommendations from above.
 
### I/O bottleneck
On a workstation, I/O bottlenecks can be identified by looking at the LED indicating I/O activity. This is what an 
I/O bottleneck looks like:
- nvidia-smi is reporting jumps of GPU activity with zeroes in between
- htop is not showing many active CPU processes
- I/O LED is blinking rapidly or turned on constantly

Detecting I/O bottlenecks is difficult on servers where you may not have physical access. Tools like `iotop` are 
difficult to read and can only be run with sudo. However, the presence of an I/O LED is not strictly necessary. If
- nvidia-smi is reporting jumps of GPU activity with zeroes in between
- htop is not showing many active CPU processes

then the only possible issue to my knowledge is in fact an I/O bottleneck. 

Here is what you can do about an I/O bottleneck:
1) Make sure you are actually using an SSD to store the preprocessed data (`nnUNet_preprocessed`). Do not use an 
SSD connected via USB! Never use a HDD. Do not use a network drive that was not specifically designed to handle fast I/O 
(Note that you can use a network drive if it was designed for this purpose. At the DKFZ we use a
[flashblade](https://www.purestorage.com/products/file-and-object/flashblade.html) connected via ethernet and that works 
great)
2) A SATA SSD is only enough to feed 1-2 GPUs. If you have more GPUs installed you may have to upgrade to an nvme 
drive (make sure to get PCIe interface!).