|
|
|
Custom hardware for training |
|
The hardware you use to run model training and inference can have a big effect on performance. For a deep dive into GPUs make sure to check out Tim Dettmer's excellent blog post. |
|
Let's have a look at some practical advice for GPU setups. |
|
GPU |
|
When you train bigger models you have essentially three options: |
|
|
|
bigger GPUs |
|
more GPUs |
|
more CPU and NVMe (offloaded to by DeepSpeed-Infinity) |
|
|
|
Let's start at the case where you have a single GPU. |
|
Power and Cooling |
|
If you bought an expensive high end GPU make sure you give it the correct power and sufficient cooling. |
|
Power: |
|
Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won't get the full performance out of your card otherwise. |
|
Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power. |
|
Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power. |
|
Low end cards may use 6-Pin connectors, which supply up to 75W of power. |
|
Additionally you want the high-end PSU that has stable voltage. Some lower quality ones may not give the card the stable voltage it needs to function at its peak. |
|
And of course the PSU needs to have enough unused Watts to power the card. |
|
Cooling: |
|
When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot. |
|
It's hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU. |
|
Next let's have a look at one of the most important aspects when having multiple GPUs: connectivity. |
|
Multi-GPU Connectivity |
|
If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. If the GPUs are on the same physical node, you can run: |
|
|
|
nvidia-smi topo -m |
|
and it will tell you how the GPUs are inter-connected. On a machine with dual-GPU and which are connected with NVLink, you will most likely see something like: |
|
GPU0 GPU1 CPU Affinity NUMA Affinity |
|
GPU0 X NV2 0-23 N/A |
|
GPU1 NV2 X 0-23 N/A |
|
on a different machine w/o NVLink we may see: |
|
GPU0 GPU1 CPU Affinity NUMA Affinity |
|
GPU0 X PHB 0-11 N/A |
|
GPU1 PHB X 0-11 N/A |
|
The report includes this legend: |
|
X = Self |
|
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) |
|
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node |
|
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) |
|
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) |
|
PIX = Connection traversing at most a single PCIe bridge |
|
NV# = Connection traversing a bonded set of # NVLinks |
|
So the first report NV2 tells us the GPUs are interconnected with 2 NVLinks, and the second report PHB we have a typical consumer-level PCIe+Bridge setup. |
|
Check what type of connectivity you have on your setup. Some of these will make the communication between cards faster (e.g. NVLink), others slower (e.g. PHB). |
|
Depending on the type of scalability solution used, the connectivity speed could have a major or a minor impact. If the GPUs need to sync rarely, as in DDP, the impact of a slower connection will be less significant. If the GPUs need to send messages to each other often, as in ZeRO-DP, then faster connectivity becomes super important to achieve faster training. |
|
NVlink |
|
NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. |
|
Each new generation provides a faster bandwidth, e.g. here is a quote from Nvidia Ampere GA102 GPU Architecture: |
|
|
|
Third-Generation NVLink® |
|
GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, |
|
with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four |
|
links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth |
|
between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. |
|
(Note that 3-Way and 4-Way SLI configurations are not supported.) |
|
|
|
So the higher X you get in the report of NVX in the output of nvidia-smi topo -m the better. The generation will depend on your GPU architecture. |
|
Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext. |
|
The results are: |
|
| NVlink | Time | |
|
| ----- | ---: | |
|
| Y | 101s | |
|
| N | 131s | |
|
You can see that NVLink completes the training ~23% faster. In the second benchmark we use NCCL_P2P_DISABLE=1 to tell the GPUs not to use NVLink. |
|
Here is the full benchmark code and outputs: |
|
```bash |
|
DDP w/ NVLink |
|
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ |
|
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ |
|
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ |
|
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 |
|
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} |
|
DDP w/o NVLink |
|
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ |
|
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ |
|
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train |
|
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 |
|
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} |
|
|
|
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) |
|
Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0 |