NeMo_Canary / docs /source /performance /performance_long_sequence.md
Respair's picture
Upload folder using huggingface_hub
b386992 verified

Long Sequence Performance

  • The table below shows the pre-training performance of the LLAMA2-7B and LLAMA3-8B models on H100 and B200 GPUs, respectively, with CP (context parallelism), and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the each GPU system.

LLAMA3-8B (FP8) - B200

SeqLen (K) # of GPUs Batch Size Without CP With CP Speedup with CP/without CP
TFLOPS / GPU TP PP DP CP TFLOPS / GPU
8 8 512 1,671 1 1 2 1 1,671 1.00
16 16 256 1,717 1 1 4 1 1,717 1.00
32 32 128 1,549 1 1 4 2 1,624 1.05
64 64 64 1,481 1 1 4 4 1,600 1.08
128 128 32 1,438 2 1 4 4 1,588 1.10
256 256 16 1,162 4 1 4 4 1,590 1.37
512 512 8 607 4 1 4 8 1,619 2.67
1024 1024 4 -1) 4 1 4 16 1,608 -

1) Since the maximum TP size is limited by the number of query groups (8 in LLAMA3-8B), even with full activation recomputation it is impossible to run the LLAMA3-8B model on a 1024K token sequence without CP due to the GPU memory constraints.

LLAMA2-7B (FP8) - H100

SeqLen (K) # of GPUs Batch Size Without CP With CP Speedup with CP/without CP
TFLOPS / GPU TP PP DP CP TFLOPS / GPU
4 4 1024 768 1 1 4 1 768 1.00
8 8 512 730 1 2 4 1 730 1.00
16 16 256 660 2 1 8 1 660 1.00
32 32 128 595 2 1 8 2 610 1.03
64 64 64 534 4 1 8 2 574 1.07
128 128 32 424 4 1 8 4 555 1.31
256 256 16 392 4 1 8 8 549 1.40
512 512 8 104 8 1 4 16 549 5.28
1024 1024 4 26.5 8 1 4 32 536 20.23