Long Sequence Performance
- The table below shows the pre-training performance of the LLAMA2-7B and LLAMA3-8B models on H100 and B200 GPUs, respectively, with CP (context parallelism), and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the each GPU system.
LLAMA3-8B (FP8) - B200
- Container: NeMo25.04.rc2
- System: DGX-B200
SeqLen (K) | # of GPUs | Batch Size | Without CP | With CP | Speedup with CP/without CP | ||||
---|---|---|---|---|---|---|---|---|---|
TFLOPS / GPU | TP | PP | DP | CP | TFLOPS / GPU | ||||
8 | 8 | 512 | 1,671 | 1 | 1 | 2 | 1 | 1,671 | 1.00 |
16 | 16 | 256 | 1,717 | 1 | 1 | 4 | 1 | 1,717 | 1.00 |
32 | 32 | 128 | 1,549 | 1 | 1 | 4 | 2 | 1,624 | 1.05 |
64 | 64 | 64 | 1,481 | 1 | 1 | 4 | 4 | 1,600 | 1.08 |
128 | 128 | 32 | 1,438 | 2 | 1 | 4 | 4 | 1,588 | 1.10 |
256 | 256 | 16 | 1,162 | 4 | 1 | 4 | 4 | 1,590 | 1.37 |
512 | 512 | 8 | 607 | 4 | 1 | 4 | 8 | 1,619 | 2.67 |
1024 | 1024 | 4 | -1) | 4 | 1 | 4 | 16 | 1,608 | - |
1) Since the maximum TP size is limited by the number of query groups (8 in LLAMA3-8B), even with full activation recomputation it is impossible to run the LLAMA3-8B model on a 1024K token sequence without CP due to the GPU memory constraints.
LLAMA2-7B (FP8) - H100
- Container: NeMo24.03.01.framework
- System: DGX-H100
SeqLen (K) | # of GPUs | Batch Size | Without CP | With CP | Speedup with CP/without CP | ||||
---|---|---|---|---|---|---|---|---|---|
TFLOPS / GPU | TP | PP | DP | CP | TFLOPS / GPU | ||||
4 | 4 | 1024 | 768 | 1 | 1 | 4 | 1 | 768 | 1.00 |
8 | 8 | 512 | 730 | 1 | 2 | 4 | 1 | 730 | 1.00 |
16 | 16 | 256 | 660 | 2 | 1 | 8 | 1 | 660 | 1.00 |
32 | 32 | 128 | 595 | 2 | 1 | 8 | 2 | 610 | 1.03 |
64 | 64 | 64 | 534 | 4 | 1 | 8 | 2 | 574 | 1.07 |
128 | 128 | 32 | 424 | 4 | 1 | 8 | 4 | 555 | 1.31 |
256 | 256 | 16 | 392 | 4 | 1 | 8 | 8 | 549 | 1.40 |
512 | 512 | 8 | 104 | 8 | 1 | 4 | 16 | 549 | 5.28 |
1024 | 1024 | 4 | 26.5 | 8 | 1 | 4 | 32 | 536 | 20.23 |