Long Sequence Performance

The table below shows the pre-training performance of the LLAMA2-7B and LLAMA3-8B models on H100 and B200 GPUs, respectively, with CP (context parallelism), and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the each GPU system.

LLAMA3-8B (FP8) - B200

Container: NeMo25.04.rc2
System: DGX-B200

SeqLen (K)	# of GPUs	Batch Size	Without CP	With CP					Speedup with CP/without CP
SeqLen (K)	# of GPUs	Batch Size	TFLOPS / GPU	TP	PP	DP	CP	TFLOPS / GPU	Speedup with CP/without CP
8	8	512	1,671	1	1	2	1	1,671	1.00
16	16	256	1,717	1	1	4	1	1,717	1.00
32	32	128	1,549	1	1	4	2	1,624	1.05
64	64	64	1,481	1	1	4	4	1,600	1.08
128	128	32	1,438	2	1	4	4	1,588	1.10
256	256	16	1,162	4	1	4	4	1,590	1.37
512	512	8	607	4	1	4	8	1,619	2.67
1024	1024	4	-¹⁾	4	1	4	16	1,608	-

^{1) Since the maximum TP size is limited by the number of query groups (8 in LLAMA3-8B),
even with full activation recomputation it is impossible to run the LLAMA3-8B model on a 1024K token sequence without CP due to the GPU memory constraints.}

LLAMA2-7B (FP8) - H100

Container: NeMo24.03.01.framework
System: DGX-H100

SeqLen (K)	# of GPUs	Batch Size	Without CP	With CP					Speedup with CP/without CP
SeqLen (K)	# of GPUs	Batch Size	TFLOPS / GPU	TP	PP	DP	CP	TFLOPS / GPU	Speedup with CP/without CP
4	4	1024	768	1	1	4	1	768	1.00
8	8	512	730	1	2	4	1	730	1.00
16	16	256	660	2	1	8	1	660	1.00
32	32	128	595	2	1	8	2	610	1.03
64	64	64	534	4	1	8	2	574	1.07
128	128	32	424	4	1	8	4	555	1.31
256	256	16	392	4	1	8	8	549	1.40
512	512	8	104	8	1	4	16	549	5.28
1024	1024	4	26.5	8	1	4	32	536	20.23