Reducing Memory Usage

Section under construction. Feel free to contribute!

Truncation

Sequence lengths in the dataset can vary widely, and by default, TRL does not modify the data. When data is batched, sequences are padded to match the longest one in the batch, which can cause high memory usage, even if most sequences are relatively short.

To reduce memory usage, it’s important to truncate sequences to a reasonable length. Even discarding just a few tokens from the dataset can result in significant memory savings by minimizing unnecessary padding. Truncation is a good practice and should always be applied to ensure efficient use of resources. While the truncation limit doesn’t need to be overly restrictive, setting a sensible value is essential for optimal performance.

DPO

SFT

Packing

This technique applies only to SFT.

Truncation has several drawbacks:

Loss of information: Key data at the end of a sequence may be discarded.
Choosing truncation length: Too short loses data; too long undermines efficiency.

Packing, introduced in Raffel et al., 2020, addresses these issues by grouping sequences instead of truncating. It concatenates and splits dataset sequences into the desired lengths.

Packing eliminates padding, preserves all sequence information, and allows for flexible sequence lengths, making it a more efficient alternative to truncation. To enable packing, use packing=True in the SFTConfig:

from trl import SFTConfig

training_args = SFTConfig(..., packing=True, max_seq_length=512)

Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see #1230.

< > Update on GitHub