Pre-train LLaMA on RedPajama

This howto will walk you through setting up the RedPajama dataset and launching the pre-training script.

What's RedPajama

RedPajama is an open-source reproduction of the original LLaMA training dataset.

It contains a total of 1.2 trillion tokens, divided into

Commoncrawl   878B
C4            175B
GitHub         59B
Books          26B
ArXiv          28B
Wikipedia      24B
StackExchange  20B

The RedPajama repo contains the source code for collecting and preparing the dataset, and it is Apache 2.0 licensed.

The data itself is licensed according to the original licenses with which its invidivdual parts were released. The GitHub datasets are limited to MIT, BSD, or Apache 2.0 repositories.

Along with the full RedPajama-1T dataset, the RedPajama-1T-Sample 1B sample dataset is also available for development.

You can download the data using git lfs:

# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T data/RedPajama-Data-1T

# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample

Prepare RedPajama for training

The dataset consists of 2084 jsonl files (the sample dataset contains 11). In order to start pre-training lit-llama on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the PackedDataset streaming dataset that comes with lit-llama.

Do to so, run

python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama

python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T-Sample --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama-sample --sample True

for the sample dataset.

In the above we are assuming that you will be using the same tokenizer as used in LLaMA, but any trained SentencePiece tokenizer with a 32000 vocabulary size will do here.

The script will take a while to run, so time for :tea:

Pre-training

Running the pre-training script requires at least 4 GPUs with 40GB+ each (A100).

python pretrain/redpajama.py --devices 4 --train_data_dir data/lit-redpajama

For running on the sample dataset:

python pretrain/redpajama.py --devices 4 --train_data_dir data/lit-redpajama-sample

The script will save checkpoints periodically to the folder out/.

The train_redpajama.py script will pre-train the LLaMA 7B model with FSDP in bfloat16 precision and gradient accumulation.

You can easily change the size of the model by passing a different string to

config = LLaMAConfig.from_name("7B")

in the main function.

Keep in mind that the original LLaMA training for the 7B model required 83k A100 80GB hours, so you'll need access to a cluster.

Once you're in a cluster, you can follow these instructions to launch the script across machines:

The script contains several configurations and hyperparameters you can tweak:

out_dir = "out/training"
save_interval = 1000
eval_interval = 1000
eval_iters = 100
log_interval = 1

# Hyperparameters
learning_rate = 6e-4
batch_size = 125
micro_batch_size = 5
max_iters = 600000  # num_epochs * (epoch_size // micro_batch_size) // devices
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0
decay_lr = True
warmup_iters = 2000
lr_decay_iters = max_iters
min_lr = 6e-5

In particular, micro_batch_size should be adjusted so the process will use the available GPU memory.

Last, logging is kept minimal in the script. In order to use a particular logger please refer to https://lightning.ai/docs/fabric/stable/api/loggers.html or call a logging client library like wandb directly.