YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

⚡️ Nanotron

Distributed training techniques:

All training was done using the HuggingFace NanoTron library for distributed training which supports data parallelism, tensor parallelism, and pipeline parallelism.

  1. Data parallelism: Data parallelism was set to dp=2 across 2 A100 GPUs while keeping tensor parallelism and pipeline parallelism at 1.
    1. ddp_bucket_cap_size of 25 MB
    2. sequence_length of 256
    3. Train_steps of 213 for 1 epoch of training
    4. batch_accumulation_per_replica of 1
    5. micro_batch_size of 1
    6. Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation
  2. Tensor parallelism: Tensor parallelism was set to tp=2 across 2 A100 GPUs while keeping data parallelism and pipeline parallelism at 1.
    1. tp_linear_async_communication enabled
    2. tp_recompute_allgather enabled
    3. Tp_mode used is reduce-scatter
    4. sequence_length of 256
    5. Train_steps of 426 for 1 epoch of training
    6. batch_accumulation_per_replica of 1
    7. micro_batch_size of 1
    8. Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation
  3. Pipeline parallelism: Data parallelism was set to tp=2 across 2 A100 GPUs while keeping data parallelism and pipeline parallelism at 1.
    1. Pp_engine used is 1f1b for overlapping computation and communication
    2. sequence_length of 256
    3. Train_steps of 426 for 1 epoch of training
    4. batch_accumulation_per_replica of 1
    5. micro_batch_size of 1
    6. Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation

Training performance and evaluation results:

  1. Data parallelism: 1 epoch
    1. Time per epoch: ~6 minutes
    2. Perplexity: ~44
    3. Other stats: consumed_tokens: 109K, time_per_iteration_ms: 1.71K, tokens_per_sec: 299, tokens_per_sec_per_gpu: 150, global_batch_size: 512
  2. Tensor parallelism: 1 epoch
    1. Time per epoch: ~9 minutes
    2. Perplexity: ~43
    3. Other stats: consumed_tokens: 109K, time_per_iteration_ms: 1.51K, tokens_per_sec: 170, tokens_per_sec_per_gpu: 84.8, global_batch_size: 256
  3. Pipeline parallelism: 1 epoch
    1. Time per epoch: ~8 minutes
    2. Perplexity: ~44
    3. Other stats: consumed_tokens: 54.5K, time_per_iteration_ms: 1.12K, tokens_per_sec: 229, tokens_per_sec_per_gpu: 114, global_batch_size: 256

Installation

To run the code in this project, first create a Conda environment using the environment.yml file by installing all dependencies listed there:

A list of the original Nanotron installation guide packages:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install datasets transformers datatrove[io] numba wandb
pip install ninja triton "flash-attn>=2.5.0" --no-build-isolation
Next, log into your Hugging Face and Weights and Biases accounts as follows:

```shell
huggingface-cli login
wandb login

Quick Start

In config_resume_training.yaml replace the tokenizer_name_or_path with your original llama 3.2 3B folder path AND replace your resume_checkpoint_path with your converted llama model folder using the examples/llama/convert_hf_to_nanotron.py script.

The following command will train the llama model on a single node of 2 x A100's:

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=2 run_train.py --config-file config_resume_training.yaml

The model will be saved in the checkpoints directory as specified in the config file.

Set the config_resume_training.yaml configurations to the following:

Data parallelism:
-train_steps: 213
-dp: 2, tp: 1, pp: 1

Tensor parallelism:
-train_steps: 426
-dp: 1, tp: 2, pp: 1

Pipeline parallelism:
-train_steps: 426
-dp: 1, tp: 1, pp: 2
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support