YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
⚡️ Nanotron
Distributed training techniques:
All training was done using the HuggingFace NanoTron library for distributed training which supports data parallelism, tensor parallelism, and pipeline parallelism.
- Data parallelism: Data parallelism was set to dp=2 across 2 A100 GPUs while keeping tensor parallelism and pipeline parallelism at 1.
- ddp_bucket_cap_size of 25 MB
- sequence_length of 256
- Train_steps of 213 for 1 epoch of training
- batch_accumulation_per_replica of 1
- micro_batch_size of 1
- Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation
- Tensor parallelism: Tensor parallelism was set to tp=2 across 2 A100 GPUs while keeping data parallelism and pipeline parallelism at 1.
- tp_linear_async_communication enabled
- tp_recompute_allgather enabled
- Tp_mode used is reduce-scatter
- sequence_length of 256
- Train_steps of 426 for 1 epoch of training
- batch_accumulation_per_replica of 1
- micro_batch_size of 1
- Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation
- Pipeline parallelism: Data parallelism was set to tp=2 across 2 A100 GPUs while keeping data parallelism and pipeline parallelism at 1.
- Pp_engine used is 1f1b for overlapping computation and communication
- sequence_length of 256
- Train_steps of 426 for 1 epoch of training
- batch_accumulation_per_replica of 1
- micro_batch_size of 1
- Additional optimizations: recomputing of layers, no accumulation of gradients in fp32, no caching of attention computation
Training performance and evaluation results:
- Data parallelism: 1 epoch
- Time per epoch: ~6 minutes
- Perplexity: ~44
- Other stats: consumed_tokens: 109K, time_per_iteration_ms: 1.71K, tokens_per_sec: 299, tokens_per_sec_per_gpu: 150, global_batch_size: 512
- Tensor parallelism: 1 epoch
- Time per epoch: ~9 minutes
- Perplexity: ~43
- Other stats: consumed_tokens: 109K, time_per_iteration_ms: 1.51K, tokens_per_sec: 170, tokens_per_sec_per_gpu: 84.8, global_batch_size: 256
- Pipeline parallelism: 1 epoch
- Time per epoch: ~8 minutes
- Perplexity: ~44
- Other stats: consumed_tokens: 54.5K, time_per_iteration_ms: 1.12K, tokens_per_sec: 229, tokens_per_sec_per_gpu: 114, global_batch_size: 256
Installation
To run the code in this project, first create a Conda environment using the environment.yml file by installing all dependencies listed there:
A list of the original Nanotron installation guide packages:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install datasets transformers datatrove[io] numba wandb
pip install ninja triton "flash-attn>=2.5.0" --no-build-isolation
Next, log into your Hugging Face and Weights and Biases accounts as follows:
```shell
huggingface-cli login
wandb login
Quick Start
In config_resume_training.yaml replace the tokenizer_name_or_path with your original llama 3.2 3B folder path AND replace your resume_checkpoint_path with your converted llama model folder using the examples/llama/convert_hf_to_nanotron.py script.
The following command will train the llama model on a single node of 2 x A100's:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=2 run_train.py --config-file config_resume_training.yaml
The model will be saved in the checkpoints directory as specified in the config file.
Set the config_resume_training.yaml configurations to the following:
Data parallelism:
-train_steps: 213
-dp: 2, tp: 1, pp: 1
Tensor parallelism:
-train_steps: 426
-dp: 1, tp: 2, pp: 1
Pipeline parallelism:
-train_steps: 426
-dp: 1, tp: 1, pp: 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support