license: apache-2.0
datasets:
- cerebras/SlimPajama-627B
- bigcode/starcoderdata
language:
- en
https://github.com/jzhang38/TinyLlama
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs ππ. The training has started on 2023-09-01.
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
Releases Schedule
We will be rolling out intermediate checkpoints following the below schedule. We also include some baseline models for comparison.
Date | HF Checkpoint | Tokens | Step | HellaSwag Acc_norm |
---|---|---|---|---|
Baseline | StableLM-Alpha-3B | 800B | -- | 38.31 |
Baseline | Pythia-1B-intermediate-step-50k-105b | 105B | 50k | 42.04 |
Baseline | Pythia-1B | 300B | 143k | 47.16 |
2023-09-04 | TinyLlama-1.1B-intermediate-step-50k-105b | 105B | 50k | 43.50 |
2023-09-16 | -- | 500B | -- | -- |
2023-10-01 | -- | 1T | -- | -- |
2023-10-16 | -- | 1.5T | -- | -- |
2023-10-31 | -- | 2T | -- | -- |
2023-11-15 | -- | 2.5T | -- | -- |
2023-12-01 | -- | 3T | -- | -- |
It can be observed that TinyLlama has so far progressed well ππ.
Meanwhile, you can track the live cross entropy loss here.
Training Details
Below are some details of our training setup:
Setting | Description |
---|---|
Parameters | 1.1B |
Attention Variant | Grouped Query Attention |
Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632 |
Sequence Length | 2048 |
Batch Size | 2 million tokens (2048 * 1024) |
Learning Rate | 4e-4 |
Learning Rate Schedule | Cosine with 2000 warmup steps |
Training Data | Slimpajama & Starcoderdata |
Data Preprocessing | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
Combined Dataset Size | 1 trillion tokens |
Total Tokens During Training | 3 trillion (3 epochs/1430k steps) |
Natural Language to Code Ratio | 7:3 |
Hardware | 16 A100-40G GPUs |