|
# Megatron-11b |
|
|
|
Megatron-11b is a unidirectional language model with `11B` parameters based on [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf). Following the original Megatron work, we trained the model using intra-layer model parallelism with each layer's parameters split across 8 GPUs. |
|
|
|
Megatron-11b is trained on the same data and uses the same byte-pair encoding (BPE) as [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). |
|
|
|
## Pre-trained models |
|
|
|
Model | Description | # params | # filesize | Download |
|
---|---|---|---|--- |
|
`megatron_11b` | megatron_11b unidirectional language model | 11B | 19Gb | [megatron_11b.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz) |
|
|
|
#### Architecture: |
|
|
|
Param | Value |
|
---|--- |
|
embed_dim | 3072 |
|
ffn_dim | 3072 * 6 |
|
layers | 72 |
|
attention heads | 32 |
|
|
|
#### Training details: |
|
|
|
Param | value |
|
---|--- |
|
bsz | 512 |
|
num_updates | 300,000 |
|
peak_lr | 1.5e-04 |
|
lr scheduler | inverse_sqrt |
|
clip norm | 0.0 |
|
|
|
|
|
## Example training command (model parallel) |
|
|
|
Megatron-11b contains too many parameters to train on a single GPU. Following |
|
the original Megatron work, we adopt an intra-layer model parallel training |
|
approach in which each layer's parameters are split across multiple GPUs and |
|
activations and gradients are communicated during the forward/backward pass, |
|
respectively. We similarly split the loss computation using the |
|
`vocab_parallel_cross_entropy` criterion. |
|
|
|
The following training command illustrates how to do model parallel training in |
|
fairseq. We assume that each machine (node) has 8 GPUs among which to split the |
|
model parameters (`--model-parallel-size 8`). If you have access to multiple |
|
nodes, you may combine this with data parallel training by increasing |
|
`--distributed-world-size`. |
|
|
|
To train Megatron-11b on a single node: |
|
|
|
|
|
```bash |
|
fairseq-train <DATA_PATH> \ |
|
--distributed-world-size 8 \ |
|
--memory-efficient-fp16 \ |
|
--num-workers 2 \ |
|
--model-parallel-size 8 \ |
|
--criterion vocab_parallel_cross_entropy \ |
|
--task language_modeling \ |
|
--sample-break-mode none \ |
|
--tokens-per-sample 1024 \ |
|
--arch transformer_lm_megatron_11b \ |
|
--share-decoder-input-output-embed \ |
|
--optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \ |
|
--lr-scheduler inverse_sqrt --lr 0.00015 \ |
|
--warmup-updates 3000 --weight-decay 0.01 \ |
|
--dropout 0.1 --attention-dropout 0.1 \ |
|
--batch-size 2 \ |
|
--max-update 300000; |
|
``` |
|
|
|
Note: Above was tested on `DGX-1` box, with `8xV100-32Gb` GPUs. |
|
|
|
## Results |
|
|
|
**[Wikitext103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)** |
|
|
|
Model | Valid perplexity | Test perplexity |
|
---|---|--- |
|
`megatron_11b` | 10.64 | 10.54 |
|
|
|
|
|
## Evaluating `megatron_11b` on Wikitext-103 |
|
|
|
#### 1. Downloading Megatron-11b |
|
```bash |
|
# WARNING: this file is 19GB |
|
wget https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz |
|
tar -xzvf megatron_11b.tar.gz |
|
``` |
|
|
|
#### 2. Download Wikitext-103 |
|
```bash |
|
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip |
|
unzip wikitext-103-raw-v1.zip |
|
``` |
|
|
|
#### 3. Detokenize test tokens |
|
Megatron-11b uses a byte-level BPE that expects raw (untokenized) input. Since |
|
the wikitext-103 dataset comes tokenized, we apply a simple detokenization |
|
process to restore the untokenized test set: |
|
|
|
```bash |
|
python -m examples.megatron_11b.detok wikitext-103-raw/wiki.test.raw > wikitext-103-raw/wiki.test.detok |
|
``` |
|
|
|
#### 4. BPE encoding |
|
```bash |
|
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json' |
|
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe' |
|
|
|
python -m examples.roberta.multiprocessing_bpe_encoder \ |
|
--encoder-json encoder.json \ |
|
--vocab-bpe vocab.bpe \ |
|
--inputs "wikitext-103-raw/wiki.test.detok" \ |
|
--outputs "wikitext-103-raw/wiki.test.bpe" \ |
|
--workers 60; |
|
``` |
|
|
|
#### 5. Fairseq binarize |
|
```bash |
|
fairseq-preprocess \ |
|
--only-source \ |
|
--testpref wikitext-103-raw/wiki.test.bpe \ |
|
--srcdict megatron_11b/dict.txt \ |
|
--destdir wikitext103-bin; |
|
``` |
|
|
|
#### 6. Evaluating perplexity. |
|
We can now evaluate perplexity on the test set. Note that because we've modified |
|
the test set (via detokenization and BPE), the perplexity reported by |
|
`fairseq-eval-lm` needs to be renormalized. |
|
|
|
Compute unnormalized perplexity: |
|
|
|
```bash |
|
DATA_PATH=wikitext103-bin/ |
|
fairseq-eval-lm \ |
|
$DATA_PATH \ |
|
--path megatron_11b/model.pt \ |
|
--task language_modeling \ |
|
--gen-subset test \ |
|
--batch-size 8 \ |
|
--criterion cross_entropy \ |
|
--context-window 992 \ |
|
--distributed-world-size 8 \ |
|
--model-parallel-size 8; |
|
# Expected PPL (unnormalized_ppl): [8.46] |
|
# Note: the eval command needs to run on 8 GPUs for the released model |
|
``` |
|
Renormalizing formula: `2 ^ ( log_2(unnormalized_PPL) * (270847 / 245566))`. |
|
PPL After normalization: `10.54` |
|
|
|
To renormalize the perplexity, we must account for the change in token count |
|
after detokenizing and appling BPE. The formula for this is: |
|
`2 ^ ( log_2(unnormalized_PPL) * (new_token_cnt / orig_token_cnt))` |
|
|
|
For the wikitext-103 test set, the original token count is `245566` and the |
|
token count after detokenization and applying BPE is `270847`. |
|
|
|
The perplexity after renormalization is: |
|
`2 ^ ( log_2(8.46) * (270847 / 245566)) = 10.54` |
|
|