File size: 4,098 Bytes
f76ed23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Mixture of Experts Language Models

## Dependencies

Our implementation of mixture of experts depends on [megablocks](https://github.com/stanford-futuredata/megablocks) and the version of xformers which is compatible with torch 2.1:

```
pip install megablocks
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
```

## Train MoE

To train an MoE, add the `--moe-X` related arguments to the training command:

```
torchrun --nproc-per-node 8 -m open_lm.main \
    --train-num-samples 10000000000 \
    --workers 2 \
    --dataset-manifest "s3://laion-west/rpj_tokenized_upsampled_eleutherai/manifest.jsonl" "s3://laion-west/2T_no_rpj_tokenized_upsampled_25k_shards/manifest.jsonl" \
    --train-data-mix-weights 0.725 0.275 \
    --precision amp_bfloat16 \
    --batch-size 8 \
    --accum-freq 4 \
    --log-every-n-steps 20 \
    --grad-clip-norm 1 \
    --lr 5e-4 \
    --warmup 200 \
    --model open_lm_41m \
    --wd 0.1 \
    --beta2 0.95 \
    --epochs 50 \
    --report-to wandb \
    --moe-freq 2 \
    --moe-num-experts 8 \
    --moe-top-k 2 \
    --moe-capacity-factor 1.25 --moe-loss-weight 0.1 \
    --disable-meta-device \
    --wandb-project-name moe \
    --name test$RANDOM \
    --logs /fsx/home-$USER/experiments/moe \
    --resume latest \
    --seed 124 \
    --data-key 'json' \
    --fsdp --fsdp-amp \
    --model-norm gain_only_layer_norm \
    --lr-scheduler cosine \
    --lr-cooldown-end 0.00001
```

The above command will add an MoE FFN layer to every other Transformer block. You can use an arbitrary number of experts; you are only limited by total RAM across all GPUs. 


You can also add the `moe_expert_model_parallelism` which will distribute experts across different GPUs. However, if the number of GPUs is larger than number of experts, an additional num_gpu/num_expert tensor parallelism is applied. Currently this is not eval-friendly though, so I would not recommend using it yet.

You can evaluate the MoE in the same way as dense models:

```
torchrun --nproc-per-node 8 -m open_lm.main \
                --val-data "pipe:aws s3 cp s3://laion-west/lmdata/validation_data_tokenized/open_lm//shard_00000000.tar -" \
                --workers 6 \
                --precision amp_bfloat16 \
                --batch-size 8 \
                --log-every-n-steps 1 \
                --model open_lm_41m \
                --fsdp --fsdp-amp \
                --moe-num-experts 64 --moe-freq 2 \
                --data-key json \
                --train-num-samples 1000000000 \
                --model-norm gain_only_layer_norm \
                --name $RANDOM \
                --resume /fsx/home-suching/experiments/mix_wo/test8086/checkpoints/epoch_1.pt \
                --logs /fsx/home-$USER/experiments/eval
```


## Benchmarking

To benchmark your results, here are perplexities we obtain with our implementation across a number of compute budgets and model sizes on our A100 cluster:

### Compute budgets

| Compute type | 41M  | 87M  | 160M | 410M | 830M |
|--------------|------|------|------|------|------|
| Number of nodes | 1   | 1    | 1    | 2 | 4 |
| Number of tokens | 20.0B | 20.0B | 20.0B | 20.0B | 20.0B | 

### Perplexity
| Number of Experts | 41M | 87M | 160M | 410M | 830M  |
|--------------|------|------|------|------|------|
| 1 | 27.61 | 18.68 | 14.87 | 10.54 | 9.39  |  
| 8 | 19.85 | 14.66 | 12.26 | 9.82 | 8.84 |
| 32 | 20.55 | 15.28 |14.62 | | |


### Tokens/sec/GPU 

| Number of Experts | 41M | 87M | 160M | 410M | 830M |
|--------------|------|------|------|------|------|
| 1 | 141.2K | 106.0K | 95.5K | 30.3K | 16.0K |  
| 8 | 69.5K | 66.6K | 66.2K | 18.5K | 9.2K |

### Training Parameters

| Number of Experts | 41M | 87M | 160M | 410M | 830M |
|--------------|------|------|------|------|------|
| 8 experts | 68.9M | 165.4M | 360.6M | 1.1B | 2.4B |
| 32 experts | 164.5M | 439.9M | 1.0B | 3.5B | 7.9B |

### Inference Parameters  

| Number of Experts | 41M | 87M | 160M | 410M | 830M |  
|--------------|------|------|------|------|------|
| 2 experts | 45.0M | 96.8M | 190.7M | 509.2M | 1.1B |