olm-chat-7b / MOE.md

Upload folder using huggingface_hub

af6e330 verified 6 months ago

4.1 kB

	# Mixture of Experts Language Models

	## Dependencies

	Our implementation of mixture of experts depends on [megablocks](https://github.com/stanford-futuredata/megablocks) and the version of xformers which is compatible with torch 2.1:

	```
	pip install megablocks
	pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
	```

	## Train MoE

	To train an MoE, add the `--moe-X` related arguments to the training command:

	```
	torchrun --nproc-per-node 8 -m open_lm.main \
	--train-num-samples 10000000000 \
	--workers 2 \
	--dataset-manifest "s3://laion-west/rpj_tokenized_upsampled_eleutherai/manifest.jsonl" "s3://laion-west/2T_no_rpj_tokenized_upsampled_25k_shards/manifest.jsonl" \
	--train-data-mix-weights 0.725 0.275 \
	--precision amp_bfloat16 \
	--batch-size 8 \
	--accum-freq 4 \
	--log-every-n-steps 20 \
	--grad-clip-norm 1 \
	--lr 5e-4 \
	--warmup 200 \
	--model open_lm_41m \
	--wd 0.1 \
	--beta2 0.95 \
	--epochs 50 \
	--report-to wandb \
	--moe-freq 2 \
	--moe-num-experts 8 \
	--moe-top-k 2 \
	--moe-capacity-factor 1.25 --moe-loss-weight 0.1 \
	--disable-meta-device \
	--wandb-project-name moe \
	--name test$RANDOM \
	--logs /fsx/home-$USER/experiments/moe \
	--resume latest \
	--seed 124 \
	--data-key 'json' \
	--fsdp --fsdp-amp \
	--model-norm gain_only_layer_norm \
	--lr-scheduler cosine \
	--lr-cooldown-end 0.00001
	```

	The above command will add an MoE FFN layer to every other Transformer block. You can use an arbitrary number of experts; you are only limited by total RAM across all GPUs.


	You can also add the `moe_expert_model_parallelism` which will distribute experts across different GPUs. However, if the number of GPUs is larger than number of experts, an additional num_gpu/num_expert tensor parallelism is applied. Currently this is not eval-friendly though, so I would not recommend using it yet.

	You can evaluate the MoE in the same way as dense models:

	```
	torchrun --nproc-per-node 8 -m open_lm.main \
	--val-data "pipe:aws s3 cp s3://laion-west/lmdata/validation_data_tokenized/open_lm//shard_00000000.tar -" \
	--workers 6 \
	--precision amp_bfloat16 \
	--batch-size 8 \
	--log-every-n-steps 1 \
	--model open_lm_41m \
	--fsdp --fsdp-amp \
	--moe-num-experts 64 --moe-freq 2 \
	--data-key json \
	--train-num-samples 1000000000 \
	--model-norm gain_only_layer_norm \
	--name $RANDOM \
	--resume /fsx/home-suching/experiments/mix_wo/test8086/checkpoints/epoch_1.pt \
	--logs /fsx/home-$USER/experiments/eval
	```


	## Benchmarking

	To benchmark your results, here are perplexities we obtain with our implementation across a number of compute budgets and model sizes on our A100 cluster:

	### Compute budgets

	\| Compute type \| 41M \| 87M \| 160M \| 410M \| 830M \|
	\|--------------\|------\|------\|------\|------\|------\|
	\| Number of nodes \| 1 \| 1 \| 1 \| 2 \| 4 \|
	\| Number of tokens \| 20.0B \| 20.0B \| 20.0B \| 20.0B \| 20.0B \|

	### Perplexity
	\| Number of Experts \| 41M \| 87M \| 160M \| 410M \| 830M \|
	\|--------------\|------\|------\|------\|------\|------\|
	\| 1 \| 27.61 \| 18.68 \| 14.87 \| 10.54 \| 9.39 \|
	\| 8 \| 19.85 \| 14.66 \| 12.26 \| 9.82 \| 8.84 \|
	\| 32 \| 20.55 \| 15.28 \|14.62 \| \| \|


	### Tokens/sec/GPU

	\| Number of Experts \| 41M \| 87M \| 160M \| 410M \| 830M \|
	\|--------------\|------\|------\|------\|------\|------\|
	\| 1 \| 141.2K \| 106.0K \| 95.5K \| 30.3K \| 16.0K \|
	\| 8 \| 69.5K \| 66.6K \| 66.2K \| 18.5K \| 9.2K \|

	### Training Parameters

	\| Number of Experts \| 41M \| 87M \| 160M \| 410M \| 830M \|
	\|--------------\|------\|------\|------\|------\|------\|
	\| 8 experts \| 68.9M \| 165.4M \| 360.6M \| 1.1B \| 2.4B \|
	\| 32 experts \| 164.5M \| 439.9M \| 1.0B \| 3.5B \| 7.9B \|

	### Inference Parameters

	\| Number of Experts \| 41M \| 87M \| 160M \| 410M \| 830M \|
	\|--------------\|------\|------\|------\|------\|------\|
	\| 2 experts \| 45.0M \| 96.8M \| 190.7M \| 509.2M \| 1.1B \|