NeMo / docs /source /nlp /nemo_megatron /gpt /gpt_training.rst

thanks to NVIDIA ❤

7934b29 about 2 years ago

7.94 kB

	GPT model training
	------------------

	GPT is a decoder-only Transformer model.


	Quick start
	^^^^^^^^^^^
	Steps below demonstrate training of a GPT style model with NeMo

	Data download & pre-processing
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	.. note::
	Data download, pre-processing and tokenizer training in the example below will take ~3 hours.

	Step 1: Download data

	The step below will download Wikipedia data (around 20GB) and can take some several hours.

	.. code-block:: bash

	wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

	Step 2: Extract raw data

	.. code-block:: bash

	pip install wikiextractor
	python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
	find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl

	Now, ``train_data.jsonl`` will contain our training data in the json line format. We are interested in the data under "text" field.


	Step 3: Train tokenizer

	Below we will condider 2 options for training data tokenizers: Using pre-built HuggingFace BPE and training and using your own Google Sentencepiece tokenizer.
	Note that only second option allows you to experiment with vocabulary size.

	Option 1: Using HuggingFace GPT2 tokenizer files.

	With this option we will just download pre-built vocabulary and merge files for BPE tokenizer.

	.. code-block:: bash

	wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
	wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt


	Option 2: Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.

	It comes as dependency with NeMo, so if you have installed NeMo it should already be installed.
	Note that training tokenizer model will also take some time.

	.. code-block:: bash

	sudo apt install jq
	jq .text train_data.jsonl >> text_for_tokenizer.txt
	spm_train --input=text_for_tokenizer.txt \
	--model_prefix=spm_32k_wiki \
	--vocab_size=32768 \
	--character_coverage=0.9999 \
	--model_type=bpe \
	--byte_fallback=true \
	--pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 \
	--split_digits true

	After this is done (will take a while), you'll have two files: ```spm_32k_wiki.model and spm_32k_wiki.vocab`` which correspond to model and vocabulary.

	Step 4: Convert training data into memory map format

	This format makes trainig more efficient, especially with many nodes and GPUs. This step will also tokenize data using tokenizer model from Step 3.

	Option 1: Using HuggingFace GPT2 tokenizer files.

	.. code-block:: bash

	python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
	--input=train_data.jsonl \
	--json-keys=text \
	--tokenizer-library=megatron \
	--vocab gpt2-vocab.json \
	--dataset-impl mmap \
	--tokenizer-type GPT2BPETokenizer \
	--merge-file gpt2-merges.txt \
	--output-prefix=hfbpe_gpt_training_data \
	--append-eod \
	--workers=32

	Option 2: Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.

	.. code-block:: bash

	python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
	--input=train_data.jsonl \
	--json-keys=text \
	--tokenizer-library=sentencepiece \
	--tokenizer-model=spm_32k_wiki.model \
	--output-prefix=gpt_training_data \
	--append-eod \
	--workers=32


	Train GPT-style Model
	~~~~~~~~~~~~~~~~~~~~~

	Once you have prepared training data and tokenizer, you are ready to train the model.
	The configuration we present below has about 124M parameters and it should fit on a single 16GB GPU if using float16.
	Let's go!!!

	Option 1: Using HuggingFace GPT2 tokenizer files.

	.. code-block:: bash

	python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
	--config-name=megatron_gpt_config \
	trainer.devices=1 \
	trainer.num_nodes=1 \
	trainer.max_epochs=null \
	trainer.max_steps=300000 \
	trainer.val_check_interval=300 \
	trainer.log_every_n_steps=50 \
	trainer.limit_val_batches=50 \
	trainer.limit_test_batches=50 \
	trainer.accumulate_grad_batches=1 \
	trainer.precision=16 \
	model.micro_batch_size=6 \
	model.global_batch_size=192 \
	model.tensor_model_parallel_size=1 \
	model.pipeline_model_parallel_size=1 \
	model.max_position_embeddings=1024 \
	model.encoder_seq_length=1024 \
	model.hidden_size=768 \
	model.ffn_hidden_size=3072 \
	model.num_layers=12 \
	model.num_attention_heads=12 \
	model.init_method_std=0.021 \
	model.hidden_dropout=0.1 \
	model.layernorm_epsilon=1e-5 \
	model.tokenizer.vocab_file=gpt2-vocab.json \
	model.tokenizer.merge_file=gpt2-merges.txt \
	model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
	model.data.num_workers=2 \
	model.data.seq_length=1024 \
	model.data.splits_string=\'980,10,10\' \
	model.optim.name=fused_adam \
	model.optim.lr=6e-4 \
	model.optim.betas=[0.9,0.95] \
	model.optim.weight_decay=0.1 \
	model.optim.sched.name=CosineAnnealing \
	model.optim.sched.warmup_steps=750 \
	model.optim.sched.constant_steps=80000 \
	model.optim.sched.min_lr=6e-5 \
	exp_manager.resume_if_exists=True \
	exp_manager.resume_ignore_no_checkpoint=True \
	exp_manager.create_checkpoint_callback=True \
	exp_manager.checkpoint_callback_params.monitor=val_loss \
	exp_manager.checkpoint_callback_params.save_top_k=3 \
	exp_manager.checkpoint_callback_params.mode=min \
	exp_manager.checkpoint_callback_params.always_save_nemo=False


	Option 2: Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.

	.. code-block:: bash

	python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
	--config-name=megatron_gpt_config \
	trainer.devices=1 \
	trainer.num_nodes=1 \
	trainer.max_epochs=null \
	trainer.max_steps=300000 \
	trainer.val_check_interval=300 \
	trainer.log_every_n_steps=50 \
	trainer.limit_val_batches=50 \
	trainer.limit_test_batches=50 \
	trainer.accumulate_grad_batches=1 \
	trainer.precision=16 \
	model.micro_batch_size=6 \
	model.global_batch_size=192 \
	model.tensor_model_parallel_size=1 \
	model.pipeline_model_parallel_size=1 \
	model.max_position_embeddings=1024 \
	model.encoder_seq_length=1024 \
	model.hidden_size=768 \
	model.ffn_hidden_size=3072 \
	model.num_layers=12 \
	model.num_attention_heads=12 \
	model.init_method_std=0.021 \
	model.hidden_dropout=0.1 \
	model.layernorm_epsilon=1e-5 \
	model.tokenizer.library=sentencepiece \
	model.tokenizer.model=spm_32k_wiki.model \
	model.data.data_prefix=[1.0,gpt_training_data_text_document] \
	model.data.num_workers=2 \
	model.data.seq_length=1024 \
	model.data.splits_string=\'980,10,10\' \
	model.optim.name=fused_adam \
	model.optim.lr=6e-4 \
	model.optim.betas=[0.9,0.95] \
	model.optim.weight_decay=0.1 \
	model.optim.sched.name=CosineAnnealing \
	model.optim.sched.warmup_steps=750 \
	model.optim.sched.constant_steps=80000 \
	model.optim.sched.min_lr=6e-5 \
	exp_manager.resume_if_exists=True \
	exp_manager.resume_ignore_no_checkpoint=True \
	exp_manager.create_checkpoint_callback=True \
	exp_manager.checkpoint_callback_params.monitor=val_loss \
	exp_manager.checkpoint_callback_params.save_top_k=3 \
	exp_manager.checkpoint_callback_params.mode=min \
	exp_manager.checkpoint_callback_params.always_save_nemo=False


	Next, simply launch Tensorboard to monitor training like so:

	.. code-block:: bash

	tensorboard --logdir nemo_experiments --bind_all

	Next steps
	~~~~~~~~~~

	Please refer to:

	* :ref:`batching` section for batch size adjustments
	* :ref:`parallelisms` section for understanding various types of parallelisms
	* :ref:`promptlearning` section for details on prompt-tuning and p-tuning