sakharamg
/

NMTKD

Model card Files Files and versions Community

NMTKD / translation /OpenNMT-py /docs /source /examples /LanguageModelGeneration.md

sakharamg

Uploading all files

158b61b over 1 year ago

preview code

raw

history blame contribute delete

4.95 kB

	# Language Model Generation


	## Step 0: Download and clean the data

	Preliminary steps are defined in the [`examples/scripts/prepare_wikitext-103_data.sh`](https://github.com/OpenNMT/OpenNMT-py/tree/master/examples/scripts/prepare_wikitext-103_data.sh). The following command will download the [WikiText103 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), remove empty lines and shuffle the training corpus:
	```bash
	chmod u+x prepare_wikitext-103_data.sh
	./prepare_wikitext-103_data.sh
	```

	## Step 1: Prepare the subword model - BPE with pyonmttok

	This snippet will train a bpe of 40000 symbols on the training dataset using pyonmttok. The bpe model will be stored in `subwords.bpe` and the train/valid/test sets will be tokenized and saved.

	The tokenized files won't be used for training. Indeed, dynamic iteration over the training dataset enables on the fly tokenization using transforms (see step 2).

	```python
	import pyonmttok

	args = {
	"mode": "aggressive",
	"joiner_annotate": True,
	"preserve_placeholders": True,
	"case_markup": True,
	"soft_case_regions": True,
	"preserve_segmented_tokens": True,
	}
	n_symbols = 40000

	tokenizer_default = pyonmttok.Tokenizer(**args)
	learner = pyonmttok.BPELearner(tokenizer=tokenizer_default, symbols=n_symbols)
	# load training corpus
	learner.ingest_file("wiki.train.raw")

	# learn and store bpe model
	tokenizer = learner.learn("subwords.bpe")

	# tokenize corpus and save results
	for data_file in ["wiki.valid", "wiki.test", "wiki.train"]:
	tokenizer.tokenize_file(f"{data_file}.raw", f"{data_file}.bpe")
	```

	## Step 2: Build the vocabulary
	An example of yaml configuration for language modeling task is available in [`examples/wiki_103.yaml`](https://github.com/OpenNMT/OpenNMT-py/tree/master/examples/wiki_103.yaml). This configuration will be used for building the vocabulary and training the model.
	BPE and language modeling specificities are explained in the following sections.

	### Language Model specificities

	In LM tasks we expect a single source, therefore path_tgt is not required for LM tasks.

	```yaml
	data:
	corpus_1:
	path_src: data/wikitext-103-raw/wiki.train.raw
	```

	### BPE specificities

	To use BPE tokenization on the fly, the following parameters must be in the config file.
	Slight differences between on the fly tokenization and outputed tokenized files from step 1 can be observed.

	```yaml
	src_subword_type: bpe
	src_subword_model: data/wikitext-103-raw/subwords.bpe
	src_onmttok_kwargs: '{"mode": "aggressive", "joiner_annotate": True, "preserve_placeholders":
	True, "case_markup": True, "soft_case_regions": True, "preserve_segmented_tokens":
	True}'
	transforms: [onmt_tokenize]
	```

	### Build vocabulary command
	The vocabulary is built using:
	```bash
	onmt_build_vocab -config examples/wiki_103.yaml -n_sample -1
	```

	## Step 3: Train the model
	To train a model for LM tasks, the following parameters are required:

	* model_task is used to specify that the task will be language modeling (decoder only model with tansformer_lm decoder type, source only dataset expected)
	* decoder_type must be transformer_lm. This transformer is the one used in GPT-2: [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Basically, it is a transformer without an encoder attention block
	* encoder_type is not useful but need to be mentionned
	* share_vocab must be true. The slided source will play the role of the target hence vocabulary must be shared.
	```yaml
	model_task: lm
	encoder_type: transformer_lm
	decoder_type: transformer_lm

	share_vocab: true
	```

	The training is launched using:
	```bash
	onmt_train -config examples/wiki_103.yaml
	```
	Tensorboard can be used to monitor the training.

	Expected results: perplexity of 20-22 on the validation set.

	## Step 4: Generate output
	Options contained in the loaded model will trigger language modeling specific inference.

	`input.txt` must contain already tokenized examples, with the same method as the training data. Here, part of validation data will be used:
	```bash
	head data/wikitext-103-raw/wiki.valid.bpe \| cut -d" " -f-15 > data/wikitext-103-raw/lm_input.txt
	```

	To proceed with LM inference, sampling methods such as top-k sampling or nucleus sampling are usually applied. Details and options about inference methods can be found in [`onmt/opts.py`](https://github.com/OpenNMT/OpenNMT-py/tree/master/onmt/opts.py).

	The following command will provide inference with nucleus sampling of p=0.9 and return the 3 sequences with the lowest perplexity out of the 10 generated sequences:
	```bash
	onmt_translate -model data/wikitext-103-raw/run/model-lm_step_1000000.pt -src data/wikitext-103-raw/lm_input.txt -output data/wikitext-103-raw/lm_pred_input.txt -verbose -n_best 3 -random_sampling_topp 0.9 -beam_size 10
	```