sakharamg
/

NMTKD

Model card Files Files and versions Community

NMTKD / translation /OpenNMT-py /docs /source /quickstart.md

sakharamg

Uploading all files

158b61b over 1 year ago

preview code

raw

history blame contribute delete

4.45 kB



	# Quickstart

	### Step 0: Install OpenNMT-py

	```bash
	pip install --upgrade pip
	pip install OpenNMT-py
	```

	### Step 1: Prepare the data

	To get started, we propose to download a toy English-German dataset for machine translation containing 10k tokenized sentences:

	```bash
	wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
	tar xf toy-ende.tar.gz
	cd toy-ende
	```

	The data consists of parallel source (`src`) and target (`tgt`) data containing one sentence per line with tokens separated by a space:

	* `src-train.txt`
	* `tgt-train.txt`
	* `src-val.txt`
	* `tgt-val.txt`

	Validation files are used to evaluate the convergence of the training. It usually contains no more than 5k sentences.

	```text
	$ head -n 2 toy_ende/src-train.txt
	It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
	Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
	```

	We need to build a YAML configuration file to specify the data that will be used:

	```yaml
	# toy_en_de.yaml

	## Where the samples will be written
	save_data: toy-ende/run/example
	## Where the vocab(s) will be written
	src_vocab: toy-ende/run/example.vocab.src
	tgt_vocab: toy-ende/run/example.vocab.tgt
	# Prevent overwriting existing files in the folder
	overwrite: False

	# Corpus opts:
	data:
	corpus_1:
	path_src: toy-ende/src-train.txt
	path_tgt: toy-ende/tgt-train.txt
	valid:
	path_src: toy-ende/src-val.txt
	path_tgt: toy-ende/tgt-val.txt
	...

	```

	From this configuration, we can build the vocab(s), that will be necessary to train the model:

	```bash
	onmt_build_vocab -config toy_en_de.yaml -n_sample 10000
	```

	Notes:
	- `-n_sample` is required here -- it represents the number of lines sampled from each corpus to build the vocab.
	- This configuration is the simplest possible, without any tokenization or other transforms. See [other example configurations](https://github.com/OpenNMT/OpenNMT-py/tree/master/config) for more complex pipelines.

	### Step 2: Train the model

	To train a model, we need to add the following to the YAML configuration file:
	- the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab;
	- training specific parameters.

	```yaml
	# toy_en_de.yaml

	...

	# Vocabulary files that were just created
	src_vocab: toy-ende/run/example.vocab.src
	tgt_vocab: toy-ende/run/example.vocab.tgt

	# Train on a single GPU
	world_size: 1
	gpu_ranks: [0]

	# Where to save the checkpoints
	save_model: toy-ende/run/model
	save_checkpoint_steps: 500
	train_steps: 1000
	valid_steps: 500

	```

	Then you can simply run:

	```bash
	onmt_train -config toy_en_de.yaml
	```

	This configuration will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder and decoder. It will run on a single GPU (`world_size 1` & `gpu_ranks [0]`).

	Before the training process actually starts, the `.vocab.pt` together with `.transforms.pt` can be dumped to `-save_data` with configurations specified in `-config` yaml file by enabling the `-dump_fields` and `-dump_transforms` flags. It is also possible to generate transformed samples to simplify any potentially required visual inspection. The number of sample lines to dump per corpus is set with the `-n_sample` flag.

	For more advanded models and parameters, see [other example configurations](https://github.com/OpenNMT/OpenNMT-py/tree/master/config) or the [FAQ](FAQ).

	### Step 3: Translate

	```bash
	onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose
	```

	Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into `toy-ende/pred_1000.txt`.

	Note:

	The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for [translation](http://www.statmt.org/wmt16/translation-task.html) or [summarization](https://github.com/harvardnlp/sent-summary).