|
|
|
|
|
# Quickstart |
|
|
|
### Step 0: Install OpenNMT-py |
|
|
|
```bash |
|
pip install --upgrade pip |
|
pip install OpenNMT-py |
|
``` |
|
|
|
### Step 1: Prepare the data |
|
|
|
To get started, we propose to download a toy English-German dataset for machine translation containing 10k tokenized sentences: |
|
|
|
```bash |
|
wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz |
|
tar xf toy-ende.tar.gz |
|
cd toy-ende |
|
``` |
|
|
|
The data consists of parallel source (`src`) and target (`tgt`) data containing one sentence per line with tokens separated by a space: |
|
|
|
* `src-train.txt` |
|
* `tgt-train.txt` |
|
* `src-val.txt` |
|
* `tgt-val.txt` |
|
|
|
Validation files are used to evaluate the convergence of the training. It usually contains no more than 5k sentences. |
|
|
|
```text |
|
$ head -n 2 toy_ende/src-train.txt |
|
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance . |
|
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym . |
|
``` |
|
|
|
We need to build a **YAML configuration file** to specify the data that will be used: |
|
|
|
```yaml |
|
# toy_en_de.yaml |
|
|
|
## Where the samples will be written |
|
save_data: toy-ende/run/example |
|
## Where the vocab(s) will be written |
|
src_vocab: toy-ende/run/example.vocab.src |
|
tgt_vocab: toy-ende/run/example.vocab.tgt |
|
# Prevent overwriting existing files in the folder |
|
overwrite: False |
|
|
|
# Corpus opts: |
|
data: |
|
corpus_1: |
|
path_src: toy-ende/src-train.txt |
|
path_tgt: toy-ende/tgt-train.txt |
|
valid: |
|
path_src: toy-ende/src-val.txt |
|
path_tgt: toy-ende/tgt-val.txt |
|
... |
|
|
|
``` |
|
|
|
From this configuration, we can build the vocab(s), that will be necessary to train the model: |
|
|
|
```bash |
|
onmt_build_vocab -config toy_en_de.yaml -n_sample 10000 |
|
``` |
|
|
|
**Notes**: |
|
- `-n_sample` is required here -- it represents the number of lines sampled from each corpus to build the vocab. |
|
- This configuration is the simplest possible, without any tokenization or other *transforms*. See [other example configurations](https://github.com/OpenNMT/OpenNMT-py/tree/master/config) for more complex pipelines. |
|
|
|
### Step 2: Train the model |
|
|
|
To train a model, we need to **add the following to the YAML configuration file**: |
|
- the vocabulary path(s) that will be used: can be that generated by onmt_build_vocab; |
|
- training specific parameters. |
|
|
|
```yaml |
|
# toy_en_de.yaml |
|
|
|
... |
|
|
|
# Vocabulary files that were just created |
|
src_vocab: toy-ende/run/example.vocab.src |
|
tgt_vocab: toy-ende/run/example.vocab.tgt |
|
|
|
# Train on a single GPU |
|
world_size: 1 |
|
gpu_ranks: [0] |
|
|
|
# Where to save the checkpoints |
|
save_model: toy-ende/run/model |
|
save_checkpoint_steps: 500 |
|
train_steps: 1000 |
|
valid_steps: 500 |
|
|
|
``` |
|
|
|
Then you can simply run: |
|
|
|
```bash |
|
onmt_train -config toy_en_de.yaml |
|
``` |
|
|
|
This configuration will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder and decoder. It will run on a single GPU (`world_size 1` & `gpu_ranks [0]`). |
|
|
|
Before the training process actually starts, the `*.vocab.pt` together with `*.transforms.pt` can be dumped to `-save_data` with configurations specified in `-config` yaml file by enabling the `-dump_fields` and `-dump_transforms` flags. It is also possible to generate transformed samples to simplify any potentially required visual inspection. The number of sample lines to dump per corpus is set with the `-n_sample` flag. |
|
|
|
For more advanded models and parameters, see [other example configurations](https://github.com/OpenNMT/OpenNMT-py/tree/master/config) or the [FAQ](FAQ). |
|
|
|
### Step 3: Translate |
|
|
|
```bash |
|
onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose |
|
``` |
|
|
|
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into `toy-ende/pred_1000.txt`. |
|
|
|
**Note**: |
|
|
|
The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for [translation](http://www.statmt.org/wmt16/translation-task.html) or [summarization](https://github.com/harvardnlp/sent-summary). |
|
|