# How to use OpenNMT-py as a Library

The example notebook (available [here](https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/examples/Library.ipynb)) should be able to run as a standalone execution, provided `onmt` is in the path (installed via `pip` for instance).

Some parts may not be 100% 'library-friendly' but it's mostly workable.

### Import a few modules and functions that will be necessary

In [1]:
import yaml
import torch
import torch.nn as nn
from argparse import Namespace
from collections import defaultdict, Counter

In [2]:
import onmt
from onmt.inputters.inputter import _load_vocab, _build_fields_vocab, get_fields, IterOnDevice
from onmt.inputters.corpus import ParallelCorpus
from onmt.inputters.dynamic_iterator import DynamicDatasetIter
from onmt.translate import GNMTGlobalScorer, Translator, TranslationBuilder
from onmt.utils.misc import set_random_seed

### Enable logging

In [3]:
# enable logging
from onmt.utils.logging import init_logger, logger
init_logger()



### Set random seed

In [4]:
is_cuda = torch.cuda.is_available()
set_random_seed(1111, is_cuda)

### Retrieve data

To make a proper example, we will need some data, as well as some vocabulary(ies).

Let's take the same data as in the [quickstart](https://opennmt.net/OpenNMT-py/quickstart.html):

In [5]:
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz

--2020-09-25 15:28:05-- https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.18.38
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.18.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1662081 (1,6M) [application/x-gzip]
Saving to: ‘toy-ende.tar.gz.5’


2020-09-25 15:28:07 (2,33 MB/s) - ‘toy-ende.tar.gz.5’ saved [1662081/1662081]



In [6]:
!tar xf toy-ende.tar.gz

In [7]:
ls toy-ende

config.yaml src-test.txt src-val.txt tgt-train.txt
[0m[01;34mrun[0m/ src-train.txt tgt-test.txt tgt-val.txt


### Prepare data and vocab

As for any use case of OpenNMT-py 2.0, we can start by creating a simple YAML configuration with our datasets. This is the easiest way to build the proper `opts` `Namespace` that will be used to create the vocabulary(ies).

In [8]:
yaml_config = """
## Where the vocab(s) will be written
save_data: toy-ende/run/example
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Corpus opts:
data:
 corpus:
 path_src: toy-ende/src-train.txt
 path_tgt: toy-ende/tgt-train.txt
 transforms: []
 weight: 1
 valid:
 path_src: toy-ende/src-val.txt
 path_tgt: toy-ende/tgt-val.txt
 transforms: []
"""
config = yaml.safe_load(yaml_config)
with open("toy-ende/config.yaml", "w") as f:
 f.write(yaml_config)

In [9]:
from onmt.utils.parse import ArgumentParser
parser = ArgumentParser(description='build_vocab.py')

In [10]:
from onmt.opts import dynamic_prepare_opts
dynamic_prepare_opts(parser, build_vocab_only=True)

In [11]:
base_args = (["-config", "toy-ende/config.yaml", "-n_sample", "10000"])
opts, unknown = parser.parse_known_args(base_args)

In [12]:
opts



In [13]:
from onmt.bin.build_vocab import build_vocab_main
build_vocab_main(opts)

[2020-09-25 15:28:08,068 INFO] Parsed 2 corpora from -data.
[2020-09-25 15:28:08,069 INFO] Counter vocab from 10000 samples.
[2020-09-25 15:28:08,070 INFO] Save 10000 transformed example/corpus.
[2020-09-25 15:28:08,070 INFO] corpus's transforms: TransformPipe()
[2020-09-25 15:28:08,101 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2020-09-25 15:28:08,320 INFO] Just finished the first loop
[2020-09-25 15:28:08,320 INFO] Counters src:24995
[2020-09-25 15:28:08,321 INFO] Counters tgt:35816


In [14]:
ls toy-ende/run

example.vocab.src example.vocab.tgt [0m[01;34msample[0m/


We just created our source and target vocabularies, respectively `toy-ende/run/example.vocab.src` and `toy-ende/run/example.vocab.tgt`.

### Build fields

We can build the fields from the text files that were just created.

In [15]:
src_vocab_path = "toy-ende/run/example.vocab.src"
tgt_vocab_path = "toy-ende/run/example.vocab.tgt"

In [16]:
# initialize the frequency counter
counters = defaultdict(Counter)
# load source vocab
_src_vocab, _src_vocab_size = _load_vocab(
 src_vocab_path,
 'src',
 counters)
# load target vocab
_tgt_vocab, _tgt_vocab_size = _load_vocab(
 tgt_vocab_path,
 'tgt',
 counters)

[2020-09-25 15:28:08,495 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2020-09-25 15:28:08,554 INFO] Loaded src vocab has 24995 tokens.
[2020-09-25 15:28:08,562 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2020-09-25 15:28:08,617 INFO] Loaded tgt vocab has 35816 tokens.


In [17]:
# initialize fields
src_nfeats, tgt_nfeats = 0, 0 # do not support word features for now
fields = get_fields(
 'text', src_nfeats, tgt_nfeats)

In [18]:
fields

{'src': ,
 'tgt': ,
 'indices': }

In [19]:
# build fields vocab
share_vocab = False
vocab_size_multiple = 1
src_vocab_size = 30000
tgt_vocab_size = 30000
src_words_min_frequency = 1
tgt_words_min_frequency = 1
vocab_fields = _build_fields_vocab(
 fields, counters, 'text', share_vocab,
 vocab_size_multiple,
 src_vocab_size, src_words_min_frequency,
 tgt_vocab_size, tgt_words_min_frequency)

[2020-09-25 15:28:08,699 INFO] * tgt vocab size: 30004.
[2020-09-25 15:28:08,749 INFO] * src vocab size: 24997.


An alternative way of creating these fields is to run `onmt_train` without actually training, to just output the necessary files.

### Prepare for training: model and optimizer creation

Let's get a few fields/vocab related variables to simplify the model creation a bit:

In [20]:
src_text_field = vocab_fields["src"].base_field
src_vocab = src_text_field.vocab
src_padding = src_vocab.stoi[src_text_field.pad_token]

tgt_text_field = vocab_fields['tgt'].base_field
tgt_vocab = tgt_text_field.vocab
tgt_padding = tgt_vocab.stoi[tgt_text_field.pad_token]

Next we specify the core model itself. Here we will build a small model with an encoder and an attention based input feeding decoder. Both models will be RNNs and the encoder will be bidirectional

In [21]:
emb_size = 100
rnn_size = 500
# Specify the core model.

encoder_embeddings = onmt.modules.Embeddings(emb_size, len(src_vocab),
 word_padding_idx=src_padding)

encoder = onmt.encoders.RNNEncoder(hidden_size=rnn_size, num_layers=1,
 rnn_type="LSTM", bidirectional=True,
 embeddings=encoder_embeddings)

decoder_embeddings = onmt.modules.Embeddings(emb_size, len(tgt_vocab),
 word_padding_idx=tgt_padding)
decoder = onmt.decoders.decoder.InputFeedRNNDecoder(
 hidden_size=rnn_size, num_layers=1, bidirectional_encoder=True, 
 rnn_type="LSTM", embeddings=decoder_embeddings)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = onmt.models.model.NMTModel(encoder, decoder)
model.to(device)

# Specify the tgt word generator and loss computation module
model.generator = nn.Sequential(
 nn.Linear(rnn_size, len(tgt_vocab)),
 nn.LogSoftmax(dim=-1)).to(device)

loss = onmt.utils.loss.NMTLossCompute(
 criterion=nn.NLLLoss(ignore_index=tgt_padding, reduction="sum"),
 generator=model.generator)

Now we set up the optimizer. This could be a core torch optim class, or our wrapper which handles learning rate updates and gradient normalization automatically.

In [22]:
lr = 1
torch_optimizer = torch.optim.SGD(model.parameters(), lr=lr)
optim = onmt.utils.optimizers.Optimizer(
 torch_optimizer, learning_rate=lr, max_grad_norm=2)

### Create the training and validation data iterators

Now we need to create the dynamic dataset iterator.

This is not very 'library-friendly' for now because of the way the `DynamicDatasetIter` constructor is defined. It may evolve in the future.

In [23]:
src_train = "toy-ende/src-train.txt"
tgt_train = "toy-ende/tgt-train.txt"
src_val = "toy-ende/src-val.txt"
tgt_val = "toy-ende/tgt-val.txt"

# build the ParallelCorpus
corpus = ParallelCorpus("corpus", src_train, tgt_train)
valid = ParallelCorpus("valid", src_val, tgt_val)

In [24]:
# build the training iterator
train_iter = DynamicDatasetIter(
 corpora={"corpus": corpus},
 corpora_info={"corpus": {"weight": 1}},
 transforms={},
 fields=vocab_fields,
 is_train=True,
 batch_type="tokens",
 batch_size=4096,
 batch_size_multiple=1,
 data_type="text")

In [25]:
# make sure the iteration happens on GPU 0 (-1 for CPU, N for GPU N)
train_iter = iter(IterOnDevice(train_iter, 0))

In [26]:
# build the validation iterator
valid_iter = DynamicDatasetIter(
 corpora={"valid": valid},
 corpora_info={"valid": {"weight": 1}},
 transforms={},
 fields=vocab_fields,
 is_train=False,
 batch_type="sents",
 batch_size=8,
 batch_size_multiple=1,
 data_type="text")

In [27]:
valid_iter = IterOnDevice(valid_iter, 0)

### Training

Finally we train.

In [28]:
report_manager = onmt.utils.ReportMgr(
 report_every=50, start_time=None, tensorboard_writer=None)

trainer = onmt.Trainer(model=model,
 train_loss=loss,
 valid_loss=loss,
 optim=optim,
 report_manager=report_manager,
 dropout=[0.1])

trainer.train(train_iter=train_iter,
 train_steps=1000,
 valid_iter=valid_iter,
 valid_steps=500)

[2020-09-25 15:28:15,184 INFO] Start training loop and validate every 500 steps...
[2020-09-25 15:28:15,185 INFO] corpus's transforms: TransformPipe()
[2020-09-25 15:28:15,187 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2020-09-25 15:28:21,140 INFO] Step 50/ 1000; acc: 7.52; ppl: 8832.29; xent: 9.09; lr: 1.00000; 18916/18871 tok/s; 6 sec
[2020-09-25 15:28:24,869 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2020-09-25 15:28:27,121 INFO] Step 100/ 1000; acc: 9.34; ppl: 1840.06; xent: 7.52; lr: 1.00000; 18911/18785 tok/s; 12 sec
[2020-09-25 15:28:33,048 INFO] Step 150/ 1000; acc: 10.35; ppl: 1419.18; xent: 7.26; lr: 1.00000; 19062/19017 tok/s; 18 sec
[2020-09-25 15:28:37,019 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2020-09-25 15:28:39,022 INFO] Step 200/ 1000; acc: 11.14; ppl: 1127.44; xent: 7.03; lr: 1.00000; 19084/18911 tok/s; 24 sec
[2020-



### Translate

For translation, we can build a "traditional" (as opposed to dynamic) dataset for now.

In [29]:
src_data = {"reader": onmt.inputters.str2reader["text"](), "data": src_val}
tgt_data = {"reader": onmt.inputters.str2reader["text"](), "data": tgt_val}
_readers, _data = onmt.inputters.Dataset.config(
 [('src', src_data), ('tgt', tgt_data)])

In [30]:
dataset = onmt.inputters.Dataset(
 vocab_fields, readers=_readers, data=_data,
 sort_key=onmt.inputters.str2sortkey["text"])

In [31]:
data_iter = onmt.inputters.OrderedIterator(
 dataset=dataset,
 device="cuda",
 batch_size=10,
 train=False,
 sort=False,
 sort_within_batch=True,
 shuffle=False
 )

In [32]:
src_reader = onmt.inputters.str2reader["text"]
tgt_reader = onmt.inputters.str2reader["text"]
scorer = GNMTGlobalScorer(alpha=0.7, 
 beta=0., 
 length_penalty="avg", 
 coverage_penalty="none")
gpu = 0 if torch.cuda.is_available() else -1
translator = Translator(model=model, 
 fields=vocab_fields, 
 src_reader=src_reader, 
 tgt_reader=tgt_reader, 
 global_scorer=scorer,
 gpu=gpu)
builder = onmt.translate.TranslationBuilder(data=dataset, 
 fields=vocab_fields)

**Note**: translations will be very poor, because of the very low quantity of data, the absence of proper tokenization, and the brevity of the training.

In [33]:
for batch in data_iter:
 trans_batch = translator.translate_batch(
 batch=batch, src_vocabs=[src_vocab],
 attn_debug=False)
 translations = builder.from_batch(trans_batch)
 for trans in translations:
 print(trans.log(0))
 break


SENT 0: ['Parliament', 'Does', 'Not', 'Support', 'Amendment', 'Freeing', 'Tymoshenko']
PRED 0: Parlament das Parlament über die Europäische Parlament , die sich in der Lage in der Lage ist , die es in der Lage sind .
PRED SCORE: -1.5935


SENT 0: ['Today', ',', 'the', 'Ukraine', 'parliament', 'dismissed', ',', 'within', 'the', 'Code', 'of', 'Criminal', 'Procedure', 'amendment', ',', 'the', 'motion', 'to', 'revoke', 'an', 'article', 'based', 'on', 'which', 'the', 'opposition', 'leader', ',', 'Yulia', 'Tymoshenko', ',', 'was', 'sentenced', '.']
PRED 0: In der Nähe des Hotels , die in der Lage , die sich in der Lage ist , in der Lage , die in der Lage , die in der Lage ist .
PRED SCORE: -1.7173


SENT 0: ['The', 'amendment', 'that', 'would', 'lead', 'to', 'freeing', 'the', 'imprisoned', 'former', 'Prime', 'Minister', 'was', 'revoked', 'during', 'second', 'reading', 'of', 'the', 'proposal', 'for', 'mitigation', 'of', 'sentences', 'for', 'economic', 'offences', '.']
PRED 0: Die Tatsache , 