Spaces:

ICML2022
/

OFA

Runtime error

App Files Files Community

OFA / fairseq /docs /tutorial_simple_lstm.rst

logicwong

init

c9bb3f2 almost 2 years ago

raw history blame contribute delete

No virus

21.2 kB

	Tutorial: Simple LSTM
	=====================

	In this tutorial we will extend fairseq by adding a new
	:class:`~fairseq.models.FairseqEncoderDecoderModel` that encodes a source
	sentence with an LSTM and then passes the final hidden state to a second LSTM
	that decodes the target sentence (without attention).

	This tutorial covers:

	1. Writing an Encoder and Decoder to encode/decode the source/target
	sentence, respectively.
	2. Registering a new Model so that it can be used with the existing
	:ref:`Command-line tools`.
	3. Training the Model using the existing command-line tools.
	4. Making generation faster by modifying the Decoder to use
	:ref:`Incremental decoding`.


	1. Building an Encoder and Decoder
	----------------------------------

	In this section we'll define a simple LSTM Encoder and Decoder. All Encoders
	should implement the :class:`~fairseq.models.FairseqEncoder` interface and
	Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
	These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
	and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
	Modules.


	Encoder
	~~~~~~~

	Our Encoder will embed the tokens in the source sentence, feed them to a
	:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
	save the following in a new file named :file:`fairseq/models/simple_lstm.py`::

	import torch.nn as nn
	from fairseq import utils
	from fairseq.models import FairseqEncoder

	class SimpleLSTMEncoder(FairseqEncoder):

	def __init__(
	self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
	):
	super().__init__(dictionary)
	self.args = args

	# Our encoder will embed the inputs before feeding them to the LSTM.
	self.embed_tokens = nn.Embedding(
	num_embeddings=len(dictionary),
	embedding_dim=embed_dim,
	padding_idx=dictionary.pad(),
	)
	self.dropout = nn.Dropout(p=dropout)

	# We'll use a single-layer, unidirectional LSTM for simplicity.
	self.lstm = nn.LSTM(
	input_size=embed_dim,
	hidden_size=hidden_dim,
	num_layers=1,
	bidirectional=False,
	batch_first=True,
	)

	def forward(self, src_tokens, src_lengths):
	# The inputs to the ``forward()`` function are determined by the
	# Task, and in particular the ``'net_input'`` key in each
	# mini-batch. We discuss Tasks in the next tutorial, but for now just
	# know that src_tokens has shape `(batch, src_len)` and src_lengths
	# has shape `(batch)`.

	# Note that the source is typically padded on the left. This can be
	# configured by adding the `--left-pad-source "False"` command-line
	# argument, but here we'll make the Encoder handle either kind of
	# padding by converting everything to be right-padded.
	if self.args.left_pad_source:
	# Convert left-padding to right-padding.
	src_tokens = utils.convert_padding_direction(
	src_tokens,
	padding_idx=self.dictionary.pad(),
	left_to_right=True
	)

	# Embed the source.
	x = self.embed_tokens(src_tokens)

	# Apply dropout.
	x = self.dropout(x)

	# Pack the sequence into a PackedSequence object to feed to the LSTM.
	x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)

	# Get the output from the LSTM.
	_outputs, (final_hidden, _final_cell) = self.lstm(x)

	# Return the Encoder's output. This can be any object and will be
	# passed directly to the Decoder.
	return {
	# this will have shape `(bsz, hidden_dim)`
	'final_hidden': final_hidden.squeeze(0),
	}

	# Encoders are required to implement this method so that we can rearrange
	# the order of the batch elements during inference (e.g., beam search).
	def reorder_encoder_out(self, encoder_out, new_order):
	"""
	Reorder encoder output according to `new_order`.

	Args:
	encoder_out: output from the ``forward()`` method
	new_order (LongTensor): desired order

	Returns:
	`encoder_out` rearranged according to `new_order`
	"""
	final_hidden = encoder_out['final_hidden']
	return {
	'final_hidden': final_hidden.index_select(0, new_order),
	}


	Decoder
	~~~~~~~

	Our Decoder will predict the next word, conditioned on the Encoder's final
	hidden state and an embedded representation of the previous target word -- which
	is sometimes called teacher forcing. More specifically, we'll use a
	:class:`torch.nn.LSTM` to produce a sequence of hidden states that we'll project
	to the size of the output vocabulary to predict each target word.

	::

	import torch
	from fairseq.models import FairseqDecoder

	class SimpleLSTMDecoder(FairseqDecoder):

	def __init__(
	self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
	dropout=0.1,
	):
	super().__init__(dictionary)

	# Our decoder will embed the inputs before feeding them to the LSTM.
	self.embed_tokens = nn.Embedding(
	num_embeddings=len(dictionary),
	embedding_dim=embed_dim,
	padding_idx=dictionary.pad(),
	)
	self.dropout = nn.Dropout(p=dropout)

	# We'll use a single-layer, unidirectional LSTM for simplicity.
	self.lstm = nn.LSTM(
	# For the first layer we'll concatenate the Encoder's final hidden
	# state with the embedded target tokens.
	input_size=encoder_hidden_dim + embed_dim,
	hidden_size=hidden_dim,
	num_layers=1,
	bidirectional=False,
	)

	# Define the output projection.
	self.output_projection = nn.Linear(hidden_dim, len(dictionary))

	# During training Decoders are expected to take the entire target sequence
	# (shifted right by one position) and produce logits over the vocabulary.
	# The prev_output_tokens tensor begins with the end-of-sentence symbol,
	# ``dictionary.eos()``, followed by the target sequence.
	def forward(self, prev_output_tokens, encoder_out):
	"""
	Args:
	prev_output_tokens (LongTensor): previous decoder outputs of shape
	`(batch, tgt_len)`, for teacher forcing
	encoder_out (Tensor, optional): output from the encoder, used for
	encoder-side attention

	Returns:
	tuple:
	- the last decoder layer's output of shape
	`(batch, tgt_len, vocab)`
	- the last decoder layer's attention weights of shape
	`(batch, tgt_len, src_len)`
	"""
	bsz, tgt_len = prev_output_tokens.size()

	# Extract the final hidden state from the Encoder.
	final_encoder_hidden = encoder_out['final_hidden']

	# Embed the target sequence, which has been shifted right by one
	# position and now starts with the end-of-sentence symbol.
	x = self.embed_tokens(prev_output_tokens)

	# Apply dropout.
	x = self.dropout(x)

	# Concatenate the Encoder's final hidden state to every embedded
	# target token.
	x = torch.cat(
	[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
	dim=2,
	)

	# Using PackedSequence objects in the Decoder is harder than in the
	# Encoder, since the targets are not sorted in descending length order,
	# which is a requirement of ``pack_padded_sequence()``. Instead we'll
	# feed nn.LSTM directly.
	initial_state = (
	final_encoder_hidden.unsqueeze(0), # hidden
	torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
	)
	output, _ = self.lstm(
	x.transpose(0, 1), # convert to shape `(tgt_len, bsz, dim)`
	initial_state,
	)
	x = output.transpose(0, 1) # convert to shape `(bsz, tgt_len, hidden)`

	# Project the outputs to the size of the vocabulary.
	x = self.output_projection(x)

	# Return the logits and ``None`` for the attention weights
	return x, None


	2. Registering the Model
	------------------------

	Now that we've defined our Encoder and Decoder we must register our model with
	fairseq using the :func:`~fairseq.models.register_model` function decorator.
	Once the model is registered we'll be able to use it with the existing
	:ref:`Command-line Tools`.

	All registered models must implement the
	:class:`~fairseq.models.BaseFairseqModel` interface. For sequence-to-sequence
	models (i.e., any model with a single Encoder and Decoder), we can instead
	implement the :class:`~fairseq.models.FairseqEncoderDecoderModel` interface.

	Create a small wrapper class in the same file and register it in fairseq with
	the name ``'simple_lstm'``::

	from fairseq.models import FairseqEncoderDecoderModel, register_model

	# Note: the register_model "decorator" should immediately precede the
	# definition of the Model class.

	@register_model('simple_lstm')
	class SimpleLSTMModel(FairseqEncoderDecoderModel):

	@staticmethod
	def add_args(parser):
	# Models can override this method to add new command-line arguments.
	# Here we'll add some new command-line arguments to configure dropout
	# and the dimensionality of the embeddings and hidden states.
	parser.add_argument(
	'--encoder-embed-dim', type=int, metavar='N',
	help='dimensionality of the encoder embeddings',
	)
	parser.add_argument(
	'--encoder-hidden-dim', type=int, metavar='N',
	help='dimensionality of the encoder hidden state',
	)
	parser.add_argument(
	'--encoder-dropout', type=float, default=0.1,
	help='encoder dropout probability',
	)
	parser.add_argument(
	'--decoder-embed-dim', type=int, metavar='N',
	help='dimensionality of the decoder embeddings',
	)
	parser.add_argument(
	'--decoder-hidden-dim', type=int, metavar='N',
	help='dimensionality of the decoder hidden state',
	)
	parser.add_argument(
	'--decoder-dropout', type=float, default=0.1,
	help='decoder dropout probability',
	)

	@classmethod
	def build_model(cls, args, task):
	# Fairseq initializes models by calling the ``build_model()``
	# function. This provides more flexibility, since the returned model
	# instance can be of a different type than the one that was called.
	# In this case we'll just return a SimpleLSTMModel instance.

	# Initialize our Encoder and Decoder.
	encoder = SimpleLSTMEncoder(
	args=args,
	dictionary=task.source_dictionary,
	embed_dim=args.encoder_embed_dim,
	hidden_dim=args.encoder_hidden_dim,
	dropout=args.encoder_dropout,
	)
	decoder = SimpleLSTMDecoder(
	dictionary=task.target_dictionary,
	encoder_hidden_dim=args.encoder_hidden_dim,
	embed_dim=args.decoder_embed_dim,
	hidden_dim=args.decoder_hidden_dim,
	dropout=args.decoder_dropout,
	)
	model = SimpleLSTMModel(encoder, decoder)

	# Print the model architecture.
	print(model)

	return model

	# We could override the ``forward()`` if we wanted more control over how
	# the encoder and decoder interact, but it's not necessary for this
	# tutorial since we can inherit the default implementation provided by
	# the FairseqEncoderDecoderModel base class, which looks like:
	#
	# def forward(self, src_tokens, src_lengths, prev_output_tokens):
	# encoder_out = self.encoder(src_tokens, src_lengths)
	# decoder_out = self.decoder(prev_output_tokens, encoder_out)
	# return decoder_out

	Finally let's define a named architecture with the configuration for our
	model. This is done with the :func:`~fairseq.models.register_model_architecture`
	function decorator. Thereafter this named architecture can be used with the
	``--arch`` command-line argument, e.g., ``--arch tutorial_simple_lstm``::

	from fairseq.models import register_model_architecture

	# The first argument to ``register_model_architecture()`` should be the name
	# of the model we registered above (i.e., 'simple_lstm'). The function we
	# register here should take a single argument args and modify it in-place
	# to match the desired architecture.

	@register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
	def tutorial_simple_lstm(args):
	# We use ``getattr()`` to prioritize arguments that are explicitly given
	# on the command-line, so that the defaults defined below are only used
	# when no other value has been specified.
	args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
	args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
	args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
	args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)


	3. Training the Model
	---------------------

	Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
	command-line tool for this, making sure to specify our new Model architecture
	(``--arch tutorial_simple_lstm``).

	.. note::

	Make sure you've already preprocessed the data from the IWSLT example in the
	:file:`examples/translation/` directory.

	.. code-block:: console

	> fairseq-train data-bin/iwslt14.tokenized.de-en \
	--arch tutorial_simple_lstm \
	--encoder-dropout 0.2 --decoder-dropout 0.2 \
	--optimizer adam --lr 0.005 --lr-shrink 0.5 \
	--max-tokens 12000
	(...)
	\| epoch 052 \| loss 4.027 \| ppl 16.30 \| wps 420805 \| ups 39.7 \| wpb 9841 \| bsz 400 \| num_updates 20852 \| lr 1.95313e-05 \| gnorm 0.218 \| clip 0% \| oom 0 \| wall 529 \| train_wall 396
	\| epoch 052 \| valid on 'valid' subset \| valid_loss 4.74989 \| valid_ppl 26.91 \| num_updates 20852 \| best 4.74954

	The model files should appear in the :file:`checkpoints/` directory. While this
	model architecture is not very good, we can use the :ref:`fairseq-generate` script to
	generate translations and compute our BLEU score over the test set:

	.. code-block:: console

	> fairseq-generate data-bin/iwslt14.tokenized.de-en \
	--path checkpoints/checkpoint_best.pt \
	--beam 5 \
	--remove-bpe
	(...)
	\| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
	\| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)


	4. Making generation faster
	---------------------------

	While autoregressive generation from sequence-to-sequence models is inherently
	slow, our implementation above is especially slow because it recomputes the
	entire sequence of Decoder hidden states for every output token (i.e., it is
	``O(n^2)``). We can make this significantly faster by instead caching the
	previous hidden states.

	In fairseq this is called :ref:`Incremental decoding`. Incremental decoding is a
	special mode at inference time where the Model only receives a single timestep
	of input corresponding to the immediately previous output token (for teacher
	forcing) and must produce the next output incrementally. Thus the model must
	cache any long-term state that is needed about the sequence, e.g., hidden
	states, convolutional states, etc.

	To implement incremental decoding we will modify our model to implement the
	:class:`~fairseq.models.FairseqIncrementalDecoder` interface. Compared to the
	standard :class:`~fairseq.models.FairseqDecoder` interface, the incremental
	decoder interface allows ``forward()`` methods to take an extra keyword argument
	(incremental_state) that can be used to cache state across time-steps.

	Let's replace our ``SimpleLSTMDecoder`` with an incremental one::

	import torch
	from fairseq.models import FairseqIncrementalDecoder

	class SimpleLSTMDecoder(FairseqIncrementalDecoder):

	def __init__(
	self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
	dropout=0.1,
	):
	# This remains the same as before.
	super().__init__(dictionary)
	self.embed_tokens = nn.Embedding(
	num_embeddings=len(dictionary),
	embedding_dim=embed_dim,
	padding_idx=dictionary.pad(),
	)
	self.dropout = nn.Dropout(p=dropout)
	self.lstm = nn.LSTM(
	input_size=encoder_hidden_dim + embed_dim,
	hidden_size=hidden_dim,
	num_layers=1,
	bidirectional=False,
	)
	self.output_projection = nn.Linear(hidden_dim, len(dictionary))

	# We now take an additional kwarg (incremental_state) for caching the
	# previous hidden and cell states.
	def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
	if incremental_state is not None:
	# If the incremental_state argument is not ``None`` then we are
	# in incremental inference mode. While prev_output_tokens will
	# still contain the entire decoded prefix, we will only use the
	# last step and assume that the rest of the state is cached.
	prev_output_tokens = prev_output_tokens[:, -1:]

	# This remains the same as before.
	bsz, tgt_len = prev_output_tokens.size()
	final_encoder_hidden = encoder_out['final_hidden']
	x = self.embed_tokens(prev_output_tokens)
	x = self.dropout(x)
	x = torch.cat(
	[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
	dim=2,
	)

	# We will now check the cache and load the cached previous hidden and
	# cell states, if they exist, otherwise we will initialize them to
	# zeros (as before). We will use the ``utils.get_incremental_state()``
	# and ``utils.set_incremental_state()`` helpers.
	initial_state = utils.get_incremental_state(
	self, incremental_state, 'prev_state',
	)
	if initial_state is None:
	# first time initialization, same as the original version
	initial_state = (
	final_encoder_hidden.unsqueeze(0), # hidden
	torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
	)

	# Run one step of our LSTM.
	output, latest_state = self.lstm(x.transpose(0, 1), initial_state)

	# Update the cache with the latest hidden and cell states.
	utils.set_incremental_state(
	self, incremental_state, 'prev_state', latest_state,
	)

	# This remains the same as before
	x = output.transpose(0, 1)
	x = self.output_projection(x)
	return x, None

	# The ``FairseqIncrementalDecoder`` interface also requires implementing a
	# ``reorder_incremental_state()`` method, which is used during beam search
	# to select and reorder the incremental state.
	def reorder_incremental_state(self, incremental_state, new_order):
	# Load the cached state.
	prev_state = utils.get_incremental_state(
	self, incremental_state, 'prev_state',
	)

	# Reorder batches according to new_order.
	reordered_state = (
	prev_state[0].index_select(1, new_order), # hidden
	prev_state[1].index_select(1, new_order), # cell
	)

	# Update the cached state.
	utils.set_incremental_state(
	self, incremental_state, 'prev_state', reordered_state,
	)

	Finally, we can rerun generation and observe the speedup:

	.. code-block:: console

	# Before

	> fairseq-generate data-bin/iwslt14.tokenized.de-en \
	--path checkpoints/checkpoint_best.pt \
	--beam 5 \
	--remove-bpe
	(...)
	\| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
	\| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)

	# After

	> fairseq-generate data-bin/iwslt14.tokenized.de-en \
	--path checkpoints/checkpoint_best.pt \
	--beam 5 \
	--remove-bpe
	(...)
	\| Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
	\| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)