ssl-aasist / fairseq /docs /tutorial_classifying_names.rst

Add files using upload-large-folder tool

d28af7f verified about 1 year ago

17 kB

	Tutorial: Classifying Names with a Character-Level RNN
	======================================================

	In this tutorial we will extend fairseq to support classification tasks. In
	particular we will re-implement the PyTorch tutorial for `Classifying Names with
	a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
	in fairseq. It is recommended to quickly skim that tutorial before beginning
	this one.

	This tutorial covers:

	1. Preprocessing the data to create dictionaries.
	2. Registering a new Model that encodes an input sentence with a simple RNN
	and predicts the output label.
	3. Registering a new Task that loads our dictionaries and dataset.
	4. Training the Model using the existing command-line tools.
	5. Writing an evaluation script that imports fairseq and allows us to
	interactively evaluate our model on new inputs.


	1. Preprocessing the data
	-------------------------

	The original tutorial provides raw data, but we'll work with a modified version
	of the data that is already tokenized into characters and split into separate
	train, valid and test sets.

	Download and extract the data from here:
	`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_

	Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
	command-line tool to create the dictionaries. While this tool is primarily
	intended for sequence-to-sequence problems, we're able to reuse it here by
	treating the label as a "target" sequence of length 1. We'll also output the
	preprocessed files in "raw" format using the ``--dataset-impl`` option to
	enhance readability:

	.. code-block:: console

	> fairseq-preprocess \
	--trainpref names/train --validpref names/valid --testpref names/test \
	--source-lang input --target-lang label \
	--destdir names-bin --dataset-impl raw

	After running the above command you should see a new directory,
	:file:`names-bin/`, containing the dictionaries for inputs and labels.


	2. Registering a new Model
	--------------------------

	Next we'll register a new model in fairseq that will encode an input sentence
	with a simple RNN and predict the output label. Compared to the original PyTorch
	tutorial, our version will also work with batches of data and GPU Tensors.

	First let's copy the simple RNN module implemented in the `PyTorch tutorial
	<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
	Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
	following contents::

	import torch
	import torch.nn as nn

	class RNN(nn.Module):

	def __init__(self, input_size, hidden_size, output_size):
	super(RNN, self).__init__()

	self.hidden_size = hidden_size

	self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
	self.i2o = nn.Linear(input_size + hidden_size, output_size)
	self.softmax = nn.LogSoftmax(dim=1)

	def forward(self, input, hidden):
	combined = torch.cat((input, hidden), 1)
	hidden = self.i2h(combined)
	output = self.i2o(combined)
	output = self.softmax(output)
	return output, hidden

	def initHidden(self):
	return torch.zeros(1, self.hidden_size)

	We must also register this model with fairseq using the
	:func:`~fairseq.models.register_model` function decorator. Once the model is
	registered we'll be able to use it with the existing :ref:`Command-line Tools`.

	All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
	interface, so we'll create a small wrapper class in the same file and register
	it in fairseq with the name ``'rnn_classifier'``::

	from fairseq.models import BaseFairseqModel, register_model

	# Note: the register_model "decorator" should immediately precede the
	# definition of the Model class.

	@register_model('rnn_classifier')
	class FairseqRNNClassifier(BaseFairseqModel):

	@staticmethod
	def add_args(parser):
	# Models can override this method to add new command-line arguments.
	# Here we'll add a new command-line argument to configure the
	# dimensionality of the hidden state.
	parser.add_argument(
	'--hidden-dim', type=int, metavar='N',
	help='dimensionality of the hidden state',
	)

	@classmethod
	def build_model(cls, args, task):
	# Fairseq initializes models by calling the ``build_model()``
	# function. This provides more flexibility, since the returned model
	# instance can be of a different type than the one that was called.
	# In this case we'll just return a FairseqRNNClassifier instance.

	# Initialize our RNN module
	rnn = RNN(
	# We'll define the Task in the next section, but for now just
	# notice that the task holds the dictionaries for the "source"
	# (i.e., the input sentence) and "target" (i.e., the label).
	input_size=len(task.source_dictionary),
	hidden_size=args.hidden_dim,
	output_size=len(task.target_dictionary),
	)

	# Return the wrapped version of the module
	return FairseqRNNClassifier(
	rnn=rnn,
	input_vocab=task.source_dictionary,
	)

	def __init__(self, rnn, input_vocab):
	super(FairseqRNNClassifier, self).__init__()

	self.rnn = rnn
	self.input_vocab = input_vocab

	# The RNN module in the tutorial expects one-hot inputs, so we can
	# precompute the identity matrix to help convert from indices to
	# one-hot vectors. We register it as a buffer so that it is moved to
	# the GPU when ``cuda()`` is called.
	self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))

	def forward(self, src_tokens, src_lengths):
	# The inputs to the ``forward()`` function are determined by the
	# Task, and in particular the ``'net_input'`` key in each
	# mini-batch. We'll define the Task in the next section, but for
	# now just know that src_tokens has shape `(batch, src_len)` and
	# src_lengths has shape `(batch)`.
	bsz, max_src_len = src_tokens.size()

	# Initialize the RNN hidden state. Compared to the original PyTorch
	# tutorial we'll also handle batched inputs and work on the GPU.
	hidden = self.rnn.initHidden()
	hidden = hidden.repeat(bsz, 1) # expand for batched inputs
	hidden = hidden.to(src_tokens.device) # move to GPU

	for i in range(max_src_len):
	# WARNING: The inputs have padding, so we should mask those
	# elements here so that padding doesn't affect the results.
	# This is left as an exercise for the reader. The padding symbol
	# is given by ``self.input_vocab.pad()`` and the unpadded length
	# of each input is given by src_lengths.

	# One-hot encode a batch of input characters.
	input = self.one_hot_inputs[src_tokens[:, i].long()]

	# Feed the input to our RNN.
	output, hidden = self.rnn(input, hidden)

	# Return the final output state for making a prediction
	return output

	Finally let's define a named architecture with the configuration for our
	model. This is done with the :func:`~fairseq.models.register_model_architecture`
	function decorator. Thereafter this named architecture can be used with the
	``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::

	from fairseq.models import register_model_architecture

	# The first argument to ``register_model_architecture()`` should be the name
	# of the model we registered above (i.e., 'rnn_classifier'). The function we
	# register here should take a single argument args and modify it in-place
	# to match the desired architecture.

	@register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
	def pytorch_tutorial_rnn(args):
	# We use ``getattr()`` to prioritize arguments that are explicitly given
	# on the command-line, so that the defaults defined below are only used
	# when no other value has been specified.
	args.hidden_dim = getattr(args, 'hidden_dim', 128)


	3. Registering a new Task
	-------------------------

	Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
	dictionaries and dataset. Tasks can also control how the data is batched into
	mini-batches, but in this tutorial we'll reuse the batching provided by
	:class:`fairseq.data.LanguagePairDataset`.

	Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
	following contents::

	import os
	import torch

	from fairseq.data import Dictionary, LanguagePairDataset
	from fairseq.tasks import LegacyFairseqTask, register_task


	@register_task('simple_classification')
	class SimpleClassificationTask(LegacyFairseqTask):

	@staticmethod
	def add_args(parser):
	# Add some command-line arguments for specifying where the data is
	# located and the maximum supported input length.
	parser.add_argument('data', metavar='FILE',
	help='file prefix for data')
	parser.add_argument('--max-positions', default=1024, type=int,
	help='max input length')

	@classmethod
	def setup_task(cls, args, **kwargs):
	# Here we can perform any setup required for the task. This may include
	# loading Dictionaries, initializing shared Embedding layers, etc.
	# In this case we'll just load the Dictionaries.
	input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
	label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
	print('\| [input] dictionary: {} types'.format(len(input_vocab)))
	print('\| [label] dictionary: {} types'.format(len(label_vocab)))

	return SimpleClassificationTask(args, input_vocab, label_vocab)

	def __init__(self, args, input_vocab, label_vocab):
	super().__init__(args)
	self.input_vocab = input_vocab
	self.label_vocab = label_vocab

	def load_dataset(self, split, **kwargs):
	"""Load a given dataset split (e.g., train, valid, test)."""

	prefix = os.path.join(self.args.data, '{}.input-label'.format(split))

	# Read input sentences.
	sentences, lengths = [], []
	with open(prefix + '.input', encoding='utf-8') as file:
	for line in file:
	sentence = line.strip()

	# Tokenize the sentence, splitting on spaces
	tokens = self.input_vocab.encode_line(
	sentence, add_if_not_exist=False,
	)

	sentences.append(tokens)
	lengths.append(tokens.numel())

	# Read labels.
	labels = []
	with open(prefix + '.label', encoding='utf-8') as file:
	for line in file:
	label = line.strip()
	labels.append(
	# Convert label to a numeric ID.
	torch.LongTensor([self.label_vocab.add_symbol(label)])
	)

	assert len(sentences) == len(labels)
	print('\| {} {} {} examples'.format(self.args.data, split, len(sentences)))

	# We reuse LanguagePairDataset since classification can be modeled as a
	# sequence-to-sequence task where the target sequence has length 1.
	self.datasets[split] = LanguagePairDataset(
	src=sentences,
	src_sizes=lengths,
	src_dict=self.input_vocab,
	tgt=labels,
	tgt_sizes=torch.ones(len(labels)), # targets have length 1
	tgt_dict=self.label_vocab,
	left_pad_source=False,
	# Since our target is a single class label, there's no need for
	# teacher forcing. If we set this to ``True`` then our Model's
	# ``forward()`` method would receive an additional argument called
	# prev_output_tokens that would contain a shifted version of the
	# target sequence.
	input_feeding=False,
	)

	def max_positions(self):
	"""Return the max input length allowed by the task."""
	# The source should be less than args.max_positions and the "target"
	# has max length 1.
	return (self.args.max_positions, 1)

	@property
	def source_dictionary(self):
	"""Return the source :class:`~fairseq.data.Dictionary`."""
	return self.input_vocab

	@property
	def target_dictionary(self):
	"""Return the target :class:`~fairseq.data.Dictionary`."""
	return self.label_vocab

	# We could override this method if we wanted more control over how batches
	# are constructed, but it's not necessary for this tutorial since we can
	# reuse the batching provided by LanguagePairDataset.
	#
	# def get_batch_iterator(
	# self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
	# ignore_invalid_inputs=False, required_batch_size_multiple=1,
	# seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
	# data_buffer_size=0, disable_iterator_cache=False,
	# ):
	# (...)


	4. Training the Model
	---------------------

	Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
	command-line tool for this, making sure to specify our new Task (``--task
	simple_classification``) and Model architecture (``--arch
	pytorch_tutorial_rnn``):

	.. note::

	You can also configure the dimensionality of the hidden state by passing the
	``--hidden-dim`` argument to :ref:`fairseq-train`.

	.. code-block:: console

	> fairseq-train names-bin \
	--task simple_classification \
	--arch pytorch_tutorial_rnn \
	--optimizer adam --lr 0.001 --lr-shrink 0.5 \
	--max-tokens 1000
	(...)
	\| epoch 027 \| loss 1.200 \| ppl 2.30 \| wps 15728 \| ups 119.4 \| wpb 116 \| bsz 116 \| num_updates 3726 \| lr 1.5625e-05 \| gnorm 1.290 \| clip 0% \| oom 0 \| wall 32 \| train_wall 21
	\| epoch 027 \| valid on 'valid' subset \| valid_loss 1.41304 \| valid_ppl 2.66 \| num_updates 3726 \| best 1.41208
	\| done training in 31.6 seconds

	The model files should appear in the :file:`checkpoints/` directory.


	5. Writing an evaluation script
	-------------------------------

	Finally we can write a short script to evaluate our model on new inputs. Create
	a new file named :file:`eval_classifier.py` with the following contents::

	from fairseq import checkpoint_utils, data, options, tasks

	# Parse command-line arguments for generation
	parser = options.get_generation_parser(default_task='simple_classification')
	args = options.parse_args_and_arch(parser)

	# Setup task
	task = tasks.setup_task(args)

	# Load model
	print('\| loading model from {}'.format(args.path))
	models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
	model = models[0]

	while True:
	sentence = input('\nInput: ')

	# Tokenize into characters
	chars = ' '.join(list(sentence.strip()))
	tokens = task.source_dictionary.encode_line(
	chars, add_if_not_exist=False,
	)

	# Build mini-batch to feed to the model
	batch = data.language_pair_dataset.collate(
	samples=[{'id': -1, 'source': tokens}], # bsz = 1
	pad_idx=task.source_dictionary.pad(),
	eos_idx=task.source_dictionary.eos(),
	left_pad_source=False,
	input_feeding=False,
	)

	# Feed batch to the model and get predictions
	preds = model(**batch['net_input'])

	# Print top 3 predictions and their log-probabilities
	top_scores, top_labels = preds[0].topk(k=3)
	for score, label_idx in zip(top_scores, top_labels):
	label_name = task.target_dictionary.string([label_idx])
	print('({:.2f})\t{}'.format(score, label_name))

	Now we can evaluate our model interactively. Note that we have included the
	original data path (:file:`names-bin/`) so that the dictionaries can be loaded:

	.. code-block:: console

	> python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
	\| [input] dictionary: 64 types
	\| [label] dictionary: 24 types
	\| loading model from checkpoints/checkpoint_best.pt

	Input: Satoshi
	(-0.61) Japanese
	(-1.20) Arabic
	(-2.86) Italian

	Input: Sinbad
	(-0.30) Arabic
	(-1.76) English
	(-4.08) Russian