Processors

This library includes processors for several traditional tasks. These processors can be used to process a dataset into examples that can be fed to a model.

Processors

All processors follow the same architecture which is that of the DataProcessor. The processor returns a list of InputExample. These InputExample can be converted to InputFeatures in order to be fed to the model.

class transformers.data.processors.utils.DataProcessor[source]

Base class for data converters for sequence classification data sets.

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_example_from_tensor_dict(tensor_dict)[source]

Gets an example from a dict with tensorflow tensors

Parameters

tensor_dict – Keys and values should match the corresponding Glue tensorflow_dataset examples.

get_labels()[source]

Gets the list of labels for this data set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

tfds_map(example)[source]

Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.

class transformers.data.processors.utils.InputExample(guid, text_a, text_b=None, label=None)[source]

A single training/test example for simple sequence classification.

Parameters
  • guid – Unique id for the example.

  • text_a – string. The untokenized text of the first sequence. For single

  • tasks, only this sequence must be specified. (sequence) –

  • text_b – (Optional) string. The untokenized text of the second sequence.

  • must be specified for sequence pair tasks. (Only) –

  • label – (Optional) string. The label of the example. This should be

  • for train and dev examples, but not for test examples. (specified) –

to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

class transformers.data.processors.utils.InputFeatures(input_ids, attention_mask, token_type_ids, label)[source]

A single set of features of data.

Parameters
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – Segment token indices to indicate first and second portions of the inputs.

  • label – Label corresponding to the input

to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

GLUE

General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding

This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.

Those processors are:
  • MrpcProcessor

  • MnliProcessor

  • MnliMismatchedProcessor

  • Sst2Processor

  • StsbProcessor

  • QqpProcessor

  • QnliProcessor

  • RteProcessor

  • WnliProcessor

Additionally, the following method can be used to load values from a data file and convert them to a list of InputExample.

glue.glue_convert_examples_to_features(tokenizer, max_length=512, task=None, label_list=None, output_mode=None, pad_on_left=False, pad_token=0, pad_token_segment_id=0, mask_padding_with_zero=True)

Loads a data file into a list of InputFeatures

Parameters
  • examples – List of InputExamples or tf.data.Dataset containing the examples.

  • tokenizer – Instance of a tokenizer that will tokenize the examples

  • max_length – Maximum example length

  • task – GLUE task

  • label_list – List of labels. Can be obtained from the processor using the processor.get_labels() method

  • output_mode – String indicating the output mode. Either regression or classification

  • pad_on_left – If set to True, the examples will be padded on the left rather than on the right (default)

  • pad_token – Padding token

  • pad_token_segment_id – The segment ID for the padding token (It is usually 0, but can vary such as for XLNet where it is 4)

  • mask_padding_with_zero – If set to True, the attention mask will be filled by 1 for actual values and by 0 for padded values. If set to False, inverts it (1 for padded values, 0 for actual values)

Returns

If the examples input is a tf.data.Dataset, will return a tf.data.Dataset containing the task-specific features. If the input is a list of InputExamples, will return a list of task-specific InputFeatures which can be fed to the model.

Example usage

An example using these processors is given in the run_glue.py script.

XNLI

The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).

It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations

This library hosts the processor to load the XNLI data:
  • XnliProcessor

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

Example usage

An example using these processors is given in the run_xnli.py script.