Processors

This library includes processors for several traditional tasks. These processors can be used to process a dataset into examples that can be fed to a model.

Processors

All processors follow the same architecture which is that of the DataProcessor. The processor returns a list of InputExample. These InputExample can be converted to InputFeatures in order to be fed to the model.

class transformers.data.processors.utils.DataProcessor[source]

Base class for data converters for sequence classification data sets.

get_dev_examples(data_dir)[source]

Gets a collection of `InputExample`s for the dev set.

get_example_from_tensor_dict(tensor_dict)[source]

Gets an example from a dict with tensorflow tensors

Parameters

tensor_dict – Keys and values should match the corresponding Glue tensorflow_dataset examples.

get_labels()[source]

Gets the list of labels for this data set.

get_train_examples(data_dir)[source]

Gets a collection of `InputExample`s for the train set.

tfds_map(example)[source]

Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.

class transformers.data.processors.utils.InputExample(guid, text_a, text_b=None, label=None)[source]

A single training/test example for simple sequence classification.

Parameters
  • guid – Unique id for the example.

  • text_a – string. The untokenized text of the first sequence. For single

  • tasks, only this sequence must be specified. (sequence) –

  • text_b – (Optional) string. The untokenized text of the second sequence.

  • must be specified for sequence pair tasks. (Only) –

  • label – (Optional) string. The label of the example. This should be

  • for train and dev examples, but not for test examples. (specified) –

to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

class transformers.data.processors.utils.InputFeatures(input_ids, attention_mask, token_type_ids, label)[source]

A single set of features of data.

Parameters
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – Segment token indices to indicate first and second portions of the inputs.

  • label – Label corresponding to the input

to_dict()[source]

Serializes this instance to a Python dictionary.

to_json_string()[source]

Serializes this instance to a JSON string.

GLUE

General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding

This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.

Those processors are:
  • MrpcProcessor

  • MnliProcessor

  • MnliMismatchedProcessor

  • Sst2Processor

  • StsbProcessor

  • QqpProcessor

  • QnliProcessor

  • RteProcessor

  • WnliProcessor

Additionally, the following method can be used to load values from a data file and convert them to a list of InputExample.

glue.glue_convert_examples_to_features(tokenizer, max_length=512, task=None, label_list=None, output_mode=None, pad_on_left=False, pad_token=0, pad_token_segment_id=0, mask_padding_with_zero=True)

Loads a data file into a list of InputFeatures

Parameters
  • examples – List of InputExamples or tf.data.Dataset containing the examples.

  • tokenizer – Instance of a tokenizer that will tokenize the examples

  • max_length – Maximum example length

  • task – GLUE task

  • label_list – List of labels. Can be obtained from the processor using the processor.get_labels() method

  • output_mode – String indicating the output mode. Either regression or classification

  • pad_on_left – If set to True, the examples will be padded on the left rather than on the right (default)

  • pad_token – Padding token

  • pad_token_segment_id – The segment ID for the padding token (It is usually 0, but can vary such as for XLNet where it is 4)

  • mask_padding_with_zero – If set to True, the attention mask will be filled by 1 for actual values and by 0 for padded values. If set to False, inverts it (1 for padded values, 0 for actual values)

Returns

If the examples input is a tf.data.Dataset, will return a tf.data.Dataset containing the task-specific features. If the input is a list of InputExamples, will return a list of task-specific InputFeatures which can be fed to the model.

Example usage

An example using these processors is given in the run_glue.py script.