Processors¶
This library includes processors for several traditional tasks. These processors can be used to process a dataset into examples that can be fed to a model.
Processors¶
All processors follow the same architecture which is that of the
DataProcessor
. The processor returns a list
of InputExample
. These
InputExample
can be converted to
InputFeatures
in order to be fed to the model.
-
class
transformers.data.processors.utils.
DataProcessor
[source]¶ Base class for data converters for sequence classification data sets.
-
get_dev_examples
(data_dir)[source]¶ Gets a collection of
InputExample
for the dev set.
-
get_example_from_tensor_dict
(tensor_dict)[source]¶ Gets an example from a dict with tensorflow tensors.
- Parameters
tensor_dict – Keys and values should match the corresponding Glue tensorflow_dataset examples.
-
get_test_examples
(data_dir)[source]¶ Gets a collection of
InputExample
for the test set.
-
get_train_examples
(data_dir)[source]¶ Gets a collection of
InputExample
for the train set.
-
-
class
transformers.data.processors.utils.
InputExample
(guid: str, text_a: str, text_b: Optional[str] = None, label: Optional[str] = None)[source]¶ A single training/test example for simple sequence classification.
- Parameters
guid – Unique id for the example.
text_a – string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.
text_b – (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.
label – (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.
-
class
transformers.data.processors.utils.
InputFeatures
(input_ids: List[int], attention_mask: Optional[List[int]] = None, token_type_ids: Optional[List[int]] = None, label: Optional[Union[int, float]] = None)[source]¶ A single set of features of data. Property names are the same names as the corresponding inputs to a model.
- Parameters
input_ids – Indices of input sequence tokens in the vocabulary.
attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
: Usually1
for tokens that are NOT MASKED,0
for MASKED (padded) tokens.token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
GLUE¶
General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
- Those processors are:
MrpcProcessor
MnliProcessor
MnliMismatchedProcessor
Sst2Processor
StsbProcessor
QqpProcessor
QnliProcessor
RteProcessor
WnliProcessor
Additionally, the following method can be used to load values from a data file and convert them to a list of
InputExample
.
-
glue.
glue_convert_examples_to_features
(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, max_length: Optional[int] = None, task=None, label_list=None, output_mode=None)¶ Loads a data file into a list of
InputFeatures
- Parameters
examples – List of
InputExamples
ortf.data.Dataset
containing the examples.tokenizer – Instance of a tokenizer that will tokenize the examples
max_length – Maximum example length. Defaults to the tokenizer’s max_len
task – GLUE task
label_list – List of labels. Can be obtained from the processor using the
processor.get_labels()
methodoutput_mode – String indicating the output mode. Either
regression
orclassification
- Returns
If the
examples
input is atf.data.Dataset
, will return atf.data.Dataset
containing the task-specific features. If the input is a list ofInputExamples
, will return a list of task-specificInputFeatures
which can be fed to the model.
Example usage¶
An example using these processors is given in the run_glue.py script.
XNLI¶
The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations
- This library hosts the processor to load the XNLI data:
XnliProcessor
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the run_xnli.py script.
SQuAD¶
The Stanford Question Answering Dataset (SQuAD) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The second version (v2.0) was released alongside the paper Know What You Don’t Know: Unanswerable Questions for SQuAD.
This library hosts a processor for each of the two versions:
Processors¶
- Those processors are:
SquadV1Processor
SquadV2Processor
They both inherit from the abstract class SquadProcessor
-
class
transformers.data.processors.squad.
SquadProcessor
[source]¶ Processor for the SQuAD data set. Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.
-
get_dev_examples
(data_dir, filename=None)[source]¶ Returns the evaluation example from the data directory.
- Parameters
data_dir – Directory containing the data files used for training and evaluating.
filename – None by default, specify this if the evaluation file has a different name than the original one which is dev-v1.1.json and dev-v2.0.json for squad versions 1.1 and 2.0 respectively.
-
get_examples_from_dataset
(dataset, evaluate=False)[source]¶ Creates a list of
SquadExample
using a TFDS dataset.- Parameters
dataset – The tfds dataset loaded from tensorflow_datasets.load(“squad”)
evaluate – Boolean specifying if in evaluation mode or in training mode
- Returns
List of SquadExample
Examples:
>>> import tensorflow_datasets as tfds >>> dataset = tfds.load("squad") >>> training_examples = get_examples_from_dataset(dataset, evaluate=False) >>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)
-
get_train_examples
(data_dir, filename=None)[source]¶ Returns the training examples from the data directory.
- Parameters
data_dir – Directory containing the data files used for training and evaluating.
filename – None by default, specify this if the training file has a different name than the original one which is train-v1.1.json and train-v2.0.json for squad versions 1.1 and 2.0 respectively.
-
Additionally, the following method can be used to convert SQuAD examples into SquadFeatures
that can be used as model inputs.
-
squad.
squad_convert_examples_to_features
(tokenizer, max_seq_length, doc_stride, max_query_length, is_training, padding_strategy='max_length', return_dataset=False, threads=1, tqdm_enabled=True)¶ Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.
- Parameters
examples – list of
SquadExample
tokenizer – an instance of a child of
PreTrainedTokenizer
max_seq_length – The maximum sequence length of the inputs.
doc_stride – The stride used when the context is too large and is split across several features.
max_query_length – The maximum length of the query.
is_training – whether to create features for model evaluation or model training.
padding_strategy – Default to “max_length”. Which padding strategy to use
return_dataset – Default False. Either ‘pt’ or ‘tf’. if ‘pt’: returns a torch.data.TensorDataset, if ‘tf’: returns a tf.data.Dataset
threads – multiple processing threadsa-smi
- Returns
list of
SquadFeatures
Example:
processor = SquadV2Processor() examples = processor.get_dev_examples(data_dir) features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=args.max_seq_length, doc_stride=args.doc_stride, max_query_length=args.max_query_length, is_training=not evaluate, )
These processors as well as the aforementionned method can be used with files containing the data as well as with the tensorflow_datasets package. Examples are given below.
Example usage¶
Here is an example using the processors as well as the conversion method using data files:
Example:
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Using tensorflow_datasets is as easy as using a data file:
Example:
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Another example using these processors is given in the run_squad.py script.