Transformers documentation

Processors

Transformers

You are viewing v4.21.1 version. A newer version v4.57.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Processors

Processors can mean two different things in the Transformers library:

the objects that pre-process inputs for multi-modal models such as Wav2Vec2 (speech and text) or CLIP (text and vision)
deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.

Multi-modal processors

Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, vision and audio). This is handled by objects called processors, which group tokenizers (for the text modality) and feature extractors (for vision and audio).

Those processors inherit from the following base class that implements the saving and loading functionality:

class transformers.ProcessorMixin

< source >

( *args **kwargs )

This is a mixin used to provide saving/loading functionality for all processor classes.

from_pretrained

< source >

( pretrained_model_name_or_path **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike) — This can be either:
- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
- a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
- a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json. **kwargs — Additional keyword arguments passed along to both from_pretrained() and from_pretrained.

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractor from_pretrained() and the tokenizer from_pretrained methods. Please refer to the docstrings of the methods above for more information.

push_to_hub

< source >

( repo_path_or_name: typing.Optional[str] = None repo_url: typing.Optional[str] = None use_temp_dir: bool = False commit_message: typing.Optional[str] = None organization: typing.Optional[str] = None private: typing.Optional[bool] = None use_auth_token: typing.Union[bool, str, NoneType] = None max_shard_size: typing.Union[int, str, NoneType] = '10GB' **model_card_kwargs ) → str

Parameters

repo_path_or_name (str, optional) — Can either be a repository name for your processor in the Hub or a path to a local folder (in which case the repository will have the name of that local folder). If not specified, will default to the name given by repo_url and a local directory with that name will be created.
repo_url (str, optional) — Specify this in case you want to push to an existing repository in the hub. If unspecified, a new repository will be created in your namespace (unless you specify an organization) with repo_name.
use_temp_dir (bool, optional, defaults to False) — Whether or not to clone the distant repo in a temporary directory or in repo_path_or_name inside the current working directory. This will slow things down if you are making changes in an existing repo since you will need to clone the repo before every push.
commit_message (str, optional) — Message to commit while pushing. Will default to "add processor".
organization (str, optional) — Organization in which you want to push your processor (you must be a member of this organization).
private (bool, optional) — Whether or not the repository created should be private (requires a paying subscription).
use_auth_token (bool or str, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running transformers-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.

Returns

str

The url of the commit of your processor in the given repository.

Upload the processor files to the 🤗 Model Hub while synchronizing a local clone of the repo in repo_path_or_name.

Examples:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("bert-base-cased")

# Push the processor to your namespace with the name "my-finetuned-bert" and have a local clone in the
# *my-finetuned-bert* folder.
processor.push_to_hub("my-finetuned-bert")

# Push the processor to your namespace with the name "my-finetuned-bert" with no local clone.
processor.push_to_hub("my-finetuned-bert", use_temp_dir=True)

# Push the processor to an organization with the name "my-finetuned-bert" and have a local clone in the
# *my-finetuned-bert* folder.
processor.push_to_hub("my-finetuned-bert", organization="huggingface")

# Make a change to an existing repo that has been cloned locally in *my-finetuned-bert*.
processor.push_to_hub("my-finetuned-bert", repo_url="https://huggingface.co/sgugger/my-finetuned-bert")

register_for_auto_class

< source >

( auto_class = 'AutoProcessor' )

Parameters

auto_class (str or type, optional, defaults to "AutoProcessor") — The auto class to register this new feature extractor with.

Register this class with a given auto class. This should only be used for custom feature extractors as the ones in the library are already mapped with AutoProcessor.

This API is experimental and may have some slight breaking changes in the next releases.

save_pretrained

< source >

( save_directory push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).
push_to_hub (bool, optional, defaults to False) — Whether or not to push your processor to the Hugging Face model hub after saving it.

Using push_to_hub=True will synchronize the repository you are pushing to with save_directory, which requires save_directory to be a local clone of the repo you are pushing to if it’s an existing folder. Pass along temp_dir=True to use a temporary directory instead.

kwargs — Additional key word arguments passed along to the push_to_hub() method.

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() and save_pretrained. Please refer to the docstrings of the methods above for more information.

Deprecated processors

All processors follow the same architecture which is that of the DataProcessor. The processor returns a list of InputExample. These InputExample can be converted to InputFeatures in order to be fed to the model.

class transformers.DataProcessor

< source >

( )

Base class for data converters for sequence classification data sets.

get_dev_examples

< source >

( data_dir )

Gets a collection of InputExample for the dev set.

get_example_from_tensor_dict

< source >

( tensor_dict )

Gets an example from a dict with tensorflow tensors.

get_labels

< source >

( )

Gets the list of labels for this data set.

get_test_examples

< source >

( data_dir )

Gets a collection of InputExample for the test set.

get_train_examples

< source >

( data_dir )

Gets a collection of InputExample for the train set.

tfds_map

< source >

( example )

Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.

class transformers.InputExample

< source >

( guid: str text_a: str text_b: typing.Optional[str] = None label: typing.Optional[str] = None )

A single training/test example for simple sequence classification.

to_json_string

< source >

( )

Serializes this instance to a JSON string.

class transformers.InputFeatures

< source >

( input_ids: typing.List[int] attention_mask: typing.Optional[typing.List[int]] = None token_type_ids: typing.Optional[typing.List[int]] = None label: typing.Union[int, float, NoneType] = None )

A single set of features of data. Property names are the same names as the corresponding inputs to a model.

to_json_string

< source >

( )

Serializes this instance to a JSON string.

GLUE

General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding

This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.

Those processors are:

MrpcProcessor
MnliProcessor
MnliMismatchedProcessor
Sst2Processor
StsbProcessor
QqpProcessor
QnliProcessor
RteProcessor
WnliProcessor

Additionally, the following method can be used to load values from a data file and convert them to a list of InputExample.

transformers.glue_convert_examples_to_features

< source >

( examples: typing.Union[typing.List[transformers.data.processors.utils.InputExample], ForwardRef('tf.data.Dataset')] tokenizer: PreTrainedTokenizer max_length: typing.Optional[int] = None task = None label_list = None output_mode = None )

Loads a data file into a list of InputFeatures

XNLI

The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations

This library hosts the processor to load the XNLI data:

XnliProcessor

Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

An example using these processors is given in the run_xnli.py script.

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The second version (v2.0) was released alongside the paper Know What You Don’t Know: Unanswerable Questions for SQuAD.

This library hosts a processor for each of the two versions:

Processors

Those processors are:

SquadV1Processor
SquadV2Processor

They both inherit from the abstract class SquadProcessor

class transformers.data.processors.squad.SquadProcessor

< source >

( )

Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.

get_dev_examples

< source >

( data_dir filename = None )

Returns the evaluation example from the data directory.

get_examples_from_dataset

< source >

( dataset evaluate = False )

Creates a list of SquadExample using a TFDS dataset.

Examples:

>>> import tensorflow_datasets as tfds

>>> dataset = tfds.load("squad")

>>> training_examples = get_examples_from_dataset(dataset, evaluate=False)
>>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)

get_train_examples

< source >

( data_dir filename = None )

Returns the training examples from the data directory.

Additionally, the following method can be used to convert SQuAD examples into SquadFeatures that can be used as model inputs.

transformers.squad_convert_examples_to_features

< source >

( examples tokenizer max_seq_length doc_stride max_query_length is_training padding_strategy = 'max_length' return_dataset = False threads = 1 tqdm_enabled = True )

Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.

Example:

processor = SquadV2Processor()
examples = processor.get_dev_examples(data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=args.max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=args.max_query_length,
    is_training=not evaluate,
)

These processors as well as the aforementionned method can be used with files containing the data as well as with the tensorflow_datasets package. Examples are given below.

Example usage

Here is an example using the processors as well as the conversion method using data files:

# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)

# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

Using tensorflow_datasets is as easy as using a data file:

# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

Another example using these processors is given in the run_squad.py script.

←Pipelines Tokenizer→