Transformers documentation

Processors

Transformers

You are viewing v4.43.3 version. A newer version v4.56.2 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Processors

Transformers ライブラリでは、プロセッサは 2 つの異なる意味を持ちます。

Wav2Vec2 などのマルチモーダルモデルの入力を前処理するオブジェクト (音声とテキスト) または CLIP (テキストとビジョン)
古いバージョンのライブラリで GLUE または SQUAD のデータを前処理するために使用されていたオブジェクトは非推奨になりました。

Multi-modal processors

マルチモーダルモデルでは、オブジェクトが複数のモダリティ (テキスト、視覚と音声）。これは、2 つ以上の処理オブジェクトをグループ化するプロセッサーと呼ばれるオブジェクトによって処理されます。トークナイザー (テキストモダリティ用)、画像プロセッサー (視覚用)、特徴抽出器 (オーディオ用) など。

これらのプロセッサは、保存およびロード機能を実装する次の基本クラスを継承します。

class transformers.ProcessorMixin

< source >

( *args **kwargs )

This is a mixin used to provide saving/loading functionality for all processor classes.

apply_chat_template

< source >

( conversation: List chat_template: Optional = None tokenize: bool = False **kwargs )

Parameters

conversation (List[Dict, str, str]) — The conversation to format.
chat_template (Optional[str], optional) — The Jinja template to use for formatting the conversation. If not provided, the default chat template is used.
tokenize (bool, optional, defaults to False) — Whether to tokenize the output or not. **kwargs — Additional keyword arguments

Similar to the apply_chat_template method on tokenizers, this method applies a Jinja template to input conversations to turn them into a single tokenizable string.

from_args_and_dict

< source >

( args processor_dict: Dict **kwargs ) → ~processing_utils.ProcessingMixin

Parameters

processor_dict (Dict[str, Any]) — Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the ~processing_utils.ProcessingMixin.to_dict method.
kwargs (Dict[str, Any]) — Additional parameters from which to initialize the processor object.

Returns

~processing_utils.ProcessingMixin

The processor object instantiated from those parameters.

Instantiates a type of ~processing_utils.ProcessingMixin from a Python dictionary of parameters.

from_pretrained

< source >

( pretrained_model_name_or_path: Union cache_dir: Union = None force_download: bool = False local_files_only: bool = False token: Union = None revision: str = 'main' **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike) — This can be either:
- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co.
- a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
- a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json. **kwargs — Additional keyword arguments passed along to both from_pretrained() and ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractor from_pretrained(), image processor ImageProcessingMixin and the tokenizer ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of the methods above for more information.

get_processor_dict

< source >

( pretrained_model_name_or_path: Union **kwargs ) → Tuple[Dict, Dict]

Parameters

pretrained_model_name_or_path (str or os.PathLike) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

Returns

Tuple[Dict, Dict]

The dictionary(ies) that will be used to instantiate the processor object.

From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a processor of type ~processing_utils.ProcessingMixin using from_args_and_dict.

push_to_hub

< source >

( repo_id: str use_temp_dir: Optional = None commit_message: Optional = None private: Optional = None token: Union = None max_shard_size: Union = '5GB' create_pr: bool = False safe_serialization: bool = True revision: str = None commit_description: str = None tags: Optional = None **deprecated_kwargs )

Parameters

repo_id (str) — The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization.
use_temp_dir (bool, optional) — Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) — Message to commit while pushing. Will default to "Upload processor".
private (bool, optional) — Whether or not the repository created should be private.
token (bool or str, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.
max_shard_size (int or str, optional, defaults to "5GB") — Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like "5MB"). We default it to "5GB" so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) — Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) — Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) — Branch to push the uploaded files to.
commit_description (str, optional) — The description of the commit that will be created
tags (List[str], optional) — List of tags to push on the Hub.

Upload the processor files to the 🤗 Model Hub.

Examples:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")

# Push the processor to your namespace with the name "my-finetuned-bert".
processor.push_to_hub("my-finetuned-bert")

# Push the processor to an organization with the name "my-finetuned-bert".
processor.push_to_hub("huggingface/my-finetuned-bert")

register_for_auto_class

< source >

( auto_class = 'AutoProcessor' )

Parameters

auto_class (str or type, optional, defaults to "AutoProcessor") — The auto class to register this new feature extractor with.

Register this class with a given auto class. This should only be used for custom feature extractors as the ones in the library are already mapped with AutoProcessor.

This API is experimental and may have some slight breaking changes in the next releases.

save_pretrained

< source >

( save_directory push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).
push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.

to_dict

< source >

( ) → Dict[str, Any]

Returns

Dict[str, Any]

Dictionary of all the attributes that make up this processor instance.

Serializes this instance to a Python dictionary.

to_json_file

< source >

( json_file_path: Union )

Parameters

json_file_path (str or os.PathLike) — Path to the JSON file in which this processor instance’s parameters will be saved.

Save this instance to a JSON file.

to_json_string

< source >

( ) → str

Returns

str

String containing all the attributes that make up this feature_extractor instance in JSON format.

Serializes this instance to a JSON string.

Deprecated processors

すべてのプロセッサは、同じアーキテクチャに従っています。 DataProcessor。プロセッサは次のリストを返します。 InputExample。これら InputExample は次のように変換できます。 ~data.processors.utils.Input features をモデルにフィードします。

class transformers.DataProcessor

< source >

( )

Base class for data converters for sequence classification data sets.

get_dev_examples

< source >

( data_dir )

Gets a collection of InputExample for the dev set.

get_example_from_tensor_dict

< source >

( tensor_dict )

Gets an example from a dict with tensorflow tensors.

get_labels

< source >

( )

Gets the list of labels for this data set.

get_test_examples

< source >

( data_dir )

Gets a collection of InputExample for the test set.

get_train_examples

< source >

( data_dir )

Gets a collection of InputExample for the train set.

tfds_map

< source >

( example )

Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.

class transformers.InputExample

< source >

( guid: str text_a: str text_b: Optional = None label: Optional = None )

A single training/test example for simple sequence classification.

to_json_string

< source >

( )

Serializes this instance to a JSON string.

class transformers.InputFeatures

< source >

( input_ids: List attention_mask: Optional = None token_type_ids: Optional = None label: Union = None )

A single set of features of data. Property names are the same names as the corresponding inputs to a model.

to_json_string

< source >

( )

Serializes this instance to a JSON string.

GLUE

一般言語理解評価 (GLUE) は、既存の NLU タスクの多様なセットにわたるモデルのパフォーマンス。紙と同時発売された GLUE: A 自然言語理解のためのマルチタスクベンチマークおよび分析プラットフォーム

このライブラリは、MRPC、MNLI、MNLI (不一致)、CoLA、SST2、STSB、 QQP、QNLI、RTE、WNLI。

それらのプロセッサは次のとおりです。

~data.processors.utils.MrpcProcessor
~data.processors.utils.MnliProcessor
~data.processors.utils.MnliMismatchedProcessor
~data.processors.utils.Sst2Processor
~data.processors.utils.StsbProcessor
~data.processors.utils.QqpProcessor
~data.processors.utils.QnliProcessor
~data.processors.utils.RteProcessor
~data.processors.utils.WnliProcessor

さらに、次のメソッドを使用して、データファイルから値をロードし、それらをリストに変換することができます。 InputExample。

transformers.glue_convert_examples_to_features

< source >

( examples: Union tokenizer: PreTrainedTokenizer max_length: Optional = None task = None label_list = None output_mode = None )

Loads a data file into a list of InputFeatures

XNLI

クロスリンガル NLI コーパス (XNLI) は、言語を超えたテキスト表現の品質。 XNLI は、MultiNLI に基づくクラウドソースのデータセットです。テキストのペアには、15 個のテキスト含意アノテーションがラベル付けされています。さまざまな言語 (英語などの高リソース言語とスワヒリ語などの低リソース言語の両方を含む)。

論文 XNLI: Evaluating Cross-lingual Sentence Representations と同時にリリースされました。

このライブラリは、XNLI データをロードするプロセッサをホストします。

~data.processors.utils.XnliProcessor

テストセットにはゴールドラベルが付いているため、評価はテストセットで行われますのでご了承ください。

これらのプロセッサを使用する例は、run_xnli.py スクリプトに示されています。

SQuAD

The Stanford Question Answering Dataset (SQuAD) は、次のベンチマークです。質問応答に関するモデルのパフォーマンスを評価します。 v1.1 と v2.0 の 2 つのバージョンが利用可能です。最初のバージョン (v1.1) は、論文 SQuAD: 100,000+ question for Machine Comprehension of Text とともにリリースされました。 2 番目のバージョン (v2.0) は、論文 Know What You Don’t と同時にリリースされました。知っておくべき: SQuAD の答えられない質問。

このライブラリは、次の 2 つのバージョンのそれぞれのプロセッサをホストします。

Processors

それらのプロセッサは次のとおりです。

~data.processors.utils.SquadV1Processor
~data.processors.utils.SquadV2Processor

どちらも抽象クラス ~data.processors.utils.SquadProcessor を継承しています。

class transformers.data.processors.squad.SquadProcessor

< source >

( )

Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.

get_dev_examples

< source >

( data_dir filename = None )

Returns the evaluation example from the data directory.

get_examples_from_dataset

< source >

( dataset evaluate = False )

Creates a list of SquadExample using a TFDS dataset.

Examples:

>>> import tensorflow_datasets as tfds

>>> dataset = tfds.load("squad")

>>> training_examples = get_examples_from_dataset(dataset, evaluate=False)
>>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)

get_train_examples

< source >

( data_dir filename = None )

Returns the training examples from the data directory.

さらに、次のメソッドを使用して、SQuAD の例を次の形式に変換できます。モデルの入力として使用できる ~data.processors.utils.SquadFeatures。

transformers.squad_convert_examples_to_features

< source >

( examples tokenizer max_seq_length doc_stride max_query_length is_training padding_strategy = 'max_length' return_dataset = False threads = 1 tqdm_enabled = True )

Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.

Example:

processor = SquadV2Processor()
examples = processor.get_dev_examples(data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=args.max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=args.max_query_length,
    is_training=not evaluate,
)

これらのプロセッサと前述の方法は、データを含むファイルだけでなく、 tensorflow_datasets パッケージ。以下に例を示します。

Example usage

以下にプロセッサを使用した例と、データファイルを使用した変換方法を示します。

# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)

# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

tensorflow_datasets の使用は、データファイルを使用するのと同じくらい簡単です。

# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

これらのプロセッサを使用する別の例は、run_squad.py スクリプトに示されています。

< > Update on GitHub

←パイプライン量子化→