Processors
在 Transformers 库中,processors可以有两种不同的含义:
多模态processors
任何多模态模型都需要一个对象来编码或解码将多个模态(包括文本、视觉和音频)组合在一起的数据。这由称为processors的对象处理,这些processors将两个或多个处理对象组合在一起,例如tokenizers(用于文本模态),image processors(用于视觉)和feature extractors(用于音频)。
这些processors继承自以下实现保存和加载功能的基类:
This is a mixin used to provide saving/loading functionality for all processor classes.
apply_chat_template
< source >( conversation: typing.List[typing.Dict[str, str]] chat_template: typing.Optional[str] = None tokenize: bool = False **kwargs )
Parameters
- conversation (
List[Dict, str, str]
) — The conversation to format. - chat_template (
Optional[str]
, optional) — The Jinja template to use for formatting the conversation. If not provided, the tokenizer’s chat template is used. - tokenize (
bool
, optional, defaults toFalse
) — Whether to tokenize the output or not. - **kwargs — Additional keyword arguments
Similar to the apply_chat_template
method on tokenizers, this method applies a Jinja template to input
conversations to turn them into a single tokenizable string.
from_args_and_dict
< source >( args processor_dict: typing.Dict[str, typing.Any] **kwargs ) → ~processing_utils.ProcessingMixin
Parameters
- processor_dict (
Dict[str, Any]
) — Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the~processing_utils.ProcessingMixin.to_dict
method. - kwargs (
Dict[str, Any]
) — Additional parameters from which to initialize the processor object.
Returns
~processing_utils.ProcessingMixin
The processor object instantiated from those parameters.
Instantiates a type of ~processing_utils.ProcessingMixin
from a Python dictionary of parameters.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[str, bool, NoneType] = None revision: str = 'main' **kwargs )
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — This can be either:- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co.
- a path to a directory containing a feature extractor file saved using the
save_pretrained() method, e.g.,
./my_model_directory/
. - a path or url to a saved feature extractor JSON file, e.g.,
./my_model_directory/preprocessor_config.json
.
- **kwargs —
Additional keyword arguments passed along to both
from_pretrained() and
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
.
Instantiate a processor associated with a pretrained model.
This class method is simply calling the feature extractor
from_pretrained(), image processor
ImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
methods. Please refer to the docstrings of the
methods above for more information.
get_processor_dict
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs ) → Tuple[Dict, Dict]
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. - subfolder (
str
, optional, defaults to""
) — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
Returns
Tuple[Dict, Dict]
The dictionary(ies) that will be used to instantiate the processor object.
From a pretrained_model_name_or_path
, resolve to a dictionary of parameters, to be used for instantiating a
processor of type ~processing_utils.ProcessingMixin
using from_args_and_dict
.
post_process_image_text_to_text
< source >( generated_outputs ) → List[str]
Post-process the output of a vlm to decode the text.
Matches optional positional arguments to their corresponding names in optional_call_args
in the processor class in the order they are passed to the processor call.
Note that this should only be used in the __call__
method of the processors with special
arguments. Special arguments are arguments that aren’t text
, images
, audio
, nor videos
but also aren’t passed to the tokenizer, image processor, etc. Examples of such processors are:
CLIPSegProcessor
LayoutLMv2Processor
OwlViTProcessor
Also note that passing by position to the processor call is now deprecated and will be disallowed in future versions. We only have this for backward compatibility.
Example:
Suppose that the processor class has optional_call_args = ["arg_name_1", "arg_name_2"]
.
And we define the call method as:
def __call__(
self,
text: str,
images: Optional[ImageInput] = None,
*arg,
audio=None,
videos=None,
)
push_to_hub
< source >( repo_id: str use_temp_dir: typing.Optional[bool] = None commit_message: typing.Optional[str] = None private: typing.Optional[bool] = None token: typing.Union[bool, str, NoneType] = None max_shard_size: typing.Union[int, str, NoneType] = '5GB' create_pr: bool = False safe_serialization: bool = True revision: str = None commit_description: str = None tags: typing.Optional[typing.List[str]] = None **deprecated_kwargs )
Parameters
- repo_id (
str
) — The name of the repository you want to push your processor to. It should contain your organization name when pushing to a given organization. - use_temp_dir (
bool
, optional) — Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default toTrue
if there is no directory named likerepo_id
,False
otherwise. - commit_message (
str
, optional) — Message to commit while pushing. Will default to"Upload processor"
. - private (
bool
, optional) — Whether to make the repo private. IfNone
(default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists. - token (
bool
orstr
, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue
, will use the token generated when runninghuggingface-cli login
(stored in~/.huggingface
). Will default toTrue
ifrepo_url
is not specified. - max_shard_size (
int
orstr
, optional, defaults to"5GB"
) — Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like"5MB"
). We default it to"5GB"
so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues. - create_pr (
bool
, optional, defaults toFalse
) — Whether or not to create a PR with the uploaded files or directly commit. - safe_serialization (
bool
, optional, defaults toTrue
) — Whether or not to convert the model weights in safetensors format for safer serialization. - revision (
str
, optional) — Branch to push the uploaded files to. - commit_description (
str
, optional) — The description of the commit that will be created - tags (
List[str]
, optional) — List of tags to push on the Hub.
Upload the processor files to the 🤗 Model Hub.
Examples:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")
# Push the processor to your namespace with the name "my-finetuned-bert".
processor.push_to_hub("my-finetuned-bert")
# Push the processor to an organization with the name "my-finetuned-bert".
processor.push_to_hub("huggingface/my-finetuned-bert")
register_for_auto_class
< source >( auto_class = 'AutoProcessor' )
Register this class with a given auto class. This should only be used for custom feature extractors as the ones
in the library are already mapped with AutoProcessor
.
This API is experimental and may have some slight breaking changes in the next releases.
save_pretrained
< source >( save_directory push_to_hub: bool = False **kwargs )
Parameters
- save_directory (
str
oros.PathLike
) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist). - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - kwargs (
Dict[str, Any]
, optional) — Additional key word arguments passed along to the push_to_hub() method.
Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
to_dict
< source >( ) → Dict[str, Any]
Returns
Dict[str, Any]
Dictionary of all the attributes that make up this processor instance.
Serializes this instance to a Python dictionary.
to_json_file
< source >( json_file_path: typing.Union[str, os.PathLike] )
Save this instance to a JSON file.
to_json_string
< source >( ) → str
Returns
str
String containing all the attributes that make up this feature_extractor instance in JSON format.
Serializes this instance to a JSON string.
已弃用的processors
所有processor都遵循与 DataProcessor 相同的架构。processor返回一个 InputExample 列表。这些 InputExample 可以转换为 InputFeatures 以供输送到模型。
Base class for data converters for sequence classification data sets.
Gets a collection of InputExample for the dev set.
get_example_from_tensor_dict
< source >( tensor_dict )
Gets an example from a dict with tensorflow tensors.
Gets the list of labels for this data set.
Gets a collection of InputExample for the test set.
Gets a collection of InputExample for the train set.
Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.
class transformers.InputExample
< source >( guid: str text_a: str text_b: typing.Optional[str] = None label: typing.Optional[str] = None )
Parameters
- guid — Unique id for the example.
- text_a — string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.
- text_b — (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.
- label — (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.
A single training/test example for simple sequence classification.
Serializes this instance to a JSON string.
class transformers.InputFeatures
< source >( input_ids: typing.List[int] attention_mask: typing.Optional[typing.List[int]] = None token_type_ids: typing.Optional[typing.List[int]] = None label: typing.Union[int, float, NoneType] = None )
Parameters
- input_ids — Indices of input sequence tokens in the vocabulary.
- attention_mask — Mask to avoid performing attention on padding token indices.
Mask values selected in
[0, 1]
: Usually1
for tokens that are NOT MASKED,0
for MASKED (padded) tokens. - token_type_ids — (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
- label — (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
A single set of features of data. Property names are the same names as the corresponding inputs to a model.
Serializes this instance to a JSON string.
GLUE
General Language Understanding Evaluation (GLUE) 是一个基准测试,评估模型在各种现有的自然语言理解任务上的性能。它与论文 GLUE: A multi-task benchmark and analysis platform for natural language understanding 一同发布。
该库为以下任务提供了总共10个processor:MRPC、MNLI、MNLI(mismatched)、CoLA、SST2、STSB、QQP、QNLI、RTE 和 WNLI。
这些processor是:
~data.processors.utils.MrpcProcessor
~data.processors.utils.MnliProcessor
~data.processors.utils.MnliMismatchedProcessor
~data.processors.utils.Sst2Processor
~data.processors.utils.StsbProcessor
~data.processors.utils.QqpProcessor
~data.processors.utils.QnliProcessor
~data.processors.utils.RteProcessor
~data.processors.utils.WnliProcessor
此外,还可以使用以下方法从数据文件加载值并将其转换为 InputExample 列表。
transformers.glue_convert_examples_to_features
< source >( examples: typing.Union[typing.List[transformers.data.processors.utils.InputExample], ForwardRef('tf.data.Dataset')] tokenizer: PreTrainedTokenizer max_length: typing.Optional[int] = None task = None label_list = None output_mode = None )
Parameters
- examples — List of
InputExamples
ortf.data.Dataset
containing the examples. - tokenizer — Instance of a tokenizer that will tokenize the examples
- max_length — Maximum example length. Defaults to the tokenizer’s max_len
- task — GLUE task
- label_list — List of labels. Can be obtained from the processor using the
processor.get_labels()
method - output_mode — String indicating the output mode. Either
regression
orclassification
Loads a data file into a list of InputFeatures
XNLI
跨语言NLI语料库(XNLI) 是一个评估跨语言文本表示质量的基准测试。XNLI是一个基于MultiNLI的众包数据集:”文本对“被标记为包含15种不同语言(包括英语等高资源语言和斯瓦希里语等低资源语言)的文本蕴涵注释。
它与论文 XNLI: Evaluating Cross-lingual Sentence Representations 一同发布。
该库提供了加载XNLI数据的processor:
~data.processors.utils.XnliProcessor
请注意,由于测试集上有“gold”标签,因此评估是在测试集上进行的。
使用这些processor的示例在 run_xnli.py 脚本中提供。
SQuAD
斯坦福问答数据集(SQuAD) 是一个评估模型在问答上性能的基准测试。有两个版本,v1.1 和 v2.0。第一个版本(v1.1)与论文 SQuAD: 100,000+ Questions for Machine Comprehension of Text 一同发布。第二个版本(v2.0)与论文 Know What You Don’t Know: Unanswerable Questions for SQuAD 一同发布。
该库为两个版本各自提供了一个processor:
Processors
这两个processor是:
~data.processors.utils.SquadV1Processor
~data.processors.utils.SquadV2Processor
它们都继承自抽象类 ~data.processors.utils.SquadProcessor
。
Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.
get_dev_examples
< source >( data_dir filename = None )
Returns the evaluation example from the data directory.
get_examples_from_dataset
< source >( dataset evaluate = False )
Creates a list of SquadExample
using a TFDS dataset.
get_train_examples
< source >( data_dir filename = None )
Returns the training examples from the data directory.
此外,可以使用以下方法将 SQuAD 示例转换为可用作模型输入的 ~data.processors.utils.SquadFeatures
。
transformers.squad_convert_examples_to_features
< source >( examples tokenizer max_seq_length doc_stride max_query_length is_training padding_strategy = 'max_length' return_dataset = False threads = 1 tqdm_enabled = True )
Parameters
- examples — list of
SquadExample
- tokenizer — an instance of a child of PreTrainedTokenizer
- max_seq_length — The maximum sequence length of the inputs.
- doc_stride — The stride used when the context is too large and is split across several features.
- max_query_length — The maximum length of the query.
- is_training — whether to create features for model evaluation or model training.
- padding_strategy — Default to “max_length”. Which padding strategy to use
- return_dataset — Default False. Either ‘pt’ or ‘tf’. if ‘pt’: returns a torch.data.TensorDataset, if ‘tf’: returns a tf.data.Dataset
- threads — multiple processing threads.
Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.
Example:
processor = SquadV2Processor()
examples = processor.get_dev_examples(data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length,
doc_stride=args.doc_stride,
max_query_length=args.max_query_length,
is_training=not evaluate,
)
这些processor以及前面提到的方法可以与包含数据的文件以及tensorflow_datasets包一起使用。下面给出了示例。
Example使用
以下是使用processor以及使用数据文件的转换方法的示例:
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
使用 tensorflow_datasets 就像使用数据文件一样简单:
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
另一个使用这些processor的示例在 run_squad.py 脚本中提供。
< > Update on GitHub