Data Collator¶
Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of
the same type as the elements of train_dataset
or eval_dataset
.
To be able to build batches, data collators may apply some processing (like padding). Some of them (like
DataCollatorForLanguageModeling
) also apply some random data augmentation (like random masking)
oin the formed batch.
Examples of use can be found in the example scripts or example notebooks.
Default data collator¶
-
transformers.data.data_collator.
default_data_collator
(features: List[InputDataClass]) → Dict[str, torch.Tensor][source]¶ Very simple data collator that simply collates batches of dict-like objects and performs special handling for potential keys named:
label
: handles a single value (int or float) per objectlabel_ids
: handles a list of values per object
Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the model. See glue and ner for example of how it’s useful.
DataCollatorWithPadding¶
-
class
transformers.data.data_collator.
DataCollatorWithPadding
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None)[source]¶ Data collator that will dynamically pad the inputs received.
- Parameters
tokenizer (
PreTrainedTokenizer
orPreTrainedTokenizerFast
) – The tokenizer used for encoding the data.padding (
bool
,str
orPaddingStrategy
, optional, defaults toTrue
) –Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (
int
, optional) – Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (
int
, optional) –If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
DataCollatorForTokenClassification¶
-
class
transformers.data.data_collator.
DataCollatorForTokenClassification
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, label_pad_token_id: int = - 100)[source]¶ Data collator that will dynamically pad the inputs received, as well as the labels.
- Parameters
tokenizer (
PreTrainedTokenizer
orPreTrainedTokenizerFast
) – The tokenizer used for encoding the data.padding (
bool
,str
orPaddingStrategy
, optional, defaults toTrue
) –Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (
int
, optional) – Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (
int
, optional) –If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
label_pad_token_id (
int
, optional, defaults to -100) – The id to use when padding the labels (-100 will be automatically ignore by PyTorch loss functions).
DataCollatorForSeq2Seq¶
-
class
transformers.data.data_collator.
DataCollatorForSeq2Seq
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, model: Optional[transformers.modeling_utils.PreTrainedModel] = None, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, label_pad_token_id: int = - 100)[source]¶ Data collator that will dynamically pad the inputs received, as well as the labels.
- Parameters
tokenizer (
PreTrainedTokenizer
orPreTrainedTokenizerFast
) – The tokenizer used for encoding the data.model (
PreTrainedModel
) –The model that is being trained. If set and has the prepare_decoder_input_ids_from_labels, use it to prepare the decoder_input_ids
This is useful when using label_smoothing to avoid calculating loss twice.
padding (
bool
,str
orPaddingStrategy
, optional, defaults toTrue
) –Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (
int
, optional) – Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (
int
, optional) –If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
label_pad_token_id (
int
, optional, defaults to -100) – The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).
DataCollatorForLanguageModeling¶
-
class
transformers.data.data_collator.
DataCollatorForLanguageModeling
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: Optional[int] = None)[source]¶ Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.
- Parameters
tokenizer (
PreTrainedTokenizer
orPreTrainedTokenizerFast
) – The tokenizer used for encoding the data.mlm (
bool
, optional, defaults toTrue
) – Whether or not to use masked language modeling. If set toFalse
, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.mlm_probability (
float
, optional, defaults to 0.15) – The probability with which to (randomly) mask tokens in the input, whenmlm
is set toTrue
.pad_to_multiple_of (
int
, optional) – If set will pad the sequence to a multiple of the provided value.
Note
For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the
"special_tokens_mask"
key, as returned by aPreTrainedTokenizer
or aPreTrainedTokenizerFast
with the argumentreturn_special_tokens_mask=True
.
DataCollatorForWholeWordMask¶
-
class
transformers.data.data_collator.
DataCollatorForWholeWordMask
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: Optional[int] = None)[source]¶ Data collator used for language modeling that masks entire words.
collates batches of tensors, honoring their tokenizer’s pad_token
preprocesses batches for masked language modeling
Note
This collator relies on details of the implementation of subword tokenization by
BertTokenizer
, specifically that subword tokens are prefixed with ##. For tokenizers that do not adhere to this scheme, this collator will produce an output that is roughly equivalent toDataCollatorForLanguageModeling
.-
mask_tokens
(inputs: torch.Tensor, mask_labels: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set ‘mask_labels’ means we use whole word mask (wwm), we directly mask idxs according to it’s ref.
DataCollatorForPermutationLanguageModeling¶
-
class
transformers.data.data_collator.
DataCollatorForPermutationLanguageModeling
(tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase, plm_probability: float = 0.16666666666666666, max_span_length: int = 5)[source]¶ Data collator used for permutation language modeling.
collates batches of tensors, honoring their tokenizer’s pad_token
preprocesses batches for permutation language modeling with procedures specific to XLNet
-
mask_tokens
(inputs: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ The masked tokens to be predicted for a particular sequence are determined by the following algorithm:
Start from the beginning of the sequence by setting
cur_len = 0
(number of tokens processed so far).Sample a
span_length
from the interval[1, max_span_length]
(length of span of tokens to be masked)Reserve a context of length
context_length = span_length / plm_probability
to surround span to be maskedSample a starting point
start_index
from the interval[cur_len, cur_len + context_length - span_length]
and mask tokensstart_index:start_index + span_length
Set
cur_len = cur_len + context_length
. Ifcur_len < max_len
(i.e. there are tokens remaining in the sequence to be processed), repeat from Step 1.