Custom Layers and UtilitiesΒΆ

This page lists all the custom layers used by the library, as well as the utility functions it provides for modeling.

Most of those are only useful if you are studying the code of the models in the library.

Pytorch custom modulesΒΆ

class transformers.modeling_utils.Conv1D(nf, nx)[source]ΒΆ

1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

Basically works like a linear layer but the weights are transposed.

Parameters
  • nf (int) – The number of output features.

  • nx (int) – The number of input features.

class transformers.modeling_utils.PoolerStartLogits(config: transformers.configuration_utils.PretrainedConfig)[source]ΒΆ

Compute SQuAD start logits from sequence hidden states.

Parameters

config (PretrainedConfig) – The config used by the model, will be used to grab the hidden_size of the model.

forward(hidden_states: torch.FloatTensor, p_mask: Optional[torch.FloatTensor] = None) → torch.FloatTensor[source]ΒΆ
Parameters
  • hidden_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size)) – The final hidden states of the model.

  • p_mask (torch.FloatTensor of shape (batch_size, seq_len), optional) – Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token should be masked.

Returns

The start logits for SQuAD.

Return type

torch.FloatTensor

class transformers.modeling_utils.PoolerEndLogits(config: transformers.configuration_utils.PretrainedConfig)[source]ΒΆ

Compute SQuAD end logits from sequence hidden states.

Parameters

config (PretrainedConfig) – The config used by the model, will be used to grab the hidden_size of the model and the layer_norm_eps to use.

forward(hidden_states: torch.FloatTensor, start_states: Optional[torch.FloatTensor] = None, start_positions: Optional[torch.LongTensor] = None, p_mask: Optional[torch.FloatTensor] = None) → torch.FloatTensor[source]ΒΆ
Parameters
  • hidden_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size)) – The final hidden states of the model.

  • start_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size), optional) – The hidden states of the first tokens for the labeled span.

  • start_positions (torch.LongTensor of shape (batch_size,), optional) – The position of the first token for the labeled span.

  • p_mask (torch.FloatTensor of shape (batch_size, seq_len), optional) – Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token should be masked.

Note

One of start_states or start_positions should be not obj:None. If both are set, start_positions overrides start_states.

Returns

The end logits for SQuAD.

Return type

torch.FloatTensor

class transformers.modeling_utils.PoolerAnswerClass(config)[source]ΒΆ

Compute SQuAD 2.0 answer class from classification and start tokens hidden states.

Parameters

config (PretrainedConfig) – The config used by the model, will be used to grab the hidden_size of the model.

forward(hidden_states: torch.FloatTensor, start_states: Optional[torch.FloatTensor] = None, start_positions: Optional[torch.LongTensor] = None, cls_index: Optional[torch.LongTensor] = None) → torch.FloatTensor[source]ΒΆ
Parameters
  • hidden_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size)) – The final hidden states of the model.

  • start_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size), optional) – The hidden states of the first tokens for the labeled span.

  • start_positions (torch.LongTensor of shape (batch_size,), optional) – The position of the first token for the labeled span.

  • cls_index (torch.LongTensor of shape (batch_size,), optional) – Position of the CLS token for each sentence in the batch. If None, takes the last token.

Note

One of start_states or start_positions should be not obj:None. If both are set, start_positions overrides start_states.

Returns

The SQuAD 2.0 answer class.

Return type

torch.FloatTensor

class transformers.modeling_utils.SquadHeadOutput(loss: Optional[torch.FloatTensor] = None, start_top_log_probs: Optional[torch.FloatTensor] = None, start_top_index: Optional[torch.LongTensor] = None, end_top_log_probs: Optional[torch.FloatTensor] = None, end_top_index: Optional[torch.LongTensor] = None, cls_logits: Optional[torch.FloatTensor] = None)[source]ΒΆ

Base class for outputs of question answering models using a SQuADHead.

Parameters
  • loss (torch.FloatTensor of shape (1,), optional, returned if both start_positions and end_positions are provided) – Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.

  • start_top_log_probs (torch.FloatTensor of shape (batch_size, config.start_n_top), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the top config.start_n_top start token possibilities (beam-search).

  • start_top_index (torch.LongTensor of shape (batch_size, config.start_n_top), optional, returned if start_positions or end_positions is not provided) – Indices for the top config.start_n_top start token possibilities (beam-search).

  • end_top_log_probs (torch.FloatTensor of shape (batch_size, config.start_n_top * config.end_n_top), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the top config.start_n_top * config.end_n_top end token possibilities (beam-search).

  • end_top_index (torch.LongTensor of shape (batch_size, config.start_n_top * config.end_n_top), optional, returned if start_positions or end_positions is not provided) – Indices for the top config.start_n_top * config.end_n_top end token possibilities (beam-search).

  • cls_logits (torch.FloatTensor of shape (batch_size,), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the is_impossible label of the answers.

class transformers.modeling_utils.SQuADHead(config)[source]ΒΆ

A SQuAD head inspired by XLNet.

Parameters

config (PretrainedConfig) – The config used by the model, will be used to grab the hidden_size of the model and the layer_norm_eps to use.

forward(hidden_states: torch.FloatTensor, start_positions: Optional[torch.LongTensor] = None, end_positions: Optional[torch.LongTensor] = None, cls_index: Optional[torch.LongTensor] = None, is_impossible: Optional[torch.LongTensor] = None, p_mask: Optional[torch.FloatTensor] = None, return_dict: bool = False) → Union[transformers.modeling_utils.SquadHeadOutput, Tuple[torch.FloatTensor]][source]ΒΆ
Args:
hidden_states (torch.FloatTensor of shape (batch_size, seq_len, hidden_size)):

Final hidden states of the model on the sequence tokens.

start_positions (torch.LongTensor of shape (batch_size,), optional):

Positions of the first token for the labeled span.

end_positions (torch.LongTensor of shape (batch_size,), optional):

Positions of the last token for the labeled span.

cls_index (torch.LongTensor of shape (batch_size,), optional):

Position of the CLS token for each sentence in the batch. If None, takes the last token.

is_impossible (torch.LongTensor of shape (batch_size,), optional):

Whether the question has a possible answer in the paragraph or not.

p_mask (torch.FloatTensor of shape (batch_size, seq_len), optional):

Mask for tokens at invalid position, such as query and special symbols (PAD, SEP, CLS). 1.0 means token should be masked.

return_dict (bool, optional, defaults to False):

Whether or not to return a ModelOutput instead of a plain tuple.

Returns

A SquadHeadOutput (if return_dict=True is passed or when config.return_dict=True) or a tuple of torch.FloatTensor comprising various elements depending on the configuration (~transformers.) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned if both start_positions and end_positions are provided) – Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.

  • start_top_log_probs (torch.FloatTensor of shape (batch_size, config.start_n_top), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the top config.start_n_top start token possibilities (beam-search).

  • start_top_index (torch.LongTensor of shape (batch_size, config.start_n_top), optional, returned if start_positions or end_positions is not provided) – Indices for the top config.start_n_top start token possibilities (beam-search).

  • end_top_log_probs (torch.FloatTensor of shape (batch_size, config.start_n_top * config.end_n_top), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the top config.start_n_top * config.end_n_top end token possibilities (beam-search).

  • end_top_index (torch.LongTensor of shape (batch_size, config.start_n_top * config.end_n_top), optional, returned if start_positions or end_positions is not provided) – Indices for the top config.start_n_top * config.end_n_top end token possibilities (beam-search).

  • cls_logits (torch.FloatTensor of shape (batch_size,), optional, returned if start_positions or end_positions is not provided) – Log probabilities for the is_impossible label of the answers.

Return type

SquadHeadOutput or tuple(torch.FloatTensor)

class transformers.modeling_utils.SequenceSummary(config: transformers.configuration_utils.PretrainedConfig)[source]ΒΆ

Compute a single vector summary of a sequence hidden states.

Parameters

config (PretrainedConfig) –

The config used by the model. Relevant arguments in the config class of the model are (refer to the actual config class of your model for the default values it uses):

  • summary_type (str) – The method to use to make this summary. Accepted values are:

    • "last" – Take the last token hidden state (like XLNet)

    • "first" – Take the first token hidden state (like Bert)

    • "mean" – Take the mean of all tokens hidden states

    • "cls_index" – Supply a Tensor of classification token position (GPT/GPT-2)

    • "attn" – Not implemented now, use multi-head attention

  • summary_use_proj (bool) – Add a projection after the vector extraction.

  • summary_proj_to_labels (bool) – If True, the projection outputs to config.num_labels classes (otherwise to config.hidden_size).

  • summary_activation (Optional[str]) – Set to "tanh" to add a tanh activation to the output, another string or None will add no activation.

  • summary_first_dropout (float) – Optional dropout probability before the projection and activation.

  • summary_last_dropout (float)– Optional dropout probability after the projection and activation.

forward(hidden_states: torch.FloatTensor, cls_index: Optional[torch.LongTensor] = None) → torch.FloatTensor[source]ΒΆ

Compute a single vector summary of a sequence hidden states.

Parameters
  • hidden_states (torch.FloatTensor of shape [batch_size, seq_len, hidden_size]) – The hidden states of the last layer.

  • cls_index (torch.LongTensor of shape [batch_size] or [batch_size, ...] where … are optional leading dimensions of hidden_states, optional) – Used if summary_type == "cls_index" and takes the last token of the sequence as classification token.

Returns

The summary of the sequence hidden states.

Return type

torch.FloatTensor

PyTorch Helper FunctionsΒΆ

transformers.apply_chunking_to_forward(forward_fn: Callable[…, torch.Tensor], chunk_size: int, chunk_dim: int, *input_tensors) → torch.Tensor[source]ΒΆ

This function chunks the input_tensors into smaller input tensor parts of size chunk_size over the dimension chunk_dim. It then applies a layer forward_fn to each chunk independently to save memory.

If the forward_fn is independent across the chunk_dim this function will yield the same result as directly applying forward_fn to input_tensors.

Parameters
  • forward_fn (Callable[..., torch.Tensor]) – The forward function of the model.

  • chunk_size (int) – The chunk size of a chunked tensor: num_chunks = len(input_tensors[0]) / chunk_size.

  • chunk_dim (int) – The dimension over which the input_tensors should be chunked.

  • input_tensors (Tuple[torch.Tensor]) – The input tensors of forward_fn which will be chunked

Returns

A tensor with the same shape as the forward_fn would have given if applied`.

Return type

torch.Tensor

Examples:

# rename the usual forward() fn to forward_chunk()
def forward_chunk(self, hidden_states):
    hidden_states = self.decoder(hidden_states)
    return hidden_states

# implement a chunked forward function
def forward(self, hidden_states):
    return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states)
transformers.modeling_utils.find_pruneable_heads_and_indices(heads: List[int], n_heads: int, head_size: int, already_pruned_heads: Set[int]) → Tuple[Set[int], torch.LongTensor][source]ΒΆ

Finds the heads and their indices taking already_pruned_heads into account.

Parameters
  • heads (List[int]) – List of the indices of heads to prune.

  • n_heads (int) – The number of heads in the model.

  • head_size (int) – The size of each head.

  • already_pruned_heads (Set[int]) – A set of already pruned heads.

Returns

A tuple with the remaining heads and their corresponding indices.

Return type

Tuple[Set[int], torch.LongTensor]

transformers.modeling_utils.prune_layer(layer: Union[torch.nn.modules.linear.Linear, transformers.modeling_utils.Conv1D], index: torch.LongTensor, dim: Optional[int] = None) → Union[torch.nn.modules.linear.Linear, transformers.modeling_utils.Conv1D][source]ΒΆ

Prune a Conv1D or linear layer to keep only entries in index.

Used to remove heads.

Parameters
  • layer (Union[torch.nn.Linear, Conv1D]) – The layer to prune.

  • index (torch.LongTensor) – The indices to keep in the layer.

  • dim (int, optional) – The dimension on which to keep the indices.

Returns

The pruned layer as a new layer with requires_grad=True.

Return type

torch.nn.Linear or Conv1D

transformers.modeling_utils.prune_conv1d_layer(layer: transformers.modeling_utils.Conv1D, index: torch.LongTensor, dim: int = 1)transformers.modeling_utils.Conv1D[source]ΒΆ

Prune a Conv1D layer to keep only entries in index. A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.

Used to remove heads.

Parameters
  • layer (Conv1D) – The layer to prune.

  • index (torch.LongTensor) – The indices to keep in the layer.

  • dim (int, optional, defaults to 1) – The dimension on which to keep the indices.

Returns

The pruned layer as a new layer with requires_grad=True.

Return type

Conv1D

transformers.modeling_utils.prune_linear_layer(layer: torch.nn.modules.linear.Linear, index: torch.LongTensor, dim: int = 0) → torch.nn.modules.linear.Linear[source]ΒΆ

Prune a linear layer to keep only entries in index.

Used to remove heads.

Parameters
  • layer (torch.nn.Linear) – The layer to prune.

  • index (torch.LongTensor) – The indices to keep in the layer.

  • dim (int, optional, defaults to 0) – The dimension on which to keep the indices.

Returns

The pruned layer as a new layer with requires_grad=True.

Return type

torch.nn.Linear

TensorFlow custom layersΒΆ

class transformers.modeling_tf_utils.TFConv1D(*args, **kwargs)[source]ΒΆ

1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

Basically works like a linear layer but the weights are transposed.

Parameters
  • nf (int) – The number of output features.

  • nx (int) – The number of input features.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation to use to initialize the weights.

  • kwargs – Additional keyword arguments passed along to the __init__ of tf.keras.layers.Layer.

class transformers.modeling_tf_utils.TFSharedEmbeddings(*args, **kwargs)[source]ΒΆ

Construct shared token embeddings.

The weights of the embedding layer is usually shared with the weights of the linear decoder when doing language modeling.

Parameters
  • vocab_size (int) – The size of the vocabulary, e.g., the number of unique tokens.

  • hidden_size (int) – The size of the embedding vectors.

  • initializer_range (float, optional) – The standard deviation to use when initializing the weights. If no value is provided, it will default to \(1/\sqrt{hidden\_size}\).

  • kwargs – Additional keyword arguments passed along to the __init__ of tf.keras.layers.Layer.

call(inputs: tensorflow.python.framework.ops.Tensor, mode: str = 'embedding') → tensorflow.python.framework.ops.Tensor[source]ΒΆ

Get token embeddings of inputs or decode final hidden state.

Parameters
  • inputs (tf.Tensor) –

    In embedding mode, should be an int64 tensor with shape [batch_size, length].

    In linear mode, should be a float tensor with shape [batch_size, length, hidden_size].

  • mode (str, defaults to "embedding") – A valid value is either "embedding" or "linear", the first one indicates that the layer should be used as an embedding layer, the second one that the layer should be used as a linear decoder.

Returns

In embedding mode, the output is a float32 embedding tensor, with shape [batch_size, length, embedding_size].

In linear mode, the output is a float32 with shape [batch_size, length, vocab_size].

Return type

tf.Tensor

Raises

ValueError – if mode is not valid.

Shared weights logic is adapted from here.

class transformers.modeling_tf_utils.TFSequenceSummary(*args, **kwargs)[source]ΒΆ

Compute a single vector summary of a sequence hidden states.

Parameters
  • config (PretrainedConfig) –

    The config used by the model. Relevant arguments in the config class of the model are (refer to the actual config class of your model for the default values it uses):

    • summary_type (str) – The method to use to make this summary. Accepted values are:

      • "last" – Take the last token hidden state (like XLNet)

      • "first" – Take the first token hidden state (like Bert)

      • "mean" – Take the mean of all tokens hidden states

      • "cls_index" – Supply a Tensor of classification token position (GPT/GPT-2)

      • "attn" – Not implemented now, use multi-head attention

    • summary_use_proj (bool) – Add a projection after the vector extraction.

    • summary_proj_to_labels (bool) – If True, the projection outputs to config.num_labels classes (otherwise to config.hidden_size).

    • summary_activation (Optional[str]) – Set to "tanh" to add a tanh activation to the output, another string or None will add no activation.

    • summary_first_dropout (float) – Optional dropout probability before the projection and activation.

    • summary_last_dropout (float)– Optional dropout probability after the projection and activation.

  • initializer_range (float, defaults to 0.02) – The standard deviation to use to initialize the weights.

  • kwargs – Additional keyword arguments passed along to the __init__ of tf.keras.layers.Layer.

call(inputs, cls_index=None, training=False)[source]ΒΆ

This is where the layer’s logic lives.

Note here that call() method in tf.keras is little bit different from keras API. In keras API, you can pass support masking for layers as additional arguments. Whereas tf.keras has compute_mask() method to support masking.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • *args – Additional positional arguments. Currently unused.

  • **kwargs – Additional keyword arguments. Currently unused.

Returns

A tensor or list/tuple of tensors.

TensorFlow loss functionsΒΆ

class transformers.modeling_tf_utils.TFCausalLanguageModelingLoss[source]ΒΆ

Loss function suitable for causal language modeling (CLM), that is, the task of guessing the next token.

Note

Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.

class transformers.modeling_tf_utils.TFMaskedLanguageModelingLoss[source]ΒΆ

Loss function suitable for masked language modeling (MLM), that is, the task of guessing the masked tokens.

Note

Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.

class transformers.modeling_tf_utils.TFMultipleChoiceLoss[source]ΒΆ

Loss function suitable for multiple choice tasks.

class transformers.modeling_tf_utils.TFQuestionAnsweringLoss[source]ΒΆ

Loss function suitable for question answering.

class transformers.modeling_tf_utils.TFSequenceClassificationLoss[source]ΒΆ

Loss function suitable for sequence classification.

class transformers.modeling_tf_utils.TFTokenClassificationLoss[source]ΒΆ

Loss function suitable for token classification.

Note

Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.

TensorFlow Helper FunctionsΒΆ

transformers.modeling_tf_utils.get_initializer(initializer_range: float = 0.02) → tensorflow.python.keras.initializers.initializers_v2.TruncatedNormal[source]ΒΆ

Creates a tf.initializers.TruncatedNormal with the given range.

Parameters

initializer_range (float, defaults to 0.02) – Standard deviation of the initializer range.

Returns

The truncated normal initializer.

Return type

tf.initializers.TruncatedNormal

transformers.modeling_tf_utils.keras_serializable(cls)[source]ΒΆ

Decorate a Keras Layer class to support Keras serialization.

This is done by:

  1. Adding a transformers_config dict to the Keras config dictionary in get_config (called by Keras at serialization time.

  2. Wrapping __init__ to accept that transformers_config dict (passed by Keras at deserialization time) and convert it to a config object for the actual layer initializer.

  3. Registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does not need to be supplied in custom_objects in the call to tf.keras.models.load_model.

Parameters

cls (a tf.keras.layers.Layers subclass) – Typically a TF.MainLayer class in this project, in general must accept a config argument to its initializer.

Returns

The same class object, with modifications for Keras deserialization.

transformers.modeling_tf_utils.shape_list(tensor: tensorflow.python.framework.ops.Tensor) → List[int][source]ΒΆ

Deal with dynamic shape in tensorflow cleanly.

Parameters

tensor (tf.Tensor) – The tensor we want the shape of.

Returns

The shape of the tensor as a list.

Return type

List[int]