Encoder Decoder Models

This class can wrap an encoder model, such as BertModel and a decoder modeling with a language modeling head, such as BertForMaskedLM into a encoder-decoder model.

The EncoderDecoderModel class allows to instantiate a encoder decoder model using the from_encoder_decoder_pretrain class method taking a pretrained encoder and pretrained decoder model as an input. The EncoderDecoderModel is saved using the standard save_pretrained() method and can also again be loaded using the standard from_pretrained() method.

An application of this architecture could be summarization using two pretrained Bert models as is shown in the paper: Text Summarization with Pretrained Encoders by Yang Liu and Mirella Lapata.

EncoderDecoderConfig

class transformers.EncoderDecoderConfig(**kwargs)[source]

EncoderDecoderConfig is the configuration class to store the configuration of a EncoderDecoderModel.

It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs. Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. See the documentation for PretrainedConfig for more information.

Parameters

kwargs (optional) –

Remaining dictionary of keyword arguments. Notably:
encoder (PretrainedConfig, optional, defaults to None):

An instance of a configuration object that defines the encoder config.

decoder (PretrainedConfig, optional, defaults to None):

An instance of a configuration object that defines the decoder config.

Example:

>>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel

>>> # Initializing a BERT bert-base-uncased style configuration
>>> config_encoder = BertConfig()
>>> config_decoder = BertConfig()

>>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

>>> # Initializing a Bert2Bert model from the bert-base-uncased style configurations
>>> model = EncoderDecoderModel(config=config)

>>> # Accessing the model configuration
>>> config_encoder = model.config.encoder
>>> config_decoder  = model.config.decoder
classmethod from_encoder_decoder_configs(encoder_config: transformers.configuration_utils.PretrainedConfig, decoder_config: transformers.configuration_utils.PretrainedConfig) → transformers.configuration_utils.PretrainedConfig[source]

Instantiate a EncoderDecoderConfig (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.

Returns

An instance of a configuration object

Return type

EncoderDecoderConfig

to_dict()[source]

Serializes this instance to a Python dictionary. Override the default to_dict() from PretrainedConfig.

Returns

Dictionary of all the attributes that make up this configuration instance,

Return type

Dict[str, any]

EncoderDecoderModel

class transformers.EncoderDecoderModel(config: Optional[transformers.configuration_utils.PretrainedConfig] = None, encoder: Optional[transformers.modeling_utils.PreTrainedModel] = None, decoder: Optional[transformers.modeling_utils.PreTrainedModel] = None)[source]

EncoderDecoder is a generic model class that will be instantiated as a transformer architecture with one of the base model classes of the library as encoder and another one as decoder when created with the AutoModel.from_pretrained(pretrained_model_name_or_path) class method for the encoder and AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path) class method for the decoder.

config_class

alias of transformers.configuration_encoder_decoder.EncoderDecoderConfig

forward(input_ids=None, inputs_embeds=None, attention_mask=None, head_mask=None, encoder_outputs=None, decoder_input_ids=None, decoder_attention_mask=None, decoder_head_mask=None, decoder_inputs_embeds=None, labels=None, **kwargs)[source]
Parameters
  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) – Indices of input sequence tokens in the vocabulary for the encoder. Indices can be obtained using transformers.PretrainedTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional, defaults to None) – Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional, defaults to None) – Mask to avoid performing attention on padding token indices for the encoder. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

  • head_mask – (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional, defaults to None): Mask to nullify selected heads of the self-attention modules for the encoder. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional, defaults to None) – Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional, defaults to None) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional, defaults to None) – Provide for sequence to sequence training to the decoder. Indices can be obtained using transformers.PretrainedTokenizer. See transformers.PreTrainedTokenizer.encode() and transformers.PreTrainedTokenizer.convert_tokens_to_ids() for details.

  • decoder_attention_mask (torch.BoolTensor of shape (batch_size, tgt_seq_len), optional, defaults to None) – Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

  • decoder_head_mask – (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional, defaults to None): Mask to nullify selected heads of the self-attention modules for the decoder. Mask values selected in [0, 1]: 1 indicates the head is not masked, 0 indicates the head is masked.

  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional, defaults to None) – Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional, defaults to None) – Labels for computing the masked language modeling loss for the decoder. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

  • kwargs – (optional) Remaining dictionary of keyword arguments. Keyword arguments come in two flavors: - Without a prefix which will be input as **encoder_kwargs for the encoder forward function. - With a decoder_ prefix which will be input as **decoder_kwargs for the decoder forward function.

Examples:

>>> from transformers import EncoderDecoderModel, BertTokenizer
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert

>>> # forward
>>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)

>>> # training
>>> loss, outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids)[:2]

>>> # generation
>>> generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
classmethod from_encoder_decoder_pretrained(encoder_pretrained_model_name_or_path: str = None, decoder_pretrained_model_name_or_path: str = None, *model_args, **kwargs) → transformers.modeling_utils.PreTrainedModel[source]

Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.

The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train the model, you need to first set it back in training mode with model.train().

Params:
encoder_pretrained_model_name_or_path (:obj: str, optional, defaults to None):

information necessary to initiate the encoder. Either:

  • a string with the shortcut name of a pre-trained model to load from cache or download, e.g.: bert-base-uncased.

  • a string with the identifier name of a pre-trained model that was user-uploaded to our S3, e.g.: dbmdz/bert-base-german-cased.

  • a path to a directory containing model weights saved using save_pretrained(), e.g.: ./my_model_directory/encoder.

  • a path or url to a tensorflow index checkpoint file (e.g. ./tf_model/model.ckpt.index). In this case, from_tf should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.

decoder_pretrained_model_name_or_path (:obj: str, optional, defaults to None):

information necessary to initiate the decoder. Either:

  • a string with the shortcut name of a pre-trained model to load from cache or download, e.g.: bert-base-uncased.

  • a string with the identifier name of a pre-trained model that was user-uploaded to our S3, e.g.: dbmdz/bert-base-german-cased.

  • a path to a directory containing model weights saved using save_pretrained(), e.g.: ./my_model_directory/decoder.

  • a path or url to a tensorflow index checkpoint file (e.g. ./tf_model/model.ckpt.index). In this case, from_tf should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.

model_args: (optional) Sequence of positional arguments:

All remaning positional arguments will be passed to the underlying model’s __init__ method

kwargs: (optional) Remaining dictionary of keyword arguments.

Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. output_attention=True). Behave differently depending on whether a config is provided or automatically loaded:

Examples:

>>> from transformers import EncoderDecoderModel
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert
get_input_embeddings()[source]

Returns the model’s input embeddings.

Returns

A torch module mapping vocabulary to hidden states.

Return type

nn.Module

get_output_embeddings()[source]

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type

nn.Module

tie_weights()[source]

Tie the weights between the input embeddings and the output embeddings. If the torchscript flag is set in the configuration, can’t handle parameter sharing so we are cloning the weights instead.