Bart¶
DISCLAIMER: If you see something strange, file a Github Issue and assign @sshleifer
Paper¶
The Bart model was proposed by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019. According to the abstract,
Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
The Authors’ code can be found here
Implementation Notes¶
Bart doesn’t use
token_type_ids
for sequence classification. Use BartTokenizer.encode to get the proper splitting.The forward pass of
BartModel
will create decoder inputs (using the helper functiontransformers.modeling_bart._prepare_bart_decoder_inputs
) if they are not passed. This is different than some other modeling APIs.Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to
fairseq.encode
starts with a space.BartForConditionalGeneration.generate
should be used for conditional generation tasks like summarization, see the example in that docstringsModels that load the
"bart-large-cnn"
weights will not have amask_token_id
, or be able to perform mask filling tasks.
BartModel¶
-
class
transformers.
BartModel
(config: transformers.configuration_bart.BartConfig)[source]¶ The bare BART Model outputting raw hidden-states without any specific head on top.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matters related to general usage and behavior.
- Parameters
config (
BartConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
-
forward
(input_ids, attention_mask=None, decoder_input_ids=None, encoder_outputs: Optional[Tuple] = None, decoder_attention_mask=None, decoder_cached_states=None, use_cache=False)[source]¶ The
BartModel
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) – Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them. Padding will be ignored by default should you provide it. Indices can be obtained usingtransformers.BartTokenizer.encode(text)
.attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Mask to avoid performing attention on padding token indices in input_ids. Mask values selected in[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional, defaults toNone
) – Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional, defaults toNone
) – Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.decoder_attention_mask (
torch.BoolTensor
of shape(batch_size, tgt_seq_len)
, optional, defaults toNone
) – Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. If you want to change padding behavior, you should read_prepare_decoder_inputs()
and modify. See diagram 1 in the paper for more info on the default strategy
-
transformers.modeling_bart.
_prepare_bart_decoder_inputs
(config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32)[source]¶ Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if none are provided. This mimics the default behavior in fairseq. To override it pass in masks. Note: this is not called during generation
BartForConditionalGeneration¶
-
class
transformers.
BartForConditionalGeneration
(config: transformers.configuration_bart.BartConfig)[source]¶ The BART Model with a language modeling head. Can be used for summarization.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matters related to general usage and behavior.
- Parameters
config (
BartConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
Examples:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig # see ``examples/summarization/bart/evaluate_cnn.py`` for a longer example model = BartForConditionalGeneration.from_pretrained('bart-large-cnn') tokenizer = BartTokenizer.from_pretrained('bart-large-cnn') ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs." inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt') # Generate Summary summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True) print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
-
forward
(input_ids, attention_mask=None, encoder_outputs=None, decoder_input_ids=None, decoder_attention_mask=None, decoder_cached_states=None, lm_labels=None, use_cache=False, **unused)[source]¶ The
BartForConditionalGeneration
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) – Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them. Padding will be ignored by default should you provide it. Indices can be obtained usingtransformers.BartTokenizer.encode(text)
.attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Mask to avoid performing attention on padding token indices in input_ids. Mask values selected in[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional, defaults toNone
) – Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional, defaults toNone
) – Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.decoder_attention_mask (
torch.BoolTensor
of shape(batch_size, tgt_seq_len)
, optional, defaults toNone
) – Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. If you want to change padding behavior, you should read_prepare_decoder_inputs()
and modify. See diagram 1 in the paper for more info on the default strategymasked_lm_labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
.
- Returns
- masked_lm_loss (optional, returned when
masked_lm_labels
is provided)torch.FloatTensor
of shape(1,)
: Masked language modeling loss.
- prediction_scores (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenconfig.output_hidden_states=True
): Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the initial embedding outputs.
- attentions (
tuple(torch.FloatTensor)
, optional, returned whenconfig.output_attentions=True
): Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
- masked_lm_loss (optional, returned when
- Return type
tuple(torch.FloatTensor)
comprising various elements depending on the configuration (RobertaConfig
) and inputs
Examples:
# Mask filling only works for bart-large from transformers import BartTokenizer, BartForConditionalGeneration tokenizer = BartTokenizer.from_pretrained('bart-large') TXT = "My friends are <mask> but they eat too many carbs." model = BartForConditionalGeneration.from_pretrained('bart-large') input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids'] logits = model(input_ids)[0] masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() probs = logits[0, masked_index].softmax(dim=0) values, predictions = probs.topk(5) tokenizer.decode(predictions).split() # ['good', 'great', 'all', 'really', 'very']
-
generate
(**kwargs)¶ Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
Adapted in part from Facebook’s XLM beam search code.
- Parameters
input_ids – (optional) torch.LongTensor of shape (batch_size, sequence_length) The sequence used as a prompt for the generation. If None the method initializes it as an empty torch.LongTensor of shape (1,).
max_length – (optional) int The max length of the sequence to be generated. Between min_length and infinity. Default to 20.
min_length – (optional) int The min length of the sequence to be generated. Between 0 and infinity. Default to 0.
do_sample – (optional) bool If set to False greedy decoding is used. Otherwise sampling is used. Defaults to False as defined in configuration_utils.PretrainedConfig.
early_stopping – (optional) bool if set to True beam search is stopped when at least num_beams sentences finished per batch. Defaults to False as defined in configuration_utils.PretrainedConfig.
num_beams – (optional) int Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
temperature – (optional) float The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
top_k – (optional) int The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.
top_p – (optional) float The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
repetition_penalty – (optional) float The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
pad_token_id – (optional) int Padding token. Default to specicic model pad_token_id or None if it does not exist.
bos_token_id – (optional) int BOS token. Defaults to bos_token_id as defined in the models config.
eos_token_id – (optional) int EOS token. Defaults to eos_token_id as defined in the models config.
length_penalty – (optional) float Exponential penalty to the length. Default to 1.
no_repeat_ngram_size – (optional) int If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.
bad_words_ids – (optional) list of lists of int bad_words_ids contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
num_return_sequences – (optional) int The number of independently computed returned sequences for each element in the batch. Default to 1.
attention_mask (optional) –
torch.LongTensor of same shape as input_ids Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens. Defaults to None.decoder_start_token_id=None – (optional) int If an encoder-decoder model starts decoding with a different token than BOS. Defaults to None and is changed to BOS later.
use_cache – (optional) bool If use_cache is True, past key values are used to speed up decoding if applicable to model. Defaults to True.
model_specific_kwargs – (optional) dict Additional model specific kwargs will be forwarded to the forward function of the model.
- Returns
- torch.LongTensor of shape (batch_size * num_return_sequences, sequence_length)
sequence_length is either equal to max_length or shorter if all batches finished early due to the eos_token_id
- Return type
output
Examples:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer model = AutoModelWithLMHead.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache. outputs = model.generate(max_length=40) # do greedy decoding print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True))) tokenizer = AutoTokenizer.from_pretrained('openai-gpt') # Initialize tokenizer model = AutoModelWithLMHead.from_pretrained('openai-gpt') # Download model and configuration from S3 and cache. input_context = 'The dog' input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5) # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog' for i in range(3): # 3 output sequences were generated print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True))) tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer model = AutoModelWithLMHead.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache. input_context = 'The dog' input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3) # 3 generate sequences using by sampling for i in range(3): # 3 output sequences were generated print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True))) tokenizer = AutoTokenizer.from_pretrained('ctrl') # Initialize tokenizer model = AutoModelWithLMHead.from_pretrained('ctrl') # Download model and configuration from S3 and cache. input_context = 'Legal My neighbor is' # "Legal" is one of the control codes for ctrl input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2) # generate sequences print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True))) tokenizer = AutoTokenizer.from_pretrained('gpt2') # Initialize tokenizer model = AutoModelWithLMHead.from_pretrained('gpt2') # Download model and configuration from S3 and cache. input_context = 'My cute dog' # "Legal" is one of the control codes for ctrl bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']] input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids) # generate sequences without allowing bad_words to be generated
BartForSequenceClassification¶
-
class
transformers.
BartForSequenceClassification
(config: transformers.configuration_bart.BartConfig, **kwargs)[source]¶ Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matters related to general usage and behavior.
- Parameters
config (
BartConfig
) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out thefrom_pretrained()
method to load the model weights.
-
forward
(input_ids, attention_mask=None, encoder_outputs=None, decoder_input_ids=None, decoder_attention_mask=None, labels=None)[source]¶ The
BartForSequenceClassification
forward method, overrides the__call__()
special method.Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.- Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) – Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them. Padding will be ignored by default should you provide it. Indices can be obtained usingtransformers.BartTokenizer.encode(text)
.attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional, defaults toNone
) – Mask to avoid performing attention on padding token indices in input_ids. Mask values selected in[0, 1]
:1
for tokens that are NOT MASKED,0
for MASKED tokens.encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional, defaults toNone
) – Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape(batch_size, sequence_length, hidden_size)
, optional, defaults toNone
) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional, defaults toNone
) – Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.decoder_attention_mask (
torch.BoolTensor
of shape(batch_size, tgt_seq_len)
, optional, defaults toNone
) – Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. If you want to change padding behavior, you should read_prepare_decoder_inputs()
and modify. See diagram 1 in the paper for more info on the default strategylabels (
torch.LongTensor
of shape(batch_size,)
, optional, defaults toNone
) – Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
- Returns
- loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided): Classification loss (cross entropy)
- logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
): Classification (or regression if config.num_labels==1) scores (before SoftMax).
- hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenconfig.output_hidden_states=True
): Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.- attentions (
tuple(torch.FloatTensor)
, optional, returned whenconfig.output_attentions=True
): Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
- loss (
- Return type
tuple(torch.FloatTensor)
comprising various elements depending on the configuration (BartConfig
) and inputs
Examples:
from transformers import BartTokenizer, BartForSequenceClassification import torch tokenizer = BartTokenizer.from_pretrained('bart-large') model = BartForSequenceClassification.from_pretrained('bart-large') input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 outputs = model(input_ids, labels=labels) loss, logits = outputs[:2]
BartConfig¶
-
class
transformers.
BartConfig
(activation_dropout=0.0, activation_function='gelu', vocab_size=50265, d_model=1024, encoder_ffn_dim=4096, encoder_layers=12, encoder_attention_heads=16, decoder_ffn_dim=4096, decoder_layers=12, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, attention_dropout=0.0, dropout=0.1, max_position_embeddings=1024, init_std=0.02, classifier_dropout=0.0, num_labels=3, is_encoder_decoder=True, pad_token_id=1, bos_token_id=0, eos_token_id=2, normalize_before=False, add_final_layer_norm=False, scale_embedding=False, normalize_embedding=True, static_position_embeddings=False, add_bias_logits=False, **common_kwargs)[source]¶ Configuration class for Bart. Parameters are renamed from the fairseq implementation