XLM-ProphetNet

DISCLAIMER: If you see something strange, file a Github Issue and assign @patrickvonplaten

Overview

The XLM-ProphetNet model was proposed in ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.

XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for “ngram” language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual “wiki100” Wikipedia dump.

The abstract from the paper is the following:

In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.

The Authors’ code can be found here.

XLMProphetNetConfig

class transformers.XLMProphetNetConfig(activation_dropout=0.1, activation_function='gelu', vocab_size=30522, hidden_size=1024, encoder_ffn_dim=4096, num_encoder_layers=12, num_encoder_attention_heads=16, decoder_ffn_dim=4096, num_decoder_layers=12, num_decoder_attention_heads=16, attention_dropout=0.1, dropout=0.1, max_position_embeddings=512, init_std=0.02, is_encoder_decoder=True, add_cross_attention=True, decoder_start_token_id=0, ngram=2, num_buckets=32, relative_max_distance=128, disable_ngram_loss=False, gradient_checkpointing=False, eps=0.0, use_cache=True, pad_token_id=0, bos_token_id=1, eos_token_id=2, **kwargs)[source]

This class overrides ProphetNetConfig. Please check the superclass for the appropriate documentation alongside usage examples.

XLMProphetNetTokenizer

class transformers.XLMProphetNetTokenizer(vocab_file, bos_token='[SEP]', eos_token='[SEP]', sep_token='[SEP]', unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[source]

Adapted from RobertaTokenizer and class:~transformers.XLNetTokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Parameters
  • vocab_file (str) – Path to the vocabulary file.

  • bos_token (str, optional, defaults to "<s>") –

    The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    Note

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") –

    The end of sequence token.

    Note

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • cls_token (str, optional, defaults to "<s>") – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • unk_token (str, optional, defaults to "<unk>") – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") – The token used for padding, for example when batching sequences of different lengths.

  • mask_token (str, optional, defaults to "<mask>") – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • additional_special_tokens (List[str], optional, defaults to ["<s>NOTUSED", "</s>NOTUSED"]) – Additional special tokens used by the tokenizer.

  • sp_model_kwargs (dict, optional) –

    Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.

      • nbest_size > 1: samples from the nbest_size results.

      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

sp_model

The SentencePiece processor that is used for every conversion (string, tokens and IDs).

Type

SentencePieceProcessor

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A XLMProphetNet sequence has the following format:

  • single sequence: X [SEP]

  • pair of sequences: A [SEP] B [SEP]

Parameters
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns

list of input IDs with the appropriate special tokens.

Return type

List[int]

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (strings for sub-words) in a single string.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. XLMProphetNet does not make use of token type ids, therefore a list of zeros is returned.

Parameters
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns

List of zeros.

Return type

List[int]

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (bool, optional, defaults to False) – Whether or not the token list is already formatted with special tokens for the model.

Returns

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type

List[int]

get_vocab()[source]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns

The vocabulary.

Return type

Dict[str, int]

save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None) → Tuple[str][source]

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use _save_pretrained() to save the whole state of the tokenizer.

Parameters
  • save_directory (str) – The directory in which to save the vocabulary.

  • filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.

Returns

Paths to the files saved.

Return type

Tuple(str)

property vocab_size

Size of the base vocabulary (without the added tokens).

Type

int

XLMProphetNetModel

class transformers.XLMProphetNetModel(config)[source]

This class overrides ProphetNetModel. Please check the superclass for the appropriate documentation alongside usage examples.

Example:

>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetModel

>>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model = XLMProphetNetModel.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')

>>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you", return_tensors="pt").input_ids  # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

>>> last_hidden_states = outputs.last_hidden_state  # main stream hidden states
    >>> last_hidden_states_ngram = outputs.last_hidden_state_ngram  # predict hidden states

XLMProphetNetEncoder

class transformers.XLMProphetNetEncoder(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig, word_embeddings: torch.nn.modules.sparse.Embedding = None)[source]

This class overrides ProphetNetEncoder. Please check the superclass for the appropriate documentation alongside usage examples.

Example:

>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetEncoder
>>> import torch

>>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model = XLMProphetNetEncoder.from_pretrained('patrickvonplaten/xprophetnet-large-uncased-standalone')
>>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

XLMProphetNetDecoder

class transformers.XLMProphetNetDecoder(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig, word_embeddings: torch.nn.modules.sparse.Embedding = None)[source]

This class overrides ProphetNetDecoder. Please check the superclass for the appropriate documentation alongside usage examples.

Example:

>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetDecoder
>>> import torch

>>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model = XLMProphetNetDecoder.from_pretrained('patrickvonplaten/xprophetnet-large-uncased-standalone', add_cross_attention=False)
>>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

XLMProphetNetForConditionalGeneration

class transformers.XLMProphetNetForConditionalGeneration(config: transformers.models.prophetnet.configuration_prophetnet.ProphetNetConfig)[source]

This class overrides ProphetNetForConditionalGeneration. Please check the superclass for the appropriate documentation alongside usage examples.

Example:

>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetForConditionalGeneration

>>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model =  XLMProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')

>>> input_ids = tokenizer("Studies have been shown that owning a dog is good for you", return_tensors="pt").input_ids  # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

>>> logits_next_token = outputs.logits  # logits to predict next token as usual
>>> logits_ngram_next_tokens = outputs.logits_ngram  # logits to predict 2nd, 3rd, ... next tokens

XLMProphetNetForCausalLM

class transformers.XLMProphetNetForCausalLM(config)[source]

This class overrides ProphetNetForCausalLM. Please check the superclass for the appropriate documentation alongside usage examples.

Example:

>>> from transformers import XLMProphetNetTokenizer, XLMProphetNetForCausalLM
>>> import torch

>>> tokenizer = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model = XLMProphetNetForCausalLM.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> logits = outputs.logits

>>> # Model can also be used with EncoderDecoder framework
>>> from transformers import EncoderDecoderModel, XLMProphetNetTokenizer, XLMRobertaTokenizer
>>> import torch

>>> tokenizer_enc = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
>>> tokenizer_dec = XLMProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased')
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("xlm-roberta-large", 'microsoft/xprophetnet-large-wiki100-cased')

>>> ARTICLE = (
... "the us state department said wednesday it had received no "
... "formal word from bolivia that it was expelling the us ambassador there "
... "but said the charges made against him are `` baseless ."
... )
>>> input_ids = tokenizer_enc(ARTICLE, return_tensors="pt").input_ids
>>> labels = tokenizer_dec("us rejects charges against its ambassador in bolivia", return_tensors="pt").input_ids
>>> outputs = model(input_ids=input_ids, decoder_input_ids=labels[:, :-1], labels=labels[:, 1:])

>>> loss = outputs.loss