transformers documentation

BARTpho

BARTpho

Overview

The BARTpho model was proposed in BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

The abstract from the paper is the following:

We present BARTpho with two versions — BARTpho_word and BARTpho_syllable — the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the “large” architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks.

Example of use:

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")

>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

>>> line = "Chúng tôi là những nghiên cứu viên."

>>> input_ids = tokenizer(line, return_tensors="pt")

>>> with torch.no_grad():
...     features = bartpho(**input_ids)  # Models outputs are now tuples

>>> # With TensorFlow 2.0+:
>>> from transformers import TFAutoModel
>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
>>> input_ids = tokenizer(line, return_tensors="tf")
>>> features = bartpho(**input_ids)

Tips:

  • Following mBART, BARTpho uses the “large” architecture of BART with an additional layer-normalization layer on top of both the encoder and decoder. Thus, usage examples in the documentation of BART, when adapting to use with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts. For example:
>>> from transformers import MBartForConditionalGeneration
>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
>>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
>>> logits = bartpho(input_ids).logits
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
>>> probs = logits[0, masked_index].softmax(dim=0)
>>> values, predictions = probs.topk(5)
>>> print(tokenizer.decode(predictions).split())
  • This implementation is only for tokenization: “monolingual_vocab_file” consists of Vietnamese-specialized types extracted from the pre-trained SentencePiece model “vocab_file” that is available from the multilingual XLM-RoBERTa. Other languages, if employing this pre-trained multilingual SentencePiece model “vocab_file” for subword segmentation, can reuse BartphoTokenizer with their own language-specialized “monolingual_vocab_file”.

This model was contributed by dqnguyen. The original code can be found here.

BartphoTokenizer

class transformers.BartphoTokenizer < > expand 

( vocab_file monolingual_vocab_file bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' sp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None **kwargs )

Adapted from XLMRobertaTokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Attributes: spmodel (SentencePieceProcessor): The _SentencePiece processor that is used for every conversion (string, tokens and IDs).

build_inputs_with_special_tokens < > expand 

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) List[int]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An BARTPho sequence has the following format:

  • single sequence: <s> X </s>
  • pair of sequences: <s> A </s></s> B </s>
convert_tokens_to_string < > expand 

( tokens )

Converts a sequence of tokens (strings for sub-words) in a single string.

create_token_type_ids_from_sequences < > expand 

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) List[int]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. BARTPho does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask < > expand 

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) List[int]

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.