BERTweet
Overview
The BERTweet model was proposed in BERTweet: A pre-trained language model for English Tweets by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
The abstract from the paper is the following:
We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification.
Example of use:
>>> import torch
>>> from transformers import AutoModel, AutoTokenizer
>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
>>> # For transformers v4.x+:
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
>>> # For transformers v3.x:
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
>>> # INPUT TWEET IS ALREADY NORMALIZED!
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
>>> input_ids = torch.tensor([tokenizer.encode(line)])
>>> with torch.no_grad():
... features = bertweet(input_ids) # Models outputs are now tuples
>>> # With TensorFlow 2.0+:
>>> # from transformers import TFAutoModel
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
This model was contributed by dqnguyen. The original code can be found here.
BertweetTokenizer
( vocab_file merges_file normalization = False bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' **kwargs )
Parameters
-
vocab_file (
str
) — Path to the vocabulary file. -
merges_file (
str
) — Path to the merges file. -
normalization (
bool
, optional, defaults toFalse
) — Whether or not to apply a normalization preprocess. -
bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
)
β
List[int]
Parameters
-
token_ids_0 (
List[int]
) — List of IDs to which the special tokens will be added. -
token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of input IDs with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERTweet sequence has the following format:
- single sequence:
<s> X </s>
- pair of sequences:
<s> A </s></s> B </s>
Converts a sequence of tokens (string) in a single string.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
)
β
List[int]
Create a mask from the two sequences passed to be used in a sequence-pair classification task. BERTweet does not make use of token type ids, therefore a list of zeros is returned.
(
token_ids_0: typing.List[int]
token_ids_1: typing.Optional[typing.List[int]] = None
already_has_special_tokens: bool = False
)
β
List[int]
Parameters
-
token_ids_0 (
List[int]
) — List of IDs. -
token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs. -
already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
Normalize tokens in a Tweet
Normalize a raw Tweet