melll-uff
/

bertweetbr

Inference Endpoints

Model card Files Files and versions Community

Fernando Carneiro commited on Sep 11, 2022

Commit

f8c2752

•

1 Parent(s): 6686b7f

Readme

Files changed (1) hide show

README.md +23 -1

README.md CHANGED Viewed

@@ -3,4 +3,26 @@ language: pt
 license: apache-2.0
 ---
-# <a name="introduction"></a> BERTweet.BR: A Pre-Trained Language Model for Tweets in Portuguese

 license: apache-2.0
 ---
+# <a name="introduction"></a> BERTweet.BR: A Pre-Trained Language Model for Tweets in Portuguese
+Having the same architecture of
+BERTweet (Nguyen et al., 2020) we trained
+our model from scratch following RoBERTa (Liu
+et al., 2019) pre-training procedure on a corpus of approximately 9GB containing 100M Portuguese Tweets. We evaluate the model on the
+task of sentiment analysis using a collecion of
+eight human-annotated datasets, five of which
+having three classes while the rest being binary.
+We compare the performance of our model to
+a broad set of contextualized transformers-based
+models containing language specific, multilingual
+and twitter adapted models. Also, we define the
+Portuguese version of static word embeddings
+fastText as baseline and
+compare BERTweet.BR to it when used in a
+feature-based approach to extract fixed word representations. Experiments show that our model
+consistently outperforms mBERT (Devlin et al.,
+2018), BERTimbau (Souza et al., 2020), XLM-
+R (Conneau et al., 2020) and XLM-T (Bar-
+bieri et al., 2022) in most of the cases and the
+static word embeddigs word2vec (Mikolov et al.,
+2013), GloVe (Pennington et al., 2014) and fastText (Mikolov et al., 2018) in all the tests.