Fernando Carneiro commited on
Commit
f8c2752
1 Parent(s): 6686b7f
Files changed (1) hide show
  1. README.md +23 -1
README.md CHANGED
@@ -3,4 +3,26 @@ language: pt
3
  license: apache-2.0
4
  ---
5
 
6
- # <a name="introduction"></a> BERTweet.BR: A Pre-Trained Language Model for Tweets in Portuguese
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license: apache-2.0
4
  ---
5
 
6
+ # <a name="introduction"></a> BERTweet.BR: A Pre-Trained Language Model for Tweets in Portuguese
7
+
8
+ Having the same architecture of
9
+ BERTweet (Nguyen et al., 2020) we trained
10
+ our model from scratch following RoBERTa (Liu
11
+ et al., 2019) pre-training procedure on a corpus of approximately 9GB containing 100M Portuguese Tweets. We evaluate the model on the
12
+ task of sentiment analysis using a collecion of
13
+ eight human-annotated datasets, five of which
14
+ having three classes while the rest being binary.
15
+ We compare the performance of our model to
16
+ a broad set of contextualized transformers-based
17
+ models containing language specific, multilingual
18
+ and twitter adapted models. Also, we define the
19
+ Portuguese version of static word embeddings
20
+ fastText as baseline and
21
+ compare BERTweet.BR to it when used in a
22
+ feature-based approach to extract fixed word representations. Experiments show that our model
23
+ consistently outperforms mBERT (Devlin et al.,
24
+ 2018), BERTimbau (Souza et al., 2020), XLM-
25
+ R (Conneau et al., 2020) and XLM-T (Bar-
26
+ bieri et al., 2022) in most of the cases and the
27
+ static word embeddigs word2vec (Mikolov et al.,
28
+ 2013), GloVe (Pennington et al., 2014) and fastText (Mikolov et al., 2018) in all the tests.