knowhate
/

HateBERTimbau

Not-For-All-Audiences

Model card Files Files and versions Community

gilramos commited on May 13, 2024

Commit

aa35a5a

·

verified ·

1 Parent(s): 8761ba5

Update README.md

Files changed (1) hide show

README.md +13 -2

README.md CHANGED Viewed

@@ -37,7 +37,7 @@ HateBERTimbau is a transformer-based encoder model for identifying hate speech i
 ## Training Data
-229,103 tweets associated with offensive content were used to retrain the base model
 ## Training Hyperparameters
@@ -64,7 +64,18 @@ Twitter Test Set:
 ## BibTeX Citation
-[More Information Needed]
 ## Acknowledgements

 ## Training Data
+229,103 tweets associated with offensive content were used to retrain the base model.
 ## Training Hyperparameters
 ## BibTeX Citation
+@mastersthesis{Matos-Automatic-Hate-Speech-Detection-in-Portuguese-Social-Media-Text,
+title = {{Automatic Hate Speech Detection in Portuguese Social Media Text}},
+author = {Matos, Bernardo Cunha},
+month = nov,
+year = {2022},
+abstract = {{Online Hate Speech (HS) has been growing dramatically on social media and its uncontrolled spread has motivated researchers to develop a diversity of methods for its automated detection. However, the detection of online HS in Portuguese still merits further research. To fill this gap, we explored different models that proved to be successful in the literature to address this task. In particular, we have explored models that use the BERT architecture. Beyond testing single-task models we also explored multitask models that use the information on other related categories to learn HS. To better capture the semantics of this type of texts, we developed HateBERTimbau, a retrained version of BERTimbau more directed to social media language including potential HS targeting African descent, Roma, and LGBTQI+ communities. The performed experiments were based on CO-HATE and FIGHT, corpora of social media messages posted by the Portuguese online community that were labelled regarding the presence of HS among other categories.
+The results achieved show the importance of considering the annotator's agreement on the data used to develop HS detection models. Comparing different subsets of data used for the training of the models it was shown that, in general, a higher agreement on the data leads to better results.
+HATEBERTimbau consistently outperformed BERTimbau on both datasets confirming that further pre-training of BERTimbau was a successful strategy to obtain a language model more suitable for online HS detection in Portuguese.
+The implementation of target-specific models, and multitask learning have shown potential in obtaining better results.}},
+language = {eng},
+copyright = {embargoed-access},
+}
 ## Acknowledgements