metadata
language: pl
license: cc-by-sa-4.0
datasets:
- Polish subset of Open Subtitles
- Polish subset of ParaCrawl
- Polish Parliamentary Corpus
- Polish Wikipedia - Feb 2020
- >-
Expert-annotated Dataset for Automatic Cyberbullying Detection in Polish
Laguage
Polbert-CB - Polish BERT trained for Automatic Cyberbullying Detection
This is a Polish version of BERT language model, specifically, Polbert, trained on a re-annotated and improved Dataset for Automatic Cyberbullying Detection in Polish Laguage.
Fine-tuning dataset
The dataset used for fine-tuning this model was based on the original Dataset for Automatic Cyberbullying Detection in Polish Laguage, which was recently additionally cleaned and re-annotated by experts from Samurai Labs. The improved dataset and will be released separately later.
Acknowledgements
- We would like to express our gratitude to the annotators of this dataset, including original annotators, and more recent expert annotators, for their invaluable time they spent on preparing the dataset.
Author
Michal Ptaszynski - contact me on:
- Twitter: @mich_ptaszynski
- GitHub: ptaszynski
- LinkedIn: michalptaszynski
- HuggingFace: ptaszynski
Licences
The finetuned model with all attached files is licensed under CC BY-SA 4.0, or Creative Commons Attribution-ShareAlike 4.0 International License.
Citations
Please, cite this model using the following citation.
Model:
@article{ptaszynski2022cyberbullyibng-bert-pl,
title={Polish BERT trained for Automatic Cyberbullying Detection},
author={Ptaszynski, Michal and Pieciukiewicz, Agata and Dybala, Pawel and Skrzek, Pawel and Soliwoda, Kamil and Fortuna, Marcin and Leliwa, Gniewosz and Wroczynski, Michal},
year={2022},
publisher={HuggingFace},
url={https://github.com/ptaszynski/bert-base-polish-cyberbullying}"
}
Original dataset:
@article{ptaszynski2019results,
title={Results of the poleval 2019 shared task 6: First dataset and open shared task for automatic cyberbullying detection in polish twitter},
author={Ptaszynski, Michal and Pieciukiewicz, Agata and Dyba{\l}a, Pawe{\l}},
year={2019},
publisher={Warszawa: Institute of Computer Sciences. Polish Academy of Sciences}
}
Improved dataset:
TBA