|
--- |
|
license: cc-by-sa-4.0 |
|
--- |
|
# IndoBERTweet-SexuallyExplicit |
|
|
|
## Model Description |
|
IndoBERTweet fine-tuned on IndoToxic2024 dataset, with an accuracy of 0.91 and macro-F1 of 0.80. Performances are obtained through stratified 10-fold cross-validation. |
|
|
|
## Supported Tokenizer |
|
- **indolem/indobertweet-base-uncased** |
|
|
|
## Example Code |
|
```python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
# Specify the model and tokenizer name |
|
model_name = "Exqrch/IndoBERTweet-SexuallyExplicit" |
|
tokenizer_name = "indolem/indobertweet-base-uncased" |
|
|
|
# Load the pre-trained model |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) |
|
|
|
text = "selamat pagi semua!" |
|
|
|
output = model(**tokenizer(text, return_tensors="pt")) |
|
logits = output.logits |
|
|
|
# Get the predicted class label |
|
predicted_class = torch.argmax(logits, dim=-1).item() |
|
|
|
print(predicted_class) |
|
--- Output --- |
|
> 0 |
|
--- End of Output --- |
|
``` |
|
|
|
## Limitations |
|
Trained only on Indonesian texts. No information on code-switched text performance. |
|
|
|
## Sample Output |
|
``` |
|
Model name: Exqrch/IndoBERTweet-SexuallyExplicit |
|
Text 1: billiard engak ntar bro? |
|
Prediction: 0 |
|
Text 2: eh kerumah ku yok main bareng di ranjang |
|
Prediction: 1 |
|
``` |
|
|
|
## Citation |
|
If used, please cite: |
|
``` |
|
@article{susanto2024indotoxic2024, |
|
title={IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language}, |
|
author={Lucky Susanto and Musa Izzanardi Wijanarko and Prasetia Anugrah Pratama and Traci Hong and Ika Idris and Alham Fikri Aji and Derry Wijaya}, |
|
year={2024}, |
|
eprint={2406.19349}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2406.19349}, |
|
} |
|
``` |
|
|