|
--- |
|
base_model: tarudesu/ViHateT5-base |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: ViHateT5-base-HSD |
|
results: [] |
|
datasets: |
|
- tarudesu/ViCTSD |
|
- tarudesu/ViHOS |
|
- tarudesu/ViHSD |
|
language: |
|
- vi |
|
metrics: |
|
- f1 |
|
- accuracy |
|
pipeline_tag: text2text-generation |
|
widget: |
|
- text: "toxic-speech-detection: Nhìn bà không thể không nhớ đến các phim phù thủy" |
|
- text: "hate-speech-detection: thằng đó trông đần vcl ấy nhỉ" |
|
- text: "hate-spans-detection: trông như cl" |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# <a name="introduction"></a>ViHateT5: Enhancing Hate Speech Detection in Vietnamese with A Unified Text-to-Text Transformer Model | ACL'2024 (Findings) |
|
**Disclaimer**: This paper contains examples from actual content on social media platforms that could be considered toxic and offensive. |
|
|
|
ViHateT5-HSD is the fine-tuned model of [ViHateT5](https://huggingface.co/tarudesu/ViHateT5-base) on multiple Vietnamese hate speech detection benchmark datasets. |
|
|
|
The architecture and experimental results of ViHateT5 can be found in the [paper](LINK): |
|
|
|
@misc{nguyen2024vihatet5, |
|
title={ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model}, |
|
author={Luan Thanh Nguyen}, |
|
year={2024}, |
|
eprint={2405.14141}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
|
|
The pre-training dataset named VOZ-HSD is available at [HERE](https://huggingface.co/datasets/tarudesu/VOZ-HSD). |
|
|
|
Kindly **CITE** our paper if you use ViHateT5-HSD to generate published results or integrate it into other software. |
|
|
|
**Example usage** |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("tarudesu/ViHateT5-base-HSD") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("tarudesu/ViHateT5-base-HSD") |
|
|
|
def generate_output(input_text, prefix): |
|
# Add prefix |
|
prefixed_input_text = prefix + ': ' + input_text |
|
|
|
# Tokenize input text |
|
input_ids = tokenizer.encode(prefixed_input_text, return_tensors="pt") |
|
|
|
# Generate output |
|
output_ids = model.generate(input_ids, max_length=256) |
|
|
|
# Decode the generated output |
|
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
|
|
return output_text |
|
|
|
sample = 'Tôi ghét bạn vl luôn!' |
|
prefix = 'hate-spans-detection' # Choose 1 from 3 prefixes ['hate-speech-detection', 'toxic-speech-detection', 'hate-spans-detection'] |
|
|
|
result = generate_output(sample, prefix) |
|
print('Result: ', result) |
|
``` |
|
|
|
Please feel free to contact us by email luannt@uit.edu.vn if you have any further information! |