Edit model card

5CD-AI/visocial-T5-base

Overview

We trimmed vocabulary size to 50,589 and continually pretrained google/mt5-base[1] on a merged 20GB dataset, the training dataset includes:

  • Crawled data (100M comments and 15M posts on Facebook)
  • UIT data[2], which is used to pretrain uitnlp/visobert[2]
  • MC4 ecommerce
  • 10.7M comments on VOZ Forum from tarudesu/VOZ-HSD[7]
  • 3.6M reviews from Amazon[3] translated into Vietnamese from 5CD-AI/Vietnamese-amazon_polarity-gg-translated

Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):

Model Average MF1 Hate Speech Detection Toxic Speech Detection Hate Spans Detection
Acc WF1 MF1 Acc WF1 MF1 Acc WF1 MF1
PhoBERT[4] 69.63 86.75 86.52 64.76 90.78 90.27 71.31 84.65 81.12 72.81
PhoBERT_v2[4] 70.50 87.42 87.33 66.60 90.23 89.78 71.39 84.92 81.51 73.51
viBERT[5] 67.80 86.33 85.79 62.85 88.81 88.17 67.65 84.63 81.28 72.91
ViSoBERT[6] 75.07 88.17 87.86 67.71 90.35 90.16 71.45 90.16 90.07 86.04
ViHateT5[7] 75.56 88.76 89.14 68.67 90.80 91.78 71.63 91.00 90.20 86.37
visocial-T5-base(Ours) 78.01 89.51 89.78 71.19 92.2 93.47 73.81 92.57 92.20 89.04

Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:

Model MF1
Hate Speech Detection Toxic Speech Detection Hate Spans Detection
mT5[1] 66.76 69.93 86.60
ViT5[8] 66.95 64.82 86.90
ViHateT5[7] 68.67 71.63 86.37
visocial-T5-base(Ours) 71.90 73.81 89.04

Fine-tune Configuration

We fine-tune 5CD-AI/visocial-T5-base on 3 downstream tasks with transformers library with the following configuration:

  • seed: 42
  • training_epochs: 4
  • train_batch_size: 4
  • gradient_accumulation_steps: 8
  • learning_rate: 3e-4
  • lr_scheduler_type: linear
  • model_max_length: 256
  • metric_for_best_model: eval_loss
  • evaluation_strategy: steps
  • eval_steps=0.1

References

[1] mT5: A massively multilingual pre-trained text-to-text transformer

[2] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[3] The Amazon Polarity dataset

[4] PhoBERT: Pre-trained language models for Vietnamese

[5] Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models

[6] ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

[7] ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model

[8] ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

Downloads last month
32
Safetensors
Model size
276M params
Tensor type
BF16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.