Edit model card

5CD-AI/visobert-14gb-corpus

Overview

We continually pretrain uitnlp/visobert on a merged 14GB dataset, the training dataset includes:

  • Internal data (100M comments and 15M posts on Facebook)
  • UIT data, which is used to pretrain uitnlp/visobert
  • MC4 ecommerce

Here are the results on 4 downstream tasks on Vietnamese social media texts, including Emotion Recognition(UIT-VSMEC), Hate Speech Detection(UIT-HSD), Spam Reviews Detection(ViSpamReviews), Hate Speech Spans Detection(ViHOS):

Model Avg Emotion Recognition Hate Speech Detection Spam Reviews Detection Hate Speech Spans Detection
Acc WF1 MF1 Acc WF1 MF1 Acc WF1 MF1 Acc WF1 MF1
viBERT 78.16 61.91 61.98 59.7 85.34 85.01 62.07 89.93 89.79 76.8 90.42 90.45 84.55
vELECTRA 79.23 64.79 64.71 61.95 86.96 86.37 63.95 89.83 89.68 76.23 90.59 90.58 85.12
PhoBERT-Base 79.3 63.49 63.36 61.41 87.12 86.81 65.01 89.83 89.75 76.18 91.32 91.38 85.92
PhoBERT-Large 79.82 64.71 64.66 62.55 87.32 86.98 65.14 90.12 90.03 76.88 91.44 91.46 86.56
ViSoBERT 81.58 68.1 68.37 65.88 88.51 88.31 68.77 90.99 90.92 79.06 91.62 91.57 86.8
visobert-14gb-corpus 82.2 68.69 68.75 66.03 88.79 88.6 69.57 91.02 90.88 77.13 93.69 93.63 89.66

Usage (HuggingFace Transformers)

Install transformers package:

pip install transformers

Then you can use this model for fill-mask task like this:

from transformers import pipeline

model_path = "5CD-AI/visobert-14gb-corpus"
mask_filler = pipeline("fill-mask", model_path)

mask_filler("shop làm ăn như cái <mask>", top_k=10)

Fine-tune Configuration

We fine-tune 5CD-AI/visobert-14gb-corpus on 4 downstream tasks with transformers library with the following configuration:

  • seed: 42
  • gradient_accumulation_steps: 1
  • weight_decay: 0.01
  • optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
  • training_epochs: 30
  • model_max_length: 128
  • learning_rate: 1e-5
  • metric_for_best_model: wf1
  • strategy: epoch

And different additional configurations for each task:

Emotion Recognition Hate Speech Detection Spam Reviews Detection Hate Speech Spans Detection
- train_batch_size: 64
- lr_scheduler_type: linear
- train_batch_size: 32
- lr_scheduler_type: linear
- train_batch_size: 32
- lr_scheduler_type: cosine
- train_batch_size: 32
- lr_scheduler_type: cosine
Downloads last month
10,033
Safetensors
Model size
97.6M params
Tensor type
F32
·