statmt/cc100
Updated • 2.02k • 106
A Vietnamese RoBERTa-based language model pre-trained on CC-100 Vietnamese and custom Vietnamese corpus.
viBERT-base is a BERT-base architecture model trained with RoBERTa-style pre-training on Vietnamese text data. It can be used for various Vietnamese NLP downstream tasks such as Named Entity Recognition, Text Classification, Question Answering, and more.
| Parameter | Value |
|---|---|
| Architecture | BERT-base |
| Hidden size | 768 |
| Attention heads | 12 |
| Hidden layers | 12 |
| Vocab size | 41,035 |
| Max sequence length | 512 |
| Parameters | ~110M |
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mainguyen9/viBERT-base")
model = AutoModel.from_pretrained("mainguyen9/viBERT-base")
# Encode text
text = "Xin chào Việt Nam"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get embeddings
last_hidden_state = outputs.last_hidden_state
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="mainguyen9/viBERT-base")
result = fill_mask("Hà Nội là [MASK] đô của Việt Nam.")
print(result)
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained(
"mainguyen9/viBERT-base",
num_labels=num_labels
)
tokenizer = AutoTokenizer.from_pretrained("mainguyen9/viBERT-base")
| Task | Dataset | Metric | Score |
|---|---|---|---|
| NER | PhoNER_COVID19 | F1 | 89.38 |
| NLI | XNLI Vietnamese | Accuracy | 71.06 |
| Hate Speech | ViHSD | Accuracy | 87.89 |
Fine-tuned with 5 epochs, batch size 32, learning rate 2e-5.
| Entity Type | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| AGE | 90.91 | 97.27 | 93.98 | 586 |
| DATE | 98.20 | 99.17 | 98.68 | 3,026 |
| GENDER | 89.96 | 92.23 | 91.08 | 476 |
| JOB | 66.59 | 51.75 | 58.24 | 570 |
| LOCATION | 88.52 | 91.33 | 89.90 | 10,845 |
| NAME | 94.09 | 90.56 | 92.29 | 1,388 |
| ORGANIZATION | 77.02 | 78.05 | 77.53 | 1,640 |
| PATIENT_ID | 95.61 | 98.54 | 97.05 | 2,120 |
| SYMPTOM_AND_DISEASE | 82.84 | 74.70 | 78.56 | 2,158 |
| TRANSPORTATION | 85.63 | 91.41 | 88.43 | 489 |
| Micro Average | 89.09 | 89.69 | 89.38 | 23,298 |
Fine-tuned with 5 epochs, batch size 64, learning rate 2e-5.
| Metric | Score |
|---|---|
| Accuracy | 71.06% |
| F1 (macro) | 71.02% |
Fine-tuned with 5 epochs, batch size 8, learning rate 2e-5.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CLEAN | 91.84% | 96.40% | 94.06% | 5,548 |
| OFFENSIVE | 51.86% | 40.77% | 45.65% | 444 |
| HATE | 67.32% | 49.71% | 57.19% | 688 |
| Accuracy | 87.89% | 6,680 | ||
| Macro Avg | 70.34% | 62.29% | 65.63% | 6,680 |
@misc{vibert-base,
author = {Mai Nguyen},
title = {viBERT-base: A Vietnamese RoBERTa Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/mainguyen9/viBERT-base}
}
MIT License