viBERT-base

A Vietnamese RoBERTa-based language model pre-trained on CC-100 Vietnamese and custom Vietnamese corpus.

Model Description

viBERT-base is a BERT-base architecture model trained with RoBERTa-style pre-training on Vietnamese text data. It can be used for various Vietnamese NLP downstream tasks such as Named Entity Recognition, Text Classification, Question Answering, and more.

Model Architecture

Parameter	Value
Architecture	BERT-base
Hidden size	768
Attention heads	12
Hidden layers	12
Vocab size	41,035
Max sequence length	512
Parameters	~110M

Training Data

CC-100 Vietnamese: Large-scale web crawl data
Custom Vietnamese corpus: Additional curated Vietnamese text

Usage

Feature Extraction

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mainguyen9/viBERT-base")
model = AutoModel.from_pretrained("mainguyen9/viBERT-base")

# Encode text
text = "Xin chào Việt Nam"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get embeddings
last_hidden_state = outputs.last_hidden_state

Masked Language Modeling

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="mainguyen9/viBERT-base")
result = fill_mask("Hà Nội là [MASK] đô của Việt Nam.")
print(result)

Fine-tuning for NER

from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained(
    "mainguyen9/viBERT-base",
    num_labels=num_labels
)
tokenizer = AutoTokenizer.from_pretrained("mainguyen9/viBERT-base")

Benchmark Results

Task	Dataset	Metric	Score
NER	PhoNER_COVID19	F1	89.38
NLI	XNLI Vietnamese	Accuracy	71.06
Hate Speech	ViHSD	Accuracy	87.89

NER Performance Details (PhoNER_COVID19)

Fine-tuned with 5 epochs, batch size 32, learning rate 2e-5.

Entity Type	Precision	Recall	F1-Score	Support
AGE	90.91	97.27	93.98	586
DATE	98.20	99.17	98.68	3,026
GENDER	89.96	92.23	91.08	476
JOB	66.59	51.75	58.24	570
LOCATION	88.52	91.33	89.90	10,845
NAME	94.09	90.56	92.29	1,388
ORGANIZATION	77.02	78.05	77.53	1,640
PATIENT_ID	95.61	98.54	97.05	2,120
SYMPTOM_AND_DISEASE	82.84	74.70	78.56	2,158
TRANSPORTATION	85.63	91.41	88.43	489
Micro Average	89.09	89.69	89.38	23,298

NLI Performance (XNLI Vietnamese)

Fine-tuned with 5 epochs, batch size 64, learning rate 2e-5.

Metric	Score
Accuracy	71.06%
F1 (macro)	71.02%

Hate Speech Detection (ViHSD)

Fine-tuned with 5 epochs, batch size 8, learning rate 2e-5.

Class	Precision	Recall	F1-Score	Support
CLEAN	91.84%	96.40%	94.06%	5,548
OFFENSIVE	51.86%	40.77%	45.65%	444
HATE	67.32%	49.71%	57.19%	688
Accuracy			87.89%	6,680
Macro Avg	70.34%	62.29%	65.63%	6,680

Limitations

Primarily trained on Vietnamese text; performance may vary for code-mixed text
512 token maximum sequence length

Citation

@misc{vibert-base,
  author = {Mai Nguyen},
  title = {viBERT-base: A Vietnamese RoBERTa Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/mainguyen9/viBERT-base}
}

License

MIT License

Downloads last month: 14

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train mainguyen9/viBERT-base

Evaluation results

F1 (micro) on PhoNER_COVID19
self-reported

89.380
Accuracy on XNLI Vietnamese
self-reported

71.060
Accuracy on ViHSD
self-reported

87.890
F1 (macro) on ViHSD
self-reported

65.630