Edit model card

FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

  • Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
  • Remarkable performance across various downstream NLP tasks
  • BERT architecture with 124 million parameters

Useful Links

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter Value
Batch Size 32
Optimizer Adam
Learning Rate 6e-5
Weight Decay 0.01
Total Steps 18 Million
Warmup Steps 1.8 Million
Precision Format TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task FaBERT ParsBERT XLM-R
MirasOpinion 87.51 86.73 84.92
MirasIrony 74.82 71.08 75.51
DeepSentiPers 79.85 74.94 79.00

Named Entity Recognition

Task FaBERT ParsBERT XLM-R
PEYMA 91.39 91.24 90.91
ParsTwiner 82.22 81.13 79.50
MultiCoNER v2 57.92 58.09 51.47

Question Answering

Task FaBERT ParsBERT XLM-R
ParsiNLU 55.87 44.89 42.55
PQuAD 87.34 86.89 87.60
PCoQA 53.51 50.96 51.12

Natural Language Inference & QQP

Task FaBERT ParsBERT XLM-R
FarsTail 84.45 82.52 83.50
SBU-NLI 66.65 58.41 58.85
ParsiNLU QQP 82.62 77.60 79.74

Number of Parameters

FaBERT ParsBERT XLM-R
Parameter Count (M) 124 162 278
Vocabulary Size (K) 50 100 250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}
Downloads last month
2,131
Safetensors
Model size
124M params
Tensor type
F32
·