FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

  • Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
  • Remarkable performance across various downstream NLP tasks
  • BERT architecture with 124 million parameters

Useful Links

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter Value
Batch Size 32
Optimizer Adam
Learning Rate 6e-5
Weight Decay 0.01
Total Steps 18 Million
Warmup Steps 1.8 Million
Precision Format TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task FaBERT ParsBERT XLM-R
MirasOpinion 87.51 86.73 84.92
MirasIrony 74.82 71.08 75.51
DeepSentiPers 79.85 74.94 79.00

Named Entity Recognition

Task FaBERT ParsBERT XLM-R
PEYMA 91.39 91.24 90.91
ParsTwiner 82.22 81.13 79.50
MultiCoNER v2 57.92 58.09 51.47

Question Answering

Task FaBERT ParsBERT XLM-R
ParsiNLU 55.87 44.89 42.55
PQuAD 87.34 86.89 87.60
PCoQA 53.51 50.96 51.12

Natural Language Inference & QQP

Task FaBERT ParsBERT XLM-R
FarsTail 84.45 82.52 83.50
SBU-NLI 66.65 58.41 58.85
ParsiNLU QQP 82.62 77.60 79.74

Number of Parameters

FaBERT ParsBERT XLM-R
Parameter Count (M) 124 162 278
Vocabulary Size (K) 50 100 250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}
Downloads last month
357
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.