Back to all models
fill-mask mask_token: [MASK]
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚡️ Upgrade your account to access the Inference API

Share Copied link to clipboard

Monthly model downloads

m3hrdadfi/albert-fa-base-v2 m3hrdadfi/albert-fa-base-v2
370 downloads
last 30 days

pytorch

tf

Contributed by

m3hrdadfi Mehrdad Farahani
12 models

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2") model = AutoModelForMaskedLM.from_pretrained("m3hrdadfi/albert-fa-base-v2")

ALBERT-Persian

ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

Introduction

ALBERT-Persian trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

TensorFlow 2.0

from transformers import AutoConfig, AutoTokenizer, TFAutoModel

config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
model = TFAutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")

text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)

>>> ['▁ما', '▁در', '▁هوش', 'واره', '▁معتقد', 'یم', '▁با', '▁انتقال', '▁صحیح', '▁دانش', '▁و', '▁اگاه', 'ی', '،', '▁همه', '▁افراد', '▁می', '▁توانند', '▁از', '▁ابزارهای', '▁هوشمند', '▁استفاده', '▁کنند', '.', '▁شعار', '▁ما', '▁هوش', '▁مصنوعی', '▁برای', '▁همه', '▁است', '.']

Pytorch

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
model = AutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")

Training

ALBERT-Persian is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.

Goals

Objective goals during training are as below (after 140K steps).

***** Eval results *****
global_step = 140000
loss = 2.0080082
masked_lm_accuracy = 0.6141017
masked_lm_loss = 1.9963315
sentence_order_accuracy = 0.985
sentence_order_loss = 0.06908702

Derivative models

Base Config

Albert Model

Albert Sentiment Analysis

Albert Text Classification

Albert NER

Eval results

The following tables summarize the F1 scores obtained by ALBERT-Persian as compared to other models and architectures.

Sentiment Analysis (SA) Task

Dataset ALBERT-fa-base-v2 ParsBERT-v1 mBERT DeepSentiPers
Digikala User Comments 81.12 81.74 80.74 -
SnappFood User Comments 85.79 88.12 87.87 -
SentiPers (Multi Class) 66.12 71.11 - 69.33
SentiPers (Binary Class) 91.09 92.13 - 91.98

Text Classification (TC) Task

Dataset ALBERT-fa-base-v2 ParsBERT-v1 mBERT
Digikala Magazine 92.33 93.59 90.72
Persian News 97.01 97.19 95.79

Named Entity Recognition (NER) Task

Dataset ALBERT-fa-base-v2 ParsBERT-v1 mBERT MorphoBERT Beheshti-NER LSTM-CRF Rule-Based CRF BiLSTM-CRF
PEYMA 88.99 93.10 86.64 - 90.59 - 84.00 -
ARMAN 97.43 98.79 95.89 89.9 84.03 86.55 - 77.45

BibTeX entry and citation info

Please cite in publications as the following:

@misc{ALBERT-Persian,
  author = {Mehrdad Farahani},
  title = {ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
}

@article{ParsBERT,
    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

Questions?

Post a Github issue on the ALBERT-Persian repo.