YasalmaBERT base model

The YasalmaBERT-base is an encoder-only Transformer text model with 110 million parameters. It is pretrained model on Tatar language (Cyrillic script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between tatar and Tatar.

For full details of this model please read our paper (coming soon!).

Model variations

This model is part of the family of YasalmaBERT models trained with different number of parameters that will continuously expanded in the future.

Model Number of parameters Language Script
tahrirchi-bert-base 110M Tatar Cyrillic

Intended uses & limitations

This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='yasalma/yasalma-bert-base')
>>> unmasker("Җәйге матур көнне без гаилә белән урманга җиләк җыярга бардык һәм бик күп <mask> таптык.")

Training data

YasalmaBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. YasalmaBERT is trained on the Tatar Corpus, which contains roughly 4000 preprocessed books, 1.8 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens).

Training procedure

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 30,528 to make fully use of rare words. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. Also, we added number of regular expressions to avoid misrepresentation of different symbols that is used mostly incorrectly in practise.

Pretraining

The model was trained for one million steps with a batch size of 512. The sequence length was limited to 512 tokens during all pre-training stage. The optimizer used is Adam with a learning rate of 5e-4, β1=0.9\beta_{1} = 0.9 and β2=0.98\beta_{2} = 0.98, a weight decay of 1e-5, learning rate warmup to the full LR for 6% of the training duration with linearly decay to 0.02x the full LR by the end of the training duration.

Downloads last month
2
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train yasalma/yasalma-bert-base