KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

This repository contains KazBERT, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.

Model Details

Architecture: BERT (based on bert-base-uncased)
Tokenizer: WordPiece tokenizer trained on Kazakh texts
Training Data: Custom Kazakh corpus
Training Method: Masked Language Modeling (MLM)

For full details look at the paper

Files in this Repository

config.json – Model configuration
model.safetensors – Model weights in safetensors format
tokenizer.json – Tokenizer data
tokenizer_config.json – Tokenizer configuration
special_tokens_map.json – Special token mappings
vocab.txt – Vocabulary file

Training Details

Number of epochs: 20
Batch size: 16
Learning rate: default
Weight decay: 0.01
Mixed Precision Training: Enabled (FP16)

Usage

To use the model with the 🤗 Transformers library:

from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

Example: Masked Token Prediction

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")

output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')

Output:
[{'score': 0.19899696111679077,
  'token': 25721,
  'token_str': 'жетік',
  'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
 {'score': 0.0383591502904892,
  'token': 1722,
  'token_str': 'де',
  'sequence': 'kazbert қазақ тілін де түсінеді.'},
 {'score': 0.0325467586517334,
  'token': 4743,
  'token_str': 'терең',
  'sequence': 'kazbert қазақ тілін терең түсінеді.'},
 {'score': 0.029968073591589928,
  'token': 5533,
  'token_str': 'ерте',
  'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
 {'score': 0.0264473594725132,
  'token': 17340,
  'token_str': 'жете',
  'sequence': 'kazbert қазақ тілін жете түсінеді.'}]

Citation

If you use this model, please cite:

@misc{kazbert2025,
  title={KazBERT: A BERT-based Language Model for Kazakh},
  author={Gainulla Eraly},
  year={2025},
  publisher={Hugging Face Model Hub}
}

License

This model is released under the Apache 2.0 License.

Eraly-ml
/

KazBERT