KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

This repository contains KazBERT, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.

Model Details

  • Architecture: BERT (based on bert-base-uncased)
  • Tokenizer: WordPiece tokenizer trained on Kazakh texts
  • Training Data: Custom Kazakh corpus
  • Training Method: Masked Language Modeling (MLM)

For full details look at the paper

Files in this Repository

  • config.json – Model configuration
  • model.safetensors – Model weights in safetensors format
  • tokenizer.json – Tokenizer data
  • tokenizer_config.json – Tokenizer configuration
  • special_tokens_map.json – Special token mappings
  • vocab.txt – Vocabulary file

Training Details

  • Number of epochs: 20
  • Batch size: 16
  • Learning rate: default
  • Weight decay: 0.01
  • Mixed Precision Training: Enabled (FP16)

Usage

To use the model with the 🤗 Transformers library:

from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

Example: Masked Token Prediction

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")

output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[{'score': 0.19899696111679077,
  'token': 25721,
  'token_str': 'жетік',
  'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
 {'score': 0.0383591502904892,
  'token': 1722,
  'token_str': 'де',
  'sequence': 'kazbert қазақ тілін де түсінеді.'},
 {'score': 0.0325467586517334,
  'token': 4743,
  'token_str': 'терең',
  'sequence': 'kazbert қазақ тілін терең түсінеді.'},
 {'score': 0.029968073591589928,
  'token': 5533,
  'token_str': 'ерте',
  'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
 {'score': 0.0264473594725132,
  'token': 17340,
  'token_str': 'жете',
  'sequence': 'kazbert қазақ тілін жете түсінеді.'}]

Citation

If you use this model, please cite:

@misc{kazbert2025,
  title={KazBERT: A BERT-based Language Model for Kazakh},
  author={Gainulla Eraly},
  year={2025},
  publisher={Hugging Face Model Hub}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
365
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 3 Ask for provider support

Model tree for Eraly-ml/KazBERT

Finetuned
(4526)
this model

Dataset used to train Eraly-ml/KazBERT