KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿
This repository contains KazBERT, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.
Model Details
- Architecture: BERT (based on
bert-base-uncased
) - Tokenizer: WordPiece tokenizer trained on Kazakh texts
- Training Data: Custom Kazakh corpus
- Training Method: Masked Language Modeling (MLM)
For full details look at the paper
Files in this Repository
config.json
– Model configurationmodel.safetensors
– Model weights in safetensors formattokenizer.json
– Tokenizer datatokenizer_config.json
– Tokenizer configurationspecial_tokens_map.json
– Special token mappingsvocab.txt
– Vocabulary file
Training Details
- Number of epochs: 20
- Batch size: 16
- Learning rate: default
- Weight decay: 0.01
- Mixed Precision Training: Enabled (FP16)
Usage
To use the model with the 🤗 Transformers library:
from transformers import BertForMaskedLM, BertTokenizerFast
model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
Example: Masked Token Prediction
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[{'score': 0.19899696111679077,
'token': 25721,
'token_str': 'жетік',
'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
{'score': 0.0383591502904892,
'token': 1722,
'token_str': 'де',
'sequence': 'kazbert қазақ тілін де түсінеді.'},
{'score': 0.0325467586517334,
'token': 4743,
'token_str': 'терең',
'sequence': 'kazbert қазақ тілін терең түсінеді.'},
{'score': 0.029968073591589928,
'token': 5533,
'token_str': 'ерте',
'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
{'score': 0.0264473594725132,
'token': 17340,
'token_str': 'жете',
'sequence': 'kazbert қазақ тілін жете түсінеді.'}]
Citation
If you use this model, please cite:
@misc{kazbert2025,
title={KazBERT: A BERT-based Language Model for Kazakh},
author={Gainulla Eraly},
year={2025},
publisher={Hugging Face Model Hub}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 365
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
3
Ask for provider support
Model tree for Eraly-ml/KazBERT
Base model
google-bert/bert-base-uncased