Abzalbek89/corpus_clean
Viewer • Updated • 1.52M • 127
BERT-Small (~29.55M params) pretrained from scratch on the Kazakh corpus_clean corpus
using the morph-bpe tokenizer.
Part of a comparative study of subword tokenizers (BPE, Unigram, morphology-aware) for Kazakh small language models.
Abzalbek89/kk-tokenizer-morph-hf-bpe-32k — vocab 32K, trained on Kazakh news + literature.
BERT-Small (Turc et al. 2019):
| Steps | 10,000 |
| Batch size | 32 |
| Sequence length | 256 |
| Train blocks | 33,392 (≈8.5M tokens) |
| Optimizer | AdamW, lr 1e-4, warmup 10%, weight decay 0.01 |
| Precision | fp16 |
| Hardware | RTX A4000 (16 GB) |
| Wallclock | 17.1 min |
| metric | value |
|---|---|
| eval_loss | 6.4837 |
| eval_perplexity | 654.403 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
tok = AutoTokenizer.from_pretrained("Abzalbek89/kk-bert-small-morph-bpe")
model = AutoModelForMaskedLM.from_pretrained("Abzalbek89/kk-bert-small-morph-bpe")
If you use this model, please cite the upcoming paper:
Tokenizer Optimization for Kazakh Small Language Models: A Comparative Study of BPE, Unigram, and Morphological Segmentation. (in prep.)
Abzalbek89/kk-tokenizer-fertility-baselinekk-bert-small-bpe, kk-bert-small-unigram, kk-bert-small-morph-bpe