kk-bert-small-morph-bpe

BERT-Small (~29.55M params) pretrained from scratch on the Kazakh corpus_clean corpus using the morph-bpe tokenizer.

Part of a comparative study of subword tokenizers (BPE, Unigram, morphology-aware) for Kazakh small language models.

Tokenizer

Abzalbek89/kk-tokenizer-morph-hf-bpe-32k — vocab 32K, trained on Kazakh news + literature.

Architecture

BERT-Small (Turc et al. 2019):

  • 4 hidden layers
  • 512 hidden size
  • 8 attention heads
  • 2048 intermediate size
  • max position 512, sequence length 256
  • ~29.55M parameters

Pretraining

Steps 10,000
Batch size 32
Sequence length 256
Train blocks 33,392 (≈8.5M tokens)
Optimizer AdamW, lr 1e-4, warmup 10%, weight decay 0.01
Precision fp16
Hardware RTX A4000 (16 GB)
Wallclock 17.1 min

Evaluation

metric value
eval_loss 6.4837
eval_perplexity 654.403

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tok = AutoTokenizer.from_pretrained("Abzalbek89/kk-bert-small-morph-bpe")
model = AutoModelForMaskedLM.from_pretrained("Abzalbek89/kk-bert-small-morph-bpe")

Citation

If you use this model, please cite the upcoming paper:

Tokenizer Optimization for Kazakh Small Language Models: A Comparative Study of BPE, Unigram, and Morphological Segmentation. (in prep.)

Companion artifacts

Downloads last month
35
Safetensors
Model size
29.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abzalbek89/kk-bert-small-morph-bpe

Finetunes
2 models

Dataset used to train Abzalbek89/kk-bert-small-morph-bpe