Til Core 0.5B

Til Core 0.5B is a 498-million-parameter Kazakh language model trained from scratch on a clean Kazakh corpus using a 256K morpheme-aware BPE tokenizer. It is a Qwen2-style decoder-only transformer built by TilQazyna as a compact, efficient foundation model for the Kazakh language.

Til — "language" in Kazakh. Til Core is the base on top of which task-specific Kazakh models (instruct, grammar correction, translation) can be fine-tuned.

Why a 256K morpheme-aware vocabulary?

Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.

Model details

Architecture Qwen2 (decoder-only, SwiGLU, RoPE, GQA)
Parameters 497.8M (embedding ≈ 229M, transformer ≈ 268M)
Vocabulary 256,000 (morpheme-aware BPE)
Hidden size 896
Layers 18
Attention heads 14 (GQA, 2 KV heads)
Intermediate size 4864
Context length 32,768 (rope_theta = 1e6)
Tied embeddings yes
Precision bf16

Training

Data stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1 — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens)
Tokens seen ≈ 5.88B (2 epochs)
Steps 11,222
Global batch 524,288 tokens/step (8 × 8 × grad-accum 4 × 2048)
Optimizer AdamW (β default), weight decay 0.1, grad clip 1.0
LR schedule 4e-4, cosine, 500 warmup steps
Sequence length 2048
Hardware 8 × NVIDIA H200 (140 GB), ~3h15m
Final eval loss 2.436 (validation), perplexity ≈ 11.4

Chinchilla-style budget: 498M params with ≈5.9B tokens (11.8 tokens/param).

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-Core-0.5B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()

prompt = "Абай Құнанбайұлы — қазақтың"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=60, do_sample=True,
                     temperature=0.8, top_p=0.9, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).

Sample generations

Қазақстан Республикасының астанасы
→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентінің
  резиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …

Абай Құнанбайұлы — қазақтың
→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазба
  әдебиетінің негізін салушы әрі дамытушы …

Жасанды интеллект дегеніміз —
→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауын
  қамтамасыз ететін технологиялар жиынтығы.

Limitations

  • Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
  • Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
  • No safety alignment — outputs are unfiltered.
  • Knowledge is limited to the training corpus.

Citation

@misc{tilcore05b2026,
  title  = {Til Core 0.5B: a morpheme-aware Kazakh language model},
  author = {TilQazyna},
  year   = {2026},
  url    = {https://huggingface.co/TilQazyna/Til-Core-0.5B}
}

Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1

Downloads last month
44
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TilQazyna/Til-Core-0.5B

Finetunes
1 model

Dataset used to train TilQazyna/Til-Core-0.5B