Til Core 0.5B

Til Core 0.5B is a 498-million-parameter Kazakh language model trained from scratch on a clean Kazakh corpus using a 256K morpheme-aware BPE tokenizer. It is a Qwen2-style decoder-only transformer built by TilQazyna as a compact, efficient foundation model for the Kazakh language.

Til — "language" in Kazakh. Til Core is the base on top of which task-specific Kazakh models (instruct, grammar correction, translation) can be fine-tuned.

Why a 256K morpheme-aware vocabulary?

Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.

Model details


Architecture	Qwen2 (decoder-only, SwiGLU, RoPE, GQA)
Parameters	497.8M (embedding ≈ 229M, transformer ≈ 268M)
Vocabulary	256,000 (morpheme-aware BPE)
Hidden size	896
Layers	18
Attention heads	14 (GQA, 2 KV heads)
Intermediate size	4864
Context length	32,768 (`rope_theta` = 1e6)
Tied embeddings	yes
Precision	bf16

Training


Data	`stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1` — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens)
Tokens seen	≈ 5.88B (2 epochs)
Steps	11,222
Global batch	524,288 tokens/step (8 × 8 × grad-accum 4 × 2048)
Optimizer	AdamW (β default), weight decay 0.1, grad clip 1.0
LR schedule	4e-4, cosine, 500 warmup steps
Sequence length	2048
Hardware	8 × NVIDIA H200 (140 GB), ~3h15m
Final eval loss	2.436 (validation), perplexity ≈ 11.4

Chinchilla-style budget: ~~498M params with ≈5.9B tokens (~~11.8 tokens/param).

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-Core-0.5B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()

prompt = "Абай Құнанбайұлы — қазақтың"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=60, do_sample=True,
                     temperature=0.8, top_p=0.9, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).

Sample generations

Қазақстан Республикасының астанасы
→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентінің
  резиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …

Абай Құнанбайұлы — қазақтың
→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазба
  әдебиетінің негізін салушы әрі дамытушы …

Жасанды интеллект дегеніміз —
→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауын
  қамтамасыз ететін технологиялар жиынтығы.

Limitations

Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
No safety alignment — outputs are unfiltered.
Knowledge is limited to the training corpus.

Citation

@misc{tilcore05b2026,
  title  = {Til Core 0.5B: a morpheme-aware Kazakh language model},
  author = {TilQazyna},
  year   = {2026},
  url    = {https://huggingface.co/TilQazyna/Til-Core-0.5B}
}

Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1

Downloads last month: 44

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for TilQazyna/Til-Core-0.5B

Finetunes

1 model

TilQazyna
/

Til-Core-0.5B