Til Core 1B (base)

A 1.25B-parameter Kazakh base language model, pre-trained from scratch on a deduplicated Kazakh web/text corpus with a 256k morpheme-aware BPE tokenizer (stukenov/sozkz-morphbpe-256k-kk-v1).

This is a base (non-instruct) model — it completes text, it does not follow chat instructions. An instruct version is planned (see Roadmap).

Model details

Architecture Llama-style decoder (RoPE, RMSNorm, SwiGLU, GQA)
Parameters 1.246 B (tied input/output embeddings)
Hidden / layers 2048 / 16
Attention heads 32 query / 8 KV (GQA)
Intermediate 5632
Context length 2048
Vocab 256 000 (morpheme-BPE)
Precision bf16

Training

Tokens 6.26 B (1 epoch)
Train blocks 3 057 865 × 2048
Corpus cleaned → MinHash-deduped (11.29 M / 13.19 M docs kept, 85.6 %)
Hardware 8 × NVIDIA H200, FSDP full-shard, bf16
Optimizer AdamW (β 0.9/0.95, wd 0.1), cosine LR 3e-4, warmup 200
Effective batch 512 blocks (8 × 16 × grad-accum 4) ≈ 1.05 M tok/step
Throughput ~313 K tok/s
Wall-clock ~5 h 40 m
Final loss ~2.90 (train)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "stukenov/Til-Core-1B"
tok = AutoTokenizer.from_pretrained(name)
m = AutoModelForCausalLM.from_pretrained(name, dtype=torch.bfloat16).cuda().eval()

ids = tok("Қазақстан Республикасы — ", return_tensors="pt").input_ids.cuda()
out = m.generate(ids, max_new_tokens=50, do_sample=True,
                 temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

Sample generations

Қазақстан Республикасы — мемлекеттік рәміздері. Жалпы білім беретін мектептің 6-сыныбына арналған оқулық…

Жасанды интеллект дегеніміз бұл адам миының эволюциясы, ойлау жүйесі мен мінез-құлқының ерекшеліктерін…

Менің Отаным — «Отан» туралы өлеңді мәнерлеп оқу… Біздің Отанымыз қалай аталады?…

Limitations

  • Base model — no instruction following, no safety alignment.
  • Single epoch on a 6.26 B-token corpus; factual reliability is limited.
  • Corpus skews toward educational / encyclopedic Kazakh text; occasional rare-token artifacts in generation.
  • Kazakh-centric; not optimized for other languages.

Roadmap

  • Til Core 1B Instruct — SFT on Kazakh instruction data (see plan in repo).
  • A smaller instruct sibling for on-device use.

Citation

@misc{tilcore1b2026,
  title  = {Til Core 1B: a Kazakh base language model with a morpheme-BPE tokenizer},
  author = {Tukenov, Saken},
  year   = {2026}
}
Downloads last month
15
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TilQazyna/Til-Core-1B

Finetunes
1 model

Dataset used to train TilQazyna/Til-Core-1B