Böri — Kazakh Punctuation Restoration (bori-punct)

Restores comma / period / question mark in unpunctuated Kazakh text. XLM-RoBERTa-large + token-classification head. Part of the Böri Kazakh language-learning project.

Validation metrics

Token accuracy: 0.951
Weighted F1: 0.95
Macro-F1 (COMMA/PERIOD/QUESTION): 0.804

Labels

O (none), COMMA, PERIOD, QUESTION — one per word; the label is the punctuation that goes AFTER the word.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tok = AutoTokenizer.from_pretrained('zhxdoka/bori-punct')
model = AutoModelForTokenClassification.from_pretrained('zhxdoka/bori-punct').eval()
I2L = {0:'O',1:'COMMA',2:'PERIOD',3:'QUESTION'}
def restore(text):
    words = text.split()
    enc = tok(words, is_split_into_words=True, return_tensors='pt', truncation=True, max_length=256)
    with torch.no_grad(): logits = model(**enc).logits[0]
    wids = enc.word_ids(); preds=['O']*len(words); prev=None
    for i,wid in enumerate(wids):
        if wid is not None and wid!=prev: preds[wid]=I2L[int(logits[i].argmax())]
        prev=wid
    P={'O':'','COMMA':',','PERIOD':'.','QUESTION':'?'}
    return ' '.join(w+P[p] for w,p in zip(words,preds))

Downloads last month: 13

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for zhxdoka/bori-punct

Base model

FacebookAI/xlm-roberta-large

Finetuned

(961)

this model