Instructions to use zhxdoka/bori-punct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhxdoka/bori-punct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="zhxdoka/bori-punct")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("zhxdoka/bori-punct") model = AutoModelForTokenClassification.from_pretrained("zhxdoka/bori-punct") - Notebooks
- Google Colab
- Kaggle
Böri — Kazakh Punctuation Restoration (bori-punct)
Restores comma / period / question mark in unpunctuated Kazakh text. XLM-RoBERTa-large + token-classification head. Part of the Böri Kazakh language-learning project.
Validation metrics
- Token accuracy: 0.951
- Weighted F1: 0.95
- Macro-F1 (COMMA/PERIOD/QUESTION): 0.804
Labels
O (none), COMMA, PERIOD, QUESTION — one per word; the label is the punctuation that goes AFTER the word.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tok = AutoTokenizer.from_pretrained('zhxdoka/bori-punct')
model = AutoModelForTokenClassification.from_pretrained('zhxdoka/bori-punct').eval()
I2L = {0:'O',1:'COMMA',2:'PERIOD',3:'QUESTION'}
def restore(text):
words = text.split()
enc = tok(words, is_split_into_words=True, return_tensors='pt', truncation=True, max_length=256)
with torch.no_grad(): logits = model(**enc).logits[0]
wids = enc.word_ids(); preds=['O']*len(words); prev=None
for i,wid in enumerate(wids):
if wid is not None and wid!=prev: preds[wid]=I2L[int(logits[i].argmax())]
prev=wid
P={'O':'','COMMA':',','PERIOD':'.','QUESTION':'?'}
return ' '.join(w+P[p] for w,p in zip(words,preds))
- Downloads last month
- 13
Model tree for zhxdoka/bori-punct
Base model
FacebookAI/xlm-roberta-large