Instructions to use Zarinaaa/mt5-small-kyrgyz-normalization with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zarinaaa/mt5-small-kyrgyz-normalization with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Zarinaaa/mt5-small-kyrgyz-normalization")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization") model = AutoModelForSeq2SeqLM.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Zarinaaa/mt5-small-kyrgyz-normalization with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Zarinaaa/mt5-small-kyrgyz-normalization" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization
- SGLang
How to use Zarinaaa/mt5-small-kyrgyz-normalization with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Zarinaaa/mt5-small-kyrgyz-normalization with Docker Model Runner:
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization
mT5-small fine-tuned for Kyrgyz text normalization
Fine-tuned google/mt5-small for normalizing noisy Kyrgyz social-media text (YouTube comments, Instagram posts, Telegram messages) into a standardized form — punctuation, capitalization, dialectal spelling, digit–word compounds.
This is the fine-tuned only variant from the camera-ready paper "Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches" (MeLLM Workshop @ ACL 2026). For the continual pre-training + fine-tuning variant see Zarinaaa/mt5-small-kyrgyz-normalization-ptft.
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "Zarinaaa/mt5-small-kyrgyz-normalization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
print(tokenizer.decode(out[0], skip_special_tokens=True))
# Барды жакшы болсун. Коркунучтуу жерлерди тазалаш керек.
The prefix "correct: " is required — the model was fine-tuned with this exact prompt.
Training data
1.67M noisy–clean Kyrgyz text pairs from YouTube (45%), Instagram (25%), and Telegram (30%), automatically annotated with Gemini 3 Pro and spot-checked on a 400-example sample (84% acceptance rate, 95% Wilson CI [80%, 87%]). The 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication.
A 20,000-pair subset of the training data and the full test set are released at Zarinaaa/kyrgyz-text-normalization.
Training procedure
- Base model:
google/mt5-small(300M parameters) - Effective batch size: 64 (physical batch 4 × gradient accumulation 16)
- Learning rate: 3e-4, cosine schedule, 500 warmup steps
- Epochs: 5
- Max sequence length: 256
- Train/validation split: 95 / 5, seed 42; best checkpoint by validation loss
- Hardware: 1× NVIDIA RTX 5080 (16 GB VRAM)
The 1,000 test inputs are disjoint from the 1.67M training set (verified 0/1,000 exact-match overlap and 0/1,000 case-insensitive overlap).
Evaluation
Automatic metrics on the held-out 1,000-example test set:
| Metric | Value |
|---|---|
| CER | 0.0796 ± 0.003 |
| WER | 0.1978 |
| Exact Match | 0.186 |
For comparison: rule-based baseline 0.2029 CER, zero-shot Gemma 4 (9.6B, 32× larger) 0.1620 CER.
Human evaluation by two native Kyrgyz speakers on 200 examples: 99.8% rated correct (Wilson 95% CI [0.986, 0.9996]). Reliability under prevalence skew: PABAK = 0.990, Gwet's AC1 = 0.995. Of the 199 outputs both annotators rated correct, 162 (81.4%) differ from the Gemini reference at the character level — surface-form variability that EM penalizes but native speakers accept.
Per-category CER
| Category | N | CER |
|---|---|---|
| Punctuation restoration | 849 | 0.078 |
| Capitalization | 62 | 0.084 |
| All-caps segments | 39 | 0.084 |
| Digit–word compounds | 41 | 0.076 |
Limitations
- Domain: trained and evaluated on social-media text. Performance on news, speech transcripts, or formal government text is not guaranteed.
- Reference bias: training references were produced by Gemini 3 Pro; a probe with an independent annotator shows the model has learned a general normalization function (CER changes by only 0.012 against an independent reference), but residual stylistic bias is possible.
- Label noise: ~16% of training pairs may contain minor issues per the 400-example spot-check.
- Model size: larger variants (mT5-base/large, ByT5) and fine-tuned LLMs were not evaluated due to compute constraints.
- Rule-based comparison: the baseline in the paper is intentionally minimal; a stronger Kyrgyz FST-based pipeline would likely close part of the gap.
Citation
@inproceedings{uvalieva2026kyrgyz,
title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
year={2026}
}
License
MIT. Code: github.com/Zarina33/Kyrgyz-Text-Normalization-Conference.
- Downloads last month
- 31
Model tree for Zarinaaa/mt5-small-kyrgyz-normalization
Base model
google/mt5-small