Khmer Text Summarization β€” mT5

An abstractive summarization model for the Khmer language, fine-tuned from google/mt5-base on two Khmer news datasets.

Note: Khmer has no spaces between words.
The mT5 SentencePiece tokenizer handles all subword segmentation automatically β€”
do not apply any word-splitting pre-processing.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re

tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization", use_fast=False)
model     = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization")
model.eval()

def clean_khmer(text):
    text = unicodedata.normalize("NFC", text)
    text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()

article = "αž”αž‰αŸ’αž…αžΌαž›αž’αžαŸ’αžαž”αž‘αžαŸ’αž˜αŸ‚αžšαžšαž”αžŸαŸ‹αž’αŸ’αž“αž€αž“αŸ…αž‘αžΈαž“αŸαŸ‡ ..."   # your Khmer article

inputs = tokenizer(
    "summarize: " + clean_khmer(article),
    return_tensors="pt",
    max_length=512,
    truncation=True,
)

output_ids = model.generate(**inputs)   # generation_config baked in
summary    = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)

Evaluation β€” Validation set

Metric Score
β€” β€”

Evaluation β€” Test set

Metric Score
β€” β€”

Training details

Setting Value
Base model google/mt5-base
Fine-tuning method LoRA (merged)
Task prefix summarize:
Max input length 512 tokens
Max target length 128 tokens
Epochs 10
Learning rate 0.0002
Beam search 4 beams
No-repeat n-gram 3
Training date 2026-06-20T09:20:50.974517

Datasets

Dataset Columns used
phonsobon/khmer-artical-summaries content β†’ summaries
phonsobon/khmer-text-summarization-v2 article β†’ summaries

Limitations

  • Optimised for Khmer-language news articles.
  • ROUGE scores are computed character-level (no Khmer word segmenter) β€” treat as relative, not absolute quality.
  • Model may struggle on very short or colloquial Khmer text outside the training distribution.
Downloads last month
34
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phonsobon/khmer-text-summarization

Base model

google/mt5-base
Finetuned
(314)
this model

Datasets used to train phonsobon/khmer-text-summarization