T5 Indonesian Summarization (Augmented 3x)

Model T5-base yang di-fine-tune dengan augmentasi data 3x (utterance-level paraphrase) untuk meringkas percakapan Bahasa Indonesia. Training menggunakan 5-Fold Cross Validation.

Model Details

  • Base Model: cahya/t5-base-indonesian-summarization-cased
  • Architecture: T5-base (encoder-decoder, 12 layers, 768 hidden, 12 heads)
  • Parameters: ~220M
  • Language: Indonesian (Bahasa Indonesia)
  • Task: Abstractive Summarization of Indonesian Conversations
  • Training: 5-Fold Cross Validation
  • Available Folds: 5 folds tersedia sebagai branches (fold_0 s/d fold_4). Branch main berisi fold 3 (performa terbaik).

Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load model & tokenizer
model_name = "aloisiusedwin/t5-id-summarization-augmented3x"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Contoh percakapan
conversation = "summarize: S1: Halo, gimana kabarmu? S2: Baik, aku lagi sibuk ngerjain tugas nih."

# Generate ringkasan
inputs = tokenizer(conversation, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"],
    max_length=150,
    num_beams=1,
    no_repeat_ngram_size=2,
    early_stopping=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

Loading Specific Fold

# Load fold tertentu (misal fold_0)
model = T5ForConditionalGeneration.from_pretrained(model_name, revision="fold_0")
tokenizer = T5Tokenizer.from_pretrained(model_name, revision="fold_0")

Training Details

Hyperparameters

Parameter Value
Learning Rate 5e-5
Batch Size 8
Epochs 10
Early Stopping Patience 2
Weight Decay 0.01
Label Smoothing 0.1
LR Scheduler Cosine
Warmup Ratio 0.10
Max Grad Norm 1.0
Max Input Length 512
Max Target Length 128
FP16 True

Data Augmentation

Dataset diperbesar 3x dengan teknik utterance-level paraphrase menggunakan model Wikidepia/IndoT5-base-paraphrase. Setiap percakapan di-paraphrase secara per-kalimat.

Evaluation Results

Per-Fold Results

Fold ROUGE-1 ROUGE-2 ROUGE-L BERTScore F1 Eval Loss
0 28.63 9.66 24.54 0.7310 4.6597
1 28.39 9.54 24.08 0.7314 4.6101
2 24.83 8.12 22.21 0.7201 4.6530
3 (best) 28.80 10.38 26.07 0.7339 4.5295
4 27.49 8.85 24.57 0.7295 4.5674

(best) = Fold terbaik (digunakan sebagai branch main)

Aggregated (5-Fold Cross Validation)

Metric Mean Std
ROUGE-1 27.63 1.47
ROUGE-2 9.31 0.77
ROUGE-L 24.29 1.24
BERTScore F1 0.7292 0.0048

Perbandingan dengan Baseline

Model ROUGE-1 ROUGE-2 ROUGE-L BERTScore F1
Baseline (pretrained) 15.92 4.40 13.12 0.6626
T5 Indonesian Summarization (Augmented 3x) 27.63 9.31 24.29 0.7292

Intended Use

Model ini dirancang untuk meringkas percakapan dalam Bahasa Indonesia.

Limitations

  • Input harus diawali dengan prefix summarize: untuk hasil optimal.
  • Panjang input maksimum 512 token.

Citation

@thesis{edwin2026summarization,
  title={Pengaruh Augmentasi Data terhadap Kualitas Ringkasan Percakapan Bahasa Indonesia menggunakan T5},
  author={Aloisius Edwin},
  year={2026},
  school={Institut Teknologi Sumatera}
}
Downloads last month
174
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aloisiusedwin/t5-id-summarization-augmented3x

Finetuned
(4)
this model

Space using aloisiusedwin/t5-id-summarization-augmented3x 1

Evaluation results