YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

castle-gec

CASTLE: Context-Aware Semantic Transformer with Knowledge Graph Enhancement Indonesian Grammatical Error Correction

Paper: Expert Systems with Applications, Vol. 299 (2026), 130233 Code: github.com/syauqie/castle-gec / github.com/syauqie/castle-gec Dataset: syauqie/IGED

Model Description

CASTLE is a seq2seq transformer for correcting grammatical errors in Indonesian text. The model was trained on IGED, covering three error categories: morphology, syntax, and semantics.

Architecture: 4-layer encoder-decoder transformer (d=256, 8 heads, FFN=2048) with WordPiece tokenization (vocab=10K). Key feature: linked attention mechanism that conditions each layer's attention on the previous layer's attention scores.

This checkpoint is a PyTorch reconstruction of the original Fairseq-based model described in the paper. See the GitHub README for full transparency on what differs from the paper.

Evaluation Results

On the IGED test set (134,025 sentence pairs):

Category	F1	BLEU
Overall	0.9444	88.95
Morphology	0.9682	—
Syntax	0.9343	—
Semantics	0.9413	—

BLEU is evaluated at WordPiece token level with length-constrained decoding (max_len = src_len + 3). The paper reports F1=0.9629, BLEU=92.72 — see the repository README for a detailed explanation of the gap.

How to Use

Load and run inference

import torch
from huggingface_hub import snapshot_download

# Download model files
local_dir = snapshot_download(repo_id="syauqie/castle-gec")

# Load corrector
import sys
sys.path.insert(0, local_dir)
from src.inference import CASTLECorrector

corrector = CASTLECorrector.from_checkpoint(
    checkpoint_path=f"{local_dir}/checkpoint_best.pt",
    config_path=f"{local_dir}/configs/castle_base.yaml",
    tokenizer_dir=f"{local_dir}/data/tokenizer",
)

# Correct a sentence
result = corrector.correct("Saya sudah pergi ke sana kemarin hari.")
print(result)
# → "Saya sudah pergi ke sana kemarin."

Batch correction

sentences = [
    "Para mahasiswa-mahasiswa itu berdiskusi dengan aktif.",
    "Dia mempermasalahkan tentang hal tersebut.",
    "Mobil itu sangat cepat sekali.",
]
results = corrector.correct_batch(sentences)
for src, tgt in zip(sentences, results):
    print(f"  Error: {src}")
    print(f"  Fixed: {tgt}")
    print()

Intended Use

This model is intended for:

Automated correction of Indonesian grammatical errors
Research on Indonesian NLP and GEC
Educational tools for Indonesian language learners

The model performs best on formal written Indonesian. It may not handle very informal or colloquial text well.

Repository Files

File	Description
`checkpoint_best.pt`	Model weights (PyTorch)
`configs/castle_base.yaml`	Model configuration
`data/tokenizer/`	WordPiece tokenizer (vocab=10K)
`src/`	Full source code

Training Details

Dataset: IGED (80/10/10 train/val/test split)
Optimizer: Adam (β₁=0.9, β₂=0.98, ε=1e-8)
Learning rate: 5e-4 with inverse sqrt schedule, 4000 warmup steps
Batch size: ~40 samples, update_freq=2 (effective batch = 80)
Training: 10 epochs (~150K gradient steps)
Loss: Label-smoothed cross-entropy (ε=0.1)
Hardware: NVIDIA RTX 3090 24GB

Citation

@article{castle2026,
  title     = {{CASTLE}: Context-Aware Semantic Transformer with Knowledge Graph Enhancement for Low-Resource Grammar Correction},
  author    = {Syauqie Muhammad Marier and Xiangjie Kong and Linan Zhu and Xiangfan Chen and Abdulloh Badruzzaman and I. Nyoman Apraz Ramatryana},
  journal   = {Expert Systems with Applications},
  volume    = {299},
  number    = {Part D},
  pages     = {130233},
  year      = {2026},
  doi       = {10.1016/j.eswa.2025.130233},
  url       = {https://www.sciencedirect.com/science/article/pii/S0957417425038485},
  issn      = {0957-4174},
  keywords  = {Grammatical error correction, Low-resource language, Knowledge graph, Semantic error correction}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

syauqie
/

castle-gec