Edit model card

CSC T5 - T5 for Traditional and Simplified Chinese Spelling Correction

This model was obtained by instruction-tuning the corresponding ClueAI/PromptCLUE-base-v1-5 model on the spelling error corpus.

Model Details

Model Description

  • Language(s) (NLP): Chinese
  • Pretrained from model: ClueAI/PromptCLUE-base-v1-5
  • Pretrained by dataset: 1M UDN news corpus
  • Finetuned by dataset: shibing624/CSC spelling error corpus (CN + TC)

Model Sources

Evaluation

  • Chinese spelling error correction task(SIGHAN2015):
    • FPR: False Positive Rate
Model Base Model accuracy recall precision F1 FPR
GECToR hfl/chinese-macbert-base 71.7 71.6 71.8 71.7 28.2
GECToR_large hfl/chinese-macbert-large 73.7 76.5 72.5 74.4 29.1
T5 w/ pretrain ClueAI/PromptCLUE-base-v1-5 79.2 69.2 85.8 76.6 11.1
T5 w/o pretrain ClueAI/PromptCLUE-base-v1-5 75.1 63.1 82.2 71.4 13.3
PTCSpell N/A 79.0 89.4 83.8 N/A
MDCSpell N/A 77.2 81.5 79.3 N/A

Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
model = T5ForConditionalGeneration.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
input_text = '糾正句子裡的錯字: 為了降低少子化,政府可以堆動獎勵生育的政策。'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Related Project

CodeTed/CGEDit - Chinese Grammatical Error Diagnosis by Task-Specific Instruction Tuning

Downloads last month
24
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train CodeTed/Chinese_Spelling_Correction_T5

Space using CodeTed/Chinese_Spelling_Correction_T5 1

Collection including CodeTed/Chinese_Spelling_Correction_T5