Chuboy
/

macbert4csc-traditional-chinese

Model card Files Files and versions Community

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

繁體版中文錯別字校正模型

透過專案 shibing624/pycorrector 程式訓練
以 hfl/chinese-macbert-base 為基底模型產出

訓練資料

27萬筆 SIGHAN 來自 shibing624/CSC
27萬筆 NLG 來自 Weaxs/csc
Opencc 之s2twp設定進行簡轉繁

訓練技巧

輸入句子長度需呈現常態分佈，錯字控制1~3個字元之間
引入FocalLoss將偵測錯別字視作物件偵測
輸出EntropyLoss與FocalLoss比重7:3

SIGHAN驗證分數

模型	準確度	精確度	召回率	F1分數
chinese-macbert-base	0.88	0.09	0.31	0.14
macbert4csc-base-chinese輸出簡轉繁	0.99	0.79	0.95	0.86
macbert4csc-traditional-chinese	1	0.9	0.99	0.94

NLG驗證分數

模型	準確度	精確度	召回率	F1分數
chinese-macbert-base	0.85	0.08	0.31	0.13
macbert4csc-base-chinese輸出簡轉繁	0.98	0.7	0.95	0.81
macbert4csc-traditional-chinese	0.99	0.8	0.99	0.89

誠摯感謝原作者XuMing開源研究成果

Downloads last month: 0

Safetensors

Model size

102M params

Tensor type

F32

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Chuboy/macbert4csc-traditional-chinese