Pseudo-Native-BART-CGEC

This model is a cutting-edge CGEC model based on Chinese BART-large. It is trained with about 100M pseudo native speaker CGEC training data generated by heuristic rules and human-annotated training data for the media domain. More details can be found in our Github and the paper.

Usage

pip install transformers

from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media")
model = BartForConditionalGeneration.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media")
encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True)
if "token_type_ids" in encoded_input:
    del encoded_input["token_type_ids"]
output = model.generate(**encoded_input)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

Citation

@inproceedings{zhang-etal-2023-nasgec,
    title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts",
    author = "Zhang, Yue  and
      Zhang, Bo  and
      Jiang, Haochen  and
      Li, Zhenghua  and
      Li, Chen  and
      Huang, Fei  and
      Zhang, Min"
    booktitle = "Findings of ACL",
    year = "2023"
    }
Downloads last month
24
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.