--- tags: - text2text-generation - Chinese - seq2seq - grammar language: zh license: apache-2.0 --- # Pseudo-Native-BART-CGEC This model is a cutting-edge CGEC model based on [Chinese BART-large](https://huggingface.co/fnlp/bart-large-chinese). It is trained with about 100M pseudo native speaker CGEC training data generated by heuristic rules and human-annotated training data for the thesis domain. More details can be found in our [Github](https://github.com/HillZhang1999/NaSGEC) and the [paper](https://arxiv.org/pdf/2305.16023.pdf). ## Usage pip install transformers ``` from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline tokenizer = BertTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC_thesis") model = BartForConditionalGeneration.from_pretrained("HillZhang/pseudo_native_bart_CGEC_thesis") encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True) if "token_type_ids" in encoded_input: del encoded_input["token_type_ids"] output = model.generate(**encoded_input) print(tokenizer.batch_decode(output, skip_special_tokens=True)) ``` ## Citation ``` @inproceedings{zhang-etal-2023-nasgec, title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts", author = "Zhang, Yue and Zhang, Bo and Jiang, Haochen and Li, Zhenghua and Li, Chen and Huang, Fei and Zhang, Min" booktitle = "Findings of ACL", year = "2023" } ```