shibing624
/

macbert4csc-base-chinese

@@ -8,13 +8,14 @@ tags:
 license: "apache-2.0"
 ---
-# Please use 'Bert' related functions to load this model!
-`macbert4csc-base-chinese` evaluate sighan15：
 Sentence Level: acc:0.825492, precision:0.993085, recall:0.825376, f1:0.901497
 ## Usage
@@ -31,21 +32,58 @@ print(i)
 当然，你也可使用官方的huggingface/transformers调用：
 ```python
 import torch
 from transformers import BertTokenizer, BertForMaskedLM
 tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
 model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
-texts = ["今天心情很好", "你找到你最喜欢的工作，我也很高心。"]
 outputs = model(**tokenizer(texts, padding=True, return_tensors='pt'))
-corrected_texts = []
 for ids, text in zip(outputs.logits, texts):
     _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
-    corrected_texts.append(_text[:len(text)])
-print(corrected_texts)
 ```
 ### 训练数据集
@@ -116,7 +154,7 @@ For more technical details, please check our paper: [Revisiting Pre-trained Mode
 @software{pycorrector,
   author = {Xu Ming},
   title = {pycorrector: Text Error Correction Tool},
-  year = {2020},
   url = {https://github.com/shibing624/pycorrector},
 }
 ```

 license: "apache-2.0"
 ---
+# MacBERT for Chinese Spelling correction(macbert4csc) Model
+中文拼写纠错模型
+`macbert4csc-base-chinese` evaluate sighan2015：
 Sentence Level: acc:0.825492, precision:0.993085, recall:0.825376, f1:0.901497
+模型在SIGHAN2015数据集达到SOTA。
 ## Usage
 当然，你也可使用官方的huggingface/transformers调用：
+*Please use 'Bert' related functions to load this model!*
 ```python
+import operator
 import torch
 from transformers import BertTokenizer, BertForMaskedLM
 tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
 model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
+texts = ["今天新情很好", "你找到你最喜欢的工作，我也很高心。"]
 outputs = model(**tokenizer(texts, padding=True, return_tensors='pt'))
+def get_errors(corrected_text, origin_text):
+    details = []
+    for i, ori_char in enumerate(origin_text):
+        if ori_char == ' ':
+            # add blank space
+            corrected_text = corrected_text[:i] + ' ' + corrected_text[i:]
+            continue
+        if i >= len(corrected_text):
+            continue
+        if ori_char != corrected_text[i]:
+            details.append((ori_char, corrected_text[i], i, i + 1))
+    details = sorted(details, key=operator.itemgetter(2))
+    return corrected_text, details
+result = []
 for ids, text in zip(outputs.logits, texts):
     _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
+    corrected_text = _text[:len(text)]
+    corrected_text, details = get_errors(corrected_text, text)
+    print(text, ' => ', corrected_text, details)
+    result.append((corrected_text, details))
+print(result)
+```
+output:
+```shell
+今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
+你找到你最喜欢的工作，我也很高心。  =>  你找到你最喜欢的工作，我也很高兴。 [('心', '兴', 15, 16)]
+```
+模型文件组成：
+```
+macbert4csc-base-chinese
+    ├── config.json
+    ├── added_tokens.json
+    ├── pytorch_model.bin
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    └── vocab.txt
 ```
 ### 训练数据集
 @software{pycorrector,
   author = {Xu Ming},
   title = {pycorrector: Text Error Correction Tool},
+  year = {2021},
   url = {https://github.com/shibing624/pycorrector},
 }
 ```