shibing624 commited on
Commit
4a65218
1 Parent(s): a880cc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -8
README.md CHANGED
@@ -8,13 +8,14 @@ tags:
8
  license: "apache-2.0"
9
  ---
10
 
11
- # Please use 'Bert' related functions to load this model!
 
12
 
13
-
14
- `macbert4csc-base-chinese` evaluate sighan15:
15
 
16
  Sentence Level: acc:0.825492, precision:0.993085, recall:0.825376, f1:0.901497
17
 
 
18
 
19
  ## Usage
20
 
@@ -31,21 +32,58 @@ print(i)
31
 
32
  当然,你也可使用官方的huggingface/transformers调用:
33
 
 
 
34
  ```python
 
35
  import torch
36
  from transformers import BertTokenizer, BertForMaskedLM
37
 
38
  tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
39
  model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
40
 
41
- texts = ["今天心情很好", "你找到你最喜欢的工作,我也很高心。"]
42
  outputs = model(**tokenizer(texts, padding=True, return_tensors='pt'))
43
- corrected_texts = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  for ids, text in zip(outputs.logits, texts):
45
  _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
46
- corrected_texts.append(_text[:len(text)])
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- print(corrected_texts)
 
 
 
 
 
 
 
 
49
  ```
50
 
51
  ### 训练数据集
@@ -116,7 +154,7 @@ For more technical details, please check our paper: [Revisiting Pre-trained Mode
116
  @software{pycorrector,
117
  author = {Xu Ming},
118
  title = {pycorrector: Text Error Correction Tool},
119
- year = {2020},
120
  url = {https://github.com/shibing624/pycorrector},
121
  }
122
  ```
 
8
  license: "apache-2.0"
9
  ---
10
 
11
+ # MacBERT for Chinese Spelling correction(macbert4csc) Model
12
+ 中文拼写纠错模型
13
 
14
+ `macbert4csc-base-chinese` evaluate sighan2015:
 
15
 
16
  Sentence Level: acc:0.825492, precision:0.993085, recall:0.825376, f1:0.901497
17
 
18
+ 模型在SIGHAN2015数据集达到SOTA。
19
 
20
  ## Usage
21
 
 
32
 
33
  当然,你也可使用官方的huggingface/transformers调用:
34
 
35
+ *Please use 'Bert' related functions to load this model!*
36
+
37
  ```python
38
+ import operator
39
  import torch
40
  from transformers import BertTokenizer, BertForMaskedLM
41
 
42
  tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
43
  model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
44
 
45
+ texts = ["今天新情很好", "你找到你最喜欢的工作,我也很高心。"]
46
  outputs = model(**tokenizer(texts, padding=True, return_tensors='pt'))
47
+
48
+ def get_errors(corrected_text, origin_text):
49
+ details = []
50
+ for i, ori_char in enumerate(origin_text):
51
+ if ori_char == ' ':
52
+ # add blank space
53
+ corrected_text = corrected_text[:i] + ' ' + corrected_text[i:]
54
+ continue
55
+ if i >= len(corrected_text):
56
+ continue
57
+ if ori_char != corrected_text[i]:
58
+ details.append((ori_char, corrected_text[i], i, i + 1))
59
+ details = sorted(details, key=operator.itemgetter(2))
60
+ return corrected_text, details
61
+
62
+ result = []
63
  for ids, text in zip(outputs.logits, texts):
64
  _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
65
+ corrected_text = _text[:len(text)]
66
+ corrected_text, details = get_errors(corrected_text, text)
67
+ print(text, ' => ', corrected_text, details)
68
+ result.append((corrected_text, details))
69
+ print(result)
70
+ ```
71
+
72
+ output:
73
+ ```shell
74
+ 今天新情很好 => 今天心情很好 [('新', '心', 2, 3)]
75
+ 你找到你最喜欢的工作,我也很高心。 => 你找到你最喜欢的工作,我也很高兴。 [('心', '兴', 15, 16)]
76
+ ```
77
 
78
+ 模型文件组成:
79
+ ```
80
+ macbert4csc-base-chinese
81
+ ├── config.json
82
+ ├── added_tokens.json
83
+ ├── pytorch_model.bin
84
+ ├── special_tokens_map.json
85
+ ├── tokenizer_config.json
86
+ └── vocab.txt
87
  ```
88
 
89
  ### 训练数据集
 
154
  @software{pycorrector,
155
  author = {Xu Ming},
156
  title = {pycorrector: Text Error Correction Tool},
157
+ year = {2021},
158
  url = {https://github.com/shibing624/pycorrector},
159
  }
160
  ```