chenpengfei commited on
Commit
6cddc41
β€’
1 Parent(s): b44e3a3

Update README.md

Browse files

Improve the code sample

Files changed (1) hide show
  1. README.md +8 -74
README.md CHANGED
@@ -1,8 +1,7 @@
1
  ---
2
  license: apache-2.0
3
- language: en
4
  tags:
5
- - generated_from_trainer
6
  - Token Classification
7
  metrics:
8
  - precision
@@ -13,92 +12,27 @@ metrics:
13
 
14
  ## Model description
15
 
16
- This model is a fine-tuned version of macbert for the purpose of spell checking in medical apllication scenarious, and we fine-tuned on our own medical data which accumulated during past several years including 600,000 fine edited medical articals. When processing the dataset, we proposed to sample 30% of these articals then randomly select characters and replace these words with spelling errors which are either visally or phonologically resembled characters. Consequently, the model can achieve 90% accuracy on our test dataset.
17
 
18
  ## Intended uses & limitations
19
 
20
  You can use this model directly with a pipeline for token classification:
21
  ```python
22
- >>> from transformers import (AutoModelForTokenClassification, BertTokenizer)
23
  >>> from transformers import pipeline
24
 
25
  >>> hub_model_id = "9pinus/macbert-base-chinese-medical-collation"
26
 
27
  >>> model = AutoModelForTokenClassification.from_pretrained(hub_model_id)
28
- >>> tokenizer = BertTokenizer.from_pretrained(hub_model_id)
29
  >>> classifier = pipeline('ner', model=model, tokenizer=tokenizer)
30
- >>> result = classifier("ε¦‚ζžœη—…ζƒ…θΎƒι‡οΌŒε―ι€‚ε½“ε£ζœη”²θ‚–ε”‘η‰‡γ€ηŽ―ι…―ηΊ’ιœ‰η΄ η‰‡γ€ε²ε“šηΎŽθΎ›η‰‡η­‰θ―η‰©θΏ›θ‘ŒζŠ—ζ„ŸζŸ“ι•‡η—›γ€‚εŒζ—Άεœ¨ζ—₯εΈΈη”Ÿζ΄»δΈ­θ¦ζ³¨ζ„η‰™ι½ΏζΈ…ζ΄ε«η”ŸοΌŒε…»ζˆεˆ·η‰™ηš„ε₯½δΉ ζƒ―。")
31
 
32
  >>> for item in result:
33
- >>> print(item)
 
34
 
35
- {'entity': 0, 'score': 0.9999982, 'index': 1, 'word': '如', 'start': None, 'end': None}
36
- {'entity': 0, 'score': 0.99999845, 'index': 2, 'word': '果', 'start': None, 'end': None}
37
- {'entity': 0, 'score': 0.99999845, 'index': 3, 'word': 'η—…', 'start': None, 'end': None}
38
- {'entity': 0, 'score': 0.99999857, 'index': 4, 'word': 'ζƒ…', 'start': None, 'end': None}
39
- {'entity': 0, 'score': 0.99999845, 'index': 5, 'word': 'θΎƒ', 'start': None, 'end': None}
40
- {'entity': 0, 'score': 0.99999845, 'index': 6, 'word': '重', 'start': None, 'end': None}
41
- {'entity': 0, 'score': 0.99999833, 'index': 7, 'word': ',', 'start': None, 'end': None}
42
- {'entity': 0, 'score': 0.99999845, 'index': 8, 'word': '可', 'start': None, 'end': None}
43
- {'entity': 0, 'score': 0.99999845, 'index': 9, 'word': '适', 'start': None, 'end': None}
44
- {'entity': 0, 'score': 0.99999845, 'index': 10, 'word': '当', 'start': None, 'end': None}
45
- {'entity': 0, 'score': 0.99999845, 'index': 11, 'word': '口', 'start': None, 'end': None}
46
- {'entity': 0, 'score': 0.99999845, 'index': 12, 'word': '服', 'start': None, 'end': None}
47
- {'entity': 0, 'score': 0.9999982, 'index': 13, 'word': 'η”²', 'start': None, 'end': None}
48
- {'entity': 1, 'score': 0.901703, 'index': 14, 'word': 'θ‚–', 'start': None, 'end': None}
49
- {'entity': 0, 'score': 0.99999833, 'index': 15, 'word': 'ε”‘', 'start': None, 'end': None}
50
- {'entity': 0, 'score': 0.99999845, 'index': 16, 'word': '片', 'start': None, 'end': None}
51
- {'entity': 0, 'score': 0.99999845, 'index': 17, 'word': '、', 'start': None, 'end': None}
52
- {'entity': 0, 'score': 0.99999845, 'index': 18, 'word': '环', 'start': None, 'end': None}
53
- {'entity': 0, 'score': 0.99999845, 'index': 19, 'word': 'ι…―', 'start': None, 'end': None}
54
- {'entity': 0, 'score': 0.99999845, 'index': 20, 'word': 'ηΊ’', 'start': None, 'end': None}
55
- {'entity': 0, 'score': 0.99999845, 'index': 21, 'word': 'ιœ‰', 'start': None, 'end': None}
56
- {'entity': 0, 'score': 0.99999845, 'index': 22, 'word': 'η΄ ', 'start': None, 'end': None}
57
- {'entity': 0, 'score': 0.99999845, 'index': 23, 'word': '片', 'start': None, 'end': None}
58
- {'entity': 0, 'score': 0.99999845, 'index': 24, 'word': '、', 'start': None, 'end': None}
59
- {'entity': 0, 'score': 0.99999845, 'index': 25, 'word': '吲', 'start': None, 'end': None}
60
- {'entity': 0, 'score': 0.99999833, 'index': 26, 'word': 'ε“š', 'start': None, 'end': None}
61
- {'entity': 0, 'score': 0.999998, 'index': 27, 'word': '美', 'start': None, 'end': None}
62
- {'entity': 0, 'score': 0.99999833, 'index': 28, 'word': 'θΎ›', 'start': None, 'end': None}
63
- {'entity': 0, 'score': 0.99999845, 'index': 29, 'word': '片', 'start': None, 'end': None}
64
- {'entity': 0, 'score': 0.99999833, 'index': 30, 'word': 'η­‰', 'start': None, 'end': None}
65
- {'entity': 0, 'score': 0.99999845, 'index': 31, 'word': '药', 'start': None, 'end': None}
66
- {'entity': 0, 'score': 0.99999845, 'index': 32, 'word': '物', 'start': None, 'end': None}
67
- {'entity': 0, 'score': 0.99999833, 'index': 33, 'word': 'θΏ›', 'start': None, 'end': None}
68
- {'entity': 0, 'score': 0.99999845, 'index': 34, 'word': '葌', 'start': None, 'end': None}
69
- {'entity': 0, 'score': 0.99999845, 'index': 35, 'word': 'ζŠ—', 'start': None, 'end': None}
70
- {'entity': 0, 'score': 0.99999845, 'index': 36, 'word': 'ζ„Ÿ', 'start': None, 'end': None}
71
- {'entity': 0, 'score': 0.99999857, 'index': 37, 'word': 'ζŸ“', 'start': None, 'end': None}
72
- {'entity': 0, 'score': 0.99999845, 'index': 38, 'word': '镇', 'start': None, 'end': None}
73
- {'entity': 0, 'score': 0.99999857, 'index': 39, 'word': 'η—›', 'start': None, 'end': None}
74
- {'entity': 0, 'score': 0.99999833, 'index': 40, 'word': '。', 'start': None, 'end': None}
75
- {'entity': 0, 'score': 0.99999845, 'index': 41, 'word': '同', 'start': None, 'end': None}
76
- {'entity': 0, 'score': 0.99999845, 'index': 42, 'word': 'ζ—Ά', 'start': None, 'end': None}
77
- {'entity': 0, 'score': 0.99999833, 'index': 43, 'word': '在', 'start': None, 'end': None}
78
- {'entity': 0, 'score': 0.99999845, 'index': 44, 'word': 'ζ—₯', 'start': None, 'end': None}
79
- {'entity': 0, 'score': 0.99999857, 'index': 45, 'word': 'εΈΈ', 'start': None, 'end': None}
80
- {'entity': 0, 'score': 0.99999845, 'index': 46, 'word': 'η”Ÿ', 'start': None, 'end': None}
81
- {'entity': 0, 'score': 0.99999845, 'index': 47, 'word': 'ζ΄»', 'start': None, 'end': None}
82
- {'entity': 0, 'score': 0.99999845, 'index': 48, 'word': 'δΈ­', 'start': None, 'end': None}
83
- {'entity': 0, 'score': 0.99999845, 'index': 49, 'word': '要', 'start': None, 'end': None}
84
- {'entity': 0, 'score': 0.99999845, 'index': 50, 'word': '注', 'start': None, 'end': None}
85
- {'entity': 0, 'score': 0.99999857, 'index': 51, 'word': '意', 'start': None, 'end': None}
86
- {'entity': 0, 'score': 0.99999845, 'index': 52, 'word': '牙', 'start': None, 'end': None}
87
- {'entity': 0, 'score': 0.99999845, 'index': 53, 'word': 'ι½Ώ', 'start': None, 'end': None}
88
- {'entity': 0, 'score': 0.99999857, 'index': 54, 'word': 'ζΈ…', 'start': None, 'end': None}
89
- {'entity': 0, 'score': 0.99999857, 'index': 55, 'word': '洁', 'start': None, 'end': None}
90
- {'entity': 0, 'score': 0.99999857, 'index': 56, 'word': '卫', 'start': None, 'end': None}
91
- {'entity': 0, 'score': 0.99999857, 'index': 57, 'word': 'η”Ÿ', 'start': None, 'end': None}
92
- {'entity': 0, 'score': 0.99999845, 'index': 58, 'word': ',', 'start': None, 'end': None}
93
- {'entity': 0, 'score': 0.99999845, 'index': 59, 'word': 'ε…»', 'start': None, 'end': None}
94
- {'entity': 0, 'score': 0.99999857, 'index': 60, 'word': '成', 'start': None, 'end': None}
95
- {'entity': 0, 'score': 0.99999857, 'index': 61, 'word': '刷', 'start': None, 'end': None}
96
- {'entity': 0, 'score': 0.99999857, 'index': 62, 'word': '牙', 'start': None, 'end': None}
97
- {'entity': 0, 'score': 0.99999845, 'index': 63, 'word': 'ηš„', 'start': None, 'end': None}
98
- {'entity': 0, 'score': 0.99999845, 'index': 64, 'word': 'ε₯½', 'start': None, 'end': None}
99
- {'entity': 0, 'score': 0.99999845, 'index': 65, 'word': 'δΉ ', 'start': None, 'end': None}
100
- {'entity': 0, 'score': 0.99999857, 'index': 66, 'word': 'ζƒ―', 'start': None, 'end': None}
101
- {'entity': 0, 'score': 0.99999833, 'index': 67, 'word': '。', 'start': None, 'end': None}
102
 
103
  ```
104
 
 
1
  ---
2
  license: apache-2.0
3
+ language: zh
4
  tags:
 
5
  - Token Classification
6
  metrics:
7
  - precision
 
12
 
13
  ## Model description
14
 
15
+ This model is a fine-tuned version of macbert for the purpose of spell checking in medical application scenarios. We fine-tuned macbert Chinese base version on a 300M dataset including 60K+ authorized medical articles. We proposed to randomly confuse 30% sentences of these articles by adding noise with a either visually or phonologically resembled characters. Consequently, the fine-tuned model can achieve 96% accuracy on our test dataset.
16
 
17
  ## Intended uses & limitations
18
 
19
  You can use this model directly with a pipeline for token classification:
20
  ```python
21
+ >>> from transformers import (AutoModelForTokenClassification, AutoTokenizer)
22
  >>> from transformers import pipeline
23
 
24
  >>> hub_model_id = "9pinus/macbert-base-chinese-medical-collation"
25
 
26
  >>> model = AutoModelForTokenClassification.from_pretrained(hub_model_id)
27
+ >>> tokenizer = AutoTokenizer.from_pretrained(hub_model_id)
28
  >>> classifier = pipeline('ner', model=model, tokenizer=tokenizer)
29
+ >>> result = classifier("ε¦‚ζžœη—…ζƒ…θΎƒι‡οΌŒε―ι€‚ε½“ε£ζœη”²θ‚–ε”‘η‰‡γ€ηŽ―ι…―ηΊ’ιœ‰η΄ η‰‡η­‰θ―η‰©θΏ›θ‘ŒζŠ—ζ„ŸζŸ“ι•‡η—›γ€‚")
30
 
31
  >>> for item in result:
32
+ >>> if item['entity'] == 1:
33
+ >>> print(item)
34
 
35
+ {'entity': 1, 'score': 0.58127016, 'index': 14, 'word': 'θ‚–', 'start': 13, 'end': 14}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ```
38