shibing624 commited on
Commit
344a656
1 Parent(s): e90c1f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -1
README.md CHANGED
@@ -1,3 +1,147 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - zh
4
+ tags:
5
+ - bert
6
+ - pytorch
7
+ - zh
8
+ - ner
9
+ license: "apache-2.0"
10
  ---
11
+
12
+ # BERT for Chinese Named Entity Recognition(bert4ner) Model
13
+ 英文实体识别模型
14
+
15
+ `bert4ner-base-chinese` evaluate CoNLL-2003 test data:
16
+
17
+ The overall performance of BERT on CoNLL-2003 **test**:
18
+
19
+ | | Accuracy | Recall | F1 |
20
+ | ------------ | ------------------ | ------------------ | ------------------ |
21
+ | BertSoftmax | 0.8956 | 0.9132 | 0.9043 |
22
+
23
+ 在CoNLL-2003的测试集上达到接近SOTA水平。
24
+
25
+ BertSoftmax的网络结构(原生BERT)。
26
+
27
+ 本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:
28
+
29
+ #### 英文实体识别:
30
+
31
+ ```shell
32
+ >>> from nerpy import NERModel
33
+ >>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
34
+ >>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
35
+ entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
36
+ ```
37
+
38
+ 模型文件组成:
39
+ ```
40
+ bert4ner-base-uncased
41
+ ├── config.json
42
+ ├── model_args.json
43
+ ├── pytorch_model.bin
44
+ ├── special_tokens_map.json
45
+ ├── tokenizer_config.json
46
+ └── vocab.txt
47
+ ```
48
+
49
+ ## Usage (HuggingFace Transformers)
50
+ Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:
51
+
52
+ First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.
53
+
54
+ Install package:
55
+ ```
56
+ pip install transformers seqeval
57
+ ```
58
+
59
+ ```python
60
+ import os
61
+ import torch
62
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
63
+ from seqeval.metrics.sequence_labeling import get_entities
64
+
65
+ os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
66
+
67
+ # Load model from HuggingFace Hub
68
+ tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
69
+ model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
70
+ label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
71
+ "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]
72
+
73
+ sentence = "AL-AIN, United Arab Emirates 1996-12-06"
74
+
75
+
76
+ def get_entity(sentence):
77
+ tokens = tokenizer.tokenize(sentence)
78
+ inputs = tokenizer.encode(sentence, return_tensors="pt")
79
+ with torch.no_grad():
80
+ outputs = model(inputs).logits
81
+ predictions = torch.argmax(outputs, dim=2)
82
+ char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
83
+ print(sentence)
84
+ print(char_tags)
85
+
86
+ pred_labels = [i[1] for i in char_tags]
87
+ entities = []
88
+ line_entities = get_entities(pred_labels)
89
+ for i in line_entities:
90
+ word = sentence[i[1]: i[2] + 1]
91
+ entity_type = i[0]
92
+ entities.append((word, entity_type))
93
+
94
+ print("Sentence entity:")
95
+ print(entities)
96
+
97
+
98
+ get_entity(sentence)
99
+ ```
100
+
101
+
102
+ ### 数据集
103
+
104
+ #### 实体识别数据集
105
+
106
+
107
+ | 数据集 | 语料 | 下载链接 | 文件大小 |
108
+ | :------- | :--------- | :---------: | :---------: |
109
+ | **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
110
+ | **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
111
+ | **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |
112
+
113
+
114
+ ### input format
115
+
116
+ Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
117
+
118
+ ```text
119
+ EU S-ORG
120
+ rejects O
121
+ German S-MISC
122
+ call O
123
+ to O
124
+ boycott O
125
+ British S-MISC
126
+ lamb O
127
+ . O
128
+
129
+ Peter B-PER
130
+ Blackburn E-PER
131
+ ```
132
+
133
+
134
+ 如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
135
+
136
+
137
+ ## Citation
138
+
139
+ ```latex
140
+ @software{nerpy,
141
+ author = {Xu Ming},
142
+ title = {nerpy: Named Entity Recognition toolkit},
143
+ year = {2022},
144
+ url = {https://github.com/shibing624/nerpy},
145
+ }
146
+ ```
147
+