Update README.md

853c7fd verified 3 months ago

No virus

4.32 kB

	---
	language:
	- en
	tags:
	- bert
	- pytorch
	- en
	- ner
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	widget:
	- text: AL-AIN, United Arab Emirates 1996-12-06
	---

	# BERT for English Named Entity Recognition(bert4ner) Model
	英文实体识别模型

	`bert4ner-base-uncased` evaluate CoNLL-2003 test data：

	The overall performance of BERT on CoNLL-2003 test:

	\| \| Accuracy \| Recall \| F1 \|
	\| ------------ \| ------------------ \| ------------------ \| ------------------ \|
	\| BertSoftmax \| 0.8956 \| 0.9132 \| 0.9043 \|

	在CoNLL-2003的测试集上达到接近SOTA水平。

	BertSoftmax的网络结构(原生BERT)。

	本项目开源在实体识别项目：[nerpy](https://github.com/shibing624/nerpy)，可支持bert4ner模型，通过如下命令调用：

	#### 英文实体识别：

	```shell
	>>> from nerpy import NERModel
	>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
	>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
	entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
	```

	模型文件组成：
	```
	bert4ner-base-uncased
	├── config.json
	├── model_args.json
	├── pytorch_model.bin
	├── special_tokens_map.json
	├── tokenizer_config.json
	└── vocab.txt
	```

	## Usage (HuggingFace Transformers)
	Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this:

	First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

	Install package:
	```
	pip install transformers seqeval
	```

	```python
	import os
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from seqeval.metrics.sequence_labeling import get_entities

	os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

	# Load model from HuggingFace Hub
	tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
	model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
	label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
	"E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]

	sentence = "AL-AIN, United Arab Emirates 1996-12-06"


	def get_entity(sentence):
	tokens = tokenizer.tokenize(sentence)
	inputs = tokenizer.encode(sentence, return_tensors="pt")
	with torch.no_grad():
	outputs = model(inputs).logits
	predictions = torch.argmax(outputs, dim=2)
	word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
	print(sentence)
	print(word_tags)

	pred_labels = [i[1] for i in word_tags]
	entities = []
	line_entities = get_entities(pred_labels)
	for i in line_entities:
	word = tokens[i[1]: i[2] + 1]
	entity_type = i[0]
	entities.append((word, entity_type))

	print("Sentence entity:")
	print(entities)


	get_entity(sentence)
	```


	### 数据集

	#### 实体识别数据集


	\| 数据集 \| 语料 \| 下载链接 \| 文件大小 \|
	\| :------- \| :--------- \| :---------: \| :---------: \|
	\| `CNER中文实体识别数据集` \| CNER(12万字) \| [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)\| 1.1MB \|
	\| `PEOPLE中文实体识别数据集` \| 人民日报数据集（200万字） \| [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)\| 12.8MB \|
	\| `CoNLL03英文实体识别数据集` \| CoNLL-2003数据集（22万字） \| [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)\| 1.7MB \|


	### input format

	Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

	```text
	EU S-ORG
	rejects O
	German S-MISC
	call O
	to O
	boycott O
	British S-MISC
	lamb O
	. O

	Peter B-PER
	Blackburn E-PER
	```


	如果需要训练bert4ner，请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)


	## Citation

	```latex
	@software{nerpy,
	author = {Xu Ming},
	title = {nerpy: Named Entity Recognition toolkit},
	year = {2022},
	url = {https://github.com/shibing624/nerpy},
	}
	```