--- language: - en tags: - bert - pytorch - en - ner license: apache-2.0 library_name: transformers pipeline_tag: token-classification widget: - text: AL-AIN, United Arab Emirates 1996-12-06 --- # BERT for English Named Entity Recognition(bert4ner) Model 英文实体识别模型 `bert4ner-base-uncased` evaluate CoNLL-2003 test data: The overall performance of BERT on CoNLL-2003 **test**: | | Accuracy | Recall | F1 | | ------------ | ------------------ | ------------------ | ------------------ | | BertSoftmax | 0.8956 | 0.9132 | 0.9043 | 在CoNLL-2003的测试集上达到接近SOTA水平。 BertSoftmax的网络结构(原生BERT)。 本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用: #### 英文实体识别: ```shell >>> from nerpy import NERModel >>> model = NERModel("bert", "shibing624/bert4ner-base-uncased") >>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True) entities: [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')] ``` 模型文件组成: ``` bert4ner-base-uncased ├── config.json ├── model_args.json ├── pytorch_model.bin ├── special_tokens_map.json ├── tokenizer_config.json └── vocab.txt ``` ## Usage (HuggingFace Transformers) Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words. Install package: ``` pip install transformers seqeval ``` ```python import os import torch from transformers import AutoTokenizer, AutoModelForTokenClassification from seqeval.metrics.sequence_labeling import get_entities os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased") model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased") label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC", "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"] sentence = "AL-AIN, United Arab Emirates 1996-12-06" def get_entity(sentence): tokens = tokenizer.tokenize(sentence) inputs = tokenizer.encode(sentence, return_tensors="pt") with torch.no_grad(): outputs = model(inputs).logits predictions = torch.argmax(outputs, dim=2) word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])] print(sentence) print(word_tags) pred_labels = [i[1] for i in word_tags] entities = [] line_entities = get_entities(pred_labels) for i in line_entities: word = tokens[i[1]: i[2] + 1] entity_type = i[0] entities.append((word, entity_type)) print("Sentence entity:") print(entities) get_entity(sentence) ``` ### 数据集 #### 实体识别数据集 | 数据集 | 语料 | 下载链接 | 文件大小 | | :------- | :--------- | :---------: | :---------: | | **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB | | **`PEOPLE中文实体识别数据集`** | 人民日报数据集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB | | **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集(22万字) | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB | ### input format Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line. ```text EU S-ORG rejects O German S-MISC call O to O boycott O British S-MISC lamb O . O Peter B-PER Blackburn E-PER ``` 如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples) ## Citation ```latex @software{nerpy, author = {Xu Ming}, title = {nerpy: Named Entity Recognition toolkit}, year = {2022}, url = {https://github.com/shibing624/nerpy}, } ```