File size: 3,933 Bytes
97b9212 637d6b7 97b9212 637d6b7 42919dd 637d6b7 42919dd 79101bb 007224a e59884a 637d6b7 d5fb2a9 42919dd 637d6b7 d5fb2a9 637d6b7 d5fb2a9 637d6b7 d5fb2a9 637d6b7 efee1b3 637d6b7 e954413 637d6b7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
license: mit
datasets:
- numind/NuNER
library_name: gliner
language:
- en
pipeline_tag: token-classification
tags:
- entity recognition
- NER
- named entity recognition
- zero shot
- zero-shot
---
NuNerZero - is the family of Zero-Shot Entity Recognition models inspired by [GLiNER](https://huggingface.co/papers/2311.08526) and built with insights we gathered throughout our work on [NuNER](https://huggingface.co/collections/numind/nuner-token-classification-and-ner-backbones-65e1f6e14639e2a465af823b).
The key differences between NuNerZero Long in comparison to GLiNER are:
* The possibility to **detect entities that are longer than 12 tokens**, as NuZero Token operates on the token level rather than on the span level.
* a more powerful version of GLiNER-large-v2.1, surpassing it by **+3.1% on average**
* NuNerZero family is trained on the **diverse dataset tailored for real-life use cases** - NuNER v2.0 dataset
<p align="center">
<img src="zero_shot_performance_unzero_token.png">
</p>
## Installation & Usage
```
!pip install gliner
```
**NuZero requires labels to be lower-cased**
```python
from gliner import GLiNER
def merge_entities(entities):
if not entities:
return []
merged = []
current = entities[0]
for next_entity in entities[1:]:
if next_entity['label'] == current['label'] and (next_entity['start'] == current['end'] + 1 or next_entity['start'] == current['end']):
current['text'] = text[current['start']: next_entity['end']].strip()
current['end'] = next_entity['end']
else:
merged.append(current)
current = next_entity
# Append the last entity
merged.append(current)
return merged
model = GLiNER.from_pretrained("numind/NuNerZero")
# NuZero requires labels to be lower-cased!
labels = ["organization", "initiative", "project"]
labels = [l.lower() for l in labels]
text = "At the annual technology summit, the keynote address was delivered by a senior member of the Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory, which recently launched an expansive initiative titled 'Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology'. This initiative explores the implications of quantum mechanics on next-generation computing and algorithm design and is part of a broader effort that includes the 'Global Computational Science Advancement Project'. The latter focuses on enhancing computational methodologies across scientific disciplines, aiming to set new benchmarks in computational efficiency and accuracy."
entities = model.predict_entities(text, labels)
entities = merge_entities(entities)
for entity in entities:
print(entity["text"], "=>", entity["label"])
```
```
Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory => organization
Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology => initiative
Global Computational Science Advancement Project => project
```
## Fine-tuning
A fine-tuning script can be found [here](https://colab.research.google.com/drive/1-hk5AIdX-TZdyes1yx-0qzS34YYEf3d2?usp=sharing).
## Citation
### This work
```bibtex
@misc{bogdanov2024nuner,
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
year={2024},
eprint={2402.15343},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Previous work
```bibtex
@misc{zaratiana2023gliner,
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
year={2023},
eprint={2311.08526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |