NuNER_Zero / README.md

feat: init

0efa21a 6 months ago

4.05 kB

	---
	license: mit
	datasets:
	- numind/NuNER
	library_name: gliner
	language:
	- en
	pipeline_tag: token-classification
	tags:
	- entity recognition
	- NER
	- named entity recognition
	- zero shot
	- zero-shot
	---

	NuNER Zero is a zero-shot Named Entity Recognition (NER) Model. (Check [NuNER](https://huggingface.co/collections/numind/nuner-token-classification-and-ner-backbones-65e1f6e14639e2a465af823b) for the few-shot setting).

	NuNER Zero uses the [GLiNER](https://huggingface.co/papers/2311.08526) architecture: its input should be a concatenation of entity types and text.

	Unlike GliNER, NuNER Zero is a token classifier, which allows detect arbitrary long entities.

	NuNER Zero was trained on [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0) dataset, which combines subsets of Pile and C4 annotated via LLMs using [NuNER's procedure](https://huggingface.co/papers/2402.15343).

	NuNER Zero is (at the time of its release) the best compact zero-shot NER model (+3.1% token-level F1-Score over GLiNER-large-v2.1 on GLiNERS's benchmark)

	<p align="left">
	<img src="zero_shot_performance_unzero_token.png" width="600">
	</p>

	## Installation & Usage

	```
	!pip install gliner
	```

	NuZero requires labels to be lower-cased

	```python
	from gliner import GLiNER

	def merge_entities(entities):
	if not entities:
	return []
	merged = []
	current = entities[0]
	for next_entity in entities[1:]:
	if next_entity['label'] == current['label'] and (next_entity['start'] == current['end'] + 1 or next_entity['start'] == current['end']):
	current['text'] = text[current['start']: next_entity['end']].strip()
	current['end'] = next_entity['end']
	else:
	merged.append(current)
	current = next_entity
	# Append the last entity
	merged.append(current)
	return merged


	model = GLiNER.from_pretrained("numind/NuNerZero")

	# NuZero requires labels to be lower-cased!
	labels = ["organization", "initiative", "project"]
	labels = [l.lower() for l in labels]

	text = "At the annual technology summit, the keynote address was delivered by a senior member of the Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory, which recently launched an expansive initiative titled 'Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology'. This initiative explores the implications of quantum mechanics on next-generation computing and algorithm design and is part of a broader effort that includes the 'Global Computational Science Advancement Project'. The latter focuses on enhancing computational methodologies across scientific disciplines, aiming to set new benchmarks in computational efficiency and accuracy."

	entities = model.predict_entities(text, labels)

	entities = merge_entities(entities)

	for entity in entities:
	print(entity["text"], "=>", entity["label"])
	```

	```
	Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory => organization
	Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology => initiative
	Global Computational Science Advancement Project => project
	```

	## Fine-tuning

	A fine-tuning script can be found [here](https://colab.research.google.com/drive/1-hk5AIdX-TZdyes1yx-0qzS34YYEf3d2?usp=sharing).


	## Citation
	### This work
	```bibtex
	@misc{bogdanov2024nuner,
	title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
	author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
	year={2024},
	eprint={2402.15343},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```
	### Previous work
	```bibtex
	@misc{zaratiana2023gliner,
	title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
	author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
	year={2023},
	eprint={2311.08526},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```