|
--- |
|
language: |
|
- en |
|
license: mit |
|
tags: |
|
- token-classification |
|
- entity-recognition |
|
- foundation-model |
|
- feature-extraction |
|
- RoBERTa |
|
- generic |
|
datasets: |
|
- numind/NuNER |
|
pipeline_tag: token-classification |
|
inference: false |
|
--- |
|
|
|
# SOTA Entity Recognition English Foundation Model by NuMind 🔥 |
|
|
|
This model provides the best embedding for the Entity Recognition task in English. |
|
|
|
This model is based on our [Paper](https://arxiv.org/abs/2402.15343). |
|
|
|
**Checkout other models by NuMind:** |
|
* SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1) |
|
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1) |
|
|
|
## About |
|
|
|
[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER). |
|
|
|
**Metrics:** |
|
|
|
Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343). |
|
|
|
Here is the aggregated performance of the models over several datasets. |
|
|
|
k=X means that as a training data for this evaluation, we took only X examples for each class, trained the model, and evaluated it on the full test set. |
|
|
|
| Model | k=1 | k=4 | k=16 | k=64 | |
|
|----------|----------|----------|----------|----------| |
|
| RoBERTa-base | 24.5 | 44.7 | 58.1 | 65.4 |
|
| RoBERTa-base + NER-BERT pre-training | 32.3 | 50.9 | 61.9 | 67.6 | |
|
| NuNER v1.0 | **39.4** | **59.6** | **67.8** | **71.5** | |
|
|
|
NuNER v1.0 has similar performance to 7B LLMs (70 times bigger that NuNER v1.0) created specifically for NER task. |
|
|
|
| Model | k=8~16| k=64~128 | |
|
|----------|----------|----------| |
|
| UniversalNER (7B) | 57.89 ± 4.34 | 71.02 ± 1.53 | |
|
| NuNER v1.0 (100M) | 58.75 ± 0.93 | 70.30 ± 0.35 | |
|
|
|
## Usage |
|
|
|
Embeddings can be used out of the box or fine-tuned on specific datasets. |
|
|
|
Get embeddings: |
|
|
|
|
|
```python |
|
import torch |
|
import transformers |
|
|
|
|
|
model = transformers.AutoModel.from_pretrained( |
|
'numind/NuNER-v1.0', |
|
output_hidden_states=True |
|
) |
|
tokenizer = transformers.AutoTokenizer.from_pretrained( |
|
'numind/NuNER-v1.0' |
|
) |
|
|
|
text = [ |
|
"NuMind is an AI company based in Paris and USA.", |
|
"See other models from us on https://huggingface.co/numind" |
|
] |
|
encoded_input = tokenizer( |
|
text, |
|
return_tensors='pt', |
|
padding=True, |
|
truncation=True |
|
) |
|
output = model(**encoded_input) |
|
|
|
# for better quality |
|
emb = torch.cat( |
|
(output.hidden_states[-1], output.hidden_states[-7]), |
|
dim=2 |
|
) |
|
|
|
# for better speed |
|
# emb = output.hidden_states[-1] |
|
``` |
|
|
|
## Citation |
|
``` |
|
@misc{bogdanov2024nuner, |
|
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, |
|
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard}, |
|
year={2024}, |
|
eprint={2402.15343}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |