File size: 3,209 Bytes
b8017e4 5cf1939 490c3ee 5cf1939 490c3ee e09ea3f e4dc402 9fcf0ba f24ea92 8d56227 e09ea3f 8d56227 b585d7e 41d7d06 e4dc402 e09ea3f 41d7d06 019bbcc 41d7d06 d1b5382 97746ae e09ea3f e4dc402 781bd5f e09ea3f 8813840 e09ea3f 8813840 1ffba22 32f006d 1ffba22 32f006d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
language:
- en
license: mit
tags:
- token-classification
- entity-recognition
- foundation-model
- feature-extraction
- RoBERTa
- generic
datasets:
- numind/NuNER
pipeline_tag: token-classification
inference: false
---
# SOTA Entity Recognition English Foundation Model by NuMind 🔥
This model provides the best embedding for the Entity Recognition task in English.
We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**
This is the model from our [**Paper**](https://arxiv.org/abs/2402.15343): **NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data**
**Checkout other models by NuMind:**
* SOTA Multilingual Entity Recognition Foundation Model: [link](https://huggingface.co/numind/entity-recognition-multilingual-general-sota-v1)
* SOTA Sentiment Analysis Foundation Model: [English](https://huggingface.co/numind/generic-sentiment-v1), [Multilingual](https://huggingface.co/numind/generic-sentiment-multi-v1)
## About
[Roberta-base](https://huggingface.co/roberta-base) fine-tuned on [NuNER data](https://huggingface.co/datasets/numind/NuNER).
**Metrics:**
Read more about evaluation protocol & datasets in our [paper](https://arxiv.org/abs/2402.15343).
We suggest using **newer version of this model: [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0)**
Here is the aggregated performance of the models over several datasets.
k=X means that as training data for this evaluation, we took only X examples for each class, trained the model, and evaluated it on the full test set.
| Model | k=1 | k=4 | k=16 | k=64 |
|----------|----------|----------|----------|----------|
| RoBERTa-base | 24.5 | 44.7 | 58.1 | 65.4
| RoBERTa-base + NER-BERT pre-training | 32.3 | 50.9 | 61.9 | 67.6 |
| NuNER v0.1 | 34.3 | 54.6 | 64.0 | 68.7 |
| NuNER v1.0 | 39.4 | 59.6 | 67.8 | 71.5 |
| **NuNER v2.0** | **43.6** | **61.0** | **68.2** | **72.0** |
NuNER v1.0 has similar performance to 7B LLMs (70 times bigger than NuNER v1.0) created specifically for the NER task.
| Model | k=8~16| k=64~128 |
|----------|----------|----------|
| UniversalNER (7B) | 57.89 ± 4.34 | 71.02 ± 1.53 |
| NuNER v1.0 (100M) | 58.75 ± 0.93 | 70.30 ± 0.35 |
## Usage
Embeddings can be used out of the box or fine-tuned on specific datasets.
Get embeddings:
```python
import torch
import transformers
model = transformers.AutoModel.from_pretrained(
'numind/NuNER-v1.0'
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
'numind/NuNER-v1.0'
)
text = [
"NuMind is an AI company based in Paris and USA.",
"See other models from us on https://huggingface.co/numind"
]
encoded_input = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True
)
output = model(**encoded_input)
emb = output.last_hidden_state
```
## Citation
```
@misc{bogdanov2024nuner,
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
year={2024},
eprint={2402.15343},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |