--- language: en datasets: - conll2003 widget: - text: "My name is jean-baptiste and I live in montreal" - text: "My name is clara and I live in berkeley, california." - text: "My name is wolfgang and I live in berlin" --- # roberta-large-ner: model fine-tuned from roberta-large for NER task ## Introduction [roberta-large-ner] is a NER model that was fine-tuned from roberta-large on conll2003 dataset. Model was validated on emails/chat data and outperformed other models on this type of data specifically. In particular the model seems to work better on entity that don't start with an upper case. ## Training data Training data was classified as follow: Abbreviation|Description -|- O| Outside of a named entity MISC | Miscellaneous entity PER | Person’s name ORG | Organization LOC | Location In order to simplify, the prefix B- or I- from original conll2003 was removed. I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size: Train | 17494 Validation | 3250 ## How to use camembert-ner with HuggingFace ##### Load camembert-ner and its sub-word tokenizer : ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner") model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner") ##### Process text sample (from wikipedia) from transformers import pipeline nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer") [{'entity_group': 'ORG', 'score': 0.99381506, 'word': ' Apple', 'start': 0, 'end': 5}, {'entity_group': 'PER', 'score': 0.99970853, 'word': ' Steve Jobs', 'start': 29, 'end': 39}, {'entity_group': 'PER', 'score': 0.99981767, 'word': ' Steve Wozniak', 'start': 41, 'end': 54}, {'entity_group': 'PER', 'score': 0.99956465, 'word': ' Ronald Wayne', 'start': 59, 'end': 71}, {'entity_group': 'PER', 'score': 0.9997918, 'word': ' Wozniak', 'start': 92, 'end': 99}, {'entity_group': 'MISC', 'score': 0.99956393, 'word': ' Apple I', 'start': 102, 'end': 109}] ``` ## Model performances Model performances computed on conll2003 validation dataset (computed on the tokens predictions) ``` entity | precision | recall | f1 - | - | - | - PER | 0.9914 | 0.9927 | 0.9920 ORG | 0.9627 | 0.9661 | 0.9644 LOC | 0.9795 | 0.9862 | 0.9828 MISC | 0.9292 | 0.9262 | 0.9277 Overall | 0.9740 | 0.9766 | 0.9753 ``` On private dataset (email, chat, informal discussion), computed on word predictions: ``` entity | precision | recall | f1 - | - | - | - PER | 0.8823 | 0.9116 | 0.8967 ORG | 0.7694 | 0.7292 | 0.7487 LOC | 0.8619 | 0.7768 | 0.8171 ``` Spacy (en_core_web_trf-3.2.0) on the same private dataset was giving: ``` entity | precision | recall | f1 - | - | - | - PER | 0.9146 | 0.8287 | 0.8695 ORG | 0.7655 | 0.6437 | 0.6993 LOC | 0.8727 | 0.6180 | 0.7236 ```