Edit model card

Hugging Face's logo

language:

  • ar
  • de
  • en
  • es
  • fr
  • it
  • lv
  • nl
  • pt
  • zh
  • multilingual

distilbert-base-multilingual-cased-ner-hrl

Model description

distilbert-base-multilingual-cased-ner-hrl is a Named Entity Recognition model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned Distiled BERT base model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER). Specifically, this model is a distilbert-base-multilingual-cased model that was fine-tuned on an aggregation of 10 high-resourced languages

Intended uses & limitations

How to use

You can use this model with Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."
ner_results = nlp(example)
print(ner_results)

Limitations and bias

This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.

Training data

The training data for the 10 languages are from:

Language Dataset
Arabic ANERcorp
German conll 2003
English conll 2003
Spanish conll 2002
French Europeana Newspapers
Italian Italian I-CAB
Latvian Latvian NER
Dutch conll 2002
Portuguese Paramopama + Second Harem
Chinese MSRA

The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:

Abbreviation Description
O Outside of a named entity
B-PER Beginning of a person’s name right after another person’s name
I-PER Person’s name
B-ORG Beginning of an organisation right after another organisation
I-ORG Organisation
B-LOC Beginning of a location right after another location
I-LOC Location

Training procedure

This model was trained on NVIDIA V100 GPU with recommended hyperparameters from HuggingFace code.

Downloads last month
49,887
Safetensors
Model size
135M params
Tensor type
F32
Β·

Spaces using Davlan/distilbert-base-multilingual-cased-ner-hrl 13