Hungarian named entity recognition model with OntoNotes5 + more entity types

Pretrained model used: SZTAKI-HLT/hubert-base-cc
Finetuned on NerKor+CARS-ONPP Corpus

Limitations

max_seq_length = 448

Training data

The underlying corpus, NerKor+CARS-OntoNotes++, was derived from NYTK-NerKor, a Hungarian gold standard named entity annotated corpus containing about 1 million tokens. It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of hvg.hu. While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories (PER, LOC, MISC, ORG), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation. The new annotation elaborates on subtypes of the LOC and MISC entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups. The annotation was elaborated with further entity subtypes not present in the Ontonotes 5 annotation (see below).

Tags derived from the OntoNotes 5.0 annotation

Names are annotated according to the following set of types:


`PER`	= PERSON People, including fictional
`FAC`	= FACILITY Buildings, airports, highways, bridges, etc.
`ORG`	= ORGANIZATION Companies, agencies, institutions, etc.
`GPE`	Geopolitical entites: countries, cities, states
`LOC`	= LOCATION Non-GPE locations, mountain ranges, bodies of water
`PROD`	= PRODUCT Vehicles, weapons, foods, etc. (Not services)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LAW`	Named documents made into laws

The following are also annotated in a style similar to names:


`NORP`	Nationalities or religious or political groups
`LANGUAGE`	Any named language
`DATE`	Absolute or relative dates or periods
`TIME`	Times smaller than a day
`PERCENT`	Percentage (including "%")
`MONEY`	Monetary values, including unit
`QUANTITY`	Measurements, as of weight or distance
`ORDINAL`	"first", "second"
`CARDINAL`	Numerals that do not fall under another type

Additional tags (not in OntoNotes 5)

Further subtypes of names of type MISC:


`AWARD`	Awards and prizes
`CAR`	Cars and other motor vehicles
`MEDIA`	Media outlets, TV channels, news portals
`SMEDIA`	Social media platforms
`PROJ`	Projects and initiatives
`MISC`	Unresolved subtypes of MISC entities
`MISC-ORG`	Organization-like unresolved subtypes of MISC entities

Further non-name entities:


`DUR`	Time duration
`AGE`	Age
`ID`	Identifier

If you use this model, please cite:

@inproceedings{novak-novak-2022-nerkor,
    title = "{N}er{K}or+{C}ars-{O}nto{N}otes++",
    author = "Nov{\'a}k, Attila  and
      Nov{\'a}k, Borb{\'a}la",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.203",
    pages = "1907--1916",
    abstract = "In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.",
}