Papers
arxiv:2405.00155

HistNERo: Historical Named Entity Recognition for the Romanian Language

Published on Apr 30
Authors:
,
,
,
,
,
,
,
,

Abstract

This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.

Community

Hey @avramandrei , very interesting paper and great new resource for Romanian!

I am definitely extending my hmBench for this dataset!

Do you btw. know any additional publicly available corpora for Historical Romanian? I am thinking of "Public Domain" datasets as @Pclanglais and team are collecting, e.g. see the Common Corpus collection.

If there are corpora available, I would love to extend my historical multilingual language models with it :)

·
Paper author

@stefan-it Thank you! The RODICA dataset (from which the documents used in this dataset were collected) is the only Historical Romanian corpus that I know of. Hope it helps you! :)

Good work!

·
Paper author

Thanks! :)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.00155 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.00155 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.