Edit model card

Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER) using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts. The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%.

Model Description

  • Developed by: Dermentzi, M. & Scheithauer, H.
  • Funded by: European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
  • Language(s) (NLP): The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
  • License: EUPL-1.2
  • Finetuned from model: FacebookAI/xlm-roberta-large

Uses

This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby, upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and unlock new ways for archivists and researchers within the EHRI network to organize, analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.

Limitations

The dataset used to fine-tune this model stems from a series of manually annotated digital scholarly editions, the EHRI Online Editions. The original purpose of these editions was not to provide a dataset for training NER models, although we argue that they nevertheless constitute a high-quality resource that is suitable to be used in this way. However, users should still be mindful that our dataset repurposes a resource that was not built for purpose.

The fine-tuned model occasionally misclassifies entities as non-entity tokens, I-GHETTO being the most confused entity. The fine-tuned model occasionally encounters challenges in extracting multi-tokens entities, such as I-CAMP, I-LOC, and I-ORG, which are sometimes confused with the beginning of an entity. Moreover, it tends to misclassify B-GHETTO and B-CAMP as B-LOC, which is not surprising given that they are semantically close.

This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for the purposes of other users/organizations.

Recommendations

For more information, we encourage potential users to read the paper accompanying this model: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222

Citation

BibTeX: @inproceedings{dermentzi_repurposing_2024, address = {Torino, Italy}, title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}}, url = {https://hal.science/hal-04547222}, abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{\textbackslash}%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.}, urldate = {2024-04-29}, booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}}, publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)}, author = {Dermentzi, Maria and Scheithauer, Hugo}, month = may, year = {2024}, keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers}, }

APA: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222

Downloads last month
7
Safetensors
Model size
559M params
Tensor type
F32
·

Finetuned from

Dataset used to train ehri-ner/xlm-roberta-large-ehri-ner-all