File size: 8,259 Bytes
bf115d6 014a98f bf115d6 45bc720 bf115d6 cb9a19d 9ec5712 bf115d6 cb9a19d bf115d6 cb9a19d bf115d6 cb9a19d bf115d6 cb9a19d bf115d6 cb9a19d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
license: eupl-1.1
datasets:
- ehri-ner/ehri-ner-all
language:
- cs
- de
- en
- fr
- hu
- nl
- pl
- sk
- yi
metrics:
- name: f1
type: f1
value: 81.5
pipeline_tag: token-classification
tags:
- Holocaust
- EHRI
base_model: FacebookAI/xlm-roberta-large
---
# Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all
<!-- Provide a quick summary of what the model is/does. -->
The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information
about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of
detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to
link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and
making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER)
using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts.
The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a
format suitable for training NER models. The results of our experiments show that despite our relatively small
dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations
is 81.5%.
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Dermentzi, M. & Scheithauer, H.
- **Funded by:** European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
- **Language(s) (NLP):** The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
- **License:** EUPL-1.2
- **Finetuned from model:** FacebookAI/xlm-roberta-large
<!-- ### Model Sources [optional]
<!-- Provide the basic links for the model.
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
-->
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine
whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts.
The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by
XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps
towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are
satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby,
upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated
in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the
custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions
in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and
unlock new ways for archivists and researchers within the EHRI network to organize,
analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.
## Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The dataset used to fine-tune this model stems from a series of manually annotated
digital scholarly editions, the EHRI Online Editions. The original purpose
of these editions was not to provide a dataset
for training NER models, although we argue that they nevertheless
constitute a high-quality resource that is
suitable to be used in this way. However, users should still be mindful that
our dataset repurposes a resource that was not built for purpose.
The fine-tuned model occasionally misclassifies entities
as non-entity tokens, I-GHETTO being the most
confused entity. The fine-tuned model occasionally
encounters challenges in extracting multi-tokens
entities, such as I-CAMP, I-LOC, and I-ORG, which
are sometimes confused with the beginning of an
entity. Moreover, it tends to misclassify B-GHETTO
and B-CAMP as B-LOC, which is not surprising
given that they are semantically close.
This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for
the purposes of other users/organizations.
### Recommendations
For more information, we encourage potential users to read the paper accompanying this model:
Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222
## Citation
**BibTeX:**
@inproceedings{dermentzi_repurposing_2024,
address = {Torino, Italy},
title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
url = {https://hal.science/hal-04547222},
abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{\textbackslash}\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.},
urldate = {2024-04-29},
booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}},
publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)},
author = {Dermentzi, Maria and Scheithauer, Hugo},
month = may,
year = {2024},
keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers},
}
**APA:**
Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222 |