SBB
/

PyTorch
sbb_ned-fr / README.md
Jrglmn's picture
Added Model Card
b327b3a
|
raw
history blame
17.8 kB
metadata
license: apache-2.0

Model Card for sbb_ned-fr

This model is part of a named entity disambiguation and linking system (NED, NEL). The system was developed by Berlin State Library (SBB) in the QURATOR project. Questions and comments about the model can be directed to Kai Labusch at kai.labusch@sbb.spk-berlin.de or Clemens Neudecker at clemens.neudecker@sbb.spk-berlin.de.

Table of Contents

Model Details

Model Description

This model forms the core of a named entity disambiguation and linking system (NED, NEL) that consists of three components: (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT embeddings. (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model. (iii) Final ranking of candidates on the basis of information gathered from previous steps.

This model is used in order to generate the BERT embeddings in step (i) and to perform the comparison of the text passages in step (ii).

Uses

Disciplines such as the digital humanities create use cases for text and data mining or the semantic enrichment of full-texts with named entity recognition and linking, e.g., for the re-construction of historical social networks. NED/NEL opens up new posibilities for improved access to text, knowledge creation and clustering of texts. Linking against Wikidata-IDs makes it possible to join the linked texts with the world knowledge provided by Wikidata by means of arbitrary SPARQL queries.

Direct Use

The NED/NEL system was developed on the basis of the digitised collections of the Staatsbibliothek zu Berlin -- Berlin State Library. The emphasis of this system is therefore on recognition and disambiguation of entities in historical texts.

Downstream Use

Due to the historical nature of the documents being digitised in libraries, standard methods and procedures from the NLP domain typically require additional adaptation in order to successfully deal with the historical spelling variation and the remaining noise resulting from OCR errors. For use on other textual material, e.g. with an emphasis on entities comprised in other Wikipedias than the German, English and French ones, significant adaptations have to be performed. In such a case, the methodology used to develop the process as described in the related papers can serve as a showcase.

Out-of-Scope Use

Though technically possible, named entity disambiguation and linking does not necessarily work well on contemporary data. This is because the disambiguation process relies on a subset of entities available on wikidata. In other words: In order to be reliably identified, those persons, places, or organizations have to be present in the extracted Wikidata.

Bias, Risks, and Limitations

The identification and disambiguation of named entities in historical and contemporary texts is a task contributing to knowledge creation aiming at enhancing scientific research and better discoverability of information in digitised historical texts. The aim of the development of these models was to improve this knowledge creation process, an endeavour that was not undertaken for profit. The results of the applied models are freely accessible for the users of the digitised collections of the Berlin State Library. Against this backdrop, ethical challenges cannot be identified; rather, improved access and semantic enrichment of the derived full-texts with NER and NEL serves every human being with access to the digital collections of the Berlin State Library. As a limitation, it has to be noted that in historical texts the vast majority of identified and disambiguated persons are white, heterosexual and male, whereas other groups (e.g., those defeated in a war, colonial subjects, or else) are often not mentioned in such texts or are not addressed as identifiable entities with full names.

The knowledge base has been directly derived from Wikidata and Wikipedia in a two-step process. In the first step, relevant entities have been selected by use of appropriate SPARQL queries on the basis of Wikidata. In the second step, for all selected entities relevant text comparison material has been extracted from Wikipedia.

Recommendations

Disambiguation of named entities proves to be challenging beyond the task of automatically identifying entities. The existence of broad variations in the spelling of person and place names because of non-normalized orthography and linguistic change as well as changes in the naming of places according to the context adds to this challenge. Historical texts, especially newspapers, contain narrative descriptions and visual representations of minorities and disadvantaged groups without naming them; de-anonymizing such persons and groups is a research task in itself which has only been started to be tackled in the 2020's. The biggest potential for improvement of the NER / NEL / NED system is to be expected with improved OCR performance and NEL recall performance.

Training Details

Training Data

Training data have been made available on Zenodo in the form of a sqlite databases for French text snippets. A data card for this data set is available on Zenodo. The English database is available at 10.5281/zenodo.7773746.

Training Procedure

Before entity disambiguation starts, the input text is run through a named entity recognition (NER) system that tags all person (PER), location (LOC) and organization (ORG) entities, see the related NER model on Hugging Face. A BERT based NER system that has been developed previously at SBB has been used and described in this paper.

The entity linking and disambiguation works by comparison of continuous text snippets where the entities in question are mentioned. A purpose-trained BERT model (the evaluation model) performs that text comparison task. Therefore, a knowledge base that contains structured information like Wikidata is not sufficient. Rather, additional continuous text is needed where the entities that are part of the knowledge base are discussed, mentioned and referenced. Hence, the knowledge base is derived in such a way that each entity in it has a corresponding Wikipedia page, since the Wikipedia articles contain continuous texts that have been annotated by human authors with references that can serve as ground truth.

Preprocessing

See section above.

Speeds, Sizes, Times

Since the NED models are purpose-trained BERT derivatives, all the speed and performance properties of standard BERT models apply.

The models were trained on a two-class classification task. Given a pair of sentences, the models decide if the two sentences reference to the same entity or not.

The construction of the training samples is implemented in the data processor that can be found in the GitHub repo.

Training Hyperparameters

The training can be performed by the ned-bert command line tool. After installation of the sbb_ned package, type "ned-bert --help" in order to get more information about its functionality.

The training hyperparamaters used can be found in the Makefile. Here, the de-ned-train-2, en-ned-train-1, and fr-ned-train-0 targets have been used in order to train the published models.

Training Results

During training, the data processor that feeds the training process continuously generates new sentence pairs without repetition over the entire training period. The models have been trained for roughly two weeks on a V100 GPU. During the entire training period the cross entropy training loss was evaluted and continued to decrease.

Evaluation

A first version of the system was evaluated at CLEF 2020 HIPE. Several lessons learned from that first evaluation were applied to the system and a second evaluation was performed at CLEF 2022 HIPE. The models published here are the ones that have been evaluated in the CLEF 2022 HIPE competition.

Testing Data, Factors and Metrics

Please consider the papers mentioned above. For a more complete overview about the used evaluation methodology read the CLEF HIPE 2020 Overview Paper and the CLEF HIPE 2022 Overview Paper.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: V100.
  • Hours used: Roughly 1-2 week(s).
  • Cloud Provider: No cloud.
  • Compute Region: Germany.
  • Carbon Emitted: More information needed.

Technical Specifications

Software

See the information and source code published on GitHub.

Citation

BibTeX:

@inproceedings{labusch_named_2020,
    title = {Named {Entity} {Disambiguation} and {Linking} on {Historic} {Newspaper} {OCR} with {BERT}},
    url = {https://ceur-ws.org/Vol-2696/paper_163.pdf},
    abstract = {In this paper, we propose a named entity disambiguation and linking (NED, NEL) system that consists of three components: (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT-embeddings. (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model. (iii) Final ranking of candidates on the basis of information gathered from previous steps. We participated in the CLEF 2020 HIPE NERC-COARSE and NEL-LIT tasks for German, French, and English. The CLEF HIPE 2020 results show that our NEL approach is competitive in terms of precision but has low recall performance due to insufficient knowledge base coverage of the test data.},
    language = {en},
    booktitle = {{CLEF}},
    author = {Labusch, Kai and Neudecker, Clemens},
    year = {2020},
    pages = {14},
}

APA:

(Labusch et al., 2020)

BibTex

@inproceedings{labusch_entity_2022,
    title = {Entity {Linking} in {Multilingual} {Newspapers} and {Classical} {Commentaries} with {BERT}},
    url = {http://ceur-ws.org/Vol-3180/paper-85.pdf},
    abstract = {Building on our BERT-based entity recognition and three stage entity linking (EL) system [1] that we evaluated in the CLEF HIPE 2020 challenge [2], we focused in the CLEF HIPE 2022 challenge [3] on the entity linking part by participation in the EL-only tasks. We submitted results for the multilingual newspaper challenge (MNC), the multilingual classical commentary challenge (MCC), and the global adaptation challenge (GAC). This working note presents the most important modifications of the entity linking system in comparison to the HIPE 2020 approach and the additional results that have been obtained on the ajmc, hipe2020, newseye, topres19th, and sonar datasets for German, French, and English. The results show that our entity linking approach can be applied to a broad range of text categories and qualities without heavy adaptation and reveals qualitative differences of the impact of hyperparameters on our system that need further investigation.},
    language = {en},
    booktitle = {{CLEF}},
    author = {Labusch, Kai and Neudecker, Clemens},
    year = {2022},
    pages = {11},
}

APA:

(Labusch et al., 2022)

More Information

A demo of the named entity recognition and disambiguation tool can be found here. Please note that the ppn (Pica Production Number) found in the link can be replaced by the ppn of any other work in the digitised collections of the Staatsbibliothek zu Berlin / Berlin State Library, provided that there is a fulltext of this work available.

MD5 hash of the French pytorch_model.bin:

8158d8b6df6ef4e86779c288d8f61a70

SHA256 hash of the French pytorch_model.bin:

a206e4a7acd48414e04595ec9924089b944a161866114e675921dce9e630e5e4

Model Card Authors

Kai Labusch and Jörg Lehmann

Model Card Contact

Questions and comments about the model can be directed to Kai Labusch at kai.labusch@sbb.spk-berlin.de, questions and comments about the model card can be directed to Jörg Lehmann at joerg.lehmann@sbb.spk-berlin.de

How to Get Started with the Model

How to get started with this model is explained in the ReadMe file of the GitHub repository over here.

Model Card as of September 12th, 2023