gm-ner-xlmrbase / README.md
sarnoult
initial commit
80064d5
|
raw
history blame
No virus
2.74 kB
metadata
language: nl
license: apache-2.0
tags:
  - dighum
inference: false

Early-modern Dutch NER (General Letters)

Description

This is a fine-tuned NER model for early-modern Dutch United East India Company (VOC) letters based on XLM-R_base (Conneau et al., 2020). The model identifies locations, persons, organisations, but also ships as well as derived forms of locations and religions.

Intended uses and limitations

This model was fine-tuned (trained, validated and tested) on a single source of data, the General Letters (Generale Missiven). These letters span a large variety of Dutch, as they cover the largest part of the 17th and 18th centuries, and have been extended with editorial notes between 1960 and 2017. As the model was only fine-tuned on this data however, it may perform less well on other texts from the same period.

Training data and tagset

The model was fine-tuned on the General Letters GM-NER dataset, with the following tagset:

tag description notes
LOC locations
LOCderiv derived forms of locations by derivation, e.g. Bandanezen, or composition, e.g. Javakoffie
ORG organisations includes forms derived by composition, e.g. Compagnieszaken
PER persons
RELderiv forms related to religion merges religion names (Christendom), derived forms (christenen) and composed forms (Christen-orangkay)
SHP ships

The base text for this dataset is OCR text that has been partially corrected. The text is clean overall but errors remain.

Training procedure

The model was fine-tuned with xlm-roberta-base, using this script.

Non-default training parameters are:

  • training batch size: 16
  • max sequence length: 256
  • number of epochs: 4 -- loading the best checkpoint model by loss at the end, with checkpoints every 200 steps
  • (seed: 1)

Evaluation

Metric

  • entity-level F1

Results

overall 92.7
LOC 95.8
LOCderiv 92.7
ORG 92.5
PER 86.2
RELderiv 90.7
SHP 81.6

Authors and references

Authors

Sophie Arnoult, Lodewijk Petram and Piek Vossen

Reference

This model was fine-tuned as part of experiments for a paper accepted at LaTeCH-CLfL 2021: Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.