BERT model for toponym recognition in 19th-century English

Description

toponym-19thC-en is a BERT model fine-tuned for the task toponym recognition on the TopRes19th dataset. It has been trained to recognise the following types of entities: LOC, BUILDING, and STREET, particularly in digitised 19th-century newspaper texts in English.

toponym-19thC-en uses the Livingwithmachines/bert_1760_1900 BERT model as base (which is a bert-base-uncased model) fine-tuned on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens.

Intended use and limitations

This model is intended for performing toponym recognition (a subtask of NER) on historical English texts, particularly on 19th-century digitised newspapers texts, on which it has been trained. It has been trained to recognise the following types of entities: LOC, BUILDING, and STREET.

How to use

You can use this model with a named entity recognition pipeline. For example:

>>> from transformers import pipeline
>>> model = "Livingwithmachines/toponym-19thC-en"
>>> ner_pipe = pipeline("ner", model=model)
>>> results = ner_pipe("MANUFACTURED ONLY AT 7S, NEW OXFORD-STREET, LONDON.")

[
    {'entity': 'B-STREET', 'score': 0.99885094, 'index': 7, 'word': 'new', 'start': 25, 'end': 28}, 
    {'entity': 'I-STREET', 'score': 0.9906386, 'index': 8, 'word': 'oxford', 'start': 29, 'end': 35}, 
    {'entity': 'I-STREET', 'score': 0.9944792, 'index': 9, 'word': '-', 'start': 35, 'end': 36}, 
    {'entity': 'I-STREET', 'score': 0.9945181, 'index': 10, 'word': 'street', 'start': 36, 'end': 42}, 
    {'entity': 'B-LOC', 'score': 0.9986091, 'index': 12, 'word': 'london', 'start': 44, 'end': 50}
]

You can also group all tokens corresponding to the same entity together, as follows:

>>> from transformers import pipeline
>>> model = "Livingwithmachines/toponym-19thC-en"
>>> ner_pipe = pipeline("ner", model=model, aggregation_strategy="average")
>>> results = ner_pipe("MANUFACTURED ONLY AT 7S, NEW OXFORD-STREET, LONDON.")

[
    {'entity_group': 'STREET', 'score': 0.9946217, 'word': 'new oxford - street', 'start': 25, 'end': 42}, 
    {'entity_group': 'LOC', 'score': 0.9986091, 'word': 'london', 'start': 44, 'end': 50}
]

Training data

This model is fine-tuned on the training set of version 2 of the TopRes19th dataset. For more information about the dataset, see the paper describing it.

Each token has been annotated using the BIO format, where O describes a token that does not belong to a named entity, a tag prefixed B- indicates that it corresponds to the first token in the named entity, while a tag prefixed I- indicates that the corresponding token is part of a named entity.

The training set consists of 5,216 annotated examples, and the development set consists of 1,304 annotated examples.

A toponym is a mention of a location in a text. In the original dataset, annotators classified toponyms into the following categories:

BUILDING for buildings,
STREET for streets, roads, and other odonyms,
LOC for any other real world places regardless of type or scale,
ALIEN for extraterrestrial locations, such as 'Venus'.
FICTION for fictional or mythical places, such as 'Hell', and
OTHER for other types of entities with coordinates, such as events, like the 'Battle of Waterloo'.

However, the ALIEN, FICTION and OTHER named entities were found to occur between zero and five times in the whole dataset, therefore resulting negligible for training purposes.

Limitations

This model is based on Livingwithmachines/bert_1760_1900, which is fine-tuned on a historical dataset of digitised books in English, published between 1760 and 1900, including both fiction and non-fiction. Therefore, the model's predictions have to be understood in their historical context. Furthermore, despite the size of the dataset (ca. 48,000 books and 5.1 billion words), this dataset is not representative of nineteenth-century English, but only of (some of) those authors who had the option to publish a book. It therefore needs to be used with caution. You can find more information about the original dataset here, or read more about the base model in this paper.

The dataset used for fine-tuning for the task of toponym resolution is described in this paper. Articles for annotation were selected from newspaper issues published between 1780 and 1870, belonging to newspapers based in four different locations in England, and therefore the model may be biased towards better predicting entities similar to the ones in the source data. Whereas the articles contain many OCR errors, only articles that were legible were selected. In particular, we selected only those articles with an OCR quality confidence score greater than 0.7, calculated as the mean of the per-word OCR confidence scores as reported in the source metadata. The model's performance on lower quality texts needs to be tested.

Finally, we've noticed that, often, there are B- and I- prefix assignment errors in hyphenated entities. This is a problem when there are hyphens in words, e.g. "Ashton-under-Lyne" (["Ashton", "-", "under", "-", "Lyne"]), which is tagged as ["B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC"], instead of ["B-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"]. An imperfect solution is to apply a post-processing step in which the tag prefix is changed to "I-" when the current token or the previous token is a hyphen, and the entity type of both previous and current token is the same and not"O".

License

The model is released under open license CC BY 4.0, available at https://creativecommons.org/licenses/by/4.0/legalcode.

Funding Statement

This work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/N510129/1). Living with Machines, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and Cambridge, King's College London, East Anglia, Exeter, and Queen Mary University of London.

Cite

If you use this model, please cite the following papers describing the base model and the dataset used for fine-tuning:

Coll Ardanuy, Mariona, David Beavan, Kaspar Beelen, Kasra Hosseini, Jon Lawrence, Katherine McDonough, Federico Nanni, Daniel van Strien, and Daniel C. S. Wilson. 2022. “A Dataset for Toponym Resolution in Nineteenth-century English Newspapers”. Journal of Open Humanities Data 8 (0): 3. DOI: https://doi.org/10.5334/johd.56

Hosseini, Kasra, Beelen, Kaspar, Colavizza, Giovanni and Coll Ardanuy, Mariona, 2021. Neural Language Models for Nineteenth-Century English. Journal of Open Humanities Data, 7(0), p.22. DOI: https://doi.org/10.5334/johd.48