metadata

language: en
tags:
  - newspapers
  - library
  - historic
  - glam
license: mit
metrics:
  - f1
widget:
  - text: 1820 [DATE] We received a letter from [MASK] Majesty.
  - text: 1850 [DATE] We received a letter from [MASK] Majesty.
  - text: '[MASK] [DATE] The Franco-Prussian war is a matter of great concern.'
  - text: '[MASK] [DATE] The Schleswig war is a matter of great concern.'

MODEL CARD UNDER CONSTRUCTION, ETA END OF NOVEMBER

ERWT-year

🌺ERWT is a language model that is (🤭 maybe 🤫) better at history than you...🌺

ERWT is a fine-tuned distilbert-base-cased model trained on historical newspapers from the Heritage Made Digital collection with temporal metadata.

ERWT performs time-sensitive masked language modelling and can be used for date prediction as well.

This model is served to you by Kaspar von Beelen and Daniel van Strien, "Improving AI, one pea at a time".

*ERWT is dutch for PEA.

Introductory Note: Repent Now. 😇

The ERWT models are trained for experimental purposes, please use them with care.

You find more detailed information below. Please consult the limitations section (seriously, read this section before using the models, we don't repent in public just for fun).

If you can't get enough of these peas and crave for some more, you can still consult our working paper "Metadata Might Make Language Models Better" for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).

Background: MDMA to the rescue. 🙂

ERWT was created using a MetaData Masking Approach (or MDMA 💊), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating metadata (information that describes a text but and is not part of the content, such as the time/place of publication or the political orientation) may make language models "better", or at least make them more sensitive to historical, political and geographical aspects of language use.

ERWT is a distilbert-base-cased model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.

To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time-stamp (year of publication) and a special [DATE] token.

The snippet below, taken from the Londonderry Sentinel,

"1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."

These formatted chunks of text are then forward to the data collator and eventually the language model.

Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended year field is taken into account when predicting hidden word in the text. Vice versa, when the metadata token at the front of the formatted snippet is hidden, the models aims to predict the year of publication based on the content of a document.

Intended Uses: LMs as History Machines.

Exposing the model to temporal metadata allows us to investigate historical language change and perform date prediction.

Historical Language Change: Her/His Majesty? 👑

Let's show how ERWT works with a very concrete example.

The ERWT models are trained on British newspapers from before 1880 (Why? Long story, don't ask...) and can be used to monitor historical change in this specific context.

Imagine you are confronted with the following snippet "We received a letter from [MASK] Majesty" and want to predict correct pronoun (again assuming a British context).

👩‍🏫 History Intermezzo Please remember, for most of in the nineteenth-century, Queen Victoria ruled Britain. From 1837 to 1901 to be precise. Her nineteenth century predecessors (George III, George IV and William IV) were all male.

While a standard language model will provide you with one a general prediction, based on what it has observed previously in the training corpus, ERWT models allow you to manipulate to prediction, by anchoring the text in specific the year.

from transformers import pipeline

mask_filler = pipeline("fill-mask",
                       model='Livingwithmachines/erwt-year')

mask_filler(f"1820 [DATE] We received a letter from [MASK] Majesty.")

Returns as most likely prediction:

{'score': 0.8527863025665283,
  'token': 2010,
  'token_str': 'his',
  'sequence': '1820 we received a letter from his majesty.'}

However, if we change the date at the start of the sentence to 1850:

mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")

Will put most of probability mass on the token "her" and only a little bit on "him".

{'score': 0.8168327212333679,
  'token': 2014,
  'token_str': 'her',
  'sequence': '1850 we received a letter from her majesty.'}

You can repeat this experiment for yourself using the example sentences in the Hosted inference API at the top right.

Okay, but why is this interesting?

Firstly, eyeballing some toy-examples (but also using more rigorous metrics such as perplexity) shows that MLMs can perform more accurate predictions when it has access to temporal metadata. In other words, ERWT's prediction reflects historical language use more accurately. Model that are sensitive to historical context could

Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least gives us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for "[MASK] Majesty".

Date Prediction

Another feature of the ERWT model series, is date prediction. Remember that during training the temporal metadata token is often masked. In this case the model effectively learns to situate documents in time based on the tokens they contain.

By masking the year token, ERWT guesses the document's year of publication.

👩‍🏫 History Intermezzo To unite the German states (there were plenty!), Prussia fought a number of wars with its neighbours in the second half of the nineteenth century. It invaded Denmark in 1864 (the second of the Schleswig Wars) and France in 1870 (the Franco-Prussian war).

Reusing to code above, we can time-stamp documents by masking the year. For example, the line of python code below:

mask_filler("[MASK] [DATE] The Schleswig war is a matter of great concern.")

Outputs as most likely filler:

{'score': 0.48822104930877686,
  'token': 6717,
  'token_str': '1864',
  'sequence': '1864 the schleswig war is a matter of great concern.'}

The prediction "1864" makes sense as this was indeed the year of Prussian troops (with some help of their Austrian friends) crossed the border into Schleswig, then part of the Kingdom of Denmark.

A few years later, in 1870, Prussia aimed artillery southwards and invaded France.

mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")

ERWT clearly learned a lot about history of German unification by ploughing through a plethora of nineteenth century newspaper articles: it correctly returns "1870" as the predicted year.

Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data.

In both cases, our answers would be "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course 🤑). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.

Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models Secondly, we could use it as an analytical tool, to study how temporal variation within text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but example predicting political orientation).

Limitations

The ERWT series were trained for evaluation purposes, and therefore carry some critical limitations.

Training Data

Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).

Historically models tend to reflect the past (and present?) stereotypes and prejudices. We strongly advice against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth century British cultural history.

Training Routine

We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our GitHub page.

To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens. Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.

We were mainly interested in the relative performance of the different ERWT models. We did, however, compared ERWT with distilbert-base-cased in our evaluation experiments, and of course, our tiny LM peas did much better. 🎉🥳

Want to know how much, then read our paper!

Data Description

The ERWT models are trained on an openly accessible newspaper corpus created by the Heritage Made Digital (HMD) newspaper digitisation project. The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper The Sun. Geographically, most papers are metropolitan (i.e. based in London). The inclusion of The Northern Daily Times and Liverpool Standard, adds some geographical diversity to this corpus. The political classification are taken for historical newspaper press directories, please read our paper on bias in newspaper collections for more information.

The table below contains a more detailed overview of the corpus. |------|--------------------------|--------------|-----------|---------------| | NLP | Title | Politics | Location | Tokens | | 2083 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 14.094.212 | | 2084 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 34.450.366 | | 2085 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 16.166.627 | | 2088 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 149.204.800 | | 2090 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 6.417.320 | | 2194 | The Sun | LIBERAL | LONDON | 1.155.791.480 | | 2244 | Colored News | NONE | LONDON | 53.634 | | 2642 | The Express | LIBERAL | LONDON | 236.240.555 | | 2644 | National Register | CONSERVATIVE | LONDON | 23.409.733 | | 2645 | The Press | CONSERVATIVE | LONDON | 15.702.276 | | 2646 | The Star | NONE | LONDON | 163.072.742 | | 2647 | The Statesman | RADICAL | LONDON | 61.225.215 |

Temporally, most of the articles date from the second half of the nineteenth century.

Evaluation

{length}	64		128
{model}	mean	sd	mean	sd
DistilBERT	354.40	376.32	229.19	294.70
HMDistilBERT	32.94	64.78	25.72	45.99
ERWT	31.49	61.85	24.97	44.58
ERWT_st	31.69	62.42	25.03	44.74
ERWT_masked_25	\textbf{30.97}	61.50	\textbf{24.59}	44.36
ERWT_masked_75	31.02	61.41	24.63	44.40
PEA	31.63	62.09	25.58	44.99
PEA_st	31.65	62.19	25.59	44.99