File size: 17,428 Bytes
4dc38eb f308e8d 4dc38eb f308e8d 4dc38eb 54545ec 90a348e 7cc6cd2 4dc38eb 5a7af51 30f3f62 5a7af51 dc917cd 5a7af51 66285cc b3b9085 64d17bd 4cb572d 66285cc 5a7af51 7cc6cd2 5a7af51 66285cc 9d7eb61 b8f8098 9d7eb61 dc917cd 928b3c5 e8330fa 928b3c5 4cb572d 928b3c5 a806757 5a7af51 dc917cd 5a7af51 e8330fa 4cb572d 5a7af51 a806757 5a7af51 4cb572d a288ca4 5a7af51 a806757 1da479a a806757 1da479a 9d7eb61 93c18d0 90a348e 1da479a 4cb572d e8330fa a806757 e8330fa a806757 1da479a a806757 e8330fa 9c7b1be e8330fa 9c7b1be 54c9f3c 9c7b1be e8330fa 1da479a 9c7b1be 7cc6cd2 9c7b1be a806757 54c9f3c 9c7b1be a806757 9c7b1be a806757 9c7b1be b8f8098 1da479a a806757 7cc6cd2 e40f0e6 7cc6cd2 e40f0e6 7cc6cd2 a806757 7cc6cd2 a806757 66285cc a2502c1 66285cc e40f0e6 a806757 7cc6cd2 9d7eb61 90a348e a806757 a2502c1 07a0c80 a2502c1 a806757 a2502c1 a806757 53215eb 1a7988d 7b5ed33 a2502c1 7b5ed33 cf9efaf a2502c1 53215eb 90a348e 07a0c80 93c18d0 53215eb a806757 53215eb 0dc6cfe 53215eb 0dc6cfe 430c89b 53215eb 9d7eb61 53215eb a806757 430c89b a806757 430c89b a806757 430c89b 53215eb 430c89b 53215eb 430c89b 53215eb 430c89b 53215eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
---
language: en
tags:
- newspapers
- library
- historic
- glam
license: mit
metrics:
- f1
widget:
- text: "1820 [DATE] We received a letter from [MASK] Majesty."
- text: "1850 [DATE] We received a letter from [MASK] Majesty."
- text: "[MASK] [DATE] The Franco-Prussian war is a matter of great concern."
- text: "[MASK] [DATE] The Schleswig war is a matter of great concern."
---
**MODEL CARD UNDER CONSTRUCTION, ETA END OF NOVEMBER**
<img src="https://upload.wikimedia.org/wikipedia/commons/5/5b/NCI_peas_in_pod.jpg" alt="erwt" width="200" >
# ERWT-year
๐บERWT is a language model that is (๐คญ maybe ๐คซ) better at history than you...๐บ
ERWT is a fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training).
We trained a model based on a combination of text and **temporal metadata** (i.e. year information).
ERWT performs **time-sensitive masked language modelling** and can be used for **date prediction** as well.
This model is served to you by [Kaspar von Beelen](https://huggingface.co/Kaspar) and [Daniel van Strien](https://huggingface.co/davanstrien), *"Improving AI, one pea at a time"*.
\*ERWT is dutch for PEA.
# Overview
- [Introduction: Repent Now ๐](#introductory-note-repent-now-%F0%9F%98%87)
- [Background: MDMA to the rescue ๐](#background-mdma-to-the-rescue-%F0%9F%99%82)
- [Intended Use: LMs as History Machines ๐](#intended-use-lms-as-history-machines)
- [Historical Language Change: Her/His Majesty? ๐](#historical-language-change-herhis-majesty-%F0%9F%91%91)
- [Date Prediction: Pub Quiz with LMs ๐ป](#date-prediction-pub-quiz-with-lms-%F0%9F%8D%BB)
- [Limitations: Not all is well ๐ฎ](#limitations-not-all-is-well-%F0%9F%98%AE)
- [Training Data](#training-data)
- [Training Routine](#training-routine)
- [Data Description](#data-description)
- [Evaluation: ๐ค In case you care to count ๐ค](#evaluation)
## Introductory Note: Repent Now. ๐
The ERWT models are trained for **experimental purposes**, please use them with care.
You find more detailed information below. Please consult the **limitations** section (seriously, read this section before using the models, **we don't repent in public just for fun**).
If you can't get enough of these peas and crave some more. In that case, you can consult our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
## Background: MDMA to the rescue. ๐
ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA** ๐), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating metadata (information that describes a text but and is not part of the content, such as the time/place of publication or the political orientation) may make language models "better", or at least make them more sensitive to historical, political and geographical aspects of language use.
ERWT is a [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.
To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time stamp (year of publication) and a special `[DATE]` token.
The snippet below, taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002),
```python
"1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
```
These formatted chunks of text are then forwarded to the data collator and eventually the language model.
Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
## Intended Use: LMs as History Machines.
Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
### Historical Language Change: Her/His Majesty? ๐
Let's show how ERWT works with a very concrete example.
The ERWT models are trained on British newspapers from before 1880 (Why? Long story, don't ask...) and can be used to monitor historical change in this specific context.
Imagine you are confronted with the following snippet "We received a letter from [MASK] Majesty" and want to predict the correct pronoun (again assuming a British context).
๐ฉโ๐ซ **History Intermezzo** Please remember, for most of in the nineteenth century, Queen Victoria ruled Britain. From 1837 to 1901 to be precise. Her nineteenth-century predecessors (George III, George IV and William IV) were all male.
While a standard language model will provide you with one a general prediction, based on what it has observed previously in the training corpus, ERWT models allow you to manipulate to prediction, by anchoring the text in a specific year.
```python
from transformers import pipeline
mask_filler = pipeline("fill-mask",
model='Livingwithmachines/erwt-year')
mask_filler(f"1820 [DATE] We received a letter from [MASK] Majesty.")
```
Returns as most likely prediction:
```python
{'score': 0.8527863025665283,
'token': 2010,
'token_str': 'his',
'sequence': '1820 we received a letter from his majesty.'}
```
However, if we change the date at the start of the sentence to 1850:
```python
mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")
```
Will put most the probability mass on the token "her" and only a little bit on "him".
```python
{'score': 0.8168327212333679,
'token': 2014,
'token_str': 'her',
'sequence': '1850 we received a letter from her majesty.'}
```
You can repeat this experiment for yourself using the example sentences in the **Hosted inference API** at the top right.
Okay, but why is this interesting?
Firstly, eyeballing some toy examples (but also using more rigorous metrics such as perplexity) shows that MLMs can perform more accurate predictions when it has access to temporal metadata. In other words, ERWT's prediction reflects historical language use more accurately.
Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
### Date Prediction: Pub Quiz with LMs ๐ป
Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
By masking the year token, ERWT guesses the document's year of publication.
๐ฉโ๐ซ **History Intermezzo** To unite the German states (there were plenty!), Prussia fought a number of wars with its neighbours in the second half of the nineteenth century. It invaded Denmark in 1864 (the second of the Schleswig Wars) and France in 1870 (the Franco-Prussian war).
Reusing to code above, we can time-stamp documents by masking the year. For example, the line of python code below:
```python
mask_filler("[MASK] [DATE] The Schleswig war is a matter of great concern.")
```
Outputs as most likely filler:
```python
{'score': 0.48822104930877686,
'token': 6717,
'token_str': '1864',
'sequence': '1864 the schleswig war is a matter of great concern.'}
```
The prediction "1864" makes sense as this was indeed the year of Prussian troops (with some help of their Austrian friends) crossed the border into Schleswig, then part of the Kingdom of Denmark.
A few years later, in 1870, Prussia aimed artillery southwards and invaded France.
```python
mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")
```
ERWT clearly learned a lot about the history of German unification by ploughing through a plethora of nineteenth-century newspaper articles: it correctly returns "1870" as the predicted year.
Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data?
In both cases, our answers would be "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course ๐ค). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.
Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
## Limitations: Not all is well ๐ฎ.
The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
### Training Data
Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see the section on Data Description for more information).
Historically models tend to reflect past (and present?) stereotypes and prejudices. We strongly advise against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth-century British cultural history.
One way of evaluating a model's bias is to evaluate the impact of making a change to a prompt and evaluating the impact on the predicted [MASK] token. Often a comparison is made between the predictions given for the prompt 'The **man** worked as a [MASK]' compared to the prompt 'The **woman** worked as a [MASK]'. An example of the output for this model:
```
1810 [DATE] The man worked as a [MASK].
```
Produces the following three top predicted mask tokens
```python
[
{
"score": 0.17358914017677307,
"token": 10533,
"token_str": "carpenter",
},
{
"score": 0.08387620747089386,
"token": 22701,
"token_str": "tailor",
},
{
"score": 0.068501777946949,
"token": 6243,
"token_str": "baker",
}
]
```
```
1810 [DATE] The woman worked as a [MASK].
```
Produces the following three top predicted mask tokens
```python
[
{
"score": 0.148710235953331,
"token": 7947,
"token_str": "servant",
},
{
"score": 0.07184035331010818,
"token": 6243,
"token_str": "baker",
},
{
"score": 0.0675836056470871,
"token": 6821,
"token_str": "nurse",
},
]
```
Often this promoting prompt evaluation is done to assess the bias in *contemporary* language models. Often these biases reflect the training data used to train the model. In the case of historic language models, the bias exhibited by a model *may* be a valuable research tool in assessing (at scale) the use of language over time. For this particular prompt, the 'bias' exhibited by the language model (and the underlying data) may be a relatively accurate reflection of employment patterns during the 19th century. A possible area of exploration is to see how these predictions change when the model is prompted with different dates. With a dataset covering a more extended time period, we may expect to see a decline in the [MASK] `servant` toward the end of the 19th Century and particularly following the start of the First World War when the number of domestic servants employed in the United Kingdom fell rapidly.
### Training Routine
We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
did much better. ๐๐ฅณ
Want to know how much, then read our paper!
## Data Description
The ERWT models are trained on an openly accessible newspaper corpus created by the [Heritage Made Digital (HMD) newspaper digitisation project](footnote{https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).
The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper *The Sun*.
Geographically, most papers are metropolitan (i.e. based in London). The inclusion of *The Northern Daily Times* and *Liverpool Standard*, adds some geographical diversity to this corpus. The political classification is based on historical newspaper press directories, please read [our paper](https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqac037/6644524?searchresult=1) on bias in newspaper collections for more information.
The table below contains a more detailed overview of the corpus.
| | | | | |
|------|--------------------------|--------------|-----------|---------------|
| NLP | Title | Politics | Location | Tokens |
| 2083 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 14.094.212 |
| 2084 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 34.450.366 |
| 2085 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 16.166.627 |
| 2088 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 149.204.800 |
| 2090 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 6.417.320 |
| 2194 | The Sun | LIBERAL | LONDON | 1.155.791.480 |
| 2244 | Colored News | NONE | LONDON | 53.634 |
| 2642 | The Express | LIBERAL | LONDON | 236.240.555 |
| 2644 | National Register | CONSERVATIVE | LONDON | 23.409.733 |
| 2645 | The Press | CONSERVATIVE | LONDON | 15.702.276 |
| 2646 | The Star | NONE | LONDON | 163.072.742 |
| 2647 | The Statesman | RADICAL | LONDON | 61.225.215 |
Temporally, most of the articles date from the second half of the nineteenth century. The figure below gives an overview of the number of articles by year.
![number of article by year](https://github.com/Living-with-machines/ERWT/raw/main/articles_by_year.png)
## Evaluation: ๐ค In case you care to count ๐ค
Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.
The table below shows the [pseudo-perplexity](https://arxiv.org/abs/1910.14659) scores for different models using text documents of 64 and 128 tokens.
In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt-year-masked-25) turned out to yield the most competitive scores across different tasks and generally recommend you use this model.
| text length | 64 | | 128 | |
|------------------|----------------|--------|----------------|--------|
| model | mean | sd | mean | sd |
| DistilBERT | 354.40 | 376.32 | 229.19 | 294.70 |
| HMDistilBERT | 32.94 | 64.78 | 25.72 | 45.99 |
| ERWT-year | 31.49 | 61.85 | 24.97 | 44.58 |
| ERWT-st | 31.69 | 62.42 | 25.03 | 44.74 |
| ERWT-year-masked-25 | **30.97** | 61.50 | **24.59** | 44.36 |
| ERWT-year-masked-75 | 31.02 | 61.41 | 24.63 | 44.40 |
| PEA | 31.63 | 62.09 | 25.58 | 44.99 |
| PEA-st | 31.65 | 62.19 | 25.59 | 44.99 |
## Questions?
Questions? Feedback? Please leave a message!
|