Kaspar commited on
Commit
a806757
β€’
1 Parent(s): 07a0c80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -37,7 +37,7 @@ The ERWT models are trained for **experimental purposes**, please use them with
37
 
38
  You find more detailed information below. Please consult the **limitations** section (seriously, read this section before using the models, **we don't repent in public just for fun**).
39
 
40
- If you can't get enough of these peas and crave for some more, you can still consult our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
41
 
42
  ## Background: MDMA to the rescue. πŸ™‚
43
 
@@ -45,16 +45,16 @@ ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA**
45
 
46
  ERWT is a [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.
47
 
48
- To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time-stamp (year of publication) and a special `[DATE]` token.
49
 
50
  The snippet below, taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002),
51
  ```python
52
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
53
  ```
54
 
55
- These formatted chunks of text are then forward to the data collator and eventually the language model.
56
 
57
- Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden word in the text. Vice versa, when the metadata token at the front of the formatted snippet is hidden, the models aims to predict the year of publication based on the content of a document.
58
 
59
  ## Intended Uses: LMs as History Machines.
60
 
@@ -66,11 +66,11 @@ Let's show how ERWT works with a very concrete example.
66
 
67
  The ERWT models are trained on British newspapers from before 1880 (Why? Long story, don't ask...) and can be used to monitor historical change in this specific context.
68
 
69
- Imagine you are confronted with the following snippet "We received a letter from [MASK] Majesty" and want to predict correct pronoun (again assuming a British context).
70
 
71
- πŸ‘©β€πŸ« **History Intermezzo** Please remember, for most of in the nineteenth-century, Queen Victoria ruled Britain. From 1837 to 1901 to be precise. Her nineteenth century predecessors (George III, George IV and William IV) were all male.
72
 
73
- While a standard language model will provide you with one a general prediction, based on what it has observed previously in the training corpus, ERWT models allow you to manipulate to prediction, by anchoring the text in specific the year.
74
 
75
  ```python
76
  from transformers import pipeline
@@ -96,7 +96,7 @@ However, if we change the date at the start of the sentence to 1850:
96
  mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")
97
  ```
98
 
99
- Will put most of probability mass on the token "her" and only a little bit on "him".
100
 
101
  ```python
102
  {'score': 0.8168327212333679,
@@ -109,13 +109,13 @@ You can repeat this experiment for yourself using the example sentences in the *
109
 
110
  Okay, but why is this interesting?
111
 
112
- Firstly, eyeballing some toy-examples (but also using more rigorous metrics such as perplexity) shows that MLMs can perform more accurate predictions when it has access to temporal metadata. In other words, ERWT's prediction reflects historical language use more accurately. Model that are sensitive to historical context could
113
 
114
- Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least gives us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
115
 
116
  ### Date Prediction
117
 
118
- Another feature of the ERWT model series, is date prediction. Remember that during training the temporal metadata token is often masked. In this case the model effectively learns to situate documents in time based on the tokens they contain.
119
 
120
  By masking the year token, ERWT guesses the document's year of publication.
121
 
@@ -147,24 +147,24 @@ A few years later, in 1870, Prussia aimed artillery southwards and invaded Franc
147
  mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")
148
  ```
149
 
150
- ERWT clearly learned a lot about history of German unification by ploughing through a plethora of nineteenth century newspaper articles: it correctly returns "1870" as the predicted year.
151
 
152
- Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data.
153
 
154
  In both cases, our answers would be "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course πŸ€‘). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.
155
 
156
  Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
157
- Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but example predicting political orientation).
158
 
159
  ## Limitations
160
 
161
- The ERWT series were trained for evaluation purposes, and therefore carry some critical limitations.
162
 
163
  ### Training Data
164
 
165
- Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).
166
 
167
- Historically models tend to reflect the past (and present?) stereotypes and prejudices. We strongly advice against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth century British cultural history.
168
 
169
  ### Training Routine
170
 
@@ -182,7 +182,7 @@ Want to know how much, then read our paper!
182
 
183
  The ERWT models are trained on an openly accessible newspaper corpus created by the [Heritage Made Digital (HMD) newspaper digitisation project](footnote{https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).
184
  The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper *The Sun*.
185
- Geographically, most papers are metropolitan (i.e. based in London). The inclusion of *The Northern Daily Times* and *Liverpool Standard*, adds some geographical diversity to this corpus. The political classification are taken for historical newspaper press directories, please read [our paper](https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqac037/6644524?searchresult=1) on bias in newspaper collections for more information.
186
 
187
  The table below contains a more detailed overview of the corpus.
188
 
@@ -209,11 +209,11 @@ Temporally, most of the articles date from the second half of the nineteenth cen
209
 
210
  ## Evaluation
211
 
212
- Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details we recommend you read and cite the current working papers.
213
 
214
- The table below shows the [pseudo-perplexity](https://arxiv.org/abs/1910.14659) scores for different models using text document of 64 and 128 tokens.
215
 
216
- In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt-year-masked-25) turned out to yield the most competitive scores across different task and generally recommend you use this model.
217
 
218
 
219
 
 
37
 
38
  You find more detailed information below. Please consult the **limitations** section (seriously, read this section before using the models, **we don't repent in public just for fun**).
39
 
40
+ If you can't get enough of these peas and crave some more. In that case, you can consult our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
41
 
42
  ## Background: MDMA to the rescue. πŸ™‚
43
 
 
45
 
46
  ERWT is a [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.
47
 
48
+ To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time stamp (year of publication) and a special `[DATE]` token.
49
 
50
  The snippet below, taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002),
51
  ```python
52
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
53
  ```
54
 
55
+ These formatted chunks of text are then forwarded to the data collator and eventually the language model.
56
 
57
+ Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
58
 
59
  ## Intended Uses: LMs as History Machines.
60
 
 
66
 
67
  The ERWT models are trained on British newspapers from before 1880 (Why? Long story, don't ask...) and can be used to monitor historical change in this specific context.
68
 
69
+ Imagine you are confronted with the following snippet "We received a letter from [MASK] Majesty" and want to predict the correct pronoun (again assuming a British context).
70
 
71
+ πŸ‘©β€πŸ« **History Intermezzo** Please remember, for most of in the nineteenth century, Queen Victoria ruled Britain. From 1837 to 1901 to be precise. Her nineteenth-century predecessors (George III, George IV and William IV) were all male.
72
 
73
+ While a standard language model will provide you with one a general prediction, based on what it has observed previously in the training corpus, ERWT models allow you to manipulate to prediction, by anchoring the text in a specific year.
74
 
75
  ```python
76
  from transformers import pipeline
 
96
  mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")
97
  ```
98
 
99
+ Will put most the probability mass on the token "her" and only a little bit on "him".
100
 
101
  ```python
102
  {'score': 0.8168327212333679,
 
109
 
110
  Okay, but why is this interesting?
111
 
112
+ Firstly, eyeballing some toy examples (but also using more rigorous metrics such as perplexity) shows that MLMs can perform more accurate predictions when it has access to temporal metadata. In other words, ERWT's prediction reflects historical language use more accurately.
113
 
114
+ Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
115
 
116
  ### Date Prediction
117
 
118
+ Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
119
 
120
  By masking the year token, ERWT guesses the document's year of publication.
121
 
 
147
  mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")
148
  ```
149
 
150
+ ERWT clearly learned a lot about the history of German unification by ploughing through a plethora of nineteenth-century newspaper articles: it correctly returns "1870" as the predicted year.
151
 
152
+ Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data?
153
 
154
  In both cases, our answers would be "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course πŸ€‘). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.
155
 
156
  Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
157
+ Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
158
 
159
  ## Limitations
160
 
161
+ The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
162
 
163
  ### Training Data
164
 
165
+ Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see the section on Data Description for more information).
166
 
167
+ Historically models tend to reflect past (and present?) stereotypes and prejudices. We strongly advise against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth-century British cultural history.
168
 
169
  ### Training Routine
170
 
 
182
 
183
  The ERWT models are trained on an openly accessible newspaper corpus created by the [Heritage Made Digital (HMD) newspaper digitisation project](footnote{https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).
184
  The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper *The Sun*.
185
+ Geographically, most papers are metropolitan (i.e. based in London). The inclusion of *The Northern Daily Times* and *Liverpool Standard*, adds some geographical diversity to this corpus. The political classification is based on historical newspaper press directories, please read [our paper](https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqac037/6644524?searchresult=1) on bias in newspaper collections for more information.
186
 
187
  The table below contains a more detailed overview of the corpus.
188
 
 
209
 
210
  ## Evaluation
211
 
212
+ Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.
213
 
214
+ The table below shows the [pseudo-perplexity](https://arxiv.org/abs/1910.14659) scores for different models using text documents of 64 and 128 tokens.
215
 
216
+ In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt-year-masked-25) turned out to yield the most competitive scores across different tasks and generally recommend you use this model.
217
 
218
 
219