Livingwithmachines
/

erwt-year

@@ -164,6 +164,8 @@ The ERWT series were trained for evaluation purposes, and therefore carry some c
 Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).
 ### Training Routine
 We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
@@ -171,8 +173,46 @@ We created this model as part of a wider experiment, which attempted to establis
 To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
 Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
-We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
-did much better. 🥳
 ## Data Description

 Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).
+Historically models tend to reflect the past (and present?) stereotypes and prejudices. We strongly advice against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth century British cultural history.
 ### Training Routine
 We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
 To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
 Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
+We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
+did much better. 🎉🥳
+Want to know how much, then read our paper!
 ## Data Description
+The ERWT models are trained on an openly accessible newspaper corpus created by the [Heritage Made Digital (HMD) newspaper digitisation project](footnote{https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).
+The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper *The Sun*.
+Geographically, most papers are metropolitan (i.e. based in London). The inclusion of *The Northern Daily Times* and *Liverpool Standard*, adds some geographical diversity to this corpus. The political classification are taken for historical newspaper press directories, please read [our paper](https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqac037/6644524?searchresult=1) on bias in newspaper collections for more information.
+The table below contains a more detailed overview of the corpus.
+|------|--------------------------|--------------|-----------|---------------|
+| NLP  | Title                    | Politics     | Location  | Tokens        |
+| 2083 | The Northern Daily Times | NEUTRAL      | LIVERPOOL | 14.094.212    |
+| 2084 | The Northern Daily Times | NEUTRAL      | LIVERPOOL | 34.450.366    |
+| 2085 | The Northern Daily Times | NEUTRAL      | LIVERPOOL | 16.166.627    |
+| 2088 | The Liverpool Standard   | CONSERVATIVE | LIVERPOOL | 149.204.800   |
+| 2090 | The Liverpool Standard   | CONSERVATIVE | LIVERPOOL | 6.417.320     |
+| 2194 | The Sun                  | LIBERAL      | LONDON    | 1.155.791.480 |
+| 2244 | Colored News             | NONE         | LONDON    | 53.634        |
+| 2642 | The Express              | LIBERAL      | LONDON    | 236.240.555   |
+| 2644 | National Register        | CONSERVATIVE | LONDON    | 23.409.733    |
+| 2645 | The Press                | CONSERVATIVE | LONDON    | 15.702.276    |
+| 2646 | The Star                 | NONE         | LONDON    | 163.072.742   |
+| 2647 | The Statesman            | RADICAL      | LONDON    | 61.225.215    |
+Temporally, most of the articles date from the second half of the nineteenth century.
+## Evaluation
+| {length}         | 64             |        | 128            |        |
+|------------------|----------------|--------|----------------|--------|
+| {model}          | mean           | sd     | mean           | sd     |
+| DistilBERT       | 354.40         | 376.32 | 229.19         | 294.70 |
+| HMDistilBERT     | 32.94          | 64.78  | 25.72          | 45.99  |
+| ERWT             | 31.49          | 61.85  | 24.97          | 44.58  |
+| ERWT\_st         | 31.69          | 62.42  | 25.03          | 44.74  |
+| ERWT\_masked\_25 | \textbf{30.97} | 61.50  | \textbf{24.59} | 44.36  |
+| ERWT\_masked\_75 | 31.02          | 61.41  | 24.63          | 44.40  |
+| PEA              | 31.63          | 62.09  | 25.58          | 44.99  |
+| PEA\_st          | 31.65          | 62.19  | 25.59          | 44.99  |