Kaspar commited on
Commit
53215eb
1 Parent(s): cf9efaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -2
README.md CHANGED
@@ -164,6 +164,8 @@ The ERWT series were trained for evaluation purposes, and therefore carry some c
164
 
165
  Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).
166
 
 
 
167
  ### Training Routine
168
 
169
  We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
@@ -171,8 +173,46 @@ We created this model as part of a wider experiment, which attempted to establis
171
  To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
172
  Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
173
 
174
- We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
175
- did much better. 🥳
 
 
176
 
177
  ## Data Description
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
  Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see section on Data Description for more information).
166
 
167
+ Historically models tend to reflect the past (and present?) stereotypes and prejudices. We strongly advice against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth century British cultural history.
168
+
169
  ### Training Routine
170
 
171
  We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
 
173
  To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
174
  Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
175
 
176
+ We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
177
+ did much better. 🎉🥳
178
+
179
+ Want to know how much, then read our paper!
180
 
181
  ## Data Description
182
 
183
+ The ERWT models are trained on an openly accessible newspaper corpus created by the [Heritage Made Digital (HMD) newspaper digitisation project](footnote{https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).
184
+ The HMD newspapers comprise around 2 billion words in total, but the bulk of the articles originate from the (then) liberal paper *The Sun*.
185
+ Geographically, most papers are metropolitan (i.e. based in London). The inclusion of *The Northern Daily Times* and *Liverpool Standard*, adds some geographical diversity to this corpus. The political classification are taken for historical newspaper press directories, please read [our paper](https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqac037/6644524?searchresult=1) on bias in newspaper collections for more information.
186
+
187
+ The table below contains a more detailed overview of the corpus.
188
+ |------|--------------------------|--------------|-----------|---------------|
189
+ | NLP | Title | Politics | Location | Tokens |
190
+ | 2083 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 14.094.212 |
191
+ | 2084 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 34.450.366 |
192
+ | 2085 | The Northern Daily Times | NEUTRAL | LIVERPOOL | 16.166.627 |
193
+ | 2088 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 149.204.800 |
194
+ | 2090 | The Liverpool Standard | CONSERVATIVE | LIVERPOOL | 6.417.320 |
195
+ | 2194 | The Sun | LIBERAL | LONDON | 1.155.791.480 |
196
+ | 2244 | Colored News | NONE | LONDON | 53.634 |
197
+ | 2642 | The Express | LIBERAL | LONDON | 236.240.555 |
198
+ | 2644 | National Register | CONSERVATIVE | LONDON | 23.409.733 |
199
+ | 2645 | The Press | CONSERVATIVE | LONDON | 15.702.276 |
200
+ | 2646 | The Star | NONE | LONDON | 163.072.742 |
201
+ | 2647 | The Statesman | RADICAL | LONDON | 61.225.215 |
202
+
203
+ Temporally, most of the articles date from the second half of the nineteenth century.
204
+
205
+ ## Evaluation
206
+
207
+ | {length} | 64 | | 128 | |
208
+ |------------------|----------------|--------|----------------|--------|
209
+ | {model} | mean | sd | mean | sd |
210
+ | DistilBERT | 354.40 | 376.32 | 229.19 | 294.70 |
211
+ | HMDistilBERT | 32.94 | 64.78 | 25.72 | 45.99 |
212
+ | ERWT | 31.49 | 61.85 | 24.97 | 44.58 |
213
+ | ERWT\_st | 31.69 | 62.42 | 25.03 | 44.74 |
214
+ | ERWT\_masked\_25 | \textbf{30.97} | 61.50 | \textbf{24.59} | 44.36 |
215
+ | ERWT\_masked\_75 | 31.02 | 61.41 | 24.63 | 44.40 |
216
+ | PEA | 31.63 | 62.09 | 25.58 | 44.99 |
217
+ | PEA\_st | 31.65 | 62.19 | 25.59 | 44.99 |
218
+