Kaspar commited on
Commit
fe148ce
โ€ข
1 Parent(s): 48c9844

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -45
README.md CHANGED
@@ -5,9 +5,10 @@ tags:
5
  - library
6
  - historic
7
  - glam
 
8
  license: mit
9
  metrics:
10
- - f1
11
  widget:
12
  - text: "1820 [DATE] We received a letter from [MASK] Majesty."
13
  - text: "1850 [DATE] We received a letter from [MASK] Majesty."
@@ -21,15 +22,29 @@ widget:
21
 
22
  # ERWT-year
23
 
24
- ๐ŸŒบERWT is a language model that is (๐Ÿคญ maybe ๐Ÿคซ) better at history than you...๐ŸŒบ
25
 
26
- ERWT is a fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training).
27
 
28
  We trained a model based on a combination of text and **temporal metadata** (i.e. year information).
29
 
30
- ERWT performs **time-sensitive masked language modelling** and can be used for **date prediction** as well.
31
 
32
- This model is served to you by [Kaspar von Beelen](https://huggingface.co/Kaspar) and [Daniel van Strien](https://huggingface.co/davanstrien), *"Improving AI, one pea at a time"*.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  \*ERWT is dutch for PEA.
35
 
@@ -50,30 +65,37 @@ This model is served to you by [Kaspar von Beelen](https://huggingface.co/Kaspar
50
 
51
  ## Introductory Note: Repent Now. ๐Ÿ˜‡
52
 
53
- The ERWT models are trained for **experimental purposes**, please use them with care.
54
 
55
- You find more detailed information below. Please consult the **limitations** section (seriously, read this section before using the models, **we don't repent in public just for fun**).
56
 
57
- If you can't get enough of these peas and crave some more. In that case, you can consult our working paper ["Metadata Might Make Language Models Better"](https://arxiv.org/abs/2211.10086) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
58
 
59
  ## Background: MDMA to the rescue. ๐Ÿ™‚
60
 
61
- ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA** ๐Ÿ’Š), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating metadata (information that describes a text but and is not part of the content, such as the time/place of publication or the political orientation) may make language models "better", or at least make them more sensitive to historical, political and geographical aspects of language use.
62
 
63
- ERWT is a [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.
64
 
65
- To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time stamp (year of publication) and a special `[DATE]` token.
66
 
 
67
 
68
 
69
- The snippet below, gives an example taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002),
 
 
 
 
 
 
70
  ```python
71
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
72
  ```
73
 
74
- These formatted chunks of text are then forwarded to the data collator and eventually the language model.
75
 
76
- Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
77
 
78
  ## Intended Use: LMs as History Machines.
79
 
@@ -83,13 +105,15 @@ Exposing the model to temporal metadata allows us to investigate **historical la
83
 
84
  Let's show how ERWT works with a very concrete example.
85
 
86
- The ERWT models are trained on British newspapers from before 1870 (Why? Long story, don't ask...) and can be used to monitor historical change in this specific context.
87
 
88
- Imagine you are confronted with the following snippet "We received a letter from [MASK] Majesty" and want to predict the correct pronoun (again assuming a British context).
89
 
90
- ๐Ÿ‘ฉโ€๐Ÿซ **History Intermezzo** Please remember, for most of in the nineteenth century, Queen Victoria ruled Britain. From 1837 to 1901 to be precise. Her nineteenth-century predecessors (George III, George IV and William IV) were all male.
91
 
92
- While a standard language model will provide you with one a general prediction, based on what it has observed previously in the training corpus, ERWT models allow you to manipulate to prediction, by anchoring the text in a specific year.
 
 
93
 
94
  ```python
95
  from transformers import pipeline
@@ -100,7 +124,7 @@ mask_filler = pipeline("fill-mask",
100
  mask_filler(f"1820 [DATE] We received a letter from [MASK] Majesty.")
101
  ```
102
 
103
- Returns as most likely prediction:
104
 
105
  ```python
106
  {'score': 0.8527863025665283,
@@ -115,7 +139,7 @@ However, if we change the date at the start of the sentence to 1850:
115
  mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")
116
  ```
117
 
118
- Will put most the probability mass on the token "her" and only a little bit on "him".
119
 
120
  ```python
121
  {'score': 0.8168327212333679,
@@ -126,19 +150,21 @@ Will put most the probability mass on the token "her" and only a little bit on "
126
 
127
  You can repeat this experiment for yourself using the example sentences in the **Hosted inference API** at the top right.
128
 
129
- Okay, but why is this interesting?
 
 
130
 
131
- Firstly, eyeballing some toy examples (but also using more rigorous metrics such as perplexity) shows that MLMs can perform more accurate predictions when it has access to temporal metadata. In other words, ERWT's prediction reflects historical language use more accurately.
132
 
133
- Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
134
 
135
  ### Date Prediction: Pub Quiz with LMs ๐Ÿป
136
 
137
- Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
138
 
139
- By masking the year token, ERWT guesses the document's year of publication.
140
 
141
- ๐Ÿ‘ฉโ€๐Ÿซ **History Intermezzo** To unite the German states (there were plenty!), Prussia fought a number of wars with its neighbours in the second half of the nineteenth century. It invaded Denmark in 1864 (the second of the Schleswig Wars) and France in 1870 (the Franco-Prussian war).
142
 
143
  Reusing to code above, we can time-stamp documents by masking the year. For example, the line of python code below:
144
 
@@ -158,22 +184,23 @@ Outputs as most likely filler:
158
  ```
159
 
160
 
161
- The prediction "1864" makes sense as this was indeed the year of Prussian troops (with some help of their Austrian friends) crossed the border into Schleswig, then part of the Kingdom of Denmark.
162
 
163
- A few years later, in 1870, Prussia aimed artillery southwards and invaded France.
164
 
165
  ```python
166
  mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")
167
  ```
168
 
169
- ERWT clearly learned a lot about the history of German unification by ploughing through a plethora of nineteenth-century newspaper articles: it correctly returns "1870" as the predicted year.
170
 
171
  Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data?
172
 
173
- In both cases, our answers would be "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course ๐Ÿค‘). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.
174
 
175
- Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
176
- Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
 
177
 
178
  ## Limitations: Not all is well ๐Ÿ˜ฎ.
179
 
@@ -181,13 +208,15 @@ The ERWT series were trained for evaluation purposes and therefore carry some cr
181
 
182
  ### Training Data
183
 
184
- Many of the limitations are a direct result of the data. ERWT models are trained on a rather small subsample of nineteenth-century British newspapers, and its predictions have to be understood in this context (remember, Her Majesty?). Moreover, the corpus has a strong Metropolitan and liberal bias (see the section on Data Description for more information).
 
 
185
 
186
- The training data ranges from 1800 to 1870. If your period of interest is outside of this historical period, it's unlikely that ERWT will be of much use. Don't ask the poor model to predict when the Second World War happened. ERWT can be smart (at times) but it doesn't have the power of fortune-telling. At least not yet...
187
 
188
- Historically models tend to reflect past (and present?) stereotypes and prejudices. We strongly advise against using these models outside of the context of historical research. The predictions are likely to exhibit harmful biases and should be investigated critically and understood within the context of nineteenth-century British cultural history.
189
 
190
- One way of evaluating a model's bias is to evaluate the impact of making a change to a prompt and evaluating the impact on the predicted [MASK] token. Often a comparison is made between the predictions given for the prompt 'The **man** worked as a [MASK]' compared to the prompt 'The **woman** worked as a [MASK]'. An example of the output for this model:
191
 
192
  ```
193
  1810 [DATE] The man worked as a [MASK].
@@ -241,19 +270,21 @@ Produces the following three top predicted mask tokens
241
  ]
242
  ```
243
 
244
- Often this promoting prompt evaluation is done to assess the bias in *contemporary* language models. Often these biases reflect the training data used to train the model. In the case of historic language models, the bias exhibited by a model *may* be a valuable research tool in assessing (at scale) the use of language over time. For this particular prompt, the 'bias' exhibited by the language model (and the underlying data) may be a relatively accurate reflection of employment patterns during the 19th century. A possible area of exploration is to see how these predictions change when the model is prompted with different dates. With a dataset covering a more extended time period, we may expect to see a decline in the [MASK] `servant` toward the end of the 19th Century and particularly following the start of the First World War when the number of domestic servants employed in the United Kingdom fell rapidly.
 
 
245
 
246
  ### Training Routine
247
 
248
- We created this model as part of a wider experiment, which attempted to establish best practices for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
249
 
250
  To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
251
- Furthermore, we only trained the models for one epoch, which implies they are most likely undertrained at the moment.
252
 
253
- We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) in our evaluation experiments, and of course, our tiny LM peas
254
  did much better. ๐ŸŽ‰๐Ÿฅณ
255
 
256
- Want to know how much, then read [our working paper](https://arxiv.org/abs/2211.10086)!
257
 
258
  ## Data Description
259
 
@@ -286,12 +317,11 @@ Temporally, most of the articles date from the second half of the nineteenth cen
286
 
287
  ## Evaluation: ๐Ÿค“ In case you care to count ๐Ÿค“
288
 
289
- Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current [working paper](https://arxiv.org/abs/2211.10086).
290
-
291
- The table below shows the [pseudo-perplexity](https://arxiv.org/abs/1910.14659) scores for different models using text documents of 64 and 128 tokens.
292
 
293
- In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt-year-masked-25) turned out to yield the most competitive scores across different tasks and generally recommend you use this model.
294
 
 
295
 
296
 
297
  | text length | 64 | | 128 | |
@@ -311,3 +341,4 @@ In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt
311
 
312
  Questions? Feedback? Please leave a message!
313
 
 
 
5
  - library
6
  - historic
7
  - glam
8
+ - mdma
9
  license: mit
10
  metrics:
11
+ - pseudo-perplexity
12
  widget:
13
  - text: "1820 [DATE] We received a letter from [MASK] Majesty."
14
  - text: "1850 [DATE] We received a letter from [MASK] Majesty."
 
22
 
23
  # ERWT-year
24
 
25
+ ๐ŸŒบERWT\* a language model that (๐Ÿคญ maybe ๐Ÿคซ) knows more about history than you...๐ŸŒบ
26
 
27
+ ERWT is a fine-tuned [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training).
28
 
29
  We trained a model based on a combination of text and **temporal metadata** (i.e. year information).
30
 
31
+ ERWT performs [**time-sensitive masked language modelling**](#historical-language-change-herhis-majesty-%F0%9F%91%91) or [**date prediction**]((#date-prediction-pub-quiz-with-lms-%F0%9F%8D%BB)).
32
 
33
+ This model is served by [Kaspar von Beelen](https://huggingface.co/Kaspar) and [Daniel van Strien](https://huggingface.co/davanstrien), *"Improving AI, one pea at a time"*.
34
+
35
+ If these models happen to be useful, please cite our working paper.
36
+
37
+ ```
38
+ @misc{https://doi.org/10.48550/arxiv.2211.10086,
39
+ doi = {10.48550/ARXIV.2211.10086},
40
+ url = {https://arxiv.org/abs/2211.10086},
41
+ author = {Beelen, Kaspar and van Strien, Daniel},
42
+ keywords = {Computation and Language (cs.CL), Digital Libraries (cs.DL), FOS: Computer and information sciences, FOS: Computer and information sciences},
43
+ title = {Metadata Might Make Language Models Better},
44
+ publisher = {arXiv},
45
+ year = {2022},
46
+ copyright = {Creative Commons Attribution 4.0 International}}
47
+ ```
48
 
49
  \*ERWT is dutch for PEA.
50
 
 
65
 
66
  ## Introductory Note: Repent Now. ๐Ÿ˜‡
67
 
68
+ The ERWT models are trained for **experimental purposes**.
69
 
70
+ Please consult the [**limitations**](#limitations-not-all-is-well-%F0%9F%98%AE) section (seriously before using the models. Seriously, read this section, **we don't repent in public just for fun**).
71
 
72
+ If you can't get enough of these neural peas and crave some more. In that case, you can consult our working paper ["Metadata Might Make Language Models Better"](https://arxiv.org/abs/2211.10086) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
73
 
74
  ## Background: MDMA to the rescue. ๐Ÿ™‚
75
 
76
+ ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA** ๐Ÿ’Š), a scenario in which we train a Masked Language Model (MLM) on text and metadata simultaneously. Our intuition was that incorporating metadata (information that *describes* a text but and is not part of the content) may make language models "better", or at least make them more **sensitive** to historical, political and geographical aspects of language use. We mainly use temporal, political and geographical metadata.
77
 
78
+ ERWT is a [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) model, fine-tuned on a random subsample taken from the [Heritage Made Digital newspaper collection]((https://huggingface.co/datasets/davanstrien/hmd-erwt-training)). The training data comprises around half a billion words.
79
 
80
+ To unleash the power of MDMA, we adapted to the training routine mainly by fidgeting with the input data.
81
 
82
+ When preprocessing the text, we prepended each segment of hundred words with a time stamp (year of publication) and a special `[DATE]` token.
83
 
84
 
85
+ The snippet below, taken from the [Londonderry Sentinel]:(https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002)...
86
+ ```
87
+ Every scrap of intelligence relative to the war between France and Prussia is now read with interest.
88
+ ```
89
+
90
+ ... would be formatted as:
91
+
92
  ```python
93
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
94
  ```
95
 
96
+ These text chunks are then forwarded to the data collator and eventually the language model.
97
 
98
+ Exposed to the tokens and (temporal) metadata, the model learns a relation between text and time. When a text token is hidden, the prepended `year` field influences the prediction of the masked words. Vice versa, when the prepended metadata is hidden, the model predicts the year of publication based on the content.
99
 
100
  ## Intended Use: LMs as History Machines.
101
 
 
105
 
106
  Let's show how ERWT works with a very concrete example.
107
 
108
+ The ERWT models are trained on a handful of British newspapers published between 1800 and 1870. It can be used to monitor historical change in this specific context.
109
 
110
+ Imagine you are confronted with the following snippet: "We received a letter from [MASK] Majesty" and want to predict the correct pronoun for the masked token (again assuming a British context).
111
 
112
+ ๐Ÿ‘ฉโ€๐Ÿซ **History Intermezzo** Please remember, for most of the nineteenth century, Queen Victoria ruled Britain, from 1837 to 1901 to be precise. Her nineteenth-century predecessors (George III, George IV and William IV) were all male.
113
 
114
+ While a standard language model will provide you with one a general predictionโ€”based on what it has observed during trainingโ€“ERWT allows you to manipulate to prediction, by anchoring the text in a specific year.
115
+
116
+ Doing this requires just a few lines of code:
117
 
118
  ```python
119
  from transformers import pipeline
 
124
  mask_filler(f"1820 [DATE] We received a letter from [MASK] Majesty.")
125
  ```
126
 
127
+ This returns "his" as the most likely filler:
128
 
129
  ```python
130
  {'score': 0.8527863025665283,
 
139
  mask_filler(f"1850 [DATE] We received a letter from [MASK] Majesty.")
140
  ```
141
 
142
+ ERWT puts most of the probability mass on the token "her" and only a little bit on "his".
143
 
144
  ```python
145
  {'score': 0.8168327212333679,
 
150
 
151
  You can repeat this experiment for yourself using the example sentences in the **Hosted inference API** at the top right.
152
 
153
+ Okay, but why is this **interesting**?
154
+
155
+ Firstly, eyeballing some toy examples (but also using more rigorous metrics such as [perplexity](#evaluation-%F0%9F%A4%93-in-case-you-care-to-count-%F0%9F%A4%93)) shows that MLMs yield more accurate predictions when they have access to temporal metadata.
156
 
157
+ In other words, **ERWT models are better at capturing historical context.**
158
 
159
+ Secondly, MDMA may **reduce biases** that arise from imbalanced training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction.
160
 
161
  ### Date Prediction: Pub Quiz with LMs ๐Ÿป
162
 
163
+ Another feature of ERWT is **date prediction**. Remember that during training the temporal metadata token is regularly masked. In this case, the model effectively learns to situate documents in time based on the tokens in a text.
164
 
165
+ By masking the year token at the beginning of the text string, ERWT guesses the document's year of publication.
166
 
167
+ ๐Ÿ‘ฉโ€๐Ÿซ **History Intermezzo** To unite the German states (there used to be [plenty](https://www.britannica.com/topic/German-Confederation)!), Prussia fought a number of wars with its neighbours in the second half of the nineteenth century. It invaded Denmark in 1864 (the second of the Schleswig Wars) and France in 1870 (the Franco-Prussian war).
168
 
169
  Reusing to code above, we can time-stamp documents by masking the year. For example, the line of python code below:
170
 
 
184
  ```
185
 
186
 
187
+ The prediction "1864" makes sense; this was indeed the year of Prussian troops (with some help of their Austrian friends) crossed the border into Schleswig, then part of the Kingdom of Denmark.
188
 
189
+ A few years later, in 1870, Prussia aimed its artillery and bayonets southwards and invaded France.
190
 
191
  ```python
192
  mask_filler("[MASK] [DATE] The Franco-Prussian war is a matter of great concern.")
193
  ```
194
 
195
+ ERWT clearly learned a lot about the history of German unification by ploughing through a plethora of nineteenth-century newspaper articles: it correctly returns "1870" as the predicted year for the Franco-Prussian war!
196
 
197
  Again, we have to ask: Who cares? Wikipedia can tell us pretty much the same. More importantly, don't we already have timestamps for newspaper data?
198
 
199
+ In both cases, our answer sounds "yes, but...". ERWT's time-stamping powers have little instrumental use and won't make us rich (but donations are welcome of course ๐Ÿค‘). Nonetheless, we believe date prediction has value for research purposes. We can use ERWT for "fictitious" prediction, i.e. as a diagnostic tool.
200
 
201
+ Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models that best capture the year of publication from a set of tokens.
202
+
203
+ Secondly, we could use date prediction as an analytical or research tool, and study, for example, temporal variation **within** text documents; or scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, like political orientation).
204
 
205
  ## Limitations: Not all is well ๐Ÿ˜ฎ.
206
 
 
208
 
209
  ### Training Data
210
 
211
+ Many of the limitations are a direct result of the training data. ERWT models are trained on a rather small subsample of nineteenth-century **British newspapers**, and its predictions have to be understood in this context (remember, "Her Majesty?"). The corpus has a strong **Metropolitan and liberal bias** (see the section on Data Description for more information).
212
+
213
+ The training data spans from **1800 to 1870**. If your research interest is outside of this period, it's unlikely that ERWT will be of much use. Don't ask the poor model to predict when the Second World War happened. ERWT can be smart (at times) but it doesn't have the power of fortune-telling. At least not yet...
214
 
215
+ Furthermore, historical models tend to reflect past (and present?) stereotypes and prejudices. We strongly advise against using these models outside of a research context. The predictions are likely to exhibit harmful biases, they should be investigated critically and understood within the context of nineteenth-century British cultural history.
216
 
217
+ One way of evaluating a model's bias is to gauge the impact of changing a prompt on the predicted [MASK] token. Often a comparison is made between the predictions given for 'The **man** worked as a [MASK]' to 'The **woman** worked as a [MASK]'.
218
 
219
+ An example of the output for this model:
220
 
221
  ```
222
  1810 [DATE] The man worked as a [MASK].
 
270
  ]
271
  ```
272
 
273
+ Mostly, prompt evaluation is done to assess the bias in *contemporary* language models. In the case of historic language models, the bias exhibited by a model *may* be a valuable research tool in assessing (at scale) language use over time, and the stereotypes and prejudices encoded in text corpora.
274
+
275
+ For this particular prompt, the 'bias' exhibited by the language model (and the underlying data) may be a relatively accurate reflection of employment patterns during the 19th century. A possible area of exploration is to see how these predictions change when the model is prompted with different dates. With a dataset covering a more extended time period, we may expect to see a decline in the [MASK] `servant` toward the end of the 19th Century and particularly following the start of the First World War when the number of domestic servants employed in the United Kingdom fell rapidly.
276
 
277
  ### Training Routine
278
 
279
+ We created various ERWT models as part of a wider experiment that aimed to establish best practices and guidelines for training models with metadata. An overview of all the models is available on our [GitHub](https://github.com/Living-with-machines/ERWT/) page.
280
 
281
  To reduce training time, we based our experiments on a random subsample of the HMD corpus, consisting of half a billion tokens.
282
+ Furthermore, we only trained the models for one epoch, which implies they are most likely **undertrained** at the moment.
283
 
284
+ We were mainly interested in the **relative** performance of the different ERWT models. We did, however, compared ERWT with [`distilbert-base-uncased`](https://huggingface.co/distilbert-base-uncased) in our evaluation experiments. And, of course, our tiny LM peas
285
  did much better. ๐ŸŽ‰๐Ÿฅณ
286
 
287
+ Want to know the detailsโ€”Oh, critical reader!โ€”then consult and cite [our working paper](https://arxiv.org/abs/2211.10086)!
288
 
289
  ## Data Description
290
 
 
317
 
318
  ## Evaluation: ๐Ÿค“ In case you care to count ๐Ÿค“
319
 
320
+ Our article ["Metadata Might Make Language Models Better"](https://arxiv.org/abs/2211.10086) comprises an extensive evaluation of all the MDMA-infused language models.
 
 
321
 
322
+ The table below shows the [pseudo-perplexity](https://arxiv.org/abs/1910.14659) scores for different models based on text documents of 64 and 128 tokens.
323
 
324
+ In general, [ERWT-year-masked-25](https://huggingface.co/Livingwithmachines/erwt-year-masked-25) turned out to yield the most competitive scores across different tasks and we generally recommend you use this model.
325
 
326
 
327
  | text length | 64 | | 128 | |
 
341
 
342
  Questions? Feedback? Please leave a message!
343
 
344
+