Update README.md
Browse files
README.md
CHANGED
@@ -31,6 +31,21 @@ This model is served to you by [Kaspar von Beelen](https://huggingface.co/Kaspar
|
|
31 |
|
32 |
\*ERWT is dutch for PEA.
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
## Introductory Note: Repent Now. ๐
|
35 |
|
36 |
The ERWT models are trained for **experimental purposes**, please use them with care.
|
@@ -56,7 +71,7 @@ These formatted chunks of text are then forwarded to the data collator and event
|
|
56 |
|
57 |
Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
|
58 |
|
59 |
-
## Intended
|
60 |
|
61 |
Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
|
62 |
|
@@ -113,7 +128,7 @@ Firstly, eyeballing some toy examples (but also using more rigorous metrics such
|
|
113 |
|
114 |
Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
|
115 |
|
116 |
-
### Date Prediction
|
117 |
|
118 |
Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
|
119 |
|
@@ -156,7 +171,7 @@ In both cases, our answers would be "yes, but...". ERWT's time-stamping powers h
|
|
156 |
Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
|
157 |
Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
|
158 |
|
159 |
-
## Limitations
|
160 |
|
161 |
The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
|
162 |
|
@@ -207,7 +222,7 @@ Temporally, most of the articles date from the second half of the nineteenth cen
|
|
207 |
|
208 |
![number of article by year](https://github.com/Living-with-machines/ERWT/raw/main/articles_by_year.png)
|
209 |
|
210 |
-
## Evaluation
|
211 |
|
212 |
Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.
|
213 |
|
|
|
31 |
|
32 |
\*ERWT is dutch for PEA.
|
33 |
|
34 |
+
# Overview
|
35 |
+
|
36 |
+
- [Introduction: Repent Now ๐](#introductory-note-repent-now-%F0%9F%98%87)
|
37 |
+
- [Background: MDMA to the rescue ๐](#background-mdma-to-the-rescue-%F0%9F%99%82)
|
38 |
+
- [Intended Use: LMs as History Machines ๐](#intended-use-lms-as-history-machines)
|
39 |
+
- [Historical Language Change: Her/His Majesty? ๐](#historical-language-change-herhis-majesty-%F0%9F%91%91)
|
40 |
+
- [Date Prediction: Pub Quiz with LMs ๐ป](#date-prediction)
|
41 |
+
- [Limitations: Not all is well ๐ฎ](#limitations)
|
42 |
+
- [Training Data](#training-data)
|
43 |
+
- [Training Routine](#training-routine)
|
44 |
+
- [Data Description](#data-description)
|
45 |
+
- [Evaluation: ๐ค In case you care to count ๐ค](#evaluation)
|
46 |
+
|
47 |
+
|
48 |
+
|
49 |
## Introductory Note: Repent Now. ๐
|
50 |
|
51 |
The ERWT models are trained for **experimental purposes**, please use them with care.
|
|
|
71 |
|
72 |
Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
|
73 |
|
74 |
+
## Intended Use: LMs as History Machines.
|
75 |
|
76 |
Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
|
77 |
|
|
|
128 |
|
129 |
Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
|
130 |
|
131 |
+
### Date Prediction: Pub Quiz with LMs
|
132 |
|
133 |
Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
|
134 |
|
|
|
171 |
Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
|
172 |
Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
|
173 |
|
174 |
+
## Limitations: Not all is well ๐ฎ.
|
175 |
|
176 |
The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
|
177 |
|
|
|
222 |
|
223 |
![number of article by year](https://github.com/Living-with-machines/ERWT/raw/main/articles_by_year.png)
|
224 |
|
225 |
+
## Evaluation: ๐ค In case you care to count ๐ค
|
226 |
|
227 |
Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.
|
228 |
|