Livingwithmachines
/

erwt-year

@@ -31,6 +31,21 @@ This model is served to you by [Kaspar von Beelen](https://huggingface.co/Kaspar
 \*ERWT is dutch for PEA.
 ## Introductory Note: Repent Now. 😇
 The ERWT models are trained for **experimental purposes**, please use them with care.
@@ -56,7 +71,7 @@ These formatted chunks of text are then forwarded to the data collator and event
 Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
-## Intended Uses: LMs as History Machines.
 Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
@@ -113,7 +128,7 @@ Firstly, eyeballing some toy examples (but also using more rigorous metrics such
 Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
-### Date Prediction
 Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
@@ -156,7 +171,7 @@ In both cases, our answers would be "yes, but...". ERWT's time-stamping powers h
 Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
 Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
-## Limitations
 The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
@@ -207,7 +222,7 @@ Temporally, most of the articles date from the second half of the nineteenth cen
 ![number of article by year](https://github.com/Living-with-machines/ERWT/raw/main/articles_by_year.png)
-## Evaluation
 Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.

 \*ERWT is dutch for PEA.
+# Overview
+- [Introduction: Repent Now 😇](#introductory-note-repent-now-%F0%9F%98%87)
+- [Background: MDMA to the rescue 🙂](#background-mdma-to-the-rescue-%F0%9F%99%82)
+- [Intended Use: LMs as History Machines 🚂](#intended-use-lms-as-history-machines)
+   - [Historical Language Change: Her/His Majesty? 👑](#historical-language-change-herhis-majesty-%F0%9F%91%91)
+   - [Date Prediction: Pub Quiz with LMs 🍻](#date-prediction)
+- [Limitations: Not all is well 😮](#limitations)
+    - [Training Data](#training-data)
+    - [Training Routine](#training-routine)
+- [Data Description](#data-description)
+- [Evaluation: 🤓 In case you care to count 🤓](#evaluation)
 ## Introductory Note: Repent Now. 😇
 The ERWT models are trained for **experimental purposes**, please use them with care.
 Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden words in the text. Vice versa, when the metadata token is hidden, the model aims to predict the year of publication based on the content.
+## Intended Use: LMs as History Machines.
 Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
 Secondly, MDMA may reduce biases induced by imbalances in the training data (or at least give us more of a handle on this problem). Admittedly, we have to prove this more formally, but some experiments at least hint in this direction. The data used for training is highly biased towards the Victorian age and a standard language model trained on this corpus will predict "her" for ```"[MASK] Majesty"```.
+### Date Prediction: Pub Quiz with LMs
 Another feature of the ERWT model series is date prediction. Remember that during training the temporal metadata token is often masked. In this case, the model effectively learns to situate documents in time based on the tokens they contain.
 Firstly, we used date prediction for evaluation purposes, to measure which training routine produces models
 Secondly, we could use it as an analytical tool, to study how temporal variation **within** text documents and further scrutinise which features drive the time prediction (it goes without saying that the same applies to other metadata fields, but for example predicting political orientation).
+## Limitations: Not all is well 😮.
 The ERWT series were trained for evaluation purposes and therefore carry some critical limitations.
 ![number of article by year](https://github.com/Living-with-machines/ERWT/raw/main/articles_by_year.png)
+## Evaluation: 🤓 In case you care to count 🤓
 Our article ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) comprises quite an extensive evaluation of all the language models created with MDMA. For details, we recommend you read and cite the current working papers.