Kaspar commited on
Commit
4cb572d
β€’
1 Parent(s): 823a5d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -14
README.md CHANGED
@@ -20,45 +20,51 @@ widget:
20
  <img src="https://upload.wikimedia.org/wikipedia/commons/5/5b/NCI_peas_in_pod.jpg" alt="erwt" width="200" >
21
 
22
  # ERWT-year
23
- \~🌺\~A language model that is (🀭 maybe 🀫) better at history than you...\~🌺\~
24
 
 
25
 
26
- ERWT is a fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training) with temporal metadata.
 
 
27
 
28
  This model is served you by [Kaspar von Beelen](https://huggingface.co/Kaspar) and [Daniel van Strien](https://huggingface.co/davanstrien).
29
 
30
  *Improving AI, one pea at a time.*
31
 
32
-
33
  ## Introductory Note: Repent Now. πŸ˜‡
34
 
35
- This model was trained for **experimental purposes**, please use it with care.
36
 
 
37
 
38
- You find more detailed information below, especially the "limitations" section. Seriously, read this section before using the models, we don't repent in public just for fun.
39
-
40
- If you can't get enough, you can still consult our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) for more background and nerdy evaluation stuff (still, work in progress, handle with care and kindness).
41
 
42
  ## Background: MDMA to the rescue. πŸ™‚
43
 
44
- ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA** πŸ’Š), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating information that is not explicitly present in the textβ€”such as the time of publication or the political leaning of the authorβ€”may make language models "better" in the sense of being more sensitive to historical and political aspects of language use.
 
 
45
 
46
- To create this ERWT model we fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) on a random subsample of the Heritage Made Digital newspaper of about half a billion words. We slightly adapted to the training routine by adding the year of publication and a special token `[DATE]` in front of each text segment (i.e. a chunk of hundred tokens).
47
 
48
- For example, we would format a snippet of text taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002) as...
49
  ```python
50
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
51
  ```
52
 
53
- ... and then provide this sentence with prepended temporal metadata to MLM. While training, the model learns a relation between the text and the time it was produced. When a token is masked, the prepended `year` field is taken into account when predicting candidates. But vice versa, if the metadata token at the front of the formatted snippet is hidden, the models aims to predict the year of publication based on the content of a document.
54
 
 
55
 
56
-
57
- ## Intended uses
58
 
59
  Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
60
 
61
- ### Language Change πŸ‘‘
 
 
 
62
 
63
  Also in the nineteenth-century Britain had Queen on the throne for a very long time. Queen Victoria ruled from 1837 to 1901.
64
 
 
20
  <img src="https://upload.wikimedia.org/wikipedia/commons/5/5b/NCI_peas_in_pod.jpg" alt="erwt" width="200" >
21
 
22
  # ERWT-year
23
+ \~🌺\~ERWT**\*** is a language model that is (🀭 maybe 🀫) better at history than you...\~🌺\~
24
 
25
+ \*ERWT is dutch for PEA.
26
 
27
+ ERWT is a fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training) with **temporal metadata**.
28
+
29
+ ERWT performs time-sensitive masked language modelling. It can also guess the year a text was written.
30
 
31
  This model is served you by [Kaspar von Beelen](https://huggingface.co/Kaspar) and [Daniel van Strien](https://huggingface.co/davanstrien).
32
 
33
  *Improving AI, one pea at a time.*
34
 
 
35
  ## Introductory Note: Repent Now. πŸ˜‡
36
 
37
+ The ERWT models are trained for **experimental purposes**, please use it with care.
38
 
39
+ You find more detailed information below. Please consult the **limitations** section (seriously, read this section before using the models, **we don't repent in public just for fun**).
40
 
41
+ If you can't get enough of these peas and crave for some more, you can still consult our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing) for more background information and nerdy evaluation stuff (work in progress, handle with care and kindness).
 
 
42
 
43
  ## Background: MDMA to the rescue. πŸ™‚
44
 
45
+ ERWT was created using a **M**eta**D**ata **M**asking **A**pproach (or **MDMA** πŸ’Š), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating metadata (information that describes a text but and is as the time of publication or the political leaning of the author) may make language models "better", or at least make them more sensitive to historical and political aspects of language use.
46
+
47
+ ERWT is a [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model, fine-tuned on a random subsample taken from the Heritage Made Digital newspaper collection. The training data comprises around half a billion words.
48
 
49
+ To unleash the power of MDMA, we adapted to the training routine for the masked language model. When preprocessing the text, we prepended each segment of hundred words with a time-stamp (year of publication) and a special `[DATE]` token.
50
 
51
+ The snippet below, taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002),
52
  ```python
53
  "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest."
54
  ```
55
 
56
+ These formatted chunks of text are then forward to the data collator and eventually the language model.
57
 
58
+ Exposed to both the tokens and (temporal) metadata, the model learns a relation between text and time. When a token is masked, the prepended `year` field is taken into account when predicting hidden word in the text. Vice versa, when the metadata token at the front of the formatted snippet is hidden, the models aims to predict the year of publication based on the content of a document.
59
 
60
+ ## Intended Uses: LMs as History Machines.
 
61
 
62
  Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**.
63
 
64
+ ### Historical Language Change: Her/His Majesty? πŸ‘‘
65
+
66
+
67
+ Let's show a very concrete example.
68
 
69
  Also in the nineteenth-century Britain had Queen on the throne for a very long time. Queen Victoria ruled from 1837 to 1901.
70