--- language: en tags: - newspapers - library - historic - glam license: mit metrics: - f1 widget: - text: "1820 [DATE] We received a letter from [MASK] Majesty." - text: "1850 [DATE] We received a letter from [MASK] Majesty." - text: "[MASK] [DATE] The Franco-Prussian war is a matter of great concern." - text: "[MASK] [DATE] The Second Schleswig war is a matter of great concern." --- erwt # ERWT-year A fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) model trained on historical newspapers from the [Heritage Made Digital collection](https://huggingface.co/datasets/davanstrien/hmd-erwt-training) with temporal metadata. **Warning**: This model was trained for **experimental purposes**, please use it with care. You find more detailed information below and in our working paper ["Metadata Might Make Language Models Better"](https://drive.google.com/file/d/1Xp21KENzIeEqFpKvO85FkHynC0PNwBn7/view?usp=sharing). ## Background ERWT was created using a MetaData Masking Approach (or MDMA 💊), in which we train a Masked Language Model simultaneously on text and metadata. Our intuition was that incorporating information that is not explicitly present in the text—such as the time of publication or the political leaning of the author—may make language models "better" in the sense of being more sensitive to historical and political aspects of language use. To create this ERWT model we fine-tuned [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased) on a random subsample of the Heritage Made Digital newspaper of about half a billion words. We slightly adapted to the training routine by adding the year of publication and a special token `[DATE]` in front of each text segment (i.e. a chunk of hundred tokens). For example, we would format a snippet of text taken from the [Londonderry Sentinel](https://www.britishnewspaperarchive.co.uk/viewer/bl/0001480/18700722/014/0002) as... ```python "1870 [DATE] Every scrap of intelligence relative to the war between France and Prussia is now read with interest." ``` ... and then provide this sentence with prepended temporal metadata to MLM. While training, the model learns a relation between the text and the time it was produced. When a token is masked, the prepended `year` field is taken into account when predicting candidates. But vice versa, if the metadata token at the front of the formatted snippet is hidden, the models aims to predict the year of publication based on the content of a document. ## Intended uses Exposing the model to temporal metadata allows us to investigate **historical language change** and perform **date prediction**. ### Language Change 👑 Also in the nineteenth-century Britain had Queen on the throne for a very long time. Queen Victoria ruled from 1837 to 1901. ### Date Prediction ## Limitations ERWT models were trained for evaluation purposes, and cary critical limitations. First of all, as explained in more detail below, this model is trained on a rather small subsample of British newspapers, with a strong Metropolitan and liberal bias. Secondly, we only trained for one epoch, which suggests. For the evaluation purposes we were interested in the relative performance of our models. ## Data Description