impresso-project
/

ner-newsagency-bert-de

@@ -10,18 +10,15 @@ Since their beginnings in the 1830s and 1840s, news agencies have played an impo
 This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
-dbmdz/bert-base-french-europeana-cased
 ## Research Summary
 Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
 Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
-## Intended uses
-dbmdz/bert-base-french-europeana-cased
 ## Dataset Characteristics
 The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
 The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
@@ -63,7 +60,16 @@ ner_results = nlp(example)
 print(ner_results)
 ```
-## Training data
-### BibTeX entry and citation info

 This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
+## Model Details
+The base of the model is [dbmdz/bert-base-french-europeana-cased](https://huggingface.co/dbmdz/bert-base-french-europeana-cased) finetuned for 3 epochs on text of 256 maximum length.
 ## Research Summary
 Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
 Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
 ## Dataset Characteristics
 The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
 The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
 print(ner_results)
 ```
+### BibTeX entry and citation info
+The code is available [here](https://github.com/impresso/newsagency-classification/tree/main).
+```
+@misc{newsagency_classification,
+  title = "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers",
+  author = "Marxen, Lea and Ehrmann, Maud and Boros, Emanuela",
+  year = "2023",
+  url = "\url{https://github.com/impresso/newsagency-classification/tree/main}",
+  note = "Master Thesis"
+}
+```