emanuelaboros commited on
Commit
3a64d03
·
verified ·
1 Parent(s): c2a6965

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -8
README.md CHANGED
@@ -10,18 +10,15 @@ Since their beginnings in the 1830s and 1840s, news agencies have played an impo
10
 
11
  This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
12
 
13
- dbmdz/bert-base-french-europeana-cased
 
 
14
 
15
  ## Research Summary
16
 
17
  Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
18
  Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
19
 
20
-
21
- ## Intended uses
22
-
23
- dbmdz/bert-base-french-europeana-cased
24
-
25
  ## Dataset Characteristics
26
  The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
27
  The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
@@ -63,7 +60,16 @@ ner_results = nlp(example)
63
  print(ner_results)
64
  ```
65
 
66
- ## Training data
67
 
 
68
 
69
- ### BibTeX entry and citation info
 
 
 
 
 
 
 
 
 
10
 
11
  This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
12
 
13
+ ## Model Details
14
+
15
+ The base of the model is [dbmdz/bert-base-french-europeana-cased](https://huggingface.co/dbmdz/bert-base-french-europeana-cased) finetuned for 3 epochs on text of 256 maximum length.
16
 
17
  ## Research Summary
18
 
19
  Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
20
  Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
21
 
 
 
 
 
 
22
  ## Dataset Characteristics
23
  The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
24
  The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
 
60
  print(ner_results)
61
  ```
62
 
63
+ ### BibTeX entry and citation info
64
 
65
+ The code is available [here](https://github.com/impresso/newsagency-classification/tree/main).
66
 
67
+ ```
68
+ @misc{newsagency_classification,
69
+ title = "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers",
70
+ author = "Marxen, Lea and Ehrmann, Maud and Boros, Emanuela",
71
+ year = "2023",
72
+ url = "\url{https://github.com/impresso/newsagency-classification/tree/main}",
73
+ note = "Master Thesis"
74
+ }
75
+ ```