Update README.md
Browse files
README.md
CHANGED
@@ -10,18 +10,15 @@ Since their beginnings in the 1830s and 1840s, news agencies have played an impo
|
|
10 |
|
11 |
This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
|
12 |
|
13 |
-
|
|
|
|
|
14 |
|
15 |
## Research Summary
|
16 |
|
17 |
Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
|
18 |
Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
|
19 |
|
20 |
-
|
21 |
-
## Intended uses
|
22 |
-
|
23 |
-
dbmdz/bert-base-french-europeana-cased
|
24 |
-
|
25 |
## Dataset Characteristics
|
26 |
The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
|
27 |
The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
|
@@ -63,7 +60,16 @@ ner_results = nlp(example)
|
|
63 |
print(ner_results)
|
64 |
```
|
65 |
|
66 |
-
|
67 |
|
|
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
This project aimed at bridging this gap by detecting news agencies in a large corpus of Swiss and Luxembourgish newspaper articles (the [impresso](https://impresso-project.ch/) corpus) for the years 1840-2000 using deep learning methods. For this, we first build and annotate a multilingual dataset with news agency mentions, which we then use to train and evaluate several BERT-based agency detection and classification models. Based on these experiments, we choose two models (for French and German) for the inference on the [impresso](https://impresso-project.ch/) corpus.
|
12 |
|
13 |
+
## Model Details
|
14 |
+
|
15 |
+
The base of the model is [dbmdz/bert-base-french-europeana-cased](https://huggingface.co/dbmdz/bert-base-french-europeana-cased) finetuned for 3 epochs on text of 256 maximum length.
|
16 |
|
17 |
## Research Summary
|
18 |
|
19 |
Results show that ca. 10% of the articles explicitly reference news agencies, with the greatest share of agency content after 1940, although systematic citation of agencies already started slowly in the 1910s.
|
20 |
Differences in the usage of agency content across time, countries and languages as well as between newspapers reveal a complex network of news flows, whose exploration provides many opportunities for future work.
|
21 |
|
|
|
|
|
|
|
|
|
|
|
22 |
## Dataset Characteristics
|
23 |
The dataset contains 1,133 French and 397 German annotated documents, with 1,058,449 tokens, of which 1,976 have annotations. Below is an overview of the corpus statistics:
|
24 |
The annotated dataset is released on [Zenodo](https://doi.org/10.5281/zenodo.8333933).
|
|
|
60 |
print(ner_results)
|
61 |
```
|
62 |
|
63 |
+
### BibTeX entry and citation info
|
64 |
|
65 |
+
The code is available [here](https://github.com/impresso/newsagency-classification/tree/main).
|
66 |
|
67 |
+
```
|
68 |
+
@misc{newsagency_classification,
|
69 |
+
title = "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers",
|
70 |
+
author = "Marxen, Lea and Ehrmann, Maud and Boros, Emanuela",
|
71 |
+
year = "2023",
|
72 |
+
url = "\url{https://github.com/impresso/newsagency-classification/tree/main}",
|
73 |
+
note = "Master Thesis"
|
74 |
+
}
|
75 |
+
```
|