philippgawlik commited on
Commit
757067f
1 Parent(s): 4f3726a

Added dataset chapter

Browse files
Files changed (1) hide show
  1. README.md +18 -2
README.md CHANGED
@@ -1,3 +1,4 @@
 
1
  ---
2
  language:
3
  - de
@@ -14,7 +15,7 @@ widget:
14
 
15
  # German news title gen
16
 
17
- Please node that this model originates from the ["What's there, what's missing"](https://interaktiv.br.de/ai-detect-newsroom-mentions-in-comments/) collaboration of [AI & Automation Labl of Bayerischer Rundfunk](https://www.br.de/extra/ai-automation-lab/index.html) and [Mitteldeutscher Rundfunk](https://www.mdr.de/) as well as [ida](https://idalab.de/). The collaboration took place during the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme). The model presented is part of the the documenation of the half year of project time. The related technical framework can be found a [github](https://github.com/br-data/wtwm-topic-modelling).
18
 
19
  ## The task
20
 
@@ -23,10 +24,25 @@ This is a model for the task of classifying whether or not a articles comment ad
23
  This classification task is implemented as a binary classification into:
24
 
25
  label 0: the comment holds no mention
26
-
27
  label 1: the comment addresses the moderation team/authors of the media house
28
 
29
  We decided to use [german-gpt2](https://huggingface.co/dbmdz/german-gpt2) by MDZ of Bayerische Staatsbibliothek as the foundation model.
30
 
31
  **This model is still work in progress and might be updated in the future.**
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
  language:
4
  - de
 
15
 
16
  # German news title gen
17
 
18
+ Please node that this model originates from the ["What's there, what's missing"](https://interaktiv.br.de/ai-detect-newsroom-mentions-in-comments/) collaboration of [AI & Automation Labl of Bayerischer Rundfunk (BR hereafter)](https://www.br.de/extra/ai-automation-lab/index.html) and [Mitteldeutscher Rundfunk (mdr hereafter)](https://www.mdr.de/) as well as [ida](https://idalab.de/). The collaboration took place during the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme). The model presented is part of the the documenation of the half year of project time. The related technical framework can be found a [github](https://github.com/br-data/wtwm-topic-modelling).
19
 
20
  ## The task
21
 
 
24
  This classification task is implemented as a binary classification into:
25
 
26
  label 0: the comment holds no mention
 
27
  label 1: the comment addresses the moderation team/authors of the media house
28
 
29
  We decided to use [german-gpt2](https://huggingface.co/dbmdz/german-gpt2) by MDZ of Bayerische Staatsbibliothek as the foundation model.
30
 
31
  **This model is still work in progress and might be updated in the future.**
32
 
33
+
34
+ ## Dataset & preprocessing
35
+
36
+ This model was finetuned on a corpus of 18.860 user comments with a share of user comments from BR and mdr websites and social media channels. The ratio of comments without mentions and with mentions is 92% to 8%. With the initial annotated data the share of comments with mentions was 2% of the data. To run the first round of training during the time of the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme), we decided to augment the corpus by 1421 generated comments with mentions. The generated comments were annotated the same way as the initial data.
37
+ Please note, that the generated comments are merely meant to kick off the training of the prototype model. Retraining of the model in later iterations of our system will ignore the generated comments and solely depend on authentic comments.
38
+
39
+ The preprocessing of the data included:
40
+ - remove linebreaks
41
+ - remove html tags
42
+ - remove emojis
43
+ - remove formatting fragments (e.g. "---------", "......")
44
+ - remove gaps (~ two or more adjacent spaces)
45
+ - strip comments for whitespaces at the begin and end of the corpus
46
+
47
+ We advice to perform the same preprocessing steps when working with the mode.
48
+