aiautomationlab
/

wtwm-gpt2-based-mentions-detector

@@ -1,3 +1,4 @@
 ---
 language:
   - de
@@ -14,7 +15,7 @@ widget:
 # German news title gen
-Please node that this model originates from the ["What's there, what's missing"](https://interaktiv.br.de/ai-detect-newsroom-mentions-in-comments/) collaboration of [AI & Automation Labl of Bayerischer Rundfunk](https://www.br.de/extra/ai-automation-lab/index.html) and [Mitteldeutscher Rundfunk](https://www.mdr.de/) as well as [ida](https://idalab.de/). The collaboration took place during the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme). The model presented is part of the the documenation of the half year of project time. The related technical framework can be found a [github](https://github.com/br-data/wtwm-topic-modelling).
 ## The task
@@ -23,10 +24,25 @@ This is a model for the task of classifying whether or not a articles comment ad
 This classification task is implemented as a binary classification into:
 label 0: the comment holds no mention
 label 1: the comment addresses the moderation team/authors of the media house
 We decided to use [german-gpt2](https://huggingface.co/dbmdz/german-gpt2) by MDZ of Bayerische Staatsbibliothek as the foundation model.
 **This model is still work in progress and might be updated in the future.**

 ---
 language:
   - de
 # German news title gen
+Please node that this model originates from the ["What's there, what's missing"](https://interaktiv.br.de/ai-detect-newsroom-mentions-in-comments/) collaboration of [AI & Automation Labl of Bayerischer Rundfunk (BR hereafter)](https://www.br.de/extra/ai-automation-lab/index.html) and [Mitteldeutscher Rundfunk (mdr hereafter)](https://www.mdr.de/) as well as [ida](https://idalab.de/). The collaboration took place during the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme). The model presented is part of the the documenation of the half year of project time. The related technical framework can be found a [github](https://github.com/br-data/wtwm-topic-modelling).
 ## The task
 This classification task is implemented as a binary classification into:
 label 0: the comment holds no mention
 label 1: the comment addresses the moderation team/authors of the media house
 We decided to use [german-gpt2](https://huggingface.co/dbmdz/german-gpt2) by MDZ of Bayerische Staatsbibliothek as the foundation model.
 **This model is still work in progress and might be updated in the future.**
+## Dataset & preprocessing
+This model was finetuned on a corpus of 18.860 user comments with a share of user comments from BR and mdr websites and social media channels. The ratio of comments without mentions and with mentions is 92% to 8%. With the initial annotated data the share of comments with mentions was 2% of the data. To run the first round of training during the time of the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme), we decided to augment the corpus by 1421 generated comments with mentions. The generated comments were annotated the same way as the initial data.
+Please note, that the generated comments are merely meant to kick off the training of the prototype model. Retraining of the model in later iterations of our system will ignore the generated comments and solely depend on authentic comments.
+The preprocessing of the data included:
+- remove linebreaks
+- remove html tags
+- remove emojis
+- remove formatting fragments (e.g. "---------", "......")
+- remove gaps (~ two or more adjacent spaces)
+- strip comments for whitespaces at the begin and end of the corpus
+We advice to perform the same preprocessing steps when working with the mode.