---
language: 
  - de
tags:
  - text-classication
license: mit
metrics:
  - precision
  - recall
  - f1
widget:
- text: "Im Artikel wird leider nicht erwähnt, inwieweit und ob dadurch Natur zerstört werden muss."
---

# German news title gen

Please node that this model originates from the ["What's there, what's missing"](https://interaktiv.br.de/ai-detect-newsroom-mentions-in-comments/) collaboration of [AI & Automation Labl of Bayerischer Rundfunk (BR hereafter)](https://www.br.de/extra/ai-automation-lab/index.html) and [Mitteldeutscher Rundfunk (mdr hereafter)](https://www.mdr.de/) as well as [ida](https://idalab.de/). The collaboration took place during the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme). The model presented is part of the the documenation of the half year of project time. The related technical framework can be found a [github](https://github.com/br-data/wtwm-topic-modelling).

## The task

This is a model for the task of classifying whether or not a articles comment addresses the moderation team/authors of the media house that published the article. In this prototype stage the media houses are Bayerischer Rundfunk and Mitteldeutscher Rundfunk.

This classification task is implemented as a binary classification into:

label 0: the comment holds no mention
label 1: the comment addresses the moderation team/authors of the media house

We decided to use [german-gpt2](https://huggingface.co/dbmdz/german-gpt2) by MDZ of Bayerische Staatsbibliothek as the foundation model.

**This model is still work in progress and might be updated in the future.**


## Dataset & preprocessing

This model was finetuned on a corpus of 18.860 user comments with a share of user comments from BR and mdr websites and social media channels. The ratio of comments without mentions and with mentions is 92% to 8%. With the initial annotated data the share of comments with mentions was 2% of the data. To run the first round of training during the time of the [JournalismAI fellowship '22](https://www.lse.ac.uk/media-and-communications/polis/JournalismAI/Fellowship-Programme), we decided to augment the corpus by 1421 generated comments with mentions. The generated comments were annotated the same way as the initial data.
Please note, that the generated comments are merely meant to kick off the training of the prototype model. Retraining of the model in later iterations of our system will ignore the generated comments and solely depend on authentic comments.

The preprocessing of the data included:
- remove linebreaks
- remove html tags
- remove emojis
- remove formatting fragments (e.g. "---------", "......")
- remove gaps (~ two or more adjacent spaces)
- strip comments for whitespaces at the begin and end of the corpus

We advice to perform the same preprocessing steps when working with the mode.