Model Card for personal-noun-detection-german-bert

This is a fine-tuned model based on bert-base-german-cased to detect personal nouns (i.e. common nouns denoting human beings like Lehrer 'teacher', Besucher 'visitor') in German text.

Model Details

Model Description

Personal nouns are defined as including all nouns referring to natural person(s) regardless of reference type (generic, non-generic, predicative). Not included in this definition are:

Proper names
Personal noun instances that do not refer to a human referent but to an institution or organization

The model conducts a binary token classification. Labels are PERS_N for personal nouns and O for all other tokens. Performance on a test set during training achieved an f1-score of 0.94. For more information on the training data and evaluation, see Sökefeld et al. (2023).

Developed by: Carla Sökefeld, Melanie Andresen, Johanna Binnewitt, Heike Zinsmeister
Model type: Pre-trained Language Model for token classification
Language: German
Finetuned from model: bert-base-german-cased

Model Sources

Paper: Sökefeld, Carla; Andresen, Melanie; Binnewitt, Johanna; Zinsmeister, Heike: "Personal noun detection for German". In: Proceedings of the 19th Joint ACL – ISO Workshop on Interoperable Semantic Annotation (ISA-19), Nancy, 20 June 2023.

Direct Use

The model can be used to detect personal nouns in German texts. This can be useful for studying the variety of morphological forms currently used in German. A particular use case is in the field of genderlinguistics: The model can be used to detect all personal nouns in a text or corpus and thus quantify the amount of gender-inclusive forms (Lehrer:innen, Lehrer*innen) with regard to the basic population.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline

model_checkpoint = '/your/file_path/to_the_model'

token_classifier = pipeline( "token-classification", model=model_checkpoint, aggregation_strategy="simple" )

example_sentence = 'Die Lehrerin macht mit ihren Schüler:innen einen Ausflug.'

print(token_classifier(example_sentence))

Training Details

Training Data

The training data consisted of a corpus of roughly 130.000 tokens comprised of newspaper and blog texts from 2019. An overview of the texts with dates, author information and links to the newspaper articles/blog posts is provided in the file overview_training_data.xlsx.

Training Procedure

We applied the transformer tokenizer on already tokenized sentences. We used the default hyperparameters specified in the Huggingface token classification tutorial (https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt#processing-the-data) for training and evaluated the model on token-level on the remaining 10% of the corpus.

Training Hyperparameters

Training regime: Number of training epochs: 3; learning rate: 2e^-5; weight decay: 0.01.