metadata

language: de
datasets:
  - Short-Answer-Feedback/saf_micro_job_german
tags:
  - generated_from_trainer
widget:
  - text: >-
      Antwort: Ich gebe mich zu erkennen und zeige das Informationsschreiben vor
      Lösung: Der Jobber soll sich in diesem Fall dem Personal gegenüber zu
      erkennen geben (0.25 P) und das entsprechende Informationsschreiben in der
      App vorzeigen (0.25 P). Zusätzlich muss notiert werden, zu welchem
      Zeitpunkt (0.25 P) des Jobs der Jobber enttarnt wurde. Zentrale Frage ist
      dabei, ob ein neutrales, unvoreingenommenes Verkaufsgespräch stattgefunden
      hat. Der Job soll mit Erlaubnis der Mitarbeiter bis zum Ende durchgeführt
      (0.25 P) werden. Frage: Frage 1: Wie reagierst du, wenn du auf deine
      Tätigkeit angesprochen wirst?

mbart-finetuned-saf-micro-job

This model is a fine-tuned version of facebook/mbart-large-cc25 on the saf_micro_job_german dataset for Short Answer Feedback (SAF), as proposed in Filighera et al., ACL 2022.

Model description

This model was built on top of mBART, which is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages.

It expects inputs in the following format:

Antwort: [answer] Lösung: [reference_answer] Frage: [question]

In the example above, [answer], [reference_answer] and [question] should be replaced by the provided answer, the reference answer and the question to which they refer, respectively.

The outputs are formatted as follows:

[verification_feedback] Feedback: [feedback]

Hence, the [verification_feedback] label will be one of Correct, Partially correct or Incorrect, while [feedback] will be the textual feedback generated by the model according to the given answer.

Intended uses & limitations

This model is intended to be used for Short Answer Feedback generation in the context of micro-job training (as conducted on the crowd-worker platform appJobber). Thus, it is not expected to have particularly good performance on sets of questions and answers out of this scope.

It is important to acknowledge that the model underperforms when a question that was not seen during training is given as input for inference. In particular, it tends to classify most answers as being correct and does not provide relevant feedback in such cases. Nevertheless, this limitation could be partially overcome by extending the dataset with the desired question (and associated answers) and fine-tuning it for a few epochs on the new data.

Training and evaluation data

As mentioned previously, the model was trained on the saf_micro_job_german dataset, which is divided into the following splits.

Split	Number of examples
train	1226
validation	308
test_unseen_answers	271
test_unseen_questions	602

Evaluation was performed on the test_unseen_answers and test_unseen_questions splits.

Training procedure

The Trainer API was used to fine-tune the model. The code utilized for pre-processing and training was mostly adapted from the summarization script made available by HuggingFace.

Training was completed in a little under 1 hour on a GPU on Google Colab.

Training hyperparameters

The following hyperparameters were utilized during training:

num_epochs: 10
optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
learning_rate: 5e-05
lr_scheduler_type: linear
train_batch_size: 1
gradient_accumulation_steps: 4
eval_batch_size: 4
mixed_precision_training: Native AMP
PyTorch seed: 42

Framework versions

Transformers 4.25.1
Pytorch 1.12.1+cu113
Datasets 2.7.1
Tokenizers 0.13.2

Evaluation results

The generated feedback was evaluated through means of the SacreBLEU, ROUGE-2, METEOR, BERTScore metrics from HuggingFace, while the accuracy and F1 scores from scikit-learn were used for evaluation of the labels.

The following results were achieved.

Split	SacreBLEU	ROUGE-2	METEOR	BERTScore	Accuracy	Weighted F1	Macro F1
test_unseen_answers	39.5	29.8	63.3	63.1	80.1	80.3	80.7
test_unseen_questions	0.3	0.5	33.8	31.3	48.7	46.5	40.6

The script used to compute these metrics and perform evaluation can be found in the evaluation.py file in this repository.

Usage

The example below shows how the model can be applied to generate feedback to a given answer.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('Short-Answer-Feedback/mbart-finetuned-saf-micro-job')
tokenizer = AutoTokenizer.from_pretrained('Short-Answer-Feedback/mbart-finetuned-saf-micro-job')

example_input = 'Antwort: Ich gebe mich zu erkennen und zeige das Informationsschreiben vor Lösung: Der Jobber soll sich in diesem Fall dem Personal gegenüber zu erkennen geben (0.25 P) und das entsprechende Informationsschreiben in der App vorzeigen (0.25 P). Zusätzlich muss notiert werden, zu welchem Zeitpunkt (0.25 P) des Jobs der Jobber enttarnt wurde. Zentrale Frage ist dabei, ob ein neutrales, unvoreingenommenes Verkaufsgespräch stattgefunden hat. Der Job soll mit Erlaubnis der Mitarbeiter bis zum Ende durchgeführt (0.25 P) werden. Frage: Frage 1: Wie reagierst du, wenn du auf deine Tätigkeit angesprochen wirst?'
inputs = tokenizer(example_input, max_length=256, padding='max_length', truncation=True, return_tensors='pt')

generated_tokens = model.generate(
                inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=128
            )
output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

The output produced by the model then looks as follows:

Partially correct Feedback: Sollte das Personal dies gestatten, kannst du den Check auch gerne noch abschließen. Bitte halte nur in fest, wann genau du auf deine Tätigkeit angesprochen wurdest.

Related Work

Filighera et al., ACL 2022 trained a mT5 model on this dataset, providing a baseline for SAF generation. The entire code used to define and train the model can be found on GitHub.