gonzpen's picture
Mention "multilabel" in the task
c180783
|
raw
history blame
2.3 kB
metadata
language: de
license: mit

German BERT base fine-tuned to predict educational requirements

This is a fine-tuned version of the German BERT base language model deepset/gbert-base. The multilabel task this model was trained on was to predict education requirements from job ad texts. The dataset used for training is not available to the public. The 7 labels in the task are (in the classification head order):

  • 'Bachelor'
  • 'Berufsausbildung'
  • 'Doktorat oder äquivalent'
  • 'Höhere Berufsausbildung'
  • 'Master'
  • 'Sonstiges'
  • 'keine Ausbildungserfordernisse'

The number of representatives of these labels in each of the splits (train/test/val) of the dataset is summarized in the following table:

Label name All data Training Validation Test
Bachelor 521 365 52 104
Berufsausbildung 1854 1298 185 371
Doktorat oder äquivalent 38 27 4 7
Höhere Berufsausbildung 564 395 56 113
Master 245 171 25 49
Sonstiges 819 573 82 164
keine Ausbildungserfordernisse 176 123 18 35

Performance

Training consisted of minimizing the binary cross-entropy (BCE) loss between the model's predictions and the actual labels in the training set. During training, a weighted version of the label ranking average precision (LRAP) was tracked for the testing set. LRAP measures what fraction of higher-ranked labels produced by the model were true labels. To account for the label imbalance, the rankings were weighted so that improperly ranked rare labels are penalized more than their more frequent counterparts. After training was complete, the model with highest weighted LRAP was saved.

LRAP: 0.93

See also:

Authors

Rodrigo C. G. Pena: rodrigocgp [at] gmail.com