language: de
license: mit
German BERT base fine-tuned to predict educational requirements
This is a fine-tuned version of the German BERT base language model deepset/gbert-base. The task this model was trained on was to predict education requirements from job ad texts. The dataset used for training is not available to the public. The 7 labels in the task are (in the classification head order):
'Bachelor'
'Berufsausbildung'
'Doktorat oder äquivalent'
- `'Höhere Berufsausbildung'
'Master'
'Sonstiges'
'keine Ausbildungserfordernisse'
The number of representatives of these labels in each of the splits (train/test/val) of the dataset is summarized in the following table:
Label name | All data | Training | Validation | Test |
---|---|---|---|---|
Bachelor | 521 | 365 | 52 | 104 |
Berufsausbildung | 1854 | 1298 | 185 | 371 |
Doktorat oder äquivalent | 38 | 27 | 4 | 7 |
Höhere Berufsausbildung | 564 | 395 | 56 | 113 |
Master | 245 | 171 | 25 | 49 |
Sonstiges | 819 | 573 | 82 | 164 |
keine Ausbildungserfordernisse | 176 | 123 | 18 | 35 |
Performance
Training consisted of minimizing the binary cross-entropy (BCE) loss between the model's predictions and the actual labels in the training set. During training, a weighted version of the label ranking average precision (LRAP) was tracked for the testing set. LRAP measures what fraction of higher-ranked labels produced by the model were true labels. To account for the label imbalance, the rankings were weighted so that improperly ranked rare labels are penalized more than their more frequent counterparts. After training was complete, the model with highest weighted LRAP was saved.
LRAP: 0.93
See also:
Authors
Rodrigo C. G. Pena: rodrigocgp [at] gmail.com