File size: 2,298 Bytes
f977019
58e2452
f977019
 
58e2452
 
 
c180783
58e2452
 
 
 
bfad2e1
58e2452
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c6810c0
e75c537
 
 
 
58e2452
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language: de
license: mit
---

# German BERT base fine-tuned to predict educational requirements

This is a fine-tuned version of the German BERT base language model [deepset/gbert-base](https://huggingface.co/deepset/gbert-base). The multilabel task this model was trained on was to predict education requirements from job ad texts. The dataset used for training is not available to the public. The 7 labels in the task are (in the classification head order):

- `'Bachelor'`
- `'Berufsausbildung'`
- `'Doktorat oder äquivalent'`
- `'Höhere Berufsausbildung'`
- `'Master'`
- `'Sonstiges'`
- `'keine Ausbildungserfordernisse'`

The number of representatives of these labels in each of the splits (train/test/val) of the dataset is summarized in the following table:

| Label name | All data | Training | Validation | Test |
|------------|----------|----------|------------|------|
| Bachelor | 521 | 365 | 52 | 104 |
| Berufsausbildung | 1854 | 1298 | 185 | 371 |
| Doktorat oder äquivalent | 38 | 27 | 4 | 7 |
| Höhere Berufsausbildung | 564 | 395 | 56 | 113 |
| Master | 245 | 171 | 25 | 49 |
| Sonstiges | 819 | 573 | 82 | 164 |
| keine Ausbildungserfordernisse | 176 | 123 | 18 | 35 |

## Performance  

Training consisted of [minimizing the binary cross-entropy (BCE)](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_minimization) loss between the model's predictions and the actual labels in the training set. During training, a weighted version of the [label ranking average precision (LRAP)](https://scikit-learn.org/stable/modules/model_evaluation.html#label-ranking-average-precision) was tracked for the testing set. LRAP measures what fraction of higher-ranked labels produced by the model were true labels. To account for the label imbalance, the rankings were weighted so that improperly ranked rare labels are penalized more than their more frequent counterparts. After training was complete, the model with highest weighted LRAP was saved.

```
LRAP: 0.93
```

# See also:  

- [deepset/gbert-base](https://huggingface.co/deepset/gbert-base)
- [deepset/gbert-large](https://huggingface.co/deepset/gbert-large)
- [gonzpen/gbert-large-ft-edu-redux](https://huggingface.co/gonzpen/gbert-large-ft-edu-redux)

## Authors
Rodrigo C. G. Pena: `rodrigocgp [at] gmail.com`