|
--- |
|
license: mit |
|
datasets: |
|
- oscar |
|
- DDSC/dagw_reddit_filtered_v1.0.0 |
|
- graelo/wikipedia |
|
language: |
|
- da |
|
widget: |
|
- text: Der var engang en [MASK] |
|
--- |
|
|
|
# What is this? |
|
|
|
A pre-trained BERT model (base version, ~110 M parameters) for Danish NLP. The model was not pre-trained from scratch but adapted from the English version with a tokenizer trained on Danish text. |
|
|
|
# How to use |
|
|
|
Test the model using the pipeline from the [🤗 Transformers](https://github.com/huggingface/transformers) library: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("fill-mask", model="KennethTM/bert-base-uncased-danish") |
|
|
|
pipe("Der var engang en [MASK]") |
|
``` |
|
|
|
Or load it using the Auto* classes: |
|
|
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("KennethTM/bert-base-uncased-danish") |
|
model = AutoModelForMaskedLM.from_pretrained("KennethTM/bert-base-uncased-danish") |
|
``` |
|
|
|
# Model training |
|
|
|
The model is trained using multiple Danish datasets and a context length of 512 tokens. |
|
|
|
The model weights are initialized from the English [bert-base-uncased model](https://huggingface.co/bert-base-uncased) with new word token embeddings created for Danish using [WECHSEL](https://github.com/CPJKU/wechsel). |
|
|
|
Initially, only the word token embeddings are trained using 1.000.000 samples. Finally, the whole model is trained for 8 epochs. |
|
|
|
|
|
# Evaluation |
|
|
|
The performance of the pretrained model was evaluated using [ScandEval](https://github.com/ScandEval/ScandEval). |
|
|
|
| Task | Dataset | Score (±SE) | |
|
|:-------------------------|:-------------|:---------------------------------| |
|
| sentiment-classification | swerec | mcc = 63.02 (±2.16) | |
|
| | | macro_f1 = 62.2 (±3.61) | |
|
| sentiment-classification | angry-tweets | mcc = 47.21 (±0.53) | |
|
| | | macro_f1 = 64.21 (±0.53) | |
|
| sentiment-classification | norec | mcc = 42.23 (±8.69) | |
|
| | | macro_f1 = 57.24 (±7.67) | |
|
| named-entity-recognition | suc3 | micro_f1 = 50.03 (±4.16) | |
|
| | | micro_f1_no_misc = 53.55 (±4.57) | |
|
| named-entity-recognition | dane | micro_f1 = 76.44 (±1.36) | |
|
| | | micro_f1_no_misc = 80.61 (±1.11) | |
|
| named-entity-recognition | norne-nb | micro_f1 = 68.38 (±1.72) | |
|
| | | micro_f1_no_misc = 73.08 (±1.66) | |
|
| named-entity-recognition | norne-nn | micro_f1 = 60.45 (±1.71) | |
|
| | | micro_f1_no_misc = 64.39 (±1.8) | |
|
| linguistic-acceptability | scala-sv | mcc = 5.01 (±5.41) | |
|
| | | macro_f1 = 49.46 (±3.67) | |
|
| linguistic-acceptability | scala-da | mcc = 54.74 (±12.22) | |
|
| | | macro_f1 = 76.25 (±6.09) | |
|
| linguistic-acceptability | scala-nb | mcc = 19.18 (±14.01) | |
|
| | | macro_f1 = 55.3 (±8.85) | |
|
| linguistic-acceptability | scala-nn | mcc = 5.72 (±5.91) | |
|
| | | macro_f1 = 49.56 (±3.73) | |
|
| question-answering | scandiqa-da | em = 26.36 (±1.17) | |
|
| | | f1 = 32.41 (±1.1) | |
|
| question-answering | scandiqa-no | em = 26.14 (±1.59) | |
|
| | | f1 = 32.02 (±1.59) | |
|
| question-answering | scandiqa-sv | em = 26.38 (±1.1) | |
|
| | | f1 = 32.33 (±1.05) | |
|
| speed | speed | speed = 4.55 (±0.0) | |
|
|