metadata

license: mit
datasets:
  - oscar
  - DDSC/dagw_reddit_filtered_v1.0.0
  - graelo/wikipedia
language:
  - da
widget:
  - text: Der var engang en [MASK]

What is this?

A pre-trained BERT model (base version, ~110 M parameters) for Danish NLP. The model was not pre-trained from scratch but adapted from the English version with a tokenizer trained on Danish text.

How to use

Test the model using the pipeline from the 🤗 Transformers library:

from transformers import pipeline

pipe = pipeline("fill-mask", model="KennethTM/bert-base-uncased-danish")

pipe("Der var engang en [MASK]")

Or load it using the Auto* classes:

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("KennethTM/bert-base-uncased-danish")
model = AutoModelForMaskedLM.from_pretrained("KennethTM/bert-base-uncased-danish")

Model training

The model is trained using multiple Danish datasets and a context length of 512 tokens.

The model weights are initialized from the English bert-base-uncased model with new word token embeddings created for Danish using WECHSEL.

Initially, only the word token embeddings are trained using 1.000.000 samples. Finally, the whole model is trained for 8 epochs.

Evaluation

The performance of the pretrained model was evaluated using ScandEval.

Task	Dataset	Score (±SE)
sentiment-classification	swerec	mcc = 63.02 (±2.16)
		macro_f1 = 62.2 (±3.61)
sentiment-classification	angry-tweets	mcc = 47.21 (±0.53)
		macro_f1 = 64.21 (±0.53)
sentiment-classification	norec	mcc = 42.23 (±8.69)
		macro_f1 = 57.24 (±7.67)
named-entity-recognition	suc3	micro_f1 = 50.03 (±4.16)
		micro_f1_no_misc = 53.55 (±4.57)
named-entity-recognition	dane	micro_f1 = 76.44 (±1.36)
		micro_f1_no_misc = 80.61 (±1.11)
named-entity-recognition	norne-nb	micro_f1 = 68.38 (±1.72)
		micro_f1_no_misc = 73.08 (±1.66)
named-entity-recognition	norne-nn	micro_f1 = 60.45 (±1.71)
		micro_f1_no_misc = 64.39 (±1.8)
linguistic-acceptability	scala-sv	mcc = 5.01 (±5.41)
		macro_f1 = 49.46 (±3.67)
linguistic-acceptability	scala-da	mcc = 54.74 (±12.22)
		macro_f1 = 76.25 (±6.09)
linguistic-acceptability	scala-nb	mcc = 19.18 (±14.01)
		macro_f1 = 55.3 (±8.85)
linguistic-acceptability	scala-nn	mcc = 5.72 (±5.91)
		macro_f1 = 49.56 (±3.73)
question-answering	scandiqa-da	em = 26.36 (±1.17)
		f1 = 32.41 (±1.1)
question-answering	scandiqa-no	em = 26.14 (±1.59)
		f1 = 32.02 (±1.59)
question-answering	scandiqa-sv	em = 26.38 (±1.1)
		f1 = 32.33 (±1.05)
speed	speed	speed = 4.55 (±0.0)