metadata

language: nl
thumbnail: https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo.png
tags:
  - Dutch
  - Flemish
  - RoBERTa
  - RobBERT
license: mit
datasets:
  - oscar
  - dbrd
  - lassy-ud
  - europarl-mono
  - conll2002
widget:
  - text: Hallo, ik ben RobBERT-2022, het nieuwe <mask> taalmodel van de KU Leuven.

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use.

RobBERT-2022 is the newest release of the Dutch RobBERT model. Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in usage, we release a new Dutch BERT model trained on data from 2022: RobBERT 2022. More in-depth information about RobBERT-2022 can be found in our blog post, our paper and the original RobBERT Github repository.

How to use

RobBERT-2022 and RobBERT both use the RoBERTa architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using code to finetune RoBERTa models and most code used for BERT models, e.g. as provided by HuggingFace Transformers library.

By default, RobBERT-2022 has the masked language model head used in training. This can be used as a zero-shot way to fill masks in sentences. It can be tested out for free on RobBERT's Hosted infererence API of Huggingface. You can also create a new prediction head for your own task by using any of HuggingFace's RoBERTa-runners, their fine-tuning notebooks by changing the model name to DTAI-KULeuven/robbert-2022-dutch-base.

from transformers import AutoTokenizer, AutoForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")

You can then use most of HuggingFace's BERT-based notebooks for finetuning RobBERT-2022 on your type of Dutch language dataset.

Technical Details From The Paper

Our Performance Evaluation Results

All experiments are described in more detail in our paper, with the code in our GitHub repository.

Sentiment analysis

Predicting whether a review is positive or negative using the Dutch Book Reviews Dataset.

Model	Accuracy [%]
ULMFiT	93.8
BERTje	93.0
RobBERT v2	95.1

Die/Dat (coreference resolution)

We measured how well the models are able to do coreference resolution by predicting whether "die" or "dat" should be filled into a sentence. For this, we used the EuroParl corpus.

Finetuning on whole dataset

Model	Accuracy [%]	F1 [%]
Baseline (LSTM)		75.03
mBERT	98.285	98.033
BERTje	98.268	98.014
RobBERT v2	99.232	99.121

Finetuning on 10K examples

We also measured the performance using only 10K training examples. This experiment clearly illustrates that RobBERT outperforms other models when there is little data available.

Model	Accuracy [%]	F1 [%]
mBERT	92.157	90.898
BERTje	93.096	91.279
RobBERT v2	97.816	97.514

Using zero-shot word masking task

Since BERT models are pre-trained using the word masking task, we can use this to predict whether "die" or "dat" is more likely. This experiment shows that RobBERT has internalised more information about Dutch than other models.

Model	Accuracy [%]
ZeroR	66.70
mBERT	90.21
BERTje	94.94
RobBERT v2	98.75

Part-of-Speech Tagging.

Using the Lassy UD dataset.

Model	Accuracy [%]
Frog	91.7
mBERT	96.5
BERTje	96.3
RobBERT v2	96.4

Credits and citation

This project is created by Pieter Delobelle, Thomas Winters and Bettina Berendt. If you would like to cite our paper or model, you can use the following BibTeX:

@inproceedings{delobelle2022robbert2022,
  doi = {10.48550/ARXIV.2211.08192},
  url = {https://arxiv.org/abs/2211.08192},
  author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use},
  venue = {arXiv},
  year = {2022},
}

@inproceedings{delobelle2020robbert,
    title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
    author = "Delobelle, Pieter  and
      Winters, Thomas  and
      Berendt, Bettina",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
    doi = "10.18653/v1/2020.findings-emnlp.292",
    pages = "3255--3265"
}