thomaswint commited on
Commit
af646e6
1 Parent(s): bc2035f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -23,10 +23,16 @@ widget:
23
 
24
  # RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use.
25
 
26
- RobBERT-2022 is the newest release of the [Dutch RobBERT model](https://pieter.ai/robbert/). Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in usage, we release a new Dutch BERT model trained on data from 2022: RobBERT 2022.
 
 
 
 
 
27
  More in-depth information about RobBERT-2022 can be found in our [blog post](https://pieter.ai/robbert-2022/), [our paper](https://arxiv.org/abs/2001.06286) and [the original RobBERT Github repository](https://github.com/iPieter/RobBERT).
28
 
29
 
 
30
  ## How to use
31
 
32
  RobBERT-2022 and RobBERT both use the [RoBERTa](https://arxiv.org/abs/1907.11692) architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using [code to finetune RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) models and most code used for BERT models, e.g. as provided by [HuggingFace Transformers](https://huggingface.co/transformers/) library.
@@ -42,6 +48,21 @@ model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robber
42
 
43
  You can then use most of [HuggingFace's BERT-based notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) for finetuning RobBERT-2022 on your type of Dutch language dataset.
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Technical Details From The Paper
46
 
47
 
@@ -56,7 +77,8 @@ Predicting whether a review is positive or negative using the [Dutch Book Review
56
  |-------------------|--------------------------|
57
  | ULMFiT | 93.8 |
58
  | BERTje | 93.0 |
59
- | RobBERT v2 | **95.1** |
 
60
 
61
  ### Die/Dat (coreference resolution)
62
 
@@ -71,6 +93,7 @@ For this, we used the [EuroParl corpus](https://www.statmt.org/europarl/).
71
  | mBERT | 98.285 | 98.033 |
72
  | BERTje | 98.268 | 98.014 |
73
  | RobBERT v2 | **99.232** | **99.121** |
 
74
 
75
  #### Finetuning on 10K examples
76
 
@@ -106,6 +129,7 @@ Using the [Lassy UD dataset](https://universaldependencies.org/treebanks/nl_lass
106
  | mBERT | **96.5** |
107
  | BERTje | 96.3 |
108
  | RobBERT v2 | 96.4 |
 
109
 
110
 
111
  ## Credits and citation
 
23
 
24
  # RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use.
25
 
26
+ RobBERT-2022 is the latest release of the [Dutch RobBERT model](https://pieter.ai/robbert/).
27
+ It further pretrained the original [pdelobelle/robbert-v2-dutch-base](https://huggingface.co/pdelobelle/robbert-v2-dutch-base) model on the 2022 version of the OSCAR version.
28
+ Thanks to this more recent dataset, this [DTAI-KULeuven/robbert-2022-dutch-base](https://huggingface.co/DTAI-KULeuven/robbert-2022-dutch-base) model shows increased performance on several tasks related to recent events, e.g. COVID-19-related tasks.
29
+ We also found that for some tasks that do not contain more recent information than 2019, the original [pdelobelle/robbert-v2-dutch-base](https://huggingface.co/pdelobelle/robbert-v2-dutch-base) RobBERT model can still outperform this newer one.
30
+
31
+ The original RobBERT model was released in January 2020. Dutch has evolved a lot since then, for example the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. Also, many other world facts that the original model considered true have also changed. To account for this and other changes in usage, we release a new Dutch BERT model trained on data from 2022: RobBERT 2022.
32
  More in-depth information about RobBERT-2022 can be found in our [blog post](https://pieter.ai/robbert-2022/), [our paper](https://arxiv.org/abs/2001.06286) and [the original RobBERT Github repository](https://github.com/iPieter/RobBERT).
33
 
34
 
35
+
36
  ## How to use
37
 
38
  RobBERT-2022 and RobBERT both use the [RoBERTa](https://arxiv.org/abs/1907.11692) architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using [code to finetune RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) models and most code used for BERT models, e.g. as provided by [HuggingFace Transformers](https://huggingface.co/transformers/) library.
 
48
 
49
  You can then use most of [HuggingFace's BERT-based notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) for finetuning RobBERT-2022 on your type of Dutch language dataset.
50
 
51
+
52
+ ## Dutch BERT models
53
+
54
+ There is a wide variety of Dutch BERT-based models available for fine-tuning on your tasks.
55
+ Here's a quick summary to find the one that suits your need:
56
+
57
+ - [pdelobelle/robbert-v2-dutch-base](https://huggingface.co/pdelobelle/robbert-v2-dutch-base): The RobBERT model has for years been the best performing BERT-like model for most language tasks. It is trained on a large Dutch webcrawled dataset (OSCAR) and uses the superior [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) architecture, which robustly optimized the original [BERT model](https://huggingface.co/docs/transformers/model_doc/bert).
58
+ - [DTAI-KULeuven/robbertje-1-gb-merged](https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-mergedRobBERTje): The RobBERTje model is a distilled version of RobBERT and about half the size and four times faster to perform inference on. This can help deploy more scalable language models for your language task
59
+ - [DTAI-KULeuven/robbert-2022-dutch-base](https://huggingface.co/DTAI-KULeuven/robbert-2022-dutch-base): The RobBERT-2022 is a further pre-trained RobBERT model on the OSCAR2022 dataset. It is helpful for tasks that rely on words and/or information about more recent events.
60
+
61
+ There's also the [GroNLP/bert-base-dutch-cased](https://huggingface.co/GroNLP/bert-base-dutch-cased) "BERTje" model. This model uses the outdated basic BERT model, and is trained on a smaller corpus of clean Dutch texts.
62
+ Thanks to RobBERT's more recent architecture as well as its larger and more real-world-like training corpus, most researchers and practitioners seem to achieve higher performance on their language tasks with the RobBERT model.
63
+
64
+
65
+
66
  ## Technical Details From The Paper
67
 
68
 
 
77
  |-------------------|--------------------------|
78
  | ULMFiT | 93.8 |
79
  | BERTje | 93.0 |
80
+ | RobBERT v2 | 94.4 |
81
+ | RobBERT 2022 | **95.1** |
82
 
83
  ### Die/Dat (coreference resolution)
84
 
 
93
  | mBERT | 98.285 | 98.033 |
94
  | BERTje | 98.268 | 98.014 |
95
  | RobBERT v2 | **99.232** | **99.121** |
96
+ | RobBERT 2022 | 97.8 | |
97
 
98
  #### Finetuning on 10K examples
99
 
 
129
  | mBERT | **96.5** |
130
  | BERTje | 96.3 |
131
  | RobBERT v2 | 96.4 |
132
+ | RobBERT 2022 | 96.1 |
133
 
134
 
135
  ## Credits and citation