wietsedv commited on
Commit
539f321
1 Parent(s): 629c4ac

Copy from wietsedv/

Browse files
Files changed (7) hide show
  1. README.md +76 -0
  2. config.json +21 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. tf_model.h5 +3 -0
  6. tokenizer_config.json +11 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: nl
3
+ thumbnail: "https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png"
4
+ tags:
5
+ - BERTje
6
+ - BERT
7
+ - Dutch
8
+ ---
9
+
10
+ # BERTje: A Dutch BERT model
11
+
12
+ ## Model description
13
+
14
+ BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
15
+
16
+ <img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">
17
+
18
+ For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582) and the code on [Github](https://github.com/wietsedv/bertje).
19
+
20
+ The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv).
21
+
22
+ ## How to use
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
28
+ model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # PyTorch
29
+ model = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # Tensorflow
30
+ ```
31
+
32
+ ## Benchmarks
33
+
34
+ The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.
35
+
36
+ More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.
37
+
38
+ All of the tested models are *base* sized (12) layers with cased tokenization.
39
+
40
+ Headers in the tables below link to original data sources. Scores link to the model pages that corresponds to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.
41
+
42
+
43
+ ### Named Entity Recognition
44
+
45
+
46
+ | Model | [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/) | [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus) | spaCy UD LassySmall |
47
+ | ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
48
+ | **BERTje** | [**90.24**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner) | [**84.93**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner) | [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) |
49
+ | [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) | [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) | [**86.77**](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) |
50
+ | [BERT-NL](http://textdata.nl) | 85.05 | 80.45 | 81.62 |
51
+ | [RobBERT](https://github.com/iPieter/RobBERT) | 84.72 | 81.98 | 79.84 |
52
+
53
+ ### Part-of-speech tagging
54
+
55
+ | Model | [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) |
56
+ | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
57
+ | **BERTje** | **96.48** |
58
+ | [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | 96.20 |
59
+ | [BERT-NL](http://textdata.nl) | 96.10 |
60
+ | [RobBERT](https://github.com/iPieter/RobBERT) | 95.91 |
61
+
62
+
63
+
64
+ ### BibTeX entry and citation info
65
+
66
+ ```bibtex
67
+ @misc{devries2019bertje,
68
+ title = {{BERTje}: {A} {Dutch} {BERT} {Model}},
69
+ shorttitle = {{BERTje}},
70
+ author = {de Vries, Wietse and van Cranenburgh, Andreas and Bisazza, Arianna and Caselli, Tommaso and Noord, Gertjan van and Nissim, Malvina},
71
+ year = {2019},
72
+ month = dec,
73
+ howpublished = {arXiv:1912.09582},
74
+ url = {http://arxiv.org/abs/1912.09582},
75
+ }
76
+ ```
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "wietsedv/bert-base-dutch-cased",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 3,
19
+ "type_vocab_size": 2,
20
+ "vocab_size": 30000
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7bbada9bd1f19adb55f62096564080c4f58f037bfe7aa9084dfd7781d18249c
3
+ size 436409143
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a659525e7b8a92c53f9cf0d6e42ee7a15a7aabd2ba63298ab1a3b4c4105e85e
3
+ size 436587288
tokenizer_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_lower_case": false,
3
+ "unk_token": "[UNK]",
4
+ "sep_token": "[SEP]",
5
+ "pad_token": "[PAD]",
6
+ "cls_token": "[CLS]",
7
+ "mask_token": "[MASK]",
8
+ "tokenize_chinese_chars": true,
9
+ "strip_accents": null,
10
+ "model_max_length": 512
11
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff