Copy from wietsedv/

Browse files

Files changed (7) hide show

README.md +76 -0
config.json +21 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +11 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+language: nl
+thumbnail: "https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png"
+tags:
+- BERTje
+- BERT
+- Dutch
+---
+# BERTje: A Dutch BERT model
+## Model description
+BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
+<img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">
+For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582) and the code on [Github](https://github.com/wietsedv/bertje).
+The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv).
+## How to use
+```python
+from transformers import AutoTokenizer, AutoModel, TFAutoModel
+tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
+model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased")  # PyTorch
+model = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased")  # Tensorflow
+```
+## Benchmarks
+The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.
+More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.
+All of the tested models are *base* sized (12) layers with cased tokenization.
+Headers in the tables below link to original data sources. Scores link to the model pages that corresponds to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.
+### Named Entity Recognition
+| Model                                                                        | [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/)                                  | [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus)                   | spaCy UD LassySmall                                                                             |
+| ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
+| **BERTje**                                                                   | [**90.24**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner)    | [**84.93**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner)    | [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner)            |
+| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) | [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) | [**86.77**](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) |
+| [BERT-NL](http://textdata.nl)                                                | 85.05                                                                                         | 80.45                                                                                     | 81.62                                                                                           |
+| [RobBERT](https://github.com/iPieter/RobBERT)                                | 84.72                                                                                         | 81.98                                                                                     | 79.84                                                                                           |
+### Part-of-speech tagging
+| Model                                                                        | [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) |
+| ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
+| **BERTje**                                                                   | **96.48**                                                                                 |
+| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | 96.20                                                                                     |
+| [BERT-NL](http://textdata.nl)                                                | 96.10                                                                                     |
+| [RobBERT](https://github.com/iPieter/RobBERT)                                | 95.91                                                                                     |
+### BibTeX entry and citation info
+```bibtex
+@misc{devries2019bertje,
+	title = {{BERTje}: {A} {Dutch} {BERT} {Model}},
+	shorttitle = {{BERTje}},
+	author = {de Vries, Wietse  and  van Cranenburgh, Andreas  and  Bisazza, Arianna  and  Caselli, Tommaso  and  Noord, Gertjan van  and  Nissim, Malvina},
+	year = {2019},
+	month = dec,
+	howpublished = {arXiv:1912.09582},
+	url = {http://arxiv.org/abs/1912.09582},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "_name_or_path": "wietsedv/bert-base-dutch-cased",
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 3,
+  "type_vocab_size": 2,
+  "vocab_size": 30000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7bbada9bd1f19adb55f62096564080c4f58f037bfe7aa9084dfd7781d18249c
+size 436409143

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0a659525e7b8a92c53f9cf0d6e42ee7a15a7aabd2ba63298ab1a3b4c4105e85e
+size 436587288

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "do_lower_case": false,
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]",
+    "tokenize_chinese_chars": true,
+    "strip_accents": null,
+    "model_max_length": 512
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff