KB commited on
Commit
26ed410
1 Parent(s): ed69dda

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: sv
3
+ ---
4
+
5
+ # Swedish BERT Models
6
+
7
+ The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
8
+
9
+ The following three models are currently available:
10
+
11
+ - **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
12
+ - **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
13
+ - **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
14
+
15
+ All models are cased and trained with whole word masking.
16
+
17
+ ## Files
18
+
19
+ | **name** | **files** |
20
+ |---------------------------------|-----------|
21
+ | bert-base-swedish-cased | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
22
+ | bert-base-swedish-cased-ner | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
23
+ | albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
24
+
25
+ TensorFlow model weights will be released soon.
26
+
27
+ ## Usage requirements / installation instructions
28
+
29
+ The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
30
+
31
+ To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
32
+
33
+ ```
34
+ # git clone https://github.com/Kungbib/swedish-bert-models
35
+ # cd swedish-bert-models
36
+ # python3 -m venv venv
37
+ # source venv/bin/activate
38
+ # pip install --upgrade pip
39
+ # pip install -r requirements.txt
40
+ ```
41
+
42
+ ### BERT Base Swedish
43
+
44
+ A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
45
+
46
+ ```python
47
+ from transformers import AutoModel,AutoTokenizer
48
+
49
+ tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
50
+ model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
51
+ ```
52
+
53
+
54
+ ### BERT base fine-tuned for Swedish NER
55
+
56
+ This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
57
+
58
+ ```python
59
+ from transformers import pipeline
60
+
61
+ nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
62
+
63
+ nlp('Idag släpper KB tre språkmodeller.')
64
+ ```
65
+
66
+ Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
67
+
68
+ ```python
69
+ [ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
70
+ { 'word': 'KB', 'score': 0.9814832210540771, 'entity': 'ORG' } ]
71
+ ```
72
+
73
+ The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
74
+
75
+ ```python
76
+ text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
77
+ 'som spelar fotboll i VM klockan två på kvällen.'
78
+
79
+ l = []
80
+ for token in nlp(text):
81
+ if token['word'].startswith('##'):
82
+ l[-1]['word'] += token['word'][2:]
83
+ else:
84
+ l += [ token ]
85
+
86
+ print(l)
87
+ ```
88
+
89
+ Which should result in the following (though less cleanly formatted):
90
+
91
+ ```python
92
+ [ { 'word': 'Engelbert', 'score': 0.99..., 'entity': 'PRS'},
93
+ { 'word': 'Volvon', 'score': 0.99..., 'entity': 'OBJ'},
94
+ { 'word': 'Tele2', 'score': 0.99..., 'entity': 'LOC'},
95
+ { 'word': 'Arena', 'score': 0.99..., 'entity': 'LOC'},
96
+ { 'word': 'Djurgården', 'score': 0.99..., 'entity': 'ORG'},
97
+ { 'word': 'IF', 'score': 0.99..., 'entity': 'ORG'},
98
+ { 'word': 'VM', 'score': 0.99..., 'entity': 'EVN'},
99
+ { 'word': 'klockan', 'score': 0.99..., 'entity': 'TME'},
100
+ { 'word': 'två', 'score': 0.99..., 'entity': 'TME'},
101
+ { 'word': 'på', 'score': 0.99..., 'entity': 'TME'},
102
+ { 'word': 'kvällen', 'score': 0.54..., 'entity': 'TME'} ]
103
+ ```
104
+
105
+ ### ALBERT base
106
+
107
+ The easiest way to do this is, again, using Huggingface Transformers:
108
+
109
+ ```python
110
+ from transformers import AutoModel,AutoTokenizer
111
+
112
+ tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
113
+ model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
114
+ ```
115
+
116
+ ## Acknowledgements ❤️
117
+
118
+ - Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
119
+ - Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
120
+ - Models are hosted on S3 by Huggingface 🤗
121
+