1
---
2
language: sv
3
---
4
5
# Swedish BERT Models
6
7
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
8
9
The following three models are currently available:
10
11
- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
12
- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
13
- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
14
15
All models are cased and trained with whole word masking.
16
17
## Files
18
19
| **name**                        | **files** |
20
|---------------------------------|-----------|
21
| bert-base-swedish-cased         | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
22
| bert-base-swedish-cased-ner     | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
23
| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
24
25
TensorFlow model weights will be released soon.
26
27
## Usage requirements / installation instructions
28
29
The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
30
31
To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
32
33
```
34
# git clone https://github.com/Kungbib/swedish-bert-models
35
# cd swedish-bert-models
36
# python3 -m venv venv
37
# source venv/bin/activate
38
# pip install --upgrade pip
39
# pip install -r requirements.txt
40
```
41
42
### BERT Base Swedish
43
44
A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
45
46
```python
47
from transformers import AutoModel,AutoTokenizer
48
49
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
50
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
51
```
52
53
54
### BERT base fine-tuned for Swedish NER
55
56
This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
57
58
```python
59
from transformers import pipeline
60
61
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
62
63
nlp('Idag släpper KB tre språkmodeller.')
64
```
65
66
Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
67
68
```python
69
[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
70
  { 'word': 'KB',   'score': 0.9814832210540771, 'entity': 'ORG' } ]
71
```
72
73
The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
74
75
```python
76
text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
77
       'som spelar fotboll i VM klockan två på kvällen.'
78
79
l = []
80
for token in nlp(text):
81
    if token['word'].startswith('##'):
82
        l[-1]['word'] += token['word'][2:]
83
    else:
84
        l += [ token ]
85
86
print(l)
87
```
88
89
Which should result in the following (though less cleanly formatted):
90
91
```python
92
[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
93
  { 'word': 'Volvon',        'score': 0.99..., 'entity': 'OBJ'},
94
  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
95
  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
96
  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
97
  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
98
  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
99
  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
100
  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
101
  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
102
  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]
103
```
104
105
### ALBERT base
106
107
The easiest way to do this is, again, using Huggingface Transformers:
108
109
```python
110
from transformers import AutoModel,AutoTokenizer
111
112
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
113
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
114
```
115
116
## Acknowledgements ❤️
117
118
- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
119
- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
120
- Models are hosted on S3 by Huggingface 🤗
121
122