Historical Swedish Bert Model

** WORK IN PROGRESS ** (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further)

A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks.

Intended uses & limitations

This model should primarly be used to fine-tune further on and downstream tasks.

Inference for fill-mask with Huggingface Transformers in python:

from transformers import pipeline

summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical")
historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om."""
print(summarizer(historical_text))

Model Description

The training procedure can be recreated from here: Src_code. The preprocessing procedure can be recreated from here: Src_code.

Model: The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 0
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 6
  • fp16: False

Dataset (WIP):

  • Khubist2, which has been cleaned and chunked. (will be further extended)

Acknowledgements

We gratefully acknowledge EuroHPC for funding this research by providing computing resources of the HPC system Vega and SWE-clarin for the datasets.

Citation Information

Eva Pettersson and Lars Borin (2022) Swedish Diachronic Corpus In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html

Downloads last month
68
Safetensors
Model size
135M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Riksarkivet/bert-base-cased-swe-historical

Evaluation results