Edit model card

Historical Swedish Bert Model

** WORK IN PROGRESS ** (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further)

A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks.

Intended uses & limitations

This model should primarly be used to fine-tune further on and downstream tasks.

Inference for fill-mask with Huggingface Transformers in python:

from transformers import pipeline

summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical")
historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om."""
print(summarizer(historical_text))

Model Description

The training procedure can be recreated from here: Src_code. The preprocessing procedure can be recreated from here: Src_code.

Model: The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 0
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 6
  • fp16: False

Dataset (WIP):

  • Khubist2, which has been cleaned and chunked. (will be further extended)

Acknowledgements

We gratefully acknowledge EuroHPC for funding this research by providing computing resources of the HPC system Vega and SWE-clarin for the datasets.

Citation Information

Eva Pettersson and Lars Borin (2022) Swedish Diachronic Corpus In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html

Downloads last month
9
Safetensors
Model size
135M params
Tensor type
I64
·
F32
·

Dataset used to train Riksarkivet/bert-base-cased-swe-historical

Evaluation results