File size: 3,047 Bytes

aa8817b
 
 
 
 
 
 
 
 
fbbf7de
9a6478f
a9201df
 
 
 
 
 
 
 
 
 
 
de0f5ae
a9201df
 
 
 
 
 
 
 
 
 
 
f8ec3a1
fbbf7de
 
79795da
 
c386494
 
a9201df
 
 
4358cae
 
 
 
 
 
 
 
 
de0f5ae
4358cae
 
 
 
 
9ae85fc
169d0e9
 
4358cae
9ae85fc
b19070a
 
 
 
 
 
 
 
 
 
a9201df
9ae85fc
fe399d2
a9201df
 
08c48da
f8ec3a1
4358cae
 
 
 
 
 
a9201df

---
license: mit
datasets:
- Riksarkivet/mini_cleaned_diachronic_swe
language:
- sv
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
  - text: Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om.

train-eval-index:
- config: Riksarkivet/mini_cleaned_diachronic_swe
  task: fill-mask
  task_id: fill-mask
  splits:
    eval_split: test
  col_mapping:
    text: text

model-index:
- name: bert-base-cased-swe-historical
  results:
  - task:
      type: fill-mask
      name: fill-mask
    dataset:
      name: Riksarkivet/mini_cleaned_diachronic_swe
      type: Riksarkivet/mini_cleaned_diachronic_swe
      split: test
    metrics:
    - type: perplexity
      value: 3.42
      name: Perplexity (WIP)
---

# Historical Swedish Bert Model

** WORK IN PROGRESS ** (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further) 

A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. 
However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks.


## Intended uses & limitations
This model should primarly be used to fine-tune further on and downstream tasks.

Inference for fill-mask with Huggingface Transformers in python:

```python
from transformers import pipeline

summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical")
historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om."""
print(summarizer(historical_text))
```


## Model Description
The training procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main).
The preprocessing procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main).

**Model**:
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 0
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 6
- fp16: False

**Dataset (WIP)**:
- [Khubist2](https://huggingface.co/datasets/Riksarkivet/mini_cleaned_diachronic_swe), which has been cleaned and chunked. **(will be further extended)**

## Acknowledgements
We gratefully acknowledge [EuroHPC](https://eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system [Vega](https://www.izum.si)
and [SWE-clarin](https://sweclarin.se/) for the datasets.

## Citation Information

Eva Pettersson and Lars Borin (2022)
Swedish Diachronic Corpus
In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html