|
--- |
|
license: mit |
|
datasets: |
|
- Riksarkivet/mini_cleaned_diachronic_swe |
|
language: |
|
- sv |
|
metrics: |
|
- perplexity |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om. |
|
|
|
train-eval-index: |
|
- config: Riksarkivet/mini_cleaned_diachronic_swe |
|
task: fill-mask |
|
task_id: fill-mask |
|
splits: |
|
eval_split: test |
|
col_mapping: |
|
text: text |
|
|
|
model-index: |
|
- name: bert-base-cased-swe-historical |
|
results: |
|
- task: |
|
type: fill-mask |
|
name: fill-mask |
|
dataset: |
|
name: Riksarkivet/mini_cleaned_diachronic_swe |
|
type: Riksarkivet/mini_cleaned_diachronic_swe |
|
split: test |
|
metrics: |
|
- type: perplexity |
|
value: 3.42 |
|
name: Perplexity (WIP) |
|
--- |
|
|
|
# Historical Swedish Bert Model |
|
|
|
A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. |
|
However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks. |
|
|
|
|
|
## Intended uses & limitations |
|
This model should primarly be used to fine-tune further on and downstream tasks. |
|
|
|
Inference for fill-mask with Huggingface Transformers in python: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical") |
|
historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om.""" |
|
print(summarizer(historical_text)) |
|
``` |
|
|
|
|
|
## Model Description |
|
The training procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). |
|
The preprocessing procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). |
|
|
|
**Model**: |
|
The following hyperparameters were used during training: |
|
- learning_rate: 3e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 0 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 6 |
|
- fp16: False |
|
|
|
**Dataset (WIP)**: |
|
- Khubist2, which has been cleaned and chunked. (will be further extended) |
|
|
|
## Acknowledgements |
|
We gratefully acknowledge [EuroHPC](https://eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system [Vega](https://www.izum.si) |
|
and [SWE-clarin](https://sweclarin.se/) for the datasets. |
|
|
|
## Citation Information |
|
|
|
Eva Pettersson and Lars Borin (2022) |
|
Swedish Diachronic Corpus |
|
In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html |
|
|
|
|
|
|