--- license: mit datasets: - Riksarkivet/mini_cleaned_diachronic_swe language: - sv metrics: - perplexity pipeline_tag: fill-mask widget: - text: Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om. train-eval-index: - config: Riksarkivet/mini_cleaned_diachronic_swe task: fill-mask task_id: fill-mask splits: eval_split: test col_mapping: text: text model-index: - name: bert-base-cased-swe-historical results: - task: type: fill-mask name: fill-mask dataset: name: Riksarkivet/mini_cleaned_diachronic_swe type: Riksarkivet/mini_cleaned_diachronic_swe split: test metrics: - type: perplexity value: 3.42 name: Perplexity (WIP) --- # Historical Swedish Bert Model ** WORK IN PROGRESS ** (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further) A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks. ## Intended uses & limitations This model should primarly be used to fine-tune further on and downstream tasks. Inference for fill-mask with Huggingface Transformers in python: ```python from transformers import pipeline summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical") historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om.""" print(summarizer(historical_text)) ``` ## Model Description The training procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). The preprocessing procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). **Model**: The following hyperparameters were used during training: - learning_rate: 3e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 0 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 6 - fp16: False **Dataset (WIP)**: - [Khubist2](https://huggingface.co/datasets/Riksarkivet/mini_cleaned_diachronic_swe), which has been cleaned and chunked. **(will be further extended)** ## Acknowledgements We gratefully acknowledge [EuroHPC](https://eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system [Vega](https://www.izum.si) and [SWE-clarin](https://sweclarin.se/) for the datasets. ## Citation Information Eva Pettersson and Lars Borin (2022) Swedish Diachronic Corpus In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html