Model Card for Model ID
DA-Bert_Old_News_V1 is the first version of a transformer trained on Danish historical texts from the period during Danish Absolutism (1660-1849). It is created by researchers at Aalborg University. The aim of the model is to create a domain-specific model to capture meaning from texts that are far enough removed in time that they no longer read like contemporary Danish.
Model Details
Pretrained BERT model on MLM task. Training data: ENO (Enevældens Nyheder Online) – a corpus of news articles, announcements and advertisements from Danish and Norwegian newspapers from the period 1762 to 1848. The model has been trained on a subset consisting of about 260m words. The data was created using a tailored Transkribus Pylaia-model and has an error rate of around 5% on word level.
Model Description
Architecture: BERT
Pretraining Objective: Masked Language Modeling (MLM)
Sequence Length: 512 tokens
Tokenizer: Custom WordPiece tokenizer
- Developed by: CALDISS
- Shared by: JohanHeinsen
- Model type: BERT
- Language(s) (NLP): Danish
- License: MIT
Model Sources [optional]
- Repository: https://github.com/CALDISS-AAU/OldNewsBERT
- Paper [optional]: In-progress
Uses
This model is designed for...
Domain-specific masked token prediction
Embedding extraction for semantic search
Further fine-tuning
Further fine-tuning is needed for adressing specific use-cases.
Further plans for retraining on more data and annotated data for fine-tuning is in the works.
The model is mostly intended for research purposes in the historical domain. Although not excluded to history.
The model can also serve as a baseline for further fine-tuning a historical BERT-based language model for either Danish or Scandinavian languages for textual or literary purposes.
Direct Use
- This model can be used out-of-the-box for domain-specific masked token prediction.
- The model can also be used for basic mean-pooled embeddings on similar data. Results on this may vary as this model is only trained on the MLM task using the transformer trainer-framework.
Out-of-Scope Use
As the model is trained on the ENO dataset the model is not used for modern Danish text because of its inherent historical training data.
Bias, Risks, and Limitations
The model is heavily limited to the historical period the training data is from. Using this model for masked token prediction on modern Danish or even other scandinavian languages the performance of the model will vary. Further fine-tuning is therefore needed. Training data is from newspapers. A bias towards this type of material and therefore a particular manner of writing is inherent to the model. Newspapers are defined by highly literal language. The model's performance will therefore also vary if using it on more materials defined by figurative language. Small biases and risks also exists in the model based on the errors from the creation of the corpus. As mentioned there is an approximate 5% error on word level which continues onto the pre-trained model. Further work on addressing these biases and risks is planned further down the road.
Recommendations
The model is based on historical texts that express a range of antiquated worldviews. These include racist, anti-democratic and patriarchal sentiments. This makes it utterly unfit for many use cases. It can, however, be used to examine such biases in Danish history.
How to Get Started with the Model
You dont.
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing
Texts shorter than 35 chars were removed. Texts including a predetermined amount of German, Latin or rare words were removed. Extra whitespaces were also removed.
Training Hyperparameters
- Training regime: [More Information Needed]
- Model trained for roughly 45 hours on the provided HPC-system.
- The MLM-prob was defined as .15
Training arguments: eval_strategy="steps", overwrite_output_dir=True, num_train_epochs=15, per_device_train_batch_size=16, gradient_accumulation_steps=4, per_device_eval_batch_size=64, logging_steps=500, learning_rate=5e-5, save_steps=1000, save_total_limit=5, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, fp16=torch.cuda.is_available(), warmup_steps=2000, warmup_ratio=0.03, weight_decay=0.01, lr_scheduler_type="cosine", dataloader_num_workers=4, dataloader_pin_memory=True, save_on_each_node=False, ddp_find_unused_parameters=False, optim="adamw_torch", local_rank=local_rank,
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
Cross-entropy loss. Standard use for BERT with MLM training.
Avg. Loss on test-set
Perplexity. Calculated based on loss value.
Results
Loss: 2.08
Avg. Loss on test-set: 2.07
Perplexity: 7.65
Summary
Model Examination [optional]
[More Information Needed]
Technical Specifications
Model Architecture and Objective
Compute Infrastructure
Ucloud-cloud infrastructure available at the Danish universities
Hardware
Hardware Type: 64 (Intel Xeon Gold 6326), 256 GB memory, 4 Nividia A10 Hours used: 44 hours 34 minutes Cloud Provider: Ucloud SDU Compute Region: Cloud services based at University of Southern Denmark, Aarhus University and Aalborg University
Software
Python 3.12.8
Citation
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Model Card Authors
- Matias Appel (mkap@adm.aau.dk)
- Johan Heinsen (heinsen@dps.aau.dk)
Model Card Contact
CALDISS, AAU: www.caldiss.aau.dk
- Downloads last month
- 7