metadata

license: mit
datasets:
  - DDSC/reddit-da
  - uonlp/CulturaX
language:
  - da

Model Card for the Danoliterate Baseline 7B Model

A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)

Model Details

Model Description

As test model part of the thesis Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish with relevant details in Sections 4.1, 5.1 and 6.1.

Developed by: Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
Model type: Base, autoregressive LLM with LLaMa 2 7B architecture.
Language(s) (NLP): Danish
License: MIT

Uses

This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.

Bias, Risks, and Limitations

The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.

Training Details

Training Data

The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX. For more details, see Section 4.1 in the thesis.

Training Procedure

See Sections 5.1 and 6.1 in the thesis

Evaluation

On the Danoliterate LLM Benchmark, this model gets an index score of 13 as of June 2024.

Model Card Contact

Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.