Update README.md

Browse files

Files changed (1) hide show

README.md +52 -1

README.md CHANGED Viewed

	@@ -1 +1,52 @@
1	- ~~Please note that this model and model card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details~~

+---
+license: mit
+datasets:
+- DDSC/reddit-da
+- uonlp/CulturaX
+language:
+- da
+---
+# Model Card for the Danoliterate Baseline 7B Model
+A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)
+## Model Details
+### Model Description
+As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
+- **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
+- **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture.
+- **Language(s) (NLP):** Danish
+- **License:** MIT
+## Uses
+This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
+## Bias, Risks, and Limitations
+The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
+## Training Details
+### Training Data
+The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
+For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
+### Training Procedure
+See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
+## Evaluation
+On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024.
+## Model Card Contact
+Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.