sorenmulli
commited on
Commit
•
91789ac
1
Parent(s):
e8fafde
Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,52 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- DDSC/reddit-da
|
5 |
+
- uonlp/CulturaX
|
6 |
+
language:
|
7 |
+
- da
|
8 |
+
---
|
9 |
+
|
10 |
+
# Model Card for the Danoliterate Baseline 7B Model
|
11 |
+
|
12 |
+
A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)
|
13 |
+
## Model Details
|
14 |
+
|
15 |
+
### Model Description
|
16 |
+
|
17 |
+
As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
|
18 |
+
|
19 |
+
- **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
|
20 |
+
- **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture.
|
21 |
+
- **Language(s) (NLP):** Danish
|
22 |
+
- **License:** MIT
|
23 |
+
|
24 |
+
## Uses
|
25 |
+
|
26 |
+
This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
|
27 |
+
|
28 |
+
|
29 |
+
## Bias, Risks, and Limitations
|
30 |
+
|
31 |
+
The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
|
32 |
+
|
33 |
+
## Training Details
|
34 |
+
|
35 |
+
### Training Data
|
36 |
+
|
37 |
+
The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
|
38 |
+
For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
|
39 |
+
|
40 |
+
### Training Procedure
|
41 |
+
|
42 |
+
See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
|
43 |
+
|
44 |
+
|
45 |
+
## Evaluation
|
46 |
+
|
47 |
+
On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024.
|
48 |
+
|
49 |
+
|
50 |
+
## Model Card Contact
|
51 |
+
|
52 |
+
Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.
|