sorenmulli commited on
Commit
91789ac
1 Parent(s): e8fafde

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -1
README.md CHANGED
@@ -1 +1,52 @@
1
- *Please note that this model and model card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - DDSC/reddit-da
5
+ - uonlp/CulturaX
6
+ language:
7
+ - da
8
+ ---
9
+
10
+ # Model Card for the Danoliterate Baseline 7B Model
11
+
12
+ A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
18
+
19
+ - **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
20
+ - **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture.
21
+ - **Language(s) (NLP):** Danish
22
+ - **License:** MIT
23
+
24
+ ## Uses
25
+
26
+ This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
27
+
28
+
29
+ ## Bias, Risks, and Limitations
30
+
31
+ The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
32
+
33
+ ## Training Details
34
+
35
+ ### Training Data
36
+
37
+ The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
38
+ For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
39
+
40
+ ### Training Procedure
41
+
42
+ See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
43
+
44
+
45
+ ## Evaluation
46
+
47
+ On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024.
48
+
49
+
50
+ ## Model Card Contact
51
+
52
+ Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.