sorenmulli commited on
Commit
70539dd
1 Parent(s): cac5387

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -1 +1,53 @@
1
- *Please note that this model and model card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ datasets:
4
+ - DDSC/reddit-da
5
+ - uonlp/CulturaX
6
+ language:
7
+ - da
8
+ base_model: meta-llama/Llama-2-7b-hf
9
+ ---
10
+
11
+ # Model Card for the Danoliterate Baseline 7B Model
12
+
13
+ A base model fine-tuned fromas LlaMa 2 7B on a combination of Danish datasets for 80K updates (2.6B tokens.)
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
19
+
20
+ - **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
21
+ - **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture.
22
+ - **Language(s) (NLP):** Danish
23
+ - **License:** LlaMa 2.
24
+
25
+ ## Uses
26
+
27
+ This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
28
+
29
+
30
+ ## Bias, Risks, and Limitations
31
+
32
+ The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
33
+
34
+ ## Training Details
35
+
36
+ ### Training Data
37
+
38
+ The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
39
+ For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
40
+
41
+ ### Training Procedure
42
+
43
+ See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
44
+
45
+
46
+ ## Evaluation
47
+
48
+ On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 18 as of June 2024.
49
+
50
+
51
+ ## Model Card Contact
52
+
53
+ Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.