sorenmulli commited on
Commit
c6e4eac
1 Parent(s): 8dfc2fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -1 +1,53 @@
1
- *Please note that this model and model card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - DDSC/reddit-da
5
+ - uonlp/CulturaX
6
+ language:
7
+ - da
8
+ base_model: mistralai/Mistral-7B-v0.1
9
+ ---
10
+
11
+ # Model Card for the Danoliterate Mistral 7B Model
12
+
13
+ A base model fine-tuned from Mistral 7B on a combination of Danish datasets for 20K updates (655M tokens.)
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
19
+
20
+ - **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
21
+ - **Model type:** Base, autoregressive LLM with Mistral 7B architecture.
22
+ - **Language(s) (NLP):** Danish
23
+ - **License:** MIT
24
+
25
+ ## Uses
26
+
27
+ This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
28
+
29
+
30
+ ## Bias, Risks, and Limitations
31
+
32
+ The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
33
+
34
+ ## Training Details
35
+
36
+ ### Training Data
37
+
38
+ The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
39
+ For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
40
+
41
+ ### Training Procedure
42
+
43
+ See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
44
+
45
+
46
+ ## Evaluation
47
+
48
+ On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 24 as of June 2024.
49
+
50
+
51
+ ## Model Card Contact
52
+
53
+ Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.