sorenmulli
/

dano-mistral-7b-0.1

Feature Extraction

text-generation-inference

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

sorenmulli commited on Aug 28

Commit

c6e4eac

•

1 Parent(s): 8dfc2fc

Update README.md

Files changed (1) hide show

README.md +53 -1

README.md CHANGED Viewed

	@@ -1 +1,53 @@
1	- ~~Please note that this model and model card both are works in progress. For now refer to the related [thesis](https://sorenmulli.github.io/thesis/thesis.pdf) for all details~~

+---
+license: mit
+datasets:
+- DDSC/reddit-da
+- uonlp/CulturaX
+language:
+- da
+base_model: mistralai/Mistral-7B-v0.1
+---
+# Model Card for the Danoliterate Mistral  7B Model
+A base model fine-tuned from Mistral 7B on a combination of Danish datasets for 20K updates (655M tokens.)
+## Model Details
+### Model Description
+As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.
+- **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
+- **Model type:** Base, autoregressive LLM with Mistral 7B architecture.
+- **Language(s) (NLP):** Danish
+- **License:** MIT
+## Uses
+This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
+## Bias, Risks, and Limitations
+The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
+## Training Details
+### Training Data
+The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
+For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).
+### Training Procedure
+See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)
+## Evaluation
+On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 24 as of June 2024.
+## Model Card Contact
+Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.