matthieumeeus97
commited on
Commit
•
638c4cb
1
Parent(s):
c9802b9
Update README.md
Browse files
README.md
CHANGED
@@ -4,12 +4,165 @@ language:
|
|
4 |
license: llama3
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
11 |
|
12 |
-
tokenizer = AutoTokenizer.from_pretrained('
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
-
|
15 |
-
```
|
|
|
4 |
license: llama3
|
5 |
---
|
6 |
|
7 |
+
<p align="center" style="margin:0;padding:0">
|
8 |
+
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
9 |
+
</p>
|
10 |
+
<div style="margin:auto; text-align:center">
|
11 |
+
<h1 style="margin-bottom: 0">ChocoLlama</h1>
|
12 |
+
<em>A Llama-2/3-based family of Dutch language models</em>
|
13 |
+
</div>
|
14 |
|
15 |
+
## Llama-3-ChocoLlama-8B-base: Getting Started
|
16 |
+
|
17 |
+
We here present **Llama-3-ChocoLlama-8B-base**, a language-adapted version of Meta's Llama-3-8b, fine-tuned on 17B Dutch Llama-3 tokens (104GB) using LoRa.
|
18 |
+
Note that this is a base model, not optimized for conversational behavior.
|
19 |
+
If this is desired for your use-case, we recommend finetuning this model on your own Dutch data or using the instruction-finetuned version of this model, [Llama-3-ChocoLlama-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-instruct).
|
20 |
+
|
21 |
+
Use the code below to get started with the model.
|
22 |
+
|
23 |
+
```python
|
24 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
25 |
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-base')
|
27 |
+
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-base')
|
28 |
+
```
|
29 |
+
|
30 |
+
## Model Details
|
31 |
+
|
32 |
+
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
|
33 |
+
|
34 |
+
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
|
35 |
+
- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
|
36 |
+
- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
|
37 |
+
- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
38 |
+
- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
39 |
+
- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
40 |
+
- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
41 |
+
|
42 |
+
For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
|
43 |
+
|
44 |
+
### Model Description
|
45 |
+
|
46 |
+
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
|
47 |
+
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
|
48 |
+
- **Language(s):** Dutch
|
49 |
+
- **License:** [Llama-3 Community License](https://www.llama.com/llama3/license/)
|
50 |
+
- **Finetuned from model:** [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
|
51 |
+
|
52 |
+
### Model Sources
|
53 |
+
|
54 |
+
- **Repository:** Will be released soon.
|
55 |
+
- **Paper:** Will be released soon.
|
56 |
+
|
57 |
+
## Uses
|
58 |
+
|
59 |
+
### Direct Use
|
60 |
+
|
61 |
+
Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend:
|
62 |
+
1. Fine-tuning this model to your specific use-case
|
63 |
+
2. Leveraging the instruction-tuned version of this model
|
64 |
+
|
65 |
+
### Downstream Use
|
66 |
+
|
67 |
+
Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation.
|
68 |
+
We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of Dutch job descriptions, corporate filings and legislation.
|
69 |
+
|
70 |
+
### Out-of-Scope Use
|
71 |
+
|
72 |
+
- Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead.
|
73 |
+
- Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
|
74 |
+
|
75 |
+
## Bias, Risks, and Limitations
|
76 |
+
|
77 |
+
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
|
78 |
+
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
|
79 |
+
|
80 |
+
### Recommendations
|
81 |
+
|
82 |
+
We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs.
|
83 |
+
|
84 |
+
## Training Details
|
85 |
+
|
86 |
+
### Training Data
|
87 |
+
|
88 |
+
We collect a diverse set of Dutch natural language.
|
89 |
+
|
90 |
+
1. **OSCAR**
|
91 |
+
The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens).
|
92 |
+
|
93 |
+
2. **Open Subtitles**
|
94 |
+
We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**.
|
95 |
+
|
96 |
+
3. **Project Gutenberg**
|
97 |
+
We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch).
|
98 |
+
|
99 |
+
4. **Wikipedia**
|
100 |
+
Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion.
|
101 |
+
|
102 |
+
5. **Job Descriptions (TechWolf)**
|
103 |
+
A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens).
|
104 |
+
|
105 |
+
6. **Staatsblad (Bizzy)**
|
106 |
+
A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy.
|
107 |
+
|
108 |
+
7. **Legislation (ML6)**
|
109 |
+
**15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6.
|
110 |
+
|
111 |
+
### Training Procedure
|
112 |
+
|
113 |
+
This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 1.07B trainable parameters.
|
114 |
+
|
115 |
+
#### Training Hyperparameters
|
116 |
+
|
117 |
+
- **Training regime:** bf16 non-mixed precision
|
118 |
+
- **Epochs:** 1
|
119 |
+
- **LoRa parameters:**
|
120 |
+
- R: 8
|
121 |
+
- Alpha: 32
|
122 |
+
- Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
|
123 |
+
- LoRa dropout: 0.05
|
124 |
+
- **Learning Rate:**
|
125 |
+
- Scheduler: StepLR
|
126 |
+
- Step size: 6212
|
127 |
+
- Learning rate: 0.0003
|
128 |
+
- Gamma: 0.85
|
129 |
+
- **Other parameters:**
|
130 |
+
- Minibatch size: 16
|
131 |
+
- Gradient accumulation steps: 8
|
132 |
+
- Parallelization factor: 8
|
133 |
+
- Weight decay: 0
|
134 |
+
|
135 |
+
## Evaluation
|
136 |
+
|
137 |
+
### Quantitative evaluation
|
138 |
+
|
139 |
+
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
|
140 |
+
|
141 |
+
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
|
142 |
+
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
|
143 |
+
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
|
144 |
+
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
|
145 |
+
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
|
146 |
+
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
|
147 |
+
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
|
148 |
+
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
|
149 |
+
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
|
150 |
+
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
|
151 |
+
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
|
152 |
+
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
|
153 |
+
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
|
154 |
+
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
|
155 |
+
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
|
156 |
+
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
|
157 |
+
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
|
158 |
+
|
159 |
+
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
|
160 |
+
|
161 |
+
### Qualitative evaluation
|
162 |
+
|
163 |
+
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
|
164 |
+
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
|
165 |
+
|
166 |
+
### Compute Infrastructure
|
167 |
|
168 |
+
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
|
|