File size: 11,296 Bytes
313f314 0075366 a7d6677 0075366 313f314 0075366 313f314 0075366 313f314 0075366 313f314 8cc95c2 313f314 0075366 313f314 0075366 313f314 0075366 313f314 8cc95c2 313f314 0075366 313f314 0075366 313f314 0075366 313f314 0075366 a7d6677 313f314 0075366 313f314 0075366 313f314 0075366 313f314 0075366 313f314 0075366 34dc8aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
language:
- nl
license: cc-by-nc-4.0
base_model: ChocoLlama/Llama-3-ChocoLlama-8B-base
datasets:
- BramVanroy/ultrachat_200k_dutch
- BramVanroy/stackoverflow-chat-dutch
- BramVanroy/alpaca-cleaned-dutch
- BramVanroy/dolly-15k-dutch
- BramVanroy/no_robots_dutch
- BramVanroy/ultra_feedback_dutch
---
<p align="center" style="margin:0;padding:0">
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
</p>
<div style="margin:auto; text-align:center">
<h1 style="margin-bottom: 0">ChocoLlama</h1>
<em>A Llama-2/3-based family of Dutch language models</em>
</div>
## Llama-3-ChocoLlama-8B-instruct: Getting Started
We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
Its base model, [Llama-3-ChocoLlama-8B-base](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct', device_map="auto")
messages = [
{"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
{"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
new_terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=512,
eos_token_id=new_terminators,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
```
Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
## Model Details
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](https://arxiv.org/pdf/2412.07633).
### Model Description
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
- **Language(s):** Dutch
- **License:** cc-by-nc-4.0
- **Finetuned from model:** [Llama-3-ChocoLlama-8B-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)
### Model Sources
- **Repository:** [on Github here](https://github.com/ChocoLlamaModel/ChocoLlama).
- **Paper:** [on ArXiv here](https://arxiv.org/pdf/2412.07633).
## Uses
### Direct Use
This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
### Out-of-Scope Use
Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
## Bias, Risks, and Limitations
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
## Training Details
We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
- learning_rate: 5e-07
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
## Evaluation
### Quantitative evaluation
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
### Qualitative evaluation
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
### Compute Infrastructure
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
## Citation
If you found this useful for your work, kindly cite our paper:
```
@article{meeus2024chocollama,
title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
journal={arXiv preprint arXiv:2412.07633},
year={2024}
}
``` |