|
--- |
|
language: |
|
- nl |
|
license: llama2 |
|
--- |
|
|
|
<p align="center" style="margin:0;padding:0"> |
|
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> |
|
</p> |
|
<div style="margin:auto; text-align:center"> |
|
<h1 style="margin-bottom: 0">ChocoLlama</h1> |
|
<em>A Llama-2/3-based family of Dutch language models</em> |
|
</div> |
|
|
|
## Model Details |
|
|
|
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class. |
|
|
|
We provide 6 variants (of which 3 base and 3 instruction-tuned models): |
|
- **ChocoLlama-2-7B-base**: A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB (XXX tokens) using LoRa. |
|
- **ChocoLlama-2-7B-instruct**: An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO. |
|
- **ChocoLlama-2-7B-tokentrans-base**: A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa. |
|
- **ChocoLlama-2-7B-tokentrans-instruct**: An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO. |
|
- **Llama-3-ChocoLlama-8B-base**: A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa. |
|
- **Llama-3-ChocoLlama-instruct**: An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO. |
|
|
|
|
|
As far as we are aware, Llama-3-ChocoLlama-8B-instruct sets a new state-of-the-art for Dutch open models in its weight class. |
|
|
|
### Model Description |
|
|
|
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe) |
|
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB) |
|
- **Language(s):** Dutch |
|
- **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/) |
|
- **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** Will be released soon. |
|
- **Paper:** Will be released soon. |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend: |
|
1. Fine-tuning this model to your specific use-case |
|
2. Leveraging the instruction-tuned version of this model |
|
|
|
### Downstream Use |
|
|
|
Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of: |
|
- Dutch job descriptions |
|
- Dutch corporate filings |
|
- Dutch legislation |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
- Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead. |
|
- Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators. |
|
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content. |
|
|
|
### Recommendations |
|
|
|
We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base') |
|
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base') |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
[More Information Needed] |
|
|
|
### Training Procedure |
|
|
|
This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** bf16 non-mixed precision |
|
- **Epochs:** 1 |
|
- **LoRa parameters:** |
|
- R: 8 |
|
- Alpha: 32 |
|
- Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head |
|
- LoRa dropout: 0.05 |
|
- **Learning Rate:** |
|
- Scheduler: StepLR |
|
- Step size: 6212 |
|
- Learning rate: 0.0003 |
|
- Gamma: 0.85 |
|
- **Other parameters:** |
|
- Minibatch size: 16 |
|
- Gradient accumulation steps: 8 |
|
- Parallelization factor: 8 |
|
- Weight decay: 0 |
|
|
|
|
|
## Evaluation |
|
|
|
### Quantitative evaluation |
|
|
|
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models. |
|
|
|
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. | |
|
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------| |
|
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** | |
|
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 | |
|
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 | |
|
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 | |
|
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 | |
|
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** | |
|
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 | |
|
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 | |
|
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** | |
|
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 | |
|
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** | |
|
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 | |
|
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** | |
|
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 | |
|
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 | |
|
|
|
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks. |
|
|
|
### Qualitative evaluation |
|
|
|
|
|
|
|
### Compute Infrastructure |
|
|
|
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM. |