ChocoLlama logo

ChocoLlama

A Llama-2/3-based family of Dutch language models

Llama-3-ChocoLlama-8B-instruct: Getting Started

We here present ChocoLlama-2-7B-instruct, an instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO. Its base model, Llama-3-ChocoLlama-8B-base, is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct', device_map="auto")

messages = [
    {"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
    {"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

new_terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id=new_terminators,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes. Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.

Model Details

ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.

We provide 6 variants (of which 3 base and 3 instruction-tuned models):

  • ChocoLlama-2-7B-base (link): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
  • ChocoLlama-2-7B-instruct (link): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
  • ChocoLlama-2-7B-tokentrans-base (link): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by Remy et al.. The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
  • ChocoLlama-2-7B-tokentrans-instruct (link): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
  • Llama-3-ChocoLlama-8B-base (link): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
  • Llama-3-ChocoLlama-instruct (link): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.

For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper here.

Model Description

Model Sources

Uses

Direct Use

This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings. For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.

Out-of-Scope Use

Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.

Bias, Risks, and Limitations

We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators. However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.

Training Details

We adopt the same strategy as used to align GEITje-7B to GEITje-7B-ultra. First, we apply supervised finetuning (SFT), utilizing the data made available by Vanroy:

Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop, now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, BramVanroy/ultra_feedback_dutch.

For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:

  • learning_rate: 5e-07
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Further, we leverage the publicly available alignment handbook and use a set of 4 NVIDIA A100 (80 GB) for both stages.

Evaluation

Quantitative evaluation

We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.

Model ARC HellaSwag MMLU TruthfulQA Avg.
Llama-3-ChocoLlama-instruct 0.48 0.66 0.49 0.49 0.53
llama-3-8B-rebatch 0.44 0.64 0.46 0.48 0.51
llama-3-8B-instruct 0.47 0.59 0.47 0.52 0.51
llama-3-8B 0.44 0.64 0.47 0.45 0.5
Reynaerde-7B-Chat 0.44 0.62 0.39 0.52 0.49
Llama-3-ChocoLlama-base 0.45 0.64 0.44 0.44 0.49
zephyr-7b-beta 0.43 0.58 0.43 0.53 0.49
geitje-7b-ultra 0.40 0.66 0.36 0.49 0.48
ChocoLlama-2-7B-tokentrans-instruct 0.45 0.62 0.34 0.42 0.46
mistral-7b-v0.1 0.43 0.58 0.37 0.45 0.46
ChocoLlama-2-7B-tokentrans-base 0.42 0.61 0.32 0.43 0.45
ChocoLlama-2-7B-instruct 0.36 0.57 0.33 0.45 **0.43
ChocoLlama-2-7B-base 0.35 0.56 0.31 0.43 0.41
llama-2-7b-chat-hf 0.36 0.49 0.33 0.44 0.41
llama-2-7b-hf 0.36 0.51 0.32 0.41 0.40

On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.

Qualitative evaluation

In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable. For details, we refer to the paper and to our benchmark ChocoLlama-Bench.

Compute Infrastructure

All ChocoLlama models have been trained on the compute cluster provided by the Flemish Supercomputer Center (VSC). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.

Citation

If you found this useful for your work, kindly cite our paper:

@article{meeus2024chocollama,
  title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
  author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
  journal={arXiv preprint arXiv:2412.07633},
  year={2024}
}
Downloads last month
64
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ChocoLlama/Llama-3-ChocoLlama-8B-instruct

Finetuned
(1)
this model
Quantizations
3 models

Datasets used to train ChocoLlama/Llama-3-ChocoLlama-8B-instruct