BramVanroy
/

GEITje-7B-ultra

@@ -1,50 +1,152 @@
 ---
 license: cc-by-nc-4.0
-base_model: BramVanroy/GEITje-ultra-sft
 tags:
 - alignment-handbook
 - generated_from_trainer
 - trl
 - dpo
-- generated_from_trainer
 datasets:
 - BramVanroy/ultra_feedback_dutch
 model-index:
-- name: GEITje-ultra-dpo-5e-7lr-128tbs-0.1b
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# GEITje-ultra-dpo-5e-7lr-128tbs-0.1b
-This model is a fine-tuned version of [BramVanroy/GEITje-ultra-sft](https://huggingface.co/BramVanroy/GEITje-ultra-sft) on the BramVanroy/ultra_feedback_dutch dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.0138
-- Rewards/chosen: -2.1351
-- Rewards/rejected: -13.8922
-- Rewards/accuracies: 0.9950
-- Rewards/margins: 11.7570
-- Logps/rejected: -565.1809
-- Logps/chosen: -519.8008
-- Logits/rejected: -3.0261
-- Logits/chosen: -2.9779
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -77,4 +179,4 @@ The following hyperparameters were used during training:
 - Transformers 4.36.2
 - Pytorch 2.1.2+cu121
 - Datasets 2.14.6
-- Tokenizers 0.15.0

 ---
 license: cc-by-nc-4.0
+base_model: BramVanroy/GEITje-7B-ultra-sft
 tags:
 - alignment-handbook
 - generated_from_trainer
 - trl
 - dpo
+- geitje
 datasets:
 - BramVanroy/ultra_feedback_dutch
 model-index:
+- name: BramVanroy/GEITje-7B-ultra
   results: []
+language:
+- nl
+pipeline_tag: conversational
 ---
+<img src="https://huggingface.co/BramVanroy/GEITje-ultra/resolve/main/geitje-ultra-banner.png" alt="GEITje Ultra banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+# GEITje 7B ultra
+**A conversational model, aligned through AI feedback.**
+This model is a fine-tuned version of [BramVanroy/GEITje-ultra-sft](https://huggingface.co/BramVanroy/GEITje-ultra-sft) on a synthetic DPO dataset of around 56M tokens that was generated with gpt-4-turbo and [Rijgersberg/GEITje-7B-chat](https://huggingface.co/Rijgersberg/GEITje-7B-chat) for Dutch.
 ## Model description
+This is a Dutch instruction/chat model ultimately based on Mistral and aligned with AI feedback via DPO. It is a DPO continuation of the SFT trained [BramVanroy/GEITje-ultra-sft](https://huggingface.co/BramVanroy/GEITje-ultra-sft), which in turn is based on [Rijgersberg/GEITje-7B](https://huggingface.co/Rijgersberg/GEITje-7B), which in turn is based on Mistral 7B and further pretrained on Dutch data. In (rather naive) [benchmarks](https://huggingface.co/spaces/BramVanroy/open_dutch_llm_leaderboard) it outperforms all the original GEITje models on average and ties with the powerful Zephyr model by Hugging Face. However, note that these benchmarks should be taken with a massive grain of salt (see the disclaimer below the benchmarks on that page).
+## Usage
+One-off:
+```python
+from transformers import pipeline, Conversation
+# load_in_8bit: lower precision but saves a lot of GPU memory
+# device_map=auto: loads the model across multiple GPUs
+chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True}, device_map="auto")
+start_messages = [
+    {"role": "system", "content": "Je bent een grappige chatbot die Bert heet. Je maakt vaak mopjes."},
+    {"role": "user", "content": "Hallo, ik ben Bram. Ik wil vanavond graag een film kijken. Heb je enkele suggesties?"}
+]
+conversation = Conversation(start_messages)
+conversation = chatbot(conversation)
+response = conversation.messages[-1]["content"]
+print(response)
+```
+Interactive conversation:
+```python
+from transformers import pipeline, Conversation
+# load_in_8bit: lower precision but saves a lot of memory
+# device_map=auto: loads the model across multiple GPUs
+# attn_implementation: uses flash attention, if your device supports it - otherwise remove it
+chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True, "attn_implementation": "flash_attention_2"}, device_map="auto")
+while (system_message := input("System message ('q' to quit): ")) != "q":
+    start_messages = [
+        {"role": "system", "content": system_message},
+    ]
+    conversation = Conversation(start_messages)
+    while (user_input := input("User ('r' to reset): ")) != "r":
+        conversation.add_user_input(user_input)
+        conversation = chatbot(conversation)
+        response = conversation.messages[-1]["content"]
+        print("Assistant:", response)
+```
 ## Intended uses & limitations
+Although the model has been aligned with gpt-4-turbo output, which has strong content filters, the model could still generate wrong, misleading, and potentially even offensive content. Use at your own risk.
+Because the model was trained on synthetic data created with OpenAI/Azure services, this model cannot be used for commercial purposes.
 ## Training and evaluation data
+The training data consists of a synthetic dataset based on [UltraFeedback binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) created with gpt-4-turbo and geitje-chat. A given prompt, translated from the original dataset, is given to the two models who then generated an answer. Then, gpt-4-turbo is always selected as the best answer which DPO will optimise for. While this is not completely fair, I did not have the budget to actually have gpt-4 rate both replies. Furthermore, while an impressive model, GEITje chat still seems behind gpt-4-turbo in the testing that I have done.
+In total the dataset consists of 56,137,090 tokens (combination of prompt + rejected + chosen) and a test set of 6,178,969 tokens (11.00%).
 ## Training procedure
+The great [alignment handbook](https://github.com/huggingface/alignment-handbook/) was used for training, with a custom slurm script for compatibility with our cluster. It was trained in full, without LoRA or other adapters.
+The model was trained in bfloat16 with flash attention 2 on two nodes of four A100 80GB each for around 11 hours. I thank the [Flemish Super Computer](https://www.vscentrum.be/compute) for their compute.
+For conversational usage, the model relies on the Zephyr chat template, which is compatible with system messages. A small portion of the data of *-sft contained system messages, so it is assumed the model can handle system messages at least a little bit.
+In earlier iterations I found that using the alignment handbook's defaults (beta=0.01) led to poor results (hallucinations of random tokens). After investigating, it seems that such a low beta does not work well for this dataset as it gives the model too much room to deviate from its initial base model. After a [hyperparameter search](https://huggingface.co/posts/BramVanroy/492522322273746) and manual analysis of the resulting metrics, I selected the current model as the best one, with a beta of 0.1.
+Recipe used with the handbook:
+```yaml
+# Model arguments
+model_name_or_path: BramVanroy/GEITje-7B-ultra-sft
+model_revision: main
+torch_dtype: bfloat16
+use_flash_attention_2: true
+# Data training arguments
+# For definitions, see: src/h4/training/config.py
+dataset_mixer:
+  BramVanroy/ultra_feedback_dutch: 1.0
+dataset_splits:
+- train_prefs
+- test_prefs
+preprocessing_num_workers: 8
+# DPOTrainer arguments
+bf16: true
+beta: 0.1
+do_eval: true
+evaluation_strategy: steps
+eval_steps: 100
+gradient_accumulation_steps: 4
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: False
+hub_model_id: BramVanroy/GEITje-ultra
+learning_rate: 5.0e-7
+log_level: info
+logging_steps: 10
+lr_scheduler_type: cosine
+max_length: 2048
+max_prompt_length: 1536
+num_train_epochs: 1
+optim: adamw_torch
+output_dir: data/GEITje-ultra
+per_device_train_batch_size: 4
+per_device_eval_batch_size: 4
+push_to_hub: true
+save_strategy: "steps"
+save_steps: 100
+save_total_limit: 3
+seed: 42
+warmup_ratio: 0.1
+```
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - Transformers 4.36.2
 - Pytorch 2.1.2+cu121
 - Datasets 2.14.6
+- Tokenizers 0.15.0