--- license: cc-by-nc-4.0 library_name: peft tags: - alignment-handbook - generated_from_trainer - trl - sft - geitje - fingeitje - dutch - nl - finance base_model: BramVanroy/GEITje-7B-ultra datasets: - snoels/FinGEITje-sft model-index: - name: snoels/FinGEITje-7B-sft results: [] language: - nl pipeline_tag: text-generation inference: false ---

FinGEITje Banner

🐐 FinGEITje 7B

A large open Dutch Financial language model.
This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset. ## 📖 Model Description FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. ## 📊 Training and Evaluation Data ### Training Data FinGEITje 7B was fine-tuned on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset, which consists of translated and processed Dutch financial texts. This dataset includes a wide range of financial topics and instruction tuning data. #### Data Processing Steps 1. **Translation**: Original instruction tuning datasets were translated into Dutch using a specialized translation service to maintain the integrity of financial terminology. 2. **Post-processing**: The translated data underwent post-processing to correct any translation inconsistencies and to format it according to the original dataset structure. 3. **Formatting**: The data was formatted to match the style and requirements of instruction tuning datasets, ensuring compatibility with the fine-tuning process. 4. **Filtering**: A Dutch language check and predefined validation checks were applied to filter out any low-quality or irrelevant data. ### Evaluation Data The model was evaluated using: - **[snoels/FinDutchBench](https://huggingface.co/datasets/snoels/FinDutchBench)**: A Dutch financial benchmark dataset designed to assess the model's performance on various financial tasks. ## ⚙️ Training Procedure FinGEITje was trained following the methodology described in the [Alignment Handbook](https://github.com/huggingface/alignment-handbook). ### Training Configuration - The training configuration is based on the recipe outlined in the alignment handbook and can be found in the [config_qlora.yaml](https://github.com/snoels/fingeit/blob/master/src/training/sft/config_qlora.yaml) file. - The model was further trained using **QLoRA** (Quantized LoRA) for efficient fine-tuning with reduced computational resources. ### Training Hyperparameters The following hyperparameters were used during training: - **Learning Rate**: 0.0002 - **Train Batch Size**: 4 - **Evaluation Batch Size**: 8 - **Seed**: 42 - **Distributed Type**: Multi-GPU - **Gradient Accumulation Steps**: 2 - **Total Train Batch Size**: 8 - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 - **LR Scheduler Type**: Cosine - **Warmup Ratio**: 0.1 - **Number of Epochs**: 1 ### Training Results | Training Loss | Epoch | Step | Validation Loss | |---------------|-------|------|-----------------| | 0.406 | 1.0 | 3922 | 0.3928 | ### Evaluation Package The evaluation package includes a set of metrics defined per task, grouped per dataset to evaluate the model's performance across different financial domains. The evaluation notebooks are available: - **[Evaluation in Dutch](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_nl.ipynb)**: Assesses the model's performance on the Dutch financial benchmark dataset. - **[Evaluation in English](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_en.ipynb)**: Evaluates the model's performance on English financial benchmarks for comparison purposes. ### Framework Versions - **PEFT**: 0.7.1 - **Transformers**: 4.39.0.dev0 - **PyTorch**: 2.1.2 - **Datasets**: 2.14.6 - **Tokenizers**: 0.15.2 ## 🛠️ How to Use FinGEITje 7B can be utilized using the Hugging Face Transformers library along with PEFT to load the LoRA adapters efficiently. ### Installation Ensure you have the necessary libraries installed: ```bash pip install torch transformers peft accelerate ``` ### Loading the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("BramVanroy/GEITje-7B-ultra", use_fast=False) # Load the base model base_model = AutoModelForCausalLM.from_pretrained("BramVanroy/GEITje-7B-ultra", device_map='auto') # Load the FinGEITje model with PEFT adapters model = PeftModel.from_pretrained(base_model, "snoels/FinGEITje-7B-sft", device_map='auto') ``` ### Generating Text ```python # Prepare the input input_text = "Wat zijn de laatste trends in de Nederlandse banksector?" input_ids = tokenizer.encode(input_text, return_tensors='pt').to(model.device) # Generate a response outputs = model.generate(input_ids, max_length=200, num_return_sequences=1) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## 🚧 Limitations and Future Work While FinGEITje 7B demonstrates significant improvements in understanding and generating Dutch financial content, certain limitations exist: - **Data Cutoff**: The model's knowledge is limited to the data it was trained on and may not include the most recent developments in the financial sector. - **Accuracy Concerns**: The model may generate incorrect or outdated information. Users should verify critical information with reliable sources. - **Biases**: Potential biases in the training data may affect the neutrality and fairness of the model's responses. - **Language Scope**: Primarily designed for Dutch; performance in other languages is not optimized. - **Ethical Use**: Users should ensure that the model's outputs comply with ethical standards and do not promote misinformation or harmful content. ### Future Work - **Data Updates**: Incorporate more recent and diverse financial datasets to keep the model up-to-date. - **Bias Mitigation**: Implement techniques to identify and reduce biases in the model's outputs. - **Performance Enhancement**: Fine-tune on more specialized financial topics and complex financial tasks. - **Multilingual Expansion**: Extend support to other languages relevant to the financial sector in the Netherlands and Europe. ## 🙏 Acknowledgements We would like to thank: - **Rijgersberg** ([GitHub](https://github.com/Rijgersberg)) for creating [GEITje](https://github.com/Rijgersberg/GEITje), one of the first Dutch foundation models, and for contributing significantly to the development of Dutch language models. - **Bram Vanroy** ([GitHub](https://github.com/BramVanroy)) for creating [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra), an open-source Dutch chat model, and for sharing training, translation, and evaluation resources. - **Contributors of the [Alignment Handbook](https://github.com/huggingface/alignment-handbook)** for providing valuable resources that guided the development and training process of FinGEITje. - **Silverfin** for their collaboration in this research. Silverfin, a Belgian scale-up focused on building an accountancy cloud service, provided valuable insights and resources that were instrumental in the development of FinGEITje. More about their work can be found at [Silverfin](https://silverfin.com/). ## 📝 Citation [Link to the paper](https://arxiv.org/abs/2410.12835) If you use FinGEITje in your work, please cite: ```bibtex @article{FinGEITje2024, title={A Dutch Financial Large Language Model}, author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl}, journal={arXiv preprint arXiv:2410.12835}, year={2024}, url={https://arxiv.org/abs/2410.12835} } ``` ## 📜 License This model is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license. ## 📧 Contact For any inquiries or questions, please contact [Sander Noels](mailto:sander.noels@ugent.be).