|
--- |
|
license: apache-2.0 |
|
base_model: tiiuae/falcon-7b |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
model-index: |
|
- name: falcon-7b-ft-mc4_nl_cleaned_tiny |
|
results: [] |
|
language: |
|
- nl |
|
inference: false |
|
tags: |
|
- falcon |
|
--- |
|
|
|
|
|
# falcon-7b-ft-mc4_nl_cleaned_tiny |
|
|
|
This model is a fine-tuned version of [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) |
|
on the [yhavinga/mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/tiny/train) dataset (`tiny` partition) on a context of 2048 tokens. |
|
See the original [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for more information, intended use, and biases. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
This model is intended as a (poor) baseline for Dutch generative LLMs. It by no means aims to provide SOTA performance and is specifically intended for research purposes. |
|
|
|
Importantly, the original Falcon 7B model was only trained on English and French. Therefore, Dutch generations should be taken with a massive grain of salt. I |
|
wanted to see if the performance would be reasonable after finetuning this model on a Dutch dataset. I find that it is okay but not great. It's especially not coherent. |
|
|
|
## Training and evaluation data |
|
|
|
Trained on the [yhavinga/mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/tiny/train) dataset (`tiny` partition) for one epoch. The canonical |
|
validation split was not used but instead 5% of `train` was used as validation. |
|
|
|
At 2048 tokens context length, the training set was around 2M (2,008,858) samples, and the model was trained for 1 epoch. That means that the model was trained for |
|
around 4B Dutch tokens (`2048 * 2008858 = 4.114.141.184`). |
|
|
|
|
|
## Training procedure |
|
|
|
Trained with LoRA targetting `['query_key_value', 'dense', 'dense_h_to_4h', 'dense_4h_to_h']` in 4 bit and merged before upload. |
|
The adapters are in the `adapters` branch. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0003 |
|
- train_batch_size: 12 |
|
- eval_batch_size: 24 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 16 |
|
- gradient_accumulation_steps: 6 |
|
- total_train_batch_size: 1152 |
|
- total_eval_batch_size: 384 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_ratio: 0.03 |
|
- num_epochs: 1 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:-----:|:----:|:---------------:| |
|
| 2.6094 | 0.1 | 170 | 2.5980 | |
|
| 2.4503 | 0.19 | 340 | 2.4405 | |
|
| 2.3243 | 0.29 | 510 | 2.3428 | |
|
| 2.2822 | 0.39 | 680 | 2.2752 | |
|
| 2.238 | 0.49 | 850 | 2.2248 | |
|
| 2.2015 | 0.58 | 1020 | 2.1865 | |
|
| 2.1678 | 0.68 | 1190 | 2.1560 | |
|
| 2.1301 | 0.78 | 1360 | 2.1312 | |
|
| 2.1161 | 0.88 | 1530 | 2.1112 | |
|
| 2.0997 | 0.97 | 1700 | 2.0928 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.31.0.dev0 |
|
- Pytorch 2.0.1+cu117 |
|
- Datasets 2.13.1 |
|
- Tokenizers 0.13.3 |