GPT-2 finetuned on French Dataset
Tokenizer
We first trained a tokenizer on OSCAR's unshuffled_original_fr
French data subset by following the training of GPT2 tokenizer (same vocab size of 50,257). Here's the Python file for the training.
Model
We finetuned the wte
and wpe
layers of GPT-2 (while freezing the parameters of all other layers) on OSCAR's unshuffled_original_fr
French data subset. We used Huggingface's code for fine-tuning the causal language model GPT-2, but with the following parameters changed
- preprocessing_num_workers: 8
- per_device_train_batch_size: 2
- gradient_accumulation_steps: 4
- per_device_eval_batch_size: 2
- eval_accumulation_steps: 4
- eval_steps: 1000
- evaluation_strategy: "steps"
- max_eval_samples: 5000
Setup: 8 RTX-3090 GPUs, trained for seven days (total training steps: 110500, effective train batch size: 64, tokens per batch: 1024)
Final checkpoint: checkpoint-111500
- Downloads last month
- 2