Edit model card

araft_trained_dpo

This model has been obtained by fine-tuning araft_trained_sft with DPO. The trajectories from the Araft dataset were used to adapt the model to make a novel query at every step, instead of repeating the query from the previous one.

Model description

This model has been generated in the context of the Araft project. The Araft project consists in fine-tuning a Llama2-7B model to enable the use of the ReAct pattern for Wikipedia-augmented question-answering. This model is the product of the second and final training step: DPO training.

In the DPO training step, the trajectories from the Araft dataset have been used to fine-tune the model. Each step was used as a desired output for the previous part of the trajectory, whereas the repetition of the previous step was used as undesired output. The model achieves a 26% performace (f1 score) on the HotpotQA dataset.

For further information, please see the Araft github repo.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 1
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 1
  • mixed_precision_training: Native AMP

Framework versions

  • PEFT 0.10.0
  • Transformers 4.38.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.18.0
  • Tokenizers 0.15.2
Downloads last month
0
Unable to determine this model’s pipeline type. Check the docs .

Adapter for