metadata

library_name: transformers
license: mit
base_model: openai-community/gpt2
tags:
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - piqa
model-index:
  - name: dpo-model-output
    results: []

dpo-model-output

This model is a fine-tuned version of openai-community/gpt2 on the piqa dataset. It achieves the following results on the evaluation set:

Loss: 1.0022
Rewards/chosen: -17.7008
Rewards/rejected: -19.2025
Rewards/accuracies: 0.6638
Rewards/margins: 1.5016
Logps/rejected: -283.3009
Logps/chosen: -266.5706
Logits/rejected: -79.0250
Logits/chosen: -79.1854

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.717	1.0	3223	0.6773	-6.7817	-7.2423	0.6513	0.4606	-163.6990	-157.3790	-76.4738	-76.4513
0.3122	2.0	6446	0.7910	-12.1665	-13.1665	0.6675	1.0000	-222.9408	-211.2272	-85.8639	-85.9837
0.0481	3.0	9669	1.0022	-17.7008	-19.2025	0.6638	1.5016	-283.3009	-266.5706	-79.0250	-79.1854

Framework versions

Transformers 4.44.2
Pytorch 2.0.0
Datasets 2.16.1
Tokenizers 0.19.1