Mistral-7B-Instruct-v0.2-DPO

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.2 on the Dahoas/full-hh-rlhf dataset. It achieves the following results on the evaluation set:

Loss: 0.5782
Rewards/chosen: -0.2120
Rewards/rejected: -0.7002
Rewards/accuracies: 0.6926
Rewards/margins: 0.4883
Logps/rejected: -296.2612
Logps/chosen: -255.5737
Logits/rejected: -2.4985
Logits/chosen: -2.5472

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6628	0.06	100	0.6611	0.1337	0.0489	0.6317	0.0848	-221.3471	-221.0088	-2.6721	-2.7152
0.6203	0.11	200	0.6121	-0.0960	-0.4057	0.6609	0.3097	-266.8084	-243.9758	-2.6213	-2.6775
0.6134	0.17	300	0.6074	-0.0623	-0.3733	0.6702	0.3111	-263.5724	-240.6045	-2.7988	-2.8551
0.5967	0.23	400	0.5992	-0.1315	-0.5181	0.6782	0.3866	-278.0497	-247.5236	-2.4576	-2.5191
0.6216	0.29	500	0.5941	-0.0370	-0.4146	0.6721	0.3775	-267.6940	-238.0781	-2.6879	-2.7311
0.5919	0.34	600	0.5904	-0.1509	-0.5767	0.6865	0.4258	-283.9072	-249.4699	-2.4044	-2.4745
0.5769	0.4	700	0.5902	-0.2407	-0.6647	0.6772	0.4240	-292.7129	-258.4496	-2.2190	-2.2924
0.5725	0.46	800	0.5882	-0.0462	-0.4830	0.6837	0.4368	-274.5383	-238.9940	-2.5276	-2.5732
0.5814	0.51	900	0.5864	-0.1178	-0.5375	0.6811	0.4197	-279.9914	-246.1586	-2.3355	-2.4098
0.5514	0.57	1000	0.5839	-0.1827	-0.6505	0.6872	0.4678	-291.2902	-252.6515	-2.4115	-2.4855
0.5946	0.63	1100	0.5846	-0.0669	-0.5120	0.6846	0.4451	-277.4430	-241.0672	-2.4475	-2.5090
0.5988	0.69	1200	0.5829	-0.2676	-0.7315	0.6891	0.4638	-299.3864	-261.1408	-2.4703	-2.5293
0.5725	0.74	1300	0.5809	-0.1107	-0.5656	0.6878	0.4549	-282.7961	-245.4460	-2.4590	-2.5131
0.5719	0.8	1400	0.5793	-0.2111	-0.6982	0.6894	0.4871	-296.0592	-255.4868	-2.4585	-2.5096
0.5702	0.86	1500	0.5789	-0.2663	-0.7548	0.6888	0.4884	-301.7152	-261.0100	-2.4746	-2.5243
0.5854	0.91	1600	0.5783	-0.2282	-0.7193	0.6913	0.4911	-298.1695	-257.1977	-2.5037	-2.5523
0.578	0.97	1700	0.5782	-0.2135	-0.7018	0.6920	0.4884	-296.4236	-255.7232	-2.4987	-2.5475

Framework versions

Transformers 4.39.0.dev0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.15.2

AmberYifan
/

Mistral-7B-Instruct-v0.2-DPO

Mistral-7B-Instruct-v0.2-DPO

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train AmberYifan/Mistral-7B-Instruct-v0.2-DPO

Evaluation results

Mistral-7B-Instruct-v0.2-DPO

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from mistralai/Mistral-7B-Instruct-v0.2

Dataset used to train AmberYifan/Mistral-7B-Instruct-v0.2-DPO

Evaluation results

Finetuned from