llama2-7b-sft-full-dpo-bs128-1

This model is a fine-tuned version of elichen3051/llama2-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized, the HuggingFaceH4/cai-conversation-harmless and the HuggingFaceH4/orca_dpo_pairs datasets. It achieves the following results on the evaluation set:

Loss: 0.3941
Rewards/chosen: -5.3789
Rewards/rejected: -9.2548
Rewards/accuracies: 0.7172
Rewards/margins: 3.8759
Logps/rejected: -1109.1118
Logps/chosen: -745.4045
Logits/rejected: -0.3099
Logits/chosen: -0.4827

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 128
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6822	0.1351	100	0.6805	-0.0879	-0.1131	0.6004	0.0252	-194.9413	-216.3078	-1.4902	-1.4701
0.5994	0.2701	200	0.5900	-0.5683	-0.8376	0.6354	0.2692	-267.3889	-264.3494	-1.4558	-1.4603
0.5207	0.4052	300	0.5228	-1.1191	-1.7333	0.6594	0.6142	-356.9614	-319.4261	-1.4859	-1.4942
0.4853	0.5403	400	0.4862	-1.4612	-2.3687	0.6758	0.9075	-420.4995	-353.6355	-1.4775	-1.4836
0.4354	0.6753	500	0.4588	-2.3149	-3.5829	0.6910	1.2680	-541.9225	-439.0110	-1.4326	-1.4491
0.4762	0.8104	600	0.4449	-2.8038	-4.4754	0.6812	1.6716	-631.1749	-487.9007	-1.2917	-1.3200
0.4387	0.9455	700	0.4269	-2.5426	-4.3834	0.6998	1.8408	-621.9758	-461.7772	-1.1139	-1.1704
0.4078	1.0805	800	0.4174	-3.4511	-5.6259	0.7151	2.1747	-746.2182	-552.6292	-0.7747	-0.8511
0.436	1.2156	900	0.4223	-2.8520	-4.9250	0.6998	2.0730	-676.1364	-492.7158	-0.8051	-0.8816
0.3981	1.3507	1000	0.4113	-4.1103	-6.7093	0.7085	2.5990	-854.5594	-618.5446	-0.6129	-0.7287
0.4197	1.4857	1100	0.4078	-3.9815	-6.7710	0.7063	2.7895	-860.7372	-605.6670	-0.6085	-0.7236
0.3948	1.6208	1200	0.4029	-4.4373	-7.6681	0.7129	3.2308	-950.4404	-651.2417	-0.4505	-0.5882
0.3774	1.7559	1300	0.4027	-4.4668	-7.7729	0.7140	3.3061	-960.9225	-654.1990	-0.4254	-0.5766
0.3776	1.8909	1400	0.3981	-5.1800	-8.6009	0.7151	3.4209	-1043.7207	-725.5146	-0.2898	-0.4448
0.3603	2.0260	1500	0.3969	-5.2589	-8.9354	0.7238	3.6765	-1077.1724	-733.4023	-0.3307	-0.4994
0.3755	2.1611	1600	0.3941	-5.3490	-9.0434	0.7227	3.6944	-1087.9674	-742.4117	-0.3008	-0.4750
0.3329	2.2961	1700	0.3955	-5.5370	-9.4449	0.7172	3.9079	-1128.1211	-761.2183	-0.2802	-0.4581
0.3792	2.4312	1800	0.3958	-5.3708	-9.1595	0.7194	3.7887	-1099.5865	-744.5994	-0.3103	-0.4787
0.3375	2.5663	1900	0.3959	-5.7224	-9.6999	0.7118	3.9775	-1153.6201	-779.7544	-0.2838	-0.4608
0.3395	2.7013	2000	0.3959	-5.5592	-9.4717	0.7172	3.9125	-1130.8043	-763.4398	-0.3069	-0.4800
0.3429	2.8364	2100	0.3944	-5.4070	-9.2880	0.7194	3.8810	-1112.4368	-748.2163	-0.3043	-0.4769
0.3641	2.9715	2200	0.3942	-5.3741	-9.2509	0.7183	3.8769	-1108.7258	-744.9233	-0.3113	-0.4836

Framework versions

Transformers 4.41.2
Pytorch 2.3.0
Datasets 2.19.1
Tokenizers 0.19.1

skymizer
/

llama2-7b-sft-full-dpo-bs128-1

llama2-7b-sft-full-dpo-bs128-1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Datasets used to train skymizer/llama2-7b-sft-full-dpo-bs128-1

Evaluation results

llama2-7b-sft-full-dpo-bs128-1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from skymizer/llama2-7b-sft-full

Datasets used to train skymizer/llama2-7b-sft-full-dpo-bs128-1

Evaluation results

Finetuned from