llama-3.2-3b-dpo-2

This model is a fine-tuned version of tanliboy/llama-3.2-3b-sft-2 on the HuggingFaceH4/orca_dpo_pairs and the HuggingFaceH4/ultrafeedback_binarized datasets. It achieves the following results on the evaluation set:

Loss: 0.5814
Rewards/chosen: 1.7432
Rewards/rejected: -4.1735
Rewards/accuracies: 0.7848
Rewards/margins: 5.9167
Logps/rejected: -388.2242
Logps/chosen: -338.5596
Logits/rejected: 0.2395
Logits/chosen: 0.1826

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 128
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.03
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7596	0.1741	100	0.7588	0.1349	-1.4398	0.6994	1.5747	-360.8871	-354.6434	0.6135	0.5482
0.6725	0.3483	200	0.6680	0.6247	-2.7323	0.7278	3.3569	-373.8118	-349.7451	0.5335	0.4718
0.6452	0.5224	300	0.6514	0.1770	-3.8036	0.75	3.9807	-384.5256	-354.2216	0.5477	0.4866
0.6259	0.6966	400	0.6328	0.9885	-3.5382	0.7722	4.5267	-381.8713	-346.1070	0.4531	0.3927
0.5709	0.8707	500	0.6219	0.9150	-4.0091	0.7816	4.9242	-386.5804	-346.8415	0.4148	0.3563
0.5835	1.0448	600	0.6094	1.5034	-3.6390	0.7722	5.1423	-382.8790	-340.9584	0.3504	0.2933
0.5571	1.2190	700	0.5992	1.5696	-3.7206	0.7690	5.2901	-383.6949	-340.2962	0.3217	0.2649
0.5532	1.3931	800	0.5954	1.7147	-3.7261	0.7785	5.4408	-383.7506	-338.8453	0.2961	0.2383
0.5168	1.5673	900	0.5930	1.9934	-3.3982	0.7753	5.3916	-380.4709	-336.0577	0.2838	0.2266
0.5232	1.7414	1000	0.5884	1.7308	-4.0024	0.7816	5.7332	-386.5127	-338.6839	0.2787	0.2220
0.5574	1.9155	1100	0.5849	1.8420	-3.9351	0.7911	5.7771	-385.8401	-337.5714	0.2706	0.2134
0.5077	2.0897	1200	0.5842	1.6188	-4.2472	0.7880	5.8659	-388.9607	-339.8043	0.2657	0.2083
0.4952	2.2638	1300	0.5837	1.9316	-3.8913	0.7816	5.8229	-385.4018	-336.6759	0.2694	0.2115
0.5236	2.4380	1400	0.5812	1.8289	-4.0636	0.7880	5.8925	-387.1253	-337.7025	0.2465	0.1895
0.5001	2.6121	1500	0.5814	1.7432	-4.1735	0.7848	5.9167	-388.2242	-338.5596	0.2395	0.1826
0.5246	2.7862	1600	0.5809	1.8622	-4.0120	0.7880	5.8742	-386.6093	-337.3701	0.2395	0.1825
0.5042	2.9604	1700	0.5808	1.8125	-4.0822	0.7880	5.8947	-387.3112	-337.8669	0.2355	0.1785

Framework versions

Transformers 4.44.2
Pytorch 2.4.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

tanliboy
/

llama-3.2-3b-dpo-2

llama-3.2-3b-dpo-2

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for tanliboy/llama-3.2-3b-dpo-2

Datasets used to train tanliboy/llama-3.2-3b-dpo-2

Evaluation results