zephyr-7b-dpo-lora-r16

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.6410
Rewards/chosen: -2.2125
Rewards/rejected: -3.0591
Rewards/accuracies: 0.6650
Rewards/margins: 0.8466
Logps/rejected: -554.3575
Logps/chosen: -489.4880
Logits/rejected: -2.1525
Logits/chosen: -2.1542

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 20

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6933	0.992	62	0.6932	0.0001	0.0001	0.4950	-0.0001	-248.4330	-268.2283	-2.8603	-2.8949
0.6898	2.0	125	0.6900	-0.0019	-0.0083	0.6300	0.0064	-249.2751	-268.4230	-2.8539	-2.8884
0.6767	2.992	187	0.6785	-0.0026	-0.0339	0.6300	0.0312	-251.8334	-268.4987	-2.8254	-2.8577
0.5861	4.0	250	0.6520	-0.0414	-0.1367	0.6350	0.0953	-262.1136	-272.3749	-2.8027	-2.8314
0.5654	4.992	312	0.6219	-0.2603	-0.4550	0.6500	0.1947	-293.9497	-294.2625	-2.7777	-2.8036
0.4986	6.0	375	0.6055	-0.4927	-0.7779	0.6800	0.2851	-326.2355	-317.5081	-2.7652	-2.7893
0.4719	6.992	437	0.6055	-0.7077	-1.0586	0.6900	0.3508	-354.3046	-339.0088	-2.7391	-2.7606
0.4512	8.0	500	0.6028	-0.7213	-1.1042	0.6750	0.3829	-358.8697	-340.3660	-2.7246	-2.7431
0.264	8.992	562	0.5955	-1.0493	-1.4939	0.7000	0.4446	-397.8353	-373.1655	-2.6715	-2.6867
0.3516	10.0	625	0.5927	-1.1473	-1.6948	0.6800	0.5474	-417.9223	-382.9673	-2.5714	-2.5856
0.3271	10.992	687	0.5922	-1.4044	-2.0377	0.6900	0.6332	-452.2125	-408.6782	-2.4751	-2.4864
0.336	12.0	750	0.6034	-1.6164	-2.3135	0.7100	0.6972	-479.8002	-429.8719	-2.3841	-2.3919
0.2157	12.992	812	0.6125	-1.6968	-2.4270	0.6800	0.7302	-491.1445	-437.9121	-2.3161	-2.3226
0.2436	14.0	875	0.6211	-1.9546	-2.7134	0.6800	0.7588	-519.7897	-463.6995	-2.2583	-2.2637
0.1747	14.992	937	0.6250	-2.0090	-2.8105	0.6750	0.8015	-529.4984	-469.1342	-2.2179	-2.2224
0.162	16.0	1000	0.6350	-2.1464	-2.9679	0.6750	0.8214	-545.2337	-482.8784	-2.1872	-2.1901
0.1898	16.992	1062	0.6415	-2.2332	-3.0695	0.6700	0.8363	-555.3980	-491.5554	-2.1618	-2.1639
0.1337	18.0	1125	0.6401	-2.2070	-3.0519	0.6700	0.8449	-553.6342	-488.9332	-2.1605	-2.1619
0.1233	18.992	1187	0.6414	-2.2093	-3.0569	0.6650	0.8476	-554.1345	-489.1610	-2.1630	-2.1636
0.1832	19.84	1240	0.6410	-2.2125	-3.0591	0.6650	0.8466	-554.3575	-489.4880	-2.1525	-2.1542

Framework versions

PEFT 0.12.0
Transformers 4.44.0
Pytorch 2.4.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1

LaoRay
/

zephyr-7b-dpo-lora-r16

zephyr-7b-dpo-lora-r16

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for LaoRay/zephyr-7b-dpo-lora-r16

Dataset used to train LaoRay/zephyr-7b-dpo-lora-r16

Evaluation results