phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

This model is a fine-tuned version of DUAL-GPO/phi-2-gpo-renew2-b0.001-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.0497
Rewards/chosen: 0.0617
Rewards/rejected: 0.0473
Rewards/accuracies: 0.5645
Rewards/margins: 0.0144
Logps/rejected: -1829.1201
Logps/chosen: -2154.7461
Logits/rejected: -0.2678
Logits/chosen: -0.2583

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0515	0.05	100	0.0532	0.0078	0.0065	0.5190	0.0013	-1869.9457	-2208.6421	-0.2109	-0.2202
0.0386	0.1	200	0.0515	0.0511	0.0427	0.5095	0.0083	-1833.6853	-2165.3538	-0.2153	-0.2175
0.0428	0.16	300	0.0515	0.0358	0.0281	0.5465	0.0077	-1848.3311	-2180.6155	-0.2312	-0.2333
0.0513	0.21	400	0.0520	0.0645	0.0516	0.5305	0.0129	-1824.8289	-2151.9404	-0.2533	-0.2474
0.0565	0.26	500	0.0507	0.0520	0.0403	0.5565	0.0117	-1836.1078	-2164.4390	-0.2774	-0.2711
0.0549	0.31	600	0.0504	0.0581	0.0443	0.5535	0.0138	-1832.1049	-2158.2695	-0.3657	-0.3506
0.0445	0.37	700	0.0504	0.0480	0.0362	0.5575	0.0118	-1840.2194	-2168.3940	-0.3268	-0.3160
0.0584	0.42	800	0.0504	0.0547	0.0417	0.5530	0.0130	-1834.7174	-2161.7117	-0.3244	-0.3128
0.0439	0.47	900	0.0501	0.0743	0.0588	0.5455	0.0155	-1817.6077	-2142.0779	-0.3005	-0.2897
0.0545	0.52	1000	0.0500	0.0612	0.0477	0.5580	0.0135	-1828.6910	-2155.1626	-0.2889	-0.2812
0.0535	0.58	1100	0.0499	0.0762	0.0605	0.5480	0.0158	-1815.9238	-2140.1655	-0.2758	-0.2662
0.0484	0.63	1200	0.0499	0.0611	0.0476	0.5545	0.0135	-1828.7972	-2155.2605	-0.2614	-0.2536
0.0443	0.68	1300	0.0499	0.0536	0.0409	0.5640	0.0127	-1835.5480	-2162.8499	-0.2628	-0.2563
0.0527	0.73	1400	0.0500	0.0536	0.0406	0.5705	0.0130	-1835.7953	-2162.7734	-0.2801	-0.2716
0.0427	0.79	1500	0.0499	0.0581	0.0443	0.5655	0.0137	-1832.0787	-2158.3472	-0.2702	-0.2613
0.0391	0.84	1600	0.0498	0.0624	0.0479	0.5625	0.0145	-1828.5033	-2153.9939	-0.2688	-0.2594
0.056	0.89	1700	0.0498	0.0626	0.0481	0.5615	0.0145	-1828.3557	-2153.8423	-0.2686	-0.2589
0.0505	0.94	1800	0.0498	0.0619	0.0476	0.5655	0.0144	-1828.8563	-2154.4631	-0.2667	-0.2571
0.0501	0.99	1900	0.0498	0.0617	0.0473	0.5635	0.0144	-1829.1072	-2154.7471	-0.2678	-0.2582

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Adapter for

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

Evaluation results

phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Adapter for microsoft/phi-2

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.001-0.5ultrafeedback-lowLr-i1

Evaluation results

Adapter for