Meta-Llama-3-8B-Instruct-ORPO-QLoRA

This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5734
Rewards/chosen: -0.0085
Rewards/rejected: -0.0105
Rewards/accuracies: 0.6070
Rewards/margins: 0.0020
Logps/rejected: -1.0492
Logps/chosen: -0.8470
Logits/rejected: -0.2321
Logits/chosen: -0.2275
Nll Loss: 0.5669
Log Odds Ratio: -0.6615
Log Odds Chosen: 0.3163

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 32
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Nll Loss	Log Odds Ratio	Log Odds Chosen
0.8633	0.0524	100	0.7181	-0.0135	-0.0158	0.6060	0.0023	-1.5779	-1.3476	-0.4503	-0.4466	0.7126	-0.6965	0.2913
0.7831	0.1048	200	0.6487	-0.0105	-0.0125	0.6140	0.0020	-1.2499	-1.0520	-0.3621	-0.3619	0.6432	-0.6627	0.2691
0.7146	0.1572	300	0.6238	-0.0102	-0.0122	0.6140	0.0020	-1.2194	-1.0173	-0.3196	-0.3169	0.6181	-0.6594	0.2790
0.7361	0.2096	400	0.6137	-0.0100	-0.0120	0.6140	0.0020	-1.2012	-1.0014	-0.2841	-0.2811	0.6078	-0.6618	0.2770
0.7382	0.2620	500	0.6066	-0.0099	-0.0119	0.6120	0.0020	-1.1884	-0.9868	-0.3023	-0.2982	0.6006	-0.6603	0.2812
0.7339	0.3143	600	0.6009	-0.0097	-0.0118	0.6100	0.0020	-1.1751	-0.9714	-0.2544	-0.2490	0.5948	-0.6587	0.2859
0.7133	0.3667	700	0.5968	-0.0096	-0.0116	0.6070	0.0020	-1.1590	-0.9588	-0.2830	-0.2764	0.5906	-0.6590	0.2828
0.6988	0.4191	800	0.5926	-0.0095	-0.0115	0.6070	0.0020	-1.1491	-0.9451	-0.2817	-0.2745	0.5864	-0.6576	0.2898
0.7493	0.4715	900	0.5882	-0.0093	-0.0114	0.6080	0.0021	-1.1357	-0.9301	-0.2547	-0.2476	0.5820	-0.6552	0.2952
0.7022	0.5239	1000	0.5842	-0.0091	-0.0111	0.6070	0.0020	-1.1110	-0.9090	-0.2588	-0.2514	0.5780	-0.6569	0.2962
0.6805	0.5763	1100	0.5807	-0.0089	-0.0108	0.6020	0.0020	-1.0833	-0.8865	-0.2590	-0.2519	0.5744	-0.6608	0.2937
0.6427	0.6287	1200	0.5780	-0.0087	-0.0107	0.6070	0.0020	-1.0670	-0.8682	-0.2483	-0.2430	0.5717	-0.6609	0.3024
0.6762	0.6811	1300	0.5762	-0.0086	-0.0106	0.6070	0.0020	-1.0576	-0.8586	-0.2376	-0.2322	0.5698	-0.6618	0.3069
0.6944	0.7335	1400	0.5750	-0.0085	-0.0105	0.6070	0.0020	-1.0548	-0.8542	-0.2468	-0.2420	0.5686	-0.6609	0.3102
0.6695	0.7859	1500	0.5742	-0.0085	-0.0105	0.6080	0.0020	-1.0505	-0.8493	-0.2426	-0.2372	0.5678	-0.6616	0.3135
0.7258	0.8382	1600	0.5738	-0.0085	-0.0105	0.6080	0.0020	-1.0497	-0.8485	-0.2418	-0.2371	0.5673	-0.6619	0.3140
0.7193	0.8906	1700	0.5735	-0.0085	-0.0105	0.6050	0.0020	-1.0499	-0.8477	-0.2403	-0.2352	0.5671	-0.6610	0.3162
0.7038	0.9430	1800	0.5734	-0.0085	-0.0105	0.6090	0.0020	-1.0493	-0.8471	-0.2360	-0.2311	0.5670	-0.6615	0.3164
0.6723	0.9954	1900	0.5734	-0.0085	-0.0105	0.6070	0.0020	-1.0493	-0.8470	-0.2369	-0.2320	0.5669	-0.6615	0.3168

Framework versions

PEFT 0.11.1
Transformers 4.41.0
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

statking
/

Meta-Llama-3-8B-Instruct-ORPO-QLoRA

Meta-Llama-3-8B-Instruct-ORPO-QLoRA

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for statking/Meta-Llama-3-8B-Instruct-ORPO-QLoRA

Dataset used to train statking/Meta-Llama-3-8B-Instruct-ORPO-QLoRA

Evaluation results