llama-7b-sft-DPO

This model is a fine-tuned version of argsearch/llama-7b-sft-float32 on the Dahoas/full-hh-rlhf dataset. It achieves the following results on the evaluation set:

Loss: 0.6525
Rewards/chosen: 0.3315
Rewards/rejected: 0.1953
Rewards/accuracies: 0.6080
Rewards/margins: 0.1362
Logps/rejected: -633.3815
Logps/chosen: -690.5654
Logits/rejected: -1.9212
Logits/chosen: -1.9766

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6884	0.06	100	0.6886	0.0879	0.0774	0.5647	0.0105	-645.1731	-714.9250	-2.7786	-2.8754
0.6769	0.11	200	0.6809	0.2546	0.2194	0.5747	0.0352	-630.9728	-698.2556	-2.6094	-2.6971
0.6734	0.17	300	0.6755	0.2980	0.2471	0.5833	0.0508	-628.1946	-693.9142	-2.5226	-2.6062
0.6684	0.23	400	0.6713	0.3480	0.2822	0.5888	0.0658	-624.6848	-688.9108	-2.4007	-2.4782
0.6647	0.29	500	0.6671	0.3495	0.2706	0.6048	0.0789	-625.8477	-688.7593	-2.3026	-2.3749
0.6598	0.34	600	0.6636	0.3311	0.2429	0.6058	0.0882	-628.6143	-690.6030	-2.1694	-2.2345
0.6598	0.4	700	0.6606	0.2824	0.1853	0.6106	0.0971	-634.3779	-695.4718	-1.9252	-1.9781
0.6563	0.46	800	0.6585	0.3476	0.2374	0.6071	0.1102	-629.1707	-688.9521	-2.0030	-2.0599
0.6636	0.51	900	0.6572	0.3569	0.2427	0.6119	0.1142	-628.6379	-688.0209	-1.9872	-2.0440
0.6436	0.57	1000	0.6558	0.2921	0.1732	0.6096	0.1190	-635.5912	-694.4999	-1.9618	-2.0181
0.6759	0.63	1100	0.6548	0.3436	0.2165	0.6071	0.1272	-631.2626	-689.3489	-1.9627	-2.0198
0.6679	0.69	1200	0.6542	0.3533	0.2212	0.6077	0.1321	-630.7878	-688.3820	-1.9058	-1.9598
0.6358	0.74	1300	0.6533	0.3363	0.2036	0.6074	0.1327	-632.5449	-690.0779	-1.9447	-2.0015
0.6473	0.8	1400	0.6528	0.3378	0.2021	0.6080	0.1357	-632.6981	-689.9300	-1.9072	-1.9621
0.6447	0.86	1500	0.6526	0.3221	0.1869	0.6080	0.1352	-634.2156	-691.5005	-1.9226	-1.9781
0.6546	0.91	1600	0.6525	0.3303	0.1941	0.6074	0.1362	-633.5018	-690.6824	-1.9134	-1.9684
0.6725	0.97	1700	0.6525	0.3312	0.1950	0.6074	0.1363	-633.4115	-690.5892	-1.9098	-1.9645

Framework versions

Transformers 4.39.0.dev0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.15.2

AmberYifan
/

llama-7b-sft-DPO

llama-7b-sft-DPO

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train AmberYifan/llama-7b-sft-DPO

Evaluation results

llama-7b-sft-DPO

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from argsearch/llama-7b-sft-float32

Dataset used to train AmberYifan/llama-7b-sft-DPO

Evaluation results

Finetuned from