TinyLlama-1.1B-Chat-v1.0-reasoning-v2-dpo

This model is a fine-tuned version of alexredna/TinyLlama-1.1B-Chat-v1.0-reasoning-v2 on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.1772
Rewards/chosen: -0.9390
Rewards/rejected: -4.1141
Rewards/accuracies: 0.8385
Rewards/margins: 3.1750
Logps/rejected: -327.8484
Logps/chosen: -280.3031
Logits/rejected: -2.7526
Logits/chosen: -2.6271

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6892	0.06	100	0.6904	-0.0007	-0.0068	0.4692	0.0061	-286.7757	-270.9199	-2.7940	-2.6576
0.6767	0.13	200	0.6754	-0.0060	-0.0430	0.6385	0.0370	-287.1373	-270.9724	-2.7931	-2.6568
0.6493	0.19	300	0.6431	-0.0105	-0.1151	0.7885	0.1046	-287.8588	-271.0174	-2.7922	-2.6561
0.5809	0.25	400	0.5879	-0.0345	-0.2649	0.8308	0.2304	-289.3571	-271.2578	-2.7893	-2.6534
0.4994	0.32	500	0.5043	-0.0774	-0.5296	0.8385	0.4522	-292.0042	-271.6873	-2.7851	-2.6499
0.4093	0.38	600	0.4360	-0.1267	-0.8043	0.8385	0.6776	-294.7504	-272.1800	-2.7820	-2.6476
0.3951	0.44	700	0.3844	-0.1731	-1.0600	0.8423	0.8870	-297.3079	-272.6434	-2.7796	-2.6459
0.3307	0.51	800	0.3413	-0.2208	-1.3252	0.8346	1.1044	-299.9597	-273.1208	-2.7764	-2.6434
0.3035	0.57	900	0.3095	-0.2914	-1.5963	0.8308	1.3049	-302.6710	-273.8272	-2.7734	-2.6410
0.2565	0.63	1000	0.2856	-0.3318	-1.8163	0.8385	1.4845	-304.8706	-274.2305	-2.7712	-2.6397
0.2409	0.7	1100	0.2676	-0.3754	-2.0199	0.8385	1.6445	-306.9071	-274.6673	-2.7691	-2.6380
0.2341	0.76	1200	0.2515	-0.4233	-2.2275	0.8385	1.8042	-308.9832	-275.1463	-2.7675	-2.6371
0.2584	0.82	1300	0.2393	-0.4799	-2.4301	0.8385	1.9501	-311.0082	-275.7123	-2.7653	-2.6355
0.2171	0.89	1400	0.2294	-0.5274	-2.6087	0.8385	2.0812	-312.7944	-276.1873	-2.7635	-2.6342
0.1638	0.95	1500	0.2206	-0.5748	-2.7894	0.8385	2.2146	-314.6021	-276.6611	-2.7623	-2.6336
0.2334	1.02	1600	0.2147	-0.6108	-2.9348	0.8385	2.3240	-316.0559	-277.0210	-2.7603	-2.6319
0.2178	1.08	1700	0.2086	-0.6523	-3.0743	0.8385	2.4220	-317.4505	-277.4355	-2.7597	-2.6314
0.1704	1.14	1800	0.2037	-0.6819	-3.1955	0.8385	2.5136	-318.6626	-277.7317	-2.7590	-2.6309
0.1683	1.21	1900	0.1996	-0.7152	-3.3176	0.8385	2.6024	-319.8835	-278.0646	-2.7587	-2.6313
0.271	1.27	2000	0.1959	-0.7447	-3.4272	0.8385	2.6825	-320.9794	-278.3595	-2.7576	-2.6305
0.127	1.33	2100	0.1930	-0.7665	-3.5137	0.8385	2.7472	-321.8449	-278.5782	-2.7571	-2.6302
0.2107	1.4	2200	0.1905	-0.7830	-3.5883	0.8385	2.8053	-322.5906	-278.7429	-2.7572	-2.6305
0.1977	1.46	2300	0.1883	-0.7986	-3.6574	0.8385	2.8588	-323.2822	-278.8991	-2.7566	-2.6300
0.1655	1.52	2400	0.1872	-0.8203	-3.7149	0.8385	2.8946	-323.8572	-279.1161	-2.7553	-2.6289
0.1776	1.59	2500	0.1850	-0.8439	-3.7881	0.8385	2.9442	-324.5885	-279.3518	-2.7548	-2.6285
0.1372	1.65	2600	0.1850	-0.8548	-3.8280	0.8385	2.9732	-324.9880	-279.4609	-2.7544	-2.6282
0.15	1.71	2700	0.1836	-0.8734	-3.8792	0.8385	3.0059	-325.5001	-279.6465	-2.7543	-2.6283
0.1338	1.78	2800	0.1823	-0.8736	-3.9132	0.8385	3.0396	-325.8393	-279.6486	-2.7541	-2.6282
0.1507	1.84	2900	0.1811	-0.8932	-3.9558	0.8385	3.0626	-326.2653	-279.8444	-2.7533	-2.6273
0.1615	1.9	3000	0.1811	-0.8986	-3.9790	0.8385	3.0804	-326.4981	-279.8992	-2.7533	-2.6275
0.1656	1.97	3100	0.1800	-0.9039	-4.0052	0.8385	3.1012	-326.7594	-279.9523	-2.7528	-2.6270
0.1398	2.03	3200	0.1797	-0.9123	-4.0258	0.8385	3.1135	-326.9660	-280.0360	-2.7534	-2.6278
0.1929	2.09	3300	0.1792	-0.9098	-4.0380	0.8385	3.1282	-327.0879	-280.0112	-2.7524	-2.6269
0.1616	2.16	3400	0.1787	-0.9249	-4.0622	0.8385	3.1374	-327.3301	-280.1616	-2.7519	-2.6263
0.1664	2.22	3500	0.1790	-0.9246	-4.0716	0.8385	3.1470	-327.4239	-280.1592	-2.7524	-2.6269
0.2085	2.28	3600	0.1787	-0.9301	-4.0835	0.8385	3.1534	-327.5426	-280.2136	-2.7532	-2.6279
0.1565	2.35	3700	0.1782	-0.9301	-4.0909	0.8385	3.1608	-327.6164	-280.2137	-2.7521	-2.6265
0.153	2.41	3800	0.1778	-0.9281	-4.0947	0.8385	3.1666	-327.6550	-280.1937	-2.7522	-2.6268
0.1787	2.47	3900	0.1783	-0.9319	-4.0918	0.8385	3.1599	-327.6259	-280.2316	-2.7520	-2.6266
0.172	2.54	4000	0.1780	-0.9338	-4.1035	0.8385	3.1697	-327.7429	-280.2505	-2.7526	-2.6273
0.2643	2.6	4100	0.1771	-0.9229	-4.0969	0.8385	3.1739	-327.6764	-280.1422	-2.7521	-2.6267
0.1619	2.66	4200	0.1776	-0.9326	-4.1083	0.8385	3.1757	-327.7909	-280.2390	-2.7523	-2.6270
0.2413	2.73	4300	0.1778	-0.9292	-4.1024	0.8385	3.1732	-327.7315	-280.2050	-2.7529	-2.6277
0.1187	2.79	4400	0.1778	-0.9343	-4.1068	0.8385	3.1725	-327.7758	-280.2554	-2.7521	-2.6267
0.1439	2.86	4500	0.1776	-0.9368	-4.1118	0.8385	3.1750	-327.8253	-280.2808	-2.7517	-2.6263
0.1116	2.92	4600	0.1773	-0.9302	-4.1079	0.8385	3.1777	-327.7867	-280.2152	-2.7526	-2.6272
0.18	2.98	4700	0.1772	-0.9290	-4.1048	0.8385	3.1758	-327.7554	-280.2029	-2.7526	-2.6271

Framework versions

Transformers 4.36.2
Pytorch 2.1.0+cu118
Datasets 2.14.6
Tokenizers 0.15.0

alexredna
/

TinyLlama-1.1B-Chat-v1.0-reasoning-v2-dpo

TinyLlama-1.1B-Chat-v1.0-reasoning-v2-dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Evaluation results

TinyLlama-1.1B-Chat-v1.0-reasoning-v2-dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from alexredna/TinyLlama-1.1B-Chat-v1.0-reasoning-v2

Evaluation results

Finetuned from