distily_experiments_loss_kl

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

Peak GPU Memory: 18.5713 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		13.0697	11.6518					21.6262
0	0	182369.3906	181455.125	780.3353	91.2363	10.961	2.74	180761.6719
500	0.0808	86362.2812	101159.6328	54.5867	91.4066	10.94	2.735	232064.5
1000	0.1616	69487.6797	86111.1172	53.6604	91.4579	10.934	2.733	345169.5312
1500	0.2424	82681.5703	89315.8984	35.7168	91.5034	10.929	2.732	445378.8125
2000	0.3232	57160.3164	68079.1016	35.2824	91.7023	10.905	2.726	518689.375
2500	0.4040	51816.1094	64056.9492	35.1557	91.637	10.913	2.728	541196.6875
3000	0.4848	48757.2461	61247.7969	34.9901	91.7102	10.904	2.726	490950.375
3500	0.5657	46986.8359	60241.9648	34.8887	91.7509	10.899	2.725	494001.2812
4000	0.6465	45312.1992	58663.0117	34.8189	91.5035	10.929	2.732	481694.25
4500	0.7273	44666.1914	58929.5781	34.8044	91.6675	10.909	2.727	422349.9062
5000	0.8081	44634.5938	59004.7656	34.6895	91.5871	10.919	2.73	366104.7188
5500	0.8889	43732.4805	57270.8125	34.6591	91.6394	10.912	2.728	369470.4688
6000	0.9697	43799.5938	57096.7930	34.6125	91.6634	10.909	2.727	372021.5
6187	0.9999	44644.5977	57738.3242	34.5918	91.6837	10.907	2.727	350857.0312