microsoft/phi-1_5 · The training time mentioned in the paper and the explanations in the Git repository have a significant gap.

I summarized the training throughput of according to model card as follows:

model	training size	time cost	a100 cost	Throughput
phi1	54B	6 days	8	13020 token/s / per a100
phi1.5	150B	8 days	32	6781 token/s / per a100
phi2	1.4T	14 days	96	12056 token/s per a100

and in the paper of 1.5, 150B tokens training cost 1.5K A100 gpu hours, which means the throughput is 27777 token/s / per a100.

Through comparison, I feel there might be errors in the data presented in the paper. It could be due to other reasons as well. I welcome any input from everyone.