Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 5000 examples [00:00, 44572.65 examples/s] Generating train split: 12000 examples [00:00, 54195.28 examples/s] Generating train split: 22000 examples [00:00, 67508.97 examples/s] Generating train split: 31000 examples [00:00, 72629.53 examples/s] Generating train split: 41000 examples [00:00, 77636.11 examples/s] Generating train split: 50000 examples [00:00, 77979.67 examples/s] Generating train split: 59000 examples [00:00, 79177.37 examples/s] Generating train split: 68000 examples [00:00, 80538.99 examples/s] Generating train split: 78000 examples [00:01, 82615.57 examples/s] Generating train split: 87000 examples [00:01, 81599.59 examples/s] Generating train split: 97000 examples [00:01, 82593.17 examples/s] Generating train split: 107000 examples [00:01, 83015.82 examples/s] Generating train split: 117000 examples [00:01, 83831.98 examples/s] Generating train split: 127000 examples [00:01, 84779.68 examples/s] Generating train split: 137000 examples [00:01, 85352.50 examples/s] Generating train split: 147000 examples [00:01, 85119.13 examples/s] Generating train split: 157000 examples [00:01, 85485.00 examples/s] Generating train split: 166000 examples [00:02, 84088.87 examples/s] Generating train split: 176000 examples [00:02, 84969.68 examples/s] Generating train split: 186000 examples [00:02, 85537.71 examples/s] Generating train split: 196000 examples [00:02, 85606.58 examples/s] Generating train split: 206000 examples [00:02, 85290.14 examples/s] Generating train split: 216000 examples [00:02, 86448.46 examples/s] Generating train split: 226000 examples [00:02, 86478.09 examples/s] Generating train split: 236000 examples [00:02, 87534.40 examples/s] Generating train split: 246000 examples [00:02, 87786.52 examples/s] Generating train split: 256000 examples [00:03, 88077.21 examples/s] Generating train split: 266000 examples [00:03, 87827.01 examples/s] Generating train split: 276000 examples [00:03, 87483.30 examples/s] Generating train split: 286000 examples [00:03, 88079.86 examples/s] Generating train split: 296000 examples [00:03, 87518.55 examples/s] Generating train split: 306000 examples [00:03, 87396.77 examples/s] Generating train split: 316000 examples [00:03, 87720.91 examples/s] Generating train split: 326000 examples [00:03, 87335.85 examples/s] Generating train split: 336000 examples [00:04, 86268.21 examples/s] Generating train split: 346000 examples [00:04, 85782.40 examples/s] Generating train split: 356000 examples [00:04, 86303.47 examples/s] Generating train split: 366000 examples [00:04, 86465.35 examples/s] Generating train split: 376000 examples [00:04, 86993.31 examples/s] Generating train split: 386000 examples [00:04, 87173.88 examples/s] Generating train split: 395000 examples [00:04, 86251.37 examples/s] Generating train split: 405000 examples [00:04, 85768.37 examples/s] Generating train split: 415000 examples [00:04, 85826.72 examples/s] Generating train split: 424000 examples [00:05, 84907.07 examples/s] Generating train split: 434000 examples [00:05, 85463.56 examples/s] Generating train split: 444000 examples [00:05, 85237.03 examples/s] Generating train split: 453000 examples [00:05, 83677.40 examples/s] Generating train split: 463000 examples [00:05, 85739.56 examples/s] Generating train split: 473000 examples [00:05, 87080.91 examples/s] Generating train split: 483000 examples [00:05, 87952.32 examples/s] Generating train split: 493000 examples [00:05, 87323.73 examples/s] Generating train split: 502000 examples [00:05, 86521.97 examples/s] Generating train split: 512000 examples [00:06, 85768.10 examples/s] Generating train split: 522000 examples [00:06, 86452.02 examples/s] Generating train split: 532000 examples [00:06, 85774.66 examples/s] Generating train split: 542000 examples [00:06, 85602.01 examples/s] Generating train split: 551000 examples [00:06, 84063.43 examples/s] Generating train split: 561000 examples [00:06, 84681.42 examples/s] Generating train split: 571000 examples [00:06, 86243.97 examples/s] Generating train split: 580000 examples [00:06, 85380.43 examples/s] Generating train split: 590000 examples [00:06, 85419.35 examples/s] Generating train split: 600000 examples [00:07, 85187.39 examples/s] Generating train split: 610000 examples [00:07, 85628.48 examples/s] Generating train split: 620000 examples [00:07, 84383.54 examples/s] Generating train split: 630000 examples [00:07, 85976.11 examples/s] Generating train split: 640000 examples [00:07, 86427.96 examples/s] Generating train split: 650000 examples [00:07, 86781.65 examples/s] Generating train split: 659000 examples [00:07, 85167.62 examples/s] Generating train split: 669000 examples [00:07, 85381.75 examples/s] Generating train split: 679000 examples [00:08, 84192.42 examples/s] Generating train split: 692000 examples [00:08, 82228.60 examples/s] Generating train split: 702000 examples [00:08, 83823.11 examples/s] Generating train split: 712000 examples [00:08, 83865.01 examples/s] Generating train split: 722000 examples [00:08, 84569.33 examples/s] Generating train split: 732000 examples [00:08, 84326.81 examples/s] Generating train split: 742000 examples [00:08, 85500.13 examples/s] Generating train split: 751000 examples [00:08, 85068.02 examples/s] Generating train split: 760000 examples [00:08, 84033.09 examples/s] Generating train split: 770000 examples [00:09, 85594.70 examples/s] Generating train split: 780000 examples [00:09, 85603.06 examples/s] Generating train split: 790000 examples [00:09, 87034.27 examples/s] Generating train split: 800000 examples [00:09, 86862.60 examples/s] Generating train split: 810000 examples [00:09, 87001.93 examples/s] Generating train split: 820000 examples [00:09, 86474.57 examples/s] Generating train split: 830000 examples [00:09, 87176.33 examples/s] Generating train split: 840000 examples [00:09, 87095.25 examples/s] Generating train split: 850000 examples [00:10, 86921.37 examples/s] Generating train split: 860000 examples [00:10, 87126.06 examples/s] Generating train split: 870000 examples [00:10, 86371.07 examples/s] Generating train split: 880000 examples [00:10, 85422.99 examples/s] Generating train split: 890000 examples [00:10, 86206.93 examples/s] Generating train split: 900000 examples [00:10, 86779.75 examples/s] Generating train split: 910000 examples [00:10, 85026.62 examples/s] Generating train split: 919000 examples [00:10, 83353.26 examples/s] Generating train split: 929000 examples [00:10, 84224.84 examples/s] Generating train split: 938000 examples [00:11, 83207.09 examples/s] Generating train split: 948000 examples [00:11, 83910.61 examples/s] Generating train split: 957000 examples [00:11, 83243.06 examples/s] Generating train split: 966000 examples [00:11, 83572.59 examples/s] Generating train split: 976000 examples [00:11, 83087.39 examples/s] Generating train split: 986000 examples [00:11, 84170.94 examples/s] Generating train split: 996000 examples [00:11, 85265.90 examples/s] Generating train split: 1006000 examples [00:11, 85801.57 examples/s] Generating train split: 1015000 examples [00:11, 83710.26 examples/s] Generating train split: 1025000 examples [00:12, 84595.62 examples/s] Generating train split: 1035000 examples [00:12, 85143.56 examples/s] Generating train split: 1045000 examples [00:12, 85297.17 examples/s] Generating train split: 1055000 examples [00:12, 81195.35 examples/s] Generating train split: 1065000 examples [00:12, 82424.47 examples/s] Generating train split: 1075000 examples [00:12, 84795.84 examples/s] Generating train split: 1085000 examples [00:12, 83831.64 examples/s] Generating train split: 1095000 examples [00:12, 84412.54 examples/s] Generating train split: 1105000 examples [00:13, 84748.39 examples/s] Generating train split: 1115000 examples [00:13, 85113.14 examples/s] Generating train split: 1125000 examples [00:13, 86185.27 examples/s] Generating train split: 1135000 examples [00:13, 86039.93 examples/s] Generating train split: 1145000 examples [00:13, 86385.33 examples/s] Generating train split: 1155000 examples [00:13, 85573.92 examples/s] Generating train split: 1165000 examples [00:13, 85976.58 examples/s] Generating train split: 1175000 examples [00:13, 86828.71 examples/s] Generating train split: 1185000 examples [00:13, 86599.84 examples/s] Generating train split: 1195000 examples [00:14, 86734.08 examples/s] Generating train split: 1204000 examples [00:14, 85231.90 examples/s] Generating train split: 1214000 examples [00:14, 86569.05 examples/s] Generating train split: 1224000 examples [00:14, 85371.92 examples/s] Generating train split: 1234000 examples [00:14, 86467.82 examples/s] Generating train split: 1244000 examples [00:14, 86658.77 examples/s] Generating train split: 1254000 examples [00:14, 86159.01 examples/s] Generating train split: 1264000 examples [00:14, 86840.36 examples/s] Generating train split: 1274000 examples [00:14, 86993.39 examples/s] Generating train split: 1284000 examples [00:15, 87076.52 examples/s] Generating train split: 1294000 examples [00:15, 86498.42 examples/s] Generating train split: 1303000 examples [00:15, 84741.48 examples/s] Generating train split: 1313000 examples [00:15, 85346.44 examples/s] Generating train split: 1322000 examples [00:15, 83071.77 examples/s] Generating train split: 1332000 examples [00:15, 83930.58 examples/s] Generating train split: 1341000 examples [00:15, 83561.79 examples/s] Generating train split: 1351000 examples [00:15, 84996.70 examples/s] Generating train split: 1361000 examples [00:16, 85930.24 examples/s] Generating train split: 1371000 examples [00:16, 86096.76 examples/s] Generating train split: 1380000 examples [00:16, 84852.41 examples/s] Generating train split: 1389000 examples [00:16, 84047.23 examples/s] Generating train split: 1399000 examples [00:16, 85205.79 examples/s] Generating train split: 1408000 examples [00:16, 84498.64 examples/s] Generating train split: 1418000 examples [00:16, 83932.59 examples/s] Generating train split: 1427000 examples [00:16, 81871.55 examples/s] Generating train split: 1437000 examples [00:16, 82546.97 examples/s] Generating train split: 1447000 examples [00:17, 84250.01 examples/s] Generating train split: 1457000 examples [00:17, 85382.51 examples/s] Generating train split: 1467000 examples [00:17, 85543.98 examples/s] Generating train split: 1477000 examples [00:17, 85370.28 examples/s] Generating train split: 1487000 examples [00:17, 84414.87 examples/s] Generating train split: 1497000 examples [00:17, 83606.03 examples/s] Generating train split: 1507000 examples [00:17, 84278.86 examples/s] Generating train split: 1517000 examples [00:17, 85417.13 examples/s] Generating train split: 1527000 examples [00:17, 85126.27 examples/s] Generating train split: 1537000 examples [00:18, 84379.75 examples/s] Generating train split: 1547000 examples [00:18, 85167.33 examples/s] Generating train split: 1557000 examples [00:18, 85467.08 examples/s] Generating train split: 1567000 examples [00:18, 85949.81 examples/s] Generating train split: 1577000 examples [00:18, 86682.34 examples/s] Generating train split: 1587000 examples [00:18, 86801.66 examples/s] Generating train split: 1597000 examples [00:18, 87588.23 examples/s] Generating train split: 1606000 examples [00:18, 85745.83 examples/s] Generating train split: 1616000 examples [00:19, 85619.21 examples/s] Generating train split: 1626000 examples [00:19, 85977.79 examples/s] Generating train split: 1636000 examples [00:19, 86286.98 examples/s] Generating train split: 1646000 examples [00:19, 84246.00 examples/s] Generating train split: 1656000 examples [00:19, 85178.17 examples/s] Generating train split: 1666000 examples [00:19, 85607.44 examples/s] Generating train split: 1676000 examples [00:19, 86495.07 examples/s] Generating train split: 1686000 examples [00:19, 86204.19 examples/s] Generating train split: 1695000 examples [00:19, 83788.64 examples/s] Generating train split: 1704000 examples [00:20, 81523.79 examples/s] Generating train split: 1713000 examples [00:20, 81297.08 examples/s] Generating train split: 1723000 examples [00:20, 82966.70 examples/s] Generating train split: 1733000 examples [00:20, 84573.66 examples/s] Generating train split: 1743000 examples [00:20, 84958.03 examples/s] Generating train split: 1752000 examples [00:20, 83264.35 examples/s] Generating train split: 1762000 examples [00:20, 84485.49 examples/s] Generating train split: 1772000 examples [00:20, 84983.35 examples/s] Generating train split: 1781000 examples [00:20, 84385.72 examples/s] Generating train split: 1790000 examples [00:21, 82980.87 examples/s] Generating train split: 1802000 examples [00:21, 79163.73 examples/s] Generating train split: 1811000 examples [00:21, 79954.08 examples/s] Generating train split: 1821000 examples [00:21, 82058.27 examples/s] Generating train split: 1830000 examples [00:21, 82273.01 examples/s] Generating train split: 1840000 examples [00:21, 82295.37 examples/s] Generating train split: 1850000 examples [00:21, 83519.01 examples/s] Generating train split: 1860000 examples [00:21, 84805.92 examples/s] Generating train split: 1870000 examples [00:22, 86241.41 examples/s] Generating train split: 1880000 examples [00:22, 86657.26 examples/s] Generating train split: 1890000 examples [00:22, 87002.45 examples/s] Generating train split: 1900000 examples [00:22, 87286.37 examples/s] Generating train split: 1910000 examples [00:22, 86725.64 examples/s] Generating train split: 1920000 examples [00:22, 86498.75 examples/s] Generating train split: 1929000 examples [00:22, 84832.86 examples/s] Generating train split: 1938000 examples [00:22, 83030.79 examples/s] Generating train split: 1948000 examples [00:22, 84234.37 examples/s] Generating train split: 1958000 examples [00:23, 85729.51 examples/s] Generating train split: 1968000 examples [00:23, 87127.56 examples/s] Generating train split: 1978000 examples [00:23, 87116.19 examples/s] Generating train split: 1988000 examples [00:23, 86331.77 examples/s] Generating train split: 1997000 examples [00:23, 85465.35 examples/s] Generating train split: 2007000 examples [00:23, 85575.89 examples/s] Generating train split: 2017000 examples [00:23, 85162.66 examples/s] Generating train split: 2027000 examples [00:23, 85314.68 examples/s] Generating train split: 2037000 examples [00:23, 86654.08 examples/s] Generating train split: 2046000 examples [00:24, 83157.90 examples/s] Generating train split: 2055000 examples [00:24, 82434.35 examples/s] Generating train split: 2065000 examples [00:24, 83803.40 examples/s] Generating train split: 2074000 examples [00:24, 83040.87 examples/s] Generating train split: 2084000 examples [00:24, 84108.12 examples/s] Generating train split: 2094000 examples [00:24, 85434.44 examples/s] Generating train split: 2103000 examples [00:24, 83255.41 examples/s] Generating train split: 2113000 examples [00:24, 83096.58 examples/s] Generating train split: 2123000 examples [00:25, 83863.81 examples/s] Generating train split: 2133000 examples [00:25, 84164.04 examples/s] Generating train split: 2143000 examples [00:25, 83652.80 examples/s] Generating train split: 2153000 examples [00:25, 85322.31 examples/s] Generating train split: 2163000 examples [00:25, 86352.06 examples/s] Generating train split: 2173000 examples [00:25, 86989.64 examples/s] Generating train split: 2183000 examples [00:25, 86270.05 examples/s] Generating train split: 2192000 examples [00:25, 85468.50 examples/s] Generating train split: 2202000 examples [00:25, 86009.06 examples/s] Generating train split: 2212000 examples [00:26, 86529.87 examples/s] Generating train split: 2222000 examples [00:26, 87088.38 examples/s] Generating train split: 2232000 examples [00:26, 87230.40 examples/s] Generating train split: 2241000 examples [00:26, 85819.02 examples/s] Generating train split: 2251000 examples [00:26, 86536.42 examples/s] Generating train split: 2261000 examples [00:26, 87030.70 examples/s] Generating train split: 2271000 examples [00:26, 87656.68 examples/s] Generating train split: 2280000 examples [00:26, 85438.40 examples/s] Generating train split: 2289000 examples [00:26, 84266.14 examples/s] Generating train split: 2299000 examples [00:27, 84877.86 examples/s] Generating train split: 2308000 examples [00:27, 84140.23 examples/s] Generating train split: 2318000 examples [00:27, 84360.55 examples/s] Generating train split: 2327000 examples [00:27, 82922.41 examples/s] Generating train split: 2337000 examples [00:27, 83900.95 examples/s] Generating train split: 2347000 examples [00:27, 84090.43 examples/s] Generating train split: 2357000 examples [00:27, 84309.20 examples/s] Generating train split: 2367000 examples [00:27, 85122.00 examples/s] Generating train split: 2377000 examples [00:27, 85562.36 examples/s] Generating train split: 2387000 examples [00:28, 86407.40 examples/s] Generating train split: 2397000 examples [00:28, 87161.97 examples/s] Generating train split: 2406000 examples [00:28, 85384.80 examples/s] Generating train split: 2416000 examples [00:28, 87048.88 examples/s] Generating train split: 2426000 examples [00:28, 86160.43 examples/s] Generating train split: 2435000 examples [00:28, 84241.15 examples/s] Generating train split: 2445000 examples [00:28, 85114.08 examples/s] Generating train split: 2455000 examples [00:28, 85890.50 examples/s] Generating train split: 2465000 examples [00:29, 85727.37 examples/s] Generating train split: 2475000 examples [00:29, 85104.98 examples/s] Generating train split: 2485000 examples [00:29, 86208.58 examples/s] Generating train split: 2495000 examples [00:29, 85065.36 examples/s] Generating train split: 2504000 examples [00:29, 83005.92 examples/s] Generating train split: 2514000 examples [00:29, 84011.89 examples/s] Generating train split: 2524000 examples [00:29, 84838.98 examples/s] Generating train split: 2534000 examples [00:29, 85895.42 examples/s] Generating train split: 2544000 examples [00:29, 86117.86 examples/s] Generating train split: 2554000 examples [00:30, 86702.70 examples/s] Generating train split: 2564000 examples [00:30, 86681.65 examples/s] Generating train split: 2573000 examples [00:30, 84818.90 examples/s] Generating train split: 2583000 examples [00:30, 85757.50 examples/s] Generating train split: 2593000 examples [00:30, 86307.90 examples/s] Generating train split: 2603000 examples [00:30, 86480.51 examples/s] Generating train split: 2612000 examples [00:30, 84614.68 examples/s] Generating train split: 2621000 examples [00:30, 83748.86 examples/s] Generating train split: 2631000 examples [00:30, 85147.30 examples/s] Generating train split: 2641000 examples [00:31, 86189.33 examples/s] Generating train split: 2651000 examples [00:31, 85994.95 examples/s] Generating train split: 2661000 examples [00:31, 85263.25 examples/s] Generating train split: 2671000 examples [00:31, 86553.31 examples/s] Generating train split: 2680000 examples [00:31, 85941.37 examples/s] Generating train split: 2689000 examples [00:31, 85337.24 examples/s] Generating train split: 2699000 examples [00:31, 85710.21 examples/s] Generating train split: 2709000 examples [00:31, 85463.83 examples/s] Generating train split: 2718000 examples [00:31, 83560.95 examples/s] Generating train split: 2728000 examples [00:32, 83678.08 examples/s] Generating train split: 2737000 examples [00:32, 82913.19 examples/s] Generating train split: 2747000 examples [00:32, 84182.39 examples/s] Generating train split: 2757000 examples [00:32, 85404.39 examples/s] Generating train split: 2767000 examples [00:32, 85299.97 examples/s] Generating train split: 2777000 examples [00:32, 85239.91 examples/s] Generating train split: 2787000 examples [00:32, 86632.37 examples/s] Generating train split: 2797000 examples [00:32, 86744.12 examples/s] Generating train split: 2807000 examples [00:33, 86650.80 examples/s] Generating train split: 2817000 examples [00:33, 88043.98 examples/s] Generating train split: 2827000 examples [00:33, 88720.06 examples/s] Generating train split: 2837000 examples [00:33, 87771.42 examples/s] Generating train split: 2847000 examples [00:33, 86457.74 examples/s] Generating train split: 2857000 examples [00:33, 87083.19 examples/s] Generating train split: 2867000 examples [00:33, 87874.70 examples/s] Generating train split: 2877000 examples [00:33, 87668.99 examples/s] Generating train split: 2886000 examples [00:33, 84386.38 examples/s] Generating train split: 2895000 examples [00:34, 81970.58 examples/s] Generating train split: 2904000 examples [00:34, 81421.48 examples/s] Generating train split: 2914000 examples [00:34, 81800.11 examples/s] Generating train split: 2923000 examples [00:34, 82011.82 examples/s] Generating train split: 2932000 examples [00:34, 82227.84 examples/s] Generating train split: 2942000 examples [00:34, 83218.29 examples/s] Generating train split: 2952000 examples [00:34, 83115.66 examples/s] Generating train split: 2962000 examples [00:34, 84290.87 examples/s] Generating train split: 2972000 examples [00:34, 84661.59 examples/s] Generating train split: 2981000 examples [00:35, 81227.65 examples/s] Generating train split: 2993000 examples [00:35, 79047.45 examples/s] Generating train split: 3002000 examples [00:35, 79688.35 examples/s] Generating train split: 3011000 examples [00:35, 79776.55 examples/s] Generating train split: 3021000 examples [00:35, 80930.07 examples/s] Generating train split: 3030000 examples [00:35, 80281.84 examples/s] Generating train split: 3040000 examples [00:35, 82525.34 examples/s] Generating train split: 3050000 examples [00:35, 84391.66 examples/s] Generating train split: 3059000 examples [00:36, 83409.26 examples/s] Generating train split: 3068000 examples [00:36, 83659.22 examples/s] Generating train split: 3078000 examples [00:36, 84166.60 examples/s] Generating train split: 3088000 examples [00:36, 85513.42 examples/s] Generating train split: 3098000 examples [00:36, 83816.73 examples/s] Generating train split: 3108000 examples [00:36, 85782.41 examples/s] Generating train split: 3118000 examples [00:36, 85999.17 examples/s] Generating train split: 3128000 examples [00:36, 85923.91 examples/s] Generating train split: 3138000 examples [00:36, 86406.53 examples/s] Generating train split: 3151000 examples [00:37, 81947.24 examples/s] Generating train split: 3161000 examples [00:37, 83302.28 examples/s] Generating train split: 3170000 examples [00:37, 83325.31 examples/s] Generating train split: 3180000 examples [00:37, 83673.87 examples/s] Generating train split: 3189000 examples [00:37, 83274.59 examples/s] Generating train split: 3199000 examples [00:37, 82983.10 examples/s] Generating train split: 3208000 examples [00:37, 82320.13 examples/s] Generating train split: 3217000 examples [00:37, 80439.34 examples/s] Generating train split: 3227000 examples [00:38, 82688.13 examples/s] Generating train split: 3237000 examples [00:38, 82671.70 examples/s] Generating train split: 3247000 examples [00:38, 84222.44 examples/s] Generating train split: 3257000 examples [00:38, 85107.69 examples/s] Generating train split: 3266000 examples [00:38, 83977.81 examples/s] Generating train split: 3276000 examples [00:38, 84712.60 examples/s] Generating train split: 3286000 examples [00:38, 85930.73 examples/s] Generating train split: 3295000 examples [00:38, 84487.70 examples/s] Generating train split: 3305000 examples [00:38, 85347.66 examples/s] Generating train split: 3315000 examples [00:39, 85489.81 examples/s] Generating train split: 3325000 examples [00:39, 85427.25 examples/s] Generating train split: 3335000 examples [00:39, 84148.04 examples/s] Generating train split: 3345000 examples [00:39, 84932.83 examples/s] Generating train split: 3354000 examples [00:39, 83971.40 examples/s] Generating train split: 3364000 examples [00:39, 85109.25 examples/s] Generating train split: 3374000 examples [00:39, 85493.40 examples/s] Generating train split: 3384000 examples [00:39, 84126.12 examples/s] Generating train split: 3393000 examples [00:40, 82408.22 examples/s] Generating train split: 3406000 examples [00:40, 81852.36 examples/s] Generating train split: 3416000 examples [00:40, 83461.63 examples/s] Generating train split: 3429000 examples [00:40, 81898.44 examples/s] Generating train split: 3438000 examples [00:40, 82566.83 examples/s] Generating train split: 3448000 examples [00:40, 84321.79 examples/s] Generating train split: 3458000 examples [00:40, 84103.36 examples/s] Generating train split: 3468000 examples [00:40, 84921.74 examples/s] Generating train split: 3478000 examples [00:41, 84890.96 examples/s] Generating train split: 3488000 examples [00:41, 85235.91 examples/s] Generating train split: 3498000 examples [00:41, 85554.57 examples/s] Generating train split: 3508000 examples [00:41, 85830.02 examples/s] Generating train split: 3518000 examples [00:41, 85934.44 examples/s] Generating train split: 3528000 examples [00:41, 85656.43 examples/s] Generating train split: 3538000 examples [00:41, 85890.45 examples/s] Generating train split: 3547000 examples [00:41, 84735.32 examples/s] Generating train split: 3556000 examples [00:41, 83381.79 examples/s] Generating train split: 3566000 examples [00:42, 83913.35 examples/s] Generating train split: 3575000 examples [00:42, 83551.84 examples/s] Generating train split: 3585000 examples [00:42, 85403.40 examples/s] Generating train split: 3595000 examples [00:42, 85851.76 examples/s] Generating train split: 3608000 examples [00:42, 83085.97 examples/s] Generating train split: 3618000 examples [00:42, 83827.37 examples/s] Generating train split: 3628000 examples [00:42, 84343.04 examples/s] Generating train split: 3637000 examples [00:42, 83689.03 examples/s] Generating train split: 3647000 examples [00:43, 84596.17 examples/s] Generating train split: 3657000 examples [00:43, 85359.19 examples/s] Generating train split: 3667000 examples [00:43, 86784.57 examples/s] Generating train split: 3677000 examples [00:43, 87067.48 examples/s] Generating train split: 3687000 examples [00:43, 86792.71 examples/s] Generating train split: 3697000 examples [00:43, 87344.48 examples/s] Generating train split: 3707000 examples [00:43, 86769.42 examples/s] Generating train split: 3717000 examples [00:43, 87283.94 examples/s] Generating train split: 3727000 examples [00:43, 87074.25 examples/s] Generating train split: 3736000 examples [00:44, 84719.66 examples/s] Generating train split: 3745000 examples [00:44, 84193.20 examples/s] Generating train split: 3754000 examples [00:44, 83782.08 examples/s] Generating train split: 3764000 examples [00:44, 84359.10 examples/s] Generating train split: 3774000 examples [00:44, 85567.16 examples/s] Generating train split: 3784000 examples [00:44, 85591.97 examples/s] Generating train split: 3794000 examples [00:44, 84275.55 examples/s] Generating train split: 3804000 examples [00:44, 84672.96 examples/s] Generating train split: 3814000 examples [00:44, 81540.18 examples/s] Generating train split: 3824000 examples [00:45, 82479.97 examples/s] Generating train split: 3834000 examples [00:45, 82557.70 examples/s] Generating train split: 3844000 examples [00:45, 82378.61 examples/s] Generating train split: 3854000 examples [00:45, 82230.40 examples/s] Generating train split: 3864000 examples [00:45, 83018.79 examples/s] Generating train split: 3874000 examples [00:45, 84110.90 examples/s] Generating train split: 3883000 examples [00:45, 83772.61 examples/s] Generating train split: 3893000 examples [00:45, 85338.69 examples/s] Generating train split: 3902000 examples [00:46, 84721.76 examples/s] Generating train split: 3911000 examples [00:46, 84104.05 examples/s] Generating train split: 3921000 examples [00:46, 85106.98 examples/s] Generating train split: 3931000 examples [00:46, 84736.08 examples/s] Generating train split: 3941000 examples [00:46, 85587.24 examples/s] Generating train split: 3951000 examples [00:46, 86275.87 examples/s] Generating train split: 3961000 examples [00:46, 86891.04 examples/s] Generating train split: 3971000 examples [00:46, 86377.42 examples/s] Generating train split: 3981000 examples [00:46, 86069.18 examples/s] Generating train split: 3991000 examples [00:47, 86391.65 examples/s] Generating train split: 4000000 examples [00:47, 84129.89 examples/s] Generating train split: 4009000 examples [00:47, 80222.22 examples/s] Generating train split: 4018000 examples [00:47, 79965.19 examples/s] Generating train split: 4027000 examples [00:47, 79112.16 examples/s] Generating train split: 4037000 examples [00:47, 82097.08 examples/s] Generating train split: 4047000 examples [00:47, 82506.56 examples/s] Generating train split: 4060000 examples [00:47, 81439.75 examples/s] Generating train split: 4069000 examples [00:48, 80436.36 examples/s] Generating train split: 4078000 examples [00:48, 80862.88 examples/s] Generating train split: 4088000 examples [00:48, 82398.61 examples/s] Generating train split: 4097000 examples [00:48, 81883.33 examples/s] Generating train split: 4107000 examples [00:48, 82627.15 examples/s] Generating train split: 4116000 examples [00:48, 81517.76 examples/s] Generating train split: 4126000 examples [00:48, 82445.96 examples/s] Generating train split: 4136000 examples [00:48, 82118.60 examples/s] Generating train split: 4146000 examples [00:48, 83941.45 examples/s] Generating train split: 4156000 examples [00:49, 84547.44 examples/s] Generating train split: 4166000 examples [00:49, 84400.85 examples/s] Generating train split: 4176000 examples [00:49, 84172.10 examples/s] Generating train split: 4186000 examples [00:49, 84931.71 examples/s] Generating train split: 4196000 examples [00:49, 81978.80 examples/s] Generating train split: 4206000 examples [00:49, 83801.76 examples/s] Generating train split: 4216000 examples [00:49, 84676.01 examples/s] Generating train split: 4226000 examples [00:49, 84928.64 examples/s] Generating train split: 4235000 examples [00:50, 84065.64 examples/s] Generating train split: 4244000 examples [00:50, 83564.61 examples/s] Generating train split: 4254000 examples [00:50, 84870.65 examples/s] Generating train split: 4264000 examples [00:50, 86076.72 examples/s] Generating train split: 4273000 examples [00:50, 85692.46 examples/s] Generating train split: 4283000 examples [00:50, 85814.18 examples/s] Generating train split: 4292000 examples [00:50, 82118.61 examples/s] Generating train split: 4302000 examples [00:50, 82579.27 examples/s] Generating train split: 4312000 examples [00:50, 83150.28 examples/s] Generating train split: 4322000 examples [00:51, 83462.84 examples/s] Generating train split: 4331000 examples [00:51, 82870.28 examples/s] Generating train split: 4341000 examples [00:51, 80922.30 examples/s] Generating train split: 4351000 examples [00:51, 81903.79 examples/s] Generating train split: 4360000 examples [00:51, 82394.51 examples/s] Generating train split: 4370000 examples [00:51, 83823.09 examples/s] Generating train split: 4380000 examples [00:51, 84703.29 examples/s] Generating train split: 4390000 examples [00:51, 85328.62 examples/s] Generating train split: 4400000 examples [00:51, 86017.02 examples/s] Generating train split: 4410000 examples [00:52, 86532.90 examples/s] Generating train split: 4420000 examples [00:52, 85785.28 examples/s] Generating train split: 4430000 examples [00:52, 85980.49 examples/s] Generating train split: 4440000 examples [00:52, 85610.07 examples/s] Generating train split: 4449000 examples [00:52, 83282.41 examples/s] Generating train split: 4459000 examples [00:52, 84082.12 examples/s] Generating train split: 4469000 examples [00:52, 85324.04 examples/s] Generating train split: 4479000 examples [00:52, 85914.88 examples/s] Generating train split: 4489000 examples [00:53, 85803.30 examples/s] Generating train split: 4499000 examples [00:53, 85469.68 examples/s] Generating train split: 4509000 examples [00:53, 86304.35 examples/s] Generating train split: 4519000 examples [00:53, 86391.98 examples/s] Generating train split: 4529000 examples [00:53, 86526.42 examples/s] Generating train split: 4539000 examples [00:53, 85911.08 examples/s] Generating train split: 4549000 examples [00:53, 85992.40 examples/s] Generating train split: 4559000 examples [00:53, 86838.91 examples/s] Generating train split: 4569000 examples [00:53, 86402.05 examples/s] Generating train split: 4579000 examples [00:54, 84858.84 examples/s] Generating train split: 4592000 examples [00:54, 83912.26 examples/s] Generating train split: 4602000 examples [00:54, 84291.96 examples/s] Generating train split: 4612000 examples [00:54, 84086.82 examples/s] Generating train split: 4622000 examples [00:54, 83665.21 examples/s] Generating train split: 4632000 examples [00:54, 84719.00 examples/s] Generating train split: 4642000 examples [00:54, 84997.57 examples/s] Generating train split: 4652000 examples [00:54, 84943.13 examples/s] Generating train split: 4662000 examples [00:55, 85632.90 examples/s] Generating train split: 4672000 examples [00:55, 85676.10 examples/s] Generating train split: 4682000 examples [00:55, 86204.91 examples/s] Generating train split: 4691000 examples [00:55, 84901.61 examples/s] Generating train split: 4700000 examples [00:55, 84447.00 examples/s] Generating train split: 4710000 examples [00:55, 84703.51 examples/s] Generating train split: 4719000 examples [00:55, 83801.24 examples/s] Generating train split: 4728000 examples [00:55, 82857.45 examples/s] Generating train split: 4738000 examples [00:55, 82706.94 examples/s] Generating train split: 4748000 examples [00:56, 83695.67 examples/s] Generating train split: 4758000 examples [00:56, 84588.73 examples/s] Generating train split: 4768000 examples [00:56, 84206.44 examples/s] Generating train split: 4777000 examples [00:56, 83566.29 examples/s] Generating train split: 4787000 examples [00:56, 83711.52 examples/s] Generating train split: 4796000 examples [00:56, 83497.71 examples/s] Generating train split: 4806000 examples [00:56, 84084.08 examples/s] Generating train split: 4816000 examples [00:56, 83172.92 examples/s] Generating train split: 4826000 examples [00:57, 83939.58 examples/s] Generating train split: 4836000 examples [00:57, 84761.05 examples/s] Generating train split: 4846000 examples [00:57, 84826.84 examples/s] Generating train split: 4855000 examples [00:57, 83674.78 examples/s] Generating train split: 4864000 examples [00:57, 83364.68 examples/s] Generating train split: 4873000 examples [00:57, 82572.41 examples/s] Generating train split: 4882000 examples [00:57, 82019.75 examples/s] Generating train split: 4891000 examples [00:57, 82161.85 examples/s] Generating train split: 4901000 examples [00:57, 83747.54 examples/s] Generating train split: 4911000 examples [00:58, 84472.76 examples/s] Generating train split: 4920000 examples [00:58, 84432.30 examples/s] Generating train split: 4929000 examples [00:58, 83502.29 examples/s] Generating train split: 4939000 examples [00:58, 84992.93 examples/s] Generating train split: 4949000 examples [00:58, 85554.75 examples/s] Generating train split: 4958000 examples [00:58, 84290.15 examples/s] Generating train split: 4968000 examples [00:58, 84601.80 examples/s] Generating train split: 4978000 examples [00:58, 85351.54 examples/s] Generating train split: 4988000 examples [00:58, 85936.79 examples/s] Generating train split: 4998000 examples [00:59, 86379.31 examples/s] Generating train split: 5008000 examples [00:59, 87139.46 examples/s] Generating train split: 5018000 examples [00:59, 87078.52 examples/s] Generating train split: 5027000 examples [00:59, 85754.05 examples/s] Generating train split: 5037000 examples [00:59, 85255.52 examples/s] Generating train split: 5047000 examples [00:59, 85373.64 examples/s] Generating train split: 5057000 examples [00:59, 85498.75 examples/s] Generating train split: 5067000 examples [00:59, 86417.01 examples/s] Generating train split: 5077000 examples [00:59, 86569.18 examples/s] Generating train split: 5087000 examples [01:00, 86105.86 examples/s] Generating train split: 5097000 examples [01:00, 86549.66 examples/s] Generating train split: 5107000 examples [01:00, 86708.77 examples/s] Generating train split: 5117000 examples [01:00, 86416.66 examples/s] Generating train split: 5126000 examples [01:00, 84827.00 examples/s] Generating train split: 5135000 examples [01:00, 84100.47 examples/s] Generating train split: 5145000 examples [01:00, 85576.40 examples/s] Generating train split: 5155000 examples [01:00, 86608.94 examples/s] Generating train split: 5165000 examples [01:00, 87399.99 examples/s] Generating train split: 5175000 examples [01:01, 87005.48 examples/s] Generating train split: 5185000 examples [01:01, 86554.36 examples/s] Generating train split: 5195000 examples [01:01, 85886.51 examples/s] Generating train split: 5205000 examples [01:01, 86418.77 examples/s] Generating train split: 5214000 examples [01:01, 85159.27 examples/s] Generating train split: 5227000 examples [01:01, 82195.67 examples/s] Generating train split: 5237000 examples [01:01, 81465.97 examples/s] Generating train split: 5247000 examples [01:01, 83475.11 examples/s] Generating train split: 5257000 examples [01:02, 84941.64 examples/s] Generating train split: 5267000 examples [01:02, 86049.24 examples/s] Generating train split: 5276000 examples [01:02, 84949.82 examples/s] Generating train split: 5285000 examples [01:02, 84678.34 examples/s] Generating train split: 5295000 examples [01:02, 85072.13 examples/s] Generating train split: 5305000 examples [01:02, 85001.39 examples/s] Generating train split: 5315000 examples [01:02, 85927.46 examples/s] Generating train split: 5325000 examples [01:02, 87040.59 examples/s] Generating train split: 5335000 examples [01:02, 86376.93 examples/s] Generating train split: 5345000 examples [01:03, 86005.52 examples/s] Generating train split: 5355000 examples [01:03, 85717.88 examples/s] Generating train split: 5364000 examples [01:03, 84171.99 examples/s] Generating train split: 5374000 examples [01:03, 84271.75 examples/s] Generating train split: 5384000 examples [01:03, 84131.83 examples/s] Generating train split: 5394000 examples [01:03, 84295.39 examples/s] Generating train split: 5403000 examples [01:03, 84223.19 examples/s] Generating train split: 5413000 examples [01:03, 84010.35 examples/s] Generating train split: 5423000 examples [01:04, 84584.77 examples/s] Generating train split: 5432000 examples [01:04, 83512.80 examples/s] Generating train split: 5442000 examples [01:04, 84661.36 examples/s] Generating train split: 5452000 examples [01:04, 85719.13 examples/s] Generating train split: 5462000 examples [01:04, 86660.20 examples/s] Generating train split: 5472000 examples [01:04, 87948.60 examples/s] Generating train split: 5482000 examples [01:04, 88807.64 examples/s] Generating train split: 5492000 examples [01:04, 88772.44 examples/s] Generating train split: 5502000 examples [01:04, 87988.47 examples/s] Generating train split: 5512000 examples [01:05, 88083.88 examples/s] Generating train split: 5522000 examples [01:05, 88015.42 examples/s] Generating train split: 5532000 examples [01:05, 86499.63 examples/s] Generating train split: 5542000 examples [01:05, 85925.89 examples/s] Generating train split: 5551000 examples [01:05, 84320.82 examples/s] Generating train split: 5560000 examples [01:05, 83238.12 examples/s] Generating train split: 5569000 examples [01:05, 82148.22 examples/s] Generating train split: 5579000 examples [01:05, 83330.21 examples/s] Generating train split: 5589000 examples [01:05, 84431.95 examples/s] Generating train split: 5598000 examples [01:06, 83774.34 examples/s] Generating train split: 5608000 examples [01:06, 85013.09 examples/s] Generating train split: 5618000 examples [01:06, 86087.77 examples/s] Generating train split: 5628000 examples [01:06, 86935.72 examples/s] Generating train split: 5638000 examples [01:06, 86813.11 examples/s] Generating train split: 5648000 examples [01:06, 87365.79 examples/s] Generating train split: 5658000 examples [01:06, 87890.32 examples/s] Generating train split: 5668000 examples [01:06, 87273.73 examples/s] Generating train split: 5678000 examples [01:06, 87929.76 examples/s] Generating train split: 5688000 examples [01:07, 88276.62 examples/s] Generating train split: 5698000 examples [01:07, 87936.75 examples/s] Generating train split: 5707000 examples [01:07, 87376.23 examples/s] Generating train split: 5717000 examples [01:07, 87175.80 examples/s] Generating train split: 5727000 examples [01:07, 86407.49 examples/s] Generating train split: 5737000 examples [01:07, 86657.85 examples/s] Generating train split: 5747000 examples [01:07, 85200.14 examples/s] Generating train split: 5757000 examples [01:07, 85615.65 examples/s] Generating train split: 5767000 examples [01:08, 86676.59 examples/s] Generating train split: 5779000 examples [01:08, 80627.42 examples/s] Generating train split: 5789000 examples [01:08, 82301.93 examples/s] Generating train split: 5799000 examples [01:08, 83424.07 examples/s] Generating train split: 5809000 examples [01:08, 84468.19 examples/s] Generating train split: 5819000 examples [01:08, 85428.27 examples/s] Generating train split: 5829000 examples [01:08, 85908.13 examples/s] Generating train split: 5838000 examples [01:08, 84554.36 examples/s] Generating train split: 5848000 examples [01:08, 84579.43 examples/s] Generating train split: 5858000 examples [01:09, 85538.05 examples/s] Generating train split: 5867000 examples [01:09, 84876.62 examples/s] Generating train split: 5877000 examples [01:09, 85291.99 examples/s] Generating train split: 5887000 examples [01:09, 85475.58 examples/s] Generating train split: 5896000 examples [01:09, 84057.50 examples/s] Generating train split: 5905000 examples [01:09, 82516.32 examples/s] Generating train split: 5915000 examples [01:09, 82566.68 examples/s] Generating train split: 5924000 examples [01:09, 81831.05 examples/s] Generating train split: 5934000 examples [01:10, 83091.08 examples/s] Generating train split: 5944000 examples [01:10, 84069.81 examples/s] Generating train split: 5954000 examples [01:10, 83723.13 examples/s] Generating train split: 5964000 examples [01:10, 84531.60 examples/s] Generating train split: 5973000 examples [01:10, 83145.72 examples/s] Generating train split: 5983000 examples [01:10, 83693.07 examples/s] Generating train split: 5992000 examples [01:10, 82014.20 examples/s] Generating train split: 6002000 examples [01:10, 82120.59 examples/s] Generating train split: 6012000 examples [01:10, 82263.29 examples/s] Generating train split: 6022000 examples [01:11, 84265.80 examples/s] Generating train split: 6032000 examples [01:11, 85101.03 examples/s] Generating train split: 6041000 examples [01:11, 84256.94 examples/s] Generating train split: 6051000 examples [01:11, 85443.03 examples/s] Generating train split: 6061000 examples [01:11, 85043.23 examples/s] Generating train split: 6071000 examples [01:11, 85513.91 examples/s] Generating train split: 6081000 examples [01:11, 85033.60 examples/s] Generating train split: 6091000 examples [01:11, 85435.60 examples/s] Generating train split: 6101000 examples [01:11, 85440.88 examples/s] Generating train split: 6111000 examples [01:12, 85171.68 examples/s] Generating train split: 6121000 examples [01:12, 85528.80 examples/s] Generating train split: 6131000 examples [01:12, 86003.03 examples/s] Generating train split: 6141000 examples [01:12, 86584.19 examples/s] Generating train split: 6151000 examples [01:12, 86552.33 examples/s] Generating train split: 6161000 examples [01:12, 86365.13 examples/s] Generating train split: 6171000 examples [01:12, 86395.18 examples/s] Generating train split: 6180000 examples [01:12, 84406.75 examples/s] Generating train split: 6190000 examples [01:13, 85935.85 examples/s] Generating train split: 6200000 examples [01:13, 86249.62 examples/s] Generating train split: 6210000 examples [01:13, 86155.00 examples/s] Generating train split: 6220000 examples [01:13, 86787.54 examples/s] Generating train split: 6230000 examples [01:13, 86855.18 examples/s] Generating train split: 6240000 examples [01:13, 86612.81 examples/s] Generating train split: 6250000 examples [01:13, 84951.22 examples/s] Generating train split: 6260000 examples [01:13, 85470.42 examples/s] Generating train split: 6270000 examples [01:13, 86042.92 examples/s] Generating train split: 6279000 examples [01:14, 80006.15 examples/s] Generating train split: 6288000 examples [01:14, 79864.95 examples/s] Generating train split: 6298000 examples [01:14, 80606.38 examples/s] Generating train split: 6307000 examples [01:14, 80965.68 examples/s] Generating train split: 6317000 examples [01:14, 82297.59 examples/s] Generating train split: 6326000 examples [01:14, 82249.82 examples/s] Generating train split: 6336000 examples [01:14, 83310.77 examples/s] Generating train split: 6345000 examples [01:14, 83303.32 examples/s] Generating train split: 6355000 examples [01:15, 84160.64 examples/s] Generating train split: 6364000 examples [01:15, 81682.44 examples/s] Generating train split: 6373000 examples [01:15, 82477.56 examples/s] Generating train split: 6382000 examples [01:15, 81848.96 examples/s] Generating train split: 6392000 examples [01:15, 82310.96 examples/s] Generating train split: 6401000 examples [01:15, 82163.94 examples/s] Generating train split: 6411000 examples [01:15, 83719.89 examples/s] Generating train split: 6421000 examples [01:15, 84700.10 examples/s] Generating train split: 6434000 examples [01:15, 83155.47 examples/s] Generating train split: 6444000 examples [01:16, 83578.16 examples/s] Generating train split: 6454000 examples [01:16, 83471.56 examples/s] Generating train split: 6464000 examples [01:16, 84775.92 examples/s] Generating train split: 6474000 examples [01:16, 85949.91 examples/s] Generating train split: 6484000 examples [01:16, 85115.97 examples/s] Generating train split: 6494000 examples [01:16, 85458.98 examples/s] Generating train split: 6503000 examples [01:16, 84422.36 examples/s] Generating train split: 6513000 examples [01:16, 84980.80 examples/s] Generating train split: 6523000 examples [01:17, 85286.76 examples/s] Generating train split: 6532000 examples [01:17, 83965.98 examples/s] Generating train split: 6542000 examples [01:17, 85359.35 examples/s] Generating train split: 6551000 examples [01:17, 84121.88 examples/s] Generating train split: 6561000 examples [01:17, 84163.27 examples/s] Generating train split: 6571000 examples [01:17, 85330.97 examples/s] Generating train split: 6581000 examples [01:17, 85816.01 examples/s] Generating train split: 6591000 examples [01:17, 86263.67 examples/s] Generating train split: 6601000 examples [01:17, 86230.20 examples/s] Generating train split: 6610000 examples [01:18, 85236.68 examples/s] Generating train split: 6620000 examples [01:18, 85490.63 examples/s] Generating train split: 6630000 examples [01:18, 86064.18 examples/s] Generating train split: 6640000 examples [01:18, 85600.27 examples/s] Generating train split: 6650000 examples [01:18, 86856.78 examples/s] Generating train split: 6659000 examples [01:18, 85240.47 examples/s] Generating train split: 6669000 examples [01:18, 85981.98 examples/s] Generating train split: 6678000 examples [01:18, 83978.67 examples/s] Generating train split: 6688000 examples [01:18, 84947.42 examples/s] Generating train split: 6698000 examples [01:19, 85667.16 examples/s] Generating train split: 6708000 examples [01:19, 85909.89 examples/s] Generating train split: 6718000 examples [01:19, 87004.17 examples/s] Generating train split: 6728000 examples [01:19, 87265.78 examples/s] Generating train split: 6738000 examples [01:19, 86203.54 examples/s] Generating train split: 6748000 examples [01:19, 87443.02 examples/s] Generating train split: 6758000 examples [01:19, 87327.64 examples/s] Generating train split: 6768000 examples [01:19, 87820.97 examples/s] Generating train split: 6777000 examples [01:19, 85538.33 examples/s] Generating train split: 6787000 examples [01:20, 86298.14 examples/s] Generating train split: 6797000 examples [01:20, 85735.95 examples/s] Generating train split: 6807000 examples [01:20, 85067.65 examples/s] Generating train split: 6817000 examples [01:20, 85407.21 examples/s] Generating train split: 6827000 examples [01:20, 86067.10 examples/s] Generating train split: 6837000 examples [01:20, 86225.73 examples/s] Generating train split: 6846000 examples [01:20, 83681.32 examples/s] Generating train split: 6856000 examples [01:20, 84660.78 examples/s] Generating train split: 6865000 examples [01:21, 83277.41 examples/s] Generating train split: 6874000 examples [01:21, 82559.60 examples/s] Generating train split: 6883000 examples [01:21, 82289.91 examples/s] Generating train split: 6892000 examples [01:21, 81691.52 examples/s] Generating train split: 6905000 examples [01:21, 80741.44 examples/s] Generating train split: 6915000 examples [01:21, 81948.76 examples/s] Generating train split: 6924000 examples [01:21, 81671.54 examples/s] Generating train split: 6934000 examples [01:21, 82326.19 examples/s] Generating train split: 6944000 examples [01:21, 82986.44 examples/s] Generating train split: 6954000 examples [01:22, 83584.51 examples/s] Generating train split: 6967000 examples [01:22, 81942.21 examples/s] Generating train split: 6977000 examples [01:22, 82914.07 examples/s] Generating train split: 6987000 examples [01:22, 83583.78 examples/s] Generating train split: 6997000 examples [01:22, 83562.60 examples/s] Generating train split: 7007000 examples [01:22, 84266.58 examples/s] Generating train split: 7017000 examples [01:22, 85172.00 examples/s] Generating train split: 7027000 examples [01:22, 85297.92 examples/s] Generating train split: 7037000 examples [01:23, 84078.15 examples/s] Generating train split: 7047000 examples [01:23, 84321.64 examples/s] Generating train split: 7057000 examples [01:23, 84918.22 examples/s] Generating train split: 7067000 examples [01:23, 86179.60 examples/s] Generating train split: 7077000 examples [01:23, 86397.96 examples/s] Generating train split: 7087000 examples [01:23, 87577.35 examples/s] Generating train split: 7097000 examples [01:23, 87820.72 examples/s] Generating train split: 7107000 examples [01:23, 87546.93 examples/s] Generating train split: 7117000 examples [01:23, 88317.21 examples/s] Generating train split: 7127000 examples [01:24, 87784.71 examples/s] Generating train split: 7137000 examples [01:24, 88073.06 examples/s] Generating train split: 7147000 examples [01:24, 87916.74 examples/s] Generating train split: 7157000 examples [01:24, 86842.35 examples/s] Generating train split: 7167000 examples [01:24, 87164.42 examples/s] Generating train split: 7177000 examples [01:24, 86868.92 examples/s] Generating train split: 7190000 examples [01:24, 84683.17 examples/s] Generating train split: 7199000 examples [01:24, 83630.68 examples/s] Generating train split: 7209000 examples [01:25, 84490.40 examples/s] Generating train split: 7218000 examples [01:25, 83321.58 examples/s] Generating train split: 7228000 examples [01:25, 84579.77 examples/s] Generating train split: 7238000 examples [01:25, 85228.94 examples/s] Generating train split: 7248000 examples [01:25, 84874.60 examples/s] Generating train split: 7257000 examples [01:25, 83193.30 examples/s] Generating train split: 7267000 examples [01:25, 84202.01 examples/s] Generating train split: 7277000 examples [01:25, 85159.18 examples/s] Generating train split: 7287000 examples [01:25, 84941.94 examples/s] Generating train split: 7297000 examples [01:26, 86196.72 examples/s] Generating train split: 7307000 examples [01:26, 85806.65 examples/s] Generating train split: 7316000 examples [01:26, 84723.12 examples/s] Generating train split: 7329000 examples [01:26, 81049.73 examples/s] Generating train split: 7338000 examples [01:26, 80644.90 examples/s] Generating train split: 7351000 examples [01:26, 80333.92 examples/s] Generating train split: 7360000 examples [01:26, 79719.44 examples/s] Generating train split: 7370000 examples [01:27, 81188.42 examples/s] Generating train split: 7379000 examples [01:27, 81249.62 examples/s] Generating train split: 7389000 examples [01:27, 83899.44 examples/s] Generating train split: 7399000 examples [01:27, 84490.37 examples/s] Generating train split: 7408000 examples [01:27, 83447.13 examples/s] Generating train split: 7418000 examples [01:27, 84374.59 examples/s] Generating train split: 7428000 examples [01:27, 84743.80 examples/s] Generating train split: 7438000 examples [01:27, 84158.34 examples/s] Generating train split: 7447000 examples [01:27, 83707.05 examples/s] Generating train split: 7457000 examples [01:28, 84145.82 examples/s] Generating train split: 7467000 examples [01:28, 84882.61 examples/s] Generating train split: 7477000 examples [01:28, 85112.61 examples/s] Generating train split: 7487000 examples [01:28, 85782.93 examples/s] Generating train split: 7497000 examples [01:28, 85803.85 examples/s] Generating train split: 7507000 examples [01:28, 85458.24 examples/s] Generating train split: 7517000 examples [01:28, 84420.05 examples/s] Generating train split: 7526000 examples [01:28, 84205.48 examples/s] Generating train split: 7535000 examples [01:28, 83631.17 examples/s] Generating train split: 7544000 examples [01:29, 82892.71 examples/s] Generating train split: 7554000 examples [01:29, 85016.37 examples/s] Generating train split: 7564000 examples [01:29, 84434.95 examples/s] Generating train split: 7574000 examples [01:29, 84856.26 examples/s] Generating train split: 7584000 examples [01:29, 84695.63 examples/s] Generating train split: 7593000 examples [01:29, 81435.05 examples/s] Generating train split: 7603000 examples [01:29, 83152.57 examples/s] Generating train split: 7613000 examples [01:29, 84623.83 examples/s] Generating train split: 7623000 examples [01:30, 84490.98 examples/s] Generating train split: 7633000 examples [01:30, 84453.27 examples/s] Generating train split: 7643000 examples [01:30, 84304.04 examples/s] Generating train split: 7652000 examples [01:30, 81958.26 examples/s] Generating train split: 7662000 examples [01:30, 82673.00 examples/s] Generating train split: 7672000 examples [01:30, 83037.90 examples/s] Generating train split: 7682000 examples [01:30, 82986.28 examples/s] Generating train split: 7692000 examples [01:30, 83151.59 examples/s] Generating train split: 7702000 examples [01:30, 83296.28 examples/s] Generating train split: 7712000 examples [01:31, 83676.14 examples/s] Generating train split: 7721000 examples [01:31, 82504.28 examples/s] Generating train split: 7731000 examples [01:31, 83375.72 examples/s] Generating train split: 7741000 examples [01:31, 83785.27 examples/s] Generating train split: 7751000 examples [01:31, 84672.33 examples/s] Generating train split: 7761000 examples [01:31, 85108.38 examples/s] Generating train split: 7771000 examples [01:31, 85555.16 examples/s] Generating train split: 7781000 examples [01:31, 85548.83 examples/s] Generating train split: 7790000 examples [01:31, 84473.14 examples/s] Generating train split: 7799000 examples [01:32, 83810.46 examples/s] Generating train split: 7809000 examples [01:32, 83665.26 examples/s] Generating train split: 7818000 examples [01:32, 83465.84 examples/s] Generating train split: 7828000 examples [01:32, 82681.78 examples/s] Generating train split: 7838000 examples [01:32, 83018.21 examples/s] Generating train split: 7848000 examples [01:32, 83786.64 examples/s] Generating train split: 7857000 examples [01:32, 83011.70 examples/s] Generating train split: 7867000 examples [01:32, 83059.23 examples/s] Generating train split: 7877000 examples [01:33, 83816.25 examples/s] Generating train split: 7887000 examples [01:33, 84577.38 examples/s] Generating train split: 7897000 examples [01:33, 84653.10 examples/s] Generating train split: 7907000 examples [01:33, 84536.65 examples/s] Generating train split: 7917000 examples [01:33, 84204.73 examples/s] Generating train split: 7927000 examples [01:33, 84554.51 examples/s] Generating train split: 7936000 examples [01:33, 83338.34 examples/s] Generating train split: 7946000 examples [01:33, 84008.91 examples/s] Generating train split: 7956000 examples [01:33, 84890.56 examples/s] Generating train split: 7966000 examples [01:34, 84137.40 examples/s] Generating train split: 7976000 examples [01:34, 85572.47 examples/s] Generating train split: 7986000 examples [01:34, 85787.48 examples/s] Generating train split: 7996000 examples [01:34, 85594.30 examples/s] Generating train split: 8006000 examples [01:34, 84532.47 examples/s] Generating train split: 8015000 examples [01:34, 83352.31 examples/s] Generating train split: 8025000 examples [01:34, 84395.47 examples/s] Generating train split: 8034000 examples [01:34, 83741.59 examples/s] Generating train split: 8044000 examples [01:35, 84625.20 examples/s] Generating train split: 8054000 examples [01:35, 84812.40 examples/s] Generating train split: 8064000 examples [01:35, 85297.69 examples/s] Generating train split: 8073000 examples [01:35, 84921.79 examples/s] Generating train split: 8083000 examples [01:35, 86074.08 examples/s] Generating train split: 8093000 examples [01:35, 85815.61 examples/s] Generating train split: 8103000 examples [01:35, 85654.82 examples/s] Generating train split: 8113000 examples [01:35, 85340.97 examples/s] Generating train split: 8123000 examples [01:35, 85784.34 examples/s] Generating train split: 8133000 examples [01:36, 86165.67 examples/s] Generating train split: 8143000 examples [01:36, 86468.53 examples/s] Generating train split: 8153000 examples [01:36, 86476.20 examples/s] Generating train split: 8162000 examples [01:36, 84770.30 examples/s] Generating train split: 8171000 examples [01:36, 84240.22 examples/s] Generating train split: 8181000 examples [01:36, 84904.83 examples/s] Generating train split: 8191000 examples [01:36, 85209.31 examples/s] Generating train split: 8201000 examples [01:36, 85553.04 examples/s] Generating train split: 8211000 examples [01:36, 86513.14 examples/s] Generating train split: 8221000 examples [01:37, 86065.38 examples/s] Generating train split: 8231000 examples [01:37, 86285.90 examples/s] Generating train split: 8241000 examples [01:37, 86113.97 examples/s] Generating train split: 8251000 examples [01:37, 85975.20 examples/s] Generating train split: 8261000 examples [01:37, 86940.39 examples/s] Generating train split: 8271000 examples [01:37, 87355.89 examples/s] Generating train split: 8281000 examples [01:37, 87057.32 examples/s] Generating train split: 8291000 examples [01:37, 87063.04 examples/s] Generating train split: 8301000 examples [01:37, 86562.71 examples/s] Generating train split: 8311000 examples [01:38, 86908.52 examples/s] Generating train split: 8321000 examples [01:38, 87139.24 examples/s] Generating train split: 8331000 examples [01:38, 87342.25 examples/s] Generating train split: 8341000 examples [01:38, 87164.92 examples/s] Generating train split: 8350000 examples [01:38, 86000.09 examples/s] Generating train split: 8360000 examples [01:38, 85246.71 examples/s] Generating train split: 8372000 examples [01:38, 80056.46 examples/s] Generating train split: 8381000 examples [01:38, 80414.94 examples/s] Generating train split: 8391000 examples [01:39, 82309.12 examples/s] Generating train split: 8401000 examples [01:39, 83773.41 examples/s] Generating train split: 8411000 examples [01:39, 84637.77 examples/s] Generating train split: 8421000 examples [01:39, 85055.62 examples/s] Generating train split: 8431000 examples [01:39, 85198.06 examples/s] Generating train split: 8441000 examples [01:39, 85567.81 examples/s] Generating train split: 8451000 examples [01:39, 85997.50 examples/s] Generating train split: 8460000 examples [01:39, 84658.57 examples/s] Generating train split: 8470000 examples [01:39, 84938.28 examples/s] Generating train split: 8480000 examples [01:40, 84605.94 examples/s] Generating train split: 8490000 examples [01:40, 85050.60 examples/s] Generating train split: 8500000 examples [01:40, 85047.84 examples/s] Generating train split: 8510000 examples [01:40, 84452.51 examples/s] Generating train split: 8520000 examples [01:40, 84797.08 examples/s] Generating train split: 8529000 examples [01:40, 83543.31 examples/s] Generating train split: 8539000 examples [01:40, 84731.46 examples/s] Generating train split: 8549000 examples [01:40, 85748.87 examples/s] Generating train split: 8559000 examples [01:41, 86540.91 examples/s] Generating train split: 8569000 examples [01:41, 87056.88 examples/s] Generating train split: 8579000 examples [01:41, 87332.29 examples/s] Generating train split: 8589000 examples [01:41, 87432.76 examples/s] Generating train split: 8599000 examples [01:41, 86432.48 examples/s] Generating train split: 8609000 examples [01:41, 87280.91 examples/s] Generating train split: 8619000 examples [01:41, 86644.26 examples/s] Generating train split: 8629000 examples [01:41, 87185.68 examples/s] Generating train split: 8638000 examples [01:41, 84520.50 examples/s] Generating train split: 8648000 examples [01:42, 85409.37 examples/s] Generating train split: 8658000 examples [01:42, 85595.68 examples/s] Generating train split: 8668000 examples [01:42, 85097.46 examples/s] Generating train split: 8678000 examples [01:42, 85361.23 examples/s] Generating train split: 8688000 examples [01:42, 85012.00 examples/s] Generating train split: 8698000 examples [01:42, 83743.67 examples/s] Generating train split: 8707000 examples [01:42, 83265.29 examples/s] Generating train split: 8717000 examples [01:42, 84436.58 examples/s] Generating train split: 8727000 examples [01:43, 84240.13 examples/s] Generating train split: 8737000 examples [01:43, 83978.17 examples/s] Generating train split: 8747000 examples [01:43, 84533.20 examples/s] Generating train split: 8757000 examples [01:43, 84468.41 examples/s] Generating train split: 8767000 examples [01:43, 84469.18 examples/s] Generating train split: 8780000 examples [01:43, 82383.82 examples/s] Generating train split: 8790000 examples [01:43, 83692.69 examples/s] Generating train split: 8799000 examples [01:43, 82973.72 examples/s] Generating train split: 8808000 examples [01:43, 82842.59 examples/s] Generating train split: 8817000 examples [01:44, 81056.48 examples/s] Generating train split: 8827000 examples [01:44, 80572.35 examples/s] Generating train split: 8837000 examples [01:44, 82023.67 examples/s] Generating train split: 8846000 examples [01:44, 81330.66 examples/s] Generating train split: 8859000 examples [01:44, 80598.39 examples/s] Generating train split: 8869000 examples [01:44, 81741.78 examples/s] Generating train split: 8882000 examples [01:44, 81323.22 examples/s] Generating train split: 8892000 examples [01:45, 81173.60 examples/s] Generating train split: 8902000 examples [01:45, 82348.13 examples/s] Generating train split: 8912000 examples [01:45, 82942.78 examples/s] Generating train split: 8922000 examples [01:45, 84651.95 examples/s] Generating train split: 8932000 examples [01:45, 84402.74 examples/s] Generating train split: 8942000 examples [01:45, 85472.30 examples/s] Generating train split: 8952000 examples [01:45, 84811.25 examples/s] Generating train split: 8962000 examples [01:45, 85152.77 examples/s] Generating train split: 8972000 examples [01:45, 84981.19 examples/s] Generating train split: 8982000 examples [01:46, 85855.02 examples/s] Generating train split: 8992000 examples [01:46, 86570.94 examples/s] Generating train split: 9002000 examples [01:46, 86307.01 examples/s] Generating train split: 9011000 examples [01:46, 84966.41 examples/s] Generating train split: 9020000 examples [01:46, 84269.04 examples/s] Generating train split: 9030000 examples [01:46, 84505.58 examples/s] Generating train split: 9039000 examples [01:46, 82872.11 examples/s] Generating train split: 9049000 examples [01:46, 83438.47 examples/s] Generating train split: 9059000 examples [01:46, 83495.16 examples/s] Generating train split: 9069000 examples [01:47, 83959.90 examples/s] Generating train split: 9079000 examples [01:47, 84563.25 examples/s] Generating train split: 9089000 examples [01:47, 84724.83 examples/s] Generating train split: 9099000 examples [01:47, 84586.37 examples/s] Generating train split: 9109000 examples [01:47, 85318.59 examples/s] Generating train split: 9119000 examples [01:47, 84701.39 examples/s] Generating train split: 9129000 examples [01:47, 85648.90 examples/s] Generating train split: 9139000 examples [01:47, 85457.37 examples/s] Generating train split: 9148000 examples [01:48, 85345.16 examples/s] Generating train split: 9158000 examples [01:48, 85606.48 examples/s] Generating train split: 9168000 examples [01:48, 85682.22 examples/s] Generating train split: 9178000 examples [01:48, 83754.03 examples/s] Generating train split: 9187000 examples [01:48, 82795.04 examples/s] Generating train split: 9196000 examples [01:48, 81563.77 examples/s] Generating train split: 9206000 examples [01:48, 83083.20 examples/s] Generating train split: 9219000 examples [01:48, 81983.82 examples/s] Generating train split: 9229000 examples [01:49, 82273.39 examples/s] Generating train split: 9239000 examples [01:49, 83823.60 examples/s] Generating train split: 9249000 examples [01:49, 83793.09 examples/s] Generating train split: 9259000 examples [01:49, 84437.31 examples/s] Generating train split: 9268000 examples [01:49, 83536.71 examples/s] Generating train split: 9278000 examples [01:49, 84991.96 examples/s] Generating train split: 9288000 examples [01:49, 85222.54 examples/s] Generating train split: 9298000 examples [01:49, 84037.34 examples/s] Generating train split: 9307000 examples [01:49, 83504.69 examples/s] Generating train split: 9317000 examples [01:50, 84334.16 examples/s] Generating train split: 9327000 examples [01:50, 84521.68 examples/s] Generating train split: 9336000 examples [01:50, 82992.88 examples/s] Generating train split: 9346000 examples [01:50, 81956.01 examples/s] Generating train split: 9356000 examples [01:50, 81890.48 examples/s] Generating train split: 9366000 examples [01:50, 81877.42 examples/s] Generating train split: 9376000 examples [01:50, 81221.27 examples/s] Generating train split: 9386000 examples [01:50, 82266.66 examples/s] Generating train split: 9396000 examples [01:51, 82483.00 examples/s] Generating train split: 9405000 examples [01:51, 80248.18 examples/s] Generating train split: 9414000 examples [01:51, 80129.52 examples/s] Generating train split: 9424000 examples [01:51, 81783.57 examples/s] Generating train split: 9433000 examples [01:51, 81845.64 examples/s] Generating train split: 9446000 examples [01:51, 81337.15 examples/s] Generating train split: 9455000 examples [01:51, 81107.20 examples/s] Generating train split: 9465000 examples [01:51, 81440.80 examples/s] Generating train split: 9478000 examples [01:52, 79515.68 examples/s] Generating train split: 9488000 examples [01:52, 81142.27 examples/s] Generating train split: 9498000 examples [01:52, 82047.68 examples/s] Generating train split: 9508000 examples [01:52, 83192.83 examples/s] Generating train split: 9520000 examples [01:52, 80518.70 examples/s] Generating train split: 9530000 examples [01:52, 81566.09 examples/s] Generating train split: 9539000 examples [01:52, 81395.35 examples/s] Generating train split: 9548000 examples [01:52, 81683.25 examples/s] Generating train split: 9557000 examples [01:53, 80513.27 examples/s] Generating train split: 9567000 examples [01:53, 82547.38 examples/s] Generating train split: 9577000 examples [01:53, 83813.51 examples/s] Generating train split: 9586000 examples [01:53, 82605.47 examples/s] Generating train split: 9596000 examples [01:53, 82792.42 examples/s] Generating train split: 9606000 examples [01:53, 84031.03 examples/s] Generating train split: 9616000 examples [01:53, 84420.20 examples/s] Generating train split: 9625000 examples [01:53, 83685.63 examples/s] Generating train split: 9635000 examples [01:53, 83479.63 examples/s] Generating train split: 9644000 examples [01:54, 81758.13 examples/s] Generating train split: 9654000 examples [01:54, 83245.66 examples/s] Generating train split: 9664000 examples [01:54, 83993.04 examples/s] Generating train split: 9674000 examples [01:54, 83748.91 examples/s] Generating train split: 9684000 examples [01:54, 83928.95 examples/s] Generating train split: 9693000 examples [01:54, 82882.77 examples/s] Generating train split: 9703000 examples [01:54, 83153.82 examples/s] Generating train split: 9713000 examples [01:54, 83829.87 examples/s] Generating train split: 9723000 examples [01:54, 84482.98 examples/s] Generating train split: 9732000 examples [01:55, 82770.32 examples/s] Generating train split: 9741000 examples [01:55, 80788.27 examples/s] Generating train split: 9751000 examples [01:55, 82538.93 examples/s] Generating train split: 9761000 examples [01:55, 83999.20 examples/s] Generating train split: 9771000 examples [01:55, 84242.68 examples/s] Generating train split: 9780000 examples [01:55, 83793.71 examples/s] Generating train split: 9790000 examples [01:55, 84742.45 examples/s] Generating train split: 9800000 examples [01:55, 84391.86 examples/s] Generating train split: 9809000 examples [01:56, 83600.97 examples/s] Generating train split: 9822000 examples [01:56, 80700.07 examples/s] Generating train split: 9832000 examples [01:56, 82039.34 examples/s] Generating train split: 9842000 examples [01:56, 83360.35 examples/s] Generating train split: 9855000 examples [01:56, 81037.96 examples/s] Generating train split: 9865000 examples [01:56, 82711.45 examples/s] Generating train split: 9875000 examples [01:56, 83231.48 examples/s] Generating train split: 9884000 examples [01:56, 82866.50 examples/s] Generating train split: 9894000 examples [01:57, 83636.78 examples/s] Generating train split: 9904000 examples [01:57, 83425.65 examples/s] Generating train split: 9914000 examples [01:57, 82506.78 examples/s] Generating train split: 9924000 examples [01:57, 83886.21 examples/s] Generating train split: 9934000 examples [01:57, 84524.65 examples/s] Generating train split: 9943000 examples [01:57, 83542.22 examples/s] Generating train split: 9952000 examples [01:57, 83044.56 examples/s] Generating train split: 9962000 examples [01:57, 83355.29 examples/s] Generating train split: 9972000 examples [01:57, 81536.05 examples/s] Generating train split: 9982000 examples [01:58, 82965.15 examples/s] Generating train split: 9991000 examples [01:58, 82346.32 examples/s] Generating train split: 10001000 examples [01:58, 83551.74 examples/s] Generating train split: 10011000 examples [01:58, 85099.89 examples/s] Generating train split: 10021000 examples [01:58, 84327.86 examples/s] Generating train split: 10030000 examples [01:58, 82900.48 examples/s] Generating train split: 10039000 examples [01:58, 82423.70 examples/s] Generating train split: 10049000 examples [01:58, 83157.58 examples/s] Generating train split: 10058000 examples [01:59, 82525.48 examples/s] Generating train split: 10068000 examples [01:59, 83164.43 examples/s] Generating train split: 10078000 examples [01:59, 83628.71 examples/s] Generating train split: 10088000 examples [01:59, 84878.61 examples/s] Generating train split: 10098000 examples [01:59, 85684.74 examples/s] Generating train split: 10107000 examples [01:59, 85460.86 examples/s] Generating train split: 10116000 examples [01:59, 83057.35 examples/s] Generating train split: 10126000 examples [01:59, 82869.56 examples/s] Generating train split: 10135000 examples [01:59, 82290.44 examples/s] Generating train split: 10144000 examples [02:00, 81814.80 examples/s] Generating train split: 10154000 examples [02:00, 82970.87 examples/s] Generating train split: 10164000 examples [02:00, 83432.22 examples/s] Generating train split: 10174000 examples [02:00, 84332.17 examples/s] Generating train split: 10184000 examples [02:00, 84826.04 examples/s] Generating train split: 10193000 examples [02:00, 83711.19 examples/s] Generating train split: 10203000 examples [02:00, 84372.82 examples/s] Generating train split: 10213000 examples [02:00, 85333.48 examples/s] Generating train split: 10223000 examples [02:00, 85575.03 examples/s] Generating train split: 10233000 examples [02:01, 85589.89 examples/s] Generating train split: 10243000 examples [02:01, 85997.17 examples/s] Generating train split: 10252000 examples [02:01, 84626.41 examples/s] Generating train split: 10262000 examples [02:01, 84410.62 examples/s] Generating train split: 10272000 examples [02:01, 85282.53 examples/s] Generating train split: 10282000 examples [02:01, 85504.87 examples/s] Generating train split: 10292000 examples [02:01, 84601.46 examples/s] Generating train split: 10302000 examples [02:01, 84976.41 examples/s] Generating train split: 10312000 examples [02:02, 85266.66 examples/s] Generating train split: 10321000 examples [02:02, 83367.64 examples/s] Generating train split: 10331000 examples [02:02, 82937.09 examples/s] Generating train split: 10340000 examples [02:02, 82260.95 examples/s] Generating train split: 10350000 examples [02:02, 81922.94 examples/s] Generating train split: 10360000 examples [02:02, 82706.13 examples/s] Generating train split: 10373000 examples [02:02, 81496.83 examples/s] Generating train split: 10382000 examples [02:02, 80829.21 examples/s] Generating train split: 10392000 examples [02:03, 81348.19 examples/s] Generating train split: 10402000 examples [02:03, 82966.80 examples/s] Generating train split: 10412000 examples [02:03, 82861.68 examples/s] Generating train split: 10421000 examples [02:03, 82908.80 examples/s] Generating train split: 10431000 examples [02:03, 83923.39 examples/s] Generating train split: 10440000 examples [02:03, 83531.37 examples/s] Generating train split: 10453000 examples [02:03, 80100.48 examples/s] Generating train split: 10463000 examples [02:03, 81042.20 examples/s] Generating train split: 10472000 examples [02:03, 81477.90 examples/s] Generating train split: 10482000 examples [02:04, 82294.66 examples/s] Generating train split: 10495000 examples [02:04, 81704.61 examples/s] Generating train split: 10505000 examples [02:04, 83242.79 examples/s] Generating train split: 10515000 examples [02:04, 84012.11 examples/s] Generating train split: 10524000 examples [02:04, 82374.86 examples/s] Generating train split: 10534000 examples [02:04, 83016.61 examples/s] Generating train split: 10543000 examples [02:04, 83089.57 examples/s] Generating train split: 10553000 examples [02:04, 83172.66 examples/s] Generating train split: 10563000 examples [02:05, 83827.40 examples/s] Generating train split: 10573000 examples [02:05, 84862.31 examples/s] Generating train split: 10583000 examples [02:05, 84126.42 examples/s] Generating train split: 10592000 examples [02:05, 83614.57 examples/s] Generating train split: 10602000 examples [02:05, 84787.01 examples/s] Generating train split: 10612000 examples [02:05, 85624.03 examples/s] Generating train split: 10621000 examples [02:05, 85070.63 examples/s] Generating train split: 10631000 examples [02:05, 85828.42 examples/s] Generating train split: 10641000 examples [02:05, 85991.29 examples/s] Generating train split: 10651000 examples [02:06, 85668.23 examples/s] Generating train split: 10661000 examples [02:06, 85979.76 examples/s] Generating train split: 10671000 examples [02:06, 85461.29 examples/s] Generating train split: 10681000 examples [02:06, 85985.22 examples/s] Generating train split: 10690000 examples [02:06, 84616.84 examples/s] Generating train split: 10700000 examples [02:06, 84443.72 examples/s] Generating train split: 10710000 examples [02:06, 85284.68 examples/s] Generating train split: 10720000 examples [02:06, 85780.11 examples/s] Generating train split: 10730000 examples [02:07, 85181.33 examples/s] Generating train split: 10740000 examples [02:07, 85563.31 examples/s] Generating train split: 10750000 examples [02:07, 85460.12 examples/s] Generating train split: 10760000 examples [02:07, 85192.80 examples/s] Generating train split: 10770000 examples [02:07, 83686.48 examples/s] Generating train split: 10779000 examples [02:07, 83512.69 examples/s] Generating train split: 10789000 examples [02:07, 83773.95 examples/s] Generating train split: 10799000 examples [02:07, 83840.91 examples/s] Generating train split: 10809000 examples [02:07, 83329.52 examples/s] Generating train split: 10819000 examples [02:08, 84314.53 examples/s] Generating train split: 10829000 examples [02:08, 85031.52 examples/s] Generating train split: 10838000 examples [02:08, 84048.86 examples/s] Generating train split: 10848000 examples [02:08, 84446.47 examples/s] Generating train split: 10857000 examples [02:08, 81284.84 examples/s] Generating train split: 10867000 examples [02:08, 81742.07 examples/s] Generating train split: 10877000 examples [02:08, 82639.34 examples/s] Generating train split: 10887000 examples [02:08, 82588.40 examples/s] Generating train split: 10896000 examples [02:09, 82108.38 examples/s] Generating train split: 10906000 examples [02:09, 82004.95 examples/s] Generating train split: 10916000 examples [02:09, 81763.05 examples/s] Generating train split: 10926000 examples [02:09, 82539.92 examples/s] Generating train split: 10935000 examples [02:09, 81549.22 examples/s] Generating train split: 10945000 examples [02:09, 83273.28 examples/s] Generating train split: 10955000 examples [02:09, 84291.33 examples/s] Generating train split: 10964000 examples [02:09, 84164.76 examples/s] Generating train split: 10973000 examples [02:09, 82048.25 examples/s] Generating train split: 10983000 examples [02:10, 80972.75 examples/s] Generating train split: 10993000 examples [02:10, 82019.65 examples/s] Generating train split: 11003000 examples [02:10, 83357.16 examples/s] Generating train split: 11016000 examples [02:10, 81905.87 examples/s] Generating train split: 11026000 examples [02:10, 82749.11 examples/s] Generating train split: 11035000 examples [02:10, 81311.80 examples/s] Generating train split: 11045000 examples [02:10, 82241.62 examples/s] Generating train split: 11055000 examples [02:10, 82699.27 examples/s] Generating train split: 11065000 examples [02:11, 83329.34 examples/s] Generating train split: 11075000 examples [02:11, 84757.21 examples/s] Generating train split: 11085000 examples [02:11, 85290.55 examples/s] Generating train split: 11095000 examples [02:11, 85967.50 examples/s] Generating train split: 11105000 examples [02:11, 85789.15 examples/s] Generating train split: 11115000 examples [02:11, 85302.12 examples/s] Generating train split: 11124000 examples [02:11, 82927.94 examples/s] Generating train split: 11136000 examples [02:11, 78679.76 examples/s] Generating train split: 11146000 examples [02:12, 81157.62 examples/s] Generating train split: 11156000 examples [02:12, 82169.49 examples/s] Generating train split: 11165000 examples [02:12, 82092.02 examples/s] Generating train split: 11175000 examples [02:12, 83213.02 examples/s] Generating train split: 11185000 examples [02:12, 84121.57 examples/s] Generating train split: 11195000 examples [02:12, 83790.19 examples/s] Generating train split: 11204000 examples [02:12, 83420.82 examples/s] Generating train split: 11214000 examples [02:12, 83366.59 examples/s] Generating train split: 11223000 examples [02:12, 81847.91 examples/s] Generating train split: 11233000 examples [02:13, 83389.75 examples/s] Generating train split: 11243000 examples [02:13, 83511.63 examples/s] Generating train split: 11253000 examples [02:13, 84318.65 examples/s] Generating train split: 11263000 examples [02:13, 84509.72 examples/s] Generating train split: 11273000 examples [02:13, 85165.52 examples/s] Generating train split: 11286000 examples [02:13, 83386.71 examples/s] Generating train split: 11295000 examples [02:13, 81384.92 examples/s] Generating train split: 11305000 examples [02:13, 82354.71 examples/s] Generating train split: 11315000 examples [02:14, 82438.56 examples/s] Generating train split: 11325000 examples [02:14, 82689.75 examples/s] Generating train split: 11334000 examples [02:14, 82990.41 examples/s] Generating train split: 11344000 examples [02:14, 83770.95 examples/s] Generating train split: 11353000 examples [02:14, 82020.49 examples/s] Generating train split: 11363000 examples [02:14, 82322.70 examples/s] Generating train split: 11376000 examples [02:14, 80975.53 examples/s] Generating train split: 11386000 examples [02:14, 81824.99 examples/s] Generating train split: 11396000 examples [02:15, 83589.45 examples/s] Generating train split: 11406000 examples [02:15, 84815.08 examples/s] Generating train split: 11416000 examples [02:15, 84961.45 examples/s] Generating train split: 11426000 examples [02:15, 85484.90 examples/s] Generating train split: 11436000 examples [02:15, 85873.85 examples/s] Generating train split: 11446000 examples [02:15, 86976.20 examples/s] Generating train split: 11456000 examples [02:15, 85930.86 examples/s] Generating train split: 11466000 examples [02:15, 86136.07 examples/s] Generating train split: 11475000 examples [02:15, 85006.19 examples/s] Generating train split: 11488000 examples [02:16, 80845.26 examples/s] Generating train split: 11498000 examples [02:16, 82249.17 examples/s] Generating train split: 11508000 examples [02:16, 83249.84 examples/s] Generating train split: 11518000 examples [02:16, 84690.33 examples/s] Generating train split: 11528000 examples [02:16, 84952.45 examples/s] Generating train split: 11538000 examples [02:16, 85087.28 examples/s] Generating train split: 11548000 examples [02:16, 84735.32 examples/s] Generating train split: 11557000 examples [02:16, 83126.97 examples/s] Generating train split: 11567000 examples [02:17, 83235.45 examples/s] Generating train split: 11577000 examples [02:17, 84908.13 examples/s] Generating train split: 11587000 examples [02:17, 85469.41 examples/s] Generating train split: 11597000 examples [02:17, 85606.82 examples/s] Generating train split: 11607000 examples [02:17, 85155.25 examples/s] Generating train split: 11617000 examples [02:17, 85549.20 examples/s] Generating train split: 11626000 examples [02:17, 83881.14 examples/s] Generating train split: 11635000 examples [02:17, 82716.44 examples/s] Generating train split: 11645000 examples [02:17, 83848.65 examples/s] Generating train split: 11655000 examples [02:18, 83639.43 examples/s] Generating train split: 11665000 examples [02:18, 84110.80 examples/s] Generating train split: 11675000 examples [02:18, 84518.29 examples/s] Generating train split: 11684000 examples [02:18, 83774.02 examples/s] Generating train split: 11694000 examples [02:18, 85327.93 examples/s] Generating train split: 11704000 examples [02:18, 84867.42 examples/s] Generating train split: 11714000 examples [02:18, 85219.39 examples/s] Generating train split: 11723000 examples [02:18, 82994.07 examples/s] Generating train split: 11733000 examples [02:19, 83306.40 examples/s] Generating train split: 11742000 examples [02:19, 82770.39 examples/s] Generating train split: 11754000 examples [02:19, 80731.79 examples/s] Generating train split: 11764000 examples [02:19, 82393.61 examples/s] Generating train split: 11774000 examples [02:19, 83313.41 examples/s] Generating train split: 11784000 examples [02:19, 83578.93 examples/s] Generating train split: 11793000 examples [02:19, 82742.65 examples/s] Generating train split: 11803000 examples [02:19, 81475.28 examples/s] Generating train split: 11813000 examples [02:20, 82853.61 examples/s] Generating train split: 11823000 examples [02:20, 83266.74 examples/s] Generating train split: 11832000 examples [02:20, 82414.34 examples/s] Generating train split: 11842000 examples [02:20, 83892.03 examples/s] Generating train split: 11852000 examples [02:20, 84053.97 examples/s] Generating train split: 11861000 examples [02:20, 83378.28 examples/s] Generating train split: 11870000 examples [02:20, 83392.38 examples/s] Generating train split: 11880000 examples [02:20, 84191.11 examples/s] Generating train split: 11890000 examples [02:20, 84528.80 examples/s] Generating train split: 11900000 examples [02:21, 84595.54 examples/s] Generating train split: 11910000 examples [02:21, 85074.47 examples/s] Generating train split: 11920000 examples [02:21, 85108.08 examples/s] Generating train split: 11930000 examples [02:21, 85280.50 examples/s] Generating train split: 11940000 examples [02:21, 85384.68 examples/s] Generating train split: 11950000 examples [02:21, 86062.80 examples/s] Generating train split: 11960000 examples [02:21, 85828.92 examples/s] Generating train split: 11970000 examples [02:21, 85423.38 examples/s] Generating train split: 11980000 examples [02:21, 86448.92 examples/s] Generating train split: 11990000 examples [02:22, 86324.88 examples/s] Generating train split: 12000000 examples [02:22, 86046.66 examples/s] Generating train split: 12010000 examples [02:22, 84570.13 examples/s] Generating train split: 12020000 examples [02:22, 84934.27 examples/s] Generating train split: 12030000 examples [02:22, 85295.63 examples/s] Generating train split: 12040000 examples [02:22, 85291.95 examples/s] Generating train split: 12053000 examples [02:22, 82947.18 examples/s] Generating train split: 12063000 examples [02:22, 82989.87 examples/s] Generating train split: 12073000 examples [02:23, 83800.95 examples/s] Generating train split: 12083000 examples [02:23, 83726.85 examples/s] Generating train split: 12093000 examples [02:23, 83251.17 examples/s] Generating train split: 12103000 examples [02:23, 84340.44 examples/s] Generating train split: 12113000 examples [02:23, 85095.76 examples/s] Generating train split: 12123000 examples [02:23, 85764.57 examples/s] Generating train split: 12133000 examples [02:23, 85923.91 examples/s] Generating train split: 12143000 examples [02:23, 85630.98 examples/s] Generating train split: 12153000 examples [02:24, 84528.68 examples/s] Generating train split: 12162000 examples [02:24, 83203.99 examples/s] Generating train split: 12172000 examples [02:24, 83516.43 examples/s] Generating train split: 12182000 examples [02:24, 84295.00 examples/s] Generating train split: 12192000 examples [02:24, 84022.01 examples/s] Generating train split: 12202000 examples [02:24, 84750.61 examples/s] Generating train split: 12212000 examples [02:24, 84472.50 examples/s] Generating train split: 12221000 examples [02:24, 83442.83 examples/s] Generating train split: 12231000 examples [02:24, 83025.99 examples/s] Generating train split: 12241000 examples [02:25, 84562.08 examples/s] Generating train split: 12251000 examples [02:25, 84445.11 examples/s] Generating train split: 12261000 examples [02:25, 84604.28 examples/s] Generating train split: 12271000 examples [02:25, 85402.85 examples/s] Generating train split: 12281000 examples [02:25, 84351.87 examples/s] Generating train split: 12291000 examples [02:25, 84767.99 examples/s] Generating train split: 12301000 examples [02:25, 85252.89 examples/s] Generating train split: 12311000 examples [02:25, 84611.43 examples/s] Generating train split: 12321000 examples [02:26, 85079.47 examples/s] Generating train split: 12330000 examples [02:26, 84447.90 examples/s] Generating train split: 12340000 examples [02:26, 84584.78 examples/s] Generating train split: 12350000 examples [02:26, 84750.05 examples/s] Generating train split: 12360000 examples [02:26, 85062.29 examples/s] Generating train split: 12370000 examples [02:26, 85182.85 examples/s] Generating train split: 12380000 examples [02:26, 85885.22 examples/s] Generating train split: 12390000 examples [02:26, 85834.98 examples/s] Generating train split: 12399000 examples [02:26, 83175.92 examples/s] Generating train split: 12408000 examples [02:27, 81548.19 examples/s] Generating train split: 12417000 examples [02:27, 82013.81 examples/s] Generating train split: 12427000 examples [02:27, 82686.58 examples/s] Generating train split: 12437000 examples [02:27, 83180.63 examples/s] Generating train split: 12446000 examples [02:27, 82789.17 examples/s] Generating train split: 12455000 examples [02:27, 82650.43 examples/s] Generating train split: 12465000 examples [02:27, 83694.38 examples/s] Generating train split: 12475000 examples [02:27, 83901.14 examples/s] Generating train split: 12484000 examples [02:27, 82640.88 examples/s] Generating train split: 12493000 examples [02:28, 82090.65 examples/s] Generating train split: 12503000 examples [02:28, 83128.87 examples/s] Generating train split: 12516000 examples [02:28, 80433.84 examples/s] Generating train split: 12525000 examples [02:28, 80653.67 examples/s] Generating train split: 12534000 examples [02:28, 81123.29 examples/s] Generating train split: 12544000 examples [02:28, 82168.35 examples/s] Generating train split: 12554000 examples [02:28, 82330.85 examples/s] Generating train split: 12563000 examples [02:28, 82126.99 examples/s] Generating train split: 12572000 examples [02:29, 81389.96 examples/s] Generating train split: 12581000 examples [02:29, 81317.10 examples/s] Generating train split: 12591000 examples [02:29, 82471.69 examples/s] Generating train split: 12601000 examples [02:29, 83702.08 examples/s] Generating train split: 12611000 examples [02:29, 82821.94 examples/s] Generating train split: 12621000 examples [02:29, 83985.59 examples/s] Generating train split: 12630000 examples [02:29, 83443.94 examples/s] Generating train split: 12640000 examples [02:29, 84858.16 examples/s] Generating train split: 12650000 examples [02:29, 84843.72 examples/s] Generating train split: 12660000 examples [02:30, 84848.08 examples/s] Generating train split: 12670000 examples [02:30, 84496.94 examples/s] Generating train split: 12680000 examples [02:30, 84553.15 examples/s] Generating train split: 12690000 examples [02:30, 83339.05 examples/s] Generating train split: 12700000 examples [02:30, 84399.59 examples/s] Generating train split: 12710000 examples [02:30, 84796.23 examples/s] Generating train split: 12720000 examples [02:30, 84224.23 examples/s] Generating train split: 12729000 examples [02:30, 83706.05 examples/s] Generating train split: 12739000 examples [02:31, 84455.56 examples/s] Generating train split: 12748000 examples [02:31, 83128.10 examples/s] Generating train split: 12757000 examples [02:31, 82476.81 examples/s] Generating train split: 12766000 examples [02:31, 82235.51 examples/s] Generating train split: 12776000 examples [02:31, 82826.31 examples/s] Generating train split: 12786000 examples [02:31, 83562.89 examples/s] Generating train split: 12795000 examples [02:31, 81764.97 examples/s] Generating train split: 12805000 examples [02:31, 82966.09 examples/s] Generating train split: 12815000 examples [02:31, 82988.02 examples/s] Generating train split: 12824000 examples [02:32, 80955.24 examples/s] Generating train split: 12834000 examples [02:32, 82421.36 examples/s] Generating train split: 12843000 examples [02:32, 82563.72 examples/s] Generating train split: 12852000 examples [02:32, 81012.96 examples/s] Generating train split: 12862000 examples [02:32, 82645.68 examples/s] Generating train split: 12872000 examples [02:32, 83354.84 examples/s] Generating train split: 12882000 examples [02:32, 83052.63 examples/s] Generating train split: 12892000 examples [02:32, 83421.46 examples/s] Generating train split: 12902000 examples [02:33, 83344.18 examples/s] Generating train split: 12912000 examples [02:33, 85037.08 examples/s] Generating train split: 12922000 examples [02:33, 85377.63 examples/s] Generating train split: 12932000 examples [02:33, 85686.78 examples/s] Generating train split: 12942000 examples [02:33, 86053.94 examples/s] Generating train split: 12952000 examples [02:33, 85991.79 examples/s] Generating train split: 12962000 examples [02:33, 85573.39 examples/s] Generating train split: 12972000 examples [02:33, 85318.53 examples/s] Generating train split: 12981000 examples [02:33, 83154.55 examples/s] Generating train split: 12991000 examples [02:34, 83606.24 examples/s] Generating train split: 13000000 examples [02:34, 82503.76 examples/s] Generating train split: 13010000 examples [02:34, 83374.41 examples/s] Generating train split: 13020000 examples [02:34, 84083.75 examples/s] Generating train split: 13029000 examples [02:34, 82949.26 examples/s] Generating train split: 13042000 examples [02:34, 82298.38 examples/s] Generating train split: 13051000 examples [02:34, 82302.23 examples/s] Generating train split: 13061000 examples [02:34, 83076.11 examples/s] Generating train split: 13070000 examples [02:35, 82403.46 examples/s] Generating train split: 13080000 examples [02:35, 82634.23 examples/s] Generating train split: 13090000 examples [02:35, 83466.78 examples/s] Generating train split: 13100000 examples [02:35, 83580.55 examples/s] Generating train split: 13109000 examples [02:35, 83143.50 examples/s] Generating train split: 13119000 examples [02:35, 83651.90 examples/s] Generating train split: 13129000 examples [02:35, 83795.91 examples/s] Generating train split: 13139000 examples [02:35, 84020.81 examples/s] Generating train split: 13149000 examples [02:35, 84494.87 examples/s] Generating train split: 13159000 examples [02:36, 83987.80 examples/s] Generating train split: 13168000 examples [02:36, 83332.46 examples/s] Generating train split: 13177000 examples [02:36, 82945.10 examples/s] Generating train split: 13186000 examples [02:36, 83400.17 examples/s] Generating train split: 13196000 examples [02:36, 83810.08 examples/s] Generating train split: 13206000 examples [02:36, 84941.41 examples/s] Generating train split: 13215000 examples [02:36, 84440.35 examples/s] Generating train split: 13225000 examples [02:36, 85409.26 examples/s] Generating train split: 13235000 examples [02:36, 85655.82 examples/s] Generating train split: 13245000 examples [02:37, 85809.75 examples/s] Generating train split: 13255000 examples [02:37, 85612.98 examples/s] Generating train split: 13265000 examples [02:37, 84917.42 examples/s] Generating train split: 13275000 examples [02:37, 83567.08 examples/s] Generating train split: 13284000 examples [02:37, 82294.37 examples/s] Generating train split: 13294000 examples [02:37, 83074.75 examples/s] Generating train split: 13304000 examples [02:37, 84023.16 examples/s] Generating train split: 13314000 examples [02:37, 84746.68 examples/s] Generating train split: 13324000 examples [02:38, 84652.80 examples/s] Generating train split: 13333000 examples [02:38, 82591.14 examples/s] Generating train split: 13343000 examples [02:38, 82636.33 examples/s] Generating train split: 13352000 examples [02:38, 81792.02 examples/s] Generating train split: 13362000 examples [02:38, 81782.67 examples/s] Generating train split: 13372000 examples [02:38, 81950.32 examples/s] Generating train split: 13381000 examples [02:38, 81159.52 examples/s] Generating train split: 13391000 examples [02:38, 81858.93 examples/s] Generating train split: 13400000 examples [02:38, 82246.02 examples/s] Generating train split: 13410000 examples [02:39, 83881.81 examples/s] Generating train split: 13420000 examples [02:39, 84663.97 examples/s] Generating train split: 13433000 examples [02:39, 81848.53 examples/s] Generating train split: 13443000 examples [02:39, 82920.77 examples/s] Generating train split: 13452000 examples [02:39, 81316.60 examples/s] Generating train split: 13462000 examples [02:39, 81772.52 examples/s] Generating train split: 13472000 examples [02:39, 82994.01 examples/s] Generating train split: 13482000 examples [02:39, 82617.51 examples/s] Generating train split: 13492000 examples [02:40, 82999.91 examples/s] Generating train split: 13501000 examples [02:40, 83016.28 examples/s] Generating train split: 13511000 examples [02:40, 83528.23 examples/s] Generating train split: 13521000 examples [02:40, 83448.17 examples/s] Generating train split: 13531000 examples [02:40, 84678.61 examples/s] Generating train split: 13540000 examples [02:40, 84020.41 examples/s] Generating train split: 13553000 examples [02:40, 80603.92 examples/s] Generating train split: 13563000 examples [02:40, 82457.97 examples/s] Generating train split: 13573000 examples [02:41, 83877.03 examples/s] Generating train split: 13582000 examples [02:41, 82851.47 examples/s] Generating train split: 13595000 examples [02:41, 80409.59 examples/s] Generating train split: 13605000 examples [02:41, 80872.99 examples/s] Generating train split: 13614000 examples [02:41, 80716.33 examples/s] Generating train split: 13624000 examples [02:41, 82056.50 examples/s] Generating train split: 13633000 examples [02:41, 81568.49 examples/s] Generating train split: 13643000 examples [02:41, 82565.40 examples/s] Generating train split: 13653000 examples [02:42, 83063.66 examples/s] Generating train split: 13663000 examples [02:42, 84287.45 examples/s] Generating train split: 13673000 examples [02:42, 84011.52 examples/s] Generating train split: 13683000 examples [02:42, 84736.69 examples/s] Generating train split: 13692000 examples [02:42, 83914.36 examples/s] Generating train split: 13702000 examples [02:42, 82527.34 examples/s] Generating train split: 13712000 examples [02:42, 83273.94 examples/s] Generating train split: 13722000 examples [02:42, 84059.73 examples/s] Generating train split: 13731000 examples [02:42, 83239.89 examples/s] Generating train split: 13740000 examples [02:43, 83006.13 examples/s] Generating train split: 13750000 examples [02:43, 84603.68 examples/s] Generating train split: 13760000 examples [02:43, 84373.42 examples/s] Generating train split: 13770000 examples [02:43, 84796.44 examples/s] Generating train split: 13780000 examples [02:43, 85876.29 examples/s] Generating train split: 13789000 examples [02:43, 85067.74 examples/s] Generating train split: 13798000 examples [02:43, 82700.63 examples/s] Generating train split: 13808000 examples [02:43, 83480.52 examples/s] Generating train split: 13818000 examples [02:43, 83381.23 examples/s] Generating train split: 13828000 examples [02:44, 83835.52 examples/s] Generating train split: 13838000 examples [02:44, 83889.67 examples/s] Generating train split: 13848000 examples [02:44, 85206.32 examples/s] Generating train split: 13858000 examples [02:44, 84824.84 examples/s] Generating train split: 13868000 examples [02:44, 85222.04 examples/s] Generating train split: 13877000 examples [02:44, 81765.23 examples/s] Generating train split: 13887000 examples [02:44, 82686.87 examples/s] Generating train split: 13897000 examples [02:44, 83932.18 examples/s] Generating train split: 13907000 examples [02:45, 84352.09 examples/s] Generating train split: 13917000 examples [02:45, 84329.97 examples/s] Generating train split: 13927000 examples [02:45, 85242.32 examples/s] Generating train split: 13937000 examples [02:45, 85803.07 examples/s] Generating train split: 13947000 examples [02:45, 85964.90 examples/s] Generating train split: 13957000 examples [02:45, 86215.27 examples/s] Generating train split: 13966000 examples [02:45, 84176.87 examples/s] Generating train split: 13976000 examples [02:45, 83880.11 examples/s] Generating train split: 13985000 examples [02:45, 83352.86 examples/s] Generating train split: 13994000 examples [02:46, 82089.80 examples/s] Generating train split: 14004000 examples [02:46, 83378.45 examples/s] Generating train split: 14013000 examples [02:46, 82081.20 examples/s] Generating train split: 14023000 examples [02:46, 82322.12 examples/s] Generating train split: 14032000 examples [02:46, 81392.24 examples/s] Generating train split: 14042000 examples [02:46, 83412.59 examples/s] Generating train split: 14052000 examples [02:46, 84933.88 examples/s] Generating train split: 14061000 examples [02:46, 83566.63 examples/s] Generating train split: 14071000 examples [02:46, 84243.84 examples/s] Generating train split: 14081000 examples [02:47, 82075.94 examples/s] Generating train split: 14091000 examples [02:47, 82956.93 examples/s] Generating train split: 14101000 examples [02:47, 82653.56 examples/s] Generating train split: 14110000 examples [02:47, 81702.90 examples/s] Generating train split: 14119000 examples [02:47, 80378.19 examples/s] Generating train split: 14129000 examples [02:47, 80875.19 examples/s] Generating train split: 14139000 examples [02:47, 81906.25 examples/s] Generating train split: 14148000 examples [02:47, 80839.08 examples/s] Generating train split: 14161000 examples [02:48, 81032.43 examples/s] Generating train split: 14171000 examples [02:48, 82732.05 examples/s] Generating train split: 14181000 examples [02:48, 83871.08 examples/s] Generating train split: 14190000 examples [02:48, 83168.96 examples/s] Generating train split: 14200000 examples [02:48, 84722.25 examples/s] Generating train split: 14210000 examples [02:48, 83814.47 examples/s] Generating train split: 14220000 examples [02:48, 85052.61 examples/s] Generating train split: 14230000 examples [02:48, 84380.65 examples/s] Generating train split: 14240000 examples [02:49, 84372.25 examples/s] Generating train split: 14250000 examples [02:49, 85385.05 examples/s] Generating train split: 14260000 examples [02:49, 84941.44 examples/s] Generating train split: 14270000 examples [02:49, 85086.73 examples/s] Generating train split: 14280000 examples [02:49, 86057.73 examples/s] Generating train split: 14290000 examples [02:49, 86497.73 examples/s] Generating train split: 14300000 examples [02:49, 86647.29 examples/s] Generating train split: 14310000 examples [02:49, 86165.51 examples/s] Generating train split: 14320000 examples [02:49, 86772.71 examples/s] Generating train split: 14330000 examples [02:50, 86255.25 examples/s] Generating train split: 14339000 examples [02:50, 83287.17 examples/s] Generating train split: 14348000 examples [02:50, 83134.10 examples/s] Generating train split: 14357000 examples [02:50, 81276.95 examples/s] Generating train split: 14366000 examples [02:50, 80854.21 examples/s] Generating train split: 14375000 examples [02:50, 81549.08 examples/s] Generating train split: 14384000 examples [02:50, 81453.40 examples/s] Generating train split: 14394000 examples [02:50, 82881.95 examples/s] Generating train split: 14404000 examples [02:50, 84664.53 examples/s] Generating train split: 14414000 examples [02:51, 84461.77 examples/s] Generating train split: 14424000 examples [02:51, 84893.41 examples/s] Generating train split: 14434000 examples [02:51, 85654.74 examples/s] Generating train split: 14444000 examples [02:51, 86618.39 examples/s] Generating train split: 14454000 examples [02:51, 86817.17 examples/s] Generating train split: 14464000 examples [02:51, 86922.43 examples/s] Generating train split: 14474000 examples [02:51, 87488.73 examples/s] Generating train split: 14484000 examples [02:51, 87660.02 examples/s] Generating train split: 14494000 examples [02:52, 87678.91 examples/s] Generating train split: 14504000 examples [02:52, 88359.65 examples/s] Generating train split: 14514000 examples [02:52, 88572.24 examples/s] Generating train split: 14524000 examples [02:52, 88412.35 examples/s] Generating train split: 14534000 examples [02:52, 89178.76 examples/s] Generating train split: 14544000 examples [02:52, 89102.66 examples/s] Generating train split: 14554000 examples [02:52, 88990.18 examples/s] Generating train split: 14564000 examples [02:52, 88899.30 examples/s] Generating train split: 14574000 examples [02:52, 88921.41 examples/s] Generating train split: 14584000 examples [02:53, 88830.47 examples/s] Generating train split: 14593000 examples [02:53, 86810.49 examples/s] Generating train split: 14602000 examples [02:53, 85561.70 examples/s] Generating train split: 14612000 examples [02:53, 86716.29 examples/s] Generating train split: 14622000 examples [02:53, 87437.63 examples/s] Generating train split: 14632000 examples [02:53, 87463.74 examples/s] Generating train split: 14642000 examples [02:53, 88043.15 examples/s] Generating train split: 14652000 examples [02:53, 87252.61 examples/s] Generating train split: 14662000 examples [02:53, 86936.20 examples/s] Generating train split: 14672000 examples [02:54, 87558.33 examples/s] Generating train split: 14682000 examples [02:54, 86614.84 examples/s] Generating train split: 14692000 examples [02:54, 87182.84 examples/s] Generating train split: 14702000 examples [02:54, 87778.74 examples/s] Generating train split: 14712000 examples [02:54, 88020.81 examples/s] Generating train split: 14722000 examples [02:54, 87972.08 examples/s] Generating train split: 14732000 examples [02:54, 87250.59 examples/s] Generating train split: 14742000 examples [02:54, 87328.73 examples/s] Generating train split: 14752000 examples [02:54, 87451.02 examples/s] Generating train split: 14762000 examples [02:55, 87439.94 examples/s] Generating train split: 14772000 examples [02:55, 88271.75 examples/s] Generating train split: 14782000 examples [02:55, 87955.47 examples/s] Generating train split: 14792000 examples [02:55, 87556.97 examples/s] Generating train split: 14802000 examples [02:55, 87115.21 examples/s] Generating train split: 14812000 examples [02:55, 88537.49 examples/s] Generating train split: 14822000 examples [02:55, 89097.83 examples/s] Generating train split: 14831000 examples [02:55, 87345.58 examples/s] Generating train split: 14841000 examples [02:55, 87585.72 examples/s] Generating train split: 14851000 examples [02:56, 88450.09 examples/s] Generating train split: 14860000 examples [02:56, 86663.10 examples/s] Generating train split: 14870000 examples [02:56, 86948.12 examples/s] Generating train split: 14873731 examples [02:56, 84345.73 examples/s] Shard 0: 0%| | 0/100000000 [00:00 bytes: 248951808 allocated 237 MiB for model parameters batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288 => setting grad_accum_steps=1 created directory: log124M --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 237 MiB for parameter gradients allocating 21216 MiB for activations allocating 59 MiB for AdamW optimizer state m allocating 59 MiB for AdamW optimizer state v allocating 59 MiB for master copy of params --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- device memory usage: 23129 MiB / 40326 MiB memory per sequence: 331 MiB -> estimated maximum batch size: 115 val loss 11.009204 step 1/19560 | loss 11.009109 (+nanz)| norm 15.3717 (+nanz)| lr 8.57e-07 | 694.02 ms | 24.3% bf16 MFU | 755432 tok/s step 2/19560 | loss 10.959894 (+nanz)| norm 15.2375 (+nanz)| lr 1.71e-06 | 314.23 ms | 53.7% bf16 MFU | 1668471 tok/s step 3/19560 | loss 10.855607 (+nanz)| norm 14.8389 (+nanz)| lr 2.57e-06 | 314.03 ms | 53.7% bf16 MFU | 1669030 tok/s step 4/19560 | loss 10.715733 (+nanz)| norm 13.1195 (+nanz)| lr 3.43e-06 | 315.18 ms | 53.5% bf16 MFU | 1667084 tok/s step 5/19560 | loss 10.568422 (+nanz)| norm 10.4406 (+nanz)| lr 4.29e-06 | 314.67 ms | 53.6% bf16 MFU | 1666839 tok/s step 6/19560 | loss 10.423462 (+nanz)| norm 8.4943 (+nanz)| lr 5.14e-06 | 314.80 ms | 53.6% bf16 MFU | 1666536 tok/s step 7/19560 | loss 10.291794 (+nanz)| norm 7.2414 (+nanz)| lr 6.00e-06 | 314.67 ms | 53.6% bf16 MFU | 1666467 tok/s step 8/19560 | loss 10.187210 (+nanz)| norm 6.1835 (+nanz)| lr 6.86e-06 | 313.99 ms | 53.8% bf16 MFU | 1667013 tok/s step 9/19560 | loss 10.096747 (+nanz)| norm 5.2947 (+nanz)| lr 7.71e-06 | 314.87 ms | 53.6% bf16 MFU | 1666731 tok/s step 10/19560 | loss 9.976242 (+nanz)| norm 4.6636 (+nanz)| lr 8.57e-06 | 313.82 ms | 53.8% bf16 MFU | 1667259 tok/s step 11/19560 | loss 9.935363 (+nanz)| norm 3.8501 (+nanz)| lr 9.43e-06 | 314.47 ms | 53.7% bf16 MFU | 1667250 tok/s step 12/19560 | loss 9.853886 (+nanz)| norm 3.3981 (+nanz)| lr 1.03e-05 | 314.46 ms | 53.7% bf16 MFU | 1667249 tok/s step 13/19560 | loss 9.792469 (+nanz)| norm 3.0057 (+nanz)| lr 1.11e-05 | 314.79 ms | 53.6% bf16 MFU | 1667061 tok/s step 14/19560 | loss 9.758893 (+nanz)| norm 2.6958 (+nanz)| lr 1.20e-05 | 314.43 ms | 53.7% bf16 MFU | 1667098 tok/s step 15/19560 | loss 9.702515 (+nanz)| norm 2.5212 (+nanz)| lr 1.29e-05 | 314.33 ms | 53.7% bf16 MFU | 1667180 tok/s step 16/19560 | loss 9.701741 (+nanz)| norm 2.3030 (+nanz)| lr 1.37e-05 | 314.86 ms | 53.6% bf16 MFU | 1666991 tok/s step 17/19560 | loss 9.655608 (+nanz)| norm 2.2412 (+nanz)| lr 1.46e-05 | 314.08 ms | 53.7% bf16 MFU | 1667196 tok/s step 18/19560 | loss 9.636610 (+nanz)| norm 2.1886 (+nanz)| lr 1.54e-05 | 314.75 ms | 53.6% bf16 MFU | 1667068 tok/s step 19/19560 | loss 9.603570 (+nanz)| norm 2.2160 (+nanz)| lr 1.63e-05 | 314.42 ms | 53.7% bf16 MFU | 1667100 tok/s step 20/19560 | loss 9.596474 (+nanz)| norm 2.1545 (+nanz)| lr 1.71e-05 | 314.56 ms | 53.7% bf16 MFU | 1667070 tok/s step 21/19560 | loss 9.562174 (+nanz)| norm 2.1811 (+nanz)| lr 1.80e-05 | 314.04 ms | 53.7% bf16 MFU | 1667257 tok/s step 22/19560 | loss 9.537735 (+nanz)| norm 2.1489 (+nanz)| lr 1.89e-05 | 315.56 ms | 53.5% bf16 MFU | 1666816 tok/s step 23/19560 | loss 9.533236 (+nanz)| norm 2.0896 (+nanz)| lr 1.97e-05 | 315.00 ms | 53.6% bf16 MFU | 1666637 tok/s step 24/19560 | loss 9.476654 (+nanz)| norm 2.1037 (+nanz)| lr 2.06e-05 | 314.74 ms | 53.6% bf16 MFU | 1666576 tok/s step 25/19560 | loss 9.458330 (+nanz)| norm 2.0526 (+nanz)| lr 2.14e-05 | 315.12 ms | 53.6% bf16 MFU | 1666377 tok/s step 26/19560 | loss 9.422844 (+nanz)| norm 2.0944 (+nanz)| lr 2.23e-05 | 315.20 ms | 53.5% bf16 MFU | 1666166 tok/s step 27/19560 | loss 9.377932 (+nanz)| norm 2.1533 (+nanz)| lr 2.31e-05 | 315.00 ms | 53.6% bf16 MFU | 1666048 tok/s step 28/19560 | loss 9.388620 (+nanz)| norm 1.9650 (+nanz)| lr 2.40e-05 | 315.06 ms | 53.6% bf16 MFU | 1665918 tok/s step 29/19560 | loss 9.357556 (+nanz)| norm 2.4479 (+nanz)| lr 2.49e-05 | 316.05 ms | 53.4% bf16 MFU | 1665455 tok/s step 30/19560 | loss 9.306395 (+nanz)| norm 1.9511 (+nanz)| lr 2.57e-05 | 314.97 ms | 53.6% bf16 MFU | 1665397 tok/s step 31/19560 | loss 9.244958 (+nanz)| norm 2.2945 (+nanz)| lr 2.66e-05 | 315.26 ms | 53.5% bf16 MFU | 1665248 tok/s step 32/19560 | loss 9.211145 (+nanz)| norm 2.2606 (+nanz)| lr 2.74e-05 | 315.26 ms | 53.5% bf16 MFU | 1665108 tok/s step 33/19560 | loss 9.235586 (+nanz)| norm 2.5847 (+nanz)| lr 2.83e-05 | 316.22 ms | 53.4% bf16 MFU | 1664665 tok/s step 34/19560 | loss 9.159502 (+nanz)| norm 2.4417 (+nanz)| lr 2.91e-05 | 315.45 ms | 53.5% bf16 MFU | 1664503 tok/s step 35/19560 | loss 9.200704 (+nanz)| norm 1.9541 (+nanz)| lr 3.00e-05 | 315.33 ms | 53.5% bf16 MFU | 1664392 tok/s step 36/19560 | loss 9.080119 (+nanz)| norm 2.4101 (+nanz)| lr 3.09e-05 | 316.02 ms | 53.4% bf16 MFU | 1664069 tok/s step 37/19560 | loss 9.029286 (+nanz)| norm 1.8689 (+nanz)| lr 3.17e-05 | 315.40 ms | 53.5% bf16 MFU | 1663964 tok/s step 38/19560 | loss 9.039859 (+nanz)| norm 1.9748 (+nanz)| lr 3.26e-05 | 315.42 ms | 53.5% bf16 MFU | 1663861 tok/s step 39/19560 | loss 8.977991 (+nanz)| norm 1.8097 (+nanz)| lr 3.34e-05 | 317.05 ms | 53.2% bf16 MFU | 1663265 tok/s step 40/19560 | loss 8.961273 (+nanz)| norm 2.0400 (+nanz)| lr 3.43e-05 | 315.88 ms | 53.4% bf16 MFU | 1663062 tok/s step 41/19560 | loss 8.906675 (+nanz)| norm 1.8695 (+nanz)| lr 3.51e-05 | 316.29 ms | 53.4% bf16 MFU | 1662749 tok/s step 42/19560 | loss 8.860172 (+nanz)| norm 1.6710 (+nanz)| lr 3.60e-05 | 316.54 ms | 53.3% bf16 MFU | 1662382 tok/s step 43/19560 | loss 8.839678 (+nanz)| norm 1.7310 (+nanz)| lr 3.69e-05 | 315.59 ms | 53.5% bf16 MFU | 1662320 tok/s step 44/19560 | loss 8.856478 (+nanz)| norm 1.6672 (+nanz)| lr 3.77e-05 | 316.37 ms | 53.3% bf16 MFU | 1662033 tok/s step 45/19560 | loss 8.748159 (+nanz)| norm 1.6705 (+nanz)| lr 3.86e-05 | 316.51 ms | 53.3% bf16 MFU | 1661722 tok/s step 46/19560 | loss 8.692660 (+nanz)| norm 1.6797 (+nanz)| lr 3.94e-05 | 316.05 ms | 53.4% bf16 MFU | 1661564 tok/s step 47/19560 | loss 8.712541 (+nanz)| norm 1.5855 (+nanz)| lr 4.03e-05 | 316.11 ms | 53.4% bf16 MFU | 1661398 tok/s step 48/19560 | loss 8.646988 (+nanz)| norm 1.5953 (+nanz)| lr 4.11e-05 | 316.75 ms | 53.3% bf16 MFU | 1661059 tok/s step 49/19560 | loss 8.607601 (+nanz)| norm 1.5653 (+nanz)| lr 4.20e-05 | 316.04 ms | 53.4% bf16 MFU | 1660943 tok/s step 50/19560 | loss 8.586852 (+nanz)| norm 1.5617 (+nanz)| lr 4.29e-05 | 316.96 ms | 53.2% bf16 MFU | 1660572 tok/s step 51/19560 | loss 8.536360 (+nanz)| norm 1.6164 (+nanz)| lr 4.37e-05 | 316.65 ms | 53.3% bf16 MFU | 1660310 tok/s step 52/19560 | loss 8.493007 (+nanz)| norm 1.7058 (+nanz)| lr 4.46e-05 | 316.76 ms | 53.3% bf16 MFU | 1660032 tok/s step 53/19560 | loss 8.489213 (+nanz)| norm 1.7822 (+nanz)| lr 4.54e-05 | 316.78 ms | 53.3% bf16 MFU | 1659766 tok/s step 54/19560 | loss 8.421474 (+nanz)| norm 1.5619 (+nanz)| lr 4.63e-05 | 317.03 ms | 53.2% bf16 MFU | 1659443 tok/s step 55/19560 | loss 8.446768 (+nanz)| norm 1.4457 (+nanz)| lr 4.71e-05 | 316.30 ms | 53.4% bf16 MFU | 1659343 tok/s step 56/19560 | loss 8.378421 (+nanz)| norm 1.6670 (+nanz)| lr 4.80e-05 | 317.06 ms | 53.2% bf16 MFU | 1659038 tok/s step 57/19560 | loss 8.326710 (+nanz)| norm 1.6516 (+nanz)| lr 4.89e-05 | 316.82 ms | 53.3% bf16 MFU | 1658817 tok/s step 58/19560 | loss 8.262535 (+nanz)| norm 1.6611 (+nanz)| lr 4.97e-05 | 316.83 ms | 53.3% bf16 MFU | 1658603 tok/s step 59/19560 | loss 8.225919 (+nanz)| norm 1.7677 (+nanz)| lr 5.06e-05 | 316.05 ms | 53.4% bf16 MFU | 1658617 tok/s step 60/19560 | loss 8.181098 (+nanz)| norm 1.4496 (+nanz)| lr 5.14e-05 | 317.47 ms | 53.2% bf16 MFU | 1658242 tok/s step 61/19560 | loss 8.197227 (+nanz)| norm 1.7004 (+nanz)| lr 5.23e-05 | 317.43 ms | 53.2% bf16 MFU | 1657897 tok/s step 62/19560 | loss 8.186023 (+nanz)| norm 1.3861 (+nanz)| lr 5.31e-05 | 317.40 ms | 53.2% bf16 MFU | 1657579 tok/s step 63/19560 | loss 8.078353 (+nanz)| norm 1.3741 (+nanz)| lr 5.40e-05 | 317.43 ms | 53.2% bf16 MFU | 1657270 tok/s step 64/19560 | loss 8.078118 (+nanz)| norm 1.4731 (+nanz)| lr 5.49e-05 | 317.50 ms | 53.2% bf16 MFU | 1656958 tok/s step 65/19560 | loss 8.054794 (+nanz)| norm 1.4474 (+nanz)| lr 5.57e-05 | 317.05 ms | 53.2% bf16 MFU | 1656788 tok/s step 66/19560 | loss 7.981112 (+nanz)| norm 1.6258 (+nanz)| lr 5.66e-05 | 316.69 ms | 53.3% bf16 MFU | 1656722 tok/s step 67/19560 | loss 7.928320 (+nanz)| norm 1.3561 (+nanz)| lr 5.74e-05 | 317.43 ms | 53.2% bf16 MFU | 1656461 tok/s step 68/19560 | loss 7.929897 (+nanz)| norm 1.3077 (+nanz)| lr 5.83e-05 | 317.57 ms | 53.1% bf16 MFU | 1656176 tok/s step 69/19560 | loss 7.937150 (+nanz)| norm 1.2623 (+nanz)| lr 5.91e-05 | 317.29 ms | 53.2% bf16 MFU | 1655981 tok/s step 70/19560 | loss 7.888319 (+nanz)| norm 1.2202 (+nanz)| lr 6.00e-05 | 317.32 ms | 53.2% bf16 MFU | 1655788 tok/s step 71/19560 | loss 7.836940 (+nanz)| norm 1.3426 (+nanz)| lr 6.09e-05 | 317.24 ms | 53.2% bf16 MFU | 1655628 tok/s step 72/19560 | loss 7.792568 (+nanz)| norm 1.1690 (+nanz)| lr 6.17e-05 | 316.99 ms | 53.2% bf16 MFU | 1655542 tok/s step 73/19560 | loss 7.773895 (+nanz)| norm 1.1610 (+nanz)| lr 6.26e-05 | 317.74 ms | 53.1% bf16 MFU | 1655260 tok/s step 74/19560 | loss 7.792548 (+nanz)| norm 1.4956 (+nanz)| lr 6.34e-05 | 317.85 ms | 53.1% bf16 MFU | 1654965 tok/s step 75/19560 | loss 7.717053 (+nanz)| norm 1.3791 (+nanz)| lr 6.43e-05 | 318.58 ms | 53.0% bf16 MFU | 1654491 tok/s step 76/19560 | loss 7.568538 (+nanz)| norm 1.0961 (+nanz)| lr 6.51e-05 | 318.38 ms | 53.0% bf16 MFU | 1654095 tok/s step 77/19560 | loss 7.679242 (+nanz)| norm 1.7162 (+nanz)| lr 6.60e-05 | 317.44 ms | 53.2% bf16 MFU | 1653968 tok/s step 78/19560 | loss 7.632957 (+nanz)| norm 0.9554 (+nanz)| lr 6.69e-05 | 318.68 ms | 53.0% bf16 MFU | 1653519 tok/s step 79/19560 | loss 7.592442 (+nanz)| norm 1.2417 (+nanz)| lr 6.77e-05 | 318.08 ms | 53.1% bf16 MFU | 1653253 tok/s step 80/19560 | loss 7.615833 (+nanz)| norm 0.9376 (+nanz)| lr 6.86e-05 | 318.47 ms | 53.0% bf16 MFU | 1652898 tok/s step 81/19560 | loss 7.423759 (+nanz)| norm 1.4176 (+nanz)| lr 6.94e-05 | 318.70 ms | 53.0% bf16 MFU | 1652501 tok/s step 82/19560 | loss 7.515634 (+nanz)| norm 1.5691 (+nanz)| lr 7.03e-05 | 318.52 ms | 53.0% bf16 MFU | 1652172 tok/s step 83/19560 | loss 7.442426 (+nanz)| norm 0.9844 (+nanz)| lr 7.11e-05 | 318.12 ms | 53.1% bf16 MFU | 1651964 tok/s step 84/19560 | loss 7.463370 (+nanz)| norm 1.0315 (+nanz)| lr 7.20e-05 | 318.07 ms | 53.1% bf16 MFU | 1651779 tok/s step 85/19560 | loss 7.399691 (+nanz)| norm 1.3543 (+nanz)| lr 7.29e-05 | 319.53 ms | 52.8% bf16 MFU | 1651224 tok/s step 86/19560 | loss 7.403323 (+nanz)| norm 0.9910 (+nanz)| lr 7.37e-05 | 318.51 ms | 53.0% bf16 MFU | 1650963 tok/s step 87/19560 | loss 7.372291 (+nanz)| norm 0.9624 (+nanz)| lr 7.46e-05 | 318.49 ms | 53.0% bf16 MFU | 1650720 tok/s step 88/19560 | loss 7.304594 (+nanz)| norm 1.0400 (+nanz)| lr 7.54e-05 | 318.36 ms | 53.0% bf16 MFU | 1650523 tok/s step 89/19560 | loss 7.340849 (+nanz)| norm 1.0088 (+nanz)| lr 7.63e-05 | 318.75 ms | 52.9% bf16 MFU | 1650235 tok/s step 90/19560 | loss 7.323218 (+nanz)| norm 0.9783 (+nanz)| lr 7.71e-05 | 318.66 ms | 53.0% bf16 MFU | 1649985 tok/s step 91/19560 | loss 7.380167 (+nanz)| norm 0.8986 (+nanz)| lr 7.80e-05 | 318.62 ms | 53.0% bf16 MFU | 1649759 tok/s step 92/19560 | loss 7.239905 (+nanz)| norm 1.3216 (+nanz)| lr 7.89e-05 | 319.01 ms | 52.9% bf16 MFU | 1649442 tok/s step 93/19560 | loss 7.279544 (+nanz)| norm 1.1597 (+nanz)| lr 7.97e-05 | 318.16 ms | 53.0% bf16 MFU | 1649364 tok/s step 94/19560 | loss 7.172609 (+nanz)| norm 0.6085 (+nanz)| lr 8.06e-05 | 318.64 ms | 53.0% bf16 MFU | 1649165 tok/s step 95/19560 | loss 7.207014 (+nanz)| norm 1.1218 (+nanz)| lr 8.14e-05 | 319.32 ms | 52.9% bf16 MFU | 1648799 tok/s step 96/19560 | loss 7.126600 (+nanz)| norm 1.1004 (+nanz)| lr 8.23e-05 | 319.01 ms | 52.9% bf16 MFU | 1648530 tok/s step 97/19560 | loss 7.140619 (+nanz)| norm 0.7407 (+nanz)| lr 8.31e-05 | 319.12 ms | 52.9% bf16 MFU | 1648247 tok/s step 98/19560 | loss 7.246284 (+nanz)| norm 0.8143 (+nanz)| lr 8.40e-05 | 319.28 ms | 52.9% bf16 MFU | 1647939 tok/s step 99/19560 | loss 7.150189 (+nanz)| norm 0.9513 (+nanz)| lr 8.49e-05 | 320.06 ms | 52.7% bf16 MFU | 1647443 tok/s step 100/19560 | loss 7.145782 (+nanz)| norm 0.7225 (+nanz)| lr 8.57e-05 | 319.18 ms | 52.9% bf16 MFU | 1647199 tok/s step 101/19560 | loss 7.090214 (+nanz)| norm 0.9221 (+nanz)| lr 8.66e-05 | 319.96 ms | 52.7% bf16 MFU | 1646766 tok/s step 102/19560 | loss 7.049090 (+nanz)| norm 0.8677 (+nanz)| lr 8.74e-05 | 319.36 ms | 52.8% bf16 MFU | 1646510 tok/s step 103/19560 | loss 7.089582 (+nanz)| norm 0.9838 (+nanz)| lr 8.83e-05 | 319.40 ms | 52.8% bf16 MFU | 1646258 tok/s step 104/19560 | loss 7.151857 (+nanz)| norm 1.0523 (+nanz)| lr 8.91e-05 | 319.61 ms | 52.8% bf16 MFU | 1645963 tok/s step 105/19560 | loss 7.035015 (+nanz)| norm 0.7944 (+nanz)| lr 9.00e-05 | 319.10 ms | 52.9% bf16 MFU | 1645816 tok/s step 106/19560 | loss 7.067171 (+nanz)| norm 0.8416 (+nanz)| lr 9.09e-05 | 320.25 ms | 52.7% bf16 MFU | 1645378 tok/s step 107/19560 | loss 6.894728 (+nanz)| norm 0.9262 (+nanz)| lr 9.17e-05 | 319.67 ms | 52.8% bf16 MFU | 1645112 tok/s step 108/19560 | loss 7.045878 (+nanz)| norm 0.8536 (+nanz)| lr 9.26e-05 | 319.66 ms | 52.8% bf16 MFU | 1644861 tok/s step 109/19560 | loss 6.967838 (+nanz)| norm 0.8406 (+nanz)| lr 9.34e-05 | 319.72 ms | 52.8% bf16 MFU | 1644609 tok/s step 110/19560 | loss 7.057088 (+nanz)| norm 0.9238 (+nanz)| lr 9.43e-05 | 319.93 ms | 52.8% bf16 MFU | 1644315 tok/s step 111/19560 | loss 6.920825 (+nanz)| norm 1.0072 (+nanz)| lr 9.51e-05 | 320.17 ms | 52.7% bf16 MFU | 1643976 tok/s step 112/19560 | loss 6.909135 (+nanz)| norm 1.0471 (+nanz)| lr 9.60e-05 | 320.07 ms | 52.7% bf16 MFU | 1643677 tok/s step 113/19560 | loss 7.005341 (+nanz)| norm 0.7607 (+nanz)| lr 9.69e-05 | 320.24 ms | 52.7% bf16 MFU | 1643351 tok/s step 114/19560 | loss 6.882371 (+nanz)| norm 1.3039 (+nanz)| lr 9.77e-05 | 322.02 ms | 52.4% bf16 MFU | 1642588 tok/s step 115/19560 | loss 6.914182 (+nanz)| norm 1.3692 (+nanz)| lr 9.86e-05 | 319.71 ms | 52.8% bf16 MFU | 1642453 tok/s step 116/19560 | loss 6.932863 (+nanz)| norm 0.8334 (+nanz)| lr 9.94e-05 | 322.12 ms | 52.4% bf16 MFU | 1641710 tok/s step 117/19560 | loss 6.998663 (+nanz)| norm 1.9052 (+nanz)| lr 1.00e-04 | 321.63 ms | 52.5% bf16 MFU | 1641128 tok/s step 118/19560 | loss 6.875816 (+nanz)| norm 0.8842 (+nanz)| lr 1.01e-04 | 320.07 ms | 52.7% bf16 MFU | 1640972 tok/s step 119/19560 | loss 6.886818 (+nanz)| norm 1.1607 (+nanz)| lr 1.02e-04 | 319.87 ms | 52.8% bf16 MFU | 1640876 tok/s step 120/19560 | loss 6.917952 (+nanz)| norm 1.0263 (+nanz)| lr 1.03e-04 | 320.55 ms | 52.7% bf16 MFU | 1640612 tok/s step 121/19560 | loss 6.934100 (+nanz)| norm 1.0135 (+nanz)| lr 1.04e-04 | 320.85 ms | 52.6% bf16 MFU | 1640284 tok/s step 122/19560 | loss 6.870656 (+nanz)| norm 1.0886 (+nanz)| lr 1.05e-04 | 320.78 ms | 52.6% bf16 MFU | 1639989 tok/s step 123/19560 | loss 6.941378 (+nanz)| norm 0.9241 (+nanz)| lr 1.05e-04 | 320.65 ms | 52.6% bf16 MFU | 1639743 tok/s step 124/19560 | loss 6.821346 (+nanz)| norm 0.9488 (+nanz)| lr 1.06e-04 | 320.34 ms | 52.7% bf16 MFU | 1639588 tok/s step 125/19560 | loss 6.788586 (+nanz)| norm 1.2955 (+nanz)| lr 1.07e-04 | 320.52 ms | 52.7% bf16 MFU | 1639396 tok/s step 126/19560 | loss 6.886657 (+nanz)| norm 0.6732 (+nanz)| lr 1.08e-04 | 320.43 ms | 52.7% bf16 MFU | 1639236 tok/s step 127/19560 | loss 6.848351 (+nanz)| norm 0.7708 (+nanz)| lr 1.09e-04 | 320.73 ms | 52.6% bf16 MFU | 1639008 tok/s step 128/19560 | loss 6.757625 (+nanz)| norm 0.9592 (+nanz)| lr 1.10e-04 | 322.94 ms | 52.3% bf16 MFU | 1638231 tok/s step 129/19560 | loss 6.738553 (-1.32z)| norm 1.1891 (-0.37z)| lr 1.11e-04 | 320.51 ms | 52.7% bf16 MFU | 1638110 tok/s step 130/19560 | loss 6.805323 (-1.25z)| norm 0.7411 (-0.59z)| lr 1.11e-04 | 320.87 ms | 52.6% bf16 MFU | 1637903 tok/s step 131/19560 | loss 6.733213 (-1.31z)| norm 0.7565 (-0.64z)| lr 1.12e-04 | 320.75 ms | 52.6% bf16 MFU | 1637735 tok/s step 132/19560 | loss 6.692756 (-1.34z)| norm 0.7235 (-0.74z)| lr 1.13e-04 | 320.80 ms | 52.6% bf16 MFU | 1637565 tok/s step 133/19560 | loss 6.714043 (-1.31z)| norm 0.6848 (-0.86z)| lr 1.14e-04 | 320.76 ms | 52.6% bf16 MFU | 1637412 tok/s step 134/19560 | loss 6.744080 (-1.27z)| norm 0.7027 (-0.93z)| lr 1.15e-04 | 320.60 ms | 52.6% bf16 MFU | 1637308 tok/s step 135/19560 | loss 6.686044 (-1.31z)| norm 0.8869 (-0.80z)| lr 1.16e-04 | 321.07 ms | 52.6% bf16 MFU | 1637090 tok/s step 136/19560 | loss 6.698227 (-1.29z)| norm 1.3588 (-0.22z)| lr 1.17e-04 | 321.55 ms | 52.5% bf16 MFU | 1636759 tok/s step 137/19560 | loss 6.701592 (-1.27z)| norm 0.7831 (-1.07z)| lr 1.17e-04 | 320.66 ms | 52.6% bf16 MFU | 1636673 tok/s step 138/19560 | loss 6.695870 (-1.26z)| norm 0.8640 (-0.99z)| lr 1.18e-04 | 320.81 ms | 52.6% bf16 MFU | 1636552 tok/s step 139/19560 | loss 6.701926 (-1.24z)| norm 1.0766 (-0.64z)| lr 1.19e-04 | 321.63 ms | 52.5% bf16 MFU | 1636228 tok/s step 140/19560 | loss 6.657983 (-1.27z)| norm 1.2112 (-0.39z)| lr 1.20e-04 | 320.67 ms | 52.6% bf16 MFU | 1636166 tok/s step 141/19560 | loss 6.638037 (-1.28z)| norm 0.6936 (-1.36z)| lr 1.21e-04 | 321.14 ms | 52.6% bf16 MFU | 1635986 tok/s step 142/19560 | loss 6.660120 (-1.24z)| norm 1.1571 (-0.46z)| lr 1.22e-04 | 321.25 ms | 52.5% bf16 MFU | 1635788 tok/s step 143/19560 | loss 6.659147 (-1.23z)| norm 0.9675 (-0.82z)| lr 1.23e-04 | 321.54 ms | 52.5% bf16 MFU | 1635526 tok/s step 144/19560 | loss 6.660672 (-1.21z)| norm 0.8242 (-1.10z)| lr 1.23e-04 | 322.29 ms | 52.4% bf16 MFU | 1635087 tok/s step 145/19560 | loss 6.631635 (-1.23z)| norm 1.0779 (-0.57z)| lr 1.24e-04 | 321.48 ms | 52.5% bf16 MFU | 1634876 tok/s step 146/19560 | loss 6.685111 (-1.16z)| norm 1.2781 (-0.14z)| lr 1.25e-04 | 321.03 ms | 52.6% bf16 MFU | 1634789 tok/s step 147/19560 | loss 6.644342 (-1.19z)| norm 0.9396 (-0.84z)| lr 1.26e-04 | 321.46 ms | 52.5% bf16 MFU | 1634598 tok/s step 148/19560 | loss 6.640083 (-1.18z)| norm 0.8838 (-0.94z)| lr 1.27e-04 | 321.11 ms | 52.6% bf16 MFU | 1634504 tok/s step 149/19560 | loss 6.611271 (-1.20z)| norm 0.5423 (-1.64z)| lr 1.28e-04 | 321.16 ms | 52.6% bf16 MFU | 1634402 tok/s step 150/19560 | loss 6.676477 (-1.12z)| norm 0.9108 (-0.85z)| lr 1.29e-04 | 321.63 ms | 52.5% bf16 MFU | 1634188 tok/s step 151/19560 | loss 6.588881 (-1.20z)| norm 0.9762 (-0.69z)| lr 1.29e-04 | 322.06 ms | 52.4% bf16 MFU | 1633875 tok/s step 152/19560 | loss 6.512462 (-1.28z)| norm 0.6910 (-1.30z)| lr 1.30e-04 | 321.87 ms | 52.4% bf16 MFU | 1633624 tok/s step 153/19560 | loss 6.654555 (-1.10z)| norm 0.7826 (-1.08z)| lr 1.31e-04 | 320.97 ms | 52.6% bf16 MFU | 1633615 tok/s step 154/19560 | loss 6.591102 (-1.16z)| norm 0.9325 (-0.74z)| lr 1.32e-04 | 321.26 ms | 52.5% bf16 MFU | 1633534 tok/s step 155/19560 | loss 6.576769 (-1.17z)| norm 1.0483 (-0.47z)| lr 1.33e-04 | 321.67 ms | 52.5% bf16 MFU | 1633352 tok/s step 156/19560 | loss 6.631476 (-1.09z)| norm 1.1559 (-0.22z)| lr 1.34e-04 | 322.17 ms | 52.4% bf16 MFU | 1633052 tok/s step 157/19560 | loss 6.553662 (-1.18z)| norm 0.8778 (-0.84z)| lr 1.35e-04 | 320.93 ms | 52.6% bf16 MFU | 1633082 tok/s step 158/19560 | loss 6.565734 (-1.15z)| norm 1.0544 (-0.42z)| lr 1.35e-04 | 321.59 ms | 52.5% bf16 MFU | 1632942 tok/s step 159/19560 | loss 6.564290 (-1.14z)| norm 1.2492 (+0.06z)| lr 1.36e-04 | 322.38 ms | 52.4% bf16 MFU | 1632609 tok/s step 160/19560 | loss 6.537234 (-1.17z)| norm 0.8371 (-0.93z)| lr 1.37e-04 | 321.88 ms | 52.4% bf16 MFU | 1632421 tok/s step 161/19560 | loss 6.581266 (-1.10z)| norm 0.8292 (-0.96z)| lr 1.38e-04 | 321.20 ms | 52.5% bf16 MFU | 1632413 tok/s step 162/19560 | loss 6.515292 (-1.18z)| norm 0.9193 (-0.72z)| lr 1.39e-04 | 323.29 ms | 52.2% bf16 MFU | 1631880 tok/s step 163/19560 | loss 6.477805 (-1.22z)| norm 0.8065 (-1.01z)| lr 1.40e-04 | 322.12 ms | 52.4% bf16 MFU | 1631666 tok/s step 164/19560 | loss 6.463096 (-1.23z)| norm 0.7481 (-1.19z)| lr 1.41e-04 | 322.09 ms | 52.4% bf16 MFU | 1631471 tok/s step 165/19560 | loss 6.489305 (-1.18z)| norm 0.8020 (-1.02z)| lr 1.41e-04 | 321.79 ms | 52.4% bf16 MFU | 1631362 tok/s step 166/19560 | loss 6.481836 (-1.18z)| norm 1.1173 (-0.10z)| lr 1.42e-04 | 322.28 ms | 52.4% bf16 MFU | 1631136 tok/s step 167/19560 | loss 6.484381 (-1.17z)| norm 0.8161 (-0.97z)| lr 1.43e-04 | 322.11 ms | 52.4% bf16 MFU | 1630962 tok/s step 168/19560 | loss 6.440076 (-1.22z)| norm 0.7304 (-1.22z)| lr 1.44e-04 | 321.60 ms | 52.5% bf16 MFU | 1630926 tok/s step 169/19560 | loss 6.499251 (-1.13z)| norm 0.6188 (-1.55z)| lr 1.45e-04 | 322.89 ms | 52.3% bf16 MFU | 1630567 tok/s step 170/19560 | loss 6.445863 (-1.20z)| norm 0.5297 (-1.79z)| lr 1.46e-04 | 321.74 ms | 52.5% bf16 MFU | 1630515 tok/s step 171/19560 | loss 6.423327 (-1.22z)| norm 0.7079 (-1.23z)| lr 1.47e-04 | 321.59 ms | 52.5% bf16 MFU | 1630504 tok/s step 172/19560 | loss 6.527288 (-1.06z)| norm 0.8409 (-0.81z)| lr 1.47e-04 | 321.88 ms | 52.4% bf16 MFU | 1630420 tok/s step 173/19560 | loss 6.439231 (-1.19z)| norm 1.2091 (+0.36z)| lr 1.48e-04 | 321.71 ms | 52.5% bf16 MFU | 1630382 tok/s step 174/19560 | loss 6.411086 (-1.22z)| norm 1.2758 (+0.59z)| lr 1.49e-04 | 321.60 ms | 52.5% bf16 MFU | 1630375 tok/s step 175/19560 | loss 6.422555 (-1.20z)| norm 0.8457 (-0.78z)| lr 1.50e-04 | 321.57 ms | 52.5% bf16 MFU | 1630377 tok/s step 176/19560 | loss 6.430792 (-1.18z)| norm 0.9278 (-0.50z)| lr 1.51e-04 | 321.56 ms | 52.5% bf16 MFU | 1630381 tok/s step 177/19560 | loss 6.482769 (-1.08z)| norm 1.0105 (-0.22z)| lr 1.52e-04 | 322.20 ms | 52.4% bf16 MFU | 1630222 tok/s step 178/19560 | loss 6.405682 (-1.20z)| norm 1.2120 (+0.46z)| lr 1.53e-04 | 321.21 ms | 52.5% bf16 MFU | 1630323 tok/s step 179/19560 | loss 6.443990 (-1.13z)| norm 0.9910 (-0.26z)| lr 1.53e-04 | 321.80 ms | 52.4% bf16 MFU | 1630269 tok/s step 180/19560 | loss 6.446496 (-1.12z)| norm 1.2475 (+0.62z)| lr 1.54e-04 | 321.94 ms | 52.4% bf16 MFU | 1630183 tok/s step 181/19560 | loss 6.426447 (-1.15z)| norm 1.1282 (+0.24z)| lr 1.55e-04 | 321.83 ms | 52.4% bf16 MFU | 1630129 tok/s step 182/19560 | loss 6.540252 (-0.94z)| norm 1.2299 (+0.61z)| lr 1.56e-04 | 321.45 ms | 52.5% bf16 MFU | 1630173 tok/s step 183/19560 | loss 6.466063 (-1.07z)| norm 1.2860 (+0.82z)| lr 1.57e-04 | 322.16 ms | 52.4% bf16 MFU | 1630036 tok/s step 184/19560 | loss 6.410929 (-1.17z)| norm 1.2490 (+0.71z)| lr 1.58e-04 | 321.83 ms | 52.4% bf16 MFU | 1629988 tok/s step 185/19560 | loss 6.398338 (-1.18z)| norm 0.9078 (-0.52z)| lr 1.59e-04 | 322.53 ms | 52.3% bf16 MFU | 1629765 tok/s step 186/19560 | loss 6.421873 (-1.13z)| norm 1.0038 (-0.15z)| lr 1.59e-04 | 321.51 ms | 52.5% bf16 MFU | 1629811 tok/s step 187/19560 | loss 6.434301 (-1.10z)| norm 1.0095 (-0.11z)| lr 1.60e-04 | 321.78 ms | 52.4% bf16 MFU | 1629788 tok/s step 188/19560 | loss 6.347479 (-1.27z)| norm 0.9490 (-0.33z)| lr 1.61e-04 | 322.35 ms | 52.4% bf16 MFU | 1629620 tok/s step 189/19560 | loss 6.346868 (-1.26z)| norm 0.9825 (-0.18z)| lr 1.62e-04 | 322.48 ms | 52.3% bf16 MFU | 1629430 tok/s step 190/19560 | loss 6.367538 (-1.22z)| norm 0.9460 (-0.32z)| lr 1.63e-04 | 322.03 ms | 52.4% bf16 MFU | 1629361 tok/s step 191/19560 | loss 6.332886 (-1.28z)| norm 0.9006 (-0.49z)| lr 1.64e-04 | 321.77 ms | 52.5% bf16 MFU | 1629362 tok/s step 192/19560 | loss 6.339782 (-1.26z)| norm 0.8391 (-0.73z)| lr 1.65e-04 | 322.33 ms | 52.4% bf16 MFU | 1629220 tok/s step 193/19560 | loss 6.369072 (-1.19z)| norm 0.8594 (-0.63z)| lr 1.65e-04 | 322.57 ms | 52.3% bf16 MFU | 1629027 tok/s step 194/19560 | loss 6.322015 (-1.29z)| norm 0.9760 (-0.13z)| lr 1.66e-04 | 322.70 ms | 52.3% bf16 MFU | 1628810 tok/s step 195/19560 | loss 6.264630 (-1.42z)| norm 0.9491 (-0.23z)| lr 1.67e-04 | 322.20 ms | 52.4% bf16 MFU | 1628730 tok/s step 196/19560 | loss 6.391530 (-1.11z)| norm 0.6910 (-1.33z)| lr 1.68e-04 | 321.82 ms | 52.4% bf16 MFU | 1628752 tok/s step 197/19560 | loss 6.303662 (-1.32z)| norm 0.5944 (-1.71z)| lr 1.69e-04 | 321.90 ms | 52.4% bf16 MFU | 1628751 tok/s step 198/19560 | loss 6.346094 (-1.21z)| norm 0.9871 (-0.01z)| lr 1.70e-04 | 322.37 ms | 52.4% bf16 MFU | 1628631 tok/s step 199/19560 | loss 6.332290 (-1.24z)| norm 1.0545 (+0.29z)| lr 1.71e-04 | 321.87 ms | 52.4% bf16 MFU | 1628645 tok/s step 200/19560 | loss 6.309676 (-1.29z)| norm 0.9253 (-0.27z)| lr 1.71e-04 | 321.49 ms | 52.5% bf16 MFU | 1628753 tok/s step 201/19560 | loss 6.360720 (-1.15z)| norm 0.6984 (-1.24z)| lr 1.72e-04 | 322.18 ms | 52.4% bf16 MFU | 1628681 tok/s step 202/19560 | loss 6.334215 (-1.22z)| norm 0.8400 (-0.61z)| lr 1.73e-04 | 321.22 ms | 52.5% bf16 MFU | 1628855 tok/s step 203/19560 | loss 6.384963 (-1.08z)| norm 0.9009 (-0.33z)| lr 1.74e-04 | 321.74 ms | 52.5% bf16 MFU | 1628889 tok/s step 204/19560 | loss 6.334068 (-1.21z)| norm 0.6918 (-1.24z)| lr 1.75e-04 | 322.00 ms | 52.4% bf16 MFU | 1628857 tok/s step 205/19560 | loss 6.385949 (-1.06z)| norm 0.7777 (-0.86z)| lr 1.76e-04 | 321.77 ms | 52.5% bf16 MFU | 1628884 tok/s step 206/19560 | loss 6.327003 (-1.22z)| norm 0.7041 (-1.19z)| lr 1.77e-04 | 321.50 ms | 52.5% bf16 MFU | 1628977 tok/s step 207/19560 | loss 6.258727 (-1.41z)| norm 0.9408 (-0.09z)| lr 1.77e-04 | 321.65 ms | 52.5% bf16 MFU | 1629029 tok/s step 208/19560 | loss 6.339325 (-1.17z)| norm 1.0069 (+0.22z)| lr 1.78e-04 | 322.30 ms | 52.4% bf16 MFU | 1628912 tok/s step 209/19560 | loss 6.319749 (-1.22z)| norm 0.9581 (+0.01z)| lr 1.79e-04 | 322.47 ms | 52.3% bf16 MFU | 1628758 tok/s step 210/19560 | loss 6.312520 (-1.23z)| norm 0.8478 (-0.50z)| lr 1.80e-04 | 321.33 ms | 52.5% bf16 MFU | 1628901 tok/s step 211/19560 | loss 6.282964 (-1.31z)| norm 0.8311 (-0.58z)| lr 1.81e-04 | 322.39 ms | 52.4% bf16 MFU | 1628768 tok/s step 212/19560 | loss 6.352109 (-1.09z)| norm 0.7342 (-1.04z)| lr 1.82e-04 | 321.41 ms | 52.5% bf16 MFU | 1628892 tok/s step 213/19560 | loss 6.293660 (-1.27z)| norm 1.0657 (+0.60z)| lr 1.83e-04 | 322.23 ms | 52.4% bf16 MFU | 1628800 tok/s step 214/19560 | loss 6.373910 (-1.00z)| norm 1.2671 (+1.57z)| lr 1.83e-04 | 323.61 ms | 52.2% bf16 MFU | 1628366 tok/s step 215/19560 | loss 6.274603 (-1.32z)| norm 0.9181 (-0.14z)| lr 1.84e-04 | 322.12 ms | 52.4% bf16 MFU | 1628329 tok/s step 216/19560 | loss 6.326112 (-1.13z)| norm 0.9230 (-0.11z)| lr 1.85e-04 | 322.02 ms | 52.4% bf16 MFU | 1628318 tok/s step 217/19560 | loss 6.305054 (-1.20z)| norm 0.9743 (+0.14z)| lr 1.86e-04 | 321.81 ms | 52.4% bf16 MFU | 1628362 tok/s step 218/19560 | loss 6.303480 (-1.19z)| norm 1.0917 (+0.71z)| lr 1.87e-04 | 322.63 ms | 52.3% bf16 MFU | 1628196 tok/s step 219/19560 | loss 6.276670 (-1.28z)| norm 0.8728 (-0.36z)| lr 1.88e-04 | 321.76 ms | 52.5% bf16 MFU | 1628259 tok/s step 220/19560 | loss 6.187314 (-1.59z)| norm 0.8676 (-0.37z)| lr 1.89e-04 | 323.23 ms | 52.2% bf16 MFU | 1627948 tok/s step 221/19560 | loss 6.289708 (-1.21z)| norm 0.7918 (-0.74z)| lr 1.89e-04 | 322.76 ms | 52.3% bf16 MFU | 1627771 tok/s step 222/19560 | loss 6.290948 (-1.19z)| norm 0.8122 (-0.65z)| lr 1.90e-04 | 321.64 ms | 52.5% bf16 MFU | 1627885 tok/s step 223/19560 | loss 6.231311 (-1.41z)| norm 0.6466 (-1.45z)| lr 1.91e-04 | 323.03 ms | 52.2% bf16 MFU | 1627641 tok/s step 224/19560 | loss 6.299545 (-1.13z)| norm 0.6262 (-1.53z)| lr 1.92e-04 | 321.72 ms | 52.5% bf16 MFU | 1627740 tok/s step 225/19560 | loss 6.268791 (-1.24z)| norm 0.6307 (-1.49z)| lr 1.93e-04 | 322.17 ms | 52.4% bf16 MFU | 1627722 tok/s step 226/19560 | loss 6.170896 (-1.61z)| norm 0.7372 (-0.96z)| lr 1.94e-04 | 321.82 ms | 52.4% bf16 MFU | 1627792 tok/s step 227/19560 | loss 6.202180 (-1.48z)| norm 0.8456 (-0.42z)| lr 1.95e-04 | 321.92 ms | 52.4% bf16 MFU | 1627834 tok/s step 228/19560 | loss 6.255916 (-1.25z)| norm 0.9567 (+0.12z)| lr 1.95e-04 | 321.96 ms | 52.4% bf16 MFU | 1627863 tok/s step 229/19560 | loss 6.189298 (-1.51z)| norm 0.7889 (-0.71z)| lr 1.96e-04 | 321.91 ms | 52.4% bf16 MFU | 1627904 tok/s step 230/19560 | loss 6.202960 (-1.44z)| norm 0.7400 (-0.94z)| lr 1.97e-04 | 322.27 ms | 52.4% bf16 MFU | 1627853 tok/s step 231/19560 | loss 6.205101 (-1.41z)| norm 0.8333 (-0.48z)| lr 1.98e-04 | 321.67 ms | 52.5% bf16 MFU | 1627955 tok/s step 232/19560 | loss 6.190864 (-1.47z)| norm 0.7611 (-0.82z)| lr 1.99e-04 | 322.41 ms | 52.3% bf16 MFU | 1627866 tok/s step 233/19560 | loss 6.116053 (-1.76z)| norm 0.7484 (-0.88z)| lr 2.00e-04 | 322.03 ms | 52.4% bf16 MFU | 1627877 tok/s step 234/19560 | loss 6.208245 (-1.36z)| norm 0.7113 (-1.05z)| lr 2.01e-04 | 321.31 ms | 52.5% bf16 MFU | 1628070 tok/s step 235/19560 | loss 6.189609 (-1.42z)| norm 0.5289 (-1.90z)| lr 2.01e-04 | 323.50 ms | 52.2% bf16 MFU | 1627700 tok/s step 236/19560 | loss 6.183286 (-1.44z)| norm 0.5708 (-1.67z)| lr 2.02e-04 | 321.70 ms | 52.5% bf16 MFU | 1627802 tok/s step 237/19560 | loss 6.233494 (-1.20z)| norm 0.6833 (-1.12z)| lr 2.03e-04 | 322.52 ms | 52.3% bf16 MFU | 1627690 tok/s step 238/19560 | loss 6.152956 (-1.55z)| norm 1.1028 (+0.86z)| lr 2.04e-04 | 322.12 ms | 52.4% bf16 MFU | 1627687 tok/s step 239/19560 | loss 6.141744 (-1.58z)| norm 1.2986 (+1.76z)| lr 2.05e-04 | 322.86 ms | 52.3% bf16 MFU | 1627498 tok/s step 240/19560 | loss 6.122013 (-1.64z)| norm 0.9029 (-0.09z)| lr 2.06e-04 | 321.89 ms | 52.4% bf16 MFU | 1627562 tok/s step 241/19560 | loss 6.203560 (-1.27z)| norm 0.9829 (+0.28z)| lr 2.07e-04 | 322.17 ms | 52.4% bf16 MFU | 1627551 tok/s step 242/19560 | loss 6.089408 (-1.76z)| norm 0.9447 (+0.11z)| lr 2.07e-04 | 322.21 ms | 52.4% bf16 MFU | 1627531 tok/s step 243/19560 | loss 6.097892 (-1.70z)| norm 1.3246 (+1.93z)| lr 2.08e-04 | 322.17 ms | 52.4% bf16 MFU | 1627523 tok/s step 244/19560 | loss 6.157339 (-1.42z)| norm 0.7777 (-0.68z)| lr 2.09e-04 | 322.96 ms | 52.3% bf16 MFU | 1627316 tok/s step 245/19560 | loss 6.222884 (-1.11z)| norm 1.1242 (+1.09z)| lr 2.10e-04 | 321.89 ms | 52.4% bf16 MFU | 1627388 tok/s step 246/19560 | loss 6.168555 (-1.35z)| norm 1.0572 (+0.74z)| lr 2.11e-04 | 322.56 ms | 52.3% bf16 MFU | 1627289 tok/s step 247/19560 | loss 6.219739 (-1.09z)| norm 1.3531 (+2.24z)| lr 2.12e-04 | 323.00 ms | 52.3% bf16 MFU | 1627083 tok/s step 248/19560 | loss 6.193687 (-1.21z)| norm 0.8183 (-0.50z)| lr 2.13e-04 | 321.80 ms | 52.4% bf16 MFU | 1627191 tok/s step 249/19560 | loss 6.098156 (-1.67z)| norm 0.7166 (-1.01z)| lr 2.13e-04 | 322.11 ms | 52.4% bf16 MFU | 1627215 tok/s step 250/19560 | loss 6.191140 (-1.20z)| norm 0.7039 (-1.06z)| lr 2.14e-04 | 322.71 ms | 52.3% bf16 MFU | 1627086 tok/s val loss 6.161226 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2474/10042 = 0.246365 step 251/19560 | loss 6.170714 (-1.30z)| norm 0.7247 (-0.94z)| lr 2.15e-04 | 321.51 ms | 52.5% bf16 MFU | 1627265 tok/s step 252/19560 | loss 6.139436 (-1.44z)| norm 0.7882 (-0.61z)| lr 2.16e-04 | 321.50 ms | 52.5% bf16 MFU | 1627440 tok/s step 253/19560 | loss 6.186738 (-1.18z)| norm 1.0084 (+0.53z)| lr 2.17e-04 | 321.97 ms | 52.4% bf16 MFU | 1627488 tok/s step 254/19560 | loss 6.098423 (-1.64z)| norm 1.5318 (+3.09z)| lr 2.18e-04 | 322.31 ms | 52.4% bf16 MFU | 1627446 tok/s step 255/19560 | loss 6.163700 (-1.28z)| norm 0.8270 (-0.43z)| lr 2.19e-04 | 322.50 ms | 52.3% bf16 MFU | 1627360 tok/s step 256/19560 | loss 6.178758 (-1.19z)| norm 0.8859 (-0.13z)| lr 2.19e-04 | 322.01 ms | 52.4% bf16 MFU | 1627400 tok/s step 257/19560 | loss 6.131074 (-1.43z)| norm 0.9603 (+0.25z)| lr 2.20e-04 | 322.54 ms | 52.3% bf16 MFU | 1627304 tok/s step 258/19560 | loss 6.132967 (-1.41z)| norm 0.7425 (-0.84z)| lr 2.21e-04 | 322.39 ms | 52.4% bf16 MFU | 1627252 tok/s step 259/19560 | loss 6.120650 (-1.46z)| norm 0.6974 (-1.07z)| lr 2.22e-04 | 323.25 ms | 52.2% bf16 MFU | 1626984 tok/s step 260/19560 | loss 6.065009 (-1.74z)| norm 0.7579 (-0.76z)| lr 2.23e-04 | 322.82 ms | 52.3% bf16 MFU | 1626838 tok/s step 261/19560 | loss 6.168475 (-1.15z)| norm 0.8033 (-0.54z)| lr 2.24e-04 | 322.58 ms | 52.3% bf16 MFU | 1626760 tok/s step 262/19560 | loss 6.122702 (-1.39z)| norm 0.8633 (-0.25z)| lr 2.25e-04 | 321.76 ms | 52.5% bf16 MFU | 1626895 tok/s step 263/19560 | loss 6.077872 (-1.63z)| norm 1.0526 (+0.71z)| lr 2.25e-04 | 322.72 ms | 52.3% bf16 MFU | 1626780 tok/s step 264/19560 | loss 6.072188 (-1.64z)| norm 0.9287 (+0.10z)| lr 2.26e-04 | 322.54 ms | 52.3% bf16 MFU | 1626716 tok/s step 265/19560 | loss 6.078524 (-1.58z)| norm 0.7649 (-0.75z)| lr 2.27e-04 | 322.69 ms | 52.3% bf16 MFU | 1626618 tok/s step 266/19560 | loss 6.099412 (-1.45z)| norm 0.6816 (-1.17z)| lr 2.28e-04 | 322.21 ms | 52.4% bf16 MFU | 1626646 tok/s step 267/19560 | loss 6.132026 (-1.24z)| norm 0.5822 (-1.64z)| lr 2.29e-04 | 321.75 ms | 52.5% bf16 MFU | 1626788 tok/s step 268/19560 | loss 6.046096 (-1.72z)| norm 0.7515 (-0.77z)| lr 2.30e-04 | 323.18 ms | 52.2% bf16 MFU | 1626563 tok/s step 269/19560 | loss 6.147350 (-1.11z)| norm 0.8301 (-0.37z)| lr 2.31e-04 | 322.66 ms | 52.3% bf16 MFU | 1626480 tok/s step 270/19560 | loss 6.132587 (-1.19z)| norm 0.9729 (+0.38z)| lr 2.31e-04 | 322.43 ms | 52.3% bf16 MFU | 1626458 tok/s step 271/19560 | loss 6.052064 (-1.65z)| norm 0.7361 (-0.84z)| lr 2.32e-04 | 323.07 ms | 52.2% bf16 MFU | 1626277 tok/s step 272/19560 | loss 6.098472 (-1.36z)| norm 0.5770 (-1.64z)| lr 2.33e-04 | 322.25 ms | 52.4% bf16 MFU | 1626310 tok/s step 273/19560 | loss 6.043133 (-1.67z)| norm 0.6163 (-1.41z)| lr 2.34e-04 | 321.81 ms | 52.4% bf16 MFU | 1626453 tok/s step 274/19560 | loss 6.140728 (-1.07z)| norm 0.7127 (-0.91z)| lr 2.35e-04 | 322.79 ms | 52.3% bf16 MFU | 1626342 tok/s step 275/19560 | loss 6.042943 (-1.65z)| norm 0.8290 (-0.30z)| lr 2.36e-04 | 322.80 ms | 52.3% bf16 MFU | 1626234 tok/s step 276/19560 | loss 6.087063 (-1.36z)| norm 0.7367 (-0.77z)| lr 2.37e-04 | 322.05 ms | 52.4% bf16 MFU | 1626322 tok/s step 277/19560 | loss 6.012643 (-1.80z)| norm 0.9222 (+0.17z)| lr 2.37e-04 | 322.79 ms | 52.3% bf16 MFU | 1626218 tok/s step 278/19560 | loss 6.073081 (-1.42z)| norm 0.7074 (-0.94z)| lr 2.38e-04 | 321.93 ms | 52.4% bf16 MFU | 1626338 tok/s step 279/19560 | loss 6.085942 (-1.32z)| norm 0.6929 (-1.00z)| lr 2.39e-04 | 322.47 ms | 52.3% bf16 MFU | 1626313 tok/s step 280/19560 | loss 6.014268 (-1.75z)| norm 0.8564 (-0.16z)| lr 2.40e-04 | 323.03 ms | 52.2% bf16 MFU | 1626150 tok/s step 281/19560 | loss 6.064736 (-1.41z)| norm 1.0737 (+0.96z)| lr 2.41e-04 | 322.45 ms | 52.3% bf16 MFU | 1626139 tok/s step 282/19560 | loss 6.163796 (-0.76z)| norm 1.3107 (+2.13z)| lr 2.42e-04 | 322.57 ms | 52.3% bf16 MFU | 1626100 tok/s step 283/19560 | loss 6.092479 (-1.22z)| norm 0.7165 (-0.88z)| lr 2.43e-04 | 321.97 ms | 52.4% bf16 MFU | 1626213 tok/s step 284/19560 | loss 6.059585 (-1.43z)| norm 0.7752 (-0.57z)| lr 2.43e-04 | 323.46 ms | 52.2% bf16 MFU | 1625946 tok/s step 285/19560 | loss 6.051013 (-1.47z)| norm 0.8881 (+0.01z)| lr 2.44e-04 | 322.11 ms | 52.4% bf16 MFU | 1626033 tok/s step 286/19560 | loss 6.031407 (-1.58z)| norm 0.9707 (+0.43z)| lr 2.45e-04 | 322.47 ms | 52.3% bf16 MFU | 1626025 tok/s step 287/19560 | loss 6.037878 (-1.52z)| norm 0.7730 (-0.57z)| lr 2.46e-04 | 322.26 ms | 52.4% bf16 MFU | 1626068 tok/s step 288/19560 | loss 6.010933 (-1.68z)| norm 0.7018 (-0.93z)| lr 2.47e-04 | 322.21 ms | 52.4% bf16 MFU | 1626122 tok/s step 289/19560 | loss 6.050262 (-1.40z)| norm 0.7083 (-0.89z)| lr 2.48e-04 | 322.31 ms | 52.4% bf16 MFU | 1626149 tok/s step 290/19560 | loss 6.014580 (-1.63z)| norm 0.6835 (-1.01z)| lr 2.49e-04 | 322.67 ms | 52.3% bf16 MFU | 1626084 tok/s step 291/19560 | loss 5.997790 (-1.72z)| norm 0.7134 (-0.85z)| lr 2.49e-04 | 322.36 ms | 52.4% bf16 MFU | 1626099 tok/s step 292/19560 | loss 6.056323 (-1.29z)| norm 0.7873 (-0.47z)| lr 2.50e-04 | 323.06 ms | 52.2% bf16 MFU | 1625937 tok/s step 293/19560 | loss 5.936280 (-2.09z)| norm 0.9420 (+0.32z)| lr 2.51e-04 | 321.84 ms | 52.4% bf16 MFU | 1626092 tok/s step 294/19560 | loss 6.031710 (-1.41z)| norm 1.1393 (+1.33z)| lr 2.52e-04 | 322.05 ms | 52.4% bf16 MFU | 1626185 tok/s step 295/19560 | loss 6.076549 (-1.08z)| norm 0.7521 (-0.65z)| lr 2.53e-04 | 322.85 ms | 52.3% bf16 MFU | 1626072 tok/s step 296/19560 | loss 5.974367 (-1.77z)| norm 0.8062 (-0.38z)| lr 2.54e-04 | 321.97 ms | 52.4% bf16 MFU | 1626187 tok/s step 297/19560 | loss 6.017417 (-1.45z)| norm 0.7973 (-0.43z)| lr 2.55e-04 | 322.59 ms | 52.3% bf16 MFU | 1626141 tok/s step 298/19560 | loss 5.929180 (-2.03z)| norm 0.8796 (-0.02z)| lr 2.55e-04 | 322.22 ms | 52.4% bf16 MFU | 1626189 tok/s step 299/19560 | loss 6.032049 (-1.29z)| norm 1.0782 (+1.00z)| lr 2.56e-04 | 323.28 ms | 52.2% bf16 MFU | 1625969 tok/s step 300/19560 | loss 6.004977 (-1.47z)| norm 1.1486 (+1.35z)| lr 2.57e-04 | 322.81 ms | 52.3% bf16 MFU | 1625877 tok/s step 301/19560 | loss 6.012208 (-1.40z)| norm 1.1050 (+1.13z)| lr 2.58e-04 | 321.97 ms | 52.4% bf16 MFU | 1626002 tok/s step 302/19560 | loss 6.007414 (-1.41z)| norm 0.9424 (+0.30z)| lr 2.59e-04 | 322.55 ms | 52.3% bf16 MFU | 1625975 tok/s step 303/19560 | loss 5.927119 (-1.95z)| norm 0.7586 (-0.67z)| lr 2.60e-04 | 322.98 ms | 52.3% bf16 MFU | 1625841 tok/s step 304/19560 | loss 5.979395 (-1.55z)| norm 0.7463 (-0.73z)| lr 2.61e-04 | 322.26 ms | 52.4% bf16 MFU | 1625895 tok/s step 305/19560 | loss 6.026130 (-1.20z)| norm 0.8494 (-0.17z)| lr 2.61e-04 | 323.28 ms | 52.2% bf16 MFU | 1625688 tok/s step 306/19560 | loss 5.999972 (-1.37z)| norm 0.9095 (+0.16z)| lr 2.62e-04 | 322.94 ms | 52.3% bf16 MFU | 1625579 tok/s step 307/19560 | loss 5.983938 (-1.47z)| norm 0.8459 (-0.18z)| lr 2.63e-04 | 322.08 ms | 52.4% bf16 MFU | 1625692 tok/s step 308/19560 | loss 5.999636 (-1.34z)| norm 0.8304 (-0.25z)| lr 2.64e-04 | 322.08 ms | 52.4% bf16 MFU | 1625798 tok/s step 309/19560 | loss 5.988765 (-1.40z)| norm 1.0922 (+1.19z)| lr 2.65e-04 | 323.04 ms | 52.2% bf16 MFU | 1625657 tok/s step 310/19560 | loss 6.181597 (+0.05z)| norm 1.0049 (+0.73z)| lr 2.66e-04 | 322.07 ms | 52.4% bf16 MFU | 1625768 tok/s step 311/19560 | loss 5.978450 (-1.49z)| norm 1.1352 (+1.48z)| lr 2.67e-04 | 322.48 ms | 52.3% bf16 MFU | 1625770 tok/s step 312/19560 | loss 6.019770 (-1.16z)| norm 1.3545 (+2.67z)| lr 2.67e-04 | 322.09 ms | 52.4% bf16 MFU | 1625870 tok/s step 313/19560 | loss 6.015505 (-1.17z)| norm 0.9328 (+0.33z)| lr 2.68e-04 | 323.16 ms | 52.2% bf16 MFU | 1625696 tok/s step 314/19560 | loss 6.024765 (-1.09z)| norm 0.9984 (+0.69z)| lr 2.69e-04 | 322.34 ms | 52.4% bf16 MFU | 1625736 tok/s step 315/19560 | loss 6.004093 (-1.24z)| norm 1.1876 (+1.72z)| lr 2.70e-04 | 322.18 ms | 52.4% bf16 MFU | 1625815 tok/s step 316/19560 | loss 5.937592 (-1.74z)| norm 0.9990 (+0.68z)| lr 2.71e-04 | 323.30 ms | 52.2% bf16 MFU | 1625610 tok/s step 317/19560 | loss 5.930367 (-1.77z)| norm 1.0054 (+0.71z)| lr 2.72e-04 | 322.43 ms | 52.3% bf16 MFU | 1625632 tok/s step 318/19560 | loss 5.957980 (-1.53z)| norm 0.7862 (-0.48z)| lr 2.73e-04 | 321.84 ms | 52.4% bf16 MFU | 1625802 tok/s step 319/19560 | loss 5.962630 (-1.47z)| norm 0.7464 (-0.69z)| lr 2.73e-04 | 322.41 ms | 52.3% bf16 MFU | 1625819 tok/s step 320/19560 | loss 5.891017 (-2.00z)| norm 0.6738 (-1.08z)| lr 2.74e-04 | 322.71 ms | 52.3% bf16 MFU | 1625760 tok/s step 321/19560 | loss 5.948895 (-1.52z)| norm 0.6873 (-0.99z)| lr 2.75e-04 | 322.73 ms | 52.3% bf16 MFU | 1625699 tok/s step 322/19560 | loss 5.896472 (-1.90z)| norm 0.6523 (-1.17z)| lr 2.76e-04 | 322.94 ms | 52.3% bf16 MFU | 1625588 tok/s step 323/19560 | loss 5.866193 (-2.08z)| norm 0.7516 (-0.62z)| lr 2.77e-04 | 322.30 ms | 52.4% bf16 MFU | 1625643 tok/s step 324/19560 | loss 5.929397 (-1.57z)| norm 1.0871 (+1.18z)| lr 2.78e-04 | 322.23 ms | 52.4% bf16 MFU | 1625714 tok/s step 325/19560 | loss 5.824729 (-2.33z)| norm 1.1047 (+1.25z)| lr 2.79e-04 | 322.82 ms | 52.3% bf16 MFU | 1625632 tok/s step 326/19560 | loss 5.920951 (-1.56z)| norm 1.0379 (+0.89z)| lr 2.79e-04 | 322.31 ms | 52.4% bf16 MFU | 1625683 tok/s step 327/19560 | loss 5.907260 (-1.64z)| norm 1.3754 (+2.63z)| lr 2.80e-04 | 322.56 ms | 52.3% bf16 MFU | 1625670 tok/s step 328/19560 | loss 5.982334 (-1.05z)| norm 0.7945 (-0.42z)| lr 2.81e-04 | 322.75 ms | 52.3% bf16 MFU | 1625608 tok/s step 329/19560 | loss 5.892969 (-1.72z)| norm 0.9413 (+0.34z)| lr 2.82e-04 | 322.32 ms | 52.4% bf16 MFU | 1625656 tok/s step 330/19560 | loss 5.908260 (-1.58z)| norm 0.8929 (+0.08z)| lr 2.83e-04 | 322.86 ms | 52.3% bf16 MFU | 1625568 tok/s step 331/19560 | loss 5.958205 (-1.18z)| norm 0.8705 (-0.04z)| lr 2.84e-04 | 322.95 ms | 52.3% bf16 MFU | 1625460 tok/s step 332/19560 | loss 5.932256 (-1.36z)| norm 0.8339 (-0.24z)| lr 2.85e-04 | 323.18 ms | 52.2% bf16 MFU | 1625301 tok/s step 333/19560 | loss 5.904912 (-1.57z)| norm 0.8774 (-0.01z)| lr 2.85e-04 | 322.49 ms | 52.3% bf16 MFU | 1625324 tok/s step 334/19560 | loss 5.852131 (-1.96z)| norm 0.9555 (+0.40z)| lr 2.86e-04 | 323.42 ms | 52.2% bf16 MFU | 1625112 tok/s step 335/19560 | loss 5.889353 (-1.63z)| norm 1.1322 (+1.32z)| lr 2.87e-04 | 322.11 ms | 52.4% bf16 MFU | 1625241 tok/s step 336/19560 | loss 5.898538 (-1.54z)| norm 1.2622 (+1.97z)| lr 2.88e-04 | 322.67 ms | 52.3% bf16 MFU | 1625220 tok/s step 337/19560 | loss 5.893711 (-1.56z)| norm 0.9089 (+0.13z)| lr 2.89e-04 | 322.84 ms | 52.3% bf16 MFU | 1625159 tok/s step 338/19560 | loss 5.882050 (-1.63z)| norm 0.8151 (-0.36z)| lr 2.90e-04 | 321.95 ms | 52.4% bf16 MFU | 1625324 tok/s step 339/19560 | loss 5.934131 (-1.19z)| norm 0.7395 (-0.75z)| lr 2.91e-04 | 322.30 ms | 52.4% bf16 MFU | 1625393 tok/s step 340/19560 | loss 5.838213 (-1.95z)| norm 0.7837 (-0.52z)| lr 2.91e-04 | 322.02 ms | 52.4% bf16 MFU | 1625530 tok/s step 341/19560 | loss 5.845649 (-1.86z)| norm 0.7655 (-0.60z)| lr 2.92e-04 | 323.55 ms | 52.2% bf16 MFU | 1625273 tok/s step 342/19560 | loss 5.851879 (-1.79z)| norm 0.7371 (-0.74z)| lr 2.93e-04 | 322.98 ms | 52.3% bf16 MFU | 1625173 tok/s step 343/19560 | loss 5.873928 (-1.59z)| norm 0.7046 (-0.90z)| lr 2.94e-04 | 322.34 ms | 52.4% bf16 MFU | 1625238 tok/s step 344/19560 | loss 5.815110 (-2.04z)| norm 0.7610 (-0.60z)| lr 2.95e-04 | 322.46 ms | 52.3% bf16 MFU | 1625272 tok/s step 345/19560 | loss 5.890047 (-1.41z)| norm 0.8427 (-0.16z)| lr 2.96e-04 | 323.01 ms | 52.2% bf16 MFU | 1625164 tok/s step 346/19560 | loss 5.877178 (-1.50z)| norm 0.8441 (-0.14z)| lr 2.97e-04 | 322.15 ms | 52.4% bf16 MFU | 1625279 tok/s step 347/19560 | loss 5.843349 (-1.76z)| norm 0.9460 (+0.39z)| lr 2.97e-04 | 323.14 ms | 52.2% bf16 MFU | 1625138 tok/s step 348/19560 | loss 5.834387 (-1.80z)| norm 1.0741 (+1.06z)| lr 2.98e-04 | 322.26 ms | 52.4% bf16 MFU | 1625226 tok/s step 349/19560 | loss 5.835071 (-1.77z)| norm 1.1507 (+1.44z)| lr 2.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1625206 tok/s step 350/19560 | loss 5.887681 (-1.31z)| norm 1.2230 (+1.78z)| lr 3.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1625121 tok/s step 351/19560 | loss 5.855923 (-1.56z)| norm 1.2046 (+1.65z)| lr 3.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1625030 tok/s step 352/19560 | loss 5.869227 (-1.43z)| norm 1.0849 (+1.02z)| lr 3.02e-04 | 322.53 ms | 52.3% bf16 MFU | 1625055 tok/s step 353/19560 | loss 5.876354 (-1.36z)| norm 0.6274 (-1.34z)| lr 3.03e-04 | 322.62 ms | 52.3% bf16 MFU | 1625058 tok/s step 354/19560 | loss 5.750834 (-2.38z)| norm 0.7555 (-0.68z)| lr 3.03e-04 | 322.37 ms | 52.4% bf16 MFU | 1625122 tok/s step 355/19560 | loss 5.808078 (-1.85z)| norm 0.8335 (-0.28z)| lr 3.04e-04 | 322.60 ms | 52.3% bf16 MFU | 1625126 tok/s step 356/19560 | loss 5.734389 (-2.42z)| norm 0.9844 (+0.50z)| lr 3.05e-04 | 323.26 ms | 52.2% bf16 MFU | 1624963 tok/s step 357/19560 | loss 5.861800 (-1.33z)| norm 1.0258 (+0.70z)| lr 3.06e-04 | 323.08 ms | 52.2% bf16 MFU | 1624853 tok/s step 358/19560 | loss 5.850970 (-1.40z)| norm 0.8918 (+0.01z)| lr 3.07e-04 | 322.31 ms | 52.4% bf16 MFU | 1624944 tok/s step 359/19560 | loss 5.794242 (-1.85z)| norm 0.9129 (+0.11z)| lr 3.08e-04 | 322.03 ms | 52.4% bf16 MFU | 1625101 tok/s step 360/19560 | loss 5.836368 (-1.47z)| norm 1.0078 (+0.59z)| lr 3.09e-04 | 322.63 ms | 52.3% bf16 MFU | 1625097 tok/s step 361/19560 | loss 5.874473 (-1.13z)| norm 1.1199 (+1.16z)| lr 3.09e-04 | 321.84 ms | 52.4% bf16 MFU | 1625293 tok/s step 362/19560 | loss 5.823833 (-1.53z)| norm 1.1025 (+1.05z)| lr 3.10e-04 | 323.18 ms | 52.2% bf16 MFU | 1625143 tok/s step 363/19560 | loss 5.816920 (-1.57z)| norm 0.9826 (+0.42z)| lr 3.11e-04 | 322.45 ms | 52.3% bf16 MFU | 1625183 tok/s step 364/19560 | loss 5.808570 (-1.61z)| norm 0.7347 (-0.89z)| lr 3.12e-04 | 321.78 ms | 52.4% bf16 MFU | 1625392 tok/s step 365/19560 | loss 5.763203 (-1.96z)| norm 0.6808 (-1.18z)| lr 3.13e-04 | 322.60 ms | 52.3% bf16 MFU | 1625382 tok/s step 366/19560 | loss 5.778937 (-1.79z)| norm 0.6525 (-1.31z)| lr 3.14e-04 | 322.96 ms | 52.3% bf16 MFU | 1625283 tok/s step 367/19560 | loss 5.710406 (-2.30z)| norm 0.7673 (-0.69z)| lr 3.15e-04 | 322.25 ms | 52.4% bf16 MFU | 1625366 tok/s step 368/19560 | loss 5.700305 (-2.32z)| norm 0.7359 (-0.85z)| lr 3.15e-04 | 322.63 ms | 52.3% bf16 MFU | 1625350 tok/s step 369/19560 | loss 5.741160 (-1.95z)| norm 0.6710 (-1.18z)| lr 3.16e-04 | 322.21 ms | 52.4% bf16 MFU | 1625440 tok/s step 370/19560 | loss 5.794682 (-1.49z)| norm 0.7766 (-0.61z)| lr 3.17e-04 | 322.61 ms | 52.3% bf16 MFU | 1625427 tok/s step 371/19560 | loss 5.798728 (-1.43z)| norm 0.8420 (-0.25z)| lr 3.18e-04 | 323.51 ms | 52.2% bf16 MFU | 1625186 tok/s step 372/19560 | loss 5.753415 (-1.76z)| norm 1.4364 (+2.85z)| lr 3.19e-04 | 322.42 ms | 52.3% bf16 MFU | 1625232 tok/s step 373/19560 | loss 5.780749 (-1.53z)| norm 0.8169 (-0.39z)| lr 3.20e-04 | 323.39 ms | 52.2% bf16 MFU | 1625032 tok/s step 374/19560 | loss 5.804579 (-1.32z)| norm 0.7360 (-0.80z)| lr 3.21e-04 | 322.23 ms | 52.4% bf16 MFU | 1625133 tok/s step 375/19560 | loss 5.704128 (-2.09z)| norm 0.8319 (-0.28z)| lr 3.21e-04 | 322.31 ms | 52.4% bf16 MFU | 1625209 tok/s step 376/19560 | loss 5.771087 (-1.53z)| norm 1.0515 (+0.90z)| lr 3.22e-04 | 322.87 ms | 52.3% bf16 MFU | 1625139 tok/s step 377/19560 | loss 5.762410 (-1.57z)| norm 1.1884 (+1.60z)| lr 3.23e-04 | 322.73 ms | 52.3% bf16 MFU | 1625110 tok/s step 378/19560 | loss 5.763268 (-1.55z)| norm 0.8333 (-0.31z)| lr 3.24e-04 | 322.14 ms | 52.4% bf16 MFU | 1625230 tok/s step 379/19560 | loss 5.745463 (-1.66z)| norm 0.8395 (-0.28z)| lr 3.25e-04 | 322.92 ms | 52.3% bf16 MFU | 1625146 tok/s step 380/19560 | loss 5.750226 (-1.60z)| norm 0.7997 (-0.49z)| lr 3.26e-04 | 322.41 ms | 52.3% bf16 MFU | 1625196 tok/s step 381/19560 | loss 5.736010 (-1.69z)| norm 0.9710 (+0.43z)| lr 3.27e-04 | 323.11 ms | 52.2% bf16 MFU | 1625068 tok/s step 382/19560 | loss 5.749287 (-1.55z)| norm 1.0330 (+0.83z)| lr 3.27e-04 | 322.53 ms | 52.3% bf16 MFU | 1625093 tok/s step 383/19560 | loss 5.698438 (-1.93z)| norm 1.0423 (+0.87z)| lr 3.28e-04 | 323.13 ms | 52.2% bf16 MFU | 1624964 tok/s step 384/19560 | loss 5.784569 (-1.23z)| norm 0.9105 (+0.12z)| lr 3.29e-04 | 322.21 ms | 52.4% bf16 MFU | 1625073 tok/s step 385/19560 | loss 5.764462 (-1.37z)| norm 1.1218 (+1.30z)| lr 3.30e-04 | 322.63 ms | 52.3% bf16 MFU | 1625072 tok/s step 386/19560 | loss 5.696859 (-1.88z)| norm 0.8176 (-0.41z)| lr 3.31e-04 | 323.06 ms | 52.2% bf16 MFU | 1624963 tok/s step 387/19560 | loss 5.697162 (-1.84z)| norm 0.7666 (-0.70z)| lr 3.32e-04 | 322.61 ms | 52.3% bf16 MFU | 1624972 tok/s step 388/19560 | loss 5.683781 (-1.91z)| norm 0.6920 (-1.12z)| lr 3.33e-04 | 323.19 ms | 52.2% bf16 MFU | 1624834 tok/s step 389/19560 | loss 5.739658 (-1.45z)| norm 0.7381 (-0.85z)| lr 3.33e-04 | 323.14 ms | 52.2% bf16 MFU | 1624716 tok/s step 390/19560 | loss 5.701883 (-1.72z)| norm 0.8968 (+0.04z)| lr 3.34e-04 | 322.34 ms | 52.4% bf16 MFU | 1624806 tok/s step 391/19560 | loss 5.746603 (-1.34z)| norm 1.2453 (+1.96z)| lr 3.35e-04 | 322.76 ms | 52.3% bf16 MFU | 1624786 tok/s step 392/19560 | loss 5.709377 (-1.61z)| norm 1.1252 (+1.28z)| lr 3.36e-04 | 322.78 ms | 52.3% bf16 MFU | 1624762 tok/s step 393/19560 | loss 5.773208 (-1.09z)| norm 1.5690 (+3.51z)| lr 3.37e-04 | 322.15 ms | 52.4% bf16 MFU | 1624898 tok/s step 394/19560 | loss 5.736277 (-1.36z)| norm 1.0398 (+0.72z)| lr 3.38e-04 | 322.40 ms | 52.3% bf16 MFU | 1624963 tok/s step 395/19560 | loss 5.697823 (-1.64z)| norm 0.7677 (-0.72z)| lr 3.39e-04 | 321.79 ms | 52.4% bf16 MFU | 1625180 tok/s step 396/19560 | loss 5.719881 (-1.44z)| norm 0.6493 (-1.34z)| lr 3.39e-04 | 323.02 ms | 52.2% bf16 MFU | 1625075 tok/s step 397/19560 | loss 5.676668 (-1.76z)| norm 0.7460 (-0.82z)| lr 3.40e-04 | 322.49 ms | 52.3% bf16 MFU | 1625110 tok/s step 398/19560 | loss 5.699487 (-1.56z)| norm 0.8371 (-0.34z)| lr 3.41e-04 | 322.50 ms | 52.3% bf16 MFU | 1625140 tok/s step 399/19560 | loss 5.721134 (-1.36z)| norm 1.4316 (+2.69z)| lr 3.42e-04 | 322.62 ms | 52.3% bf16 MFU | 1625138 tok/s step 400/19560 | loss 5.747796 (-1.13z)| norm 0.9091 (-0.00z)| lr 3.43e-04 | 322.58 ms | 52.3% bf16 MFU | 1625146 tok/s step 401/19560 | loss 5.645664 (-1.92z)| norm 0.9908 (+0.41z)| lr 3.44e-04 | 323.21 ms | 52.2% bf16 MFU | 1624994 tok/s step 402/19560 | loss 5.732313 (-1.21z)| norm 1.5178 (+3.03z)| lr 3.45e-04 | 322.52 ms | 52.3% bf16 MFU | 1625025 tok/s step 403/19560 | loss 5.699378 (-1.45z)| norm 0.8086 (-0.56z)| lr 3.45e-04 | 322.40 ms | 52.3% bf16 MFU | 1625083 tok/s step 404/19560 | loss 5.608023 (-2.14z)| norm 1.0548 (+0.68z)| lr 3.46e-04 | 321.87 ms | 52.4% bf16 MFU | 1625273 tok/s step 405/19560 | loss 5.708708 (-1.32z)| norm 0.9836 (+0.31z)| lr 3.47e-04 | 322.54 ms | 52.3% bf16 MFU | 1625285 tok/s step 406/19560 | loss 5.652881 (-1.73z)| norm 0.8873 (-0.18z)| lr 3.48e-04 | 322.63 ms | 52.3% bf16 MFU | 1625273 tok/s step 407/19560 | loss 5.627907 (-1.90z)| norm 0.7628 (-0.82z)| lr 3.49e-04 | 322.04 ms | 52.4% bf16 MFU | 1625410 tok/s step 408/19560 | loss 5.711045 (-1.22z)| norm 0.7968 (-0.64z)| lr 3.50e-04 | 322.55 ms | 52.3% bf16 MFU | 1625410 tok/s step 409/19560 | loss 5.664342 (-1.56z)| norm 0.7079 (-1.08z)| lr 3.51e-04 | 322.55 ms | 52.3% bf16 MFU | 1625411 tok/s step 410/19560 | loss 5.633265 (-1.80z)| norm 0.7716 (-0.74z)| lr 3.51e-04 | 322.59 ms | 52.3% bf16 MFU | 1625403 tok/s step 411/19560 | loss 5.588402 (-2.12z)| norm 0.8159 (-0.52z)| lr 3.52e-04 | 322.73 ms | 52.3% bf16 MFU | 1625360 tok/s step 412/19560 | loss 5.705221 (-1.17z)| norm 0.9973 (+0.41z)| lr 3.53e-04 | 322.65 ms | 52.3% bf16 MFU | 1625339 tok/s step 413/19560 | loss 5.640998 (-1.66z)| norm 0.8342 (-0.43z)| lr 3.54e-04 | 322.74 ms | 52.3% bf16 MFU | 1625298 tok/s step 414/19560 | loss 5.599078 (-1.95z)| norm 0.8024 (-0.59z)| lr 3.55e-04 | 322.95 ms | 52.3% bf16 MFU | 1625205 tok/s step 415/19560 | loss 5.605090 (-1.87z)| norm 0.8830 (-0.18z)| lr 3.56e-04 | 323.21 ms | 52.2% bf16 MFU | 1625051 tok/s step 416/19560 | loss 5.651168 (-1.48z)| norm 1.2966 (+1.92z)| lr 3.57e-04 | 322.87 ms | 52.3% bf16 MFU | 1624991 tok/s step 417/19560 | loss 5.683066 (-1.21z)| norm 1.0212 (+0.50z)| lr 3.57e-04 | 322.12 ms | 52.4% bf16 MFU | 1625122 tok/s step 418/19560 | loss 5.687293 (-1.16z)| norm 1.0925 (+0.85z)| lr 3.58e-04 | 322.51 ms | 52.3% bf16 MFU | 1625149 tok/s step 419/19560 | loss 5.656088 (-1.39z)| norm 1.0495 (+0.62z)| lr 3.59e-04 | 322.91 ms | 52.3% bf16 MFU | 1625074 tok/s step 420/19560 | loss 5.648342 (-1.43z)| norm 1.2059 (+1.40z)| lr 3.60e-04 | 323.38 ms | 52.2% bf16 MFU | 1624884 tok/s step 421/19560 | loss 5.696465 (-1.03z)| norm 0.9834 (+0.25z)| lr 3.61e-04 | 322.54 ms | 52.3% bf16 MFU | 1624914 tok/s step 422/19560 | loss 5.581553 (-1.92z)| norm 0.7401 (-0.98z)| lr 3.62e-04 | 323.15 ms | 52.2% bf16 MFU | 1624790 tok/s step 423/19560 | loss 5.582055 (-1.89z)| norm 0.6630 (-1.37z)| lr 3.63e-04 | 323.05 ms | 52.2% bf16 MFU | 1624696 tok/s step 424/19560 | loss 5.627133 (-1.50z)| norm 0.8760 (-0.28z)| lr 3.63e-04 | 323.14 ms | 52.2% bf16 MFU | 1624586 tok/s step 425/19560 | loss 5.599741 (-1.69z)| norm 1.0957 (+0.84z)| lr 3.64e-04 | 322.65 ms | 52.3% bf16 MFU | 1624604 tok/s step 426/19560 | loss 5.640201 (-1.34z)| norm 1.1179 (+0.94z)| lr 3.65e-04 | 322.48 ms | 52.3% bf16 MFU | 1624665 tok/s step 427/19560 | loss 5.644914 (-1.29z)| norm 0.9126 (-0.11z)| lr 3.66e-04 | 323.02 ms | 52.2% bf16 MFU | 1624584 tok/s step 428/19560 | loss 5.586823 (-1.73z)| norm 0.8980 (-0.17z)| lr 3.67e-04 | 322.88 ms | 52.3% bf16 MFU | 1624544 tok/s step 429/19560 | loss 5.594231 (-1.64z)| norm 0.7729 (-0.81z)| lr 3.68e-04 | 322.63 ms | 52.3% bf16 MFU | 1624568 tok/s step 430/19560 | loss 5.581963 (-1.71z)| norm 0.7826 (-0.75z)| lr 3.69e-04 | 322.55 ms | 52.3% bf16 MFU | 1624613 tok/s step 431/19560 | loss 5.592231 (-1.60z)| norm 0.8492 (-0.41z)| lr 3.69e-04 | 323.10 ms | 52.2% bf16 MFU | 1624516 tok/s step 432/19560 | loss 5.555316 (-1.86z)| norm 0.7208 (-1.07z)| lr 3.70e-04 | 322.17 ms | 52.4% bf16 MFU | 1624658 tok/s step 433/19560 | loss 5.549290 (-1.88z)| norm 0.7836 (-0.74z)| lr 3.71e-04 | 322.08 ms | 52.4% bf16 MFU | 1624815 tok/s step 434/19560 | loss 5.597592 (-1.47z)| norm 0.9018 (-0.13z)| lr 3.72e-04 | 323.02 ms | 52.2% bf16 MFU | 1624728 tok/s step 435/19560 | loss 5.595928 (-1.46z)| norm 1.1269 (+1.01z)| lr 3.73e-04 | 322.68 ms | 52.3% bf16 MFU | 1624731 tok/s step 436/19560 | loss 5.604266 (-1.38z)| norm 1.0030 (+0.37z)| lr 3.74e-04 | 322.75 ms | 52.3% bf16 MFU | 1624717 tok/s step 437/19560 | loss 5.503093 (-2.14z)| norm 0.8563 (-0.38z)| lr 3.75e-04 | 322.81 ms | 52.3% bf16 MFU | 1624687 tok/s step 438/19560 | loss 5.637400 (-1.08z)| norm 0.9635 (+0.18z)| lr 3.75e-04 | 323.93 ms | 52.1% bf16 MFU | 1624379 tok/s step 439/19560 | loss 5.548215 (-1.79z)| norm 1.1969 (+1.37z)| lr 3.76e-04 | 322.81 ms | 52.3% bf16 MFU | 1624368 tok/s step 440/19560 | loss 5.570685 (-1.59z)| norm 0.8225 (-0.54z)| lr 3.77e-04 | 323.64 ms | 52.1% bf16 MFU | 1624150 tok/s step 441/19560 | loss 5.535388 (-1.86z)| norm 0.8110 (-0.59z)| lr 3.78e-04 | 321.95 ms | 52.4% bf16 MFU | 1624365 tok/s step 442/19560 | loss 5.551450 (-1.71z)| norm 0.8799 (-0.23z)| lr 3.79e-04 | 322.70 ms | 52.3% bf16 MFU | 1624382 tok/s step 443/19560 | loss 5.588181 (-1.38z)| norm 0.7774 (-0.75z)| lr 3.80e-04 | 322.85 ms | 52.3% bf16 MFU | 1624359 tok/s step 444/19560 | loss 5.493528 (-2.14z)| norm 0.7301 (-0.98z)| lr 3.81e-04 | 322.32 ms | 52.4% bf16 MFU | 1624472 tok/s step 445/19560 | loss 5.515854 (-1.92z)| norm 0.9461 (+0.15z)| lr 3.81e-04 | 322.63 ms | 52.3% bf16 MFU | 1624500 tok/s step 446/19560 | loss 5.569407 (-1.45z)| norm 1.1722 (+1.31z)| lr 3.82e-04 | 322.84 ms | 52.3% bf16 MFU | 1624474 tok/s step 447/19560 | loss 5.520355 (-1.83z)| norm 0.7020 (-1.14z)| lr 3.83e-04 | 322.31 ms | 52.4% bf16 MFU | 1624584 tok/s step 448/19560 | loss 5.540468 (-1.63z)| norm 0.7891 (-0.69z)| lr 3.84e-04 | 322.80 ms | 52.3% bf16 MFU | 1624565 tok/s step 449/19560 | loss 5.607126 (-1.06z)| norm 0.8632 (-0.31z)| lr 3.85e-04 | 322.83 ms | 52.3% bf16 MFU | 1624539 tok/s step 450/19560 | loss 5.553960 (-1.49z)| norm 0.9792 (+0.28z)| lr 3.86e-04 | 321.82 ms | 52.4% bf16 MFU | 1624769 tok/s step 451/19560 | loss 5.591267 (-1.15z)| norm 1.0273 (+0.53z)| lr 3.87e-04 | 322.78 ms | 52.3% bf16 MFU | 1624745 tok/s step 452/19560 | loss 5.571222 (-1.30z)| norm 0.8060 (-0.63z)| lr 3.87e-04 | 322.75 ms | 52.3% bf16 MFU | 1624730 tok/s step 453/19560 | loss 5.544106 (-1.51z)| norm 0.7530 (-0.90z)| lr 3.88e-04 | 322.62 ms | 52.3% bf16 MFU | 1624748 tok/s step 454/19560 | loss 5.467613 (-2.11z)| norm 0.7107 (-1.11z)| lr 3.89e-04 | 323.97 ms | 52.1% bf16 MFU | 1624427 tok/s step 455/19560 | loss 5.529677 (-1.56z)| norm 0.7287 (-1.01z)| lr 3.90e-04 | 321.97 ms | 52.4% bf16 MFU | 1624623 tok/s step 456/19560 | loss 5.523317 (-1.60z)| norm 0.8351 (-0.43z)| lr 3.91e-04 | 322.79 ms | 52.3% bf16 MFU | 1624605 tok/s step 457/19560 | loss 5.563190 (-1.25z)| norm 0.8202 (-0.51z)| lr 3.92e-04 | 322.71 ms | 52.3% bf16 MFU | 1624608 tok/s step 458/19560 | loss 5.500129 (-1.75z)| norm 0.8308 (-0.45z)| lr 3.93e-04 | 322.10 ms | 52.4% bf16 MFU | 1624762 tok/s step 459/19560 | loss 5.558644 (-1.25z)| norm 1.1058 (+1.02z)| lr 3.93e-04 | 322.46 ms | 52.3% bf16 MFU | 1624819 tok/s step 460/19560 | loss 5.532295 (-1.45z)| norm 0.9210 (+0.02z)| lr 3.94e-04 | 322.83 ms | 52.3% bf16 MFU | 1624781 tok/s step 461/19560 | loss 5.520588 (-1.53z)| norm 1.1248 (+1.11z)| lr 3.95e-04 | 323.08 ms | 52.2% bf16 MFU | 1624680 tok/s step 462/19560 | loss 5.520423 (-1.51z)| norm 1.0647 (+0.78z)| lr 3.96e-04 | 322.86 ms | 52.3% bf16 MFU | 1624639 tok/s step 463/19560 | loss 5.649081 (-0.39z)| norm 0.9920 (+0.40z)| lr 3.97e-04 | 323.10 ms | 52.2% bf16 MFU | 1624541 tok/s step 464/19560 | loss 5.549935 (-1.24z)| norm 1.0482 (+0.72z)| lr 3.98e-04 | 322.02 ms | 52.4% bf16 MFU | 1624721 tok/s step 465/19560 | loss 5.462357 (-1.97z)| norm 0.9506 (+0.18z)| lr 3.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1624682 tok/s step 466/19560 | loss 5.453300 (-2.01z)| norm 0.8609 (-0.31z)| lr 3.99e-04 | 323.04 ms | 52.2% bf16 MFU | 1624598 tok/s step 467/19560 | loss 5.465189 (-1.89z)| norm 0.8461 (-0.39z)| lr 4.00e-04 | 322.37 ms | 52.4% bf16 MFU | 1624685 tok/s step 468/19560 | loss 5.469641 (-1.81z)| norm 0.7119 (-1.12z)| lr 4.01e-04 | 322.45 ms | 52.3% bf16 MFU | 1624748 tok/s step 469/19560 | loss 5.452538 (-1.92z)| norm 0.7696 (-0.80z)| lr 4.02e-04 | 322.76 ms | 52.3% bf16 MFU | 1624730 tok/s step 470/19560 | loss 5.530628 (-1.23z)| norm 0.8799 (-0.21z)| lr 4.03e-04 | 322.49 ms | 52.3% bf16 MFU | 1624780 tok/s step 471/19560 | loss 5.481446 (-1.63z)| norm 0.7972 (-0.67z)| lr 4.04e-04 | 323.68 ms | 52.1% bf16 MFU | 1624530 tok/s step 472/19560 | loss 5.426326 (-2.06z)| norm 0.8624 (-0.32z)| lr 4.05e-04 | 323.12 ms | 52.2% bf16 MFU | 1624432 tok/s step 473/19560 | loss 5.470809 (-1.66z)| norm 1.0298 (+0.59z)| lr 4.05e-04 | 322.15 ms | 52.4% bf16 MFU | 1624583 tok/s step 474/19560 | loss 5.450632 (-1.81z)| norm 0.9982 (+0.41z)| lr 4.06e-04 | 322.81 ms | 52.3% bf16 MFU | 1624560 tok/s step 475/19560 | loss 5.466815 (-1.64z)| norm 1.1457 (+1.21z)| lr 4.07e-04 | 322.47 ms | 52.3% bf16 MFU | 1624624 tok/s step 476/19560 | loss 5.444256 (-1.80z)| norm 0.7032 (-1.19z)| lr 4.08e-04 | 322.35 ms | 52.4% bf16 MFU | 1624716 tok/s step 477/19560 | loss 5.514423 (-1.18z)| norm 0.7536 (-0.90z)| lr 4.09e-04 | 323.15 ms | 52.2% bf16 MFU | 1624603 tok/s step 478/19560 | loss 5.500505 (-1.29z)| norm 0.7322 (-1.00z)| lr 4.10e-04 | 322.61 ms | 52.3% bf16 MFU | 1624631 tok/s step 479/19560 | loss 5.428710 (-1.88z)| norm 0.8018 (-0.61z)| lr 4.11e-04 | 322.65 ms | 52.3% bf16 MFU | 1624646 tok/s step 480/19560 | loss 5.412281 (-2.00z)| norm 0.8902 (-0.11z)| lr 4.11e-04 | 322.85 ms | 52.3% bf16 MFU | 1624609 tok/s step 481/19560 | loss 5.470517 (-1.47z)| norm 0.7932 (-0.66z)| lr 4.12e-04 | 322.61 ms | 52.3% bf16 MFU | 1624636 tok/s step 482/19560 | loss 5.462602 (-1.51z)| norm 0.7694 (-0.79z)| lr 4.13e-04 | 322.75 ms | 52.3% bf16 MFU | 1624625 tok/s step 483/19560 | loss 5.455883 (-1.55z)| norm 0.9316 (+0.11z)| lr 4.14e-04 | 322.65 ms | 52.3% bf16 MFU | 1624641 tok/s step 484/19560 | loss 5.427275 (-1.76z)| norm 0.9896 (+0.44z)| lr 4.15e-04 | 322.91 ms | 52.3% bf16 MFU | 1624591 tok/s step 485/19560 | loss 5.453872 (-1.51z)| norm 1.0460 (+0.75z)| lr 4.16e-04 | 323.00 ms | 52.3% bf16 MFU | 1624520 tok/s step 486/19560 | loss 5.480731 (-1.26z)| norm 0.9344 (+0.12z)| lr 4.17e-04 | 321.80 ms | 52.4% bf16 MFU | 1624755 tok/s step 487/19560 | loss 5.421312 (-1.76z)| norm 0.7281 (-1.02z)| lr 4.17e-04 | 322.83 ms | 52.3% bf16 MFU | 1624718 tok/s step 488/19560 | loss 5.495105 (-1.10z)| norm 0.7017 (-1.15z)| lr 4.18e-04 | 323.71 ms | 52.1% bf16 MFU | 1624464 tok/s step 489/19560 | loss 5.379706 (-2.09z)| norm 0.6671 (-1.32z)| lr 4.19e-04 | 322.86 ms | 52.3% bf16 MFU | 1624434 tok/s step 490/19560 | loss 5.452560 (-1.43z)| norm 0.7446 (-0.88z)| lr 4.20e-04 | 321.79 ms | 52.4% bf16 MFU | 1624678 tok/s step 491/19560 | loss 5.411655 (-1.77z)| norm 0.7727 (-0.71z)| lr 4.21e-04 | 322.73 ms | 52.3% bf16 MFU | 1624671 tok/s step 492/19560 | loss 5.403941 (-1.81z)| norm 0.8306 (-0.39z)| lr 4.22e-04 | 322.88 ms | 52.3% bf16 MFU | 1624626 tok/s step 493/19560 | loss 5.394545 (-1.85z)| norm 0.8389 (-0.36z)| lr 4.23e-04 | 323.19 ms | 52.2% bf16 MFU | 1624506 tok/s step 494/19560 | loss 5.409517 (-1.69z)| norm 1.0450 (+0.78z)| lr 4.23e-04 | 322.41 ms | 52.3% bf16 MFU | 1624587 tok/s step 495/19560 | loss 5.418550 (-1.58z)| norm 1.1251 (+1.22z)| lr 4.24e-04 | 322.23 ms | 52.4% bf16 MFU | 1624710 tok/s step 496/19560 | loss 5.394727 (-1.76z)| norm 0.8615 (-0.27z)| lr 4.25e-04 | 322.73 ms | 52.3% bf16 MFU | 1624701 tok/s step 497/19560 | loss 5.406218 (-1.63z)| norm 0.8084 (-0.58z)| lr 4.26e-04 | 323.01 ms | 52.2% bf16 MFU | 1624623 tok/s step 498/19560 | loss 5.419038 (-1.49z)| norm 0.8683 (-0.24z)| lr 4.27e-04 | 322.52 ms | 52.3% bf16 MFU | 1624671 tok/s step 499/19560 | loss 5.415681 (-1.50z)| norm 0.8377 (-0.42z)| lr 4.28e-04 | 322.47 ms | 52.3% bf16 MFU | 1624730 tok/s step 500/19560 | loss 5.391510 (-1.69z)| norm 0.6522 (-1.48z)| lr 4.29e-04 | 322.08 ms | 52.4% bf16 MFU | 1624883 tok/s val loss 5.364819 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2455/10042 = 0.244473 step 501/19560 | loss 5.498502 (-0.73z)| norm 0.5839 (-1.84z)| lr 4.29e-04 | 321.88 ms | 52.4% bf16 MFU | 1625080 tok/s step 502/19560 | loss 5.364617 (-1.89z)| norm 0.5607 (-1.94z)| lr 4.30e-04 | 322.13 ms | 52.4% bf16 MFU | 1625205 tok/s step 503/19560 | loss 5.347536 (-2.00z)| norm 0.6226 (-1.57z)| lr 4.31e-04 | 323.36 ms | 52.2% bf16 MFU | 1625015 tok/s step 504/19560 | loss 5.347878 (-1.96z)| norm 0.7357 (-0.91z)| lr 4.32e-04 | 322.41 ms | 52.3% bf16 MFU | 1625073 tok/s step 505/19560 | loss 5.418486 (-1.32z)| norm 0.8774 (-0.10z)| lr 4.33e-04 | 322.33 ms | 52.4% bf16 MFU | 1625147 tok/s step 506/19560 | loss 5.340372 (-1.97z)| norm 0.9101 (+0.08z)| lr 4.34e-04 | 322.40 ms | 52.3% bf16 MFU | 1625201 tok/s step 507/19560 | loss 5.350870 (-1.85z)| norm 1.0830 (+1.05z)| lr 4.35e-04 | 322.33 ms | 52.4% bf16 MFU | 1625268 tok/s step 508/19560 | loss 5.441101 (-1.05z)| norm 1.2948 (+2.19z)| lr 4.35e-04 | 321.97 ms | 52.4% bf16 MFU | 1625423 tok/s step 509/19560 | loss 5.356126 (-1.76z)| norm 0.8207 (-0.45z)| lr 4.36e-04 | 322.38 ms | 52.4% bf16 MFU | 1625466 tok/s step 510/19560 | loss 5.409202 (-1.28z)| norm 0.7187 (-1.00z)| lr 4.37e-04 | 322.53 ms | 52.3% bf16 MFU | 1625470 tok/s step 511/19560 | loss 5.337696 (-1.87z)| norm 0.7167 (-0.99z)| lr 4.38e-04 | 322.30 ms | 52.4% bf16 MFU | 1625532 tok/s step 512/19560 | loss 5.302006 (-2.14z)| norm 0.8005 (-0.52z)| lr 4.39e-04 | 322.52 ms | 52.3% bf16 MFU | 1625535 tok/s step 513/19560 | loss 5.340942 (-1.78z)| norm 0.9039 (+0.06z)| lr 4.40e-04 | 322.43 ms | 52.3% bf16 MFU | 1625562 tok/s step 514/19560 | loss 5.346510 (-1.70z)| norm 0.8360 (-0.32z)| lr 4.41e-04 | 323.08 ms | 52.2% bf16 MFU | 1625423 tok/s step 515/19560 | loss 5.381519 (-1.37z)| norm 0.9962 (+0.57z)| lr 4.41e-04 | 322.35 ms | 52.4% bf16 MFU | 1625474 tok/s step 516/19560 | loss 5.420273 (-1.02z)| norm 1.5913 (+3.67z)| lr 4.42e-04 | 322.00 ms | 52.4% bf16 MFU | 1625612 tok/s step 517/19560 | loss 5.352194 (-1.59z)| norm 0.9089 (+0.03z)| lr 4.43e-04 | 322.75 ms | 52.3% bf16 MFU | 1625554 tok/s step 518/19560 | loss 5.368319 (-1.43z)| norm 0.8914 (-0.06z)| lr 4.44e-04 | 322.93 ms | 52.3% bf16 MFU | 1625452 tok/s step 519/19560 | loss 5.309851 (-1.91z)| norm 0.9037 (+0.02z)| lr 4.45e-04 | 322.38 ms | 52.4% bf16 MFU | 1625493 tok/s step 520/19560 | loss 5.326802 (-1.73z)| norm 0.8177 (-0.44z)| lr 4.46e-04 | 322.36 ms | 52.4% bf16 MFU | 1625539 tok/s step 521/19560 | loss 5.380197 (-1.25z)| norm 0.8519 (-0.23z)| lr 4.47e-04 | 322.04 ms | 52.4% bf16 MFU | 1625664 tok/s step 522/19560 | loss 5.286824 (-2.04z)| norm 0.8887 (-0.02z)| lr 4.47e-04 | 322.57 ms | 52.3% bf16 MFU | 1625648 tok/s step 523/19560 | loss 5.339777 (-1.55z)| norm 0.9630 (+0.40z)| lr 4.48e-04 | 323.13 ms | 52.2% bf16 MFU | 1625493 tok/s step 524/19560 | loss 5.276851 (-2.06z)| norm 1.1618 (+1.53z)| lr 4.49e-04 | 323.45 ms | 52.2% bf16 MFU | 1625264 tok/s step 525/19560 | loss 5.331431 (-1.56z)| norm 0.8211 (-0.44z)| lr 4.50e-04 | 321.93 ms | 52.4% bf16 MFU | 1625430 tok/s step 526/19560 | loss 5.305256 (-1.76z)| norm 0.7615 (-0.78z)| lr 4.51e-04 | 322.70 ms | 52.3% bf16 MFU | 1625394 tok/s step 527/19560 | loss 5.337534 (-1.46z)| norm 0.7519 (-0.83z)| lr 4.52e-04 | 323.07 ms | 52.2% bf16 MFU | 1625265 tok/s step 528/19560 | loss 5.365855 (-1.20z)| norm 0.6454 (-1.45z)| lr 4.53e-04 | 323.05 ms | 52.2% bf16 MFU | 1625149 tok/s step 529/19560 | loss 5.215557 (-2.46z)| norm 0.6524 (-1.38z)| lr 4.53e-04 | 322.72 ms | 52.3% bf16 MFU | 1625121 tok/s step 530/19560 | loss 5.355121 (-1.23z)| norm 0.6029 (-1.71z)| lr 4.54e-04 | 322.60 ms | 52.3% bf16 MFU | 1625124 tok/s step 531/19560 | loss 5.309175 (-1.61z)| norm 0.6232 (-1.56z)| lr 4.55e-04 | 322.21 ms | 52.4% bf16 MFU | 1625226 tok/s step 532/19560 | loss 5.286662 (-1.78z)| norm 0.6796 (-1.19z)| lr 4.56e-04 | 322.45 ms | 52.3% bf16 MFU | 1625262 tok/s step 533/19560 | loss 5.276320 (-1.84z)| norm 0.8257 (-0.30z)| lr 4.57e-04 | 322.93 ms | 52.3% bf16 MFU | 1625176 tok/s step 534/19560 | loss 5.301159 (-1.59z)| norm 1.0123 (+0.84z)| lr 4.58e-04 | 323.05 ms | 52.2% bf16 MFU | 1625064 tok/s step 535/19560 | loss 5.258131 (-1.93z)| norm 0.9188 (+0.26z)| lr 4.59e-04 | 322.45 ms | 52.3% bf16 MFU | 1625109 tok/s step 536/19560 | loss 5.260259 (-1.88z)| norm 0.8472 (-0.18z)| lr 4.59e-04 | 323.31 ms | 52.2% bf16 MFU | 1624935 tok/s step 537/19560 | loss 5.205267 (-2.31z)| norm 0.8731 (-0.03z)| lr 4.60e-04 | 322.57 ms | 52.3% bf16 MFU | 1624955 tok/s step 538/19560 | loss 5.237278 (-1.99z)| norm 0.9583 (+0.48z)| lr 4.61e-04 | 322.99 ms | 52.3% bf16 MFU | 1624869 tok/s step 539/19560 | loss 5.273420 (-1.65z)| norm 0.7446 (-0.82z)| lr 4.62e-04 | 322.69 ms | 52.3% bf16 MFU | 1624863 tok/s step 540/19560 | loss 5.300308 (-1.40z)| norm 0.6498 (-1.38z)| lr 4.63e-04 | 322.80 ms | 52.3% bf16 MFU | 1624828 tok/s step 541/19560 | loss 5.199187 (-2.22z)| norm 0.6230 (-1.52z)| lr 4.64e-04 | 322.95 ms | 52.3% bf16 MFU | 1624758 tok/s step 542/19560 | loss 5.224133 (-1.96z)| norm 0.5510 (-1.92z)| lr 4.65e-04 | 322.75 ms | 52.3% bf16 MFU | 1624742 tok/s step 543/19560 | loss 5.257647 (-1.65z)| norm 0.5791 (-1.71z)| lr 4.65e-04 | 322.90 ms | 52.3% bf16 MFU | 1624688 tok/s step 544/19560 | loss 5.188994 (-2.17z)| norm 0.7344 (-0.79z)| lr 4.66e-04 | 322.86 ms | 52.3% bf16 MFU | 1624648 tok/s step 545/19560 | loss 5.323084 (-1.05z)| norm 1.1945 (+1.95z)| lr 4.67e-04 | 323.01 ms | 52.3% bf16 MFU | 1624573 tok/s step 546/19560 | loss 5.234902 (-1.77z)| norm 1.0489 (+1.09z)| lr 4.68e-04 | 322.20 ms | 52.4% bf16 MFU | 1624706 tok/s step 547/19560 | loss 5.212219 (-1.92z)| norm 0.9813 (+0.69z)| lr 4.69e-04 | 322.61 ms | 52.3% bf16 MFU | 1624728 tok/s step 548/19560 | loss 5.345063 (-0.80z)| norm 0.9062 (+0.26z)| lr 4.70e-04 | 322.78 ms | 52.3% bf16 MFU | 1624706 tok/s step 549/19560 | loss 5.245448 (-1.62z)| norm 0.9232 (+0.36z)| lr 4.71e-04 | 321.98 ms | 52.4% bf16 MFU | 1624886 tok/s step 550/19560 | loss 5.260484 (-1.47z)| norm 0.9321 (+0.41z)| lr 4.71e-04 | 323.61 ms | 52.2% bf16 MFU | 1624648 tok/s step 551/19560 | loss 5.314529 (-1.00z)| norm 0.8480 (-0.11z)| lr 4.72e-04 | 322.58 ms | 52.3% bf16 MFU | 1624680 tok/s step 552/19560 | loss 5.238575 (-1.62z)| norm 0.8240 (-0.26z)| lr 4.73e-04 | 323.08 ms | 52.2% bf16 MFU | 1624585 tok/s step 553/19560 | loss 5.231218 (-1.65z)| norm 0.8802 (+0.10z)| lr 4.74e-04 | 322.30 ms | 52.4% bf16 MFU | 1624692 tok/s step 554/19560 | loss 5.267057 (-1.33z)| norm 0.8607 (-0.01z)| lr 4.75e-04 | 322.95 ms | 52.3% bf16 MFU | 1624628 tok/s step 555/19560 | loss 5.220951 (-1.70z)| norm 0.7869 (-0.46z)| lr 4.76e-04 | 322.96 ms | 52.3% bf16 MFU | 1624566 tok/s step 556/19560 | loss 5.239261 (-1.52z)| norm 0.8243 (-0.23z)| lr 4.77e-04 | 322.62 ms | 52.3% bf16 MFU | 1624592 tok/s step 557/19560 | loss 5.230048 (-1.57z)| norm 1.0265 (+1.02z)| lr 4.77e-04 | 322.48 ms | 52.3% bf16 MFU | 1624653 tok/s step 558/19560 | loss 5.130980 (-2.36z)| norm 0.7765 (-0.54z)| lr 4.78e-04 | 323.15 ms | 52.2% bf16 MFU | 1624543 tok/s step 559/19560 | loss 5.240596 (-1.41z)| norm 0.8398 (-0.14z)| lr 4.79e-04 | 322.94 ms | 52.3% bf16 MFU | 1624489 tok/s step 560/19560 | loss 5.223023 (-1.53z)| norm 0.7135 (-0.93z)| lr 4.80e-04 | 322.40 ms | 52.3% bf16 MFU | 1624574 tok/s step 561/19560 | loss 5.253553 (-1.26z)| norm 0.6145 (-1.52z)| lr 4.81e-04 | 322.89 ms | 52.3% bf16 MFU | 1624531 tok/s step 562/19560 | loss 5.229512 (-1.44z)| norm 0.6276 (-1.42z)| lr 4.82e-04 | 323.56 ms | 52.2% bf16 MFU | 1624322 tok/s step 563/19560 | loss 5.245800 (-1.28z)| norm 0.6034 (-1.54z)| lr 4.83e-04 | 322.43 ms | 52.3% bf16 MFU | 1624408 tok/s step 564/19560 | loss 5.212085 (-1.55z)| norm 0.9239 (+0.43z)| lr 4.83e-04 | 322.47 ms | 52.3% bf16 MFU | 1624480 tok/s step 565/19560 | loss 5.237493 (-1.31z)| norm 1.2477 (+2.35z)| lr 4.84e-04 | 323.87 ms | 52.1% bf16 MFU | 1624198 tok/s step 566/19560 | loss 5.189203 (-1.70z)| norm 0.8181 (-0.23z)| lr 4.85e-04 | 323.11 ms | 52.2% bf16 MFU | 1624118 tok/s step 567/19560 | loss 5.222032 (-1.40z)| norm 0.8756 (+0.13z)| lr 4.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1624155 tok/s step 568/19560 | loss 5.208960 (-1.49z)| norm 0.8070 (-0.29z)| lr 4.87e-04 | 322.80 ms | 52.3% bf16 MFU | 1624155 tok/s step 569/19560 | loss 5.227763 (-1.31z)| norm 0.6797 (-1.05z)| lr 4.88e-04 | 322.35 ms | 52.4% bf16 MFU | 1624271 tok/s step 570/19560 | loss 5.199591 (-1.52z)| norm 0.7698 (-0.50z)| lr 4.89e-04 | 323.25 ms | 52.2% bf16 MFU | 1624154 tok/s step 571/19560 | loss 5.223817 (-1.30z)| norm 0.7924 (-0.36z)| lr 4.89e-04 | 322.38 ms | 52.4% bf16 MFU | 1624262 tok/s step 572/19560 | loss 5.183462 (-1.62z)| norm 0.7297 (-0.74z)| lr 4.90e-04 | 323.16 ms | 52.2% bf16 MFU | 1624167 tok/s step 573/19560 | loss 5.145094 (-1.90z)| norm 0.8880 (+0.22z)| lr 4.91e-04 | 322.73 ms | 52.3% bf16 MFU | 1624186 tok/s step 574/19560 | loss 5.153665 (-1.80z)| norm 0.8206 (-0.17z)| lr 4.92e-04 | 322.98 ms | 52.3% bf16 MFU | 1624141 tok/s step 575/19560 | loss 5.144991 (-1.84z)| norm 0.6598 (-1.17z)| lr 4.93e-04 | 323.05 ms | 52.2% bf16 MFU | 1624081 tok/s step 576/19560 | loss 5.180297 (-1.52z)| norm 0.6297 (-1.33z)| lr 4.94e-04 | 322.84 ms | 52.3% bf16 MFU | 1624076 tok/s step 577/19560 | loss 5.174501 (-1.55z)| norm 0.7124 (-0.82z)| lr 4.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1624050 tok/s step 578/19560 | loss 5.201692 (-1.30z)| norm 0.7491 (-0.58z)| lr 4.95e-04 | 322.73 ms | 52.3% bf16 MFU | 1624073 tok/s step 579/19560 | loss 5.227182 (-1.07z)| norm 0.8329 (-0.06z)| lr 4.96e-04 | 323.23 ms | 52.2% bf16 MFU | 1623972 tok/s step 580/19560 | loss 5.207070 (-1.23z)| norm 0.7012 (-0.86z)| lr 4.97e-04 | 322.71 ms | 52.3% bf16 MFU | 1624004 tok/s step 581/19560 | loss 5.105211 (-2.07z)| norm 0.6148 (-1.38z)| lr 4.98e-04 | 322.66 ms | 52.3% bf16 MFU | 1624048 tok/s step 582/19560 | loss 5.142806 (-1.71z)| norm 0.6275 (-1.29z)| lr 4.99e-04 | 323.08 ms | 52.2% bf16 MFU | 1623985 tok/s step 583/19560 | loss 5.175977 (-1.41z)| norm 0.7438 (-0.59z)| lr 5.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1623983 tok/s step 584/19560 | loss 5.148100 (-1.62z)| norm 0.9102 (+0.42z)| lr 5.01e-04 | 322.62 ms | 52.3% bf16 MFU | 1624040 tok/s step 585/19560 | loss 5.153560 (-1.56z)| norm 0.9601 (+0.72z)| lr 5.01e-04 | 322.80 ms | 52.3% bf16 MFU | 1624047 tok/s step 586/19560 | loss 5.154264 (-1.53z)| norm 0.7470 (-0.57z)| lr 5.02e-04 | 323.02 ms | 52.2% bf16 MFU | 1623998 tok/s step 587/19560 | loss 5.078993 (-2.14z)| norm 0.6308 (-1.26z)| lr 5.03e-04 | 323.06 ms | 52.2% bf16 MFU | 1623942 tok/s step 588/19560 | loss 5.163724 (-1.39z)| norm 0.8836 (+0.28z)| lr 5.04e-04 | 323.09 ms | 52.2% bf16 MFU | 1623882 tok/s step 589/19560 | loss 5.170393 (-1.31z)| norm 0.9619 (+0.77z)| lr 5.05e-04 | 322.74 ms | 52.3% bf16 MFU | 1623913 tok/s step 590/19560 | loss 5.067248 (-2.16z)| norm 0.6952 (-0.85z)| lr 5.06e-04 | 322.46 ms | 52.3% bf16 MFU | 1624012 tok/s step 591/19560 | loss 5.150664 (-1.45z)| norm 0.8111 (-0.13z)| lr 5.07e-04 | 323.34 ms | 52.2% bf16 MFU | 1623884 tok/s step 592/19560 | loss 5.208635 (-0.92z)| norm 0.7689 (-0.38z)| lr 5.07e-04 | 322.35 ms | 52.4% bf16 MFU | 1624013 tok/s step 593/19560 | loss 5.115358 (-1.73z)| norm 0.6106 (-1.34z)| lr 5.08e-04 | 323.02 ms | 52.2% bf16 MFU | 1623966 tok/s step 594/19560 | loss 5.123180 (-1.63z)| norm 0.6507 (-1.08z)| lr 5.09e-04 | 323.03 ms | 52.2% bf16 MFU | 1623919 tok/s step 595/19560 | loss 5.123184 (-1.60z)| norm 0.7032 (-0.75z)| lr 5.10e-04 | 322.96 ms | 52.3% bf16 MFU | 1623891 tok/s step 596/19560 | loss 5.132890 (-1.49z)| norm 0.7276 (-0.60z)| lr 5.11e-04 | 322.46 ms | 52.3% bf16 MFU | 1623991 tok/s step 597/19560 | loss 5.200877 (-0.87z)| norm 0.6916 (-0.81z)| lr 5.12e-04 | 323.59 ms | 52.2% bf16 MFU | 1623803 tok/s step 598/19560 | loss 5.109354 (-1.67z)| norm 0.7024 (-0.74z)| lr 5.13e-04 | 322.33 ms | 52.4% bf16 MFU | 1623940 tok/s step 599/19560 | loss 5.097095 (-1.75z)| norm 0.7637 (-0.36z)| lr 5.13e-04 | 322.58 ms | 52.3% bf16 MFU | 1624009 tok/s step 600/19560 | loss 5.075758 (-1.90z)| norm 0.7418 (-0.49z)| lr 5.14e-04 | 323.54 ms | 52.2% bf16 MFU | 1623832 tok/s step 601/19560 | loss 5.197525 (-0.81z)| norm 0.7172 (-0.63z)| lr 5.15e-04 | 323.35 ms | 52.2% bf16 MFU | 1623711 tok/s step 602/19560 | loss 5.135144 (-1.35z)| norm 0.7520 (-0.40z)| lr 5.16e-04 | 322.95 ms | 52.3% bf16 MFU | 1623699 tok/s step 603/19560 | loss 5.131846 (-1.36z)| norm 0.8895 (+0.47z)| lr 5.17e-04 | 323.13 ms | 52.2% bf16 MFU | 1623640 tok/s step 604/19560 | loss 5.101715 (-1.60z)| norm 0.8449 (+0.18z)| lr 5.18e-04 | 322.29 ms | 52.4% bf16 MFU | 1623795 tok/s step 605/19560 | loss 5.074262 (-1.82z)| norm 0.6858 (-0.82z)| lr 5.19e-04 | 322.85 ms | 52.3% bf16 MFU | 1623801 tok/s step 606/19560 | loss 5.116573 (-1.43z)| norm 0.6640 (-0.95z)| lr 5.19e-04 | 323.40 ms | 52.2% bf16 MFU | 1623670 tok/s step 607/19560 | loss 5.073135 (-1.79z)| norm 0.6640 (-0.94z)| lr 5.20e-04 | 323.92 ms | 52.1% bf16 MFU | 1623416 tok/s step 608/19560 | loss 5.094081 (-1.57z)| norm 0.7195 (-0.58z)| lr 5.21e-04 | 322.62 ms | 52.3% bf16 MFU | 1623500 tok/s step 609/19560 | loss 5.102177 (-1.48z)| norm 0.7199 (-0.57z)| lr 5.22e-04 | 322.69 ms | 52.3% bf16 MFU | 1623561 tok/s step 610/19560 | loss 5.069654 (-1.75z)| norm 0.7280 (-0.52z)| lr 5.23e-04 | 322.83 ms | 52.3% bf16 MFU | 1623584 tok/s step 611/19560 | loss 5.083541 (-1.60z)| norm 0.8477 (+0.23z)| lr 5.24e-04 | 322.98 ms | 52.3% bf16 MFU | 1623568 tok/s step 612/19560 | loss 5.091235 (-1.50z)| norm 0.7500 (-0.37z)| lr 5.25e-04 | 322.98 ms | 52.3% bf16 MFU | 1623555 tok/s step 613/19560 | loss 5.080273 (-1.58z)| norm 0.7085 (-0.62z)| lr 5.25e-04 | 323.44 ms | 52.2% bf16 MFU | 1623427 tok/s step 614/19560 | loss 5.067672 (-1.68z)| norm 0.6610 (-0.91z)| lr 5.26e-04 | 323.37 ms | 52.2% bf16 MFU | 1623322 tok/s step 615/19560 | loss 5.057717 (-1.74z)| norm 0.6492 (-0.97z)| lr 5.27e-04 | 323.33 ms | 52.2% bf16 MFU | 1623232 tok/s step 616/19560 | loss 5.016659 (-2.08z)| norm 0.6081 (-1.22z)| lr 5.28e-04 | 322.92 ms | 52.3% bf16 MFU | 1623249 tok/s step 617/19560 | loss 5.003469 (-2.15z)| norm 0.6443 (-0.99z)| lr 5.29e-04 | 323.26 ms | 52.2% bf16 MFU | 1623180 tok/s step 618/19560 | loss 5.125568 (-1.03z)| norm 0.7191 (-0.52z)| lr 5.30e-04 | 322.66 ms | 52.3% bf16 MFU | 1623264 tok/s step 619/19560 | loss 5.062023 (-1.59z)| norm 0.8893 (+0.54z)| lr 5.31e-04 | 323.14 ms | 52.2% bf16 MFU | 1623225 tok/s step 620/19560 | loss 5.051396 (-1.66z)| norm 0.9007 (+0.61z)| lr 5.31e-04 | 323.00 ms | 52.3% bf16 MFU | 1623222 tok/s step 621/19560 | loss 5.030831 (-1.81z)| norm 0.6909 (-0.70z)| lr 5.32e-04 | 322.33 ms | 52.4% bf16 MFU | 1623389 tok/s step 622/19560 | loss 4.978292 (-2.24z)| norm 0.6073 (-1.20z)| lr 5.33e-04 | 323.03 ms | 52.2% bf16 MFU | 1623372 tok/s step 623/19560 | loss 4.993465 (-2.06z)| norm 0.5829 (-1.34z)| lr 5.34e-04 | 322.82 ms | 52.3% bf16 MFU | 1623408 tok/s step 624/19560 | loss 5.010024 (-1.88z)| norm 0.6075 (-1.17z)| lr 5.35e-04 | 322.95 ms | 52.3% bf16 MFU | 1623408 tok/s step 625/19560 | loss 5.060742 (-1.41z)| norm 0.6416 (-0.94z)| lr 5.36e-04 | 322.86 ms | 52.3% bf16 MFU | 1623433 tok/s step 626/19560 | loss 5.031288 (-1.64z)| norm 0.7701 (-0.13z)| lr 5.37e-04 | 322.73 ms | 52.3% bf16 MFU | 1623487 tok/s step 627/19560 | loss 5.102491 (-0.99z)| norm 0.9538 (+1.01z)| lr 5.37e-04 | 322.95 ms | 52.3% bf16 MFU | 1623484 tok/s step 628/19560 | loss 5.092971 (-1.07z)| norm 0.9394 (+0.91z)| lr 5.38e-04 | 322.86 ms | 52.3% bf16 MFU | 1623504 tok/s step 629/19560 | loss 5.068263 (-1.29z)| norm 0.9414 (+0.91z)| lr 5.39e-04 | 322.72 ms | 52.3% bf16 MFU | 1623559 tok/s step 630/19560 | loss 5.060086 (-1.34z)| norm 0.9682 (+1.06z)| lr 5.40e-04 | 322.85 ms | 52.3% bf16 MFU | 1623578 tok/s step 631/19560 | loss 5.093626 (-1.01z)| norm 0.9878 (+1.17z)| lr 5.41e-04 | 323.05 ms | 52.2% bf16 MFU | 1623544 tok/s step 632/19560 | loss 5.079383 (-1.13z)| norm 0.8824 (+0.49z)| lr 5.42e-04 | 322.90 ms | 52.3% bf16 MFU | 1623551 tok/s step 633/19560 | loss 5.052071 (-1.37z)| norm 0.7613 (-0.26z)| lr 5.43e-04 | 322.98 ms | 52.3% bf16 MFU | 1623538 tok/s step 634/19560 | loss 5.123904 (-0.68z)| norm 0.6617 (-0.88z)| lr 5.43e-04 | 322.95 ms | 52.3% bf16 MFU | 1623532 tok/s step 635/19560 | loss 4.997602 (-1.84z)| norm 0.6105 (-1.19z)| lr 5.44e-04 | 322.39 ms | 52.3% bf16 MFU | 1623667 tok/s step 636/19560 | loss 5.064015 (-1.21z)| norm 0.6564 (-0.90z)| lr 5.45e-04 | 322.08 ms | 52.4% bf16 MFU | 1623874 tok/s step 637/19560 | loss 4.984893 (-1.93z)| norm 0.6587 (-0.87z)| lr 5.46e-04 | 322.98 ms | 52.3% bf16 MFU | 1623845 tok/s step 638/19560 | loss 5.020777 (-1.57z)| norm 0.7053 (-0.56z)| lr 5.47e-04 | 322.31 ms | 52.4% bf16 MFU | 1623986 tok/s step 639/19560 | loss 5.058457 (-1.19z)| norm 0.5669 (-1.45z)| lr 5.48e-04 | 321.67 ms | 52.5% bf16 MFU | 1624281 tok/s step 640/19560 | loss 5.027593 (-1.47z)| norm 0.5760 (-1.37z)| lr 5.49e-04 | 322.94 ms | 52.3% bf16 MFU | 1624241 tok/s step 641/19560 | loss 5.004543 (-1.66z)| norm 0.6891 (-0.63z)| lr 5.49e-04 | 323.27 ms | 52.2% bf16 MFU | 1624121 tok/s step 642/19560 | loss 4.970813 (-1.95z)| norm 0.7922 (+0.04z)| lr 5.50e-04 | 322.78 ms | 52.3% bf16 MFU | 1624130 tok/s step 643/19560 | loss 5.007246 (-1.58z)| norm 0.6525 (-0.85z)| lr 5.51e-04 | 323.02 ms | 52.2% bf16 MFU | 1624078 tok/s step 644/19560 | loss 5.031816 (-1.33z)| norm 0.6939 (-0.61z)| lr 5.52e-04 | 322.30 ms | 52.4% bf16 MFU | 1624210 tok/s step 645/19560 | loss 5.025863 (-1.38z)| norm 0.8993 (+0.90z)| lr 5.53e-04 | 322.10 ms | 52.4% bf16 MFU | 1624387 tok/s step 646/19560 | loss 4.972078 (-1.88z)| norm 0.7775 (+0.01z)| lr 5.54e-04 | 322.91 ms | 52.3% bf16 MFU | 1624348 tok/s step 647/19560 | loss 4.995323 (-1.62z)| norm 0.6490 (-0.92z)| lr 5.55e-04 | 323.18 ms | 52.2% bf16 MFU | 1624245 tok/s step 648/19560 | loss 5.004746 (-1.51z)| norm 0.7387 (-0.25z)| lr 5.55e-04 | 322.64 ms | 52.3% bf16 MFU | 1624281 tok/s step 649/19560 | loss 5.017095 (-1.37z)| norm 0.9046 (+0.96z)| lr 5.56e-04 | 323.11 ms | 52.2% bf16 MFU | 1624198 tok/s step 650/19560 | loss 5.013813 (-1.38z)| norm 0.7492 (-0.17z)| lr 5.57e-04 | 322.26 ms | 52.4% bf16 MFU | 1624334 tok/s step 651/19560 | loss 4.993725 (-1.56z)| norm 0.7661 (-0.03z)| lr 5.58e-04 | 322.56 ms | 52.3% bf16 MFU | 1624387 tok/s step 652/19560 | loss 4.974297 (-1.72z)| norm 1.3601 (+4.19z)| lr 5.59e-04 | 323.34 ms | 52.2% bf16 MFU | 1624240 tok/s step 653/19560 | loss 5.069190 (-0.77z)| norm 0.8768 (+0.74z)| lr 5.60e-04 | 322.58 ms | 52.3% bf16 MFU | 1624294 tok/s step 654/19560 | loss 4.943919 (-1.99z)| norm 0.8171 (+0.31z)| lr 5.61e-04 | 322.97 ms | 52.3% bf16 MFU | 1624246 tok/s step 655/19560 | loss 4.945743 (-1.94z)| norm 0.8294 (+0.40z)| lr 5.61e-04 | 322.86 ms | 52.3% bf16 MFU | 1624228 tok/s step 656/19560 | loss 5.040781 (-0.98z)| norm 1.1191 (+2.38z)| lr 5.62e-04 | 322.74 ms | 52.3% bf16 MFU | 1624241 tok/s step 657/19560 | loss 4.969687 (-1.67z)| norm 0.8635 (+0.59z)| lr 5.63e-04 | 322.68 ms | 52.3% bf16 MFU | 1624267 tok/s step 658/19560 | loss 5.039181 (-0.96z)| norm 0.7252 (-0.39z)| lr 5.64e-04 | 323.26 ms | 52.2% bf16 MFU | 1624148 tok/s step 659/19560 | loss 5.063507 (-0.70z)| norm 0.6768 (-0.73z)| lr 5.65e-04 | 322.81 ms | 52.3% bf16 MFU | 1624147 tok/s step 660/19560 | loss 4.980377 (-1.54z)| norm 0.6036 (-1.24z)| lr 5.66e-04 | 322.13 ms | 52.4% bf16 MFU | 1624318 tok/s step 661/19560 | loss 5.276470 (+1.54z)| norm 0.5489 (-1.59z)| lr 5.67e-04 | 323.31 ms | 52.2% bf16 MFU | 1624182 tok/s step 662/19560 | loss 5.026884 (-1.04z)| norm 0.7274 (-0.34z)| lr 5.67e-04 | 323.38 ms | 52.2% bf16 MFU | 1624038 tok/s step 663/19560 | loss 4.992612 (-1.38z)| norm 0.9472 (+1.20z)| lr 5.68e-04 | 322.80 ms | 52.3% bf16 MFU | 1624045 tok/s step 664/19560 | loss 5.018263 (-1.09z)| norm 0.9637 (+1.30z)| lr 5.69e-04 | 323.05 ms | 52.2% bf16 MFU | 1623990 tok/s step 665/19560 | loss 5.104635 (-0.18z)| norm 0.9886 (+1.46z)| lr 5.70e-04 | 322.52 ms | 52.3% bf16 MFU | 1624072 tok/s step 666/19560 | loss 5.028912 (-0.96z)| norm 1.1072 (+2.24z)| lr 5.71e-04 | 322.96 ms | 52.3% bf16 MFU | 1624037 tok/s step 667/19560 | loss 5.020986 (-1.03z)| norm 1.0803 (+2.01z)| lr 5.72e-04 | 323.13 ms | 52.2% bf16 MFU | 1623961 tok/s step 668/19560 | loss 4.959846 (-1.66z)| norm 0.8126 (+0.20z)| lr 5.73e-04 | 322.52 ms | 52.3% bf16 MFU | 1624043 tok/s step 669/19560 | loss 5.019249 (-1.01z)| norm 0.8732 (+0.60z)| lr 5.73e-04 | 322.46 ms | 52.3% bf16 MFU | 1624135 tok/s step 670/19560 | loss 4.995850 (-1.24z)| norm 0.8977 (+0.75z)| lr 5.74e-04 | 323.05 ms | 52.2% bf16 MFU | 1624075 tok/s step 671/19560 | loss 5.004800 (-1.13z)| norm 0.9138 (+0.85z)| lr 5.75e-04 | 323.29 ms | 52.2% bf16 MFU | 1623958 tok/s step 672/19560 | loss 5.029220 (-0.85z)| norm 0.9062 (+0.79z)| lr 5.76e-04 | 322.89 ms | 52.3% bf16 MFU | 1623946 tok/s step 673/19560 | loss 5.001417 (-1.14z)| norm 0.8267 (+0.27z)| lr 5.77e-04 | 322.66 ms | 52.3% bf16 MFU | 1623992 tok/s step 674/19560 | loss 5.011970 (-1.01z)| norm 0.7735 (-0.09z)| lr 5.78e-04 | 323.36 ms | 52.2% bf16 MFU | 1623862 tok/s step 675/19560 | loss 4.978489 (-1.35z)| norm 0.6499 (-0.96z)| lr 5.79e-04 | 322.58 ms | 52.3% bf16 MFU | 1623932 tok/s step 676/19560 | loss 4.974800 (-1.39z)| norm 0.7067 (-0.54z)| lr 5.79e-04 | 322.93 ms | 52.3% bf16 MFU | 1623913 tok/s step 677/19560 | loss 5.000626 (-1.09z)| norm 0.7027 (-0.56z)| lr 5.80e-04 | 322.89 ms | 52.3% bf16 MFU | 1623904 tok/s step 678/19560 | loss 4.925295 (-1.90z)| norm 0.6024 (-1.26z)| lr 5.81e-04 | 322.55 ms | 52.3% bf16 MFU | 1623981 tok/s step 679/19560 | loss 4.878237 (-2.39z)| norm 0.5471 (-1.63z)| lr 5.82e-04 | 322.45 ms | 52.3% bf16 MFU | 1624080 tok/s step 680/19560 | loss 4.988664 (-1.14z)| norm 0.4570 (-2.21z)| lr 5.83e-04 | 322.24 ms | 52.4% bf16 MFU | 1624228 tok/s step 681/19560 | loss 4.897344 (-2.12z)| norm 0.5565 (-1.48z)| lr 5.84e-04 | 323.09 ms | 52.2% bf16 MFU | 1624152 tok/s step 682/19560 | loss 4.926800 (-1.77z)| norm 0.6114 (-1.08z)| lr 5.85e-04 | 322.88 ms | 52.3% bf16 MFU | 1624133 tok/s step 683/19560 | loss 4.894254 (-2.09z)| norm 0.7927 (+0.17z)| lr 5.85e-04 | 323.89 ms | 52.1% bf16 MFU | 1623863 tok/s step 684/19560 | loss 5.002707 (-0.87z)| norm 0.9107 (+0.98z)| lr 5.86e-04 | 322.24 ms | 52.4% bf16 MFU | 1624020 tok/s step 685/19560 | loss 4.993126 (-0.96z)| norm 0.8631 (+0.67z)| lr 5.87e-04 | 322.93 ms | 52.3% bf16 MFU | 1623997 tok/s step 686/19560 | loss 4.872915 (-2.25z)| norm 0.8211 (+0.37z)| lr 5.88e-04 | 323.28 ms | 52.2% bf16 MFU | 1623887 tok/s step 687/19560 | loss 4.909283 (-1.82z)| norm 0.7737 (+0.04z)| lr 5.89e-04 | 323.13 ms | 52.2% bf16 MFU | 1623818 tok/s step 688/19560 | loss 4.967854 (-1.16z)| norm 0.8200 (+0.36z)| lr 5.90e-04 | 322.45 ms | 52.3% bf16 MFU | 1623926 tok/s step 689/19560 | loss 4.874228 (-2.16z)| norm 0.6533 (-0.81z)| lr 5.91e-04 | 322.77 ms | 52.3% bf16 MFU | 1623948 tok/s step 690/19560 | loss 4.892570 (-1.93z)| norm 0.6775 (-0.64z)| lr 5.91e-04 | 322.90 ms | 52.3% bf16 MFU | 1623934 tok/s step 691/19560 | loss 4.980680 (-0.94z)| norm 0.7240 (-0.32z)| lr 5.92e-04 | 322.77 ms | 52.3% bf16 MFU | 1623954 tok/s step 692/19560 | loss 4.958344 (-1.17z)| norm 0.8255 (+0.40z)| lr 5.93e-04 | 322.64 ms | 52.3% bf16 MFU | 1624005 tok/s step 693/19560 | loss 4.879052 (-2.03z)| norm 0.6773 (-0.65z)| lr 5.94e-04 | 323.28 ms | 52.2% bf16 MFU | 1623894 tok/s step 694/19560 | loss 4.937470 (-1.35z)| norm 0.6570 (-0.79z)| lr 5.95e-04 | 322.55 ms | 52.3% bf16 MFU | 1623972 tok/s step 695/19560 | loss 4.904213 (-1.71z)| norm 0.5991 (-1.19z)| lr 5.96e-04 | 323.05 ms | 52.2% bf16 MFU | 1623920 tok/s step 696/19560 | loss 4.871541 (-2.04z)| norm 0.5632 (-1.43z)| lr 5.97e-04 | 322.96 ms | 52.3% bf16 MFU | 1623894 tok/s step 697/19560 | loss 4.841018 (-2.33z)| norm 0.5918 (-1.21z)| lr 5.97e-04 | 322.82 ms | 52.3% bf16 MFU | 1623904 tok/s step 698/19560 | loss 4.933313 (-1.28z)| norm 0.5891 (-1.22z)| lr 5.98e-04 | 323.25 ms | 52.2% bf16 MFU | 1623805 tok/s step 699/19560 | loss 4.882298 (-1.83z)| norm 0.5889 (-1.20z)| lr 5.99e-04 | 323.41 ms | 52.2% bf16 MFU | 1623670 tok/s step 700/19560 | loss 4.842571 (-2.22z)| norm 0.5787 (-1.26z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1623737 tok/s step 701/19560 | loss 4.935590 (-1.17z)| norm 0.6714 (-0.58z)| lr 6.00e-04 | 323.00 ms | 52.3% bf16 MFU | 1623710 tok/s step 702/19560 | loss 4.868848 (-1.87z)| norm 0.8171 (+0.46z)| lr 6.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1623762 tok/s step 703/19560 | loss 4.943333 (-1.03z)| norm 0.8086 (+0.39z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1623723 tok/s step 704/19560 | loss 4.923669 (-1.23z)| norm 0.7288 (-0.19z)| lr 6.00e-04 | 322.58 ms | 52.3% bf16 MFU | 1623801 tok/s step 705/19560 | loss 4.881377 (-1.68z)| norm 0.6616 (-0.67z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1623744 tok/s step 706/19560 | loss 4.857975 (-1.91z)| norm 0.6513 (-0.73z)| lr 6.00e-04 | 323.50 ms | 52.2% bf16 MFU | 1623590 tok/s step 707/19560 | loss 4.885143 (-1.59z)| norm 0.6598 (-0.66z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1623644 tok/s step 708/19560 | loss 4.906756 (-1.33z)| norm 0.7385 (-0.10z)| lr 6.00e-04 | 323.25 ms | 52.2% bf16 MFU | 1623558 tok/s step 709/19560 | loss 4.795980 (-2.50z)| norm 0.7501 (-0.02z)| lr 6.00e-04 | 321.98 ms | 52.4% bf16 MFU | 1623796 tok/s step 710/19560 | loss 4.859131 (-1.77z)| norm 0.7155 (-0.28z)| lr 6.00e-04 | 322.97 ms | 52.3% bf16 MFU | 1623772 tok/s step 711/19560 | loss 4.908017 (-1.22z)| norm 0.6768 (-0.55z)| lr 6.00e-04 | 322.31 ms | 52.4% bf16 MFU | 1623917 tok/s step 712/19560 | loss 4.825622 (-2.08z)| norm 0.7052 (-0.34z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1623969 tok/s step 713/19560 | loss 4.805757 (-2.25z)| norm 0.7129 (-0.27z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1623995 tok/s step 714/19560 | loss 4.819892 (-2.05z)| norm 0.7852 (+0.25z)| lr 6.00e-04 | 323.13 ms | 52.2% bf16 MFU | 1623922 tok/s step 715/19560 | loss 4.852966 (-1.66z)| norm 0.8758 (+0.90z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1623960 tok/s step 716/19560 | loss 4.859012 (-1.57z)| norm 0.9480 (+1.42z)| lr 6.00e-04 | 323.20 ms | 52.2% bf16 MFU | 1623872 tok/s step 717/19560 | loss 4.856486 (-1.57z)| norm 0.8429 (+0.67z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1623892 tok/s step 718/19560 | loss 4.747184 (-2.65z)| norm 0.7563 (+0.03z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1623846 tok/s step 719/19560 | loss 4.854129 (-1.51z)| norm 0.6280 (-0.90z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1623796 tok/s step 720/19560 | loss 4.924534 (-0.76z)| norm 0.6322 (-0.86z)| lr 6.00e-04 | 322.46 ms | 52.3% bf16 MFU | 1623902 tok/s step 721/19560 | loss 4.841887 (-1.61z)| norm 0.6348 (-0.84z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1623911 tok/s step 722/19560 | loss 4.870138 (-1.29z)| norm 0.6908 (-0.44z)| lr 6.00e-04 | 322.18 ms | 52.4% bf16 MFU | 1624082 tok/s step 723/19560 | loss 4.821965 (-1.77z)| norm 0.7505 (-0.00z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1624073 tok/s step 724/19560 | loss 4.799098 (-1.97z)| norm 0.7012 (-0.36z)| lr 6.00e-04 | 322.71 ms | 52.3% bf16 MFU | 1624100 tok/s step 725/19560 | loss 4.806869 (-1.87z)| norm 0.5692 (-1.31z)| lr 6.00e-04 | 322.34 ms | 52.4% bf16 MFU | 1624220 tok/s step 726/19560 | loss 4.754896 (-2.35z)| norm 0.5899 (-1.15z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624169 tok/s step 727/19560 | loss 4.865260 (-1.19z)| norm 0.6771 (-0.51z)| lr 6.00e-04 | 322.71 ms | 52.3% bf16 MFU | 1624193 tok/s step 728/19560 | loss 4.821895 (-1.61z)| norm 0.7336 (-0.10z)| lr 6.00e-04 | 323.05 ms | 52.2% bf16 MFU | 1624129 tok/s step 729/19560 | loss 4.770300 (-2.11z)| norm 0.7784 (+0.22z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1624065 tok/s step 730/19560 | loss 4.790355 (-1.87z)| norm 0.5867 (-1.15z)| lr 6.00e-04 | 322.41 ms | 52.3% bf16 MFU | 1624168 tok/s step 731/19560 | loss 4.799998 (-1.74z)| norm 0.6848 (-0.44z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1624155 tok/s step 732/19560 | loss 4.793441 (-1.78z)| norm 0.7803 (+0.26z)| lr 6.00e-04 | 323.08 ms | 52.2% bf16 MFU | 1624087 tok/s step 733/19560 | loss 4.854555 (-1.13z)| norm 0.7823 (+0.26z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1624065 tok/s step 734/19560 | loss 4.720920 (-2.43z)| norm 0.7072 (-0.28z)| lr 6.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1624063 tok/s step 735/19560 | loss 4.779951 (-1.80z)| norm 0.7608 (+0.10z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1624160 tok/s step 736/19560 | loss 4.766173 (-1.90z)| norm 0.8045 (+0.41z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1624163 tok/s step 737/19560 | loss 4.793047 (-1.60z)| norm 0.9228 (+1.25z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624115 tok/s step 738/19560 | loss 4.761183 (-1.88z)| norm 0.7274 (-0.16z)| lr 6.00e-04 | 323.34 ms | 52.2% bf16 MFU | 1623984 tok/s step 739/19560 | loss 4.824106 (-1.24z)| norm 0.8541 (+0.76z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1624035 tok/s step 740/19560 | loss 4.771642 (-1.73z)| norm 0.8130 (+0.46z)| lr 6.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1624095 tok/s step 741/19560 | loss 4.793971 (-1.48z)| norm 0.6912 (-0.42z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1624135 tok/s step 742/19560 | loss 4.758155 (-1.80z)| norm 0.6010 (-1.06z)| lr 6.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1624120 tok/s step 743/19560 | loss 4.757328 (-1.77z)| norm 0.6363 (-0.81z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624102 tok/s step 744/19560 | loss 4.754287 (-1.76z)| norm 0.5118 (-1.68z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1624040 tok/s step 745/19560 | loss 4.766960 (-1.61z)| norm 0.5423 (-1.45z)| lr 6.00e-04 | 322.63 ms | 52.3% bf16 MFU | 1624090 tok/s step 746/19560 | loss 4.747696 (-1.76z)| norm 0.4927 (-1.77z)| lr 6.00e-04 | 323.48 ms | 52.2% bf16 MFU | 1623925 tok/s step 747/19560 | loss 4.769404 (-1.53z)| norm 0.5390 (-1.42z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1623978 tok/s step 748/19560 | loss 4.739942 (-1.77z)| norm 0.5319 (-1.44z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1623994 tok/s step 749/19560 | loss 4.697782 (-2.11z)| norm 0.5500 (-1.30z)| lr 6.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624076 tok/s step 750/19560 | loss 4.706387 (-1.98z)| norm 0.5522 (-1.28z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624111 tok/s val loss 4.731819 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2523/10042 = 0.251245 step 751/19560 | loss 4.752688 (-1.53z)| norm 0.7793 (+0.27z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624125 tok/s step 752/19560 | loss 4.767797 (-1.37z)| norm 0.8375 (+0.66z)| lr 6.00e-04 | 323.47 ms | 52.2% bf16 MFU | 1623960 tok/s step 753/19560 | loss 4.823668 (-0.86z)| norm 0.7592 (+0.11z)| lr 6.00e-04 | 323.22 ms | 52.2% bf16 MFU | 1623867 tok/s step 754/19560 | loss 4.769236 (-1.32z)| norm 0.6913 (-0.35z)| lr 6.00e-04 | 322.92 ms | 52.3% bf16 MFU | 1623852 tok/s step 755/19560 | loss 4.733784 (-1.62z)| norm 0.5884 (-1.05z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1623841 tok/s step 756/19560 | loss 4.714999 (-1.76z)| norm 0.5645 (-1.20z)| lr 6.00e-04 | 323.86 ms | 52.1% bf16 MFU | 1623594 tok/s step 757/19560 | loss 4.746745 (-1.45z)| norm 0.5690 (-1.15z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1623713 tok/s step 758/19560 | loss 4.846592 (-0.54z)| norm 0.6310 (-0.70z)| lr 6.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1623799 tok/s step 759/19560 | loss 4.756117 (-1.34z)| norm 0.9392 (+1.48z)| lr 6.00e-04 | 323.39 ms | 52.2% bf16 MFU | 1623671 tok/s step 760/19560 | loss 4.679823 (-1.99z)| norm 0.7017 (-0.19z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1623708 tok/s step 761/19560 | loss 4.685313 (-1.90z)| norm 0.6757 (-0.37z)| lr 6.00e-04 | 322.90 ms | 52.3% bf16 MFU | 1623706 tok/s step 762/19560 | loss 4.701752 (-1.74z)| norm 0.5762 (-1.07z)| lr 6.00e-04 | 323.83 ms | 52.1% bf16 MFU | 1623472 tok/s step 763/19560 | loss 4.740846 (-1.36z)| norm 0.5438 (-1.29z)| lr 6.00e-04 | 322.14 ms | 52.4% bf16 MFU | 1623675 tok/s step 764/19560 | loss 4.641892 (-2.20z)| norm 0.5347 (-1.34z)| lr 6.00e-04 | 323.30 ms | 52.2% bf16 MFU | 1623576 tok/s step 765/19560 | loss 4.689033 (-1.74z)| norm 0.6166 (-0.76z)| lr 6.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623518 tok/s step 766/19560 | loss 4.729656 (-1.36z)| norm 0.7154 (-0.07z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1623556 tok/s step 767/19560 | loss 4.718853 (-1.43z)| norm 0.9485 (+1.53z)| lr 6.00e-04 | 322.52 ms | 52.3% bf16 MFU | 1623658 tok/s step 768/19560 | loss 4.701714 (-1.56z)| norm 0.9646 (+1.61z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1623653 tok/s step 769/19560 | loss 4.744864 (-1.16z)| norm 0.8986 (+1.14z)| lr 6.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623591 tok/s step 770/19560 | loss 4.725748 (-1.31z)| norm 0.8321 (+0.68z)| lr 6.00e-04 | 323.53 ms | 52.2% bf16 MFU | 1623438 tok/s step 771/19560 | loss 4.677504 (-1.69z)| norm 0.5429 (-1.30z)| lr 6.00e-04 | 323.25 ms | 52.2% bf16 MFU | 1623362 tok/s step 772/19560 | loss 4.727492 (-1.24z)| norm 0.5423 (-1.29z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1623389 tok/s step 773/19560 | loss 4.650577 (-1.87z)| norm 0.5354 (-1.32z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1623409 tok/s step 774/19560 | loss 4.737201 (-1.11z)| norm 0.6554 (-0.49z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1623353 tok/s step 775/19560 | loss 4.727294 (-1.17z)| norm 0.6655 (-0.43z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1623336 tok/s step 776/19560 | loss 4.681262 (-1.54z)| norm 0.6185 (-0.74z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1623388 tok/s step 777/19560 | loss 4.697464 (-1.38z)| norm 0.6033 (-0.83z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1623400 tok/s step 778/19560 | loss 4.758095 (-0.85z)| norm 0.6096 (-0.77z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1623377 tok/s step 779/19560 | loss 4.692084 (-1.39z)| norm 0.6559 (-0.45z)| lr 6.00e-04 | 323.30 ms | 52.2% bf16 MFU | 1623291 tok/s step 780/19560 | loss 4.683318 (-1.44z)| norm 0.7188 (+0.01z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1623372 tok/s step 781/19560 | loss 4.644392 (-1.75z)| norm 0.6525 (-0.47z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1623424 tok/s step 782/19560 | loss 4.671682 (-1.49z)| norm 0.6190 (-0.71z)| lr 6.00e-04 | 323.30 ms | 52.2% bf16 MFU | 1623337 tok/s step 783/19560 | loss 4.646279 (-1.67z)| norm 0.8445 (+0.96z)| lr 6.00e-04 | 323.00 ms | 52.3% bf16 MFU | 1623331 tok/s step 784/19560 | loss 4.667816 (-1.47z)| norm 0.6836 (-0.21z)| lr 6.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623284 tok/s step 785/19560 | loss 4.668583 (-1.44z)| norm 0.7206 (+0.08z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1623328 tok/s step 786/19560 | loss 4.607783 (-1.91z)| norm 0.6913 (-0.14z)| lr 6.00e-04 | 322.96 ms | 52.3% bf16 MFU | 1623331 tok/s step 787/19560 | loss 4.635625 (-1.66z)| norm 0.5621 (-1.12z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1623360 tok/s step 788/19560 | loss 4.681964 (-1.25z)| norm 0.6446 (-0.50z)| lr 6.00e-04 | 323.37 ms | 52.2% bf16 MFU | 1623258 tok/s step 789/19560 | loss 4.638859 (-1.65z)| norm 0.7375 (+0.21z)| lr 6.00e-04 | 323.40 ms | 52.2% bf16 MFU | 1623155 tok/s step 790/19560 | loss 4.685817 (-1.22z)| norm 0.5858 (-0.95z)| lr 6.00e-04 | 323.51 ms | 52.2% bf16 MFU | 1623029 tok/s step 791/19560 | loss 4.700580 (-1.07z)| norm 0.4689 (-1.82z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1622991 tok/s step 792/19560 | loss 4.611324 (-1.84z)| norm 0.6815 (-0.17z)| lr 6.00e-04 | 323.25 ms | 52.2% bf16 MFU | 1622938 tok/s step 793/19560 | loss 4.677313 (-1.24z)| norm 0.8066 (+0.83z)| lr 6.00e-04 | 323.42 ms | 52.2% bf16 MFU | 1622846 tok/s step 794/19560 | loss 4.616054 (-1.78z)| norm 0.8201 (+0.99z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1622881 tok/s step 795/19560 | loss 4.669495 (-1.27z)| norm 0.6950 (-0.02z)| lr 6.00e-04 | 323.82 ms | 52.1% bf16 MFU | 1622690 tok/s step 796/19560 | loss 4.689667 (-1.07z)| norm 0.8209 (+1.06z)| lr 6.00e-04 | 322.98 ms | 52.3% bf16 MFU | 1622719 tok/s step 797/19560 | loss 4.633152 (-1.58z)| norm 0.8728 (+1.50z)| lr 6.00e-04 | 323.21 ms | 52.2% bf16 MFU | 1622691 tok/s step 798/19560 | loss 4.706203 (-0.88z)| norm 0.9022 (+1.76z)| lr 6.00e-04 | 322.19 ms | 52.4% bf16 MFU | 1622920 tok/s step 799/19560 | loss 4.655054 (-1.35z)| norm 0.9404 (+2.07z)| lr 6.00e-04 | 323.19 ms | 52.2% bf16 MFU | 1622885 tok/s step 800/19560 | loss 4.616530 (-1.70z)| norm 0.7385 (+0.37z)| lr 6.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1622869 tok/s step 801/19560 | loss 4.653017 (-1.33z)| norm 0.6143 (-0.69z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1622840 tok/s step 802/19560 | loss 4.598925 (-1.83z)| norm 0.5159 (-1.52z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1622938 tok/s step 803/19560 | loss 4.596166 (-1.83z)| norm 0.4487 (-2.05z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1623038 tok/s step 804/19560 | loss 4.736272 (-0.46z)| norm 0.4405 (-2.07z)| lr 6.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1623016 tok/s step 805/19560 | loss 4.605661 (-1.72z)| norm 0.4042 (-2.30z)| lr 6.00e-04 | 323.22 ms | 52.2% bf16 MFU | 1622970 tok/s step 806/19560 | loss 4.608259 (-1.67z)| norm 0.4162 (-2.15z)| lr 6.00e-04 | 323.42 ms | 52.2% bf16 MFU | 1622875 tok/s step 807/19560 | loss 4.612420 (-1.60z)| norm 0.4671 (-1.73z)| lr 6.00e-04 | 322.95 ms | 52.3% bf16 MFU | 1622904 tok/s step 808/19560 | loss 4.615067 (-1.55z)| norm 0.5502 (-1.08z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1623089 tok/s step 809/19560 | loss 4.611504 (-1.56z)| norm 0.7799 (+0.75z)| lr 6.00e-04 | 323.46 ms | 52.2% bf16 MFU | 1622979 tok/s step 810/19560 | loss 4.583096 (-1.81z)| norm 0.8123 (+0.99z)| lr 6.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1623072 tok/s step 811/19560 | loss 4.577732 (-1.82z)| norm 0.7363 (+0.39z)| lr 6.00e-04 | 323.19 ms | 52.2% bf16 MFU | 1623030 tok/s step 812/19560 | loss 4.601267 (-1.58z)| norm 0.5560 (-1.05z)| lr 6.00e-04 | 323.29 ms | 52.2% bf16 MFU | 1622965 tok/s step 813/19560 | loss 4.647179 (-1.12z)| norm 0.5833 (-0.81z)| lr 6.00e-04 | 322.97 ms | 52.3% bf16 MFU | 1622984 tok/s step 814/19560 | loss 4.625286 (-1.32z)| norm 0.5765 (-0.85z)| lr 6.00e-04 | 323.42 ms | 52.2% bf16 MFU | 1622890 tok/s step 815/19560 | loss 4.554557 (-1.98z)| norm 0.6459 (-0.28z)| lr 6.00e-04 | 322.31 ms | 52.4% bf16 MFU | 1623077 tok/s step 816/19560 | loss 4.628315 (-1.24z)| norm 0.6117 (-0.55z)| lr 6.00e-04 | 322.09 ms | 52.4% bf16 MFU | 1623311 tok/s step 817/19560 | loss 4.581597 (-1.68z)| norm 0.4984 (-1.46z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1623323 tok/s step 818/19560 | loss 4.583045 (-1.63z)| norm 0.4709 (-1.65z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1623406 tok/s step 819/19560 | loss 4.599386 (-1.46z)| norm 0.4307 (-1.93z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1623413 tok/s step 820/19560 | loss 4.586987 (-1.57z)| norm 0.3871 (-2.22z)| lr 6.00e-04 | 322.15 ms | 52.4% bf16 MFU | 1623615 tok/s step 821/19560 | loss 4.530159 (-2.10z)| norm 0.4596 (-1.62z)| lr 6.00e-04 | 322.11 ms | 52.4% bf16 MFU | 1623817 tok/s step 822/19560 | loss 4.556902 (-1.80z)| norm 0.4253 (-1.84z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1623872 tok/s step 823/19560 | loss 4.564120 (-1.70z)| norm 0.4978 (-1.28z)| lr 6.00e-04 | 323.06 ms | 52.2% bf16 MFU | 1623822 tok/s step 824/19560 | loss 4.547585 (-1.84z)| norm 0.5812 (-0.64z)| lr 6.00e-04 | 322.97 ms | 52.3% bf16 MFU | 1623797 tok/s step 825/19560 | loss 4.557037 (-1.71z)| norm 0.6368 (-0.22z)| lr 6.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1623844 tok/s step 826/19560 | loss 4.563113 (-1.63z)| norm 0.6022 (-0.49z)| lr 6.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1623826 tok/s step 827/19560 | loss 4.651984 (-0.73z)| norm 0.5418 (-0.94z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1623817 tok/s step 828/19560 | loss 4.569741 (-1.53z)| norm 0.6107 (-0.42z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1623820 tok/s step 829/19560 | loss 4.549089 (-1.72z)| norm 0.5649 (-0.76z)| lr 6.00e-04 | 323.55 ms | 52.2% bf16 MFU | 1623649 tok/s step 830/19560 | loss 4.525521 (-1.92z)| norm 0.5604 (-0.78z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1623766 tok/s step 831/19560 | loss 4.571943 (-1.44z)| norm 0.5655 (-0.73z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1623794 tok/s step 832/19560 | loss 4.590589 (-1.24z)| norm 0.5755 (-0.65z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623804 tok/s step 833/19560 | loss 4.621037 (-0.91z)| norm 0.5677 (-0.70z)| lr 6.00e-04 | 322.52 ms | 52.3% bf16 MFU | 1623892 tok/s step 834/19560 | loss 4.571955 (-1.40z)| norm 0.5425 (-0.88z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1623906 tok/s step 835/19560 | loss 4.559008 (-1.52z)| norm 0.6194 (-0.30z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1623933 tok/s step 836/19560 | loss 4.592214 (-1.16z)| norm 0.6047 (-0.40z)| lr 6.00e-04 | 323.24 ms | 52.2% bf16 MFU | 1623836 tok/s step 837/19560 | loss 4.509272 (-1.99z)| norm 0.5429 (-0.86z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1623866 tok/s step 838/19560 | loss 4.550372 (-1.54z)| norm 0.5858 (-0.52z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1623876 tok/s step 839/19560 | loss 4.723170 (+0.30z)| norm 0.6180 (-0.27z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1623865 tok/s step 840/19560 | loss 4.552167 (-1.51z)| norm 0.7232 (+0.52z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1623804 tok/s step 841/19560 | loss 4.554356 (-1.46z)| norm 0.5647 (-0.67z)| lr 6.00e-04 | 322.33 ms | 52.4% bf16 MFU | 1623942 tok/s step 842/19560 | loss 4.614858 (-0.80z)| norm 0.4958 (-1.17z)| lr 6.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624028 tok/s step 843/19560 | loss 4.535882 (-1.62z)| norm 0.4649 (-1.39z)| lr 6.00e-04 | 322.79 ms | 52.3% bf16 MFU | 1624038 tok/s step 844/19560 | loss 4.492441 (-2.05z)| norm 0.5210 (-0.95z)| lr 6.00e-04 | 322.58 ms | 52.3% bf16 MFU | 1624101 tok/s step 845/19560 | loss 4.573414 (-1.17z)| norm 0.6421 (-0.01z)| lr 6.00e-04 | 322.98 ms | 52.3% bf16 MFU | 1624060 tok/s step 846/19560 | loss 4.543926 (-1.46z)| norm 0.6277 (-0.11z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1624064 tok/s step 847/19560 | loss 4.516595 (-1.73z)| norm 0.5525 (-0.69z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1624013 tok/s step 848/19560 | loss 4.525113 (-1.64z)| norm 0.5189 (-0.95z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1624021 tok/s step 849/19560 | loss 4.548756 (-1.36z)| norm 0.4917 (-1.14z)| lr 6.00e-04 | 323.13 ms | 52.2% bf16 MFU | 1623946 tok/s step 850/19560 | loss 4.456058 (-2.34z)| norm 0.5447 (-0.72z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1623957 tok/s step 851/19560 | loss 4.556492 (-1.22z)| norm 0.5563 (-0.62z)| lr 6.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1624031 tok/s step 852/19560 | loss 4.470996 (-2.12z)| norm 0.5314 (-0.80z)| lr 6.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1624067 tok/s step 853/19560 | loss 4.521953 (-1.53z)| norm 0.6179 (-0.14z)| lr 6.00e-04 | 323.88 ms | 52.1% bf16 MFU | 1623802 tok/s step 854/19560 | loss 4.503319 (-1.70z)| norm 0.6026 (-0.26z)| lr 6.00e-04 | 322.50 ms | 52.3% bf16 MFU | 1623898 tok/s step 855/19560 | loss 4.516375 (-1.55z)| norm 0.5844 (-0.39z)| lr 6.00e-04 | 322.92 ms | 52.3% bf16 MFU | 1623881 tok/s step 856/19560 | loss 4.477001 (-1.95z)| norm 0.5539 (-0.62z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1623905 tok/s step 857/19560 | loss 4.524323 (-1.41z)| norm 0.5265 (-0.82z)| lr 6.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1623902 tok/s step 858/19560 | loss 4.509416 (-1.54z)| norm 0.5588 (-0.56z)| lr 6.00e-04 | 322.34 ms | 52.4% bf16 MFU | 1624032 tok/s step 859/19560 | loss 4.605366 (-0.48z)| norm 0.5723 (-0.45z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1623972 tok/s step 860/19560 | loss 4.533696 (-1.26z)| norm 0.5541 (-0.58z)| lr 6.00e-04 | 322.47 ms | 52.3% bf16 MFU | 1624066 tok/s step 861/19560 | loss 4.560771 (-0.95z)| norm 0.5803 (-0.37z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1623997 tok/s step 862/19560 | loss 4.456868 (-2.07z)| norm 0.4696 (-1.21z)| lr 6.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1623971 tok/s step 863/19560 | loss 4.580936 (-0.68z)| norm 0.4137 (-1.62z)| lr 6.00e-04 | 322.36 ms | 52.4% bf16 MFU | 1624093 tok/s step 864/19560 | loss 4.527455 (-1.25z)| norm 0.4502 (-1.32z)| lr 6.00e-04 | 323.05 ms | 52.2% bf16 MFU | 1624036 tok/s step 865/19560 | loss 4.507093 (-1.46z)| norm 0.4017 (-1.68z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1623984 tok/s step 866/19560 | loss 4.500328 (-1.51z)| norm 0.4538 (-1.25z)| lr 6.00e-04 | 322.45 ms | 52.3% bf16 MFU | 1624082 tok/s step 867/19560 | loss 4.565885 (-0.77z)| norm 0.5075 (-0.82z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1624178 tok/s step 868/19560 | loss 4.471168 (-1.82z)| norm 0.6120 (+0.02z)| lr 6.00e-04 | 323.01 ms | 52.2% bf16 MFU | 1624125 tok/s start Wed Nov 20 05:52:53 UTC 2024 +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/fineweb10B/fineweb_train_*.bin | | val data pattern | dev/data/fineweb10B/fineweb_val_*.bin | | output log dir | log124M | | checkpoint_every | 5000 | | resume | 0 | | micro batch size B | 64 | | sequence length T | 1024 | | total batch size | 524288 | | LR scheduler | cosine | | learning rate (LR) | 6.000000e-04 | | warmup iterations | 700 | | final LR fraction | 0.000000e+00 | | weight decay | 1.000000e-01 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 250 | | val_max_steps | 20 | | sample_every | 20000 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-SXM4-40GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- | weight init method | d12 | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 19560 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | yes | +-----------------------+----------------------------------------------------+ | num_processes | 8 | | zero_stage | 1 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288 => setting grad_accum_steps=1 --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 237 MiB for parameter gradients allocating 21216 MiB for activations --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 59 MiB for AdamW optimizer state m allocating 59 MiB for AdamW optimizer state v allocating 59 MiB for master copy of params device memory usage: 23129 MiB / 40326 MiB memory per sequence: 331 MiB -> estimated maximum batch size: 115 val loss 11.009204 step 1/19560 | loss 11.009109 (+nanz)| norm 15.3717 (+nanz)| lr 8.57e-07 | 694.62 ms | 24.3% bf16 MFU | 754784 tok/s step 2/19560 | loss 10.959906 (+nanz)| norm 15.2377 (+nanz)| lr 1.71e-06 | 314.81 ms | 53.6% bf16 MFU | 1665435 tok/s step 3/19560 | loss 10.855613 (+nanz)| norm 14.8389 (+nanz)| lr 2.57e-06 | 315.55 ms | 53.5% bf16 MFU | 1663410 tok/s step 4/19560 | loss 10.715732 (+nanz)| norm 13.1195 (+nanz)| lr 3.43e-06 | 315.63 ms | 53.5% bf16 MFU | 1662603 tok/s step 5/19560 | loss 10.568424 (+nanz)| norm 10.4405 (+nanz)| lr 4.29e-06 | 314.80 ms | 53.6% bf16 MFU | 1663372 tok/s step 6/19560 | loss 10.423451 (+nanz)| norm 8.4943 (+nanz)| lr 5.14e-06 | 316.25 ms | 53.4% bf16 MFU | 1662141 tok/s step 7/19560 | loss 10.291803 (+nanz)| norm 7.2415 (+nanz)| lr 6.00e-06 | 314.33 ms | 53.7% bf16 MFU | 1663243 tok/s step 8/19560 | loss 10.187215 (+nanz)| norm 6.1835 (+nanz)| lr 6.86e-06 | 314.61 ms | 53.6% bf16 MFU | 1663778 tok/s step 9/19560 | loss 10.096741 (+nanz)| norm 5.2947 (+nanz)| lr 7.71e-06 | 314.69 ms | 53.6% bf16 MFU | 1664113 tok/s step 10/19560 | loss 9.976251 (+nanz)| norm 4.6635 (+nanz)| lr 8.57e-06 | 314.25 ms | 53.7% bf16 MFU | 1664688 tok/s step 11/19560 | loss 9.935358 (+nanz)| norm 3.8501 (+nanz)| lr 9.43e-06 | 314.42 ms | 53.7% bf16 MFU | 1665038 tok/s step 12/19560 | loss 9.853889 (+nanz)| norm 3.3980 (+nanz)| lr 1.03e-05 | 314.53 ms | 53.7% bf16 MFU | 1665252 tok/s step 13/19560 | loss 9.792458 (+nanz)| norm 3.0057 (+nanz)| lr 1.11e-05 | 313.95 ms | 53.8% bf16 MFU | 1665768 tok/s step 14/19560 | loss 9.758890 (+nanz)| norm 2.6957 (+nanz)| lr 1.20e-05 | 314.46 ms | 53.7% bf16 MFU | 1665920 tok/s step 15/19560 | loss 9.702516 (+nanz)| norm 2.5212 (+nanz)| lr 1.29e-05 | 314.34 ms | 53.7% bf16 MFU | 1666115 tok/s step 16/19560 | loss 9.701745 (+nanz)| norm 2.3030 (+nanz)| lr 1.37e-05 | 314.66 ms | 53.6% bf16 MFU | 1666125 tok/s step 17/19560 | loss 9.655603 (+nanz)| norm 2.2412 (+nanz)| lr 1.46e-05 | 313.97 ms | 53.8% bf16 MFU | 1666458 tok/s step 18/19560 | loss 9.636615 (+nanz)| norm 2.1886 (+nanz)| lr 1.54e-05 | 314.98 ms | 53.6% bf16 MFU | 1666292 tok/s step 19/19560 | loss 9.603565 (+nanz)| norm 2.2160 (+nanz)| lr 1.63e-05 | 314.25 ms | 53.7% bf16 MFU | 1666464 tok/s step 20/19560 | loss 9.596481 (+nanz)| norm 2.1544 (+nanz)| lr 1.71e-05 | 314.16 ms | 53.7% bf16 MFU | 1666656 tok/s step 21/19560 | loss 9.562174 (+nanz)| norm 2.1811 (+nanz)| lr 1.80e-05 | 314.54 ms | 53.7% bf16 MFU | 1666672 tok/s step 22/19560 | loss 9.537735 (+nanz)| norm 2.1489 (+nanz)| lr 1.89e-05 | 313.90 ms | 53.8% bf16 MFU | 1666944 tok/s step 23/19560 | loss 9.533234 (+nanz)| norm 2.0896 (+nanz)| lr 1.97e-05 | 314.80 ms | 53.6% bf16 MFU | 1666835 tok/s step 24/19560 | loss 9.476656 (+nanz)| norm 2.1037 (+nanz)| lr 2.06e-05 | 314.22 ms | 53.7% bf16 MFU | 1666959 tok/s step 25/19560 | loss 9.458328 (+nanz)| norm 2.0526 (+nanz)| lr 2.14e-05 | 314.46 ms | 53.7% bf16 MFU | 1666982 tok/s step 26/19560 | loss 9.422851 (+nanz)| norm 2.0944 (+nanz)| lr 2.23e-05 | 315.15 ms | 53.6% bf16 MFU | 1666748 tok/s step 27/19560 | loss 9.377934 (+nanz)| norm 2.1533 (+nanz)| lr 2.31e-05 | 314.62 ms | 53.6% bf16 MFU | 1666724 tok/s step 28/19560 | loss 9.388617 (+nanz)| norm 1.9649 (+nanz)| lr 2.40e-05 | 315.59 ms | 53.5% bf16 MFU | 1666363 tok/s step 29/19560 | loss 9.357564 (+nanz)| norm 2.4482 (+nanz)| lr 2.49e-05 | 314.29 ms | 53.7% bf16 MFU | 1666483 tok/s step 30/19560 | loss 9.306392 (+nanz)| norm 1.9512 (+nanz)| lr 2.57e-05 | 315.52 ms | 53.5% bf16 MFU | 1666172 tok/s step 31/19560 | loss 9.244952 (+nanz)| norm 2.2942 (+nanz)| lr 2.66e-05 | 315.59 ms | 53.5% bf16 MFU | 1665862 tok/s step 32/19560 | loss 9.211153 (+nanz)| norm 2.2607 (+nanz)| lr 2.74e-05 | 315.30 ms | 53.5% bf16 MFU | 1665672 tok/s step 33/19560 | loss 9.235577 (+nanz)| norm 2.5835 (+nanz)| lr 2.83e-05 | 315.60 ms | 53.5% bf16 MFU | 1665396 tok/s step 34/19560 | loss 9.159492 (+nanz)| norm 2.4417 (+nanz)| lr 2.91e-05 | 315.31 ms | 53.5% bf16 MFU | 1665234 tok/s step 35/19560 | loss 9.200701 (+nanz)| norm 1.9539 (+nanz)| lr 3.00e-05 | 315.66 ms | 53.5% bf16 MFU | 1664972 tok/s step 36/19560 | loss 9.080112 (+nanz)| norm 2.4098 (+nanz)| lr 3.09e-05 | 315.95 ms | 53.4% bf16 MFU | 1664639 tok/s step 37/19560 | loss 9.029287 (+nanz)| norm 1.8690 (+nanz)| lr 3.17e-05 | 315.30 ms | 53.5% bf16 MFU | 1664530 tok/s step 38/19560 | loss 9.039859 (+nanz)| norm 1.9751 (+nanz)| lr 3.26e-05 | 315.53 ms | 53.5% bf16 MFU | 1664360 tok/s step 39/19560 | loss 8.977989 (+nanz)| norm 1.8097 (+nanz)| lr 3.34e-05 | 315.59 ms | 53.5% bf16 MFU | 1664182 tok/s step 40/19560 | loss 8.961264 (+nanz)| norm 2.0398 (+nanz)| lr 3.43e-05 | 315.81 ms | 53.4% bf16 MFU | 1663950 tok/s step 41/19560 | loss 8.906665 (+nanz)| norm 1.8697 (+nanz)| lr 3.51e-05 | 317.02 ms | 53.2% bf16 MFU | 1663368 tok/s step 42/19560 | loss 8.860152 (+nanz)| norm 1.6709 (+nanz)| lr 3.60e-05 | 316.21 ms | 53.4% bf16 MFU | 1663064 tok/s step 43/19560 | loss 8.839649 (+nanz)| norm 1.7306 (+nanz)| lr 3.69e-05 | 316.50 ms | 53.3% bf16 MFU | 1662694 tok/s step 44/19560 | loss 8.856464 (+nanz)| norm 1.6668 (+nanz)| lr 3.77e-05 | 316.67 ms | 53.3% bf16 MFU | 1662297 tok/s step 45/19560 | loss 8.748160 (+nanz)| norm 1.6705 (+nanz)| lr 3.86e-05 | 316.52 ms | 53.3% bf16 MFU | 1661970 tok/s step 46/19560 | loss 8.692651 (+nanz)| norm 1.6793 (+nanz)| lr 3.94e-05 | 316.42 ms | 53.3% bf16 MFU | 1661691 tok/s step 47/19560 | loss 8.712533 (+nanz)| norm 1.5849 (+nanz)| lr 4.03e-05 | 316.61 ms | 53.3% bf16 MFU | 1661374 tok/s step 48/19560 | loss 8.646973 (+nanz)| norm 1.5954 (+nanz)| lr 4.11e-05 | 316.15 ms | 53.4% bf16 MFU | 1661206 tok/s step 49/19560 | loss 8.607582 (+nanz)| norm 1.5651 (+nanz)| lr 4.20e-05 | 317.01 ms | 53.2% bf16 MFU | 1660803 tok/s step 50/19560 | loss 8.586839 (+nanz)| norm 1.5604 (+nanz)| lr 4.29e-05 | 317.10 ms | 53.2% bf16 MFU | 1660398 tok/s step 51/19560 | loss 8.536323 (+nanz)| norm 1.6132 (+nanz)| lr 4.37e-05 | 316.83 ms | 53.3% bf16 MFU | 1660095 tok/s step 52/19560 | loss 8.492963 (+nanz)| norm 1.7036 (+nanz)| lr 4.46e-05 | 316.48 ms | 53.3% bf16 MFU | 1659908 tok/s step 53/19560 | loss 8.489202 (+nanz)| norm 1.7864 (+nanz)| lr 4.54e-05 | 316.81 ms | 53.3% bf16 MFU | 1659640 tok/s step 54/19560 | loss 8.421484 (+nanz)| norm 1.5661 (+nanz)| lr 4.63e-05 | 316.76 ms | 53.3% bf16 MFU | 1659400 tok/s step 55/19560 | loss 8.446745 (+nanz)| norm 1.4467 (+nanz)| lr 4.71e-05 | 316.79 ms | 53.3% bf16 MFU | 1659167 tok/s step 56/19560 | loss 8.378387 (+nanz)| norm 1.6649 (+nanz)| lr 4.80e-05 | 316.73 ms | 53.3% bf16 MFU | 1658961 tok/s step 57/19560 | loss 8.326707 (+nanz)| norm 1.6512 (+nanz)| lr 4.89e-05 | 317.15 ms | 53.2% bf16 MFU | 1658651 tok/s step 58/19560 | loss 8.262538 (+nanz)| norm 1.6633 (+nanz)| lr 4.97e-05 | 317.71 ms | 53.1% bf16 MFU | 1658204 tok/s step 59/19560 | loss 8.225888 (+nanz)| norm 1.7655 (+nanz)| lr 5.06e-05 | 316.92 ms | 53.3% bf16 MFU | 1658000 tok/s step 60/19560 | loss 8.181088 (+nanz)| norm 1.4495 (+nanz)| lr 5.14e-05 | 317.11 ms | 53.2% bf16 MFU | 1657754 tok/s step 61/19560 | loss 8.197222 (+nanz)| norm 1.7008 (+nanz)| lr 5.23e-05 | 317.48 ms | 53.2% bf16 MFU | 1657421 tok/s step 62/19560 | loss 8.186007 (+nanz)| norm 1.3840 (+nanz)| lr 5.31e-05 | 317.33 ms | 53.2% bf16 MFU | 1657148 tok/s step 63/19560 | loss 8.078360 (+nanz)| norm 1.3752 (+nanz)| lr 5.40e-05 | 316.38 ms | 53.3% bf16 MFU | 1657147 tok/s step 64/19560 | loss 8.078139 (+nanz)| norm 1.4739 (+nanz)| lr 5.49e-05 | 317.47 ms | 53.2% bf16 MFU | 1656850 tok/s step 65/19560 | loss 8.054733 (+nanz)| norm 1.4442 (+nanz)| lr 5.57e-05 | 317.15 ms | 53.2% bf16 MFU | 1656657 tok/s step 66/19560 | loss 7.981145 (+nanz)| norm 1.6291 (+nanz)| lr 5.66e-05 | 317.78 ms | 53.1% bf16 MFU | 1656304 tok/s step 67/19560 | loss 7.928283 (+nanz)| norm 1.3553 (+nanz)| lr 5.74e-05 | 317.21 ms | 53.2% bf16 MFU | 1656124 tok/s step 68/19560 | loss 7.929861 (+nanz)| norm 1.3051 (+nanz)| lr 5.83e-05 | 317.44 ms | 53.2% bf16 MFU | 1655891 tok/s step 69/19560 | loss 7.937125 (+nanz)| norm 1.2620 (+nanz)| lr 5.91e-05 | 317.23 ms | 53.2% bf16 MFU | 1655727 tok/s step 70/19560 | loss 7.888303 (+nanz)| norm 1.2230 (+nanz)| lr 6.00e-05 | 318.91 ms | 52.9% bf16 MFU | 1655123 tok/s step 71/19560 | loss 7.836903 (+nanz)| norm 1.3416 (+nanz)| lr 6.09e-05 | 317.28 ms | 53.2% bf16 MFU | 1654985 tok/s step 72/19560 | loss 7.792533 (+nanz)| norm 1.1669 (+nanz)| lr 6.17e-05 | 317.11 ms | 53.2% bf16 MFU | 1654899 tok/s step 73/19560 | loss 7.773809 (+nanz)| norm 1.1522 (+nanz)| lr 6.26e-05 | 318.22 ms | 53.0% bf16 MFU | 1654524 tok/s step 74/19560 | loss 7.792335 (+nanz)| norm 1.4781 (+nanz)| lr 6.34e-05 | 318.10 ms | 53.1% bf16 MFU | 1654199 tok/s step 75/19560 | loss 7.717278 (+nanz)| norm 1.3997 (+nanz)| lr 6.43e-05 | 317.26 ms | 53.2% bf16 MFU | 1654116 tok/s step 76/19560 | loss 7.568539 (+nanz)| norm 1.0994 (+nanz)| lr 6.51e-05 | 317.61 ms | 53.1% bf16 MFU | 1653942 tok/s step 77/19560 | loss 7.679488 (+nanz)| norm 1.7398 (+nanz)| lr 6.60e-05 | 318.15 ms | 53.0% bf16 MFU | 1653634 tok/s step 78/19560 | loss 7.633045 (+nanz)| norm 0.9579 (+nanz)| lr 6.69e-05 | 317.74 ms | 53.1% bf16 MFU | 1653452 tok/s step 79/19560 | loss 7.592651 (+nanz)| norm 1.2439 (+nanz)| lr 6.77e-05 | 317.91 ms | 53.1% bf16 MFU | 1653233 tok/s step 80/19560 | loss 7.616010 (+nanz)| norm 0.9443 (+nanz)| lr 6.86e-05 | 318.23 ms | 53.0% bf16 MFU | 1652942 tok/s step 81/19560 | loss 7.424067 (+nanz)| norm 1.4287 (+nanz)| lr 6.94e-05 | 318.45 ms | 53.0% bf16 MFU | 1652608 tok/s step 82/19560 | loss 7.515984 (+nanz)| norm 1.5817 (+nanz)| lr 7.03e-05 | 318.30 ms | 53.0% bf16 MFU | 1652331 tok/s step 83/19560 | loss 7.442465 (+nanz)| norm 0.9805 (+nanz)| lr 7.11e-05 | 318.05 ms | 53.1% bf16 MFU | 1652134 tok/s step 84/19560 | loss 7.463772 (+nanz)| norm 1.0387 (+nanz)| lr 7.20e-05 | 318.95 ms | 52.9% bf16 MFU | 1651712 tok/s step 85/19560 | loss 7.399567 (+nanz)| norm 1.3155 (+nanz)| lr 7.29e-05 | 317.95 ms | 53.1% bf16 MFU | 1651572 tok/s step 86/19560 | loss 7.403726 (+nanz)| norm 1.0048 (+nanz)| lr 7.37e-05 | 318.22 ms | 53.0% bf16 MFU | 1651368 tok/s step 87/19560 | loss 7.372361 (+nanz)| norm 0.9639 (+nanz)| lr 7.46e-05 | 318.97 ms | 52.9% bf16 MFU | 1650979 tok/s step 88/19560 | loss 7.304776 (+nanz)| norm 1.0455 (+nanz)| lr 7.54e-05 | 318.20 ms | 53.0% bf16 MFU | 1650810 tok/s step 89/19560 | loss 7.340996 (+nanz)| norm 0.9987 (+nanz)| lr 7.63e-05 | 318.88 ms | 52.9% bf16 MFU | 1650474 tok/s step 90/19560 | loss 7.323092 (+nanz)| norm 0.9632 (+nanz)| lr 7.71e-05 | 318.92 ms | 52.9% bf16 MFU | 1650144 tok/s step 91/19560 | loss 7.380370 (+nanz)| norm 0.9188 (+nanz)| lr 7.80e-05 | 318.48 ms | 53.0% bf16 MFU | 1649947 tok/s step 92/19560 | loss 7.240514 (+nanz)| norm 1.3798 (+nanz)| lr 7.89e-05 | 318.65 ms | 53.0% bf16 MFU | 1649716 tok/s step 93/19560 | loss 7.278755 (+nanz)| norm 1.0515 (+nanz)| lr 7.97e-05 | 318.62 ms | 53.0% bf16 MFU | 1649502 tok/s step 94/19560 | loss 7.172446 (+nanz)| norm 0.6091 (+nanz)| lr 8.06e-05 | 319.39 ms | 52.8% bf16 MFU | 1649101 tok/s step 95/19560 | loss 7.206691 (+nanz)| norm 1.0823 (+nanz)| lr 8.14e-05 | 318.61 ms | 53.0% bf16 MFU | 1648922 tok/s step 96/19560 | loss 7.126145 (+nanz)| norm 1.0807 (+nanz)| lr 8.23e-05 | 319.09 ms | 52.9% bf16 MFU | 1648628 tok/s step 97/19560 | loss 7.140677 (+nanz)| norm 0.7696 (+nanz)| lr 8.31e-05 | 318.52 ms | 53.0% bf16 MFU | 1648495 tok/s step 98/19560 | loss 7.245871 (+nanz)| norm 0.7834 (+nanz)| lr 8.40e-05 | 319.75 ms | 52.8% bf16 MFU | 1648051 tok/s step 99/19560 | loss 7.150112 (+nanz)| norm 0.9784 (+nanz)| lr 8.49e-05 | 319.48 ms | 52.8% bf16 MFU | 1647701 tok/s step 100/19560 | loss 7.146232 (+nanz)| norm 0.7532 (+nanz)| lr 8.57e-05 | 319.01 ms | 52.9% bf16 MFU | 1647488 tok/s step 101/19560 | loss 7.090318 (+nanz)| norm 0.9386 (+nanz)| lr 8.66e-05 | 318.51 ms | 53.0% bf16 MFU | 1647415 tok/s step 102/19560 | loss 7.049257 (+nanz)| norm 0.8929 (+nanz)| lr 8.74e-05 | 319.68 ms | 52.8% bf16 MFU | 1647043 tok/s step 103/19560 | loss 7.089221 (+nanz)| norm 0.9470 (+nanz)| lr 8.83e-05 | 319.13 ms | 52.9% bf16 MFU | 1646833 tok/s step 104/19560 | loss 7.151436 (+nanz)| norm 1.0774 (+nanz)| lr 8.91e-05 | 319.46 ms | 52.8% bf16 MFU | 1646550 tok/s step 105/19560 | loss 7.035321 (+nanz)| norm 0.8694 (+nanz)| lr 9.00e-05 | 320.22 ms | 52.7% bf16 MFU | 1646082 tok/s step 106/19560 | loss 7.067957 (+nanz)| norm 0.9092 (+nanz)| lr 9.09e-05 | 319.38 ms | 52.8% bf16 MFU | 1645857 tok/s step 107/19560 | loss 6.896207 (+nanz)| norm 0.9970 (+nanz)| lr 9.17e-05 | 319.57 ms | 52.8% bf16 MFU | 1645594 tok/s step 108/19560 | loss 7.045938 (+nanz)| norm 0.8555 (+nanz)| lr 9.26e-05 | 320.33 ms | 52.7% bf16 MFU | 1645148 tok/s step 109/19560 | loss 6.968539 (+nanz)| norm 0.8375 (+nanz)| lr 9.34e-05 | 319.29 ms | 52.9% bf16 MFU | 1644992 tok/s step 110/19560 | loss 7.056652 (+nanz)| norm 0.8642 (+nanz)| lr 9.43e-05 | 320.58 ms | 52.6% bf16 MFU | 1644513 tok/s step 111/19560 | loss 6.921691 (+nanz)| norm 1.0757 (+nanz)| lr 9.51e-05 | 320.35 ms | 52.7% bf16 MFU | 1644117 tok/s step 112/19560 | loss 6.911484 (+nanz)| norm 1.1640 (+nanz)| lr 9.60e-05 | 320.26 ms | 52.7% bf16 MFU | 1643764 tok/s step 113/19560 | loss 7.007264 (+nanz)| norm 0.8044 (+nanz)| lr 9.69e-05 | 319.60 ms | 52.8% bf16 MFU | 1643597 tok/s step 114/19560 | loss 6.890769 (+nanz)| norm 1.9847 (+nanz)| lr 9.77e-05 | 320.29 ms | 52.7% bf16 MFU | 1643263 tok/s step 115/19560 | loss 6.908181 (+nanz)| norm 1.0685 (+nanz)| lr 9.86e-05 | 319.79 ms | 52.8% bf16 MFU | 1643073 tok/s step 116/19560 | loss 6.931115 (+nanz)| norm 0.7416 (+nanz)| lr 9.94e-05 | 319.62 ms | 52.8% bf16 MFU | 1642938 tok/s step 117/19560 | loss 6.991655 (+nanz)| norm 1.4740 (+nanz)| lr 1.00e-04 | 320.40 ms | 52.7% bf16 MFU | 1642607 tok/s step 118/19560 | loss 6.873771 (+nanz)| norm 1.0094 (+nanz)| lr 1.01e-04 | 320.56 ms | 52.6% bf16 MFU | 1642253 tok/s step 119/19560 | loss 6.878549 (+nanz)| norm 0.8534 (+nanz)| lr 1.02e-04 | 320.33 ms | 52.7% bf16 MFU | 1641975 tok/s step 120/19560 | loss 6.912476 (+nanz)| norm 0.7953 (+nanz)| lr 1.03e-04 | 320.25 ms | 52.7% bf16 MFU | 1641732 tok/s step 121/19560 | loss 6.926718 (+nanz)| norm 0.6222 (+nanz)| lr 1.04e-04 | 320.77 ms | 52.6% bf16 MFU | 1641369 tok/s step 122/19560 | loss 6.864659 (+nanz)| norm 0.7715 (+nanz)| lr 1.05e-04 | 319.45 ms | 52.8% bf16 MFU | 1641362 tok/s step 123/19560 | loss 6.937659 (+nanz)| norm 0.8921 (+nanz)| lr 1.05e-04 | 319.90 ms | 52.8% bf16 MFU | 1641241 tok/s step 124/19560 | loss 6.817866 (+nanz)| norm 0.8085 (+nanz)| lr 1.06e-04 | 320.50 ms | 52.7% bf16 MFU | 1640970 tok/s step 125/19560 | loss 6.784693 (+nanz)| norm 1.1932 (+nanz)| lr 1.07e-04 | 320.04 ms | 52.7% bf16 MFU | 1640831 tok/s step 126/19560 | loss 6.884774 (+nanz)| norm 1.0859 (+nanz)| lr 1.08e-04 | 319.83 ms | 52.8% bf16 MFU | 1640753 tok/s step 127/19560 | loss 6.848583 (+nanz)| norm 0.9801 (+nanz)| lr 1.09e-04 | 320.86 ms | 52.6% bf16 MFU | 1640416 tok/s step 128/19560 | loss 6.754868 (+nanz)| norm 0.9551 (+nanz)| lr 1.10e-04 | 320.56 ms | 52.6% bf16 MFU | 1640172 tok/s step 129/19560 | loss 6.737210 (-1.32z)| norm 1.2265 (-0.35z)| lr 1.11e-04 | 320.73 ms | 52.6% bf16 MFU | 1639897 tok/s step 130/19560 | loss 6.802297 (-1.25z)| norm 0.6604 (-0.63z)| lr 1.11e-04 | 320.29 ms | 52.7% bf16 MFU | 1639748 tok/s step 131/19560 | loss 6.728189 (-1.31z)| norm 0.6409 (-0.70z)| lr 1.12e-04 | 320.14 ms | 52.7% bf16 MFU | 1639645 tok/s step 132/19560 | loss 6.695301 (-1.33z)| norm 1.0297 (-0.52z)| lr 1.13e-04 | 320.54 ms | 52.7% bf16 MFU | 1639444 tok/s step 133/19560 | loss 6.712407 (-1.31z)| norm 0.8658 (-0.70z)| lr 1.14e-04 | 320.53 ms | 52.7% bf16 MFU | 1639257 tok/s step 134/19560 | loss 6.745340 (-1.26z)| norm 0.9275 (-0.70z)| lr 1.15e-04 | 320.54 ms | 52.7% bf16 MFU | 1639076 tok/s step 135/19560 | loss 6.685360 (-1.31z)| norm 1.0226 (-0.64z)| lr 1.16e-04 | 320.51 ms | 52.7% bf16 MFU | 1638911 tok/s step 136/19560 | loss 6.695441 (-1.29z)| norm 1.2460 (-0.37z)| lr 1.17e-04 | 320.55 ms | 52.7% bf16 MFU | 1638746 tok/s step 137/19560 | loss 6.700593 (-1.27z)| norm 0.9276 (-0.85z)| lr 1.17e-04 | 321.39 ms | 52.5% bf16 MFU | 1638373 tok/s step 138/19560 | loss 6.694285 (-1.26z)| norm 0.9143 (-0.91z)| lr 1.18e-04 | 320.98 ms | 52.6% bf16 MFU | 1638124 tok/s step 139/19560 | loss 6.701507 (-1.24z)| norm 1.0674 (-0.66z)| lr 1.19e-04 | 321.38 ms | 52.5% bf16 MFU | 1637785 tok/s step 140/19560 | loss 6.649533 (-1.28z)| norm 0.8993 (-0.97z)| lr 1.20e-04 | 320.71 ms | 52.6% bf16 MFU | 1637633 tok/s step 141/19560 | loss 6.643652 (-1.27z)| norm 1.2593 (-0.28z)| lr 1.21e-04 | 320.40 ms | 52.7% bf16 MFU | 1637570 tok/s step 142/19560 | loss 6.653690 (-1.25z)| norm 0.7099 (-1.34z)| lr 1.22e-04 | 320.37 ms | 52.7% bf16 MFU | 1637516 tok/s step 143/19560 | loss 6.653822 (-1.23z)| norm 0.7694 (-1.22z)| lr 1.23e-04 | 321.36 ms | 52.5% bf16 MFU | 1637212 tok/s step 144/19560 | loss 6.654366 (-1.22z)| norm 0.5592 (-1.62z)| lr 1.23e-04 | 320.58 ms | 52.6% bf16 MFU | 1637122 tok/s step 145/19560 | loss 6.625488 (-1.23z)| norm 0.7303 (-1.26z)| lr 1.24e-04 | 321.52 ms | 52.5% bf16 MFU | 1636799 tok/s step 146/19560 | loss 6.673954 (-1.17z)| norm 0.6119 (-1.48z)| lr 1.25e-04 | 321.15 ms | 52.6% bf16 MFU | 1636585 tok/s step 147/19560 | loss 6.635135 (-1.20z)| norm 0.7076 (-1.27z)| lr 1.26e-04 | 320.76 ms | 52.6% bf16 MFU | 1636481 tok/s step 148/19560 | loss 6.636708 (-1.18z)| norm 1.0659 (-0.52z)| lr 1.27e-04 | 321.21 ms | 52.5% bf16 MFU | 1636269 tok/s step 149/19560 | loss 6.620650 (-1.19z)| norm 1.1442 (-0.34z)| lr 1.28e-04 | 321.65 ms | 52.5% bf16 MFU | 1635955 tok/s step 150/19560 | loss 6.670755 (-1.12z)| norm 0.8805 (-0.88z)| lr 1.29e-04 | 321.00 ms | 52.6% bf16 MFU | 1635822 tok/s step 151/19560 | loss 6.596502 (-1.19z)| norm 1.6190 (+0.69z)| lr 1.29e-04 | 321.24 ms | 52.5% bf16 MFU | 1635634 tok/s step 152/19560 | loss 6.519027 (-1.27z)| norm 0.9438 (-0.74z)| lr 1.30e-04 | 320.94 ms | 52.6% bf16 MFU | 1635532 tok/s step 153/19560 | loss 6.655558 (-1.10z)| norm 0.9484 (-0.72z)| lr 1.31e-04 | 320.74 ms | 52.6% bf16 MFU | 1635486 tok/s step 154/19560 | loss 6.590478 (-1.16z)| norm 1.0792 (-0.42z)| lr 1.32e-04 | 321.28 ms | 52.5% bf16 MFU | 1635305 tok/s step 155/19560 | loss 6.582931 (-1.16z)| norm 1.3952 (+0.30z)| lr 1.33e-04 | 321.80 ms | 52.4% bf16 MFU | 1635000 tok/s step 156/19560 | loss 6.626073 (-1.10z)| norm 0.8860 (-0.83z)| lr 1.34e-04 | 320.62 ms | 52.6% bf16 MFU | 1635011 tok/s step 157/19560 | loss 6.556036 (-1.17z)| norm 1.0843 (-0.37z)| lr 1.35e-04 | 320.80 ms | 52.6% bf16 MFU | 1634976 tok/s step 158/19560 | loss 6.570563 (-1.14z)| norm 1.1266 (-0.26z)| lr 1.35e-04 | 321.20 ms | 52.5% bf16 MFU | 1634842 tok/s step 159/19560 | loss 6.558052 (-1.15z)| norm 0.9948 (-0.56z)| lr 1.36e-04 | 321.81 ms | 52.4% bf16 MFU | 1634560 tok/s step 160/19560 | loss 6.537641 (-1.16z)| norm 0.8370 (-0.93z)| lr 1.37e-04 | 321.36 ms | 52.5% bf16 MFU | 1634405 tok/s step 161/19560 | loss 6.579946 (-1.10z)| norm 0.7336 (-1.20z)| lr 1.38e-04 | 321.32 ms | 52.5% bf16 MFU | 1634269 tok/s step 162/19560 | loss 6.512710 (-1.18z)| norm 0.9114 (-0.74z)| lr 1.39e-04 | 322.31 ms | 52.4% bf16 MFU | 1633888 tok/s step 163/19560 | loss 6.482497 (-1.21z)| norm 1.1708 (-0.03z)| lr 1.40e-04 | 321.78 ms | 52.4% bf16 MFU | 1633660 tok/s step 164/19560 | loss 6.473069 (-1.21z)| norm 1.4080 (+0.66z)| lr 1.41e-04 | 321.06 ms | 52.6% bf16 MFU | 1633626 tok/s step 165/19560 | loss 6.495692 (-1.17z)| norm 1.0809 (-0.25z)| lr 1.41e-04 | 321.50 ms | 52.5% bf16 MFU | 1633481 tok/s step 166/19560 | loss 6.482262 (-1.18z)| norm 0.8868 (-0.80z)| lr 1.42e-04 | 321.67 ms | 52.5% bf16 MFU | 1633303 tok/s start Wed Nov 20 05:55:47 UTC 2024 +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/fineweb10B/fineweb_train_*.bin | | val data pattern | dev/data/fineweb10B/fineweb_val_*.bin | | output log dir | log124M | | checkpoint_every | 5000 | | resume | 0 | | micro batch size B | 64 | | sequence length T | 1024 | | total batch size | 524288 | | LR scheduler | cosine | | learning rate (LR) | 6.000000e-04 | | warmup iterations | 700 | | final LR fraction | 0.000000e+00 | | weight decay | 1.000000e-01 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 250 | | val_max_steps | 20 | | sample_every | 20000 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-SXM4-40GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- | weight init method | d12 | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 19560 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | yes | +-----------------------+----------------------------------------------------+ | num_processes | 8 | | zero_stage | 1 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288 => setting grad_accum_steps=1 --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 237 MiB for parameter gradients allocating 21216 MiB for activations allocating 59 MiB for AdamW optimizer state m allocating 59 MiB for AdamW optimizer state v allocating 59 MiB for master copy of params --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- device memory usage: 23129 MiB / 40326 MiB memory per sequence: 331 MiB -> estimated maximum batch size: 115 val loss 11.009204 step 1/19560 | loss 11.009109 (+nanz)| norm 15.3717 (+nanz)| lr 8.57e-07 | 696.92 ms | 24.2% bf16 MFU | 752294 tok/s step 2/19560 | loss 10.959850 (+nanz)| norm 15.2377 (+nanz)| lr 1.71e-06 | 313.72 ms | 53.8% bf16 MFU | 1671212 tok/s step 3/19560 | loss 10.855611 (+nanz)| norm 14.8389 (+nanz)| lr 2.57e-06 | 315.14 ms | 53.6% bf16 MFU | 1667356 tok/s step 4/19560 | loss 10.715738 (+nanz)| norm 13.1195 (+nanz)| lr 3.43e-06 | 315.67 ms | 53.5% bf16 MFU | 1665085 tok/s step 5/19560 | loss 10.568426 (+nanz)| norm 10.4405 (+nanz)| lr 4.29e-06 | 314.86 ms | 53.6% bf16 MFU | 1665101 tok/s step 6/19560 | loss 10.423454 (+nanz)| norm 8.4942 (+nanz)| lr 5.14e-06 | 315.10 ms | 53.6% bf16 MFU | 1664830 tok/s step 7/19560 | loss 10.291792 (+nanz)| norm 7.2415 (+nanz)| lr 6.00e-06 | 315.63 ms | 53.5% bf16 MFU | 1664128 tok/s step 8/19560 | loss 10.187212 (+nanz)| norm 6.1835 (+nanz)| lr 6.86e-06 | 314.60 ms | 53.6% bf16 MFU | 1664521 tok/s step 9/19560 | loss 10.096745 (+nanz)| norm 5.2947 (+nanz)| lr 7.71e-06 | 315.47 ms | 53.5% bf16 MFU | 1664136 tok/s step 10/19560 | loss 9.976257 (+nanz)| norm 4.6636 (+nanz)| lr 8.57e-06 | 315.13 ms | 53.6% bf16 MFU | 1664076 tok/s step 11/19560 | loss 9.935361 (+nanz)| norm 3.8501 (+nanz)| lr 9.43e-06 | 314.56 ms | 53.7% bf16 MFU | 1664410 tok/s step 12/19560 | loss 9.853884 (+nanz)| norm 3.3980 (+nanz)| lr 1.03e-05 | 315.67 ms | 53.5% bf16 MFU | 1663998 tok/s step 13/19560 | loss 9.792458 (+nanz)| norm 3.0057 (+nanz)| lr 1.11e-05 | 315.94 ms | 53.4% bf16 MFU | 1663503 tok/s step 14/19560 | loss 9.758899 (+nanz)| norm 2.6957 (+nanz)| lr 1.20e-05 | 315.41 ms | 53.5% bf16 MFU | 1663371 tok/s step 15/19560 | loss 9.702504 (+nanz)| norm 2.5212 (+nanz)| lr 1.29e-05 | 315.25 ms | 53.5% bf16 MFU | 1663343 tok/s step 16/19560 | loss 9.701744 (+nanz)| norm 2.3030 (+nanz)| lr 1.37e-05 | 316.39 ms | 53.3% bf16 MFU | 1662760 tok/s step 17/19560 | loss 9.655616 (+nanz)| norm 2.2412 (+nanz)| lr 1.46e-05 | 314.88 ms | 53.6% bf16 MFU | 1662965 tok/s step 18/19560 | loss 9.636614 (+nanz)| norm 2.1886 (+nanz)| lr 1.54e-05 | 314.89 ms | 53.6% bf16 MFU | 1663138 tok/s step 19/19560 | loss 9.603564 (+nanz)| norm 2.2161 (+nanz)| lr 1.63e-05 | 314.83 ms | 53.6% bf16 MFU | 1663318 tok/s step 20/19560 | loss 9.596482 (+nanz)| norm 2.1545 (+nanz)| lr 1.71e-05 | 315.22 ms | 53.5% bf16 MFU | 1663312 tok/s step 21/19560 | loss 9.562165 (+nanz)| norm 2.1811 (+nanz)| lr 1.80e-05 | 315.59 ms | 53.5% bf16 MFU | 1663155 tok/s step 22/19560 | loss 9.537733 (+nanz)| norm 2.1489 (+nanz)| lr 1.89e-05 | 315.39 ms | 53.5% bf16 MFU | 1663095 tok/s step 23/19560 | loss 9.533237 (+nanz)| norm 2.0896 (+nanz)| lr 1.97e-05 | 315.02 ms | 53.6% bf16 MFU | 1663183 tok/s step 24/19560 | loss 9.476657 (+nanz)| norm 2.1037 (+nanz)| lr 2.06e-05 | 315.23 ms | 53.5% bf16 MFU | 1663182 tok/s step 25/19560 | loss 9.458346 (+nanz)| norm 2.0526 (+nanz)| lr 2.14e-05 | 315.10 ms | 53.6% bf16 MFU | 1663232 tok/s step 26/19560 | loss 9.422850 (+nanz)| norm 2.0943 (+nanz)| lr 2.23e-05 | 315.80 ms | 53.4% bf16 MFU | 1663022 tok/s step 27/19560 | loss 9.377939 (+nanz)| norm 2.1531 (+nanz)| lr 2.31e-05 | 315.89 ms | 53.4% bf16 MFU | 1662799 tok/s step 28/19560 | loss 9.388626 (+nanz)| norm 1.9649 (+nanz)| lr 2.40e-05 | 315.54 ms | 53.5% bf16 MFU | 1662716 tok/s step 29/19560 | loss 9.357561 (+nanz)| norm 2.4480 (+nanz)| lr 2.49e-05 | 315.81 ms | 53.4% bf16 MFU | 1662546 tok/s step 30/19560 | loss 9.306387 (+nanz)| norm 1.9511 (+nanz)| lr 2.57e-05 | 316.32 ms | 53.4% bf16 MFU | 1662217 tok/s step 31/19560 | loss 9.244957 (+nanz)| norm 2.2944 (+nanz)| lr 2.66e-05 | 315.64 ms | 53.5% bf16 MFU | 1662143 tok/s step 32/19560 | loss 9.211149 (+nanz)| norm 2.2605 (+nanz)| lr 2.74e-05 | 317.27 ms | 53.2% bf16 MFU | 1661538 tok/s step 33/19560 | loss 9.235588 (+nanz)| norm 2.5857 (+nanz)| lr 2.83e-05 | 316.16 ms | 53.4% bf16 MFU | 1661335 tok/s step 34/19560 | loss 9.159510 (+nanz)| norm 2.4415 (+nanz)| lr 2.91e-05 | 316.08 ms | 53.4% bf16 MFU | 1661174 tok/s step 35/19560 | loss 9.200710 (+nanz)| norm 1.9544 (+nanz)| lr 3.00e-05 | 316.53 ms | 53.3% bf16 MFU | 1660882 tok/s step 36/19560 | loss 9.080112 (+nanz)| norm 2.4098 (+nanz)| lr 3.09e-05 | 316.35 ms | 53.3% bf16 MFU | 1660666 tok/s step 37/19560 | loss 9.029282 (+nanz)| norm 1.8688 (+nanz)| lr 3.17e-05 | 316.29 ms | 53.4% bf16 MFU | 1660485 tok/s step 38/19560 | loss 9.039858 (+nanz)| norm 1.9748 (+nanz)| lr 3.26e-05 | 316.99 ms | 53.2% bf16 MFU | 1660102 tok/s step 39/19560 | loss 8.977991 (+nanz)| norm 1.8095 (+nanz)| lr 3.34e-05 | 317.55 ms | 53.1% bf16 MFU | 1659574 tok/s step 40/19560 | loss 8.961290 (+nanz)| norm 2.0396 (+nanz)| lr 3.43e-05 | 316.64 ms | 53.3% bf16 MFU | 1659356 tok/s step 41/19560 | loss 8.906673 (+nanz)| norm 1.8697 (+nanz)| lr 3.51e-05 | 317.42 ms | 53.2% bf16 MFU | 1658919 tok/s step 42/19560 | loss 8.860168 (+nanz)| norm 1.6712 (+nanz)| lr 3.60e-05 | 317.66 ms | 53.1% bf16 MFU | 1658437 tok/s step 43/19560 | loss 8.839666 (+nanz)| norm 1.7301 (+nanz)| lr 3.69e-05 | 316.28 ms | 53.4% bf16 MFU | 1658392 tok/s step 44/19560 | loss 8.856472 (+nanz)| norm 1.6667 (+nanz)| lr 3.77e-05 | 317.10 ms | 53.2% bf16 MFU | 1658112 tok/s step 45/19560 | loss 8.748165 (+nanz)| norm 1.6708 (+nanz)| lr 3.86e-05 | 318.71 ms | 53.0% bf16 MFU | 1657382 tok/s step 46/19560 | loss 8.692653 (+nanz)| norm 1.6775 (+nanz)| lr 3.94e-05 | 317.29 ms | 53.2% bf16 MFU | 1657104 tok/s step 47/19560 | loss 8.712533 (+nanz)| norm 1.5835 (+nanz)| lr 4.03e-05 | 316.96 ms | 53.2% bf16 MFU | 1656940 tok/s step 48/19560 | loss 8.646987 (+nanz)| norm 1.5961 (+nanz)| lr 4.11e-05 | 317.18 ms | 53.2% bf16 MFU | 1656720 tok/s step 49/19560 | loss 8.607571 (+nanz)| norm 1.5648 (+nanz)| lr 4.20e-05 | 317.69 ms | 53.1% bf16 MFU | 1656370 tok/s step 50/19560 | loss 8.586817 (+nanz)| norm 1.5581 (+nanz)| lr 4.29e-05 | 317.18 ms | 53.2% bf16 MFU | 1656184 tok/s step 51/19560 | loss 8.536273 (+nanz)| norm 1.6072 (+nanz)| lr 4.37e-05 | 316.89 ms | 53.3% bf16 MFU | 1656091 tok/s step 52/19560 | loss 8.492901 (+nanz)| norm 1.7007 (+nanz)| lr 4.46e-05 | 317.45 ms | 53.2% bf16 MFU | 1655846 tok/s step 53/19560 | loss 8.489247 (+nanz)| norm 1.7998 (+nanz)| lr 4.54e-05 | 317.93 ms | 53.1% bf16 MFU | 1655483 tok/s step 54/19560 | loss 8.421532 (+nanz)| norm 1.5751 (+nanz)| lr 4.63e-05 | 317.67 ms | 53.1% bf16 MFU | 1655212 tok/s step 55/19560 | loss 8.446787 (+nanz)| norm 1.4480 (+nanz)| lr 4.71e-05 | 316.55 ms | 53.3% bf16 MFU | 1655267 tok/s step 56/19560 | loss 8.378431 (+nanz)| norm 1.6670 (+nanz)| lr 4.80e-05 | 317.54 ms | 53.1% bf16 MFU | 1655044 tok/s step 57/19560 | loss 8.326777 (+nanz)| norm 1.6541 (+nanz)| lr 4.89e-05 | 317.91 ms | 53.1% bf16 MFU | 1654733 tok/s step 58/19560 | loss 8.262592 (+nanz)| norm 1.6627 (+nanz)| lr 4.97e-05 | 317.74 ms | 53.1% bf16 MFU | 1654486 tok/s step 59/19560 | loss 8.225939 (+nanz)| norm 1.7647 (+nanz)| lr 5.06e-05 | 318.23 ms | 53.0% bf16 MFU | 1654118 tok/s step 60/19560 | loss 8.181162 (+nanz)| norm 1.4505 (+nanz)| lr 5.14e-05 | 317.53 ms | 53.2% bf16 MFU | 1653962 tok/s step 61/19560 | loss 8.197295 (+nanz)| norm 1.6991 (+nanz)| lr 5.23e-05 | 317.84 ms | 53.1% bf16 MFU | 1653730 tok/s step 62/19560 | loss 8.186028 (+nanz)| norm 1.3791 (+nanz)| lr 5.31e-05 | 318.47 ms | 53.0% bf16 MFU | 1653340 tok/s step 63/19560 | loss 8.078424 (+nanz)| norm 1.3744 (+nanz)| lr 5.40e-05 | 318.06 ms | 53.1% bf16 MFU | 1653083 tok/s step 64/19560 | loss 8.078199 (+nanz)| norm 1.4735 (+nanz)| lr 5.49e-05 | 316.94 ms | 53.3% bf16 MFU | 1653141 tok/s step 65/19560 | loss 8.054811 (+nanz)| norm 1.4451 (+nanz)| lr 5.57e-05 | 318.27 ms | 53.0% bf16 MFU | 1652838 tok/s step 66/19560 | loss 7.981157 (+nanz)| norm 1.6246 (+nanz)| lr 5.66e-05 | 318.65 ms | 53.0% bf16 MFU | 1652449 tok/s step 67/19560 | loss 7.928317 (+nanz)| norm 1.3552 (+nanz)| lr 5.74e-05 | 317.91 ms | 53.1% bf16 MFU | 1652278 tok/s step 68/19560 | loss 7.929958 (+nanz)| norm 1.3094 (+nanz)| lr 5.83e-05 | 317.97 ms | 53.1% bf16 MFU | 1652101 tok/s step 69/19560 | loss 7.937172 (+nanz)| norm 1.2620 (+nanz)| lr 5.91e-05 | 319.02 ms | 52.9% bf16 MFU | 1651654 tok/s step 70/19560 | loss 7.888276 (+nanz)| norm 1.2156 (+nanz)| lr 6.00e-05 | 318.55 ms | 53.0% bf16 MFU | 1651357 tok/s step 71/19560 | loss 7.836855 (+nanz)| norm 1.3333 (+nanz)| lr 6.09e-05 | 318.25 ms | 53.0% bf16 MFU | 1651155 tok/s step 72/19560 | loss 7.792550 (+nanz)| norm 1.1686 (+nanz)| lr 6.17e-05 | 318.10 ms | 53.1% bf16 MFU | 1651003 tok/s step 73/19560 | loss 7.773930 (+nanz)| norm 1.1652 (+nanz)| lr 6.26e-05 | 319.02 ms | 52.9% bf16 MFU | 1650615 tok/s step 74/19560 | loss 7.792465 (+nanz)| norm 1.4856 (+nanz)| lr 6.34e-05 | 318.27 ms | 53.0% bf16 MFU | 1650447 tok/s step 75/19560 | loss 7.716968 (+nanz)| norm 1.3789 (+nanz)| lr 6.43e-05 | 317.49 ms | 53.2% bf16 MFU | 1650492 tok/s step 76/19560 | loss 7.568437 (+nanz)| norm 1.0922 (+nanz)| lr 6.51e-05 | 318.67 ms | 53.0% bf16 MFU | 1650224 tok/s step 77/19560 | loss 7.679068 (+nanz)| norm 1.7059 (+nanz)| lr 6.60e-05 | 319.03 ms | 52.9% bf16 MFU | 1649873 tok/s step 78/19560 | loss 7.632849 (+nanz)| norm 0.9554 (+nanz)| lr 6.69e-05 | 318.88 ms | 52.9% bf16 MFU | 1649582 tok/s step 79/19560 | loss 7.592070 (+nanz)| norm 1.2280 (+nanz)| lr 6.77e-05 | 318.49 ms | 53.0% bf16 MFU | 1649409 tok/s step 80/19560 | loss 7.615588 (+nanz)| norm 0.9336 (+nanz)| lr 6.86e-05 | 318.47 ms | 53.0% bf16 MFU | 1649250 tok/s step 81/19560 | loss 7.423123 (+nanz)| norm 1.3849 (+nanz)| lr 6.94e-05 | 318.67 ms | 53.0% bf16 MFU | 1649047 tok/s step 82/19560 | loss 7.515193 (+nanz)| norm 1.5655 (+nanz)| lr 7.03e-05 | 319.08 ms | 52.9% bf16 MFU | 1648747 tok/s step 83/19560 | loss 7.442195 (+nanz)| norm 0.9895 (+nanz)| lr 7.11e-05 | 319.01 ms | 52.9% bf16 MFU | 1648479 tok/s step 84/19560 | loss 7.462711 (+nanz)| norm 1.0160 (+nanz)| lr 7.20e-05 | 319.62 ms | 52.8% bf16 MFU | 1648068 tok/s step 85/19560 | loss 7.399843 (+nanz)| norm 1.4101 (+nanz)| lr 7.29e-05 | 318.71 ms | 53.0% bf16 MFU | 1647913 tok/s step 86/19560 | loss 7.402619 (+nanz)| norm 0.9672 (+nanz)| lr 7.37e-05 | 318.41 ms | 53.0% bf16 MFU | 1647846 tok/s step 87/19560 | loss 7.372180 (+nanz)| norm 0.9722 (+nanz)| lr 7.46e-05 | 319.19 ms | 52.9% bf16 MFU | 1647578 tok/s step 88/19560 | loss 7.304500 (+nanz)| norm 1.0659 (+nanz)| lr 7.54e-05 | 319.43 ms | 52.8% bf16 MFU | 1647262 tok/s step 89/19560 | loss 7.340682 (+nanz)| norm 1.0230 (+nanz)| lr 7.63e-05 | 319.15 ms | 52.9% bf16 MFU | 1647035 tok/s step 90/19560 | loss 7.322897 (+nanz)| norm 0.9668 (+nanz)| lr 7.71e-05 | 319.81 ms | 52.8% bf16 MFU | 1646649 tok/s step 91/19560 | loss 7.380168 (+nanz)| norm 0.9023 (+nanz)| lr 7.80e-05 | 318.77 ms | 52.9% bf16 MFU | 1646552 tok/s step 92/19560 | loss 7.239254 (+nanz)| norm 1.2497 (+nanz)| lr 7.89e-05 | 319.17 ms | 52.9% bf16 MFU | 1646356 tok/s step 93/19560 | loss 7.280473 (+nanz)| norm 1.2754 (+nanz)| lr 7.97e-05 | 319.14 ms | 52.9% bf16 MFU | 1646176 tok/s step 94/19560 | loss 7.173059 (+nanz)| norm 0.6341 (+nanz)| lr 8.06e-05 | 319.84 ms | 52.8% bf16 MFU | 1645826 tok/s step 95/19560 | loss 7.207663 (+nanz)| norm 1.1749 (+nanz)| lr 8.14e-05 | 319.74 ms | 52.8% bf16 MFU | 1645518 tok/s step 96/19560 | loss 7.126929 (+nanz)| norm 1.1039 (+nanz)| lr 8.23e-05 | 319.82 ms | 52.8% bf16 MFU | 1645205 tok/s step 97/19560 | loss 7.140967 (+nanz)| norm 0.7372 (+nanz)| lr 8.31e-05 | 319.02 ms | 52.9% bf16 MFU | 1645117 tok/s step 98/19560 | loss 7.246431 (+nanz)| norm 0.8139 (+nanz)| lr 8.40e-05 | 320.12 ms | 52.7% bf16 MFU | 1644747 tok/s step 99/19560 | loss 7.150344 (+nanz)| norm 0.9218 (+nanz)| lr 8.49e-05 | 319.66 ms | 52.8% bf16 MFU | 1644515 tok/s step 100/19560 | loss 7.145090 (+nanz)| norm 0.6691 (+nanz)| lr 8.57e-05 | 319.76 ms | 52.8% bf16 MFU | 1644268 tok/s step 101/19560 | loss 7.090052 (+nanz)| norm 0.8795 (+nanz)| lr 8.66e-05 | 320.24 ms | 52.7% bf16 MFU | 1643911 tok/s step 102/19560 | loss 7.048390 (+nanz)| norm 0.7897 (+nanz)| lr 8.74e-05 | 319.96 ms | 52.7% bf16 MFU | 1643644 tok/s step 103/19560 | loss 7.089171 (+nanz)| norm 0.9258 (+nanz)| lr 8.83e-05 | 319.52 ms | 52.8% bf16 MFU | 1643504 tok/s step 104/19560 | loss 7.151195 (+nanz)| norm 1.0183 (+nanz)| lr 8.91e-05 | 319.80 ms | 52.8% bf16 MFU | 1643300 tok/s step 105/19560 | loss 7.035669 (+nanz)| norm 0.8900 (+nanz)| lr 9.00e-05 | 320.26 ms | 52.7% bf16 MFU | 1642986 tok/s step 106/19560 | loss 7.065207 (+nanz)| norm 0.6486 (+nanz)| lr 9.09e-05 | 320.05 ms | 52.7% bf16 MFU | 1642742 tok/s step 107/19560 | loss 6.892458 (+nanz)| norm 0.8898 (+nanz)| lr 9.17e-05 | 320.26 ms | 52.7% bf16 MFU | 1642457 tok/s step 108/19560 | loss 7.045056 (+nanz)| norm 0.8603 (+nanz)| lr 9.26e-05 | 320.70 ms | 52.6% bf16 MFU | 1642074 tok/s step 109/19560 | loss 6.967371 (+nanz)| norm 0.8506 (+nanz)| lr 9.34e-05 | 320.54 ms | 52.7% bf16 MFU | 1641750 tok/s step 110/19560 | loss 7.060790 (+nanz)| norm 1.1379 (+nanz)| lr 9.43e-05 | 319.62 ms | 52.8% bf16 MFU | 1641681 tok/s step 111/19560 | loss 6.921129 (+nanz)| norm 0.9117 (+nanz)| lr 9.51e-05 | 320.24 ms | 52.7% bf16 MFU | 1641454 tok/s step 112/19560 | loss 6.903554 (+nanz)| norm 0.6963 (+nanz)| lr 9.60e-05 | 320.72 ms | 52.6% bf16 MFU | 1641116 tok/s step 113/19560 | loss 7.002238 (+nanz)| norm 0.6131 (+nanz)| lr 9.69e-05 | 322.25 ms | 52.4% bf16 MFU | 1640406 tok/s step 114/19560 | loss 6.877550 (+nanz)| norm 0.9659 (+nanz)| lr 9.77e-05 | 321.01 ms | 52.6% bf16 MFU | 1640048 tok/s step 115/19560 | loss 6.912578 (+nanz)| norm 1.3991 (+nanz)| lr 9.86e-05 | 321.66 ms | 52.5% bf16 MFU | 1639542 tok/s step 116/19560 | loss 6.930488 (+nanz)| norm 0.7822 (+nanz)| lr 9.94e-05 | 320.56 ms | 52.6% bf16 MFU | 1639341 tok/s step 117/19560 | loss 6.994165 (+nanz)| norm 1.4629 (+nanz)| lr 1.00e-04 | 320.58 ms | 52.6% bf16 MFU | 1639146 tok/s step 118/19560 | loss 6.870842 (+nanz)| norm 0.8994 (+nanz)| lr 1.01e-04 | 321.30 ms | 52.5% bf16 MFU | 1638777 tok/s step 119/19560 | loss 6.878057 (+nanz)| norm 0.8905 (+nanz)| lr 1.02e-04 | 320.99 ms | 52.6% bf16 MFU | 1638505 tok/s step 120/19560 | loss 6.911891 (+nanz)| norm 0.9541 (+nanz)| lr 1.03e-04 | 321.44 ms | 52.5% bf16 MFU | 1638132 tok/s step 121/19560 | loss 6.925647 (+nanz)| norm 0.7611 (+nanz)| lr 1.04e-04 | 320.86 ms | 52.6% bf16 MFU | 1637924 tok/s step 122/19560 | loss 6.864729 (+nanz)| norm 0.8617 (+nanz)| lr 1.05e-04 | 321.47 ms | 52.5% bf16 MFU | 1637571 tok/s step 123/19560 | loss 6.934637 (+nanz)| norm 1.0621 (+nanz)| lr 1.05e-04 | 321.70 ms | 52.5% bf16 MFU | 1637179 tok/s step 124/19560 | loss 6.830661 (+nanz)| norm 1.4283 (+nanz)| lr 1.06e-04 | 320.95 ms | 52.6% bf16 MFU | 1636998 tok/s step 125/19560 | loss 6.785739 (+nanz)| norm 1.0351 (+nanz)| lr 1.07e-04 | 321.16 ms | 52.6% bf16 MFU | 1636772 tok/s step 126/19560 | loss 6.888448 (+nanz)| norm 1.2989 (+nanz)| lr 1.08e-04 | 321.26 ms | 52.5% bf16 MFU | 1636531 tok/s step 127/19560 | loss 6.854775 (+nanz)| norm 1.5645 (+nanz)| lr 1.09e-04 | 321.76 ms | 52.5% bf16 MFU | 1636175 tok/s step 128/19560 | loss 6.757652 (+nanz)| norm 0.8796 (+nanz)| lr 1.10e-04 | 321.22 ms | 52.5% bf16 MFU | 1635976 tok/s step 129/19560 | loss 6.734069 (-1.32z)| norm 0.9832 (-0.46z)| lr 1.11e-04 | 321.40 ms | 52.5% bf16 MFU | 1635740 tok/s step 130/19560 | loss 6.804447 (-1.25z)| norm 0.9410 (-0.49z)| lr 1.11e-04 | 322.35 ms | 52.4% bf16 MFU | 1635274 tok/s step 131/19560 | loss 6.736572 (-1.30z)| norm 1.1193 (-0.42z)| lr 1.12e-04 | 321.83 ms | 52.4% bf16 MFU | 1634964 tok/s step 132/19560 | loss 6.690763 (-1.34z)| norm 0.8417 (-0.65z)| lr 1.13e-04 | 320.95 ms | 52.6% bf16 MFU | 1634894 tok/s step 133/19560 | loss 6.716677 (-1.30z)| norm 0.8852 (-0.68z)| lr 1.14e-04 | 321.11 ms | 52.6% bf16 MFU | 1634785 tok/s step 134/19560 | loss 6.746459 (-1.26z)| norm 0.8890 (-0.74z)| lr 1.15e-04 | 321.30 ms | 52.5% bf16 MFU | 1634636 tok/s step 135/19560 | loss 6.686617 (-1.31z)| norm 0.9627 (-0.71z)| lr 1.16e-04 | 321.82 ms | 52.4% bf16 MFU | 1634360 tok/s step 136/19560 | loss 6.694745 (-1.29z)| norm 0.9746 (-0.74z)| lr 1.17e-04 | 321.61 ms | 52.5% bf16 MFU | 1634152 tok/s step 137/19560 | loss 6.701411 (-1.27z)| norm 0.8374 (-0.99z)| lr 1.17e-04 | 322.18 ms | 52.4% bf16 MFU | 1633810 tok/s step 138/19560 | loss 6.695772 (-1.26z)| norm 0.8120 (-1.08z)| lr 1.18e-04 | 322.23 ms | 52.4% bf16 MFU | 1633474 tok/s step 139/19560 | loss 6.699075 (-1.24z)| norm 0.9601 (-0.85z)| lr 1.19e-04 | 321.53 ms | 52.5% bf16 MFU | 1633330 tok/s step 140/19560 | loss 6.652607 (-1.28z)| norm 0.9474 (-0.88z)| lr 1.20e-04 | 321.25 ms | 52.5% bf16 MFU | 1633266 tok/s step 141/19560 | loss 6.644627 (-1.27z)| norm 1.1908 (-0.41z)| lr 1.21e-04 | 322.03 ms | 52.4% bf16 MFU | 1633006 tok/s step 142/19560 | loss 6.658344 (-1.24z)| norm 1.1292 (-0.52z)| lr 1.22e-04 | 321.54 ms | 52.5% bf16 MFU | 1632883 tok/s step 143/19560 | loss 6.652598 (-1.23z)| norm 0.5062 (-1.75z)| lr 1.23e-04 | 321.09 ms | 52.6% bf16 MFU | 1632880 tok/s step 144/19560 | loss 6.658814 (-1.21z)| norm 0.7841 (-1.17z)| lr 1.23e-04 | 321.29 ms | 52.5% bf16 MFU | 1632826 tok/s step 145/19560 | loss 6.629550 (-1.23z)| norm 0.8909 (-0.94z)| lr 1.24e-04 | 320.94 ms | 52.6% bf16 MFU | 1632864 tok/s step 146/19560 | loss 6.678525 (-1.16z)| norm 0.9792 (-0.75z)| lr 1.25e-04 | 321.88 ms | 52.4% bf16 MFU | 1632663 tok/s step 147/19560 | loss 6.642830 (-1.19z)| norm 1.1263 (-0.43z)| lr 1.26e-04 | 321.43 ms | 52.5% bf16 MFU | 1632584 tok/s step 148/19560 | loss 6.639793 (-1.18z)| norm 1.0252 (-0.63z)| lr 1.27e-04 | 322.25 ms | 52.4% bf16 MFU | 1632303 tok/s step 149/19560 | loss 6.615000 (-1.19z)| norm 0.7926 (-1.11z)| lr 1.28e-04 | 321.73 ms | 52.5% bf16 MFU | 1632168 tok/s step 150/19560 | loss 6.670747 (-1.12z)| norm 0.7933 (-1.09z)| lr 1.29e-04 | 321.68 ms | 52.5% bf16 MFU | 1632052 tok/s step 151/19560 | loss 6.594179 (-1.19z)| norm 1.4163 (+0.26z)| lr 1.29e-04 | 322.05 ms | 52.4% bf16 MFU | 1631847 tok/s step 152/19560 | loss 6.517328 (-1.27z)| norm 1.1646 (-0.27z)| lr 1.30e-04 | 321.29 ms | 52.5% bf16 MFU | 1631845 tok/s step 153/19560 | loss 6.656227 (-1.10z)| norm 1.0052 (-0.61z)| lr 1.31e-04 | 320.75 ms | 52.6% bf16 MFU | 1631982 tok/s step 154/19560 | loss 6.594466 (-1.16z)| norm 1.1414 (-0.30z)| lr 1.32e-04 | 321.37 ms | 52.5% bf16 MFU | 1631954 tok/s step 155/19560 | loss 6.581488 (-1.16z)| norm 1.1147 (-0.34z)| lr 1.33e-04 | 321.14 ms | 52.6% bf16 MFU | 1631985 tok/s step 156/19560 | loss 6.626822 (-1.10z)| norm 0.8736 (-0.88z)| lr 1.34e-04 | 321.82 ms | 52.4% bf16 MFU | 1631842 tok/s step 157/19560 | loss 6.552612 (-1.18z)| norm 0.7725 (-1.11z)| lr 1.35e-04 | 322.66 ms | 52.3% bf16 MFU | 1631495 tok/s step 158/19560 | loss 6.562229 (-1.16z)| norm 0.6758 (-1.33z)| lr 1.35e-04 | 321.14 ms | 52.6% bf16 MFU | 1631550 tok/s step 159/19560 | loss 6.556561 (-1.15z)| norm 0.8551 (-0.89z)| lr 1.36e-04 | 322.11 ms | 52.4% bf16 MFU | 1631356 tok/s step 160/19560 | loss 6.534228 (-1.17z)| norm 0.7214 (-1.21z)| lr 1.37e-04 | 322.40 ms | 52.3% bf16 MFU | 1631098 tok/s step 161/19560 | loss 6.581291 (-1.10z)| norm 0.7718 (-1.10z)| lr 1.38e-04 | 321.83 ms | 52.4% bf16 MFU | 1630997 tok/s step 162/19560 | loss 6.509456 (-1.18z)| norm 0.6473 (-1.43z)| lr 1.39e-04 | 321.96 ms | 52.4% bf16 MFU | 1630868 tok/s step 163/19560 | loss 6.475272 (-1.22z)| norm 0.7170 (-1.23z)| lr 1.40e-04 | 323.06 ms | 52.2% bf16 MFU | 1630468 tok/s step 164/19560 | loss 6.458817 (-1.23z)| norm 0.5773 (-1.63z)| lr 1.41e-04 | 321.15 ms | 52.6% bf16 MFU | 1630570 tok/s step 165/19560 | loss 6.487965 (-1.18z)| norm 0.6924 (-1.29z)| lr 1.41e-04 | 322.18 ms | 52.4% bf16 MFU | 1630408 tok/s step 166/19560 | loss 6.474494 (-1.19z)| norm 0.6740 (-1.34z)| lr 1.42e-04 | 321.59 ms | 52.5% bf16 MFU | 1630404 tok/s step 167/19560 | loss 6.484888 (-1.17z)| norm 1.0290 (-0.30z)| lr 1.43e-04 | 321.52 ms | 52.5% bf16 MFU | 1630416 tok/s step 168/19560 | loss 6.452101 (-1.20z)| norm 1.4963 (+1.10z)| lr 1.44e-04 | 321.99 ms | 52.4% bf16 MFU | 1630309 tok/s step 169/19560 | loss 6.499334 (-1.13z)| norm 0.6267 (-1.49z)| lr 1.45e-04 | 321.46 ms | 52.5% bf16 MFU | 1630342 tok/s step 170/19560 | loss 6.452170 (-1.19z)| norm 1.0261 (-0.27z)| lr 1.46e-04 | 321.12 ms | 52.6% bf16 MFU | 1630458 tok/s step 171/19560 | loss 6.440082 (-1.20z)| norm 1.7258 (+1.87z)| lr 1.47e-04 | 321.48 ms | 52.5% bf16 MFU | 1630477 tok/s step 172/19560 | loss 6.531271 (-1.05z)| norm 0.8970 (-0.65z)| lr 1.47e-04 | 322.19 ms | 52.4% bf16 MFU | 1630316 tok/s step 173/19560 | loss 6.446108 (-1.18z)| norm 1.4760 (+1.15z)| lr 1.48e-04 | 321.31 ms | 52.5% bf16 MFU | 1630385 tok/s step 174/19560 | loss 6.417402 (-1.21z)| norm 1.2472 (+0.45z)| lr 1.49e-04 | 322.12 ms | 52.4% bf16 MFU | 1630246 tok/s step 175/19560 | loss 6.424203 (-1.19z)| norm 0.6903 (-1.28z)| lr 1.50e-04 | 322.23 ms | 52.4% bf16 MFU | 1630088 tok/s step 176/19560 | loss 6.437121 (-1.17z)| norm 0.9103 (-0.57z)| lr 1.51e-04 | 321.07 ms | 52.6% bf16 MFU | 1630230 tok/s step 177/19560 | loss 6.487861 (-1.07z)| norm 1.0471 (-0.13z)| lr 1.52e-04 | 322.70 ms | 52.3% bf16 MFU | 1629952 tok/s step 178/19560 | loss 6.400074 (-1.21z)| norm 0.9577 (-0.40z)| lr 1.53e-04 | 321.99 ms | 52.4% bf16 MFU | 1629867 tok/s step 179/19560 | loss 6.445080 (-1.13z)| norm 1.0410 (-0.12z)| lr 1.53e-04 | 322.78 ms | 52.3% bf16 MFU | 1629589 tok/s step 180/19560 | loss 6.441687 (-1.13z)| norm 0.9112 (-0.53z)| lr 1.54e-04 | 322.27 ms | 52.4% bf16 MFU | 1629451 tok/s step 181/19560 | loss 6.424420 (-1.15z)| norm 0.9778 (-0.30z)| lr 1.55e-04 | 321.78 ms | 52.4% bf16 MFU | 1629445 tok/s step 182/19560 | loss 6.539721 (-0.94z)| norm 1.1865 (+0.43z)| lr 1.56e-04 | 322.22 ms | 52.4% bf16 MFU | 1629329 tok/s step 183/19560 | loss 6.459341 (-1.08z)| norm 1.0446 (-0.05z)| lr 1.57e-04 | 322.00 ms | 52.4% bf16 MFU | 1629275 tok/s step 184/19560 | loss 6.407334 (-1.17z)| norm 1.1060 (+0.18z)| lr 1.58e-04 | 322.13 ms | 52.4% bf16 MFU | 1629190 tok/s step 185/19560 | loss 6.395887 (-1.19z)| norm 0.8576 (-0.68z)| lr 1.59e-04 | 322.12 ms | 52.4% bf16 MFU | 1629112 tok/s step 186/19560 | loss 6.423013 (-1.13z)| norm 1.1102 (+0.24z)| lr 1.59e-04 | 322.30 ms | 52.4% bf16 MFU | 1628992 tok/s step 187/19560 | loss 6.426555 (-1.11z)| norm 0.8073 (-0.86z)| lr 1.60e-04 | 322.12 ms | 52.4% bf16 MFU | 1628924 tok/s step 188/19560 | loss 6.348249 (-1.27z)| norm 1.0249 (-0.03z)| lr 1.61e-04 | 322.08 ms | 52.4% bf16 MFU | 1628869 tok/s step 189/19560 | loss 6.347377 (-1.26z)| norm 1.0403 (+0.05z)| lr 1.62e-04 | 321.72 ms | 52.5% bf16 MFU | 1628907 tok/s step 190/19560 | loss 6.365944 (-1.22z)| norm 0.9139 (-0.43z)| lr 1.63e-04 | 323.12 ms | 52.2% bf16 MFU | 1628590 tok/s step 191/19560 | loss 6.328228 (-1.29z)| norm 0.8223 (-0.78z)| lr 1.64e-04 | 322.48 ms | 52.3% bf16 MFU | 1628450 tok/s step 192/19560 | loss 6.335621 (-1.27z)| norm 0.8242 (-0.76z)| lr 1.65e-04 | 321.83 ms | 52.4% bf16 MFU | 1628481 tok/s step 193/19560 | loss 6.367826 (-1.20z)| norm 0.8237 (-0.75z)| lr 1.65e-04 | 321.93 ms | 52.4% bf16 MFU | 1628485 tok/s step 194/19560 | loss 6.320416 (-1.30z)| norm 0.9747 (-0.12z)| lr 1.66e-04 | 321.48 ms | 52.5% bf16 MFU | 1628604 tok/s step 195/19560 | loss 6.265522 (-1.41z)| norm 0.9921 (-0.04z)| lr 1.67e-04 | 322.17 ms | 52.4% bf16 MFU | 1628542 tok/s step 196/19560 | loss 6.387854 (-1.12z)| norm 0.8391 (-0.66z)| lr 1.68e-04 | 321.37 ms | 52.5% bf16 MFU | 1628685 tok/s step 197/19560 | loss 6.310363 (-1.30z)| norm 0.8593 (-0.57z)| lr 1.69e-04 | 321.74 ms | 52.5% bf16 MFU | 1628727 tok/s step 198/19560 | loss 6.339876 (-1.23z)| norm 0.7897 (-0.84z)| lr 1.70e-04 | 321.21 ms | 52.5% bf16 MFU | 1628901 tok/s step 199/19560 | loss 6.329884 (-1.25z)| norm 0.8923 (-0.40z)| lr 1.71e-04 | 322.06 ms | 52.4% bf16 MFU | 1628851 tok/s step 200/19560 | loss 6.306967 (-1.30z)| norm 0.9885 (+0.01z)| lr 1.71e-04 | 321.59 ms | 52.5% bf16 MFU | 1628925 tok/s step 201/19560 | loss 6.368891 (-1.13z)| norm 1.2945 (+1.29z)| lr 1.72e-04 | 321.45 ms | 52.5% bf16 MFU | 1629029 tok/s step 202/19560 | loss 6.333399 (-1.22z)| norm 0.7737 (-0.89z)| lr 1.73e-04 | 321.73 ms | 52.5% bf16 MFU | 1629058 tok/s step 203/19560 | loss 6.379382 (-1.09z)| norm 0.7853 (-0.83z)| lr 1.74e-04 | 321.36 ms | 52.5% bf16 MFU | 1629178 tok/s step 204/19560 | loss 6.334266 (-1.20z)| norm 0.9555 (-0.09z)| lr 1.75e-04 | 321.97 ms | 52.4% bf16 MFU | 1629138 tok/s step 205/19560 | loss 6.385811 (-1.06z)| norm 1.0454 (+0.33z)| lr 1.76e-04 | 321.52 ms | 52.5% bf16 MFU | 1629215 tok/s step 206/19560 | loss 6.331663 (-1.20z)| norm 1.0411 (+0.31z)| lr 1.77e-04 | 322.27 ms | 52.4% bf16 MFU | 1629098 tok/s step 207/19560 | loss 6.254904 (-1.42z)| norm 0.8483 (-0.54z)| lr 1.77e-04 | 322.25 ms | 52.4% bf16 MFU | 1628990 tok/s step 208/19560 | loss 6.333177 (-1.18z)| norm 0.7313 (-1.06z)| lr 1.78e-04 | 322.78 ms | 52.3% bf16 MFU | 1628754 tok/s step 209/19560 | loss 6.311634 (-1.24z)| norm 0.7074 (-1.15z)| lr 1.79e-04 | 321.87 ms | 52.4% bf16 MFU | 1628760 tok/s step 210/19560 | loss 6.310075 (-1.24z)| norm 0.8357 (-0.56z)| lr 1.80e-04 | 322.22 ms | 52.4% bf16 MFU | 1628676 tok/s step 211/19560 | loss 6.281957 (-1.31z)| norm 0.9084 (-0.22z)| lr 1.81e-04 | 322.38 ms | 52.4% bf16 MFU | 1628558 tok/s step 212/19560 | loss 6.355189 (-1.08z)| norm 0.9187 (-0.17z)| lr 1.82e-04 | 322.39 ms | 52.4% bf16 MFU | 1628443 tok/s step 213/19560 | loss 6.285794 (-1.29z)| norm 0.9755 (+0.11z)| lr 1.83e-04 | 322.19 ms | 52.4% bf16 MFU | 1628385 tok/s step 214/19560 | loss 6.374501 (-0.99z)| norm 1.1681 (+1.01z)| lr 1.83e-04 | 322.23 ms | 52.4% bf16 MFU | 1628318 tok/s step 215/19560 | loss 6.268120 (-1.34z)| norm 0.7327 (-1.03z)| lr 1.84e-04 | 321.87 ms | 52.4% bf16 MFU | 1628346 tok/s step 216/19560 | loss 6.324082 (-1.14z)| norm 0.8923 (-0.27z)| lr 1.85e-04 | 322.16 ms | 52.4% bf16 MFU | 1628299 tok/s step 217/19560 | loss 6.301116 (-1.21z)| norm 0.8987 (-0.24z)| lr 1.86e-04 | 322.55 ms | 52.3% bf16 MFU | 1628157 tok/s step 218/19560 | loss 6.297749 (-1.21z)| norm 0.9448 (-0.02z)| lr 1.87e-04 | 322.24 ms | 52.4% bf16 MFU | 1628101 tok/s step 219/19560 | loss 6.276949 (-1.28z)| norm 0.9657 (+0.08z)| lr 1.88e-04 | 322.04 ms | 52.4% bf16 MFU | 1628098 tok/s step 220/19560 | loss 6.189223 (-1.58z)| norm 0.8820 (-0.31z)| lr 1.89e-04 | 322.11 ms | 52.4% bf16 MFU | 1628075 tok/s step 221/19560 | loss 6.288687 (-1.21z)| norm 0.7327 (-1.00z)| lr 1.89e-04 | 322.37 ms | 52.4% bf16 MFU | 1627990 tok/s step 222/19560 | loss 6.287546 (-1.20z)| norm 0.7229 (-1.05z)| lr 1.90e-04 | 321.98 ms | 52.4% bf16 MFU | 1628006 tok/s step 223/19560 | loss 6.232458 (-1.40z)| norm 0.7927 (-0.71z)| lr 1.91e-04 | 322.53 ms | 52.3% bf16 MFU | 1627882 tok/s step 224/19560 | loss 6.302462 (-1.12z)| norm 0.9380 (-0.00z)| lr 1.92e-04 | 321.88 ms | 52.4% bf16 MFU | 1627929 tok/s step 225/19560 | loss 6.274525 (-1.22z)| norm 0.9228 (-0.08z)| lr 1.93e-04 | 321.83 ms | 52.4% bf16 MFU | 1627986 tok/s step 226/19560 | loss 6.173187 (-1.60z)| norm 0.8746 (-0.32z)| lr 1.94e-04 | 322.07 ms | 52.4% bf16 MFU | 1627981 tok/s step 227/19560 | loss 6.202115 (-1.48z)| norm 0.8800 (-0.29z)| lr 1.95e-04 | 321.38 ms | 52.5% bf16 MFU | 1628149 tok/s step 228/19560 | loss 6.251428 (-1.27z)| norm 0.6701 (-1.31z)| lr 1.95e-04 | 322.09 ms | 52.4% bf16 MFU | 1628129 tok/s step 229/19560 | loss 6.182144 (-1.53z)| norm 0.5626 (-1.79z)| lr 1.96e-04 | 323.25 ms | 52.2% bf16 MFU | 1627820 tok/s step 230/19560 | loss 6.197141 (-1.46z)| norm 0.7991 (-0.66z)| lr 1.97e-04 | 321.94 ms | 52.4% bf16 MFU | 1627856 tok/s step 231/19560 | loss 6.207821 (-1.40z)| norm 1.1295 (+0.90z)| lr 1.98e-04 | 321.42 ms | 52.5% bf16 MFU | 1628021 tok/s step 232/19560 | loss 6.206169 (-1.40z)| norm 1.2466 (+1.44z)| lr 1.99e-04 | 321.84 ms | 52.4% bf16 MFU | 1628073 tok/s step 233/19560 | loss 6.121982 (-1.74z)| norm 0.9204 (-0.10z)| lr 2.00e-04 | 321.96 ms | 52.4% bf16 MFU | 1628090 tok/s step 234/19560 | loss 6.214460 (-1.33z)| norm 0.9283 (-0.07z)| lr 2.01e-04 | 321.81 ms | 52.4% bf16 MFU | 1628145 tok/s step 235/19560 | loss 6.196240 (-1.39z)| norm 0.8991 (-0.21z)| lr 2.01e-04 | 321.47 ms | 52.5% bf16 MFU | 1628282 tok/s step 236/19560 | loss 6.193775 (-1.39z)| norm 0.8129 (-0.62z)| lr 2.02e-04 | 322.15 ms | 52.4% bf16 MFU | 1628240 tok/s step 237/19560 | loss 6.242153 (-1.17z)| norm 0.8246 (-0.56z)| lr 2.03e-04 | 322.26 ms | 52.4% bf16 MFU | 1628173 tok/s step 238/19560 | loss 6.159338 (-1.53z)| norm 1.0368 (+0.45z)| lr 2.04e-04 | 322.55 ms | 52.3% bf16 MFU | 1628036 tok/s step 239/19560 | loss 6.141267 (-1.59z)| norm 1.0644 (+0.58z)| lr 2.05e-04 | 322.32 ms | 52.4% bf16 MFU | 1627964 tok/s step 240/19560 | loss 6.128356 (-1.62z)| norm 1.0487 (+0.49z)| lr 2.06e-04 | 322.02 ms | 52.4% bf16 MFU | 1627972 tok/s step 241/19560 | loss 6.204836 (-1.27z)| norm 0.8554 (-0.45z)| lr 2.07e-04 | 322.34 ms | 52.4% bf16 MFU | 1627897 tok/s step 242/19560 | loss 6.093307 (-1.75z)| norm 0.9986 (+0.24z)| lr 2.07e-04 | 322.04 ms | 52.4% bf16 MFU | 1627904 tok/s step 243/19560 | loss 6.099175 (-1.71z)| norm 1.0532 (+0.53z)| lr 2.08e-04 | 321.54 ms | 52.5% bf16 MFU | 1628037 tok/s step 244/19560 | loss 6.157618 (-1.42z)| norm 0.7963 (-0.73z)| lr 2.09e-04 | 322.37 ms | 52.4% bf16 MFU | 1627954 tok/s step 245/19560 | loss 6.210825 (-1.17z)| norm 0.6617 (-1.39z)| lr 2.10e-04 | 322.94 ms | 52.3% bf16 MFU | 1627731 tok/s step 246/19560 | loss 6.159844 (-1.40z)| norm 0.6581 (-1.39z)| lr 2.11e-04 | 322.41 ms | 52.3% bf16 MFU | 1627652 tok/s step 247/19560 | loss 6.195800 (-1.21z)| norm 0.5860 (-1.72z)| lr 2.12e-04 | 322.29 ms | 52.4% bf16 MFU | 1627608 tok/s step 248/19560 | loss 6.181733 (-1.27z)| norm 0.6444 (-1.41z)| lr 2.13e-04 | 321.94 ms | 52.4% bf16 MFU | 1627653 tok/s step 249/19560 | loss 6.091333 (-1.70z)| norm 0.6953 (-1.15z)| lr 2.13e-04 | 322.52 ms | 52.3% bf16 MFU | 1627550 tok/s step 250/19560 | loss 6.188012 (-1.21z)| norm 0.8369 (-0.46z)| lr 2.14e-04 | 321.96 ms | 52.4% bf16 MFU | 1627595 tok/s val loss 6.173676 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2475/10042 = 0.246465 step 251/19560 | loss 6.181686 (-1.24z)| norm 1.3707 (+2.09z)| lr 2.15e-04 | 321.77 ms | 52.5% bf16 MFU | 1627686 tok/s step 252/19560 | loss 6.139812 (-1.44z)| norm 0.8852 (-0.22z)| lr 2.16e-04 | 322.01 ms | 52.4% bf16 MFU | 1627709 tok/s step 253/19560 | loss 6.184560 (-1.19z)| norm 0.9596 (+0.15z)| lr 2.17e-04 | 322.66 ms | 52.3% bf16 MFU | 1627569 tok/s step 254/19560 | loss 6.075264 (-1.75z)| norm 0.8222 (-0.51z)| lr 2.18e-04 | 321.45 ms | 52.5% bf16 MFU | 1627741 tok/s step 255/19560 | loss 6.159996 (-1.30z)| norm 0.8925 (-0.15z)| lr 2.19e-04 | 322.19 ms | 52.4% bf16 MFU | 1627716 tok/s step 256/19560 | loss 6.182535 (-1.16z)| norm 1.0928 (+0.88z)| lr 2.19e-04 | 322.73 ms | 52.3% bf16 MFU | 1627556 tok/s step 257/19560 | loss 6.129143 (-1.43z)| norm 1.1002 (+0.91z)| lr 2.20e-04 | 321.68 ms | 52.5% bf16 MFU | 1627670 tok/s step 258/19560 | loss 6.128068 (-1.43z)| norm 0.8288 (-0.48z)| lr 2.21e-04 | 322.04 ms | 52.4% bf16 MFU | 1627689 tok/s step 259/19560 | loss 6.116661 (-1.48z)| norm 0.6254 (-1.50z)| lr 2.22e-04 | 322.20 ms | 52.4% bf16 MFU | 1627666 tok/s step 260/19560 | loss 6.054629 (-1.79z)| norm 0.5884 (-1.66z)| lr 2.23e-04 | 322.45 ms | 52.3% bf16 MFU | 1627580 tok/s step 261/19560 | loss 6.157582 (-1.20z)| norm 0.5724 (-1.71z)| lr 2.24e-04 | 322.17 ms | 52.4% bf16 MFU | 1627570 tok/s step 262/19560 | loss 6.116055 (-1.42z)| norm 0.6775 (-1.17z)| lr 2.25e-04 | 321.80 ms | 52.4% bf16 MFU | 1627652 tok/s step 263/19560 | loss 6.060465 (-1.72z)| norm 0.6570 (-1.25z)| lr 2.25e-04 | 323.25 ms | 52.2% bf16 MFU | 1627367 tok/s step 264/19560 | loss 6.059302 (-1.70z)| norm 0.6723 (-1.16z)| lr 2.26e-04 | 322.49 ms | 52.3% bf16 MFU | 1627286 tok/s step 265/19560 | loss 6.071833 (-1.61z)| norm 0.7637 (-0.70z)| lr 2.27e-04 | 322.53 ms | 52.3% bf16 MFU | 1627199 tok/s step 266/19560 | loss 6.091773 (-1.48z)| norm 0.7208 (-0.91z)| lr 2.28e-04 | 321.96 ms | 52.4% bf16 MFU | 1627259 tok/s step 267/19560 | loss 6.132460 (-1.23z)| norm 0.8663 (-0.19z)| lr 2.29e-04 | 321.92 ms | 52.4% bf16 MFU | 1627328 tok/s step 268/19560 | loss 6.049953 (-1.69z)| norm 1.1055 (+0.97z)| lr 2.30e-04 | 322.40 ms | 52.3% bf16 MFU | 1627272 tok/s step 269/19560 | loss 6.144701 (-1.12z)| norm 0.8932 (-0.06z)| lr 2.31e-04 | 321.92 ms | 52.4% bf16 MFU | 1627340 tok/s step 270/19560 | loss 6.120766 (-1.24z)| norm 0.7558 (-0.72z)| lr 2.31e-04 | 321.71 ms | 52.5% bf16 MFU | 1627458 tok/s step 271/19560 | loss 6.054239 (-1.62z)| norm 0.9668 (+0.31z)| lr 2.32e-04 | 322.40 ms | 52.3% bf16 MFU | 1627394 tok/s step 272/19560 | loss 6.109072 (-1.28z)| norm 1.0310 (+0.62z)| lr 2.33e-04 | 321.41 ms | 52.5% bf16 MFU | 1627584 tok/s step 273/19560 | loss 6.048799 (-1.62z)| norm 0.9505 (+0.21z)| lr 2.34e-04 | 322.86 ms | 52.3% bf16 MFU | 1627400 tok/s step 274/19560 | loss 6.159984 (-0.94z)| norm 1.2466 (+1.67z)| lr 2.35e-04 | 322.22 ms | 52.4% bf16 MFU | 1627384 tok/s step 275/19560 | loss 6.051655 (-1.59z)| norm 1.0267 (+0.58z)| lr 2.36e-04 | 322.84 ms | 52.3% bf16 MFU | 1627216 tok/s step 276/19560 | loss 6.100168 (-1.27z)| norm 0.9614 (+0.26z)| lr 2.37e-04 | 322.05 ms | 52.4% bf16 MFU | 1627254 tok/s step 277/19560 | loss 6.010573 (-1.81z)| norm 0.8595 (-0.25z)| lr 2.37e-04 | 321.57 ms | 52.5% bf16 MFU | 1627410 tok/s step 278/19560 | loss 6.079008 (-1.37z)| norm 0.7606 (-0.74z)| lr 2.38e-04 | 322.03 ms | 52.4% bf16 MFU | 1627444 tok/s step 279/19560 | loss 6.088156 (-1.30z)| norm 0.7010 (-1.03z)| lr 2.39e-04 | 322.63 ms | 52.3% bf16 MFU | 1627323 tok/s step 280/19560 | loss 6.010752 (-1.76z)| norm 0.6840 (-1.10z)| lr 2.40e-04 | 322.44 ms | 52.3% bf16 MFU | 1627258 tok/s step 281/19560 | loss 6.057764 (-1.45z)| norm 0.7041 (-0.98z)| lr 2.41e-04 | 322.77 ms | 52.3% bf16 MFU | 1627113 tok/s step 282/19560 | loss 6.135691 (-0.93z)| norm 0.7332 (-0.82z)| lr 2.42e-04 | 321.71 ms | 52.5% bf16 MFU | 1627241 tok/s step 283/19560 | loss 6.088119 (-1.23z)| norm 0.7690 (-0.62z)| lr 2.43e-04 | 321.92 ms | 52.4% bf16 MFU | 1627311 tok/s step 284/19560 | loss 6.051684 (-1.47z)| norm 0.7286 (-0.82z)| lr 2.43e-04 | 322.81 ms | 52.3% bf16 MFU | 1627152 tok/s step 285/19560 | loss 6.045553 (-1.49z)| norm 0.7824 (-0.55z)| lr 2.44e-04 | 322.45 ms | 52.3% bf16 MFU | 1627093 tok/s step 286/19560 | loss 6.019584 (-1.64z)| norm 0.7575 (-0.68z)| lr 2.45e-04 | 322.46 ms | 52.3% bf16 MFU | 1627034 tok/s step 287/19560 | loss 6.027895 (-1.57z)| norm 0.6522 (-1.20z)| lr 2.46e-04 | 322.36 ms | 52.4% bf16 MFU | 1627002 tok/s step 288/19560 | loss 6.005795 (-1.70z)| norm 0.6869 (-1.03z)| lr 2.47e-04 | 322.12 ms | 52.4% bf16 MFU | 1627034 tok/s step 289/19560 | loss 6.048473 (-1.39z)| norm 0.8421 (-0.24z)| lr 2.48e-04 | 322.61 ms | 52.3% bf16 MFU | 1626940 tok/s step 290/19560 | loss 6.013555 (-1.61z)| norm 0.8975 (+0.03z)| lr 2.49e-04 | 322.72 ms | 52.3% bf16 MFU | 1626822 tok/s step 291/19560 | loss 6.002988 (-1.66z)| norm 1.0161 (+0.63z)| lr 2.49e-04 | 323.01 ms | 52.2% bf16 MFU | 1626637 tok/s step 292/19560 | loss 6.070134 (-1.18z)| norm 1.2183 (+1.64z)| lr 2.50e-04 | 322.58 ms | 52.3% bf16 MFU | 1626569 tok/s step 293/19560 | loss 5.941104 (-2.04z)| norm 0.9722 (+0.37z)| lr 2.51e-04 | 322.26 ms | 52.4% bf16 MFU | 1626586 tok/s step 294/19560 | loss 6.023878 (-1.44z)| norm 0.9237 (+0.11z)| lr 2.52e-04 | 322.48 ms | 52.3% bf16 MFU | 1626547 tok/s step 295/19560 | loss 6.084309 (-1.01z)| norm 1.0126 (+0.57z)| lr 2.53e-04 | 322.92 ms | 52.3% bf16 MFU | 1626400 tok/s step 296/19560 | loss 5.997696 (-1.59z)| norm 1.4799 (+2.99z)| lr 2.54e-04 | 322.07 ms | 52.4% bf16 MFU | 1626474 tok/s step 297/19560 | loss 6.020942 (-1.41z)| norm 0.8512 (-0.28z)| lr 2.55e-04 | 322.16 ms | 52.4% bf16 MFU | 1626520 tok/s step 298/19560 | loss 5.929161 (-2.02z)| norm 0.7992 (-0.54z)| lr 2.55e-04 | 323.61 ms | 52.2% bf16 MFU | 1626200 tok/s step 299/19560 | loss 6.036056 (-1.25z)| norm 1.1161 (+1.23z)| lr 2.56e-04 | 321.86 ms | 52.4% bf16 MFU | 1626336 tok/s step 300/19560 | loss 6.007178 (-1.44z)| norm 0.9531 (+0.31z)| lr 2.57e-04 | 322.77 ms | 52.3% bf16 MFU | 1626236 tok/s step 301/19560 | loss 5.997185 (-1.50z)| norm 0.7145 (-1.04z)| lr 2.58e-04 | 322.96 ms | 52.3% bf16 MFU | 1626094 tok/s step 302/19560 | loss 6.000340 (-1.45z)| norm 0.6340 (-1.49z)| lr 2.59e-04 | 322.14 ms | 52.4% bf16 MFU | 1626166 tok/s step 303/19560 | loss 5.917272 (-2.01z)| norm 0.5613 (-1.90z)| lr 2.60e-04 | 322.68 ms | 52.3% bf16 MFU | 1626098 tok/s step 304/19560 | loss 5.971196 (-1.60z)| norm 0.6630 (-1.28z)| lr 2.61e-04 | 322.64 ms | 52.3% bf16 MFU | 1626043 tok/s step 305/19560 | loss 6.011260 (-1.30z)| norm 0.5642 (-1.82z)| lr 2.61e-04 | 322.08 ms | 52.4% bf16 MFU | 1626131 tok/s step 306/19560 | loss 5.988555 (-1.44z)| norm 0.7029 (-1.01z)| lr 2.62e-04 | 322.52 ms | 52.3% bf16 MFU | 1626104 tok/s step 307/19560 | loss 5.982687 (-1.47z)| norm 1.0402 (+0.92z)| lr 2.63e-04 | 323.67 ms | 52.1% bf16 MFU | 1625789 tok/s step 308/19560 | loss 6.012085 (-1.24z)| norm 1.2939 (+2.31z)| lr 2.64e-04 | 322.17 ms | 52.4% bf16 MFU | 1625868 tok/s step 309/19560 | loss 5.988081 (-1.40z)| norm 1.0314 (+0.83z)| lr 2.65e-04 | 322.46 ms | 52.3% bf16 MFU | 1625869 tok/s step 310/19560 | loss 6.205850 (+0.24z)| norm 1.4755 (+3.21z)| lr 2.66e-04 | 322.96 ms | 52.3% bf16 MFU | 1625745 tok/s step 311/19560 | loss 5.974777 (-1.50z)| norm 0.9991 (+0.62z)| lr 2.67e-04 | 322.30 ms | 52.4% bf16 MFU | 1625794 tok/s step 312/19560 | loss 6.027915 (-1.08z)| norm 1.4537 (+2.99z)| lr 2.67e-04 | 322.09 ms | 52.4% bf16 MFU | 1625892 tok/s step 313/19560 | loss 6.027864 (-1.07z)| norm 1.0915 (+1.06z)| lr 2.68e-04 | 322.48 ms | 52.3% bf16 MFU | 1625886 tok/s step 314/19560 | loss 6.054132 (-0.85z)| norm 1.5860 (+3.48z)| lr 2.69e-04 | 322.28 ms | 52.4% bf16 MFU | 1625933 tok/s step 315/19560 | loss 6.008661 (-1.20z)| norm 1.1098 (+1.07z)| lr 2.70e-04 | 322.37 ms | 52.4% bf16 MFU | 1625954 tok/s step 316/19560 | loss 5.953825 (-1.61z)| norm 1.2184 (+1.60z)| lr 2.71e-04 | 322.21 ms | 52.4% bf16 MFU | 1626015 tok/s step 317/19560 | loss 5.939453 (-1.70z)| norm 0.9028 (+0.04z)| lr 2.72e-04 | 321.88 ms | 52.4% bf16 MFU | 1626156 tok/s step 318/19560 | loss 5.965785 (-1.47z)| norm 0.8877 (-0.04z)| lr 2.73e-04 | 321.81 ms | 52.4% bf16 MFU | 1626307 tok/s step 319/19560 | loss 5.965137 (-1.45z)| norm 0.6572 (-1.17z)| lr 2.73e-04 | 321.67 ms | 52.5% bf16 MFU | 1626487 tok/s step 320/19560 | loss 5.891699 (-2.00z)| norm 0.6046 (-1.41z)| lr 2.74e-04 | 321.98 ms | 52.4% bf16 MFU | 1626578 tok/s step 321/19560 | loss 5.950617 (-1.51z)| norm 0.6766 (-1.05z)| lr 2.75e-04 | 322.31 ms | 52.4% bf16 MFU | 1626581 tok/s step 322/19560 | loss 5.905738 (-1.83z)| norm 0.8218 (-0.33z)| lr 2.76e-04 | 321.98 ms | 52.4% bf16 MFU | 1626669 tok/s step 323/19560 | loss 5.877674 (-2.01z)| norm 1.0206 (+0.64z)| lr 2.77e-04 | 322.89 ms | 52.3% bf16 MFU | 1626522 tok/s step 324/19560 | loss 5.943759 (-1.47z)| norm 1.2408 (+1.68z)| lr 2.78e-04 | 322.46 ms | 52.3% bf16 MFU | 1626492 tok/s step 325/19560 | loss 5.825142 (-2.35z)| norm 1.0034 (+0.53z)| lr 2.79e-04 | 322.08 ms | 52.4% bf16 MFU | 1626558 tok/s step 326/19560 | loss 5.931548 (-1.50z)| norm 1.1801 (+1.36z)| lr 2.79e-04 | 322.97 ms | 52.3% bf16 MFU | 1626397 tok/s step 327/19560 | loss 5.898881 (-1.72z)| norm 0.9803 (+0.39z)| lr 2.80e-04 | 322.37 ms | 52.4% bf16 MFU | 1626394 tok/s step 328/19560 | loss 5.984981 (-1.03z)| norm 0.9396 (+0.20z)| lr 2.81e-04 | 323.40 ms | 52.2% bf16 MFU | 1626133 tok/s step 329/19560 | loss 5.902948 (-1.66z)| norm 1.0167 (+0.59z)| lr 2.82e-04 | 322.47 ms | 52.3% bf16 MFU | 1626119 tok/s step 330/19560 | loss 5.909916 (-1.58z)| norm 0.7868 (-0.53z)| lr 2.83e-04 | 322.83 ms | 52.3% bf16 MFU | 1626015 tok/s step 331/19560 | loss 5.963213 (-1.15z)| norm 0.8201 (-0.37z)| lr 2.84e-04 | 322.59 ms | 52.3% bf16 MFU | 1625976 tok/s step 332/19560 | loss 5.933629 (-1.37z)| norm 0.7109 (-0.89z)| lr 2.85e-04 | 321.78 ms | 52.4% bf16 MFU | 1626144 tok/s step 333/19560 | loss 5.905851 (-1.58z)| norm 0.7739 (-0.57z)| lr 2.85e-04 | 321.82 ms | 52.4% bf16 MFU | 1626294 tok/s step 334/19560 | loss 5.841239 (-2.07z)| norm 0.7540 (-0.66z)| lr 2.86e-04 | 321.70 ms | 52.5% bf16 MFU | 1626466 tok/s step 335/19560 | loss 5.874666 (-1.76z)| norm 0.7758 (-0.55z)| lr 2.87e-04 | 321.89 ms | 52.4% bf16 MFU | 1626582 tok/s step 336/19560 | loss 5.880956 (-1.69z)| norm 0.9425 (+0.25z)| lr 2.88e-04 | 322.07 ms | 52.4% bf16 MFU | 1626646 tok/s step 337/19560 | loss 5.901791 (-1.50z)| norm 1.2648 (+1.78z)| lr 2.89e-04 | 322.73 ms | 52.3% bf16 MFU | 1626542 tok/s step 338/19560 | loss 5.881930 (-1.64z)| norm 0.9167 (+0.10z)| lr 2.90e-04 | 322.92 ms | 52.3% bf16 MFU | 1626395 tok/s step 339/19560 | loss 5.938679 (-1.16z)| norm 0.8419 (-0.26z)| lr 2.91e-04 | 322.04 ms | 52.4% bf16 MFU | 1626477 tok/s step 340/19560 | loss 5.840579 (-1.94z)| norm 0.9368 (+0.20z)| lr 2.91e-04 | 322.22 ms | 52.4% bf16 MFU | 1626508 tok/s step 341/19560 | loss 5.866185 (-1.70z)| norm 1.2356 (+1.61z)| lr 2.92e-04 | 322.86 ms | 52.3% bf16 MFU | 1626378 tok/s step 342/19560 | loss 5.862097 (-1.72z)| norm 0.7736 (-0.58z)| lr 2.93e-04 | 322.58 ms | 52.3% bf16 MFU | 1626324 tok/s step 343/19560 | loss 5.884152 (-1.52z)| norm 0.6846 (-1.00z)| lr 2.94e-04 | 322.41 ms | 52.3% bf16 MFU | 1626315 tok/s step 344/19560 | loss 5.827714 (-1.96z)| norm 0.9660 (+0.34z)| lr 2.95e-04 | 322.72 ms | 52.3% bf16 MFU | 1626229 tok/s step 345/19560 | loss 5.905993 (-1.29z)| norm 1.0111 (+0.55z)| lr 2.96e-04 | 322.11 ms | 52.4% bf16 MFU | 1626300 tok/s step 346/19560 | loss 5.890816 (-1.41z)| norm 0.7848 (-0.52z)| lr 2.97e-04 | 322.47 ms | 52.3% bf16 MFU | 1626277 tok/s step 347/19560 | loss 5.849085 (-1.74z)| norm 0.7704 (-0.58z)| lr 2.97e-04 | 323.83 ms | 52.1% bf16 MFU | 1625914 tok/s step 348/19560 | loss 5.842230 (-1.76z)| norm 1.0372 (+0.68z)| lr 2.98e-04 | 322.54 ms | 52.3% bf16 MFU | 1625893 tok/s step 349/19560 | loss 5.839679 (-1.76z)| norm 1.0092 (+0.54z)| lr 2.99e-04 | 323.64 ms | 52.1% bf16 MFU | 1625596 tok/s step 350/19560 | loss 5.883446 (-1.37z)| norm 0.9541 (+0.27z)| lr 3.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1625579 tok/s step 351/19560 | loss 5.845253 (-1.68z)| norm 0.7711 (-0.60z)| lr 3.01e-04 | 322.29 ms | 52.4% bf16 MFU | 1625637 tok/s step 352/19560 | loss 5.859211 (-1.54z)| norm 0.8240 (-0.35z)| lr 3.02e-04 | 322.74 ms | 52.3% bf16 MFU | 1625578 tok/s step 353/19560 | loss 5.884883 (-1.31z)| norm 0.7642 (-0.62z)| lr 3.03e-04 | 322.33 ms | 52.4% bf16 MFU | 1625628 tok/s step 354/19560 | loss 5.743085 (-2.48z)| norm 0.6273 (-1.26z)| lr 3.03e-04 | 322.08 ms | 52.4% bf16 MFU | 1625737 tok/s step 355/19560 | loss 5.807248 (-1.89z)| norm 0.7283 (-0.77z)| lr 3.04e-04 | 322.54 ms | 52.3% bf16 MFU | 1625724 tok/s step 356/19560 | loss 5.727894 (-2.50z)| norm 0.7883 (-0.50z)| lr 3.05e-04 | 322.95 ms | 52.3% bf16 MFU | 1625609 tok/s step 357/19560 | loss 5.849351 (-1.45z)| norm 0.9012 (+0.03z)| lr 3.06e-04 | 323.01 ms | 52.2% bf16 MFU | 1625486 tok/s step 358/19560 | loss 5.873186 (-1.23z)| norm 1.7094 (+3.65z)| lr 3.07e-04 | 322.21 ms | 52.4% bf16 MFU | 1625569 tok/s step 359/19560 | loss 5.796082 (-1.85z)| norm 1.0025 (+0.46z)| lr 3.08e-04 | 322.69 ms | 52.3% bf16 MFU | 1625528 tok/s step 360/19560 | loss 5.839208 (-1.47z)| norm 0.9790 (+0.36z)| lr 3.09e-04 | 322.26 ms | 52.4% bf16 MFU | 1625596 tok/s step 361/19560 | loss 5.858850 (-1.28z)| norm 0.8327 (-0.30z)| lr 3.09e-04 | 322.58 ms | 52.3% bf16 MFU | 1625580 tok/s step 362/19560 | loss 5.808546 (-1.68z)| norm 0.7634 (-0.62z)| lr 3.10e-04 | 322.28 ms | 52.4% bf16 MFU | 1625641 tok/s step 363/19560 | loss 5.806164 (-1.67z)| norm 0.9428 (+0.20z)| lr 3.11e-04 | 322.76 ms | 52.3% bf16 MFU | 1625578 tok/s step 364/19560 | loss 5.817901 (-1.55z)| norm 1.0145 (+0.53z)| lr 3.12e-04 | 322.21 ms | 52.4% bf16 MFU | 1625656 tok/s step 365/19560 | loss 5.770277 (-1.92z)| norm 0.9715 (+0.32z)| lr 3.13e-04 | 322.30 ms | 52.4% bf16 MFU | 1625710 tok/s step 366/19560 | loss 5.792306 (-1.71z)| norm 1.0671 (+0.76z)| lr 3.14e-04 | 322.15 ms | 52.4% bf16 MFU | 1625799 tok/s step 367/19560 | loss 5.720038 (-2.26z)| norm 0.9619 (+0.28z)| lr 3.15e-04 | 322.75 ms | 52.3% bf16 MFU | 1625730 tok/s step 368/19560 | loss 5.718963 (-2.21z)| norm 0.9665 (+0.31z)| lr 3.15e-04 | 323.02 ms | 52.2% bf16 MFU | 1625598 tok/s step 369/19560 | loss 5.764118 (-1.81z)| norm 1.0256 (+0.57z)| lr 3.16e-04 | 322.17 ms | 52.4% bf16 MFU | 1625685 tok/s step 370/19560 | loss 5.827168 (-1.27z)| norm 1.1665 (+1.21z)| lr 3.17e-04 | 322.54 ms | 52.3% bf16 MFU | 1625677 tok/s step 371/19560 | loss 5.822650 (-1.28z)| norm 0.8900 (-0.05z)| lr 3.18e-04 | 321.91 ms | 52.4% bf16 MFU | 1625828 tok/s step 372/19560 | loss 5.750629 (-1.84z)| norm 0.8913 (-0.05z)| lr 3.19e-04 | 322.61 ms | 52.3% bf16 MFU | 1625793 tok/s step 373/19560 | loss 5.785405 (-1.53z)| norm 0.6851 (-0.99z)| lr 3.20e-04 | 322.91 ms | 52.3% bf16 MFU | 1625686 tok/s step 374/19560 | loss 5.819055 (-1.24z)| norm 0.8535 (-0.23z)| lr 3.21e-04 | 322.57 ms | 52.3% bf16 MFU | 1625670 tok/s step 375/19560 | loss 5.713826 (-2.07z)| norm 0.8625 (-0.20z)| lr 3.21e-04 | 322.31 ms | 52.4% bf16 MFU | 1625718 tok/s step 376/19560 | loss 5.776162 (-1.54z)| norm 0.8518 (-0.26z)| lr 3.22e-04 | 322.94 ms | 52.3% bf16 MFU | 1625607 tok/s step 377/19560 | loss 5.761419 (-1.62z)| norm 0.9883 (+0.37z)| lr 3.23e-04 | 323.24 ms | 52.2% bf16 MFU | 1625427 tok/s step 378/19560 | loss 5.783203 (-1.43z)| norm 1.2551 (+1.59z)| lr 3.24e-04 | 322.50 ms | 52.3% bf16 MFU | 1625439 tok/s step 379/19560 | loss 5.756898 (-1.62z)| norm 0.8764 (-0.15z)| lr 3.25e-04 | 322.35 ms | 52.4% bf16 MFU | 1625491 tok/s step 380/19560 | loss 5.770188 (-1.49z)| norm 0.9272 (+0.08z)| lr 3.26e-04 | 322.32 ms | 52.4% bf16 MFU | 1625547 tok/s step 381/19560 | loss 5.745415 (-1.67z)| norm 0.8312 (-0.36z)| lr 3.27e-04 | 323.07 ms | 52.2% bf16 MFU | 1625411 tok/s step 382/19560 | loss 5.750505 (-1.60z)| norm 0.8161 (-0.43z)| lr 3.27e-04 | 322.34 ms | 52.4% bf16 MFU | 1625464 tok/s step 383/19560 | loss 5.695450 (-2.01z)| norm 0.9185 (+0.05z)| lr 3.28e-04 | 323.01 ms | 52.2% bf16 MFU | 1625348 tok/s step 384/19560 | loss 5.786231 (-1.26z)| norm 0.8738 (-0.16z)| lr 3.29e-04 | 323.15 ms | 52.2% bf16 MFU | 1625201 tok/s step 385/19560 | loss 5.757025 (-1.48z)| norm 1.0157 (+0.52z)| lr 3.30e-04 | 321.90 ms | 52.4% bf16 MFU | 1625378 tok/s step 386/19560 | loss 5.692902 (-1.96z)| norm 0.7604 (-0.69z)| lr 3.31e-04 | 322.68 ms | 52.3% bf16 MFU | 1625348 tok/s step 387/19560 | loss 5.694768 (-1.91z)| norm 0.7658 (-0.67z)| lr 3.32e-04 | 322.84 ms | 52.3% bf16 MFU | 1625281 tok/s step 388/19560 | loss 5.681092 (-1.98z)| norm 0.6569 (-1.20z)| lr 3.33e-04 | 322.73 ms | 52.3% bf16 MFU | 1625244 tok/s step 389/19560 | loss 5.742258 (-1.47z)| norm 0.6893 (-1.05z)| lr 3.33e-04 | 322.25 ms | 52.4% bf16 MFU | 1625329 tok/s step 390/19560 | loss 5.695709 (-1.81z)| norm 0.7322 (-0.85z)| lr 3.34e-04 | 322.07 ms | 52.4% bf16 MFU | 1625457 tok/s step 391/19560 | loss 5.731614 (-1.49z)| norm 1.2843 (+1.77z)| lr 3.35e-04 | 322.82 ms | 52.3% bf16 MFU | 1625389 tok/s step 392/19560 | loss 5.736551 (-1.43z)| norm 1.4277 (+2.39z)| lr 3.36e-04 | 322.86 ms | 52.3% bf16 MFU | 1625313 tok/s step 393/19560 | loss 5.755140 (-1.26z)| norm 0.9903 (+0.33z)| lr 3.37e-04 | 322.46 ms | 52.3% bf16 MFU | 1625343 tok/s step 394/19560 | loss 5.735332 (-1.40z)| norm 1.0733 (+0.70z)| lr 3.38e-04 | 322.47 ms | 52.3% bf16 MFU | 1625367 tok/s step 395/19560 | loss 5.721637 (-1.49z)| norm 1.1927 (+1.25z)| lr 3.39e-04 | 322.45 ms | 52.3% bf16 MFU | 1625397 tok/s step 396/19560 | loss 5.736702 (-1.34z)| norm 0.9543 (+0.14z)| lr 3.39e-04 | 322.63 ms | 52.3% bf16 MFU | 1625380 tok/s step 397/19560 | loss 5.687119 (-1.72z)| norm 0.9149 (-0.05z)| lr 3.40e-04 | 321.90 ms | 52.4% bf16 MFU | 1625548 tok/s step 398/19560 | loss 5.716406 (-1.46z)| norm 1.0600 (+0.62z)| lr 3.41e-04 | 322.98 ms | 52.3% bf16 MFU | 1625435 tok/s step 399/19560 | loss 5.719507 (-1.41z)| norm 1.0972 (+0.79z)| lr 3.42e-04 | 322.98 ms | 52.3% bf16 MFU | 1625327 tok/s step 400/19560 | loss 5.766946 (-1.02z)| norm 1.1407 (+0.99z)| lr 3.43e-04 | 322.73 ms | 52.3% bf16 MFU | 1625288 tok/s step 401/19560 | loss 5.669549 (-1.78z)| norm 1.1828 (+1.17z)| lr 3.44e-04 | 322.61 ms | 52.3% bf16 MFU | 1625280 tok/s step 402/19560 | loss 5.742557 (-1.18z)| norm 1.4134 (+2.21z)| lr 3.45e-04 | 322.74 ms | 52.3% bf16 MFU | 1625240 tok/s step 403/19560 | loss 5.729059 (-1.27z)| norm 1.1337 (+0.92z)| lr 3.45e-04 | 322.97 ms | 52.3% bf16 MFU | 1625145 tok/s step 404/19560 | loss 5.622725 (-2.10z)| norm 0.9593 (+0.12z)| lr 3.46e-04 | 322.44 ms | 52.3% bf16 MFU | 1625188 tok/s step 405/19560 | loss 5.725780 (-1.24z)| norm 0.9325 (-0.01z)| lr 3.47e-04 | 322.86 ms | 52.3% bf16 MFU | 1625122 tok/s step 406/19560 | loss 5.670414 (-1.66z)| norm 1.0246 (+0.41z)| lr 3.48e-04 | 323.08 ms | 52.2% bf16 MFU | 1625004 tok/s step 407/19560 | loss 5.655542 (-1.75z)| norm 1.0075 (+0.32z)| lr 3.49e-04 | 321.73 ms | 52.5% bf16 MFU | 1625233 tok/s step 408/19560 | loss 5.720356 (-1.21z)| norm 0.7679 (-0.79z)| lr 3.50e-04 | 322.78 ms | 52.3% bf16 MFU | 1625187 tok/s step 409/19560 | loss 5.689933 (-1.43z)| norm 0.8612 (-0.37z)| lr 3.51e-04 | 322.82 ms | 52.3% bf16 MFU | 1625133 tok/s step 410/19560 | loss 5.647253 (-1.76z)| norm 0.8234 (-0.55z)| lr 3.51e-04 | 321.96 ms | 52.4% bf16 MFU | 1625298 tok/s step 411/19560 | loss 5.608929 (-2.04z)| norm 0.8368 (-0.49z)| lr 3.52e-04 | 322.87 ms | 52.3% bf16 MFU | 1625225 tok/s step 412/19560 | loss 5.703447 (-1.25z)| norm 0.6916 (-1.17z)| lr 3.53e-04 | 322.43 ms | 52.3% bf16 MFU | 1625266 tok/s step 413/19560 | loss 5.643915 (-1.71z)| norm 0.7164 (-1.05z)| lr 3.54e-04 | 322.49 ms | 52.3% bf16 MFU | 1625290 tok/s step 414/19560 | loss 5.622111 (-1.85z)| norm 1.1186 (+0.81z)| lr 3.55e-04 | 322.58 ms | 52.3% bf16 MFU | 1625289 tok/s step 415/19560 | loss 5.636696 (-1.70z)| norm 1.1089 (+0.76z)| lr 3.56e-04 | 322.34 ms | 52.4% bf16 MFU | 1625351 tok/s step 416/19560 | loss 5.644901 (-1.61z)| norm 0.7677 (-0.85z)| lr 3.57e-04 | 322.44 ms | 52.3% bf16 MFU | 1625384 tok/s step 417/19560 | loss 5.676686 (-1.33z)| norm 0.8204 (-0.60z)| lr 3.57e-04 | 322.27 ms | 52.4% bf16 MFU | 1625458 tok/s step 418/19560 | loss 5.672420 (-1.35z)| norm 0.6397 (-1.43z)| lr 3.58e-04 | 322.33 ms | 52.4% bf16 MFU | 1625512 tok/s step 419/19560 | loss 5.638844 (-1.59z)| norm 0.7051 (-1.10z)| lr 3.59e-04 | 322.61 ms | 52.3% bf16 MFU | 1625494 tok/s step 420/19560 | loss 5.629757 (-1.64z)| norm 0.6911 (-1.15z)| lr 3.60e-04 | 322.47 ms | 52.3% bf16 MFU | 1625510 tok/s step 421/19560 | loss 5.674884 (-1.25z)| norm 0.6832 (-1.17z)| lr 3.61e-04 | 322.22 ms | 52.4% bf16 MFU | 1625591 tok/s step 422/19560 | loss 5.586718 (-1.93z)| norm 0.9527 (+0.07z)| lr 3.62e-04 | 322.21 ms | 52.4% bf16 MFU | 1625668 tok/s step 423/19560 | loss 5.597835 (-1.82z)| norm 1.0148 (+0.36z)| lr 3.63e-04 | 322.80 ms | 52.3% bf16 MFU | 1625595 tok/s step 424/19560 | loss 5.625693 (-1.57z)| norm 0.8374 (-0.45z)| lr 3.63e-04 | 322.62 ms | 52.3% bf16 MFU | 1625570 tok/s step 425/19560 | loss 5.593684 (-1.80z)| norm 0.8765 (-0.26z)| lr 3.64e-04 | 322.23 ms | 52.4% bf16 MFU | 1625644 tok/s step 426/19560 | loss 5.619625 (-1.56z)| norm 0.7717 (-0.76z)| lr 3.65e-04 | 322.62 ms | 52.3% bf16 MFU | 1625616 tok/s step 427/19560 | loss 5.646550 (-1.32z)| norm 1.1069 (+0.83z)| lr 3.66e-04 | 322.74 ms | 52.3% bf16 MFU | 1625560 tok/s step 428/19560 | loss 5.589303 (-1.76z)| norm 1.1361 (+0.96z)| lr 3.67e-04 | 322.43 ms | 52.3% bf16 MFU | 1625584 tok/s step 429/19560 | loss 5.599124 (-1.65z)| norm 0.9718 (+0.17z)| lr 3.68e-04 | 322.70 ms | 52.3% bf16 MFU | 1625538 tok/s step 430/19560 | loss 5.584253 (-1.74z)| norm 0.9013 (-0.17z)| lr 3.69e-04 | 322.45 ms | 52.3% bf16 MFU | 1625559 tok/s step 431/19560 | loss 5.599386 (-1.59z)| norm 1.0663 (+0.60z)| lr 3.69e-04 | 323.12 ms | 52.2% bf16 MFU | 1625410 tok/s step 432/19560 | loss 5.568501 (-1.80z)| norm 0.9260 (-0.09z)| lr 3.70e-04 | 322.22 ms | 52.4% bf16 MFU | 1625496 tok/s step 433/19560 | loss 5.555615 (-1.87z)| norm 0.8499 (-0.47z)| lr 3.71e-04 | 322.58 ms | 52.3% bf16 MFU | 1625488 tok/s step 434/19560 | loss 5.597670 (-1.51z)| norm 0.8140 (-0.66z)| lr 3.72e-04 | 322.73 ms | 52.3% bf16 MFU | 1625441 tok/s step 435/19560 | loss 5.579070 (-1.63z)| norm 0.6902 (-1.25z)| lr 3.73e-04 | 322.76 ms | 52.3% bf16 MFU | 1625388 tok/s step 436/19560 | loss 5.590956 (-1.52z)| norm 0.7445 (-0.97z)| lr 3.74e-04 | 322.03 ms | 52.4% bf16 MFU | 1625522 tok/s step 437/19560 | loss 5.501616 (-2.18z)| norm 0.9231 (-0.08z)| lr 3.75e-04 | 322.84 ms | 52.3% bf16 MFU | 1625446 tok/s step 438/19560 | loss 5.645470 (-1.06z)| norm 1.1164 (+0.92z)| lr 3.75e-04 | 323.03 ms | 52.2% bf16 MFU | 1625326 tok/s step 439/19560 | loss 5.533571 (-1.94z)| norm 0.9254 (-0.05z)| lr 3.76e-04 | 322.22 ms | 52.4% bf16 MFU | 1625415 tok/s step 440/19560 | loss 5.561307 (-1.69z)| norm 0.8409 (-0.47z)| lr 3.77e-04 | 322.38 ms | 52.4% bf16 MFU | 1625461 tok/s step 441/19560 | loss 5.518496 (-2.02z)| norm 0.6693 (-1.34z)| lr 3.78e-04 | 322.44 ms | 52.3% bf16 MFU | 1625488 tok/s step 442/19560 | loss 5.538939 (-1.83z)| norm 0.7303 (-1.04z)| lr 3.79e-04 | 322.32 ms | 52.4% bf16 MFU | 1625544 tok/s step 443/19560 | loss 5.580063 (-1.48z)| norm 0.7631 (-0.84z)| lr 3.80e-04 | 323.51 ms | 52.2% bf16 MFU | 1625299 tok/s step 444/19560 | loss 5.485353 (-2.22z)| norm 0.7597 (-0.85z)| lr 3.81e-04 | 322.33 ms | 52.4% bf16 MFU | 1625362 tok/s step 445/19560 | loss 5.500202 (-2.06z)| norm 0.6991 (-1.17z)| lr 3.81e-04 | 323.23 ms | 52.2% bf16 MFU | 1625196 tok/s step 446/19560 | loss 5.536902 (-1.73z)| norm 0.7388 (-0.94z)| lr 3.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1625065 tok/s step 447/19560 | loss 5.518736 (-1.85z)| norm 0.9274 (+0.07z)| lr 3.83e-04 | 322.64 ms | 52.3% bf16 MFU | 1625061 tok/s step 448/19560 | loss 5.525872 (-1.75z)| norm 0.8873 (-0.16z)| lr 3.84e-04 | 322.63 ms | 52.3% bf16 MFU | 1625059 tok/s step 449/19560 | loss 5.598598 (-1.14z)| norm 0.8607 (-0.32z)| lr 3.85e-04 | 322.49 ms | 52.3% bf16 MFU | 1625093 tok/s step 450/19560 | loss 5.555230 (-1.47z)| norm 0.9887 (+0.39z)| lr 3.86e-04 | 322.27 ms | 52.4% bf16 MFU | 1625182 tok/s step 451/19560 | loss 5.584246 (-1.22z)| norm 0.8414 (-0.42z)| lr 3.87e-04 | 322.76 ms | 52.3% bf16 MFU | 1625143 tok/s step 452/19560 | loss 5.559985 (-1.40z)| norm 0.7951 (-0.67z)| lr 3.87e-04 | 321.82 ms | 52.4% bf16 MFU | 1625342 tok/s step 453/19560 | loss 5.543053 (-1.51z)| norm 0.9667 (+0.30z)| lr 3.88e-04 | 322.43 ms | 52.3% bf16 MFU | 1625376 tok/s step 454/19560 | loss 5.481050 (-1.99z)| norm 1.0480 (+0.77z)| lr 3.89e-04 | 322.66 ms | 52.3% bf16 MFU | 1625352 tok/s step 455/19560 | loss 5.553469 (-1.37z)| norm 1.1745 (+1.47z)| lr 3.90e-04 | 322.75 ms | 52.3% bf16 MFU | 1625305 tok/s step 456/19560 | loss 5.536829 (-1.49z)| norm 1.0369 (+0.69z)| lr 3.91e-04 | 322.56 ms | 52.3% bf16 MFU | 1625310 tok/s step 457/19560 | loss 5.584871 (-1.08z)| norm 1.0562 (+0.80z)| lr 3.92e-04 | 322.84 ms | 52.3% bf16 MFU | 1625244 tok/s step 458/19560 | loss 5.514379 (-1.64z)| norm 0.8486 (-0.38z)| lr 3.93e-04 | 322.47 ms | 52.3% bf16 MFU | 1625274 tok/s step 459/19560 | loss 5.549839 (-1.33z)| norm 0.8395 (-0.43z)| lr 3.93e-04 | 322.33 ms | 52.4% bf16 MFU | 1625339 tok/s step 460/19560 | loss 5.530938 (-1.47z)| norm 0.7881 (-0.73z)| lr 3.94e-04 | 322.82 ms | 52.3% bf16 MFU | 1625278 tok/s step 461/19560 | loss 5.496747 (-1.74z)| norm 0.6855 (-1.30z)| lr 3.95e-04 | 322.35 ms | 52.4% bf16 MFU | 1625335 tok/s step 462/19560 | loss 5.499086 (-1.69z)| norm 0.7505 (-0.93z)| lr 3.96e-04 | 322.86 ms | 52.3% bf16 MFU | 1625263 tok/s step 463/19560 | loss 5.628045 (-0.58z)| norm 0.8172 (-0.56z)| lr 3.97e-04 | 322.62 ms | 52.3% bf16 MFU | 1625254 tok/s step 464/19560 | loss 5.545953 (-1.26z)| norm 1.1557 (+1.34z)| lr 3.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1625262 tok/s step 465/19560 | loss 5.453753 (-2.02z)| norm 1.0178 (+0.58z)| lr 3.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1625269 tok/s step 466/19560 | loss 5.455788 (-1.96z)| norm 1.0387 (+0.69z)| lr 3.99e-04 | 323.14 ms | 52.2% bf16 MFU | 1625128 tok/s step 467/19560 | loss 5.472921 (-1.79z)| norm 0.9023 (-0.08z)| lr 4.00e-04 | 322.52 ms | 52.3% bf16 MFU | 1625152 tok/s step 468/19560 | loss 5.470323 (-1.78z)| norm 0.7840 (-0.75z)| lr 4.01e-04 | 322.77 ms | 52.3% bf16 MFU | 1625110 tok/s step 469/19560 | loss 5.455796 (-1.87z)| norm 0.8661 (-0.27z)| lr 4.02e-04 | 322.33 ms | 52.4% bf16 MFU | 1625182 tok/s step 470/19560 | loss 5.540091 (-1.14z)| norm 1.0270 (+0.64z)| lr 4.03e-04 | 323.31 ms | 52.2% bf16 MFU | 1625003 tok/s step 471/19560 | loss 5.507588 (-1.40z)| norm 1.2436 (+1.85z)| lr 4.04e-04 | 322.07 ms | 52.4% bf16 MFU | 1625148 tok/s step 472/19560 | loss 5.447962 (-1.87z)| norm 0.9871 (+0.39z)| lr 4.05e-04 | 322.29 ms | 52.4% bf16 MFU | 1625228 tok/s step 473/19560 | loss 5.456339 (-1.78z)| norm 0.7876 (-0.74z)| lr 4.05e-04 | 322.71 ms | 52.3% bf16 MFU | 1625199 tok/s step 474/19560 | loss 5.448184 (-1.82z)| norm 0.9719 (+0.30z)| lr 4.06e-04 | 323.03 ms | 52.2% bf16 MFU | 1625090 tok/s step 475/19560 | loss 5.447782 (-1.79z)| norm 0.9177 (-0.02z)| lr 4.07e-04 | 321.99 ms | 52.4% bf16 MFU | 1625250 tok/s step 476/19560 | loss 5.435426 (-1.86z)| norm 0.6918 (-1.29z)| lr 4.08e-04 | 322.50 ms | 52.3% bf16 MFU | 1625272 tok/s step 477/19560 | loss 5.514537 (-1.18z)| norm 0.8658 (-0.29z)| lr 4.09e-04 | 322.55 ms | 52.3% bf16 MFU | 1625281 tok/s step 478/19560 | loss 5.498700 (-1.30z)| norm 0.8963 (-0.11z)| lr 4.10e-04 | 322.41 ms | 52.3% bf16 MFU | 1625326 tok/s step 479/19560 | loss 5.435882 (-1.80z)| norm 0.9105 (-0.04z)| lr 4.11e-04 | 322.26 ms | 52.4% bf16 MFU | 1625404 tok/s step 480/19560 | loss 5.426637 (-1.85z)| norm 1.2213 (+1.70z)| lr 4.11e-04 | 322.19 ms | 52.4% bf16 MFU | 1625497 tok/s step 481/19560 | loss 5.496995 (-1.24z)| norm 1.1932 (+1.52z)| lr 4.12e-04 | 322.48 ms | 52.3% bf16 MFU | 1625512 tok/s step 482/19560 | loss 5.484913 (-1.32z)| norm 0.9583 (+0.18z)| lr 4.13e-04 | 322.77 ms | 52.3% bf16 MFU | 1625454 tok/s step 483/19560 | loss 5.466426 (-1.46z)| norm 0.8764 (-0.29z)| lr 4.14e-04 | 322.33 ms | 52.4% bf16 MFU | 1625508 tok/s step 484/19560 | loss 5.423511 (-1.79z)| norm 0.8038 (-0.71z)| lr 4.15e-04 | 322.28 ms | 52.4% bf16 MFU | 1625573 tok/s step 485/19560 | loss 5.438282 (-1.64z)| norm 0.7041 (-1.26z)| lr 4.16e-04 | 322.35 ms | 52.4% bf16 MFU | 1625617 tok/s step 486/19560 | loss 5.460457 (-1.43z)| norm 0.5657 (-2.13z)| lr 4.17e-04 | 322.80 ms | 52.3% bf16 MFU | 1625544 tok/s step 487/19560 | loss 5.409263 (-1.84z)| norm 0.6027 (-1.87z)| lr 4.17e-04 | 322.38 ms | 52.4% bf16 MFU | 1625583 tok/s step 488/19560 | loss 5.488084 (-1.15z)| norm 0.6170 (-1.74z)| lr 4.18e-04 | 322.50 ms | 52.3% bf16 MFU | 1625588 tok/s step 489/19560 | loss 5.378205 (-2.07z)| norm 0.8146 (-0.57z)| lr 4.19e-04 | 323.13 ms | 52.2% bf16 MFU | 1625435 tok/s step 490/19560 | loss 5.468954 (-1.27z)| norm 1.0960 (+1.08z)| lr 4.20e-04 | 322.18 ms | 52.4% bf16 MFU | 1625530 tok/s step 491/19560 | loss 5.427242 (-1.60z)| norm 0.9624 (+0.29z)| lr 4.21e-04 | 322.48 ms | 52.3% bf16 MFU | 1625542 tok/s step 492/19560 | loss 5.411309 (-1.72z)| norm 0.8421 (-0.42z)| lr 4.22e-04 | 322.86 ms | 52.3% bf16 MFU | 1625459 tok/s step 493/19560 | loss 5.391879 (-1.85z)| norm 0.6382 (-1.59z)| lr 4.23e-04 | 322.17 ms | 52.4% bf16 MFU | 1625553 tok/s step 494/19560 | loss 5.397700 (-1.77z)| norm 0.7292 (-1.04z)| lr 4.23e-04 | 322.54 ms | 52.3% bf16 MFU | 1625551 tok/s step 495/19560 | loss 5.402930 (-1.69z)| norm 0.9433 (+0.21z)| lr 4.24e-04 | 322.73 ms | 52.3% bf16 MFU | 1625502 tok/s step 496/19560 | loss 5.391466 (-1.75z)| norm 0.8844 (-0.13z)| lr 4.25e-04 | 322.48 ms | 52.3% bf16 MFU | 1625516 tok/s step 497/19560 | loss 5.403422 (-1.62z)| norm 0.8047 (-0.59z)| lr 4.26e-04 | 322.35 ms | 52.4% bf16 MFU | 1625564 tok/s step 498/19560 | loss 5.416341 (-1.50z)| norm 0.9743 (+0.42z)| lr 4.27e-04 | 322.39 ms | 52.4% bf16 MFU | 1625598 tok/s step 499/19560 | loss 5.419875 (-1.45z)| norm 1.0150 (+0.66z)| lr 4.28e-04 | 322.48 ms | 52.3% bf16 MFU | 1625609 tok/s step 500/19560 | loss 5.402827 (-1.57z)| norm 0.8915 (-0.07z)| lr 4.29e-04 | 322.95 ms | 52.3% bf16 MFU | 1625501 tok/s val loss 5.389525 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2388/10042 = 0.237801 step 501/19560 | loss 5.524487 (-0.51z)| norm 1.0309 (+0.74z)| lr 4.29e-04 | 321.91 ms | 52.4% bf16 MFU | 1625661 tok/s step 502/19560 | loss 5.402844 (-1.55z)| norm 1.0707 (+0.96z)| lr 4.30e-04 | 323.14 ms | 52.2% bf16 MFU | 1625503 tok/s step 503/19560 | loss 5.378118 (-1.73z)| norm 0.8648 (-0.26z)| lr 4.31e-04 | 322.31 ms | 52.4% bf16 MFU | 1625560 tok/s step 504/19560 | loss 5.369720 (-1.78z)| norm 0.7505 (-0.93z)| lr 4.32e-04 | 322.24 ms | 52.4% bf16 MFU | 1625632 tok/s step 505/19560 | loss 5.428893 (-1.25z)| norm 0.7723 (-0.79z)| lr 4.33e-04 | 322.32 ms | 52.4% bf16 MFU | 1625680 tok/s step 506/19560 | loss 5.347645 (-1.92z)| norm 0.8359 (-0.40z)| lr 4.34e-04 | 323.25 ms | 52.2% bf16 MFU | 1625492 tok/s step 507/19560 | loss 5.348750 (-1.88z)| norm 0.9073 (+0.03z)| lr 4.35e-04 | 323.14 ms | 52.2% bf16 MFU | 1625340 tok/s step 508/19560 | loss 5.432893 (-1.13z)| norm 0.8907 (-0.07z)| lr 4.35e-04 | 323.01 ms | 52.2% bf16 MFU | 1625230 tok/s step 509/19560 | loss 5.358836 (-1.75z)| norm 0.8624 (-0.24z)| lr 4.36e-04 | 323.96 ms | 52.1% bf16 MFU | 1624888 tok/s step 510/19560 | loss 5.419590 (-1.21z)| norm 0.8713 (-0.19z)| lr 4.37e-04 | 322.46 ms | 52.3% bf16 MFU | 1624938 tok/s step 511/19560 | loss 5.351375 (-1.77z)| norm 0.8790 (-0.14z)| lr 4.38e-04 | 322.79 ms | 52.3% bf16 MFU | 1624901 tok/s step 512/19560 | loss 5.317251 (-2.04z)| norm 0.9398 (+0.22z)| lr 4.39e-04 | 322.91 ms | 52.3% bf16 MFU | 1624839 tok/s step 513/19560 | loss 5.337533 (-1.83z)| norm 0.7361 (-0.99z)| lr 4.40e-04 | 322.53 ms | 52.3% bf16 MFU | 1624874 tok/s step 514/19560 | loss 5.347617 (-1.71z)| norm 0.7874 (-0.68z)| lr 4.41e-04 | 322.30 ms | 52.4% bf16 MFU | 1624966 tok/s step 515/19560 | loss 5.370241 (-1.49z)| norm 0.7819 (-0.72z)| lr 4.41e-04 | 323.20 ms | 52.2% bf16 MFU | 1624827 tok/s step 516/19560 | loss 5.361914 (-1.53z)| norm 0.7619 (-0.85z)| lr 4.42e-04 | 322.99 ms | 52.3% bf16 MFU | 1624747 tok/s step 517/19560 | loss 5.332770 (-1.76z)| norm 0.8869 (-0.10z)| lr 4.43e-04 | 322.56 ms | 52.3% bf16 MFU | 1624780 tok/s step 518/19560 | loss 5.362351 (-1.48z)| norm 1.0179 (+0.68z)| lr 4.44e-04 | 322.56 ms | 52.3% bf16 MFU | 1624810 tok/s step 519/19560 | loss 5.306779 (-1.92z)| norm 0.9885 (+0.52z)| lr 4.45e-04 | 323.32 ms | 52.2% bf16 MFU | 1624648 tok/s step 520/19560 | loss 5.336686 (-1.64z)| norm 1.0889 (+1.21z)| lr 4.46e-04 | 322.96 ms | 52.3% bf16 MFU | 1624583 tok/s step 521/19560 | loss 5.382787 (-1.23z)| norm 0.9118 (+0.07z)| lr 4.47e-04 | 322.79 ms | 52.3% bf16 MFU | 1624566 tok/s step 522/19560 | loss 5.291169 (-1.99z)| norm 0.9392 (+0.26z)| lr 4.47e-04 | 323.29 ms | 52.2% bf16 MFU | 1624423 tok/s step 523/19560 | loss 5.337360 (-1.57z)| norm 0.9163 (+0.12z)| lr 4.48e-04 | 323.08 ms | 52.2% bf16 MFU | 1624340 tok/s step 524/19560 | loss 5.252901 (-2.26z)| norm 0.7830 (-0.74z)| lr 4.49e-04 | 322.74 ms | 52.3% bf16 MFU | 1624348 tok/s step 525/19560 | loss 5.320742 (-1.64z)| norm 0.6455 (-1.61z)| lr 4.50e-04 | 322.69 ms | 52.3% bf16 MFU | 1624368 tok/s step 526/19560 | loss 5.293087 (-1.85z)| norm 0.6213 (-1.74z)| lr 4.51e-04 | 322.36 ms | 52.4% bf16 MFU | 1624470 tok/s step 527/19560 | loss 5.325662 (-1.55z)| norm 0.6578 (-1.48z)| lr 4.52e-04 | 322.52 ms | 52.3% bf16 MFU | 1624526 tok/s step 528/19560 | loss 5.362967 (-1.22z)| norm 0.7042 (-1.16z)| lr 4.53e-04 | 323.27 ms | 52.2% bf16 MFU | 1624392 tok/s step 529/19560 | loss 5.212873 (-2.46z)| norm 0.7884 (-0.60z)| lr 4.53e-04 | 322.32 ms | 52.4% bf16 MFU | 1624503 tok/s step 530/19560 | loss 5.369731 (-1.10z)| norm 0.9738 (+0.66z)| lr 4.54e-04 | 322.80 ms | 52.3% bf16 MFU | 1624488 tok/s step 531/19560 | loss 5.325718 (-1.47z)| norm 0.8640 (-0.08z)| lr 4.55e-04 | 322.91 ms | 52.3% bf16 MFU | 1624444 tok/s step 532/19560 | loss 5.299959 (-1.66z)| norm 0.7597 (-0.79z)| lr 4.56e-04 | 322.36 ms | 52.4% bf16 MFU | 1624542 tok/s step 533/19560 | loss 5.282382 (-1.79z)| norm 0.8405 (-0.23z)| lr 4.57e-04 | 322.76 ms | 52.3% bf16 MFU | 1624534 tok/s step 534/19560 | loss 5.308837 (-1.54z)| norm 0.9572 (+0.59z)| lr 4.58e-04 | 322.25 ms | 52.4% bf16 MFU | 1624656 tok/s step 535/19560 | loss 5.254652 (-1.97z)| norm 0.8290 (-0.29z)| lr 4.59e-04 | 322.80 ms | 52.3% bf16 MFU | 1624633 tok/s step 536/19560 | loss 5.262820 (-1.88z)| norm 0.8172 (-0.38z)| lr 4.59e-04 | 323.17 ms | 52.2% bf16 MFU | 1624517 tok/s step 537/19560 | loss 5.207742 (-2.31z)| norm 0.9359 (+0.45z)| lr 4.60e-04 | 322.30 ms | 52.4% bf16 MFU | 1624625 tok/s step 538/19560 | loss 5.228716 (-2.08z)| norm 0.8486 (-0.17z)| lr 4.61e-04 | 322.94 ms | 52.3% bf16 MFU | 1624568 tok/s step 539/19560 | loss 5.280213 (-1.61z)| norm 0.7906 (-0.57z)| lr 4.62e-04 | 322.47 ms | 52.3% bf16 MFU | 1624632 tok/s step 540/19560 | loss 5.303538 (-1.39z)| norm 0.7402 (-0.93z)| lr 4.63e-04 | 323.21 ms | 52.2% bf16 MFU | 1624506 tok/s step 541/19560 | loss 5.209216 (-2.16z)| norm 0.8289 (-0.31z)| lr 4.64e-04 | 322.89 ms | 52.3% bf16 MFU | 1624468 tok/s step 542/19560 | loss 5.247682 (-1.80z)| norm 0.8443 (-0.19z)| lr 4.65e-04 | 323.28 ms | 52.2% bf16 MFU | 1624334 tok/s step 543/19560 | loss 5.272687 (-1.56z)| norm 0.7364 (-0.95z)| lr 4.65e-04 | 322.32 ms | 52.4% bf16 MFU | 1624446 tok/s step 544/19560 | loss 5.200600 (-2.13z)| norm 0.6822 (-1.33z)| lr 4.66e-04 | 322.37 ms | 52.4% bf16 MFU | 1624540 tok/s step 545/19560 | loss 5.296463 (-1.30z)| norm 0.5700 (-2.08z)| lr 4.67e-04 | 322.31 ms | 52.4% bf16 MFU | 1624646 tok/s step 546/19560 | loss 5.208014 (-2.02z)| norm 0.5812 (-1.99z)| lr 4.68e-04 | 322.71 ms | 52.3% bf16 MFU | 1624645 tok/s step 547/19560 | loss 5.178508 (-2.23z)| norm 0.6208 (-1.69z)| lr 4.69e-04 | 322.78 ms | 52.3% bf16 MFU | 1624628 tok/s step 548/19560 | loss 5.329017 (-0.94z)| norm 0.8896 (+0.17z)| lr 4.70e-04 | 322.59 ms | 52.3% bf16 MFU | 1624658 tok/s step 549/19560 | loss 5.254537 (-1.55z)| norm 1.3744 (+3.38z)| lr 4.71e-04 | 322.24 ms | 52.4% bf16 MFU | 1624775 tok/s step 550/19560 | loss 5.273184 (-1.37z)| norm 1.1152 (+1.62z)| lr 4.71e-04 | 322.61 ms | 52.3% bf16 MFU | 1624794 tok/s step 551/19560 | loss 5.319035 (-0.96z)| norm 0.9714 (+0.66z)| lr 4.72e-04 | 322.81 ms | 52.3% bf16 MFU | 1624761 tok/s step 552/19560 | loss 5.250103 (-1.53z)| norm 1.0021 (+0.86z)| lr 4.73e-04 | 323.25 ms | 52.2% bf16 MFU | 1624620 tok/s step 553/19560 | loss 5.222087 (-1.74z)| norm 0.9173 (+0.29z)| lr 4.74e-04 | 322.77 ms | 52.3% bf16 MFU | 1624606 tok/s step 554/19560 | loss 5.268061 (-1.33z)| norm 0.8835 (+0.06z)| lr 4.75e-04 | 322.23 ms | 52.4% bf16 MFU | 1624728 tok/s step 555/19560 | loss 5.226632 (-1.67z)| norm 0.8722 (-0.00z)| lr 4.76e-04 | 322.84 ms | 52.3% bf16 MFU | 1624692 tok/s step 556/19560 | loss 5.238420 (-1.54z)| norm 0.8402 (-0.21z)| lr 4.77e-04 | 322.72 ms | 52.3% bf16 MFU | 1624687 tok/s step 557/19560 | loss 5.238579 (-1.52z)| norm 1.1119 (+1.62z)| lr 4.77e-04 | 322.29 ms | 52.4% bf16 MFU | 1624791 tok/s step 558/19560 | loss 5.134130 (-2.36z)| norm 0.6978 (-1.16z)| lr 4.78e-04 | 322.55 ms | 52.3% bf16 MFU | 1624824 tok/s step 559/19560 | loss 5.227890 (-1.53z)| norm 0.6225 (-1.63z)| lr 4.79e-04 | 322.82 ms | 52.3% bf16 MFU | 1624788 tok/s step 560/19560 | loss 5.213434 (-1.63z)| norm 0.5709 (-1.93z)| lr 4.80e-04 | 323.26 ms | 52.2% bf16 MFU | 1624642 tok/s step 561/19560 | loss 5.256575 (-1.24z)| norm 0.6584 (-1.33z)| lr 4.81e-04 | 322.55 ms | 52.3% bf16 MFU | 1624683 tok/s step 562/19560 | loss 5.221104 (-1.52z)| norm 0.6159 (-1.59z)| lr 4.82e-04 | 322.84 ms | 52.3% bf16 MFU | 1624648 tok/s step 563/19560 | loss 5.249693 (-1.26z)| norm 0.7306 (-0.85z)| lr 4.83e-04 | 322.57 ms | 52.3% bf16 MFU | 1624682 tok/s step 564/19560 | loss 5.215561 (-1.53z)| norm 0.8677 (+0.04z)| lr 4.83e-04 | 322.92 ms | 52.3% bf16 MFU | 1624626 tok/s step 565/19560 | loss 5.207976 (-1.57z)| norm 0.8878 (+0.17z)| lr 4.84e-04 | 322.48 ms | 52.3% bf16 MFU | 1624686 tok/s step 566/19560 | loss 5.194946 (-1.66z)| norm 0.9942 (+0.88z)| lr 4.85e-04 | 322.46 ms | 52.3% bf16 MFU | 1624746 tok/s step 567/19560 | loss 5.227783 (-1.36z)| norm 0.9163 (+0.37z)| lr 4.86e-04 | 322.90 ms | 52.3% bf16 MFU | 1624692 tok/s step 568/19560 | loss 5.199757 (-1.57z)| norm 0.6972 (-1.06z)| lr 4.87e-04 | 322.60 ms | 52.3% bf16 MFU | 1624718 tok/s step 569/19560 | loss 5.228430 (-1.30z)| norm 0.7097 (-0.98z)| lr 4.88e-04 | 322.25 ms | 52.4% bf16 MFU | 1624830 tok/s step 570/19560 | loss 5.198398 (-1.54z)| norm 0.7512 (-0.71z)| lr 4.89e-04 | 322.82 ms | 52.3% bf16 MFU | 1624793 tok/s step 571/19560 | loss 5.212082 (-1.40z)| norm 0.7198 (-0.92z)| lr 4.89e-04 | 322.79 ms | 52.3% bf16 MFU | 1624764 tok/s step 572/19560 | loss 5.179855 (-1.64z)| norm 0.7626 (-0.64z)| lr 4.90e-04 | 322.76 ms | 52.3% bf16 MFU | 1624745 tok/s step 573/19560 | loss 5.130253 (-2.02z)| norm 0.7789 (-0.54z)| lr 4.91e-04 | 322.91 ms | 52.3% bf16 MFU | 1624689 tok/s step 574/19560 | loss 5.137377 (-1.92z)| norm 0.7269 (-0.88z)| lr 4.92e-04 | 323.11 ms | 52.2% bf16 MFU | 1624587 tok/s step 575/19560 | loss 5.140460 (-1.86z)| norm 0.6764 (-1.19z)| lr 4.93e-04 | 322.86 ms | 52.3% bf16 MFU | 1624552 tok/s step 576/19560 | loss 5.175934 (-1.53z)| norm 0.6887 (-1.09z)| lr 4.94e-04 | 322.10 ms | 52.4% bf16 MFU | 1624709 tok/s step 577/19560 | loss 5.175227 (-1.52z)| norm 0.8266 (-0.19z)| lr 4.95e-04 | 322.57 ms | 52.3% bf16 MFU | 1624741 tok/s step 578/19560 | loss 5.202324 (-1.28z)| norm 0.8131 (-0.27z)| lr 4.95e-04 | 323.05 ms | 52.2% bf16 MFU | 1624651 tok/s step 579/19560 | loss 5.226425 (-1.06z)| norm 0.8043 (-0.33z)| lr 4.96e-04 | 322.88 ms | 52.3% bf16 MFU | 1624607 tok/s step 580/19560 | loss 5.211358 (-1.18z)| norm 0.8189 (-0.24z)| lr 4.97e-04 | 322.41 ms | 52.3% bf16 MFU | 1624685 tok/s step 581/19560 | loss 5.119787 (-1.92z)| norm 0.8870 (+0.21z)| lr 4.98e-04 | 322.33 ms | 52.4% bf16 MFU | 1624779 tok/s step 582/19560 | loss 5.162151 (-1.54z)| norm 0.8209 (-0.21z)| lr 4.99e-04 | 323.11 ms | 52.2% bf16 MFU | 1624670 tok/s step 583/19560 | loss 5.181075 (-1.36z)| norm 0.7294 (-0.80z)| lr 5.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1624690 tok/s step 584/19560 | loss 5.152989 (-1.57z)| norm 0.9127 (+0.43z)| lr 5.01e-04 | 322.44 ms | 52.3% bf16 MFU | 1624756 tok/s step 585/19560 | loss 5.163503 (-1.47z)| norm 0.9945 (+0.99z)| lr 5.01e-04 | 322.57 ms | 52.3% bf16 MFU | 1624785 tok/s step 586/19560 | loss 5.181536 (-1.30z)| norm 1.0547 (+1.37z)| lr 5.02e-04 | 323.00 ms | 52.3% bf16 MFU | 1624706 tok/s step 587/19560 | loss 5.106391 (-1.92z)| norm 0.8645 (+0.10z)| lr 5.03e-04 | 322.37 ms | 52.4% bf16 MFU | 1624787 tok/s step 588/19560 | loss 5.168844 (-1.36z)| norm 0.7727 (-0.51z)| lr 5.04e-04 | 323.09 ms | 52.2% bf16 MFU | 1624683 tok/s step 589/19560 | loss 5.157025 (-1.44z)| norm 0.7311 (-0.79z)| lr 5.05e-04 | 322.60 ms | 52.3% bf16 MFU | 1624708 tok/s step 590/19560 | loss 5.073137 (-2.12z)| norm 0.6870 (-1.08z)| lr 5.06e-04 | 322.72 ms | 52.3% bf16 MFU | 1624703 tok/s step 591/19560 | loss 5.136032 (-1.57z)| norm 0.6021 (-1.62z)| lr 5.07e-04 | 322.43 ms | 52.3% bf16 MFU | 1624769 tok/s step 592/19560 | loss 5.194335 (-1.05z)| norm 0.5910 (-1.68z)| lr 5.07e-04 | 322.41 ms | 52.3% bf16 MFU | 1624840 tok/s step 593/19560 | loss 5.107955 (-1.78z)| norm 0.5316 (-2.02z)| lr 5.08e-04 | 322.99 ms | 52.3% bf16 MFU | 1624760 tok/s step 594/19560 | loss 5.113731 (-1.70z)| norm 0.5490 (-1.87z)| lr 5.09e-04 | 322.83 ms | 52.3% bf16 MFU | 1624724 tok/s step 595/19560 | loss 5.116246 (-1.65z)| norm 0.6390 (-1.26z)| lr 5.10e-04 | 322.89 ms | 52.3% bf16 MFU | 1624674 tok/s step 596/19560 | loss 5.130189 (-1.50z)| norm 0.8921 (+0.38z)| lr 5.11e-04 | 322.34 ms | 52.4% bf16 MFU | 1624765 tok/s step 597/19560 | loss 5.237840 (-0.55z)| norm 1.1005 (+1.70z)| lr 5.12e-04 | 322.56 ms | 52.3% bf16 MFU | 1624796 tok/s step 598/19560 | loss 5.121524 (-1.56z)| norm 0.7647 (-0.45z)| lr 5.13e-04 | 322.96 ms | 52.3% bf16 MFU | 1624725 tok/s step 599/19560 | loss 5.108318 (-1.65z)| norm 0.7731 (-0.38z)| lr 5.13e-04 | 322.81 ms | 52.3% bf16 MFU | 1624695 tok/s step 600/19560 | loss 5.082011 (-1.85z)| norm 0.6581 (-1.13z)| lr 5.14e-04 | 322.94 ms | 52.3% bf16 MFU | 1624635 tok/s step 601/19560 | loss 5.203005 (-0.76z)| norm 0.5965 (-1.51z)| lr 5.15e-04 | 322.70 ms | 52.3% bf16 MFU | 1624638 tok/s step 602/19560 | loss 5.128351 (-1.41z)| norm 0.5401 (-1.84z)| lr 5.16e-04 | 321.89 ms | 52.4% bf16 MFU | 1624846 tok/s step 603/19560 | loss 5.117797 (-1.48z)| norm 0.5554 (-1.71z)| lr 5.17e-04 | 322.56 ms | 52.3% bf16 MFU | 1624873 tok/s step 604/19560 | loss 5.081254 (-1.77z)| norm 0.5858 (-1.50z)| lr 5.18e-04 | 322.75 ms | 52.3% bf16 MFU | 1624850 tok/s step 605/19560 | loss 5.069843 (-1.85z)| norm 0.7113 (-0.68z)| lr 5.19e-04 | 322.52 ms | 52.3% bf16 MFU | 1624888 tok/s step 606/19560 | loss 5.118257 (-1.40z)| norm 0.7893 (-0.18z)| lr 5.19e-04 | 322.32 ms | 52.4% bf16 MFU | 1624973 tok/s step 607/19560 | loss 5.081005 (-1.71z)| norm 0.8457 (+0.19z)| lr 5.20e-04 | 322.38 ms | 52.4% bf16 MFU | 1625040 tok/s step 608/19560 | loss 5.107182 (-1.45z)| norm 0.9137 (+0.65z)| lr 5.21e-04 | 322.99 ms | 52.3% bf16 MFU | 1624948 tok/s step 609/19560 | loss 5.131974 (-1.21z)| norm 0.9982 (+1.24z)| lr 5.22e-04 | 323.11 ms | 52.2% bf16 MFU | 1624832 tok/s step 610/19560 | loss 5.106199 (-1.43z)| norm 0.9657 (+1.02z)| lr 5.23e-04 | 322.91 ms | 52.3% bf16 MFU | 1624772 tok/s step 611/19560 | loss 5.098745 (-1.48z)| norm 0.8403 (+0.19z)| lr 5.24e-04 | 322.65 ms | 52.3% bf16 MFU | 1624780 tok/s step 612/19560 | loss 5.103807 (-1.41z)| norm 0.7037 (-0.72z)| lr 5.25e-04 | 322.54 ms | 52.3% bf16 MFU | 1624815 tok/s step 613/19560 | loss 5.090555 (-1.51z)| norm 0.7163 (-0.64z)| lr 5.25e-04 | 322.94 ms | 52.3% bf16 MFU | 1624749 tok/s step 614/19560 | loss 5.085859 (-1.54z)| norm 0.7535 (-0.40z)| lr 5.26e-04 | 322.84 ms | 52.3% bf16 MFU | 1624710 tok/s step 615/19560 | loss 5.074931 (-1.61z)| norm 0.7368 (-0.53z)| lr 5.27e-04 | 322.62 ms | 52.3% bf16 MFU | 1624730 tok/s step 616/19560 | loss 5.035066 (-1.95z)| norm 0.7201 (-0.65z)| lr 5.28e-04 | 322.55 ms | 52.3% bf16 MFU | 1624765 tok/s step 617/19560 | loss 5.019628 (-2.05z)| norm 0.6617 (-1.04z)| lr 5.29e-04 | 322.63 ms | 52.3% bf16 MFU | 1624780 tok/s step 618/19560 | loss 5.136543 (-0.97z)| norm 0.7097 (-0.70z)| lr 5.30e-04 | 322.64 ms | 52.3% bf16 MFU | 1624790 tok/s step 619/19560 | loss 5.069710 (-1.56z)| norm 0.8442 (+0.24z)| lr 5.31e-04 | 322.86 ms | 52.3% bf16 MFU | 1624744 tok/s step 620/19560 | loss 5.057464 (-1.65z)| norm 0.8453 (+0.24z)| lr 5.31e-04 | 322.35 ms | 52.4% bf16 MFU | 1624831 tok/s step 621/19560 | loss 5.044183 (-1.74z)| norm 0.7773 (-0.24z)| lr 5.32e-04 | 322.35 ms | 52.4% bf16 MFU | 1624913 tok/s step 622/19560 | loss 4.990043 (-2.19z)| norm 0.6644 (-1.02z)| lr 5.33e-04 | 322.84 ms | 52.3% bf16 MFU | 1624866 tok/s step 623/19560 | loss 5.012901 (-1.94z)| norm 0.6626 (-1.01z)| lr 5.34e-04 | 322.57 ms | 52.3% bf16 MFU | 1624890 tok/s step 624/19560 | loss 5.017928 (-1.86z)| norm 0.6281 (-1.23z)| lr 5.35e-04 | 322.42 ms | 52.3% bf16 MFU | 1624951 tok/s step 625/19560 | loss 5.068626 (-1.38z)| norm 0.5539 (-1.71z)| lr 5.36e-04 | 322.50 ms | 52.3% bf16 MFU | 1624989 tok/s step 626/19560 | loss 5.031133 (-1.70z)| norm 0.5444 (-1.74z)| lr 5.37e-04 | 323.10 ms | 52.2% bf16 MFU | 1624874 tok/s step 627/19560 | loss 5.077384 (-1.26z)| norm 0.5904 (-1.41z)| lr 5.37e-04 | 322.76 ms | 52.3% bf16 MFU | 1624851 tok/s step 628/19560 | loss 5.064902 (-1.36z)| norm 0.6532 (-0.97z)| lr 5.38e-04 | 322.41 ms | 52.3% bf16 MFU | 1624916 tok/s step 629/19560 | loss 5.035083 (-1.64z)| norm 0.6408 (-1.04z)| lr 5.39e-04 | 323.11 ms | 52.2% bf16 MFU | 1624801 tok/s step 630/19560 | loss 5.014622 (-1.80z)| norm 0.6388 (-1.04z)| lr 5.40e-04 | 322.66 ms | 52.3% bf16 MFU | 1624805 tok/s step 631/19560 | loss 5.048150 (-1.46z)| norm 0.6796 (-0.75z)| lr 5.41e-04 | 322.43 ms | 52.3% bf16 MFU | 1624867 tok/s step 632/19560 | loss 5.036717 (-1.55z)| norm 0.7330 (-0.38z)| lr 5.42e-04 | 322.86 ms | 52.3% bf16 MFU | 1624818 tok/s step 633/19560 | loss 5.043062 (-1.47z)| norm 0.9825 (+1.32z)| lr 5.43e-04 | 322.71 ms | 52.3% bf16 MFU | 1624809 tok/s step 634/19560 | loss 5.149160 (-0.45z)| norm 1.1988 (+2.69z)| lr 5.43e-04 | 322.75 ms | 52.3% bf16 MFU | 1624790 tok/s step 635/19560 | loss 5.007878 (-1.78z)| norm 0.7625 (-0.19z)| lr 5.44e-04 | 322.57 ms | 52.3% bf16 MFU | 1624819 tok/s step 636/19560 | loss 5.065555 (-1.21z)| norm 0.6432 (-0.97z)| lr 5.45e-04 | 322.34 ms | 52.4% bf16 MFU | 1624904 tok/s step 637/19560 | loss 4.986655 (-1.95z)| norm 0.6298 (-1.04z)| lr 5.46e-04 | 322.83 ms | 52.3% bf16 MFU | 1624862 tok/s step 638/19560 | loss 5.013972 (-1.67z)| norm 0.6211 (-1.08z)| lr 5.47e-04 | 323.14 ms | 52.2% bf16 MFU | 1624743 tok/s step 639/19560 | loss 5.062648 (-1.17z)| norm 0.6861 (-0.65z)| lr 5.48e-04 | 322.84 ms | 52.3% bf16 MFU | 1624705 tok/s step 640/19560 | loss 5.039643 (-1.38z)| norm 0.7980 (+0.10z)| lr 5.49e-04 | 322.40 ms | 52.3% bf16 MFU | 1624779 tok/s step 641/19560 | loss 5.029319 (-1.46z)| norm 0.9868 (+1.33z)| lr 5.49e-04 | 322.91 ms | 52.3% bf16 MFU | 1624722 tok/s step 642/19560 | loss 4.993726 (-1.78z)| norm 1.0527 (+1.72z)| lr 5.50e-04 | 322.89 ms | 52.3% bf16 MFU | 1624674 tok/s step 643/19560 | loss 5.049856 (-1.21z)| norm 0.9163 (+0.83z)| lr 5.51e-04 | 322.89 ms | 52.3% bf16 MFU | 1624627 tok/s step 644/19560 | loss 5.062680 (-1.07z)| norm 0.7868 (-0.01z)| lr 5.52e-04 | 322.81 ms | 52.3% bf16 MFU | 1624603 tok/s step 645/19560 | loss 5.028902 (-1.39z)| norm 0.6895 (-0.63z)| lr 5.53e-04 | 322.80 ms | 52.3% bf16 MFU | 1624582 tok/s step 646/19560 | loss 4.973598 (-1.92z)| norm 0.7215 (-0.41z)| lr 5.54e-04 | 322.99 ms | 52.3% bf16 MFU | 1624515 tok/s step 647/19560 | loss 5.006144 (-1.56z)| norm 0.7853 (+0.02z)| lr 5.55e-04 | 322.72 ms | 52.3% bf16 MFU | 1624518 tok/s step 648/19560 | loss 5.010554 (-1.50z)| norm 0.6757 (-0.69z)| lr 5.55e-04 | 322.42 ms | 52.3% bf16 MFU | 1624597 tok/s step 649/19560 | loss 4.993208 (-1.66z)| norm 0.5831 (-1.29z)| lr 5.56e-04 | 322.93 ms | 52.3% bf16 MFU | 1624544 tok/s step 650/19560 | loss 4.999944 (-1.56z)| norm 0.6699 (-0.70z)| lr 5.57e-04 | 323.05 ms | 52.2% bf16 MFU | 1624464 tok/s step 651/19560 | loss 4.994024 (-1.60z)| norm 0.8102 (+0.24z)| lr 5.58e-04 | 322.51 ms | 52.3% bf16 MFU | 1624522 tok/s step 652/19560 | loss 4.952317 (-1.98z)| norm 0.8145 (+0.27z)| lr 5.59e-04 | 322.58 ms | 52.3% bf16 MFU | 1624561 tok/s step 653/19560 | loss 5.064122 (-0.84z)| norm 0.9658 (+1.26z)| lr 5.60e-04 | 322.76 ms | 52.3% bf16 MFU | 1624552 tok/s step 654/19560 | loss 4.946109 (-2.00z)| norm 0.8448 (+0.44z)| lr 5.61e-04 | 322.92 ms | 52.3% bf16 MFU | 1624504 tok/s step 655/19560 | loss 4.934239 (-2.08z)| norm 0.7075 (-0.48z)| lr 5.61e-04 | 323.45 ms | 52.2% bf16 MFU | 1624325 tok/s step 656/19560 | loss 4.991432 (-1.49z)| norm 0.6353 (-0.96z)| lr 5.62e-04 | 322.74 ms | 52.3% bf16 MFU | 1624333 tok/s step 657/19560 | loss 4.938716 (-1.98z)| norm 0.6217 (-1.03z)| lr 5.63e-04 | 323.25 ms | 52.2% bf16 MFU | 1624212 tok/s step 658/19560 | loss 5.006645 (-1.29z)| norm 0.5430 (-1.53z)| lr 5.64e-04 | 322.81 ms | 52.3% bf16 MFU | 1624209 tok/s step 659/19560 | loss 5.036720 (-0.97z)| norm 0.5862 (-1.22z)| lr 5.65e-04 | 323.65 ms | 52.1% bf16 MFU | 1623994 tok/s step 660/19560 | loss 4.961880 (-1.72z)| norm 0.6142 (-1.03z)| lr 5.66e-04 | 322.49 ms | 52.3% bf16 MFU | 1624082 tok/s step 661/19560 | loss 5.270241 (+1.46z)| norm 0.7808 (+0.07z)| lr 5.67e-04 | 322.92 ms | 52.3% bf16 MFU | 1624058 tok/s step 662/19560 | loss 5.036720 (-0.93z)| norm 1.0769 (+1.99z)| lr 5.67e-04 | 322.78 ms | 52.3% bf16 MFU | 1624070 tok/s step 663/19560 | loss 4.994245 (-1.35z)| norm 0.9730 (+1.30z)| lr 5.68e-04 | 322.92 ms | 52.3% bf16 MFU | 1624046 tok/s step 664/19560 | loss 4.989199 (-1.39z)| norm 0.7031 (-0.44z)| lr 5.69e-04 | 322.82 ms | 52.3% bf16 MFU | 1624049 tok/s step 665/19560 | loss 5.085757 (-0.37z)| norm 0.8159 (+0.30z)| lr 5.70e-04 | 323.21 ms | 52.2% bf16 MFU | 1623953 tok/s step 666/19560 | loss 4.998095 (-1.27z)| norm 0.8673 (+0.63z)| lr 5.71e-04 | 322.40 ms | 52.3% bf16 MFU | 1624066 tok/s step 667/19560 | loss 4.984738 (-1.39z)| norm 0.8970 (+0.81z)| lr 5.72e-04 | 322.80 ms | 52.3% bf16 MFU | 1624072 tok/s step 668/19560 | loss 4.964344 (-1.58z)| norm 1.0919 (+2.03z)| lr 5.73e-04 | 323.09 ms | 52.2% bf16 MFU | 1624005 tok/s step 669/19560 | loss 5.018040 (-1.00z)| norm 1.0096 (+1.48z)| lr 5.73e-04 | 322.88 ms | 52.3% bf16 MFU | 1623995 tok/s step 670/19560 | loss 4.987677 (-1.30z)| norm 0.8449 (+0.44z)| lr 5.74e-04 | 322.63 ms | 52.3% bf16 MFU | 1624046 tok/s step 671/19560 | loss 4.984091 (-1.32z)| norm 0.8880 (+0.70z)| lr 5.75e-04 | 322.98 ms | 52.3% bf16 MFU | 1624008 tok/s step 672/19560 | loss 5.033219 (-0.79z)| norm 1.1347 (+2.20z)| lr 5.76e-04 | 322.62 ms | 52.3% bf16 MFU | 1624063 tok/s step 673/19560 | loss 5.003238 (-1.09z)| norm 0.8940 (+0.69z)| lr 5.77e-04 | 322.81 ms | 52.3% bf16 MFU | 1624067 tok/s step 674/19560 | loss 5.015086 (-0.95z)| norm 0.7828 (-0.01z)| lr 5.78e-04 | 322.83 ms | 52.3% bf16 MFU | 1624065 tok/s step 675/19560 | loss 4.988688 (-1.21z)| norm 0.7909 (+0.03z)| lr 5.79e-04 | 323.04 ms | 52.2% bf16 MFU | 1624011 tok/s step 676/19560 | loss 4.986254 (-1.23z)| norm 0.8312 (+0.29z)| lr 5.79e-04 | 322.67 ms | 52.3% bf16 MFU | 1624051 tok/s step 677/19560 | loss 5.023244 (-0.82z)| norm 0.9255 (+0.96z)| lr 5.80e-04 | 322.45 ms | 52.3% bf16 MFU | 1624145 tok/s step 678/19560 | loss 4.936784 (-1.74z)| norm 0.6732 (-0.71z)| lr 5.81e-04 | 323.00 ms | 52.3% bf16 MFU | 1624097 tok/s step 679/19560 | loss 4.886338 (-2.27z)| norm 0.5921 (-1.24z)| lr 5.82e-04 | 323.13 ms | 52.2% bf16 MFU | 1624019 tok/s step 680/19560 | loss 4.999411 (-1.01z)| norm 0.5438 (-1.54z)| lr 5.83e-04 | 322.93 ms | 52.3% bf16 MFU | 1623994 tok/s step 681/19560 | loss 4.899426 (-2.08z)| norm 0.5390 (-1.55z)| lr 5.84e-04 | 323.12 ms | 52.2% bf16 MFU | 1623923 tok/s step 682/19560 | loss 4.927174 (-1.75z)| norm 0.5453 (-1.48z)| lr 5.85e-04 | 322.58 ms | 52.3% bf16 MFU | 1623992 tok/s step 683/19560 | loss 4.873587 (-2.29z)| norm 0.4864 (-1.83z)| lr 5.85e-04 | 322.87 ms | 52.3% bf16 MFU | 1623984 tok/s step 684/19560 | loss 4.971961 (-1.19z)| norm 0.5311 (-1.51z)| lr 5.86e-04 | 323.19 ms | 52.2% bf16 MFU | 1623897 tok/s step 685/19560 | loss 4.963676 (-1.27z)| norm 0.6455 (-0.75z)| lr 5.87e-04 | 322.91 ms | 52.3% bf16 MFU | 1623885 tok/s step 686/19560 | loss 4.864251 (-2.30z)| norm 0.9424 (+1.22z)| lr 5.88e-04 | 322.73 ms | 52.3% bf16 MFU | 1623918 tok/s step 687/19560 | loss 4.922982 (-1.64z)| norm 0.9146 (+1.02z)| lr 5.89e-04 | 323.14 ms | 52.2% bf16 MFU | 1623846 tok/s step 688/19560 | loss 4.955513 (-1.26z)| norm 0.7357 (-0.18z)| lr 5.90e-04 | 322.81 ms | 52.3% bf16 MFU | 1623860 tok/s step 689/19560 | loss 4.878910 (-2.07z)| norm 0.7903 (+0.18z)| lr 5.91e-04 | 322.56 ms | 52.3% bf16 MFU | 1623936 tok/s step 690/19560 | loss 4.897873 (-1.83z)| norm 0.7225 (-0.28z)| lr 5.91e-04 | 322.85 ms | 52.3% bf16 MFU | 1623937 tok/s step 691/19560 | loss 4.975124 (-0.97z)| norm 0.6568 (-0.72z)| lr 5.92e-04 | 323.08 ms | 52.2% bf16 MFU | 1623878 tok/s step 692/19560 | loss 4.945214 (-1.29z)| norm 0.6408 (-0.81z)| lr 5.93e-04 | 322.59 ms | 52.3% bf16 MFU | 1623946 tok/s step 693/19560 | loss 4.871378 (-2.06z)| norm 0.6512 (-0.73z)| lr 5.94e-04 | 322.48 ms | 52.3% bf16 MFU | 1624038 tok/s step 694/19560 | loss 4.927730 (-1.42z)| norm 0.6274 (-0.88z)| lr 5.95e-04 | 322.63 ms | 52.3% bf16 MFU | 1624089 tok/s step 695/19560 | loss 4.903580 (-1.67z)| norm 0.6795 (-0.51z)| lr 5.96e-04 | 323.14 ms | 52.2% bf16 MFU | 1624007 tok/s step 696/19560 | loss 4.871758 (-1.98z)| norm 0.6566 (-0.67z)| lr 5.97e-04 | 323.09 ms | 52.2% bf16 MFU | 1623944 tok/s step 697/19560 | loss 4.842562 (-2.26z)| norm 0.7150 (-0.27z)| lr 5.97e-04 | 322.67 ms | 52.3% bf16 MFU | 1623988 tok/s step 698/19560 | loss 4.941879 (-1.15z)| norm 0.7540 (-0.01z)| lr 5.98e-04 | 323.42 ms | 52.2% bf16 MFU | 1623843 tok/s step 699/19560 | loss 4.894193 (-1.65z)| norm 0.7439 (-0.08z)| lr 5.99e-04 | 323.39 ms | 52.2% bf16 MFU | 1623712 tok/s step 700/19560 | loss 4.862970 (-1.96z)| norm 0.7967 (+0.28z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1623640 tok/s step 701/19560 | loss 4.964519 (-0.83z)| norm 1.0231 (+1.78z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1623646 tok/s step 702/19560 | loss 4.896083 (-1.55z)| norm 0.9381 (+1.19z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1623712 tok/s step 703/19560 | loss 4.946777 (-0.98z)| norm 0.7163 (-0.29z)| lr 6.00e-04 | 323.21 ms | 52.2% bf16 MFU | 1623634 tok/s step 704/19560 | loss 4.921443 (-1.24z)| norm 0.6400 (-0.79z)| lr 6.00e-04 | 322.19 ms | 52.4% bf16 MFU | 1623816 tok/s step 705/19560 | loss 4.883496 (-1.63z)| norm 0.6028 (-1.02z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1623822 tok/s step 706/19560 | loss 4.858153 (-1.88z)| norm 0.5906 (-1.09z)| lr 6.00e-04 | 322.56 ms | 52.3% bf16 MFU | 1623901 tok/s step 707/19560 | loss 4.883925 (-1.58z)| norm 0.5978 (-1.03z)| lr 6.00e-04 | 323.05 ms | 52.2% bf16 MFU | 1623852 tok/s step 708/19560 | loss 4.902527 (-1.36z)| norm 0.6575 (-0.63z)| lr 6.00e-04 | 322.38 ms | 52.4% bf16 MFU | 1623976 tok/s step 709/19560 | loss 4.788436 (-2.55z)| norm 0.6490 (-0.67z)| lr 6.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624058 tok/s step 710/19560 | loss 4.858588 (-1.75z)| norm 0.7027 (-0.31z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1624105 tok/s step 711/19560 | loss 4.900668 (-1.28z)| norm 0.6287 (-0.79z)| lr 6.00e-04 | 322.30 ms | 52.4% bf16 MFU | 1624234 tok/s step 712/19560 | loss 4.812255 (-2.20z)| norm 0.6168 (-0.86z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1624156 tok/s step 713/19560 | loss 4.789722 (-2.38z)| norm 0.6140 (-0.86z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1624085 tok/s step 714/19560 | loss 4.813803 (-2.08z)| norm 0.8224 (+0.54z)| lr 6.00e-04 | 322.71 ms | 52.3% bf16 MFU | 1624113 tok/s step 715/19560 | loss 4.856848 (-1.59z)| norm 0.9713 (+1.53z)| lr 6.00e-04 | 323.54 ms | 52.2% bf16 MFU | 1623931 tok/s step 716/19560 | loss 4.843964 (-1.70z)| norm 0.7927 (+0.33z)| lr 6.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1623975 tok/s step 717/19560 | loss 4.830743 (-1.81z)| norm 0.6847 (-0.39z)| lr 6.00e-04 | 323.44 ms | 52.2% bf16 MFU | 1623824 tok/s step 718/19560 | loss 4.736825 (-2.70z)| norm 0.8112 (+0.45z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1623747 tok/s step 719/19560 | loss 4.860819 (-1.40z)| norm 0.7532 (+0.06z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1623826 tok/s step 720/19560 | loss 4.927311 (-0.71z)| norm 0.7661 (+0.13z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1623893 tok/s step 721/19560 | loss 4.845044 (-1.54z)| norm 0.7537 (+0.04z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1623933 tok/s step 722/19560 | loss 4.887622 (-1.08z)| norm 0.8110 (+0.42z)| lr 6.00e-04 | 322.37 ms | 52.4% bf16 MFU | 1624053 tok/s step 723/19560 | loss 4.841437 (-1.53z)| norm 0.8767 (+0.85z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624069 tok/s step 724/19560 | loss 4.820416 (-1.72z)| norm 0.8706 (+0.81z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624080 tok/s step 725/19560 | loss 4.830868 (-1.61z)| norm 0.7195 (-0.20z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624072 tok/s step 726/19560 | loss 4.766219 (-2.23z)| norm 0.5976 (-1.04z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1624114 tok/s step 727/19560 | loss 4.871646 (-1.12z)| norm 0.5606 (-1.28z)| lr 6.00e-04 | 323.16 ms | 52.2% bf16 MFU | 1624026 tok/s step 728/19560 | loss 4.815707 (-1.67z)| norm 0.4959 (-1.70z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1624035 tok/s step 729/19560 | loss 4.751202 (-2.30z)| norm 0.4953 (-1.69z)| lr 6.00e-04 | 323.50 ms | 52.2% bf16 MFU | 1623867 tok/s step 730/19560 | loss 4.781316 (-1.95z)| norm 0.4700 (-1.85z)| lr 6.00e-04 | 322.17 ms | 52.4% bf16 MFU | 1624041 tok/s step 731/19560 | loss 4.783220 (-1.90z)| norm 0.5161 (-1.53z)| lr 6.00e-04 | 323.17 ms | 52.2% bf16 MFU | 1623954 tok/s step 732/19560 | loss 4.766355 (-2.02z)| norm 0.4855 (-1.72z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1623947 tok/s step 733/19560 | loss 4.824569 (-1.41z)| norm 0.5339 (-1.37z)| lr 6.00e-04 | 323.06 ms | 52.2% bf16 MFU | 1623893 tok/s step 734/19560 | loss 4.694496 (-2.64z)| norm 0.6102 (-0.86z)| lr 6.00e-04 | 323.36 ms | 52.2% bf16 MFU | 1623766 tok/s step 735/19560 | loss 4.752392 (-2.02z)| norm 0.6103 (-0.84z)| lr 6.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1623771 tok/s step 736/19560 | loss 4.744374 (-2.05z)| norm 0.6483 (-0.58z)| lr 6.00e-04 | 322.95 ms | 52.3% bf16 MFU | 1623754 tok/s step 737/19560 | loss 4.756208 (-1.90z)| norm 0.7648 (+0.21z)| lr 6.00e-04 | 322.41 ms | 52.3% bf16 MFU | 1623873 tok/s step 738/19560 | loss 4.742828 (-1.99z)| norm 0.7986 (+0.45z)| lr 6.00e-04 | 323.28 ms | 52.2% bf16 MFU | 1623770 tok/s step 739/19560 | loss 4.790648 (-1.51z)| norm 0.7069 (-0.17z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1623768 tok/s step 740/19560 | loss 4.737453 (-1.98z)| norm 0.6920 (-0.27z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1623784 tok/s step 741/19560 | loss 4.776419 (-1.58z)| norm 0.7704 (+0.26z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623793 tok/s step 742/19560 | loss 4.752944 (-1.77z)| norm 0.8000 (+0.46z)| lr 6.00e-04 | 322.41 ms | 52.3% bf16 MFU | 1623910 tok/s step 743/19560 | loss 4.743009 (-1.83z)| norm 0.7097 (-0.15z)| lr 6.00e-04 | 323.44 ms | 52.2% bf16 MFU | 1623764 tok/s step 744/19560 | loss 4.751386 (-1.71z)| norm 0.6410 (-0.61z)| lr 6.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623696 tok/s step 745/19560 | loss 4.764440 (-1.56z)| norm 0.7278 (-0.03z)| lr 6.00e-04 | 322.92 ms | 52.3% bf16 MFU | 1623690 tok/s step 746/19560 | loss 4.767522 (-1.51z)| norm 0.8383 (+0.72z)| lr 6.00e-04 | 322.24 ms | 52.4% bf16 MFU | 1623856 tok/s step 747/19560 | loss 4.795957 (-1.23z)| norm 0.8131 (+0.55z)| lr 6.00e-04 | 323.16 ms | 52.2% bf16 MFU | 1623782 tok/s step 748/19560 | loss 4.753600 (-1.60z)| norm 0.7053 (-0.18z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1623832 tok/s step 749/19560 | loss 4.709216 (-1.96z)| norm 0.6796 (-0.35z)| lr 6.00e-04 | 322.79 ms | 52.3% bf16 MFU | 1623853 tok/s step 750/19560 | loss 4.724714 (-1.78z)| norm 0.7019 (-0.20z)| lr 6.00e-04 | 323.29 ms | 52.2% bf16 MFU | 1623748 tok/s val loss 4.731918 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2504/10042 = 0.249353 step 751/19560 | loss 4.750709 (-1.52z)| norm 0.7485 (+0.12z)| lr 6.00e-04 | 321.27 ms | 52.5% bf16 MFU | 1624155 tok/s step 752/19560 | loss 4.758900 (-1.42z)| norm 0.7801 (+0.32z)| lr 6.00e-04 | 321.99 ms | 52.4% bf16 MFU | 1624361 tok/s step 753/19560 | loss 4.810132 (-0.94z)| norm 0.6522 (-0.56z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1624366 tok/s step 754/19560 | loss 4.756321 (-1.40z)| norm 0.5733 (-1.10z)| lr 6.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1624421 tok/s step 755/19560 | loss 4.724876 (-1.66z)| norm 0.5487 (-1.26z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1624530 tok/s step 756/19560 | loss 4.702960 (-1.82z)| norm 0.4971 (-1.59z)| lr 6.00e-04 | 322.16 ms | 52.4% bf16 MFU | 1624674 tok/s step 757/19560 | loss 4.733330 (-1.52z)| norm 0.4862 (-1.64z)| lr 6.00e-04 | 322.39 ms | 52.4% bf16 MFU | 1624754 tok/s step 758/19560 | loss 4.826579 (-0.68z)| norm 0.5437 (-1.25z)| lr 6.00e-04 | 322.48 ms | 52.3% bf16 MFU | 1624806 tok/s step 759/19560 | loss 4.734684 (-1.48z)| norm 0.8831 (+1.01z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1624813 tok/s step 760/19560 | loss 4.668581 (-2.02z)| norm 0.6775 (-0.36z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624811 tok/s step 761/19560 | loss 4.682612 (-1.86z)| norm 0.8361 (+0.71z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624786 tok/s step 762/19560 | loss 4.712783 (-1.58z)| norm 0.8224 (+0.67z)| lr 6.00e-04 | 322.26 ms | 52.4% bf16 MFU | 1624892 tok/s step 763/19560 | loss 4.757513 (-1.16z)| norm 0.7292 (+0.02z)| lr 6.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1624909 tok/s step 764/19560 | loss 4.650088 (-2.07z)| norm 0.5842 (-0.99z)| lr 6.00e-04 | 322.71 ms | 52.3% bf16 MFU | 1624894 tok/s step 765/19560 | loss 4.698924 (-1.61z)| norm 0.6550 (-0.50z)| lr 6.00e-04 | 322.40 ms | 52.3% bf16 MFU | 1624961 tok/s step 766/19560 | loss 4.732685 (-1.29z)| norm 0.7272 (-0.00z)| lr 6.00e-04 | 322.38 ms | 52.4% bf16 MFU | 1625027 tok/s step 767/19560 | loss 4.713754 (-1.44z)| norm 0.8933 (+1.14z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1625075 tok/s step 768/19560 | loss 4.704225 (-1.50z)| norm 1.0007 (+1.86z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1625071 tok/s step 769/19560 | loss 4.752849 (-1.06z)| norm 0.9058 (+1.22z)| lr 6.00e-04 | 322.56 ms | 52.3% bf16 MFU | 1625086 tok/s step 770/19560 | loss 4.725540 (-1.27z)| norm 0.8220 (+0.66z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1625090 tok/s step 771/19560 | loss 4.686444 (-1.59z)| norm 0.6726 (-0.38z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1625075 tok/s step 772/19560 | loss 4.723881 (-1.25z)| norm 0.5360 (-1.33z)| lr 6.00e-04 | 322.72 ms | 52.3% bf16 MFU | 1625052 tok/s step 773/19560 | loss 4.645856 (-1.89z)| norm 0.4452 (-1.93z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1624998 tok/s step 774/19560 | loss 4.724556 (-1.19z)| norm 0.4756 (-1.68z)| lr 6.00e-04 | 323.19 ms | 52.2% bf16 MFU | 1624860 tok/s step 775/19560 | loss 4.713929 (-1.26z)| norm 0.4888 (-1.56z)| lr 6.00e-04 | 322.13 ms | 52.4% bf16 MFU | 1624996 tok/s step 776/19560 | loss 4.669775 (-1.62z)| norm 0.6016 (-0.79z)| lr 6.00e-04 | 323.01 ms | 52.2% bf16 MFU | 1624903 tok/s step 777/19560 | loss 4.708864 (-1.26z)| norm 0.8484 (+0.87z)| lr 6.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624936 tok/s step 778/19560 | loss 4.784009 (-0.59z)| norm 0.8905 (+1.14z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1624842 tok/s step 779/19560 | loss 4.701509 (-1.29z)| norm 0.7122 (-0.06z)| lr 6.00e-04 | 322.16 ms | 52.4% bf16 MFU | 1624971 tok/s step 780/19560 | loss 4.695317 (-1.32z)| norm 0.6949 (-0.17z)| lr 6.00e-04 | 322.20 ms | 52.4% bf16 MFU | 1625083 tok/s step 781/19560 | loss 4.646690 (-1.72z)| norm 0.5945 (-0.84z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1625051 tok/s step 782/19560 | loss 4.670248 (-1.49z)| norm 0.5197 (-1.33z)| lr 6.00e-04 | 322.22 ms | 52.4% bf16 MFU | 1625154 tok/s step 783/19560 | loss 4.641622 (-1.70z)| norm 0.6750 (-0.27z)| lr 6.00e-04 | 323.39 ms | 52.2% bf16 MFU | 1624958 tok/s step 784/19560 | loss 4.671340 (-1.42z)| norm 0.6059 (-0.73z)| lr 6.00e-04 | 322.14 ms | 52.4% bf16 MFU | 1625087 tok/s step 785/19560 | loss 4.662946 (-1.47z)| norm 0.6321 (-0.56z)| lr 6.00e-04 | 322.96 ms | 52.3% bf16 MFU | 1625002 tok/s step 786/19560 | loss 4.606144 (-1.91z)| norm 0.6847 (-0.21z)| lr 6.00e-04 | 323.17 ms | 52.2% bf16 MFU | 1624870 tok/s step 787/19560 | loss 4.646077 (-1.55z)| norm 0.7391 (+0.16z)| lr 6.00e-04 | 322.48 ms | 52.3% bf16 MFU | 1624916 tok/s step 788/19560 | loss 4.676457 (-1.28z)| norm 0.6007 (-0.79z)| lr 6.00e-04 | 322.98 ms | 52.3% bf16 MFU | 1624836 tok/s step 789/19560 | loss 4.627562 (-1.73z)| norm 0.6312 (-0.57z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624834 tok/s step 790/19560 | loss 4.681332 (-1.23z)| norm 0.6403 (-0.50z)| lr 6.00e-04 | 322.39 ms | 52.4% bf16 MFU | 1624906 tok/s step 791/19560 | loss 4.702680 (-1.03z)| norm 0.6070 (-0.72z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1624801 tok/s step 792/19560 | loss 4.603000 (-1.89z)| norm 0.6346 (-0.52z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1624829 tok/s step 793/19560 | loss 4.656129 (-1.41z)| norm 0.5581 (-1.05z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1624848 tok/s step 794/19560 | loss 4.586333 (-2.00z)| norm 0.5650 (-0.98z)| lr 6.00e-04 | 323.13 ms | 52.2% bf16 MFU | 1624732 tok/s step 795/19560 | loss 4.649553 (-1.41z)| norm 0.5602 (-1.00z)| lr 6.00e-04 | 322.01 ms | 52.4% bf16 MFU | 1624904 tok/s step 796/19560 | loss 4.658714 (-1.31z)| norm 0.5545 (-1.04z)| lr 6.00e-04 | 323.06 ms | 52.2% bf16 MFU | 1624802 tok/s step 797/19560 | loss 4.588563 (-1.92z)| norm 0.5655 (-0.95z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624751 tok/s step 798/19560 | loss 4.658260 (-1.26z)| norm 0.5544 (-1.02z)| lr 6.00e-04 | 322.73 ms | 52.3% bf16 MFU | 1624740 tok/s step 799/19560 | loss 4.592325 (-1.84z)| norm 0.5368 (-1.13z)| lr 6.00e-04 | 322.79 ms | 52.3% bf16 MFU | 1624716 tok/s step 800/19560 | loss 4.567824 (-2.03z)| norm 0.4869 (-1.52z)| lr 6.00e-04 | 322.42 ms | 52.3% bf16 MFU | 1624786 tok/s step 801/19560 | loss 4.609706 (-1.63z)| norm 0.5433 (-1.07z)| lr 6.00e-04 | 322.48 ms | 52.3% bf16 MFU | 1624837 tok/s step 802/19560 | loss 4.573147 (-1.94z)| norm 0.5992 (-0.62z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1624732 tok/s step 803/19560 | loss 4.572931 (-1.91z)| norm 0.5501 (-0.99z)| lr 6.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1624758 tok/s step 804/19560 | loss 4.718993 (-0.55z)| norm 0.5840 (-0.72z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624741 tok/s step 805/19560 | loss 4.594267 (-1.70z)| norm 0.5874 (-0.68z)| lr 6.00e-04 | 322.03 ms | 52.4% bf16 MFU | 1624907 tok/s step 806/19560 | loss 4.601640 (-1.61z)| norm 0.5890 (-0.66z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624920 tok/s step 807/19560 | loss 4.602854 (-1.57z)| norm 0.5708 (-0.80z)| lr 6.00e-04 | 323.64 ms | 52.1% bf16 MFU | 1624674 tok/s step 808/19560 | loss 4.610412 (-1.48z)| norm 0.7077 (+0.27z)| lr 6.00e-04 | 322.40 ms | 52.3% bf16 MFU | 1624750 tok/s step 809/19560 | loss 4.604239 (-1.51z)| norm 0.7287 (+0.43z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1624700 tok/s step 810/19560 | loss 4.546172 (-2.02z)| norm 0.5808 (-0.75z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1624601 tok/s step 811/19560 | loss 4.540331 (-2.02z)| norm 0.5491 (-1.01z)| lr 6.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624647 tok/s step 812/19560 | loss 4.581498 (-1.62z)| norm 0.5331 (-1.14z)| lr 6.00e-04 | 322.36 ms | 52.4% bf16 MFU | 1624734 tok/s step 813/19560 | loss 4.624102 (-1.21z)| norm 0.5494 (-1.00z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624687 tok/s step 814/19560 | loss 4.606969 (-1.35z)| norm 0.5397 (-1.07z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1624603 tok/s step 815/19560 | loss 4.528615 (-2.04z)| norm 0.5090 (-1.30z)| lr 6.00e-04 | 322.43 ms | 52.3% bf16 MFU | 1624676 tok/s step 816/19560 | loss 4.601659 (-1.35z)| norm 0.5698 (-0.79z)| lr 6.00e-04 | 322.40 ms | 52.3% bf16 MFU | 1624752 tok/s step 817/19560 | loss 4.568259 (-1.63z)| norm 0.5680 (-0.80z)| lr 6.00e-04 | 323.40 ms | 52.2% bf16 MFU | 1624574 tok/s step 818/19560 | loss 4.580789 (-1.49z)| norm 0.6613 (-0.03z)| lr 6.00e-04 | 322.47 ms | 52.3% bf16 MFU | 1624637 tok/s step 819/19560 | loss 4.603157 (-1.27z)| norm 0.6944 (+0.24z)| lr 6.00e-04 | 322.09 ms | 52.4% bf16 MFU | 1624793 tok/s step 820/19560 | loss 4.594363 (-1.34z)| norm 0.6182 (-0.38z)| lr 6.00e-04 | 323.65 ms | 52.1% bf16 MFU | 1624548 tok/s step 821/19560 | loss 4.527572 (-1.93z)| norm 0.5325 (-1.07z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624482 tok/s step 822/19560 | loss 4.554987 (-1.65z)| norm 0.5277 (-1.10z)| lr 6.00e-04 | 322.54 ms | 52.3% bf16 MFU | 1624533 tok/s step 823/19560 | loss 4.559875 (-1.58z)| norm 0.5441 (-0.95z)| lr 6.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1624435 tok/s step 824/19560 | loss 4.542778 (-1.71z)| norm 0.6095 (-0.42z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624403 tok/s step 825/19560 | loss 4.549355 (-1.62z)| norm 0.5806 (-0.65z)| lr 6.00e-04 | 322.92 ms | 52.3% bf16 MFU | 1624361 tok/s step 826/19560 | loss 4.546217 (-1.63z)| norm 0.4836 (-1.40z)| lr 6.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1624345 tok/s step 827/19560 | loss 4.642519 (-0.71z)| norm 0.5038 (-1.22z)| lr 6.00e-04 | 322.52 ms | 52.3% bf16 MFU | 1624406 tok/s step 828/19560 | loss 4.553758 (-1.53z)| norm 0.5305 (-0.99z)| lr 6.00e-04 | 322.56 ms | 52.3% bf16 MFU | 1624455 tok/s step 829/19560 | loss 4.534232 (-1.70z)| norm 0.4886 (-1.33z)| lr 6.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624420 tok/s step 830/19560 | loss 4.507692 (-1.93z)| norm 0.4488 (-1.65z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1624406 tok/s step 831/19560 | loss 4.556407 (-1.45z)| norm 0.4433 (-1.66z)| lr 6.00e-04 | 323.26 ms | 52.2% bf16 MFU | 1624281 tok/s step 832/19560 | loss 4.568287 (-1.32z)| norm 0.4338 (-1.70z)| lr 6.00e-04 | 322.10 ms | 52.4% bf16 MFU | 1624452 tok/s step 833/19560 | loss 4.608150 (-0.91z)| norm 0.5693 (-0.60z)| lr 6.00e-04 | 323.35 ms | 52.2% bf16 MFU | 1624301 tok/s step 834/19560 | loss 4.575103 (-1.22z)| norm 0.7558 (+0.91z)| lr 6.00e-04 | 323.01 ms | 52.2% bf16 MFU | 1624242 tok/s step 835/19560 | loss 4.559517 (-1.36z)| norm 0.7067 (+0.50z)| lr 6.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1624293 tok/s step 836/19560 | loss 4.593911 (-1.01z)| norm 0.6912 (+0.37z)| lr 6.00e-04 | 323.06 ms | 52.2% bf16 MFU | 1624223 tok/s step 837/19560 | loss 4.529801 (-1.63z)| norm 0.7482 (+0.83z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1624188 tok/s step 838/19560 | loss 4.556854 (-1.34z)| norm 0.6626 (+0.14z)| lr 6.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1624278 tok/s step 839/19560 | loss 4.727320 (+0.42z)| norm 0.6642 (+0.15z)| lr 6.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1624300 tok/s step 840/19560 | loss 4.547984 (-1.42z)| norm 0.5160 (-1.04z)| lr 6.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1624278 tok/s step 841/19560 | loss 4.543496 (-1.44z)| norm 0.4255 (-1.74z)| lr 6.00e-04 | 323.86 ms | 52.1% bf16 MFU | 1624007 tok/s step 842/19560 | loss 4.604276 (-0.80z)| norm 0.4637 (-1.41z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1623944 tok/s step 843/19560 | loss 4.533867 (-1.51z)| norm 0.4794 (-1.28z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1623987 tok/s step 844/19560 | loss 4.484531 (-1.98z)| norm 0.4853 (-1.22z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624028 tok/s step 845/19560 | loss 4.554555 (-1.24z)| norm 0.4964 (-1.11z)| lr 6.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1624002 tok/s step 846/19560 | loss 4.527771 (-1.49z)| norm 0.6121 (-0.16z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1624070 tok/s step 847/19560 | loss 4.514639 (-1.61z)| norm 0.6127 (-0.15z)| lr 6.00e-04 | 323.11 ms | 52.2% bf16 MFU | 1623997 tok/s step 848/19560 | loss 4.520532 (-1.54z)| norm 0.5448 (-0.69z)| lr 6.00e-04 | 322.71 ms | 52.3% bf16 MFU | 1624030 tok/s step 849/19560 | loss 4.551583 (-1.20z)| norm 0.5476 (-0.66z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624086 tok/s step 850/19560 | loss 4.454568 (-2.21z)| norm 0.5909 (-0.29z)| lr 6.00e-04 | 323.57 ms | 52.2% bf16 MFU | 1623898 tok/s step 851/19560 | loss 4.559063 (-1.08z)| norm 0.6485 (+0.21z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1623937 tok/s step 852/19560 | loss 4.474381 (-1.96z)| norm 0.6051 (-0.14z)| lr 6.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1623868 tok/s step 853/19560 | loss 4.519273 (-1.45z)| norm 0.5834 (-0.32z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1623788 tok/s step 854/19560 | loss 4.499451 (-1.64z)| norm 0.5967 (-0.21z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1623845 tok/s step 855/19560 | loss 4.513379 (-1.48z)| norm 0.5701 (-0.44z)| lr 6.00e-04 | 323.33 ms | 52.2% bf16 MFU | 1623729 tok/s step 856/19560 | loss 4.469506 (-1.93z)| norm 0.5184 (-0.89z)| lr 6.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1623766 tok/s step 857/19560 | loss 4.516726 (-1.39z)| norm 0.4881 (-1.15z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1623838 tok/s step 858/19560 | loss 4.501809 (-1.53z)| norm 0.4874 (-1.16z)| lr 6.00e-04 | 323.67 ms | 52.1% bf16 MFU | 1623636 tok/s step 859/19560 | loss 4.596187 (-0.48z)| norm 0.5102 (-0.96z)| lr 6.00e-04 | 322.58 ms | 52.3% bf16 MFU | 1623720 tok/s step 860/19560 | loss 4.523275 (-1.27z)| norm 0.5262 (-0.82z)| lr 6.00e-04 | 322.98 ms | 52.3% bf16 MFU | 1623699 tok/s step 861/19560 | loss 4.548597 (-0.98z)| norm 0.5660 (-0.48z)| lr 6.00e-04 | 323.28 ms | 52.2% bf16 MFU | 1623602 tok/s step 862/19560 | loss 4.460073 (-1.92z)| norm 0.5790 (-0.37z)| lr 6.00e-04 | 323.00 ms | 52.3% bf16 MFU | 1623581 tok/s step 863/19560 | loss 4.587985 (-0.50z)| norm 0.6137 (-0.06z)| lr 6.00e-04 | 323.23 ms | 52.2% bf16 MFU | 1623505 tok/s step 864/19560 | loss 4.545689 (-0.95z)| norm 0.7424 (+1.04z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1623554 tok/s step 865/19560 | loss 4.530947 (-1.10z)| norm 0.6848 (+0.55z)| lr 6.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1623506 tok/s step 866/19560 | loss 4.529347 (-1.10z)| norm 0.7610 (+1.22z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1623464 tok/s step 867/19560 | loss 4.592486 (-0.38z)| norm 0.6393 (+0.17z)| lr 6.00e-04 | 322.45 ms | 52.3% bf16 MFU | 1623589 tok/s step 868/19560 | loss 4.479220 (-1.63z)| norm 0.5195 (-0.87z)| lr 6.00e-04 | 322.59 ms | 52.3% bf16 MFU | 1623672 tok/s step 869/19560 | loss 4.501421 (-1.37z)| norm 0.4768 (-1.22z)| lr 6.00e-04 | 323.61 ms | 52.2% bf16 MFU | 1623494 tok/s step 870/19560 | loss 4.479470 (-1.59z)| norm 0.4399 (-1.52z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1623455 tok/s step 871/19560 | loss 4.455842 (-1.82z)| norm 0.3992 (-1.84z)| lr 6.00e-04 | 322.96 ms | 52.3% bf16 MFU | 1623450 tok/s step 872/19560 | loss 4.486840 (-1.45z)| norm 0.3829 (-1.94z)| lr 6.00e-04 | 323.38 ms | 52.2% bf16 MFU | 1623341 tok/s step 873/19560 | loss 4.453401 (-1.79z)| norm 0.3922 (-1.82z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623373 tok/s step 874/19560 | loss 4.377589 (-2.57z)| norm 0.4649 (-1.19z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623404 tok/s step 875/19560 | loss 4.456231 (-1.68z)| norm 0.4779 (-1.06z)| lr 6.00e-04 | 323.26 ms | 52.2% bf16 MFU | 1623329 tok/s step 876/19560 | loss 4.519008 (-0.97z)| norm 0.4777 (-1.05z)| lr 6.00e-04 | 323.23 ms | 52.2% bf16 MFU | 1623264 tok/s step 877/19560 | loss 4.517462 (-0.97z)| norm 0.4947 (-0.89z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623300 tok/s step 878/19560 | loss 4.509140 (-1.05z)| norm 0.5601 (-0.32z)| lr 6.00e-04 | 322.77 ms | 52.3% bf16 MFU | 1623352 tok/s step 879/19560 | loss 4.636855 (+0.40z)| norm 0.5959 (+0.00z)| lr 6.00e-04 | 322.52 ms | 52.3% bf16 MFU | 1623465 tok/s step 880/19560 | loss 4.536485 (-0.73z)| norm 0.8146 (+1.89z)| lr 6.00e-04 | 323.67 ms | 52.1% bf16 MFU | 1623284 tok/s step 881/19560 | loss 4.504397 (-1.09z)| norm 0.9071 (+2.60z)| lr 6.00e-04 | 322.63 ms | 52.3% bf16 MFU | 1623373 tok/s step 882/19560 | loss 4.567525 (-0.34z)| norm 0.7612 (+1.35z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1623453 tok/s step 883/19560 | loss 4.501540 (-1.11z)| norm 0.6687 (+0.57z)| lr 6.00e-04 | 322.49 ms | 52.3% bf16 MFU | 1623568 tok/s step 884/19560 | loss 4.465900 (-1.50z)| norm 0.5694 (-0.26z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1623604 tok/s step 885/19560 | loss 4.508203 (-0.99z)| norm 0.5253 (-0.64z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1623658 tok/s step 886/19560 | loss 4.474409 (-1.39z)| norm 0.5023 (-0.83z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1623617 tok/s step 887/19560 | loss 4.518575 (-0.84z)| norm 0.4296 (-1.43z)| lr 6.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623557 tok/s step 888/19560 | loss 4.425880 (-1.94z)| norm 0.4267 (-1.43z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1623647 tok/s step 889/19560 | loss 4.480249 (-1.26z)| norm 0.4374 (-1.32z)| lr 6.00e-04 | 323.53 ms | 52.2% bf16 MFU | 1623491 tok/s step 890/19560 | loss 4.411825 (-2.05z)| norm 0.4264 (-1.40z)| lr 6.00e-04 | 322.05 ms | 52.4% bf16 MFU | 1623714 tok/s step 891/19560 | loss 4.487944 (-1.12z)| norm 0.3899 (-1.68z)| lr 6.00e-04 | 323.27 ms | 52.2% bf16 MFU | 1623618 tok/s step 892/19560 | loss 4.467730 (-1.34z)| norm 0.3721 (-1.80z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1623644 tok/s step 893/19560 | loss 4.387103 (-2.27z)| norm 0.3451 (-1.98z)| lr 6.00e-04 | 323.55 ms | 52.2% bf16 MFU | 1623484 tok/s step 894/19560 | loss 4.432099 (-1.70z)| norm 0.4041 (-1.46z)| lr 6.00e-04 | 323.25 ms | 52.2% bf16 MFU | 1623407 tok/s step 895/19560 | loss 4.486924 (-1.02z)| norm 0.4702 (-0.91z)| lr 6.00e-04 | 322.47 ms | 52.3% bf16 MFU | 1623528 tok/s step 896/19560 | loss 4.572297 (+0.03z)| norm 0.5171 (-0.50z)| lr 6.00e-04 | 323.38 ms | 52.2% bf16 MFU | 1623416 tok/s step 897/19560 | loss 4.494519 (-0.92z)| norm 0.6089 (+0.35z)| lr 6.00e-04 | 323.24 ms | 52.2% bf16 MFU | 1623343 tok/s step 898/19560 | loss 4.444746 (-1.53z)| norm 0.6813 (+1.06z)| lr 6.00e-04 | 322.90 ms | 52.3% bf16 MFU | 1623361 tok/s step 899/19560 | loss 4.427084 (-1.72z)| norm 0.6165 (+0.45z)| lr 6.00e-04 | 323.71 ms | 52.1% bf16 MFU | 1623174 tok/s step 900/19560 | loss 4.448463 (-1.44z)| norm 0.6340 (+0.61z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1623346 tok/s step 901/19560 | loss 4.493427 (-0.85z)| norm 0.6015 (+0.29z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1623341 tok/s step 902/19560 | loss 4.491237 (-0.87z)| norm 0.5868 (+0.14z)| lr 6.00e-04 | 323.35 ms | 52.2% bf16 MFU | 1623244 tok/s step 903/19560 | loss 4.455154 (-1.32z)| norm 0.5297 (-0.41z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1623232 tok/s step 904/19560 | loss 4.451303 (-1.35z)| norm 0.4985 (-0.70z)| lr 6.00e-04 | 323.38 ms | 52.2% bf16 MFU | 1623135 tok/s step 905/19560 | loss 4.391316 (-2.10z)| norm 0.5159 (-0.52z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1623308 tok/s step 906/19560 | loss 4.433712 (-1.56z)| norm 0.4424 (-1.25z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1623402 tok/s step 907/19560 | loss 4.450596 (-1.31z)| norm 0.4416 (-1.24z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1623382 tok/s step 908/19560 | loss 4.391743 (-2.08z)| norm 0.4435 (-1.20z)| lr 6.00e-04 | 323.16 ms | 52.2% bf16 MFU | 1623332 tok/s step 909/19560 | loss 4.467037 (-1.04z)| norm 0.4097 (-1.52z)| lr 6.00e-04 | 322.34 ms | 52.4% bf16 MFU | 1623491 tok/s step 910/19560 | loss 4.426228 (-1.58z)| norm 0.4409 (-1.19z)| lr 6.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1623553 tok/s step 911/19560 | loss 4.384405 (-2.10z)| norm 0.4533 (-1.05z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1623635 tok/s step 912/19560 | loss 4.432291 (-1.43z)| norm 0.5030 (-0.54z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1623678 tok/s step 913/19560 | loss 4.387957 (-2.00z)| norm 0.5230 (-0.33z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1623704 tok/s step 914/19560 | loss 4.421427 (-1.52z)| norm 0.5007 (-0.54z)| lr 6.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1623774 tok/s step 915/19560 | loss 4.427155 (-1.42z)| norm 0.4845 (-0.70z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1623852 tok/s step 916/19560 | loss 4.426624 (-1.41z)| norm 0.4647 (-0.89z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1623917 tok/s step 917/19560 | loss 4.364565 (-2.19z)| norm 0.4699 (-0.82z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1623920 tok/s step 918/19560 | loss 4.415363 (-1.50z)| norm 0.5083 (-0.42z)| lr 6.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1623899 tok/s step 919/19560 | loss 4.372352 (-2.06z)| norm 0.5103 (-0.39z)| lr 6.00e-04 | 322.46 ms | 52.3% bf16 MFU | 1624000 tok/s step 920/19560 | loss 4.485586 (-0.51z)| norm 0.5489 (+0.01z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1624006 tok/s step 921/19560 | loss 4.420200 (-1.38z)| norm 0.5793 (+0.32z)| lr 6.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1623986 tok/s step 922/19560 | loss 4.448822 (-0.97z)| norm 0.5836 (+0.37z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624028 tok/s step 923/19560 | loss 4.456485 (-0.86z)| norm 0.5008 (-0.48z)| lr 6.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624110 tok/s step 924/19560 | loss 4.425251 (-1.27z)| norm 0.4745 (-0.74z)| lr 6.00e-04 | 322.33 ms | 52.4% bf16 MFU | 1624232 tok/s step 925/19560 | loss 4.515070 (-0.01z)| norm 0.4778 (-0.70z)| lr 6.00e-04 | 322.97 ms | 52.3% bf16 MFU | 1624186 tok/s step 926/19560 | loss 4.410030 (-1.47z)| norm 0.5401 (-0.06z)| lr 6.00e-04 | 322.90 ms | 52.3% bf16 MFU | 1624160 tok/s step 927/19560 | loss 4.404501 (-1.52z)| norm 0.4812 (-0.66z)| lr 6.00e-04 | 322.58 ms | 52.3% bf16 MFU | 1624217 tok/s step 928/19560 | loss 4.431636 (-1.12z)| norm 0.4799 (-0.67z)| lr 6.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1624280 tok/s step 929/19560 | loss 4.576412 (+0.92z)| norm 0.6022 (+0.57z)| lr 6.00e-04 | 322.36 ms | 52.4% bf16 MFU | 1624385 tok/s step 930/19560 | loss 4.432632 (-1.09z)| norm 0.6696 (+1.25z)| lr 6.00e-04 | 323.04 ms | 52.2% bf16 MFU | 1624315 tok/s step 931/19560 | loss 4.451801 (-0.81z)| norm 0.6462 (+1.00z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624320 tok/s step 932/19560 | loss 4.421003 (-1.24z)| norm 0.5632 (+0.16z)| lr 6.00e-04 | 323.01 ms | 52.2% bf16 MFU | 1624260 tok/s step 933/19560 | loss 4.415336 (-1.30z)| norm 0.4687 (-0.78z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624261 tok/s step 934/19560 | loss 4.434638 (-1.01z)| norm 0.4074 (-1.38z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1624242 tok/s step 935/19560 | loss 4.375488 (-1.83z)| norm 0.4156 (-1.28z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624245 tok/s step 936/19560 | loss 4.456660 (-0.65z)| norm 0.4230 (-1.19z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624230 tok/s step 937/19560 | loss 4.417194 (-1.21z)| norm 0.5059 (-0.34z)| lr 6.00e-04 | 322.22 ms | 52.4% bf16 MFU | 1624374 tok/s step 938/19560 | loss 4.459547 (-0.58z)| norm 0.5765 (+0.38z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624352 tok/s step 939/19560 | loss 4.408288 (-1.30z)| norm 0.5991 (+0.60z)| lr 6.00e-04 | 323.26 ms | 52.2% bf16 MFU | 1624229 tok/s step 940/19560 | loss 4.373230 (-1.78z)| norm 0.5430 (+0.03z)| lr 6.00e-04 | 322.31 ms | 52.4% bf16 MFU | 1624350 tok/s step 941/19560 | loss 4.379801 (-1.66z)| norm 0.5271 (-0.13z)| lr 6.00e-04 | 322.41 ms | 52.3% bf16 MFU | 1624442 tok/s step 942/19560 | loss 4.409242 (-1.22z)| norm 0.5201 (-0.20z)| lr 6.00e-04 | 322.34 ms | 52.4% bf16 MFU | 1624544 tok/s step 943/19560 | loss 4.446931 (-0.66z)| norm 0.4840 (-0.56z)| lr 6.00e-04 | 323.08 ms | 52.2% bf16 MFU | 1624457 tok/s step 944/19560 | loss 4.464184 (-0.40z)| norm 0.4879 (-0.52z)| lr 6.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624517 tok/s step 945/19560 | loss 4.401361 (-1.30z)| norm 0.4788 (-0.60z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1624557 tok/s step 946/19560 | loss 4.367875 (-1.76z)| norm 0.4487 (-0.89z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624526 tok/s step 947/19560 | loss 4.399831 (-1.27z)| norm 0.4133 (-1.24z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1624526 tok/s step 948/19560 | loss 4.397146 (-1.29z)| norm 0.4093 (-1.26z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624533 tok/s step 949/19560 | loss 4.422815 (-0.90z)| norm 0.4580 (-0.75z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1624565 tok/s step 950/19560 | loss 4.474222 (-0.14z)| norm 0.4285 (-1.04z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624572 tok/s step 951/19560 | loss 4.389612 (-1.36z)| norm 0.4395 (-0.92z)| lr 6.00e-04 | 322.80 ms | 52.3% bf16 MFU | 1624554 tok/s step 952/19560 | loss 4.360703 (-1.75z)| norm 0.4810 (-0.49z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624584 tok/s step 953/19560 | loss 4.321119 (-2.26z)| norm 0.4401 (-0.89z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624588 tok/s step 954/19560 | loss 4.403850 (-1.06z)| norm 0.4896 (-0.39z)| lr 6.00e-04 | 322.25 ms | 52.4% bf16 MFU | 1624708 tok/s step 955/19560 | loss 4.425549 (-0.74z)| norm 0.5754 (+0.47z)| lr 6.00e-04 | 323.27 ms | 52.2% bf16 MFU | 1624564 tok/s step 956/19560 | loss 4.368651 (-1.55z)| norm 0.5164 (-0.13z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1624559 tok/s step 957/19560 | loss 4.351866 (-1.75z)| norm 0.4522 (-0.77z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1624591 tok/s step 958/19560 | loss 4.359515 (-1.61z)| norm 0.4542 (-0.75z)| lr 6.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624640 tok/s step 959/19560 | loss 4.370401 (-1.43z)| norm 0.4046 (-1.24z)| lr 6.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624686 tok/s step 960/19560 | loss 4.415242 (-0.78z)| norm 0.4209 (-1.08z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624670 tok/s step 961/19560 | loss 4.354748 (-1.63z)| norm 0.4342 (-0.93z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1624766 tok/s step 962/19560 | loss 4.435390 (-0.45z)| norm 0.4271 (-0.99z)| lr 6.00e-04 | 323.28 ms | 52.2% bf16 MFU | 1624618 tok/s step 963/19560 | loss 4.371108 (-1.36z)| norm 0.5329 (+0.10z)| lr 6.00e-04 | 323.13 ms | 52.2% bf16 MFU | 1624513 tok/s step 964/19560 | loss 4.364655 (-1.44z)| norm 0.6184 (+0.99z)| lr 6.00e-04 | 322.38 ms | 52.4% bf16 MFU | 1624603 tok/s step 965/19560 | loss 4.363682 (-1.43z)| norm 0.5986 (+0.81z)| lr 6.00e-04 | 323.00 ms | 52.3% bf16 MFU | 1624532 tok/s step 966/19560 | loss 4.383604 (-1.12z)| norm 0.6175 (+1.02z)| lr 6.00e-04 | 322.38 ms | 52.4% bf16 MFU | 1624621 tok/s step 967/19560 | loss 4.351465 (-1.63z)| norm 0.5749 (+0.58z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1624541 tok/s step 968/19560 | loss 4.487774 (+0.47z)| norm 0.4916 (-0.31z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1624574 tok/s step 969/19560 | loss 4.346553 (-1.68z)| norm 0.4743 (-0.50z)| lr 6.00e-04 | 322.42 ms | 52.3% bf16 MFU | 1624651 tok/s step 970/19560 | loss 4.336484 (-1.82z)| norm 0.4271 (-1.00z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624654 tok/s step 971/19560 | loss 4.435225 (-0.27z)| norm 0.4410 (-0.85z)| lr 6.00e-04 | 322.48 ms | 52.3% bf16 MFU | 1624712 tok/s step 972/19560 | loss 4.333393 (-1.83z)| norm 0.4309 (-0.95z)| lr 6.00e-04 | 322.49 ms | 52.3% bf16 MFU | 1624765 tok/s step 973/19560 | loss 4.358710 (-1.41z)| norm 0.4381 (-0.86z)| lr 6.00e-04 | 322.25 ms | 52.4% bf16 MFU | 1624875 tok/s step 974/19560 | loss 4.304809 (-2.19z)| norm 0.4118 (-1.12z)| lr 6.00e-04 | 323.37 ms | 52.2% bf16 MFU | 1624697 tok/s step 975/19560 | loss 4.340498 (-1.62z)| norm 0.4034 (-1.19z)| lr 6.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624660 tok/s step 976/19560 | loss 4.345835 (-1.51z)| norm 0.4029 (-1.18z)| lr 6.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624706 tok/s step 977/19560 | loss 4.387686 (-0.86z)| norm 0.3893 (-1.30z)| lr 6.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624713 tok/s step 978/19560 | loss 4.248606 (-2.86z)| norm 0.4698 (-0.45z)| lr 6.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1624628 tok/s step 979/19560 | loss 4.306850 (-1.96z)| norm 0.5072 (-0.05z)| lr 6.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1624591 tok/s step 980/19560 | loss 4.310771 (-1.86z)| norm 0.5280 (+0.18z)| lr 6.00e-04 | 322.54 ms | 52.3% bf16 MFU | 1624637 tok/s step 981/19560 | loss 4.389370 (-0.71z)| norm 0.5381 (+0.29z)| lr 6.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1624542 tok/s step 982/19560 | loss 4.329943 (-1.55z)| norm 0.5678 (+0.61z)| lr 6.00e-04 | 322.96 ms | 52.3% bf16 MFU | 1624485 tok/s step 983/19560 | loss 4.327624 (-1.55z)| norm 0.4728 (-0.39z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624476 tok/s step 984/19560 | loss 4.453816 (+0.26z)| norm 0.4585 (-0.54z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624508 tok/s step 985/19560 | loss 4.333447 (-1.44z)| norm 0.4562 (-0.56z)| lr 6.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1624532 tok/s step 986/19560 | loss 4.336009 (-1.38z)| norm 0.4903 (-0.20z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624546 tok/s step 987/19560 | loss 4.390206 (-0.60z)| norm 0.5948 (+0.90z)| lr 6.00e-04 | 322.46 ms | 52.3% bf16 MFU | 1624614 tok/s step 988/19560 | loss 4.361480 (-1.00z)| norm 0.5287 (+0.20z)| lr 6.00e-04 | 322.95 ms | 52.3% bf16 MFU | 1624553 tok/s step 989/19560 | loss 4.316251 (-1.64z)| norm 0.3977 (-1.17z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1624549 tok/s step 990/19560 | loss 4.301621 (-1.81z)| norm 0.4308 (-0.81z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1624528 tok/s step 991/19560 | loss 4.359831 (-0.96z)| norm 0.4747 (-0.33z)| lr 6.00e-04 | 323.26 ms | 52.2% bf16 MFU | 1624397 tok/s step 992/19560 | loss 4.295407 (-1.88z)| norm 0.5028 (-0.02z)| lr 6.00e-04 | 322.43 ms | 52.3% bf16 MFU | 1624480 tok/s step 993/19560 | loss 4.308767 (-1.65z)| norm 0.4657 (-0.41z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624418 tok/s step 994/19560 | loss 4.314485 (-1.55z)| norm 0.4382 (-0.71z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1624384 tok/s step 995/19560 | loss 4.326447 (-1.37z)| norm 0.4220 (-0.88z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1624411 tok/s step 996/19560 | loss 4.343665 (-1.09z)| norm 0.4428 (-0.63z)| lr 6.00e-04 | 322.78 ms | 52.3% bf16 MFU | 1624406 tok/s step 997/19560 | loss 4.299064 (-1.72z)| norm 0.4405 (-0.65z)| lr 6.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1624444 tok/s step 998/19560 | loss 4.366317 (-0.71z)| norm 0.4143 (-0.95z)| lr 6.00e-04 | 323.32 ms | 52.2% bf16 MFU | 1624300 tok/s step 999/19560 | loss 4.357787 (-0.83z)| norm 0.4375 (-0.69z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1624273 tok/s step 1000/19560 | loss 4.334036 (-1.16z)| norm 0.3980 (-1.15z)| lr 6.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1624327 tok/s val loss 4.327087 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2575/10042 = 0.256423 step 1001/19560 | loss 4.315507 (-1.41z)| norm 0.5049 (+0.07z)| lr 6.00e-04 | 322.35 ms | 52.4% bf16 MFU | 1624434 tok/s step 1002/19560 | loss 4.332686 (-1.15z)| norm 0.5759 (+0.88z)| lr 6.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1624468 tok/s step 1003/19560 | loss 4.334732 (-1.10z)| norm 0.6145 (+1.31z)| lr 6.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1624518 tok/s step 1004/19560 | loss 4.358151 (-0.75z)| norm 0.6296 (+1.45z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624512 tok/s step 1005/19560 | loss 4.355356 (-0.77z)| norm 0.4870 (-0.17z)| lr 6.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624522 tok/s step 1006/19560 | loss 4.254376 (-2.22z)| norm 0.4357 (-0.74z)| lr 6.00e-04 | 323.18 ms | 52.2% bf16 MFU | 1624409 tok/s step 1007/19560 | loss 4.336263 (-1.03z)| norm 0.4272 (-0.82z)| lr 6.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624430 tok/s step 1008/19560 | loss 4.264089 (-2.10z)| norm 0.4253 (-0.85z)| lr 6.00e-04 | 322.56 ms | 52.3% bf16 MFU | 1624479 tok/s step 1009/19560 | loss 4.317988 (-1.26z)| norm 0.4082 (-1.11z)| lr 6.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1624479 tok/s step 1010/19560 | loss 4.364633 (-0.53z)| norm 0.3745 (-1.59z)| lr 6.00e-04 | 322.47 ms | 52.3% bf16 MFU | 1624546 tok/s step 1011/19560 | loss 4.310432 (-1.37z)| norm 0.3969 (-1.27z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1624454 tok/s step 1012/19560 | loss 4.334487 (-0.97z)| norm 0.4199 (-0.93z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624451 tok/s step 1013/19560 | loss 4.319251 (-1.20z)| norm 0.4474 (-0.54z)| lr 6.00e-04 | 323.02 ms | 52.2% bf16 MFU | 1624382 tok/s step 1014/19560 | loss 4.316617 (-1.22z)| norm 0.4709 (-0.20z)| lr 6.00e-04 | 322.65 ms | 52.3% bf16 MFU | 1624411 tok/s step 1015/19560 | loss 4.281258 (-1.76z)| norm 0.4508 (-0.49z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1624394 tok/s step 1016/19560 | loss 4.321610 (-1.10z)| norm 0.4984 (+0.18z)| lr 6.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1624349 tok/s step 1017/19560 | loss 4.318653 (-1.13z)| norm 0.5479 (+0.86z)| lr 6.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1624337 tok/s step 1018/19560 | loss 4.262259 (-1.99z)| norm 0.4924 (+0.07z)| lr 6.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1624296 tok/s step 1019/19560 | loss 4.287188 (-1.57z)| norm 0.4157 (-1.02z)| lr 6.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624300 tok/s step 1020/19560 | loss 4.266711 (-1.85z)| norm 0.4812 (-0.10z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1624329 tok/s step 1021/19560 | loss 4.342735 (-0.65z)| norm 0.4838 (-0.08z)| lr 6.00e-04 | 323.77 ms | 52.1% bf16 MFU | 1624079 tok/s step 1022/19560 | loss 4.333303 (-0.79z)| norm 0.4454 (-0.65z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1624080 tok/s step 1023/19560 | loss 4.331410 (-0.80z)| norm 0.4235 (-0.96z)| lr 6.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624133 tok/s step 1024/19560 | loss 4.325337 (-0.90z)| norm 0.4276 (-0.89z)| lr 6.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1624131 tok/s step 1025/19560 | loss 4.341902 (-0.61z)| norm 0.4362 (-0.75z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1624124 tok/s step 1026/19560 | loss 4.336498 (-0.69z)| norm 0.4801 (-0.09z)| lr 6.00e-04 | 323.11 ms | 52.2% bf16 MFU | 1624048 tok/s step 1027/19560 | loss 4.346272 (-0.52z)| norm 0.4709 (-0.21z)| lr 6.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1624047 tok/s step 1028/19560 | loss 4.263344 (-1.87z)| norm 0.3693 (-1.77z)| lr 6.00e-04 | 323.13 ms | 52.2% bf16 MFU | 1623972 tok/s step 1029/19560 | loss 4.227961 (-2.40z)| norm 0.4241 (-0.90z)| lr 6.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1623976 tok/s step 1030/19560 | loss 4.328561 (-0.73z)| norm 0.4084 (-1.13z)| lr 6.00e-04 | 322.79 ms | 52.3% bf16 MFU | 1623989 tok/s step 1031/19560 | loss 4.285833 (-1.42z)| norm 0.4547 (-0.39z)| lr 6.00e-04 | 322.92 ms | 52.3% bf16 MFU | 1623970 tok/s step 1032/19560 | loss 4.272451 (-1.62z)| norm 0.4401 (-0.61z)| lr 6.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624011 tok/s step 1033/19560 | loss 4.352414 (-0.28z)| norm 0.3992 (-1.24z)| lr 6.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1624056 tok/s step 1034/19560 | loss 4.298450 (-1.16z)| norm 0.3842 (-1.46z)| lr 6.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1624183 tok/s step 1035/19560 | loss 4.294179 (-1.21z)| norm 0.4080 (-1.08z)| lr 6.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1624109 tok/s step 1036/19560 | loss 4.210465 (-2.52z)| norm 0.4292 (-0.75z)| lr 6.00e-04 | 323.14 ms | 52.2% bf16 MFU | 1624029 tok/s step 1037/19560 | loss 4.293723 (-1.15z)| norm 0.4928 (+0.24z)| lr 6.00e-04 | 322.50 ms | 52.3% bf16 MFU | 1624112 tok/s step 1038/19560 | loss 4.262871 (-1.62z)| norm 0.4634 (-0.23z)| lr 6.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624070 tok/s step 1039/19560 | loss 4.237586 (-1.98z)| norm 0.4094 (-1.07z)| lr 6.00e-04 | 322.84 ms | 52.3% bf16 MFU | 1624064 tok/s step 1040/19560 | loss 4.250121 (-1.75z)| norm 0.4154 (-0.96z)| lr 6.00e-04 | 323.06 ms | 52.2% bf16 MFU | 1624006 tok/s step 1041/19560 | loss 4.346634 (-0.22z)| norm 0.4455 (-0.48z)| lr 6.00e-04 | 323.38 ms | 52.2% bf16 MFU | 1623870 tok/s step 1042/19560 | loss 4.300263 (-0.94z)| norm 0.5610 (+1.31z)| lr 6.00e-04 | 322.73 ms | 52.3% bf16 MFU | 1623904 tok/s step 1043/19560 | loss 4.335783 (-0.36z)| norm 0.5796 (+1.57z)| lr 6.00e-04 | 323.46 ms | 52.2% bf16 MFU | 1623752 tok/s step 1044/19560 | loss 4.319640 (-0.61z)| norm 0.5098 (+0.49z)| lr 6.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1623752 tok/s step 1045/19560 | loss 4.341962 (-0.25z)| norm 0.4671 (-0.16z)| lr 6.00e-04 | 323.07 ms | 52.2% bf16 MFU | 1623707 tok/s step 1046/19560 | loss 4.307075 (-0.79z)| norm 0.4322 (-0.69z)| lr 6.00e-04 | 322.36 ms | 52.4% bf16 MFU | 1623842 tok/s step 1047/19560 | loss 4.291709 (-1.02z)| norm 0.5139 (+0.56z)| lr 6.00e-04 | 324.04 ms | 52.1% bf16 MFU | 1623547 tok/s step 1048/19560 | loss 4.231095 (-1.96z)| norm 0.4769 (+0.00z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1623546 tok/s step 1049/19560 | loss 4.230667 (-1.92z)| norm 0.3938 (-1.26z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1623640 tok/s step 1050/19560 | loss 4.257368 (-1.48z)| norm 0.4180 (-0.87z)| lr 5.99e-04 | 323.80 ms | 52.1% bf16 MFU | 1623417 tok/s step 1051/19560 | loss 4.284750 (-1.03z)| norm 0.4304 (-0.67z)| lr 5.99e-04 | 323.09 ms | 52.2% bf16 MFU | 1623383 tok/s step 1052/19560 | loss 4.273417 (-1.19z)| norm 0.4154 (-0.89z)| lr 5.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1623440 tok/s step 1053/19560 | loss 4.284092 (-1.02z)| norm 0.3919 (-1.24z)| lr 5.99e-04 | 323.13 ms | 52.2% bf16 MFU | 1623394 tok/s step 1054/19560 | loss 4.277356 (-1.11z)| norm 0.3778 (-1.43z)| lr 5.99e-04 | 323.51 ms | 52.2% bf16 MFU | 1623254 tok/s step 1055/19560 | loss 4.308005 (-0.60z)| norm 0.4021 (-1.05z)| lr 5.99e-04 | 323.12 ms | 52.2% bf16 MFU | 1623220 tok/s step 1056/19560 | loss 4.254036 (-1.46z)| norm 0.4492 (-0.32z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1623280 tok/s step 1057/19560 | loss 4.285842 (-0.96z)| norm 0.4369 (-0.50z)| lr 5.99e-04 | 323.57 ms | 52.2% bf16 MFU | 1623132 tok/s step 1058/19560 | loss 4.272183 (-1.17z)| norm 0.4513 (-0.26z)| lr 5.99e-04 | 322.90 ms | 52.3% bf16 MFU | 1623161 tok/s step 1059/19560 | loss 4.250378 (-1.53z)| norm 0.4449 (-0.35z)| lr 5.99e-04 | 322.25 ms | 52.4% bf16 MFU | 1623350 tok/s step 1060/19560 | loss 4.318236 (-0.34z)| norm 0.4776 (+0.22z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1623345 tok/s step 1061/19560 | loss 4.461792 (+2.14z)| norm 0.5335 (+1.15z)| lr 5.99e-04 | 322.58 ms | 52.3% bf16 MFU | 1623441 tok/s step 1062/19560 | loss 4.275778 (-1.07z)| norm 0.6488 (+2.96z)| lr 5.99e-04 | 322.49 ms | 52.3% bf16 MFU | 1623556 tok/s step 1063/19560 | loss 4.300944 (-0.62z)| norm 0.5553 (+1.41z)| lr 5.99e-04 | 323.07 ms | 52.2% bf16 MFU | 1623521 tok/s step 1064/19560 | loss 4.341844 (+0.11z)| norm 0.4979 (+0.47z)| lr 5.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1623512 tok/s step 1065/19560 | loss 4.287944 (-0.83z)| norm 0.5022 (+0.54z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1623608 tok/s step 1066/19560 | loss 4.300617 (-0.59z)| norm 0.4693 (+0.02z)| lr 5.99e-04 | 323.20 ms | 52.2% bf16 MFU | 1623537 tok/s step 1067/19560 | loss 4.264319 (-1.23z)| norm 0.4487 (-0.31z)| lr 5.99e-04 | 323.02 ms | 52.2% bf16 MFU | 1623514 tok/s step 1068/19560 | loss 4.302049 (-0.54z)| norm 0.4347 (-0.53z)| lr 5.99e-04 | 323.32 ms | 52.2% bf16 MFU | 1623418 tok/s step 1069/19560 | loss 4.425255 (+1.69z)| norm 0.4280 (-0.63z)| lr 5.99e-04 | 323.03 ms | 52.2% bf16 MFU | 1623399 tok/s step 1070/19560 | loss 4.301131 (-0.54z)| norm 0.4736 (+0.15z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1623449 tok/s step 1071/19560 | loss 4.246578 (-1.52z)| norm 0.4479 (-0.28z)| lr 5.99e-04 | 323.25 ms | 52.2% bf16 MFU | 1623371 tok/s step 1072/19560 | loss 4.244065 (-1.56z)| norm 0.4306 (-0.57z)| lr 5.99e-04 | 323.30 ms | 52.2% bf16 MFU | 1623286 tok/s step 1073/19560 | loss 4.311873 (-0.28z)| norm 0.3979 (-1.10z)| lr 5.99e-04 | 322.61 ms | 52.3% bf16 MFU | 1623380 tok/s step 1074/19560 | loss 4.271628 (-1.02z)| norm 0.4051 (-0.98z)| lr 5.99e-04 | 322.72 ms | 52.3% bf16 MFU | 1623441 tok/s step 1075/19560 | loss 4.265193 (-1.13z)| norm 0.3738 (-1.49z)| lr 5.99e-04 | 322.31 ms | 52.4% bf16 MFU | 1623601 tok/s step 1076/19560 | loss 4.273385 (-0.96z)| norm 0.3771 (-1.42z)| lr 5.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1623588 tok/s step 1077/19560 | loss 4.254282 (-1.30z)| norm 0.3587 (-1.69z)| lr 5.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1623678 tok/s step 1078/19560 | loss 4.305501 (-0.32z)| norm 0.3851 (-1.25z)| lr 5.99e-04 | 323.05 ms | 52.2% bf16 MFU | 1623641 tok/s step 1079/19560 | loss 4.242988 (-1.52z)| norm 0.4177 (-0.71z)| lr 5.99e-04 | 323.08 ms | 52.2% bf16 MFU | 1623598 tok/s step 1080/19560 | loss 4.306591 (-0.26z)| norm 0.4312 (-0.49z)| lr 5.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1623697 tok/s step 1081/19560 | loss 4.261479 (-1.13z)| norm 0.4131 (-0.78z)| lr 5.99e-04 | 322.70 ms | 52.3% bf16 MFU | 1623746 tok/s step 1082/19560 | loss 4.298052 (-0.41z)| norm 0.3856 (-1.20z)| lr 5.99e-04 | 322.90 ms | 52.3% bf16 MFU | 1623743 tok/s step 1083/19560 | loss 4.323348 (+0.11z)| norm 0.4195 (-0.64z)| lr 5.99e-04 | 322.77 ms | 52.3% bf16 MFU | 1623771 tok/s step 1084/19560 | loss 4.303419 (-0.28z)| norm 0.4543 (-0.07z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1623759 tok/s step 1085/19560 | loss 4.267123 (-1.00z)| norm 0.5108 (+0.85z)| lr 5.99e-04 | 322.79 ms | 52.3% bf16 MFU | 1623784 tok/s step 1086/19560 | loss 4.254716 (-1.23z)| norm 0.5396 (+1.31z)| lr 5.99e-04 | 322.89 ms | 52.3% bf16 MFU | 1623780 tok/s step 1087/19560 | loss 4.268510 (-0.94z)| norm 0.5929 (+2.12z)| lr 5.99e-04 | 322.66 ms | 52.3% bf16 MFU | 1623835 tok/s step 1088/19560 | loss 4.273470 (-0.83z)| norm 0.4767 (+0.25z)| lr 5.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1623812 tok/s step 1089/19560 | loss 4.303243 (-0.21z)| norm 0.4317 (-0.48z)| lr 5.99e-04 | 322.92 ms | 52.3% bf16 MFU | 1623800 tok/s step 1090/19560 | loss 4.215114 (-1.99z)| norm 0.4072 (-0.87z)| lr 5.99e-04 | 322.78 ms | 52.3% bf16 MFU | 1623825 tok/s step 1091/19560 | loss 4.316307 (+0.10z)| norm 0.4040 (-0.90z)| lr 5.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1623838 tok/s step 1092/19560 | loss 4.308203 (-0.06z)| norm 0.3978 (-1.00z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1623822 tok/s step 1093/19560 | loss 4.262244 (-1.00z)| norm 0.3936 (-1.06z)| lr 5.99e-04 | 323.06 ms | 52.2% bf16 MFU | 1623775 tok/s step 1094/19560 | loss 4.310332 (+0.01z)| norm 0.3832 (-1.23z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1623788 tok/s step 1095/19560 | loss 4.252660 (-1.18z)| norm 0.3649 (-1.52z)| lr 5.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1623796 tok/s step 1096/19560 | loss 4.225327 (-1.79z)| norm 0.3816 (-1.22z)| lr 5.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1623811 tok/s step 1097/19560 | loss 4.180476 (-2.67z)| norm 0.3770 (-1.28z)| lr 5.99e-04 | 322.44 ms | 52.3% bf16 MFU | 1623920 tok/s step 1098/19560 | loss 4.259275 (-0.97z)| norm 0.3877 (-1.08z)| lr 5.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1623952 tok/s step 1099/19560 | loss 4.285997 (-0.39z)| norm 0.4428 (-0.15z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1624027 tok/s step 1100/19560 | loss 4.211704 (-1.98z)| norm 0.4856 (+0.57z)| lr 5.99e-04 | 322.51 ms | 52.3% bf16 MFU | 1624107 tok/s step 1101/19560 | loss 4.267924 (-0.74z)| norm 0.5066 (+0.92z)| lr 5.99e-04 | 322.89 ms | 52.3% bf16 MFU | 1624089 tok/s step 1102/19560 | loss 4.250597 (-1.11z)| norm 0.5486 (+1.60z)| lr 5.99e-04 | 322.91 ms | 52.3% bf16 MFU | 1624067 tok/s step 1103/19560 | loss 4.256037 (-0.97z)| norm 0.5318 (+1.29z)| lr 5.99e-04 | 322.94 ms | 52.3% bf16 MFU | 1624038 tok/s step 1104/19560 | loss 4.249403 (-1.10z)| norm 0.4415 (-0.22z)| lr 5.99e-04 | 322.34 ms | 52.4% bf16 MFU | 1624161 tok/s step 1105/19560 | loss 4.289483 (-0.22z)| norm 0.4712 (+0.27z)| lr 5.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1624156 tok/s step 1106/19560 | loss 4.276387 (-0.51z)| norm 0.4701 (+0.25z)| lr 5.99e-04 | 322.87 ms | 52.3% bf16 MFU | 1624139 tok/s step 1107/19560 | loss 4.407222 (+2.29z)| norm 0.4354 (-0.33z)| lr 5.99e-04 | 322.12 ms | 52.4% bf16 MFU | 1624314 tok/s step 1108/19560 | loss 4.230106 (-1.49z)| norm 0.4870 (+0.55z)| lr 5.99e-04 | 323.43 ms | 52.2% bf16 MFU | 1624148 tok/s step 1109/19560 | loss 4.229639 (-1.48z)| norm 0.5265 (+1.23z)| lr 5.99e-04 | 322.70 ms | 52.3% bf16 MFU | 1624174 tok/s step 1110/19560 | loss 4.256308 (-0.90z)| norm 0.4725 (+0.33z)| lr 5.99e-04 | 322.61 ms | 52.3% bf16 MFU | 1624223 tok/s step 1111/19560 | loss 4.163010 (-2.78z)| norm 0.4820 (+0.49z)| lr 5.99e-04 | 322.86 ms | 52.3% bf16 MFU | 1624207 tok/s step 1112/19560 | loss 4.228174 (-1.44z)| norm 0.4962 (+0.73z)| lr 5.99e-04 | 322.87 ms | 52.3% bf16 MFU | 1624187 tok/s step 1113/19560 | loss 4.245527 (-1.05z)| norm 0.3770 (-1.31z)| lr 5.99e-04 | 322.65 ms | 52.3% bf16 MFU | 1624224 tok/s step 1114/19560 | loss 4.251103 (-0.92z)| norm 0.3307 (-2.04z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624261 tok/s step 1115/19560 | loss 4.246608 (-1.00z)| norm 0.3441 (-1.80z)| lr 5.99e-04 | 322.47 ms | 52.3% bf16 MFU | 1624340 tok/s step 1116/19560 | loss 4.240746 (-1.11z)| norm 0.3235 (-2.11z)| lr 5.99e-04 | 322.72 ms | 52.3% bf16 MFU | 1624352 tok/s step 1117/19560 | loss 4.226913 (-1.39z)| norm 0.3296 (-1.97z)| lr 5.99e-04 | 322.39 ms | 52.4% bf16 MFU | 1624447 tok/s step 1118/19560 | loss 4.224158 (-1.43z)| norm 0.3120 (-2.20z)| lr 5.99e-04 | 323.09 ms | 52.2% bf16 MFU | 1624361 tok/s step 1119/19560 | loss 4.217641 (-1.54z)| norm 0.3155 (-2.09z)| lr 5.99e-04 | 322.75 ms | 52.3% bf16 MFU | 1624365 tok/s step 1120/19560 | loss 4.233857 (-1.18z)| norm 0.3431 (-1.62z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624396 tok/s step 1121/19560 | loss 4.207540 (-1.71z)| norm 0.4048 (-0.63z)| lr 5.99e-04 | 323.19 ms | 52.2% bf16 MFU | 1624287 tok/s step 1122/19560 | loss 4.174316 (-2.34z)| norm 0.4092 (-0.55z)| lr 5.99e-04 | 322.26 ms | 52.4% bf16 MFU | 1624417 tok/s step 1123/19560 | loss 4.294195 (+0.16z)| norm 0.4419 (-0.04z)| lr 5.99e-04 | 323.07 ms | 52.2% bf16 MFU | 1624339 tok/s step 1124/19560 | loss 4.201818 (-1.73z)| norm 0.4715 (+0.43z)| lr 5.99e-04 | 322.42 ms | 52.3% bf16 MFU | 1624428 tok/s step 1125/19560 | loss 4.230196 (-1.13z)| norm 0.4194 (-0.39z)| lr 5.99e-04 | 322.65 ms | 52.3% bf16 MFU | 1624454 tok/s step 1126/19560 | loss 4.312085 (+0.57z)| norm 0.4336 (-0.17z)| lr 5.99e-04 | 322.98 ms | 52.3% bf16 MFU | 1624396 tok/s step 1127/19560 | loss 4.208316 (-1.56z)| norm 0.4685 (+0.38z)| lr 5.99e-04 | 322.63 ms | 52.3% bf16 MFU | 1624428 tok/s step 1128/19560 | loss 4.180327 (-2.09z)| norm 0.5129 (+1.07z)| lr 5.99e-04 | 322.74 ms | 52.3% bf16 MFU | 1624433 tok/s step 1129/19560 | loss 4.261365 (-0.42z)| norm 0.5181 (+1.15z)| lr 5.99e-04 | 322.70 ms | 52.3% bf16 MFU | 1624446 tok/s step 1130/19560 | loss 4.251108 (-0.62z)| norm 0.4000 (-0.71z)| lr 5.99e-04 | 322.21 ms | 52.4% bf16 MFU | 1624581 tok/s step 1131/19560 | loss 4.149376 (-2.62z)| norm 0.3458 (-1.58z)| lr 5.99e-04 | 322.95 ms | 52.3% bf16 MFU | 1624522 tok/s step 1132/19560 | loss 4.223258 (-1.12z)| norm 0.3727 (-1.14z)| lr 5.99e-04 | 323.04 ms | 52.2% bf16 MFU | 1624446 tok/s step 1133/19560 | loss 4.221910 (-1.13z)| norm 0.3783 (-1.03z)| lr 5.99e-04 | 323.12 ms | 52.2% bf16 MFU | 1624353 tok/s step 1134/19560 | loss 4.259343 (-0.37z)| norm 0.4631 (+0.40z)| lr 5.99e-04 | 322.91 ms | 52.3% bf16 MFU | 1624316 tok/s step 1135/19560 | loss 4.229073 (-0.97z)| norm 0.5217 (+1.37z)| lr 5.99e-04 | 322.75 ms | 52.3% bf16 MFU | 1624322 tok/s step 1136/19560 | loss 4.231712 (-0.91z)| norm 0.4970 (+0.94z)| lr 5.99e-04 | 322.16 ms | 52.4% bf16 MFU | 1624475 tok/s step 1137/19560 | loss 4.188697 (-1.75z)| norm 0.4085 (-0.54z)| lr 5.99e-04 | 322.59 ms | 52.3% bf16 MFU | 1624513 tok/s step 1138/19560 | loss 4.307250 (+0.65z)| norm 0.3533 (-1.45z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1624448 tok/s step 1139/19560 | loss 4.266775 (-0.16z)| norm 0.3798 (-1.01z)| lr 5.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1624454 tok/s step 1140/19560 | loss 4.197397 (-1.55z)| norm 0.3659 (-1.23z)| lr 5.99e-04 | 322.47 ms | 52.3% bf16 MFU | 1624524 tok/s step 1141/19560 | loss 4.206427 (-1.34z)| norm 0.3752 (-1.06z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1624569 tok/s step 1142/19560 | loss 4.160388 (-2.21z)| norm 0.3856 (-0.87z)| lr 5.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1624568 tok/s step 1143/19560 | loss 4.164186 (-2.08z)| norm 0.4105 (-0.46z)| lr 5.99e-04 | 323.27 ms | 52.2% bf16 MFU | 1624431 tok/s step 1144/19560 | loss 4.187081 (-1.60z)| norm 0.4165 (-0.35z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624460 tok/s step 1145/19560 | loss 4.190776 (-1.50z)| norm 0.4206 (-0.27z)| lr 5.99e-04 | 322.77 ms | 52.3% bf16 MFU | 1624454 tok/s step 1146/19560 | loss 4.249236 (-0.37z)| norm 0.4281 (-0.14z)| lr 5.99e-04 | 322.50 ms | 52.3% bf16 MFU | 1624518 tok/s step 1147/19560 | loss 4.394010 (+2.36z)| norm 0.4151 (-0.36z)| lr 5.99e-04 | 322.63 ms | 52.3% bf16 MFU | 1624545 tok/s step 1148/19560 | loss 4.235040 (-0.64z)| norm 0.3994 (-0.61z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1624520 tok/s step 1149/19560 | loss 4.156266 (-2.08z)| norm 0.4114 (-0.40z)| lr 5.99e-04 | 322.65 ms | 52.3% bf16 MFU | 1624542 tok/s step 1150/19560 | loss 4.146527 (-2.21z)| norm 0.3886 (-0.77z)| lr 5.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1624596 tok/s step 1151/19560 | loss 4.257714 (-0.15z)| norm 0.4006 (-0.57z)| lr 5.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624607 tok/s step 1152/19560 | loss 4.190435 (-1.37z)| norm 0.4374 (+0.05z)| lr 5.99e-04 | 322.65 ms | 52.3% bf16 MFU | 1624623 tok/s step 1153/19560 | loss 4.185427 (-1.44z)| norm 0.4423 (+0.13z)| lr 5.99e-04 | 323.35 ms | 52.2% bf16 MFU | 1624462 tok/s step 1154/19560 | loss 4.216591 (-0.85z)| norm 0.3881 (-0.77z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1624458 tok/s step 1155/19560 | loss 4.192917 (-1.27z)| norm 0.4513 (+0.29z)| lr 5.99e-04 | 322.62 ms | 52.3% bf16 MFU | 1624490 tok/s step 1156/19560 | loss 4.238899 (-0.41z)| norm 0.3659 (-1.14z)| lr 5.99e-04 | 323.10 ms | 52.2% bf16 MFU | 1624400 tok/s step 1157/19560 | loss 4.170979 (-1.65z)| norm 0.3431 (-1.49z)| lr 5.99e-04 | 322.86 ms | 52.3% bf16 MFU | 1624374 tok/s step 1158/19560 | loss 4.281308 (+0.39z)| norm 0.3549 (-1.28z)| lr 5.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1624353 tok/s step 1159/19560 | loss 4.189448 (-1.29z)| norm 0.3662 (-1.08z)| lr 5.99e-04 | 322.78 ms | 52.3% bf16 MFU | 1624349 tok/s step 1160/19560 | loss 4.308475 (+0.90z)| norm 0.4213 (-0.17z)| lr 5.99e-04 | 322.27 ms | 52.4% bf16 MFU | 1624474 tok/s step 1161/19560 | loss 4.204869 (-1.00z)| norm 0.4216 (-0.17z)| lr 5.99e-04 | 322.57 ms | 52.3% bf16 MFU | 1624516 tok/s step 1162/19560 | loss 4.204150 (-0.99z)| norm 0.4771 (+0.73z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1624492 tok/s step 1163/19560 | loss 4.286435 (+0.53z)| norm 0.5112 (+1.27z)| lr 5.99e-04 | 322.79 ms | 52.3% bf16 MFU | 1624479 tok/s step 1164/19560 | loss 4.167003 (-1.66z)| norm 0.4828 (+0.80z)| lr 5.99e-04 | 322.49 ms | 52.3% bf16 MFU | 1624542 tok/s step 1165/19560 | loss 4.211086 (-0.84z)| norm 0.4589 (+0.41z)| lr 5.99e-04 | 322.36 ms | 52.4% bf16 MFU | 1624636 tok/s step 1166/19560 | loss 4.246637 (-0.18z)| norm 0.4763 (+0.70z)| lr 5.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1624603 tok/s step 1167/19560 | loss 4.237340 (-0.35z)| norm 0.4727 (+0.63z)| lr 5.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1624570 tok/s step 1168/19560 | loss 4.207323 (-0.90z)| norm 0.4088 (-0.42z)| lr 5.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1624623 tok/s step 1169/19560 | loss 4.159014 (-1.75z)| norm 0.4054 (-0.47z)| lr 5.99e-04 | 322.66 ms | 52.3% bf16 MFU | 1624636 tok/s step 1170/19560 | loss 4.247252 (-0.13z)| norm 0.4304 (-0.04z)| lr 5.99e-04 | 323.01 ms | 52.2% bf16 MFU | 1624559 tok/s step 1171/19560 | loss 4.221545 (-0.59z)| norm 0.4400 (+0.14z)| lr 5.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1624535 tok/s step 1172/19560 | loss 4.189423 (-1.17z)| norm 0.4713 (+0.68z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624558 tok/s step 1173/19560 | loss 4.216667 (-0.65z)| norm 0.4143 (-0.29z)| lr 5.99e-04 | 323.32 ms | 52.2% bf16 MFU | 1624410 tok/s step 1174/19560 | loss 4.354784 (+1.90z)| norm 0.4666 (+0.60z)| lr 5.99e-04 | 322.68 ms | 52.3% bf16 MFU | 1624428 tok/s step 1175/19560 | loss 4.254039 (+0.04z)| norm 0.5576 (+2.13z)| lr 5.99e-04 | 322.68 ms | 52.3% bf16 MFU | 1624446 tok/s step 1176/19560 | loss 4.213186 (-0.71z)| norm 0.4859 (+0.91z)| lr 5.99e-04 | 322.54 ms | 52.3% bf16 MFU | 1624500 tok/s step 1177/19560 | loss 4.179469 (-1.31z)| norm 0.4027 (-0.49z)| lr 5.99e-04 | 323.03 ms | 52.2% bf16 MFU | 1624426 tok/s step 1178/19560 | loss 4.164054 (-1.57z)| norm 0.3798 (-0.87z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1624476 tok/s step 1179/19560 | loss 4.239907 (-0.18z)| norm 0.4072 (-0.41z)| lr 5.99e-04 | 322.50 ms | 52.3% bf16 MFU | 1624537 tok/s step 1180/19560 | loss 4.159409 (-1.62z)| norm 0.3927 (-0.65z)| lr 5.99e-04 | 322.48 ms | 52.3% bf16 MFU | 1624600 tok/s step 1181/19560 | loss 4.197785 (-0.91z)| norm 0.4038 (-0.46z)| lr 5.99e-04 | 323.17 ms | 52.2% bf16 MFU | 1624488 tok/s step 1182/19560 | loss 4.244324 (-0.07z)| norm 0.3581 (-1.23z)| lr 5.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1624534 tok/s step 1183/19560 | loss 4.216019 (-0.57z)| norm 0.3500 (-1.35z)| lr 5.99e-04 | 322.69 ms | 52.3% bf16 MFU | 1624544 tok/s step 1184/19560 | loss 4.193949 (-0.95z)| norm 0.3448 (-1.41z)| lr 5.99e-04 | 322.35 ms | 52.4% bf16 MFU | 1624639 tok/s step 1185/19560 | loss 4.198360 (-0.86z)| norm 0.3800 (-0.82z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1624626 tok/s step 1186/19560 | loss 4.183338 (-1.12z)| norm 0.3955 (-0.55z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624644 tok/s step 1187/19560 | loss 4.197066 (-0.86z)| norm 0.4053 (-0.39z)| lr 5.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624654 tok/s step 1188/19560 | loss 4.169842 (-1.32z)| norm 0.4409 (+0.21z)| lr 5.99e-04 | 322.90 ms | 52.3% bf16 MFU | 1624606 tok/s step 1189/19560 | loss 4.191823 (-0.95z)| norm 0.5074 (+1.32z)| lr 5.99e-04 | 322.59 ms | 52.3% bf16 MFU | 1624636 tok/s step 1190/19560 | loss 4.181055 (-1.14z)| norm 0.4292 (+0.05z)| lr 5.99e-04 | 322.26 ms | 52.4% bf16 MFU | 1624750 tok/s step 1191/19560 | loss 4.212391 (-0.53z)| norm 0.4028 (-0.40z)| lr 5.99e-04 | 322.49 ms | 52.3% bf16 MFU | 1624799 tok/s step 1192/19560 | loss 4.204677 (-0.67z)| norm 0.3985 (-0.47z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1624737 tok/s step 1193/19560 | loss 4.218983 (-0.38z)| norm 0.3900 (-0.61z)| lr 5.99e-04 | 322.86 ms | 52.3% bf16 MFU | 1624694 tok/s step 1194/19560 | loss 4.151640 (-1.65z)| norm 0.3780 (-0.82z)| lr 5.99e-04 | 322.66 ms | 52.3% bf16 MFU | 1624703 tok/s step 1195/19560 | loss 4.176857 (-1.15z)| norm 0.4004 (-0.40z)| lr 5.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1624665 tok/s step 1196/19560 | loss 4.158461 (-1.47z)| norm 0.4460 (+0.42z)| lr 5.99e-04 | 323.16 ms | 52.2% bf16 MFU | 1624551 tok/s step 1197/19560 | loss 4.193521 (-0.81z)| norm 0.4861 (+1.14z)| lr 5.99e-04 | 322.95 ms | 52.3% bf16 MFU | 1624494 tok/s step 1198/19560 | loss 4.156698 (-1.53z)| norm 0.4028 (-0.36z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1624490 tok/s step 1199/19560 | loss 4.206083 (-0.53z)| norm 0.3915 (-0.55z)| lr 5.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1624491 tok/s step 1200/19560 | loss 4.146915 (-1.68z)| norm 0.3655 (-1.01z)| lr 5.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1624547 tok/s step 1201/19560 | loss 4.151252 (-1.57z)| norm 0.3462 (-1.34z)| lr 5.99e-04 | 322.63 ms | 52.3% bf16 MFU | 1624573 tok/s step 1202/19560 | loss 4.188396 (-0.82z)| norm 0.3154 (-1.86z)| lr 5.99e-04 | 323.26 ms | 52.2% bf16 MFU | 1624439 tok/s step 1203/19560 | loss 4.370740 (+2.69z)| norm 0.3537 (-1.18z)| lr 5.99e-04 | 323.57 ms | 52.2% bf16 MFU | 1624232 tok/s step 1204/19560 | loss 4.082326 (-2.75z)| norm 0.3673 (-0.93z)| lr 5.99e-04 | 322.05 ms | 52.4% bf16 MFU | 1624419 tok/s step 1205/19560 | loss 4.185234 (-0.81z)| norm 0.4121 (-0.15z)| lr 5.99e-04 | 323.04 ms | 52.2% bf16 MFU | 1624349 tok/s step 1206/19560 | loss 4.179572 (-0.90z)| norm 0.4715 (+0.89z)| lr 5.99e-04 | 323.60 ms | 52.2% bf16 MFU | 1624139 tok/s step 1207/19560 | loss 4.119399 (-1.98z)| norm 0.4467 (+0.44z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1624181 tok/s step 1208/19560 | loss 4.150757 (-1.38z)| norm 0.4527 (+0.54z)| lr 5.99e-04 | 322.29 ms | 52.4% bf16 MFU | 1624310 tok/s step 1209/19560 | loss 4.132619 (-1.68z)| norm 0.4262 (+0.08z)| lr 5.99e-04 | 323.32 ms | 52.2% bf16 MFU | 1624174 tok/s step 1210/19560 | loss 4.202093 (-0.40z)| norm 0.3774 (-0.78z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1624127 tok/s step 1211/19560 | loss 4.176745 (-0.86z)| norm 0.3610 (-1.06z)| lr 5.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624163 tok/s step 1212/19560 | loss 4.076228 (-2.64z)| norm 0.3604 (-1.05z)| lr 5.99e-04 | 322.77 ms | 52.3% bf16 MFU | 1624173 tok/s step 1213/19560 | loss 4.153026 (-1.22z)| norm 0.3481 (-1.25z)| lr 5.99e-04 | 323.74 ms | 52.1% bf16 MFU | 1623937 tok/s step 1214/19560 | loss 4.123189 (-1.73z)| norm 0.3645 (-0.95z)| lr 5.99e-04 | 322.61 ms | 52.3% bf16 MFU | 1623998 tok/s step 1215/19560 | loss 4.152661 (-1.18z)| norm 0.3753 (-0.76z)| lr 5.99e-04 | 322.81 ms | 52.3% bf16 MFU | 1624006 tok/s step 1216/19560 | loss 4.151632 (-1.18z)| norm 0.3810 (-0.64z)| lr 5.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1623972 tok/s step 1217/19560 | loss 4.126484 (-1.60z)| norm 0.4298 (+0.26z)| lr 5.99e-04 | 323.57 ms | 52.2% bf16 MFU | 1623790 tok/s step 1218/19560 | loss 4.173442 (-0.75z)| norm 0.5062 (+1.65z)| lr 5.99e-04 | 323.09 ms | 52.2% bf16 MFU | 1623737 tok/s step 1219/19560 | loss 4.129362 (-1.52z)| norm 0.4997 (+1.50z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1623751 tok/s step 1220/19560 | loss 4.154597 (-1.05z)| norm 0.4004 (-0.30z)| lr 5.99e-04 | 323.08 ms | 52.2% bf16 MFU | 1623704 tok/s step 1221/19560 | loss 4.100323 (-1.99z)| norm 0.3834 (-0.61z)| lr 5.99e-04 | 322.89 ms | 52.3% bf16 MFU | 1623704 tok/s step 1222/19560 | loss 4.210634 (-0.00z)| norm 0.3712 (-0.83z)| lr 5.99e-04 | 322.80 ms | 52.3% bf16 MFU | 1623730 tok/s step 1223/19560 | loss 4.254615 (+0.79z)| norm 0.3962 (-0.38z)| lr 5.99e-04 | 322.89 ms | 52.3% bf16 MFU | 1623730 tok/s step 1224/19560 | loss 4.226801 (+0.29z)| norm 0.4601 (+0.77z)| lr 5.99e-04 | 323.15 ms | 52.2% bf16 MFU | 1623666 tok/s step 1225/19560 | loss 4.172290 (-0.69z)| norm 0.4747 (+1.02z)| lr 5.99e-04 | 323.22 ms | 52.2% bf16 MFU | 1623587 tok/s step 1226/19560 | loss 4.167961 (-0.76z)| norm 0.4690 (+0.90z)| lr 5.99e-04 | 323.27 ms | 52.2% bf16 MFU | 1623500 tok/s step 1227/19560 | loss 4.262054 (+0.95z)| norm 0.4267 (+0.14z)| lr 5.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1623605 tok/s step 1228/19560 | loss 4.160992 (-0.88z)| norm 0.3813 (-0.67z)| lr 5.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1623624 tok/s step 1229/19560 | loss 4.116798 (-1.65z)| norm 0.3538 (-1.16z)| lr 5.99e-04 | 323.23 ms | 52.2% bf16 MFU | 1623544 tok/s step 1230/19560 | loss 4.140265 (-1.20z)| norm 0.3383 (-1.43z)| lr 5.99e-04 | 322.85 ms | 52.3% bf16 MFU | 1623564 tok/s step 1231/19560 | loss 4.133938 (-1.29z)| norm 0.3151 (-1.84z)| lr 5.99e-04 | 323.32 ms | 52.2% bf16 MFU | 1623464 tok/s step 1232/19560 | loss 4.323032 (+2.04z)| norm 0.3224 (-1.67z)| lr 5.99e-04 | 322.41 ms | 52.3% bf16 MFU | 1623599 tok/s step 1233/19560 | loss 4.166764 (-0.70z)| norm 0.3389 (-1.34z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1623582 tok/s step 1234/19560 | loss 4.115427 (-1.58z)| norm 0.3846 (-0.49z)| lr 5.99e-04 | 323.34 ms | 52.2% bf16 MFU | 1623476 tok/s step 1235/19560 | loss 4.129014 (-1.36z)| norm 0.4297 (+0.35z)| lr 5.99e-04 | 323.19 ms | 52.2% bf16 MFU | 1623413 tok/s step 1236/19560 | loss 4.135087 (-1.23z)| norm 0.4543 (+0.81z)| lr 5.99e-04 | 323.22 ms | 52.2% bf16 MFU | 1623346 tok/s step 1237/19560 | loss 4.187058 (-0.27z)| norm 0.3752 (-0.65z)| lr 5.99e-04 | 322.20 ms | 52.4% bf16 MFU | 1623538 tok/s step 1238/19560 | loss 4.170949 (-0.55z)| norm 0.3508 (-1.10z)| lr 5.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1623632 tok/s step 1239/19560 | loss 4.118270 (-1.51z)| norm 0.3300 (-1.47z)| lr 5.99e-04 | 323.65 ms | 52.1% bf16 MFU | 1623448 tok/s step 1240/19560 | loss 4.161664 (-0.70z)| norm 0.3355 (-1.34z)| lr 5.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1623443 tok/s step 1241/19560 | loss 4.134306 (-1.18z)| norm 0.3053 (-1.88z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1623448 tok/s step 1242/19560 | loss 4.110388 (-1.59z)| norm 0.3232 (-1.54z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1623438 tok/s step 1243/19560 | loss 4.183714 (-0.25z)| norm 0.3611 (-0.84z)| lr 5.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1623435 tok/s step 1244/19560 | loss 4.148710 (-0.87z)| norm 0.3617 (-0.84z)| lr 5.99e-04 | 322.81 ms | 52.3% bf16 MFU | 1623471 tok/s step 1245/19560 | loss 4.092630 (-1.85z)| norm 0.3575 (-0.93z)| lr 5.99e-04 | 323.26 ms | 52.2% bf16 MFU | 1623390 tok/s step 1246/19560 | loss 4.123858 (-1.27z)| norm 0.4315 (+0.47z)| lr 5.99e-04 | 322.94 ms | 52.3% bf16 MFU | 1623395 tok/s step 1247/19560 | loss 4.141810 (-0.94z)| norm 0.4624 (+1.05z)| lr 5.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1623426 tok/s step 1248/19560 | loss 4.124043 (-1.23z)| norm 0.5388 (+2.46z)| lr 5.99e-04 | 322.59 ms | 52.3% bf16 MFU | 1623517 tok/s step 1249/19560 | loss 4.121623 (-1.26z)| norm 0.5292 (+2.21z)| lr 5.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1623611 tok/s step 1250/19560 | loss 4.124686 (-1.19z)| norm 0.4793 (+1.26z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1623631 tok/s val loss 4.141542 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2595/10042 = 0.258415 step 1251/19560 | loss 4.167761 (-0.42z)| norm 0.4587 (+0.87z)| lr 5.99e-04 | 321.67 ms | 52.5% bf16 MFU | 1623945 tok/s step 1252/19560 | loss 4.118954 (-1.27z)| norm 0.4477 (+0.68z)| lr 5.99e-04 | 323.16 ms | 52.2% bf16 MFU | 1623866 tok/s step 1253/19560 | loss 4.119660 (-1.24z)| norm 0.3975 (-0.25z)| lr 5.99e-04 | 324.18 ms | 52.1% bf16 MFU | 1623536 tok/s step 1254/19560 | loss 4.112191 (-1.35z)| norm 0.3505 (-1.11z)| lr 5.99e-04 | 323.31 ms | 52.2% bf16 MFU | 1623441 tok/s step 1255/19560 | loss 4.105109 (-1.45z)| norm 0.4028 (-0.13z)| lr 5.99e-04 | 323.52 ms | 52.2% bf16 MFU | 1623296 tok/s step 1256/19560 | loss 4.075395 (-1.93z)| norm 0.3462 (-1.17z)| lr 5.99e-04 | 322.78 ms | 52.3% bf16 MFU | 1623346 tok/s step 1257/19560 | loss 4.106080 (-1.38z)| norm 0.3317 (-1.43z)| lr 5.99e-04 | 323.28 ms | 52.2% bf16 MFU | 1623268 tok/s step 1258/19560 | loss 4.114724 (-1.21z)| norm 0.3162 (-1.69z)| lr 5.99e-04 | 323.07 ms | 52.2% bf16 MFU | 1623247 tok/s step 1259/19560 | loss 4.124422 (-1.04z)| norm 0.3345 (-1.34z)| lr 5.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1623255 tok/s step 1260/19560 | loss 4.155513 (-0.49z)| norm 0.3676 (-0.72z)| lr 5.99e-04 | 322.94 ms | 52.3% bf16 MFU | 1623266 tok/s step 1261/19560 | loss 4.165601 (-0.31z)| norm 0.3917 (-0.28z)| lr 5.99e-04 | 323.05 ms | 52.2% bf16 MFU | 1623250 tok/s step 1262/19560 | loss 4.090282 (-1.58z)| norm 0.3847 (-0.40z)| lr 5.99e-04 | 323.72 ms | 52.1% bf16 MFU | 1623067 tok/s step 1263/19560 | loss 4.119903 (-1.06z)| norm 0.3584 (-0.88z)| lr 5.99e-04 | 322.93 ms | 52.3% bf16 MFU | 1623091 tok/s step 1264/19560 | loss 4.151977 (-0.49z)| norm 0.3081 (-1.81z)| lr 5.99e-04 | 322.99 ms | 52.3% bf16 MFU | 1623097 tok/s step 1265/19560 | loss 4.190174 (+0.17z)| norm 0.3364 (-1.25z)| lr 5.99e-04 | 322.98 ms | 52.3% bf16 MFU | 1623108 tok/s step 1266/19560 | loss 4.214524 (+0.61z)| norm 0.4458 (+0.80z)| lr 5.99e-04 | 322.95 ms | 52.3% bf16 MFU | 1623123 tok/s step 1267/19560 | loss 4.137581 (-0.73z)| norm 0.4704 (+1.25z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1623216 tok/s step 1268/19560 | loss 4.109784 (-1.20z)| norm 0.4521 (+0.89z)| lr 5.99e-04 | 323.14 ms | 52.2% bf16 MFU | 1623179 tok/s step 1269/19560 | loss 4.184509 (+0.11z)| norm 0.5158 (+2.04z)| lr 5.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1623224 tok/s step 1270/19560 | loss 4.108706 (-1.21z)| norm 0.4424 (+0.67z)| lr 5.99e-04 | 323.28 ms | 52.2% bf16 MFU | 1623150 tok/s step 1271/19560 | loss 4.090984 (-1.50z)| norm 0.4492 (+0.79z)| lr 5.99e-04 | 322.50 ms | 52.3% bf16 MFU | 1623279 tok/s step 1272/19560 | loss 4.077159 (-1.70z)| norm 0.3861 (-0.37z)| lr 5.99e-04 | 323.06 ms | 52.2% bf16 MFU | 1623258 tok/s step 1273/19560 | loss 4.078413 (-1.65z)| norm 0.3513 (-1.00z)| lr 5.99e-04 | 322.86 ms | 52.3% bf16 MFU | 1623290 tok/s step 1274/19560 | loss 4.066900 (-1.81z)| norm 0.3423 (-1.15z)| lr 5.99e-04 | 322.62 ms | 52.3% bf16 MFU | 1623380 tok/s step 1275/19560 | loss 4.076345 (-1.68z)| norm 0.3608 (-0.80z)| lr 5.99e-04 | 323.01 ms | 52.2% bf16 MFU | 1623367 tok/s step 1276/19560 | loss 4.089949 (-1.42z)| norm 0.4125 (+0.14z)| lr 5.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1623368 tok/s step 1277/19560 | loss 4.099973 (-1.22z)| norm 0.4324 (+0.50z)| lr 5.99e-04 | 323.07 ms | 52.2% bf16 MFU | 1623342 tok/s step 1278/19560 | loss 4.062058 (-1.85z)| norm 0.4132 (+0.15z)| lr 5.99e-04 | 322.83 ms | 52.3% bf16 MFU | 1623377 tok/s step 1279/19560 | loss 4.157190 (-0.20z)| norm 0.4424 (+0.67z)| lr 5.99e-04 | 322.71 ms | 52.3% bf16 MFU | 1623439 tok/s step 1280/19560 | loss 4.127111 (-0.71z)| norm 0.4409 (+0.64z)| lr 5.99e-04 | 322.89 ms | 52.3% bf16 MFU | 1623455 tok/s step 1281/19560 | loss 4.146239 (-0.37z)| norm 0.4010 (-0.08z)| lr 5.99e-04 | 322.64 ms | 52.3% bf16 MFU | 1623533 tok/s step 1282/19560 | loss 4.144344 (-0.40z)| norm 0.3913 (-0.25z)| lr 5.99e-04 | 322.98 ms | 52.3% bf16 MFU | 1623519 tok/s step 1283/19560 | loss 4.133632 (-0.57z)| norm 0.3562 (-0.88z)| lr 5.99e-04 | 323.03 ms | 52.2% bf16 MFU | 1623495 tok/s step 1284/19560 | loss 4.103918 (-1.08z)| norm 0.3510 (-0.97z)| lr 5.99e-04 | 322.66 ms | 52.3% bf16 MFU | 1623565 tok/s step 1285/19560 | loss 4.075294 (-1.55z)| norm 0.3543 (-0.91z)| lr 5.99e-04 | 322.80 ms | 52.3% bf16 MFU | 1623596 tok/s step 1286/19560 | loss 4.119056 (-0.78z)| norm 0.3443 (-1.09z)| lr 5.99e-04 | 322.46 ms | 52.3% bf16 MFU | 1623712 tok/s step 1287/19560 | loss 4.129856 (-0.58z)| norm 0.3679 (-0.66z)| lr 5.99e-04 | 323.06 ms | 52.2% bf16 MFU | 1623671 tok/s step 1288/19560 | loss 4.138741 (-0.41z)| norm 0.4137 (+0.17z)| lr 5.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1623686 tok/s step 1289/19560 | loss 4.090109 (-1.27z)| norm 0.4167 (+0.23z)| lr 5.99e-04 | 322.86 ms | 52.3% bf16 MFU | 1623695 tok/s step 1290/19560 | loss 4.094909 (-1.16z)| norm 0.3594 (-0.80z)| lr 5.99e-04 | 322.72 ms | 52.3% bf16 MFU | 1623741 tok/s step 1291/19560 | loss 4.094496 (-1.16z)| norm 0.3439 (-1.08z)| lr 5.99e-04 | 322.72 ms | 52.3% bf16 MFU | 1623782 tok/s step 1292/19560 | loss 4.115676 (-0.77z)| norm 0.3359 (-1.21z)| lr 5.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1623759 tok/s step 1293/19560 | loss 4.090409 (-1.21z)| norm 0.3191 (-1.49z)| lr 5.99e-04 | 322.49 ms | 52.3% bf16 MFU | 1623860 tok/s step 1294/19560 | loss 4.104644 (-0.94z)| norm 0.3570 (-0.78z)| lr 5.99e-04 | 322.60 ms | 52.3% bf16 MFU | 1623927 tok/s step 1295/19560 | loss 4.132895 (-0.41z)| norm 0.3929 (-0.10z)| lr 5.99e-04 | 323.17 ms | 52.2% bf16 MFU | 1623846 tok/s step 1296/19560 | loss 4.103261 (-0.94z)| norm 0.4353 (+0.69z)| lr 5.99e-04 | 322.65 ms | 52.3% bf16 MFU | 1623902 tok/s step 1297/19560 | loss 4.082335 (-1.30z)| norm 0.4010 (+0.05z)| lr 5.99e-04 | 322.98 ms | 52.3% bf16 MFU | 1623870 tok/s step 1298/19560 | loss 4.123477 (-0.54z)| norm 0.4824 (+1.55z)| lr 5.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1623876 tok/s step 1299/19560 | loss 4.008973 (-2.57z)| norm 0.4999 (+1.84z)| lr 5.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1623952 tok/s step 1300/19560 | loss 4.041705 (-1.93z)| norm 0.3914 (-0.13z)| lr 5.99e-04 | 322.88 ms | 52.3% bf16 MFU | 1623943 tok/s step 1301/19560 | loss 4.153322 (+0.07z)| norm 0.3732 (-0.46z)| lr 5.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1623964 tok/s step 1302/19560 | loss 4.140512 (-0.14z)| norm 0.4270 (+0.54z)| lr 5.98e-04 | 323.48 ms | 52.2% bf16 MFU | 1623805 tok/s step 1303/19560 | loss 4.045087 (-1.92z)| norm 0.3682 (-0.54z)| lr 5.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1623885 tok/s step 1304/19560 | loss 4.153109 (+0.14z)| norm 0.3322 (-1.22z)| lr 5.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1623943 tok/s step 1305/19560 | loss 4.059356 (-1.62z)| norm 0.3242 (-1.35z)| lr 5.98e-04 | 322.81 ms | 52.3% bf16 MFU | 1623953 tok/s step 1306/19560 | loss 4.160428 (+0.30z)| norm 0.3477 (-0.89z)| lr 5.98e-04 | 322.86 ms | 52.3% bf16 MFU | 1623949 tok/s step 1307/19560 | loss 4.023816 (-2.25z)| norm 0.3568 (-0.71z)| lr 5.98e-04 | 322.76 ms | 52.3% bf16 MFU | 1623970 tok/s step 1308/19560 | loss 4.151257 (+0.16z)| norm 0.3477 (-0.88z)| lr 5.98e-04 | 323.12 ms | 52.2% bf16 MFU | 1623900 tok/s step 1309/19560 | loss 4.043431 (-1.84z)| norm 0.3621 (-0.59z)| lr 5.98e-04 | 322.62 ms | 52.3% bf16 MFU | 1623959 tok/s step 1310/19560 | loss 4.118425 (-0.42z)| norm 0.3308 (-1.18z)| lr 5.98e-04 | 322.78 ms | 52.3% bf16 MFU | 1623976 tok/s step 1311/19560 | loss 4.113226 (-0.51z)| norm 0.3447 (-0.92z)| lr 5.98e-04 | 323.06 ms | 52.2% bf16 MFU | 1623922 tok/s step 1312/19560 | loss 4.087688 (-0.98z)| norm 0.3935 (-0.00z)| lr 5.98e-04 | 322.79 ms | 52.3% bf16 MFU | 1623937 tok/s step 1313/19560 | loss 4.141487 (+0.05z)| norm 0.4112 (+0.33z)| lr 5.98e-04 | 322.95 ms | 52.3% bf16 MFU | 1623911 tok/s step 1314/19560 | loss 4.116852 (-0.41z)| norm 0.3485 (-0.85z)| lr 5.98e-04 | 322.84 ms | 52.3% bf16 MFU | 1623916 tok/s step 1315/19560 | loss 4.046134 (-1.73z)| norm 0.3045 (-1.65z)| lr 5.98e-04 | 323.00 ms | 52.3% bf16 MFU | 1623880 tok/s step 1316/19560 | loss 4.070950 (-1.24z)| norm 0.3306 (-1.14z)| lr 5.98e-04 | 323.03 ms | 52.2% bf16 MFU | 1623838 tok/s step 1317/19560 | loss 4.052610 (-1.56z)| norm 0.3357 (-1.04z)| lr 5.98e-04 | 322.96 ms | 52.3% bf16 MFU | 1623814 tok/s step 1318/19560 | loss 4.061085 (-1.38z)| norm 0.3646 (-0.48z)| lr 5.98e-04 | 322.64 ms | 52.3% bf16 MFU | 1623874 tok/s step 1319/19560 | loss 4.111064 (-0.42z)| norm 0.3909 (+0.02z)| lr 5.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1623932 tok/s step 1320/19560 | loss 4.139692 (+0.13z)| norm 0.4405 (+0.95z)| lr 5.98e-04 | 323.17 ms | 52.2% bf16 MFU | 1623852 tok/s step 1321/19560 | loss 4.136234 (+0.08z)| norm 0.4441 (+1.01z)| lr 5.98e-04 | 322.89 ms | 52.3% bf16 MFU | 1623845 tok/s step 1322/19560 | loss 4.084999 (-0.90z)| norm 0.4603 (+1.29z)| lr 5.98e-04 | 322.91 ms | 52.3% bf16 MFU | 1623834 tok/s step 1323/19560 | loss 4.118074 (-0.25z)| norm 0.4515 (+1.12z)| lr 5.98e-04 | 322.88 ms | 52.3% bf16 MFU | 1623831 tok/s step 1324/19560 | loss 4.192485 (+1.17z)| norm 0.4036 (+0.23z)| lr 5.98e-04 | 322.12 ms | 52.4% bf16 MFU | 1624020 tok/s step 1325/19560 | loss 4.107422 (-0.45z)| norm 0.3844 (-0.11z)| lr 5.98e-04 | 322.92 ms | 52.3% bf16 MFU | 1623997 tok/s step 1326/19560 | loss 4.092796 (-0.72z)| norm 0.3788 (-0.22z)| lr 5.98e-04 | 322.59 ms | 52.3% bf16 MFU | 1624060 tok/s step 1327/19560 | loss 4.093022 (-0.71z)| norm 0.3935 (+0.06z)| lr 5.98e-04 | 323.07 ms | 52.2% bf16 MFU | 1623999 tok/s step 1328/19560 | loss 4.181535 (+1.00z)| norm 0.3441 (-0.87z)| lr 5.98e-04 | 322.65 ms | 52.3% bf16 MFU | 1624047 tok/s step 1329/19560 | loss 4.076842 (-1.01z)| norm 0.3170 (-1.37z)| lr 5.98e-04 | 322.58 ms | 52.3% bf16 MFU | 1624110 tok/s step 1330/19560 | loss 4.204367 (+1.44z)| norm 0.3193 (-1.33z)| lr 5.98e-04 | 322.52 ms | 52.3% bf16 MFU | 1624185 tok/s step 1331/19560 | loss 4.084752 (-0.89z)| norm 0.3360 (-1.01z)| lr 5.98e-04 | 322.96 ms | 52.3% bf16 MFU | 1624145 tok/s step 1332/19560 | loss 4.051969 (-1.56z)| norm 0.3196 (-1.30z)| lr 5.98e-04 | 322.86 ms | 52.3% bf16 MFU | 1624131 tok/s step 1333/19560 | loss 4.088660 (-0.78z)| norm 0.2938 (-1.75z)| lr 5.98e-04 | 322.64 ms | 52.3% bf16 MFU | 1624175 tok/s step 1334/19560 | loss 4.167230 (+0.87z)| norm 0.2979 (-1.64z)| lr 5.98e-04 | 324.58 ms | 52.0% bf16 MFU | 1623731 tok/s step 1335/19560 | loss 4.110883 (-0.32z)| norm 0.3637 (-0.42z)| lr 5.98e-04 | 322.24 ms | 52.4% bf16 MFU | 1623894 tok/s step 1336/19560 | loss 4.129498 (+0.08z)| norm 0.4022 (+0.30z)| lr 5.98e-04 | 322.59 ms | 52.3% bf16 MFU | 1623962 tok/s step 1337/19560 | loss 4.120827 (-0.10z)| norm 0.3838 (-0.04z)| lr 5.98e-04 | 322.58 ms | 52.3% bf16 MFU | 1624028 tok/s step 1338/19560 | loss 4.054112 (-1.48z)| norm 0.4041 (+0.34z)| lr 5.98e-04 | 322.22 ms | 52.4% bf16 MFU | 1624182 tok/s step 1339/19560 | loss 4.232796 (+2.24z)| norm 0.4791 (+1.70z)| lr 5.98e-04 | 323.32 ms | 52.2% bf16 MFU | 1624051 tok/s step 1340/19560 | loss 4.150126 (+0.51z)| norm 0.4515 (+1.17z)| lr 5.98e-04 | 322.66 ms | 52.3% bf16 MFU | 1624092 tok/s step 1341/19560 | loss 4.093121 (-0.66z)| norm 0.3791 (-0.16z)| lr 5.98e-04 | 322.46 ms | 52.3% bf16 MFU | 1624183 tok/s step 1342/19560 | loss 4.145670 (+0.43z)| norm 0.3880 (-0.00z)| lr 5.98e-04 | 323.28 ms | 52.2% bf16 MFU | 1624064 tok/s step 1343/19560 | loss 4.129520 (+0.09z)| norm 0.3465 (-0.76z)| lr 5.98e-04 | 322.79 ms | 52.3% bf16 MFU | 1624074 tok/s step 1344/19560 | loss 4.078375 (-0.96z)| norm 0.3154 (-1.31z)| lr 5.98e-04 | 322.94 ms | 52.3% bf16 MFU | 1624043 tok/s step 1345/19560 | loss 4.041599 (-1.69z)| norm 0.3466 (-0.73z)| lr 5.98e-04 | 322.46 ms | 52.3% bf16 MFU | 1624137 tok/s step 1346/19560 | loss 4.041125 (-1.66z)| norm 0.3604 (-0.47z)| lr 5.98e-04 | 323.21 ms | 52.2% bf16 MFU | 1624038 tok/s step 1347/19560 | loss 4.024932 (-1.95z)| norm 0.3829 (-0.03z)| lr 5.98e-04 | 322.83 ms | 52.3% bf16 MFU | 1624038 tok/s step 1348/19560 | loss 4.075709 (-0.91z)| norm 0.3752 (-0.17z)| lr 5.98e-04 | 322.38 ms | 52.4% bf16 MFU | 1624152 tok/s step 1349/19560 | loss 4.117517 (-0.08z)| norm 0.3306 (-1.01z)| lr 5.98e-04 | 322.71 ms | 52.3% bf16 MFU | 1624176 tok/s step 1350/19560 | loss 4.082644 (-0.77z)| norm 0.3231 (-1.14z)| lr 5.98e-04 | 322.90 ms | 52.3% bf16 MFU | 1624151 tok/s step 1351/19560 | loss 4.052288 (-1.38z)| norm 0.3147 (-1.27z)| lr 5.98e-04 | 322.64 ms | 52.3% bf16 MFU | 1624193 tok/s step 1352/19560 | loss 4.080681 (-0.78z)| norm 0.3262 (-1.04z)| lr 5.98e-04 | 323.14 ms | 52.2% bf16 MFU | 1624106 tok/s step 1353/19560 | loss 4.098711 (-0.39z)| norm 0.3673 (-0.26z)| lr 5.98e-04 | 322.84 ms | 52.3% bf16 MFU | 1624100 tok/s step 1354/19560 | loss 4.166344 (+1.04z)| norm 0.4942 (+2.12z)| lr 5.98e-04 | 322.71 ms | 52.3% bf16 MFU | 1624128 tok/s step 1355/19560 | loss 4.194375 (+1.69z)| norm 0.6202 (+4.16z)| lr 5.98e-04 | 322.93 ms | 52.3% bf16 MFU | 1624099 tok/s step 1356/19560 | loss 4.134597 (+0.40z)| norm 0.5152 (+2.26z)| lr 5.98e-04 | 323.11 ms | 52.2% bf16 MFU | 1624025 tok/s step 1357/19560 | loss 4.058218 (-1.25z)| norm 0.4679 (+1.42z)| lr 5.98e-04 | 323.13 ms | 52.2% bf16 MFU | 1623949 tok/s step 1358/19560 | loss 4.114557 (-0.03z)| norm 0.3947 (+0.16z)| lr 5.98e-04 | 322.25 ms | 52.4% bf16 MFU | 1624100 tok/s step 1359/19560 | loss 4.055529 (-1.29z)| norm 0.4255 (+0.68z)| lr 5.98e-04 | 322.71 ms | 52.3% bf16 MFU | 1624127 tok/s step 1360/19560 | loss 4.065694 (-1.11z)| norm 0.4300 (+0.74z)| lr 5.98e-04 | 323.00 ms | 52.3% bf16 MFU | 1624080 tok/s step 1361/19560 | loss 4.025440 (-2.01z)| norm 0.3630 (-0.41z)| lr 5.98e-04 | 322.69 ms | 52.3% bf16 MFU | 1624114 tok/s step 1362/19560 | loss 4.110214 (-0.04z)| norm 0.3514 (-0.61z)| lr 5.98e-04 | 322.72 ms | 52.3% bf16 MFU | 1624138 tok/s step 1363/19560 | loss 4.093547 (-0.42z)| norm 0.3261 (-1.03z)| lr 5.98e-04 | 323.04 ms | 52.2% bf16 MFU | 1624082 tok/s step 1364/19560 | loss 4.030174 (-1.85z)| norm 0.3229 (-1.07z)| lr 5.98e-04 | 323.06 ms | 52.2% bf16 MFU | 1624020 tok/s step 1365/19560 | loss 4.066306 (-1.01z)| norm 0.2807 (-1.76z)| lr 5.98e-04 | 322.75 ms | 52.3% bf16 MFU | 1624042 tok/s step 1366/19560 | loss 4.096484 (-0.30z)| norm 0.2627 (-2.02z)| lr 5.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1624110 tok/s step 1367/19560 | loss 4.088656 (-0.48z)| norm 0.2834 (-1.66z)| lr 5.98e-04 | 322.79 ms | 52.3% bf16 MFU | 1624116 tok/s step 1368/19560 | loss 4.076690 (-0.74z)| norm 0.2850 (-1.61z)| lr 5.98e-04 | 322.74 ms | 52.3% bf16 MFU | 1624135 tok/s step 1369/19560 | loss 4.090102 (-0.42z)| norm 0.3470 (-0.60z)| lr 5.98e-04 | 322.62 ms | 52.3% bf16 MFU | 1624182 tok/s step 1370/19560 | loss 4.050742 (-1.32z)| norm 0.3974 (+0.23z)| lr 5.98e-04 | 322.86 ms | 52.3% bf16 MFU | 1624167 tok/s step 1371/19560 | loss 4.039556 (-1.56z)| norm 0.3369 (-0.77z)| lr 5.98e-04 | 323.04 ms | 52.2% bf16 MFU | 1624106 tok/s step 1372/19560 | loss 4.020722 (-1.95z)| norm 0.3300 (-0.88z)| lr 5.98e-04 | 322.77 ms | 52.3% bf16 MFU | 1624118 tok/s step 1373/19560 | loss 4.053628 (-1.18z)| norm 0.3144 (-1.13z)| lr 5.98e-04 | 322.87 ms | 52.3% bf16 MFU | 1624105 tok/s step 1374/19560 | loss 4.081979 (-0.52z)| norm 0.3489 (-0.55z)| lr 5.98e-04 | 322.25 ms | 52.4% bf16 MFU | 1624247 tok/s step 1375/19560 | loss 4.049324 (-1.25z)| norm 0.3638 (-0.29z)| lr 5.98e-04 | 322.97 ms | 52.3% bf16 MFU | 1624203 tok/s step 1376/19560 | loss 4.049805 (-1.22z)| norm 0.3475 (-0.55z)| lr 5.98e-04 | 323.16 ms | 52.2% bf16 MFU | 1624112 tok/s step 1377/19560 | loss 4.036278 (-1.50z)| norm 0.3441 (-0.60z)| lr 5.98e-04 | 322.69 ms | 52.3% bf16 MFU | 1624142 tok/s step 1378/19560 | loss 4.055214 (-1.06z)| norm 0.3613 (-0.29z)| lr 5.98e-04 | 322.64 ms | 52.3% bf16 MFU | 1624185 tok/s step 1379/19560 | loss 4.144552 (+0.95z)| norm 0.3874 (+0.18z)| lr 5.98e-04 | 323.68 ms | 52.1% bf16 MFU | 1623966 tok/s step 1380/19560 | loss 4.039429 (-1.39z)| norm 0.3790 (+0.04z)| lr 5.98e-04 | 322.94 ms | 52.3% bf16 MFU | 1623942 tok/s step 1381/19560 | loss 4.011459 (-1.97z)| norm 0.4386 (+1.10z)| lr 5.98e-04 | 322.87 ms | 52.3% bf16 MFU | 1623936 tok/s step 1382/19560 | loss 4.121147 (+0.45z)| norm 0.3692 (-0.14z)| lr 5.98e-04 | 322.52 ms | 52.3% bf16 MFU | 1624020 tok/s step 1383/19560 | loss 3.972503 (-2.72z)| norm 0.3237 (-0.94z)| lr 5.98e-04 | 322.47 ms | 52.3% bf16 MFU | 1624113 tok/s step 1384/19560 | loss 4.084218 (-0.33z)| norm 0.3589 (-0.31z)| lr 5.98e-04 | 322.84 ms | 52.3% bf16 MFU | 1624106 tok/s step 1385/19560 | loss 4.048861 (-1.08z)| norm 0.3473 (-0.52z)| lr 5.98e-04 | 322.88 ms | 52.3% bf16 MFU | 1624089 tok/s step 1386/19560 | loss 4.030810 (-1.44z)| norm 0.3942 (+0.31z)| lr 5.98e-04 | 322.89 ms | 52.3% bf16 MFU | 1624071 tok/s step 1387/19560 | loss 4.126639 (+0.59z)| norm 0.4331 (+0.99z)| lr 5.98e-04 | 323.28 ms | 52.2% bf16 MFU | 1623957 tok/s step 1388/19560 | loss 4.097718 (-0.01z)| norm 0.4786 (+1.76z)| lr 5.98e-04 | 322.51 ms | 52.3% bf16 MFU | 1624042 tok/s step 1389/19560 | loss 4.198313 (+2.10z)| norm 0.4884 (+1.90z)| lr 5.98e-04 | 323.01 ms | 52.2% bf16 MFU | 1623996 tok/s step 1390/19560 | loss 4.086090 (-0.26z)| norm 0.4371 (+0.99z)| lr 5.98e-04 | 322.88 ms | 52.3% bf16 MFU | 1623985 tok/s step 1391/19560 | loss 4.140803 (+0.88z)| norm 0.3961 (+0.27z)| lr 5.98e-04 | 323.14 ms | 52.2% bf16 MFU | 1623910 tok/s step 1392/19560 | loss 4.092006 (-0.13z)| norm 0.3768 (-0.07z)| lr 5.98e-04 | 322.68 ms | 52.3% bf16 MFU | 1623955 tok/s step 1393/19560 | loss 4.000178 (-2.04z)| norm 0.3900 (+0.15z)| lr 5.98e-04 | 323.26 ms | 52.2% bf16 MFU | 1623850 tok/s step 1394/19560 | loss 4.012664 (-1.76z)| norm 0.3640 (-0.29z)| lr 5.98e-04 | 323.02 ms | 52.2% bf16 MFU | 1623811 tok/s step 1395/19560 | loss 4.069551 (-0.54z)| norm 0.3438 (-0.64z)| lr 5.98e-04 | 323.07 ms | 52.2% bf16 MFU | 1623762 tok/s step 1396/19560 | loss 4.064982 (-0.63z)| norm 0.3348 (-0.78z)| lr 5.98e-04 | 323.56 ms | 52.2% bf16 MFU | 1623593 tok/s step 1397/19560 | loss 4.069489 (-0.52z)| norm 0.3399 (-0.68z)| lr 5.98e-04 | 322.94 ms | 52.3% bf16 MFU | 1623587 tok/s step 1398/19560 | loss 4.035017 (-1.24z)| norm 0.3265 (-0.91z)| lr 5.98e-04 | 323.62 ms | 52.2% bf16 MFU | 1623412 tok/s step 1399/19560 | loss 4.104888 (+0.26z)| norm 0.3471 (-0.52z)| lr 5.98e-04 | 322.27 ms | 52.4% bf16 MFU | 1623586 tok/s step 1400/19560 | loss 4.065950 (-0.58z)| norm 0.3542 (-0.39z)| lr 5.98e-04 | 322.62 ms | 52.3% bf16 MFU | 1623661 tok/s step 1401/19560 | loss 4.116630 (+0.50z)| norm 0.3496 (-0.47z)| lr 5.98e-04 | 323.01 ms | 52.3% bf16 MFU | 1623635 tok/s step 1402/19560 | loss 4.074270 (-0.41z)| norm 0.3626 (-0.24z)| lr 5.98e-04 | 322.77 ms | 52.3% bf16 MFU | 1623671 tok/s step 1403/19560 | loss 4.112826 (+0.42z)| norm 0.3429 (-0.60z)| lr 5.98e-04 | 323.44 ms | 52.2% bf16 MFU | 1623535 tok/s step 1404/19560 | loss 4.049541 (-0.93z)| norm 0.3584 (-0.31z)| lr 5.98e-04 | 322.67 ms | 52.3% bf16 MFU | 1623601 tok/s step 1405/19560 | loss 4.082209 (-0.23z)| norm 0.3372 (-0.68z)| lr 5.98e-04 | 323.40 ms | 52.2% bf16 MFU | 1623479 tok/s step 1406/19560 | loss 4.017365 (-1.60z)| norm 0.3251 (-0.89z)| lr 5.98e-04 | 323.62 ms | 52.2% bf16 MFU | 1623310 tok/s step 1407/19560 | loss 4.034731 (-1.21z)| norm 0.3272 (-0.84z)| lr 5.98e-04 | 323.13 ms | 52.2% bf16 MFU | 1623270 tok/s step 1408/19560 | loss 4.031169 (-1.27z)| norm 0.3282 (-0.81z)| lr 5.98e-04 | 322.89 ms | 52.3% bf16 MFU | 1623293 tok/s step 1409/19560 | loss 4.041810 (-1.03z)| norm 0.3275 (-0.81z)| lr 5.98e-04 | 322.94 ms | 52.3% bf16 MFU | 1623302 tok/s step 1410/19560 | loss 4.037273 (-1.10z)| norm 0.3499 (-0.39z)| lr 5.98e-04 | 323.49 ms | 52.2% bf16 MFU | 1623173 tok/s step 1411/19560 | loss 4.044871 (-0.93z)| norm 0.3925 (+0.40z)| lr 5.98e-04 | 322.80 ms | 52.3% bf16 MFU | 1623223 tok/s step 1412/19560 | loss 4.093290 (+0.10z)| norm 0.4013 (+0.55z)| lr 5.98e-04 | 322.93 ms | 52.3% bf16 MFU | 1623238 tok/s step 1413/19560 | loss 4.053787 (-0.73z)| norm 0.3894 (+0.32z)| lr 5.98e-04 | 324.32 ms | 52.0% bf16 MFU | 1622905 tok/s step 1414/19560 | loss 4.025778 (-1.31z)| norm 0.3633 (-0.16z)| lr 5.98e-04 | 322.74 ms | 52.3% bf16 MFU | 1622984 tok/s step 1415/19560 | loss 4.055253 (-0.67z)| norm 0.3640 (-0.15z)| lr 5.98e-04 | 323.33 ms | 52.2% bf16 MFU | 1622910 tok/s step 1416/19560 | loss 4.088930 (+0.05z)| norm 0.3820 (+0.19z)| lr 5.98e-04 | 323.03 ms | 52.2% bf16 MFU | 1622916 tok/s step 1417/19560 | loss 4.172957 (+1.79z)| norm 0.4069 (+0.66z)| lr 5.98e-04 | 322.49 ms | 52.3% bf16 MFU | 1623056 tok/s step 1418/19560 | loss 4.097576 (+0.21z)| norm 0.4509 (+1.45z)| lr 5.98e-04 | 322.88 ms | 52.3% bf16 MFU | 1623093 tok/s step 1419/19560 | loss 4.076322 (-0.23z)| norm 0.5171 (+2.58z)| lr 5.98e-04 | 322.92 ms | 52.3% bf16 MFU | 1623118 tok/s step 1420/19560 | loss 4.049673 (-0.77z)| norm 0.4778 (+1.83z)| lr 5.98e-04 | 323.10 ms | 52.2% bf16 MFU | 1623097 tok/s step 1421/19560 | loss 4.100173 (+0.28z)| norm 0.4188 (+0.77z)| lr 5.98e-04 | 322.89 ms | 52.3% bf16 MFU | 1623130 tok/s step 1422/19560 | loss 4.099229 (+0.26z)| norm 0.4073 (+0.56z)| lr 5.98e-04 | 322.83 ms | 52.3% bf16 MFU | 1623174 tok/s step 1423/19560 | loss 4.024128 (-1.29z)| norm 0.4275 (+0.91z)| lr 5.98e-04 | 322.29 ms | 52.4% bf16 MFU | 1623353 tok/s step 1424/19560 | loss 4.095939 (+0.21z)| norm 0.4045 (+0.51z)| lr 5.98e-04 | 323.05 ms | 52.2% bf16 MFU | 1623332 tok/s step 1425/19560 | loss 4.161427 (+1.55z)| norm 0.3384 (-0.66z)| lr 5.98e-04 | 323.07 ms | 52.2% bf16 MFU | 1623308 tok/s step 1426/19560 | loss 4.020894 (-1.33z)| norm 0.2972 (-1.38z)| lr 5.98e-04 | 323.05 ms | 52.2% bf16 MFU | 1623288 tok/s step 1427/19560 | loss 4.061159 (-0.52z)| norm 0.3158 (-1.03z)| lr 5.98e-04 | 322.42 ms | 52.3% bf16 MFU | 1623428 tok/s step 1428/19560 | loss 4.013459 (-1.49z)| norm 0.3193 (-0.96z)| lr 5.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1623508 tok/s step 1429/19560 | loss 3.995669 (-1.83z)| norm 0.2979 (-1.33z)| lr 5.98e-04 | 323.46 ms | 52.2% bf16 MFU | 1623377 tok/s step 1430/19560 | loss 4.048530 (-0.73z)| norm 0.2955 (-1.35z)| lr 5.98e-04 | 322.80 ms | 52.3% bf16 MFU | 1623418 tok/s step 1431/19560 | loss 4.142686 (+1.19z)| norm 0.3280 (-0.76z)| lr 5.98e-04 | 322.90 ms | 52.3% bf16 MFU | 1623431 tok/s step 1432/19560 | loss 3.989061 (-1.92z)| norm 0.3451 (-0.45z)| lr 5.98e-04 | 323.38 ms | 52.2% bf16 MFU | 1623323 tok/s step 1433/19560 | loss 4.030345 (-1.07z)| norm 0.4008 (+0.54z)| lr 5.98e-04 | 323.30 ms | 52.2% bf16 MFU | 1623240 tok/s step 1434/19560 | loss 4.122947 (+0.82z)| norm 0.3422 (-0.51z)| lr 5.98e-04 | 322.85 ms | 52.3% bf16 MFU | 1623275 tok/s step 1435/19560 | loss 4.144711 (+1.25z)| norm 0.3605 (-0.18z)| lr 5.98e-04 | 322.85 ms | 52.3% bf16 MFU | 1623308 tok/s step 1436/19560 | loss 4.085480 (+0.05z)| norm 0.4550 (+1.49z)| lr 5.98e-04 | 322.81 ms | 52.3% bf16 MFU | 1623349 tok/s step 1437/19560 | loss 4.086898 (+0.07z)| norm 0.4354 (+1.12z)| lr 5.98e-04 | 323.23 ms | 52.2% bf16 MFU | 1623284 tok/s step 1438/19560 | loss 4.036263 (-0.96z)| norm 0.3667 (-0.10z)| lr 5.98e-04 | 323.13 ms | 52.2% bf16 MFU | 1623247 tok/s step 1439/19560 | loss 4.014958 (-1.38z)| norm 0.3274 (-0.80z)| lr 5.98e-04 | 323.05 ms | 52.2% bf16 MFU | 1623230 tok/s step 1440/19560 | loss 4.069127 (-0.26z)| norm 0.3423 (-0.53z)| lr 5.98e-04 | 322.34 ms | 52.4% bf16 MFU | 1623393 tok/s step 1441/19560 | loss 4.089343 (+0.16z)| norm 0.3433 (-0.50z)| lr 5.98e-04 | 322.38 ms | 52.4% bf16 MFU | 1623540 tok/s step 1442/19560 | loss 4.026896 (-1.11z)| norm 0.3446 (-0.48z)| lr 5.98e-04 | 322.87 ms | 52.3% bf16 MFU | 1623554 tok/s step 1443/19560 | loss 4.005475 (-1.53z)| norm 0.3834 (+0.20z)| lr 5.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1623629 tok/s step 1444/19560 | loss 4.056977 (-0.48z)| norm 0.3827 (+0.18z)| lr 5.98e-04 | 322.42 ms | 52.3% bf16 MFU | 1623752 tok/s step 1445/19560 | loss 3.998985 (-1.64z)| norm 0.4263 (+0.95z)| lr 5.98e-04 | 322.99 ms | 52.3% bf16 MFU | 1623726 tok/s step 1446/19560 | loss 4.069960 (-0.21z)| norm 0.3501 (-0.41z)| lr 5.98e-04 | 322.83 ms | 52.3% bf16 MFU | 1623741 tok/s step 1447/19560 | loss 4.022071 (-1.16z)| norm 0.3219 (-0.90z)| lr 5.98e-04 | 323.13 ms | 52.2% bf16 MFU | 1623680 tok/s step 1448/19560 | loss 4.057993 (-0.42z)| norm 0.3109 (-1.08z)| lr 5.98e-04 | 322.47 ms | 52.3% bf16 MFU | 1623788 tok/s step 1449/19560 | loss 4.038275 (-0.81z)| norm 0.3231 (-0.85z)| lr 5.98e-04 | 322.57 ms | 52.3% bf16 MFU | 1623867 tok/s step 1450/19560 | loss 4.122305 (+0.89z)| norm 0.3322 (-0.67z)| lr 5.98e-04 | 323.10 ms | 52.2% bf16 MFU | 1623807 tok/s step 1451/19560 | loss 4.086774 (+0.18z)| norm 0.3861 (+0.31z)| lr 5.98e-04 | 323.02 ms | 52.2% bf16 MFU | 1623770 tok/s step 1452/19560 | loss 4.056293 (-0.43z)| norm 0.4185 (+0.89z)| lr 5.98e-04 | 322.90 ms | 52.3% bf16 MFU | 1623765 tok/s step 1453/19560 | loss 4.079787 (+0.06z)| norm 0.3935 (+0.44z)| lr 5.98e-04 | 323.24 ms | 52.2% bf16 MFU | 1623675 tok/s step 1454/19560 | loss 4.055875 (-0.43z)| norm 0.3668 (-0.04z)| lr 5.98e-04 | 323.49 ms | 52.2% bf16 MFU | 1623528 tok/s step 1455/19560 | loss 4.110554 (+0.70z)| norm 0.3166 (-0.94z)| lr 5.98e-04 | 323.40 ms | 52.2% bf16 MFU | 1623411 tok/s step 1456/19560 | loss 4.008396 (-1.40z)| norm 0.2852 (-1.49z)| lr 5.98e-04 | 323.33 ms | 52.2% bf16 MFU | 1623317 tok/s step 1457/19560 | loss 3.987010 (-1.81z)| norm 0.3016 (-1.19z)| lr 5.98e-04 | 323.49 ms | 52.2% bf16 MFU | 1623187 tok/s step 1458/19560 | loss 4.071054 (-0.05z)| norm 0.3242 (-0.78z)| lr 5.98e-04 | 323.02 ms | 52.2% bf16 MFU | 1623183 tok/s step 1459/19560 | loss 4.056961 (-0.35z)| norm 0.3164 (-0.92z)| lr 5.98e-04 | 323.08 ms | 52.2% bf16 MFU | 1623162 tok/s step 1460/19560 | loss 4.102508 (+0.61z)| norm 0.3122 (-0.99z)| lr 5.98e-04 | 323.19 ms | 52.2% bf16 MFU | 1623115 tok/s step 1461/19560 | loss 4.049548 (-0.51z)| norm 0.3162 (-0.93z)| lr 5.98e-04 | 322.84 ms | 52.3% bf16 MFU | 1623159 tok/s step 1462/19560 | loss 4.086233 (+0.29z)| norm 0.3198 (-0.87z)| lr 5.98e-04 | 322.65 ms | 52.3% bf16 MFU | 1623247 tok/s step 1463/19560 | loss 4.070814 (-0.04z)| norm 0.3511 (-0.30z)| lr 5.98e-04 | 322.69 ms | 52.3% bf16 MFU | 1623323 tok/s step 1464/19560 | loss 4.051999 (-0.43z)| norm 0.3698 (+0.04z)| lr 5.98e-04 | 322.74 ms | 52.3% bf16 MFU | 1623381 tok/s step 1465/19560 | loss 4.029665 (-0.90z)| norm 0.3699 (+0.04z)| lr 5.98e-04 | 322.94 ms | 52.3% bf16 MFU | 1623386 tok/s step 1466/19560 | loss 4.006220 (-1.40z)| norm 0.3672 (-0.00z)| lr 5.98e-04 | 322.75 ms | 52.3% bf16 MFU | 1623438 tok/s step 1467/19560 | loss 4.065162 (-0.10z)| norm 0.3552 (-0.21z)| lr 5.98e-04 | 323.18 ms | 52.2% bf16 MFU | 1623381 tok/s step 1468/19560 | loss 4.017833 (-1.16z)| norm 0.3782 (+0.23z)| lr 5.98e-04 | 322.54 ms | 52.3% bf16 MFU | 1623486 tok/s step 1469/19560 | loss 4.015823 (-1.19z)| norm 0.3675 (+0.03z)| lr 5.98e-04 | 323.14 ms | 52.2% bf16 MFU | 1623436 tok/s step 1470/19560 | loss 4.057711 (-0.22z)| norm 0.3569 (-0.16z)| lr 5.98e-04 | 322.95 ms | 52.3% bf16 MFU | 1623436 tok/s step 1471/19560 | loss 4.067285 (+0.01z)| norm 0.3411 (-0.45z)| lr 5.98e-04 | 323.43 ms | 52.2% bf16 MFU | 1623317 tok/s step 1472/19560 | loss 3.983083 (-1.90z)| norm 0.3473 (-0.34z)| lr 5.98e-04 | 322.96 ms | 52.3% bf16 MFU | 1623319 tok/s step 1473/19560 | loss 3.933501 (-2.93z)| norm 0.3269 (-0.72z)| lr 5.98e-04 | 322.69 ms | 52.3% bf16 MFU | 1623389 tok/s step 1474/19560 | loss 4.079915 (+0.32z)| norm 0.3188 (-0.86z)| lr 5.98e-04 | 322.54 ms | 52.3% bf16 MFU | 1623494 tok/s step 1475/19560 | loss 4.026858 (-0.86z)| norm 0.3192 (-0.84z)| lr 5.98e-04 | 322.45 ms | 52.3% bf16 MFU | 1623615 tok/s step 1476/19560 | loss 4.079443 (+0.31z)| norm 0.3454 (-0.35z)| lr 5.98e-04 | 323.03 ms | 52.2% bf16 MFU | 1623586 tok/s step 1477/19560 | loss 4.021117 (-0.97z)| norm 0.3780 (+0.24z)| lr 5.97e-04 | 322.73 ms | 52.3% bf16 MFU | 1623635 tok/s step 1478/19560 | loss 4.009720 (-1.21z)| norm 0.3820 (+0.31z)| lr 5.97e-04 | 322.43 ms | 52.3% bf16 MFU | 1623756 tok/s step 1479/19560 | loss 4.036293 (-0.61z)| norm 0.3748 (+0.17z)| lr 5.97e-04 | 322.71 ms | 52.3% bf16 MFU | 1623801 tok/s step 1480/19560 | loss 4.027168 (-0.80z)| norm 0.3744 (+0.15z)| lr 5.97e-04 | 323.15 ms | 52.2% bf16 MFU | 1623731 tok/s step 1481/19560 | loss 3.981533 (-1.77z)| norm 0.3479 (-0.34z)| lr 5.97e-04 | 322.84 ms | 52.3% bf16 MFU | 1623743 tok/s step 1482/19560 | loss 4.061494 (-0.01z)| norm 0.3821 (+0.32z)| lr 5.97e-04 | 323.20 ms | 52.2% bf16 MFU | 1623664 tok/s step 1483/19560 | loss 3.972362 (-2.00z)| norm 0.4244 (+1.27z)| lr 5.97e-04 | 322.72 ms | 52.3% bf16 MFU | 1623709 tok/s step 1484/19560 | loss 4.001307 (-1.32z)| norm 0.4823 (+2.53z)| lr 5.97e-04 | 322.31 ms | 52.4% bf16 MFU | 1623856 tok/s step 1485/19560 | loss 4.052230 (-0.16z)| norm 0.4281 (+1.40z)| lr 5.97e-04 | 322.59 ms | 52.3% bf16 MFU | 1623924 tok/s step 1486/19560 | loss 4.071679 (+0.30z)| norm 0.3311 (-0.68z)| lr 5.97e-04 | 322.84 ms | 52.3% bf16 MFU | 1623928 tok/s step 1487/19560 | loss 4.108179 (+1.12z)| norm 0.3507 (-0.24z)| lr 5.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1623998 tok/s step 1488/19560 | loss 4.038028 (-0.48z)| norm 0.3453 (-0.35z)| lr 5.97e-04 | 323.11 ms | 52.2% bf16 MFU | 1623928 tok/s step 1489/19560 | loss 4.084297 (+0.57z)| norm 0.3554 (-0.13z)| lr 5.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1623998 tok/s step 1490/19560 | loss 4.053097 (-0.13z)| norm 0.4125 (+1.10z)| lr 5.97e-04 | 322.89 ms | 52.3% bf16 MFU | 1623984 tok/s step 1491/19560 | loss 4.010375 (-1.10z)| norm 0.4080 (+0.99z)| lr 5.97e-04 | 322.79 ms | 52.3% bf16 MFU | 1623996 tok/s step 1492/19560 | loss 4.094645 (+0.82z)| norm 0.3581 (-0.10z)| lr 5.97e-04 | 322.31 ms | 52.4% bf16 MFU | 1624128 tok/s step 1493/19560 | loss 4.067115 (+0.19z)| norm 0.3631 (-0.00z)| lr 5.97e-04 | 322.86 ms | 52.3% bf16 MFU | 1624116 tok/s step 1494/19560 | loss 4.044972 (-0.31z)| norm 0.3130 (-1.13z)| lr 5.97e-04 | 322.62 ms | 52.3% bf16 MFU | 1624165 tok/s step 1495/19560 | loss 4.085721 (+0.63z)| norm 0.3105 (-1.20z)| lr 5.97e-04 | 322.57 ms | 52.3% bf16 MFU | 1624224 tok/s step 1496/19560 | loss 4.057797 (-0.01z)| norm 0.3263 (-0.86z)| lr 5.97e-04 | 322.78 ms | 52.3% bf16 MFU | 1624226 tok/s step 1497/19560 | loss 4.053786 (-0.09z)| norm 0.3339 (-0.69z)| lr 5.97e-04 | 322.78 ms | 52.3% bf16 MFU | 1624228 tok/s step 1498/19560 | loss 4.042284 (-0.36z)| norm 0.3315 (-0.73z)| lr 5.97e-04 | 322.73 ms | 52.3% bf16 MFU | 1624243 tok/s step 1499/19560 | loss 4.054980 (-0.07z)| norm 0.3127 (-1.15z)| lr 5.97e-04 | 322.64 ms | 52.3% bf16 MFU | 1624281 tok/s step 1500/19560 | loss 4.055475 (-0.06z)| norm 0.2910 (-1.62z)| lr 5.97e-04 | 322.79 ms | 52.3% bf16 MFU | 1624278 tok/s val loss 4.011433 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2592/10042 = 0.258116 step 1501/19560 | loss 4.159521 (+2.28z)| norm 0.4034 (+0.89z)| lr 5.97e-04 | 322.09 ms | 52.4% bf16 MFU | 1624451 tok/s step 1502/19560 | loss 3.990705 (-1.52z)| norm 0.3291 (-0.77z)| lr 5.97e-04 | 323.34 ms | 52.2% bf16 MFU | 1624302 tok/s step 1503/19560 | loss 4.041358 (-0.38z)| norm 0.3859 (+0.50z)| lr 5.97e-04 | 323.17 ms | 52.2% bf16 MFU | 1624203 tok/s step 1504/19560 | loss 3.992959 (-1.45z)| norm 0.4343 (+1.55z)| lr 5.97e-04 | 323.87 ms | 52.1% bf16 MFU | 1623933 tok/s step 1505/19560 | loss 3.961343 (-2.10z)| norm 0.5339 (+3.55z)| lr 5.97e-04 | 323.16 ms | 52.2% bf16 MFU | 1623856 tok/s step 1506/19560 | loss 4.059728 (+0.05z)| norm 0.5041 (+2.80z)| lr 5.97e-04 | 322.74 ms | 52.3% bf16 MFU | 1623887 tok/s step 1507/19560 | loss 4.048226 (-0.18z)| norm 0.4055 (+0.78z)| lr 5.97e-04 | 323.18 ms | 52.2% bf16 MFU | 1623806 tok/s step 1508/19560 | loss 4.049907 (-0.15z)| norm 0.4110 (+0.89z)| lr 5.97e-04 | 323.01 ms | 52.3% bf16 MFU | 1623774 tok/s step 1509/19560 | loss 4.042882 (-0.31z)| norm 0.3609 (-0.12z)| lr 5.97e-04 | 323.35 ms | 52.2% bf16 MFU | 1623658 tok/s step 1510/19560 | loss 3.994079 (-1.38z)| norm 0.3147 (-1.06z)| lr 5.97e-04 | 323.06 ms | 52.2% bf16 MFU | 1623619 tok/s step 1511/19560 | loss 3.996654 (-1.34z)| norm 0.3265 (-0.82z)| lr 5.97e-04 | 322.48 ms | 52.3% bf16 MFU | 1623729 tok/s step 1512/19560 | loss 4.015581 (-0.90z)| norm 0.3089 (-1.16z)| lr 5.97e-04 | 322.96 ms | 52.3% bf16 MFU | 1623711 tok/s step 1513/19560 | loss 4.018122 (-0.83z)| norm 0.2922 (-1.48z)| lr 5.97e-04 | 322.98 ms | 52.3% bf16 MFU | 1623690 tok/s step 1514/19560 | loss 4.085457 (+0.67z)| norm 0.2788 (-1.72z)| lr 5.97e-04 | 322.33 ms | 52.4% bf16 MFU | 1623834 tok/s step 1515/19560 | loss 3.973032 (-1.82z)| norm 0.3051 (-1.17z)| lr 5.97e-04 | 322.92 ms | 52.3% bf16 MFU | 1623821 tok/s step 1516/19560 | loss 4.023579 (-0.68z)| norm 0.3237 (-0.79z)| lr 5.97e-04 | 323.63 ms | 52.1% bf16 MFU | 1623630 tok/s step 1517/19560 | loss 4.125345 (+1.66z)| norm 0.3714 (+0.20z)| lr 5.97e-04 | 322.42 ms | 52.3% bf16 MFU | 1623755 tok/s step 1518/19560 | loss 4.001955 (-1.17z)| norm 0.3693 (+0.17z)| lr 5.97e-04 | 323.06 ms | 52.2% bf16 MFU | 1623712 tok/s step 1519/19560 | loss 4.035736 (-0.38z)| norm 0.3158 (-0.94z)| lr 5.97e-04 | 323.10 ms | 52.2% bf16 MFU | 1623660 tok/s step 1520/19560 | loss 3.977811 (-1.70z)| norm 0.3328 (-0.58z)| lr 5.97e-04 | 322.89 ms | 52.3% bf16 MFU | 1623664 tok/s step 1521/19560 | loss 3.966904 (-1.93z)| norm 0.3748 (+0.31z)| lr 5.97e-04 | 322.45 ms | 52.3% bf16 MFU | 1623778 tok/s step 1522/19560 | loss 4.022779 (-0.65z)| norm 0.4269 (+1.39z)| lr 5.97e-04 | 322.75 ms | 52.3% bf16 MFU | 1623811 tok/s step 1523/19560 | loss 4.022706 (-0.64z)| norm 0.4721 (+2.27z)| lr 5.97e-04 | 322.63 ms | 52.3% bf16 MFU | 1623873 tok/s step 1524/19560 | loss 4.026923 (-0.53z)| norm 0.4967 (+2.68z)| lr 5.97e-04 | 322.29 ms | 52.4% bf16 MFU | 1624016 tok/s step 1525/19560 | loss 4.003592 (-1.05z)| norm 0.4110 (+0.95z)| lr 5.97e-04 | 323.23 ms | 52.2% bf16 MFU | 1623916 tok/s step 1526/19560 | loss 4.021257 (-0.65z)| norm 0.3451 (-0.37z)| lr 5.97e-04 | 322.93 ms | 52.3% bf16 MFU | 1623898 tok/s step 1527/19560 | loss 4.031202 (-0.41z)| norm 0.3179 (-0.90z)| lr 5.97e-04 | 322.78 ms | 52.3% bf16 MFU | 1623917 tok/s step 1528/19560 | loss 4.041647 (-0.16z)| norm 0.3257 (-0.74z)| lr 5.97e-04 | 322.88 ms | 52.3% bf16 MFU | 1623910 tok/s step 1529/19560 | loss 4.028530 (-0.45z)| norm 0.3403 (-0.45z)| lr 5.97e-04 | 323.10 ms | 52.2% bf16 MFU | 1623849 tok/s step 1530/19560 | loss 4.014847 (-0.76z)| norm 0.3172 (-0.90z)| lr 5.97e-04 | 323.17 ms | 52.2% bf16 MFU | 1623774 tok/s step 1531/19560 | loss 3.942752 (-2.37z)| norm 0.2906 (-1.40z)| lr 5.97e-04 | 323.04 ms | 52.2% bf16 MFU | 1623734 tok/s step 1532/19560 | loss 4.049927 (+0.08z)| norm 0.2939 (-1.32z)| lr 5.97e-04 | 322.85 ms | 52.3% bf16 MFU | 1623745 tok/s step 1533/19560 | loss 4.017544 (-0.65z)| norm 0.2962 (-1.26z)| lr 5.97e-04 | 322.61 ms | 52.3% bf16 MFU | 1623815 tok/s step 1534/19560 | loss 4.004326 (-0.95z)| norm 0.2933 (-1.31z)| lr 5.97e-04 | 322.40 ms | 52.3% bf16 MFU | 1623934 tok/s step 1535/19560 | loss 4.084575 (+0.88z)| norm 0.3083 (-1.01z)| lr 5.97e-04 | 322.63 ms | 52.3% bf16 MFU | 1623988 tok/s step 1536/19560 | loss 4.048764 (+0.06z)| norm 0.2864 (-1.42z)| lr 5.97e-04 | 322.69 ms | 52.3% bf16 MFU | 1624025 tok/s step 1537/19560 | loss 4.078271 (+0.72z)| norm 0.3809 (+0.38z)| lr 5.97e-04 | 322.70 ms | 52.3% bf16 MFU | 1624058 tok/s step 1538/19560 | loss 3.988653 (-1.30z)| norm 0.3347 (-0.50z)| lr 5.97e-04 | 322.98 ms | 52.3% bf16 MFU | 1624018 tok/s step 1539/19560 | loss 4.074364 (+0.63z)| norm 0.3536 (-0.13z)| lr 5.97e-04 | 322.56 ms | 52.3% bf16 MFU | 1624086 tok/s step 1540/19560 | loss 4.009256 (-0.82z)| norm 0.3984 (+0.72z)| lr 5.97e-04 | 323.20 ms | 52.2% bf16 MFU | 1623991 tok/s step 1541/19560 | loss 4.024399 (-0.48z)| norm 0.4513 (+1.71z)| lr 5.97e-04 | 322.73 ms | 52.3% bf16 MFU | 1624019 tok/s step 1542/19560 | loss 4.016889 (-0.64z)| norm 0.4290 (+1.27z)| lr 5.97e-04 | 322.40 ms | 52.3% bf16 MFU | 1624128 tok/s step 1543/19560 | loss 4.030420 (-0.33z)| norm 0.3966 (+0.65z)| lr 5.97e-04 | 322.62 ms | 52.3% bf16 MFU | 1624177 tok/s step 1544/19560 | loss 4.065173 (+0.46z)| norm 0.4079 (+0.86z)| lr 5.97e-04 | 322.95 ms | 52.3% bf16 MFU | 1624139 tok/s step 1545/19560 | loss 4.024431 (-0.45z)| norm 0.3410 (-0.39z)| lr 5.97e-04 | 323.23 ms | 52.2% bf16 MFU | 1624033 tok/s step 1546/19560 | loss 3.985888 (-1.34z)| norm 0.3409 (-0.38z)| lr 5.97e-04 | 322.99 ms | 52.3% bf16 MFU | 1623992 tok/s step 1547/19560 | loss 3.944701 (-2.24z)| norm 0.3020 (-1.12z)| lr 5.97e-04 | 322.28 ms | 52.4% bf16 MFU | 1624133 tok/s step 1548/19560 | loss 4.028181 (-0.31z)| norm 0.3132 (-0.89z)| lr 5.97e-04 | 322.74 ms | 52.3% bf16 MFU | 1624152 tok/s step 1549/19560 | loss 3.957532 (-1.90z)| norm 0.3497 (-0.15z)| lr 5.97e-04 | 323.11 ms | 52.2% bf16 MFU | 1624076 tok/s step 1550/19560 | loss 4.085234 (+1.02z)| norm 0.3437 (-0.26z)| lr 5.97e-04 | 323.57 ms | 52.2% bf16 MFU | 1623888 tok/s step 1551/19560 | loss 4.008932 (-0.72z)| norm 0.3551 (-0.02z)| lr 5.97e-04 | 322.92 ms | 52.3% bf16 MFU | 1623872 tok/s step 1552/19560 | loss 4.042807 (+0.06z)| norm 0.4116 (+1.13z)| lr 5.97e-04 | 322.74 ms | 52.3% bf16 MFU | 1623904 tok/s step 1553/19560 | loss 4.068636 (+0.70z)| norm 0.3655 (+0.18z)| lr 5.97e-04 | 322.19 ms | 52.4% bf16 MFU | 1624072 tok/s step 1554/19560 | loss 4.005545 (-0.80z)| norm 0.3091 (-0.96z)| lr 5.97e-04 | 323.25 ms | 52.2% bf16 MFU | 1623965 tok/s step 1555/19560 | loss 3.998467 (-0.95z)| norm 0.3148 (-0.85z)| lr 5.97e-04 | 322.84 ms | 52.3% bf16 MFU | 1623967 tok/s step 1556/19560 | loss 3.952214 (-2.00z)| norm 0.2859 (-1.42z)| lr 5.97e-04 | 322.85 ms | 52.3% bf16 MFU | 1623965 tok/s step 1557/19560 | loss 3.932813 (-2.39z)| norm 0.2675 (-1.78z)| lr 5.97e-04 | 322.54 ms | 52.3% bf16 MFU | 1624042 tok/s step 1558/19560 | loss 4.036528 (-0.02z)| norm 0.3168 (-0.79z)| lr 5.97e-04 | 322.68 ms | 52.3% bf16 MFU | 1624080 tok/s step 1559/19560 | loss 3.983883 (-1.22z)| norm 0.3149 (-0.83z)| lr 5.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1624057 tok/s step 1560/19560 | loss 3.932744 (-2.35z)| norm 0.3162 (-0.80z)| lr 5.97e-04 | 323.59 ms | 52.2% bf16 MFU | 1623864 tok/s step 1561/19560 | loss 3.994828 (-0.93z)| norm 0.3051 (-1.00z)| lr 5.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1623936 tok/s step 1562/19560 | loss 4.020146 (-0.34z)| norm 0.3052 (-0.99z)| lr 5.97e-04 | 322.71 ms | 52.3% bf16 MFU | 1623972 tok/s step 1563/19560 | loss 3.982113 (-1.21z)| norm 0.3182 (-0.72z)| lr 5.97e-04 | 323.50 ms | 52.2% bf16 MFU | 1623806 tok/s step 1564/19560 | loss 4.003666 (-0.69z)| norm 0.3073 (-0.93z)| lr 5.97e-04 | 322.89 ms | 52.3% bf16 MFU | 1623802 tok/s step 1565/19560 | loss 4.124558 (+2.13z)| norm 0.2763 (-1.53z)| lr 5.97e-04 | 322.77 ms | 52.3% bf16 MFU | 1623829 tok/s step 1566/19560 | loss 3.965366 (-1.56z)| norm 0.3157 (-0.73z)| lr 5.97e-04 | 322.80 ms | 52.3% bf16 MFU | 1623847 tok/s step 1567/19560 | loss 4.020566 (-0.28z)| norm 0.3424 (-0.19z)| lr 5.97e-04 | 322.85 ms | 52.3% bf16 MFU | 1623852 tok/s step 1568/19560 | loss 4.053872 (+0.49z)| norm 0.3425 (-0.19z)| lr 5.97e-04 | 323.14 ms | 52.2% bf16 MFU | 1623784 tok/s step 1569/19560 | loss 3.970419 (-1.42z)| norm 0.3323 (-0.39z)| lr 5.97e-04 | 322.52 ms | 52.3% bf16 MFU | 1623873 tok/s step 1570/19560 | loss 3.997555 (-0.78z)| norm 0.3288 (-0.46z)| lr 5.97e-04 | 322.66 ms | 52.3% bf16 MFU | 1623924 tok/s step 1571/19560 | loss 4.006727 (-0.57z)| norm 0.3697 (+0.37z)| lr 5.97e-04 | 323.49 ms | 52.2% bf16 MFU | 1623764 tok/s step 1572/19560 | loss 4.033582 (+0.05z)| norm 0.4407 (+1.78z)| lr 5.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1623758 tok/s step 1573/19560 | loss 3.952165 (-1.80z)| norm 0.4248 (+1.46z)| lr 5.97e-04 | 322.78 ms | 52.3% bf16 MFU | 1623786 tok/s step 1574/19560 | loss 4.069812 (+0.89z)| norm 0.3524 (+0.01z)| lr 5.97e-04 | 323.31 ms | 52.2% bf16 MFU | 1623678 tok/s step 1575/19560 | loss 4.009238 (-0.49z)| norm 0.3859 (+0.67z)| lr 5.97e-04 | 323.25 ms | 52.2% bf16 MFU | 1623590 tok/s step 1576/19560 | loss 3.968860 (-1.39z)| norm 0.3509 (-0.04z)| lr 5.97e-04 | 322.62 ms | 52.3% bf16 MFU | 1623664 tok/s step 1577/19560 | loss 4.010723 (-0.44z)| norm 0.2999 (-1.05z)| lr 5.97e-04 | 322.85 ms | 52.3% bf16 MFU | 1623677 tok/s step 1578/19560 | loss 3.984411 (-1.02z)| norm 0.3143 (-0.76z)| lr 5.97e-04 | 322.98 ms | 52.3% bf16 MFU | 1623658 tok/s step 1579/19560 | loss 3.948206 (-1.82z)| norm 0.3101 (-0.83z)| lr 5.97e-04 | 322.96 ms | 52.3% bf16 MFU | 1623645 tok/s step 1580/19560 | loss 4.013085 (-0.33z)| norm 0.3308 (-0.41z)| lr 5.97e-04 | 322.86 ms | 52.3% bf16 MFU | 1623657 tok/s step 1581/19560 | loss 3.995130 (-0.73z)| norm 0.3425 (-0.17z)| lr 5.97e-04 | 323.62 ms | 52.2% bf16 MFU | 1623479 tok/s step 1582/19560 | loss 3.936509 (-2.03z)| norm 0.3673 (+0.33z)| lr 5.97e-04 | 322.44 ms | 52.3% bf16 MFU | 1623604 tok/s step 1583/19560 | loss 4.022608 (-0.06z)| norm 0.3388 (-0.24z)| lr 5.97e-04 | 323.37 ms | 52.2% bf16 MFU | 1623492 tok/s step 1584/19560 | loss 3.897214 (-2.83z)| norm 0.3388 (-0.25z)| lr 5.97e-04 | 323.48 ms | 52.2% bf16 MFU | 1623355 tok/s step 1585/19560 | loss 4.025544 (+0.02z)| norm 0.3767 (+0.51z)| lr 5.97e-04 | 322.99 ms | 52.3% bf16 MFU | 1623350 tok/s step 1586/19560 | loss 3.974672 (-1.10z)| norm 0.3850 (+0.66z)| lr 5.97e-04 | 322.69 ms | 52.3% bf16 MFU | 1623419 tok/s step 1587/19560 | loss 3.973110 (-1.11z)| norm 0.3574 (+0.10z)| lr 5.97e-04 | 322.87 ms | 52.3% bf16 MFU | 1623440 tok/s step 1588/19560 | loss 3.984173 (-0.86z)| norm 0.3848 (+0.64z)| lr 5.97e-04 | 323.02 ms | 52.2% bf16 MFU | 1623421 tok/s step 1589/19560 | loss 4.098431 (+1.69z)| norm 0.3483 (-0.11z)| lr 5.97e-04 | 323.76 ms | 52.1% bf16 MFU | 1623220 tok/s step 1590/19560 | loss 4.027652 (+0.12z)| norm 0.3505 (-0.07z)| lr 5.97e-04 | 322.89 ms | 52.3% bf16 MFU | 1623245 tok/s step 1591/19560 | loss 3.976610 (-1.01z)| norm 0.3291 (-0.50z)| lr 5.97e-04 | 322.77 ms | 52.3% bf16 MFU | 1623300 tok/s step 1592/19560 | loss 3.997116 (-0.54z)| norm 0.3109 (-0.86z)| lr 5.97e-04 | 322.74 ms | 52.3% bf16 MFU | 1623360 tok/s step 1593/19560 | loss 3.969239 (-1.15z)| norm 0.2914 (-1.24z)| lr 5.97e-04 | 322.86 ms | 52.3% bf16 MFU | 1623388 tok/s step 1594/19560 | loss 4.030815 (+0.23z)| norm 0.3115 (-0.82z)| lr 5.97e-04 | 323.13 ms | 52.2% bf16 MFU | 1623344 tok/s step 1595/19560 | loss 3.998285 (-0.49z)| norm 0.3084 (-0.88z)| lr 5.97e-04 | 323.13 ms | 52.2% bf16 MFU | 1623304 tok/s step 1596/19560 | loss 3.970103 (-1.11z)| norm 0.3005 (-1.02z)| lr 5.97e-04 | 323.39 ms | 52.2% bf16 MFU | 1623201 tok/s step 1597/19560 | loss 3.962514 (-1.26z)| norm 0.3020 (-0.98z)| lr 5.97e-04 | 322.53 ms | 52.3% bf16 MFU | 1623318 tok/s step 1598/19560 | loss 4.003498 (-0.34z)| norm 0.3245 (-0.52z)| lr 5.97e-04 | 322.83 ms | 52.3% bf16 MFU | 1623353 tok/s step 1599/19560 | loss 4.048811 (+0.67z)| norm 0.3091 (-0.82z)| lr 5.97e-04 | 322.95 ms | 52.3% bf16 MFU | 1623358 tok/s step 1600/19560 | loss 3.958730 (-1.33z)| norm 0.3082 (-0.83z)| lr 5.97e-04 | 323.28 ms | 52.2% bf16 MFU | 1623279 tok/s step 1601/19560 | loss 3.988135 (-0.70z)| norm 0.2961 (-1.06z)| lr 5.97e-04 | 322.80 ms | 52.3% bf16 MFU | 1623325 tok/s step 1602/19560 | loss 3.984780 (-0.76z)| norm 0.3078 (-0.83z)| lr 5.97e-04 | 322.77 ms | 52.3% bf16 MFU | 1623375 tok/s step 1603/19560 | loss 4.010561 (-0.17z)| norm 0.3218 (-0.55z)| lr 5.97e-04 | 323.75 ms | 52.1% bf16 MFU | 1623178 tok/s step 1604/19560 | loss 4.020858 (+0.07z)| norm 0.3680 (+0.36z)| lr 5.97e-04 | 322.96 ms | 52.3% bf16 MFU | 1623190 tok/s step 1605/19560 | loss 3.999475 (-0.41z)| norm 0.3460 (-0.07z)| lr 5.97e-04 | 322.82 ms | 52.3% bf16 MFU | 1623236 tok/s step 1606/19560 | loss 3.953088 (-1.45z)| norm 0.3502 (+0.02z)| lr 5.97e-04 | 323.50 ms | 52.2% bf16 MFU | 1623107 tok/s step 1607/19560 | loss 4.024491 (+0.17z)| norm 0.3539 (+0.10z)| lr 5.97e-04 | 322.84 ms | 52.3% bf16 MFU | 1623150 tok/s step 1608/19560 | loss 3.951769 (-1.45z)| norm 0.3833 (+0.68z)| lr 5.97e-04 | 323.27 ms | 52.2% bf16 MFU | 1623084 tok/s step 1609/19560 | loss 4.013806 (-0.06z)| norm 0.3936 (+0.88z)| lr 5.97e-04 | 323.26 ms | 52.2% bf16 MFU | 1623023 tok/s step 1610/19560 | loss 3.975573 (-0.91z)| norm 0.3950 (+0.90z)| lr 5.97e-04 | 323.15 ms | 52.2% bf16 MFU | 1622993 tok/s step 1611/19560 | loss 4.029557 (+0.30z)| norm 0.3967 (+0.95z)| lr 5.97e-04 | 322.85 ms | 52.3% bf16 MFU | 1623040 tok/s step 1612/19560 | loss 4.010869 (-0.13z)| norm 0.3250 (-0.47z)| lr 5.97e-04 | 323.41 ms | 52.2% bf16 MFU | 1622944 tok/s step 1613/19560 | loss 3.943769 (-1.61z)| norm 0.2905 (-1.16z)| lr 5.97e-04 | 322.82 ms | 52.3% bf16 MFU | 1623001 tok/s step 1614/19560 | loss 3.991206 (-0.54z)| norm 0.3072 (-0.81z)| lr 5.97e-04 | 323.11 ms | 52.2% bf16 MFU | 1622982 tok/s step 1615/19560 | loss 3.957269 (-1.29z)| norm 0.2884 (-1.18z)| lr 5.97e-04 | 323.77 ms | 52.1% bf16 MFU | 1622800 tok/s step 1616/19560 | loss 4.023487 (+0.22z)| norm 0.2787 (-1.36z)| lr 5.97e-04 | 323.05 ms | 52.2% bf16 MFU | 1622807 tok/s step 1617/19560 | loss 3.986166 (-0.62z)| norm 0.3023 (-0.87z)| lr 5.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1622847 tok/s step 1618/19560 | loss 4.073366 (+1.38z)| norm 0.3012 (-0.88z)| lr 5.97e-04 | 323.98 ms | 52.1% bf16 MFU | 1622619 tok/s step 1619/19560 | loss 3.983947 (-0.66z)| norm 0.3256 (-0.37z)| lr 5.96e-04 | 323.50 ms | 52.2% bf16 MFU | 1622521 tok/s step 1620/19560 | loss 3.931635 (-1.83z)| norm 0.3206 (-0.47z)| lr 5.96e-04 | 323.36 ms | 52.2% bf16 MFU | 1622462 tok/s step 1621/19560 | loss 4.033907 (+0.52z)| norm 0.3491 (+0.12z)| lr 5.96e-04 | 323.66 ms | 52.1% bf16 MFU | 1622334 tok/s step 1622/19560 | loss 3.982190 (-0.66z)| norm 0.3873 (+0.88z)| lr 5.96e-04 | 323.16 ms | 52.2% bf16 MFU | 1622335 tok/s step 1623/19560 | loss 3.973523 (-0.85z)| norm 0.4696 (+2.48z)| lr 5.96e-04 | 323.16 ms | 52.2% bf16 MFU | 1622338 tok/s step 1624/19560 | loss 3.940369 (-1.59z)| norm 0.4894 (+2.76z)| lr 5.96e-04 | 322.05 ms | 52.4% bf16 MFU | 1622620 tok/s step 1625/19560 | loss 3.978081 (-0.70z)| norm 0.4086 (+1.18z)| lr 5.96e-04 | 322.97 ms | 52.3% bf16 MFU | 1622656 tok/s step 1626/19560 | loss 3.967330 (-0.94z)| norm 0.3592 (+0.23z)| lr 5.96e-04 | 323.21 ms | 52.2% bf16 MFU | 1622628 tok/s step 1627/19560 | loss 3.956390 (-1.17z)| norm 0.3250 (-0.43z)| lr 5.96e-04 | 322.56 ms | 52.3% bf16 MFU | 1622766 tok/s step 1628/19560 | loss 3.981700 (-0.58z)| norm 0.3358 (-0.23z)| lr 5.96e-04 | 322.39 ms | 52.4% bf16 MFU | 1622941 tok/s step 1629/19560 | loss 4.023356 (+0.43z)| norm 0.3289 (-0.36z)| lr 5.96e-04 | 322.38 ms | 52.4% bf16 MFU | 1623108 tok/s step 1630/19560 | loss 3.968607 (-0.89z)| norm 0.3100 (-0.72z)| lr 5.96e-04 | 322.53 ms | 52.3% bf16 MFU | 1623230 tok/s step 1631/19560 | loss 3.978902 (-0.63z)| norm 0.3035 (-0.83z)| lr 5.96e-04 | 323.04 ms | 52.2% bf16 MFU | 1623217 tok/s step 1632/19560 | loss 3.949059 (-1.34z)| norm 0.2993 (-0.90z)| lr 5.96e-04 | 323.19 ms | 52.2% bf16 MFU | 1623168 tok/s step 1633/19560 | loss 4.007390 (+0.06z)| norm 0.2798 (-1.31z)| lr 5.96e-04 | 322.90 ms | 52.3% bf16 MFU | 1623194 tok/s step 1634/19560 | loss 3.980777 (-0.57z)| norm 0.3227 (-0.41z)| lr 5.96e-04 | 322.30 ms | 52.4% bf16 MFU | 1623368 tok/s step 1635/19560 | loss 4.029692 (+0.63z)| norm 0.3390 (-0.05z)| lr 5.96e-04 | 323.18 ms | 52.2% bf16 MFU | 1623314 tok/s step 1636/19560 | loss 4.027854 (+0.59z)| norm 0.3384 (-0.05z)| lr 5.96e-04 | 322.47 ms | 52.3% bf16 MFU | 1623441 tok/s step 1637/19560 | loss 3.940619 (-1.53z)| norm 0.3717 (+0.67z)| lr 5.96e-04 | 323.38 ms | 52.2% bf16 MFU | 1623332 tok/s step 1638/19560 | loss 4.008698 (+0.13z)| norm 0.4178 (+1.64z)| lr 5.96e-04 | 322.88 ms | 52.3% bf16 MFU | 1623354 tok/s step 1639/19560 | loss 3.901658 (-2.41z)| norm 0.4008 (+1.26z)| lr 5.96e-04 | 322.81 ms | 52.3% bf16 MFU | 1623394 tok/s step 1640/19560 | loss 3.983058 (-0.46z)| norm 0.3526 (+0.21z)| lr 5.96e-04 | 323.18 ms | 52.2% bf16 MFU | 1623338 tok/s step 1641/19560 | loss 4.200288 (+4.34z)| norm 0.3666 (+0.50z)| lr 5.96e-04 | 323.25 ms | 52.2% bf16 MFU | 1623266 tok/s step 1642/19560 | loss 3.935713 (-1.48z)| norm 0.3090 (-0.75z)| lr 5.96e-04 | 322.78 ms | 52.3% bf16 MFU | 1623319 tok/s step 1643/19560 | loss 4.013534 (+0.24z)| norm 0.3405 (-0.07z)| lr 5.96e-04 | 322.75 ms | 52.3% bf16 MFU | 1623375 tok/s step 1644/19560 | loss 3.966593 (-0.79z)| norm 0.3027 (-0.89z)| lr 5.96e-04 | 323.05 ms | 52.2% bf16 MFU | 1623353 tok/s step 1645/19560 | loss 4.058642 (+1.29z)| norm 0.3842 (+0.88z)| lr 5.96e-04 | 323.12 ms | 52.2% bf16 MFU | 1623316 tok/s step 1646/19560 | loss 3.960019 (-0.94z)| norm 0.3023 (-0.89z)| lr 5.96e-04 | 322.78 ms | 52.3% bf16 MFU | 1623364 tok/s step 1647/19560 | loss 3.986022 (-0.34z)| norm 0.3045 (-0.84z)| lr 5.96e-04 | 322.62 ms | 52.3% bf16 MFU | 1623450 tok/s step 1648/19560 | loss 3.977446 (-0.54z)| norm 0.3117 (-0.68z)| lr 5.96e-04 | 322.39 ms | 52.4% bf16 MFU | 1623591 tok/s step 1649/19560 | loss 4.036729 (+0.79z)| norm 0.3265 (-0.35z)| lr 5.96e-04 | 323.03 ms | 52.2% bf16 MFU | 1623564 tok/s step 1650/19560 | loss 4.039316 (+0.85z)| norm 0.3486 (+0.15z)| lr 5.96e-04 | 322.95 ms | 52.3% bf16 MFU | 1623556 tok/s step 1651/19560 | loss 3.995047 (-0.15z)| norm 0.4114 (+1.57z)| lr 5.96e-04 | 322.53 ms | 52.3% bf16 MFU | 1623657 tok/s step 1652/19560 | loss 3.965794 (-0.80z)| norm 0.4220 (+1.89z)| lr 5.96e-04 | 322.45 ms | 52.3% bf16 MFU | 1623772 tok/s step 1653/19560 | loss 3.975358 (-0.58z)| norm 0.4498 (+2.49z)| lr 5.96e-04 | 323.23 ms | 52.2% bf16 MFU | 1623685 tok/s step 1654/19560 | loss 3.953948 (-1.04z)| norm 0.4460 (+2.33z)| lr 5.96e-04 | 322.28 ms | 52.4% bf16 MFU | 1623840 tok/s step 1655/19560 | loss 4.025266 (+0.56z)| norm 0.3799 (+0.84z)| lr 5.96e-04 | 323.06 ms | 52.2% bf16 MFU | 1623793 tok/s step 1656/19560 | loss 4.040297 (+0.90z)| norm 0.3859 (+0.96z)| lr 5.96e-04 | 322.20 ms | 52.4% bf16 MFU | 1623963 tok/s step 1657/19560 | loss 3.969197 (-0.69z)| norm 0.3242 (-0.41z)| lr 5.96e-04 | 323.31 ms | 52.2% bf16 MFU | 1623847 tok/s step 1658/19560 | loss 4.037359 (+0.84z)| norm 0.2946 (-1.07z)| lr 5.96e-04 | 323.19 ms | 52.2% bf16 MFU | 1623766 tok/s step 1659/19560 | loss 4.005958 (+0.12z)| norm 0.2933 (-1.10z)| lr 5.96e-04 | 322.76 ms | 52.3% bf16 MFU | 1623798 tok/s step 1660/19560 | loss 4.047435 (+1.06z)| norm 0.3147 (-0.63z)| lr 5.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1623848 tok/s step 1661/19560 | loss 4.005736 (+0.12z)| norm 0.3496 (+0.14z)| lr 5.96e-04 | 322.52 ms | 52.3% bf16 MFU | 1623936 tok/s step 1662/19560 | loss 4.209295 (+4.33z)| norm 0.3669 (+0.52z)| lr 5.96e-04 | 322.66 ms | 52.3% bf16 MFU | 1623984 tok/s step 1663/19560 | loss 3.986740 (-0.31z)| norm 0.3797 (+0.80z)| lr 5.96e-04 | 322.33 ms | 52.4% bf16 MFU | 1624112 tok/s step 1664/19560 | loss 4.013573 (+0.27z)| norm 0.3281 (-0.38z)| lr 5.96e-04 | 322.44 ms | 52.3% bf16 MFU | 1624208 tok/s step 1665/19560 | loss 3.989362 (-0.23z)| norm 0.2948 (-1.11z)| lr 5.96e-04 | 322.77 ms | 52.3% bf16 MFU | 1624214 tok/s step 1666/19560 | loss 3.928823 (-1.51z)| norm 0.3292 (-0.33z)| lr 5.96e-04 | 323.21 ms | 52.2% bf16 MFU | 1624110 tok/s step 1667/19560 | loss 3.965026 (-0.73z)| norm 0.3149 (-0.65z)| lr 5.96e-04 | 322.55 ms | 52.3% bf16 MFU | 1624176 tok/s step 1668/19560 | loss 3.982849 (-0.34z)| norm 0.2976 (-1.02z)| lr 5.96e-04 | 322.33 ms | 52.4% bf16 MFU | 1624294 tok/s step 1669/19560 | loss 4.001373 (+0.06z)| norm 0.2872 (-1.25z)| lr 5.96e-04 | 323.04 ms | 52.2% bf16 MFU | 1624228 tok/s step 1670/19560 | loss 3.954597 (-0.93z)| norm 0.3406 (-0.01z)| lr 5.96e-04 | 322.97 ms | 52.3% bf16 MFU | 1624184 tok/s step 1671/19560 | loss 3.995460 (-0.05z)| norm 0.3653 (+0.58z)| lr 5.96e-04 | 322.69 ms | 52.3% bf16 MFU | 1624213 tok/s step 1672/19560 | loss 3.946181 (-1.09z)| norm 0.3702 (+0.71z)| lr 5.96e-04 | 322.18 ms | 52.4% bf16 MFU | 1624369 tok/s step 1673/19560 | loss 3.939324 (-1.22z)| norm 0.3331 (-0.17z)| lr 5.96e-04 | 322.92 ms | 52.3% bf16 MFU | 1624330 tok/s step 1674/19560 | loss 3.945903 (-1.07z)| norm 0.3285 (-0.28z)| lr 5.96e-04 | 322.54 ms | 52.3% bf16 MFU | 1624388 tok/s step 1675/19560 | loss 4.016770 (+0.43z)| norm 0.3218 (-0.44z)| lr 5.96e-04 | 322.58 ms | 52.3% bf16 MFU | 1624433 tok/s step 1676/19560 | loss 3.982213 (-0.30z)| norm 0.3783 (+0.89z)| lr 5.96e-04 | 322.27 ms | 52.4% bf16 MFU | 1624554 tok/s step 1677/19560 | loss 3.935833 (-1.28z)| norm 0.3172 (-0.56z)| lr 5.96e-04 | 322.96 ms | 52.3% bf16 MFU | 1624496 tok/s step 1678/19560 | loss 3.907446 (-1.86z)| norm 0.3104 (-0.71z)| lr 5.96e-04 | 322.61 ms | 52.3% bf16 MFU | 1624528 tok/s step 1679/19560 | loss 3.887442 (-2.22z)| norm 0.3098 (-0.71z)| lr 5.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1624542 tok/s step 1680/19560 | loss 4.061501 (+1.41z)| norm 0.2991 (-0.95z)| lr 5.96e-04 | 322.48 ms | 52.3% bf16 MFU | 1624605 tok/s step 1681/19560 | loss 3.983044 (-0.21z)| norm 0.3036 (-0.83z)| lr 5.96e-04 | 322.29 ms | 52.4% bf16 MFU | 1624712 tok/s step 1682/19560 | loss 3.982893 (-0.21z)| norm 0.3371 (-0.04z)| lr 5.96e-04 | 322.76 ms | 52.3% bf16 MFU | 1624696 tok/s step 1683/19560 | loss 3.904428 (-1.82z)| norm 0.3266 (-0.30z)| lr 5.96e-04 | 322.36 ms | 52.4% bf16 MFU | 1624782 tok/s step 1684/19560 | loss 3.947376 (-0.93z)| norm 0.3155 (-0.57z)| lr 5.96e-04 | 322.94 ms | 52.3% bf16 MFU | 1624716 tok/s step 1685/19560 | loss 3.989165 (-0.07z)| norm 0.2847 (-1.31z)| lr 5.96e-04 | 322.85 ms | 52.3% bf16 MFU | 1624678 tok/s step 1686/19560 | loss 4.164039 (+3.40z)| norm 0.3062 (-0.79z)| lr 5.96e-04 | 322.29 ms | 52.4% bf16 MFU | 1624781 tok/s step 1687/19560 | loss 3.964754 (-0.57z)| norm 0.3215 (-0.43z)| lr 5.96e-04 | 322.69 ms | 52.3% bf16 MFU | 1624778 tok/s step 1688/19560 | loss 3.944545 (-0.98z)| norm 0.2993 (-0.96z)| lr 5.96e-04 | 322.75 ms | 52.3% bf16 MFU | 1624762 tok/s step 1689/19560 | loss 3.911718 (-1.61z)| norm 0.3027 (-0.87z)| lr 5.96e-04 | 322.98 ms | 52.3% bf16 MFU | 1624687 tok/s step 1690/19560 | loss 3.981247 (-0.22z)| norm 0.3025 (-0.88z)| lr 5.96e-04 | 322.61 ms | 52.3% bf16 MFU | 1624709 tok/s step 1691/19560 | loss 3.959027 (-0.66z)| norm 0.3044 (-0.83z)| lr 5.96e-04 | 322.40 ms | 52.3% bf16 MFU | 1624784 tok/s step 1692/19560 | loss 3.964952 (-0.54z)| norm 0.3057 (-0.80z)| lr 5.96e-04 | 322.73 ms | 52.3% bf16 MFU | 1624770 tok/s step 1693/19560 | loss 3.961044 (-0.60z)| norm 0.3389 (-0.02z)| lr 5.96e-04 | 322.63 ms | 52.3% bf16 MFU | 1624784 tok/s step 1694/19560 | loss 3.966521 (-0.49z)| norm 0.3542 (+0.35z)| lr 5.96e-04 | 322.57 ms | 52.3% bf16 MFU | 1624811 tok/s step 1695/19560 | loss 3.990681 (+0.00z)| norm 0.3797 (+0.95z)| lr 5.96e-04 | 322.74 ms | 52.3% bf16 MFU | 1624795 tok/s step 1696/19560 | loss 4.024958 (+0.71z)| norm 0.3754 (+0.84z)| lr 5.96e-04 | 322.91 ms | 52.3% bf16 MFU | 1624738 tok/s step 1697/19560 | loss 3.977563 (-0.26z)| norm 0.3555 (+0.36z)| lr 5.96e-04 | 322.43 ms | 52.3% bf16 MFU | 1624804 tok/s step 1698/19560 | loss 3.971622 (-0.38z)| norm 0.3278 (-0.31z)| lr 5.96e-04 | 322.87 ms | 52.3% bf16 MFU | 1624754 tok/s step 1699/19560 | loss 3.952085 (-0.77z)| norm 0.3207 (-0.47z)| lr 5.96e-04 | 322.87 ms | 52.3% bf16 MFU | 1624709 tok/s step 1700/19560 | loss 3.932538 (-1.15z)| norm 0.3110 (-0.69z)| lr 5.96e-04 | 322.59 ms | 52.3% bf16 MFU | 1624737 tok/s step 1701/19560 | loss 3.897227 (-1.84z)| norm 0.3154 (-0.57z)| lr 5.96e-04 | 322.69 ms | 52.3% bf16 MFU | 1624737 tok/s step 1702/19560 | loss 4.001628 (+0.28z)| norm 0.3179 (-0.50z)| lr 5.96e-04 | 322.88 ms | 52.3% bf16 MFU | 1624690 tok/s step 1703/19560 | loss 3.922822 (-1.30z)| norm 0.3271 (-0.26z)| lr 5.96e-04 | 322.76 ms | 52.3% bf16 MFU | 1624673 tok/s step 1704/19560 | loss 3.975940 (-0.23z)| norm 0.3281 (-0.23z)| lr 5.96e-04 | 323.14 ms | 52.2% bf16 MFU | 1624562 tok/s step 1705/19560 | loss 3.962553 (-0.49z)| norm 0.3716 (+0.84z)| lr 5.96e-04 | 322.77 ms | 52.3% bf16 MFU | 1624552 tok/s step 1706/19560 | loss 3.956603 (-0.61z)| norm 0.3726 (+0.86z)| lr 5.96e-04 | 322.99 ms | 52.3% bf16 MFU | 1624487 tok/s step 1707/19560 | loss 3.948868 (-0.77z)| norm 0.3376 (-0.02z)| lr 5.96e-04 | 323.01 ms | 52.2% bf16 MFU | 1624418 tok/s step 1708/19560 | loss 3.971791 (-0.30z)| norm 0.3265 (-0.30z)| lr 5.96e-04 | 322.71 ms | 52.3% bf16 MFU | 1624429 tok/s step 1709/19560 | loss 3.948129 (-0.77z)| norm 0.3272 (-0.28z)| lr 5.96e-04 | 323.45 ms | 52.2% bf16 MFU | 1624253 tok/s step 1710/19560 | loss 3.974335 (-0.24z)| norm 0.3181 (-0.50z)| lr 5.96e-04 | 322.51 ms | 52.3% bf16 MFU | 1624322 tok/s step 1711/19560 | loss 4.030900 (+0.90z)| norm 0.3418 (+0.09z)| lr 5.96e-04 | 323.44 ms | 52.2% bf16 MFU | 1624154 tok/s step 1712/19560 | loss 3.956485 (-0.62z)| norm 0.3346 (-0.09z)| lr 5.96e-04 | 322.80 ms | 52.3% bf16 MFU | 1624155 tok/s step 1713/19560 | loss 3.945597 (-0.83z)| norm 0.3096 (-0.70z)| lr 5.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1624186 tok/s step 1714/19560 | loss 3.966821 (-0.40z)| norm 0.3189 (-0.46z)| lr 5.96e-04 | 323.39 ms | 52.2% bf16 MFU | 1624038 tok/s step 1715/19560 | loss 3.933908 (-1.06z)| norm 0.2935 (-1.08z)| lr 5.96e-04 | 322.81 ms | 52.3% bf16 MFU | 1624042 tok/s step 1716/19560 | loss 3.923087 (-1.26z)| norm 0.2691 (-1.67z)| lr 5.96e-04 | 322.12 ms | 52.4% bf16 MFU | 1624221 tok/s step 1717/19560 | loss 3.919364 (-1.33z)| norm 0.2898 (-1.13z)| lr 5.96e-04 | 322.87 ms | 52.3% bf16 MFU | 1624202 tok/s step 1718/19560 | loss 3.939133 (-0.91z)| norm 0.2798 (-1.36z)| lr 5.96e-04 | 323.07 ms | 52.2% bf16 MFU | 1624132 tok/s step 1719/19560 | loss 3.951615 (-0.65z)| norm 0.3092 (-0.62z)| lr 5.96e-04 | 323.17 ms | 52.2% bf16 MFU | 1624042 tok/s step 1720/19560 | loss 3.954625 (-0.58z)| norm 0.2833 (-1.25z)| lr 5.96e-04 | 322.79 ms | 52.3% bf16 MFU | 1624051 tok/s step 1721/19560 | loss 3.901308 (-1.64z)| norm 0.2874 (-1.15z)| lr 5.96e-04 | 323.37 ms | 52.2% bf16 MFU | 1623915 tok/s step 1722/19560 | loss 3.955902 (-0.52z)| norm 0.2998 (-0.84z)| lr 5.96e-04 | 322.29 ms | 52.4% bf16 MFU | 1624058 tok/s step 1723/19560 | loss 3.911971 (-1.39z)| norm 0.3074 (-0.66z)| lr 5.96e-04 | 322.48 ms | 52.3% bf16 MFU | 1624146 tok/s step 1724/19560 | loss 3.869134 (-2.19z)| norm 0.3143 (-0.49z)| lr 5.96e-04 | 323.08 ms | 52.2% bf16 MFU | 1624078 tok/s step 1725/19560 | loss 3.977250 (-0.06z)| norm 0.3322 (-0.06z)| lr 5.96e-04 | 322.55 ms | 52.3% bf16 MFU | 1624145 tok/s step 1726/19560 | loss 3.953190 (-0.53z)| norm 0.3428 (+0.20z)| lr 5.96e-04 | 322.99 ms | 52.3% bf16 MFU | 1624100 tok/s step 1727/19560 | loss 3.952786 (-0.53z)| norm 0.3804 (+1.11z)| lr 5.96e-04 | 322.69 ms | 52.3% bf16 MFU | 1624132 tok/s step 1728/19560 | loss 3.955787 (-0.47z)| norm 0.3658 (+0.74z)| lr 5.96e-04 | 322.65 ms | 52.3% bf16 MFU | 1624173 tok/s step 1729/19560 | loss 3.859198 (-2.31z)| norm 0.3443 (+0.21z)| lr 5.96e-04 | 322.81 ms | 52.3% bf16 MFU | 1624170 tok/s step 1730/19560 | loss 4.001730 (+0.45z)| norm 0.3605 (+0.59z)| lr 5.96e-04 | 322.62 ms | 52.3% bf16 MFU | 1624217 tok/s step 1731/19560 | loss 3.985606 (+0.14z)| norm 0.3784 (+1.02z)| lr 5.96e-04 | 323.04 ms | 52.2% bf16 MFU | 1624156 tok/s step 1732/19560 | loss 4.035286 (+1.11z)| norm 0.3443 (+0.19z)| lr 5.96e-04 | 322.78 ms | 52.3% bf16 MFU | 1624162 tok/s step 1733/19560 | loss 3.917593 (-1.16z)| norm 0.3304 (-0.15z)| lr 5.96e-04 | 322.63 ms | 52.3% bf16 MFU | 1624205 tok/s step 1734/19560 | loss 3.992927 (+0.29z)| norm 0.3264 (-0.25z)| lr 5.96e-04 | 323.02 ms | 52.2% bf16 MFU | 1624149 tok/s step 1735/19560 | loss 3.938751 (-0.75z)| norm 0.3576 (+0.52z)| lr 5.96e-04 | 322.50 ms | 52.3% bf16 MFU | 1624227 tok/s step 1736/19560 | loss 4.001000 (+0.45z)| norm 0.3323 (-0.09z)| lr 5.96e-04 | 323.08 ms | 52.2% bf16 MFU | 1624154 tok/s step 1737/19560 | loss 3.943174 (-0.66z)| norm 0.3183 (-0.43z)| lr 5.96e-04 | 322.75 ms | 52.3% bf16 MFU | 1624169 tok/s step 1738/19560 | loss 3.925190 (-0.99z)| norm 0.3005 (-0.86z)| lr 5.96e-04 | 322.57 ms | 52.3% bf16 MFU | 1624228 tok/s step 1739/19560 | loss 3.994501 (+0.35z)| norm 0.3321 (-0.05z)| lr 5.96e-04 | 322.49 ms | 52.3% bf16 MFU | 1624304 tok/s step 1740/19560 | loss 3.994535 (+0.35z)| norm 0.3039 (-0.76z)| lr 5.96e-04 | 322.94 ms | 52.3% bf16 MFU | 1624263 tok/s step 1741/19560 | loss 3.909254 (-1.29z)| norm 0.3109 (-0.59z)| lr 5.96e-04 | 322.67 ms | 52.3% bf16 MFU | 1624291 tok/s step 1742/19560 | loss 3.933182 (-0.82z)| norm 0.2984 (-0.90z)| lr 5.96e-04 | 322.50 ms | 52.3% bf16 MFU | 1624362 tok/s step 1743/19560 | loss 3.964703 (-0.21z)| norm 0.2680 (-1.66z)| lr 5.95e-04 | 323.20 ms | 52.2% bf16 MFU | 1624252 tok/s step 1744/19560 | loss 3.917544 (-1.10z)| norm 0.3057 (-0.72z)| lr 5.95e-04 | 322.79 ms | 52.3% bf16 MFU | 1624252 tok/s step 1745/19560 | loss 4.047411 (+1.37z)| norm 0.3056 (-0.72z)| lr 5.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1624337 tok/s step 1746/19560 | loss 3.919715 (-1.05z)| norm 0.3095 (-0.63z)| lr 5.95e-04 | 323.02 ms | 52.2% bf16 MFU | 1624274 tok/s step 1747/19560 | loss 3.946610 (-0.52z)| norm 0.3109 (-0.59z)| lr 5.95e-04 | 322.68 ms | 52.3% bf16 MFU | 1624300 tok/s step 1748/19560 | loss 4.028149 (+1.03z)| norm 0.3109 (-0.59z)| lr 5.95e-04 | 322.23 ms | 52.4% bf16 MFU | 1624438 tok/s step 1749/19560 | loss 4.002176 (+0.54z)| norm 0.3373 (+0.08z)| lr 5.95e-04 | 322.77 ms | 52.3% bf16 MFU | 1624434 tok/s step 1750/19560 | loss 3.914291 (-1.14z)| norm 0.3826 (+1.24z)| lr 5.95e-04 | 323.36 ms | 52.2% bf16 MFU | 1624282 tok/s val loss 3.932409 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2574/10042 = 0.256323 step 1751/19560 | loss 3.973516 (-0.01z)| norm 0.3537 (+0.55z)| lr 5.95e-04 | 321.96 ms | 52.4% bf16 MFU | 1624490 tok/s step 1752/19560 | loss 4.047817 (+1.40z)| norm 0.3508 (+0.54z)| lr 5.95e-04 | 322.62 ms | 52.3% bf16 MFU | 1624519 tok/s step 1753/19560 | loss 3.914773 (-1.13z)| norm 0.3371 (+0.17z)| lr 5.95e-04 | 323.73 ms | 52.1% bf16 MFU | 1624270 tok/s step 1754/19560 | loss 3.913014 (-1.15z)| norm 0.3311 (+0.00z)| lr 5.95e-04 | 323.02 ms | 52.2% bf16 MFU | 1624210 tok/s step 1755/19560 | loss 3.929302 (-0.83z)| norm 0.3238 (-0.21z)| lr 5.95e-04 | 322.85 ms | 52.3% bf16 MFU | 1624198 tok/s step 1756/19560 | loss 3.968888 (-0.09z)| norm 0.3441 (+0.38z)| lr 5.95e-04 | 322.72 ms | 52.3% bf16 MFU | 1624218 tok/s step 1757/19560 | loss 3.887860 (-1.58z)| norm 0.3715 (+1.16z)| lr 5.95e-04 | 322.83 ms | 52.3% bf16 MFU | 1624210 tok/s step 1758/19560 | loss 3.918562 (-1.00z)| norm 0.3692 (+1.08z)| lr 5.95e-04 | 322.87 ms | 52.3% bf16 MFU | 1624190 tok/s step 1759/19560 | loss 3.916976 (-1.01z)| norm 0.3794 (+1.35z)| lr 5.95e-04 | 323.20 ms | 52.2% bf16 MFU | 1624089 tok/s step 1760/19560 | loss 3.972547 (+0.02z)| norm 0.3431 (+0.29z)| lr 5.95e-04 | 322.78 ms | 52.3% bf16 MFU | 1624100 tok/s step 1761/19560 | loss 4.004533 (+0.61z)| norm 0.2857 (-1.37z)| lr 5.95e-04 | 323.39 ms | 52.2% bf16 MFU | 1623957 tok/s step 1762/19560 | loss 3.910335 (-1.12z)| norm 0.2769 (-1.60z)| lr 5.95e-04 | 322.64 ms | 52.3% bf16 MFU | 1624008 tok/s step 1763/19560 | loss 3.939862 (-0.57z)| norm 0.3111 (-0.61z)| lr 5.95e-04 | 323.19 ms | 52.2% bf16 MFU | 1623919 tok/s step 1764/19560 | loss 3.935099 (-0.64z)| norm 0.3156 (-0.48z)| lr 5.95e-04 | 323.34 ms | 52.2% bf16 MFU | 1623796 tok/s step 1765/19560 | loss 3.947445 (-0.41z)| norm 0.3149 (-0.48z)| lr 5.95e-04 | 322.86 ms | 52.3% bf16 MFU | 1623800 tok/s step 1766/19560 | loss 3.875075 (-1.73z)| norm 0.3069 (-0.71z)| lr 5.95e-04 | 323.30 ms | 52.2% bf16 MFU | 1623693 tok/s step 1767/19560 | loss 4.003951 (+0.64z)| norm 0.2931 (-1.10z)| lr 5.95e-04 | 323.29 ms | 52.2% bf16 MFU | 1623595 tok/s step 1768/19560 | loss 3.938674 (-0.56z)| norm 0.3179 (-0.35z)| lr 5.95e-04 | 322.54 ms | 52.3% bf16 MFU | 1623691 tok/s step 1769/19560 | loss 3.908736 (-1.16z)| norm 0.3408 (+0.34z)| lr 5.95e-04 | 322.73 ms | 52.3% bf16 MFU | 1623733 tok/s step 1770/19560 | loss 3.906401 (-1.19z)| norm 0.3180 (-0.35z)| lr 5.95e-04 | 322.96 ms | 52.3% bf16 MFU | 1623716 tok/s step 1771/19560 | loss 3.928017 (-0.75z)| norm 0.2968 (-0.97z)| lr 5.95e-04 | 323.31 ms | 52.2% bf16 MFU | 1623613 tok/s step 1772/19560 | loss 3.925140 (-0.80z)| norm 0.3211 (-0.25z)| lr 5.95e-04 | 322.77 ms | 52.3% bf16 MFU | 1623650 tok/s step 1773/19560 | loss 3.971457 (+0.13z)| norm 0.3311 (+0.06z)| lr 5.95e-04 | 323.21 ms | 52.2% bf16 MFU | 1623574 tok/s step 1774/19560 | loss 3.937717 (-0.54z)| norm 0.3388 (+0.29z)| lr 5.95e-04 | 322.49 ms | 52.3% bf16 MFU | 1623681 tok/s step 1775/19560 | loss 3.907554 (-1.13z)| norm 0.3488 (+0.58z)| lr 5.95e-04 | 322.94 ms | 52.3% bf16 MFU | 1623672 tok/s step 1776/19560 | loss 3.918024 (-0.91z)| norm 0.3103 (-0.59z)| lr 5.95e-04 | 322.95 ms | 52.3% bf16 MFU | 1623661 tok/s step 1777/19560 | loss 3.901931 (-1.21z)| norm 0.3781 (+1.45z)| lr 5.95e-04 | 322.46 ms | 52.3% bf16 MFU | 1623774 tok/s step 1778/19560 | loss 3.960621 (-0.03z)| norm 0.3323 (+0.07z)| lr 5.95e-04 | 323.46 ms | 52.2% bf16 MFU | 1623628 tok/s step 1779/19560 | loss 3.944809 (-0.34z)| norm 0.2961 (-1.01z)| lr 5.95e-04 | 322.33 ms | 52.4% bf16 MFU | 1623776 tok/s step 1780/19560 | loss 3.919430 (-0.84z)| norm 0.2841 (-1.38z)| lr 5.95e-04 | 323.57 ms | 52.2% bf16 MFU | 1623603 tok/s step 1781/19560 | loss 3.924814 (-0.72z)| norm 0.2901 (-1.22z)| lr 5.95e-04 | 323.36 ms | 52.2% bf16 MFU | 1623491 tok/s step 1782/19560 | loss 3.948969 (-0.24z)| norm 0.3001 (-0.90z)| lr 5.95e-04 | 323.02 ms | 52.2% bf16 MFU | 1623470 tok/s step 1783/19560 | loss 3.906819 (-1.07z)| norm 0.3273 (+0.08z)| lr 5.95e-04 | 323.52 ms | 52.2% bf16 MFU | 1623324 tok/s step 1784/19560 | loss 3.946754 (-0.25z)| norm 0.3747 (+1.80z)| lr 5.95e-04 | 322.63 ms | 52.3% bf16 MFU | 1623411 tok/s step 1785/19560 | loss 3.949878 (-0.19z)| norm 0.4170 (+3.18z)| lr 5.95e-04 | 323.11 ms | 52.2% bf16 MFU | 1623372 tok/s step 1786/19560 | loss 3.925636 (-0.67z)| norm 0.4112 (+2.86z)| lr 5.95e-04 | 322.96 ms | 52.3% bf16 MFU | 1623372 tok/s step 1787/19560 | loss 3.944310 (-0.27z)| norm 0.3702 (+1.45z)| lr 5.95e-04 | 322.58 ms | 52.3% bf16 MFU | 1623469 tok/s step 1788/19560 | loss 3.881239 (-1.55z)| norm 0.3612 (+1.13z)| lr 5.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1623472 tok/s step 1789/19560 | loss 3.895774 (-1.23z)| norm 0.3244 (-0.10z)| lr 5.95e-04 | 322.79 ms | 52.3% bf16 MFU | 1623509 tok/s step 1790/19560 | loss 3.885234 (-1.56z)| norm 0.2961 (-1.04z)| lr 5.95e-04 | 322.80 ms | 52.3% bf16 MFU | 1623544 tok/s step 1791/19560 | loss 3.911340 (-0.94z)| norm 0.2881 (-1.29z)| lr 5.95e-04 | 322.56 ms | 52.3% bf16 MFU | 1623637 tok/s step 1792/19560 | loss 3.899828 (-1.19z)| norm 0.2816 (-1.49z)| lr 5.95e-04 | 322.41 ms | 52.3% bf16 MFU | 1623762 tok/s step 1793/19560 | loss 3.929955 (-0.49z)| norm 0.2855 (-1.35z)| lr 5.95e-04 | 322.52 ms | 52.3% bf16 MFU | 1623854 tok/s step 1794/19560 | loss 3.934066 (-0.39z)| norm 0.3154 (-0.34z)| lr 5.95e-04 | 323.28 ms | 52.2% bf16 MFU | 1623750 tok/s step 1795/19560 | loss 3.872092 (-1.79z)| norm 0.3173 (-0.28z)| lr 5.95e-04 | 323.37 ms | 52.2% bf16 MFU | 1623629 tok/s step 1796/19560 | loss 3.878901 (-1.60z)| norm 0.3227 (-0.11z)| lr 5.95e-04 | 323.06 ms | 52.2% bf16 MFU | 1623590 tok/s step 1797/19560 | loss 3.881090 (-1.52z)| norm 0.3074 (-0.63z)| lr 5.95e-04 | 322.29 ms | 52.4% bf16 MFU | 1623749 tok/s step 1798/19560 | loss 3.941275 (-0.16z)| norm 0.2911 (-1.16z)| lr 5.95e-04 | 322.84 ms | 52.3% bf16 MFU | 1623762 tok/s step 1799/19560 | loss 3.941587 (-0.15z)| norm 0.3070 (-0.62z)| lr 5.95e-04 | 322.99 ms | 52.3% bf16 MFU | 1623734 tok/s step 1800/19560 | loss 3.958652 (+0.24z)| norm 0.3189 (-0.20z)| lr 5.95e-04 | 323.10 ms | 52.2% bf16 MFU | 1623682 tok/s step 1801/19560 | loss 3.909759 (-0.86z)| norm 0.3194 (-0.18z)| lr 5.95e-04 | 323.12 ms | 52.2% bf16 MFU | 1623627 tok/s step 1802/19560 | loss 3.953363 (+0.12z)| norm 0.3429 (+0.62z)| lr 5.95e-04 | 323.06 ms | 52.2% bf16 MFU | 1623591 tok/s step 1803/19560 | loss 3.929342 (-0.41z)| norm 0.3287 (+0.13z)| lr 5.95e-04 | 322.85 ms | 52.3% bf16 MFU | 1623608 tok/s step 1804/19560 | loss 3.978669 (+0.71z)| norm 0.3234 (-0.03z)| lr 5.95e-04 | 323.14 ms | 52.2% bf16 MFU | 1623551 tok/s step 1805/19560 | loss 3.915376 (-0.72z)| norm 0.3418 (+0.59z)| lr 5.95e-04 | 323.27 ms | 52.2% bf16 MFU | 1623464 tok/s step 1806/19560 | loss 3.851506 (-2.13z)| norm 0.3077 (-0.58z)| lr 5.95e-04 | 322.74 ms | 52.3% bf16 MFU | 1623514 tok/s step 1807/19560 | loss 3.945408 (-0.04z)| norm 0.2872 (-1.28z)| lr 5.95e-04 | 322.62 ms | 52.3% bf16 MFU | 1623592 tok/s step 1808/19560 | loss 3.960146 (+0.32z)| norm 0.3020 (-0.77z)| lr 5.95e-04 | 323.17 ms | 52.2% bf16 MFU | 1623530 tok/s step 1809/19560 | loss 3.900874 (-1.04z)| norm 0.3000 (-0.84z)| lr 5.95e-04 | 323.40 ms | 52.2% bf16 MFU | 1623411 tok/s step 1810/19560 | loss 3.917804 (-0.64z)| norm 0.2875 (-1.25z)| lr 5.95e-04 | 323.66 ms | 52.1% bf16 MFU | 1623233 tok/s step 1811/19560 | loss 3.930100 (-0.36z)| norm 0.2934 (-1.03z)| lr 5.95e-04 | 322.50 ms | 52.3% bf16 MFU | 1623357 tok/s step 1812/19560 | loss 3.942937 (-0.06z)| norm 0.3335 (+0.33z)| lr 5.95e-04 | 322.54 ms | 52.3% bf16 MFU | 1623464 tok/s step 1813/19560 | loss 3.872732 (-1.66z)| norm 0.3165 (-0.26z)| lr 5.95e-04 | 322.90 ms | 52.3% bf16 MFU | 1623476 tok/s step 1814/19560 | loss 4.034369 (+2.30z)| norm 0.3144 (-0.34z)| lr 5.95e-04 | 323.22 ms | 52.2% bf16 MFU | 1623405 tok/s step 1815/19560 | loss 3.940471 (-0.07z)| norm 0.3441 (+0.68z)| lr 5.95e-04 | 323.44 ms | 52.2% bf16 MFU | 1623284 tok/s step 1816/19560 | loss 3.980687 (+0.94z)| norm 0.3537 (+0.99z)| lr 5.95e-04 | 323.32 ms | 52.2% bf16 MFU | 1623200 tok/s step 1817/19560 | loss 3.907422 (-0.91z)| norm 0.3382 (+0.45z)| lr 5.95e-04 | 323.09 ms | 52.2% bf16 MFU | 1623177 tok/s step 1818/19560 | loss 3.943439 (+0.01z)| norm 0.3894 (+2.15z)| lr 5.95e-04 | 323.25 ms | 52.2% bf16 MFU | 1623114 tok/s step 1819/19560 | loss 3.958316 (+0.38z)| norm 0.4237 (+3.15z)| lr 5.95e-04 | 322.94 ms | 52.3% bf16 MFU | 1623133 tok/s step 1820/19560 | loss 3.938196 (-0.12z)| norm 0.3917 (+2.06z)| lr 5.95e-04 | 323.62 ms | 52.2% bf16 MFU | 1622980 tok/s step 1821/19560 | loss 3.941909 (-0.02z)| norm 0.3587 (+0.99z)| lr 5.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1623008 tok/s step 1822/19560 | loss 3.894263 (-1.21z)| norm 0.3793 (+1.63z)| lr 5.95e-04 | 322.80 ms | 52.3% bf16 MFU | 1623068 tok/s step 1823/19560 | loss 3.882454 (-1.49z)| norm 0.3259 (-0.04z)| lr 5.95e-04 | 323.46 ms | 52.2% bf16 MFU | 1622960 tok/s step 1824/19560 | loss 3.882908 (-1.46z)| norm 0.3015 (-0.81z)| lr 5.95e-04 | 323.28 ms | 52.2% bf16 MFU | 1622901 tok/s step 1825/19560 | loss 3.910388 (-0.75z)| norm 0.2777 (-1.55z)| lr 5.95e-04 | 323.24 ms | 52.2% bf16 MFU | 1622855 tok/s step 1826/19560 | loss 3.876657 (-1.58z)| norm 0.2910 (-1.11z)| lr 5.95e-04 | 322.68 ms | 52.3% bf16 MFU | 1622953 tok/s step 1827/19560 | loss 3.914424 (-0.61z)| norm 0.3273 (+0.05z)| lr 5.95e-04 | 322.97 ms | 52.3% bf16 MFU | 1622971 tok/s step 1828/19560 | loss 3.902818 (-0.90z)| norm 0.3375 (+0.36z)| lr 5.95e-04 | 323.04 ms | 52.2% bf16 MFU | 1622971 tok/s step 1829/19560 | loss 3.912383 (-0.66z)| norm 0.3121 (-0.45z)| lr 5.95e-04 | 323.06 ms | 52.2% bf16 MFU | 1622967 tok/s step 1830/19560 | loss 3.931211 (-0.18z)| norm 0.2855 (-1.27z)| lr 5.95e-04 | 322.82 ms | 52.3% bf16 MFU | 1623022 tok/s step 1831/19560 | loss 3.938343 (+0.00z)| norm 0.3069 (-0.59z)| lr 5.95e-04 | 323.03 ms | 52.2% bf16 MFU | 1623023 tok/s step 1832/19560 | loss 3.935064 (-0.07z)| norm 0.3185 (-0.22z)| lr 5.95e-04 | 322.87 ms | 52.3% bf16 MFU | 1623064 tok/s step 1833/19560 | loss 3.909375 (-0.72z)| norm 0.3151 (-0.32z)| lr 5.95e-04 | 322.86 ms | 52.3% bf16 MFU | 1623106 tok/s step 1834/19560 | loss 3.903347 (-0.86z)| norm 0.3105 (-0.45z)| lr 5.95e-04 | 322.73 ms | 52.3% bf16 MFU | 1623177 tok/s step 1835/19560 | loss 3.906613 (-0.77z)| norm 0.3454 (+0.66z)| lr 5.95e-04 | 323.22 ms | 52.2% bf16 MFU | 1623121 tok/s step 1836/19560 | loss 3.926259 (-0.26z)| norm 0.3150 (-0.31z)| lr 5.95e-04 | 322.76 ms | 52.3% bf16 MFU | 1623185 tok/s step 1837/19560 | loss 3.965204 (+0.73z)| norm 0.2931 (-1.00z)| lr 5.95e-04 | 322.52 ms | 52.3% bf16 MFU | 1623306 tok/s step 1838/19560 | loss 3.891656 (-1.13z)| norm 0.2976 (-0.85z)| lr 5.95e-04 | 322.94 ms | 52.3% bf16 MFU | 1623316 tok/s step 1839/19560 | loss 3.899467 (-0.92z)| norm 0.3045 (-0.62z)| lr 5.95e-04 | 323.09 ms | 52.2% bf16 MFU | 1623286 tok/s step 1840/19560 | loss 3.867416 (-1.72z)| norm 0.3417 (+0.57z)| lr 5.95e-04 | 322.94 ms | 52.3% bf16 MFU | 1623297 tok/s step 1841/19560 | loss 3.910877 (-0.59z)| norm 0.3445 (+0.65z)| lr 5.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1623430 tok/s step 1842/19560 | loss 3.967297 (+0.86z)| norm 0.3743 (+1.56z)| lr 5.95e-04 | 322.63 ms | 52.3% bf16 MFU | 1623512 tok/s step 1843/19560 | loss 3.869354 (-1.63z)| norm 0.3459 (+0.66z)| lr 5.95e-04 | 323.13 ms | 52.2% bf16 MFU | 1623462 tok/s step 1844/19560 | loss 3.904262 (-0.74z)| norm 0.2904 (-1.11z)| lr 5.95e-04 | 323.24 ms | 52.2% bf16 MFU | 1623386 tok/s step 1845/19560 | loss 3.925210 (-0.21z)| norm 0.2721 (-1.67z)| lr 5.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1623394 tok/s step 1846/19560 | loss 3.882395 (-1.27z)| norm 0.2917 (-1.06z)| lr 5.95e-04 | 323.01 ms | 52.2% bf16 MFU | 1623381 tok/s step 1847/19560 | loss 3.865370 (-1.67z)| norm 0.2919 (-1.05z)| lr 5.95e-04 | 322.69 ms | 52.3% bf16 MFU | 1623449 tok/s step 1848/19560 | loss 3.910219 (-0.54z)| norm 0.3008 (-0.77z)| lr 5.95e-04 | 322.85 ms | 52.3% bf16 MFU | 1623473 tok/s step 1849/19560 | loss 3.945799 (+0.34z)| norm 0.3083 (-0.54z)| lr 5.95e-04 | 322.78 ms | 52.3% bf16 MFU | 1623513 tok/s step 1850/19560 | loss 3.850249 (-2.00z)| norm 0.3277 (+0.07z)| lr 5.95e-04 | 323.18 ms | 52.2% bf16 MFU | 1623452 tok/s step 1851/19560 | loss 3.881564 (-1.21z)| norm 0.3264 (+0.02z)| lr 5.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1623578 tok/s step 1852/19560 | loss 3.885060 (-1.14z)| norm 0.3092 (-0.53z)| lr 5.95e-04 | 322.51 ms | 52.3% bf16 MFU | 1623682 tok/s step 1853/19560 | loss 3.992932 (+1.51z)| norm 0.3211 (-0.14z)| lr 5.94e-04 | 322.61 ms | 52.3% bf16 MFU | 1623754 tok/s step 1854/19560 | loss 3.916271 (-0.36z)| norm 0.3303 (+0.15z)| lr 5.94e-04 | 322.81 ms | 52.3% bf16 MFU | 1623772 tok/s step 1855/19560 | loss 3.855841 (-1.81z)| norm 0.3004 (-0.79z)| lr 5.94e-04 | 323.05 ms | 52.2% bf16 MFU | 1623731 tok/s step 1856/19560 | loss 3.975829 (+1.10z)| norm 0.2803 (-1.42z)| lr 5.94e-04 | 323.12 ms | 52.2% bf16 MFU | 1623673 tok/s step 1857/19560 | loss 3.874674 (-1.36z)| norm 0.2767 (-1.51z)| lr 5.94e-04 | 323.02 ms | 52.2% bf16 MFU | 1623642 tok/s step 1858/19560 | loss 3.930558 (+0.01z)| norm 0.2973 (-0.83z)| lr 5.94e-04 | 322.94 ms | 52.3% bf16 MFU | 1623634 tok/s step 1859/19560 | loss 3.876902 (-1.29z)| norm 0.3214 (-0.04z)| lr 5.94e-04 | 323.25 ms | 52.2% bf16 MFU | 1623548 tok/s step 1860/19560 | loss 3.867698 (-1.51z)| norm 0.3491 (+0.86z)| lr 5.94e-04 | 322.44 ms | 52.3% bf16 MFU | 1623670 tok/s step 1861/19560 | loss 3.833389 (-2.30z)| norm 0.3681 (+1.46z)| lr 5.94e-04 | 322.83 ms | 52.3% bf16 MFU | 1623689 tok/s step 1862/19560 | loss 3.882550 (-1.08z)| norm 0.3496 (+0.85z)| lr 5.94e-04 | 322.51 ms | 52.3% bf16 MFU | 1623788 tok/s step 1863/19560 | loss 3.846069 (-1.93z)| norm 0.3735 (+1.60z)| lr 5.94e-04 | 322.60 ms | 52.3% bf16 MFU | 1623858 tok/s step 1864/19560 | loss 3.919854 (-0.13z)| norm 0.3609 (+1.19z)| lr 5.94e-04 | 322.88 ms | 52.3% bf16 MFU | 1623854 tok/s step 1865/19560 | loss 3.910881 (-0.34z)| norm 0.3345 (+0.34z)| lr 5.94e-04 | 322.98 ms | 52.3% bf16 MFU | 1623826 tok/s step 1866/19560 | loss 3.873638 (-1.24z)| norm 0.3429 (+0.60z)| lr 5.94e-04 | 322.45 ms | 52.3% bf16 MFU | 1623934 tok/s step 1867/19560 | loss 3.949182 (+0.62z)| norm 0.3498 (+0.81z)| lr 5.94e-04 | 323.16 ms | 52.2% bf16 MFU | 1623857 tok/s step 1868/19560 | loss 3.921269 (-0.05z)| norm 0.3486 (+0.76z)| lr 5.94e-04 | 323.00 ms | 52.3% bf16 MFU | 1623824 tok/s step 1869/19560 | loss 3.873145 (-1.24z)| norm 0.3156 (-0.29z)| lr 5.94e-04 | 322.97 ms | 52.3% bf16 MFU | 1623800 tok/s step 1870/19560 | loss 3.806360 (-2.79z)| norm 0.3117 (-0.41z)| lr 5.94e-04 | 322.86 ms | 52.3% bf16 MFU | 1623803 tok/s step 1871/19560 | loss 3.860638 (-1.46z)| norm 0.3132 (-0.38z)| lr 5.94e-04 | 322.99 ms | 52.3% bf16 MFU | 1623775 tok/s step 1872/19560 | loss 3.903416 (-0.43z)| norm 0.3187 (-0.21z)| lr 5.94e-04 | 322.68 ms | 52.3% bf16 MFU | 1623825 tok/s step 1873/19560 | loss 3.929187 (+0.22z)| norm 0.3516 (+0.84z)| lr 5.94e-04 | 322.73 ms | 52.3% bf16 MFU | 1623860 tok/s step 1874/19560 | loss 3.961948 (+1.03z)| norm 0.3337 (+0.26z)| lr 5.94e-04 | 323.06 ms | 52.2% bf16 MFU | 1623812 tok/s step 1875/19560 | loss 3.973602 (+1.30z)| norm 0.3173 (-0.27z)| lr 5.94e-04 | 322.67 ms | 52.3% bf16 MFU | 1623864 tok/s step 1876/19560 | loss 3.916710 (-0.08z)| norm 0.3349 (+0.29z)| lr 5.94e-04 | 322.59 ms | 52.3% bf16 MFU | 1623932 tok/s step 1877/19560 | loss 3.941557 (+0.57z)| norm 0.2899 (-1.15z)| lr 5.94e-04 | 323.38 ms | 52.2% bf16 MFU | 1623800 tok/s step 1878/19560 | loss 3.937272 (+0.45z)| norm 0.2808 (-1.42z)| lr 5.94e-04 | 322.53 ms | 52.3% bf16 MFU | 1623887 tok/s step 1879/19560 | loss 3.902525 (-0.43z)| norm 0.2754 (-1.57z)| lr 5.94e-04 | 322.45 ms | 52.3% bf16 MFU | 1623991 tok/s step 1880/19560 | loss 3.886342 (-0.85z)| norm 0.2637 (-1.90z)| lr 5.94e-04 | 322.74 ms | 52.3% bf16 MFU | 1624017 tok/s step 1881/19560 | loss 3.929120 (+0.30z)| norm 0.2692 (-1.69z)| lr 5.94e-04 | 323.08 ms | 52.2% bf16 MFU | 1623955 tok/s step 1882/19560 | loss 4.052195 (+3.44z)| norm 0.3070 (-0.50z)| lr 5.94e-04 | 322.69 ms | 52.3% bf16 MFU | 1623995 tok/s step 1883/19560 | loss 3.831885 (-2.19z)| norm 0.2975 (-0.78z)| lr 5.94e-04 | 322.50 ms | 52.3% bf16 MFU | 1624079 tok/s step 1884/19560 | loss 3.848399 (-1.74z)| norm 0.2941 (-0.88z)| lr 5.94e-04 | 322.96 ms | 52.3% bf16 MFU | 1624044 tok/s step 1885/19560 | loss 3.863045 (-1.36z)| norm 0.2962 (-0.80z)| lr 5.94e-04 | 322.64 ms | 52.3% bf16 MFU | 1624092 tok/s step 1886/19560 | loss 3.847478 (-1.71z)| norm 0.2802 (-1.28z)| lr 5.94e-04 | 323.00 ms | 52.3% bf16 MFU | 1624047 tok/s step 1887/19560 | loss 3.913686 (-0.07z)| norm 0.3163 (-0.13z)| lr 5.94e-04 | 322.82 ms | 52.3% bf16 MFU | 1624050 tok/s step 1888/19560 | loss 3.931793 (+0.39z)| norm 0.3306 (+0.33z)| lr 5.94e-04 | 322.39 ms | 52.4% bf16 MFU | 1624161 tok/s step 1889/19560 | loss 3.921515 (+0.15z)| norm 0.2931 (-0.87z)| lr 5.94e-04 | 323.07 ms | 52.2% bf16 MFU | 1624095 tok/s step 1890/19560 | loss 3.814682 (-2.49z)| norm 0.3132 (-0.24z)| lr 5.94e-04 | 322.74 ms | 52.3% bf16 MFU | 1624114 tok/s step 1891/19560 | loss 3.859995 (-1.34z)| norm 0.3508 (+0.96z)| lr 5.94e-04 | 322.55 ms | 52.3% bf16 MFU | 1624180 tok/s step 1892/19560 | loss 3.896742 (-0.42z)| norm 0.3154 (-0.18z)| lr 5.94e-04 | 322.62 ms | 52.3% bf16 MFU | 1624226 tok/s step 1893/19560 | loss 3.858229 (-1.35z)| norm 0.3100 (-0.35z)| lr 5.94e-04 | 322.46 ms | 52.3% bf16 MFU | 1624310 tok/s step 1894/19560 | loss 3.887149 (-0.64z)| norm 0.2902 (-0.98z)| lr 5.94e-04 | 323.17 ms | 52.2% bf16 MFU | 1624212 tok/s step 1895/19560 | loss 3.868073 (-1.11z)| norm 0.2875 (-1.07z)| lr 5.94e-04 | 322.56 ms | 52.3% bf16 MFU | 1624271 tok/s step 1896/19560 | loss 3.946312 (+0.85z)| norm 0.2790 (-1.32z)| lr 5.94e-04 | 322.81 ms | 52.3% bf16 MFU | 1624264 tok/s step 1897/19560 | loss 3.832264 (-1.95z)| norm 0.2664 (-1.69z)| lr 5.94e-04 | 322.18 ms | 52.4% bf16 MFU | 1624415 tok/s step 1898/19560 | loss 4.018414 (+2.54z)| norm 0.3098 (-0.32z)| lr 5.94e-04 | 322.75 ms | 52.3% bf16 MFU | 1624417 tok/s step 1899/19560 | loss 3.825441 (-2.04z)| norm 0.3599 (+1.25z)| lr 5.94e-04 | 323.15 ms | 52.2% bf16 MFU | 1624318 tok/s step 1900/19560 | loss 3.947583 (+0.84z)| norm 0.3844 (+1.97z)| lr 5.94e-04 | 323.24 ms | 52.2% bf16 MFU | 1624202 tok/s step 1901/19560 | loss 3.908469 (-0.07z)| norm 0.3731 (+1.60z)| lr 5.94e-04 | 322.38 ms | 52.4% bf16 MFU | 1624308 tok/s step 1902/19560 | loss 3.913713 (+0.06z)| norm 0.7522 (+8.55z)| lr 5.94e-04 | 322.32 ms | 52.4% bf16 MFU | 1624423 tok/s step 1903/19560 | loss 3.955807 (+1.04z)| norm 0.6184 (+5.19z)| lr 5.94e-04 | 322.85 ms | 52.3% bf16 MFU | 1624397 tok/s step 1904/19560 | loss 4.034506 (+2.79z)| norm 0.6207 (+4.71z)| lr 5.94e-04 | 323.14 ms | 52.2% bf16 MFU | 1624301 tok/s step 1905/19560 | loss 4.047697 (+2.96z)| norm 0.3706 (+0.67z)| lr 5.94e-04 | 322.88 ms | 52.3% bf16 MFU | 1624276 tok/s step 1906/19560 | loss 3.823776 (-1.94z)| norm 0.4209 (+1.46z)| lr 5.94e-04 | 322.80 ms | 52.3% bf16 MFU | 1624271 tok/s step 1907/19560 | loss 3.913228 (+0.02z)| norm 0.3394 (+0.15z)| lr 5.94e-04 | 322.44 ms | 52.3% bf16 MFU | 1624358 tok/s step 1908/19560 | loss 3.914479 (+0.05z)| norm 0.3340 (+0.06z)| lr 5.94e-04 | 322.55 ms | 52.3% bf16 MFU | 1624414 tok/s step 1909/19560 | loss 3.866066 (-1.00z)| norm 0.3941 (+1.01z)| lr 5.94e-04 | 323.02 ms | 52.2% bf16 MFU | 1624347 tok/s step 1910/19560 | loss 3.924031 (+0.27z)| norm 0.4175 (+1.36z)| lr 5.94e-04 | 322.78 ms | 52.3% bf16 MFU | 1624343 tok/s step 1911/19560 | loss 3.933537 (+0.47z)| norm 0.3850 (+0.84z)| lr 5.94e-04 | 322.78 ms | 52.3% bf16 MFU | 1624340 tok/s step 1912/19560 | loss 3.926575 (+0.32z)| norm 0.3258 (-0.10z)| lr 5.94e-04 | 322.66 ms | 52.3% bf16 MFU | 1624367 tok/s step 1913/19560 | loss 3.931008 (+0.43z)| norm 0.3397 (+0.13z)| lr 5.94e-04 | 322.63 ms | 52.3% bf16 MFU | 1624401 tok/s step 1914/19560 | loss 3.888435 (-0.50z)| norm 0.3404 (+0.15z)| lr 5.94e-04 | 322.67 ms | 52.3% bf16 MFU | 1624424 tok/s step 1915/19560 | loss 3.927322 (+0.35z)| norm 0.3227 (-0.13z)| lr 5.94e-04 | 323.75 ms | 52.1% bf16 MFU | 1624173 tok/s step 1916/19560 | loss 3.895730 (-0.34z)| norm 0.2983 (-0.51z)| lr 5.94e-04 | 322.53 ms | 52.3% bf16 MFU | 1624243 tok/s step 1917/19560 | loss 3.820195 (-1.96z)| norm 0.3190 (-0.18z)| lr 5.94e-04 | 322.42 ms | 52.3% bf16 MFU | 1624336 tok/s step 1918/19560 | loss 3.908283 (-0.06z)| norm 0.3015 (-0.46z)| lr 5.94e-04 | 323.00 ms | 52.3% bf16 MFU | 1624278 tok/s step 1919/19560 | loss 3.893671 (-0.37z)| norm 0.3264 (-0.06z)| lr 5.94e-04 | 323.26 ms | 52.2% bf16 MFU | 1624158 tok/s step 1920/19560 | loss 3.863209 (-1.02z)| norm 0.3086 (-0.36z)| lr 5.94e-04 | 322.89 ms | 52.3% bf16 MFU | 1624136 tok/s step 1921/19560 | loss 3.905039 (-0.11z)| norm 0.3029 (-0.45z)| lr 5.94e-04 | 322.64 ms | 52.3% bf16 MFU | 1624180 tok/s step 1922/19560 | loss 3.849390 (-1.29z)| norm 0.2809 (-0.80z)| lr 5.94e-04 | 322.34 ms | 52.4% bf16 MFU | 1624296 tok/s step 1923/19560 | loss 3.922589 (+0.27z)| norm 0.3059 (-0.39z)| lr 5.94e-04 | 322.57 ms | 52.3% bf16 MFU | 1624350 tok/s step 1924/19560 | loss 3.964665 (+1.16z)| norm 0.3497 (+0.31z)| lr 5.94e-04 | 322.94 ms | 52.3% bf16 MFU | 1624307 tok/s step 1925/19560 | loss 3.906454 (-0.10z)| norm 0.3261 (-0.07z)| lr 5.94e-04 | 322.87 ms | 52.3% bf16 MFU | 1624284 tok/s step 1926/19560 | loss 4.056831 (+3.01z)| norm 0.3007 (-0.49z)| lr 5.94e-04 | 322.50 ms | 52.3% bf16 MFU | 1624355 tok/s step 1927/19560 | loss 3.935499 (+0.49z)| norm 0.3091 (-0.35z)| lr 5.94e-04 | 322.94 ms | 52.3% bf16 MFU | 1624312 tok/s step 1928/19560 | loss 3.847380 (-1.31z)| norm 0.3241 (-0.11z)| lr 5.94e-04 | 323.00 ms | 52.3% bf16 MFU | 1624257 tok/s step 1929/19560 | loss 3.935276 (+0.50z)| norm 0.2967 (-0.55z)| lr 5.94e-04 | 322.49 ms | 52.3% bf16 MFU | 1624331 tok/s step 1930/19560 | loss 3.861609 (-1.01z)| norm 0.2861 (-0.71z)| lr 5.94e-04 | 323.42 ms | 52.2% bf16 MFU | 1624167 tok/s step 1931/19560 | loss 3.916395 (+0.13z)| norm 0.2787 (-0.82z)| lr 5.94e-04 | 323.26 ms | 52.2% bf16 MFU | 1624053 tok/s step 1932/19560 | loss 3.874068 (-0.74z)| norm 0.2779 (-0.83z)| lr 5.94e-04 | 322.48 ms | 52.3% bf16 MFU | 1624140 tok/s step 1933/19560 | loss 3.866444 (-0.88z)| norm 0.2869 (-0.68z)| lr 5.94e-04 | 322.68 ms | 52.3% bf16 MFU | 1624171 tok/s step 1934/19560 | loss 3.821479 (-1.80z)| norm 0.2624 (-1.06z)| lr 5.94e-04 | 323.22 ms | 52.2% bf16 MFU | 1624066 tok/s step 1935/19560 | loss 3.867861 (-0.83z)| norm 0.2800 (-0.77z)| lr 5.94e-04 | 322.93 ms | 52.3% bf16 MFU | 1624040 tok/s step 1936/19560 | loss 3.894511 (-0.27z)| norm 0.3080 (-0.33z)| lr 5.94e-04 | 322.42 ms | 52.3% bf16 MFU | 1624145 tok/s step 1937/19560 | loss 3.916310 (+0.18z)| norm 0.3102 (-0.30z)| lr 5.94e-04 | 322.94 ms | 52.3% bf16 MFU | 1624112 tok/s step 1938/19560 | loss 3.899676 (-0.17z)| norm 0.3072 (-0.35z)| lr 5.94e-04 | 323.25 ms | 52.2% bf16 MFU | 1624003 tok/s step 1939/19560 | loss 3.867883 (-0.81z)| norm 0.2845 (-0.71z)| lr 5.94e-04 | 322.93 ms | 52.3% bf16 MFU | 1623980 tok/s step 1940/19560 | loss 3.816818 (-1.82z)| norm 0.2908 (-0.60z)| lr 5.94e-04 | 323.07 ms | 52.2% bf16 MFU | 1623922 tok/s step 1941/19560 | loss 3.863210 (-0.87z)| norm 0.3208 (-0.12z)| lr 5.94e-04 | 323.22 ms | 52.2% bf16 MFU | 1623829 tok/s step 1942/19560 | loss 3.909719 (+0.10z)| norm 0.3400 (+0.18z)| lr 5.94e-04 | 322.70 ms | 52.3% bf16 MFU | 1623871 tok/s step 1943/19560 | loss 3.934447 (+0.61z)| norm 0.3093 (-0.31z)| lr 5.94e-04 | 323.36 ms | 52.2% bf16 MFU | 1623746 tok/s step 1944/19560 | loss 3.887289 (-0.36z)| norm 0.2886 (-0.63z)| lr 5.94e-04 | 323.09 ms | 52.2% bf16 MFU | 1623696 tok/s step 1945/19560 | loss 3.892253 (-0.25z)| norm 0.3019 (-0.41z)| lr 5.94e-04 | 322.50 ms | 52.3% bf16 MFU | 1623796 tok/s step 1946/19560 | loss 3.982024 (+1.63z)| norm 0.3231 (-0.06z)| lr 5.94e-04 | 323.45 ms | 52.2% bf16 MFU | 1623653 tok/s step 1947/19560 | loss 3.894596 (-0.20z)| norm 0.3210 (-0.09z)| lr 5.94e-04 | 322.73 ms | 52.3% bf16 MFU | 1623697 tok/s step 1948/19560 | loss 3.866687 (-0.78z)| norm 0.3029 (-0.37z)| lr 5.94e-04 | 322.91 ms | 52.3% bf16 MFU | 1623693 tok/s step 1949/19560 | loss 3.836724 (-1.38z)| norm 0.2899 (-0.57z)| lr 5.94e-04 | 322.98 ms | 52.3% bf16 MFU | 1623673 tok/s step 1950/19560 | loss 3.896240 (-0.14z)| norm 0.2706 (-0.87z)| lr 5.94e-04 | 323.03 ms | 52.2% bf16 MFU | 1623641 tok/s step 1951/19560 | loss 3.858839 (-0.91z)| norm 0.2670 (-0.92z)| lr 5.94e-04 | 322.44 ms | 52.3% bf16 MFU | 1623758 tok/s step 1952/19560 | loss 3.903010 (+0.01z)| norm 0.3332 (+0.15z)| lr 5.94e-04 | 323.72 ms | 52.1% bf16 MFU | 1623549 tok/s step 1953/19560 | loss 4.008102 (+2.15z)| norm 0.3828 (+0.94z)| lr 5.93e-04 | 322.63 ms | 52.3% bf16 MFU | 1623624 tok/s step 1954/19560 | loss 3.861987 (-0.85z)| norm 0.3595 (+0.55z)| lr 5.93e-04 | 323.77 ms | 52.1% bf16 MFU | 1623408 tok/s step 1955/19560 | loss 3.878477 (-0.50z)| norm 0.3167 (-0.14z)| lr 5.93e-04 | 322.82 ms | 52.3% bf16 MFU | 1623442 tok/s step 1956/19560 | loss 3.833225 (-1.41z)| norm 0.3062 (-0.30z)| lr 5.93e-04 | 322.95 ms | 52.3% bf16 MFU | 1623443 tok/s step 1957/19560 | loss 3.872427 (-0.60z)| norm 0.3203 (-0.08z)| lr 5.93e-04 | 322.80 ms | 52.3% bf16 MFU | 1623479 tok/s step 1958/19560 | loss 3.861245 (-0.82z)| norm 0.2985 (-0.43z)| lr 5.93e-04 | 322.85 ms | 52.3% bf16 MFU | 1623502 tok/s step 1959/19560 | loss 3.839693 (-1.24z)| norm 0.3030 (-0.36z)| lr 5.93e-04 | 323.52 ms | 52.2% bf16 MFU | 1623357 tok/s step 1960/19560 | loss 3.889944 (-0.21z)| norm 0.3019 (-0.37z)| lr 5.93e-04 | 322.89 ms | 52.3% bf16 MFU | 1623376 tok/s step 1961/19560 | loss 3.875113 (-0.51z)| norm 0.3018 (-0.37z)| lr 5.93e-04 | 322.69 ms | 52.3% bf16 MFU | 1623444 tok/s step 1962/19560 | loss 3.869118 (-0.62z)| norm 0.2917 (-0.53z)| lr 5.93e-04 | 322.83 ms | 52.3% bf16 MFU | 1623474 tok/s step 1963/19560 | loss 3.826785 (-1.45z)| norm 0.2993 (-0.40z)| lr 5.93e-04 | 323.01 ms | 52.2% bf16 MFU | 1623455 tok/s step 1964/19560 | loss 3.858535 (-0.81z)| norm 0.3175 (-0.11z)| lr 5.93e-04 | 323.41 ms | 52.2% bf16 MFU | 1623338 tok/s step 1965/19560 | loss 3.854115 (-0.88z)| norm 0.3116 (-0.21z)| lr 5.93e-04 | 322.97 ms | 52.3% bf16 MFU | 1623338 tok/s step 1966/19560 | loss 3.851043 (-0.93z)| norm 0.3385 (+0.22z)| lr 5.93e-04 | 322.65 ms | 52.3% bf16 MFU | 1623418 tok/s step 1967/19560 | loss 3.847920 (-0.98z)| norm 0.3483 (+0.37z)| lr 5.93e-04 | 322.69 ms | 52.3% bf16 MFU | 1623483 tok/s step 1968/19560 | loss 3.885864 (-0.23z)| norm 0.3274 (+0.04z)| lr 5.93e-04 | 323.22 ms | 52.2% bf16 MFU | 1623414 tok/s step 1969/19560 | loss 3.904807 (+0.15z)| norm 0.3404 (+0.25z)| lr 5.93e-04 | 323.07 ms | 52.2% bf16 MFU | 1623386 tok/s step 1970/19560 | loss 3.905075 (+0.17z)| norm 0.3189 (-0.09z)| lr 5.93e-04 | 322.89 ms | 52.3% bf16 MFU | 1623403 tok/s step 1971/19560 | loss 3.876718 (-0.41z)| norm 0.3411 (+0.27z)| lr 5.93e-04 | 323.45 ms | 52.2% bf16 MFU | 1623278 tok/s step 1972/19560 | loss 3.840563 (-1.12z)| norm 0.3519 (+0.43z)| lr 5.93e-04 | 323.12 ms | 52.2% bf16 MFU | 1623242 tok/s step 1973/19560 | loss 3.836907 (-1.17z)| norm 0.3308 (+0.09z)| lr 5.93e-04 | 322.91 ms | 52.3% bf16 MFU | 1623260 tok/s step 1974/19560 | loss 3.875861 (-0.40z)| norm 0.3076 (-0.29z)| lr 5.93e-04 | 322.66 ms | 52.3% bf16 MFU | 1623342 tok/s step 1975/19560 | loss 3.887004 (-0.18z)| norm 0.3034 (-0.36z)| lr 5.93e-04 | 323.16 ms | 52.2% bf16 MFU | 1623295 tok/s step 1976/19560 | loss 3.876280 (-0.39z)| norm 0.2872 (-0.63z)| lr 5.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1623358 tok/s step 1977/19560 | loss 3.840518 (-1.08z)| norm 0.2749 (-0.82z)| lr 5.93e-04 | 323.60 ms | 52.2% bf16 MFU | 1623198 tok/s step 1978/19560 | loss 3.926577 (+0.62z)| norm 0.2913 (-0.55z)| lr 5.93e-04 | 323.25 ms | 52.2% bf16 MFU | 1623134 tok/s step 1979/19560 | loss 3.883982 (-0.23z)| norm 0.2924 (-0.53z)| lr 5.93e-04 | 323.06 ms | 52.2% bf16 MFU | 1623121 tok/s step 1980/19560 | loss 3.901013 (+0.11z)| norm 0.2993 (-0.41z)| lr 5.93e-04 | 323.23 ms | 52.2% bf16 MFU | 1623066 tok/s step 1981/19560 | loss 3.866685 (-0.56z)| norm 0.2943 (-0.49z)| lr 5.93e-04 | 322.72 ms | 52.3% bf16 MFU | 1623144 tok/s step 1982/19560 | loss 3.910547 (+0.33z)| norm 0.2879 (-0.59z)| lr 5.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1623213 tok/s step 1983/19560 | loss 3.940312 (+0.92z)| norm 0.2759 (-0.78z)| lr 5.93e-04 | 323.13 ms | 52.2% bf16 MFU | 1623179 tok/s step 1984/19560 | loss 3.823101 (-1.44z)| norm 0.3136 (-0.17z)| lr 5.93e-04 | 323.19 ms | 52.2% bf16 MFU | 1623132 tok/s step 1985/19560 | loss 3.898646 (+0.09z)| norm 0.3475 (+0.37z)| lr 5.93e-04 | 323.39 ms | 52.2% bf16 MFU | 1623036 tok/s step 1986/19560 | loss 3.897266 (+0.07z)| norm 0.3420 (+0.27z)| lr 5.93e-04 | 322.82 ms | 52.3% bf16 MFU | 1623089 tok/s step 1987/19560 | loss 3.900568 (+0.13z)| norm 0.2936 (-0.51z)| lr 5.93e-04 | 322.98 ms | 52.3% bf16 MFU | 1623099 tok/s step 1988/19560 | loss 3.941552 (+0.95z)| norm 0.2669 (-0.93z)| lr 5.93e-04 | 323.12 ms | 52.2% bf16 MFU | 1623074 tok/s step 1989/19560 | loss 3.851890 (-0.87z)| norm 0.3024 (-0.35z)| lr 5.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1623148 tok/s step 1990/19560 | loss 3.836714 (-1.17z)| norm 0.3143 (-0.15z)| lr 5.93e-04 | 322.96 ms | 52.3% bf16 MFU | 1623160 tok/s step 1991/19560 | loss 4.094044 (+3.79z)| norm 0.3531 (+0.48z)| lr 5.93e-04 | 322.99 ms | 52.3% bf16 MFU | 1623163 tok/s step 1992/19560 | loss 3.924194 (+0.53z)| norm 0.4741 (+2.38z)| lr 5.93e-04 | 322.87 ms | 52.3% bf16 MFU | 1623196 tok/s step 1993/19560 | loss 3.864517 (-0.60z)| norm 0.4408 (+1.81z)| lr 5.93e-04 | 323.20 ms | 52.2% bf16 MFU | 1623144 tok/s step 1994/19560 | loss 3.896804 (+0.01z)| norm 0.4129 (+1.36z)| lr 5.93e-04 | 323.26 ms | 52.2% bf16 MFU | 1623082 tok/s step 1995/19560 | loss 3.870105 (-0.49z)| norm 0.3603 (+0.54z)| lr 5.93e-04 | 322.76 ms | 52.3% bf16 MFU | 1623146 tok/s step 1996/19560 | loss 3.870321 (-0.48z)| norm 0.3313 (+0.09z)| lr 5.93e-04 | 323.05 ms | 52.2% bf16 MFU | 1623137 tok/s step 1997/19560 | loss 3.960653 (+1.24z)| norm 0.3060 (-0.30z)| lr 5.93e-04 | 323.67 ms | 52.1% bf16 MFU | 1622970 tok/s step 1998/19560 | loss 3.821547 (-1.43z)| norm 0.2950 (-0.47z)| lr 5.93e-04 | 323.08 ms | 52.2% bf16 MFU | 1622961 tok/s step 1999/19560 | loss 3.897863 (+0.03z)| norm 0.2681 (-0.88z)| lr 5.93e-04 | 322.80 ms | 52.3% bf16 MFU | 1623021 tok/s step 2000/19560 | loss 3.996846 (+1.89z)| norm 0.2901 (-0.54z)| lr 5.93e-04 | 323.66 ms | 52.1% bf16 MFU | 1622865 tok/s val loss 3.872994 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2615/10042 = 0.260406 step 2001/19560 | loss 3.838551 (-1.09z)| norm 0.3036 (-0.32z)| lr 5.93e-04 | 322.66 ms | 52.3% bf16 MFU | 1622966 tok/s step 2002/19560 | loss 3.884687 (-0.21z)| norm 0.2866 (-0.58z)| lr 5.93e-04 | 323.42 ms | 52.2% bf16 MFU | 1622873 tok/s step 2003/19560 | loss 3.879128 (-0.30z)| norm 0.2736 (-0.77z)| lr 5.93e-04 | 322.83 ms | 52.3% bf16 MFU | 1622930 tok/s step 2004/19560 | loss 3.855515 (-0.75z)| norm 0.2690 (-0.83z)| lr 5.93e-04 | 323.23 ms | 52.2% bf16 MFU | 1622885 tok/s step 2005/19560 | loss 3.896732 (+0.05z)| norm 0.2715 (-0.79z)| lr 5.93e-04 | 322.48 ms | 52.3% bf16 MFU | 1623031 tok/s step 2006/19560 | loss 3.877994 (-0.30z)| norm 0.2844 (-0.59z)| lr 5.93e-04 | 323.04 ms | 52.2% bf16 MFU | 1623028 tok/s step 2007/19560 | loss 3.872452 (-0.40z)| norm 0.2988 (-0.38z)| lr 5.93e-04 | 323.00 ms | 52.3% bf16 MFU | 1623035 tok/s step 2008/19560 | loss 3.822048 (-1.35z)| norm 0.2720 (-0.79z)| lr 5.93e-04 | 322.91 ms | 52.3% bf16 MFU | 1623064 tok/s step 2009/19560 | loss 3.858433 (-0.65z)| norm 0.2906 (-0.51z)| lr 5.93e-04 | 323.09 ms | 52.2% bf16 MFU | 1623048 tok/s step 2010/19560 | loss 3.829288 (-1.21z)| norm 0.2953 (-0.43z)| lr 5.93e-04 | 323.12 ms | 52.2% bf16 MFU | 1623024 tok/s step 2011/19560 | loss 3.839217 (-1.02z)| norm 0.2938 (-0.46z)| lr 5.93e-04 | 323.02 ms | 52.2% bf16 MFU | 1623027 tok/s step 2012/19560 | loss 3.918465 (+0.54z)| norm 0.3124 (-0.17z)| lr 5.93e-04 | 322.41 ms | 52.3% bf16 MFU | 1623184 tok/s step 2013/19560 | loss 3.840703 (-0.99z)| norm 0.2846 (-0.60z)| lr 5.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1623251 tok/s step 2014/19560 | loss 3.854673 (-0.72z)| norm 0.2798 (-0.67z)| lr 5.93e-04 | 323.04 ms | 52.2% bf16 MFU | 1623237 tok/s step 2015/19560 | loss 3.903294 (+0.24z)| norm 0.3095 (-0.21z)| lr 5.93e-04 | 323.07 ms | 52.2% bf16 MFU | 1623216 tok/s step 2016/19560 | loss 3.861357 (-0.58z)| norm 0.2857 (-0.57z)| lr 5.93e-04 | 323.11 ms | 52.2% bf16 MFU | 1623186 tok/s step 2017/19560 | loss 3.866676 (-0.46z)| norm 0.2877 (-0.54z)| lr 5.93e-04 | 322.60 ms | 52.3% bf16 MFU | 1623288 tok/s step 2018/19560 | loss 3.859258 (-0.62z)| norm 0.3158 (-0.11z)| lr 5.93e-04 | 323.18 ms | 52.2% bf16 MFU | 1623236 tok/s step 2019/19560 | loss 3.926078 (+0.70z)| norm 0.3870 (+0.98z)| lr 5.93e-04 | 322.66 ms | 52.3% bf16 MFU | 1623319 tok/s step 2020/19560 | loss 3.903627 (+0.25z)| norm 0.3906 (+1.02z)| lr 5.93e-04 | 322.72 ms | 52.3% bf16 MFU | 1623382 tok/s step 2021/19560 | loss 3.849583 (-0.82z)| norm 0.3324 (+0.13z)| lr 5.93e-04 | 323.02 ms | 52.2% bf16 MFU | 1623366 tok/s step 2022/19560 | loss 3.799003 (-1.79z)| norm 0.3376 (+0.20z)| lr 5.93e-04 | 323.24 ms | 52.2% bf16 MFU | 1623297 tok/s step 2023/19560 | loss 3.969550 (+1.53z)| norm 0.2994 (-0.38z)| lr 5.93e-04 | 322.33 ms | 52.4% bf16 MFU | 1623460 tok/s step 2024/19560 | loss 3.819745 (-1.36z)| norm 0.2658 (-0.90z)| lr 5.93e-04 | 322.85 ms | 52.3% bf16 MFU | 1623484 tok/s step 2025/19560 | loss 3.894693 (+0.08z)| norm 0.3056 (-0.29z)| lr 5.93e-04 | 323.15 ms | 52.2% bf16 MFU | 1623432 tok/s step 2026/19560 | loss 3.859398 (-0.60z)| norm 0.3050 (-0.30z)| lr 5.93e-04 | 322.92 ms | 52.3% bf16 MFU | 1623439 tok/s step 2027/19560 | loss 3.910039 (+0.40z)| norm 0.3072 (-0.26z)| lr 5.93e-04 | 322.60 ms | 52.3% bf16 MFU | 1623526 tok/s step 2028/19560 | loss 3.872269 (-0.34z)| norm 0.3475 (+0.36z)| lr 5.93e-04 | 322.91 ms | 52.3% bf16 MFU | 1623530 tok/s step 2029/19560 | loss 3.831859 (-1.14z)| norm 0.4520 (+1.93z)| lr 5.93e-04 | 323.02 ms | 52.2% bf16 MFU | 1623509 tok/s step 2030/19560 | loss 3.885267 (-0.07z)| norm 0.4174 (+1.75z)| lr 5.93e-04 | 323.20 ms | 52.2% bf16 MFU | 1623441 tok/s step 2031/19560 | loss 3.875842 (-0.24z)| norm 0.3900 (+1.45z)| lr 5.93e-04 | 322.91 ms | 52.3% bf16 MFU | 1623450 tok/s step 2032/19560 | loss 3.902372 (+0.33z)| norm 0.3707 (+1.30z)| lr 5.93e-04 | 322.48 ms | 52.3% bf16 MFU | 1623567 tok/s step 2033/19560 | loss 3.887965 (+0.05z)| norm 0.2847 (-0.82z)| lr 5.93e-04 | 323.05 ms | 52.2% bf16 MFU | 1623534 tok/s step 2034/19560 | loss 3.883620 (-0.05z)| norm 0.3262 (+0.24z)| lr 5.93e-04 | 322.53 ms | 52.3% bf16 MFU | 1623635 tok/s step 2035/19560 | loss 3.973438 (+1.90z)| norm 0.2853 (-0.79z)| lr 5.93e-04 | 323.74 ms | 52.1% bf16 MFU | 1623426 tok/s step 2036/19560 | loss 3.838825 (-1.03z)| norm 0.2970 (-0.49z)| lr 5.93e-04 | 322.61 ms | 52.3% bf16 MFU | 1623511 tok/s step 2037/19560 | loss 3.824172 (-1.33z)| norm 0.2999 (-0.40z)| lr 5.93e-04 | 322.93 ms | 52.3% bf16 MFU | 1623512 tok/s step 2038/19560 | loss 3.877923 (-0.16z)| norm 0.2930 (-0.57z)| lr 5.93e-04 | 322.24 ms | 52.4% bf16 MFU | 1623687 tok/s step 2039/19560 | loss 3.868313 (-0.36z)| norm 0.3200 (+0.16z)| lr 5.93e-04 | 323.04 ms | 52.2% bf16 MFU | 1623651 tok/s step 2040/19560 | loss 3.916786 (+0.70z)| norm 0.3583 (+1.18z)| lr 5.93e-04 | 323.36 ms | 52.2% bf16 MFU | 1623536 tok/s step 2041/19560 | loss 3.895571 (+0.24z)| norm 0.3228 (+0.24z)| lr 5.93e-04 | 322.47 ms | 52.3% bf16 MFU | 1623652 tok/s step 2042/19560 | loss 3.945998 (+1.33z)| norm 0.3114 (-0.06z)| lr 5.93e-04 | 322.88 ms | 52.3% bf16 MFU | 1623659 tok/s step 2043/19560 | loss 3.891880 (+0.16z)| norm 0.2996 (-0.38z)| lr 5.93e-04 | 322.97 ms | 52.3% bf16 MFU | 1623643 tok/s step 2044/19560 | loss 3.952746 (+1.47z)| norm 0.2717 (-1.12z)| lr 5.93e-04 | 323.02 ms | 52.2% bf16 MFU | 1623614 tok/s step 2045/19560 | loss 3.866626 (-0.41z)| norm 0.2773 (-0.96z)| lr 5.93e-04 | 322.27 ms | 52.4% bf16 MFU | 1623776 tok/s step 2046/19560 | loss 3.868048 (-0.37z)| norm 0.2789 (-0.91z)| lr 5.93e-04 | 322.54 ms | 52.3% bf16 MFU | 1623862 tok/s step 2047/19560 | loss 3.841606 (-0.94z)| norm 0.2721 (-1.07z)| lr 5.92e-04 | 323.55 ms | 52.2% bf16 MFU | 1623690 tok/s step 2048/19560 | loss 3.878742 (-0.13z)| norm 0.2756 (-0.97z)| lr 5.92e-04 | 323.39 ms | 52.2% bf16 MFU | 1623568 tok/s step 2049/19560 | loss 3.815015 (-1.49z)| norm 0.2705 (-1.09z)| lr 5.92e-04 | 322.53 ms | 52.3% bf16 MFU | 1623668 tok/s step 2050/19560 | loss 4.051280 (+3.42z)| norm 0.2838 (-0.74z)| lr 5.92e-04 | 322.23 ms | 52.4% bf16 MFU | 1623837 tok/s step 2051/19560 | loss 3.842019 (-0.89z)| norm 0.2749 (-0.97z)| lr 5.92e-04 | 323.56 ms | 52.2% bf16 MFU | 1623665 tok/s step 2052/19560 | loss 3.861690 (-0.47z)| norm 0.3038 (-0.20z)| lr 5.92e-04 | 322.52 ms | 52.3% bf16 MFU | 1623760 tok/s step 2053/19560 | loss 3.868600 (-0.32z)| norm 0.2982 (-0.34z)| lr 5.92e-04 | 322.65 ms | 52.3% bf16 MFU | 1623819 tok/s step 2054/19560 | loss 3.872980 (-0.21z)| norm 0.3303 (+0.50z)| lr 5.92e-04 | 322.41 ms | 52.3% bf16 MFU | 1623936 tok/s step 2055/19560 | loss 3.864410 (-0.39z)| norm 0.3373 (+0.67z)| lr 5.92e-04 | 322.53 ms | 52.3% bf16 MFU | 1624016 tok/s step 2056/19560 | loss 3.891681 (+0.21z)| norm 0.3099 (-0.04z)| lr 5.92e-04 | 322.75 ms | 52.3% bf16 MFU | 1624038 tok/s step 2057/19560 | loss 3.951708 (+1.53z)| norm 0.2908 (-0.54z)| lr 5.92e-04 | 322.79 ms | 52.3% bf16 MFU | 1624047 tok/s step 2058/19560 | loss 3.916007 (+0.73z)| norm 0.2722 (-1.03z)| lr 5.92e-04 | 322.87 ms | 52.3% bf16 MFU | 1624038 tok/s step 2059/19560 | loss 3.894839 (+0.27z)| norm 0.3103 (-0.03z)| lr 5.92e-04 | 322.51 ms | 52.3% bf16 MFU | 1624118 tok/s step 2060/19560 | loss 3.864342 (-0.40z)| norm 0.3012 (-0.28z)| lr 5.92e-04 | 323.21 ms | 52.2% bf16 MFU | 1624019 tok/s step 2061/19560 | loss 3.848931 (-0.74z)| norm 0.3078 (-0.11z)| lr 5.92e-04 | 322.60 ms | 52.3% bf16 MFU | 1624077 tok/s step 2062/19560 | loss 3.796524 (-1.88z)| norm 0.3276 (+0.40z)| lr 5.92e-04 | 323.05 ms | 52.2% bf16 MFU | 1624019 tok/s step 2063/19560 | loss 3.831339 (-1.10z)| norm 0.3342 (+0.57z)| lr 5.92e-04 | 322.72 ms | 52.3% bf16 MFU | 1624048 tok/s step 2064/19560 | loss 3.836989 (-0.97z)| norm 0.2842 (-0.76z)| lr 5.92e-04 | 322.48 ms | 52.3% bf16 MFU | 1624135 tok/s step 2065/19560 | loss 3.867804 (-0.29z)| norm 0.2892 (-0.62z)| lr 5.92e-04 | 322.69 ms | 52.3% bf16 MFU | 1624166 tok/s step 2066/19560 | loss 3.889737 (+0.19z)| norm 0.3007 (-0.31z)| lr 5.92e-04 | 322.66 ms | 52.3% bf16 MFU | 1624203 tok/s step 2067/19560 | loss 3.899692 (+0.40z)| norm 0.3419 (+0.77z)| lr 5.92e-04 | 322.96 ms | 52.3% bf16 MFU | 1624161 tok/s step 2068/19560 | loss 3.847639 (-0.74z)| norm 0.3475 (+0.90z)| lr 5.92e-04 | 322.60 ms | 52.3% bf16 MFU | 1624212 tok/s step 2069/19560 | loss 3.928022 (+1.00z)| norm 0.3372 (+0.62z)| lr 5.92e-04 | 322.63 ms | 52.3% bf16 MFU | 1624253 tok/s step 2070/19560 | loss 3.859410 (-0.49z)| norm 0.3707 (+1.49z)| lr 5.92e-04 | 322.84 ms | 52.3% bf16 MFU | 1624240 tok/s step 2071/19560 | loss 3.914354 (+0.72z)| norm 0.3597 (+1.19z)| lr 5.92e-04 | 322.68 ms | 52.3% bf16 MFU | 1624266 tok/s step 2072/19560 | loss 3.849498 (-0.69z)| norm 0.3617 (+1.22z)| lr 5.92e-04 | 322.26 ms | 52.4% bf16 MFU | 1624397 tok/s step 2073/19560 | loss 3.884192 (+0.07z)| norm 0.3370 (+0.57z)| lr 5.92e-04 | 322.79 ms | 52.3% bf16 MFU | 1624390 tok/s step 2074/19560 | loss 3.825682 (-1.20z)| norm 0.2898 (-0.65z)| lr 5.92e-04 | 323.13 ms | 52.2% bf16 MFU | 1624296 tok/s step 2075/19560 | loss 3.839286 (-0.89z)| norm 0.2864 (-0.73z)| lr 5.92e-04 | 323.04 ms | 52.2% bf16 MFU | 1624229 tok/s step 2076/19560 | loss 3.837759 (-0.91z)| norm 0.2955 (-0.49z)| lr 5.92e-04 | 322.90 ms | 52.3% bf16 MFU | 1624202 tok/s step 2077/19560 | loss 3.856917 (-0.50z)| norm 0.2942 (-0.52z)| lr 5.92e-04 | 322.71 ms | 52.3% bf16 MFU | 1624224 tok/s step 2078/19560 | loss 3.841416 (-0.83z)| norm 0.2997 (-0.39z)| lr 5.92e-04 | 322.37 ms | 52.4% bf16 MFU | 1624331 tok/s step 2079/19560 | loss 3.831806 (-1.03z)| norm 0.2599 (-1.42z)| lr 5.92e-04 | 323.08 ms | 52.2% bf16 MFU | 1624254 tok/s step 2080/19560 | loss 3.873190 (-0.12z)| norm 0.2700 (-1.14z)| lr 5.92e-04 | 323.04 ms | 52.2% bf16 MFU | 1624192 tok/s step 2081/19560 | loss 3.897665 (+0.45z)| norm 0.2685 (-1.16z)| lr 5.92e-04 | 322.38 ms | 52.4% bf16 MFU | 1624296 tok/s step 2082/19560 | loss 3.818328 (-1.33z)| norm 0.2893 (-0.61z)| lr 5.92e-04 | 323.32 ms | 52.2% bf16 MFU | 1624161 tok/s step 2083/19560 | loss 3.850338 (-0.60z)| norm 0.2869 (-0.67z)| lr 5.92e-04 | 322.52 ms | 52.3% bf16 MFU | 1624233 tok/s step 2084/19560 | loss 3.822008 (-1.24z)| norm 0.2992 (-0.34z)| lr 5.92e-04 | 323.34 ms | 52.2% bf16 MFU | 1624096 tok/s step 2085/19560 | loss 3.867054 (-0.22z)| norm 0.3383 (+0.67z)| lr 5.92e-04 | 322.38 ms | 52.4% bf16 MFU | 1624206 tok/s step 2086/19560 | loss 3.850869 (-0.59z)| norm 0.3143 (+0.04z)| lr 5.92e-04 | 323.08 ms | 52.2% bf16 MFU | 1624135 tok/s step 2087/19560 | loss 3.859416 (-0.40z)| norm 0.2774 (-0.91z)| lr 5.92e-04 | 322.37 ms | 52.4% bf16 MFU | 1624246 tok/s step 2088/19560 | loss 3.809022 (-1.50z)| norm 0.3016 (-0.28z)| lr 5.92e-04 | 322.49 ms | 52.3% bf16 MFU | 1624322 tok/s step 2089/19560 | loss 3.820754 (-1.23z)| norm 0.2917 (-0.54z)| lr 5.92e-04 | 322.77 ms | 52.3% bf16 MFU | 1624322 tok/s step 2090/19560 | loss 3.833531 (-0.93z)| norm 0.2882 (-0.63z)| lr 5.92e-04 | 323.08 ms | 52.2% bf16 MFU | 1624245 tok/s step 2091/19560 | loss 3.842823 (-0.73z)| norm 0.3069 (-0.14z)| lr 5.92e-04 | 322.58 ms | 52.3% bf16 MFU | 1624297 tok/s step 2092/19560 | loss 3.901645 (+0.56z)| norm 0.3368 (+0.63z)| lr 5.92e-04 | 322.40 ms | 52.3% bf16 MFU | 1624393 tok/s step 2093/19560 | loss 3.868407 (-0.18z)| norm 0.3960 (+2.11z)| lr 5.92e-04 | 322.85 ms | 52.3% bf16 MFU | 1624369 tok/s step 2094/19560 | loss 3.840100 (-0.80z)| norm 0.3756 (+1.57z)| lr 5.92e-04 | 322.40 ms | 52.3% bf16 MFU | 1624461 tok/s step 2095/19560 | loss 3.840324 (-0.79z)| norm 0.3463 (+0.83z)| lr 5.92e-04 | 323.21 ms | 52.2% bf16 MFU | 1624344 tok/s step 2096/19560 | loss 3.824178 (-1.13z)| norm 0.3116 (-0.05z)| lr 5.92e-04 | 322.94 ms | 52.3% bf16 MFU | 1624301 tok/s step 2097/19560 | loss 3.875939 (+0.01z)| norm 0.3089 (-0.11z)| lr 5.92e-04 | 322.63 ms | 52.3% bf16 MFU | 1624339 tok/s step 2098/19560 | loss 3.860743 (-0.32z)| norm 0.2829 (-0.76z)| lr 5.92e-04 | 322.65 ms | 52.3% bf16 MFU | 1624370 tok/s step 2099/19560 | loss 3.896521 (+0.47z)| norm 0.2845 (-0.70z)| lr 5.92e-04 | 322.69 ms | 52.3% bf16 MFU | 1624389 tok/s step 2100/19560 | loss 3.885157 (+0.21z)| norm 0.3058 (-0.16z)| lr 5.92e-04 | 322.67 ms | 52.3% bf16 MFU | 1624411 tok/s step 2101/19560 | loss 3.868900 (-0.15z)| norm 0.3125 (+0.01z)| lr 5.92e-04 | 322.57 ms | 52.3% bf16 MFU | 1624458 tok/s step 2102/19560 | loss 3.790573 (-1.85z)| norm 0.2842 (-0.70z)| lr 5.92e-04 | 322.62 ms | 52.3% bf16 MFU | 1624491 tok/s step 2103/19560 | loss 3.798814 (-1.64z)| norm 0.2830 (-0.72z)| lr 5.92e-04 | 322.96 ms | 52.3% bf16 MFU | 1624435 tok/s step 2104/19560 | loss 3.860708 (-0.30z)| norm 0.2667 (-1.13z)| lr 5.92e-04 | 322.67 ms | 52.3% bf16 MFU | 1624455 tok/s step 2105/19560 | loss 3.814388 (-1.29z)| norm 0.2666 (-1.13z)| lr 5.92e-04 | 322.61 ms | 52.3% bf16 MFU | 1624489 tok/s step 2106/19560 | loss 3.853290 (-0.44z)| norm 0.2745 (-0.92z)| lr 5.92e-04 | 322.55 ms | 52.3% bf16 MFU | 1624537 tok/s step 2107/19560 | loss 3.865197 (-0.18z)| norm 0.3138 (+0.06z)| lr 5.92e-04 | 323.27 ms | 52.2% bf16 MFU | 1624402 tok/s step 2108/19560 | loss 3.862202 (-0.24z)| norm 0.3069 (-0.11z)| lr 5.92e-04 | 322.41 ms | 52.3% bf16 MFU | 1624489 tok/s step 2109/19560 | loss 3.837365 (-0.77z)| norm 0.3066 (-0.12z)| lr 5.92e-04 | 322.88 ms | 52.3% bf16 MFU | 1624455 tok/s step 2110/19560 | loss 3.806658 (-1.41z)| norm 0.2807 (-0.77z)| lr 5.92e-04 | 323.10 ms | 52.2% bf16 MFU | 1624367 tok/s step 2111/19560 | loss 4.015674 (+2.99z)| norm 0.3194 (+0.19z)| lr 5.92e-04 | 322.77 ms | 52.3% bf16 MFU | 1624366 tok/s step 2112/19560 | loss 3.901825 (+0.60z)| norm 0.3531 (+1.03z)| lr 5.92e-04 | 323.15 ms | 52.2% bf16 MFU | 1624269 tok/s step 2113/19560 | loss 3.866952 (-0.13z)| norm 0.3801 (+1.68z)| lr 5.92e-04 | 322.77 ms | 52.3% bf16 MFU | 1624272 tok/s step 2114/19560 | loss 3.843457 (-0.61z)| norm 0.3482 (+0.89z)| lr 5.92e-04 | 323.33 ms | 52.2% bf16 MFU | 1624136 tok/s step 2115/19560 | loss 3.815472 (-1.18z)| norm 0.2966 (-0.39z)| lr 5.92e-04 | 323.40 ms | 52.2% bf16 MFU | 1623989 tok/s step 2116/19560 | loss 3.853722 (-0.37z)| norm 0.2727 (-0.99z)| lr 5.92e-04 | 322.31 ms | 52.4% bf16 MFU | 1624123 tok/s step 2117/19560 | loss 3.877978 (+0.14z)| norm 0.2683 (-1.09z)| lr 5.92e-04 | 322.59 ms | 52.3% bf16 MFU | 1624178 tok/s step 2118/19560 | loss 3.810423 (-1.28z)| norm 0.2859 (-0.65z)| lr 5.92e-04 | 323.29 ms | 52.2% bf16 MFU | 1624055 tok/s step 2119/19560 | loss 3.877544 (+0.18z)| norm 0.3037 (-0.20z)| lr 5.92e-04 | 322.60 ms | 52.3% bf16 MFU | 1624111 tok/s step 2120/19560 | loss 3.799153 (-1.59z)| norm 0.2989 (-0.30z)| lr 5.92e-04 | 322.95 ms | 52.3% bf16 MFU | 1624076 tok/s step 2121/19560 | loss 3.793444 (-1.69z)| norm 0.3016 (-0.21z)| lr 5.92e-04 | 323.00 ms | 52.3% bf16 MFU | 1624032 tok/s step 2122/19560 | loss 3.887735 (+0.45z)| norm 0.3016 (-0.19z)| lr 5.92e-04 | 322.93 ms | 52.3% bf16 MFU | 1624008 tok/s step 2123/19560 | loss 3.844864 (-0.52z)| norm 0.3206 (+0.37z)| lr 5.92e-04 | 323.16 ms | 52.2% bf16 MFU | 1623927 tok/s step 2124/19560 | loss 3.889866 (+0.50z)| norm 0.3261 (+0.53z)| lr 5.92e-04 | 322.91 ms | 52.3% bf16 MFU | 1623912 tok/s step 2125/19560 | loss 3.820311 (-1.07z)| norm 0.3112 (+0.09z)| lr 5.92e-04 | 323.16 ms | 52.2% bf16 MFU | 1623834 tok/s step 2126/19560 | loss 3.860945 (-0.14z)| norm 0.2929 (-0.44z)| lr 5.92e-04 | 323.10 ms | 52.2% bf16 MFU | 1623776 tok/s step 2127/19560 | loss 3.856990 (-0.23z)| norm 0.2847 (-0.68z)| lr 5.92e-04 | 322.68 ms | 52.3% bf16 MFU | 1623826 tok/s step 2128/19560 | loss 3.856740 (-0.22z)| norm 0.3230 (+0.43z)| lr 5.92e-04 | 323.46 ms | 52.2% bf16 MFU | 1623678 tok/s step 2129/19560 | loss 3.840557 (-0.60z)| norm 0.3158 (+0.21z)| lr 5.92e-04 | 322.84 ms | 52.3% bf16 MFU | 1623693 tok/s step 2130/19560 | loss 3.929665 (+1.51z)| norm 0.2956 (-0.38z)| lr 5.92e-04 | 323.34 ms | 52.2% bf16 MFU | 1623582 tok/s step 2131/19560 | loss 3.841668 (-0.57z)| norm 0.2958 (-0.38z)| lr 5.92e-04 | 322.75 ms | 52.3% bf16 MFU | 1623626 tok/s step 2132/19560 | loss 3.845230 (-0.49z)| norm 0.2925 (-0.48z)| lr 5.92e-04 | 323.11 ms | 52.2% bf16 MFU | 1623575 tok/s step 2133/19560 | loss 3.830211 (-0.83z)| norm 0.2586 (-1.47z)| lr 5.92e-04 | 322.61 ms | 52.3% bf16 MFU | 1623652 tok/s step 2134/19560 | loss 3.821730 (-1.02z)| norm 0.2641 (-1.30z)| lr 5.91e-04 | 323.30 ms | 52.2% bf16 MFU | 1623554 tok/s step 2135/19560 | loss 3.865021 (+0.01z)| norm 0.2747 (-0.98z)| lr 5.91e-04 | 322.99 ms | 52.3% bf16 MFU | 1623538 tok/s step 2136/19560 | loss 3.844307 (-0.49z)| norm 0.2777 (-0.90z)| lr 5.91e-04 | 323.02 ms | 52.2% bf16 MFU | 1623516 tok/s step 2137/19560 | loss 3.906429 (+0.97z)| norm 0.2828 (-0.75z)| lr 5.91e-04 | 323.29 ms | 52.2% bf16 MFU | 1623425 tok/s step 2138/19560 | loss 3.955154 (+2.07z)| norm 0.2948 (-0.40z)| lr 5.91e-04 | 322.70 ms | 52.3% bf16 MFU | 1623488 tok/s step 2139/19560 | loss 3.857857 (-0.20z)| norm 0.3637 (+1.58z)| lr 5.91e-04 | 322.72 ms | 52.3% bf16 MFU | 1623543 tok/s step 2140/19560 | loss 3.859879 (-0.14z)| norm 0.4356 (+3.46z)| lr 5.91e-04 | 322.74 ms | 52.3% bf16 MFU | 1623591 tok/s step 2141/19560 | loss 3.858809 (-0.17z)| norm 0.4112 (+2.68z)| lr 5.91e-04 | 322.78 ms | 52.3% bf16 MFU | 1623626 tok/s step 2142/19560 | loss 3.899968 (+0.78z)| norm 0.3572 (+1.22z)| lr 5.91e-04 | 323.02 ms | 52.2% bf16 MFU | 1623599 tok/s step 2143/19560 | loss 3.835336 (-0.72z)| norm 0.3359 (+0.64z)| lr 5.91e-04 | 323.08 ms | 52.2% bf16 MFU | 1623558 tok/s step 2144/19560 | loss 3.842565 (-0.54z)| norm 0.3126 (+0.02z)| lr 5.91e-04 | 322.51 ms | 52.3% bf16 MFU | 1623662 tok/s step 2145/19560 | loss 3.865440 (-0.01z)| norm 0.3476 (+0.94z)| lr 5.91e-04 | 322.84 ms | 52.3% bf16 MFU | 1623678 tok/s step 2146/19560 | loss 3.805069 (-1.40z)| norm 0.3116 (-0.02z)| lr 5.91e-04 | 323.06 ms | 52.2% bf16 MFU | 1623637 tok/s step 2147/19560 | loss 3.893787 (+0.67z)| norm 0.2892 (-0.61z)| lr 5.91e-04 | 322.78 ms | 52.3% bf16 MFU | 1623669 tok/s step 2148/19560 | loss 3.867718 (+0.07z)| norm 0.2897 (-0.58z)| lr 5.91e-04 | 323.30 ms | 52.2% bf16 MFU | 1623570 tok/s step 2149/19560 | loss 3.859550 (-0.13z)| norm 0.2994 (-0.31z)| lr 5.91e-04 | 322.80 ms | 52.3% bf16 MFU | 1623602 tok/s step 2150/19560 | loss 3.829487 (-0.84z)| norm 0.3010 (-0.26z)| lr 5.91e-04 | 322.84 ms | 52.3% bf16 MFU | 1623620 tok/s step 2151/19560 | loss 3.839760 (-0.59z)| norm 0.2844 (-0.71z)| lr 5.91e-04 | 323.27 ms | 52.2% bf16 MFU | 1623531 tok/s step 2152/19560 | loss 3.799077 (-1.56z)| norm 0.2761 (-0.94z)| lr 5.91e-04 | 322.81 ms | 52.3% bf16 MFU | 1623561 tok/s step 2153/19560 | loss 3.897567 (+0.81z)| norm 0.2845 (-0.70z)| lr 5.91e-04 | 323.40 ms | 52.2% bf16 MFU | 1623441 tok/s step 2154/19560 | loss 3.795753 (-1.61z)| norm 0.2799 (-0.82z)| lr 5.91e-04 | 322.93 ms | 52.3% bf16 MFU | 1623445 tok/s step 2155/19560 | loss 3.899330 (+0.86z)| norm 0.2608 (-1.33z)| lr 5.91e-04 | 322.86 ms | 52.3% bf16 MFU | 1623468 tok/s step 2156/19560 | loss 3.841872 (-0.51z)| norm 0.3035 (-0.16z)| lr 5.91e-04 | 323.16 ms | 52.2% bf16 MFU | 1623413 tok/s step 2157/19560 | loss 3.851319 (-0.29z)| norm 0.2970 (-0.32z)| lr 5.91e-04 | 323.03 ms | 52.2% bf16 MFU | 1623394 tok/s step 2158/19560 | loss 3.863230 (+0.00z)| norm 0.2926 (-0.44z)| lr 5.91e-04 | 323.84 ms | 52.1% bf16 MFU | 1623174 tok/s step 2159/19560 | loss 3.828242 (-0.82z)| norm 0.2865 (-0.61z)| lr 5.91e-04 | 322.85 ms | 52.3% bf16 MFU | 1623212 tok/s step 2160/19560 | loss 3.860061 (-0.06z)| norm 0.2600 (-1.42z)| lr 5.91e-04 | 322.94 ms | 52.3% bf16 MFU | 1623226 tok/s step 2161/19560 | loss 3.889133 (+0.64z)| norm 0.2669 (-1.20z)| lr 5.91e-04 | 322.76 ms | 52.3% bf16 MFU | 1623284 tok/s step 2162/19560 | loss 3.862992 (+0.02z)| norm 0.2622 (-1.32z)| lr 5.91e-04 | 323.14 ms | 52.2% bf16 MFU | 1623243 tok/s step 2163/19560 | loss 3.896971 (+0.87z)| norm 0.2703 (-1.06z)| lr 5.91e-04 | 322.62 ms | 52.3% bf16 MFU | 1623336 tok/s step 2164/19560 | loss 3.865563 (+0.09z)| norm 0.2937 (-0.34z)| lr 5.91e-04 | 323.12 ms | 52.2% bf16 MFU | 1623297 tok/s step 2165/19560 | loss 3.873277 (+0.27z)| norm 0.3053 (+0.02z)| lr 5.91e-04 | 323.03 ms | 52.2% bf16 MFU | 1623283 tok/s step 2166/19560 | loss 3.893630 (+0.77z)| norm 0.3662 (+1.87z)| lr 5.91e-04 | 323.63 ms | 52.1% bf16 MFU | 1623121 tok/s step 2167/19560 | loss 3.893423 (+0.76z)| norm 0.4119 (+3.12z)| lr 5.91e-04 | 322.99 ms | 52.3% bf16 MFU | 1623127 tok/s step 2168/19560 | loss 3.871274 (+0.22z)| norm 0.3406 (+1.03z)| lr 5.91e-04 | 323.53 ms | 52.2% bf16 MFU | 1622996 tok/s step 2169/19560 | loss 3.853224 (-0.22z)| norm 0.3140 (+0.25z)| lr 5.91e-04 | 322.83 ms | 52.3% bf16 MFU | 1623048 tok/s step 2170/19560 | loss 3.835666 (-0.64z)| norm 0.3093 (+0.11z)| lr 5.91e-04 | 322.75 ms | 52.3% bf16 MFU | 1623116 tok/s step 2171/19560 | loss 3.850403 (-0.26z)| norm 0.3010 (-0.14z)| lr 5.91e-04 | 323.03 ms | 52.2% bf16 MFU | 1623112 tok/s step 2172/19560 | loss 3.788124 (-1.82z)| norm 0.2666 (-1.15z)| lr 5.91e-04 | 323.63 ms | 52.2% bf16 MFU | 1622959 tok/s step 2173/19560 | loss 3.828601 (-0.78z)| norm 0.2808 (-0.74z)| lr 5.91e-04 | 323.17 ms | 52.2% bf16 MFU | 1622928 tok/s step 2174/19560 | loss 3.790626 (-1.71z)| norm 0.3061 (+0.01z)| lr 5.91e-04 | 322.90 ms | 52.3% bf16 MFU | 1622966 tok/s step 2175/19560 | loss 3.886532 (+0.69z)| norm 0.3035 (-0.08z)| lr 5.91e-04 | 323.16 ms | 52.2% bf16 MFU | 1622936 tok/s step 2176/19560 | loss 3.772528 (-2.11z)| norm 0.2872 (-0.57z)| lr 5.91e-04 | 323.30 ms | 52.2% bf16 MFU | 1622874 tok/s step 2177/19560 | loss 3.846866 (-0.28z)| norm 0.3145 (+0.24z)| lr 5.91e-04 | 322.89 ms | 52.3% bf16 MFU | 1622916 tok/s step 2178/19560 | loss 3.813419 (-1.17z)| norm 0.3080 (+0.04z)| lr 5.91e-04 | 323.67 ms | 52.1% bf16 MFU | 1622761 tok/s step 2179/19560 | loss 3.853672 (-0.08z)| norm 0.2896 (-0.52z)| lr 5.91e-04 | 323.53 ms | 52.2% bf16 MFU | 1622648 tok/s step 2180/19560 | loss 3.813949 (-1.14z)| norm 0.2865 (-0.61z)| lr 5.91e-04 | 323.18 ms | 52.2% bf16 MFU | 1622630 tok/s step 2181/19560 | loss 3.866473 (+0.28z)| norm 0.2920 (-0.44z)| lr 5.91e-04 | 323.11 ms | 52.2% bf16 MFU | 1622629 tok/s step 2182/19560 | loss 3.895161 (+1.05z)| norm 0.2800 (-0.79z)| lr 5.91e-04 | 323.97 ms | 52.1% bf16 MFU | 1622414 tok/s step 2183/19560 | loss 3.817982 (-1.02z)| norm 0.2697 (-1.08z)| lr 5.91e-04 | 322.85 ms | 52.3% bf16 MFU | 1622489 tok/s step 2184/19560 | loss 3.789605 (-1.75z)| norm 0.2787 (-0.80z)| lr 5.91e-04 | 323.27 ms | 52.2% bf16 MFU | 1622456 tok/s step 2185/19560 | loss 3.860663 (+0.17z)| norm 0.2843 (-0.63z)| lr 5.91e-04 | 323.01 ms | 52.3% bf16 MFU | 1622490 tok/s step 2186/19560 | loss 3.832847 (-0.58z)| norm 0.2921 (-0.40z)| lr 5.91e-04 | 323.09 ms | 52.2% bf16 MFU | 1622503 tok/s step 2187/19560 | loss 3.922686 (+1.88z)| norm 0.3194 (+0.41z)| lr 5.91e-04 | 322.56 ms | 52.3% bf16 MFU | 1622648 tok/s step 2188/19560 | loss 3.848661 (-0.14z)| norm 0.3100 (+0.13z)| lr 5.91e-04 | 322.64 ms | 52.3% bf16 MFU | 1622765 tok/s step 2189/19560 | loss 3.900276 (+1.25z)| norm 0.2838 (-0.65z)| lr 5.91e-04 | 322.54 ms | 52.3% bf16 MFU | 1622902 tok/s step 2190/19560 | loss 3.804972 (-1.35z)| norm 0.2723 (-0.98z)| lr 5.91e-04 | 323.45 ms | 52.2% bf16 MFU | 1622803 tok/s step 2191/19560 | loss 3.813070 (-1.12z)| norm 0.2898 (-0.45z)| lr 5.91e-04 | 323.16 ms | 52.2% bf16 MFU | 1622781 tok/s step 2192/19560 | loss 3.817869 (-0.98z)| norm 0.3110 (+0.18z)| lr 5.91e-04 | 323.13 ms | 52.2% bf16 MFU | 1622768 tok/s step 2193/19560 | loss 3.833385 (-0.55z)| norm 0.3090 (+0.11z)| lr 5.91e-04 | 323.33 ms | 52.2% bf16 MFU | 1622707 tok/s step 2194/19560 | loss 3.885960 (+0.87z)| norm 0.3153 (+0.30z)| lr 5.91e-04 | 323.10 ms | 52.2% bf16 MFU | 1622704 tok/s step 2195/19560 | loss 3.833728 (-0.53z)| norm 0.3120 (+0.21z)| lr 5.91e-04 | 323.18 ms | 52.2% bf16 MFU | 1622684 tok/s step 2196/19560 | loss 3.783632 (-1.86z)| norm 0.3393 (+1.04z)| lr 5.91e-04 | 323.19 ms | 52.2% bf16 MFU | 1622660 tok/s step 2197/19560 | loss 3.837052 (-0.41z)| norm 0.3180 (+0.40z)| lr 5.91e-04 | 322.69 ms | 52.3% bf16 MFU | 1622763 tok/s step 2198/19560 | loss 3.813322 (-1.04z)| norm 0.3516 (+1.43z)| lr 5.91e-04 | 323.41 ms | 52.2% bf16 MFU | 1622681 tok/s step 2199/19560 | loss 3.815918 (-0.96z)| norm 0.2921 (-0.37z)| lr 5.91e-04 | 322.25 ms | 52.4% bf16 MFU | 1622894 tok/s step 2200/19560 | loss 3.883500 (+0.88z)| norm 0.2934 (-0.32z)| lr 5.91e-04 | 323.13 ms | 52.2% bf16 MFU | 1622876 tok/s step 2201/19560 | loss 3.892226 (+1.12z)| norm 0.3303 (+0.84z)| lr 5.91e-04 | 323.00 ms | 52.3% bf16 MFU | 1622891 tok/s step 2202/19560 | loss 3.824568 (-0.73z)| norm 0.3401 (+1.13z)| lr 5.91e-04 | 323.61 ms | 52.2% bf16 MFU | 1622753 tok/s step 2203/19560 | loss 3.891141 (+1.07z)| norm 0.3224 (+0.57z)| lr 5.91e-04 | 323.16 ms | 52.2% bf16 MFU | 1622734 tok/s step 2204/19560 | loss 3.853151 (+0.04z)| norm 0.3159 (+0.36z)| lr 5.91e-04 | 322.77 ms | 52.3% bf16 MFU | 1622815 tok/s step 2205/19560 | loss 3.838854 (-0.35z)| norm 0.3355 (+0.96z)| lr 5.91e-04 | 322.87 ms | 52.3% bf16 MFU | 1622867 tok/s step 2206/19560 | loss 3.879784 (+0.76z)| norm 0.3489 (+1.35z)| lr 5.91e-04 | 323.38 ms | 52.2% bf16 MFU | 1622787 tok/s step 2207/19560 | loss 3.903367 (+1.37z)| norm 0.3190 (+0.42z)| lr 5.91e-04 | 322.94 ms | 52.3% bf16 MFU | 1622822 tok/s step 2208/19560 | loss 3.885102 (+0.87z)| norm 0.3130 (+0.22z)| lr 5.91e-04 | 323.37 ms | 52.2% bf16 MFU | 1622748 tok/s step 2209/19560 | loss 3.888732 (+0.98z)| norm 0.2874 (-0.58z)| lr 5.91e-04 | 322.87 ms | 52.3% bf16 MFU | 1622802 tok/s step 2210/19560 | loss 3.836384 (-0.44z)| norm 0.2731 (-1.02z)| lr 5.91e-04 | 323.11 ms | 52.2% bf16 MFU | 1622795 tok/s step 2211/19560 | loss 3.810434 (-1.13z)| norm 0.3021 (-0.12z)| lr 5.91e-04 | 323.36 ms | 52.2% bf16 MFU | 1622725 tok/s step 2212/19560 | loss 3.858227 (+0.15z)| norm 0.3233 (+0.53z)| lr 5.91e-04 | 322.94 ms | 52.3% bf16 MFU | 1622762 tok/s step 2213/19560 | loss 3.808764 (-1.17z)| norm 0.3266 (+0.64z)| lr 5.91e-04 | 323.08 ms | 52.2% bf16 MFU | 1622762 tok/s step 2214/19560 | loss 3.828126 (-0.64z)| norm 0.3016 (-0.14z)| lr 5.91e-04 | 322.66 ms | 52.3% bf16 MFU | 1622870 tok/s step 2215/19560 | loss 3.865067 (+0.35z)| norm 0.2749 (-0.97z)| lr 5.91e-04 | 322.82 ms | 52.3% bf16 MFU | 1622931 tok/s step 2216/19560 | loss 3.887672 (+0.94z)| norm 0.2910 (-0.47z)| lr 5.90e-04 | 322.58 ms | 52.3% bf16 MFU | 1623049 tok/s step 2217/19560 | loss 3.870350 (+0.47z)| norm 0.3133 (+0.23z)| lr 5.90e-04 | 323.27 ms | 52.2% bf16 MFU | 1622987 tok/s step 2218/19560 | loss 3.780512 (-1.91z)| norm 0.2884 (-0.55z)| lr 5.90e-04 | 322.62 ms | 52.3% bf16 MFU | 1623092 tok/s step 2219/19560 | loss 3.793497 (-1.54z)| norm 0.2782 (-0.86z)| lr 5.90e-04 | 322.64 ms | 52.3% bf16 MFU | 1623187 tok/s step 2220/19560 | loss 3.793355 (-1.52z)| norm 0.3107 (+0.16z)| lr 5.90e-04 | 322.88 ms | 52.3% bf16 MFU | 1623217 tok/s step 2221/19560 | loss 3.835339 (-0.41z)| norm 0.3114 (+0.21z)| lr 5.90e-04 | 323.22 ms | 52.2% bf16 MFU | 1623161 tok/s step 2222/19560 | loss 3.825064 (-0.68z)| norm 0.2591 (-1.47z)| lr 5.90e-04 | 323.30 ms | 52.2% bf16 MFU | 1623088 tok/s step 2223/19560 | loss 3.879581 (+0.74z)| norm 0.3004 (-0.11z)| lr 5.90e-04 | 323.10 ms | 52.2% bf16 MFU | 1623068 tok/s step 2224/19560 | loss 3.832315 (-0.50z)| norm 0.2996 (-0.13z)| lr 5.90e-04 | 323.06 ms | 52.2% bf16 MFU | 1623058 tok/s step 2225/19560 | loss 3.814955 (-0.94z)| norm 0.2778 (-0.84z)| lr 5.90e-04 | 322.59 ms | 52.3% bf16 MFU | 1623168 tok/s step 2226/19560 | loss 3.843934 (-0.18z)| norm 0.3219 (+0.60z)| lr 5.90e-04 | 322.62 ms | 52.3% bf16 MFU | 1623264 tok/s step 2227/19560 | loss 3.792624 (-1.49z)| norm 0.3071 (+0.11z)| lr 5.90e-04 | 323.44 ms | 52.2% bf16 MFU | 1623150 tok/s step 2228/19560 | loss 3.802410 (-1.22z)| norm 0.3037 (-0.01z)| lr 5.90e-04 | 322.07 ms | 52.4% bf16 MFU | 1623387 tok/s step 2229/19560 | loss 3.841020 (-0.21z)| norm 0.3148 (+0.36z)| lr 5.90e-04 | 322.62 ms | 52.3% bf16 MFU | 1623472 tok/s step 2230/19560 | loss 3.818304 (-0.81z)| norm 0.3011 (-0.10z)| lr 5.90e-04 | 322.92 ms | 52.3% bf16 MFU | 1623476 tok/s step 2231/19560 | loss 3.832697 (-0.44z)| norm 0.3087 (+0.15z)| lr 5.90e-04 | 322.92 ms | 52.3% bf16 MFU | 1623483 tok/s step 2232/19560 | loss 3.797839 (-1.34z)| norm 0.3200 (+0.51z)| lr 5.90e-04 | 322.48 ms | 52.3% bf16 MFU | 1623599 tok/s step 2233/19560 | loss 3.809361 (-1.04z)| norm 0.3040 (-0.03z)| lr 5.90e-04 | 322.88 ms | 52.3% bf16 MFU | 1623607 tok/s step 2234/19560 | loss 4.021517 (+4.16z)| norm 0.2956 (-0.32z)| lr 5.90e-04 | 322.59 ms | 52.3% bf16 MFU | 1623689 tok/s step 2235/19560 | loss 3.837027 (-0.32z)| norm 0.3024 (-0.09z)| lr 5.90e-04 | 322.38 ms | 52.4% bf16 MFU | 1623819 tok/s step 2236/19560 | loss 3.808284 (-1.00z)| norm 0.2687 (-1.20z)| lr 5.90e-04 | 323.07 ms | 52.2% bf16 MFU | 1623771 tok/s step 2237/19560 | loss 3.845932 (-0.09z)| norm 0.2962 (-0.28z)| lr 5.90e-04 | 322.43 ms | 52.3% bf16 MFU | 1623886 tok/s step 2238/19560 | loss 3.837286 (-0.31z)| norm 0.3048 (+0.00z)| lr 5.90e-04 | 322.98 ms | 52.3% bf16 MFU | 1623856 tok/s step 2239/19560 | loss 3.904487 (+1.43z)| norm 0.3178 (+0.44z)| lr 5.90e-04 | 322.71 ms | 52.3% bf16 MFU | 1623894 tok/s step 2240/19560 | loss 3.854264 (+0.14z)| norm 0.3365 (+1.07z)| lr 5.90e-04 | 322.49 ms | 52.3% bf16 MFU | 1623986 tok/s step 2241/19560 | loss 3.818579 (-0.77z)| norm 0.3069 (+0.10z)| lr 5.90e-04 | 322.40 ms | 52.3% bf16 MFU | 1624096 tok/s step 2242/19560 | loss 3.793741 (-1.40z)| norm 0.2556 (-1.65z)| lr 5.90e-04 | 323.01 ms | 52.2% bf16 MFU | 1624047 tok/s step 2243/19560 | loss 3.784155 (-1.63z)| norm 0.2929 (-0.36z)| lr 5.90e-04 | 322.84 ms | 52.3% bf16 MFU | 1624044 tok/s step 2244/19560 | loss 3.882524 (+0.88z)| norm 0.2817 (-0.75z)| lr 5.90e-04 | 322.61 ms | 52.3% bf16 MFU | 1624100 tok/s step 2245/19560 | loss 3.794253 (-1.35z)| norm 0.2976 (-0.21z)| lr 5.90e-04 | 322.87 ms | 52.3% bf16 MFU | 1624087 tok/s step 2246/19560 | loss 3.858137 (+0.26z)| norm 0.3059 (+0.07z)| lr 5.90e-04 | 322.48 ms | 52.3% bf16 MFU | 1624172 tok/s step 2247/19560 | loss 3.869950 (+0.57z)| norm 0.2976 (-0.21z)| lr 5.90e-04 | 323.32 ms | 52.2% bf16 MFU | 1624044 tok/s step 2248/19560 | loss 3.840655 (-0.19z)| norm 0.3077 (+0.14z)| lr 5.90e-04 | 322.16 ms | 52.4% bf16 MFU | 1624211 tok/s step 2249/19560 | loss 3.831407 (-0.44z)| norm 0.2830 (-0.72z)| lr 5.90e-04 | 322.79 ms | 52.3% bf16 MFU | 1624212 tok/s step 2250/19560 | loss 3.810333 (-0.96z)| norm 0.3563 (+1.79z)| lr 5.90e-04 | 322.85 ms | 52.3% bf16 MFU | 1624197 tok/s val loss 3.819037 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2609/10042 = 0.259809 step 2251/19560 | loss 3.845198 (-0.06z)| norm 0.3583 (+1.83z)| lr 5.90e-04 | 323.30 ms | 52.2% bf16 MFU | 1624071 tok/s step 2252/19560 | loss 3.795984 (-1.31z)| norm 0.3113 (+0.24z)| lr 5.90e-04 | 323.14 ms | 52.2% bf16 MFU | 1623992 tok/s step 2253/19560 | loss 3.800017 (-1.20z)| norm 0.3121 (+0.27z)| lr 5.90e-04 | 322.75 ms | 52.3% bf16 MFU | 1624013 tok/s step 2254/19560 | loss 3.877542 (+0.78z)| norm 0.2932 (-0.38z)| lr 5.90e-04 | 323.23 ms | 52.2% bf16 MFU | 1623914 tok/s step 2255/19560 | loss 3.796699 (-1.27z)| norm 0.3094 (+0.17z)| lr 5.90e-04 | 322.87 ms | 52.3% bf16 MFU | 1623910 tok/s step 2256/19560 | loss 3.833616 (-0.32z)| norm 0.3144 (+0.34z)| lr 5.90e-04 | 323.00 ms | 52.3% bf16 MFU | 1623873 tok/s step 2257/19560 | loss 3.777520 (-1.72z)| norm 0.2840 (-0.69z)| lr 5.90e-04 | 322.71 ms | 52.3% bf16 MFU | 1623910 tok/s step 2258/19560 | loss 3.782048 (-1.59z)| norm 0.2626 (-1.39z)| lr 5.90e-04 | 322.94 ms | 52.3% bf16 MFU | 1623890 tok/s step 2259/19560 | loss 3.946272 (+2.49z)| norm 0.2724 (-1.05z)| lr 5.90e-04 | 322.93 ms | 52.3% bf16 MFU | 1623871 tok/s step 2260/19560 | loss 3.777153 (-1.66z)| norm 0.3275 (+0.79z)| lr 5.90e-04 | 322.71 ms | 52.3% bf16 MFU | 1623910 tok/s step 2261/19560 | loss 3.831545 (-0.33z)| norm 0.4035 (+3.20z)| lr 5.90e-04 | 322.94 ms | 52.3% bf16 MFU | 1623889 tok/s step 2262/19560 | loss 3.900838 (+1.34z)| norm 0.4147 (+3.38z)| lr 5.90e-04 | 322.91 ms | 52.3% bf16 MFU | 1623875 tok/s step 2263/19560 | loss 3.835582 (-0.24z)| norm 0.4569 (+4.32z)| lr 5.90e-04 | 322.60 ms | 52.3% bf16 MFU | 1623942 tok/s step 2264/19560 | loss 3.810823 (-0.83z)| norm 0.3638 (+1.59z)| lr 5.90e-04 | 323.29 ms | 52.2% bf16 MFU | 1623832 tok/s step 2265/19560 | loss 3.789056 (-1.34z)| norm 0.3416 (+0.94z)| lr 5.90e-04 | 323.22 ms | 52.2% bf16 MFU | 1623745 tok/s step 2266/19560 | loss 3.814459 (-0.71z)| norm 0.2626 (-1.31z)| lr 5.90e-04 | 322.85 ms | 52.3% bf16 MFU | 1623755 tok/s step 2267/19560 | loss 3.777929 (-1.60z)| norm 0.2796 (-0.81z)| lr 5.90e-04 | 322.65 ms | 52.3% bf16 MFU | 1623815 tok/s step 2268/19560 | loss 3.848052 (+0.14z)| norm 0.2564 (-1.51z)| lr 5.90e-04 | 322.54 ms | 52.3% bf16 MFU | 1623899 tok/s step 2269/19560 | loss 3.996967 (+3.60z)| norm 0.2641 (-1.28z)| lr 5.90e-04 | 322.56 ms | 52.3% bf16 MFU | 1623974 tok/s step 2270/19560 | loss 3.850561 (+0.18z)| norm 0.2670 (-1.18z)| lr 5.90e-04 | 323.14 ms | 52.2% bf16 MFU | 1623900 tok/s step 2271/19560 | loss 3.839243 (-0.09z)| norm 0.2436 (-1.87z)| lr 5.90e-04 | 322.97 ms | 52.3% bf16 MFU | 1623872 tok/s step 2272/19560 | loss 3.744771 (-2.26z)| norm 0.2511 (-1.61z)| lr 5.90e-04 | 322.78 ms | 52.3% bf16 MFU | 1623894 tok/s step 2273/19560 | loss 3.814921 (-0.62z)| norm 0.2397 (-1.92z)| lr 5.90e-04 | 323.09 ms | 52.2% bf16 MFU | 1623836 tok/s step 2274/19560 | loss 3.805678 (-0.84z)| norm 0.2497 (-1.58z)| lr 5.90e-04 | 322.56 ms | 52.3% bf16 MFU | 1623913 tok/s step 2275/19560 | loss 3.886713 (+1.04z)| norm 0.2717 (-0.91z)| lr 5.90e-04 | 322.64 ms | 52.3% bf16 MFU | 1623967 tok/s step 2276/19560 | loss 3.830768 (-0.25z)| norm 0.2593 (-1.27z)| lr 5.90e-04 | 322.72 ms | 52.3% bf16 MFU | 1623997 tok/s step 2277/19560 | loss 3.899174 (+1.32z)| norm 0.3019 (+0.00z)| lr 5.90e-04 | 323.03 ms | 52.2% bf16 MFU | 1623949 tok/s step 2278/19560 | loss 3.820830 (-0.48z)| norm 0.2966 (-0.16z)| lr 5.90e-04 | 322.50 ms | 52.3% bf16 MFU | 1624035 tok/s step 2279/19560 | loss 3.784799 (-1.30z)| norm 0.3386 (+1.09z)| lr 5.90e-04 | 322.61 ms | 52.3% bf16 MFU | 1624091 tok/s step 2280/19560 | loss 3.864542 (+0.52z)| norm 0.3116 (+0.27z)| lr 5.90e-04 | 322.86 ms | 52.3% bf16 MFU | 1624080 tok/s step 2281/19560 | loss 3.848513 (+0.16z)| norm 0.2841 (-0.55z)| lr 5.90e-04 | 323.02 ms | 52.2% bf16 MFU | 1624031 tok/s step 2282/19560 | loss 3.806200 (-0.82z)| norm 0.2934 (-0.27z)| lr 5.90e-04 | 323.06 ms | 52.2% bf16 MFU | 1623972 tok/s step 2283/19560 | loss 3.811474 (-0.68z)| norm 0.2777 (-0.75z)| lr 5.90e-04 | 322.97 ms | 52.3% bf16 MFU | 1623942 tok/s step 2284/19560 | loss 3.780030 (-1.39z)| norm 0.2924 (-0.31z)| lr 5.90e-04 | 323.37 ms | 52.2% bf16 MFU | 1623811 tok/s step 2285/19560 | loss 3.782219 (-1.32z)| norm 0.3282 (+0.76z)| lr 5.90e-04 | 322.56 ms | 52.3% bf16 MFU | 1623892 tok/s step 2286/19560 | loss 3.812474 (-0.62z)| norm 0.2943 (-0.26z)| lr 5.90e-04 | 322.86 ms | 52.3% bf16 MFU | 1623892 tok/s step 2287/19560 | loss 3.839877 (+0.01z)| norm 0.2852 (-0.53z)| lr 5.90e-04 | 323.11 ms | 52.2% bf16 MFU | 1623828 tok/s step 2288/19560 | loss 3.827256 (-0.28z)| norm 0.3392 (+1.07z)| lr 5.90e-04 | 322.61 ms | 52.3% bf16 MFU | 1623893 tok/s step 2289/19560 | loss 3.826686 (-0.28z)| norm 0.3448 (+1.22z)| lr 5.90e-04 | 322.88 ms | 52.3% bf16 MFU | 1623887 tok/s step 2290/19560 | loss 3.840833 (+0.05z)| norm 0.2791 (-0.75z)| lr 5.90e-04 | 323.56 ms | 52.2% bf16 MFU | 1623711 tok/s step 2291/19560 | loss 3.809868 (-0.65z)| norm 0.2886 (-0.48z)| lr 5.90e-04 | 322.58 ms | 52.3% bf16 MFU | 1623790 tok/s step 2292/19560 | loss 3.820103 (-0.41z)| norm 0.2697 (-1.04z)| lr 5.90e-04 | 322.72 ms | 52.3% bf16 MFU | 1623832 tok/s step 2293/19560 | loss 3.836496 (-0.02z)| norm 0.2784 (-0.77z)| lr 5.90e-04 | 323.47 ms | 52.2% bf16 MFU | 1623681 tok/s step 2294/19560 | loss 3.848221 (+0.26z)| norm 0.2602 (-1.30z)| lr 5.90e-04 | 322.68 ms | 52.3% bf16 MFU | 1623737 tok/s step 2295/19560 | loss 3.815930 (-0.48z)| norm 0.2736 (-0.90z)| lr 5.89e-04 | 323.45 ms | 52.2% bf16 MFU | 1623596 tok/s step 2296/19560 | loss 3.748985 (-2.01z)| norm 0.2628 (-1.22z)| lr 5.89e-04 | 322.84 ms | 52.3% bf16 MFU | 1623615 tok/s step 2297/19560 | loss 3.873532 (+0.88z)| norm 0.2983 (-0.10z)| lr 5.89e-04 | 323.03 ms | 52.2% bf16 MFU | 1623586 tok/s step 2298/19560 | loss 3.789026 (-1.07z)| norm 0.2778 (-0.73z)| lr 5.89e-04 | 322.95 ms | 52.3% bf16 MFU | 1623578 tok/s step 2299/19560 | loss 3.798212 (-0.84z)| norm 0.2902 (-0.34z)| lr 5.89e-04 | 322.22 ms | 52.4% bf16 MFU | 1623754 tok/s step 2300/19560 | loss 3.852995 (+0.41z)| norm 0.2905 (-0.34z)| lr 5.89e-04 | 323.70 ms | 52.1% bf16 MFU | 1623550 tok/s step 2301/19560 | loss 3.803387 (-0.73z)| norm 0.2798 (-0.67z)| lr 5.89e-04 | 323.45 ms | 52.2% bf16 MFU | 1623419 tok/s step 2302/19560 | loss 3.835579 (+0.00z)| norm 0.2946 (-0.21z)| lr 5.89e-04 | 323.08 ms | 52.2% bf16 MFU | 1623387 tok/s step 2303/19560 | loss 3.765421 (-1.60z)| norm 0.3149 (+0.43z)| lr 5.89e-04 | 323.24 ms | 52.2% bf16 MFU | 1623317 tok/s step 2304/19560 | loss 3.788161 (-1.08z)| norm 0.3428 (+1.29z)| lr 5.89e-04 | 322.83 ms | 52.3% bf16 MFU | 1623352 tok/s step 2305/19560 | loss 3.824896 (-0.22z)| norm 0.3241 (+0.70z)| lr 5.89e-04 | 322.50 ms | 52.3% bf16 MFU | 1623470 tok/s step 2306/19560 | loss 3.799635 (-0.80z)| norm 0.3065 (+0.15z)| lr 5.89e-04 | 323.19 ms | 52.2% bf16 MFU | 1623408 tok/s step 2307/19560 | loss 3.820621 (-0.31z)| norm 0.2820 (-0.61z)| lr 5.89e-04 | 323.01 ms | 52.2% bf16 MFU | 1623393 tok/s step 2308/19560 | loss 3.860807 (+0.61z)| norm 0.2687 (-1.02z)| lr 5.89e-04 | 323.01 ms | 52.3% bf16 MFU | 1623381 tok/s step 2309/19560 | loss 3.809770 (-0.56z)| norm 0.2728 (-0.89z)| lr 5.89e-04 | 322.90 ms | 52.3% bf16 MFU | 1623397 tok/s step 2310/19560 | loss 3.809581 (-0.55z)| norm 0.2755 (-0.80z)| lr 5.89e-04 | 323.13 ms | 52.2% bf16 MFU | 1623354 tok/s step 2311/19560 | loss 3.738820 (-2.15z)| norm 0.2704 (-0.96z)| lr 5.89e-04 | 323.14 ms | 52.2% bf16 MFU | 1623310 tok/s step 2312/19560 | loss 3.718814 (-2.54z)| norm 0.2817 (-0.61z)| lr 5.89e-04 | 322.80 ms | 52.3% bf16 MFU | 1623354 tok/s step 2313/19560 | loss 3.778263 (-1.19z)| norm 0.2793 (-0.68z)| lr 5.89e-04 | 322.87 ms | 52.3% bf16 MFU | 1623378 tok/s step 2314/19560 | loss 3.836004 (+0.10z)| norm 0.2859 (-0.48z)| lr 5.89e-04 | 323.15 ms | 52.2% bf16 MFU | 1623330 tok/s step 2315/19560 | loss 3.871470 (+0.91z)| norm 0.2781 (-0.71z)| lr 5.89e-04 | 323.01 ms | 52.3% bf16 MFU | 1623321 tok/s step 2316/19560 | loss 3.855022 (+0.54z)| norm 0.2909 (-0.31z)| lr 5.89e-04 | 323.11 ms | 52.2% bf16 MFU | 1623286 tok/s step 2317/19560 | loss 3.893150 (+1.40z)| norm 0.3008 (-0.00z)| lr 5.89e-04 | 322.67 ms | 52.3% bf16 MFU | 1623364 tok/s step 2318/19560 | loss 3.776022 (-1.24z)| norm 0.2944 (-0.21z)| lr 5.89e-04 | 323.09 ms | 52.2% bf16 MFU | 1623332 tok/s step 2319/19560 | loss 3.793742 (-0.83z)| norm 0.2874 (-0.42z)| lr 5.89e-04 | 322.99 ms | 52.3% bf16 MFU | 1623327 tok/s step 2320/19560 | loss 3.753922 (-1.70z)| norm 0.3050 (+0.12z)| lr 5.89e-04 | 322.64 ms | 52.3% bf16 MFU | 1623412 tok/s step 2321/19560 | loss 3.800998 (-0.64z)| norm 0.2829 (-0.56z)| lr 5.89e-04 | 323.58 ms | 52.2% bf16 MFU | 1623255 tok/s step 2322/19560 | loss 3.851397 (+0.48z)| norm 0.2946 (-0.19z)| lr 5.89e-04 | 322.97 ms | 52.3% bf16 MFU | 1623258 tok/s step 2323/19560 | loss 3.804776 (-0.55z)| norm 0.3291 (+0.88z)| lr 5.89e-04 | 323.23 ms | 52.2% bf16 MFU | 1623198 tok/s step 2324/19560 | loss 3.802111 (-0.62z)| norm 0.3297 (+0.90z)| lr 5.89e-04 | 323.03 ms | 52.2% bf16 MFU | 1623190 tok/s step 2325/19560 | loss 3.917960 (+1.93z)| norm 0.3151 (+0.45z)| lr 5.89e-04 | 322.96 ms | 52.3% bf16 MFU | 1623199 tok/s step 2326/19560 | loss 3.851144 (+0.45z)| norm 0.3080 (+0.24z)| lr 5.89e-04 | 323.02 ms | 52.2% bf16 MFU | 1623193 tok/s step 2327/19560 | loss 3.824095 (-0.14z)| norm 0.3502 (+1.54z)| lr 5.89e-04 | 323.23 ms | 52.2% bf16 MFU | 1623133 tok/s step 2328/19560 | loss 3.834930 (+0.10z)| norm 0.3183 (+0.54z)| lr 5.89e-04 | 323.19 ms | 52.2% bf16 MFU | 1623089 tok/s step 2329/19560 | loss 3.797446 (-0.71z)| norm 0.2935 (-0.22z)| lr 5.89e-04 | 322.68 ms | 52.3% bf16 MFU | 1623176 tok/s step 2330/19560 | loss 3.797718 (-0.70z)| norm 0.2921 (-0.26z)| lr 5.89e-04 | 322.93 ms | 52.3% bf16 MFU | 1623193 tok/s step 2331/19560 | loss 3.775640 (-1.18z)| norm 0.2734 (-0.83z)| lr 5.89e-04 | 323.44 ms | 52.2% bf16 MFU | 1623083 tok/s step 2332/19560 | loss 3.833792 (+0.12z)| norm 0.2921 (-0.24z)| lr 5.89e-04 | 322.67 ms | 52.3% bf16 MFU | 1623170 tok/s step 2333/19560 | loss 3.828542 (+0.01z)| norm 0.2754 (-0.75z)| lr 5.89e-04 | 323.52 ms | 52.2% bf16 MFU | 1623042 tok/s step 2334/19560 | loss 3.775408 (-1.16z)| norm 0.2730 (-0.81z)| lr 5.89e-04 | 322.88 ms | 52.3% bf16 MFU | 1623080 tok/s step 2335/19560 | loss 3.787076 (-0.89z)| norm 0.2938 (-0.15z)| lr 5.89e-04 | 322.60 ms | 52.3% bf16 MFU | 1623186 tok/s step 2336/19560 | loss 3.817269 (-0.20z)| norm 0.2830 (-0.48z)| lr 5.89e-04 | 323.27 ms | 52.2% bf16 MFU | 1623118 tok/s step 2337/19560 | loss 3.751443 (-1.66z)| norm 0.2774 (-0.66z)| lr 5.89e-04 | 322.80 ms | 52.3% bf16 MFU | 1623171 tok/s step 2338/19560 | loss 3.786089 (-0.86z)| norm 0.3036 (+0.17z)| lr 5.89e-04 | 323.30 ms | 52.2% bf16 MFU | 1623097 tok/s step 2339/19560 | loss 3.784770 (-0.89z)| norm 0.3057 (+0.23z)| lr 5.89e-04 | 322.71 ms | 52.3% bf16 MFU | 1623174 tok/s step 2340/19560 | loss 3.799827 (-0.54z)| norm 0.2784 (-0.63z)| lr 5.89e-04 | 322.89 ms | 52.3% bf16 MFU | 1623202 tok/s step 2341/19560 | loss 3.802803 (-0.47z)| norm 0.2891 (-0.28z)| lr 5.89e-04 | 324.41 ms | 52.0% bf16 MFU | 1622848 tok/s step 2342/19560 | loss 3.833994 (+0.23z)| norm 0.3075 (+0.31z)| lr 5.89e-04 | 322.19 ms | 52.4% bf16 MFU | 1623069 tok/s step 2343/19560 | loss 3.844184 (+0.46z)| norm 0.2989 (+0.03z)| lr 5.89e-04 | 322.96 ms | 52.3% bf16 MFU | 1623085 tok/s step 2344/19560 | loss 3.799263 (-0.54z)| norm 0.2915 (-0.21z)| lr 5.89e-04 | 323.66 ms | 52.1% bf16 MFU | 1622925 tok/s step 2345/19560 | loss 3.785121 (-0.84z)| norm 0.3078 (+0.31z)| lr 5.89e-04 | 322.95 ms | 52.3% bf16 MFU | 1622950 tok/s step 2346/19560 | loss 3.766305 (-1.27z)| norm 0.3555 (+1.80z)| lr 5.89e-04 | 323.39 ms | 52.2% bf16 MFU | 1622862 tok/s step 2347/19560 | loss 3.826642 (+0.10z)| norm 0.3450 (+1.44z)| lr 5.89e-04 | 323.18 ms | 52.2% bf16 MFU | 1622833 tok/s step 2348/19560 | loss 3.747540 (-1.68z)| norm 0.3019 (+0.09z)| lr 5.89e-04 | 322.82 ms | 52.3% bf16 MFU | 1622896 tok/s step 2349/19560 | loss 3.791402 (-0.68z)| norm 0.3018 (+0.09z)| lr 5.89e-04 | 322.89 ms | 52.3% bf16 MFU | 1622938 tok/s step 2350/19560 | loss 3.812991 (-0.19z)| norm 0.2945 (-0.15z)| lr 5.89e-04 | 323.51 ms | 52.2% bf16 MFU | 1622821 tok/s step 2351/19560 | loss 3.773593 (-1.06z)| norm 0.2711 (-0.88z)| lr 5.89e-04 | 322.81 ms | 52.3% bf16 MFU | 1622888 tok/s step 2352/19560 | loss 3.791281 (-0.66z)| norm 0.2837 (-0.47z)| lr 5.89e-04 | 322.77 ms | 52.3% bf16 MFU | 1622961 tok/s step 2353/19560 | loss 3.775766 (-0.99z)| norm 0.2745 (-0.76z)| lr 5.89e-04 | 322.47 ms | 52.3% bf16 MFU | 1623104 tok/s step 2354/19560 | loss 3.837745 (+0.39z)| norm 0.5705 (+6.79z)| lr 5.89e-04 | 322.92 ms | 52.3% bf16 MFU | 1623128 tok/s step 2355/19560 | loss 3.787466 (-0.73z)| norm 0.4111 (+2.68z)| lr 5.89e-04 | 322.33 ms | 52.4% bf16 MFU | 1623299 tok/s step 2356/19560 | loss 3.764937 (-1.22z)| norm 0.4760 (+3.96z)| lr 5.89e-04 | 323.57 ms | 52.2% bf16 MFU | 1623151 tok/s step 2357/19560 | loss 3.837166 (+0.39z)| norm 0.3706 (+1.53z)| lr 5.89e-04 | 322.50 ms | 52.3% bf16 MFU | 1623278 tok/s step 2358/19560 | loss 3.838485 (+0.41z)| norm 0.3141 (+0.24z)| lr 5.89e-04 | 322.82 ms | 52.3% bf16 MFU | 1623319 tok/s step 2359/19560 | loss 3.788975 (-0.68z)| norm 0.3170 (+0.31z)| lr 5.89e-04 | 322.90 ms | 52.3% bf16 MFU | 1623337 tok/s step 2360/19560 | loss 3.770358 (-1.08z)| norm 0.2954 (-0.18z)| lr 5.89e-04 | 322.73 ms | 52.3% bf16 MFU | 1623398 tok/s step 2361/19560 | loss 3.830543 (+0.24z)| norm 0.2908 (-0.28z)| lr 5.89e-04 | 322.75 ms | 52.3% bf16 MFU | 1623451 tok/s step 2362/19560 | loss 3.790244 (-0.66z)| norm 0.2812 (-0.49z)| lr 5.89e-04 | 323.16 ms | 52.2% bf16 MFU | 1623398 tok/s step 2363/19560 | loss 3.826760 (+0.22z)| norm 0.2603 (-0.96z)| lr 5.89e-04 | 323.06 ms | 52.2% bf16 MFU | 1623374 tok/s step 2364/19560 | loss 3.809740 (-0.19z)| norm 0.2552 (-1.07z)| lr 5.89e-04 | 323.10 ms | 52.2% bf16 MFU | 1623340 tok/s step 2365/19560 | loss 3.823277 (+0.14z)| norm 0.2717 (-0.69z)| lr 5.89e-04 | 322.55 ms | 52.3% bf16 MFU | 1623445 tok/s step 2366/19560 | loss 3.809357 (-0.19z)| norm 0.3656 (+1.40z)| lr 5.89e-04 | 323.49 ms | 52.2% bf16 MFU | 1623308 tok/s step 2367/19560 | loss 3.872110 (+1.34z)| norm 0.2986 (-0.09z)| lr 5.89e-04 | 323.32 ms | 52.2% bf16 MFU | 1623221 tok/s step 2368/19560 | loss 3.824281 (+0.18z)| norm 0.3327 (+0.67z)| lr 5.89e-04 | 322.92 ms | 52.3% bf16 MFU | 1623240 tok/s step 2369/19560 | loss 3.863683 (+1.13z)| norm 0.3029 (+0.00z)| lr 5.88e-04 | 323.01 ms | 52.3% bf16 MFU | 1623235 tok/s step 2370/19560 | loss 3.846398 (+0.70z)| norm 0.3277 (+0.55z)| lr 5.88e-04 | 322.59 ms | 52.3% bf16 MFU | 1623336 tok/s step 2371/19560 | loss 3.859132 (+1.00z)| norm 0.2999 (-0.08z)| lr 5.88e-04 | 322.96 ms | 52.3% bf16 MFU | 1623338 tok/s step 2372/19560 | loss 3.838985 (+0.52z)| norm 0.2880 (-0.34z)| lr 5.88e-04 | 323.11 ms | 52.2% bf16 MFU | 1623303 tok/s step 2373/19560 | loss 3.785880 (-0.78z)| norm 0.3224 (+0.42z)| lr 5.88e-04 | 323.25 ms | 52.2% bf16 MFU | 1623234 tok/s step 2374/19560 | loss 3.821207 (+0.09z)| norm 0.3339 (+0.67z)| lr 5.88e-04 | 322.71 ms | 52.3% bf16 MFU | 1623304 tok/s step 2375/19560 | loss 3.796314 (-0.51z)| norm 0.3029 (-0.02z)| lr 5.88e-04 | 323.33 ms | 52.2% bf16 MFU | 1623216 tok/s step 2376/19560 | loss 3.814389 (-0.06z)| norm 0.2543 (-1.09z)| lr 5.88e-04 | 323.05 ms | 52.2% bf16 MFU | 1623202 tok/s step 2377/19560 | loss 3.787122 (-0.72z)| norm 0.2588 (-0.99z)| lr 5.88e-04 | 322.96 ms | 52.3% bf16 MFU | 1623211 tok/s step 2378/19560 | loss 3.823843 (+0.18z)| norm 0.2405 (-1.37z)| lr 5.88e-04 | 323.02 ms | 52.2% bf16 MFU | 1623205 tok/s step 2379/19560 | loss 3.746772 (-1.69z)| norm 0.2749 (-0.60z)| lr 5.88e-04 | 322.53 ms | 52.3% bf16 MFU | 1623321 tok/s step 2380/19560 | loss 3.818555 (+0.07z)| norm 0.2667 (-0.77z)| lr 5.88e-04 | 323.41 ms | 52.2% bf16 MFU | 1623211 tok/s step 2381/19560 | loss 3.800004 (-0.39z)| norm 0.2770 (-0.53z)| lr 5.88e-04 | 323.18 ms | 52.2% bf16 MFU | 1623165 tok/s step 2382/19560 | loss 3.845452 (+0.74z)| norm 0.2671 (-0.75z)| lr 5.88e-04 | 323.30 ms | 52.2% bf16 MFU | 1623092 tok/s step 2383/19560 | loss 3.795957 (-0.48z)| norm 0.2675 (-0.73z)| lr 5.88e-04 | 322.86 ms | 52.3% bf16 MFU | 1623131 tok/s step 2384/19560 | loss 3.842443 (+0.66z)| norm 0.2994 (-0.02z)| lr 5.88e-04 | 323.01 ms | 52.2% bf16 MFU | 1623130 tok/s step 2385/19560 | loss 3.821863 (+0.15z)| norm 0.2847 (-0.35z)| lr 5.88e-04 | 322.79 ms | 52.3% bf16 MFU | 1623185 tok/s step 2386/19560 | loss 3.793858 (-0.55z)| norm 0.2771 (-0.52z)| lr 5.88e-04 | 322.89 ms | 52.3% bf16 MFU | 1623212 tok/s step 2387/19560 | loss 3.761661 (-1.36z)| norm 0.3401 (+0.86z)| lr 5.88e-04 | 322.73 ms | 52.3% bf16 MFU | 1623278 tok/s step 2388/19560 | loss 3.780739 (-0.87z)| norm 0.3011 (+0.01z)| lr 5.88e-04 | 322.80 ms | 52.3% bf16 MFU | 1623323 tok/s step 2389/19560 | loss 3.786953 (-0.70z)| norm 0.2989 (-0.03z)| lr 5.88e-04 | 323.09 ms | 52.2% bf16 MFU | 1623294 tok/s step 2390/19560 | loss 3.771441 (-1.09z)| norm 0.2935 (-0.13z)| lr 5.88e-04 | 322.69 ms | 52.3% bf16 MFU | 1623366 tok/s step 2391/19560 | loss 3.807467 (-0.15z)| norm 0.3219 (+0.59z)| lr 5.88e-04 | 322.74 ms | 52.3% bf16 MFU | 1623421 tok/s step 2392/19560 | loss 3.784799 (-0.73z)| norm 0.2858 (-0.29z)| lr 5.88e-04 | 323.48 ms | 52.2% bf16 MFU | 1623287 tok/s step 2393/19560 | loss 3.703033 (-2.76z)| norm 0.3578 (+1.49z)| lr 5.88e-04 | 322.78 ms | 52.3% bf16 MFU | 1623338 tok/s step 2394/19560 | loss 3.806237 (-0.15z)| norm 0.3625 (+1.57z)| lr 5.88e-04 | 322.51 ms | 52.3% bf16 MFU | 1623454 tok/s step 2395/19560 | loss 3.721497 (-2.24z)| norm 0.3354 (+0.90z)| lr 5.88e-04 | 323.32 ms | 52.2% bf16 MFU | 1623360 tok/s step 2396/19560 | loss 3.891065 (+1.94z)| norm 0.3359 (+0.89z)| lr 5.88e-04 | 322.93 ms | 52.3% bf16 MFU | 1623368 tok/s step 2397/19560 | loss 3.799025 (-0.31z)| norm 0.2990 (-0.02z)| lr 5.88e-04 | 322.85 ms | 52.3% bf16 MFU | 1623397 tok/s step 2398/19560 | loss 3.798177 (-0.32z)| norm 0.3009 (+0.02z)| lr 5.88e-04 | 323.25 ms | 52.2% bf16 MFU | 1623323 tok/s step 2399/19560 | loss 3.811796 (+0.05z)| norm 0.3016 (+0.03z)| lr 5.88e-04 | 323.45 ms | 52.2% bf16 MFU | 1623202 tok/s step 2400/19560 | loss 3.813511 (+0.08z)| norm 0.2987 (-0.05z)| lr 5.88e-04 | 322.97 ms | 52.3% bf16 MFU | 1623208 tok/s step 2401/19560 | loss 3.821557 (+0.30z)| norm 0.3088 (+0.19z)| lr 5.88e-04 | 322.40 ms | 52.3% bf16 MFU | 1623358 tok/s step 2402/19560 | loss 3.801061 (-0.26z)| norm 0.2818 (-0.50z)| lr 5.88e-04 | 322.58 ms | 52.3% bf16 MFU | 1623454 tok/s step 2403/19560 | loss 3.803312 (-0.18z)| norm 0.2773 (-0.61z)| lr 5.88e-04 | 322.91 ms | 52.3% bf16 MFU | 1623463 tok/s step 2404/19560 | loss 3.736094 (-2.00z)| norm 0.2821 (-0.50z)| lr 5.88e-04 | 323.07 ms | 52.2% bf16 MFU | 1623432 tok/s step 2405/19560 | loss 3.805214 (-0.09z)| norm 0.2872 (-0.37z)| lr 5.88e-04 | 322.88 ms | 52.3% bf16 MFU | 1623450 tok/s step 2406/19560 | loss 3.777059 (-0.86z)| norm 0.2879 (-0.35z)| lr 5.88e-04 | 322.79 ms | 52.3% bf16 MFU | 1623489 tok/s step 2407/19560 | loss 3.828737 (+0.57z)| norm 0.2694 (-0.80z)| lr 5.88e-04 | 323.12 ms | 52.2% bf16 MFU | 1623444 tok/s step 2408/19560 | loss 3.743880 (-1.77z)| norm 0.2904 (-0.27z)| lr 5.88e-04 | 322.35 ms | 52.4% bf16 MFU | 1623596 tok/s step 2409/19560 | loss 3.807560 (+0.01z)| norm 0.2850 (-0.40z)| lr 5.88e-04 | 322.51 ms | 52.3% bf16 MFU | 1623698 tok/s step 2410/19560 | loss 3.777904 (-0.81z)| norm 0.2821 (-0.47z)| lr 5.88e-04 | 322.82 ms | 52.3% bf16 MFU | 1623718 tok/s step 2411/19560 | loss 3.816594 (+0.27z)| norm 0.2554 (-1.15z)| lr 5.88e-04 | 323.27 ms | 52.2% bf16 MFU | 1623623 tok/s step 2412/19560 | loss 3.746941 (-1.65z)| norm 0.2679 (-0.82z)| lr 5.88e-04 | 322.84 ms | 52.3% bf16 MFU | 1623640 tok/s step 2413/19560 | loss 3.733947 (-1.98z)| norm 0.2846 (-0.39z)| lr 5.88e-04 | 322.67 ms | 52.3% bf16 MFU | 1623701 tok/s step 2414/19560 | loss 3.731327 (-2.00z)| norm 0.2999 (-0.00z)| lr 5.88e-04 | 323.22 ms | 52.2% bf16 MFU | 1623620 tok/s step 2415/19560 | loss 3.788291 (-0.46z)| norm 0.3147 (+0.36z)| lr 5.88e-04 | 322.42 ms | 52.3% bf16 MFU | 1623743 tok/s step 2416/19560 | loss 3.796772 (-0.22z)| norm 0.3365 (+0.91z)| lr 5.88e-04 | 322.29 ms | 52.4% bf16 MFU | 1623894 tok/s step 2417/19560 | loss 3.781199 (-0.63z)| norm 0.3116 (+0.29z)| lr 5.88e-04 | 323.42 ms | 52.2% bf16 MFU | 1623751 tok/s step 2418/19560 | loss 3.810256 (+0.16z)| norm 0.3031 (+0.07z)| lr 5.88e-04 | 322.47 ms | 52.3% bf16 MFU | 1623856 tok/s step 2419/19560 | loss 3.768143 (-0.97z)| norm 0.2855 (-0.37z)| lr 5.88e-04 | 322.99 ms | 52.3% bf16 MFU | 1623825 tok/s step 2420/19560 | loss 3.745945 (-1.54z)| norm 0.2908 (-0.25z)| lr 5.88e-04 | 323.04 ms | 52.2% bf16 MFU | 1623782 tok/s step 2421/19560 | loss 3.760942 (-1.12z)| norm 0.2766 (-0.61z)| lr 5.88e-04 | 322.72 ms | 52.3% bf16 MFU | 1623823 tok/s step 2422/19560 | loss 3.802832 (+0.01z)| norm 0.2711 (-0.75z)| lr 5.88e-04 | 322.96 ms | 52.3% bf16 MFU | 1623802 tok/s step 2423/19560 | loss 3.718697 (-2.19z)| norm 0.2896 (-0.28z)| lr 5.88e-04 | 322.63 ms | 52.3% bf16 MFU | 1623864 tok/s step 2424/19560 | loss 3.781273 (-0.55z)| norm 0.2681 (-0.83z)| lr 5.88e-04 | 323.06 ms | 52.2% bf16 MFU | 1623815 tok/s step 2425/19560 | loss 3.829536 (+0.75z)| norm 0.3037 (+0.08z)| lr 5.88e-04 | 322.95 ms | 52.3% bf16 MFU | 1623797 tok/s step 2426/19560 | loss 3.792250 (-0.25z)| norm 0.3200 (+0.49z)| lr 5.88e-04 | 322.57 ms | 52.3% bf16 MFU | 1623874 tok/s step 2427/19560 | loss 3.777773 (-0.64z)| norm 0.3545 (+1.35z)| lr 5.88e-04 | 323.00 ms | 52.3% bf16 MFU | 1623839 tok/s step 2428/19560 | loss 3.848796 (+1.27z)| norm 0.3890 (+2.17z)| lr 5.88e-04 | 322.86 ms | 52.3% bf16 MFU | 1623842 tok/s step 2429/19560 | loss 3.702137 (-2.58z)| norm 0.3351 (+0.81z)| lr 5.88e-04 | 323.09 ms | 52.2% bf16 MFU | 1623786 tok/s step 2430/19560 | loss 3.786343 (-0.37z)| norm 0.3211 (+0.45z)| lr 5.88e-04 | 322.51 ms | 52.3% bf16 MFU | 1623880 tok/s step 2431/19560 | loss 3.892754 (+2.35z)| norm 0.3752 (+1.77z)| lr 5.88e-04 | 322.94 ms | 52.3% bf16 MFU | 1623861 tok/s step 2432/19560 | loss 3.822172 (+0.53z)| norm 0.3024 (-0.02z)| lr 5.88e-04 | 323.45 ms | 52.2% bf16 MFU | 1623714 tok/s step 2433/19560 | loss 3.783959 (-0.45z)| norm 0.2761 (-0.66z)| lr 5.88e-04 | 322.62 ms | 52.3% bf16 MFU | 1623783 tok/s step 2434/19560 | loss 3.843686 (+1.08z)| norm 0.2674 (-0.86z)| lr 5.88e-04 | 323.48 ms | 52.2% bf16 MFU | 1623632 tok/s step 2435/19560 | loss 3.753422 (-1.22z)| norm 0.3129 (+0.25z)| lr 5.88e-04 | 322.81 ms | 52.3% bf16 MFU | 1623658 tok/s step 2436/19560 | loss 3.817996 (+0.44z)| norm 0.3133 (+0.26z)| lr 5.88e-04 | 322.71 ms | 52.3% bf16 MFU | 1623707 tok/s step 2437/19560 | loss 3.898489 (+2.44z)| norm 0.3149 (+0.29z)| lr 5.88e-04 | 322.83 ms | 52.3% bf16 MFU | 1623725 tok/s step 2438/19560 | loss 3.801486 (+0.00z)| norm 0.3374 (+0.83z)| lr 5.88e-04 | 322.99 ms | 52.3% bf16 MFU | 1623699 tok/s step 2439/19560 | loss 3.836968 (+0.88z)| norm 0.3077 (+0.09z)| lr 5.88e-04 | 323.26 ms | 52.2% bf16 MFU | 1623608 tok/s step 2440/19560 | loss 3.835354 (+0.83z)| norm 0.2971 (-0.18z)| lr 5.88e-04 | 322.96 ms | 52.3% bf16 MFU | 1623598 tok/s step 2441/19560 | loss 3.762598 (-1.03z)| norm 0.3062 (+0.05z)| lr 5.87e-04 | 322.05 ms | 52.4% bf16 MFU | 1623817 tok/s step 2442/19560 | loss 3.747158 (-1.40z)| norm 0.2832 (-0.53z)| lr 5.87e-04 | 322.90 ms | 52.3% bf16 MFU | 1623812 tok/s step 2443/19560 | loss 3.814126 (+0.32z)| norm 0.2765 (-0.69z)| lr 5.87e-04 | 322.90 ms | 52.3% bf16 MFU | 1623804 tok/s step 2444/19560 | loss 3.816314 (+0.39z)| norm 0.2924 (-0.30z)| lr 5.87e-04 | 322.71 ms | 52.3% bf16 MFU | 1623845 tok/s step 2445/19560 | loss 3.792135 (-0.23z)| norm 0.2857 (-0.46z)| lr 5.87e-04 | 322.90 ms | 52.3% bf16 MFU | 1623837 tok/s step 2446/19560 | loss 3.821135 (+0.53z)| norm 0.2627 (-1.02z)| lr 5.87e-04 | 322.45 ms | 52.3% bf16 MFU | 1623944 tok/s step 2447/19560 | loss 3.817029 (+0.42z)| norm 0.2672 (-0.90z)| lr 5.87e-04 | 323.19 ms | 52.2% bf16 MFU | 1623859 tok/s step 2448/19560 | loss 3.804526 (+0.08z)| norm 0.2972 (-0.16z)| lr 5.87e-04 | 322.58 ms | 52.3% bf16 MFU | 1623930 tok/s step 2449/19560 | loss 3.799471 (-0.06z)| norm 0.3367 (+0.80z)| lr 5.87e-04 | 323.20 ms | 52.2% bf16 MFU | 1623842 tok/s step 2450/19560 | loss 3.764297 (-0.98z)| norm 0.3217 (+0.43z)| lr 5.87e-04 | 322.38 ms | 52.4% bf16 MFU | 1623965 tok/s step 2451/19560 | loss 3.801799 (+0.02z)| norm 0.3449 (+0.99z)| lr 5.87e-04 | 322.56 ms | 52.3% bf16 MFU | 1624036 tok/s step 2452/19560 | loss 3.752546 (-1.28z)| norm 0.2941 (-0.25z)| lr 5.87e-04 | 322.83 ms | 52.3% bf16 MFU | 1624035 tok/s step 2453/19560 | loss 3.839753 (+1.10z)| norm 0.2628 (-1.01z)| lr 5.87e-04 | 323.27 ms | 52.2% bf16 MFU | 1623923 tok/s step 2454/19560 | loss 3.790058 (-0.26z)| norm 0.2620 (-1.01z)| lr 5.87e-04 | 322.71 ms | 52.3% bf16 MFU | 1623959 tok/s step 2455/19560 | loss 3.747726 (-1.41z)| norm 0.2911 (-0.29z)| lr 5.87e-04 | 323.03 ms | 52.2% bf16 MFU | 1623913 tok/s step 2456/19560 | loss 3.799814 (+0.03z)| norm 0.2902 (-0.31z)| lr 5.87e-04 | 322.50 ms | 52.3% bf16 MFU | 1624003 tok/s step 2457/19560 | loss 3.787413 (-0.31z)| norm 0.2768 (-0.63z)| lr 5.87e-04 | 322.80 ms | 52.3% bf16 MFU | 1624012 tok/s step 2458/19560 | loss 3.826602 (+0.77z)| norm 0.3063 (+0.09z)| lr 5.87e-04 | 322.77 ms | 52.3% bf16 MFU | 1624029 tok/s step 2459/19560 | loss 3.799251 (+0.01z)| norm 0.2843 (-0.46z)| lr 5.87e-04 | 322.80 ms | 52.3% bf16 MFU | 1624037 tok/s step 2460/19560 | loss 3.815897 (+0.48z)| norm 0.2573 (-1.11z)| lr 5.87e-04 | 322.68 ms | 52.3% bf16 MFU | 1624075 tok/s step 2461/19560 | loss 3.819885 (+0.59z)| norm 0.2567 (-1.12z)| lr 5.87e-04 | 322.98 ms | 52.3% bf16 MFU | 1624036 tok/s step 2462/19560 | loss 3.760215 (-1.06z)| norm 0.2641 (-0.93z)| lr 5.87e-04 | 322.64 ms | 52.3% bf16 MFU | 1624085 tok/s step 2463/19560 | loss 3.754056 (-1.22z)| norm 0.2947 (-0.19z)| lr 5.87e-04 | 323.16 ms | 52.2% bf16 MFU | 1624000 tok/s step 2464/19560 | loss 3.745152 (-1.44z)| norm 0.2828 (-0.48z)| lr 5.87e-04 | 322.61 ms | 52.3% bf16 MFU | 1624058 tok/s step 2465/19560 | loss 3.758494 (-1.08z)| norm 0.2614 (-0.99z)| lr 5.87e-04 | 322.36 ms | 52.4% bf16 MFU | 1624175 tok/s step 2466/19560 | loss 3.754497 (-1.18z)| norm 0.2682 (-0.82z)| lr 5.87e-04 | 323.27 ms | 52.2% bf16 MFU | 1624059 tok/s step 2467/19560 | loss 3.766269 (-0.85z)| norm 0.2851 (-0.40z)| lr 5.87e-04 | 323.02 ms | 52.2% bf16 MFU | 1624010 tok/s step 2468/19560 | loss 3.829383 (+0.86z)| norm 0.3111 (+0.22z)| lr 5.87e-04 | 322.78 ms | 52.3% bf16 MFU | 1624023 tok/s step 2469/19560 | loss 3.764190 (-0.90z)| norm 0.2768 (-0.61z)| lr 5.87e-04 | 322.66 ms | 52.3% bf16 MFU | 1624065 tok/s step 2470/19560 | loss 3.774196 (-0.62z)| norm 0.2573 (-1.07z)| lr 5.87e-04 | 322.72 ms | 52.3% bf16 MFU | 1624091 tok/s step 2471/19560 | loss 3.791071 (-0.15z)| norm 0.3126 (+0.26z)| lr 5.87e-04 | 322.43 ms | 52.3% bf16 MFU | 1624189 tok/s step 2472/19560 | loss 3.786976 (-0.26z)| norm 0.2780 (-0.57z)| lr 5.87e-04 | 323.10 ms | 52.2% bf16 MFU | 1624113 tok/s step 2473/19560 | loss 3.762369 (-0.92z)| norm 0.2931 (-0.20z)| lr 5.87e-04 | 322.69 ms | 52.3% bf16 MFU | 1624145 tok/s step 2474/19560 | loss 3.805726 (+0.25z)| norm 0.3322 (+0.75z)| lr 5.87e-04 | 322.72 ms | 52.3% bf16 MFU | 1624167 tok/s step 2475/19560 | loss 3.750921 (-1.22z)| norm 0.3430 (+1.01z)| lr 5.87e-04 | 322.67 ms | 52.3% bf16 MFU | 1624200 tok/s step 2476/19560 | loss 3.787066 (-0.25z)| norm 0.3554 (+1.29z)| lr 5.87e-04 | 322.62 ms | 52.3% bf16 MFU | 1624245 tok/s step 2477/19560 | loss 3.816283 (+0.54z)| norm 0.3069 (+0.12z)| lr 5.87e-04 | 322.98 ms | 52.3% bf16 MFU | 1624196 tok/s step 2478/19560 | loss 3.862461 (+1.78z)| norm 0.3078 (+0.14z)| lr 5.87e-04 | 322.94 ms | 52.3% bf16 MFU | 1624159 tok/s step 2479/19560 | loss 3.752958 (-1.18z)| norm 0.3202 (+0.43z)| lr 5.87e-04 | 322.82 ms | 52.3% bf16 MFU | 1624155 tok/s step 2480/19560 | loss 3.908223 (+2.88z)| norm 0.3426 (+0.95z)| lr 5.87e-04 | 322.50 ms | 52.3% bf16 MFU | 1624232 tok/s step 2481/19560 | loss 3.769568 (-0.73z)| norm 0.2594 (-1.03z)| lr 5.87e-04 | 322.44 ms | 52.3% bf16 MFU | 1624320 tok/s step 2482/19560 | loss 3.828462 (+0.81z)| norm 0.3101 (+0.28z)| lr 5.87e-04 | 323.13 ms | 52.2% bf16 MFU | 1624230 tok/s step 2483/19560 | loss 3.857288 (+1.53z)| norm 0.3293 (+0.89z)| lr 5.87e-04 | 323.02 ms | 52.2% bf16 MFU | 1624173 tok/s step 2484/19560 | loss 3.874381 (+1.93z)| norm 0.3151 (+0.56z)| lr 5.87e-04 | 323.00 ms | 52.3% bf16 MFU | 1624123 tok/s step 2485/19560 | loss 3.779583 (-0.48z)| norm 0.2735 (-0.85z)| lr 5.87e-04 | 322.60 ms | 52.3% bf16 MFU | 1624177 tok/s step 2486/19560 | loss 3.779656 (-0.47z)| norm 0.2786 (-0.66z)| lr 5.87e-04 | 322.39 ms | 52.3% bf16 MFU | 1624280 tok/s step 2487/19560 | loss 3.789139 (-0.23z)| norm 0.2780 (-0.68z)| lr 5.87e-04 | 323.12 ms | 52.2% bf16 MFU | 1624195 tok/s step 2488/19560 | loss 3.850091 (+1.32z)| norm 0.2988 (+0.05z)| lr 5.87e-04 | 323.49 ms | 52.2% bf16 MFU | 1624021 tok/s step 2489/19560 | loss 3.758869 (-1.00z)| norm 0.2623 (-1.21z)| lr 5.87e-04 | 322.56 ms | 52.3% bf16 MFU | 1624091 tok/s step 2490/19560 | loss 3.851647 (+1.35z)| norm 0.2770 (-0.70z)| lr 5.87e-04 | 322.83 ms | 52.3% bf16 MFU | 1624088 tok/s step 2491/19560 | loss 3.816258 (+0.45z)| norm 0.2856 (-0.41z)| lr 5.87e-04 | 322.94 ms | 52.3% bf16 MFU | 1624057 tok/s step 2492/19560 | loss 3.778680 (-0.49z)| norm 0.2740 (-0.82z)| lr 5.87e-04 | 322.45 ms | 52.3% bf16 MFU | 1624152 tok/s step 2493/19560 | loss 3.769901 (-0.71z)| norm 0.2901 (-0.26z)| lr 5.87e-04 | 323.12 ms | 52.2% bf16 MFU | 1624075 tok/s step 2494/19560 | loss 3.799367 (+0.04z)| norm 0.2715 (-0.91z)| lr 5.87e-04 | 323.09 ms | 52.2% bf16 MFU | 1624007 tok/s step 2495/19560 | loss 3.751661 (-1.15z)| norm 0.2786 (-0.65z)| lr 5.87e-04 | 322.74 ms | 52.3% bf16 MFU | 1624031 tok/s step 2496/19560 | loss 3.812238 (+0.40z)| norm 0.3121 (+0.56z)| lr 5.87e-04 | 322.66 ms | 52.3% bf16 MFU | 1624074 tok/s step 2497/19560 | loss 3.781444 (-0.38z)| norm 0.3179 (+0.76z)| lr 5.87e-04 | 322.49 ms | 52.3% bf16 MFU | 1624158 tok/s step 2498/19560 | loss 3.815129 (+0.51z)| norm 0.3127 (+0.58z)| lr 5.87e-04 | 322.90 ms | 52.3% bf16 MFU | 1624135 tok/s step 2499/19560 | loss 3.825183 (+0.78z)| norm 0.3556 (+2.08z)| lr 5.87e-04 | 322.41 ms | 52.3% bf16 MFU | 1624236 tok/s step 2500/19560 | loss 3.776311 (-0.49z)| norm 0.3219 (+0.87z)| lr 5.87e-04 | 322.70 ms | 52.3% bf16 MFU | 1624260 tok/s val loss 3.772850 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Helevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2674/10042 = 0.266282 step 2501/19560 | loss 3.722334 (-1.88z)| norm 0.2911 (-0.21z)| lr 5.87e-04 | 322.79 ms | 52.3% bf16 MFU | 1624260 tok/s step 2502/19560 | loss 3.795560 (+0.03z)| norm 0.2827 (-0.50z)| lr 5.87e-04 | 323.15 ms | 52.2% bf16 MFU | 1624168 tok/s step 2503/19560 | loss 3.767299 (-0.70z)| norm 0.2758 (-0.73z)| lr 5.87e-04 | 323.17 ms | 52.2% bf16 MFU | 1624077 tok/s step 2504/19560 | loss 3.769343 (-0.63z)| norm 0.2702 (-0.94z)| lr 5.87e-04 | 323.45 ms | 52.2% bf16 MFU | 1623919 tok/s step 2505/19560 | loss 3.815243 (+0.55z)| norm 0.2762 (-0.74z)| lr 5.87e-04 | 322.71 ms | 52.3% bf16 MFU | 1623954 tok/s step 2506/19560 | loss 3.802663 (+0.23z)| norm 0.2666 (-1.10z)| lr 5.87e-04 | 322.96 ms | 52.3% bf16 MFU | 1623926 tok/s step 2507/19560 | loss 3.794680 (+0.01z)| norm 0.2615 (-1.28z)| lr 5.87e-04 | 323.17 ms | 52.2% bf16 MFU | 1623846 tok/s step 2508/19560 | loss 3.724365 (-1.79z)| norm 0.2712 (-0.93z)| lr 5.87e-04 | 323.13 ms | 52.2% bf16 MFU | 1623780 tok/s step 2509/19560 | loss 3.802876 (+0.25z)| norm 0.2658 (-1.12z)| lr 5.86e-04 | 323.05 ms | 52.2% bf16 MFU | 1623737 tok/s step 2510/19560 | loss 3.786878 (-0.16z)| norm 0.3197 (+0.82z)| lr 5.86e-04 | 322.66 ms | 52.3% bf16 MFU | 1623795 tok/s step 2511/19560 | loss 3.743241 (-1.28z)| norm 0.2840 (-0.48z)| lr 5.86e-04 | 323.68 ms | 52.1% bf16 MFU | 1623593 tok/s step 2512/19560 | loss 3.775994 (-0.42z)| norm 0.3025 (+0.19z)| lr 5.86e-04 | 322.66 ms | 52.3% bf16 MFU | 1623658 tok/s step 2513/19560 | loss 3.739677 (-1.34z)| norm 0.2648 (-1.17z)| lr 5.86e-04 | 322.62 ms | 52.3% bf16 MFU | 1623729 tok/s step 2514/19560 | loss 3.749341 (-1.08z)| norm 0.2524 (-1.60z)| lr 5.86e-04 | 323.13 ms | 52.2% bf16 MFU | 1623670 tok/s step 2515/19560 | loss 3.768361 (-0.59z)| norm 0.2587 (-1.36z)| lr 5.86e-04 | 322.97 ms | 52.3% bf16 MFU | 1623654 tok/s step 2516/19560 | loss 3.800784 (+0.25z)| norm 0.2819 (-0.51z)| lr 5.86e-04 | 323.30 ms | 52.2% bf16 MFU | 1623554 tok/s step 2517/19560 | loss 3.740984 (-1.28z)| norm 0.2843 (-0.42z)| lr 5.86e-04 | 323.42 ms | 52.2% bf16 MFU | 1623430 tok/s step 2518/19560 | loss 3.728158 (-1.59z)| norm 0.2667 (-1.04z)| lr 5.86e-04 | 323.01 ms | 52.2% bf16 MFU | 1623414 tok/s step 2519/19560 | loss 3.761664 (-0.73z)| norm 0.2780 (-0.63z)| lr 5.86e-04 | 322.99 ms | 52.3% bf16 MFU | 1623406 tok/s step 2520/19560 | loss 3.794909 (+0.12z)| norm 0.3408 (+1.60z)| lr 5.86e-04 | 323.52 ms | 52.2% bf16 MFU | 1623264 tok/s step 2521/19560 | loss 3.767959 (-0.59z)| norm 0.3395 (+1.57z)| lr 5.86e-04 | 323.15 ms | 52.2% bf16 MFU | 1623221 tok/s step 2522/19560 | loss 3.774223 (-0.42z)| norm 0.3187 (+0.85z)| lr 5.86e-04 | 323.65 ms | 52.1% bf16 MFU | 1623056 tok/s step 2523/19560 | loss 3.752446 (-1.00z)| norm 0.2893 (-0.21z)| lr 5.86e-04 | 323.26 ms | 52.2% bf16 MFU | 1622997 tok/s step 2524/19560 | loss 3.843059 (+1.40z)| norm 0.3304 (+1.31z)| lr 5.86e-04 | 322.81 ms | 52.3% bf16 MFU | 1623054 tok/s step 2525/19560 | loss 3.894445 (+2.68z)| norm 0.3435 (+1.76z)| lr 5.86e-04 | 322.93 ms | 52.3% bf16 MFU | 1623077 tok/s step 2526/19560 | loss 3.827628 (+0.94z)| norm 0.3492 (+1.93z)| lr 5.86e-04 | 322.91 ms | 52.3% bf16 MFU | 1623106 tok/s step 2527/19560 | loss 3.791441 (+0.01z)| norm 0.3024 (+0.24z)| lr 5.86e-04 | 322.99 ms | 52.3% bf16 MFU | 1623112 tok/s step 2528/19560 | loss 3.743499 (-1.21z)| norm 0.2571 (-1.37z)| lr 5.86e-04 | 323.16 ms | 52.2% bf16 MFU | 1623074 tok/s step 2529/19560 | loss 3.826960 (+0.93z)| norm 0.2831 (-0.43z)| lr 5.86e-04 | 323.07 ms | 52.2% bf16 MFU | 1623062 tok/s step 2530/19560 | loss 3.759873 (-0.78z)| norm 0.2599 (-1.25z)| lr 5.86e-04 | 322.71 ms | 52.3% bf16 MFU | 1623140 tok/s step 2531/19560 | loss 3.737898 (-1.32z)| norm 0.2398 (-1.93z)| lr 5.86e-04 | 322.51 ms | 52.3% bf16 MFU | 1623266 tok/s step 2532/19560 | loss 3.830634 (+1.02z)| norm 0.2851 (-0.34z)| lr 5.86e-04 | 323.00 ms | 52.3% bf16 MFU | 1623262 tok/s step 2533/19560 | loss 3.770744 (-0.50z)| norm 0.2748 (-0.70z)| lr 5.86e-04 | 322.84 ms | 52.3% bf16 MFU | 1623298 tok/s step 2534/19560 | loss 3.766336 (-0.61z)| norm 0.2988 (+0.14z)| lr 5.86e-04 | 323.06 ms | 52.2% bf16 MFU | 1623278 tok/s step 2535/19560 | loss 3.823014 (+0.84z)| norm 0.3177 (+0.79z)| lr 5.86e-04 | 322.60 ms | 52.3% bf16 MFU | 1623375 tok/s step 2536/19560 | loss 3.756813 (-0.86z)| norm 0.3016 (+0.22z)| lr 5.86e-04 | 322.81 ms | 52.3% bf16 MFU | 1623413 tok/s step 2537/19560 | loss 3.774621 (-0.40z)| norm 0.2418 (-1.84z)| lr 5.86e-04 | 322.78 ms | 52.3% bf16 MFU | 1623457 tok/s step 2538/19560 | loss 3.827524 (+0.95z)| norm 0.3084 (+0.46z)| lr 5.86e-04 | 323.05 ms | 52.2% bf16 MFU | 1623430 tok/s step 2539/19560 | loss 3.769520 (-0.53z)| norm 0.3935 (+3.25z)| lr 5.86e-04 | 323.38 ms | 52.2% bf16 MFU | 1623323 tok/s step 2540/19560 | loss 3.771699 (-0.48z)| norm 0.4029 (+3.38z)| lr 5.86e-04 | 322.77 ms | 52.3% bf16 MFU | 1623375 tok/s step 2541/19560 | loss 3.886908 (+2.41z)| norm 0.4037 (+3.24z)| lr 5.86e-04 | 323.42 ms | 52.2% bf16 MFU | 1623260 tok/s step 2542/19560 | loss 3.762893 (-0.73z)| norm 0.3164 (+0.56z)| lr 5.86e-04 | 323.44 ms | 52.2% bf16 MFU | 1623145 tok/s step 2543/19560 | loss 3.759208 (-0.82z)| norm 0.2976 (-0.02z)| lr 5.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1623228 tok/s step 2544/19560 | loss 3.858714 (+1.68z)| norm 0.2980 (+0.00z)| lr 5.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1623310 tok/s step 2545/19560 | loss 3.749262 (-1.06z)| norm 0.2696 (-0.86z)| lr 5.86e-04 | 322.97 ms | 52.3% bf16 MFU | 1623310 tok/s step 2546/19560 | loss 3.896489 (+2.54z)| norm 0.2777 (-0.60z)| lr 5.86e-04 | 322.96 ms | 52.3% bf16 MFU | 1623315 tok/s step 2547/19560 | loss 3.686396 (-2.51z)| norm 0.2882 (-0.28z)| lr 5.86e-04 | 322.94 ms | 52.3% bf16 MFU | 1623323 tok/s step 2548/19560 | loss 3.794493 (+0.06z)| norm 0.2913 (-0.18z)| lr 5.86e-04 | 322.54 ms | 52.3% bf16 MFU | 1623432 tok/s step 2549/19560 | loss 3.784196 (-0.19z)| norm 0.2705 (-0.82z)| lr 5.86e-04 | 322.76 ms | 52.3% bf16 MFU | 1623480 tok/s step 2550/19560 | loss 3.830750 (+0.91z)| norm 0.2860 (-0.35z)| lr 5.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1623547 tok/s step 2551/19560 | loss 3.865186 (+1.71z)| norm 0.2655 (-0.97z)| lr 5.86e-04 | 323.24 ms | 52.2% bf16 MFU | 1623468 tok/s step 2552/19560 | loss 3.774605 (-0.45z)| norm 0.2634 (-1.04z)| lr 5.86e-04 | 323.05 ms | 52.2% bf16 MFU | 1623442 tok/s step 2553/19560 | loss 3.774303 (-0.45z)| norm 0.2746 (-0.68z)| lr 5.86e-04 | 323.27 ms | 52.2% bf16 MFU | 1623361 tok/s step 2554/19560 | loss 3.786272 (-0.16z)| norm 0.2620 (-1.05z)| lr 5.86e-04 | 322.62 ms | 52.3% bf16 MFU | 1623447 tok/s step 2555/19560 | loss 3.826340 (+0.78z)| norm 0.2778 (-0.56z)| lr 5.86e-04 | 322.39 ms | 52.3% bf16 MFU | 1623586 tok/s step 2556/19560 | loss 3.744342 (-1.16z)| norm 0.2715 (-0.75z)| lr 5.86e-04 | 323.66 ms | 52.1% bf16 MFU | 1623399 tok/s step 2557/19560 | loss 3.691392 (-2.41z)| norm 0.2952 (+0.02z)| lr 5.86e-04 | 322.83 ms | 52.3% bf16 MFU | 1623432 tok/s step 2558/19560 | loss 3.779439 (-0.31z)| norm 0.2815 (-0.41z)| lr 5.86e-04 | 323.01 ms | 52.2% bf16 MFU | 1623417 tok/s step 2559/19560 | loss 3.708851 (-1.97z)| norm 0.2813 (-0.40z)| lr 5.86e-04 | 322.92 ms | 52.3% bf16 MFU | 1623425 tok/s step 2560/19560 | loss 3.773670 (-0.41z)| norm 0.2633 (-0.99z)| lr 5.86e-04 | 323.08 ms | 52.2% bf16 MFU | 1623392 tok/s step 2561/19560 | loss 3.755945 (-0.83z)| norm 0.2724 (-0.69z)| lr 5.86e-04 | 323.10 ms | 52.2% bf16 MFU | 1623356 tok/s step 2562/19560 | loss 3.709023 (-1.91z)| norm 0.2666 (-0.88z)| lr 5.86e-04 | 323.49 ms | 52.2% bf16 MFU | 1623226 tok/s step 2563/19560 | loss 3.808296 (+0.44z)| norm 0.2635 (-0.96z)| lr 5.86e-04 | 322.98 ms | 52.3% bf16 MFU | 1623230 tok/s step 2564/19560 | loss 3.723370 (-1.56z)| norm 0.2674 (-0.82z)| lr 5.86e-04 | 323.03 ms | 52.2% bf16 MFU | 1623221 tok/s step 2565/19560 | loss 3.780755 (-0.18z)| norm 0.2965 (+0.13z)| lr 5.86e-04 | 322.77 ms | 52.3% bf16 MFU | 1623276 tok/s step 2566/19560 | loss 3.777951 (-0.25z)| norm 0.2943 (+0.07z)| lr 5.86e-04 | 322.56 ms | 52.3% bf16 MFU | 1623382 tok/s step 2567/19560 | loss 3.770980 (-0.40z)| norm 0.3471 (+1.79z)| lr 5.86e-04 | 322.93 ms | 52.3% bf16 MFU | 1623390 tok/s step 2568/19560 | loss 3.761218 (-0.63z)| norm 0.2960 (+0.12z)| lr 5.86e-04 | 323.27 ms | 52.2% bf16 MFU | 1623310 tok/s step 2569/19560 | loss 3.724352 (-1.52z)| norm 0.2638 (-0.92z)| lr 5.86e-04 | 322.51 ms | 52.3% bf16 MFU | 1623429 tok/s step 2570/19560 | loss 3.772637 (-0.35z)| norm 0.2868 (-0.17z)| lr 5.86e-04 | 322.94 ms | 52.3% bf16 MFU | 1623432 tok/s step 2571/19560 | loss 3.824644 (+0.92z)| norm 0.3117 (+0.63z)| lr 5.86e-04 | 323.34 ms | 52.2% bf16 MFU | 1623335 tok/s step 2572/19560 | loss 3.718022 (-1.65z)| norm 0.3293 (+1.19z)| lr 5.86e-04 | 322.78 ms | 52.3% bf16 MFU | 1623381 tok/s step 2573/19560 | loss 3.828629 (+1.01z)| norm 0.2860 (-0.22z)| lr 5.86e-04 | 322.77 ms | 52.3% bf16 MFU | 1623429 tok/s step 2574/19560 | loss 3.752099 (-0.81z)| norm 0.3100 (+0.55z)| lr 5.86e-04 | 323.18 ms | 52.2% bf16 MFU | 1623371 tok/s step 2575/19560 | loss 3.842842 (+1.36z)| norm 0.5265 (+6.27z)| lr 5.86e-04 | 322.64 ms | 52.3% bf16 MFU | 1623452 tok/s step 2576/19560 | loss 3.771036 (-0.35z)| norm 0.3815 (+2.28z)| lr 5.85e-04 | 323.14 ms | 52.2% bf16 MFU | 1623404 tok/s step 2577/19560 | loss 3.797517 (+0.28z)| norm 0.3438 (+1.27z)| lr 5.85e-04 | 323.32 ms | 52.2% bf16 MFU | 1623313 tok/s step 2578/19560 | loss 3.859514 (+1.73z)| norm 0.3311 (+0.93z)| lr 5.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1623355 tok/s step 2579/19560 | loss 3.775180 (-0.27z)| norm 0.3699 (+1.94z)| lr 5.85e-04 | 322.35 ms | 52.4% bf16 MFU | 1623510 tok/s step 2580/19560 | loss 3.853699 (+1.56z)| norm 0.3580 (+1.60z)| lr 5.85e-04 | 323.17 ms | 52.2% bf16 MFU | 1623451 tok/s step 2581/19560 | loss 3.851680 (+1.51z)| norm 0.3552 (+1.50z)| lr 5.85e-04 | 322.70 ms | 52.3% bf16 MFU | 1623512 tok/s step 2582/19560 | loss 3.778661 (-0.20z)| norm 0.3428 (+1.16z)| lr 5.85e-04 | 323.31 ms | 52.2% bf16 MFU | 1623418 tok/s step 2583/19560 | loss 3.757520 (-0.70z)| norm 0.3567 (+1.49z)| lr 5.85e-04 | 322.76 ms | 52.3% bf16 MFU | 1623466 tok/s step 2584/19560 | loss 3.735546 (-1.20z)| norm 0.3457 (+1.19z)| lr 5.85e-04 | 322.45 ms | 52.3% bf16 MFU | 1623589 tok/s step 2585/19560 | loss 3.788047 (+0.03z)| norm 0.2738 (-0.63z)| lr 5.85e-04 | 323.23 ms | 52.2% bf16 MFU | 1623511 tok/s step 2586/19560 | loss 3.766297 (-0.47z)| norm 0.2658 (-0.83z)| lr 5.85e-04 | 322.51 ms | 52.3% bf16 MFU | 1623617 tok/s step 2587/19560 | loss 3.718997 (-1.55z)| norm 0.2909 (-0.19z)| lr 5.85e-04 | 322.94 ms | 52.3% bf16 MFU | 1623612 tok/s step 2588/19560 | loss 3.840653 (+1.27z)| norm 0.2595 (-0.99z)| lr 5.85e-04 | 322.59 ms | 52.3% bf16 MFU | 1623692 tok/s step 2589/19560 | loss 3.744183 (-0.95z)| norm 0.2687 (-0.76z)| lr 5.85e-04 | 322.97 ms | 52.3% bf16 MFU | 1623676 tok/s step 2590/19560 | loss 3.798287 (+0.29z)| norm 0.2639 (-0.88z)| lr 5.85e-04 | 322.75 ms | 52.3% bf16 MFU | 1623714 tok/s step 2591/19560 | loss 3.765145 (-0.47z)| norm 0.2696 (-0.73z)| lr 5.85e-04 | 322.64 ms | 52.3% bf16 MFU | 1623777 tok/s step 2592/19560 | loss 3.742227 (-1.00z)| norm 0.2750 (-0.59z)| lr 5.85e-04 | 322.87 ms | 52.3% bf16 MFU | 1623778 tok/s step 2593/19560 | loss 3.779512 (-0.15z)| norm 0.2660 (-0.82z)| lr 5.85e-04 | 322.47 ms | 52.3% bf16 MFU | 1623882 tok/s step 2594/19560 | loss 3.801584 (+0.36z)| norm 0.2956 (-0.08z)| lr 5.85e-04 | 323.13 ms | 52.2% bf16 MFU | 1623816 tok/s step 2595/19560 | loss 3.793080 (+0.16z)| norm 0.2754 (-0.59z)| lr 5.85e-04 | 323.10 ms | 52.2% bf16 MFU | 1623760 tok/s step 2596/19560 | loss 3.753289 (-0.75z)| norm 0.2539 (-1.12z)| lr 5.85e-04 | 322.71 ms | 52.3% bf16 MFU | 1623804 tok/s step 2597/19560 | loss 3.796544 (+0.24z)| norm 0.2878 (-0.26z)| lr 5.85e-04 | 323.19 ms | 52.2% bf16 MFU | 1623725 tok/s step 2598/19560 | loss 3.786915 (+0.02z)| norm 0.2469 (-1.29z)| lr 5.85e-04 | 323.05 ms | 52.2% bf16 MFU | 1623685 tok/s step 2599/19560 | loss 3.854761 (+1.57z)| norm 0.2550 (-1.07z)| lr 5.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1623707 tok/s step 2600/19560 | loss 3.776588 (-0.23z)| norm 0.2955 (-0.06z)| lr 5.85e-04 | 322.16 ms | 52.4% bf16 MFU | 1623893 tok/s step 2601/19560 | loss 3.734419 (-1.19z)| norm 0.2892 (-0.21z)| lr 5.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1623905 tok/s step 2602/19560 | loss 3.778741 (-0.17z)| norm 0.2831 (-0.36z)| lr 5.85e-04 | 322.55 ms | 52.3% bf16 MFU | 1623982 tok/s step 2603/19560 | loss 3.771581 (-0.34z)| norm 0.2774 (-0.49z)| lr 5.85e-04 | 323.22 ms | 52.2% bf16 MFU | 1623888 tok/s step 2604/19560 | loss 3.757559 (-0.65z)| norm 0.2688 (-0.70z)| lr 5.85e-04 | 322.59 ms | 52.3% bf16 MFU | 1623955 tok/s step 2605/19560 | loss 3.796062 (+0.23z)| norm 0.2731 (-0.58z)| lr 5.85e-04 | 322.66 ms | 52.3% bf16 MFU | 1624002 tok/s step 2606/19560 | loss 3.700093 (-1.94z)| norm 0.2431 (-1.32z)| lr 5.85e-04 | 322.29 ms | 52.4% bf16 MFU | 1624140 tok/s step 2607/19560 | loss 3.797020 (+0.28z)| norm 0.2480 (-1.18z)| lr 5.85e-04 | 322.66 ms | 52.3% bf16 MFU | 1624177 tok/s step 2608/19560 | loss 3.764773 (-0.45z)| norm 0.2611 (-0.84z)| lr 5.85e-04 | 322.47 ms | 52.3% bf16 MFU | 1624261 tok/s step 2609/19560 | loss 3.765101 (-0.44z)| norm 0.2845 (-0.25z)| lr 5.85e-04 | 322.75 ms | 52.3% bf16 MFU | 1624271 tok/s step 2610/19560 | loss 3.773821 (-0.23z)| norm 0.3102 (+0.40z)| lr 5.85e-04 | 322.64 ms | 52.3% bf16 MFU | 1624307 tok/s step 2611/19560 | loss 3.754994 (-0.66z)| norm 0.3350 (+1.03z)| lr 5.85e-04 | 322.91 ms | 52.3% bf16 MFU | 1624273 tok/s step 2612/19560 | loss 3.693582 (-2.11z)| norm 0.3281 (+0.85z)| lr 5.85e-04 | 322.94 ms | 52.3% bf16 MFU | 1624234 tok/s step 2613/19560 | loss 3.705609 (-1.78z)| norm 0.3085 (+0.34z)| lr 5.85e-04 | 322.86 ms | 52.3% bf16 MFU | 1624216 tok/s step 2614/19560 | loss 3.755591 (-0.59z)| norm 0.2855 (-0.24z)| lr 5.85e-04 | 323.40 ms | 52.2% bf16 MFU | 1624063 tok/s step 2615/19560 | loss 3.773777 (-0.15z)| norm 0.2858 (-0.23z)| lr 5.85e-04 | 322.75 ms | 52.3% bf16 MFU | 1624081 tok/s step 2616/19560 | loss 3.769355 (-0.25z)| norm 0.2870 (-0.20z)| lr 5.85e-04 | 322.46 ms | 52.3% bf16 MFU | 1624171 tok/s step 2617/19560 | loss 3.776553 (-0.08z)| norm 0.2764 (-0.47z)| lr 5.85e-04 | 322.91 ms | 52.3% bf16 MFU | 1624144 tok/s step 2618/19560 | loss 3.791133 (+0.29z)| norm 0.2820 (-0.33z)| lr 5.85e-04 | 322.27 ms | 52.4% bf16 MFU | 1624281 tok/s step 2619/19560 | loss 3.816413 (+0.90z)| norm 0.2948 (-0.01z)| lr 5.85e-04 | 323.30 ms | 52.2% bf16 MFU | 1624151 tok/s step 2620/19560 | loss 3.745913 (-0.80z)| norm 0.3202 (+0.63z)| lr 5.85e-04 | 322.90 ms | 52.3% bf16 MFU | 1624128 tok/s step 2621/19560 | loss 3.774220 (-0.12z)| norm 0.3026 (+0.18z)| lr 5.85e-04 | 322.63 ms | 52.3% bf16 MFU | 1624173 tok/s step 2622/19560 | loss 3.727488 (-1.23z)| norm 0.2621 (-0.85z)| lr 5.85e-04 | 322.56 ms | 52.3% bf16 MFU | 1624234 tok/s step 2623/19560 | loss 3.808542 (+0.71z)| norm 0.2942 (-0.03z)| lr 5.85e-04 | 323.03 ms | 52.2% bf16 MFU | 1624174 tok/s step 2624/19560 | loss 3.792736 (+0.34z)| norm 0.2948 (-0.02z)| lr 5.85e-04 | 322.13 ms | 52.4% bf16 MFU | 1624344 tok/s step 2625/19560 | loss 3.768869 (-0.24z)| norm 0.3070 (+0.29z)| lr 5.85e-04 | 322.66 ms | 52.3% bf16 MFU | 1624372 tok/s step 2626/19560 | loss 3.724691 (-1.28z)| norm 0.2756 (-0.50z)| lr 5.85e-04 | 322.71 ms | 52.3% bf16 MFU | 1624386 tok/s step 2627/19560 | loss 3.817633 (+0.96z)| norm 0.2917 (-0.07z)| lr 5.85e-04 | 322.41 ms | 52.3% bf16 MFU | 1624474 tok/s step 2628/19560 | loss 3.806212 (+0.68z)| norm 0.2950 (+0.02z)| lr 5.85e-04 | 322.56 ms | 52.3% bf16 MFU | 1624520 tok/s step 2629/19560 | loss 3.802204 (+0.57z)| norm 0.2923 (-0.05z)| lr 5.85e-04 | 323.56 ms | 52.2% bf16 MFU | 1624312 tok/s step 2630/19560 | loss 3.821895 (+1.04z)| norm 0.2882 (-0.16z)| lr 5.85e-04 | 322.85 ms | 52.3% bf16 MFU | 1624293 tok/s step 2631/19560 | loss 3.796062 (+0.41z)| norm 0.2858 (-0.22z)| lr 5.85e-04 | 322.68 ms | 52.3% bf16 MFU | 1624318 tok/s step 2632/19560 | loss 3.742098 (-0.89z)| norm 0.2800 (-0.37z)| lr 5.85e-04 | 322.90 ms | 52.3% bf16 MFU | 1624286 tok/s step 2633/19560 | loss 3.907814 (+2.98z)| norm 0.2788 (-0.41z)| lr 5.85e-04 | 322.83 ms | 52.3% bf16 MFU | 1624275 tok/s step 2634/19560 | loss 3.795686 (+0.37z)| norm 0.2671 (-0.71z)| lr 5.85e-04 | 322.93 ms | 52.3% bf16 MFU | 1624238 tok/s step 2635/19560 | loss 3.753474 (-0.60z)| norm 0.2645 (-0.78z)| lr 5.85e-04 | 322.89 ms | 52.3% bf16 MFU | 1624214 tok/s step 2636/19560 | loss 3.765162 (-0.34z)| norm 0.2791 (-0.40z)| lr 5.85e-04 | 322.52 ms | 52.3% bf16 MFU | 1624284 tok/s step 2637/19560 | loss 3.719581 (-1.38z)| norm 0.2829 (-0.31z)| lr 5.85e-04 | 322.60 ms | 52.3% bf16 MFU | 1624329 tok/s step 2638/19560 | loss 3.752830 (-0.60z)| norm 0.2555 (-1.00z)| lr 5.85e-04 | 322.95 ms | 52.3% bf16 MFU | 1624285 tok/s step 2639/19560 | loss 3.785881 (+0.16z)| norm 0.2757 (-0.48z)| lr 5.85e-04 | 322.52 ms | 52.3% bf16 MFU | 1624350 tok/s step 2640/19560 | loss 3.804918 (+0.60z)| norm 0.3036 (+0.24z)| lr 5.84e-04 | 323.01 ms | 52.2% bf16 MFU | 1624289 tok/s step 2641/19560 | loss 3.799178 (+0.45z)| norm 0.3035 (+0.23z)| lr 5.84e-04 | 322.43 ms | 52.3% bf16 MFU | 1624377 tok/s step 2642/19560 | loss 3.816880 (+0.85z)| norm 0.3268 (+0.82z)| lr 5.84e-04 | 322.67 ms | 52.3% bf16 MFU | 1624399 tok/s step 2643/19560 | loss 3.814546 (+0.79z)| norm 0.2817 (-0.35z)| lr 5.84e-04 | 322.69 ms | 52.3% bf16 MFU | 1624417 tok/s step 2644/19560 | loss 3.842889 (+1.43z)| norm 0.2701 (-0.65z)| lr 5.84e-04 | 323.40 ms | 52.2% bf16 MFU | 1624254 tok/s step 2645/19560 | loss 3.768966 (-0.28z)| norm 0.2945 (-0.02z)| lr 5.84e-04 | 322.57 ms | 52.3% bf16 MFU | 1624310 tok/s step 2646/19560 | loss 3.705042 (-1.75z)| norm 0.2978 (+0.06z)| lr 5.84e-04 | 322.27 ms | 52.4% bf16 MFU | 1624438 tok/s step 2647/19560 | loss 3.708200 (-1.65z)| norm 0.2731 (-0.58z)| lr 5.84e-04 | 322.55 ms | 52.3% bf16 MFU | 1624489 tok/s step 2648/19560 | loss 3.691811 (-1.97z)| norm 0.2654 (-0.77z)| lr 5.84e-04 | 322.66 ms | 52.3% bf16 MFU | 1624509 tok/s step 2649/19560 | loss 3.719347 (-1.34z)| norm 0.2830 (-0.30z)| lr 5.84e-04 | 323.11 ms | 52.2% bf16 MFU | 1624416 tok/s step 2650/19560 | loss 3.830149 (+1.11z)| norm 0.2566 (-0.98z)| lr 5.84e-04 | 322.54 ms | 52.3% bf16 MFU | 1624470 tok/s step 2651/19560 | loss 3.740602 (-0.87z)| norm 0.2744 (-0.51z)| lr 5.84e-04 | 322.19 ms | 52.4% bf16 MFU | 1624610 tok/s step 2652/19560 | loss 3.757222 (-0.49z)| norm 0.2688 (-0.64z)| lr 5.84e-04 | 323.05 ms | 52.2% bf16 MFU | 1624526 tok/s step 2653/19560 | loss 3.749304 (-0.66z)| norm 0.2798 (-0.34z)| lr 5.84e-04 | 322.97 ms | 52.3% bf16 MFU | 1624466 tok/s step 2654/19560 | loss 3.769788 (-0.18z)| norm 0.3237 (+0.82z)| lr 5.84e-04 | 323.16 ms | 52.2% bf16 MFU | 1624361 tok/s step 2655/19560 | loss 3.744332 (-0.75z)| norm 0.3197 (+0.71z)| lr 5.84e-04 | 322.76 ms | 52.3% bf16 MFU | 1624362 tok/s step 2656/19560 | loss 3.766641 (-0.25z)| norm 0.2971 (+0.10z)| lr 5.84e-04 | 322.55 ms | 52.3% bf16 MFU | 1624417 tok/s step 2657/19560 | loss 3.810688 (+0.77z)| norm 0.2955 (+0.06z)| lr 5.84e-04 | 322.98 ms | 52.3% bf16 MFU | 1624360 tok/s step 2658/19560 | loss 3.735004 (-0.97z)| norm 0.2918 (-0.04z)| lr 5.84e-04 | 324.82 ms | 52.0% bf16 MFU | 1623846 tok/s step 2659/19560 | loss 3.736697 (-0.93z)| norm 0.2501 (-1.16z)| lr 5.84e-04 | 321.85 ms | 52.4% bf16 MFU | 1624103 tok/s step 2660/19560 | loss 3.782734 (+0.14z)| norm 0.2835 (-0.27z)| lr 5.84e-04 | 322.79 ms | 52.3% bf16 MFU | 1624111 tok/s step 2661/19560 | loss 3.758779 (-0.41z)| norm 0.3272 (+0.88z)| lr 5.84e-04 | 324.11 ms | 52.1% bf16 MFU | 1623788 tok/s step 2662/19560 | loss 3.780130 (+0.08z)| norm 0.3283 (+0.90z)| lr 5.84e-04 | 323.06 ms | 52.2% bf16 MFU | 1623742 tok/s step 2663/19560 | loss 3.822543 (+1.06z)| norm 0.2999 (+0.15z)| lr 5.84e-04 | 322.41 ms | 52.3% bf16 MFU | 1623862 tok/s step 2664/19560 | loss 3.849291 (+1.64z)| norm 0.2927 (-0.03z)| lr 5.84e-04 | 322.32 ms | 52.4% bf16 MFU | 1623998 tok/s step 2665/19560 | loss 3.763551 (-0.31z)| norm 0.2765 (-0.47z)| lr 5.84e-04 | 322.92 ms | 52.3% bf16 MFU | 1623978 tok/s step 2666/19560 | loss 3.835428 (+1.33z)| norm 0.2681 (-0.69z)| lr 5.84e-04 | 322.25 ms | 52.4% bf16 MFU | 1624126 tok/s step 2667/19560 | loss 3.733766 (-0.98z)| norm 0.3127 (+0.53z)| lr 5.84e-04 | 323.26 ms | 52.2% bf16 MFU | 1624013 tok/s step 2668/19560 | loss 3.747480 (-0.67z)| norm 0.3009 (+0.24z)| lr 5.84e-04 | 323.20 ms | 52.2% bf16 MFU | 1623922 tok/s step 2669/19560 | loss 3.738165 (-0.87z)| norm 0.3301 (+1.13z)| lr 5.84e-04 | 323.07 ms | 52.2% bf16 MFU | 1623867 tok/s step 2670/19560 | loss 3.703669 (-1.64z)| norm 0.3345 (+1.25z)| lr 5.84e-04 | 322.28 ms | 52.4% bf16 MFU | 1624014 tok/s step 2671/19560 | loss 3.847188 (+1.62z)| norm 0.3225 (+0.89z)| lr 5.84e-04 | 323.18 ms | 52.2% bf16 MFU | 1623928 tok/s step 2672/19560 | loss 3.732490 (-0.98z)| norm 0.2996 (+0.22z)| lr 5.84e-04 | 323.17 ms | 52.2% bf16 MFU | 1623848 tok/s step 2673/19560 | loss 3.766713 (-0.19z)| norm 0.3005 (+0.23z)| lr 5.84e-04 | 323.27 ms | 52.2% bf16 MFU | 1623746 tok/s step 2674/19560 | loss 3.739775 (-0.81z)| norm 0.2781 (-0.42z)| lr 5.84e-04 | 322.96 ms | 52.3% bf16 MFU | 1623727 tok/s step 2675/19560 | loss 3.773391 (-0.03z)| norm 0.2794 (-0.38z)| lr 5.84e-04 | 322.80 ms | 52.3% bf16 MFU | 1623749 tok/s step 2676/19560 | loss 3.718816 (-1.32z)| norm 0.2675 (-0.72z)| lr 5.84e-04 | 322.78 ms | 52.3% bf16 MFU | 1623776 tok/s step 2677/19560 | loss 3.795676 (+0.52z)| norm 0.2863 (-0.18z)| lr 5.84e-04 | 322.70 ms | 52.3% bf16 MFU | 1623821 tok/s step 2678/19560 | loss 3.677610 (-2.25z)| norm 0.2649 (-0.80z)| lr 5.84e-04 | 322.79 ms | 52.3% bf16 MFU | 1623842 tok/s step 2679/19560 | loss 3.824577 (+1.25z)| norm 0.3052 (+0.37z)| lr 5.84e-04 | 322.94 ms | 52.3% bf16 MFU | 1623824 tok/s step 2680/19560 | loss 3.718174 (-1.28z)| norm 0.3086 (+0.46z)| lr 5.84e-04 | 322.93 ms | 52.3% bf16 MFU | 1623809 tok/s step 2681/19560 | loss 3.757776 (-0.34z)| norm 0.3013 (+0.24z)| lr 5.84e-04 | 322.78 ms | 52.3% bf16 MFU | 1623832 tok/s step 2682/19560 | loss 3.756133 (-0.37z)| norm 0.2983 (+0.15z)| lr 5.84e-04 | 322.80 ms | 52.3% bf16 MFU | 1623849 tok/s step 2683/19560 | loss 3.802548 (+0.74z)| norm 0.2878 (-0.17z)| lr 5.84e-04 | 323.23 ms | 52.2% bf16 MFU | 1623758 tok/s step 2684/19560 | loss 3.757248 (-0.34z)| norm 0.2534 (-1.17z)| lr 5.84e-04 | 322.98 ms | 52.3% bf16 MFU | 1623734 tok/s step 2685/19560 | loss 3.764011 (-0.20z)| norm 0.2525 (-1.18z)| lr 5.84e-04 | 322.23 ms | 52.4% bf16 MFU | 1623899 tok/s step 2686/19560 | loss 3.736913 (-0.85z)| norm 0.2624 (-0.89z)| lr 5.84e-04 | 323.35 ms | 52.2% bf16 MFU | 1623776 tok/s step 2687/19560 | loss 3.788915 (+0.40z)| norm 0.2790 (-0.40z)| lr 5.84e-04 | 322.77 ms | 52.3% bf16 MFU | 1623803 tok/s step 2688/19560 | loss 3.717040 (-1.33z)| norm 0.2799 (-0.38z)| lr 5.84e-04 | 323.15 ms | 52.2% bf16 MFU | 1623735 tok/s step 2689/19560 | loss 3.723266 (-1.17z)| norm 0.2862 (-0.20z)| lr 5.84e-04 | 323.20 ms | 52.2% bf16 MFU | 1623657 tok/s step 2690/19560 | loss 3.783320 (+0.27z)| norm 0.2976 (+0.12z)| lr 5.84e-04 | 322.51 ms | 52.3% bf16 MFU | 1623758 tok/s step 2691/19560 | loss 3.727760 (-1.07z)| norm 0.2767 (-0.49z)| lr 5.84e-04 | 322.74 ms | 52.3% bf16 MFU | 1623794 tok/s step 2692/19560 | loss 3.720562 (-1.24z)| norm 0.2791 (-0.42z)| lr 5.84e-04 | 323.69 ms | 52.1% bf16 MFU | 1623590 tok/s step 2693/19560 | loss 3.746448 (-0.61z)| norm 0.2933 (-0.01z)| lr 5.84e-04 | 323.00 ms | 52.3% bf16 MFU | 1623569 tok/s step 2694/19560 | loss 3.810347 (+0.93z)| norm 0.3174 (+0.70z)| lr 5.84e-04 | 322.69 ms | 52.3% bf16 MFU | 1623629 tok/s step 2695/19560 | loss 3.769260 (-0.06z)| norm 0.3644 (+2.06z)| lr 5.84e-04 | 323.00 ms | 52.3% bf16 MFU | 1623607 tok/s step 2696/19560 | loss 3.839933 (+1.62z)| norm 0.3639 (+2.00z)| lr 5.84e-04 | 322.81 ms | 52.3% bf16 MFU | 1623633 tok/s step 2697/19560 | loss 3.750230 (-0.54z)| norm 0.3380 (+1.23z)| lr 5.84e-04 | 323.01 ms | 52.2% bf16 MFU | 1623607 tok/s step 2698/19560 | loss 3.731635 (-0.97z)| norm 0.2983 (+0.09z)| lr 5.84e-04 | 322.68 ms | 52.3% bf16 MFU | 1623667 tok/s step 2699/19560 | loss 3.762305 (-0.23z)| norm 0.3019 (+0.20z)| lr 5.84e-04 | 322.80 ms | 52.3% bf16 MFU | 1623694 tok/s step 2700/19560 | loss 3.724539 (-1.14z)| norm 0.2820 (-0.36z)| lr 5.84e-04 | 323.16 ms | 52.2% bf16 MFU | 1623627 tok/s step 2701/19560 | loss 3.823443 (+1.25z)| norm 0.2893 (-0.15z)| lr 5.84e-04 | 323.34 ms | 52.2% bf16 MFU | 1623520 tok/s step 2702/19560 | loss 3.746871 (-0.60z)| norm 0.2825 (-0.34z)| lr 5.83e-04 | 323.23 ms | 52.2% bf16 MFU | 1623446 tok/s step 2703/19560 | loss 3.794259 (+0.56z)| norm 0.3196 (+0.95z)| lr 5.83e-04 | 323.01 ms | 52.2% bf16 MFU | 1623430 tok/s step 2704/19560 | loss 3.685556 (-2.04z)| norm 0.3290 (+1.34z)| lr 5.83e-04 | 322.97 ms | 52.3% bf16 MFU | 1623425 tok/s step 2705/19560 | loss 3.747341 (-0.55z)| norm 0.3277 (+1.31z)| lr 5.83e-04 | 323.11 ms | 52.2% bf16 MFU | 1623385 tok/s step 2706/19560 | loss 3.732234 (-0.90z)| norm 0.2974 (+0.20z)| lr 5.83e-04 | 322.59 ms | 52.3% bf16 MFU | 1623477 tok/s step 2707/19560 | loss 3.868419 (+2.34z)| norm 0.3247 (+1.27z)| lr 5.83e-04 | 323.13 ms | 52.2% bf16 MFU | 1623430 tok/s step 2708/19560 | loss 3.744354 (-0.60z)| norm 0.2904 (-0.03z)| lr 5.83e-04 | 323.01 ms | 52.2% bf16 MFU | 1623414 tok/s step 2709/19560 | loss 3.739653 (-0.70z)| norm 0.2750 (-0.62z)| lr 5.83e-04 | 322.91 ms | 52.3% bf16 MFU | 1623425 tok/s step 2710/19560 | loss 3.700207 (-1.64z)| norm 0.2680 (-0.89z)| lr 5.83e-04 | 322.60 ms | 52.3% bf16 MFU | 1623514 tok/s step 2711/19560 | loss 3.690817 (-1.83z)| norm 0.2732 (-0.67z)| lr 5.83e-04 | 322.59 ms | 52.3% bf16 MFU | 1623599 tok/s step 2712/19560 | loss 3.788317 (+0.50z)| norm 0.2518 (-1.56z)| lr 5.83e-04 | 323.22 ms | 52.2% bf16 MFU | 1623523 tok/s step 2713/19560 | loss 3.777300 (+0.24z)| norm 0.2606 (-1.18z)| lr 5.83e-04 | 322.86 ms | 52.3% bf16 MFU | 1623541 tok/s step 2714/19560 | loss 3.719654 (-1.13z)| norm 0.2964 (+0.33z)| lr 5.83e-04 | 322.95 ms | 52.3% bf16 MFU | 1623535 tok/s step 2715/19560 | loss 3.748741 (-0.45z)| norm 0.2572 (-1.32z)| lr 5.83e-04 | 323.03 ms | 52.2% bf16 MFU | 1623511 tok/s step 2716/19560 | loss 3.693574 (-1.74z)| norm 0.2572 (-1.31z)| lr 5.83e-04 | 323.39 ms | 52.2% bf16 MFU | 1623398 tok/s step 2717/19560 | loss 3.760852 (-0.13z)| norm 0.2599 (-1.20z)| lr 5.83e-04 | 322.85 ms | 52.3% bf16 MFU | 1623424 tok/s step 2718/19560 | loss 3.785863 (+0.47z)| norm 0.2824 (-0.26z)| lr 5.83e-04 | 323.40 ms | 52.2% bf16 MFU | 1623312 tok/s step 2719/19560 | loss 3.759660 (-0.16z)| norm 0.2761 (-0.53z)| lr 5.83e-04 | 322.87 ms | 52.3% bf16 MFU | 1623339 tok/s step 2720/19560 | loss 3.767508 (+0.03z)| norm 0.2876 (-0.04z)| lr 5.83e-04 | 322.99 ms | 52.3% bf16 MFU | 1623334 tok/s step 2721/19560 | loss 3.811087 (+1.07z)| norm 0.2893 (+0.02z)| lr 5.83e-04 | 323.20 ms | 52.2% bf16 MFU | 1623277 tok/s step 2722/19560 | loss 3.722528 (-1.04z)| norm 0.2689 (-0.83z)| lr 5.83e-04 | 323.23 ms | 52.2% bf16 MFU | 1623215 tok/s step 2723/19560 | loss 3.795530 (+0.71z)| norm 0.2860 (-0.11z)| lr 5.83e-04 | 322.74 ms | 52.3% bf16 MFU | 1623279 tok/s step 2724/19560 | loss 3.723794 (-1.00z)| norm 0.2954 (+0.28z)| lr 5.83e-04 | 323.20 ms | 52.2% bf16 MFU | 1623223 tok/s step 2725/19560 | loss 3.733434 (-0.76z)| norm 0.3330 (+1.85z)| lr 5.83e-04 | 323.45 ms | 52.2% bf16 MFU | 1623109 tok/s step 2726/19560 | loss 3.753031 (-0.29z)| norm 0.3204 (+1.30z)| lr 5.83e-04 | 322.78 ms | 52.3% bf16 MFU | 1623167 tok/s step 2727/19560 | loss 3.751918 (-0.30z)| norm 0.2999 (+0.42z)| lr 5.83e-04 | 323.01 ms | 52.3% bf16 MFU | 1623166 tok/s step 2728/19560 | loss 3.771125 (+0.17z)| norm 0.3253 (+1.48z)| lr 5.83e-04 | 323.29 ms | 52.2% bf16 MFU | 1623093 tok/s step 2729/19560 | loss 3.743728 (-0.50z)| norm 0.3100 (+0.83z)| lr 5.83e-04 | 323.07 ms | 52.2% bf16 MFU | 1623080 tok/s step 2730/19560 | loss 3.723992 (-0.97z)| norm 0.3189 (+1.18z)| lr 5.83e-04 | 322.67 ms | 52.3% bf16 MFU | 1623167 tok/s step 2731/19560 | loss 3.727577 (-0.87z)| norm 0.3084 (+0.73z)| lr 5.83e-04 | 323.01 ms | 52.3% bf16 MFU | 1623166 tok/s step 2732/19560 | loss 3.852580 (+2.10z)| norm 0.2952 (+0.16z)| lr 5.83e-04 | 322.58 ms | 52.3% bf16 MFU | 1623272 tok/s step 2733/19560 | loss 3.779510 (+0.37z)| norm 0.2818 (-0.41z)| lr 5.83e-04 | 324.23 ms | 52.1% bf16 MFU | 1622960 tok/s step 2734/19560 | loss 3.749238 (-0.37z)| norm 0.2785 (-0.57z)| lr 5.83e-04 | 323.58 ms | 52.2% bf16 MFU | 1622826 tok/s step 2735/19560 | loss 3.768931 (+0.11z)| norm 0.2847 (-0.32z)| lr 5.83e-04 | 323.26 ms | 52.2% bf16 MFU | 1622777 tok/s step 2736/19560 | loss 3.746121 (-0.43z)| norm 0.2511 (-1.77z)| lr 5.83e-04 | 323.35 ms | 52.2% bf16 MFU | 1622710 tok/s step 2737/19560 | loss 3.796119 (+0.76z)| norm 0.2754 (-0.71z)| lr 5.83e-04 | 323.12 ms | 52.2% bf16 MFU | 1622704 tok/s step 2738/19560 | loss 3.784251 (+0.47z)| norm 0.2677 (-1.03z)| lr 5.83e-04 | 322.75 ms | 52.3% bf16 MFU | 1622790 tok/s step 2739/19560 | loss 3.764137 (-0.01z)| norm 0.2459 (-1.94z)| lr 5.83e-04 | 322.82 ms | 52.3% bf16 MFU | 1622855 tok/s step 2740/19560 | loss 3.724480 (-0.97z)| norm 0.2550 (-1.53z)| lr 5.83e-04 | 322.94 ms | 52.3% bf16 MFU | 1622888 tok/s step 2741/19560 | loss 3.689567 (-1.80z)| norm 0.2391 (-2.16z)| lr 5.83e-04 | 322.67 ms | 52.3% bf16 MFU | 1622985 tok/s step 2742/19560 | loss 3.825511 (+1.44z)| norm 0.2819 (-0.33z)| lr 5.83e-04 | 322.53 ms | 52.3% bf16 MFU | 1623114 tok/s step 2743/19560 | loss 3.701335 (-1.50z)| norm 0.3311 (+1.73z)| lr 5.83e-04 | 322.91 ms | 52.3% bf16 MFU | 1623141 tok/s step 2744/19560 | loss 3.809417 (+1.05z)| norm 0.3355 (+1.87z)| lr 5.83e-04 | 322.56 ms | 52.3% bf16 MFU | 1623253 tok/s step 2745/19560 | loss 3.727582 (-0.87z)| norm 0.3408 (+2.04z)| lr 5.83e-04 | 323.21 ms | 52.2% bf16 MFU | 1623197 tok/s step 2746/19560 | loss 3.701106 (-1.46z)| norm 0.3072 (+0.66z)| lr 5.83e-04 | 322.86 ms | 52.3% bf16 MFU | 1623231 tok/s step 2747/19560 | loss 3.739161 (-0.56z)| norm 0.3124 (+0.86z)| lr 5.83e-04 | 322.59 ms | 52.3% bf16 MFU | 1623332 tok/s step 2748/19560 | loss 3.903172 (+3.12z)| norm 0.2764 (-0.59z)| lr 5.83e-04 | 322.74 ms | 52.3% bf16 MFU | 1623389 tok/s step 2749/19560 | loss 3.699767 (-1.43z)| norm 0.2917 (+0.04z)| lr 5.83e-04 | 323.22 ms | 52.2% bf16 MFU | 1623324 tok/s step 2750/19560 | loss 3.823112 (+1.30z)| norm 0.3000 (+0.37z)| lr 5.83e-04 | 322.96 ms | 52.3% bf16 MFU | 1623328 tok/s val loss 3.739780 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2645/10042 = 0.263394 step 2751/19560 | loss 3.771020 (+0.15z)| norm 0.3100 (+0.77z)| lr 5.83e-04 | 322.62 ms | 52.3% bf16 MFU | 1623416 tok/s step 2752/19560 | loss 3.746324 (-0.39z)| norm 0.2942 (+0.12z)| lr 5.83e-04 | 322.98 ms | 52.3% bf16 MFU | 1623409 tok/s step 2753/19560 | loss 3.734823 (-0.64z)| norm 0.2805 (-0.43z)| lr 5.83e-04 | 323.23 ms | 52.2% bf16 MFU | 1623340 tok/s step 2754/19560 | loss 3.793948 (+0.66z)| norm 0.2784 (-0.52z)| lr 5.83e-04 | 322.87 ms | 52.3% bf16 MFU | 1623365 tok/s step 2755/19560 | loss 3.744542 (-0.43z)| norm 0.2580 (-1.33z)| lr 5.83e-04 | 323.70 ms | 52.1% bf16 MFU | 1623181 tok/s step 2756/19560 | loss 3.758360 (-0.11z)| norm 0.2727 (-0.73z)| lr 5.83e-04 | 322.75 ms | 52.3% bf16 MFU | 1623244 tok/s step 2757/19560 | loss 3.720559 (-0.95z)| norm 0.2505 (-1.60z)| lr 5.83e-04 | 322.99 ms | 52.3% bf16 MFU | 1623244 tok/s step 2758/19560 | loss 3.645978 (-2.55z)| norm 0.2625 (-1.11z)| lr 5.83e-04 | 323.18 ms | 52.2% bf16 MFU | 1623195 tok/s step 2759/19560 | loss 3.684530 (-1.66z)| norm 0.3170 (+1.07z)| lr 5.83e-04 | 322.88 ms | 52.3% bf16 MFU | 1623224 tok/s step 2760/19560 | loss 3.790644 (+0.65z)| norm 0.3459 (+2.16z)| lr 5.83e-04 | 323.08 ms | 52.2% bf16 MFU | 1623203 tok/s step 2761/19560 | loss 3.762017 (+0.05z)| norm 0.3278 (+1.43z)| lr 5.83e-04 | 322.74 ms | 52.3% bf16 MFU | 1623267 tok/s step 2762/19560 | loss 3.765527 (+0.14z)| norm 0.3082 (+0.65z)| lr 5.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1623355 tok/s step 2763/19560 | loss 3.719879 (-0.90z)| norm 0.3215 (+1.15z)| lr 5.82e-04 | 323.28 ms | 52.2% bf16 MFU | 1623275 tok/s step 2764/19560 | loss 3.748742 (-0.23z)| norm 0.3060 (+0.54z)| lr 5.82e-04 | 322.80 ms | 52.3% bf16 MFU | 1623322 tok/s step 2765/19560 | loss 3.750151 (-0.21z)| norm 0.2690 (-0.90z)| lr 5.82e-04 | 323.28 ms | 52.2% bf16 MFU | 1623244 tok/s step 2766/19560 | loss 3.743104 (-0.37z)| norm 0.2835 (-0.34z)| lr 5.82e-04 | 322.67 ms | 52.3% bf16 MFU | 1623325 tok/s step 2767/19560 | loss 3.756400 (-0.06z)| norm 0.2705 (-0.85z)| lr 5.82e-04 | 322.98 ms | 52.3% bf16 MFU | 1623322 tok/s step 2768/19560 | loss 3.747297 (-0.26z)| norm 0.2914 (-0.03z)| lr 5.82e-04 | 323.23 ms | 52.2% bf16 MFU | 1623258 tok/s step 2769/19560 | loss 3.824649 (+1.51z)| norm 0.2953 (+0.13z)| lr 5.82e-04 | 323.08 ms | 52.2% bf16 MFU | 1623235 tok/s step 2770/19560 | loss 3.754273 (-0.09z)| norm 0.2938 (+0.08z)| lr 5.82e-04 | 323.05 ms | 52.2% bf16 MFU | 1623221 tok/s step 2771/19560 | loss 3.783105 (+0.58z)| norm 0.2662 (-1.00z)| lr 5.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1623189 tok/s step 2772/19560 | loss 3.725506 (-0.74z)| norm 0.3088 (+0.66z)| lr 5.82e-04 | 323.67 ms | 52.1% bf16 MFU | 1623022 tok/s step 2773/19560 | loss 3.778456 (+0.50z)| norm 0.3554 (+2.42z)| lr 5.82e-04 | 322.06 ms | 52.4% bf16 MFU | 1623266 tok/s step 2774/19560 | loss 3.686740 (-1.64z)| norm 0.3359 (+1.64z)| lr 5.82e-04 | 322.69 ms | 52.3% bf16 MFU | 1623340 tok/s step 2775/19560 | loss 3.714483 (-0.99z)| norm 0.3137 (+0.78z)| lr 5.82e-04 | 323.31 ms | 52.2% bf16 MFU | 1623253 tok/s step 2776/19560 | loss 3.723993 (-0.78z)| norm 0.3099 (+0.63z)| lr 5.82e-04 | 323.24 ms | 52.2% bf16 MFU | 1623190 tok/s step 2777/19560 | loss 3.785683 (+0.66z)| norm 0.3611 (+2.50z)| lr 5.82e-04 | 322.80 ms | 52.3% bf16 MFU | 1623238 tok/s step 2778/19560 | loss 3.783050 (+0.61z)| norm 0.3299 (+1.32z)| lr 5.82e-04 | 323.20 ms | 52.2% bf16 MFU | 1623186 tok/s step 2779/19560 | loss 3.784127 (+0.63z)| norm 0.2826 (-0.45z)| lr 5.82e-04 | 323.13 ms | 52.2% bf16 MFU | 1623154 tok/s step 2780/19560 | loss 3.657238 (-2.32z)| norm 0.3179 (+0.85z)| lr 5.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1623125 tok/s step 2781/19560 | loss 3.685107 (-1.64z)| norm 0.3303 (+1.30z)| lr 5.82e-04 | 323.16 ms | 52.2% bf16 MFU | 1623088 tok/s step 2782/19560 | loss 3.752918 (-0.08z)| norm 0.2967 (+0.05z)| lr 5.82e-04 | 322.27 ms | 52.4% bf16 MFU | 1623276 tok/s step 2783/19560 | loss 3.754962 (-0.03z)| norm 0.3173 (+0.82z)| lr 5.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1623362 tok/s step 2784/19560 | loss 3.762016 (+0.13z)| norm 0.3185 (+0.86z)| lr 5.82e-04 | 323.03 ms | 52.2% bf16 MFU | 1623345 tok/s step 2785/19560 | loss 3.711478 (-1.02z)| norm 0.3231 (+1.02z)| lr 5.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1623307 tok/s step 2786/19560 | loss 3.768352 (+0.29z)| norm 0.2995 (+0.14z)| lr 5.82e-04 | 322.39 ms | 52.3% bf16 MFU | 1623453 tok/s step 2787/19560 | loss 3.783440 (+0.63z)| norm 0.3165 (+0.76z)| lr 5.82e-04 | 323.37 ms | 52.2% bf16 MFU | 1623348 tok/s step 2788/19560 | loss 3.851493 (+2.15z)| norm 0.3297 (+1.23z)| lr 5.82e-04 | 323.02 ms | 52.2% bf16 MFU | 1623334 tok/s step 2789/19560 | loss 3.733188 (-0.53z)| norm 0.2696 (-0.99z)| lr 5.82e-04 | 322.92 ms | 52.3% bf16 MFU | 1623346 tok/s step 2790/19560 | loss 3.839342 (+1.84z)| norm 0.3001 (+0.16z)| lr 5.82e-04 | 322.56 ms | 52.3% bf16 MFU | 1623449 tok/s step 2791/19560 | loss 3.666276 (-1.99z)| norm 0.2962 (+0.01z)| lr 5.82e-04 | 322.59 ms | 52.3% bf16 MFU | 1623539 tok/s step 2792/19560 | loss 3.663516 (-2.02z)| norm 0.2874 (-0.31z)| lr 5.82e-04 | 323.06 ms | 52.2% bf16 MFU | 1623507 tok/s step 2793/19560 | loss 3.739332 (-0.33z)| norm 0.2647 (-1.16z)| lr 5.82e-04 | 323.00 ms | 52.3% bf16 MFU | 1623492 tok/s step 2794/19560 | loss 3.743019 (-0.24z)| norm 0.2772 (-0.70z)| lr 5.82e-04 | 322.58 ms | 52.3% bf16 MFU | 1623581 tok/s step 2795/19560 | loss 3.759586 (+0.13z)| norm 0.2685 (-1.00z)| lr 5.82e-04 | 323.20 ms | 52.2% bf16 MFU | 1623512 tok/s step 2796/19560 | loss 3.753053 (-0.02z)| norm 0.2673 (-1.04z)| lr 5.82e-04 | 322.69 ms | 52.3% bf16 MFU | 1623574 tok/s step 2797/19560 | loss 3.692133 (-1.37z)| norm 0.2709 (-0.89z)| lr 5.82e-04 | 323.07 ms | 52.2% bf16 MFU | 1623537 tok/s step 2798/19560 | loss 3.728142 (-0.57z)| norm 0.2484 (-1.70z)| lr 5.82e-04 | 322.99 ms | 52.3% bf16 MFU | 1623523 tok/s step 2799/19560 | loss 3.911849 (+3.44z)| norm 0.2594 (-1.27z)| lr 5.82e-04 | 323.10 ms | 52.2% bf16 MFU | 1623481 tok/s step 2800/19560 | loss 3.722550 (-0.68z)| norm 0.2888 (-0.17z)| lr 5.82e-04 | 322.56 ms | 52.3% bf16 MFU | 1623578 tok/s step 2801/19560 | loss 3.727488 (-0.57z)| norm 0.3073 (+0.51z)| lr 5.82e-04 | 322.50 ms | 52.3% bf16 MFU | 1623684 tok/s step 2802/19560 | loss 3.749223 (-0.10z)| norm 0.2721 (-0.79z)| lr 5.82e-04 | 323.27 ms | 52.2% bf16 MFU | 1623592 tok/s step 2803/19560 | loss 3.759329 (+0.12z)| norm 0.2785 (-0.55z)| lr 5.82e-04 | 322.43 ms | 52.3% bf16 MFU | 1623716 tok/s step 2804/19560 | loss 3.719505 (-0.74z)| norm 0.2553 (-1.41z)| lr 5.82e-04 | 322.74 ms | 52.3% bf16 MFU | 1623754 tok/s step 2805/19560 | loss 3.740845 (-0.27z)| norm 0.2566 (-1.34z)| lr 5.82e-04 | 322.97 ms | 52.3% bf16 MFU | 1623732 tok/s step 2806/19560 | loss 3.731171 (-0.50z)| norm 0.2611 (-1.17z)| lr 5.82e-04 | 322.50 ms | 52.3% bf16 MFU | 1623832 tok/s step 2807/19560 | loss 3.742512 (-0.23z)| norm 0.2403 (-1.89z)| lr 5.82e-04 | 322.65 ms | 52.3% bf16 MFU | 1623887 tok/s step 2808/19560 | loss 3.772869 (+0.43z)| norm 0.2524 (-1.43z)| lr 5.82e-04 | 322.66 ms | 52.3% bf16 MFU | 1623939 tok/s step 2809/19560 | loss 3.711201 (-0.93z)| norm 0.2648 (-0.97z)| lr 5.82e-04 | 322.34 ms | 52.4% bf16 MFU | 1624067 tok/s step 2810/19560 | loss 3.742381 (-0.23z)| norm 0.2844 (-0.27z)| lr 5.82e-04 | 322.86 ms | 52.3% bf16 MFU | 1624058 tok/s step 2811/19560 | loss 3.764773 (+0.27z)| norm 0.3203 (+1.01z)| lr 5.82e-04 | 323.14 ms | 52.2% bf16 MFU | 1623980 tok/s step 2812/19560 | loss 3.699181 (-1.18z)| norm 0.3103 (+0.64z)| lr 5.82e-04 | 322.57 ms | 52.3% bf16 MFU | 1624048 tok/s step 2813/19560 | loss 3.774446 (+0.49z)| norm 0.2745 (-0.65z)| lr 5.82e-04 | 322.58 ms | 52.3% bf16 MFU | 1624110 tok/s step 2814/19560 | loss 3.745945 (-0.14z)| norm 0.3336 (+1.45z)| lr 5.82e-04 | 322.56 ms | 52.3% bf16 MFU | 1624174 tok/s step 2815/19560 | loss 3.690374 (-1.35z)| norm 0.3153 (+0.78z)| lr 5.82e-04 | 322.88 ms | 52.3% bf16 MFU | 1624155 tok/s step 2816/19560 | loss 3.717651 (-0.75z)| norm 0.3048 (+0.40z)| lr 5.82e-04 | 323.18 ms | 52.2% bf16 MFU | 1624062 tok/s step 2817/19560 | loss 3.760496 (+0.19z)| norm 0.3194 (+0.91z)| lr 5.82e-04 | 322.69 ms | 52.3% bf16 MFU | 1624096 tok/s step 2818/19560 | loss 3.727116 (-0.54z)| norm 0.2704 (-0.83z)| lr 5.82e-04 | 322.44 ms | 52.3% bf16 MFU | 1624191 tok/s step 2819/19560 | loss 3.779703 (+0.62z)| norm 0.2768 (-0.60z)| lr 5.82e-04 | 322.69 ms | 52.3% bf16 MFU | 1624218 tok/s step 2820/19560 | loss 3.871721 (+2.56z)| norm 0.2660 (-0.98z)| lr 5.82e-04 | 322.44 ms | 52.3% bf16 MFU | 1624307 tok/s step 2821/19560 | loss 3.703601 (-1.06z)| norm 0.2826 (-0.39z)| lr 5.81e-04 | 323.47 ms | 52.2% bf16 MFU | 1624133 tok/s step 2822/19560 | loss 3.747183 (-0.11z)| norm 0.2873 (-0.21z)| lr 5.81e-04 | 322.88 ms | 52.3% bf16 MFU | 1624115 tok/s step 2823/19560 | loss 3.713122 (-0.83z)| norm 0.2759 (-0.61z)| lr 5.81e-04 | 322.82 ms | 52.3% bf16 MFU | 1624114 tok/s step 2824/19560 | loss 3.782605 (+0.68z)| norm 0.3098 (+0.66z)| lr 5.81e-04 | 322.69 ms | 52.3% bf16 MFU | 1624147 tok/s step 2825/19560 | loss 3.776878 (+0.55z)| norm 0.3032 (+0.43z)| lr 5.81e-04 | 322.48 ms | 52.3% bf16 MFU | 1624229 tok/s step 2826/19560 | loss 3.757584 (+0.13z)| norm 0.2869 (-0.19z)| lr 5.81e-04 | 323.04 ms | 52.2% bf16 MFU | 1624166 tok/s step 2827/19560 | loss 3.699047 (-1.13z)| norm 0.2443 (-1.76z)| lr 5.81e-04 | 323.01 ms | 52.2% bf16 MFU | 1624114 tok/s step 2828/19560 | loss 3.687774 (-1.36z)| norm 0.2718 (-0.73z)| lr 5.81e-04 | 322.66 ms | 52.3% bf16 MFU | 1624154 tok/s step 2829/19560 | loss 3.764784 (+0.31z)| norm 0.3207 (+1.09z)| lr 5.81e-04 | 322.61 ms | 52.3% bf16 MFU | 1624203 tok/s step 2830/19560 | loss 3.737046 (-0.29z)| norm 0.3395 (+1.75z)| lr 5.81e-04 | 323.03 ms | 52.2% bf16 MFU | 1624145 tok/s step 2831/19560 | loss 3.793384 (+0.94z)| norm 0.3203 (+1.04z)| lr 5.81e-04 | 323.35 ms | 52.2% bf16 MFU | 1624008 tok/s step 2832/19560 | loss 3.713360 (-0.82z)| norm 0.2637 (-1.03z)| lr 5.81e-04 | 322.59 ms | 52.3% bf16 MFU | 1624071 tok/s step 2833/19560 | loss 3.754216 (+0.08z)| norm 0.2785 (-0.47z)| lr 5.81e-04 | 322.52 ms | 52.3% bf16 MFU | 1624148 tok/s step 2834/19560 | loss 3.690750 (-1.30z)| norm 0.2881 (-0.11z)| lr 5.81e-04 | 323.46 ms | 52.2% bf16 MFU | 1623984 tok/s step 2835/19560 | loss 3.702110 (-1.05z)| norm 0.2502 (-1.49z)| lr 5.81e-04 | 322.51 ms | 52.3% bf16 MFU | 1624067 tok/s step 2836/19560 | loss 3.720030 (-0.64z)| norm 0.2824 (-0.30z)| lr 5.81e-04 | 323.40 ms | 52.2% bf16 MFU | 1623923 tok/s step 2837/19560 | loss 3.746528 (-0.05z)| norm 0.2614 (-1.07z)| lr 5.81e-04 | 322.74 ms | 52.3% bf16 MFU | 1623953 tok/s step 2838/19560 | loss 3.734298 (-0.33z)| norm 0.2925 (+0.08z)| lr 5.81e-04 | 322.42 ms | 52.3% bf16 MFU | 1624060 tok/s step 2839/19560 | loss 3.713338 (-0.81z)| norm 0.3145 (+0.87z)| lr 5.81e-04 | 322.92 ms | 52.3% bf16 MFU | 1624035 tok/s step 2840/19560 | loss 3.678433 (-1.56z)| norm 0.2921 (+0.04z)| lr 5.81e-04 | 323.19 ms | 52.2% bf16 MFU | 1623944 tok/s step 2841/19560 | loss 3.754615 (+0.14z)| norm 0.2734 (-0.67z)| lr 5.81e-04 | 323.15 ms | 52.2% bf16 MFU | 1623868 tok/s step 2842/19560 | loss 3.702112 (-1.03z)| norm 0.2906 (-0.02z)| lr 5.81e-04 | 322.99 ms | 52.3% bf16 MFU | 1623836 tok/s step 2843/19560 | loss 3.728867 (-0.43z)| norm 0.2454 (-1.70z)| lr 5.81e-04 | 322.34 ms | 52.4% bf16 MFU | 1623968 tok/s step 2844/19560 | loss 3.785123 (+0.81z)| norm 0.2749 (-0.61z)| lr 5.81e-04 | 323.22 ms | 52.2% bf16 MFU | 1623873 tok/s step 2845/19560 | loss 3.682388 (-1.46z)| norm 0.3167 (+0.94z)| lr 5.81e-04 | 323.13 ms | 52.2% bf16 MFU | 1623806 tok/s step 2846/19560 | loss 3.746674 (-0.03z)| norm 0.3134 (+0.81z)| lr 5.81e-04 | 323.04 ms | 52.2% bf16 MFU | 1623765 tok/s step 2847/19560 | loss 3.723579 (-0.53z)| norm 0.2888 (-0.12z)| lr 5.81e-04 | 322.73 ms | 52.3% bf16 MFU | 1623805 tok/s step 2848/19560 | loss 3.747608 (+0.01z)| norm 0.2881 (-0.14z)| lr 5.81e-04 | 323.06 ms | 52.2% bf16 MFU | 1623759 tok/s step 2849/19560 | loss 3.686808 (-1.32z)| norm 0.3079 (+0.59z)| lr 5.81e-04 | 322.56 ms | 52.3% bf16 MFU | 1623840 tok/s step 2850/19560 | loss 3.764825 (+0.40z)| norm 0.3378 (+1.67z)| lr 5.81e-04 | 323.39 ms | 52.2% bf16 MFU | 1623710 tok/s step 2851/19560 | loss 3.689434 (-1.25z)| norm 0.3336 (+1.49z)| lr 5.81e-04 | 322.53 ms | 52.3% bf16 MFU | 1623801 tok/s step 2852/19560 | loss 3.701958 (-0.97z)| norm 0.2832 (-0.36z)| lr 5.81e-04 | 322.75 ms | 52.3% bf16 MFU | 1623832 tok/s step 2853/19560 | loss 3.725263 (-0.45z)| norm 0.2299 (-2.26z)| lr 5.81e-04 | 322.97 ms | 52.3% bf16 MFU | 1623807 tok/s step 2854/19560 | loss 3.650445 (-2.05z)| norm 0.2514 (-1.45z)| lr 5.81e-04 | 323.29 ms | 52.2% bf16 MFU | 1623703 tok/s step 2855/19560 | loss 3.818886 (+1.58z)| norm 0.2938 (+0.08z)| lr 5.81e-04 | 322.62 ms | 52.3% bf16 MFU | 1623772 tok/s step 2856/19560 | loss 3.726985 (-0.39z)| norm 0.3041 (+0.46z)| lr 5.81e-04 | 323.28 ms | 52.2% bf16 MFU | 1623672 tok/s step 2857/19560 | loss 3.710419 (-0.74z)| norm 0.2821 (-0.33z)| lr 5.81e-04 | 323.99 ms | 52.1% bf16 MFU | 1623400 tok/s step 2858/19560 | loss 3.739505 (-0.12z)| norm 0.2709 (-0.73z)| lr 5.81e-04 | 322.30 ms | 52.4% bf16 MFU | 1623565 tok/s step 2859/19560 | loss 3.825562 (+1.70z)| norm 0.3024 (+0.42z)| lr 5.81e-04 | 322.79 ms | 52.3% bf16 MFU | 1623599 tok/s step 2860/19560 | loss 3.694658 (-1.08z)| norm 0.3107 (+0.72z)| lr 5.81e-04 | 323.11 ms | 52.2% bf16 MFU | 1623551 tok/s step 2861/19560 | loss 3.713740 (-0.65z)| norm 0.3076 (+0.60z)| lr 5.81e-04 | 323.42 ms | 52.2% bf16 MFU | 1623427 tok/s step 2862/19560 | loss 3.711244 (-0.70z)| norm 0.2763 (-0.54z)| lr 5.81e-04 | 322.66 ms | 52.3% bf16 MFU | 1623499 tok/s step 2863/19560 | loss 3.791805 (+1.03z)| norm 0.2604 (-1.10z)| lr 5.81e-04 | 323.17 ms | 52.2% bf16 MFU | 1623441 tok/s step 2864/19560 | loss 3.743989 (+0.00z)| norm 0.2866 (-0.17z)| lr 5.81e-04 | 322.98 ms | 52.3% bf16 MFU | 1623432 tok/s step 2865/19560 | loss 3.695203 (-1.03z)| norm 0.2653 (-0.94z)| lr 5.81e-04 | 322.70 ms | 52.3% bf16 MFU | 1623496 tok/s step 2866/19560 | loss 3.694207 (-1.04z)| norm 0.3042 (+0.47z)| lr 5.81e-04 | 323.66 ms | 52.1% bf16 MFU | 1623314 tok/s step 2867/19560 | loss 3.745579 (+0.07z)| norm 0.3131 (+0.78z)| lr 5.81e-04 | 323.52 ms | 52.2% bf16 MFU | 1623176 tok/s step 2868/19560 | loss 3.709235 (-0.71z)| norm 0.2934 (+0.04z)| lr 5.81e-04 | 322.59 ms | 52.3% bf16 MFU | 1623280 tok/s step 2869/19560 | loss 3.709957 (-0.70z)| norm 0.3350 (+1.57z)| lr 5.81e-04 | 322.65 ms | 52.3% bf16 MFU | 1623362 tok/s step 2870/19560 | loss 3.757650 (+0.35z)| norm 0.3408 (+1.75z)| lr 5.81e-04 | 322.77 ms | 52.3% bf16 MFU | 1623411 tok/s step 2871/19560 | loss 3.731882 (-0.22z)| norm 0.3417 (+1.77z)| lr 5.81e-04 | 322.45 ms | 52.3% bf16 MFU | 1623538 tok/s step 2872/19560 | loss 3.691308 (-1.09z)| norm 0.2885 (-0.17z)| lr 5.81e-04 | 322.80 ms | 52.3% bf16 MFU | 1623571 tok/s step 2873/19560 | loss 3.750730 (+0.21z)| norm 0.2719 (-0.77z)| lr 5.81e-04 | 323.39 ms | 52.2% bf16 MFU | 1623454 tok/s step 2874/19560 | loss 3.746755 (+0.12z)| norm 0.3060 (+0.50z)| lr 5.81e-04 | 323.22 ms | 52.2% bf16 MFU | 1623386 tok/s step 2875/19560 | loss 3.670473 (-1.54z)| norm 0.2594 (-1.22z)| lr 5.81e-04 | 322.99 ms | 52.3% bf16 MFU | 1623377 tok/s step 2876/19560 | loss 3.720363 (-0.44z)| norm 0.2495 (-1.57z)| lr 5.81e-04 | 322.71 ms | 52.3% bf16 MFU | 1623439 tok/s step 2877/19560 | loss 3.735169 (-0.11z)| norm 0.2707 (-0.78z)| lr 5.81e-04 | 323.60 ms | 52.2% bf16 MFU | 1623275 tok/s step 2878/19560 | loss 3.757975 (+0.44z)| norm 0.2869 (-0.18z)| lr 5.80e-04 | 322.48 ms | 52.3% bf16 MFU | 1623401 tok/s step 2879/19560 | loss 3.781589 (+0.99z)| norm 0.2687 (-0.83z)| lr 5.80e-04 | 322.80 ms | 52.3% bf16 MFU | 1623440 tok/s step 2880/19560 | loss 3.662468 (-1.76z)| norm 0.2571 (-1.24z)| lr 5.80e-04 | 323.88 ms | 52.1% bf16 MFU | 1623206 tok/s step 2881/19560 | loss 3.749111 (+0.24z)| norm 0.2582 (-1.19z)| lr 5.80e-04 | 323.31 ms | 52.2% bf16 MFU | 1623128 tok/s step 2882/19560 | loss 3.752007 (+0.31z)| norm 0.2618 (-1.05z)| lr 5.80e-04 | 322.40 ms | 52.3% bf16 MFU | 1623281 tok/s step 2883/19560 | loss 3.740732 (+0.05z)| norm 0.2820 (-0.33z)| lr 5.80e-04 | 323.06 ms | 52.2% bf16 MFU | 1623261 tok/s step 2884/19560 | loss 3.779354 (+0.94z)| norm 0.2746 (-0.60z)| lr 5.80e-04 | 322.50 ms | 52.3% bf16 MFU | 1623383 tok/s step 2885/19560 | loss 3.702133 (-0.84z)| norm 0.2666 (-0.90z)| lr 5.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1623384 tok/s step 2886/19560 | loss 3.746030 (+0.16z)| norm 0.2768 (-0.53z)| lr 5.80e-04 | 322.70 ms | 52.3% bf16 MFU | 1623448 tok/s step 2887/19560 | loss 3.720223 (-0.46z)| norm 0.2751 (-0.58z)| lr 5.80e-04 | 322.61 ms | 52.3% bf16 MFU | 1623532 tok/s step 2888/19560 | loss 3.680507 (-1.38z)| norm 0.2749 (-0.58z)| lr 5.80e-04 | 323.18 ms | 52.2% bf16 MFU | 1623471 tok/s step 2889/19560 | loss 3.757445 (+0.44z)| norm 0.2681 (-0.82z)| lr 5.80e-04 | 322.82 ms | 52.3% bf16 MFU | 1623502 tok/s step 2890/19560 | loss 3.675918 (-1.46z)| norm 0.3047 (+0.56z)| lr 5.80e-04 | 322.73 ms | 52.3% bf16 MFU | 1623554 tok/s step 2891/19560 | loss 3.691824 (-1.08z)| norm 0.3079 (+0.69z)| lr 5.80e-04 | 322.84 ms | 52.3% bf16 MFU | 1623575 tok/s step 2892/19560 | loss 3.868124 (+2.93z)| norm 0.3173 (+1.04z)| lr 5.80e-04 | 323.19 ms | 52.2% bf16 MFU | 1623508 tok/s step 2893/19560 | loss 3.744205 (+0.13z)| norm 0.3417 (+1.92z)| lr 5.80e-04 | 322.69 ms | 52.3% bf16 MFU | 1623569 tok/s step 2894/19560 | loss 3.773355 (+0.78z)| norm 0.3171 (+0.98z)| lr 5.80e-04 | 323.06 ms | 52.2% bf16 MFU | 1623534 tok/s step 2895/19560 | loss 3.663753 (-1.66z)| norm 0.3407 (+1.82z)| lr 5.80e-04 | 323.10 ms | 52.2% bf16 MFU | 1623491 tok/s step 2896/19560 | loss 3.741116 (+0.07z)| norm 0.3109 (+0.72z)| lr 5.80e-04 | 323.09 ms | 52.2% bf16 MFU | 1623453 tok/s step 2897/19560 | loss 3.765071 (+0.62z)| norm 0.2887 (-0.10z)| lr 5.80e-04 | 323.50 ms | 52.2% bf16 MFU | 1623314 tok/s step 2898/19560 | loss 3.713593 (-0.54z)| norm 0.2782 (-0.47z)| lr 5.80e-04 | 322.63 ms | 52.3% bf16 MFU | 1623400 tok/s step 2899/19560 | loss 3.716004 (-0.47z)| norm 0.2867 (-0.17z)| lr 5.80e-04 | 323.09 ms | 52.2% bf16 MFU | 1623367 tok/s step 2900/19560 | loss 3.683182 (-1.20z)| norm 0.2396 (-1.86z)| lr 5.80e-04 | 323.03 ms | 52.2% bf16 MFU | 1623350 tok/s step 2901/19560 | loss 3.731867 (-0.10z)| norm 0.2606 (-1.09z)| lr 5.80e-04 | 323.35 ms | 52.2% bf16 MFU | 1623255 tok/s step 2902/19560 | loss 3.690414 (-1.04z)| norm 0.3047 (+0.56z)| lr 5.80e-04 | 323.16 ms | 52.2% bf16 MFU | 1623210 tok/s step 2903/19560 | loss 3.674229 (-1.39z)| norm 0.2745 (-0.56z)| lr 5.80e-04 | 323.55 ms | 52.2% bf16 MFU | 1623070 tok/s step 2904/19560 | loss 3.716165 (-0.44z)| norm 0.2745 (-0.55z)| lr 5.80e-04 | 324.23 ms | 52.1% bf16 MFU | 1622767 tok/s step 2905/19560 | loss 3.688177 (-1.05z)| norm 0.2751 (-0.52z)| lr 5.80e-04 | 323.01 ms | 52.2% bf16 MFU | 1622785 tok/s step 2906/19560 | loss 3.755404 (+0.47z)| norm 0.2841 (-0.16z)| lr 5.80e-04 | 323.26 ms | 52.2% bf16 MFU | 1622739 tok/s step 2907/19560 | loss 3.755227 (+0.47z)| norm 0.2678 (-0.79z)| lr 5.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1622771 tok/s step 2908/19560 | loss 3.695385 (-0.90z)| norm 0.2565 (-1.21z)| lr 5.80e-04 | 323.34 ms | 52.2% bf16 MFU | 1622705 tok/s step 2909/19560 | loss 3.721891 (-0.31z)| norm 0.2749 (-0.48z)| lr 5.80e-04 | 323.68 ms | 52.1% bf16 MFU | 1622558 tok/s step 2910/19560 | loss 3.712732 (-0.51z)| norm 0.2823 (-0.19z)| lr 5.80e-04 | 323.20 ms | 52.2% bf16 MFU | 1622539 tok/s step 2911/19560 | loss 3.841750 (+2.39z)| norm 0.2729 (-0.54z)| lr 5.80e-04 | 323.29 ms | 52.2% bf16 MFU | 1622498 tok/s step 2912/19560 | loss 3.806421 (+1.57z)| norm 0.2797 (-0.27z)| lr 5.80e-04 | 323.49 ms | 52.2% bf16 MFU | 1622409 tok/s step 2913/19560 | loss 3.888103 (+3.23z)| norm 0.3031 (+0.67z)| lr 5.80e-04 | 322.70 ms | 52.3% bf16 MFU | 1622523 tok/s step 2914/19560 | loss 3.741333 (+0.09z)| norm 0.3639 (+2.97z)| lr 5.80e-04 | 322.16 ms | 52.4% bf16 MFU | 1622768 tok/s step 2915/19560 | loss 3.703396 (-0.71z)| norm 0.3550 (+2.56z)| lr 5.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1622799 tok/s step 2916/19560 | loss 3.686385 (-1.07z)| norm 0.3386 (+1.93z)| lr 5.80e-04 | 322.92 ms | 52.3% bf16 MFU | 1622837 tok/s step 2917/19560 | loss 3.741388 (+0.14z)| norm 0.2758 (-0.43z)| lr 5.80e-04 | 322.32 ms | 52.4% bf16 MFU | 1623027 tok/s step 2918/19560 | loss 3.727317 (-0.16z)| norm 0.3167 (+1.10z)| lr 5.80e-04 | 322.62 ms | 52.3% bf16 MFU | 1623129 tok/s step 2919/19560 | loss 3.759880 (+0.56z)| norm 0.3303 (+1.58z)| lr 5.80e-04 | 322.88 ms | 52.3% bf16 MFU | 1623162 tok/s step 2920/19560 | loss 3.699652 (-0.81z)| norm 0.3062 (+0.68z)| lr 5.80e-04 | 322.55 ms | 52.3% bf16 MFU | 1623277 tok/s step 2921/19560 | loss 3.734389 (-0.02z)| norm 0.2778 (-0.37z)| lr 5.80e-04 | 323.05 ms | 52.2% bf16 MFU | 1623259 tok/s step 2922/19560 | loss 3.794522 (+1.33z)| norm 0.2874 (-0.02z)| lr 5.80e-04 | 323.12 ms | 52.2% bf16 MFU | 1623226 tok/s step 2923/19560 | loss 3.742994 (+0.17z)| norm 0.2848 (-0.12z)| lr 5.80e-04 | 323.37 ms | 52.2% bf16 MFU | 1623131 tok/s step 2924/19560 | loss 3.678155 (-1.28z)| norm 0.3035 (+0.56z)| lr 5.80e-04 | 322.94 ms | 52.3% bf16 MFU | 1623148 tok/s step 2925/19560 | loss 3.675274 (-1.33z)| norm 0.3171 (+1.06z)| lr 5.80e-04 | 322.97 ms | 52.3% bf16 MFU | 1623156 tok/s step 2926/19560 | loss 3.723248 (-0.26z)| norm 0.2696 (-0.72z)| lr 5.80e-04 | 323.38 ms | 52.2% bf16 MFU | 1623063 tok/s step 2927/19560 | loss 3.690573 (-1.01z)| norm 0.2823 (-0.25z)| lr 5.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1623080 tok/s step 2928/19560 | loss 3.747028 (+0.33z)| norm 0.2721 (-0.63z)| lr 5.80e-04 | 323.37 ms | 52.2% bf16 MFU | 1622992 tok/s step 2929/19560 | loss 3.701383 (-0.75z)| norm 0.2607 (-1.04z)| lr 5.80e-04 | 323.14 ms | 52.2% bf16 MFU | 1622966 tok/s step 2930/19560 | loss 3.661227 (-1.67z)| norm 0.2582 (-1.13z)| lr 5.80e-04 | 323.39 ms | 52.2% bf16 MFU | 1622880 tok/s step 2931/19560 | loss 3.712629 (-0.46z)| norm 0.2725 (-0.59z)| lr 5.80e-04 | 322.81 ms | 52.3% bf16 MFU | 1622943 tok/s step 2932/19560 | loss 3.667543 (-1.49z)| norm 0.2769 (-0.44z)| lr 5.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1622964 tok/s step 2933/19560 | loss 3.721946 (-0.22z)| norm 0.2405 (-1.78z)| lr 5.80e-04 | 323.07 ms | 52.2% bf16 MFU | 1622957 tok/s step 2934/19560 | loss 3.713171 (-0.42z)| norm 0.2690 (-0.72z)| lr 5.79e-04 | 322.49 ms | 52.3% bf16 MFU | 1623097 tok/s step 2935/19560 | loss 3.719267 (-0.28z)| norm 0.2763 (-0.47z)| lr 5.79e-04 | 323.24 ms | 52.2% bf16 MFU | 1623042 tok/s step 2936/19560 | loss 3.678944 (-1.20z)| norm 0.2506 (-1.44z)| lr 5.79e-04 | 322.98 ms | 52.3% bf16 MFU | 1623053 tok/s step 2937/19560 | loss 3.699280 (-0.72z)| norm 0.2480 (-1.52z)| lr 5.79e-04 | 322.59 ms | 52.3% bf16 MFU | 1623163 tok/s step 2938/19560 | loss 3.704854 (-0.58z)| norm 0.2510 (-1.39z)| lr 5.79e-04 | 323.03 ms | 52.2% bf16 MFU | 1623157 tok/s step 2939/19560 | loss 3.772979 (+0.99z)| norm 0.2713 (-0.62z)| lr 5.79e-04 | 323.23 ms | 52.2% bf16 MFU | 1623100 tok/s step 2940/19560 | loss 3.718306 (-0.28z)| norm 0.2720 (-0.59z)| lr 5.79e-04 | 323.08 ms | 52.2% bf16 MFU | 1623084 tok/s step 2941/19560 | loss 3.737068 (+0.17z)| norm 0.2582 (-1.09z)| lr 5.79e-04 | 322.57 ms | 52.3% bf16 MFU | 1623198 tok/s step 2942/19560 | loss 3.855021 (+2.81z)| norm 0.3075 (+0.76z)| lr 5.79e-04 | 322.67 ms | 52.3% bf16 MFU | 1623280 tok/s step 2943/19560 | loss 3.741509 (+0.23z)| norm 0.3066 (+0.73z)| lr 5.79e-04 | 322.61 ms | 52.3% bf16 MFU | 1623374 tok/s step 2944/19560 | loss 3.772003 (+0.91z)| norm 0.3260 (+1.44z)| lr 5.79e-04 | 322.83 ms | 52.3% bf16 MFU | 1623407 tok/s step 2945/19560 | loss 3.707462 (-0.54z)| norm 0.3010 (+0.51z)| lr 5.79e-04 | 322.92 ms | 52.3% bf16 MFU | 1623416 tok/s step 2946/19560 | loss 3.744477 (+0.30z)| norm 0.3134 (+0.97z)| lr 5.79e-04 | 322.77 ms | 52.3% bf16 MFU | 1623463 tok/s step 2947/19560 | loss 3.719668 (-0.25z)| norm 0.3030 (+0.57z)| lr 5.79e-04 | 322.79 ms | 52.3% bf16 MFU | 1623501 tok/s step 2948/19560 | loss 3.743231 (+0.32z)| norm 0.2930 (+0.19z)| lr 5.79e-04 | 323.03 ms | 52.2% bf16 MFU | 1623477 tok/s step 2949/19560 | loss 3.700984 (-0.68z)| norm 0.2834 (-0.18z)| lr 5.79e-04 | 322.40 ms | 52.3% bf16 MFU | 1623614 tok/s step 2950/19560 | loss 3.792044 (+1.45z)| norm 0.3132 (+0.93z)| lr 5.79e-04 | 322.69 ms | 52.3% bf16 MFU | 1623671 tok/s step 2951/19560 | loss 3.760403 (+0.70z)| norm 0.2756 (-0.47z)| lr 5.79e-04 | 323.08 ms | 52.2% bf16 MFU | 1623626 tok/s step 2952/19560 | loss 3.733781 (+0.09z)| norm 0.2454 (-1.57z)| lr 5.79e-04 | 322.81 ms | 52.3% bf16 MFU | 1623651 tok/s step 2953/19560 | loss 3.649804 (-1.86z)| norm 0.2716 (-0.59z)| lr 5.79e-04 | 322.54 ms | 52.3% bf16 MFU | 1623742 tok/s step 2954/19560 | loss 3.731605 (+0.06z)| norm 0.2847 (-0.10z)| lr 5.79e-04 | 322.57 ms | 52.3% bf16 MFU | 1623822 tok/s step 2955/19560 | loss 3.686919 (-0.98z)| norm 0.2570 (-1.14z)| lr 5.79e-04 | 322.98 ms | 52.3% bf16 MFU | 1623795 tok/s step 2956/19560 | loss 3.779972 (+1.18z)| norm 0.2769 (-0.40z)| lr 5.79e-04 | 322.22 ms | 52.4% bf16 MFU | 1623960 tok/s step 2957/19560 | loss 3.658932 (-1.62z)| norm 0.2830 (-0.16z)| lr 5.79e-04 | 322.49 ms | 52.3% bf16 MFU | 1624050 tok/s step 2958/19560 | loss 3.781906 (+1.22z)| norm 0.3039 (+0.64z)| lr 5.79e-04 | 322.67 ms | 52.3% bf16 MFU | 1624090 tok/s step 2959/19560 | loss 3.707869 (-0.48z)| norm 0.2836 (-0.12z)| lr 5.79e-04 | 322.31 ms | 52.4% bf16 MFU | 1624219 tok/s step 2960/19560 | loss 3.679030 (-1.14z)| norm 0.2777 (-0.35z)| lr 5.79e-04 | 322.74 ms | 52.3% bf16 MFU | 1624231 tok/s step 2961/19560 | loss 3.667477 (-1.38z)| norm 0.2974 (+0.40z)| lr 5.79e-04 | 322.51 ms | 52.3% bf16 MFU | 1624301 tok/s step 2962/19560 | loss 3.672512 (-1.26z)| norm 0.3079 (+0.80z)| lr 5.79e-04 | 322.49 ms | 52.3% bf16 MFU | 1624372 tok/s step 2963/19560 | loss 3.682366 (-1.02z)| norm 0.3031 (+0.60z)| lr 5.79e-04 | 322.70 ms | 52.3% bf16 MFU | 1624388 tok/s step 2964/19560 | loss 3.694747 (-0.73z)| norm 0.3049 (+0.66z)| lr 5.79e-04 | 322.78 ms | 52.3% bf16 MFU | 1624384 tok/s step 2965/19560 | loss 3.689155 (-0.85z)| norm 0.2641 (-0.91z)| lr 5.79e-04 | 322.86 ms | 52.3% bf16 MFU | 1624360 tok/s step 2966/19560 | loss 3.631439 (-2.11z)| norm 0.2729 (-0.57z)| lr 5.79e-04 | 323.03 ms | 52.2% bf16 MFU | 1624294 tok/s step 2967/19560 | loss 3.717647 (-0.18z)| norm 0.2787 (-0.33z)| lr 5.79e-04 | 322.64 ms | 52.3% bf16 MFU | 1624328 tok/s step 2968/19560 | loss 3.769530 (+0.96z)| norm 0.3033 (+0.61z)| lr 5.79e-04 | 322.50 ms | 52.3% bf16 MFU | 1624396 tok/s step 2969/19560 | loss 3.703057 (-0.52z)| norm 0.3237 (+1.38z)| lr 5.79e-04 | 322.80 ms | 52.3% bf16 MFU | 1624385 tok/s step 2970/19560 | loss 3.748514 (+0.49z)| norm 0.3010 (+0.50z)| lr 5.79e-04 | 323.23 ms | 52.2% bf16 MFU | 1624268 tok/s step 2971/19560 | loss 3.708203 (-0.40z)| norm 0.2959 (+0.30z)| lr 5.79e-04 | 322.57 ms | 52.3% bf16 MFU | 1624322 tok/s step 2972/19560 | loss 3.724105 (-0.04z)| norm 0.3013 (+0.50z)| lr 5.79e-04 | 322.83 ms | 52.3% bf16 MFU | 1624308 tok/s step 2973/19560 | loss 3.671284 (-1.22z)| norm 0.2840 (-0.17z)| lr 5.79e-04 | 322.89 ms | 52.3% bf16 MFU | 1624279 tok/s step 2974/19560 | loss 3.690986 (-0.77z)| norm 0.2858 (-0.09z)| lr 5.79e-04 | 323.32 ms | 52.2% bf16 MFU | 1624144 tok/s step 2975/19560 | loss 3.775170 (+1.10z)| norm 0.2754 (-0.49z)| lr 5.79e-04 | 322.48 ms | 52.3% bf16 MFU | 1624227 tok/s step 2976/19560 | loss 3.750643 (+0.56z)| norm 0.2869 (-0.04z)| lr 5.79e-04 | 322.84 ms | 52.3% bf16 MFU | 1624215 tok/s step 2977/19560 | loss 3.726292 (+0.01z)| norm 0.2566 (-1.21z)| lr 5.79e-04 | 323.00 ms | 52.3% bf16 MFU | 1624163 tok/s step 2978/19560 | loss 3.691401 (-0.76z)| norm 0.2797 (-0.29z)| lr 5.79e-04 | 322.89 ms | 52.3% bf16 MFU | 1624140 tok/s step 2979/19560 | loss 3.763691 (+0.84z)| norm 0.2922 (+0.22z)| lr 5.79e-04 | 322.81 ms | 52.3% bf16 MFU | 1624140 tok/s step 2980/19560 | loss 3.689962 (-0.80z)| norm 0.2570 (-1.18z)| lr 5.79e-04 | 322.98 ms | 52.3% bf16 MFU | 1624097 tok/s step 2981/19560 | loss 3.768402 (+0.94z)| norm 0.2695 (-0.70z)| lr 5.79e-04 | 322.83 ms | 52.3% bf16 MFU | 1624094 tok/s step 2982/19560 | loss 3.748224 (+0.48z)| norm 0.2844 (-0.11z)| lr 5.79e-04 | 322.85 ms | 52.3% bf16 MFU | 1624087 tok/s step 2983/19560 | loss 3.691540 (-0.79z)| norm 0.2966 (+0.39z)| lr 5.79e-04 | 323.02 ms | 52.2% bf16 MFU | 1624036 tok/s step 2984/19560 | loss 3.765627 (+0.89z)| norm 0.2693 (-0.72z)| lr 5.79e-04 | 322.67 ms | 52.3% bf16 MFU | 1624077 tok/s step 2985/19560 | loss 3.651749 (-1.67z)| norm 0.2848 (-0.08z)| lr 5.79e-04 | 322.42 ms | 52.3% bf16 MFU | 1624178 tok/s step 2986/19560 | loss 3.675994 (-1.11z)| norm 0.2585 (-1.15z)| lr 5.79e-04 | 322.96 ms | 52.3% bf16 MFU | 1624140 tok/s step 2987/19560 | loss 3.702832 (-0.49z)| norm 0.2686 (-0.73z)| lr 5.79e-04 | 322.81 ms | 52.3% bf16 MFU | 1624138 tok/s step 2988/19560 | loss 3.766839 (+0.95z)| norm 0.2908 (+0.18z)| lr 5.78e-04 | 323.29 ms | 52.2% bf16 MFU | 1624018 tok/s step 2989/19560 | loss 3.777137 (+1.17z)| norm 0.2799 (-0.26z)| lr 5.78e-04 | 322.64 ms | 52.3% bf16 MFU | 1624067 tok/s step 2990/19560 | loss 3.837072 (+2.45z)| norm 0.2675 (-0.76z)| lr 5.78e-04 | 322.72 ms | 52.3% bf16 MFU | 1624092 tok/s step 2991/19560 | loss 3.752678 (+0.59z)| norm 0.2850 (-0.05z)| lr 5.78e-04 | 322.65 ms | 52.3% bf16 MFU | 1624135 tok/s step 2992/19560 | loss 3.702930 (-0.51z)| norm 0.2768 (-0.39z)| lr 5.78e-04 | 323.27 ms | 52.2% bf16 MFU | 1624019 tok/s step 2993/19560 | loss 3.731890 (+0.13z)| norm 0.3196 (+1.35z)| lr 5.78e-04 | 322.50 ms | 52.3% bf16 MFU | 1624102 tok/s step 2994/19560 | loss 3.738987 (+0.28z)| norm 0.3051 (+0.76z)| lr 5.78e-04 | 322.87 ms | 52.3% bf16 MFU | 1624088 tok/s step 2995/19560 | loss 3.710408 (-0.35z)| norm 0.2963 (+0.40z)| lr 5.78e-04 | 322.58 ms | 52.3% bf16 MFU | 1624147 tok/s step 2996/19560 | loss 3.716156 (-0.23z)| norm 0.3166 (+1.22z)| lr 5.78e-04 | 322.71 ms | 52.3% bf16 MFU | 1624173 tok/s step 2997/19560 | loss 3.698983 (-0.61z)| norm 0.3024 (+0.66z)| lr 5.78e-04 | 323.14 ms | 52.2% bf16 MFU | 1624087 tok/s step 2998/19560 | loss 3.720685 (-0.12z)| norm 0.3095 (+0.98z)| lr 5.78e-04 | 322.48 ms | 52.3% bf16 MFU | 1624172 tok/s step 2999/19560 | loss 3.735607 (+0.22z)| norm 0.3274 (+1.76z)| lr 5.78e-04 | 322.44 ms | 52.3% bf16 MFU | 1624262 tok/s step 3000/19560 | loss 3.728986 (+0.06z)| norm 0.2829 (-0.13z)| lr 5.78e-04 | 322.83 ms | 52.3% bf16 MFU | 1624251 tok/s val loss 3.705422 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2663/10042 = 0.265186 step 3001/19560 | loss 3.615312 (-2.42z)| norm 0.2612 (-1.05z)| lr 5.78e-04 | 322.77 ms | 52.3% bf16 MFU | 1624256 tok/s step 3002/19560 | loss 3.705487 (-0.43z)| norm 0.2489 (-1.54z)| lr 5.78e-04 | 323.30 ms | 52.2% bf16 MFU | 1624128 tok/s step 3003/19560 | loss 3.706055 (-0.42z)| norm 0.2694 (-0.68z)| lr 5.78e-04 | 322.98 ms | 52.3% bf16 MFU | 1624087 tok/s step 3004/19560 | loss 3.725370 (+0.00z)| norm 0.2756 (-0.43z)| lr 5.78e-04 | 322.97 ms | 52.3% bf16 MFU | 1624049 tok/s step 3005/19560 | loss 3.753065 (+0.61z)| norm 0.2657 (-0.85z)| lr 5.78e-04 | 323.18 ms | 52.2% bf16 MFU | 1623960 tok/s step 3006/19560 | loss 3.701484 (-0.52z)| norm 0.2891 (+0.14z)| lr 5.78e-04 | 324.03 ms | 52.1% bf16 MFU | 1623663 tok/s step 3007/19560 | loss 3.780184 (+1.22z)| norm 0.2993 (+0.57z)| lr 5.78e-04 | 323.21 ms | 52.2% bf16 MFU | 1623586 tok/s step 3008/19560 | loss 3.733690 (+0.18z)| norm 0.3004 (+0.60z)| lr 5.78e-04 | 322.94 ms | 52.3% bf16 MFU | 1623581 tok/s step 3009/19560 | loss 3.761492 (+0.80z)| norm 0.2967 (+0.44z)| lr 5.78e-04 | 323.11 ms | 52.2% bf16 MFU | 1623533 tok/s step 3010/19560 | loss 3.770710 (+1.00z)| norm 0.2614 (-1.09z)| lr 5.78e-04 | 323.23 ms | 52.2% bf16 MFU | 1623458 tok/s step 3011/19560 | loss 3.768864 (+0.95z)| norm 0.2824 (-0.18z)| lr 5.78e-04 | 323.37 ms | 52.2% bf16 MFU | 1623352 tok/s step 3012/19560 | loss 3.734732 (+0.20z)| norm 0.2883 (+0.07z)| lr 5.78e-04 | 322.59 ms | 52.3% bf16 MFU | 1623447 tok/s step 3013/19560 | loss 3.706581 (-0.42z)| norm 0.2744 (-0.53z)| lr 5.78e-04 | 322.92 ms | 52.3% bf16 MFU | 1623455 tok/s step 3014/19560 | loss 3.751149 (+0.57z)| norm 0.2840 (-0.12z)| lr 5.78e-04 | 323.10 ms | 52.2% bf16 MFU | 1623417 tok/s step 3015/19560 | loss 3.775253 (+1.09z)| norm 0.2956 (+0.37z)| lr 5.78e-04 | 323.17 ms | 52.2% bf16 MFU | 1623363 tok/s step 3016/19560 | loss 3.722136 (-0.09z)| norm 0.2889 (+0.08z)| lr 5.78e-04 | 323.08 ms | 52.2% bf16 MFU | 1623333 tok/s step 3017/19560 | loss 3.692103 (-0.75z)| norm 0.2672 (-0.86z)| lr 5.78e-04 | 322.72 ms | 52.3% bf16 MFU | 1623396 tok/s step 3018/19560 | loss 3.730867 (+0.10z)| norm 0.2908 (+0.16z)| lr 5.78e-04 | 323.33 ms | 52.2% bf16 MFU | 1623302 tok/s step 3019/19560 | loss 3.716696 (-0.22z)| norm 0.2949 (+0.35z)| lr 5.78e-04 | 322.93 ms | 52.3% bf16 MFU | 1623315 tok/s step 3020/19560 | loss 3.682963 (-0.98z)| norm 0.2632 (-1.01z)| lr 5.78e-04 | 323.70 ms | 52.1% bf16 MFU | 1623132 tok/s step 3021/19560 | loss 3.735471 (+0.24z)| norm 0.2692 (-0.74z)| lr 5.78e-04 | 323.03 ms | 52.2% bf16 MFU | 1623126 tok/s step 3022/19560 | loss 3.721995 (-0.06z)| norm 0.2896 (+0.18z)| lr 5.78e-04 | 323.11 ms | 52.2% bf16 MFU | 1623101 tok/s step 3023/19560 | loss 3.758198 (+0.77z)| norm 0.3094 (+1.09z)| lr 5.78e-04 | 322.04 ms | 52.4% bf16 MFU | 1623346 tok/s step 3024/19560 | loss 3.706384 (-0.44z)| norm 0.3076 (+1.02z)| lr 5.78e-04 | 323.53 ms | 52.2% bf16 MFU | 1623205 tok/s step 3025/19560 | loss 3.718027 (-0.16z)| norm 0.3093 (+1.08z)| lr 5.78e-04 | 323.09 ms | 52.2% bf16 MFU | 1623181 tok/s step 3026/19560 | loss 3.726819 (+0.05z)| norm 0.3345 (+2.17z)| lr 5.78e-04 | 323.55 ms | 52.2% bf16 MFU | 1623044 tok/s step 3027/19560 | loss 3.689518 (-0.82z)| norm 0.3112 (+1.11z)| lr 5.78e-04 | 322.38 ms | 52.4% bf16 MFU | 1623206 tok/s step 3028/19560 | loss 3.850797 (+2.85z)| norm 0.2949 (+0.37z)| lr 5.78e-04 | 322.97 ms | 52.3% bf16 MFU | 1623213 tok/s step 3029/19560 | loss 3.776739 (+1.15z)| norm 0.2680 (-0.85z)| lr 5.78e-04 | 323.27 ms | 52.2% bf16 MFU | 1623143 tok/s step 3030/19560 | loss 3.721025 (-0.12z)| norm 0.2620 (-1.10z)| lr 5.78e-04 | 323.34 ms | 52.2% bf16 MFU | 1623059 tok/s step 3031/19560 | loss 3.689968 (-0.84z)| norm 0.2912 (+0.22z)| lr 5.78e-04 | 323.37 ms | 52.2% bf16 MFU | 1622972 tok/s step 3032/19560 | loss 3.708446 (-0.41z)| norm 0.2696 (-0.76z)| lr 5.78e-04 | 322.47 ms | 52.3% bf16 MFU | 1623116 tok/s step 3033/19560 | loss 3.640597 (-1.93z)| norm 0.2758 (-0.48z)| lr 5.78e-04 | 322.45 ms | 52.3% bf16 MFU | 1623259 tok/s step 3034/19560 | loss 3.732359 (+0.14z)| norm 0.2776 (-0.40z)| lr 5.78e-04 | 323.02 ms | 52.2% bf16 MFU | 1623251 tok/s step 3035/19560 | loss 3.664042 (-1.37z)| norm 0.3274 (+1.81z)| lr 5.78e-04 | 323.00 ms | 52.3% bf16 MFU | 1623247 tok/s step 3036/19560 | loss 3.678757 (-1.04z)| norm 0.2708 (-0.73z)| lr 5.78e-04 | 322.60 ms | 52.3% bf16 MFU | 1623344 tok/s step 3037/19560 | loss 3.708128 (-0.38z)| norm 0.2836 (-0.16z)| lr 5.78e-04 | 323.19 ms | 52.2% bf16 MFU | 1623287 tok/s step 3038/19560 | loss 3.664842 (-1.33z)| norm 0.3423 (+2.41z)| lr 5.78e-04 | 322.98 ms | 52.3% bf16 MFU | 1623287 tok/s step 3039/19560 | loss 3.693046 (-0.69z)| norm 0.3332 (+1.96z)| lr 5.78e-04 | 323.46 ms | 52.2% bf16 MFU | 1623167 tok/s step 3040/19560 | loss 3.755420 (+0.75z)| norm 0.3121 (+1.03z)| lr 5.78e-04 | 323.38 ms | 52.2% bf16 MFU | 1623073 tok/s step 3041/19560 | loss 3.708917 (-0.31z)| norm 0.3212 (+1.41z)| lr 5.77e-04 | 322.75 ms | 52.3% bf16 MFU | 1623142 tok/s step 3042/19560 | loss 3.684100 (-0.91z)| norm 0.3114 (+1.05z)| lr 5.77e-04 | 322.94 ms | 52.3% bf16 MFU | 1623159 tok/s step 3043/19560 | loss 3.652668 (-1.65z)| norm 0.2885 (+0.05z)| lr 5.77e-04 | 322.75 ms | 52.3% bf16 MFU | 1623222 tok/s step 3044/19560 | loss 3.793298 (+1.72z)| norm 0.2838 (-0.15z)| lr 5.77e-04 | 323.05 ms | 52.2% bf16 MFU | 1623206 tok/s step 3045/19560 | loss 3.741689 (+0.48z)| norm 0.3190 (+1.49z)| lr 5.77e-04 | 323.29 ms | 52.2% bf16 MFU | 1623131 tok/s step 3046/19560 | loss 3.698350 (-0.55z)| norm 0.2966 (+0.44z)| lr 5.77e-04 | 322.70 ms | 52.3% bf16 MFU | 1623210 tok/s step 3047/19560 | loss 3.679331 (-0.99z)| norm 0.2807 (-0.29z)| lr 5.77e-04 | 323.28 ms | 52.2% bf16 MFU | 1623140 tok/s step 3048/19560 | loss 3.749581 (+0.68z)| norm 0.2641 (-1.08z)| lr 5.77e-04 | 322.59 ms | 52.3% bf16 MFU | 1623246 tok/s step 3049/19560 | loss 3.765181 (+1.04z)| norm 0.2933 (+0.32z)| lr 5.77e-04 | 323.16 ms | 52.2% bf16 MFU | 1623202 tok/s step 3050/19560 | loss 3.645940 (-1.77z)| norm 0.3165 (+1.42z)| lr 5.77e-04 | 323.36 ms | 52.2% bf16 MFU | 1623110 tok/s step 3051/19560 | loss 3.731107 (+0.26z)| norm 0.2721 (-0.70z)| lr 5.77e-04 | 322.78 ms | 52.3% bf16 MFU | 1623168 tok/s step 3052/19560 | loss 3.694185 (-0.62z)| norm 0.2880 (+0.07z)| lr 5.77e-04 | 322.63 ms | 52.3% bf16 MFU | 1623263 tok/s step 3053/19560 | loss 3.698130 (-0.54z)| norm 0.2613 (-1.19z)| lr 5.77e-04 | 322.84 ms | 52.3% bf16 MFU | 1623300 tok/s step 3054/19560 | loss 3.749548 (+0.69z)| norm 0.2548 (-1.49z)| lr 5.77e-04 | 322.76 ms | 52.3% bf16 MFU | 1623354 tok/s step 3055/19560 | loss 3.655128 (-1.55z)| norm 0.2616 (-1.15z)| lr 5.77e-04 | 323.21 ms | 52.2% bf16 MFU | 1623292 tok/s step 3056/19560 | loss 3.722019 (+0.04z)| norm 0.2623 (-1.11z)| lr 5.77e-04 | 322.69 ms | 52.3% bf16 MFU | 1623365 tok/s step 3057/19560 | loss 3.679040 (-0.97z)| norm 0.2674 (-0.87z)| lr 5.77e-04 | 323.18 ms | 52.2% bf16 MFU | 1623310 tok/s step 3058/19560 | loss 3.663565 (-1.34z)| norm 0.2673 (-0.89z)| lr 5.77e-04 | 323.14 ms | 52.2% bf16 MFU | 1623268 tok/s step 3059/19560 | loss 3.743421 (+0.55z)| norm 0.2643 (-1.02z)| lr 5.77e-04 | 323.03 ms | 52.2% bf16 MFU | 1623255 tok/s step 3060/19560 | loss 3.757103 (+0.86z)| norm 0.2542 (-1.48z)| lr 5.77e-04 | 323.04 ms | 52.2% bf16 MFU | 1623240 tok/s step 3061/19560 | loss 3.709077 (-0.28z)| norm 0.2637 (-1.06z)| lr 5.77e-04 | 323.02 ms | 52.2% bf16 MFU | 1623234 tok/s step 3062/19560 | loss 3.662332 (-1.37z)| norm 0.2713 (-0.70z)| lr 5.77e-04 | 323.05 ms | 52.2% bf16 MFU | 1623218 tok/s step 3063/19560 | loss 3.711976 (-0.20z)| norm 0.2670 (-0.90z)| lr 5.77e-04 | 322.77 ms | 52.3% bf16 MFU | 1623274 tok/s step 3064/19560 | loss 3.688640 (-0.75z)| norm 0.2660 (-0.96z)| lr 5.77e-04 | 323.45 ms | 52.2% bf16 MFU | 1623158 tok/s step 3065/19560 | loss 3.722360 (+0.04z)| norm 0.2757 (-0.51z)| lr 5.77e-04 | 323.21 ms | 52.2% bf16 MFU | 1623107 tok/s step 3066/19560 | loss 3.700021 (-0.49z)| norm 0.2810 (-0.27z)| lr 5.77e-04 | 322.75 ms | 52.3% bf16 MFU | 1623175 tok/s step 3067/19560 | loss 3.708552 (-0.28z)| norm 0.6634 (+9.61z)| lr 5.77e-04 | 322.23 ms | 52.4% bf16 MFU | 1623370 tok/s step 3068/19560 | loss 3.761899 (+0.98z)| norm 0.3724 (+2.08z)| lr 5.77e-04 | 323.29 ms | 52.2% bf16 MFU | 1623290 tok/s step 3069/19560 | loss 3.707692 (-0.30z)| norm 0.3847 (+2.32z)| lr 5.77e-04 | 322.80 ms | 52.3% bf16 MFU | 1623334 tok/s step 3070/19560 | loss 3.718183 (-0.03z)| norm 0.3910 (+2.40z)| lr 5.77e-04 | 323.16 ms | 52.2% bf16 MFU | 1623287 tok/s step 3071/19560 | loss 3.696507 (-0.55z)| norm 0.3297 (+0.91z)| lr 5.77e-04 | 323.08 ms | 52.2% bf16 MFU | 1623262 tok/s step 3072/19560 | loss 3.749359 (+0.76z)| norm 0.2897 (-0.05z)| lr 5.77e-04 | 322.16 ms | 52.4% bf16 MFU | 1623470 tok/s step 3073/19560 | loss 3.734599 (+0.39z)| norm 0.3137 (+0.53z)| lr 5.77e-04 | 323.59 ms | 52.2% bf16 MFU | 1623308 tok/s step 3074/19560 | loss 3.613250 (-2.54z)| norm 0.2811 (-0.25z)| lr 5.77e-04 | 323.54 ms | 52.2% bf16 MFU | 1623165 tok/s step 3075/19560 | loss 3.737735 (+0.48z)| norm 0.2945 (+0.07z)| lr 5.77e-04 | 323.14 ms | 52.2% bf16 MFU | 1623131 tok/s step 3076/19560 | loss 3.701699 (-0.39z)| norm 0.3158 (+0.58z)| lr 5.77e-04 | 322.41 ms | 52.3% bf16 MFU | 1623282 tok/s step 3077/19560 | loss 3.703601 (-0.34z)| norm 0.3650 (+1.74z)| lr 5.77e-04 | 323.45 ms | 52.2% bf16 MFU | 1623163 tok/s step 3078/19560 | loss 3.704483 (-0.31z)| norm 0.3143 (+0.52z)| lr 5.77e-04 | 323.31 ms | 52.2% bf16 MFU | 1623085 tok/s step 3079/19560 | loss 3.784996 (+1.65z)| norm 0.2720 (-0.49z)| lr 5.77e-04 | 323.09 ms | 52.2% bf16 MFU | 1623067 tok/s step 3080/19560 | loss 3.735521 (+0.45z)| norm 0.2980 (+0.12z)| lr 5.77e-04 | 323.27 ms | 52.2% bf16 MFU | 1623006 tok/s step 3081/19560 | loss 3.770789 (+1.29z)| norm 0.2720 (-0.50z)| lr 5.77e-04 | 323.38 ms | 52.2% bf16 MFU | 1622919 tok/s step 3082/19560 | loss 3.695321 (-0.55z)| norm 0.2708 (-0.53z)| lr 5.77e-04 | 323.45 ms | 52.2% bf16 MFU | 1622820 tok/s step 3083/19560 | loss 3.755412 (+0.90z)| norm 0.2785 (-0.34z)| lr 5.77e-04 | 323.08 ms | 52.2% bf16 MFU | 1622819 tok/s step 3084/19560 | loss 3.798028 (+1.93z)| norm 0.2900 (-0.07z)| lr 5.77e-04 | 322.01 ms | 52.4% bf16 MFU | 1623085 tok/s step 3085/19560 | loss 3.737390 (+0.45z)| norm 0.2747 (-0.44z)| lr 5.77e-04 | 323.06 ms | 52.2% bf16 MFU | 1623075 tok/s step 3086/19560 | loss 3.766998 (+1.18z)| norm 0.2637 (-0.69z)| lr 5.77e-04 | 322.61 ms | 52.3% bf16 MFU | 1623177 tok/s step 3087/19560 | loss 3.704152 (-0.37z)| norm 0.3163 (+0.56z)| lr 5.77e-04 | 322.76 ms | 52.3% bf16 MFU | 1623239 tok/s step 3088/19560 | loss 3.671156 (-1.17z)| norm 0.3130 (+0.48z)| lr 5.77e-04 | 322.60 ms | 52.3% bf16 MFU | 1623336 tok/s step 3089/19560 | loss 3.675415 (-1.07z)| norm 0.2629 (-0.72z)| lr 5.77e-04 | 322.68 ms | 52.3% bf16 MFU | 1623408 tok/s step 3090/19560 | loss 3.676587 (-1.05z)| norm 0.2541 (-0.91z)| lr 5.77e-04 | 322.97 ms | 52.3% bf16 MFU | 1623405 tok/s step 3091/19560 | loss 3.719886 (+0.01z)| norm 0.2609 (-0.74z)| lr 5.77e-04 | 322.65 ms | 52.3% bf16 MFU | 1623482 tok/s step 3092/19560 | loss 3.667376 (-1.27z)| norm 0.2628 (-0.69z)| lr 5.77e-04 | 323.00 ms | 52.3% bf16 MFU | 1623466 tok/s step 3093/19560 | loss 3.703721 (-0.38z)| norm 0.2686 (-0.55z)| lr 5.76e-04 | 322.99 ms | 52.3% bf16 MFU | 1623456 tok/s step 3094/19560 | loss 3.731722 (+0.29z)| norm 0.2757 (-0.38z)| lr 5.76e-04 | 322.88 ms | 52.3% bf16 MFU | 1623472 tok/s step 3095/19560 | loss 3.672190 (-1.18z)| norm 0.2703 (-0.51z)| lr 5.76e-04 | 322.90 ms | 52.3% bf16 MFU | 1623481 tok/s step 3096/19560 | loss 3.642269 (-1.89z)| norm 0.2596 (-0.76z)| lr 5.76e-04 | 323.27 ms | 52.2% bf16 MFU | 1623399 tok/s step 3097/19560 | loss 3.730669 (+0.29z)| norm 0.2562 (-0.82z)| lr 5.76e-04 | 322.95 ms | 52.3% bf16 MFU | 1623400 tok/s step 3098/19560 | loss 3.683877 (-0.85z)| norm 0.2557 (-0.83z)| lr 5.76e-04 | 323.03 ms | 52.2% bf16 MFU | 1623381 tok/s step 3099/19560 | loss 3.689938 (-0.70z)| norm 0.2509 (-0.93z)| lr 5.76e-04 | 323.21 ms | 52.2% bf16 MFU | 1623320 tok/s step 3100/19560 | loss 3.673412 (-1.09z)| norm 0.2462 (-1.03z)| lr 5.76e-04 | 323.43 ms | 52.2% bf16 MFU | 1623205 tok/s step 3101/19560 | loss 3.674001 (-1.08z)| norm 0.2529 (-0.86z)| lr 5.76e-04 | 323.19 ms | 52.2% bf16 MFU | 1623157 tok/s step 3102/19560 | loss 3.712869 (-0.13z)| norm 0.2407 (-1.13z)| lr 5.76e-04 | 323.05 ms | 52.2% bf16 MFU | 1623145 tok/s step 3103/19560 | loss 3.689792 (-0.68z)| norm 0.2694 (-0.46z)| lr 5.76e-04 | 323.18 ms | 52.2% bf16 MFU | 1623101 tok/s step 3104/19560 | loss 3.741635 (+0.60z)| norm 0.2829 (-0.14z)| lr 5.76e-04 | 323.50 ms | 52.2% bf16 MFU | 1622979 tok/s step 3105/19560 | loss 3.668437 (-1.19z)| norm 0.3068 (+0.40z)| lr 5.76e-04 | 322.80 ms | 52.3% bf16 MFU | 1623038 tok/s step 3106/19560 | loss 3.704102 (-0.32z)| norm 0.3210 (+0.73z)| lr 5.76e-04 | 323.21 ms | 52.2% bf16 MFU | 1622992 tok/s step 3107/19560 | loss 3.696824 (-0.49z)| norm 0.2829 (-0.16z)| lr 5.76e-04 | 323.16 ms | 52.2% bf16 MFU | 1622960 tok/s step 3108/19560 | loss 3.689852 (-0.66z)| norm 0.2882 (-0.04z)| lr 5.76e-04 | 323.16 ms | 52.2% bf16 MFU | 1622932 tok/s step 3109/19560 | loss 3.687230 (-0.71z)| norm 0.2737 (-0.38z)| lr 5.76e-04 | 323.64 ms | 52.1% bf16 MFU | 1622784 tok/s step 3110/19560 | loss 3.741080 (+0.63z)| norm 0.2857 (-0.10z)| lr 5.76e-04 | 323.19 ms | 52.2% bf16 MFU | 1622755 tok/s step 3111/19560 | loss 3.789825 (+1.81z)| norm 0.3749 (+1.94z)| lr 5.76e-04 | 322.76 ms | 52.3% bf16 MFU | 1622838 tok/s step 3112/19560 | loss 3.711464 (-0.12z)| norm 0.2779 (-0.29z)| lr 5.76e-04 | 323.24 ms | 52.2% bf16 MFU | 1622795 tok/s step 3113/19560 | loss 3.678874 (-0.93z)| norm 0.2669 (-0.54z)| lr 5.76e-04 | 323.11 ms | 52.2% bf16 MFU | 1622788 tok/s step 3114/19560 | loss 3.774331 (+1.42z)| norm 0.2905 (-0.01z)| lr 5.76e-04 | 322.34 ms | 52.4% bf16 MFU | 1622975 tok/s step 3115/19560 | loss 3.697179 (-0.49z)| norm 0.3158 (+0.57z)| lr 5.76e-04 | 323.34 ms | 52.2% bf16 MFU | 1622900 tok/s step 3116/19560 | loss 3.664237 (-1.29z)| norm 0.2974 (+0.14z)| lr 5.76e-04 | 323.00 ms | 52.3% bf16 MFU | 1622913 tok/s step 3117/19560 | loss 3.708980 (-0.17z)| norm 0.2786 (-0.29z)| lr 5.76e-04 | 323.10 ms | 52.2% bf16 MFU | 1622901 tok/s step 3118/19560 | loss 3.646459 (-1.74z)| norm 0.2753 (-0.37z)| lr 5.76e-04 | 323.02 ms | 52.2% bf16 MFU | 1622909 tok/s step 3119/19560 | loss 3.663316 (-1.29z)| norm 0.3010 (+0.22z)| lr 5.76e-04 | 323.11 ms | 52.2% bf16 MFU | 1622895 tok/s step 3120/19560 | loss 3.743481 (+0.76z)| norm 0.3324 (+0.93z)| lr 5.76e-04 | 322.91 ms | 52.3% bf16 MFU | 1622931 tok/s step 3121/19560 | loss 3.851073 (+3.33z)| norm 0.3438 (+1.19z)| lr 5.76e-04 | 322.57 ms | 52.3% bf16 MFU | 1623052 tok/s step 3122/19560 | loss 3.697652 (-0.41z)| norm 0.3249 (+0.75z)| lr 5.76e-04 | 323.12 ms | 52.2% bf16 MFU | 1623028 tok/s step 3123/19560 | loss 3.638912 (-1.81z)| norm 0.2948 (+0.06z)| lr 5.76e-04 | 322.87 ms | 52.3% bf16 MFU | 1623068 tok/s step 3124/19560 | loss 3.701250 (-0.30z)| norm 0.2772 (-0.33z)| lr 5.76e-04 | 323.21 ms | 52.2% bf16 MFU | 1623021 tok/s step 3125/19560 | loss 3.760973 (+1.12z)| norm 0.2914 (-0.01z)| lr 5.76e-04 | 323.24 ms | 52.2% bf16 MFU | 1622969 tok/s step 3126/19560 | loss 3.730271 (+0.38z)| norm 0.2886 (-0.07z)| lr 5.76e-04 | 322.89 ms | 52.3% bf16 MFU | 1623009 tok/s step 3127/19560 | loss 3.649679 (-1.52z)| norm 0.2605 (-0.70z)| lr 5.76e-04 | 322.57 ms | 52.3% bf16 MFU | 1623125 tok/s step 3128/19560 | loss 3.838799 (+2.87z)| norm 0.2933 (+0.05z)| lr 5.76e-04 | 322.85 ms | 52.3% bf16 MFU | 1623164 tok/s step 3129/19560 | loss 3.704227 (-0.26z)| norm 0.2974 (+0.14z)| lr 5.76e-04 | 322.90 ms | 52.3% bf16 MFU | 1623190 tok/s step 3130/19560 | loss 3.703085 (-0.28z)| norm 0.2558 (-0.82z)| lr 5.76e-04 | 322.62 ms | 52.3% bf16 MFU | 1623284 tok/s step 3131/19560 | loss 3.767364 (+1.21z)| norm 0.2671 (-0.56z)| lr 5.76e-04 | 323.09 ms | 52.2% bf16 MFU | 1623257 tok/s step 3132/19560 | loss 3.715461 (-0.00z)| norm 0.2716 (-0.46z)| lr 5.76e-04 | 322.57 ms | 52.3% bf16 MFU | 1623360 tok/s step 3133/19560 | loss 3.700706 (-0.34z)| norm 0.2672 (-0.56z)| lr 5.76e-04 | 322.57 ms | 52.3% bf16 MFU | 1623458 tok/s step 3134/19560 | loss 3.793281 (+1.79z)| norm 0.3018 (+0.24z)| lr 5.76e-04 | 323.06 ms | 52.2% bf16 MFU | 1623429 tok/s step 3135/19560 | loss 3.687685 (-0.64z)| norm 0.2739 (-0.40z)| lr 5.76e-04 | 322.66 ms | 52.3% bf16 MFU | 1623501 tok/s step 3136/19560 | loss 3.679383 (-0.82z)| norm 0.3006 (+0.21z)| lr 5.76e-04 | 322.55 ms | 52.3% bf16 MFU | 1623599 tok/s step 3137/19560 | loss 3.696192 (-0.42z)| norm 0.3152 (+0.54z)| lr 5.76e-04 | 322.87 ms | 52.3% bf16 MFU | 1623612 tok/s step 3138/19560 | loss 3.657652 (-1.30z)| norm 0.3327 (+0.93z)| lr 5.76e-04 | 322.55 ms | 52.3% bf16 MFU | 1623703 tok/s step 3139/19560 | loss 3.699987 (-0.30z)| norm 0.3231 (+0.70z)| lr 5.76e-04 | 323.06 ms | 52.2% bf16 MFU | 1623661 tok/s step 3140/19560 | loss 3.668320 (-1.03z)| norm 0.3270 (+0.78z)| lr 5.76e-04 | 322.80 ms | 52.3% bf16 MFU | 1623689 tok/s step 3141/19560 | loss 3.703471 (-0.21z)| norm 0.2996 (+0.15z)| lr 5.76e-04 | 323.06 ms | 52.2% bf16 MFU | 1623649 tok/s step 3142/19560 | loss 3.741819 (+0.70z)| norm 0.3137 (+0.47z)| lr 5.76e-04 | 322.79 ms | 52.3% bf16 MFU | 1623678 tok/s step 3143/19560 | loss 3.683063 (-0.67z)| norm 0.3126 (+0.44z)| lr 5.76e-04 | 323.00 ms | 52.3% bf16 MFU | 1623654 tok/s step 3144/19560 | loss 3.666512 (-1.05z)| norm 0.2970 (+0.08z)| lr 5.76e-04 | 323.25 ms | 52.2% bf16 MFU | 1623568 tok/s step 3145/19560 | loss 3.683398 (-0.65z)| norm 0.3004 (+0.16z)| lr 5.75e-04 | 322.62 ms | 52.3% bf16 MFU | 1623644 tok/s step 3146/19560 | loss 3.729162 (+0.43z)| norm 0.2642 (-0.66z)| lr 5.75e-04 | 323.21 ms | 52.2% bf16 MFU | 1623568 tok/s step 3147/19560 | loss 3.718845 (+0.18z)| norm 0.2923 (-0.02z)| lr 5.75e-04 | 323.13 ms | 52.2% bf16 MFU | 1623517 tok/s step 3148/19560 | loss 3.668958 (-0.99z)| norm 0.2715 (-0.50z)| lr 5.75e-04 | 322.53 ms | 52.3% bf16 MFU | 1623618 tok/s step 3149/19560 | loss 3.681247 (-0.69z)| norm 0.2512 (-0.96z)| lr 5.75e-04 | 322.84 ms | 52.3% bf16 MFU | 1623636 tok/s step 3150/19560 | loss 3.656999 (-1.24z)| norm 0.2677 (-0.58z)| lr 5.75e-04 | 322.88 ms | 52.3% bf16 MFU | 1623644 tok/s step 3151/19560 | loss 3.814508 (+2.39z)| norm 0.2910 (-0.04z)| lr 5.75e-04 | 322.91 ms | 52.3% bf16 MFU | 1623643 tok/s step 3152/19560 | loss 3.620931 (-2.00z)| norm 0.2708 (-0.50z)| lr 5.75e-04 | 322.72 ms | 52.3% bf16 MFU | 1623691 tok/s step 3153/19560 | loss 3.682433 (-0.61z)| norm 0.2682 (-0.55z)| lr 5.75e-04 | 322.91 ms | 52.3% bf16 MFU | 1623687 tok/s step 3154/19560 | loss 3.735785 (+0.59z)| norm 0.2772 (-0.33z)| lr 5.75e-04 | 323.03 ms | 52.2% bf16 MFU | 1623654 tok/s step 3155/19560 | loss 3.689265 (-0.46z)| norm 0.2682 (-0.53z)| lr 5.75e-04 | 322.56 ms | 52.3% bf16 MFU | 1623741 tok/s step 3156/19560 | loss 3.713635 (+0.12z)| norm 0.2919 (+0.01z)| lr 5.75e-04 | 323.25 ms | 52.2% bf16 MFU | 1623651 tok/s step 3157/19560 | loss 3.645545 (-1.46z)| norm 0.2667 (-0.56z)| lr 5.75e-04 | 322.69 ms | 52.3% bf16 MFU | 1623706 tok/s step 3158/19560 | loss 3.681895 (-0.59z)| norm 0.2873 (-0.10z)| lr 5.75e-04 | 322.92 ms | 52.3% bf16 MFU | 1623700 tok/s step 3159/19560 | loss 3.661126 (-1.07z)| norm 0.3038 (+0.27z)| lr 5.75e-04 | 323.54 ms | 52.2% bf16 MFU | 1623538 tok/s step 3160/19560 | loss 3.761309 (+1.26z)| norm 0.2462 (-1.03z)| lr 5.75e-04 | 322.40 ms | 52.3% bf16 MFU | 1623672 tok/s step 3161/19560 | loss 3.708754 (+0.02z)| norm 0.2878 (-0.09z)| lr 5.75e-04 | 323.31 ms | 52.2% bf16 MFU | 1623570 tok/s step 3162/19560 | loss 3.679554 (-0.65z)| norm 0.2752 (-0.38z)| lr 5.75e-04 | 322.83 ms | 52.3% bf16 MFU | 1623595 tok/s step 3163/19560 | loss 3.703671 (-0.10z)| norm 0.2823 (-0.21z)| lr 5.75e-04 | 322.71 ms | 52.3% bf16 MFU | 1623647 tok/s step 3164/19560 | loss 3.674170 (-0.79z)| norm 0.2716 (-0.45z)| lr 5.75e-04 | 322.94 ms | 52.3% bf16 MFU | 1623638 tok/s step 3165/19560 | loss 3.690962 (-0.39z)| norm 0.2896 (-0.04z)| lr 5.75e-04 | 322.73 ms | 52.3% bf16 MFU | 1623684 tok/s step 3166/19560 | loss 3.669657 (-0.89z)| norm 0.3024 (+0.26z)| lr 5.75e-04 | 322.75 ms | 52.3% bf16 MFU | 1623721 tok/s step 3167/19560 | loss 3.695136 (-0.29z)| norm 0.3007 (+0.23z)| lr 5.75e-04 | 322.84 ms | 52.3% bf16 MFU | 1623734 tok/s step 3168/19560 | loss 3.708281 (+0.02z)| norm 0.3105 (+0.45z)| lr 5.75e-04 | 322.88 ms | 52.3% bf16 MFU | 1623736 tok/s step 3169/19560 | loss 3.655650 (-1.21z)| norm 0.2939 (+0.08z)| lr 5.75e-04 | 323.23 ms | 52.2% bf16 MFU | 1623650 tok/s step 3170/19560 | loss 3.691179 (-0.37z)| norm 0.2857 (-0.11z)| lr 5.75e-04 | 322.33 ms | 52.4% bf16 MFU | 1623794 tok/s step 3171/19560 | loss 3.678432 (-0.68z)| norm 0.3284 (+0.87z)| lr 5.75e-04 | 323.38 ms | 52.2% bf16 MFU | 1623667 tok/s step 3172/19560 | loss 3.665226 (-0.98z)| norm 0.2892 (-0.04z)| lr 5.75e-04 | 322.38 ms | 52.4% bf16 MFU | 1623799 tok/s step 3173/19560 | loss 3.740482 (+0.83z)| norm 0.2845 (-0.14z)| lr 5.75e-04 | 323.04 ms | 52.2% bf16 MFU | 1623758 tok/s step 3174/19560 | loss 3.696718 (-0.22z)| norm 0.2976 (+0.16z)| lr 5.75e-04 | 322.30 ms | 52.4% bf16 MFU | 1623906 tok/s step 3175/19560 | loss 3.595816 (-2.56z)| norm 0.2573 (-0.76z)| lr 5.75e-04 | 322.72 ms | 52.3% bf16 MFU | 1623939 tok/s step 3176/19560 | loss 3.634440 (-1.63z)| norm 0.2912 (+0.02z)| lr 5.75e-04 | 323.09 ms | 52.2% bf16 MFU | 1623880 tok/s step 3177/19560 | loss 3.663131 (-0.95z)| norm 0.2839 (-0.15z)| lr 5.75e-04 | 323.07 ms | 52.2% bf16 MFU | 1623828 tok/s step 3178/19560 | loss 3.691218 (-0.30z)| norm 0.2721 (-0.41z)| lr 5.75e-04 | 322.31 ms | 52.4% bf16 MFU | 1623968 tok/s step 3179/19560 | loss 3.693584 (-0.24z)| norm 0.2671 (-0.53z)| lr 5.75e-04 | 322.83 ms | 52.3% bf16 MFU | 1623972 tok/s step 3180/19560 | loss 3.687956 (-0.37z)| norm 0.2707 (-0.44z)| lr 5.75e-04 | 322.74 ms | 52.3% bf16 MFU | 1623998 tok/s step 3181/19560 | loss 3.680115 (-0.55z)| norm 0.2802 (-0.23z)| lr 5.75e-04 | 323.08 ms | 52.2% bf16 MFU | 1623938 tok/s step 3182/19560 | loss 3.704582 (+0.03z)| norm 0.3015 (+0.25z)| lr 5.75e-04 | 322.85 ms | 52.3% bf16 MFU | 1623937 tok/s step 3183/19560 | loss 3.657591 (-1.08z)| norm 0.2666 (-0.55z)| lr 5.75e-04 | 322.48 ms | 52.3% bf16 MFU | 1624030 tok/s step 3184/19560 | loss 3.686415 (-0.39z)| norm 0.2514 (-0.90z)| lr 5.75e-04 | 323.16 ms | 52.2% bf16 MFU | 1623946 tok/s step 3185/19560 | loss 3.632821 (-1.63z)| norm 0.2794 (-0.26z)| lr 5.75e-04 | 322.29 ms | 52.4% bf16 MFU | 1624086 tok/s step 3186/19560 | loss 3.674569 (-0.66z)| norm 0.2775 (-0.30z)| lr 5.75e-04 | 323.36 ms | 52.2% bf16 MFU | 1623951 tok/s step 3187/19560 | loss 3.711024 (+0.20z)| norm 0.2877 (-0.07z)| lr 5.75e-04 | 322.38 ms | 52.4% bf16 MFU | 1624068 tok/s step 3188/19560 | loss 3.691062 (-0.26z)| norm 0.3095 (+0.42z)| lr 5.75e-04 | 322.68 ms | 52.3% bf16 MFU | 1624104 tok/s step 3189/19560 | loss 3.699739 (-0.05z)| norm 0.2802 (-0.26z)| lr 5.75e-04 | 323.15 ms | 52.2% bf16 MFU | 1624020 tok/s step 3190/19560 | loss 3.678455 (-0.56z)| norm 0.3144 (+0.53z)| lr 5.75e-04 | 322.95 ms | 52.3% bf16 MFU | 1623991 tok/s step 3191/19560 | loss 3.715856 (+0.33z)| norm 0.3324 (+0.93z)| lr 5.75e-04 | 322.76 ms | 52.3% bf16 MFU | 1624012 tok/s step 3192/19560 | loss 3.751519 (+1.16z)| norm 0.3127 (+0.47z)| lr 5.75e-04 | 322.80 ms | 52.3% bf16 MFU | 1624020 tok/s step 3193/19560 | loss 3.712334 (+0.23z)| norm 0.2874 (-0.12z)| lr 5.75e-04 | 322.78 ms | 52.3% bf16 MFU | 1624033 tok/s step 3194/19560 | loss 3.664166 (-0.89z)| norm 0.2553 (-0.86z)| lr 5.75e-04 | 322.85 ms | 52.3% bf16 MFU | 1624029 tok/s step 3195/19560 | loss 3.609713 (-2.12z)| norm 0.2793 (-0.36z)| lr 5.74e-04 | 322.53 ms | 52.3% bf16 MFU | 1624105 tok/s step 3196/19560 | loss 3.707566 (+0.15z)| norm 0.2843 (-0.16z)| lr 5.74e-04 | 322.82 ms | 52.3% bf16 MFU | 1624104 tok/s step 3197/19560 | loss 3.639666 (-1.41z)| norm 0.2453 (-1.63z)| lr 5.74e-04 | 322.50 ms | 52.3% bf16 MFU | 1624184 tok/s step 3198/19560 | loss 3.713089 (+0.29z)| norm 0.2436 (-1.74z)| lr 5.74e-04 | 322.86 ms | 52.3% bf16 MFU | 1624168 tok/s step 3199/19560 | loss 3.645424 (-1.25z)| norm 0.2652 (-0.86z)| lr 5.74e-04 | 322.61 ms | 52.3% bf16 MFU | 1624216 tok/s step 3200/19560 | loss 3.682593 (-0.39z)| norm 0.2506 (-1.43z)| lr 5.74e-04 | 322.86 ms | 52.3% bf16 MFU | 1624199 tok/s step 3201/19560 | loss 3.631286 (-1.55z)| norm 0.2334 (-2.08z)| lr 5.74e-04 | 322.78 ms | 52.3% bf16 MFU | 1624203 tok/s step 3202/19560 | loss 3.672474 (-0.62z)| norm 0.2414 (-1.72z)| lr 5.74e-04 | 322.88 ms | 52.3% bf16 MFU | 1624183 tok/s step 3203/19560 | loss 3.700020 (+0.03z)| norm 0.2450 (-1.55z)| lr 5.74e-04 | 322.88 ms | 52.3% bf16 MFU | 1624162 tok/s step 3204/19560 | loss 3.725207 (+0.61z)| norm 0.2829 (-0.05z)| lr 5.74e-04 | 322.79 ms | 52.3% bf16 MFU | 1624165 tok/s step 3205/19560 | loss 3.726603 (+0.64z)| norm 0.2872 (+0.15z)| lr 5.74e-04 | 322.57 ms | 52.3% bf16 MFU | 1624224 tok/s step 3206/19560 | loss 3.667387 (-0.73z)| norm 0.2800 (-0.13z)| lr 5.74e-04 | 322.93 ms | 52.3% bf16 MFU | 1624188 tok/s step 3207/19560 | loss 3.667254 (-0.72z)| norm 0.2905 (+0.30z)| lr 5.74e-04 | 323.03 ms | 52.2% bf16 MFU | 1624130 tok/s step 3208/19560 | loss 3.727790 (+0.70z)| norm 0.2895 (+0.26z)| lr 5.74e-04 | 322.85 ms | 52.3% bf16 MFU | 1624121 tok/s step 3209/19560 | loss 3.684154 (-0.31z)| norm 0.2826 (-0.03z)| lr 5.74e-04 | 322.89 ms | 52.3% bf16 MFU | 1624101 tok/s step 3210/19560 | loss 3.692698 (-0.11z)| norm 0.2786 (-0.20z)| lr 5.74e-04 | 323.10 ms | 52.2% bf16 MFU | 1624031 tok/s step 3211/19560 | loss 3.621390 (-1.77z)| norm 0.2673 (-0.67z)| lr 5.74e-04 | 322.80 ms | 52.3% bf16 MFU | 1624039 tok/s step 3212/19560 | loss 3.691805 (-0.09z)| norm 0.2384 (-1.83z)| lr 5.74e-04 | 322.87 ms | 52.3% bf16 MFU | 1624030 tok/s step 3213/19560 | loss 3.728368 (+0.81z)| norm 0.2556 (-1.12z)| lr 5.74e-04 | 322.66 ms | 52.3% bf16 MFU | 1624073 tok/s step 3214/19560 | loss 3.701577 (+0.17z)| norm 0.2359 (-1.89z)| lr 5.74e-04 | 322.78 ms | 52.3% bf16 MFU | 1624084 tok/s step 3215/19560 | loss 3.642787 (-1.26z)| norm 0.2546 (-1.12z)| lr 5.74e-04 | 322.92 ms | 52.3% bf16 MFU | 1624058 tok/s step 3216/19560 | loss 3.680385 (-0.34z)| norm 0.2608 (-0.85z)| lr 5.74e-04 | 323.03 ms | 52.2% bf16 MFU | 1624008 tok/s step 3217/19560 | loss 3.756570 (+1.50z)| norm 0.2978 (+0.64z)| lr 5.74e-04 | 322.82 ms | 52.3% bf16 MFU | 1624011 tok/s step 3218/19560 | loss 3.683888 (-0.27z)| norm 0.3179 (+1.44z)| lr 5.74e-04 | 322.78 ms | 52.3% bf16 MFU | 1624024 tok/s step 3219/19560 | loss 3.689615 (-0.13z)| norm 0.3216 (+1.56z)| lr 5.74e-04 | 322.97 ms | 52.3% bf16 MFU | 1623989 tok/s step 3220/19560 | loss 3.662769 (-0.78z)| norm 0.2888 (+0.23z)| lr 5.74e-04 | 323.08 ms | 52.2% bf16 MFU | 1623929 tok/s step 3221/19560 | loss 3.700569 (+0.14z)| norm 0.2489 (-1.37z)| lr 5.74e-04 | 323.33 ms | 52.2% bf16 MFU | 1623809 tok/s step 3222/19560 | loss 3.715732 (+0.51z)| norm 0.2649 (-0.72z)| lr 5.74e-04 | 322.72 ms | 52.3% bf16 MFU | 1623848 tok/s step 3223/19560 | loss 3.688141 (-0.16z)| norm 0.2863 (+0.13z)| lr 5.74e-04 | 322.69 ms | 52.3% bf16 MFU | 1623892 tok/s step 3224/19560 | loss 3.654345 (-0.99z)| norm 0.2856 (+0.09z)| lr 5.74e-04 | 322.69 ms | 52.3% bf16 MFU | 1623935 tok/s step 3225/19560 | loss 3.696919 (+0.06z)| norm 0.2824 (-0.04z)| lr 5.74e-04 | 323.36 ms | 52.2% bf16 MFU | 1623807 tok/s step 3226/19560 | loss 3.668742 (-0.63z)| norm 0.2755 (-0.33z)| lr 5.74e-04 | 323.50 ms | 52.2% bf16 MFU | 1623650 tok/s step 3227/19560 | loss 3.695070 (+0.01z)| norm 0.3123 (+1.15z)| lr 5.74e-04 | 322.49 ms | 52.3% bf16 MFU | 1623756 tok/s step 3228/19560 | loss 3.723274 (+0.70z)| norm 0.3223 (+1.53z)| lr 5.74e-04 | 322.62 ms | 52.3% bf16 MFU | 1623822 tok/s step 3229/19560 | loss 3.613341 (-1.96z)| norm 0.3216 (+1.48z)| lr 5.74e-04 | 322.70 ms | 52.3% bf16 MFU | 1623864 tok/s step 3230/19560 | loss 3.764103 (+1.66z)| norm 0.3207 (+1.43z)| lr 5.74e-04 | 323.39 ms | 52.2% bf16 MFU | 1623731 tok/s step 3231/19560 | loss 3.759909 (+1.53z)| norm 0.2970 (+0.45z)| lr 5.74e-04 | 322.98 ms | 52.3% bf16 MFU | 1623708 tok/s step 3232/19560 | loss 3.706000 (+0.26z)| norm 0.2940 (+0.32z)| lr 5.74e-04 | 323.03 ms | 52.2% bf16 MFU | 1623675 tok/s step 3233/19560 | loss 3.689104 (-0.15z)| norm 0.2726 (-0.54z)| lr 5.74e-04 | 323.22 ms | 52.2% bf16 MFU | 1623597 tok/s step 3234/19560 | loss 3.747208 (+1.22z)| norm 0.2732 (-0.51z)| lr 5.74e-04 | 323.01 ms | 52.2% bf16 MFU | 1623574 tok/s step 3235/19560 | loss 3.644205 (-1.20z)| norm 0.2835 (-0.08z)| lr 5.74e-04 | 323.33 ms | 52.2% bf16 MFU | 1623472 tok/s step 3236/19560 | loss 3.755828 (+1.41z)| norm 0.2781 (-0.30z)| lr 5.74e-04 | 322.58 ms | 52.3% bf16 MFU | 1623563 tok/s step 3237/19560 | loss 3.729925 (+0.79z)| norm 0.2985 (+0.53z)| lr 5.74e-04 | 322.65 ms | 52.3% bf16 MFU | 1623633 tok/s step 3238/19560 | loss 3.794547 (+2.25z)| norm 0.3131 (+1.12z)| lr 5.74e-04 | 323.49 ms | 52.2% bf16 MFU | 1623487 tok/s step 3239/19560 | loss 3.729255 (+0.78z)| norm 0.2598 (-1.09z)| lr 5.74e-04 | 323.11 ms | 52.2% bf16 MFU | 1623445 tok/s step 3240/19560 | loss 3.701605 (+0.13z)| norm 0.2714 (-0.58z)| lr 5.74e-04 | 322.71 ms | 52.3% bf16 MFU | 1623504 tok/s step 3241/19560 | loss 3.666129 (-0.69z)| norm 0.3031 (+0.77z)| lr 5.74e-04 | 322.72 ms | 52.3% bf16 MFU | 1623560 tok/s step 3242/19560 | loss 3.668031 (-0.63z)| norm 0.3104 (+1.08z)| lr 5.74e-04 | 323.20 ms | 52.2% bf16 MFU | 1623491 tok/s step 3243/19560 | loss 3.696621 (+0.04z)| norm 0.2985 (+0.57z)| lr 5.74e-04 | 322.50 ms | 52.3% bf16 MFU | 1623602 tok/s step 3244/19560 | loss 3.711979 (+0.39z)| norm 0.2989 (+0.59z)| lr 5.73e-04 | 322.98 ms | 52.3% bf16 MFU | 1623586 tok/s step 3245/19560 | loss 3.775847 (+1.86z)| norm 0.2633 (-0.94z)| lr 5.73e-04 | 322.87 ms | 52.3% bf16 MFU | 1623597 tok/s step 3246/19560 | loss 3.706805 (+0.25z)| norm 0.2680 (-0.73z)| lr 5.73e-04 | 322.75 ms | 52.3% bf16 MFU | 1623639 tok/s step 3247/19560 | loss 3.709409 (+0.30z)| norm 0.2639 (-0.90z)| lr 5.73e-04 | 323.42 ms | 52.2% bf16 MFU | 1623512 tok/s step 3248/19560 | loss 3.634936 (-1.42z)| norm 0.2419 (-1.82z)| lr 5.73e-04 | 322.98 ms | 52.3% bf16 MFU | 1623501 tok/s step 3249/19560 | loss 3.740180 (+1.11z)| norm 0.2559 (-1.21z)| lr 5.73e-04 | 323.46 ms | 52.2% bf16 MFU | 1623370 tok/s step 3250/19560 | loss 3.688173 (-0.16z)| norm 0.2729 (-0.45z)| lr 5.73e-04 | 322.96 ms | 52.3% bf16 MFU | 1623370 tok/s val loss 3.678645 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2683/10042 = 0.267178 step 3251/19560 | loss 3.696390 (+0.03z)| norm 0.2589 (-1.05z)| lr 5.73e-04 | 322.11 ms | 52.4% bf16 MFU | 1623586 tok/s step 3252/19560 | loss 3.702756 (+0.18z)| norm 0.2548 (-1.22z)| lr 5.73e-04 | 323.02 ms | 52.2% bf16 MFU | 1623561 tok/s step 3253/19560 | loss 3.764822 (+1.72z)| norm 0.4818 (+6.91z)| lr 5.73e-04 | 323.66 ms | 52.1% bf16 MFU | 1623376 tok/s step 3254/19560 | loss 3.742220 (+1.15z)| norm 0.3282 (+1.52z)| lr 5.73e-04 | 323.36 ms | 52.2% bf16 MFU | 1623275 tok/s step 3255/19560 | loss 3.713723 (+0.44z)| norm 0.2985 (+0.48z)| lr 5.73e-04 | 323.22 ms | 52.2% bf16 MFU | 1623216 tok/s step 3256/19560 | loss 3.655802 (-1.01z)| norm 0.2845 (+0.00z)| lr 5.73e-04 | 323.51 ms | 52.2% bf16 MFU | 1623088 tok/s step 3257/19560 | loss 3.627860 (-1.70z)| norm 0.2892 (+0.17z)| lr 5.73e-04 | 323.49 ms | 52.2% bf16 MFU | 1622969 tok/s step 3258/19560 | loss 3.671021 (-0.58z)| norm 0.2831 (-0.06z)| lr 5.73e-04 | 323.51 ms | 52.2% bf16 MFU | 1622851 tok/s step 3259/19560 | loss 3.824357 (+3.25z)| norm 0.2858 (+0.04z)| lr 5.73e-04 | 323.41 ms | 52.2% bf16 MFU | 1622765 tok/s step 3260/19560 | loss 3.655288 (-0.95z)| norm 0.2963 (+0.40z)| lr 5.73e-04 | 323.15 ms | 52.2% bf16 MFU | 1622748 tok/s step 3261/19560 | loss 3.693375 (-0.01z)| norm 0.2779 (-0.25z)| lr 5.73e-04 | 323.53 ms | 52.2% bf16 MFU | 1622638 tok/s step 3262/19560 | loss 3.683002 (-0.25z)| norm 0.2697 (-0.53z)| lr 5.73e-04 | 322.96 ms | 52.3% bf16 MFU | 1622676 tok/s step 3263/19560 | loss 3.783340 (+2.24z)| norm 0.3003 (+0.53z)| lr 5.73e-04 | 323.82 ms | 52.1% bf16 MFU | 1622495 tok/s step 3264/19560 | loss 3.681397 (-0.30z)| norm 0.2974 (+0.43z)| lr 5.73e-04 | 323.26 ms | 52.2% bf16 MFU | 1622463 tok/s step 3265/19560 | loss 3.689314 (-0.10z)| norm 0.2855 (+0.02z)| lr 5.73e-04 | 322.75 ms | 52.3% bf16 MFU | 1622561 tok/s step 3266/19560 | loss 3.763644 (+1.72z)| norm 0.3186 (+1.20z)| lr 5.73e-04 | 322.72 ms | 52.3% bf16 MFU | 1622662 tok/s step 3267/19560 | loss 3.694453 (+0.01z)| norm 0.3338 (+1.72z)| lr 5.73e-04 | 323.55 ms | 52.2% bf16 MFU | 1622551 tok/s step 3268/19560 | loss 3.729711 (+0.87z)| norm 0.3129 (+1.00z)| lr 5.73e-04 | 322.80 ms | 52.3% bf16 MFU | 1622632 tok/s step 3269/19560 | loss 3.695463 (+0.02z)| norm 0.2616 (-0.80z)| lr 5.73e-04 | 322.63 ms | 52.3% bf16 MFU | 1622752 tok/s step 3270/19560 | loss 3.710131 (+0.39z)| norm 0.2604 (-0.83z)| lr 5.73e-04 | 322.51 ms | 52.3% bf16 MFU | 1622897 tok/s step 3271/19560 | loss 3.678530 (-0.39z)| norm 0.2728 (-0.38z)| lr 5.73e-04 | 322.84 ms | 52.3% bf16 MFU | 1622950 tok/s step 3272/19560 | loss 3.666641 (-0.69z)| norm 0.2730 (-0.37z)| lr 5.73e-04 | 322.71 ms | 52.3% bf16 MFU | 1623034 tok/s step 3273/19560 | loss 3.723111 (+0.71z)| norm 0.2390 (-1.54z)| lr 5.73e-04 | 322.99 ms | 52.3% bf16 MFU | 1623044 tok/s step 3274/19560 | loss 3.701970 (+0.19z)| norm 0.2647 (-0.64z)| lr 5.73e-04 | 323.23 ms | 52.2% bf16 MFU | 1622994 tok/s step 3275/19560 | loss 3.760213 (+1.61z)| norm 0.2770 (-0.20z)| lr 5.73e-04 | 322.80 ms | 52.3% bf16 MFU | 1623054 tok/s step 3276/19560 | loss 3.698072 (+0.08z)| norm 0.2742 (-0.31z)| lr 5.73e-04 | 322.50 ms | 52.3% bf16 MFU | 1623187 tok/s step 3277/19560 | loss 3.624154 (-1.72z)| norm 0.2896 (+0.23z)| lr 5.73e-04 | 323.11 ms | 52.2% bf16 MFU | 1623158 tok/s step 3278/19560 | loss 3.760365 (+1.58z)| norm 0.2846 (+0.05z)| lr 5.73e-04 | 322.72 ms | 52.3% bf16 MFU | 1623231 tok/s step 3279/19560 | loss 3.684399 (-0.25z)| norm 0.2832 (-0.00z)| lr 5.73e-04 | 322.75 ms | 52.3% bf16 MFU | 1623292 tok/s step 3280/19560 | loss 3.649263 (-1.15z)| norm 0.2755 (-0.28z)| lr 5.73e-04 | 323.27 ms | 52.2% bf16 MFU | 1623220 tok/s step 3281/19560 | loss 3.687056 (-0.19z)| norm 0.2907 (+0.26z)| lr 5.73e-04 | 322.92 ms | 52.3% bf16 MFU | 1623238 tok/s step 3282/19560 | loss 3.753514 (+1.48z)| norm 0.3137 (+1.06z)| lr 5.73e-04 | 323.89 ms | 52.1% bf16 MFU | 1623013 tok/s step 3283/19560 | loss 3.743462 (+1.21z)| norm 0.2928 (+0.31z)| lr 5.73e-04 | 322.79 ms | 52.3% bf16 MFU | 1623073 tok/s step 3284/19560 | loss 3.705314 (+0.26z)| norm 0.2542 (-1.04z)| lr 5.73e-04 | 323.00 ms | 52.3% bf16 MFU | 1623080 tok/s step 3285/19560 | loss 3.699313 (+0.09z)| norm 0.2625 (-0.74z)| lr 5.73e-04 | 322.63 ms | 52.3% bf16 MFU | 1623178 tok/s step 3286/19560 | loss 3.721731 (+0.65z)| norm 0.2794 (-0.14z)| lr 5.73e-04 | 322.78 ms | 52.3% bf16 MFU | 1623233 tok/s step 3287/19560 | loss 3.670036 (-0.65z)| norm 0.2723 (-0.39z)| lr 5.73e-04 | 322.92 ms | 52.3% bf16 MFU | 1623251 tok/s step 3288/19560 | loss 3.671425 (-0.61z)| norm 0.2600 (-0.83z)| lr 5.73e-04 | 322.77 ms | 52.3% bf16 MFU | 1623306 tok/s step 3289/19560 | loss 3.685275 (-0.25z)| norm 0.2526 (-1.08z)| lr 5.73e-04 | 322.79 ms | 52.3% bf16 MFU | 1623354 tok/s step 3290/19560 | loss 3.684609 (-0.27z)| norm 0.2772 (-0.21z)| lr 5.73e-04 | 322.83 ms | 52.3% bf16 MFU | 1623387 tok/s step 3291/19560 | loss 3.718636 (+0.60z)| norm 0.2931 (+0.35z)| lr 5.73e-04 | 322.69 ms | 52.3% bf16 MFU | 1623456 tok/s step 3292/19560 | loss 3.725147 (+0.75z)| norm 0.2709 (-0.43z)| lr 5.72e-04 | 323.13 ms | 52.2% bf16 MFU | 1623409 tok/s step 3293/19560 | loss 3.655217 (-1.02z)| norm 0.2700 (-0.46z)| lr 5.72e-04 | 322.91 ms | 52.3% bf16 MFU | 1623420 tok/s step 3294/19560 | loss 3.673260 (-0.56z)| norm 0.2636 (-0.67z)| lr 5.72e-04 | 322.80 ms | 52.3% bf16 MFU | 1623459 tok/s step 3295/19560 | loss 3.704463 (+0.23z)| norm 0.2607 (-0.76z)| lr 5.72e-04 | 322.70 ms | 52.3% bf16 MFU | 1623522 tok/s step 3296/19560 | loss 3.643950 (-1.28z)| norm 0.3078 (+0.89z)| lr 5.72e-04 | 322.98 ms | 52.3% bf16 MFU | 1623510 tok/s step 3297/19560 | loss 3.719840 (+0.62z)| norm 0.3690 (+2.92z)| lr 5.72e-04 | 322.59 ms | 52.3% bf16 MFU | 1623596 tok/s step 3298/19560 | loss 3.706294 (+0.27z)| norm 0.3527 (+2.30z)| lr 5.72e-04 | 323.24 ms | 52.2% bf16 MFU | 1623516 tok/s step 3299/19560 | loss 3.690620 (-0.13z)| norm 0.2869 (+0.12z)| lr 5.72e-04 | 322.74 ms | 52.3% bf16 MFU | 1623564 tok/s step 3300/19560 | loss 3.587313 (-2.65z)| norm 0.2759 (-0.24z)| lr 5.72e-04 | 322.92 ms | 52.3% bf16 MFU | 1623565 tok/s step 3301/19560 | loss 3.681112 (-0.33z)| norm 0.3202 (+1.23z)| lr 5.72e-04 | 322.56 ms | 52.3% bf16 MFU | 1623658 tok/s step 3302/19560 | loss 3.667115 (-0.67z)| norm 0.3034 (+0.67z)| lr 5.72e-04 | 322.84 ms | 52.3% bf16 MFU | 1623675 tok/s step 3303/19560 | loss 3.694960 (-0.00z)| norm 0.3061 (+0.74z)| lr 5.72e-04 | 322.82 ms | 52.3% bf16 MFU | 1623695 tok/s step 3304/19560 | loss 3.714496 (+0.48z)| norm 0.3406 (+1.86z)| lr 5.72e-04 | 322.40 ms | 52.3% bf16 MFU | 1623821 tok/s step 3305/19560 | loss 3.651899 (-1.11z)| norm 0.2892 (+0.16z)| lr 5.72e-04 | 323.26 ms | 52.2% bf16 MFU | 1623725 tok/s step 3306/19560 | loss 3.647086 (-1.22z)| norm 0.3721 (+2.78z)| lr 5.72e-04 | 322.69 ms | 52.3% bf16 MFU | 1623775 tok/s step 3307/19560 | loss 3.692151 (-0.08z)| norm 0.3296 (+1.40z)| lr 5.72e-04 | 323.17 ms | 52.2% bf16 MFU | 1623703 tok/s step 3308/19560 | loss 3.682999 (-0.31z)| norm 0.3513 (+2.04z)| lr 5.72e-04 | 322.54 ms | 52.3% bf16 MFU | 1623793 tok/s step 3309/19560 | loss 3.677485 (-0.45z)| norm 0.3093 (+0.71z)| lr 5.72e-04 | 322.72 ms | 52.3% bf16 MFU | 1623834 tok/s step 3310/19560 | loss 3.641174 (-1.34z)| norm 0.3087 (+0.70z)| lr 5.72e-04 | 323.21 ms | 52.2% bf16 MFU | 1623748 tok/s step 3311/19560 | loss 3.739415 (+1.10z)| norm 0.2619 (-0.76z)| lr 5.72e-04 | 322.54 ms | 52.3% bf16 MFU | 1623836 tok/s step 3312/19560 | loss 3.741284 (+1.13z)| norm 0.2649 (-0.67z)| lr 5.72e-04 | 323.58 ms | 52.2% bf16 MFU | 1623659 tok/s step 3313/19560 | loss 3.618570 (-1.91z)| norm 0.2602 (-0.81z)| lr 5.72e-04 | 322.77 ms | 52.3% bf16 MFU | 1623692 tok/s step 3314/19560 | loss 3.689782 (-0.15z)| norm 0.2528 (-1.03z)| lr 5.72e-04 | 323.12 ms | 52.2% bf16 MFU | 1623635 tok/s step 3315/19560 | loss 3.740806 (+1.11z)| norm 0.2673 (-0.58z)| lr 5.72e-04 | 322.98 ms | 52.3% bf16 MFU | 1623617 tok/s step 3316/19560 | loss 3.683110 (-0.32z)| norm 0.2726 (-0.41z)| lr 5.72e-04 | 322.77 ms | 52.3% bf16 MFU | 1623652 tok/s step 3317/19560 | loss 3.684201 (-0.29z)| norm 0.2774 (-0.26z)| lr 5.72e-04 | 322.94 ms | 52.3% bf16 MFU | 1623643 tok/s step 3318/19560 | loss 3.683357 (-0.31z)| norm 0.2395 (-1.41z)| lr 5.72e-04 | 322.57 ms | 52.3% bf16 MFU | 1623728 tok/s step 3319/19560 | loss 3.650646 (-1.10z)| norm 0.2357 (-1.50z)| lr 5.72e-04 | 323.49 ms | 52.2% bf16 MFU | 1623577 tok/s step 3320/19560 | loss 3.657915 (-0.91z)| norm 0.2517 (-0.99z)| lr 5.72e-04 | 322.49 ms | 52.3% bf16 MFU | 1623686 tok/s step 3321/19560 | loss 3.633368 (-1.49z)| norm 0.2355 (-1.46z)| lr 5.72e-04 | 322.62 ms | 52.3% bf16 MFU | 1623757 tok/s step 3322/19560 | loss 3.605494 (-2.13z)| norm 0.2562 (-0.83z)| lr 5.72e-04 | 322.84 ms | 52.3% bf16 MFU | 1623768 tok/s step 3323/19560 | loss 3.604757 (-2.14z)| norm 0.2550 (-0.86z)| lr 5.72e-04 | 323.36 ms | 52.2% bf16 MFU | 1623650 tok/s step 3324/19560 | loss 3.674684 (-0.45z)| norm 0.2698 (-0.41z)| lr 5.72e-04 | 322.29 ms | 52.4% bf16 MFU | 1623804 tok/s step 3325/19560 | loss 3.732823 (+0.94z)| norm 0.3040 (+0.62z)| lr 5.72e-04 | 322.74 ms | 52.3% bf16 MFU | 1623839 tok/s step 3326/19560 | loss 3.755291 (+1.46z)| norm 0.2798 (-0.12z)| lr 5.72e-04 | 323.14 ms | 52.2% bf16 MFU | 1623770 tok/s step 3327/19560 | loss 3.661168 (-0.80z)| norm 0.2632 (-0.63z)| lr 5.72e-04 | 323.03 ms | 52.2% bf16 MFU | 1623732 tok/s step 3328/19560 | loss 3.615398 (-1.86z)| norm 0.2719 (-0.37z)| lr 5.72e-04 | 322.67 ms | 52.3% bf16 MFU | 1623788 tok/s step 3329/19560 | loss 3.672793 (-0.51z)| norm 0.2785 (-0.18z)| lr 5.72e-04 | 323.02 ms | 52.2% bf16 MFU | 1623752 tok/s step 3330/19560 | loss 3.731031 (+0.87z)| norm 0.2493 (-1.10z)| lr 5.72e-04 | 322.61 ms | 52.3% bf16 MFU | 1623823 tok/s step 3331/19560 | loss 3.624873 (-1.64z)| norm 0.2445 (-1.25z)| lr 5.72e-04 | 322.81 ms | 52.3% bf16 MFU | 1623838 tok/s step 3332/19560 | loss 3.652117 (-0.98z)| norm 0.2559 (-0.88z)| lr 5.72e-04 | 323.22 ms | 52.2% bf16 MFU | 1623751 tok/s step 3333/19560 | loss 3.644719 (-1.13z)| norm 0.2552 (-0.89z)| lr 5.72e-04 | 322.91 ms | 52.3% bf16 MFU | 1623744 tok/s step 3334/19560 | loss 3.758904 (+1.52z)| norm 0.2711 (-0.40z)| lr 5.72e-04 | 322.39 ms | 52.3% bf16 MFU | 1623869 tok/s step 3335/19560 | loss 3.656636 (-0.86z)| norm 0.2773 (-0.20z)| lr 5.72e-04 | 322.98 ms | 52.3% bf16 MFU | 1623840 tok/s step 3336/19560 | loss 3.710903 (+0.41z)| norm 0.2866 (+0.09z)| lr 5.72e-04 | 323.10 ms | 52.2% bf16 MFU | 1623782 tok/s step 3337/19560 | loss 3.660943 (-0.75z)| norm 0.2874 (+0.11z)| lr 5.72e-04 | 322.87 ms | 52.3% bf16 MFU | 1623785 tok/s step 3338/19560 | loss 3.684973 (-0.19z)| norm 0.2969 (+0.40z)| lr 5.72e-04 | 323.27 ms | 52.2% bf16 MFU | 1623688 tok/s step 3339/19560 | loss 3.736895 (+1.00z)| norm 0.2979 (+0.42z)| lr 5.71e-04 | 322.79 ms | 52.3% bf16 MFU | 1623716 tok/s step 3340/19560 | loss 3.680922 (-0.31z)| norm 0.3163 (+0.98z)| lr 5.71e-04 | 323.06 ms | 52.2% bf16 MFU | 1623676 tok/s step 3341/19560 | loss 3.721260 (+0.64z)| norm 0.3133 (+0.87z)| lr 5.71e-04 | 322.61 ms | 52.3% bf16 MFU | 1623748 tok/s step 3342/19560 | loss 3.651060 (-0.99z)| norm 0.3242 (+1.19z)| lr 5.71e-04 | 322.95 ms | 52.3% bf16 MFU | 1623734 tok/s step 3343/19560 | loss 3.656392 (-0.87z)| norm 0.3022 (+0.50z)| lr 5.71e-04 | 322.11 ms | 52.4% bf16 MFU | 1623931 tok/s step 3344/19560 | loss 3.685981 (-0.18z)| norm 0.2756 (-0.34z)| lr 5.71e-04 | 323.17 ms | 52.2% bf16 MFU | 1623851 tok/s step 3345/19560 | loss 3.724131 (+0.72z)| norm 0.2734 (-0.40z)| lr 5.71e-04 | 322.29 ms | 52.4% bf16 MFU | 1623996 tok/s step 3346/19560 | loss 3.696491 (+0.07z)| norm 0.2453 (-1.27z)| lr 5.71e-04 | 322.35 ms | 52.4% bf16 MFU | 1624120 tok/s step 3347/19560 | loss 3.706540 (+0.30z)| norm 0.2556 (-0.93z)| lr 5.71e-04 | 323.54 ms | 52.2% bf16 MFU | 1623938 tok/s step 3348/19560 | loss 3.712831 (+0.44z)| norm 0.2636 (-0.67z)| lr 5.71e-04 | 322.56 ms | 52.3% bf16 MFU | 1624011 tok/s step 3349/19560 | loss 3.681122 (-0.30z)| norm 0.2726 (-0.39z)| lr 5.71e-04 | 322.26 ms | 52.4% bf16 MFU | 1624156 tok/s step 3350/19560 | loss 3.713390 (+0.46z)| norm 0.2814 (-0.12z)| lr 5.71e-04 | 323.24 ms | 52.2% bf16 MFU | 1624048 tok/s step 3351/19560 | loss 3.668691 (-0.59z)| norm 0.2715 (-0.43z)| lr 5.71e-04 | 322.84 ms | 52.3% bf16 MFU | 1624044 tok/s step 3352/19560 | loss 3.655758 (-0.90z)| norm 0.2568 (-0.88z)| lr 5.71e-04 | 322.93 ms | 52.3% bf16 MFU | 1624018 tok/s step 3353/19560 | loss 3.700764 (+0.16z)| norm 0.2471 (-1.17z)| lr 5.71e-04 | 322.84 ms | 52.3% bf16 MFU | 1624016 tok/s step 3354/19560 | loss 3.657199 (-0.86z)| norm 0.2408 (-1.35z)| lr 5.71e-04 | 322.33 ms | 52.4% bf16 MFU | 1624144 tok/s step 3355/19560 | loss 3.684429 (-0.22z)| norm 0.2465 (-1.16z)| lr 5.71e-04 | 322.95 ms | 52.3% bf16 MFU | 1624109 tok/s step 3356/19560 | loss 3.725117 (+0.74z)| norm 0.2512 (-1.00z)| lr 5.71e-04 | 322.65 ms | 52.3% bf16 MFU | 1624150 tok/s step 3357/19560 | loss 3.607476 (-2.02z)| norm 0.2821 (-0.03z)| lr 5.71e-04 | 322.77 ms | 52.3% bf16 MFU | 1624160 tok/s step 3358/19560 | loss 3.744815 (+1.21z)| norm 0.3170 (+1.06z)| lr 5.71e-04 | 322.46 ms | 52.3% bf16 MFU | 1624246 tok/s step 3359/19560 | loss 3.660413 (-0.77z)| norm 0.3319 (+1.50z)| lr 5.71e-04 | 323.05 ms | 52.2% bf16 MFU | 1624180 tok/s step 3360/19560 | loss 3.669435 (-0.55z)| norm 0.2865 (+0.10z)| lr 5.71e-04 | 322.64 ms | 52.3% bf16 MFU | 1624221 tok/s step 3361/19560 | loss 3.676309 (-0.38z)| norm 0.2898 (+0.20z)| lr 5.71e-04 | 322.18 ms | 52.4% bf16 MFU | 1624376 tok/s step 3362/19560 | loss 3.694860 (+0.07z)| norm 0.3438 (+1.83z)| lr 5.71e-04 | 322.62 ms | 52.3% bf16 MFU | 1624411 tok/s step 3363/19560 | loss 3.695092 (+0.07z)| norm 0.3205 (+1.10z)| lr 5.71e-04 | 322.61 ms | 52.3% bf16 MFU | 1624448 tok/s step 3364/19560 | loss 3.646910 (-1.07z)| norm 0.2998 (+0.47z)| lr 5.71e-04 | 322.46 ms | 52.3% bf16 MFU | 1624522 tok/s step 3365/19560 | loss 3.637894 (-1.27z)| norm 0.2907 (+0.19z)| lr 5.71e-04 | 322.55 ms | 52.3% bf16 MFU | 1624568 tok/s step 3366/19560 | loss 3.657141 (-0.80z)| norm 0.2873 (+0.10z)| lr 5.71e-04 | 322.55 ms | 52.3% bf16 MFU | 1624612 tok/s step 3367/19560 | loss 3.675282 (-0.34z)| norm 0.3052 (+0.63z)| lr 5.71e-04 | 322.97 ms | 52.3% bf16 MFU | 1624549 tok/s step 3368/19560 | loss 3.741784 (+1.28z)| norm 0.3287 (+1.33z)| lr 5.71e-04 | 322.56 ms | 52.3% bf16 MFU | 1624590 tok/s step 3369/19560 | loss 3.651825 (-0.92z)| norm 0.3198 (+1.05z)| lr 5.71e-04 | 322.42 ms | 52.3% bf16 MFU | 1624667 tok/s step 3370/19560 | loss 3.677831 (-0.29z)| norm 0.3098 (+0.75z)| lr 5.71e-04 | 322.64 ms | 52.3% bf16 MFU | 1624684 tok/s step 3371/19560 | loss 3.698433 (+0.22z)| norm 0.2655 (-0.58z)| lr 5.71e-04 | 322.88 ms | 52.3% bf16 MFU | 1624639 tok/s step 3372/19560 | loss 3.731972 (+1.04z)| norm 0.2596 (-0.75z)| lr 5.71e-04 | 323.06 ms | 52.2% bf16 MFU | 1624552 tok/s step 3373/19560 | loss 3.663380 (-0.63z)| norm 0.2685 (-0.48z)| lr 5.71e-04 | 322.34 ms | 52.4% bf16 MFU | 1624651 tok/s step 3374/19560 | loss 3.697206 (+0.21z)| norm 0.2462 (-1.15z)| lr 5.71e-04 | 323.00 ms | 52.3% bf16 MFU | 1624577 tok/s step 3375/19560 | loss 3.738912 (+1.24z)| norm 0.2479 (-1.09z)| lr 5.71e-04 | 322.88 ms | 52.3% bf16 MFU | 1624538 tok/s step 3376/19560 | loss 3.643558 (-1.13z)| norm 0.2658 (-0.56z)| lr 5.71e-04 | 322.88 ms | 52.3% bf16 MFU | 1624500 tok/s step 3377/19560 | loss 3.659608 (-0.72z)| norm 0.2527 (-0.95z)| lr 5.71e-04 | 322.95 ms | 52.3% bf16 MFU | 1624447 tok/s step 3378/19560 | loss 3.649061 (-0.97z)| norm 0.2780 (-0.19z)| lr 5.71e-04 | 323.20 ms | 52.2% bf16 MFU | 1624332 tok/s step 3379/19560 | loss 3.704080 (+0.40z)| norm 0.2944 (+0.29z)| lr 5.71e-04 | 322.74 ms | 52.3% bf16 MFU | 1624339 tok/s step 3380/19560 | loss 3.649323 (-0.95z)| norm 0.2755 (-0.28z)| lr 5.71e-04 | 322.72 ms | 52.3% bf16 MFU | 1624351 tok/s step 3381/19560 | loss 3.683176 (-0.10z)| norm 0.2933 (+0.35z)| lr 5.71e-04 | 322.74 ms | 52.3% bf16 MFU | 1624358 tok/s step 3382/19560 | loss 3.700035 (+0.34z)| norm 0.3057 (+0.81z)| lr 5.71e-04 | 323.08 ms | 52.2% bf16 MFU | 1624279 tok/s step 3383/19560 | loss 3.676276 (-0.26z)| norm 0.3207 (+1.33z)| lr 5.71e-04 | 322.64 ms | 52.3% bf16 MFU | 1624315 tok/s step 3384/19560 | loss 3.665138 (-0.54z)| norm 0.3060 (+0.80z)| lr 5.71e-04 | 322.76 ms | 52.3% bf16 MFU | 1624318 tok/s step 3385/19560 | loss 3.663056 (-0.61z)| norm 0.2916 (+0.29z)| lr 5.71e-04 | 323.35 ms | 52.2% bf16 MFU | 1624174 tok/s step 3386/19560 | loss 3.653588 (-0.84z)| norm 0.2633 (-0.71z)| lr 5.70e-04 | 322.71 ms | 52.3% bf16 MFU | 1624197 tok/s step 3387/19560 | loss 3.621528 (-1.68z)| norm 0.2580 (-0.89z)| lr 5.70e-04 | 323.13 ms | 52.2% bf16 MFU | 1624114 tok/s step 3388/19560 | loss 3.668841 (-0.43z)| norm 0.2669 (-0.57z)| lr 5.70e-04 | 322.91 ms | 52.3% bf16 MFU | 1624092 tok/s step 3389/19560 | loss 3.663946 (-0.56z)| norm 0.2502 (-1.15z)| lr 5.70e-04 | 323.01 ms | 52.2% bf16 MFU | 1624043 tok/s step 3390/19560 | loss 3.680719 (-0.11z)| norm 0.2395 (-1.50z)| lr 5.70e-04 | 322.41 ms | 52.3% bf16 MFU | 1624149 tok/s step 3391/19560 | loss 3.668332 (-0.43z)| norm 0.2614 (-0.73z)| lr 5.70e-04 | 322.59 ms | 52.3% bf16 MFU | 1624205 tok/s step 3392/19560 | loss 3.623150 (-1.63z)| norm 0.2557 (-0.91z)| lr 5.70e-04 | 322.70 ms | 52.3% bf16 MFU | 1624231 tok/s step 3393/19560 | loss 3.677561 (-0.16z)| norm 0.2690 (-0.44z)| lr 5.70e-04 | 322.46 ms | 52.3% bf16 MFU | 1624313 tok/s step 3394/19560 | loss 3.692694 (+0.27z)| norm 0.2579 (-0.82z)| lr 5.70e-04 | 323.11 ms | 52.2% bf16 MFU | 1624230 tok/s step 3395/19560 | loss 3.638581 (-1.20z)| norm 0.2860 (+0.18z)| lr 5.70e-04 | 322.94 ms | 52.3% bf16 MFU | 1624193 tok/s step 3396/19560 | loss 3.719738 (+1.02z)| norm 0.2760 (-0.16z)| lr 5.70e-04 | 322.80 ms | 52.3% bf16 MFU | 1624192 tok/s step 3397/19560 | loss 3.666102 (-0.44z)| norm 0.2765 (-0.15z)| lr 5.70e-04 | 322.90 ms | 52.3% bf16 MFU | 1624167 tok/s step 3398/19560 | loss 3.644530 (-1.01z)| norm 0.2669 (-0.49z)| lr 5.70e-04 | 322.33 ms | 52.4% bf16 MFU | 1624286 tok/s step 3399/19560 | loss 3.714964 (+0.90z)| norm 0.2797 (-0.04z)| lr 5.70e-04 | 323.55 ms | 52.2% bf16 MFU | 1624093 tok/s step 3400/19560 | loss 3.693952 (+0.32z)| norm 0.2861 (+0.19z)| lr 5.70e-04 | 323.44 ms | 52.2% bf16 MFU | 1623938 tok/s step 3401/19560 | loss 3.659525 (-0.61z)| norm 0.2823 (+0.04z)| lr 5.70e-04 | 322.91 ms | 52.3% bf16 MFU | 1623921 tok/s step 3402/19560 | loss 3.605315 (-2.04z)| norm 0.3053 (+0.85z)| lr 5.70e-04 | 322.64 ms | 52.3% bf16 MFU | 1623976 tok/s step 3403/19560 | loss 3.717487 (+1.01z)| norm 0.3475 (+2.30z)| lr 5.70e-04 | 322.82 ms | 52.3% bf16 MFU | 1623981 tok/s step 3404/19560 | loss 3.659146 (-0.58z)| norm 0.2878 (+0.20z)| lr 5.70e-04 | 322.98 ms | 52.3% bf16 MFU | 1623947 tok/s step 3405/19560 | loss 3.658441 (-0.61z)| norm 0.2763 (-0.21z)| lr 5.70e-04 | 323.13 ms | 52.2% bf16 MFU | 1623875 tok/s step 3406/19560 | loss 3.666961 (-0.36z)| norm 0.3150 (+1.14z)| lr 5.70e-04 | 323.76 ms | 52.1% bf16 MFU | 1623650 tok/s step 3407/19560 | loss 3.734056 (+1.50z)| norm 0.2674 (-0.52z)| lr 5.70e-04 | 322.84 ms | 52.3% bf16 MFU | 1623666 tok/s step 3408/19560 | loss 3.638472 (-1.15z)| norm 0.2803 (-0.07z)| lr 5.70e-04 | 322.99 ms | 52.3% bf16 MFU | 1623645 tok/s step 3409/19560 | loss 3.676239 (-0.10z)| norm 0.2984 (+0.56z)| lr 5.70e-04 | 322.63 ms | 52.3% bf16 MFU | 1623715 tok/s step 3410/19560 | loss 3.723595 (+1.23z)| norm 0.3046 (+0.78z)| lr 5.70e-04 | 323.17 ms | 52.2% bf16 MFU | 1623645 tok/s step 3411/19560 | loss 3.600152 (-2.19z)| norm 0.2651 (-0.59z)| lr 5.70e-04 | 323.20 ms | 52.2% bf16 MFU | 1623572 tok/s step 3412/19560 | loss 3.605825 (-1.98z)| norm 0.2758 (-0.22z)| lr 5.70e-04 | 322.68 ms | 52.3% bf16 MFU | 1623632 tok/s step 3413/19560 | loss 3.665535 (-0.33z)| norm 0.2920 (+0.33z)| lr 5.70e-04 | 322.98 ms | 52.3% bf16 MFU | 1623614 tok/s step 3414/19560 | loss 3.724487 (+1.29z)| norm 0.3020 (+0.68z)| lr 5.70e-04 | 323.15 ms | 52.2% bf16 MFU | 1623556 tok/s step 3415/19560 | loss 3.691626 (+0.38z)| norm 0.2919 (+0.32z)| lr 5.70e-04 | 322.58 ms | 52.3% bf16 MFU | 1623642 tok/s step 3416/19560 | loss 3.659560 (-0.50z)| norm 0.2862 (+0.11z)| lr 5.70e-04 | 322.86 ms | 52.3% bf16 MFU | 1623656 tok/s step 3417/19560 | loss 3.716787 (+1.06z)| norm 0.2962 (+0.46z)| lr 5.70e-04 | 323.16 ms | 52.2% bf16 MFU | 1623593 tok/s step 3418/19560 | loss 3.693579 (+0.42z)| norm 0.2586 (-0.87z)| lr 5.70e-04 | 323.16 ms | 52.2% bf16 MFU | 1623532 tok/s step 3419/19560 | loss 3.676621 (-0.03z)| norm 0.2727 (-0.36z)| lr 5.70e-04 | 323.39 ms | 52.2% bf16 MFU | 1623416 tok/s step 3420/19560 | loss 3.670767 (-0.18z)| norm 0.2627 (-0.71z)| lr 5.70e-04 | 322.62 ms | 52.3% bf16 MFU | 1623500 tok/s step 3421/19560 | loss 3.681642 (+0.11z)| norm 0.3189 (+1.24z)| lr 5.70e-04 | 322.39 ms | 52.3% bf16 MFU | 1623637 tok/s step 3422/19560 | loss 3.737501 (+1.63z)| norm 0.2539 (-1.03z)| lr 5.70e-04 | 323.59 ms | 52.2% bf16 MFU | 1623466 tok/s step 3423/19560 | loss 3.737751 (+1.62z)| norm 0.3045 (+0.73z)| lr 5.70e-04 | 323.63 ms | 52.1% bf16 MFU | 1623294 tok/s step 3424/19560 | loss 3.669468 (-0.24z)| norm 0.2957 (+0.43z)| lr 5.70e-04 | 322.72 ms | 52.3% bf16 MFU | 1623359 tok/s step 3425/19560 | loss 3.612000 (-1.77z)| norm 0.2670 (-0.57z)| lr 5.70e-04 | 322.42 ms | 52.3% bf16 MFU | 1623495 tok/s step 3426/19560 | loss 3.694709 (+0.47z)| norm 0.2793 (-0.10z)| lr 5.70e-04 | 323.78 ms | 52.1% bf16 MFU | 1623284 tok/s step 3427/19560 | loss 3.663936 (-0.36z)| norm 0.2763 (-0.21z)| lr 5.70e-04 | 322.72 ms | 52.3% bf16 MFU | 1623349 tok/s step 3428/19560 | loss 3.666409 (-0.32z)| norm 0.2643 (-0.66z)| lr 5.70e-04 | 322.98 ms | 52.3% bf16 MFU | 1623346 tok/s step 3429/19560 | loss 3.645461 (-0.89z)| norm 0.2587 (-0.85z)| lr 5.70e-04 | 323.39 ms | 52.2% bf16 MFU | 1623239 tok/s step 3430/19560 | loss 3.708720 (+0.85z)| norm 0.2401 (-1.52z)| lr 5.70e-04 | 323.40 ms | 52.2% bf16 MFU | 1623135 tok/s step 3431/19560 | loss 3.661163 (-0.46z)| norm 0.2566 (-0.89z)| lr 5.70e-04 | 322.51 ms | 52.3% bf16 MFU | 1623262 tok/s step 3432/19560 | loss 3.709402 (+0.88z)| norm 0.2749 (-0.20z)| lr 5.69e-04 | 322.83 ms | 52.3% bf16 MFU | 1623300 tok/s step 3433/19560 | loss 3.650425 (-0.75z)| norm 0.2737 (-0.24z)| lr 5.69e-04 | 324.01 ms | 52.1% bf16 MFU | 1623042 tok/s step 3434/19560 | loss 3.688077 (+0.28z)| norm 0.2863 (+0.28z)| lr 5.69e-04 | 323.39 ms | 52.2% bf16 MFU | 1622950 tok/s step 3435/19560 | loss 3.634910 (-1.17z)| norm 0.4376 (+5.55z)| lr 5.69e-04 | 323.08 ms | 52.2% bf16 MFU | 1622942 tok/s step 3436/19560 | loss 3.766613 (+2.38z)| norm 0.2590 (-0.74z)| lr 5.69e-04 | 322.69 ms | 52.3% bf16 MFU | 1623031 tok/s step 3437/19560 | loss 3.719837 (+1.11z)| norm 0.2738 (-0.20z)| lr 5.69e-04 | 323.46 ms | 52.2% bf16 MFU | 1622922 tok/s step 3438/19560 | loss 3.668324 (-0.28z)| norm 0.2796 (+0.02z)| lr 5.69e-04 | 323.01 ms | 52.3% bf16 MFU | 1622934 tok/s step 3439/19560 | loss 3.667648 (-0.29z)| norm 0.3000 (+0.75z)| lr 5.69e-04 | 322.88 ms | 52.3% bf16 MFU | 1622977 tok/s step 3440/19560 | loss 3.631783 (-1.24z)| norm 0.3087 (+1.06z)| lr 5.69e-04 | 322.80 ms | 52.3% bf16 MFU | 1623038 tok/s step 3441/19560 | loss 3.656391 (-0.58z)| norm 0.3161 (+1.30z)| lr 5.69e-04 | 322.89 ms | 52.3% bf16 MFU | 1623072 tok/s step 3442/19560 | loss 3.726743 (+1.34z)| norm 0.2674 (-0.46z)| lr 5.69e-04 | 322.90 ms | 52.3% bf16 MFU | 1623102 tok/s step 3443/19560 | loss 3.709272 (+0.87z)| norm 0.2450 (-1.26z)| lr 5.69e-04 | 322.70 ms | 52.3% bf16 MFU | 1623182 tok/s step 3444/19560 | loss 3.676770 (-0.02z)| norm 0.2527 (-0.97z)| lr 5.69e-04 | 323.09 ms | 52.2% bf16 MFU | 1623158 tok/s step 3445/19560 | loss 3.765381 (+2.35z)| norm 0.4190 (+4.53z)| lr 5.69e-04 | 323.19 ms | 52.2% bf16 MFU | 1623112 tok/s step 3446/19560 | loss 3.680556 (+0.06z)| norm 0.3657 (+2.69z)| lr 5.69e-04 | 322.24 ms | 52.4% bf16 MFU | 1623306 tok/s step 3447/19560 | loss 3.668627 (-0.26z)| norm 0.3449 (+1.98z)| lr 5.69e-04 | 322.93 ms | 52.3% bf16 MFU | 1623318 tok/s step 3448/19560 | loss 3.707021 (+0.76z)| norm 0.2991 (+0.51z)| lr 5.69e-04 | 322.92 ms | 52.3% bf16 MFU | 1623331 tok/s step 3449/19560 | loss 3.682602 (+0.09z)| norm 0.2799 (-0.11z)| lr 5.69e-04 | 322.95 ms | 52.3% bf16 MFU | 1623336 tok/s step 3450/19560 | loss 3.711821 (+0.88z)| norm 0.3065 (+0.73z)| lr 5.69e-04 | 322.81 ms | 52.3% bf16 MFU | 1623375 tok/s step 3451/19560 | loss 3.724052 (+1.20z)| norm 0.2980 (+0.45z)| lr 5.69e-04 | 322.77 ms | 52.3% bf16 MFU | 1623423 tok/s step 3452/19560 | loss 3.618874 (-1.70z)| norm 0.2766 (-0.25z)| lr 5.69e-04 | 322.76 ms | 52.3% bf16 MFU | 1623471 tok/s step 3453/19560 | loss 3.631230 (-1.34z)| norm 0.2544 (-0.95z)| lr 5.69e-04 | 323.39 ms | 52.2% bf16 MFU | 1623360 tok/s step 3454/19560 | loss 3.652442 (-0.74z)| norm 0.2848 (+0.03z)| lr 5.69e-04 | 323.01 ms | 52.3% bf16 MFU | 1623349 tok/s step 3455/19560 | loss 3.676032 (-0.08z)| norm 0.2730 (-0.35z)| lr 5.69e-04 | 322.88 ms | 52.3% bf16 MFU | 1623372 tok/s step 3456/19560 | loss 3.733888 (+1.52z)| norm 0.2937 (+0.31z)| lr 5.69e-04 | 322.76 ms | 52.3% bf16 MFU | 1623423 tok/s step 3457/19560 | loss 3.647980 (-0.89z)| norm 0.2757 (-0.27z)| lr 5.69e-04 | 322.88 ms | 52.3% bf16 MFU | 1623442 tok/s step 3458/19560 | loss 3.681678 (+0.07z)| norm 0.2795 (-0.16z)| lr 5.69e-04 | 323.12 ms | 52.2% bf16 MFU | 1623398 tok/s step 3459/19560 | loss 3.691869 (+0.34z)| norm 0.3028 (+0.59z)| lr 5.69e-04 | 323.35 ms | 52.2% bf16 MFU | 1623298 tok/s step 3460/19560 | loss 3.702503 (+0.64z)| norm 0.3048 (+0.64z)| lr 5.69e-04 | 323.18 ms | 52.2% bf16 MFU | 1623247 tok/s step 3461/19560 | loss 3.666767 (-0.39z)| norm 0.3252 (+1.29z)| lr 5.69e-04 | 323.40 ms | 52.2% bf16 MFU | 1623143 tok/s step 3462/19560 | loss 3.689463 (+0.28z)| norm 0.2940 (+0.27z)| lr 5.69e-04 | 323.34 ms | 52.2% bf16 MFU | 1623059 tok/s step 3463/19560 | loss 3.688694 (+0.25z)| norm 0.3160 (+0.97z)| lr 5.69e-04 | 323.14 ms | 52.2% bf16 MFU | 1623029 tok/s step 3464/19560 | loss 3.655601 (-0.71z)| norm 0.2963 (+0.33z)| lr 5.69e-04 | 323.20 ms | 52.2% bf16 MFU | 1622986 tok/s step 3465/19560 | loss 3.723301 (+1.26z)| norm 0.2736 (-0.41z)| lr 5.69e-04 | 324.36 ms | 52.0% bf16 MFU | 1622655 tok/s step 3466/19560 | loss 3.701477 (+0.62z)| norm 0.2864 (+0.01z)| lr 5.69e-04 | 323.03 ms | 52.2% bf16 MFU | 1622673 tok/s step 3467/19560 | loss 3.689270 (+0.28z)| norm 0.2939 (+0.26z)| lr 5.69e-04 | 323.35 ms | 52.2% bf16 MFU | 1622610 tok/s step 3468/19560 | loss 3.716680 (+1.07z)| norm 0.2884 (+0.08z)| lr 5.69e-04 | 323.82 ms | 52.1% bf16 MFU | 1622432 tok/s step 3469/19560 | loss 3.649822 (-0.88z)| norm 0.2929 (+0.24z)| lr 5.69e-04 | 322.57 ms | 52.3% bf16 MFU | 1622576 tok/s step 3470/19560 | loss 3.635066 (-1.30z)| norm 0.3050 (+0.64z)| lr 5.69e-04 | 322.87 ms | 52.3% bf16 MFU | 1622639 tok/s step 3471/19560 | loss 3.692979 (+0.39z)| norm 0.2636 (-0.71z)| lr 5.69e-04 | 323.45 ms | 52.2% bf16 MFU | 1622554 tok/s step 3472/19560 | loss 3.629451 (-1.45z)| norm 0.2594 (-0.84z)| lr 5.69e-04 | 322.54 ms | 52.3% bf16 MFU | 1622702 tok/s step 3473/19560 | loss 3.735872 (+1.63z)| norm 0.3274 (+1.37z)| lr 5.69e-04 | 322.73 ms | 52.3% bf16 MFU | 1622793 tok/s step 3474/19560 | loss 3.686114 (+0.20z)| norm 0.2928 (+0.23z)| lr 5.69e-04 | 322.75 ms | 52.3% bf16 MFU | 1622875 tok/s step 3475/19560 | loss 3.651187 (-0.80z)| norm 0.2682 (-0.59z)| lr 5.69e-04 | 322.98 ms | 52.3% bf16 MFU | 1622894 tok/s step 3476/19560 | loss 3.651219 (-0.79z)| norm 0.2526 (-1.09z)| lr 5.69e-04 | 322.44 ms | 52.3% bf16 MFU | 1623050 tok/s step 3477/19560 | loss 3.638497 (-1.14z)| norm 0.2958 (+0.32z)| lr 5.68e-04 | 322.76 ms | 52.3% bf16 MFU | 1623117 tok/s step 3478/19560 | loss 3.597924 (-2.25z)| norm 0.2808 (-0.17z)| lr 5.68e-04 | 323.17 ms | 52.2% bf16 MFU | 1623078 tok/s step 3479/19560 | loss 3.625063 (-1.46z)| norm 0.2817 (-0.15z)| lr 5.68e-04 | 323.02 ms | 52.2% bf16 MFU | 1623078 tok/s step 3480/19560 | loss 3.635489 (-1.16z)| norm 0.2602 (-0.86z)| lr 5.68e-04 | 322.66 ms | 52.3% bf16 MFU | 1623169 tok/s step 3481/19560 | loss 3.642164 (-0.96z)| norm 0.2709 (-0.51z)| lr 5.68e-04 | 322.74 ms | 52.3% bf16 MFU | 1623234 tok/s step 3482/19560 | loss 3.615742 (-1.67z)| norm 0.2921 (+0.18z)| lr 5.68e-04 | 322.84 ms | 52.3% bf16 MFU | 1623272 tok/s step 3483/19560 | loss 3.696653 (+0.57z)| norm 0.2824 (-0.16z)| lr 5.68e-04 | 323.61 ms | 52.2% bf16 MFU | 1623114 tok/s step 3484/19560 | loss 3.643339 (-0.89z)| norm 0.2892 (+0.06z)| lr 5.68e-04 | 322.64 ms | 52.3% bf16 MFU | 1623208 tok/s step 3485/19560 | loss 3.673242 (-0.08z)| norm 0.3113 (+0.80z)| lr 5.68e-04 | 323.30 ms | 52.2% bf16 MFU | 1623132 tok/s step 3486/19560 | loss 3.635597 (-1.12z)| norm 0.2836 (-0.13z)| lr 5.68e-04 | 323.33 ms | 52.2% bf16 MFU | 1623051 tok/s step 3487/19560 | loss 3.661675 (-0.38z)| norm 0.2849 (-0.07z)| lr 5.68e-04 | 323.28 ms | 52.2% bf16 MFU | 1622986 tok/s step 3488/19560 | loss 3.641285 (-0.95z)| norm 0.2954 (+0.29z)| lr 5.68e-04 | 323.03 ms | 52.2% bf16 MFU | 1622989 tok/s step 3489/19560 | loss 3.725842 (+1.43z)| norm 0.2868 (-0.01z)| lr 5.68e-04 | 322.86 ms | 52.3% bf16 MFU | 1623033 tok/s step 3490/19560 | loss 3.602313 (-2.00z)| norm 0.2882 (+0.06z)| lr 5.68e-04 | 323.53 ms | 52.2% bf16 MFU | 1622908 tok/s step 3491/19560 | loss 3.735977 (+1.68z)| norm 0.2924 (+0.21z)| lr 5.68e-04 | 323.02 ms | 52.2% bf16 MFU | 1622916 tok/s step 3492/19560 | loss 3.672425 (-0.07z)| norm 0.2907 (+0.15z)| lr 5.68e-04 | 322.89 ms | 52.3% bf16 MFU | 1622958 tok/s step 3493/19560 | loss 3.634572 (-1.11z)| norm 0.2714 (-0.51z)| lr 5.68e-04 | 323.51 ms | 52.2% bf16 MFU | 1622843 tok/s step 3494/19560 | loss 3.656166 (-0.52z)| norm 0.2764 (-0.34z)| lr 5.68e-04 | 322.76 ms | 52.3% bf16 MFU | 1622919 tok/s step 3495/19560 | loss 3.658243 (-0.46z)| norm 0.2682 (-0.61z)| lr 5.68e-04 | 323.07 ms | 52.2% bf16 MFU | 1622915 tok/s step 3496/19560 | loss 3.752013 (+2.11z)| norm 0.2605 (-0.87z)| lr 5.68e-04 | 323.21 ms | 52.2% bf16 MFU | 1622874 tok/s step 3497/19560 | loss 3.649086 (-0.71z)| norm 0.2598 (-0.88z)| lr 5.68e-04 | 322.85 ms | 52.3% bf16 MFU | 1622928 tok/s step 3498/19560 | loss 3.629924 (-1.21z)| norm 0.2487 (-1.25z)| lr 5.68e-04 | 322.97 ms | 52.3% bf16 MFU | 1622948 tok/s step 3499/19560 | loss 3.648370 (-0.70z)| norm 0.2588 (-0.89z)| lr 5.68e-04 | 323.10 ms | 52.2% bf16 MFU | 1622935 tok/s step 3500/19560 | loss 3.765308 (+2.44z)| norm 0.2523 (-1.11z)| lr 5.68e-04 | 322.48 ms | 52.3% bf16 MFU | 1623078 tok/s val loss 3.657746 laSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluaevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2687/10042 = 0.267576 step 3501/19560 | loss 3.617394 (-1.51z)| norm 0.2761 (-0.28z)| lr 5.68e-04 | 322.41 ms | 52.3% bf16 MFU | 1623231 tok/s step 3502/19560 | loss 3.660687 (-0.35z)| norm 0.2808 (-0.13z)| lr 5.68e-04 | 323.01 ms | 52.3% bf16 MFU | 1623227 tok/s step 3503/19560 | loss 3.743374 (+1.85z)| norm 0.3100 (+0.88z)| lr 5.68e-04 | 323.01 ms | 52.2% bf16 MFU | 1623222 tok/s step 3504/19560 | loss 3.712737 (+1.02z)| norm 0.2813 (-0.13z)| lr 5.68e-04 | 322.17 ms | 52.4% bf16 MFU | 1623429 tok/s step 3505/19560 | loss 3.631363 (-1.13z)| norm 0.2797 (-0.20z)| lr 5.68e-04 | 323.43 ms | 52.2% bf16 MFU | 1623309 tok/s step 3506/19560 | loss 3.638841 (-0.93z)| norm 0.3008 (+0.54z)| lr 5.68e-04 | 323.26 ms | 52.2% bf16 MFU | 1623238 tok/s step 3507/19560 | loss 3.645107 (-0.75z)| norm 0.2956 (+0.36z)| lr 5.68e-04 | 322.88 ms | 52.3% bf16 MFU | 1623266 tok/s step 3508/19560 | loss 3.731641 (+1.50z)| norm 0.3056 (+0.70z)| lr 5.68e-04 | 322.75 ms | 52.3% bf16 MFU | 1623325 tok/s step 3509/19560 | loss 3.728935 (+1.41z)| norm 0.3303 (+1.56z)| lr 5.68e-04 | 323.45 ms | 52.2% bf16 MFU | 1623204 tok/s step 3510/19560 | loss 3.672247 (-0.06z)| norm 0.2865 (+0.02z)| lr 5.68e-04 | 323.17 ms | 52.2% bf16 MFU | 1623159 tok/s step 3511/19560 | loss 3.664184 (-0.26z)| norm 0.3180 (+1.13z)| lr 5.68e-04 | 322.70 ms | 52.3% bf16 MFU | 1623235 tok/s step 3512/19560 | loss 3.719232 (+1.15z)| norm 0.3224 (+1.28z)| lr 5.68e-04 | 322.97 ms | 52.3% bf16 MFU | 1623240 tok/s step 3513/19560 | loss 3.695384 (+0.53z)| norm 0.2719 (-0.49z)| lr 5.68e-04 | 322.48 ms | 52.3% bf16 MFU | 1623368 tok/s step 3514/19560 | loss 3.695477 (+0.52z)| norm 0.2990 (+0.45z)| lr 5.68e-04 | 323.02 ms | 52.2% bf16 MFU | 1623354 tok/s step 3515/19560 | loss 3.709172 (+0.86z)| norm 0.3266 (+1.40z)| lr 5.68e-04 | 322.82 ms | 52.3% bf16 MFU | 1623391 tok/s step 3516/19560 | loss 3.670799 (-0.13z)| norm 0.5283 (+6.74z)| lr 5.68e-04 | 322.32 ms | 52.4% bf16 MFU | 1623552 tok/s step 3517/19560 | loss 3.624639 (-1.32z)| norm 0.3147 (+0.72z)| lr 5.68e-04 | 323.13 ms | 52.2% bf16 MFU | 1623500 tok/s step 3518/19560 | loss 3.712430 (+0.94z)| norm 0.3078 (+0.51z)| lr 5.68e-04 | 322.60 ms | 52.3% bf16 MFU | 1623586 tok/s step 3519/19560 | loss 3.594564 (-2.05z)| norm 0.2812 (-0.25z)| lr 5.68e-04 | 322.97 ms | 52.3% bf16 MFU | 1623574 tok/s step 3520/19560 | loss 3.683169 (+0.19z)| norm 0.2806 (-0.27z)| lr 5.68e-04 | 322.89 ms | 52.3% bf16 MFU | 1623582 tok/s step 3521/19560 | loss 3.641939 (-0.85z)| norm 0.3698 (+2.21z)| lr 5.68e-04 | 322.77 ms | 52.3% bf16 MFU | 1623621 tok/s step 3522/19560 | loss 3.717300 (+1.05z)| norm 0.3326 (+1.15z)| lr 5.67e-04 | 322.67 ms | 52.3% bf16 MFU | 1623682 tok/s step 3523/19560 | loss 3.701891 (+0.65z)| norm 0.3460 (+1.50z)| lr 5.67e-04 | 322.53 ms | 52.3% bf16 MFU | 1623776 tok/s step 3524/19560 | loss 3.688797 (+0.33z)| norm 0.3089 (+0.46z)| lr 5.67e-04 | 322.67 ms | 52.3% bf16 MFU | 1623829 tok/s step 3525/19560 | loss 3.649810 (-0.66z)| norm 0.3113 (+0.52z)| lr 5.67e-04 | 323.18 ms | 52.2% bf16 MFU | 1623752 tok/s step 3526/19560 | loss 3.728164 (+1.31z)| norm 0.2870 (-0.16z)| lr 5.67e-04 | 322.94 ms | 52.3% bf16 MFU | 1623739 tok/s step 3527/19560 | loss 3.761778 (+2.12z)| norm 0.2685 (-0.67z)| lr 5.67e-04 | 322.59 ms | 52.3% bf16 MFU | 1623814 tok/s step 3528/19560 | loss 3.731305 (+1.34z)| norm 0.2671 (-0.70z)| lr 5.67e-04 | 322.63 ms | 52.3% bf16 MFU | 1623875 tok/s step 3529/19560 | loss 3.693401 (+0.40z)| norm 0.2567 (-0.98z)| lr 5.67e-04 | 322.79 ms | 52.3% bf16 MFU | 1623894 tok/s step 3530/19560 | loss 3.609279 (-1.69z)| norm 0.2658 (-0.72z)| lr 5.67e-04 | 323.18 ms | 52.2% bf16 MFU | 1623813 tok/s step 3531/19560 | loss 3.803757 (+3.02z)| norm 0.2976 (+0.17z)| lr 5.67e-04 | 323.17 ms | 52.2% bf16 MFU | 1623740 tok/s step 3532/19560 | loss 3.732564 (+1.29z)| norm 0.3191 (+0.76z)| lr 5.67e-04 | 322.89 ms | 52.3% bf16 MFU | 1623741 tok/s step 3533/19560 | loss 3.658366 (-0.49z)| norm 0.2767 (-0.42z)| lr 5.67e-04 | 322.54 ms | 52.3% bf16 MFU | 1623828 tok/s step 3534/19560 | loss 3.708018 (+0.69z)| norm 0.2912 (-0.01z)| lr 5.67e-04 | 322.73 ms | 52.3% bf16 MFU | 1623863 tok/s step 3535/19560 | loss 3.711666 (+0.79z)| norm 0.2676 (-0.67z)| lr 5.67e-04 | 322.87 ms | 52.3% bf16 MFU | 1623863 tok/s step 3536/19560 | loss 3.688311 (+0.22z)| norm 0.2460 (-1.25z)| lr 5.67e-04 | 322.96 ms | 52.3% bf16 MFU | 1623839 tok/s step 3537/19560 | loss 3.675902 (-0.08z)| norm 0.2901 (-0.03z)| lr 5.67e-04 | 323.24 ms | 52.2% bf16 MFU | 1623746 tok/s step 3538/19560 | loss 3.651445 (-0.66z)| norm 0.2730 (-0.50z)| lr 5.67e-04 | 322.94 ms | 52.3% bf16 MFU | 1623733 tok/s step 3539/19560 | loss 3.715102 (+0.87z)| norm 0.2611 (-0.82z)| lr 5.67e-04 | 322.34 ms | 52.4% bf16 MFU | 1623872 tok/s step 3540/19560 | loss 3.697473 (+0.42z)| norm 0.2866 (-0.12z)| lr 5.67e-04 | 322.68 ms | 52.3% bf16 MFU | 1623919 tok/s step 3541/19560 | loss 3.645177 (-0.86z)| norm 0.2859 (-0.14z)| lr 5.67e-04 | 323.15 ms | 52.2% bf16 MFU | 1623844 tok/s step 3542/19560 | loss 3.774791 (+2.28z)| norm 0.3062 (+0.42z)| lr 5.67e-04 | 322.55 ms | 52.3% bf16 MFU | 1623923 tok/s step 3543/19560 | loss 3.694628 (+0.34z)| norm 0.3054 (+0.39z)| lr 5.67e-04 | 322.75 ms | 52.3% bf16 MFU | 1623949 tok/s step 3544/19560 | loss 3.680361 (-0.01z)| norm 0.2902 (-0.03z)| lr 5.67e-04 | 322.91 ms | 52.3% bf16 MFU | 1623932 tok/s step 3545/19560 | loss 3.747041 (+1.59z)| norm 0.2945 (+0.09z)| lr 5.67e-04 | 322.90 ms | 52.3% bf16 MFU | 1623920 tok/s step 3546/19560 | loss 3.589914 (-2.14z)| norm 0.2887 (-0.07z)| lr 5.67e-04 | 322.87 ms | 52.3% bf16 MFU | 1623915 tok/s step 3547/19560 | loss 3.688187 (+0.19z)| norm 0.3356 (+1.21z)| lr 5.67e-04 | 322.47 ms | 52.3% bf16 MFU | 1624011 tok/s step 3548/19560 | loss 3.677904 (-0.06z)| norm 0.3337 (+1.13z)| lr 5.67e-04 | 323.29 ms | 52.2% bf16 MFU | 1623896 tok/s step 3549/19560 | loss 3.754322 (+1.72z)| norm 0.3261 (+0.92z)| lr 5.67e-04 | 322.70 ms | 52.3% bf16 MFU | 1623937 tok/s step 3550/19560 | loss 3.673514 (-0.16z)| norm 0.2718 (-0.57z)| lr 5.67e-04 | 323.48 ms | 52.2% bf16 MFU | 1623780 tok/s step 3551/19560 | loss 3.601372 (-1.82z)| norm 0.3187 (+0.71z)| lr 5.67e-04 | 323.14 ms | 52.2% bf16 MFU | 1623714 tok/s step 3552/19560 | loss 3.689033 (+0.22z)| norm 0.2841 (-0.23z)| lr 5.67e-04 | 322.51 ms | 52.3% bf16 MFU | 1623810 tok/s step 3553/19560 | loss 3.679593 (-0.01z)| norm 0.2709 (-0.60z)| lr 5.67e-04 | 323.08 ms | 52.2% bf16 MFU | 1623760 tok/s step 3554/19560 | loss 3.667016 (-0.30z)| norm 0.2676 (-0.69z)| lr 5.67e-04 | 322.50 ms | 52.3% bf16 MFU | 1623858 tok/s step 3555/19560 | loss 3.718745 (+0.91z)| norm 0.2711 (-0.59z)| lr 5.67e-04 | 323.04 ms | 52.2% bf16 MFU | 1623816 tok/s step 3556/19560 | loss 3.641137 (-0.91z)| norm 0.2663 (-0.72z)| lr 5.67e-04 | 322.95 ms | 52.3% bf16 MFU | 1623798 tok/s step 3557/19560 | loss 3.656965 (-0.54z)| norm 0.2828 (-0.27z)| lr 5.67e-04 | 322.69 ms | 52.3% bf16 MFU | 1623844 tok/s step 3558/19560 | loss 3.684798 (+0.12z)| norm 0.2573 (-0.98z)| lr 5.67e-04 | 323.18 ms | 52.2% bf16 MFU | 1623765 tok/s step 3559/19560 | loss 3.705243 (+0.59z)| norm 0.2510 (-1.15z)| lr 5.67e-04 | 322.79 ms | 52.3% bf16 MFU | 1623788 tok/s step 3560/19560 | loss 3.720216 (+0.94z)| norm 0.2633 (-0.81z)| lr 5.67e-04 | 322.81 ms | 52.3% bf16 MFU | 1623805 tok/s step 3561/19560 | loss 3.611530 (-1.60z)| norm 0.2625 (-0.83z)| lr 5.67e-04 | 322.98 ms | 52.3% bf16 MFU | 1623779 tok/s step 3562/19560 | loss 3.618780 (-1.41z)| norm 0.2635 (-0.79z)| lr 5.67e-04 | 322.46 ms | 52.3% bf16 MFU | 1623885 tok/s step 3563/19560 | loss 3.731630 (+1.19z)| norm 0.2718 (-0.57z)| lr 5.67e-04 | 322.70 ms | 52.3% bf16 MFU | 1623925 tok/s step 3564/19560 | loss 3.667905 (-0.27z)| norm 0.2556 (-1.04z)| lr 5.67e-04 | 323.00 ms | 52.3% bf16 MFU | 1623887 tok/s step 3565/19560 | loss 3.658940 (-0.47z)| norm 0.2675 (-0.69z)| lr 5.67e-04 | 323.10 ms | 52.2% bf16 MFU | 1623827 tok/s step 3566/19560 | loss 3.750008 (+1.64z)| norm 0.2526 (-1.12z)| lr 5.66e-04 | 322.71 ms | 52.3% bf16 MFU | 1623866 tok/s step 3567/19560 | loss 3.665475 (-0.33z)| norm 0.2658 (-0.72z)| lr 5.66e-04 | 322.28 ms | 52.4% bf16 MFU | 1624015 tok/s step 3568/19560 | loss 3.668571 (-0.27z)| norm 0.2882 (-0.07z)| lr 5.66e-04 | 322.62 ms | 52.3% bf16 MFU | 1624069 tok/s step 3569/19560 | loss 3.562182 (-2.66z)| norm 0.2964 (+0.18z)| lr 5.66e-04 | 323.15 ms | 52.2% bf16 MFU | 1623986 tok/s step 3570/19560 | loss 3.681739 (+0.07z)| norm 0.2702 (-0.59z)| lr 5.66e-04 | 322.75 ms | 52.3% bf16 MFU | 1624008 tok/s step 3571/19560 | loss 3.621551 (-1.29z)| norm 0.2781 (-0.37z)| lr 5.66e-04 | 323.78 ms | 52.1% bf16 MFU | 1623771 tok/s step 3572/19560 | loss 3.654600 (-0.53z)| norm 0.2856 (-0.15z)| lr 5.66e-04 | 322.46 ms | 52.3% bf16 MFU | 1623877 tok/s step 3573/19560 | loss 3.691231 (+0.32z)| norm 0.2890 (-0.03z)| lr 5.66e-04 | 322.73 ms | 52.3% bf16 MFU | 1623910 tok/s step 3574/19560 | loss 3.660940 (-0.38z)| norm 0.2695 (-0.63z)| lr 5.66e-04 | 322.90 ms | 52.3% bf16 MFU | 1623899 tok/s step 3575/19560 | loss 3.648249 (-0.66z)| norm 0.2588 (-0.95z)| lr 5.66e-04 | 322.68 ms | 52.3% bf16 MFU | 1623944 tok/s step 3576/19560 | loss 3.683464 (+0.15z)| norm 0.3101 (+0.70z)| lr 5.66e-04 | 323.54 ms | 52.2% bf16 MFU | 1623769 tok/s step 3577/19560 | loss 3.753423 (+1.74z)| norm 0.3228 (+1.09z)| lr 5.66e-04 | 322.52 ms | 52.3% bf16 MFU | 1623862 tok/s step 3578/19560 | loss 3.629082 (-1.09z)| norm 0.3057 (+0.54z)| lr 5.66e-04 | 322.30 ms | 52.4% bf16 MFU | 1624004 tok/s step 3579/19560 | loss 3.606296 (-1.58z)| norm 0.2431 (-1.44z)| lr 5.66e-04 | 323.00 ms | 52.3% bf16 MFU | 1623964 tok/s step 3580/19560 | loss 3.621768 (-1.23z)| norm 0.2918 (+0.10z)| lr 5.66e-04 | 322.16 ms | 52.4% bf16 MFU | 1624136 tok/s step 3581/19560 | loss 3.711468 (+0.79z)| norm 0.2553 (-1.06z)| lr 5.66e-04 | 322.76 ms | 52.3% bf16 MFU | 1624148 tok/s step 3582/19560 | loss 3.697450 (+0.47z)| norm 0.2706 (-0.57z)| lr 5.66e-04 | 322.50 ms | 52.3% bf16 MFU | 1624226 tok/s step 3583/19560 | loss 3.629316 (-1.07z)| norm 0.2599 (-0.90z)| lr 5.66e-04 | 323.12 ms | 52.2% bf16 MFU | 1624144 tok/s step 3584/19560 | loss 3.724694 (+1.09z)| norm 0.2558 (-1.02z)| lr 5.66e-04 | 323.23 ms | 52.2% bf16 MFU | 1624039 tok/s step 3585/19560 | loss 3.638425 (-0.86z)| norm 0.2717 (-0.52z)| lr 5.66e-04 | 322.91 ms | 52.3% bf16 MFU | 1624019 tok/s step 3586/19560 | loss 3.699886 (+0.53z)| norm 0.2735 (-0.46z)| lr 5.66e-04 | 322.85 ms | 52.3% bf16 MFU | 1624015 tok/s step 3587/19560 | loss 3.722135 (+1.02z)| norm 0.2808 (-0.22z)| lr 5.66e-04 | 322.61 ms | 52.3% bf16 MFU | 1624072 tok/s step 3588/19560 | loss 3.693297 (+0.38z)| norm 0.2790 (-0.27z)| lr 5.66e-04 | 323.25 ms | 52.2% bf16 MFU | 1623966 tok/s step 3589/19560 | loss 3.621727 (-1.22z)| norm 0.2724 (-0.47z)| lr 5.66e-04 | 323.17 ms | 52.2% bf16 MFU | 1623883 tok/s step 3590/19560 | loss 3.662674 (-0.30z)| norm 0.3224 (+1.11z)| lr 5.66e-04 | 322.93 ms | 52.3% bf16 MFU | 1623865 tok/s step 3591/19560 | loss 3.810460 (+2.89z)| norm 0.3493 (+1.93z)| lr 5.66e-04 | 322.76 ms | 52.3% bf16 MFU | 1623892 tok/s step 3592/19560 | loss 3.661386 (-0.34z)| norm 0.2617 (-0.80z)| lr 5.66e-04 | 323.01 ms | 52.3% bf16 MFU | 1623854 tok/s step 3593/19560 | loss 3.629669 (-1.01z)| norm 0.2761 (-0.35z)| lr 5.66e-04 | 322.62 ms | 52.3% bf16 MFU | 1623917 tok/s step 3594/19560 | loss 3.643033 (-0.71z)| norm 0.2640 (-0.72z)| lr 5.66e-04 | 323.58 ms | 52.2% bf16 MFU | 1623734 tok/s step 3595/19560 | loss 3.644464 (-0.67z)| norm 0.2709 (-0.50z)| lr 5.66e-04 | 322.86 ms | 52.3% bf16 MFU | 1623741 tok/s step 3596/19560 | loss 3.652606 (-0.49z)| norm 0.2726 (-0.45z)| lr 5.66e-04 | 322.87 ms | 52.3% bf16 MFU | 1623746 tok/s step 3597/19560 | loss 3.660758 (-0.31z)| norm 0.2625 (-0.75z)| lr 5.66e-04 | 322.83 ms | 52.3% bf16 MFU | 1623761 tok/s step 3598/19560 | loss 3.685028 (+0.21z)| norm 0.2623 (-0.75z)| lr 5.66e-04 | 323.08 ms | 52.2% bf16 MFU | 1623712 tok/s step 3599/19560 | loss 3.635247 (-0.86z)| norm 0.2708 (-0.48z)| lr 5.66e-04 | 323.44 ms | 52.2% bf16 MFU | 1623574 tok/s step 3600/19560 | loss 3.618408 (-1.23z)| norm 0.2927 (+0.19z)| lr 5.66e-04 | 323.11 ms | 52.2% bf16 MFU | 1623527 tok/s step 3601/19560 | loss 3.653492 (-0.45z)| norm 0.2898 (+0.11z)| lr 5.66e-04 | 322.73 ms | 52.3% bf16 MFU | 1623577 tok/s step 3602/19560 | loss 3.704870 (+0.66z)| norm 0.2690 (-0.54z)| lr 5.66e-04 | 322.57 ms | 52.3% bf16 MFU | 1623664 tok/s step 3603/19560 | loss 3.634503 (-0.87z)| norm 0.2749 (-0.35z)| lr 5.66e-04 | 323.40 ms | 52.2% bf16 MFU | 1623538 tok/s step 3604/19560 | loss 3.766853 (+1.96z)| norm 0.3156 (+0.91z)| lr 5.66e-04 | 323.16 ms | 52.2% bf16 MFU | 1623480 tok/s step 3605/19560 | loss 3.737541 (+1.31z)| norm 0.2970 (+0.32z)| lr 5.66e-04 | 323.08 ms | 52.2% bf16 MFU | 1623444 tok/s step 3606/19560 | loss 3.698489 (+0.47z)| norm 0.2983 (+0.36z)| lr 5.66e-04 | 322.65 ms | 52.3% bf16 MFU | 1623520 tok/s step 3607/19560 | loss 3.729347 (+1.11z)| norm 0.2999 (+0.41z)| lr 5.66e-04 | 323.00 ms | 52.3% bf16 MFU | 1623502 tok/s step 3608/19560 | loss 3.672470 (-0.12z)| norm 0.2846 (-0.08z)| lr 5.66e-04 | 324.12 ms | 52.1% bf16 MFU | 1623205 tok/s step 3609/19560 | loss 3.619717 (-1.25z)| norm 0.2813 (-0.19z)| lr 5.65e-04 | 322.91 ms | 52.3% bf16 MFU | 1623226 tok/s step 3610/19560 | loss 3.702425 (+0.52z)| norm 0.2633 (-0.74z)| lr 5.65e-04 | 322.82 ms | 52.3% bf16 MFU | 1623270 tok/s step 3611/19560 | loss 3.725451 (+1.01z)| norm 0.2826 (-0.14z)| lr 5.65e-04 | 322.81 ms | 52.3% bf16 MFU | 1623313 tok/s step 3612/19560 | loss 3.667240 (-0.25z)| norm 0.2722 (-0.46z)| lr 5.65e-04 | 323.10 ms | 52.2% bf16 MFU | 1623282 tok/s step 3613/19560 | loss 3.660913 (-0.39z)| norm 0.2591 (-0.86z)| lr 5.65e-04 | 323.14 ms | 52.2% bf16 MFU | 1623241 tok/s step 3614/19560 | loss 3.627745 (-1.10z)| norm 0.2685 (-0.56z)| lr 5.65e-04 | 322.83 ms | 52.3% bf16 MFU | 1623281 tok/s step 3615/19560 | loss 3.610760 (-1.45z)| norm 0.2519 (-1.07z)| lr 5.65e-04 | 322.73 ms | 52.3% bf16 MFU | 1623345 tok/s step 3616/19560 | loss 3.705536 (+0.57z)| norm 0.2474 (-1.19z)| lr 5.65e-04 | 323.88 ms | 52.1% bf16 MFU | 1623116 tok/s step 3617/19560 | loss 3.658176 (-0.43z)| norm 0.2495 (-1.11z)| lr 5.65e-04 | 322.41 ms | 52.3% bf16 MFU | 1623267 tok/s step 3618/19560 | loss 3.680024 (+0.03z)| norm 0.2673 (-0.56z)| lr 5.65e-04 | 322.43 ms | 52.3% bf16 MFU | 1623407 tok/s step 3619/19560 | loss 3.645299 (-0.72z)| norm 0.2367 (-1.47z)| lr 5.65e-04 | 323.13 ms | 52.2% bf16 MFU | 1623364 tok/s step 3620/19560 | loss 3.657030 (-0.46z)| norm 0.2327 (-1.57z)| lr 5.65e-04 | 322.91 ms | 52.3% bf16 MFU | 1623378 tok/s step 3621/19560 | loss 3.660207 (-0.39z)| norm 0.2503 (-1.02z)| lr 5.65e-04 | 322.91 ms | 52.3% bf16 MFU | 1623391 tok/s step 3622/19560 | loss 3.628361 (-1.08z)| norm 0.2438 (-1.21z)| lr 5.65e-04 | 322.70 ms | 52.3% bf16 MFU | 1623456 tok/s step 3623/19560 | loss 3.705199 (+0.58z)| norm 0.2577 (-0.78z)| lr 5.65e-04 | 322.95 ms | 52.3% bf16 MFU | 1623456 tok/s step 3624/19560 | loss 3.711195 (+0.73z)| norm 0.2543 (-0.88z)| lr 5.65e-04 | 323.11 ms | 52.2% bf16 MFU | 1623415 tok/s step 3625/19560 | loss 3.756629 (+1.69z)| norm 0.2792 (-0.14z)| lr 5.65e-04 | 322.88 ms | 52.3% bf16 MFU | 1623433 tok/s step 3626/19560 | loss 3.667435 (-0.26z)| norm 0.3051 (+0.62z)| lr 5.65e-04 | 322.94 ms | 52.3% bf16 MFU | 1623434 tok/s step 3627/19560 | loss 3.648014 (-0.68z)| norm 0.3087 (+0.72z)| lr 5.65e-04 | 323.22 ms | 52.2% bf16 MFU | 1623366 tok/s step 3628/19560 | loss 3.725416 (+1.03z)| norm 0.2761 (-0.27z)| lr 5.65e-04 | 322.52 ms | 52.3% bf16 MFU | 1623477 tok/s step 3629/19560 | loss 3.671014 (-0.18z)| norm 0.2797 (-0.16z)| lr 5.65e-04 | 323.72 ms | 52.1% bf16 MFU | 1623282 tok/s step 3630/19560 | loss 3.685443 (+0.13z)| norm 0.2769 (-0.25z)| lr 5.65e-04 | 322.73 ms | 52.3% bf16 MFU | 1623345 tok/s step 3631/19560 | loss 3.709129 (+0.67z)| norm 0.2915 (+0.20z)| lr 5.65e-04 | 323.23 ms | 52.2% bf16 MFU | 1623278 tok/s step 3632/19560 | loss 3.669146 (-0.22z)| norm 0.2772 (-0.23z)| lr 5.65e-04 | 323.25 ms | 52.2% bf16 MFU | 1623211 tok/s step 3633/19560 | loss 3.642221 (-0.82z)| norm 0.2972 (+0.37z)| lr 5.65e-04 | 323.27 ms | 52.2% bf16 MFU | 1623141 tok/s step 3634/19560 | loss 3.642330 (-0.82z)| norm 0.3009 (+0.48z)| lr 5.65e-04 | 323.24 ms | 52.2% bf16 MFU | 1623081 tok/s step 3635/19560 | loss 3.700344 (+0.47z)| norm 0.2689 (-0.48z)| lr 5.65e-04 | 323.56 ms | 52.2% bf16 MFU | 1622945 tok/s step 3636/19560 | loss 3.722425 (+0.97z)| norm 0.2685 (-0.48z)| lr 5.65e-04 | 322.90 ms | 52.3% bf16 MFU | 1622982 tok/s step 3637/19560 | loss 3.647052 (-0.72z)| norm 0.2725 (-0.35z)| lr 5.65e-04 | 323.41 ms | 52.2% bf16 MFU | 1622888 tok/s step 3638/19560 | loss 3.640153 (-0.86z)| norm 0.2706 (-0.41z)| lr 5.65e-04 | 323.70 ms | 52.1% bf16 MFU | 1622727 tok/s step 3639/19560 | loss 3.584045 (-2.08z)| norm 0.2570 (-0.80z)| lr 5.65e-04 | 323.12 ms | 52.2% bf16 MFU | 1622718 tok/s step 3640/19560 | loss 3.699117 (+0.48z)| norm 0.2950 (+0.36z)| lr 5.65e-04 | 323.80 ms | 52.1% bf16 MFU | 1622541 tok/s step 3641/19560 | loss 3.697595 (+0.44z)| norm 0.3348 (+1.55z)| lr 5.65e-04 | 323.77 ms | 52.1% bf16 MFU | 1622379 tok/s step 3642/19560 | loss 3.658769 (-0.41z)| norm 0.3344 (+1.52z)| lr 5.65e-04 | 323.02 ms | 52.2% bf16 MFU | 1622413 tok/s step 3643/19560 | loss 3.858369 (+3.77z)| norm 0.3591 (+2.23z)| lr 5.65e-04 | 323.22 ms | 52.2% bf16 MFU | 1622395 tok/s step 3644/19560 | loss 3.736358 (+1.19z)| norm 0.3588 (+2.85z)| lr 5.65e-04 | 323.36 ms | 52.2% bf16 MFU | 1622345 tok/s step 3645/19560 | loss 3.682651 (+0.06z)| norm 0.3525 (+2.55z)| lr 5.65e-04 | 323.27 ms | 52.2% bf16 MFU | 1622318 tok/s step 3646/19560 | loss 3.607031 (-1.49z)| norm 0.3214 (+1.40z)| lr 5.65e-04 | 323.11 ms | 52.2% bf16 MFU | 1622333 tok/s step 3647/19560 | loss 3.679637 (+0.01z)| norm 0.3023 (+0.69z)| lr 5.65e-04 | 322.58 ms | 52.3% bf16 MFU | 1622481 tok/s step 3648/19560 | loss 3.661587 (-0.37z)| norm 0.3285 (+1.62z)| lr 5.65e-04 | 322.49 ms | 52.3% bf16 MFU | 1622644 tok/s step 3649/19560 | loss 3.655878 (-0.49z)| norm 0.3011 (+0.67z)| lr 5.65e-04 | 322.07 ms | 52.4% bf16 MFU | 1622904 tok/s step 3650/19560 | loss 3.655916 (-0.48z)| norm 0.3053 (+0.84z)| lr 5.65e-04 | 323.04 ms | 52.2% bf16 MFU | 1622909 tok/s step 3651/19560 | loss 3.636170 (-0.89z)| norm 0.2611 (-0.83z)| lr 5.65e-04 | 323.15 ms | 52.2% bf16 MFU | 1622885 tok/s step 3652/19560 | loss 3.674745 (-0.07z)| norm 0.2988 (+0.64z)| lr 5.64e-04 | 323.03 ms | 52.2% bf16 MFU | 1622894 tok/s step 3653/19560 | loss 3.635424 (-0.90z)| norm 0.2820 (-0.00z)| lr 5.64e-04 | 323.06 ms | 52.2% bf16 MFU | 1622892 tok/s step 3654/19560 | loss 3.756694 (+1.64z)| norm 0.3042 (+0.86z)| lr 5.64e-04 | 322.81 ms | 52.3% bf16 MFU | 1622956 tok/s step 3655/19560 | loss 3.687736 (+0.21z)| norm 0.2798 (-0.10z)| lr 5.64e-04 | 322.69 ms | 52.3% bf16 MFU | 1623046 tok/s step 3656/19560 | loss 3.658149 (-0.40z)| norm 0.2647 (-0.69z)| lr 5.64e-04 | 323.31 ms | 52.2% bf16 MFU | 1622976 tok/s step 3657/19560 | loss 3.712457 (+0.74z)| norm 0.2538 (-1.11z)| lr 5.64e-04 | 322.91 ms | 52.3% bf16 MFU | 1623008 tok/s step 3658/19560 | loss 3.665302 (-0.27z)| norm 0.2448 (-1.45z)| lr 5.64e-04 | 322.43 ms | 52.3% bf16 MFU | 1623159 tok/s step 3659/19560 | loss 3.660909 (-0.35z)| norm 0.2713 (-0.41z)| lr 5.64e-04 | 323.29 ms | 52.2% bf16 MFU | 1623087 tok/s step 3660/19560 | loss 3.702199 (+0.57z)| norm 0.2742 (-0.29z)| lr 5.64e-04 | 323.00 ms | 52.3% bf16 MFU | 1623091 tok/s step 3661/19560 | loss 3.720089 (+0.95z)| norm 0.2907 (+0.35z)| lr 5.64e-04 | 323.72 ms | 52.1% bf16 MFU | 1622914 tok/s step 3662/19560 | loss 3.722805 (+1.01z)| norm 0.2606 (-0.81z)| lr 5.64e-04 | 323.27 ms | 52.2% bf16 MFU | 1622859 tok/s step 3663/19560 | loss 3.673752 (-0.07z)| norm 0.2587 (-0.88z)| lr 5.64e-04 | 323.10 ms | 52.2% bf16 MFU | 1622851 tok/s step 3664/19560 | loss 3.719105 (+0.93z)| norm 0.2987 (+0.66z)| lr 5.64e-04 | 322.84 ms | 52.3% bf16 MFU | 1622908 tok/s step 3665/19560 | loss 3.708408 (+0.68z)| norm 0.2750 (-0.26z)| lr 5.64e-04 | 323.37 ms | 52.2% bf16 MFU | 1622828 tok/s step 3666/19560 | loss 3.633447 (-0.96z)| norm 0.2533 (-1.10z)| lr 5.64e-04 | 322.88 ms | 52.3% bf16 MFU | 1622876 tok/s step 3667/19560 | loss 3.630573 (-1.00z)| norm 0.2570 (-0.95z)| lr 5.64e-04 | 322.62 ms | 52.3% bf16 MFU | 1622986 tok/s step 3668/19560 | loss 3.631576 (-0.97z)| norm 0.2459 (-1.36z)| lr 5.64e-04 | 323.13 ms | 52.2% bf16 MFU | 1622964 tok/s step 3669/19560 | loss 3.593766 (-1.76z)| norm 0.2572 (-0.91z)| lr 5.64e-04 | 323.01 ms | 52.2% bf16 MFU | 1622972 tok/s step 3670/19560 | loss 3.736791 (+1.34z)| norm 0.2693 (-0.44z)| lr 5.64e-04 | 323.14 ms | 52.2% bf16 MFU | 1622948 tok/s step 3671/19560 | loss 3.725659 (+1.09z)| norm 0.2908 (+0.40z)| lr 5.64e-04 | 323.30 ms | 52.2% bf16 MFU | 1622886 tok/s step 3672/19560 | loss 3.720199 (+0.96z)| norm 0.2777 (-0.11z)| lr 5.64e-04 | 322.79 ms | 52.3% bf16 MFU | 1622952 tok/s step 3673/19560 | loss 3.636162 (-0.84z)| norm 0.3067 (+1.01z)| lr 5.64e-04 | 322.99 ms | 52.3% bf16 MFU | 1622966 tok/s step 3674/19560 | loss 3.589453 (-1.86z)| norm 0.3102 (+1.13z)| lr 5.64e-04 | 322.74 ms | 52.3% bf16 MFU | 1623041 tok/s step 3675/19560 | loss 3.669109 (-0.12z)| norm 0.2780 (-0.09z)| lr 5.64e-04 | 323.55 ms | 52.2% bf16 MFU | 1622910 tok/s step 3676/19560 | loss 3.661311 (-0.29z)| norm 0.3017 (+0.86z)| lr 5.64e-04 | 322.91 ms | 52.3% bf16 MFU | 1622946 tok/s step 3677/19560 | loss 3.643021 (-0.68z)| norm 0.2964 (+0.67z)| lr 5.64e-04 | 322.98 ms | 52.3% bf16 MFU | 1622963 tok/s step 3678/19560 | loss 3.451085 (-4.47z)| norm 0.5565 (+7.88z)| lr 5.64e-04 | 322.69 ms | 52.3% bf16 MFU | 1623053 tok/s step 3679/19560 | loss 3.614644 (-1.16z)| norm 0.2728 (-0.25z)| lr 5.64e-04 | 322.77 ms | 52.3% bf16 MFU | 1623116 tok/s step 3680/19560 | loss 3.650567 (-0.43z)| norm 0.2712 (-0.30z)| lr 5.64e-04 | 322.47 ms | 52.3% bf16 MFU | 1623253 tok/s step 3681/19560 | loss 3.639132 (-0.65z)| norm 0.3228 (+1.17z)| lr 5.64e-04 | 323.13 ms | 52.2% bf16 MFU | 1623217 tok/s step 3682/19560 | loss 3.636233 (-0.71z)| norm 0.4445 (+4.28z)| lr 5.64e-04 | 322.82 ms | 52.3% bf16 MFU | 1623261 tok/s step 3683/19560 | loss 3.665777 (-0.10z)| norm 0.2599 (-0.62z)| lr 5.64e-04 | 322.63 ms | 52.3% bf16 MFU | 1623351 tok/s step 3684/19560 | loss 3.706988 (+0.72z)| norm 0.3273 (+1.15z)| lr 5.64e-04 | 323.07 ms | 52.2% bf16 MFU | 1623326 tok/s step 3685/19560 | loss 3.687790 (+0.33z)| norm 0.2955 (+0.31z)| lr 5.64e-04 | 322.87 ms | 52.3% bf16 MFU | 1623351 tok/s step 3686/19560 | loss 3.663200 (-0.17z)| norm 0.2841 (+0.00z)| lr 5.64e-04 | 322.96 ms | 52.3% bf16 MFU | 1623352 tok/s step 3687/19560 | loss 3.711071 (+0.80z)| norm 0.2551 (-0.77z)| lr 5.64e-04 | 323.07 ms | 52.2% bf16 MFU | 1623325 tok/s step 3688/19560 | loss 3.676479 (+0.11z)| norm 0.2798 (-0.12z)| lr 5.64e-04 | 323.35 ms | 52.2% bf16 MFU | 1623230 tok/s step 3689/19560 | loss 3.672940 (+0.03z)| norm 0.2629 (-0.56z)| lr 5.64e-04 | 322.80 ms | 52.3% bf16 MFU | 1623277 tok/s step 3690/19560 | loss 3.635348 (-0.75z)| norm 0.2676 (-0.44z)| lr 5.64e-04 | 322.64 ms | 52.3% bf16 MFU | 1623362 tok/s step 3691/19560 | loss 3.703012 (+0.65z)| norm 0.2532 (-0.82z)| lr 5.64e-04 | 322.86 ms | 52.3% bf16 MFU | 1623388 tok/s step 3692/19560 | loss 3.739142 (+1.37z)| norm 0.2967 (+0.33z)| lr 5.64e-04 | 322.59 ms | 52.3% bf16 MFU | 1623480 tok/s step 3693/19560 | loss 3.668287 (-0.08z)| norm 0.2874 (+0.08z)| lr 5.64e-04 | 323.24 ms | 52.2% bf16 MFU | 1623406 tok/s step 3694/19560 | loss 3.626527 (-0.92z)| norm 0.2650 (-0.52z)| lr 5.63e-04 | 322.71 ms | 52.3% bf16 MFU | 1623468 tok/s step 3695/19560 | loss 3.685738 (+0.30z)| norm 0.2677 (-0.45z)| lr 5.63e-04 | 322.71 ms | 52.3% bf16 MFU | 1623527 tok/s step 3696/19560 | loss 3.697922 (+0.54z)| norm 0.2818 (-0.08z)| lr 5.63e-04 | 322.84 ms | 52.3% bf16 MFU | 1623549 tok/s step 3697/19560 | loss 3.652001 (-0.42z)| norm 0.3198 (+0.93z)| lr 5.63e-04 | 323.43 ms | 52.2% bf16 MFU | 1623422 tok/s step 3698/19560 | loss 3.676887 (+0.10z)| norm 0.3076 (+0.60z)| lr 5.63e-04 | 323.22 ms | 52.2% bf16 MFU | 1623356 tok/s step 3699/19560 | loss 3.691283 (+0.39z)| norm 0.2769 (-0.22z)| lr 5.63e-04 | 322.77 ms | 52.3% bf16 MFU | 1623405 tok/s step 3700/19560 | loss 3.624268 (-1.01z)| norm 0.2811 (-0.10z)| lr 5.63e-04 | 322.58 ms | 52.3% bf16 MFU | 1623500 tok/s step 3701/19560 | loss 3.714941 (+0.88z)| norm 0.2771 (-0.21z)| lr 5.63e-04 | 323.16 ms | 52.2% bf16 MFU | 1623444 tok/s step 3702/19560 | loss 3.651188 (-0.45z)| norm 0.2808 (-0.11z)| lr 5.63e-04 | 322.70 ms | 52.3% bf16 MFU | 1623506 tok/s step 3703/19560 | loss 3.704969 (+0.67z)| norm 0.2839 (-0.03z)| lr 5.63e-04 | 322.72 ms | 52.3% bf16 MFU | 1623560 tok/s step 3704/19560 | loss 3.680736 (+0.16z)| norm 0.3323 (+1.24z)| lr 5.63e-04 | 322.16 ms | 52.4% bf16 MFU | 1623752 tok/s step 3705/19560 | loss 3.600591 (-1.49z)| norm 0.3206 (+0.93z)| lr 5.63e-04 | 322.78 ms | 52.3% bf16 MFU | 1623780 tok/s step 3706/19560 | loss 3.631083 (-0.85z)| norm 0.2733 (-0.31z)| lr 5.63e-04 | 322.99 ms | 52.3% bf16 MFU | 1623753 tok/s step 3707/19560 | loss 3.667774 (-0.10z)| norm 0.2603 (-0.66z)| lr 5.63e-04 | 322.35 ms | 52.4% bf16 MFU | 1623889 tok/s step 3708/19560 | loss 3.710727 (+0.80z)| norm 0.2590 (-0.69z)| lr 5.63e-04 | 322.87 ms | 52.3% bf16 MFU | 1623886 tok/s step 3709/19560 | loss 3.698312 (+0.54z)| norm 0.2618 (-0.62z)| lr 5.63e-04 | 322.93 ms | 52.3% bf16 MFU | 1623869 tok/s step 3710/19560 | loss 3.670493 (-0.05z)| norm 0.2636 (-0.57z)| lr 5.63e-04 | 322.67 ms | 52.3% bf16 MFU | 1623917 tok/s step 3711/19560 | loss 3.628466 (-0.94z)| norm 0.3260 (+1.07z)| lr 5.63e-04 | 322.55 ms | 52.3% bf16 MFU | 1623993 tok/s step 3712/19560 | loss 3.627172 (-0.95z)| norm 0.3341 (+1.26z)| lr 5.63e-04 | 322.96 ms | 52.3% bf16 MFU | 1623962 tok/s step 3713/19560 | loss 3.662467 (-0.21z)| norm 0.2973 (+0.29z)| lr 5.63e-04 | 322.89 ms | 52.3% bf16 MFU | 1623951 tok/s step 3714/19560 | loss 3.702573 (+0.65z)| norm 0.2977 (+0.29z)| lr 5.63e-04 | 322.56 ms | 52.3% bf16 MFU | 1624024 tok/s step 3715/19560 | loss 3.722736 (+1.07z)| norm 0.2810 (-0.15z)| lr 5.63e-04 | 323.02 ms | 52.2% bf16 MFU | 1623978 tok/s step 3716/19560 | loss 3.711448 (+0.83z)| norm 0.3093 (+0.59z)| lr 5.63e-04 | 323.25 ms | 52.2% bf16 MFU | 1623876 tok/s step 3717/19560 | loss 3.726911 (+1.14z)| norm 0.2797 (-0.19z)| lr 5.63e-04 | 322.92 ms | 52.3% bf16 MFU | 1623861 tok/s step 3718/19560 | loss 3.606706 (-1.39z)| norm 0.2872 (+0.02z)| lr 5.63e-04 | 322.83 ms | 52.3% bf16 MFU | 1623870 tok/s step 3719/19560 | loss 3.641021 (-0.66z)| norm 0.2893 (+0.09z)| lr 5.63e-04 | 322.78 ms | 52.3% bf16 MFU | 1623890 tok/s step 3720/19560 | loss 3.669548 (-0.04z)| norm 0.2699 (-0.43z)| lr 5.63e-04 | 322.82 ms | 52.3% bf16 MFU | 1623900 tok/s step 3721/19560 | loss 3.737176 (+1.40z)| norm 0.2755 (-0.28z)| lr 5.63e-04 | 322.97 ms | 52.3% bf16 MFU | 1623872 tok/s step 3722/19560 | loss 3.641972 (-0.66z)| norm 0.2489 (-0.99z)| lr 5.63e-04 | 322.91 ms | 52.3% bf16 MFU | 1623860 tok/s step 3723/19560 | loss 3.673823 (+0.03z)| norm 0.2945 (+0.22z)| lr 5.63e-04 | 322.75 ms | 52.3% bf16 MFU | 1623888 tok/s step 3724/19560 | loss 3.637520 (-0.75z)| norm 0.2823 (-0.11z)| lr 5.63e-04 | 322.64 ms | 52.3% bf16 MFU | 1623945 tok/s step 3725/19560 | loss 3.625556 (-1.00z)| norm 0.2592 (-0.72z)| lr 5.63e-04 | 322.98 ms | 52.3% bf16 MFU | 1623911 tok/s step 3726/19560 | loss 3.659479 (-0.27z)| norm 0.2639 (-0.60z)| lr 5.63e-04 | 322.43 ms | 52.3% bf16 MFU | 1624017 tok/s step 3727/19560 | loss 3.660875 (-0.24z)| norm 0.2686 (-0.47z)| lr 5.63e-04 | 322.57 ms | 52.3% bf16 MFU | 1624084 tok/s step 3728/19560 | loss 3.683945 (+0.25z)| norm 0.2719 (-0.38z)| lr 5.63e-04 | 323.25 ms | 52.2% bf16 MFU | 1623975 tok/s step 3729/19560 | loss 3.620582 (-1.12z)| norm 0.2481 (-1.00z)| lr 5.63e-04 | 322.83 ms | 52.3% bf16 MFU | 1623977 tok/s step 3730/19560 | loss 3.645226 (-0.58z)| norm 0.2767 (-0.24z)| lr 5.63e-04 | 322.90 ms | 52.3% bf16 MFU | 1623963 tok/s step 3731/19560 | loss 3.627522 (-0.96z)| norm 0.2651 (-0.55z)| lr 5.63e-04 | 323.36 ms | 52.2% bf16 MFU | 1623834 tok/s step 3732/19560 | loss 3.643894 (-0.59z)| norm 0.2974 (+0.31z)| lr 5.63e-04 | 322.79 ms | 52.3% bf16 MFU | 1623855 tok/s step 3733/19560 | loss 3.649383 (-0.46z)| norm 0.3014 (+0.42z)| lr 5.63e-04 | 322.45 ms | 52.3% bf16 MFU | 1623959 tok/s step 3734/19560 | loss 3.671402 (+0.03z)| norm 0.2809 (-0.12z)| lr 5.63e-04 | 322.90 ms | 52.3% bf16 MFU | 1623944 tok/s step 3735/19560 | loss 3.665178 (-0.10z)| norm 0.2760 (-0.25z)| lr 5.62e-04 | 322.86 ms | 52.3% bf16 MFU | 1623942 tok/s step 3736/19560 | loss 3.682496 (+0.29z)| norm 0.2695 (-0.42z)| lr 5.62e-04 | 322.58 ms | 52.3% bf16 MFU | 1624011 tok/s step 3737/19560 | loss 3.684072 (+0.31z)| norm 0.2831 (-0.06z)| lr 5.62e-04 | 322.75 ms | 52.3% bf16 MFU | 1624033 tok/s step 3738/19560 | loss 3.692793 (+0.51z)| norm 0.2703 (-0.40z)| lr 5.62e-04 | 322.71 ms | 52.3% bf16 MFU | 1624064 tok/s step 3739/19560 | loss 3.653551 (-0.36z)| norm 0.2752 (-0.27z)| lr 5.62e-04 | 322.81 ms | 52.3% bf16 MFU | 1624068 tok/s step 3740/19560 | loss 3.617956 (-1.15z)| norm 0.2604 (-0.66z)| lr 5.62e-04 | 323.04 ms | 52.2% bf16 MFU | 1624014 tok/s step 3741/19560 | loss 3.623093 (-1.02z)| norm 0.2681 (-0.46z)| lr 5.62e-04 | 323.30 ms | 52.2% bf16 MFU | 1623897 tok/s step 3742/19560 | loss 3.667682 (-0.03z)| norm 0.2630 (-0.59z)| lr 5.62e-04 | 322.28 ms | 52.4% bf16 MFU | 1624043 tok/s step 3743/19560 | loss 3.655007 (-0.32z)| norm 0.2619 (-0.62z)| lr 5.62e-04 | 322.74 ms | 52.3% bf16 MFU | 1624066 tok/s step 3744/19560 | loss 3.575800 (-2.06z)| norm 0.2388 (-1.23z)| lr 5.62e-04 | 322.99 ms | 52.3% bf16 MFU | 1624025 tok/s step 3745/19560 | loss 3.717788 (+1.08z)| norm 0.2913 (+0.15z)| lr 5.62e-04 | 322.55 ms | 52.3% bf16 MFU | 1624096 tok/s step 3746/19560 | loss 3.734930 (+1.44z)| norm 0.2945 (+0.23z)| lr 5.62e-04 | 322.76 ms | 52.3% bf16 MFU | 1624109 tok/s step 3747/19560 | loss 3.679275 (+0.21z)| norm 0.2899 (+0.10z)| lr 5.62e-04 | 322.88 ms | 52.3% bf16 MFU | 1624092 tok/s step 3748/19560 | loss 3.677579 (+0.17z)| norm 0.2642 (-0.60z)| lr 5.62e-04 | 322.88 ms | 52.3% bf16 MFU | 1624078 tok/s step 3749/19560 | loss 3.640811 (-0.63z)| norm 0.2678 (-0.51z)| lr 5.62e-04 | 322.69 ms | 52.3% bf16 MFU | 1624111 tok/s step 3750/19560 | loss 3.647289 (-0.49z)| norm 0.2803 (-0.18z)| lr 5.62e-04 | 323.52 ms | 52.2% bf16 MFU | 1623935 tok/s val loss 3.639310 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2717/10042 = 0.270564 step 3751/19560 | loss 3.637692 (-0.69z)| norm 0.3625 (+2.01z)| lr 5.62e-04 | 322.59 ms | 52.3% bf16 MFU | 1624001 tok/s step 3752/19560 | loss 3.617714 (-1.11z)| norm 0.3324 (+1.18z)| lr 5.62e-04 | 322.83 ms | 52.3% bf16 MFU | 1624002 tok/s step 3753/19560 | loss 3.613930 (-1.18z)| norm 0.2959 (+0.20z)| lr 5.62e-04 | 323.10 ms | 52.2% bf16 MFU | 1623935 tok/s step 3754/19560 | loss 3.663908 (-0.08z)| norm 0.3005 (+0.33z)| lr 5.62e-04 | 323.48 ms | 52.2% bf16 MFU | 1623778 tok/s step 3755/19560 | loss 3.649072 (-0.40z)| norm 0.3041 (+0.42z)| lr 5.62e-04 | 322.73 ms | 52.3% bf16 MFU | 1623815 tok/s step 3756/19560 | loss 3.679752 (+0.29z)| norm 0.2448 (-1.15z)| lr 5.62e-04 | 322.29 ms | 52.4% bf16 MFU | 1623962 tok/s step 3757/19560 | loss 3.624314 (-0.94z)| norm 0.2897 (+0.04z)| lr 5.62e-04 | 323.12 ms | 52.2% bf16 MFU | 1623892 tok/s step 3758/19560 | loss 3.632936 (-0.74z)| norm 0.2900 (+0.05z)| lr 5.62e-04 | 322.92 ms | 52.3% bf16 MFU | 1623876 tok/s step 3759/19560 | loss 3.666486 (+0.01z)| norm 0.3511 (+1.65z)| lr 5.62e-04 | 322.66 ms | 52.3% bf16 MFU | 1623925 tok/s step 3760/19560 | loss 3.671982 (+0.14z)| norm 0.3178 (+0.76z)| lr 5.62e-04 | 323.04 ms | 52.2% bf16 MFU | 1623878 tok/s step 3761/19560 | loss 3.635179 (-0.68z)| norm 0.2849 (-0.11z)| lr 5.62e-04 | 322.62 ms | 52.3% bf16 MFU | 1623939 tok/s step 3762/19560 | loss 3.585231 (-1.76z)| norm 0.3368 (+1.25z)| lr 5.62e-04 | 322.88 ms | 52.3% bf16 MFU | 1623932 tok/s step 3763/19560 | loss 3.609230 (-1.22z)| norm 0.3236 (+0.89z)| lr 5.62e-04 | 323.00 ms | 52.3% bf16 MFU | 1623894 tok/s step 3764/19560 | loss 3.674091 (+0.22z)| norm 0.3261 (+0.94z)| lr 5.62e-04 | 323.08 ms | 52.2% bf16 MFU | 1623839 tok/s step 3765/19560 | loss 3.532030 (-2.81z)| norm 0.3303 (+1.04z)| lr 5.62e-04 | 322.46 ms | 52.3% bf16 MFU | 1623943 tok/s step 3766/19560 | loss 3.652826 (-0.23z)| norm 0.3531 (+1.60z)| lr 5.62e-04 | 323.06 ms | 52.2% bf16 MFU | 1623890 tok/s step 3767/19560 | loss 3.625540 (-0.83z)| norm 0.3455 (+1.38z)| lr 5.62e-04 | 322.75 ms | 52.3% bf16 MFU | 1623918 tok/s step 3768/19560 | loss 3.696029 (+0.70z)| norm 0.2693 (-0.57z)| lr 5.62e-04 | 322.60 ms | 52.3% bf16 MFU | 1623982 tok/s step 3769/19560 | loss 3.636214 (-0.59z)| norm 0.2831 (-0.21z)| lr 5.62e-04 | 323.52 ms | 52.2% bf16 MFU | 1623812 tok/s step 3770/19560 | loss 3.661832 (-0.03z)| norm 0.2673 (-0.61z)| lr 5.62e-04 | 323.27 ms | 52.2% bf16 MFU | 1623712 tok/s step 3771/19560 | loss 3.620548 (-0.95z)| norm 0.2802 (-0.26z)| lr 5.62e-04 | 322.56 ms | 52.3% bf16 MFU | 1623796 tok/s step 3772/19560 | loss 3.655952 (-0.12z)| norm 0.2681 (-0.56z)| lr 5.62e-04 | 322.55 ms | 52.3% bf16 MFU | 1623879 tok/s step 3773/19560 | loss 3.707346 (+1.08z)| norm 0.2394 (-1.31z)| lr 5.62e-04 | 323.61 ms | 52.2% bf16 MFU | 1623692 tok/s step 3774/19560 | loss 3.661674 (+0.00z)| norm 0.2655 (-0.60z)| lr 5.62e-04 | 322.80 ms | 52.3% bf16 MFU | 1623717 tok/s step 3775/19560 | loss 3.650508 (-0.25z)| norm 0.2508 (-0.98z)| lr 5.62e-04 | 322.97 ms | 52.3% bf16 MFU | 1623698 tok/s step 3776/19560 | loss 3.684157 (+0.54z)| norm 0.2694 (-0.47z)| lr 5.61e-04 | 322.62 ms | 52.3% bf16 MFU | 1623768 tok/s step 3777/19560 | loss 3.751866 (+2.08z)| norm 0.2754 (-0.31z)| lr 5.61e-04 | 322.95 ms | 52.3% bf16 MFU | 1623751 tok/s step 3778/19560 | loss 3.675031 (+0.29z)| norm 0.2318 (-1.44z)| lr 5.61e-04 | 322.79 ms | 52.3% bf16 MFU | 1623774 tok/s step 3779/19560 | loss 3.670704 (+0.19z)| norm 0.3181 (+0.83z)| lr 5.61e-04 | 322.78 ms | 52.3% bf16 MFU | 1623799 tok/s step 3780/19560 | loss 3.624263 (-0.88z)| norm 0.2894 (+0.07z)| lr 5.61e-04 | 323.27 ms | 52.2% bf16 MFU | 1623701 tok/s step 3781/19560 | loss 3.678533 (+0.37z)| norm 0.3298 (+1.12z)| lr 5.61e-04 | 323.37 ms | 52.2% bf16 MFU | 1623581 tok/s step 3782/19560 | loss 3.673080 (+0.26z)| norm 0.3467 (+1.54z)| lr 5.61e-04 | 322.95 ms | 52.3% bf16 MFU | 1623574 tok/s step 3783/19560 | loss 3.667310 (+0.13z)| norm 0.2838 (-0.10z)| lr 5.61e-04 | 323.43 ms | 52.2% bf16 MFU | 1623446 tok/s step 3784/19560 | loss 3.613449 (-1.13z)| norm 0.2582 (-0.76z)| lr 5.61e-04 | 323.13 ms | 52.2% bf16 MFU | 1623401 tok/s step 3785/19560 | loss 3.683435 (+0.53z)| norm 0.2821 (-0.15z)| lr 5.61e-04 | 323.38 ms | 52.2% bf16 MFU | 1623294 tok/s step 3786/19560 | loss 3.664803 (+0.09z)| norm 0.2824 (-0.15z)| lr 5.61e-04 | 323.43 ms | 52.2% bf16 MFU | 1623180 tok/s step 3787/19560 | loss 3.633282 (-0.65z)| norm 0.2984 (+0.27z)| lr 5.61e-04 | 322.70 ms | 52.3% bf16 MFU | 1623256 tok/s step 3788/19560 | loss 3.685585 (+0.59z)| norm 0.2690 (-0.50z)| lr 5.61e-04 | 323.41 ms | 52.2% bf16 MFU | 1623150 tok/s step 3789/19560 | loss 3.658369 (-0.05z)| norm 0.2611 (-0.70z)| lr 5.61e-04 | 322.86 ms | 52.3% bf16 MFU | 1623186 tok/s step 3790/19560 | loss 3.635900 (-0.57z)| norm 0.2752 (-0.34z)| lr 5.61e-04 | 323.18 ms | 52.2% bf16 MFU | 1623141 tok/s step 3791/19560 | loss 3.645669 (-0.33z)| norm 0.2590 (-0.76z)| lr 5.61e-04 | 323.04 ms | 52.2% bf16 MFU | 1623134 tok/s step 3792/19560 | loss 3.658548 (-0.01z)| norm 0.2636 (-0.63z)| lr 5.61e-04 | 322.90 ms | 52.3% bf16 MFU | 1623161 tok/s step 3793/19560 | loss 3.674253 (+0.38z)| norm 0.2743 (-0.35z)| lr 5.61e-04 | 322.84 ms | 52.3% bf16 MFU | 1623204 tok/s step 3794/19560 | loss 3.697062 (+0.92z)| norm 0.2773 (-0.28z)| lr 5.61e-04 | 323.19 ms | 52.2% bf16 MFU | 1623154 tok/s step 3795/19560 | loss 3.704190 (+1.08z)| norm 0.2832 (-0.13z)| lr 5.61e-04 | 323.08 ms | 52.2% bf16 MFU | 1623135 tok/s step 3796/19560 | loss 3.692914 (+0.79z)| norm 0.2539 (-0.91z)| lr 5.61e-04 | 322.55 ms | 52.3% bf16 MFU | 1623251 tok/s step 3797/19560 | loss 3.680575 (+0.48z)| norm 0.3011 (+0.33z)| lr 5.61e-04 | 322.83 ms | 52.3% bf16 MFU | 1623290 tok/s step 3798/19560 | loss 3.647364 (-0.32z)| norm 0.2783 (-0.27z)| lr 5.61e-04 | 323.01 ms | 52.2% bf16 MFU | 1623281 tok/s step 3799/19560 | loss 3.708030 (+1.19z)| norm 0.3240 (+0.93z)| lr 5.61e-04 | 323.30 ms | 52.2% bf16 MFU | 1623200 tok/s step 3800/19560 | loss 3.637475 (-0.55z)| norm 0.3251 (+0.94z)| lr 5.61e-04 | 323.06 ms | 52.2% bf16 MFU | 1623184 tok/s step 3801/19560 | loss 3.739770 (+1.96z)| norm 0.2830 (-0.16z)| lr 5.61e-04 | 322.81 ms | 52.3% bf16 MFU | 1623232 tok/s step 3802/19560 | loss 3.655882 (-0.12z)| norm 0.2713 (-0.46z)| lr 5.61e-04 | 323.14 ms | 52.2% bf16 MFU | 1623193 tok/s step 3803/19560 | loss 3.646423 (-0.35z)| norm 0.2617 (-0.71z)| lr 5.61e-04 | 322.96 ms | 52.3% bf16 MFU | 1623203 tok/s step 3804/19560 | loss 3.652139 (-0.21z)| norm 0.2629 (-0.67z)| lr 5.61e-04 | 323.14 ms | 52.2% bf16 MFU | 1623166 tok/s step 3805/19560 | loss 3.654244 (-0.16z)| norm 0.2892 (+0.02z)| lr 5.61e-04 | 322.79 ms | 52.3% bf16 MFU | 1623219 tok/s step 3806/19560 | loss 3.625976 (-1.01z)| norm 0.2799 (-0.21z)| lr 5.61e-04 | 322.65 ms | 52.3% bf16 MFU | 1623305 tok/s step 3807/19560 | loss 3.675209 (+0.36z)| norm 0.2621 (-0.81z)| lr 5.61e-04 | 323.66 ms | 52.1% bf16 MFU | 1623132 tok/s step 3808/19560 | loss 3.587528 (-2.07z)| norm 0.2764 (-0.33z)| lr 5.61e-04 | 322.49 ms | 52.3% bf16 MFU | 1623262 tok/s step 3809/19560 | loss 3.637567 (-0.68z)| norm 0.2661 (-0.66z)| lr 5.61e-04 | 323.69 ms | 52.1% bf16 MFU | 1623086 tok/s step 3810/19560 | loss 3.700073 (+1.05z)| norm 0.2445 (-1.51z)| lr 5.61e-04 | 322.71 ms | 52.3% bf16 MFU | 1623164 tok/s step 3811/19560 | loss 3.640188 (-0.61z)| norm 0.2481 (-1.36z)| lr 5.61e-04 | 323.08 ms | 52.2% bf16 MFU | 1623144 tok/s step 3812/19560 | loss 3.615474 (-1.28z)| norm 0.2570 (-1.01z)| lr 5.61e-04 | 324.36 ms | 52.0% bf16 MFU | 1622807 tok/s step 3813/19560 | loss 3.652056 (-0.25z)| norm 0.3029 (+0.74z)| lr 5.61e-04 | 323.36 ms | 52.2% bf16 MFU | 1622735 tok/s step 3814/19560 | loss 3.723362 (+1.69z)| norm 0.3787 (+3.43z)| lr 5.61e-04 | 323.26 ms | 52.2% bf16 MFU | 1622692 tok/s step 3815/19560 | loss 3.691061 (+0.81z)| norm 0.3938 (+3.73z)| lr 5.61e-04 | 323.14 ms | 52.2% bf16 MFU | 1622681 tok/s step 3816/19560 | loss 3.676440 (+0.41z)| norm 0.3275 (+1.43z)| lr 5.61e-04 | 322.94 ms | 52.3% bf16 MFU | 1622722 tok/s step 3817/19560 | loss 3.664554 (+0.08z)| norm 0.3285 (+1.44z)| lr 5.60e-04 | 323.96 ms | 52.1% bf16 MFU | 1622505 tok/s step 3818/19560 | loss 3.611359 (-1.37z)| norm 0.3031 (+0.56z)| lr 5.60e-04 | 322.35 ms | 52.4% bf16 MFU | 1622704 tok/s step 3819/19560 | loss 3.632732 (-0.77z)| norm 0.2810 (-0.19z)| lr 5.60e-04 | 323.90 ms | 52.1% bf16 MFU | 1622502 tok/s step 3820/19560 | loss 3.627015 (-0.92z)| norm 0.2783 (-0.28z)| lr 5.60e-04 | 323.80 ms | 52.1% bf16 MFU | 1622337 tok/s step 3821/19560 | loss 3.679521 (+0.55z)| norm 0.2784 (-0.27z)| lr 5.60e-04 | 323.32 ms | 52.2% bf16 MFU | 1622298 tok/s step 3822/19560 | loss 3.617105 (-1.19z)| norm 0.3094 (+0.77z)| lr 5.60e-04 | 322.86 ms | 52.3% bf16 MFU | 1622378 tok/s step 3823/19560 | loss 3.608981 (-1.39z)| norm 0.2715 (-0.52z)| lr 5.60e-04 | 323.84 ms | 52.1% bf16 MFU | 1622207 tok/s step 3824/19560 | loss 3.640787 (-0.50z)| norm 0.2629 (-0.81z)| lr 5.60e-04 | 322.79 ms | 52.3% bf16 MFU | 1622310 tok/s step 3825/19560 | loss 3.635431 (-0.65z)| norm 0.2614 (-0.85z)| lr 5.60e-04 | 322.81 ms | 52.3% bf16 MFU | 1622402 tok/s step 3826/19560 | loss 3.642083 (-0.45z)| norm 0.2880 (+0.06z)| lr 5.60e-04 | 323.48 ms | 52.2% bf16 MFU | 1622320 tok/s step 3827/19560 | loss 3.655002 (-0.09z)| norm 0.2599 (-0.89z)| lr 5.60e-04 | 323.30 ms | 52.2% bf16 MFU | 1622287 tok/s step 3828/19560 | loss 3.617495 (-1.13z)| norm 0.2421 (-1.47z)| lr 5.60e-04 | 323.93 ms | 52.1% bf16 MFU | 1622100 tok/s step 3829/19560 | loss 3.680330 (+0.63z)| norm 0.2624 (-0.78z)| lr 5.60e-04 | 322.58 ms | 52.3% bf16 MFU | 1622258 tok/s step 3830/19560 | loss 3.838093 (+4.58z)| norm 0.3137 (+0.94z)| lr 5.60e-04 | 323.18 ms | 52.2% bf16 MFU | 1622260 tok/s step 3831/19560 | loss 3.542315 (-2.88z)| norm 0.7182 (+8.88z)| lr 5.60e-04 | 323.43 ms | 52.2% bf16 MFU | 1622199 tok/s step 3832/19560 | loss 3.640293 (-0.43z)| norm 0.3268 (+0.78z)| lr 5.60e-04 | 323.08 ms | 52.2% bf16 MFU | 1622229 tok/s step 3833/19560 | loss 3.632845 (-0.63z)| norm 0.3411 (+1.07z)| lr 5.60e-04 | 322.79 ms | 52.3% bf16 MFU | 1622331 tok/s step 3834/19560 | loss 3.677379 (+0.48z)| norm 0.3298 (+0.82z)| lr 5.60e-04 | 323.67 ms | 52.1% bf16 MFU | 1622205 tok/s step 3835/19560 | loss 3.657063 (-0.03z)| norm 0.3076 (+0.36z)| lr 5.60e-04 | 322.51 ms | 52.3% bf16 MFU | 1622378 tok/s step 3836/19560 | loss 3.613303 (-1.11z)| norm 0.2965 (+0.13z)| lr 5.60e-04 | 322.99 ms | 52.3% bf16 MFU | 1622421 tok/s step 3837/19560 | loss 3.598778 (-1.45z)| norm 0.2980 (+0.15z)| lr 5.60e-04 | 323.28 ms | 52.2% bf16 MFU | 1622389 tok/s step 3838/19560 | loss 3.665151 (+0.21z)| norm 0.3028 (+0.24z)| lr 5.60e-04 | 323.12 ms | 52.2% bf16 MFU | 1622398 tok/s step 3839/19560 | loss 3.616221 (-1.01z)| norm 0.3080 (+0.35z)| lr 5.60e-04 | 322.93 ms | 52.3% bf16 MFU | 1622455 tok/s step 3840/19560 | loss 3.586105 (-1.74z)| norm 0.3292 (+0.80z)| lr 5.60e-04 | 322.57 ms | 52.3% bf16 MFU | 1622599 tok/s step 3841/19560 | loss 3.630872 (-0.62z)| norm 0.2843 (-0.13z)| lr 5.60e-04 | 323.27 ms | 52.2% bf16 MFU | 1622560 tok/s step 3842/19560 | loss 3.687037 (+0.77z)| norm 0.2857 (-0.10z)| lr 5.60e-04 | 323.02 ms | 52.2% bf16 MFU | 1622586 tok/s step 3843/19560 | loss 3.640893 (-0.36z)| norm 0.2797 (-0.23z)| lr 5.60e-04 | 322.51 ms | 52.3% bf16 MFU | 1622739 tok/s step 3844/19560 | loss 3.673693 (+0.47z)| norm 0.2934 (+0.06z)| lr 5.60e-04 | 323.13 ms | 52.2% bf16 MFU | 1622727 tok/s step 3845/19560 | loss 3.656204 (+0.05z)| norm 0.2918 (+0.03z)| lr 5.60e-04 | 322.59 ms | 52.3% bf16 MFU | 1622854 tok/s step 3846/19560 | loss 3.616329 (-0.98z)| norm 0.2658 (-0.51z)| lr 5.60e-04 | 322.79 ms | 52.3% bf16 MFU | 1622923 tok/s step 3847/19560 | loss 3.703316 (+1.23z)| norm 0.2859 (-0.09z)| lr 5.60e-04 | 323.00 ms | 52.3% bf16 MFU | 1622935 tok/s step 3848/19560 | loss 3.658825 (+0.10z)| norm 0.2806 (-0.21z)| lr 5.60e-04 | 322.94 ms | 52.3% bf16 MFU | 1622963 tok/s step 3849/19560 | loss 3.705988 (+1.32z)| norm 0.2858 (-0.10z)| lr 5.60e-04 | 323.21 ms | 52.2% bf16 MFU | 1622921 tok/s step 3850/19560 | loss 3.681422 (+0.68z)| norm 0.2480 (-0.88z)| lr 5.60e-04 | 322.69 ms | 52.3% bf16 MFU | 1623011 tok/s step 3851/19560 | loss 3.606838 (-1.22z)| norm 0.2621 (-0.59z)| lr 5.60e-04 | 322.40 ms | 52.3% bf16 MFU | 1623170 tok/s step 3852/19560 | loss 3.617697 (-0.93z)| norm 0.2654 (-0.51z)| lr 5.60e-04 | 323.09 ms | 52.2% bf16 MFU | 1623148 tok/s step 3853/19560 | loss 3.735027 (+2.01z)| norm 0.2783 (-0.25z)| lr 5.60e-04 | 322.99 ms | 52.3% bf16 MFU | 1623152 tok/s step 3854/19560 | loss 3.628605 (-0.66z)| norm 0.2632 (-0.56z)| lr 5.60e-04 | 323.22 ms | 52.2% bf16 MFU | 1623098 tok/s step 3855/19560 | loss 3.628300 (-0.66z)| norm 0.2427 (-0.98z)| lr 5.60e-04 | 322.95 ms | 52.3% bf16 MFU | 1623116 tok/s step 3856/19560 | loss 3.662941 (+0.21z)| norm 0.2669 (-0.48z)| lr 5.60e-04 | 322.49 ms | 52.3% bf16 MFU | 1623248 tok/s step 3857/19560 | loss 3.705261 (+1.26z)| norm 0.2886 (-0.04z)| lr 5.59e-04 | 322.92 ms | 52.3% bf16 MFU | 1623265 tok/s step 3858/19560 | loss 3.749468 (+2.30z)| norm 0.2531 (-0.77z)| lr 5.59e-04 | 322.79 ms | 52.3% bf16 MFU | 1623312 tok/s step 3859/19560 | loss 3.694173 (+0.92z)| norm 0.2796 (-0.22z)| lr 5.59e-04 | 322.90 ms | 52.3% bf16 MFU | 1623330 tok/s step 3860/19560 | loss 3.625353 (-0.76z)| norm 0.2976 (+0.15z)| lr 5.59e-04 | 322.63 ms | 52.3% bf16 MFU | 1623415 tok/s step 3861/19560 | loss 3.664726 (+0.20z)| norm 0.2620 (-0.58z)| lr 5.59e-04 | 322.91 ms | 52.3% bf16 MFU | 1623426 tok/s step 3862/19560 | loss 3.807526 (+3.49z)| norm 0.2712 (-0.39z)| lr 5.59e-04 | 323.31 ms | 52.2% bf16 MFU | 1623337 tok/s step 3863/19560 | loss 3.612859 (-1.02z)| norm 0.3044 (+0.29z)| lr 5.59e-04 | 322.85 ms | 52.3% bf16 MFU | 1623368 tok/s step 3864/19560 | loss 3.627631 (-0.67z)| norm 0.2825 (-0.16z)| lr 5.59e-04 | 323.06 ms | 52.2% bf16 MFU | 1623344 tok/s step 3865/19560 | loss 3.680854 (+0.56z)| norm 0.2767 (-0.28z)| lr 5.59e-04 | 322.86 ms | 52.3% bf16 MFU | 1623370 tok/s step 3866/19560 | loss 3.667696 (+0.26z)| norm 0.2824 (-0.16z)| lr 5.59e-04 | 322.42 ms | 52.3% bf16 MFU | 1623506 tok/s step 3867/19560 | loss 3.648294 (-0.19z)| norm 0.2755 (-0.31z)| lr 5.59e-04 | 323.48 ms | 52.2% bf16 MFU | 1623368 tok/s step 3868/19560 | loss 3.575311 (-1.86z)| norm 0.2699 (-0.42z)| lr 5.59e-04 | 322.94 ms | 52.3% bf16 MFU | 1623374 tok/s step 3869/19560 | loss 3.663209 (+0.16z)| norm 0.2521 (-0.79z)| lr 5.59e-04 | 322.93 ms | 52.3% bf16 MFU | 1623381 tok/s step 3870/19560 | loss 3.654096 (-0.05z)| norm 0.2933 (+0.06z)| lr 5.59e-04 | 322.90 ms | 52.3% bf16 MFU | 1623396 tok/s step 3871/19560 | loss 3.628921 (-0.62z)| norm 0.3078 (+0.35z)| lr 5.59e-04 | 323.08 ms | 52.2% bf16 MFU | 1623366 tok/s step 3872/19560 | loss 3.632710 (-0.55z)| norm 0.2631 (-0.58z)| lr 5.59e-04 | 323.15 ms | 52.2% bf16 MFU | 1623319 tok/s step 3873/19560 | loss 3.764026 (+2.46z)| norm 0.2900 (-0.02z)| lr 5.59e-04 | 322.90 ms | 52.3% bf16 MFU | 1623338 tok/s step 3874/19560 | loss 3.633051 (-0.53z)| norm 0.2800 (-0.23z)| lr 5.59e-04 | 323.17 ms | 52.2% bf16 MFU | 1623288 tok/s step 3875/19560 | loss 3.603438 (-1.20z)| norm 0.2717 (-0.40z)| lr 5.59e-04 | 322.94 ms | 52.3% bf16 MFU | 1623299 tok/s step 3876/19560 | loss 3.584350 (-1.61z)| norm 0.2639 (-0.56z)| lr 5.59e-04 | 323.09 ms | 52.2% bf16 MFU | 1623270 tok/s step 3877/19560 | loss 3.617919 (-0.84z)| norm 0.2678 (-0.48z)| lr 5.59e-04 | 322.93 ms | 52.3% bf16 MFU | 1623284 tok/s step 3878/19560 | loss 3.605305 (-1.11z)| norm 0.2778 (-0.27z)| lr 5.59e-04 | 322.55 ms | 52.3% bf16 MFU | 1623392 tok/s step 3879/19560 | loss 3.790177 (+2.95z)| norm 0.3577 (+1.39z)| lr 5.59e-04 | 323.22 ms | 52.2% bf16 MFU | 1623327 tok/s step 3880/19560 | loss 3.611782 (-0.95z)| norm 0.3866 (+1.96z)| lr 5.59e-04 | 323.01 ms | 52.2% bf16 MFU | 1623316 tok/s step 3881/19560 | loss 3.776150 (+2.56z)| norm 0.5175 (+4.27z)| lr 5.59e-04 | 322.88 ms | 52.3% bf16 MFU | 1623340 tok/s step 3882/19560 | loss 3.627346 (-0.62z)| norm 0.3364 (+0.82z)| lr 5.59e-04 | 322.82 ms | 52.3% bf16 MFU | 1623377 tok/s step 3883/19560 | loss 3.702940 (+0.98z)| norm 0.3670 (+1.38z)| lr 5.59e-04 | 322.76 ms | 52.3% bf16 MFU | 1623429 tok/s step 3884/19560 | loss 3.641616 (-0.32z)| norm 0.3082 (+0.26z)| lr 5.59e-04 | 323.10 ms | 52.2% bf16 MFU | 1623391 tok/s step 3885/19560 | loss 3.571486 (-1.78z)| norm 0.2632 (-0.58z)| lr 5.59e-04 | 322.40 ms | 52.3% bf16 MFU | 1623532 tok/s step 3886/19560 | loss 3.572664 (-1.73z)| norm 0.9843 (+8.49z)| lr 5.59e-04 | 322.69 ms | 52.3% bf16 MFU | 1623594 tok/s step 3887/19560 | loss 3.637591 (-0.37z)| norm 0.4306 (+1.61z)| lr 5.59e-04 | 322.98 ms | 52.3% bf16 MFU | 1623579 tok/s step 3888/19560 | loss 3.679749 (+0.51z)| norm 0.4372 (+1.66z)| lr 5.59e-04 | 322.64 ms | 52.3% bf16 MFU | 1623651 tok/s step 3889/19560 | loss 3.658504 (+0.06z)| norm 0.3161 (+0.18z)| lr 5.59e-04 | 322.96 ms | 52.3% bf16 MFU | 1623637 tok/s step 3890/19560 | loss 3.618050 (-0.79z)| norm 0.3313 (+0.37z)| lr 5.59e-04 | 323.49 ms | 52.2% bf16 MFU | 1623492 tok/s step 3891/19560 | loss 3.623894 (-0.67z)| norm 0.3100 (+0.11z)| lr 5.59e-04 | 322.78 ms | 52.3% bf16 MFU | 1623532 tok/s step 3892/19560 | loss 3.646134 (-0.20z)| norm 0.3057 (+0.06z)| lr 5.59e-04 | 322.36 ms | 52.4% bf16 MFU | 1623675 tok/s step 3893/19560 | loss 3.651131 (-0.12z)| norm 0.2932 (-0.09z)| lr 5.59e-04 | 322.92 ms | 52.3% bf16 MFU | 1623670 tok/s step 3894/19560 | loss 3.618277 (-0.82z)| norm 0.2621 (-0.46z)| lr 5.59e-04 | 323.05 ms | 52.2% bf16 MFU | 1623633 tok/s step 3895/19560 | loss 3.657338 (+0.01z)| norm 0.2758 (-0.29z)| lr 5.59e-04 | 322.89 ms | 52.3% bf16 MFU | 1623638 tok/s step 3896/19560 | loss 3.647271 (-0.20z)| norm 0.2607 (-0.47z)| lr 5.59e-04 | 322.94 ms | 52.3% bf16 MFU | 1623630 tok/s step 3897/19560 | loss 3.616188 (-0.86z)| norm 0.2754 (-0.29z)| lr 5.58e-04 | 322.38 ms | 52.4% bf16 MFU | 1623764 tok/s step 3898/19560 | loss 3.573119 (-1.76z)| norm 0.2565 (-0.52z)| lr 5.58e-04 | 322.54 ms | 52.3% bf16 MFU | 1623851 tok/s step 3899/19560 | loss 3.658716 (+0.06z)| norm 0.2661 (-0.40z)| lr 5.58e-04 | 323.46 ms | 52.2% bf16 MFU | 1623701 tok/s step 3900/19560 | loss 3.804778 (+3.04z)| norm 0.2913 (-0.10z)| lr 5.58e-04 | 322.98 ms | 52.3% bf16 MFU | 1623680 tok/s step 3901/19560 | loss 3.633183 (-0.48z)| norm 0.2902 (-0.11z)| lr 5.58e-04 | 322.65 ms | 52.3% bf16 MFU | 1623743 tok/s step 3902/19560 | loss 3.692009 (+0.73z)| norm 0.2493 (-0.61z)| lr 5.58e-04 | 322.24 ms | 52.4% bf16 MFU | 1623906 tok/s step 3903/19560 | loss 3.644405 (-0.25z)| norm 0.2755 (-0.29z)| lr 5.58e-04 | 323.11 ms | 52.2% bf16 MFU | 1623843 tok/s step 3904/19560 | loss 3.672253 (+0.33z)| norm 0.2852 (-0.18z)| lr 5.58e-04 | 323.08 ms | 52.2% bf16 MFU | 1623791 tok/s step 3905/19560 | loss 3.647147 (-0.18z)| norm 0.2818 (-0.22z)| lr 5.58e-04 | 323.83 ms | 52.1% bf16 MFU | 1623553 tok/s step 3906/19560 | loss 3.659176 (+0.08z)| norm 0.2643 (-0.44z)| lr 5.58e-04 | 322.53 ms | 52.3% bf16 MFU | 1623652 tok/s step 3907/19560 | loss 3.695656 (+0.84z)| norm 0.2552 (-0.54z)| lr 5.58e-04 | 322.19 ms | 52.4% bf16 MFU | 1623833 tok/s step 3908/19560 | loss 3.667979 (+0.25z)| norm 0.2774 (-0.27z)| lr 5.58e-04 | 323.24 ms | 52.2% bf16 MFU | 1623739 tok/s step 3909/19560 | loss 3.635420 (-0.42z)| norm 0.2816 (-0.21z)| lr 5.58e-04 | 322.56 ms | 52.3% bf16 MFU | 1623821 tok/s step 3910/19560 | loss 3.649520 (-0.13z)| norm 0.2806 (-0.22z)| lr 5.58e-04 | 322.53 ms | 52.3% bf16 MFU | 1623908 tok/s step 3911/19560 | loss 3.647323 (-0.17z)| norm 0.2740 (-0.30z)| lr 5.58e-04 | 322.45 ms | 52.3% bf16 MFU | 1624011 tok/s step 3912/19560 | loss 3.549495 (-2.17z)| norm 0.2872 (-0.14z)| lr 5.58e-04 | 322.90 ms | 52.3% bf16 MFU | 1623995 tok/s step 3913/19560 | loss 3.596874 (-1.18z)| norm 0.3055 (+0.08z)| lr 5.58e-04 | 322.54 ms | 52.3% bf16 MFU | 1624069 tok/s step 3914/19560 | loss 3.727256 (+1.48z)| norm 0.2862 (-0.16z)| lr 5.58e-04 | 322.60 ms | 52.3% bf16 MFU | 1624125 tok/s step 3915/19560 | loss 3.592384 (-1.25z)| norm 0.2648 (-0.41z)| lr 5.58e-04 | 322.68 ms | 52.3% bf16 MFU | 1624157 tok/s step 3916/19560 | loss 3.571592 (-1.64z)| norm 0.2383 (-0.74z)| lr 5.58e-04 | 322.84 ms | 52.3% bf16 MFU | 1624150 tok/s step 3917/19560 | loss 3.616771 (-0.73z)| norm 0.2661 (-0.40z)| lr 5.58e-04 | 322.69 ms | 52.3% bf16 MFU | 1624180 tok/s step 3918/19560 | loss 3.629938 (-0.46z)| norm 0.2777 (-0.25z)| lr 5.58e-04 | 323.09 ms | 52.2% bf16 MFU | 1624108 tok/s step 3919/19560 | loss 3.637212 (-0.32z)| norm 0.2581 (-0.49z)| lr 5.58e-04 | 323.13 ms | 52.2% bf16 MFU | 1624029 tok/s step 3920/19560 | loss 3.662820 (+0.20z)| norm 0.2911 (-0.09z)| lr 5.58e-04 | 322.51 ms | 52.3% bf16 MFU | 1624111 tok/s step 3921/19560 | loss 3.597711 (-1.09z)| norm 0.3110 (+0.15z)| lr 5.58e-04 | 322.66 ms | 52.3% bf16 MFU | 1624151 tok/s step 3922/19560 | loss 3.627851 (-0.48z)| norm 0.3146 (+0.19z)| lr 5.58e-04 | 323.00 ms | 52.3% bf16 MFU | 1624102 tok/s step 3923/19560 | loss 3.641779 (-0.19z)| norm 0.3052 (+0.07z)| lr 5.58e-04 | 323.44 ms | 52.2% bf16 MFU | 1623945 tok/s step 3924/19560 | loss 3.581506 (-1.38z)| norm 0.2727 (-0.33z)| lr 5.58e-04 | 322.68 ms | 52.3% bf16 MFU | 1623987 tok/s step 3925/19560 | loss 3.625554 (-0.49z)| norm 0.2892 (-0.13z)| lr 5.58e-04 | 322.92 ms | 52.3% bf16 MFU | 1623968 tok/s step 3926/19560 | loss 3.625952 (-0.48z)| norm 0.2647 (-0.43z)| lr 5.58e-04 | 322.56 ms | 52.3% bf16 MFU | 1624039 tok/s step 3927/19560 | loss 3.636221 (-0.26z)| norm 0.2509 (-0.59z)| lr 5.58e-04 | 322.70 ms | 52.3% bf16 MFU | 1624072 tok/s step 3928/19560 | loss 3.592215 (-1.13z)| norm 0.2495 (-0.60z)| lr 5.58e-04 | 323.45 ms | 52.2% bf16 MFU | 1623914 tok/s step 3929/19560 | loss 3.605160 (-0.86z)| norm 0.2386 (-0.72z)| lr 5.58e-04 | 323.07 ms | 52.2% bf16 MFU | 1623861 tok/s step 3930/19560 | loss 3.640105 (-0.16z)| norm 0.2448 (-0.64z)| lr 5.58e-04 | 322.72 ms | 52.3% bf16 MFU | 1623899 tok/s step 3931/19560 | loss 3.679430 (+0.63z)| norm 0.2521 (-0.55z)| lr 5.58e-04 | 322.89 ms | 52.3% bf16 MFU | 1623890 tok/s step 3932/19560 | loss 3.676025 (+0.56z)| norm 0.2466 (-0.62z)| lr 5.58e-04 | 322.52 ms | 52.3% bf16 MFU | 1623976 tok/s step 3933/19560 | loss 3.593951 (-1.08z)| norm 0.2450 (-0.63z)| lr 5.58e-04 | 322.50 ms | 52.3% bf16 MFU | 1624061 tok/s step 3934/19560 | loss 3.646760 (-0.02z)| norm 0.2639 (-0.40z)| lr 5.58e-04 | 322.92 ms | 52.3% bf16 MFU | 1624038 tok/s step 3935/19560 | loss 3.668716 (+0.42z)| norm 0.2792 (-0.22z)| lr 5.58e-04 | 322.87 ms | 52.3% bf16 MFU | 1624029 tok/s step 3936/19560 | loss 3.674552 (+0.52z)| norm 0.2706 (-0.32z)| lr 5.57e-04 | 322.75 ms | 52.3% bf16 MFU | 1624049 tok/s step 3937/19560 | loss 3.583191 (-1.30z)| norm 0.2917 (-0.07z)| lr 5.57e-04 | 322.46 ms | 52.3% bf16 MFU | 1624141 tok/s step 3938/19560 | loss 3.658196 (+0.21z)| norm 0.2747 (-0.28z)| lr 5.57e-04 | 323.14 ms | 52.2% bf16 MFU | 1624058 tok/s step 3939/19560 | loss 3.615187 (-0.65z)| norm 0.2818 (-0.20z)| lr 5.57e-04 | 322.71 ms | 52.3% bf16 MFU | 1624088 tok/s step 3940/19560 | loss 3.629362 (-0.37z)| norm 0.3175 (+0.24z)| lr 5.57e-04 | 322.72 ms | 52.3% bf16 MFU | 1624114 tok/s step 3941/19560 | loss 3.605606 (-0.84z)| norm 0.3116 (+0.16z)| lr 5.57e-04 | 322.67 ms | 52.3% bf16 MFU | 1624150 tok/s step 3942/19560 | loss 3.625524 (-0.43z)| norm 0.2894 (-0.10z)| lr 5.57e-04 | 322.68 ms | 52.3% bf16 MFU | 1624181 tok/s step 3943/19560 | loss 3.598871 (-0.95z)| norm 0.2495 (-0.58z)| lr 5.57e-04 | 323.11 ms | 52.2% bf16 MFU | 1624102 tok/s step 3944/19560 | loss 3.626048 (-0.39z)| norm 0.2651 (-0.38z)| lr 5.57e-04 | 322.56 ms | 52.3% bf16 MFU | 1624167 tok/s step 3945/19560 | loss 3.607510 (-0.76z)| norm 0.2640 (-0.39z)| lr 5.57e-04 | 322.68 ms | 52.3% bf16 MFU | 1624197 tok/s step 3946/19560 | loss 3.676498 (+0.62z)| norm 0.2730 (-0.27z)| lr 5.57e-04 | 322.68 ms | 52.3% bf16 MFU | 1624225 tok/s step 3947/19560 | loss 3.708236 (+1.24z)| norm 0.2663 (-0.35z)| lr 5.57e-04 | 323.52 ms | 52.2% bf16 MFU | 1624042 tok/s step 3948/19560 | loss 3.680824 (+0.69z)| norm 0.2777 (-0.22z)| lr 5.57e-04 | 323.02 ms | 52.2% bf16 MFU | 1623994 tok/s step 3949/19560 | loss 3.546893 (-1.95z)| norm 0.3378 (+0.52z)| lr 5.57e-04 | 322.91 ms | 52.3% bf16 MFU | 1623976 tok/s step 3950/19560 | loss 3.628535 (-0.34z)| norm 0.3264 (+0.38z)| lr 5.57e-04 | 322.24 ms | 52.4% bf16 MFU | 1624129 tok/s step 3951/19560 | loss 3.646073 (+0.00z)| norm 0.2812 (-0.18z)| lr 5.57e-04 | 322.95 ms | 52.3% bf16 MFU | 1624096 tok/s step 3952/19560 | loss 3.640282 (-0.11z)| norm 0.3044 (+0.10z)| lr 5.57e-04 | 322.66 ms | 52.3% bf16 MFU | 1624135 tok/s step 3953/19560 | loss 3.624948 (-0.41z)| norm 0.3123 (+0.19z)| lr 5.57e-04 | 322.43 ms | 52.3% bf16 MFU | 1624231 tok/s step 3954/19560 | loss 3.628107 (-0.35z)| norm 0.2701 (-0.32z)| lr 5.57e-04 | 323.32 ms | 52.2% bf16 MFU | 1624099 tok/s step 3955/19560 | loss 3.610330 (-0.69z)| norm 0.2740 (-0.28z)| lr 5.57e-04 | 322.94 ms | 52.3% bf16 MFU | 1624068 tok/s step 3956/19560 | loss 3.659822 (+0.28z)| norm 0.2643 (-0.40z)| lr 5.57e-04 | 322.77 ms | 52.3% bf16 MFU | 1624081 tok/s step 3957/19560 | loss 3.597030 (-0.95z)| norm 0.2478 (-0.60z)| lr 5.57e-04 | 322.69 ms | 52.3% bf16 MFU | 1624113 tok/s step 3958/19560 | loss 3.623905 (-0.41z)| norm 0.2401 (-0.69z)| lr 5.57e-04 | 322.92 ms | 52.3% bf16 MFU | 1624087 tok/s step 3959/19560 | loss 3.591009 (-1.12z)| norm 0.2882 (-0.06z)| lr 5.57e-04 | 323.42 ms | 52.2% bf16 MFU | 1623937 tok/s step 3960/19560 | loss 3.636445 (-0.16z)| norm 0.2562 (-0.50z)| lr 5.57e-04 | 323.00 ms | 52.3% bf16 MFU | 1623900 tok/s step 3961/19560 | loss 3.660935 (+0.36z)| norm 0.2640 (-0.38z)| lr 5.57e-04 | 323.13 ms | 52.2% bf16 MFU | 1623830 tok/s step 3962/19560 | loss 3.629664 (-0.30z)| norm 0.2803 (-0.15z)| lr 5.57e-04 | 323.23 ms | 52.2% bf16 MFU | 1623739 tok/s step 3963/19560 | loss 3.627267 (-0.34z)| norm 0.2790 (-0.16z)| lr 5.57e-04 | 322.40 ms | 52.3% bf16 MFU | 1623862 tok/s step 3964/19560 | loss 3.607651 (-0.76z)| norm 0.2640 (-0.37z)| lr 5.57e-04 | 322.79 ms | 52.3% bf16 MFU | 1623881 tok/s step 3965/19560 | loss 3.600327 (-0.91z)| norm 0.2709 (-0.27z)| lr 5.57e-04 | 323.28 ms | 52.2% bf16 MFU | 1623776 tok/s step 3966/19560 | loss 3.612387 (-0.65z)| norm 0.2598 (-0.42z)| lr 5.57e-04 | 322.53 ms | 52.3% bf16 MFU | 1623864 tok/s step 3967/19560 | loss 3.602875 (-0.85z)| norm 0.2711 (-0.26z)| lr 5.57e-04 | 323.22 ms | 52.2% bf16 MFU | 1623775 tok/s step 3968/19560 | loss 3.678384 (+0.74z)| norm 0.2589 (-0.42z)| lr 5.57e-04 | 322.51 ms | 52.3% bf16 MFU | 1623869 tok/s step 3969/19560 | loss 3.725144 (+1.70z)| norm 0.2705 (-0.26z)| lr 5.57e-04 | 322.65 ms | 52.3% bf16 MFU | 1623923 tok/s step 3970/19560 | loss 3.677567 (+0.70z)| norm 0.2834 (-0.08z)| lr 5.57e-04 | 323.39 ms | 52.2% bf16 MFU | 1623789 tok/s step 3971/19560 | loss 3.658990 (+0.31z)| norm 0.2975 (+0.11z)| lr 5.57e-04 | 323.23 ms | 52.2% bf16 MFU | 1623700 tok/s step 3972/19560 | loss 3.703323 (+1.23z)| norm 0.2941 (+0.07z)| lr 5.57e-04 | 323.03 ms | 52.2% bf16 MFU | 1623665 tok/s step 3973/19560 | loss 3.644889 (+0.01z)| norm 0.2802 (-0.12z)| lr 5.57e-04 | 323.31 ms | 52.2% bf16 MFU | 1623562 tok/s step 3974/19560 | loss 3.609233 (-0.74z)| norm 0.2491 (-0.55z)| lr 5.57e-04 | 322.55 ms | 52.3% bf16 MFU | 1623656 tok/s step 3975/19560 | loss 3.659533 (+0.33z)| norm 0.2922 (+0.04z)| lr 5.56e-04 | 323.06 ms | 52.2% bf16 MFU | 1623616 tok/s step 3976/19560 | loss 3.613551 (-0.64z)| norm 0.2988 (+0.13z)| lr 5.56e-04 | 323.33 ms | 52.2% bf16 MFU | 1623512 tok/s step 3977/19560 | loss 3.639420 (-0.08z)| norm 0.3196 (+0.42z)| lr 5.56e-04 | 322.85 ms | 52.3% bf16 MFU | 1623533 tok/s step 3978/19560 | loss 3.619028 (-0.50z)| norm 0.2959 (+0.08z)| lr 5.56e-04 | 323.41 ms | 52.2% bf16 MFU | 1623412 tok/s step 3979/19560 | loss 3.595455 (-1.00z)| norm 0.3093 (+0.26z)| lr 5.56e-04 | 322.83 ms | 52.3% bf16 MFU | 1623444 tok/s step 3980/19560 | loss 3.688497 (+0.96z)| norm 0.2929 (+0.03z)| lr 5.56e-04 | 322.87 ms | 52.3% bf16 MFU | 1623463 tok/s step 3981/19560 | loss 3.573047 (-1.47z)| norm 0.3087 (+0.25z)| lr 5.56e-04 | 323.96 ms | 52.1% bf16 MFU | 1623209 tok/s step 3982/19560 | loss 3.639865 (-0.05z)| norm 0.3175 (+0.37z)| lr 5.56e-04 | 322.94 ms | 52.3% bf16 MFU | 1623223 tok/s step 3983/19560 | loss 3.656484 (+0.30z)| norm 0.3295 (+0.52z)| lr 5.56e-04 | 323.16 ms | 52.2% bf16 MFU | 1623181 tok/s step 3984/19560 | loss 3.606674 (-0.75z)| norm 0.2987 (+0.09z)| lr 5.56e-04 | 323.03 ms | 52.2% bf16 MFU | 1623174 tok/s step 3985/19560 | loss 3.618655 (-0.48z)| norm 0.2974 (+0.07z)| lr 5.56e-04 | 322.42 ms | 52.3% bf16 MFU | 1623319 tok/s step 3986/19560 | loss 3.613683 (-0.58z)| norm 0.2553 (-0.51z)| lr 5.56e-04 | 323.61 ms | 52.2% bf16 MFU | 1623159 tok/s step 3987/19560 | loss 3.662063 (+0.49z)| norm 0.2772 (-0.21z)| lr 5.56e-04 | 322.56 ms | 52.3% bf16 MFU | 1623272 tok/s step 3988/19560 | loss 3.731322 (+1.96z)| norm 0.2626 (-0.41z)| lr 5.56e-04 | 323.15 ms | 52.2% bf16 MFU | 1623231 tok/s step 3989/19560 | loss 3.703114 (+1.33z)| norm 0.2755 (-0.23z)| lr 5.56e-04 | 323.08 ms | 52.2% bf16 MFU | 1623209 tok/s step 3990/19560 | loss 3.586591 (-1.18z)| norm 0.2805 (-0.16z)| lr 5.56e-04 | 322.63 ms | 52.3% bf16 MFU | 1623302 tok/s step 3991/19560 | loss 3.595653 (-0.98z)| norm 0.2536 (-0.53z)| lr 5.56e-04 | 322.79 ms | 52.3% bf16 MFU | 1623348 tok/s step 3992/19560 | loss 3.642853 (+0.08z)| norm 0.2980 (+0.09z)| lr 5.56e-04 | 322.89 ms | 52.3% bf16 MFU | 1623366 tok/s step 3993/19560 | loss 3.638317 (-0.01z)| norm 0.2961 (+0.06z)| lr 5.56e-04 | 323.67 ms | 52.1% bf16 MFU | 1623190 tok/s step 3994/19560 | loss 3.666627 (+0.63z)| norm 0.2734 (-0.25z)| lr 5.56e-04 | 323.15 ms | 52.2% bf16 MFU | 1623151 tok/s step 3995/19560 | loss 3.742148 (+2.27z)| norm 0.3110 (+0.26z)| lr 5.56e-04 | 322.32 ms | 52.4% bf16 MFU | 1623324 tok/s step 3996/19560 | loss 3.589920 (-1.10z)| norm 0.3364 (+0.61z)| lr 5.56e-04 | 323.07 ms | 52.2% bf16 MFU | 1623299 tok/s step 3997/19560 | loss 3.690125 (+1.11z)| norm 0.3116 (+0.26z)| lr 5.56e-04 | 323.24 ms | 52.2% bf16 MFU | 1623232 tok/s step 3998/19560 | loss 3.637177 (-0.06z)| norm 0.3201 (+0.37z)| lr 5.56e-04 | 322.66 ms | 52.3% bf16 MFU | 1623316 tok/s step 3999/19560 | loss 3.663365 (+0.51z)| norm 0.2875 (-0.08z)| lr 5.56e-04 | 322.83 ms | 52.3% bf16 MFU | 1623351 tok/s step 4000/19560 | loss 3.665452 (+0.55z)| norm 0.2955 (+0.03z)| lr 5.56e-04 | 322.88 ms | 52.3% bf16 MFU | 1623372 tok/s val loss 3.624114 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2745/10042 = 0.273352 step 4001/19560 | loss 3.593017 (-1.04z)| norm 0.2969 (+0.05z)| lr 5.56e-04 | 322.27 ms | 52.4% bf16 MFU | 1623546 tok/s step 4002/19560 | loss 3.655041 (+0.36z)| norm 0.2813 (-0.17z)| lr 5.56e-04 | 322.97 ms | 52.3% bf16 MFU | 1623535 tok/s step 4003/19560 | loss 3.594423 (-1.01z)| norm 0.2633 (-0.42z)| lr 5.56e-04 | 323.16 ms | 52.2% bf16 MFU | 1623476 tok/s step 4004/19560 | loss 3.614021 (-0.57z)| norm 0.2821 (-0.16z)| lr 5.56e-04 | 322.54 ms | 52.3% bf16 MFU | 1623578 tok/s step 4005/19560 | loss 3.593353 (-1.03z)| norm 0.3035 (+0.13z)| lr 5.56e-04 | 322.69 ms | 52.3% bf16 MFU | 1623636 tok/s step 4006/19560 | loss 3.650147 (+0.24z)| norm 0.2715 (-0.31z)| lr 5.56e-04 | 323.30 ms | 52.2% bf16 MFU | 1623537 tok/s step 4007/19560 | loss 3.629933 (-0.20z)| norm 0.2393 (-0.74z)| lr 5.56e-04 | 322.71 ms | 52.3% bf16 MFU | 1623593 tok/s step 4008/19560 | loss 3.642234 (+0.09z)| norm 0.2895 (-0.04z)| lr 5.56e-04 | 323.05 ms | 52.2% bf16 MFU | 1623560 tok/s step 4009/19560 | loss 3.608412 (-0.71z)| norm 0.2626 (-0.40z)| lr 5.56e-04 | 323.03 ms | 52.2% bf16 MFU | 1623534 tok/s step 4010/19560 | loss 3.601538 (-0.88z)| norm 0.2632 (-0.38z)| lr 5.56e-04 | 323.34 ms | 52.2% bf16 MFU | 1623431 tok/s step 4011/19560 | loss 3.587490 (-1.21z)| norm 0.2492 (-0.58z)| lr 5.56e-04 | 322.44 ms | 52.3% bf16 MFU | 1623560 tok/s step 4012/19560 | loss 3.615353 (-0.51z)| norm 0.2425 (-0.66z)| lr 5.56e-04 | 323.22 ms | 52.2% bf16 MFU | 1623486 tok/s step 4013/19560 | loss 3.591149 (-1.12z)| norm 0.2634 (-0.36z)| lr 5.55e-04 | 322.95 ms | 52.3% bf16 MFU | 1623483 tok/s step 4014/19560 | loss 3.713456 (+1.90z)| norm 0.3150 (+1.08z)| lr 5.55e-04 | 323.28 ms | 52.2% bf16 MFU | 1623399 tok/s step 4015/19560 | loss 3.644744 (+0.19z)| norm 0.2763 (-0.20z)| lr 5.55e-04 | 322.40 ms | 52.3% bf16 MFU | 1623539 tok/s step 4016/19560 | loss 3.650062 (+0.33z)| norm 0.2735 (-0.30z)| lr 5.55e-04 | 323.50 ms | 52.2% bf16 MFU | 1623395 tok/s step 4017/19560 | loss 3.611217 (-0.63z)| norm 0.2889 (+0.38z)| lr 5.55e-04 | 322.76 ms | 52.3% bf16 MFU | 1623445 tok/s step 4018/19560 | loss 3.580933 (-1.37z)| norm 0.2496 (-1.33z)| lr 5.55e-04 | 323.15 ms | 52.2% bf16 MFU | 1623395 tok/s step 4019/19560 | loss 3.637861 (+0.04z)| norm 0.2490 (-1.34z)| lr 5.55e-04 | 322.77 ms | 52.3% bf16 MFU | 1623443 tok/s step 4020/19560 | loss 3.607496 (-0.71z)| norm 0.2617 (-0.76z)| lr 5.55e-04 | 323.01 ms | 52.2% bf16 MFU | 1623427 tok/s step 4021/19560 | loss 3.628456 (-0.18z)| norm 0.2423 (-1.59z)| lr 5.55e-04 | 323.07 ms | 52.2% bf16 MFU | 1623396 tok/s step 4022/19560 | loss 3.667736 (+0.78z)| norm 0.2518 (-1.17z)| lr 5.55e-04 | 323.17 ms | 52.2% bf16 MFU | 1623344 tok/s step 4023/19560 | loss 3.672827 (+0.90z)| norm 0.2572 (-0.92z)| lr 5.55e-04 | 322.87 ms | 52.3% bf16 MFU | 1623368 tok/s step 4024/19560 | loss 3.638294 (+0.05z)| norm 0.2877 (+0.41z)| lr 5.55e-04 | 322.72 ms | 52.3% bf16 MFU | 1623430 tok/s step 4025/19560 | loss 3.616563 (-0.49z)| norm 0.2563 (-0.96z)| lr 5.55e-04 | 322.80 ms | 52.3% bf16 MFU | 1623469 tok/s step 4026/19560 | loss 3.665863 (+0.72z)| norm 0.2940 (+0.68z)| lr 5.55e-04 | 322.99 ms | 52.3% bf16 MFU | 1623456 tok/s step 4027/19560 | loss 3.585206 (-1.27z)| norm 0.2986 (+0.87z)| lr 5.55e-04 | 322.82 ms | 52.3% bf16 MFU | 1623487 tok/s step 4028/19560 | loss 3.613826 (-0.56z)| norm 0.2726 (-0.26z)| lr 5.55e-04 | 322.60 ms | 52.3% bf16 MFU | 1623572 tok/s step 4029/19560 | loss 3.616753 (-0.48z)| norm 0.2598 (-0.81z)| lr 5.55e-04 | 322.24 ms | 52.4% bf16 MFU | 1623743 tok/s step 4030/19560 | loss 3.682773 (+1.28z)| norm 0.2974 (+0.82z)| lr 5.55e-04 | 322.98 ms | 52.3% bf16 MFU | 1623721 tok/s step 4031/19560 | loss 3.654122 (+0.51z)| norm 0.3191 (+1.74z)| lr 5.55e-04 | 322.91 ms | 52.3% bf16 MFU | 1623716 tok/s step 4032/19560 | loss 3.645009 (+0.28z)| norm 0.2729 (-0.26z)| lr 5.55e-04 | 322.52 ms | 52.3% bf16 MFU | 1623811 tok/s step 4033/19560 | loss 3.702865 (+1.79z)| norm 0.2976 (+0.80z)| lr 5.55e-04 | 323.02 ms | 52.2% bf16 MFU | 1623775 tok/s step 4034/19560 | loss 3.665070 (+0.79z)| norm 0.3243 (+1.92z)| lr 5.55e-04 | 322.76 ms | 52.3% bf16 MFU | 1623805 tok/s step 4035/19560 | loss 3.565159 (-1.81z)| norm 0.3134 (+1.43z)| lr 5.55e-04 | 322.61 ms | 52.3% bf16 MFU | 1623872 tok/s step 4036/19560 | loss 3.618596 (-0.40z)| norm 0.3074 (+1.15z)| lr 5.55e-04 | 322.54 ms | 52.3% bf16 MFU | 1623954 tok/s step 4037/19560 | loss 3.643104 (+0.25z)| norm 0.2772 (-0.13z)| lr 5.55e-04 | 322.99 ms | 52.3% bf16 MFU | 1623918 tok/s step 4038/19560 | loss 3.616293 (-0.45z)| norm 0.3027 (+0.94z)| lr 5.55e-04 | 322.36 ms | 52.4% bf16 MFU | 1624041 tok/s step 4039/19560 | loss 3.653923 (+0.54z)| norm 0.3027 (+0.93z)| lr 5.55e-04 | 322.32 ms | 52.4% bf16 MFU | 1624171 tok/s step 4040/19560 | loss 3.650120 (+0.42z)| norm 0.2928 (+0.51z)| lr 5.55e-04 | 323.33 ms | 52.2% bf16 MFU | 1624038 tok/s step 4041/19560 | loss 3.620551 (-0.38z)| norm 0.2627 (-0.74z)| lr 5.55e-04 | 322.34 ms | 52.4% bf16 MFU | 1624161 tok/s step 4042/19560 | loss 3.641786 (+0.22z)| norm 0.3145 (+1.42z)| lr 5.55e-04 | 322.90 ms | 52.3% bf16 MFU | 1624139 tok/s step 4043/19560 | loss 3.576379 (-1.58z)| norm 0.2728 (-0.32z)| lr 5.55e-04 | 322.88 ms | 52.3% bf16 MFU | 1624122 tok/s step 4044/19560 | loss 3.602767 (-0.87z)| norm 0.2997 (+0.79z)| lr 5.55e-04 | 322.57 ms | 52.3% bf16 MFU | 1624184 tok/s step 4045/19560 | loss 3.544895 (-2.40z)| norm 0.2456 (-1.48z)| lr 5.55e-04 | 322.64 ms | 52.3% bf16 MFU | 1624224 tok/s step 4046/19560 | loss 3.595658 (-1.01z)| norm 0.2607 (-0.84z)| lr 5.55e-04 | 322.76 ms | 52.3% bf16 MFU | 1624232 tok/s step 4047/19560 | loss 3.628889 (-0.11z)| norm 0.2854 (+0.19z)| lr 5.55e-04 | 323.09 ms | 52.2% bf16 MFU | 1624156 tok/s step 4048/19560 | loss 3.711565 (+2.08z)| norm 0.2841 (+0.13z)| lr 5.55e-04 | 322.89 ms | 52.3% bf16 MFU | 1624136 tok/s step 4049/19560 | loss 3.613978 (-0.52z)| norm 0.2467 (-1.42z)| lr 5.55e-04 | 322.54 ms | 52.3% bf16 MFU | 1624205 tok/s step 4050/19560 | loss 3.687603 (+1.42z)| norm 0.2742 (-0.25z)| lr 5.55e-04 | 322.71 ms | 52.3% bf16 MFU | 1624226 tok/s step 4051/19560 | loss 3.619924 (-0.37z)| norm 0.3114 (+1.32z)| lr 5.54e-04 | 323.07 ms | 52.2% bf16 MFU | 1624156 tok/s step 4052/19560 | loss 3.645320 (+0.29z)| norm 0.2692 (-0.46z)| lr 5.54e-04 | 322.82 ms | 52.3% bf16 MFU | 1624152 tok/s step 4053/19560 | loss 3.621313 (-0.35z)| norm 0.2559 (-1.01z)| lr 5.54e-04 | 323.05 ms | 52.2% bf16 MFU | 1624090 tok/s step 4054/19560 | loss 3.656263 (+0.58z)| norm 0.2672 (-0.53z)| lr 5.54e-04 | 322.25 ms | 52.4% bf16 MFU | 1624234 tok/s step 4055/19560 | loss 3.651044 (+0.43z)| norm 0.2782 (-0.08z)| lr 5.54e-04 | 322.58 ms | 52.3% bf16 MFU | 1624287 tok/s step 4056/19560 | loss 3.664264 (+0.77z)| norm 0.3634 (+3.35z)| lr 5.54e-04 | 322.81 ms | 52.3% bf16 MFU | 1624280 tok/s step 4057/19560 | loss 3.571424 (-1.68z)| norm 0.3098 (+1.16z)| lr 5.54e-04 | 322.68 ms | 52.3% bf16 MFU | 1624306 tok/s step 4058/19560 | loss 3.644574 (+0.25z)| norm 0.3031 (+0.87z)| lr 5.54e-04 | 323.11 ms | 52.2% bf16 MFU | 1624221 tok/s step 4059/19560 | loss 3.642491 (+0.21z)| norm 0.2965 (+0.59z)| lr 5.54e-04 | 323.11 ms | 52.2% bf16 MFU | 1624140 tok/s step 4060/19560 | loss 3.653914 (+0.52z)| norm 0.2872 (+0.19z)| lr 5.54e-04 | 322.77 ms | 52.3% bf16 MFU | 1624150 tok/s step 4061/19560 | loss 3.618897 (-0.43z)| norm 0.2947 (+0.49z)| lr 5.54e-04 | 323.06 ms | 52.2% bf16 MFU | 1624086 tok/s step 4062/19560 | loss 3.627296 (-0.20z)| norm 0.2899 (+0.28z)| lr 5.54e-04 | 323.08 ms | 52.2% bf16 MFU | 1624021 tok/s step 4063/19560 | loss 3.608376 (-0.69z)| norm 0.2725 (-0.45z)| lr 5.54e-04 | 322.34 ms | 52.4% bf16 MFU | 1624144 tok/s step 4064/19560 | loss 3.590637 (-1.15z)| norm 0.2811 (-0.09z)| lr 5.54e-04 | 323.32 ms | 52.2% bf16 MFU | 1624016 tok/s step 4065/19560 | loss 3.580518 (-1.42z)| norm 0.2655 (-0.74z)| lr 5.54e-04 | 322.09 ms | 52.4% bf16 MFU | 1624204 tok/s step 4066/19560 | loss 3.727915 (+2.46z)| norm 0.3258 (+1.77z)| lr 5.54e-04 | 322.94 ms | 52.3% bf16 MFU | 1624166 tok/s step 4067/19560 | loss 3.679946 (+1.18z)| norm 0.2959 (+0.51z)| lr 5.54e-04 | 323.07 ms | 52.2% bf16 MFU | 1624098 tok/s step 4068/19560 | loss 3.588843 (-1.18z)| norm 0.2962 (+0.54z)| lr 5.54e-04 | 322.54 ms | 52.3% bf16 MFU | 1624169 tok/s step 4069/19560 | loss 3.629129 (-0.14z)| norm 0.2968 (+0.57z)| lr 5.54e-04 | 323.10 ms | 52.2% bf16 MFU | 1624096 tok/s step 4070/19560 | loss 3.626384 (-0.21z)| norm 0.3028 (+0.81z)| lr 5.54e-04 | 322.80 ms | 52.3% bf16 MFU | 1624101 tok/s step 4071/19560 | loss 3.702760 (+1.74z)| norm 0.3383 (+2.25z)| lr 5.54e-04 | 322.71 ms | 52.3% bf16 MFU | 1624127 tok/s step 4072/19560 | loss 3.549523 (-2.15z)| norm 0.2866 (+0.10z)| lr 5.54e-04 | 323.04 ms | 52.2% bf16 MFU | 1624070 tok/s step 4073/19560 | loss 3.563224 (-1.78z)| norm 0.2937 (+0.38z)| lr 5.54e-04 | 323.29 ms | 52.2% bf16 MFU | 1623954 tok/s step 4074/19560 | loss 3.646889 (+0.32z)| norm 0.2792 (-0.22z)| lr 5.54e-04 | 322.70 ms | 52.3% bf16 MFU | 1623990 tok/s step 4075/19560 | loss 3.608414 (-0.63z)| norm 0.2676 (-0.71z)| lr 5.54e-04 | 322.61 ms | 52.3% bf16 MFU | 1624046 tok/s step 4076/19560 | loss 3.604089 (-0.73z)| norm 0.2779 (-0.28z)| lr 5.54e-04 | 322.58 ms | 52.3% bf16 MFU | 1624108 tok/s step 4077/19560 | loss 3.597394 (-0.93z)| norm 0.2933 (+0.39z)| lr 5.54e-04 | 323.14 ms | 52.2% bf16 MFU | 1624027 tok/s step 4078/19560 | loss 3.612101 (-0.54z)| norm 0.2720 (-0.51z)| lr 5.54e-04 | 322.44 ms | 52.3% bf16 MFU | 1624125 tok/s step 4079/19560 | loss 3.581243 (-1.32z)| norm 0.2474 (-1.54z)| lr 5.54e-04 | 322.58 ms | 52.3% bf16 MFU | 1624183 tok/s step 4080/19560 | loss 3.630858 (-0.04z)| norm 0.2585 (-1.05z)| lr 5.54e-04 | 322.94 ms | 52.3% bf16 MFU | 1624148 tok/s step 4081/19560 | loss 3.624035 (-0.21z)| norm 0.2609 (-0.93z)| lr 5.54e-04 | 322.36 ms | 52.4% bf16 MFU | 1624261 tok/s step 4082/19560 | loss 3.597451 (-0.89z)| norm 0.2489 (-1.43z)| lr 5.54e-04 | 323.08 ms | 52.2% bf16 MFU | 1624188 tok/s step 4083/19560 | loss 3.639673 (+0.19z)| norm 0.2660 (-0.70z)| lr 5.54e-04 | 322.52 ms | 52.3% bf16 MFU | 1624259 tok/s step 4084/19560 | loss 3.626272 (-0.15z)| norm 0.2638 (-0.79z)| lr 5.54e-04 | 322.65 ms | 52.3% bf16 MFU | 1624293 tok/s step 4085/19560 | loss 3.641899 (+0.24z)| norm 0.2809 (-0.08z)| lr 5.54e-04 | 322.88 ms | 52.3% bf16 MFU | 1624269 tok/s step 4086/19560 | loss 3.630301 (-0.06z)| norm 0.2863 (+0.14z)| lr 5.54e-04 | 322.86 ms | 52.3% bf16 MFU | 1624250 tok/s step 4087/19560 | loss 3.638008 (+0.13z)| norm 0.2599 (-0.99z)| lr 5.54e-04 | 322.44 ms | 52.3% bf16 MFU | 1624337 tok/s step 4088/19560 | loss 3.568530 (-1.64z)| norm 0.2638 (-0.83z)| lr 5.54e-04 | 322.50 ms | 52.3% bf16 MFU | 1624404 tok/s step 4089/19560 | loss 3.585500 (-1.18z)| norm 0.2721 (-0.47z)| lr 5.53e-04 | 323.50 ms | 52.2% bf16 MFU | 1624218 tok/s step 4090/19560 | loss 3.594686 (-0.94z)| norm 0.2594 (-1.01z)| lr 5.53e-04 | 322.56 ms | 52.3% bf16 MFU | 1624278 tok/s step 4091/19560 | loss 3.649438 (+0.45z)| norm 0.2629 (-0.85z)| lr 5.53e-04 | 322.63 ms | 52.3% bf16 MFU | 1624316 tok/s step 4092/19560 | loss 3.605963 (-0.65z)| norm 0.2731 (-0.42z)| lr 5.53e-04 | 323.18 ms | 52.2% bf16 MFU | 1624214 tok/s step 4093/19560 | loss 3.589080 (-1.08z)| norm 0.2532 (-1.26z)| lr 5.53e-04 | 322.62 ms | 52.3% bf16 MFU | 1624258 tok/s step 4094/19560 | loss 3.612295 (-0.49z)| norm 0.2507 (-1.36z)| lr 5.53e-04 | 322.41 ms | 52.3% bf16 MFU | 1624353 tok/s step 4095/19560 | loss 3.616704 (-0.38z)| norm 0.2703 (-0.52z)| lr 5.53e-04 | 322.88 ms | 52.3% bf16 MFU | 1624324 tok/s step 4096/19560 | loss 3.634635 (+0.09z)| norm 0.2627 (-0.85z)| lr 5.53e-04 | 323.06 ms | 52.2% bf16 MFU | 1624252 tok/s step 4097/19560 | loss 3.609053 (-0.56z)| norm 0.2419 (-1.71z)| lr 5.53e-04 | 323.14 ms | 52.2% bf16 MFU | 1624162 tok/s step 4098/19560 | loss 3.557957 (-1.86z)| norm 0.2742 (-0.35z)| lr 5.53e-04 | 322.88 ms | 52.3% bf16 MFU | 1624145 tok/s step 4099/19560 | loss 3.572682 (-1.45z)| norm 0.2359 (-1.92z)| lr 5.53e-04 | 323.50 ms | 52.2% bf16 MFU | 1623971 tok/s step 4100/19560 | loss 3.604041 (-0.63z)| norm 0.2469 (-1.43z)| lr 5.53e-04 | 322.42 ms | 52.3% bf16 MFU | 1624079 tok/s step 4101/19560 | loss 3.579966 (-1.24z)| norm 0.2688 (-0.52z)| lr 5.53e-04 | 322.70 ms | 52.3% bf16 MFU | 1624109 tok/s step 4102/19560 | loss 3.605384 (-0.57z)| norm 0.2736 (-0.33z)| lr 5.53e-04 | 322.92 ms | 52.3% bf16 MFU | 1624082 tok/s step 4103/19560 | loss 3.641677 (+0.37z)| norm 0.3008 (+0.80z)| lr 5.53e-04 | 322.42 ms | 52.3% bf16 MFU | 1624183 tok/s step 4104/19560 | loss 3.587112 (-1.04z)| norm 0.2789 (-0.11z)| lr 5.53e-04 | 322.92 ms | 52.3% bf16 MFU | 1624153 tok/s step 4105/19560 | loss 3.593425 (-0.86z)| norm 0.2579 (-0.97z)| lr 5.53e-04 | 322.77 ms | 52.3% bf16 MFU | 1624163 tok/s step 4106/19560 | loss 3.634370 (+0.19z)| norm 0.2990 (+0.75z)| lr 5.53e-04 | 322.47 ms | 52.3% bf16 MFU | 1624247 tok/s step 4107/19560 | loss 3.632100 (+0.13z)| norm 0.3096 (+1.20z)| lr 5.53e-04 | 322.78 ms | 52.3% bf16 MFU | 1624250 tok/s step 4108/19560 | loss 3.646509 (+0.52z)| norm 0.2906 (+0.40z)| lr 5.53e-04 | 323.11 ms | 52.2% bf16 MFU | 1624169 tok/s step 4109/19560 | loss 3.511515 (-2.93z)| norm 0.2697 (-0.47z)| lr 5.53e-04 | 322.76 ms | 52.3% bf16 MFU | 1624179 tok/s step 4110/19560 | loss 3.597666 (-0.72z)| norm 0.2470 (-1.40z)| lr 5.53e-04 | 322.59 ms | 52.3% bf16 MFU | 1624234 tok/s step 4111/19560 | loss 3.578649 (-1.19z)| norm 0.2943 (+0.62z)| lr 5.53e-04 | 323.01 ms | 52.2% bf16 MFU | 1624179 tok/s step 4112/19560 | loss 3.559188 (-1.65z)| norm 0.2547 (-1.06z)| lr 5.53e-04 | 322.93 ms | 52.3% bf16 MFU | 1624148 tok/s step 4113/19560 | loss 3.516365 (-2.63z)| norm 0.2969 (+0.74z)| lr 5.53e-04 | 323.24 ms | 52.2% bf16 MFU | 1624040 tok/s step 4114/19560 | loss 3.614078 (-0.25z)| norm 0.3136 (+1.43z)| lr 5.53e-04 | 322.80 ms | 52.3% bf16 MFU | 1624048 tok/s step 4115/19560 | loss 3.643498 (+0.47z)| norm 0.2928 (+0.54z)| lr 5.53e-04 | 322.55 ms | 52.3% bf16 MFU | 1624118 tok/s step 4116/19560 | loss 3.813227 (+4.37z)| norm 0.2974 (+0.73z)| lr 5.53e-04 | 322.70 ms | 52.3% bf16 MFU | 1624146 tok/s step 4117/19560 | loss 3.583923 (-0.93z)| norm 0.3049 (+1.03z)| lr 5.53e-04 | 322.74 ms | 52.3% bf16 MFU | 1624163 tok/s step 4118/19560 | loss 3.550241 (-1.70z)| norm 0.2954 (+0.62z)| lr 5.53e-04 | 322.83 ms | 52.3% bf16 MFU | 1624156 tok/s step 4119/19560 | loss 3.629765 (+0.14z)| norm 0.3321 (+2.12z)| lr 5.53e-04 | 322.59 ms | 52.3% bf16 MFU | 1624212 tok/s step 4120/19560 | loss 3.624552 (+0.02z)| norm 0.2791 (-0.09z)| lr 5.53e-04 | 322.39 ms | 52.3% bf16 MFU | 1624313 tok/s step 4121/19560 | loss 3.607483 (-0.37z)| norm 0.2877 (+0.27z)| lr 5.53e-04 | 322.21 ms | 52.4% bf16 MFU | 1624457 tok/s step 4122/19560 | loss 3.599349 (-0.55z)| norm 0.2958 (+0.61z)| lr 5.53e-04 | 322.91 ms | 52.3% bf16 MFU | 1624416 tok/s step 4123/19560 | loss 3.569139 (-1.25z)| norm 0.2762 (-0.20z)| lr 5.53e-04 | 322.42 ms | 52.3% bf16 MFU | 1624500 tok/s step 4124/19560 | loss 3.611614 (-0.24z)| norm 0.2638 (-0.72z)| lr 5.53e-04 | 323.10 ms | 52.2% bf16 MFU | 1624408 tok/s step 4125/19560 | loss 3.563298 (-1.38z)| norm 0.2702 (-0.43z)| lr 5.53e-04 | 322.92 ms | 52.3% bf16 MFU | 1624368 tok/s step 4126/19560 | loss 3.587513 (-0.79z)| norm 0.2755 (-0.19z)| lr 5.52e-04 | 322.26 ms | 52.4% bf16 MFU | 1624494 tok/s step 4127/19560 | loss 3.593126 (-0.64z)| norm 0.2853 (+0.24z)| lr 5.52e-04 | 323.09 ms | 52.2% bf16 MFU | 1624405 tok/s step 4128/19560 | loss 3.687693 (+1.62z)| norm 0.2788 (-0.04z)| lr 5.52e-04 | 322.82 ms | 52.3% bf16 MFU | 1624388 tok/s step 4129/19560 | loss 3.573114 (-1.11z)| norm 0.2681 (-0.50z)| lr 5.52e-04 | 323.08 ms | 52.2% bf16 MFU | 1624307 tok/s step 4130/19560 | loss 3.703206 (+1.96z)| norm 0.2942 (+0.64z)| lr 5.52e-04 | 322.56 ms | 52.3% bf16 MFU | 1624361 tok/s step 4131/19560 | loss 3.726842 (+2.44z)| norm 0.3054 (+1.12z)| lr 5.52e-04 | 322.35 ms | 52.4% bf16 MFU | 1624465 tok/s step 4132/19560 | loss 3.580533 (-0.93z)| norm 0.2941 (+0.62z)| lr 5.52e-04 | 323.10 ms | 52.2% bf16 MFU | 1624376 tok/s step 4133/19560 | loss 3.583084 (-0.87z)| norm 0.2588 (-0.91z)| lr 5.52e-04 | 323.74 ms | 52.1% bf16 MFU | 1624132 tok/s step 4134/19560 | loss 3.584373 (-0.82z)| norm 0.2694 (-0.44z)| lr 5.52e-04 | 322.63 ms | 52.3% bf16 MFU | 1624177 tok/s step 4135/19560 | loss 3.681573 (+1.38z)| norm 0.2756 (-0.19z)| lr 5.52e-04 | 322.44 ms | 52.3% bf16 MFU | 1624268 tok/s step 4136/19560 | loss 3.662638 (+0.95z)| norm 0.2739 (-0.26z)| lr 5.52e-04 | 323.07 ms | 52.2% bf16 MFU | 1624197 tok/s step 4137/19560 | loss 3.628286 (+0.16z)| norm 0.2699 (-0.44z)| lr 5.52e-04 | 323.06 ms | 52.2% bf16 MFU | 1624132 tok/s step 4138/19560 | loss 3.608376 (-0.29z)| norm 0.2631 (-0.74z)| lr 5.52e-04 | 322.48 ms | 52.3% bf16 MFU | 1624214 tok/s step 4139/19560 | loss 3.649589 (+0.64z)| norm 0.2695 (-0.47z)| lr 5.52e-04 | 322.70 ms | 52.3% bf16 MFU | 1624239 tok/s step 4140/19560 | loss 3.640210 (+0.42z)| norm 0.2719 (-0.37z)| lr 5.52e-04 | 322.93 ms | 52.3% bf16 MFU | 1624204 tok/s step 4141/19560 | loss 3.704181 (+1.83z)| norm 0.2737 (-0.30z)| lr 5.52e-04 | 323.31 ms | 52.2% bf16 MFU | 1624075 tok/s step 4142/19560 | loss 3.573242 (-1.10z)| norm 0.2548 (-1.13z)| lr 5.52e-04 | 322.99 ms | 52.3% bf16 MFU | 1624033 tok/s step 4143/19560 | loss 3.639866 (+0.42z)| norm 0.2686 (-0.50z)| lr 5.52e-04 | 322.78 ms | 52.3% bf16 MFU | 1624045 tok/s step 4144/19560 | loss 3.598634 (-0.51z)| norm 0.2962 (+0.74z)| lr 5.52e-04 | 322.91 ms | 52.3% bf16 MFU | 1624024 tok/s step 4145/19560 | loss 3.597631 (-0.53z)| norm 0.2870 (+0.32z)| lr 5.52e-04 | 323.67 ms | 52.1% bf16 MFU | 1623814 tok/s step 4146/19560 | loss 3.564604 (-1.28z)| norm 0.2580 (-1.00z)| lr 5.52e-04 | 322.03 ms | 52.4% bf16 MFU | 1624027 tok/s step 4147/19560 | loss 3.643326 (+0.51z)| norm 0.3021 (+0.99z)| lr 5.52e-04 | 322.89 ms | 52.3% bf16 MFU | 1624012 tok/s step 4148/19560 | loss 3.564430 (-1.26z)| norm 0.2838 (+0.15z)| lr 5.52e-04 | 323.11 ms | 52.2% bf16 MFU | 1623941 tok/s step 4149/19560 | loss 3.641378 (+0.46z)| norm 0.2611 (-0.91z)| lr 5.52e-04 | 322.84 ms | 52.3% bf16 MFU | 1623944 tok/s step 4150/19560 | loss 3.544142 (-1.69z)| norm 0.2869 (+0.27z)| lr 5.52e-04 | 322.82 ms | 52.3% bf16 MFU | 1623952 tok/s step 4151/19560 | loss 3.581851 (-0.83z)| norm 0.2795 (-0.08z)| lr 5.52e-04 | 322.78 ms | 52.3% bf16 MFU | 1623968 tok/s step 4152/19560 | loss 3.638568 (+0.44z)| norm 0.2725 (-0.40z)| lr 5.52e-04 | 322.84 ms | 52.3% bf16 MFU | 1623970 tok/s step 4153/19560 | loss 3.532677 (-1.89z)| norm 0.2742 (-0.33z)| lr 5.52e-04 | 322.75 ms | 52.3% bf16 MFU | 1623993 tok/s step 4154/19560 | loss 3.601058 (-0.37z)| norm 0.2769 (-0.19z)| lr 5.52e-04 | 323.06 ms | 52.2% bf16 MFU | 1623938 tok/s step 4155/19560 | loss 3.650210 (+0.71z)| norm 0.2992 (+0.85z)| lr 5.52e-04 | 323.24 ms | 52.2% bf16 MFU | 1623840 tok/s step 4156/19560 | loss 3.593151 (-0.56z)| norm 0.3412 (+2.71z)| lr 5.52e-04 | 322.54 ms | 52.3% bf16 MFU | 1623923 tok/s step 4157/19560 | loss 3.554747 (-1.38z)| norm 0.2733 (-0.39z)| lr 5.52e-04 | 323.13 ms | 52.2% bf16 MFU | 1623853 tok/s step 4158/19560 | loss 3.647127 (+0.66z)| norm 0.2837 (+0.09z)| lr 5.52e-04 | 322.93 ms | 52.3% bf16 MFU | 1623838 tok/s step 4159/19560 | loss 3.644315 (+0.60z)| norm 0.2691 (-0.56z)| lr 5.52e-04 | 322.64 ms | 52.3% bf16 MFU | 1623896 tok/s step 4160/19560 | loss 3.560921 (-1.23z)| norm 0.2448 (-1.66z)| lr 5.52e-04 | 322.69 ms | 52.3% bf16 MFU | 1623938 tok/s step 4161/19560 | loss 3.572039 (-0.97z)| norm 0.2661 (-0.67z)| lr 5.52e-04 | 322.91 ms | 52.3% bf16 MFU | 1623923 tok/s step 4162/19560 | loss 3.627092 (+0.26z)| norm 0.2523 (-1.29z)| lr 5.52e-04 | 323.03 ms | 52.2% bf16 MFU | 1623879 tok/s step 4163/19560 | loss 3.744982 (+2.79z)| norm 0.2740 (-0.28z)| lr 5.51e-04 | 323.04 ms | 52.2% bf16 MFU | 1623834 tok/s step 4164/19560 | loss 3.638915 (+0.48z)| norm 0.2573 (-1.04z)| lr 5.51e-04 | 322.85 ms | 52.3% bf16 MFU | 1623839 tok/s step 4165/19560 | loss 3.601285 (-0.33z)| norm 0.2874 (+0.37z)| lr 5.51e-04 | 322.97 ms | 52.3% bf16 MFU | 1623813 tok/s step 4166/19560 | loss 3.597833 (-0.41z)| norm 0.2493 (-1.39z)| lr 5.51e-04 | 323.00 ms | 52.3% bf16 MFU | 1623782 tok/s step 4167/19560 | loss 3.589050 (-0.59z)| norm 0.2794 (+0.02z)| lr 5.51e-04 | 323.22 ms | 52.2% bf16 MFU | 1623697 tok/s step 4168/19560 | loss 3.627823 (+0.26z)| norm 0.2817 (+0.13z)| lr 5.51e-04 | 323.10 ms | 52.2% bf16 MFU | 1623645 tok/s step 4169/19560 | loss 3.666702 (+1.10z)| norm 0.2396 (-1.81z)| lr 5.51e-04 | 322.84 ms | 52.3% bf16 MFU | 1623661 tok/s step 4170/19560 | loss 3.612550 (-0.07z)| norm 0.2746 (-0.18z)| lr 5.51e-04 | 322.73 ms | 52.3% bf16 MFU | 1623706 tok/s step 4171/19560 | loss 3.636220 (+0.43z)| norm 0.2834 (+0.23z)| lr 5.51e-04 | 322.76 ms | 52.3% bf16 MFU | 1623741 tok/s step 4172/19560 | loss 3.704060 (+1.87z)| norm 0.2916 (+0.62z)| lr 5.51e-04 | 323.48 ms | 52.2% bf16 MFU | 1623592 tok/s step 4173/19560 | loss 3.598729 (-0.41z)| norm 0.2903 (+0.55z)| lr 5.51e-04 | 323.98 ms | 52.1% bf16 MFU | 1623326 tok/s step 4174/19560 | loss 3.666920 (+1.05z)| norm 0.2686 (-0.48z)| lr 5.51e-04 | 322.20 ms | 52.4% bf16 MFU | 1623522 tok/s step 4175/19560 | loss 3.627440 (+0.20z)| norm 0.3054 (+1.25z)| lr 5.51e-04 | 323.16 ms | 52.2% bf16 MFU | 1623464 tok/s step 4176/19560 | loss 3.558077 (-1.29z)| norm 0.3142 (+1.63z)| lr 5.51e-04 | 323.08 ms | 52.2% bf16 MFU | 1623428 tok/s step 4177/19560 | loss 3.660540 (+0.94z)| norm 0.3146 (+1.63z)| lr 5.51e-04 | 322.77 ms | 52.3% bf16 MFU | 1623473 tok/s step 4178/19560 | loss 3.563473 (-1.16z)| norm 0.2900 (+0.47z)| lr 5.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1623431 tok/s step 4179/19560 | loss 3.632526 (+0.35z)| norm 0.2794 (-0.01z)| lr 5.51e-04 | 322.64 ms | 52.3% bf16 MFU | 1623510 tok/s step 4180/19560 | loss 3.674088 (+1.25z)| norm 0.2890 (+0.44z)| lr 5.51e-04 | 322.82 ms | 52.3% bf16 MFU | 1623539 tok/s step 4181/19560 | loss 3.567482 (-1.06z)| norm 0.2827 (+0.13z)| lr 5.51e-04 | 323.37 ms | 52.2% bf16 MFU | 1623428 tok/s step 4182/19560 | loss 3.595584 (-0.44z)| norm 0.2962 (+0.76z)| lr 5.51e-04 | 322.76 ms | 52.3% bf16 MFU | 1623477 tok/s step 4183/19560 | loss 3.643428 (+0.60z)| norm 0.2891 (+0.42z)| lr 5.51e-04 | 322.62 ms | 52.3% bf16 MFU | 1623557 tok/s step 4184/19560 | loss 3.615085 (-0.01z)| norm 0.2909 (+0.56z)| lr 5.51e-04 | 323.01 ms | 52.2% bf16 MFU | 1623535 tok/s step 4185/19560 | loss 3.576340 (-0.85z)| norm 0.2546 (-1.25z)| lr 5.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1623491 tok/s step 4186/19560 | loss 3.644481 (+0.63z)| norm 0.2705 (-0.43z)| lr 5.51e-04 | 322.92 ms | 52.3% bf16 MFU | 1623495 tok/s step 4187/19560 | loss 3.574047 (-0.89z)| norm 0.2869 (+0.41z)| lr 5.51e-04 | 322.72 ms | 52.3% bf16 MFU | 1623550 tok/s step 4188/19560 | loss 3.622020 (+0.16z)| norm 0.2436 (-1.76z)| lr 5.51e-04 | 322.96 ms | 52.3% bf16 MFU | 1623541 tok/s step 4189/19560 | loss 3.571959 (-0.92z)| norm 0.2899 (+0.57z)| lr 5.51e-04 | 323.08 ms | 52.2% bf16 MFU | 1623502 tok/s step 4190/19560 | loss 3.555382 (-1.26z)| norm 0.2695 (-0.45z)| lr 5.51e-04 | 322.97 ms | 52.3% bf16 MFU | 1623493 tok/s step 4191/19560 | loss 3.630452 (+0.36z)| norm 0.2668 (-0.59z)| lr 5.51e-04 | 323.14 ms | 52.2% bf16 MFU | 1623442 tok/s step 4192/19560 | loss 3.601705 (-0.27z)| norm 0.2674 (-0.54z)| lr 5.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1623403 tok/s step 4193/19560 | loss 3.692235 (+1.66z)| norm 0.2803 (+0.10z)| lr 5.51e-04 | 323.90 ms | 52.1% bf16 MFU | 1623166 tok/s step 4194/19560 | loss 3.597083 (-0.37z)| norm 0.2724 (-0.29z)| lr 5.51e-04 | 322.93 ms | 52.3% bf16 MFU | 1623184 tok/s step 4195/19560 | loss 3.709996 (+2.09z)| norm 0.2763 (-0.08z)| lr 5.51e-04 | 323.42 ms | 52.2% bf16 MFU | 1623080 tok/s step 4196/19560 | loss 3.652454 (+0.82z)| norm 0.2667 (-0.56z)| lr 5.51e-04 | 323.65 ms | 52.1% bf16 MFU | 1622923 tok/s step 4197/19560 | loss 3.582052 (-0.70z)| norm 0.2667 (-0.55z)| lr 5.51e-04 | 323.05 ms | 52.2% bf16 MFU | 1622922 tok/s step 4198/19560 | loss 3.628586 (+0.31z)| norm 0.2564 (-1.07z)| lr 5.51e-04 | 324.29 ms | 52.0% bf16 MFU | 1622613 tok/s step 4199/19560 | loss 3.596868 (-0.36z)| norm 0.2727 (-0.20z)| lr 5.50e-04 | 322.84 ms | 52.3% bf16 MFU | 1622683 tok/s step 4200/19560 | loss 3.611749 (-0.05z)| norm 0.2428 (-1.79z)| lr 5.50e-04 | 322.37 ms | 52.4% bf16 MFU | 1622866 tok/s step 4201/19560 | loss 3.562517 (-1.14z)| norm 0.2606 (-0.82z)| lr 5.50e-04 | 323.14 ms | 52.2% bf16 MFU | 1622847 tok/s step 4202/19560 | loss 3.523859 (-1.95z)| norm 0.2811 (+0.28z)| lr 5.50e-04 | 323.95 ms | 52.1% bf16 MFU | 1622625 tok/s step 4203/19560 | loss 3.606107 (-0.15z)| norm 0.2812 (+0.28z)| lr 5.50e-04 | 323.31 ms | 52.2% bf16 MFU | 1622574 tok/s step 4204/19560 | loss 3.554815 (-1.26z)| norm 0.2573 (-0.99z)| lr 5.50e-04 | 323.16 ms | 52.2% bf16 MFU | 1622564 tok/s step 4205/19560 | loss 3.613616 (+0.02z)| norm 0.2582 (-0.93z)| lr 5.50e-04 | 323.12 ms | 52.2% bf16 MFU | 1622565 tok/s step 4206/19560 | loss 3.623657 (+0.24z)| norm 0.2805 (+0.26z)| lr 5.50e-04 | 324.08 ms | 52.1% bf16 MFU | 1622325 tok/s step 4207/19560 | loss 3.655248 (+0.91z)| norm 0.2869 (+0.59z)| lr 5.50e-04 | 322.80 ms | 52.3% bf16 MFU | 1622417 tok/s step 4208/19560 | loss 3.626151 (+0.28z)| norm 0.3135 (+1.98z)| lr 5.50e-04 | 322.81 ms | 52.3% bf16 MFU | 1622503 tok/s step 4209/19560 | loss 3.584778 (-0.61z)| norm 0.3019 (+1.34z)| lr 5.50e-04 | 323.57 ms | 52.2% bf16 MFU | 1622394 tok/s step 4210/19560 | loss 3.585268 (-0.60z)| norm 0.2965 (+1.04z)| lr 5.50e-04 | 323.02 ms | 52.2% bf16 MFU | 1622429 tok/s step 4211/19560 | loss 3.576159 (-0.78z)| norm 0.2615 (-0.82z)| lr 5.50e-04 | 322.83 ms | 52.3% bf16 MFU | 1622509 tok/s step 4212/19560 | loss 3.577580 (-0.74z)| norm 0.2660 (-0.59z)| lr 5.50e-04 | 323.59 ms | 52.2% bf16 MFU | 1622395 tok/s step 4213/19560 | loss 3.566657 (-0.97z)| norm 0.2728 (-0.22z)| lr 5.50e-04 | 323.31 ms | 52.2% bf16 MFU | 1622356 tok/s step 4214/19560 | loss 3.522942 (-1.87z)| norm 0.2533 (-1.24z)| lr 5.50e-04 | 323.34 ms | 52.2% bf16 MFU | 1622312 tok/s step 4215/19560 | loss 3.546978 (-1.33z)| norm 0.2836 (+0.35z)| lr 5.50e-04 | 323.29 ms | 52.2% bf16 MFU | 1622283 tok/s step 4216/19560 | loss 3.575742 (-0.72z)| norm 0.3001 (+1.21z)| lr 5.50e-04 | 323.50 ms | 52.2% bf16 MFU | 1622201 tok/s step 4217/19560 | loss 3.620598 (+0.22z)| norm 0.2959 (+0.98z)| lr 5.50e-04 | 323.82 ms | 52.1% bf16 MFU | 1622044 tok/s step 4218/19560 | loss 3.603521 (-0.14z)| norm 0.2851 (+0.40z)| lr 5.50e-04 | 322.72 ms | 52.3% bf16 MFU | 1622172 tok/s step 4219/19560 | loss 3.600808 (-0.19z)| norm 0.2722 (-0.29z)| lr 5.50e-04 | 322.83 ms | 52.3% bf16 MFU | 1622265 tok/s step 4220/19560 | loss 3.566720 (-0.91z)| norm 0.2919 (+0.74z)| lr 5.50e-04 | 323.40 ms | 52.2% bf16 MFU | 1622210 tok/s step 4221/19560 | loss 3.711264 (+2.10z)| norm 0.3233 (+2.34z)| lr 5.50e-04 | 323.11 ms | 52.2% bf16 MFU | 1622232 tok/s step 4222/19560 | loss 3.600213 (-0.21z)| norm 0.3254 (+2.39z)| lr 5.50e-04 | 323.21 ms | 52.2% bf16 MFU | 1622225 tok/s step 4223/19560 | loss 3.547559 (-1.29z)| norm 0.3079 (+1.46z)| lr 5.50e-04 | 322.84 ms | 52.3% bf16 MFU | 1622312 tok/s step 4224/19560 | loss 3.629761 (+0.41z)| norm 0.3244 (+2.24z)| lr 5.50e-04 | 322.88 ms | 52.3% bf16 MFU | 1622387 tok/s step 4225/19560 | loss 3.586833 (-0.47z)| norm 0.2967 (+0.84z)| lr 5.50e-04 | 323.30 ms | 52.2% bf16 MFU | 1622352 tok/s step 4226/19560 | loss 3.539608 (-1.44z)| norm 0.2739 (-0.31z)| lr 5.50e-04 | 322.45 ms | 52.3% bf16 MFU | 1622531 tok/s step 4227/19560 | loss 3.600838 (-0.18z)| norm 0.2671 (-0.68z)| lr 5.50e-04 | 323.46 ms | 52.2% bf16 MFU | 1622447 tok/s step 4228/19560 | loss 3.697851 (+1.78z)| norm 0.2971 (+0.85z)| lr 5.50e-04 | 323.10 ms | 52.2% bf16 MFU | 1622460 tok/s step 4229/19560 | loss 3.636356 (+0.52z)| norm 0.2607 (-1.04z)| lr 5.50e-04 | 322.84 ms | 52.3% bf16 MFU | 1622536 tok/s step 4230/19560 | loss 3.676650 (+1.32z)| norm 0.2742 (-0.34z)| lr 5.50e-04 | 322.69 ms | 52.3% bf16 MFU | 1622646 tok/s step 4231/19560 | loss 3.611029 (-0.00z)| norm 0.2652 (-0.79z)| lr 5.50e-04 | 323.46 ms | 52.2% bf16 MFU | 1622557 tok/s step 4232/19560 | loss 3.525924 (-1.70z)| norm 0.2778 (-0.14z)| lr 5.50e-04 | 322.93 ms | 52.3% bf16 MFU | 1622605 tok/s step 4233/19560 | loss 3.610830 (-0.00z)| norm 0.2677 (-0.67z)| lr 5.50e-04 | 323.10 ms | 52.2% bf16 MFU | 1622608 tok/s step 4234/19560 | loss 3.523885 (-1.71z)| norm 0.9589 (+10.73z)| lr 5.50e-04 | 322.41 ms | 52.3% bf16 MFU | 1622786 tok/s step 4235/19560 | loss 3.665288 (+1.09z)| norm 0.3516 (+1.04z)| lr 5.50e-04 | 322.79 ms | 52.3% bf16 MFU | 1622859 tok/s step 4236/19560 | loss 3.707662 (+1.90z)| norm 0.3345 (+0.76z)| lr 5.49e-04 | 323.25 ms | 52.2% bf16 MFU | 1622811 tok/s step 4237/19560 | loss 3.547343 (-1.26z)| norm 0.3123 (+0.41z)| lr 5.49e-04 | 323.06 ms | 52.2% bf16 MFU | 1622815 tok/s step 4238/19560 | loss 3.579733 (-0.61z)| norm 0.2893 (+0.04z)| lr 5.49e-04 | 322.26 ms | 52.4% bf16 MFU | 1623019 tok/s step 4239/19560 | loss 3.590334 (-0.41z)| norm 0.3225 (+0.56z)| lr 5.49e-04 | 322.57 ms | 52.3% bf16 MFU | 1623136 tok/s step 4240/19560 | loss 3.619441 (+0.16z)| norm 0.2887 (+0.02z)| lr 5.49e-04 | 323.68 ms | 52.1% bf16 MFU | 1622968 tok/s step 4241/19560 | loss 3.639861 (+0.55z)| norm 0.2837 (-0.06z)| lr 5.49e-04 | 322.27 ms | 52.4% bf16 MFU | 1623162 tok/s step 4242/19560 | loss 3.650142 (+0.75z)| norm 0.2769 (-0.16z)| lr 5.49e-04 | 322.68 ms | 52.3% bf16 MFU | 1623244 tok/s step 4243/19560 | loss 3.678531 (+1.31z)| norm 0.2791 (-0.12z)| lr 5.49e-04 | 323.96 ms | 52.1% bf16 MFU | 1623000 tok/s step 4244/19560 | loss 3.586909 (-0.52z)| norm 0.2927 (+0.09z)| lr 5.49e-04 | 322.32 ms | 52.4% bf16 MFU | 1623180 tok/s step 4245/19560 | loss 3.597649 (-0.29z)| norm 0.2908 (+0.06z)| lr 5.49e-04 | 323.20 ms | 52.2% bf16 MFU | 1623130 tok/s step 4246/19560 | loss 3.683958 (+1.52z)| norm 0.2891 (+0.04z)| lr 5.49e-04 | 323.00 ms | 52.3% bf16 MFU | 1623134 tok/s step 4247/19560 | loss 3.609339 (-0.06z)| norm 0.2636 (-0.36z)| lr 5.49e-04 | 322.41 ms | 52.3% bf16 MFU | 1623285 tok/s step 4248/19560 | loss 3.606239 (-0.12z)| norm 0.2791 (-0.11z)| lr 5.49e-04 | 322.53 ms | 52.3% bf16 MFU | 1623398 tok/s step 4249/19560 | loss 3.612806 (+0.02z)| norm 0.2877 (+0.02z)| lr 5.49e-04 | 322.86 ms | 52.3% bf16 MFU | 1623422 tok/s step 4250/19560 | loss 3.643720 (+0.66z)| norm 0.3073 (+0.33z)| lr 5.49e-04 | 323.10 ms | 52.2% bf16 MFU | 1623386 tok/s val loss 3.606812 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2756/10042 = 0.274447 step 4251/19560 | loss 3.657717 (+0.95z)| norm 0.3025 (+0.25z)| lr 5.49e-04 | 322.33 ms | 52.4% bf16 MFU | 1623544 tok/s step 4252/19560 | loss 3.633719 (+0.43z)| norm 0.2784 (-0.13z)| lr 5.49e-04 | 322.96 ms | 52.3% bf16 MFU | 1623536 tok/s step 4253/19560 | loss 3.515887 (-2.03z)| norm 0.2628 (-0.38z)| lr 5.49e-04 | 322.70 ms | 52.3% bf16 MFU | 1623594 tok/s step 4254/19560 | loss 3.603077 (-0.21z)| norm 0.2782 (-0.13z)| lr 5.49e-04 | 322.56 ms | 52.3% bf16 MFU | 1623684 tok/s step 4255/19560 | loss 3.534624 (-1.62z)| norm 0.2785 (-0.13z)| lr 5.49e-04 | 323.56 ms | 52.2% bf16 MFU | 1623519 tok/s step 4256/19560 | loss 3.624863 (+0.27z)| norm 0.2354 (-0.81z)| lr 5.49e-04 | 323.15 ms | 52.2% bf16 MFU | 1623465 tok/s step 4257/19560 | loss 3.629890 (+0.36z)| norm 0.2715 (-0.23z)| lr 5.49e-04 | 323.00 ms | 52.3% bf16 MFU | 1623452 tok/s step 4258/19560 | loss 3.670438 (+1.23z)| norm 0.2597 (-0.42z)| lr 5.49e-04 | 322.79 ms | 52.3% bf16 MFU | 1623492 tok/s step 4259/19560 | loss 3.604557 (-0.15z)| norm 0.2503 (-0.56z)| lr 5.49e-04 | 323.14 ms | 52.2% bf16 MFU | 1623440 tok/s step 4260/19560 | loss 3.636955 (+0.55z)| norm 0.2665 (-0.30z)| lr 5.49e-04 | 322.26 ms | 52.4% bf16 MFU | 1623614 tok/s step 4261/19560 | loss 3.661551 (+1.07z)| norm 0.2683 (-0.27z)| lr 5.49e-04 | 323.42 ms | 52.2% bf16 MFU | 1623487 tok/s step 4262/19560 | loss 3.585674 (-0.58z)| norm 0.2496 (-0.56z)| lr 5.49e-04 | 323.09 ms | 52.2% bf16 MFU | 1623449 tok/s step 4263/19560 | loss 3.566324 (-0.98z)| norm 0.2599 (-0.40z)| lr 5.49e-04 | 323.06 ms | 52.2% bf16 MFU | 1623420 tok/s step 4264/19560 | loss 3.618252 (+0.16z)| norm 0.2507 (-0.54z)| lr 5.49e-04 | 322.74 ms | 52.3% bf16 MFU | 1623474 tok/s step 4265/19560 | loss 3.591113 (-0.43z)| norm 0.2588 (-0.41z)| lr 5.49e-04 | 323.49 ms | 52.2% bf16 MFU | 1623336 tok/s step 4266/19560 | loss 3.635366 (+0.53z)| norm 0.2817 (-0.05z)| lr 5.49e-04 | 322.52 ms | 52.3% bf16 MFU | 1623449 tok/s step 4267/19560 | loss 3.679836 (+1.49z)| norm 0.2879 (+0.04z)| lr 5.49e-04 | 323.14 ms | 52.2% bf16 MFU | 1623400 tok/s step 4268/19560 | loss 3.727754 (+2.46z)| norm 0.2627 (-0.35z)| lr 5.49e-04 | 323.38 ms | 52.2% bf16 MFU | 1623294 tok/s step 4269/19560 | loss 3.619023 (+0.17z)| norm 0.2581 (-0.42z)| lr 5.49e-04 | 323.51 ms | 52.2% bf16 MFU | 1623160 tok/s step 4270/19560 | loss 3.710155 (+2.08z)| norm 0.2616 (-0.37z)| lr 5.49e-04 | 323.26 ms | 52.2% bf16 MFU | 1623097 tok/s step 4271/19560 | loss 3.587511 (-0.52z)| norm 0.2637 (-0.34z)| lr 5.48e-04 | 322.76 ms | 52.3% bf16 MFU | 1623163 tok/s step 4272/19560 | loss 3.576011 (-0.76z)| norm 0.2824 (-0.04z)| lr 5.48e-04 | 323.04 ms | 52.2% bf16 MFU | 1623155 tok/s step 4273/19560 | loss 3.591442 (-0.43z)| norm 0.2808 (-0.06z)| lr 5.48e-04 | 323.01 ms | 52.3% bf16 MFU | 1623155 tok/s step 4274/19560 | loss 3.563662 (-1.02z)| norm 0.2715 (-0.21z)| lr 5.48e-04 | 322.84 ms | 52.3% bf16 MFU | 1623196 tok/s step 4275/19560 | loss 3.679440 (+1.42z)| norm 0.2967 (+0.19z)| lr 5.48e-04 | 322.64 ms | 52.3% bf16 MFU | 1623286 tok/s step 4276/19560 | loss 3.642288 (+0.63z)| norm 0.2843 (-0.01z)| lr 5.48e-04 | 323.40 ms | 52.2% bf16 MFU | 1623181 tok/s step 4277/19560 | loss 3.581959 (-0.64z)| norm 0.2723 (-0.20z)| lr 5.48e-04 | 322.92 ms | 52.3% bf16 MFU | 1623202 tok/s step 4278/19560 | loss 3.643831 (+0.66z)| norm 0.2789 (-0.09z)| lr 5.48e-04 | 323.11 ms | 52.2% bf16 MFU | 1623173 tok/s step 4279/19560 | loss 3.608216 (-0.11z)| norm 0.2640 (-0.33z)| lr 5.48e-04 | 323.20 ms | 52.2% bf16 MFU | 1623124 tok/s step 4280/19560 | loss 3.537209 (-1.59z)| norm 0.3384 (+0.84z)| lr 5.48e-04 | 322.87 ms | 52.3% bf16 MFU | 1623159 tok/s step 4281/19560 | loss 3.562867 (-1.06z)| norm 0.3518 (+1.03z)| lr 5.48e-04 | 323.54 ms | 52.2% bf16 MFU | 1623024 tok/s step 4282/19560 | loss 3.586231 (-0.56z)| norm 0.3467 (+0.94z)| lr 5.48e-04 | 322.67 ms | 52.3% bf16 MFU | 1623116 tok/s step 4283/19560 | loss 3.649903 (+0.79z)| norm 0.3307 (+0.69z)| lr 5.48e-04 | 322.97 ms | 52.3% bf16 MFU | 1623127 tok/s step 4284/19560 | loss 3.584245 (-0.60z)| norm 0.3466 (+0.93z)| lr 5.48e-04 | 323.51 ms | 52.2% bf16 MFU | 1623003 tok/s step 4285/19560 | loss 3.644145 (+0.66z)| norm 0.3004 (+0.21z)| lr 5.48e-04 | 322.84 ms | 52.3% bf16 MFU | 1623053 tok/s step 4286/19560 | loss 3.623230 (+0.22z)| norm 0.3185 (+0.49z)| lr 5.48e-04 | 322.41 ms | 52.3% bf16 MFU | 1623207 tok/s step 4287/19560 | loss 3.653535 (+0.86z)| norm 0.3339 (+0.72z)| lr 5.48e-04 | 323.05 ms | 52.2% bf16 MFU | 1623192 tok/s step 4288/19560 | loss 3.606359 (-0.15z)| norm 0.2942 (+0.09z)| lr 5.48e-04 | 322.81 ms | 52.3% bf16 MFU | 1623240 tok/s step 4289/19560 | loss 3.577531 (-0.77z)| norm 0.2887 (+0.01z)| lr 5.48e-04 | 322.92 ms | 52.3% bf16 MFU | 1623257 tok/s step 4290/19560 | loss 3.632134 (+0.40z)| norm 0.2968 (+0.13z)| lr 5.48e-04 | 322.88 ms | 52.3% bf16 MFU | 1623284 tok/s step 4291/19560 | loss 3.552191 (-1.31z)| norm 0.2796 (-0.14z)| lr 5.48e-04 | 322.87 ms | 52.3% bf16 MFU | 1623312 tok/s step 4292/19560 | loss 3.608504 (-0.07z)| norm 0.3051 (+0.25z)| lr 5.48e-04 | 323.14 ms | 52.2% bf16 MFU | 1623271 tok/s step 4293/19560 | loss 3.608592 (-0.07z)| norm 0.2989 (+0.15z)| lr 5.48e-04 | 322.59 ms | 52.3% bf16 MFU | 1623369 tok/s step 4294/19560 | loss 3.675426 (+1.38z)| norm 0.3083 (+0.29z)| lr 5.48e-04 | 322.94 ms | 52.3% bf16 MFU | 1623374 tok/s step 4295/19560 | loss 3.515800 (-2.06z)| norm 0.2854 (-0.07z)| lr 5.48e-04 | 323.29 ms | 52.2% bf16 MFU | 1623291 tok/s step 4296/19560 | loss 3.555757 (-1.19z)| norm 0.2833 (-0.10z)| lr 5.48e-04 | 323.14 ms | 52.2% bf16 MFU | 1623249 tok/s step 4297/19560 | loss 3.551528 (-1.26z)| norm 0.2792 (-0.17z)| lr 5.48e-04 | 322.49 ms | 52.3% bf16 MFU | 1623374 tok/s step 4298/19560 | loss 3.692727 (+1.73z)| norm 0.2744 (-0.24z)| lr 5.48e-04 | 323.11 ms | 52.2% bf16 MFU | 1623336 tok/s step 4299/19560 | loss 3.658694 (+1.00z)| norm 0.2795 (-0.16z)| lr 5.48e-04 | 322.57 ms | 52.3% bf16 MFU | 1623435 tok/s step 4300/19560 | loss 3.567279 (-0.91z)| norm 0.3043 (+0.22z)| lr 5.48e-04 | 322.89 ms | 52.3% bf16 MFU | 1623449 tok/s step 4301/19560 | loss 3.602048 (-0.17z)| norm 0.2813 (-0.14z)| lr 5.48e-04 | 323.08 ms | 52.2% bf16 MFU | 1623416 tok/s step 4302/19560 | loss 3.619236 (+0.20z)| norm 0.2776 (-0.20z)| lr 5.48e-04 | 322.90 ms | 52.3% bf16 MFU | 1623428 tok/s step 4303/19560 | loss 3.600095 (-0.20z)| norm 0.2632 (-0.42z)| lr 5.48e-04 | 322.40 ms | 52.3% bf16 MFU | 1623568 tok/s step 4304/19560 | loss 3.594891 (-0.32z)| norm 0.2762 (-0.21z)| lr 5.48e-04 | 323.21 ms | 52.2% bf16 MFU | 1623496 tok/s step 4305/19560 | loss 3.651468 (+0.90z)| norm 0.2699 (-0.30z)| lr 5.48e-04 | 322.55 ms | 52.3% bf16 MFU | 1623594 tok/s step 4306/19560 | loss 3.550086 (-1.28z)| norm 0.2581 (-0.48z)| lr 5.48e-04 | 323.03 ms | 52.2% bf16 MFU | 1623565 tok/s step 4307/19560 | loss 3.621755 (+0.26z)| norm 0.2812 (-0.12z)| lr 5.47e-04 | 323.70 ms | 52.1% bf16 MFU | 1623371 tok/s step 4308/19560 | loss 3.597847 (-0.24z)| norm 0.2770 (-0.18z)| lr 5.47e-04 | 322.40 ms | 52.3% bf16 MFU | 1623513 tok/s step 4309/19560 | loss 3.551745 (-1.23z)| norm 0.2918 (+0.05z)| lr 5.47e-04 | 323.42 ms | 52.2% bf16 MFU | 1623392 tok/s step 4310/19560 | loss 3.637038 (+0.60z)| norm 0.2697 (-0.30z)| lr 5.47e-04 | 323.39 ms | 52.2% bf16 MFU | 1623284 tok/s step 4311/19560 | loss 3.647645 (+0.83z)| norm 0.2574 (-0.48z)| lr 5.47e-04 | 322.85 ms | 52.3% bf16 MFU | 1623316 tok/s step 4312/19560 | loss 3.597891 (-0.24z)| norm 0.2709 (-0.27z)| lr 5.47e-04 | 322.94 ms | 52.3% bf16 MFU | 1623324 tok/s step 4313/19560 | loss 3.596636 (-0.27z)| norm 0.2713 (-0.27z)| lr 5.47e-04 | 322.91 ms | 52.3% bf16 MFU | 1623339 tok/s step 4314/19560 | loss 3.596867 (-0.26z)| norm 0.2493 (-0.61z)| lr 5.47e-04 | 322.58 ms | 52.3% bf16 MFU | 1623438 tok/s step 4315/19560 | loss 3.620206 (+0.24z)| norm 0.2821 (-0.09z)| lr 5.47e-04 | 323.16 ms | 52.2% bf16 MFU | 1623384 tok/s step 4316/19560 | loss 3.594434 (-0.32z)| norm 0.2960 (+0.12z)| lr 5.47e-04 | 322.50 ms | 52.3% bf16 MFU | 1623500 tok/s step 4317/19560 | loss 3.640478 (+0.67z)| norm 0.2725 (-0.25z)| lr 5.47e-04 | 323.02 ms | 52.2% bf16 MFU | 1623479 tok/s step 4318/19560 | loss 3.555356 (-1.18z)| norm 0.2756 (-0.20z)| lr 5.47e-04 | 322.75 ms | 52.3% bf16 MFU | 1623528 tok/s step 4319/19560 | loss 3.642560 (+0.72z)| norm 0.2557 (-0.51z)| lr 5.47e-04 | 322.87 ms | 52.3% bf16 MFU | 1623544 tok/s step 4320/19560 | loss 3.594782 (-0.32z)| norm 0.2732 (-0.24z)| lr 5.47e-04 | 323.20 ms | 52.2% bf16 MFU | 1623476 tok/s step 4321/19560 | loss 3.607779 (-0.02z)| norm 0.2603 (-0.44z)| lr 5.47e-04 | 322.87 ms | 52.3% bf16 MFU | 1623494 tok/s step 4322/19560 | loss 3.530621 (-1.69z)| norm 0.2708 (-0.27z)| lr 5.47e-04 | 322.61 ms | 52.3% bf16 MFU | 1623578 tok/s step 4323/19560 | loss 3.641984 (+0.76z)| norm 0.3295 (+0.64z)| lr 5.47e-04 | 322.58 ms | 52.3% bf16 MFU | 1623664 tok/s step 4324/19560 | loss 3.617965 (+0.23z)| norm 0.2864 (-0.04z)| lr 5.47e-04 | 323.39 ms | 52.2% bf16 MFU | 1623542 tok/s step 4325/19560 | loss 3.614511 (+0.15z)| norm 0.2685 (-0.32z)| lr 5.47e-04 | 322.52 ms | 52.3% bf16 MFU | 1623644 tok/s step 4326/19560 | loss 3.711533 (+2.25z)| norm 0.2639 (-0.39z)| lr 5.47e-04 | 323.07 ms | 52.2% bf16 MFU | 1623604 tok/s step 4327/19560 | loss 3.533474 (-1.61z)| norm 0.2701 (-0.29z)| lr 5.47e-04 | 323.22 ms | 52.2% bf16 MFU | 1623528 tok/s step 4328/19560 | loss 3.640454 (+0.70z)| norm 0.2927 (+0.05z)| lr 5.47e-04 | 322.27 ms | 52.4% bf16 MFU | 1623696 tok/s step 4329/19560 | loss 3.629741 (+0.45z)| norm 0.3160 (+0.41z)| lr 5.47e-04 | 323.18 ms | 52.2% bf16 MFU | 1623625 tok/s step 4330/19560 | loss 3.624812 (+0.34z)| norm 0.3299 (+0.62z)| lr 5.47e-04 | 323.46 ms | 52.2% bf16 MFU | 1623487 tok/s step 4331/19560 | loss 3.635322 (+0.56z)| norm 0.3374 (+0.73z)| lr 5.47e-04 | 322.98 ms | 52.3% bf16 MFU | 1623476 tok/s step 4332/19560 | loss 3.667142 (+1.24z)| norm 0.2808 (-0.15z)| lr 5.47e-04 | 322.92 ms | 52.3% bf16 MFU | 1623480 tok/s step 4333/19560 | loss 3.681546 (+1.53z)| norm 0.2936 (+0.04z)| lr 5.47e-04 | 322.03 ms | 52.4% bf16 MFU | 1623709 tok/s step 4334/19560 | loss 3.581997 (-0.62z)| norm 0.2779 (-0.20z)| lr 5.47e-04 | 322.90 ms | 52.3% bf16 MFU | 1623708 tok/s step 4335/19560 | loss 3.566482 (-0.94z)| norm 0.2905 (-0.01z)| lr 5.47e-04 | 323.30 ms | 52.2% bf16 MFU | 1623607 tok/s step 4336/19560 | loss 3.595219 (-0.32z)| norm 0.2656 (-0.39z)| lr 5.47e-04 | 323.57 ms | 52.2% bf16 MFU | 1623443 tok/s step 4337/19560 | loss 3.583574 (-0.57z)| norm 0.2580 (-0.50z)| lr 5.47e-04 | 322.51 ms | 52.3% bf16 MFU | 1623554 tok/s step 4338/19560 | loss 3.650576 (+0.87z)| norm 0.2685 (-0.34z)| lr 5.47e-04 | 323.20 ms | 52.2% bf16 MFU | 1623486 tok/s step 4339/19560 | loss 3.604853 (-0.12z)| norm 0.2470 (-0.67z)| lr 5.47e-04 | 323.15 ms | 52.2% bf16 MFU | 1623433 tok/s step 4340/19560 | loss 3.597119 (-0.30z)| norm 0.2468 (-0.67z)| lr 5.47e-04 | 323.36 ms | 52.2% bf16 MFU | 1623330 tok/s step 4341/19560 | loss 3.606673 (-0.10z)| norm 0.2747 (-0.23z)| lr 5.47e-04 | 322.99 ms | 52.3% bf16 MFU | 1623325 tok/s step 4342/19560 | loss 3.612479 (+0.02z)| norm 0.2380 (-0.80z)| lr 5.46e-04 | 323.08 ms | 52.2% bf16 MFU | 1623298 tok/s step 4343/19560 | loss 3.691236 (+1.72z)| norm 0.2953 (+0.09z)| lr 5.46e-04 | 323.56 ms | 52.2% bf16 MFU | 1623151 tok/s step 4344/19560 | loss 3.688223 (+1.62z)| norm 0.2882 (-0.02z)| lr 5.46e-04 | 323.23 ms | 52.2% bf16 MFU | 1623094 tok/s step 4345/19560 | loss 3.586503 (-0.59z)| norm 0.2777 (-0.18z)| lr 5.46e-04 | 322.97 ms | 52.3% bf16 MFU | 1623105 tok/s step 4346/19560 | loss 3.632637 (+0.41z)| norm 0.2839 (-0.09z)| lr 5.46e-04 | 323.01 ms | 52.3% bf16 MFU | 1623108 tok/s step 4347/19560 | loss 3.583999 (-0.64z)| norm 0.3033 (+0.21z)| lr 5.46e-04 | 322.50 ms | 52.3% bf16 MFU | 1623236 tok/s step 4348/19560 | loss 3.639725 (+0.56z)| norm 0.2711 (-0.29z)| lr 5.46e-04 | 323.42 ms | 52.2% bf16 MFU | 1623129 tok/s step 4349/19560 | loss 3.633917 (+0.45z)| norm 0.2821 (-0.11z)| lr 5.46e-04 | 323.45 ms | 52.2% bf16 MFU | 1623018 tok/s step 4350/19560 | loss 3.615116 (+0.03z)| norm 0.2609 (-0.43z)| lr 5.46e-04 | 323.08 ms | 52.2% bf16 MFU | 1623006 tok/s step 4351/19560 | loss 3.549410 (-1.43z)| norm 0.2587 (-0.46z)| lr 5.46e-04 | 322.86 ms | 52.3% bf16 MFU | 1623049 tok/s step 4352/19560 | loss 3.568752 (-0.98z)| norm 0.2722 (-0.25z)| lr 5.46e-04 | 323.03 ms | 52.2% bf16 MFU | 1623047 tok/s step 4353/19560 | loss 3.599916 (-0.30z)| norm 0.2479 (-0.62z)| lr 5.46e-04 | 322.61 ms | 52.3% bf16 MFU | 1623153 tok/s step 4354/19560 | loss 3.570683 (-0.96z)| norm 0.2807 (-0.11z)| lr 5.46e-04 | 322.84 ms | 52.3% bf16 MFU | 1623194 tok/s step 4355/19560 | loss 3.642940 (+0.65z)| norm 0.2977 (+0.15z)| lr 5.46e-04 | 323.26 ms | 52.2% bf16 MFU | 1623128 tok/s step 4356/19560 | loss 3.580966 (-0.72z)| norm 0.2809 (-0.11z)| lr 5.46e-04 | 323.09 ms | 52.2% bf16 MFU | 1623107 tok/s step 4357/19560 | loss 3.681344 (+1.52z)| norm 0.2709 (-0.26z)| lr 5.46e-04 | 322.57 ms | 52.3% bf16 MFU | 1623218 tok/s step 4358/19560 | loss 3.570584 (-0.94z)| norm 0.2976 (+0.15z)| lr 5.46e-04 | 323.00 ms | 52.3% bf16 MFU | 1623217 tok/s step 4359/19560 | loss 3.625823 (+0.30z)| norm 0.3057 (+0.27z)| lr 5.46e-04 | 322.84 ms | 52.3% bf16 MFU | 1623255 tok/s step 4360/19560 | loss 3.639495 (+0.59z)| norm 0.3003 (+0.18z)| lr 5.46e-04 | 322.88 ms | 52.3% bf16 MFU | 1623282 tok/s step 4361/19560 | loss 3.569885 (-0.98z)| norm 0.2826 (-0.09z)| lr 5.46e-04 | 323.27 ms | 52.2% bf16 MFU | 1623209 tok/s step 4362/19560 | loss 3.567374 (-1.06z)| norm 0.2767 (-0.28z)| lr 5.46e-04 | 322.95 ms | 52.3% bf16 MFU | 1623219 tok/s step 4363/19560 | loss 3.578224 (-0.80z)| norm 0.2914 (+0.37z)| lr 5.46e-04 | 322.53 ms | 52.3% bf16 MFU | 1623335 tok/s step 4364/19560 | loss 3.608848 (-0.08z)| norm 0.2587 (-1.04z)| lr 5.46e-04 | 323.56 ms | 52.2% bf16 MFU | 1623188 tok/s step 4365/19560 | loss 3.613565 (+0.02z)| norm 0.2640 (-0.79z)| lr 5.46e-04 | 323.08 ms | 52.2% bf16 MFU | 1623167 tok/s step 4366/19560 | loss 3.690310 (+1.80z)| norm 0.2902 (+0.37z)| lr 5.46e-04 | 323.69 ms | 52.1% bf16 MFU | 1622994 tok/s step 4367/19560 | loss 3.636428 (+0.53z)| norm 0.3125 (+1.37z)| lr 5.46e-04 | 323.14 ms | 52.2% bf16 MFU | 1622968 tok/s step 4368/19560 | loss 3.666706 (+1.22z)| norm 0.3514 (+2.97z)| lr 5.46e-04 | 322.66 ms | 52.3% bf16 MFU | 1623064 tok/s step 4369/19560 | loss 3.608751 (-0.12z)| norm 0.3138 (+1.34z)| lr 5.46e-04 | 323.24 ms | 52.2% bf16 MFU | 1623009 tok/s step 4370/19560 | loss 3.621764 (+0.19z)| norm 0.2516 (-1.30z)| lr 5.46e-04 | 323.42 ms | 52.2% bf16 MFU | 1622911 tok/s step 4371/19560 | loss 3.592314 (-0.49z)| norm 0.2921 (+0.41z)| lr 5.46e-04 | 322.74 ms | 52.3% bf16 MFU | 1622991 tok/s step 4372/19560 | loss 3.616275 (+0.07z)| norm 0.2759 (-0.27z)| lr 5.46e-04 | 322.48 ms | 52.3% bf16 MFU | 1623131 tok/s step 4373/19560 | loss 3.637085 (+0.56z)| norm 0.2683 (-0.59z)| lr 5.46e-04 | 323.01 ms | 52.3% bf16 MFU | 1623132 tok/s step 4374/19560 | loss 3.574289 (-0.91z)| norm 0.2536 (-1.19z)| lr 5.46e-04 | 323.67 ms | 52.1% bf16 MFU | 1622966 tok/s step 4375/19560 | loss 3.665800 (+1.25z)| norm 0.2854 (+0.14z)| lr 5.46e-04 | 323.26 ms | 52.2% bf16 MFU | 1622912 tok/s step 4376/19560 | loss 3.700242 (+2.01z)| norm 0.2938 (+0.49z)| lr 5.46e-04 | 323.99 ms | 52.1% bf16 MFU | 1622677 tok/s step 4377/19560 | loss 3.611730 (-0.05z)| norm 0.2664 (-0.66z)| lr 5.45e-04 | 323.10 ms | 52.2% bf16 MFU | 1622676 tok/s step 4378/19560 | loss 3.599649 (-0.32z)| norm 0.2730 (-0.37z)| lr 5.45e-04 | 323.23 ms | 52.2% bf16 MFU | 1622643 tok/s step 4379/19560 | loss 3.701562 (+2.02z)| norm 0.2595 (-0.93z)| lr 5.45e-04 | 323.78 ms | 52.1% bf16 MFU | 1622474 tok/s step 4380/19560 | loss 3.698237 (+1.90z)| norm 0.2842 (+0.12z)| lr 5.45e-04 | 323.10 ms | 52.2% bf16 MFU | 1622484 tok/s step 4381/19560 | loss 3.610438 (-0.11z)| norm 0.2796 (-0.08z)| lr 5.45e-04 | 322.90 ms | 52.3% bf16 MFU | 1622544 tok/s step 4382/19560 | loss 3.590709 (-0.56z)| norm 0.2563 (-1.06z)| lr 5.45e-04 | 323.80 ms | 52.1% bf16 MFU | 1622376 tok/s step 4383/19560 | loss 3.727215 (+2.53z)| norm 0.3034 (+0.92z)| lr 5.45e-04 | 322.62 ms | 52.3% bf16 MFU | 1622513 tok/s step 4384/19560 | loss 3.606011 (-0.24z)| norm 0.3207 (+1.63z)| lr 5.45e-04 | 322.33 ms | 52.4% bf16 MFU | 1622716 tok/s step 4385/19560 | loss 3.620506 (+0.10z)| norm 0.2760 (-0.26z)| lr 5.45e-04 | 323.19 ms | 52.2% bf16 MFU | 1622691 tok/s step 4386/19560 | loss 3.554870 (-1.38z)| norm 0.3062 (+1.00z)| lr 5.45e-04 | 322.72 ms | 52.3% bf16 MFU | 1622786 tok/s step 4387/19560 | loss 3.657111 (+0.94z)| norm 0.2767 (-0.26z)| lr 5.45e-04 | 322.66 ms | 52.3% bf16 MFU | 1622893 tok/s step 4388/19560 | loss 3.636754 (+0.48z)| norm 0.2856 (+0.11z)| lr 5.45e-04 | 322.51 ms | 52.3% bf16 MFU | 1623030 tok/s step 4389/19560 | loss 3.685083 (+1.56z)| norm 0.2897 (+0.28z)| lr 5.45e-04 | 323.12 ms | 52.2% bf16 MFU | 1623008 tok/s step 4390/19560 | loss 3.590594 (-0.58z)| norm 0.2767 (-0.29z)| lr 5.45e-04 | 323.03 ms | 52.2% bf16 MFU | 1623009 tok/s step 4391/19560 | loss 3.680339 (+1.43z)| norm 0.2741 (-0.41z)| lr 5.45e-04 | 323.05 ms | 52.2% bf16 MFU | 1623006 tok/s step 4392/19560 | loss 3.567895 (-1.09z)| norm 0.2742 (-0.41z)| lr 5.45e-04 | 322.97 ms | 52.3% bf16 MFU | 1623022 tok/s step 4393/19560 | loss 3.573573 (-0.96z)| norm 0.2451 (-1.66z)| lr 5.45e-04 | 323.49 ms | 52.2% bf16 MFU | 1622906 tok/s step 4394/19560 | loss 3.605349 (-0.24z)| norm 0.2789 (-0.20z)| lr 5.45e-04 | 323.37 ms | 52.2% bf16 MFU | 1622826 tok/s step 4395/19560 | loss 3.713892 (+2.16z)| norm 0.2803 (-0.14z)| lr 5.45e-04 | 322.87 ms | 52.3% bf16 MFU | 1622876 tok/s step 4396/19560 | loss 3.659828 (+1.00z)| norm 0.2714 (-0.53z)| lr 5.45e-04 | 322.94 ms | 52.3% bf16 MFU | 1622907 tok/s step 4397/19560 | loss 3.489590 (-2.76z)| norm 0.2673 (-0.71z)| lr 5.45e-04 | 323.25 ms | 52.2% bf16 MFU | 1622858 tok/s step 4398/19560 | loss 3.594442 (-0.44z)| norm 0.2700 (-0.60z)| lr 5.45e-04 | 322.82 ms | 52.3% bf16 MFU | 1622920 tok/s step 4399/19560 | loss 3.595922 (-0.41z)| norm 0.2905 (+0.29z)| lr 5.45e-04 | 322.73 ms | 52.3% bf16 MFU | 1623001 tok/s step 4400/19560 | loss 3.552989 (-1.36z)| norm 0.2551 (-1.24z)| lr 5.45e-04 | 322.96 ms | 52.3% bf16 MFU | 1623019 tok/s step 4401/19560 | loss 3.644774 (+0.68z)| norm 0.2854 (+0.07z)| lr 5.45e-04 | 322.88 ms | 52.3% bf16 MFU | 1623057 tok/s step 4402/19560 | loss 3.661746 (+1.04z)| norm 0.2962 (+0.54z)| lr 5.45e-04 | 323.77 ms | 52.1% bf16 MFU | 1622871 tok/s step 4403/19560 | loss 3.608547 (-0.13z)| norm 0.2820 (-0.08z)| lr 5.45e-04 | 323.00 ms | 52.3% bf16 MFU | 1622886 tok/s step 4404/19560 | loss 3.611626 (-0.06z)| norm 0.2819 (-0.08z)| lr 5.45e-04 | 323.19 ms | 52.2% bf16 MFU | 1622853 tok/s step 4405/19560 | loss 3.573374 (-0.92z)| norm 0.2635 (-0.88z)| lr 5.45e-04 | 322.66 ms | 52.3% bf16 MFU | 1622954 tok/s step 4406/19560 | loss 3.590078 (-0.53z)| norm 0.2660 (-0.76z)| lr 5.45e-04 | 322.69 ms | 52.3% bf16 MFU | 1623044 tok/s step 4407/19560 | loss 3.554733 (-1.31z)| norm 0.2606 (-0.99z)| lr 5.45e-04 | 322.98 ms | 52.3% bf16 MFU | 1623056 tok/s step 4408/19560 | loss 3.563469 (-1.13z)| norm 0.2596 (-1.03z)| lr 5.45e-04 | 323.27 ms | 52.2% bf16 MFU | 1622995 tok/s step 4409/19560 | loss 3.583587 (-0.68z)| norm 0.2550 (-1.24z)| lr 5.45e-04 | 322.98 ms | 52.3% bf16 MFU | 1623009 tok/s step 4410/19560 | loss 3.605933 (-0.18z)| norm 0.2561 (-1.19z)| lr 5.45e-04 | 323.17 ms | 52.2% bf16 MFU | 1622976 tok/s step 4411/19560 | loss 3.623275 (+0.22z)| norm 0.2581 (-1.09z)| lr 5.45e-04 | 323.17 ms | 52.2% bf16 MFU | 1622942 tok/s step 4412/19560 | loss 3.584075 (-0.67z)| norm 0.2617 (-0.92z)| lr 5.44e-04 | 322.67 ms | 52.3% bf16 MFU | 1623038 tok/s step 4413/19560 | loss 3.597902 (-0.35z)| norm 0.2620 (-0.89z)| lr 5.44e-04 | 323.30 ms | 52.2% bf16 MFU | 1622969 tok/s step 4414/19560 | loss 3.564673 (-1.09z)| norm 0.2815 (+0.09z)| lr 5.44e-04 | 322.94 ms | 52.3% bf16 MFU | 1622995 tok/s step 4415/19560 | loss 3.644712 (+0.72z)| norm 0.2834 (+0.21z)| lr 5.44e-04 | 323.32 ms | 52.2% bf16 MFU | 1622924 tok/s step 4416/19560 | loss 3.560727 (-1.16z)| norm 0.2962 (+0.87z)| lr 5.44e-04 | 322.70 ms | 52.3% bf16 MFU | 1623012 tok/s step 4417/19560 | loss 3.610704 (-0.04z)| norm 0.2989 (+1.00z)| lr 5.44e-04 | 323.06 ms | 52.2% bf16 MFU | 1623004 tok/s step 4418/19560 | loss 3.586375 (-0.58z)| norm 0.2886 (+0.48z)| lr 5.44e-04 | 322.74 ms | 52.3% bf16 MFU | 1623080 tok/s step 4419/19560 | loss 3.632085 (+0.43z)| norm 0.3118 (+1.65z)| lr 5.44e-04 | 322.83 ms | 52.3% bf16 MFU | 1623127 tok/s step 4420/19560 | loss 3.656071 (+0.96z)| norm 0.3217 (+2.12z)| lr 5.44e-04 | 322.71 ms | 52.3% bf16 MFU | 1623203 tok/s step 4421/19560 | loss 3.600680 (-0.28z)| norm 0.3060 (+1.32z)| lr 5.44e-04 | 322.64 ms | 52.3% bf16 MFU | 1623293 tok/s step 4422/19560 | loss 3.566115 (-1.05z)| norm 0.2942 (+0.74z)| lr 5.44e-04 | 322.82 ms | 52.3% bf16 MFU | 1623333 tok/s step 4423/19560 | loss 3.630291 (+0.39z)| norm 0.2807 (+0.06z)| lr 5.44e-04 | 322.96 ms | 52.3% bf16 MFU | 1623335 tok/s step 4424/19560 | loss 3.600591 (-0.30z)| norm 0.3208 (+2.04z)| lr 5.44e-04 | 322.64 ms | 52.3% bf16 MFU | 1623418 tok/s step 4425/19560 | loss 3.557308 (-1.31z)| norm 0.2785 (-0.07z)| lr 5.44e-04 | 322.94 ms | 52.3% bf16 MFU | 1623422 tok/s step 4426/19560 | loss 3.622908 (+0.23z)| norm 0.2699 (-0.49z)| lr 5.44e-04 | 322.79 ms | 52.3% bf16 MFU | 1623463 tok/s step 4427/19560 | loss 3.560084 (-1.23z)| norm 0.2695 (-0.51z)| lr 5.44e-04 | 322.52 ms | 52.3% bf16 MFU | 1623570 tok/s step 4428/19560 | loss 3.609362 (-0.08z)| norm 0.2755 (-0.20z)| lr 5.44e-04 | 322.44 ms | 52.3% bf16 MFU | 1623690 tok/s step 4429/19560 | loss 3.614216 (+0.03z)| norm 0.2805 (+0.05z)| lr 5.44e-04 | 323.14 ms | 52.2% bf16 MFU | 1623629 tok/s step 4430/19560 | loss 3.617361 (+0.11z)| norm 0.2348 (-2.18z)| lr 5.44e-04 | 322.96 ms | 52.3% bf16 MFU | 1623616 tok/s step 4431/19560 | loss 3.554557 (-1.36z)| norm 0.2671 (-0.60z)| lr 5.44e-04 | 322.68 ms | 52.3% bf16 MFU | 1623676 tok/s step 4432/19560 | loss 3.601496 (-0.26z)| norm 0.2643 (-0.73z)| lr 5.44e-04 | 322.50 ms | 52.3% bf16 MFU | 1623778 tok/s step 4433/19560 | loss 3.561870 (-1.17z)| norm 0.2540 (-1.22z)| lr 5.44e-04 | 323.07 ms | 52.2% bf16 MFU | 1623731 tok/s step 4434/19560 | loss 3.549912 (-1.45z)| norm 0.2573 (-1.06z)| lr 5.44e-04 | 323.12 ms | 52.2% bf16 MFU | 1623672 tok/s step 4435/19560 | loss 3.597879 (-0.32z)| norm 0.2531 (-1.25z)| lr 5.44e-04 | 322.54 ms | 52.3% bf16 MFU | 1623765 tok/s step 4436/19560 | loss 3.629691 (+0.42z)| norm 0.2667 (-0.58z)| lr 5.44e-04 | 322.88 ms | 52.3% bf16 MFU | 1623766 tok/s step 4437/19560 | loss 3.646283 (+0.79z)| norm 0.2788 (+0.01z)| lr 5.44e-04 | 322.50 ms | 52.3% bf16 MFU | 1623861 tok/s step 4438/19560 | loss 3.591426 (-0.49z)| norm 0.2974 (+0.90z)| lr 5.44e-04 | 322.73 ms | 52.3% bf16 MFU | 1623896 tok/s step 4439/19560 | loss 3.596687 (-0.36z)| norm 0.3076 (+1.37z)| lr 5.44e-04 | 322.45 ms | 52.3% bf16 MFU | 1623999 tok/s step 4440/19560 | loss 3.583553 (-0.66z)| norm 0.2899 (+0.51z)| lr 5.44e-04 | 322.40 ms | 52.3% bf16 MFU | 1624110 tok/s step 4441/19560 | loss 3.570989 (-0.95z)| norm 0.2627 (-0.80z)| lr 5.44e-04 | 322.51 ms | 52.3% bf16 MFU | 1624187 tok/s step 4442/19560 | loss 3.598618 (-0.30z)| norm 0.2590 (-0.98z)| lr 5.44e-04 | 323.10 ms | 52.2% bf16 MFU | 1624112 tok/s step 4443/19560 | loss 3.629704 (+0.43z)| norm 0.2648 (-0.69z)| lr 5.44e-04 | 323.53 ms | 52.2% bf16 MFU | 1623932 tok/s step 4444/19560 | loss 3.577414 (-0.80z)| norm 0.2693 (-0.47z)| lr 5.44e-04 | 322.25 ms | 52.4% bf16 MFU | 1624084 tok/s step 4445/19560 | loss 3.590708 (-0.48z)| norm 0.2700 (-0.43z)| lr 5.44e-04 | 322.38 ms | 52.4% bf16 MFU | 1624195 tok/s step 4446/19560 | loss 3.593846 (-0.41z)| norm 0.2829 (+0.19z)| lr 5.43e-04 | 322.84 ms | 52.3% bf16 MFU | 1624185 tok/s step 4447/19560 | loss 3.581846 (-0.69z)| norm 0.2670 (-0.59z)| lr 5.43e-04 | 322.94 ms | 52.3% bf16 MFU | 1624149 tok/s step 4448/19560 | loss 3.644226 (+0.78z)| norm 0.2489 (-1.44z)| lr 5.43e-04 | 322.95 ms | 52.3% bf16 MFU | 1624114 tok/s step 4449/19560 | loss 3.581000 (-0.71z)| norm 0.2518 (-1.30z)| lr 5.43e-04 | 323.13 ms | 52.2% bf16 MFU | 1624033 tok/s step 4450/19560 | loss 3.549918 (-1.45z)| norm 0.2634 (-0.74z)| lr 5.43e-04 | 322.70 ms | 52.3% bf16 MFU | 1624067 tok/s step 4451/19560 | loss 3.587421 (-0.55z)| norm 0.2905 (+0.58z)| lr 5.43e-04 | 322.95 ms | 52.3% bf16 MFU | 1624035 tok/s step 4452/19560 | loss 3.603090 (-0.18z)| norm 0.3017 (+1.12z)| lr 5.43e-04 | 322.65 ms | 52.3% bf16 MFU | 1624080 tok/s step 4453/19560 | loss 3.628360 (+0.42z)| norm 0.2891 (+0.50z)| lr 5.43e-04 | 322.97 ms | 52.3% bf16 MFU | 1624042 tok/s step 4454/19560 | loss 3.707264 (+2.29z)| norm 0.2908 (+0.57z)| lr 5.43e-04 | 322.64 ms | 52.3% bf16 MFU | 1624090 tok/s step 4455/19560 | loss 3.524452 (-2.04z)| norm 0.2775 (-0.08z)| lr 5.43e-04 | 322.57 ms | 52.3% bf16 MFU | 1624154 tok/s step 4456/19560 | loss 3.657982 (+1.11z)| norm 0.2535 (-1.23z)| lr 5.43e-04 | 322.67 ms | 52.3% bf16 MFU | 1624189 tok/s step 4457/19560 | loss 3.606180 (-0.11z)| norm 0.2632 (-0.74z)| lr 5.43e-04 | 323.27 ms | 52.2% bf16 MFU | 1624072 tok/s step 4458/19560 | loss 3.668894 (+1.36z)| norm 0.2875 (+0.48z)| lr 5.43e-04 | 322.57 ms | 52.3% bf16 MFU | 1624135 tok/s step 4459/19560 | loss 3.594274 (-0.39z)| norm 0.2991 (+1.11z)| lr 5.43e-04 | 323.18 ms | 52.2% bf16 MFU | 1624043 tok/s step 4460/19560 | loss 3.634315 (+0.56z)| norm 0.2921 (+0.74z)| lr 5.43e-04 | 322.97 ms | 52.3% bf16 MFU | 1624008 tok/s step 4461/19560 | loss 3.626019 (+0.38z)| norm 0.2678 (-0.51z)| lr 5.43e-04 | 322.68 ms | 52.3% bf16 MFU | 1624048 tok/s step 4462/19560 | loss 3.612043 (+0.04z)| norm 0.2895 (+0.61z)| lr 5.43e-04 | 322.91 ms | 52.3% bf16 MFU | 1624028 tok/s step 4463/19560 | loss 3.682787 (+1.70z)| norm 0.3009 (+1.19z)| lr 5.43e-04 | 323.11 ms | 52.2% bf16 MFU | 1623958 tok/s step 4464/19560 | loss 3.663871 (+1.23z)| norm 0.2706 (-0.37z)| lr 5.43e-04 | 322.78 ms | 52.3% bf16 MFU | 1623975 tok/s step 4465/19560 | loss 3.618473 (+0.15z)| norm 0.3176 (+2.01z)| lr 5.43e-04 | 323.05 ms | 52.2% bf16 MFU | 1623922 tok/s step 4466/19560 | loss 3.652853 (+0.97z)| norm 0.3236 (+2.25z)| lr 5.43e-04 | 322.42 ms | 52.3% bf16 MFU | 1624030 tok/s step 4467/19560 | loss 3.576649 (-0.83z)| norm 0.3046 (+1.28z)| lr 5.43e-04 | 323.50 ms | 52.2% bf16 MFU | 1623862 tok/s step 4468/19560 | loss 3.594845 (-0.40z)| norm 0.3021 (+1.14z)| lr 5.43e-04 | 322.77 ms | 52.3% bf16 MFU | 1623886 tok/s step 4469/19560 | loss 3.580420 (-0.73z)| norm 0.2770 (-0.13z)| lr 5.43e-04 | 323.04 ms | 52.2% bf16 MFU | 1623842 tok/s step 4470/19560 | loss 3.568639 (-1.00z)| norm 0.2750 (-0.25z)| lr 5.43e-04 | 322.87 ms | 52.3% bf16 MFU | 1623840 tok/s step 4471/19560 | loss 3.681537 (+1.65z)| norm 0.2851 (+0.27z)| lr 5.43e-04 | 322.57 ms | 52.3% bf16 MFU | 1623916 tok/s step 4472/19560 | loss 3.565122 (-1.07z)| norm 0.2633 (-0.84z)| lr 5.43e-04 | 323.02 ms | 52.2% bf16 MFU | 1623873 tok/s step 4473/19560 | loss 3.680251 (+1.63z)| norm 0.2515 (-1.43z)| lr 5.43e-04 | 322.39 ms | 52.3% bf16 MFU | 1623991 tok/s step 4474/19560 | loss 3.591791 (-0.44z)| norm 0.2750 (-0.23z)| lr 5.43e-04 | 322.46 ms | 52.3% bf16 MFU | 1624087 tok/s step 4475/19560 | loss 3.590223 (-0.48z)| norm 0.2494 (-1.50z)| lr 5.43e-04 | 323.16 ms | 52.2% bf16 MFU | 1624000 tok/s step 4476/19560 | loss 3.636388 (+0.61z)| norm 0.2626 (-0.83z)| lr 5.43e-04 | 322.63 ms | 52.3% bf16 MFU | 1624053 tok/s step 4477/19560 | loss 3.541068 (-1.60z)| norm 0.2400 (-1.93z)| lr 5.43e-04 | 323.03 ms | 52.2% bf16 MFU | 1624001 tok/s step 4478/19560 | loss 3.614717 (+0.11z)| norm 0.2689 (-0.48z)| lr 5.43e-04 | 323.41 ms | 52.2% bf16 MFU | 1623858 tok/s step 4479/19560 | loss 3.648349 (+0.88z)| norm 0.2785 (-0.01z)| lr 5.43e-04 | 322.67 ms | 52.3% bf16 MFU | 1623906 tok/s step 4480/19560 | loss 3.597796 (-0.31z)| norm 0.2857 (+0.34z)| lr 5.42e-04 | 322.79 ms | 52.3% bf16 MFU | 1623923 tok/s step 4481/19560 | loss 3.603783 (-0.17z)| norm 0.2593 (-0.99z)| lr 5.42e-04 | 323.23 ms | 52.2% bf16 MFU | 1623827 tok/s step 4482/19560 | loss 3.605877 (-0.12z)| norm 0.2867 (+0.39z)| lr 5.42e-04 | 322.92 ms | 52.3% bf16 MFU | 1623815 tok/s step 4483/19560 | loss 3.526031 (-1.96z)| norm 0.3140 (+1.75z)| lr 5.42e-04 | 322.52 ms | 52.3% bf16 MFU | 1623905 tok/s step 4484/19560 | loss 3.592550 (-0.41z)| norm 0.3081 (+1.43z)| lr 5.42e-04 | 322.22 ms | 52.4% bf16 MFU | 1624064 tok/s step 4485/19560 | loss 3.615191 (+0.13z)| norm 0.3201 (+1.98z)| lr 5.42e-04 | 323.13 ms | 52.2% bf16 MFU | 1623989 tok/s step 4486/19560 | loss 3.643631 (+0.78z)| norm 0.3148 (+1.70z)| lr 5.42e-04 | 322.82 ms | 52.3% bf16 MFU | 1623994 tok/s step 4487/19560 | loss 3.627628 (+0.41z)| norm 0.3616 (+3.75z)| lr 5.42e-04 | 322.84 ms | 52.3% bf16 MFU | 1623994 tok/s step 4488/19560 | loss 3.562008 (-1.12z)| norm 0.3097 (+1.35z)| lr 5.42e-04 | 323.35 ms | 52.2% bf16 MFU | 1623865 tok/s step 4489/19560 | loss 3.604702 (-0.13z)| norm 0.2929 (+0.57z)| lr 5.42e-04 | 322.62 ms | 52.3% bf16 MFU | 1623927 tok/s step 4490/19560 | loss 3.547265 (-1.47z)| norm 0.2676 (-0.59z)| lr 5.42e-04 | 322.68 ms | 52.3% bf16 MFU | 1623970 tok/s step 4491/19560 | loss 3.664709 (+1.26z)| norm 0.3075 (+1.23z)| lr 5.42e-04 | 323.39 ms | 52.2% bf16 MFU | 1623833 tok/s step 4492/19560 | loss 3.601287 (-0.22z)| norm 0.2736 (-0.32z)| lr 5.42e-04 | 322.83 ms | 52.3% bf16 MFU | 1623843 tok/s step 4493/19560 | loss 3.609632 (-0.02z)| norm 0.2593 (-0.97z)| lr 5.42e-04 | 323.05 ms | 52.2% bf16 MFU | 1623799 tok/s step 4494/19560 | loss 3.548989 (-1.42z)| norm 0.2792 (-0.06z)| lr 5.42e-04 | 323.09 ms | 52.2% bf16 MFU | 1623746 tok/s step 4495/19560 | loss 3.653472 (+1.03z)| norm 0.2782 (-0.09z)| lr 5.42e-04 | 322.36 ms | 52.4% bf16 MFU | 1623879 tok/s step 4496/19560 | loss 3.559134 (-1.16z)| norm 0.2619 (-0.85z)| lr 5.42e-04 | 322.89 ms | 52.3% bf16 MFU | 1623873 tok/s step 4497/19560 | loss 3.623951 (+0.36z)| norm 0.2778 (-0.07z)| lr 5.42e-04 | 323.29 ms | 52.2% bf16 MFU | 1623766 tok/s step 4498/19560 | loss 3.564142 (-1.03z)| norm 0.2760 (-0.17z)| lr 5.42e-04 | 322.81 ms | 52.3% bf16 MFU | 1623785 tok/s step 4499/19560 | loss 3.560209 (-1.11z)| norm 0.2849 (+0.27z)| lr 5.42e-04 | 323.03 ms | 52.2% bf16 MFU | 1623748 tok/s step 4500/19560 | loss 3.545851 (-1.42z)| norm 0.2559 (-1.14z)| lr 5.42e-04 | 322.53 ms | 52.3% bf16 MFU | 1623838 tok/s val loss 3.586649 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2780/10042 = 0.276837 step 4501/19560 | loss 3.610704 (+0.08z)| norm 0.2612 (-0.87z)| lr 5.42e-04 | 322.65 ms | 52.3% bf16 MFU | 1623893 tok/s step 4502/19560 | loss 3.562644 (-1.03z)| norm 0.2655 (-0.67z)| lr 5.42e-04 | 322.44 ms | 52.3% bf16 MFU | 1623998 tok/s step 4503/19560 | loss 3.549914 (-1.30z)| norm 0.2536 (-1.23z)| lr 5.42e-04 | 323.31 ms | 52.2% bf16 MFU | 1623879 tok/s step 4504/19560 | loss 3.640954 (+0.82z)| norm 0.2476 (-1.50z)| lr 5.42e-04 | 323.00 ms | 52.3% bf16 MFU | 1623846 tok/s step 4505/19560 | loss 3.627991 (+0.52z)| norm 0.2656 (-0.63z)| lr 5.42e-04 | 323.87 ms | 52.1% bf16 MFU | 1623594 tok/s step 4506/19560 | loss 3.571037 (-0.81z)| norm 0.2667 (-0.57z)| lr 5.42e-04 | 322.81 ms | 52.3% bf16 MFU | 1623621 tok/s step 4507/19560 | loss 3.565683 (-0.93z)| norm 0.2500 (-1.37z)| lr 5.42e-04 | 322.72 ms | 52.3% bf16 MFU | 1623670 tok/s step 4508/19560 | loss 3.658538 (+1.30z)| norm 0.2848 (+0.30z)| lr 5.42e-04 | 322.88 ms | 52.3% bf16 MFU | 1623676 tok/s step 4509/19560 | loss 3.524763 (-1.88z)| norm 0.2797 (+0.06z)| lr 5.42e-04 | 322.95 ms | 52.3% bf16 MFU | 1623663 tok/s step 4510/19560 | loss 3.576801 (-0.64z)| norm 0.2682 (-0.50z)| lr 5.42e-04 | 322.66 ms | 52.3% bf16 MFU | 1623725 tok/s step 4511/19560 | loss 3.561192 (-1.01z)| norm 0.2632 (-0.73z)| lr 5.42e-04 | 322.54 ms | 52.3% bf16 MFU | 1623813 tok/s step 4512/19560 | loss 3.531725 (-1.70z)| norm 0.2635 (-0.71z)| lr 5.42e-04 | 322.86 ms | 52.3% bf16 MFU | 1623817 tok/s step 4513/19560 | loss 3.570914 (-0.74z)| norm 0.2575 (-0.99z)| lr 5.42e-04 | 323.13 ms | 52.2% bf16 MFU | 1623754 tok/s step 4514/19560 | loss 3.632274 (+0.73z)| norm 0.2557 (-1.06z)| lr 5.41e-04 | 323.37 ms | 52.2% bf16 MFU | 1623634 tok/s step 4515/19560 | loss 3.595145 (-0.16z)| norm 0.3011 (+1.15z)| lr 5.41e-04 | 322.85 ms | 52.3% bf16 MFU | 1623648 tok/s step 4516/19560 | loss 3.559551 (-1.01z)| norm 0.2782 (+0.04z)| lr 5.41e-04 | 322.61 ms | 52.3% bf16 MFU | 1623724 tok/s step 4517/19560 | loss 3.629785 (+0.73z)| norm 0.2755 (-0.09z)| lr 5.41e-04 | 323.18 ms | 52.2% bf16 MFU | 1623650 tok/s step 4518/19560 | loss 3.717923 (+2.79z)| norm 0.2925 (+0.73z)| lr 5.41e-04 | 322.75 ms | 52.3% bf16 MFU | 1623691 tok/s step 4519/19560 | loss 3.631183 (+0.73z)| norm 0.2835 (+0.29z)| lr 5.41e-04 | 322.79 ms | 52.3% bf16 MFU | 1623719 tok/s step 4520/19560 | loss 3.656678 (+1.33z)| norm 0.2949 (+0.84z)| lr 5.41e-04 | 322.85 ms | 52.3% bf16 MFU | 1623729 tok/s step 4521/19560 | loss 3.724595 (+2.85z)| norm 0.3036 (+1.24z)| lr 5.41e-04 | 322.92 ms | 52.3% bf16 MFU | 1623722 tok/s step 4522/19560 | loss 3.611003 (+0.19z)| norm 0.3012 (+1.11z)| lr 5.41e-04 | 322.91 ms | 52.3% bf16 MFU | 1623718 tok/s step 4523/19560 | loss 3.604650 (+0.06z)| norm 0.2625 (-0.76z)| lr 5.41e-04 | 323.30 ms | 52.2% bf16 MFU | 1623615 tok/s step 4524/19560 | loss 3.597713 (-0.10z)| norm 0.2747 (-0.17z)| lr 5.41e-04 | 322.41 ms | 52.3% bf16 MFU | 1623741 tok/s step 4525/19560 | loss 3.650746 (+1.19z)| norm 0.2856 (+0.35z)| lr 5.41e-04 | 323.01 ms | 52.2% bf16 MFU | 1623710 tok/s step 4526/19560 | loss 3.596760 (-0.15z)| norm 0.2660 (-0.60z)| lr 5.41e-04 | 323.70 ms | 52.1% bf16 MFU | 1623508 tok/s step 4527/19560 | loss 3.631533 (+0.70z)| norm 0.2719 (-0.31z)| lr 5.41e-04 | 322.16 ms | 52.4% bf16 MFU | 1623704 tok/s step 4528/19560 | loss 3.531225 (-1.77z)| norm 0.2928 (+0.70z)| lr 5.41e-04 | 323.39 ms | 52.2% bf16 MFU | 1623579 tok/s step 4529/19560 | loss 3.632918 (+0.74z)| norm 0.3101 (+1.52z)| lr 5.41e-04 | 323.10 ms | 52.2% bf16 MFU | 1623535 tok/s step 4530/19560 | loss 3.569797 (-0.81z)| norm 0.2824 (+0.18z)| lr 5.41e-04 | 322.92 ms | 52.3% bf16 MFU | 1623537 tok/s step 4531/19560 | loss 3.615904 (+0.34z)| norm 0.3127 (+1.62z)| lr 5.41e-04 | 323.23 ms | 52.2% bf16 MFU | 1623462 tok/s step 4532/19560 | loss 3.643363 (+1.01z)| norm 0.3001 (+1.01z)| lr 5.41e-04 | 322.59 ms | 52.3% bf16 MFU | 1623550 tok/s step 4533/19560 | loss 3.659714 (+1.39z)| norm 0.2779 (-0.06z)| lr 5.41e-04 | 322.95 ms | 52.3% bf16 MFU | 1623545 tok/s step 4534/19560 | loss 3.548027 (-1.34z)| norm 0.2732 (-0.28z)| lr 5.41e-04 | 322.89 ms | 52.3% bf16 MFU | 1623554 tok/s step 4535/19560 | loss 3.620009 (+0.41z)| norm 0.3023 (+1.09z)| lr 5.41e-04 | 323.13 ms | 52.2% bf16 MFU | 1623503 tok/s step 4536/19560 | loss 3.666755 (+1.53z)| norm 0.2822 (+0.13z)| lr 5.41e-04 | 323.01 ms | 52.3% bf16 MFU | 1623486 tok/s step 4537/19560 | loss 3.598349 (-0.15z)| norm 0.3109 (+1.47z)| lr 5.41e-04 | 322.72 ms | 52.3% bf16 MFU | 1623541 tok/s step 4538/19560 | loss 3.518523 (-2.05z)| norm 0.2851 (+0.23z)| lr 5.41e-04 | 322.58 ms | 52.3% bf16 MFU | 1623629 tok/s step 4539/19560 | loss 3.560149 (-1.03z)| norm 0.2910 (+0.50z)| lr 5.41e-04 | 323.66 ms | 52.1% bf16 MFU | 1623441 tok/s step 4540/19560 | loss 3.786277 (+4.06z)| norm 0.2942 (+0.65z)| lr 5.41e-04 | 322.36 ms | 52.4% bf16 MFU | 1623589 tok/s step 4541/19560 | loss 3.543461 (-1.35z)| norm 0.3136 (+1.56z)| lr 5.41e-04 | 323.10 ms | 52.2% bf16 MFU | 1623543 tok/s step 4542/19560 | loss 3.540149 (-1.41z)| norm 0.2799 (-0.06z)| lr 5.41e-04 | 323.08 ms | 52.2% bf16 MFU | 1623506 tok/s step 4543/19560 | loss 3.559802 (-0.96z)| norm 0.2916 (+0.50z)| lr 5.41e-04 | 322.56 ms | 52.3% bf16 MFU | 1623600 tok/s step 4544/19560 | loss 3.600287 (-0.08z)| norm 0.3605 (+3.59z)| lr 5.41e-04 | 323.71 ms | 52.1% bf16 MFU | 1623400 tok/s step 4545/19560 | loss 3.575628 (-0.62z)| norm 0.2917 (+0.46z)| lr 5.41e-04 | 323.44 ms | 52.2% bf16 MFU | 1623279 tok/s step 4546/19560 | loss 3.585833 (-0.39z)| norm 0.2681 (-0.61z)| lr 5.41e-04 | 322.49 ms | 52.3% bf16 MFU | 1623402 tok/s step 4547/19560 | loss 3.577451 (-0.57z)| norm 0.2614 (-0.90z)| lr 5.41e-04 | 323.19 ms | 52.2% bf16 MFU | 1623343 tok/s step 4548/19560 | loss 3.617676 (+0.33z)| norm 0.2770 (-0.18z)| lr 5.40e-04 | 323.50 ms | 52.2% bf16 MFU | 1623210 tok/s step 4549/19560 | loss 3.605289 (+0.06z)| norm 0.2912 (+0.49z)| lr 5.40e-04 | 323.34 ms | 52.2% bf16 MFU | 1623124 tok/s step 4550/19560 | loss 3.559408 (-0.96z)| norm 0.2649 (-0.72z)| lr 5.40e-04 | 322.73 ms | 52.3% bf16 MFU | 1623193 tok/s step 4551/19560 | loss 3.576767 (-0.57z)| norm 0.2808 (+0.02z)| lr 5.40e-04 | 322.57 ms | 52.3% bf16 MFU | 1623300 tok/s step 4552/19560 | loss 3.538781 (-1.39z)| norm 0.3079 (+1.29z)| lr 5.40e-04 | 322.91 ms | 52.3% bf16 MFU | 1623316 tok/s step 4553/19560 | loss 3.754114 (+3.20z)| norm 0.2705 (-0.46z)| lr 5.40e-04 | 322.80 ms | 52.3% bf16 MFU | 1623359 tok/s step 4554/19560 | loss 3.565832 (-0.79z)| norm 0.3080 (+1.28z)| lr 5.40e-04 | 322.42 ms | 52.3% bf16 MFU | 1623497 tok/s step 4555/19560 | loss 3.625802 (+0.48z)| norm 0.2869 (+0.29z)| lr 5.40e-04 | 322.90 ms | 52.3% bf16 MFU | 1623507 tok/s step 4556/19560 | loss 3.623303 (+0.42z)| norm 0.3147 (+1.56z)| lr 5.40e-04 | 322.84 ms | 52.3% bf16 MFU | 1623531 tok/s step 4557/19560 | loss 3.574011 (-0.62z)| norm 0.2946 (+0.62z)| lr 5.40e-04 | 322.39 ms | 52.4% bf16 MFU | 1623668 tok/s step 4558/19560 | loss 3.573852 (-0.62z)| norm 0.2518 (-1.37z)| lr 5.40e-04 | 323.59 ms | 52.2% bf16 MFU | 1623496 tok/s step 4559/19560 | loss 3.585269 (-0.38z)| norm 0.2666 (-0.68z)| lr 5.40e-04 | 322.67 ms | 52.3% bf16 MFU | 1623564 tok/s step 4560/19560 | loss 3.674439 (+1.49z)| norm 0.2551 (-1.21z)| lr 5.40e-04 | 322.99 ms | 52.3% bf16 MFU | 1623548 tok/s step 4561/19560 | loss 3.585932 (-0.38z)| norm 0.2652 (-0.75z)| lr 5.40e-04 | 322.50 ms | 52.3% bf16 MFU | 1623656 tok/s step 4562/19560 | loss 3.587245 (-0.36z)| norm 0.2715 (-0.47z)| lr 5.40e-04 | 323.14 ms | 52.2% bf16 MFU | 1623596 tok/s step 4563/19560 | loss 3.563575 (-0.85z)| norm 0.2436 (-1.76z)| lr 5.40e-04 | 322.85 ms | 52.3% bf16 MFU | 1623613 tok/s step 4564/19560 | loss 3.693384 (+1.86z)| norm 0.2470 (-1.58z)| lr 5.40e-04 | 322.81 ms | 52.3% bf16 MFU | 1623640 tok/s step 4565/19560 | loss 3.552518 (-1.07z)| norm 0.2494 (-1.45z)| lr 5.40e-04 | 323.11 ms | 52.2% bf16 MFU | 1623590 tok/s step 4566/19560 | loss 3.611734 (+0.17z)| norm 0.2633 (-0.80z)| lr 5.40e-04 | 322.82 ms | 52.3% bf16 MFU | 1623616 tok/s step 4567/19560 | loss 3.625111 (+0.44z)| norm 0.2523 (-1.28z)| lr 5.40e-04 | 323.51 ms | 52.2% bf16 MFU | 1623466 tok/s step 4568/19560 | loss 3.615442 (+0.23z)| norm 0.2577 (-1.02z)| lr 5.40e-04 | 323.35 ms | 52.2% bf16 MFU | 1623363 tok/s step 4569/19560 | loss 3.635651 (+0.64z)| norm 0.2772 (-0.13z)| lr 5.40e-04 | 323.05 ms | 52.2% bf16 MFU | 1623341 tok/s step 4570/19560 | loss 3.573121 (-0.66z)| norm 0.3177 (+1.69z)| lr 5.40e-04 | 322.95 ms | 52.3% bf16 MFU | 1623344 tok/s step 4571/19560 | loss 3.564879 (-0.82z)| norm 0.3172 (+1.63z)| lr 5.40e-04 | 323.40 ms | 52.2% bf16 MFU | 1623237 tok/s step 4572/19560 | loss 3.549466 (-1.13z)| norm 0.2716 (-0.42z)| lr 5.40e-04 | 322.84 ms | 52.3% bf16 MFU | 1623275 tok/s step 4573/19560 | loss 3.548316 (-1.14z)| norm 0.2519 (-1.30z)| lr 5.40e-04 | 323.12 ms | 52.2% bf16 MFU | 1623239 tok/s step 4574/19560 | loss 3.610386 (+0.14z)| norm 0.2647 (-0.71z)| lr 5.40e-04 | 322.52 ms | 52.3% bf16 MFU | 1623357 tok/s step 4575/19560 | loss 3.615915 (+0.25z)| norm 0.2436 (-1.64z)| lr 5.40e-04 | 323.25 ms | 52.2% bf16 MFU | 1623286 tok/s step 4576/19560 | loss 3.547756 (-1.14z)| norm 0.2576 (-1.02z)| lr 5.40e-04 | 322.83 ms | 52.3% bf16 MFU | 1623325 tok/s step 4577/19560 | loss 3.611266 (+0.16z)| norm 0.2377 (-1.89z)| lr 5.40e-04 | 322.89 ms | 52.3% bf16 MFU | 1623346 tok/s step 4578/19560 | loss 3.590649 (-0.27z)| norm 0.2571 (-1.03z)| lr 5.40e-04 | 322.88 ms | 52.3% bf16 MFU | 1623368 tok/s step 4579/19560 | loss 3.555427 (-0.99z)| norm 0.2540 (-1.15z)| lr 5.40e-04 | 322.92 ms | 52.3% bf16 MFU | 1623380 tok/s step 4580/19560 | loss 3.616011 (+0.26z)| norm 0.2825 (+0.11z)| lr 5.40e-04 | 323.25 ms | 52.2% bf16 MFU | 1623307 tok/s step 4581/19560 | loss 3.645131 (+0.85z)| norm 0.2697 (-0.45z)| lr 5.39e-04 | 322.36 ms | 52.4% bf16 MFU | 1623461 tok/s step 4582/19560 | loss 3.598383 (-0.09z)| norm 0.2503 (-1.28z)| lr 5.39e-04 | 323.29 ms | 52.2% bf16 MFU | 1623374 tok/s step 4583/19560 | loss 3.693352 (+1.86z)| norm 0.6999 (+9.59z)| lr 5.39e-04 | 322.82 ms | 52.3% bf16 MFU | 1623409 tok/s step 4584/19560 | loss 3.567060 (-0.76z)| norm 0.3848 (+2.28z)| lr 5.39e-04 | 322.78 ms | 52.3% bf16 MFU | 1623452 tok/s step 4585/19560 | loss 3.591753 (-0.24z)| norm 0.3535 (+1.54z)| lr 5.39e-04 | 322.62 ms | 52.3% bf16 MFU | 1623534 tok/s step 4586/19560 | loss 3.620050 (+0.36z)| norm 0.2943 (+0.22z)| lr 5.39e-04 | 323.11 ms | 52.2% bf16 MFU | 1623488 tok/s step 4587/19560 | loss 3.577180 (-0.54z)| norm 0.2970 (+0.28z)| lr 5.39e-04 | 323.24 ms | 52.2% bf16 MFU | 1623412 tok/s step 4588/19560 | loss 3.560382 (-0.88z)| norm 0.3126 (+0.62z)| lr 5.39e-04 | 323.08 ms | 52.2% bf16 MFU | 1623380 tok/s step 4589/19560 | loss 3.587319 (-0.31z)| norm 0.3124 (+0.61z)| lr 5.39e-04 | 322.42 ms | 52.3% bf16 MFU | 1623516 tok/s step 4590/19560 | loss 3.594256 (-0.16z)| norm 0.3010 (+0.35z)| lr 5.39e-04 | 323.04 ms | 52.2% bf16 MFU | 1623489 tok/s step 4591/19560 | loss 3.582680 (-0.39z)| norm 0.2712 (-0.31z)| lr 5.39e-04 | 322.54 ms | 52.3% bf16 MFU | 1623589 tok/s step 4592/19560 | loss 3.599794 (-0.02z)| norm 0.2763 (-0.19z)| lr 5.39e-04 | 322.57 ms | 52.3% bf16 MFU | 1623676 tok/s step 4593/19560 | loss 3.587986 (-0.26z)| norm 0.3152 (+0.67z)| lr 5.39e-04 | 322.44 ms | 52.3% bf16 MFU | 1623793 tok/s step 4594/19560 | loss 3.603683 (+0.08z)| norm 0.3127 (+0.62z)| lr 5.39e-04 | 322.97 ms | 52.3% bf16 MFU | 1623770 tok/s step 4595/19560 | loss 3.706323 (+2.23z)| norm 0.2690 (-0.35z)| lr 5.39e-04 | 322.33 ms | 52.4% bf16 MFU | 1623909 tok/s step 4596/19560 | loss 3.609588 (+0.18z)| norm 0.2596 (-0.55z)| lr 5.39e-04 | 322.76 ms | 52.3% bf16 MFU | 1623934 tok/s step 4597/19560 | loss 3.660240 (+1.23z)| norm 0.2659 (-0.41z)| lr 5.39e-04 | 323.04 ms | 52.2% bf16 MFU | 1623887 tok/s step 4598/19560 | loss 3.573137 (-0.60z)| norm 0.2666 (-0.39z)| lr 5.39e-04 | 323.27 ms | 52.2% bf16 MFU | 1623785 tok/s step 4599/19560 | loss 3.587550 (-0.29z)| norm 0.2593 (-0.55z)| lr 5.39e-04 | 322.72 ms | 52.3% bf16 MFU | 1623825 tok/s step 4600/19560 | loss 3.578443 (-0.48z)| norm 0.2605 (-0.52z)| lr 5.39e-04 | 322.63 ms | 52.3% bf16 MFU | 1623885 tok/s step 4601/19560 | loss 3.623243 (+0.49z)| norm 0.2753 (-0.20z)| lr 5.39e-04 | 323.27 ms | 52.2% bf16 MFU | 1623783 tok/s step 4602/19560 | loss 3.617306 (+0.35z)| norm 0.2745 (-0.22z)| lr 5.39e-04 | 322.43 ms | 52.3% bf16 MFU | 1623897 tok/s step 4603/19560 | loss 3.598840 (-0.04z)| norm 0.2718 (-0.28z)| lr 5.39e-04 | 322.72 ms | 52.3% bf16 MFU | 1623932 tok/s step 4604/19560 | loss 3.583447 (-0.37z)| norm 0.2698 (-0.32z)| lr 5.39e-04 | 323.04 ms | 52.2% bf16 MFU | 1623884 tok/s step 4605/19560 | loss 3.567991 (-0.71z)| norm 0.2690 (-0.35z)| lr 5.39e-04 | 322.72 ms | 52.3% bf16 MFU | 1623918 tok/s step 4606/19560 | loss 3.665652 (+1.39z)| norm 0.2620 (-0.50z)| lr 5.39e-04 | 323.01 ms | 52.2% bf16 MFU | 1623878 tok/s step 4607/19560 | loss 3.587442 (-0.28z)| norm 0.2528 (-0.71z)| lr 5.39e-04 | 322.56 ms | 52.3% bf16 MFU | 1623954 tok/s step 4608/19560 | loss 3.596278 (-0.09z)| norm 0.2836 (-0.02z)| lr 5.39e-04 | 322.96 ms | 52.3% bf16 MFU | 1623924 tok/s step 4609/19560 | loss 3.589587 (-0.24z)| norm 0.2800 (-0.10z)| lr 5.39e-04 | 323.59 ms | 52.2% bf16 MFU | 1623740 tok/s step 4610/19560 | loss 3.625787 (+0.54z)| norm 0.3069 (+0.50z)| lr 5.39e-04 | 322.63 ms | 52.3% bf16 MFU | 1623806 tok/s step 4611/19560 | loss 3.602898 (+0.04z)| norm 0.2716 (-0.28z)| lr 5.39e-04 | 322.89 ms | 52.3% bf16 MFU | 1623803 tok/s step 4612/19560 | loss 3.707010 (+2.24z)| norm 0.2892 (+0.11z)| lr 5.39e-04 | 322.80 ms | 52.3% bf16 MFU | 1623821 tok/s step 4613/19560 | loss 3.539931 (-1.31z)| norm 0.2710 (-0.29z)| lr 5.39e-04 | 322.69 ms | 52.3% bf16 MFU | 1623868 tok/s step 4614/19560 | loss 3.601176 (-0.00z)| norm 0.3022 (+0.42z)| lr 5.38e-04 | 322.96 ms | 52.3% bf16 MFU | 1623844 tok/s step 4615/19560 | loss 3.572410 (-0.60z)| norm 0.2555 (-0.62z)| lr 5.38e-04 | 322.83 ms | 52.3% bf16 MFU | 1623854 tok/s step 4616/19560 | loss 3.608420 (+0.15z)| norm 0.3016 (+0.43z)| lr 5.38e-04 | 322.75 ms | 52.3% bf16 MFU | 1623884 tok/s step 4617/19560 | loss 3.563900 (-0.79z)| norm 0.3124 (+0.67z)| lr 5.38e-04 | 322.61 ms | 52.3% bf16 MFU | 1623946 tok/s step 4618/19560 | loss 3.554888 (-0.98z)| norm 0.2809 (-0.05z)| lr 5.38e-04 | 322.42 ms | 52.3% bf16 MFU | 1624054 tok/s step 4619/19560 | loss 3.569369 (-0.66z)| norm 0.2613 (-0.49z)| lr 5.38e-04 | 322.56 ms | 52.3% bf16 MFU | 1624122 tok/s step 4620/19560 | loss 3.651784 (+1.09z)| norm 0.2546 (-0.63z)| lr 5.38e-04 | 323.53 ms | 52.2% bf16 MFU | 1623940 tok/s step 4621/19560 | loss 3.597777 (-0.06z)| norm 0.2801 (-0.06z)| lr 5.38e-04 | 322.71 ms | 52.3% bf16 MFU | 1623975 tok/s step 4622/19560 | loss 3.584711 (-0.34z)| norm 0.2881 (+0.12z)| lr 5.38e-04 | 322.67 ms | 52.3% bf16 MFU | 1624018 tok/s step 4623/19560 | loss 3.562605 (-0.80z)| norm 0.2878 (+0.11z)| lr 5.38e-04 | 323.16 ms | 52.2% bf16 MFU | 1623936 tok/s step 4624/19560 | loss 3.553930 (-0.99z)| norm 0.2920 (+0.20z)| lr 5.38e-04 | 322.59 ms | 52.3% bf16 MFU | 1624001 tok/s step 4625/19560 | loss 3.593531 (-0.13z)| norm 0.2537 (-0.66z)| lr 5.38e-04 | 322.54 ms | 52.3% bf16 MFU | 1624076 tok/s step 4626/19560 | loss 3.621909 (+0.47z)| norm 0.2729 (-0.23z)| lr 5.38e-04 | 323.07 ms | 52.2% bf16 MFU | 1624015 tok/s step 4627/19560 | loss 3.561170 (-0.84z)| norm 0.2583 (-0.55z)| lr 5.38e-04 | 323.36 ms | 52.2% bf16 MFU | 1623884 tok/s step 4628/19560 | loss 3.570371 (-0.65z)| norm 0.2617 (-0.48z)| lr 5.38e-04 | 322.70 ms | 52.3% bf16 MFU | 1623924 tok/s step 4629/19560 | loss 3.588876 (-0.25z)| norm 0.2746 (-0.19z)| lr 5.38e-04 | 322.78 ms | 52.3% bf16 MFU | 1623942 tok/s step 4630/19560 | loss 3.641197 (+0.87z)| norm 0.2621 (-0.47z)| lr 5.38e-04 | 322.86 ms | 52.3% bf16 MFU | 1623940 tok/s step 4631/19560 | loss 3.651772 (+1.08z)| norm 0.2637 (-0.44z)| lr 5.38e-04 | 322.06 ms | 52.4% bf16 MFU | 1624139 tok/s step 4632/19560 | loss 3.726856 (+2.62z)| norm 0.2837 (+0.01z)| lr 5.38e-04 | 322.65 ms | 52.3% bf16 MFU | 1624180 tok/s step 4633/19560 | loss 3.625121 (+0.48z)| norm 0.2691 (-0.32z)| lr 5.38e-04 | 323.23 ms | 52.2% bf16 MFU | 1624074 tok/s step 4634/19560 | loss 3.562155 (-0.84z)| norm 0.2612 (-0.50z)| lr 5.38e-04 | 322.64 ms | 52.3% bf16 MFU | 1624119 tok/s step 4635/19560 | loss 3.546630 (-1.16z)| norm 0.2851 (+0.04z)| lr 5.38e-04 | 322.79 ms | 52.3% bf16 MFU | 1624124 tok/s step 4636/19560 | loss 3.852825 (+4.77z)| norm 0.3309 (+1.08z)| lr 5.38e-04 | 322.84 ms | 52.3% bf16 MFU | 1624116 tok/s step 4637/19560 | loss 3.524935 (-1.50z)| norm 0.3165 (+0.74z)| lr 5.38e-04 | 322.99 ms | 52.3% bf16 MFU | 1624072 tok/s step 4638/19560 | loss 3.562304 (-0.79z)| norm 0.3084 (+0.55z)| lr 5.38e-04 | 322.79 ms | 52.3% bf16 MFU | 1624081 tok/s step 4639/19560 | loss 3.612137 (+0.16z)| norm 0.3152 (+0.69z)| lr 5.38e-04 | 322.45 ms | 52.3% bf16 MFU | 1624174 tok/s step 4640/19560 | loss 3.575202 (-0.56z)| norm 0.3538 (+1.54z)| lr 5.38e-04 | 323.00 ms | 52.3% bf16 MFU | 1624123 tok/s step 4641/19560 | loss 3.582076 (-0.43z)| norm 0.3379 (+1.16z)| lr 5.38e-04 | 322.64 ms | 52.3% bf16 MFU | 1624167 tok/s step 4642/19560 | loss 3.577291 (-0.51z)| norm 0.3396 (+1.18z)| lr 5.38e-04 | 323.01 ms | 52.3% bf16 MFU | 1624117 tok/s step 4643/19560 | loss 3.528325 (-1.44z)| norm 0.3045 (+0.40z)| lr 5.38e-04 | 322.56 ms | 52.3% bf16 MFU | 1624180 tok/s step 4644/19560 | loss 3.548182 (-1.05z)| norm 0.2889 (+0.05z)| lr 5.38e-04 | 322.48 ms | 52.3% bf16 MFU | 1624261 tok/s step 4645/19560 | loss 3.510258 (-1.74z)| norm 0.2740 (-0.29z)| lr 5.38e-04 | 323.06 ms | 52.2% bf16 MFU | 1624191 tok/s step 4646/19560 | loss 3.512586 (-1.68z)| norm 0.2735 (-0.29z)| lr 5.38e-04 | 323.63 ms | 52.1% bf16 MFU | 1623983 tok/s step 4647/19560 | loss 3.599269 (-0.02z)| norm 0.3181 (+0.70z)| lr 5.37e-04 | 322.93 ms | 52.3% bf16 MFU | 1623961 tok/s step 4648/19560 | loss 3.548632 (-0.97z)| norm 0.2447 (-0.93z)| lr 5.37e-04 | 322.39 ms | 52.3% bf16 MFU | 1624075 tok/s step 4649/19560 | loss 3.531399 (-1.29z)| norm 0.2749 (-0.25z)| lr 5.37e-04 | 322.24 ms | 52.4% bf16 MFU | 1624220 tok/s step 4650/19560 | loss 3.766064 (+3.11z)| norm 0.2686 (-0.39z)| lr 5.37e-04 | 323.44 ms | 52.2% bf16 MFU | 1624058 tok/s step 4651/19560 | loss 3.602116 (+0.05z)| norm 0.2815 (-0.10z)| lr 5.37e-04 | 322.49 ms | 52.3% bf16 MFU | 1624143 tok/s step 4652/19560 | loss 3.562711 (-0.68z)| norm 0.2855 (-0.02z)| lr 5.37e-04 | 322.99 ms | 52.3% bf16 MFU | 1624097 tok/s step 4653/19560 | loss 3.577845 (-0.39z)| norm 0.3185 (+0.71z)| lr 5.37e-04 | 322.49 ms | 52.3% bf16 MFU | 1624179 tok/s step 4654/19560 | loss 3.548726 (-0.92z)| norm 0.2727 (-0.31z)| lr 5.37e-04 | 322.53 ms | 52.3% bf16 MFU | 1624247 tok/s step 4655/19560 | loss 3.609303 (+0.21z)| norm 0.2472 (-0.87z)| lr 5.37e-04 | 323.20 ms | 52.2% bf16 MFU | 1624144 tok/s step 4656/19560 | loss 3.621356 (+0.43z)| norm 0.2629 (-0.52z)| lr 5.37e-04 | 322.98 ms | 52.3% bf16 MFU | 1624100 tok/s step 4657/19560 | loss 3.607958 (+0.18z)| norm 0.2776 (-0.18z)| lr 5.37e-04 | 323.01 ms | 52.2% bf16 MFU | 1624051 tok/s step 4658/19560 | loss 3.537790 (-1.13z)| norm 0.2419 (-0.96z)| lr 5.37e-04 | 323.25 ms | 52.2% bf16 MFU | 1623945 tok/s step 4659/19560 | loss 3.594902 (-0.06z)| norm 0.2510 (-0.75z)| lr 5.37e-04 | 322.72 ms | 52.3% bf16 MFU | 1623978 tok/s step 4660/19560 | loss 3.580780 (-0.31z)| norm 0.2554 (-0.65z)| lr 5.37e-04 | 322.42 ms | 52.3% bf16 MFU | 1624085 tok/s step 4661/19560 | loss 3.622022 (+0.47z)| norm 0.2540 (-0.67z)| lr 5.37e-04 | 322.62 ms | 52.3% bf16 MFU | 1624137 tok/s step 4662/19560 | loss 3.563051 (-0.65z)| norm 0.2784 (-0.14z)| lr 5.37e-04 | 323.02 ms | 52.2% bf16 MFU | 1624085 tok/s step 4663/19560 | loss 3.563556 (-0.63z)| norm 0.2566 (-0.61z)| lr 5.37e-04 | 322.67 ms | 52.3% bf16 MFU | 1624122 tok/s step 4664/19560 | loss 3.522380 (-1.38z)| norm 0.2568 (-0.60z)| lr 5.37e-04 | 322.61 ms | 52.3% bf16 MFU | 1624173 tok/s step 4665/19560 | loss 3.567149 (-0.53z)| norm 0.2799 (-0.09z)| lr 5.37e-04 | 322.31 ms | 52.4% bf16 MFU | 1624297 tok/s step 4666/19560 | loss 3.556747 (-0.74z)| norm 0.3240 (+0.87z)| lr 5.37e-04 | 322.92 ms | 52.3% bf16 MFU | 1624262 tok/s step 4667/19560 | loss 3.595470 (-0.01z)| norm 0.3111 (+0.59z)| lr 5.37e-04 | 323.39 ms | 52.2% bf16 MFU | 1624109 tok/s step 4668/19560 | loss 3.625587 (+0.61z)| norm 0.3029 (+0.41z)| lr 5.37e-04 | 322.61 ms | 52.3% bf16 MFU | 1624161 tok/s step 4669/19560 | loss 3.536304 (-1.17z)| norm 0.2863 (+0.05z)| lr 5.37e-04 | 322.72 ms | 52.3% bf16 MFU | 1624181 tok/s step 4670/19560 | loss 3.608769 (+0.27z)| norm 0.3637 (+1.71z)| lr 5.37e-04 | 322.43 ms | 52.3% bf16 MFU | 1624274 tok/s step 4671/19560 | loss 3.681275 (+1.69z)| norm 0.3148 (+0.64z)| lr 5.37e-04 | 323.49 ms | 52.2% bf16 MFU | 1624097 tok/s step 4672/19560 | loss 3.623213 (+0.53z)| norm 0.3123 (+0.60z)| lr 5.37e-04 | 322.79 ms | 52.3% bf16 MFU | 1624104 tok/s step 4673/19560 | loss 3.598323 (+0.03z)| norm 0.3122 (+0.60z)| lr 5.37e-04 | 322.93 ms | 52.3% bf16 MFU | 1624076 tok/s step 4674/19560 | loss 3.599467 (+0.05z)| norm 0.2756 (-0.20z)| lr 5.37e-04 | 323.04 ms | 52.2% bf16 MFU | 1624022 tok/s step 4675/19560 | loss 3.626198 (+0.58z)| norm 0.2923 (+0.16z)| lr 5.37e-04 | 322.90 ms | 52.3% bf16 MFU | 1624005 tok/s step 4676/19560 | loss 3.598665 (+0.03z)| norm 0.2646 (-0.44z)| lr 5.37e-04 | 322.81 ms | 52.3% bf16 MFU | 1624011 tok/s step 4677/19560 | loss 3.627992 (+0.61z)| norm 0.2884 (+0.08z)| lr 5.37e-04 | 323.34 ms | 52.2% bf16 MFU | 1623883 tok/s step 4678/19560 | loss 3.533018 (-1.26z)| norm 0.2985 (+0.29z)| lr 5.37e-04 | 323.03 ms | 52.2% bf16 MFU | 1623840 tok/s step 4679/19560 | loss 3.587343 (-0.19z)| norm 0.2649 (-0.44z)| lr 5.37e-04 | 322.68 ms | 52.3% bf16 MFU | 1623887 tok/s step 4680/19560 | loss 3.581693 (-0.31z)| norm 0.2575 (-0.59z)| lr 5.36e-04 | 322.11 ms | 52.4% bf16 MFU | 1624077 tok/s step 4681/19560 | loss 3.633577 (+0.77z)| norm 0.2623 (-0.49z)| lr 5.36e-04 | 323.45 ms | 52.2% bf16 MFU | 1623920 tok/s step 4682/19560 | loss 3.596406 (-0.01z)| norm 0.2497 (-0.75z)| lr 5.36e-04 | 323.15 ms | 52.2% bf16 MFU | 1623845 tok/s step 4683/19560 | loss 3.459914 (-2.72z)| norm 0.2606 (-0.51z)| lr 5.36e-04 | 322.91 ms | 52.3% bf16 MFU | 1623835 tok/s step 4684/19560 | loss 3.528020 (-1.33z)| norm 0.2869 (+0.07z)| lr 5.36e-04 | 322.80 ms | 52.3% bf16 MFU | 1623853 tok/s step 4685/19560 | loss 3.570405 (-0.48z)| norm 0.2785 (-0.11z)| lr 5.36e-04 | 322.79 ms | 52.3% bf16 MFU | 1623873 tok/s step 4686/19560 | loss 3.557185 (-0.74z)| norm 0.2579 (-0.56z)| lr 5.36e-04 | 323.19 ms | 52.2% bf16 MFU | 1623790 tok/s step 4687/19560 | loss 3.587110 (-0.15z)| norm 0.3125 (+0.62z)| lr 5.36e-04 | 323.04 ms | 52.2% bf16 MFU | 1623750 tok/s step 4688/19560 | loss 3.618086 (+0.48z)| norm 0.3069 (+0.49z)| lr 5.36e-04 | 322.72 ms | 52.3% bf16 MFU | 1623792 tok/s step 4689/19560 | loss 3.576736 (-0.35z)| norm 0.4042 (+2.52z)| lr 5.36e-04 | 322.41 ms | 52.3% bf16 MFU | 1623910 tok/s step 4690/19560 | loss 3.532905 (-1.21z)| norm 0.2738 (-0.25z)| lr 5.36e-04 | 322.75 ms | 52.3% bf16 MFU | 1623937 tok/s step 4691/19560 | loss 3.515144 (-1.55z)| norm 0.2763 (-0.20z)| lr 5.36e-04 | 323.59 ms | 52.2% bf16 MFU | 1623752 tok/s step 4692/19560 | loss 3.589289 (-0.06z)| norm 0.2847 (-0.03z)| lr 5.36e-04 | 323.19 ms | 52.2% bf16 MFU | 1623675 tok/s step 4693/19560 | loss 3.619322 (+0.53z)| norm 0.2644 (-0.47z)| lr 5.36e-04 | 322.57 ms | 52.3% bf16 MFU | 1623759 tok/s step 4694/19560 | loss 3.554792 (-0.76z)| norm 0.2802 (-0.13z)| lr 5.36e-04 | 322.62 ms | 52.3% bf16 MFU | 1623825 tok/s step 4695/19560 | loss 3.624643 (+0.65z)| norm 0.3000 (+0.29z)| lr 5.36e-04 | 322.91 ms | 52.3% bf16 MFU | 1623816 tok/s step 4696/19560 | loss 3.589857 (-0.05z)| norm 0.2628 (-0.51z)| lr 5.36e-04 | 323.30 ms | 52.2% bf16 MFU | 1623709 tok/s step 4697/19560 | loss 3.559165 (-0.65z)| norm 0.2651 (-0.46z)| lr 5.36e-04 | 323.06 ms | 52.2% bf16 MFU | 1623667 tok/s step 4698/19560 | loss 3.610356 (+0.37z)| norm 0.2778 (-0.18z)| lr 5.36e-04 | 322.53 ms | 52.3% bf16 MFU | 1623762 tok/s step 4699/19560 | loss 3.564275 (-0.56z)| norm 0.2676 (-0.39z)| lr 5.36e-04 | 322.83 ms | 52.3% bf16 MFU | 1623775 tok/s step 4700/19560 | loss 3.522334 (-1.39z)| norm 0.2681 (-0.38z)| lr 5.36e-04 | 323.30 ms | 52.2% bf16 MFU | 1623671 tok/s step 4701/19560 | loss 3.582036 (-0.20z)| norm 0.2782 (-0.17z)| lr 5.36e-04 | 323.44 ms | 52.2% bf16 MFU | 1623537 tok/s step 4702/19560 | loss 3.561688 (-0.60z)| norm 0.2908 (+0.10z)| lr 5.36e-04 | 322.84 ms | 52.3% bf16 MFU | 1623560 tok/s step 4703/19560 | loss 3.673246 (+1.62z)| norm 0.2747 (-0.26z)| lr 5.36e-04 | 322.57 ms | 52.3% bf16 MFU | 1623650 tok/s step 4704/19560 | loss 3.534267 (-1.15z)| norm 0.2663 (-0.44z)| lr 5.36e-04 | 322.39 ms | 52.3% bf16 MFU | 1623779 tok/s step 4705/19560 | loss 3.597525 (+0.11z)| norm 0.2729 (-0.30z)| lr 5.36e-04 | 322.88 ms | 52.3% bf16 MFU | 1623781 tok/s step 4706/19560 | loss 3.651686 (+1.17z)| norm 0.2748 (-0.27z)| lr 5.36e-04 | 323.29 ms | 52.2% bf16 MFU | 1623678 tok/s step 4707/19560 | loss 3.558134 (-0.68z)| norm 0.2603 (-0.58z)| lr 5.36e-04 | 322.45 ms | 52.3% bf16 MFU | 1623792 tok/s step 4708/19560 | loss 3.576828 (-0.30z)| norm 0.2854 (-0.04z)| lr 5.36e-04 | 323.16 ms | 52.2% bf16 MFU | 1623721 tok/s step 4709/19560 | loss 3.902802 (+5.39z)| norm 0.2819 (-0.12z)| lr 5.36e-04 | 322.77 ms | 52.3% bf16 MFU | 1623751 tok/s step 4710/19560 | loss 3.585321 (-0.15z)| norm 0.2799 (-0.16z)| lr 5.36e-04 | 323.10 ms | 52.2% bf16 MFU | 1623698 tok/s step 4711/19560 | loss 3.627463 (+0.60z)| norm 0.2736 (-0.38z)| lr 5.36e-04 | 322.94 ms | 52.3% bf16 MFU | 1623687 tok/s step 4712/19560 | loss 3.612420 (+0.33z)| norm 0.3457 (+2.32z)| lr 5.35e-04 | 322.57 ms | 52.3% bf16 MFU | 1623770 tok/s step 4713/19560 | loss 3.635081 (+0.72z)| norm 0.3161 (+1.25z)| lr 5.35e-04 | 322.91 ms | 52.3% bf16 MFU | 1623764 tok/s step 4714/19560 | loss 3.588962 (-0.09z)| norm 0.2657 (-0.68z)| lr 5.35e-04 | 322.50 ms | 52.3% bf16 MFU | 1623862 tok/s step 4715/19560 | loss 3.610085 (+0.28z)| norm 0.2865 (+0.12z)| lr 5.35e-04 | 323.43 ms | 52.2% bf16 MFU | 1623720 tok/s step 4716/19560 | loss 3.592628 (-0.03z)| norm 0.2853 (+0.09z)| lr 5.35e-04 | 322.88 ms | 52.3% bf16 MFU | 1623724 tok/s step 4717/19560 | loss 3.562859 (-0.55z)| norm 0.2897 (+0.27z)| lr 5.35e-04 | 322.81 ms | 52.3% bf16 MFU | 1623746 tok/s step 4718/19560 | loss 3.636452 (+0.74z)| norm 0.2511 (-1.21z)| lr 5.35e-04 | 323.18 ms | 52.2% bf16 MFU | 1623674 tok/s step 4719/19560 | loss 3.623971 (+0.51z)| norm 0.2877 (+0.20z)| lr 5.35e-04 | 322.86 ms | 52.3% bf16 MFU | 1623684 tok/s step 4720/19560 | loss 3.586424 (-0.15z)| norm 0.2944 (+0.45z)| lr 5.35e-04 | 323.30 ms | 52.2% bf16 MFU | 1623584 tok/s step 4721/19560 | loss 3.578290 (-0.29z)| norm 0.2687 (-0.53z)| lr 5.35e-04 | 322.48 ms | 52.3% bf16 MFU | 1623696 tok/s step 4722/19560 | loss 3.633223 (+0.67z)| norm 0.2785 (-0.14z)| lr 5.35e-04 | 322.77 ms | 52.3% bf16 MFU | 1623727 tok/s step 4723/19560 | loss 3.566336 (-0.49z)| norm 0.2718 (-0.40z)| lr 5.35e-04 | 323.26 ms | 52.2% bf16 MFU | 1623636 tok/s step 4724/19560 | loss 3.616079 (+0.40z)| norm 0.2621 (-0.78z)| lr 5.35e-04 | 322.80 ms | 52.3% bf16 MFU | 1623664 tok/s step 4725/19560 | loss 3.666106 (+1.28z)| norm 0.2648 (-0.68z)| lr 5.35e-04 | 322.85 ms | 52.3% bf16 MFU | 1623676 tok/s step 4726/19560 | loss 3.535304 (-1.03z)| norm 0.2477 (-1.33z)| lr 5.35e-04 | 323.09 ms | 52.2% bf16 MFU | 1623629 tok/s step 4727/19560 | loss 3.592214 (-0.03z)| norm 0.2572 (-0.96z)| lr 5.35e-04 | 323.23 ms | 52.2% bf16 MFU | 1623548 tok/s step 4728/19560 | loss 3.537985 (-0.98z)| norm 0.2685 (-0.53z)| lr 5.35e-04 | 323.41 ms | 52.2% bf16 MFU | 1623426 tok/s step 4729/19560 | loss 3.551153 (-0.73z)| norm 0.2659 (-0.62z)| lr 5.35e-04 | 322.61 ms | 52.3% bf16 MFU | 1623512 tok/s step 4730/19560 | loss 3.570170 (-0.39z)| norm 0.2780 (-0.16z)| lr 5.35e-04 | 323.05 ms | 52.2% bf16 MFU | 1623482 tok/s step 4731/19560 | loss 3.598907 (+0.12z)| norm 0.2923 (+0.40z)| lr 5.35e-04 | 322.77 ms | 52.3% bf16 MFU | 1623527 tok/s step 4732/19560 | loss 3.675697 (+1.44z)| norm 0.2654 (-0.65z)| lr 5.35e-04 | 322.73 ms | 52.3% bf16 MFU | 1623578 tok/s step 4733/19560 | loss 3.557533 (-0.62z)| norm 0.2940 (+0.45z)| lr 5.35e-04 | 323.14 ms | 52.2% bf16 MFU | 1623523 tok/s step 4734/19560 | loss 3.592129 (-0.01z)| norm 0.3045 (+0.85z)| lr 5.35e-04 | 322.77 ms | 52.3% bf16 MFU | 1623565 tok/s step 4735/19560 | loss 3.586573 (-0.10z)| norm 0.2946 (+0.45z)| lr 5.35e-04 | 323.87 ms | 52.1% bf16 MFU | 1623329 tok/s step 4736/19560 | loss 3.563667 (-0.50z)| norm 0.2618 (-0.82z)| lr 5.35e-04 | 323.20 ms | 52.2% bf16 MFU | 1623272 tok/s step 4737/19560 | loss 3.597226 (+0.09z)| norm 0.2572 (-0.98z)| lr 5.35e-04 | 322.46 ms | 52.3% bf16 MFU | 1623403 tok/s step 4738/19560 | loss 3.635536 (+0.76z)| norm 0.2746 (-0.30z)| lr 5.35e-04 | 323.09 ms | 52.2% bf16 MFU | 1623369 tok/s step 4739/19560 | loss 3.606118 (+0.24z)| norm 0.2942 (+0.45z)| lr 5.35e-04 | 323.49 ms | 52.2% bf16 MFU | 1623238 tok/s step 4740/19560 | loss 3.571105 (-0.36z)| norm 0.2543 (-1.08z)| lr 5.35e-04 | 323.07 ms | 52.2% bf16 MFU | 1623218 tok/s step 4741/19560 | loss 3.586225 (-0.10z)| norm 0.2899 (+0.29z)| lr 5.35e-04 | 323.30 ms | 52.2% bf16 MFU | 1623140 tok/s step 4742/19560 | loss 3.598306 (+0.12z)| norm 0.2693 (-0.50z)| lr 5.35e-04 | 322.82 ms | 52.3% bf16 MFU | 1623188 tok/s step 4743/19560 | loss 3.564715 (-0.48z)| norm 0.2753 (-0.27z)| lr 5.35e-04 | 323.00 ms | 52.3% bf16 MFU | 1623188 tok/s step 4744/19560 | loss 3.584325 (-0.13z)| norm 0.2644 (-0.69z)| lr 5.35e-04 | 323.08 ms | 52.2% bf16 MFU | 1623169 tok/s step 4745/19560 | loss 3.572302 (-0.34z)| norm 0.2665 (-0.59z)| lr 5.34e-04 | 322.86 ms | 52.3% bf16 MFU | 1623204 tok/s step 4746/19560 | loss 3.584930 (-0.12z)| norm 0.2733 (-0.32z)| lr 5.34e-04 | 323.78 ms | 52.1% bf16 MFU | 1623008 tok/s step 4747/19560 | loss 3.523077 (-1.22z)| norm 0.2612 (-0.80z)| lr 5.34e-04 | 323.05 ms | 52.2% bf16 MFU | 1623004 tok/s step 4748/19560 | loss 3.597201 (+0.11z)| norm 0.2648 (-0.66z)| lr 5.34e-04 | 322.46 ms | 52.3% bf16 MFU | 1623150 tok/s step 4749/19560 | loss 3.564998 (-0.46z)| norm 0.2756 (-0.24z)| lr 5.34e-04 | 323.28 ms | 52.2% bf16 MFU | 1623082 tok/s step 4750/19560 | loss 3.595903 (+0.09z)| norm 0.2752 (-0.25z)| lr 5.34e-04 | 323.67 ms | 52.1% bf16 MFU | 1622918 tok/s val loss 3.571055 ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2776/10042 = 0.276439 step 4751/19560 | loss 3.575073 (-0.28z)| norm 0.3090 (+1.06z)| lr 5.34e-04 | 322.20 ms | 52.4% bf16 MFU | 1623132 tok/s step 4752/19560 | loss 3.558228 (-0.58z)| norm 0.2912 (+0.37z)| lr 5.34e-04 | 323.23 ms | 52.2% bf16 MFU | 1623077 tok/s step 4753/19560 | loss 3.642257 (+0.91z)| norm 0.2460 (-1.39z)| lr 5.34e-04 | 322.98 ms | 52.3% bf16 MFU | 1623088 tok/s step 4754/19560 | loss 3.607160 (+0.29z)| norm 0.2812 (-0.02z)| lr 5.34e-04 | 322.83 ms | 52.3% bf16 MFU | 1623135 tok/s step 4755/19560 | loss 3.620661 (+0.52z)| norm 0.2783 (-0.14z)| lr 5.34e-04 | 322.74 ms | 52.3% bf16 MFU | 1623202 tok/s step 4756/19560 | loss 3.572160 (-0.35z)| norm 0.2917 (+0.38z)| lr 5.34e-04 | 322.97 ms | 52.3% bf16 MFU | 1623208 tok/s step 4757/19560 | loss 3.639639 (+0.85z)| norm 0.3120 (+1.15z)| lr 5.34e-04 | 322.83 ms | 52.3% bf16 MFU | 1623250 tok/s step 4758/19560 | loss 3.630886 (+0.70z)| norm 0.3233 (+1.56z)| lr 5.34e-04 | 322.52 ms | 52.3% bf16 MFU | 1623367 tok/s step 4759/19560 | loss 3.625693 (+0.61z)| norm 0.3157 (+1.25z)| lr 5.34e-04 | 322.84 ms | 52.3% bf16 MFU | 1623399 tok/s step 4760/19560 | loss 3.578603 (-0.22z)| norm 0.3269 (+1.65z)| lr 5.34e-04 | 322.65 ms | 52.3% bf16 MFU | 1623477 tok/s step 4761/19560 | loss 3.589500 (-0.01z)| norm 0.3108 (+1.02z)| lr 5.34e-04 | 322.61 ms | 52.3% bf16 MFU | 1623561 tok/s step 4762/19560 | loss 3.611462 (+0.38z)| norm 0.3386 (+2.03z)| lr 5.34e-04 | 322.88 ms | 52.3% bf16 MFU | 1623573 tok/s step 4763/19560 | loss 3.626578 (+0.65z)| norm 0.3531 (+2.48z)| lr 5.34e-04 | 322.35 ms | 52.4% bf16 MFU | 1623718 tok/s step 4764/19560 | loss 3.571175 (-0.36z)| norm 0.2984 (+0.50z)| lr 5.34e-04 | 322.48 ms | 52.3% bf16 MFU | 1623822 tok/s step 4765/19560 | loss 3.553120 (-0.74z)| norm 0.3110 (+0.97z)| lr 5.34e-04 | 322.64 ms | 52.3% bf16 MFU | 1623880 tok/s step 4766/19560 | loss 3.649274 (+1.20z)| norm 0.3117 (+0.99z)| lr 5.34e-04 | 322.95 ms | 52.3% bf16 MFU | 1623858 tok/s step 4767/19560 | loss 3.585071 (-0.10z)| norm 0.2700 (-0.54z)| lr 5.34e-04 | 322.60 ms | 52.3% bf16 MFU | 1623926 tok/s step 4768/19560 | loss 3.670307 (+1.61z)| norm 0.2777 (-0.23z)| lr 5.34e-04 | 323.23 ms | 52.2% bf16 MFU | 1623831 tok/s step 4769/19560 | loss 3.588175 (-0.05z)| norm 0.2878 (+0.17z)| lr 5.34e-04 | 322.76 ms | 52.3% bf16 MFU | 1623858 tok/s step 4770/19560 | loss 3.640055 (+0.98z)| norm 0.2655 (-0.69z)| lr 5.34e-04 | 321.98 ms | 52.4% bf16 MFU | 1624080 tok/s step 4771/19560 | loss 3.574090 (-0.35z)| norm 0.2752 (-0.30z)| lr 5.34e-04 | 322.38 ms | 52.4% bf16 MFU | 1624190 tok/s step 4772/19560 | loss 3.583602 (-0.16z)| norm 0.2573 (-0.99z)| lr 5.34e-04 | 323.35 ms | 52.2% bf16 MFU | 1624053 tok/s step 4773/19560 | loss 3.590667 (-0.03z)| norm 0.2672 (-0.59z)| lr 5.34e-04 | 322.90 ms | 52.3% bf16 MFU | 1624034 tok/s step 4774/19560 | loss 3.595508 (+0.05z)| norm 0.2641 (-0.71z)| lr 5.34e-04 | 322.94 ms | 52.3% bf16 MFU | 1624006 tok/s step 4775/19560 | loss 3.675673 (+1.68z)| norm 0.2569 (-0.98z)| lr 5.34e-04 | 322.34 ms | 52.4% bf16 MFU | 1624131 tok/s step 4776/19560 | loss 3.594363 (+0.01z)| norm 0.2800 (-0.08z)| lr 5.33e-04 | 322.74 ms | 52.3% bf16 MFU | 1624149 tok/s step 4777/19560 | loss 3.634792 (+0.82z)| norm 0.2667 (-0.61z)| lr 5.33e-04 | 322.79 ms | 52.3% bf16 MFU | 1624152 tok/s step 4778/19560 | loss 3.583088 (-0.22z)| norm 0.2705 (-0.46z)| lr 5.33e-04 | 322.61 ms | 52.3% bf16 MFU | 1624203 tok/s step 4779/19560 | loss 3.575545 (-0.38z)| norm 0.2535 (-1.12z)| lr 5.33e-04 | 322.75 ms | 52.3% bf16 MFU | 1624216 tok/s step 4780/19560 | loss 3.541058 (-1.12z)| norm 0.2578 (-0.94z)| lr 5.33e-04 | 322.88 ms | 52.3% bf16 MFU | 1624193 tok/s step 4781/19560 | loss 3.611009 (+0.39z)| norm 0.2687 (-0.49z)| lr 5.33e-04 | 322.76 ms | 52.3% bf16 MFU | 1624202 tok/s step 4782/19560 | loss 3.697007 (+2.18z)| norm 0.2931 (+0.46z)| lr 5.33e-04 | 322.39 ms | 52.4% bf16 MFU | 1624306 tok/s step 4783/19560 | loss 3.598027 (+0.08z)| norm 0.2650 (-0.66z)| lr 5.33e-04 | 322.83 ms | 52.3% bf16 MFU | 1624293 tok/s step 4784/19560 | loss 3.525195 (-1.44z)| norm 0.3570 (+2.89z)| lr 5.33e-04 | 322.76 ms | 52.3% bf16 MFU | 1624297 tok/s step 4785/19560 | loss 3.559966 (-0.70z)| norm 0.2866 (+0.17z)| lr 5.33e-04 | 322.58 ms | 52.3% bf16 MFU | 1624347 tok/s step 4786/19560 | loss 3.631635 (+0.80z)| norm 0.3078 (+0.97z)| lr 5.33e-04 | 321.91 ms | 52.4% bf16 MFU | 1624564 tok/s step 4787/19560 | loss 3.515207 (-1.63z)| norm 0.2804 (-0.10z)| lr 5.33e-04 | 323.16 ms | 52.2% bf16 MFU | 1624456 tok/s step 4788/19560 | loss 3.584501 (-0.18z)| norm 0.2666 (-0.65z)| lr 5.33e-04 | 322.76 ms | 52.3% bf16 MFU | 1624451 tok/s step 4789/19560 | loss 3.575223 (-0.37z)| norm 0.2584 (-0.97z)| lr 5.33e-04 | 323.19 ms | 52.2% bf16 MFU | 1624340 tok/s step 4790/19560 | loss 3.590783 (-0.05z)| norm 0.2829 (-0.01z)| lr 5.33e-04 | 323.02 ms | 52.2% bf16 MFU | 1624277 tok/s step 4791/19560 | loss 3.621015 (+0.58z)| norm 0.3143 (+1.20z)| lr 5.33e-04 | 322.32 ms | 52.4% bf16 MFU | 1624395 tok/s step 4792/19560 | loss 3.582456 (-0.24z)| norm 0.3102 (+1.03z)| lr 5.33e-04 | 322.67 ms | 52.3% bf16 MFU | 1624418 tok/s step 4793/19560 | loss 3.636764 (+0.89z)| norm 0.2840 (-0.00z)| lr 5.33e-04 | 322.59 ms | 52.3% bf16 MFU | 1624459 tok/s step 4794/19560 | loss 3.599574 (+0.10z)| norm 0.2708 (-0.51z)| lr 5.33e-04 | 323.38 ms | 52.2% bf16 MFU | 1624300 tok/s step 4795/19560 | loss 3.633637 (+0.81z)| norm 0.2742 (-0.37z)| lr 5.33e-04 | 322.31 ms | 52.4% bf16 MFU | 1624418 tok/s step 4796/19560 | loss 3.677675 (+1.72z)| norm 0.3109 (+1.09z)| lr 5.33e-04 | 322.63 ms | 52.3% bf16 MFU | 1624448 tok/s step 4797/19560 | loss 3.592599 (-0.07z)| norm 0.2942 (+0.42z)| lr 5.33e-04 | 323.28 ms | 52.2% bf16 MFU | 1624314 tok/s step 4798/19560 | loss 3.654103 (+1.21z)| norm 0.2588 (-0.98z)| lr 5.33e-04 | 322.63 ms | 52.3% bf16 MFU | 1624350 tok/s step 4799/19560 | loss 3.627180 (+0.66z)| norm 0.2722 (-0.42z)| lr 5.33e-04 | 322.77 ms | 52.3% bf16 MFU | 1624349 tok/s step 4800/19560 | loss 3.582022 (-0.29z)| norm 0.3125 (+1.25z)| lr 5.33e-04 | 323.01 ms | 52.2% bf16 MFU | 1624289 tok/s step 4801/19560 | loss 3.531198 (-1.34z)| norm 0.2833 (+0.05z)| lr 5.33e-04 | 322.61 ms | 52.3% bf16 MFU | 1624332 tok/s step 4802/19560 | loss 3.531682 (-1.31z)| norm 0.2799 (-0.10z)| lr 5.33e-04 | 323.37 ms | 52.2% bf16 MFU | 1624181 tok/s step 4803/19560 | loss 3.554588 (-0.82z)| norm 0.2802 (-0.08z)| lr 5.33e-04 | 322.64 ms | 52.3% bf16 MFU | 1624222 tok/s step 4804/19560 | loss 3.530538 (-1.30z)| norm 0.3035 (+0.88z)| lr 5.33e-04 | 322.56 ms | 52.3% bf16 MFU | 1624280 tok/s step 4805/19560 | loss 3.602895 (+0.20z)| norm 0.2840 (+0.07z)| lr 5.33e-04 | 322.73 ms | 52.3% bf16 MFU | 1624293 tok/s step 4806/19560 | loss 3.532785 (-1.26z)| norm 0.2583 (-0.99z)| lr 5.33e-04 | 323.34 ms | 52.2% bf16 MFU | 1624151 tok/s step 4807/19560 | loss 3.618541 (+0.52z)| norm 0.2816 (-0.02z)| lr 5.33e-04 | 322.93 ms | 52.3% bf16 MFU | 1624121 tok/s step 4808/19560 | loss 3.557525 (-0.74z)| norm 0.2852 (+0.12z)| lr 5.32e-04 | 322.90 ms | 52.3% bf16 MFU | 1624098 tok/s step 4809/19560 | loss 3.578798 (-0.29z)| norm 0.2538 (-1.19z)| lr 5.32e-04 | 322.86 ms | 52.3% bf16 MFU | 1624086 tok/s step 4810/19560 | loss 3.635619 (+0.88z)| norm 0.2641 (-0.77z)| lr 5.32e-04 | 323.32 ms | 52.2% bf16 MFU | 1623961 tok/s step 4811/19560 | loss 3.656896 (+1.32z)| norm 0.3102 (+1.15z)| lr 5.32e-04 | 322.35 ms | 52.4% bf16 MFU | 1624085 tok/s step 4812/19560 | loss 3.568115 (-0.58z)| norm 0.2579 (-1.03z)| lr 5.32e-04 | 322.42 ms | 52.3% bf16 MFU | 1624186 tok/s step 4813/19560 | loss 3.616278 (+0.44z)| norm 0.2725 (-0.42z)| lr 5.32e-04 | 322.95 ms | 52.3% bf16 MFU | 1624149 tok/s step 4814/19560 | loss 3.586123 (-0.20z)| norm 0.2629 (-0.82z)| lr 5.32e-04 | 322.59 ms | 52.3% bf16 MFU | 1624203 tok/s step 4815/19560 | loss 3.593143 (-0.06z)| norm 0.2706 (-0.49z)| lr 5.32e-04 | 323.02 ms | 52.2% bf16 MFU | 1624146 tok/s step 4816/19560 | loss 3.611032 (+0.33z)| norm 0.2698 (-0.51z)| lr 5.32e-04 | 322.93 ms | 52.3% bf16 MFU | 1624116 tok/s step 4817/19560 | loss 3.572990 (-0.48z)| norm 0.2863 (+0.25z)| lr 5.32e-04 | 322.87 ms | 52.3% bf16 MFU | 1624102 tok/s step 4818/19560 | loss 3.606188 (+0.21z)| norm 0.2710 (-0.47z)| lr 5.32e-04 | 322.36 ms | 52.4% bf16 MFU | 1624218 tok/s step 4819/19560 | loss 3.525672 (-1.53z)| norm 0.2726 (-0.40z)| lr 5.32e-04 | 322.96 ms | 52.3% bf16 MFU | 1624175 tok/s step 4820/19560 | loss 3.553436 (-0.92z)| norm 0.2647 (-0.76z)| lr 5.32e-04 | 322.77 ms | 52.3% bf16 MFU | 1624184 tok/s step 4821/19560 | loss 3.595619 (-0.00z)| norm 0.2594 (-1.01z)| lr 5.32e-04 | 322.32 ms | 52.4% bf16 MFU | 1624304 tok/s step 4822/19560 | loss 3.583164 (-0.28z)| norm 0.2568 (-1.12z)| lr 5.32e-04 | 322.79 ms | 52.3% bf16 MFU | 1624300 tok/s step 4823/19560 | loss 3.551986 (-0.94z)| norm 0.2512 (-1.36z)| lr 5.32e-04 | 322.68 ms | 52.3% bf16 MFU | 1624324 tok/s step 4824/19560 | loss 3.625840 (+0.65z)| norm 0.3168 (+1.67z)| lr 5.32e-04 | 322.46 ms | 52.3% bf16 MFU | 1624403 tok/s step 4825/19560 | loss 3.578736 (-0.37z)| norm 0.3070 (+1.20z)| lr 5.32e-04 | 322.95 ms | 52.3% bf16 MFU | 1624356 tok/s step 4826/19560 | loss 3.536916 (-1.25z)| norm 0.2985 (+0.80z)| lr 5.32e-04 | 323.60 ms | 52.2% bf16 MFU | 1624146 tok/s step 4827/19560 | loss 3.572714 (-0.49z)| norm 0.2757 (-0.25z)| lr 5.32e-04 | 322.57 ms | 52.3% bf16 MFU | 1624204 tok/s step 4828/19560 | loss 3.645069 (+1.05z)| norm 0.2701 (-0.51z)| lr 5.32e-04 | 322.37 ms | 52.4% bf16 MFU | 1624312 tok/s step 4829/19560 | loss 3.575351 (-0.45z)| norm 0.2958 (+0.66z)| lr 5.32e-04 | 322.76 ms | 52.3% bf16 MFU | 1624315 tok/s step 4830/19560 | loss 3.622540 (+0.56z)| norm 0.2442 (-1.68z)| lr 5.32e-04 | 322.98 ms | 52.3% bf16 MFU | 1624263 tok/s step 4831/19560 | loss 3.570714 (-0.55z)| norm 0.2682 (-0.58z)| lr 5.32e-04 | 323.05 ms | 52.2% bf16 MFU | 1624196 tok/s step 4832/19560 | loss 3.588078 (-0.18z)| norm 0.2709 (-0.46z)| lr 5.32e-04 | 322.75 ms | 52.3% bf16 MFU | 1624208 tok/s step 4833/19560 | loss 3.576826 (-0.43z)| norm 0.2962 (+0.68z)| lr 5.32e-04 | 322.70 ms | 52.3% bf16 MFU | 1624233 tok/s step 4834/19560 | loss 3.601878 (+0.13z)| norm 0.3379 (+2.49z)| lr 5.32e-04 | 322.18 ms | 52.4% bf16 MFU | 1624386 tok/s step 4835/19560 | loss 3.583302 (-0.28z)| norm 0.3163 (+1.50z)| lr 5.32e-04 | 323.14 ms | 52.2% bf16 MFU | 1624291 tok/s step 4836/19560 | loss 3.690921 (+2.05z)| norm 0.3014 (+0.84z)| lr 5.32e-04 | 322.86 ms | 52.3% bf16 MFU | 1624270 tok/s step 4837/19560 | loss 3.577185 (-0.46z)| norm 0.3328 (+2.16z)| lr 5.32e-04 | 322.46 ms | 52.3% bf16 MFU | 1624352 tok/s step 4838/19560 | loss 3.482328 (-2.90z)| norm 0.2749 (-0.33z)| lr 5.32e-04 | 322.68 ms | 52.3% bf16 MFU | 1624373 tok/s step 4839/19560 | loss 3.508560 (-2.16z)| norm 0.3002 (+0.75z)| lr 5.32e-04 | 322.78 ms | 52.3% bf16 MFU | 1624368 tok/s step 4840/19560 | loss 3.654976 (+1.58z)| norm 0.2979 (+0.68z)| lr 5.31e-04 | 322.69 ms | 52.3% bf16 MFU | 1624387 tok/s step 4841/19560 | loss 3.595369 (+0.07z)| norm 0.2631 (-0.84z)| lr 5.31e-04 | 323.46 ms | 52.2% bf16 MFU | 1624212 tok/s step 4842/19560 | loss 3.590986 (-0.04z)| norm 0.2904 (+0.36z)| lr 5.31e-04 | 322.91 ms | 52.3% bf16 MFU | 1624182 tok/s step 4843/19560 | loss 3.549397 (-1.09z)| norm 0.2632 (-0.84z)| lr 5.31e-04 | 322.63 ms | 52.3% bf16 MFU | 1624225 tok/s step 4844/19560 | loss 3.629720 (+0.94z)| norm 0.2787 (-0.15z)| lr 5.31e-04 | 322.54 ms | 52.3% bf16 MFU | 1624290 tok/s step 4845/19560 | loss 3.613577 (+0.52z)| norm 0.3253 (+1.89z)| lr 5.31e-04 | 322.33 ms | 52.4% bf16 MFU | 1624403 tok/s step 4846/19560 | loss 3.636283 (+1.10z)| norm 0.3328 (+2.17z)| lr 5.31e-04 | 322.90 ms | 52.3% bf16 MFU | 1624368 tok/s step 4847/19560 | loss 3.566817 (-0.65z)| norm 0.2790 (-0.17z)| lr 5.31e-04 | 323.36 ms | 52.2% bf16 MFU | 1624220 tok/s step 4848/19560 | loss 3.635456 (+1.08z)| norm 0.2910 (+0.36z)| lr 5.31e-04 | 322.65 ms | 52.3% bf16 MFU | 1624256 tok/s step 4849/19560 | loss 3.528602 (-1.60z)| norm 0.2986 (+0.67z)| lr 5.31e-04 | 322.59 ms | 52.3% bf16 MFU | 1624306 tok/s step 4850/19560 | loss 3.582729 (-0.24z)| norm 0.3256 (+1.81z)| lr 5.31e-04 | 323.27 ms | 52.2% bf16 MFU | 1624182 tok/s step 4851/19560 | loss 3.567109 (-0.63z)| norm 0.3098 (+1.11z)| lr 5.31e-04 | 322.35 ms | 52.4% bf16 MFU | 1624296 tok/s step 4852/19560 | loss 3.553664 (-0.95z)| norm 0.2984 (+0.61z)| lr 5.31e-04 | 323.38 ms | 52.2% bf16 MFU | 1624146 tok/s step 4853/19560 | loss 3.569546 (-0.54z)| norm 0.2939 (+0.42z)| lr 5.31e-04 | 322.52 ms | 52.3% bf16 MFU | 1624218 tok/s step 4854/19560 | loss 3.611784 (+0.52z)| norm 0.2717 (-0.55z)| lr 5.31e-04 | 322.40 ms | 52.3% bf16 MFU | 1624318 tok/s step 4855/19560 | loss 3.606586 (+0.38z)| norm 0.2761 (-0.37z)| lr 5.31e-04 | 322.94 ms | 52.3% bf16 MFU | 1624276 tok/s step 4856/19560 | loss 3.573113 (-0.48z)| norm 0.2789 (-0.25z)| lr 5.31e-04 | 322.95 ms | 52.3% bf16 MFU | 1624234 tok/s step 4857/19560 | loss 3.583066 (-0.23z)| norm 0.2653 (-0.84z)| lr 5.31e-04 | 322.69 ms | 52.3% bf16 MFU | 1624260 tok/s step 4858/19560 | loss 3.570724 (-0.55z)| norm 0.2769 (-0.33z)| lr 5.31e-04 | 322.68 ms | 52.3% bf16 MFU | 1624286 tok/s step 4859/19560 | loss 3.528383 (-1.62z)| norm 0.2762 (-0.36z)| lr 5.31e-04 | 322.37 ms | 52.4% bf16 MFU | 1624390 tok/s step 4860/19560 | loss 3.605096 (+0.37z)| norm 0.2633 (-0.92z)| lr 5.31e-04 | 322.41 ms | 52.3% bf16 MFU | 1624479 tok/s step 4861/19560 | loss 3.565886 (-0.66z)| norm 0.3282 (+1.86z)| lr 5.31e-04 | 322.91 ms | 52.3% bf16 MFU | 1624436 tok/s step 4862/19560 | loss 3.547728 (-1.12z)| norm 0.2931 (+0.36z)| lr 5.31e-04 | 323.45 ms | 52.2% bf16 MFU | 1624260 tok/s step 4863/19560 | loss 3.781799 (+4.51z)| norm 0.2726 (-0.51z)| lr 5.31e-04 | 322.36 ms | 52.4% bf16 MFU | 1624368 tok/s step 4864/19560 | loss 3.571758 (-0.49z)| norm 0.3034 (+0.80z)| lr 5.31e-04 | 322.71 ms | 52.3% bf16 MFU | 1624382 tok/s step 4865/19560 | loss 3.638919 (+1.10z)| norm 0.2823 (-0.12z)| lr 5.31e-04 | 322.99 ms | 52.3% bf16 MFU | 1624324 tok/s step 4866/19560 | loss 3.566359 (-0.61z)| norm 0.3097 (+1.05z)| lr 5.31e-04 | 323.25 ms | 52.2% bf16 MFU | 1624205 tok/s step 4867/19560 | loss 3.570832 (-0.50z)| norm 0.3382 (+2.22z)| lr 5.31e-04 | 322.64 ms | 52.3% bf16 MFU | 1624244 tok/s step 4868/19560 | loss 3.550492 (-0.97z)| norm 0.3252 (+1.64z)| lr 5.31e-04 | 322.17 ms | 52.4% bf16 MFU | 1624400 tok/s step 4869/19560 | loss 3.532954 (-1.37z)| norm 0.2892 (+0.13z)| lr 5.31e-04 | 323.11 ms | 52.2% bf16 MFU | 1624311 tok/s step 4870/19560 | loss 3.539689 (-1.19z)| norm 0.2864 (+0.01z)| lr 5.31e-04 | 323.27 ms | 52.2% bf16 MFU | 1624188 tok/s step 4871/19560 | loss 3.608091 (+0.40z)| norm 0.2877 (+0.06z)| lr 5.30e-04 | 322.52 ms | 52.3% bf16 MFU | 1624259 tok/s step 4872/19560 | loss 3.636342 (+1.04z)| norm 0.3107 (+1.01z)| lr 5.30e-04 | 322.57 ms | 52.3% bf16 MFU | 1624313 tok/s step 4873/19560 | loss 3.561223 (-0.70z)| norm 0.3029 (+0.67z)| lr 5.30e-04 | 324.91 ms | 51.9% bf16 MFU | 1623780 tok/s step 4874/19560 | loss 3.588705 (-0.06z)| norm 0.3252 (+1.58z)| lr 5.30e-04 | 322.19 ms | 52.4% bf16 MFU | 1623956 tok/s step 4875/19560 | loss 3.645760 (+1.25z)| norm 0.3588 (+2.87z)| lr 5.30e-04 | 322.73 ms | 52.3% bf16 MFU | 1623985 tok/s step 4876/19560 | loss 3.550030 (-0.98z)| norm 0.2793 (-0.37z)| lr 5.30e-04 | 322.07 ms | 52.4% bf16 MFU | 1624178 tok/s step 4877/19560 | loss 3.604089 (+0.27z)| norm 0.2982 (+0.40z)| lr 5.30e-04 | 323.40 ms | 52.2% bf16 MFU | 1624028 tok/s step 4878/19560 | loss 3.529286 (-1.45z)| norm 0.2880 (-0.03z)| lr 5.30e-04 | 323.02 ms | 52.2% bf16 MFU | 1623980 tok/s step 4879/19560 | loss 3.546999 (-1.03z)| norm 0.2849 (-0.14z)| lr 5.30e-04 | 323.15 ms | 52.2% bf16 MFU | 1623903 tok/s step 4880/19560 | loss 3.637097 (+1.03z)| norm 0.2527 (-1.44z)| lr 5.30e-04 | 323.60 ms | 52.2% bf16 MFU | 1623716 tok/s step 4881/19560 | loss 3.698140 (+2.38z)| norm 0.2833 (-0.21z)| lr 5.30e-04 | 323.23 ms | 52.2% bf16 MFU | 1623633 tok/s step 4882/19560 | loss 3.580644 (-0.27z)| norm 0.2635 (-1.01z)| lr 5.30e-04 | 323.15 ms | 52.2% bf16 MFU | 1623574 tok/s step 4883/19560 | loss 3.549857 (-0.95z)| norm 0.2525 (-1.45z)| lr 5.30e-04 | 322.99 ms | 52.3% bf16 MFU | 1623558 tok/s step 4884/19560 | loss 3.614168 (+0.49z)| norm 0.2953 (+0.29z)| lr 5.30e-04 | 323.51 ms | 52.2% bf16 MFU | 1623412 tok/s step 4885/19560 | loss 3.579690 (-0.27z)| norm 0.3137 (+1.04z)| lr 5.30e-04 | 323.25 ms | 52.2% bf16 MFU | 1623338 tok/s step 4886/19560 | loss 3.504712 (-1.93z)| norm 0.2746 (-0.54z)| lr 5.30e-04 | 322.60 ms | 52.3% bf16 MFU | 1623432 tok/s step 4887/19560 | loss 3.491343 (-2.17z)| norm 0.2709 (-0.68z)| lr 5.30e-04 | 323.19 ms | 52.2% bf16 MFU | 1623370 tok/s step 4888/19560 | loss 3.621322 (+0.69z)| norm 0.3132 (+1.07z)| lr 5.30e-04 | 322.84 ms | 52.3% bf16 MFU | 1623401 tok/s step 4889/19560 | loss 3.585600 (-0.10z)| norm 0.2963 (+0.38z)| lr 5.30e-04 | 323.09 ms | 52.2% bf16 MFU | 1623368 tok/s step 4890/19560 | loss 3.515809 (-1.60z)| norm 0.2879 (+0.05z)| lr 5.30e-04 | 322.91 ms | 52.3% bf16 MFU | 1623382 tok/s step 4891/19560 | loss 3.586554 (-0.05z)| norm 0.2624 (-1.02z)| lr 5.30e-04 | 323.13 ms | 52.2% bf16 MFU | 1623338 tok/s step 4892/19560 | loss 3.612346 (+0.50z)| norm 0.2828 (-0.13z)| lr 5.30e-04 | 323.18 ms | 52.2% bf16 MFU | 1623286 tok/s step 4893/19560 | loss 3.622996 (+0.72z)| norm 0.2977 (+0.52z)| lr 5.30e-04 | 323.01 ms | 52.3% bf16 MFU | 1623279 tok/s step 4894/19560 | loss 3.604001 (+0.32z)| norm 0.2916 (+0.26z)| lr 5.30e-04 | 322.86 ms | 52.3% bf16 MFU | 1623308 tok/s step 4895/19560 | loss 3.551603 (-0.82z)| norm 0.2967 (+0.47z)| lr 5.30e-04 | 323.75 ms | 52.1% bf16 MFU | 1623115 tok/s step 4896/19560 | loss 3.616151 (+0.61z)| norm 0.2706 (-0.67z)| lr 5.30e-04 | 323.39 ms | 52.2% bf16 MFU | 1623021 tok/s step 4897/19560 | loss 3.623251 (+0.76z)| norm 0.3030 (+0.75z)| lr 5.30e-04 | 322.35 ms | 52.4% bf16 MFU | 1623194 tok/s step 4898/19560 | loss 3.534936 (-1.18z)| norm 0.2693 (-0.73z)| lr 5.30e-04 | 323.51 ms | 52.2% bf16 MFU | 1623066 tok/s step 4899/19560 | loss 3.639417 (+1.11z)| norm 0.3103 (+1.05z)| lr 5.30e-04 | 322.87 ms | 52.3% bf16 MFU | 1623105 tok/s step 4900/19560 | loss 3.576063 (-0.28z)| norm 0.2657 (-0.90z)| lr 5.30e-04 | 322.60 ms | 52.3% bf16 MFU | 1623211 tok/s step 4901/19560 | loss 3.596001 (+0.16z)| norm 0.2800 (-0.28z)| lr 5.30e-04 | 322.97 ms | 52.3% bf16 MFU | 1623218 tok/s step 4902/19560 | loss 3.577723 (-0.24z)| norm 0.2661 (-0.89z)| lr 5.29e-04 | 323.55 ms | 52.2% bf16 MFU | 1623079 tok/s step 4903/19560 | loss 3.601409 (+0.30z)| norm 0.2749 (-0.52z)| lr 5.29e-04 | 323.13 ms | 52.2% bf16 MFU | 1623052 tok/s step 4904/19560 | loss 3.501549 (-1.89z)| norm 0.2572 (-1.28z)| lr 5.29e-04 | 323.12 ms | 52.2% bf16 MFU | 1623029 tok/s step 4905/19560 | loss 3.568003 (-0.41z)| norm 0.2802 (-0.28z)| lr 5.29e-04 | 323.61 ms | 52.2% bf16 MFU | 1622883 tok/s step 4906/19560 | loss 3.507767 (-1.71z)| norm 0.2595 (-1.18z)| lr 5.29e-04 | 322.70 ms | 52.3% bf16 MFU | 1622975 tok/s step 4907/19560 | loss 3.513027 (-1.57z)| norm 0.2774 (-0.40z)| lr 5.29e-04 | 323.33 ms | 52.2% bf16 MFU | 1622901 tok/s step 4908/19560 | loss 3.551007 (-0.75z)| norm 0.2970 (+0.45z)| lr 5.29e-04 | 322.97 ms | 52.3% bf16 MFU | 1622924 tok/s step 4909/19560 | loss 3.606787 (+0.46z)| norm 0.2700 (-0.75z)| lr 5.29e-04 | 323.22 ms | 52.2% bf16 MFU | 1622881 tok/s step 4910/19560 | loss 3.547951 (-0.81z)| norm 0.2727 (-0.62z)| lr 5.29e-04 | 323.16 ms | 52.2% bf16 MFU | 1622856 tok/s step 4911/19560 | loss 3.627520 (+0.94z)| norm 0.2455 (-1.81z)| lr 5.29e-04 | 323.30 ms | 52.2% bf16 MFU | 1622796 tok/s step 4912/19560 | loss 3.528666 (-1.24z)| norm 0.2505 (-1.59z)| lr 5.29e-04 | 322.93 ms | 52.3% bf16 MFU | 1622834 tok/s step 4913/19560 | loss 3.524654 (-1.31z)| norm 0.2703 (-0.69z)| lr 5.29e-04 | 322.61 ms | 52.3% bf16 MFU | 1622950 tok/s step 4914/19560 | loss 3.600785 (+0.36z)| norm 0.2673 (-0.82z)| lr 5.29e-04 | 322.56 ms | 52.3% bf16 MFU | 1623071 tok/s step 4915/19560 | loss 3.591660 (+0.15z)| norm 0.2730 (-0.55z)| lr 5.29e-04 | 323.10 ms | 52.2% bf16 MFU | 1623052 tok/s step 4916/19560 | loss 3.528792 (-1.23z)| norm 0.2494 (-1.60z)| lr 5.29e-04 | 323.15 ms | 52.2% bf16 MFU | 1623021 tok/s step 4917/19560 | loss 3.677633 (+2.01z)| norm 0.2758 (-0.42z)| lr 5.29e-04 | 322.49 ms | 52.3% bf16 MFU | 1623158 tok/s step 4918/19560 | loss 3.613519 (+0.61z)| norm 0.2883 (+0.14z)| lr 5.29e-04 | 322.99 ms | 52.3% bf16 MFU | 1623162 tok/s step 4919/19560 | loss 3.529027 (-1.20z)| norm 0.2978 (+0.57z)| lr 5.29e-04 | 323.15 ms | 52.2% bf16 MFU | 1623125 tok/s step 4920/19560 | loss 3.607602 (+0.49z)| norm 0.2684 (-0.75z)| lr 5.29e-04 | 322.61 ms | 52.3% bf16 MFU | 1623226 tok/s step 4921/19560 | loss 3.640103 (+1.19z)| norm 0.2953 (+0.47z)| lr 5.29e-04 | 323.21 ms | 52.2% bf16 MFU | 1623171 tok/s step 4922/19560 | loss 3.544150 (-0.87z)| norm 0.3098 (+1.11z)| lr 5.29e-04 | 323.28 ms | 52.2% bf16 MFU | 1623102 tok/s step 4923/19560 | loss 3.587632 (+0.07z)| norm 0.2967 (+0.51z)| lr 5.29e-04 | 322.82 ms | 52.3% bf16 MFU | 1623152 tok/s step 4924/19560 | loss 3.568313 (-0.33z)| norm 0.3032 (+0.81z)| lr 5.29e-04 | 322.66 ms | 52.3% bf16 MFU | 1623238 tok/s step 4925/19560 | loss 3.586571 (+0.07z)| norm 0.2881 (+0.13z)| lr 5.29e-04 | 323.00 ms | 52.3% bf16 MFU | 1623234 tok/s step 4926/19560 | loss 3.582288 (-0.01z)| norm 0.2916 (+0.28z)| lr 5.29e-04 | 323.79 ms | 52.1% bf16 MFU | 1623033 tok/s step 4927/19560 | loss 3.599340 (+0.38z)| norm 0.3238 (+1.71z)| lr 5.29e-04 | 323.53 ms | 52.2% bf16 MFU | 1622907 tok/s step 4928/19560 | loss 3.582597 (+0.00z)| norm 0.3206 (+1.55z)| lr 5.29e-04 | 323.15 ms | 52.2% bf16 MFU | 1622884 tok/s step 4929/19560 | loss 3.575550 (-0.16z)| norm 0.3549 (+2.96z)| lr 5.29e-04 | 323.04 ms | 52.2% bf16 MFU | 1622888 tok/s step 4930/19560 | loss 3.627229 (+0.98z)| norm 0.3499 (+2.65z)| lr 5.29e-04 | 323.91 ms | 52.1% bf16 MFU | 1622676 tok/s step 4931/19560 | loss 3.652840 (+1.52z)| norm 0.3404 (+2.18z)| lr 5.29e-04 | 322.92 ms | 52.3% bf16 MFU | 1622721 tok/s step 4932/19560 | loss 3.452798 (-2.83z)| norm 0.3814 (+3.66z)| lr 5.29e-04 | 322.86 ms | 52.3% bf16 MFU | 1622778 tok/s step 4933/19560 | loss 3.565119 (-0.40z)| norm 0.3202 (+1.24z)| lr 5.28e-04 | 323.93 ms | 52.1% bf16 MFU | 1622565 tok/s step 4934/19560 | loss 3.621637 (+0.81z)| norm 0.3314 (+1.64z)| lr 5.28e-04 | 322.66 ms | 52.3% bf16 MFU | 1622682 tok/s step 4935/19560 | loss 3.598449 (+0.31z)| norm 0.3030 (+0.53z)| lr 5.28e-04 | 323.13 ms | 52.2% bf16 MFU | 1622674 tok/s step 4936/19560 | loss 3.559046 (-0.54z)| norm 0.3319 (+1.62z)| lr 5.28e-04 | 323.75 ms | 52.1% bf16 MFU | 1622513 tok/s step 4937/19560 | loss 3.601956 (+0.39z)| norm 0.3170 (+1.04z)| lr 5.28e-04 | 323.11 ms | 52.2% bf16 MFU | 1622519 tok/s step 4938/19560 | loss 3.601838 (+0.39z)| norm 0.3353 (+1.71z)| lr 5.28e-04 | 322.98 ms | 52.3% bf16 MFU | 1622558 tok/s step 4939/19560 | loss 3.568251 (-0.33z)| norm 0.2823 (-0.31z)| lr 5.28e-04 | 323.68 ms | 52.1% bf16 MFU | 1622420 tok/s step 4940/19560 | loss 3.554327 (-0.63z)| norm 0.2985 (+0.30z)| lr 5.28e-04 | 322.95 ms | 52.3% bf16 MFU | 1622469 tok/s step 4941/19560 | loss 3.617737 (+0.76z)| norm 0.2706 (-0.78z)| lr 5.28e-04 | 323.05 ms | 52.2% bf16 MFU | 1622492 tok/s step 4942/19560 | loss 3.534097 (-1.07z)| norm 0.2505 (-1.53z)| lr 5.28e-04 | 322.66 ms | 52.3% bf16 MFU | 1622612 tok/s step 4943/19560 | loss 3.505379 (-1.66z)| norm 0.2566 (-1.29z)| lr 5.28e-04 | 323.49 ms | 52.2% bf16 MFU | 1622519 tok/s step 4944/19560 | loss 3.621428 (+0.85z)| norm 0.2606 (-1.13z)| lr 5.28e-04 | 322.99 ms | 52.3% bf16 MFU | 1622555 tok/s step 4945/19560 | loss 3.558269 (-0.51z)| norm 0.2568 (-1.26z)| lr 5.28e-04 | 323.06 ms | 52.2% bf16 MFU | 1622571 tok/s step 4946/19560 | loss 3.674625 (+1.96z)| norm 0.2649 (-0.95z)| lr 5.28e-04 | 323.42 ms | 52.2% bf16 MFU | 1622497 tok/s step 4947/19560 | loss 3.589201 (+0.13z)| norm 0.2550 (-1.32z)| lr 5.28e-04 | 324.27 ms | 52.0% bf16 MFU | 1622213 tok/s step 4948/19560 | loss 3.517303 (-1.39z)| norm 0.2487 (-1.54z)| lr 5.28e-04 | 323.43 ms | 52.2% bf16 MFU | 1622154 tok/s step 4949/19560 | loss 3.673353 (+1.89z)| norm 0.2674 (-0.85z)| lr 5.28e-04 | 322.54 ms | 52.3% bf16 MFU | 1622321 tok/s step 4950/19560 | loss 3.529061 (-1.13z)| norm 0.2805 (-0.36z)| lr 5.28e-04 | 323.18 ms | 52.2% bf16 MFU | 1622320 tok/s step 4951/19560 | loss 3.579705 (-0.07z)| norm 0.2550 (-1.32z)| lr 5.28e-04 | 323.20 ms | 52.2% bf16 MFU | 1622314 tok/s step 4952/19560 | loss 3.611620 (+0.60z)| norm 0.2553 (-1.29z)| lr 5.28e-04 | 323.15 ms | 52.2% bf16 MFU | 1622320 tok/s step 4953/19560 | loss 3.537588 (-0.94z)| norm 0.2549 (-1.28z)| lr 5.28e-04 | 323.35 ms | 52.2% bf16 MFU | 1622275 tok/s step 4954/19560 | loss 3.671439 (+1.81z)| norm 0.2678 (-0.79z)| lr 5.28e-04 | 323.54 ms | 52.2% bf16 MFU | 1622186 tok/s step 4955/19560 | loss 3.555200 (-0.59z)| norm 0.2645 (-0.91z)| lr 5.28e-04 | 323.28 ms | 52.2% bf16 MFU | 1622165 tok/s step 4956/19560 | loss 3.566313 (-0.35z)| norm 0.2934 (+0.16z)| lr 5.28e-04 | 322.61 ms | 52.3% bf16 MFU | 1622313 tok/s step 4957/19560 | loss 3.547706 (-0.73z)| norm 0.2588 (-1.11z)| lr 5.28e-04 | 322.95 ms | 52.3% bf16 MFU | 1622369 tok/s step 4958/19560 | loss 3.519458 (-1.29z)| norm 0.2779 (-0.42z)| lr 5.28e-04 | 323.75 ms | 52.1% bf16 MFU | 1622222 tok/s step 4959/19560 | loss 3.628613 (+0.95z)| norm 0.2694 (-0.74z)| lr 5.28e-04 | 322.70 ms | 52.3% bf16 MFU | 1622346 tok/s step 4960/19560 | loss 3.746482 (+3.22z)| norm 0.2522 (-1.37z)| lr 5.28e-04 | 322.74 ms | 52.3% bf16 MFU | 1622452 tok/s step 4961/19560 | loss 3.558087 (-0.50z)| norm 0.2758 (-0.49z)| lr 5.28e-04 | 322.56 ms | 52.3% bf16 MFU | 1622599 tok/s step 4962/19560 | loss 3.608388 (+0.49z)| norm 0.2553 (-1.23z)| lr 5.28e-04 | 323.01 ms | 52.2% bf16 MFU | 1622626 tok/s step 4963/19560 | loss 3.525233 (-1.14z)| norm 0.2755 (-0.47z)| lr 5.28e-04 | 322.93 ms | 52.3% bf16 MFU | 1622672 tok/s step 4964/19560 | loss 3.525616 (-1.12z)| norm 0.2585 (-1.09z)| lr 5.27e-04 | 323.23 ms | 52.2% bf16 MFU | 1622639 tok/s step 4965/19560 | loss 3.571973 (-0.20z)| norm 0.2629 (-0.91z)| lr 5.27e-04 | 323.04 ms | 52.2% bf16 MFU | 1622656 tok/s step 4966/19560 | loss 3.533950 (-0.97z)| norm 0.2883 (+0.04z)| lr 5.27e-04 | 322.85 ms | 52.3% bf16 MFU | 1622721 tok/s step 4967/19560 | loss 3.591981 (+0.19z)| norm 0.2456 (-1.54z)| lr 5.27e-04 | 322.73 ms | 52.3% bf16 MFU | 1622811 tok/s step 4968/19560 | loss 3.588772 (+0.13z)| norm 0.2707 (-0.59z)| lr 5.27e-04 | 322.71 ms | 52.3% bf16 MFU | 1622904 tok/s step 4969/19560 | loss 3.529004 (-1.08z)| norm 0.2896 (+0.11z)| lr 5.27e-04 | 323.05 ms | 52.2% bf16 MFU | 1622906 tok/s step 4970/19560 | loss 3.593068 (+0.23z)| norm 0.2907 (+0.15z)| lr 5.27e-04 | 322.84 ms | 52.3% bf16 MFU | 1622960 tok/s step 4971/19560 | loss 3.477177 (-2.09z)| norm 0.2517 (-1.31z)| lr 5.27e-04 | 323.32 ms | 52.2% bf16 MFU | 1622890 tok/s step 4972/19560 | loss 3.516619 (-1.28z)| norm 0.2732 (-0.50z)| lr 5.27e-04 | 322.88 ms | 52.3% bf16 MFU | 1622934 tok/s step 4973/19560 | loss 3.540050 (-0.79z)| norm 0.3163 (+1.12z)| lr 5.27e-04 | 322.49 ms | 52.3% bf16 MFU | 1623075 tok/s step 4974/19560 | loss 3.608448 (+0.58z)| norm 0.2803 (-0.22z)| lr 5.27e-04 | 322.78 ms | 52.3% bf16 MFU | 1623136 tok/s step 4975/19560 | loss 3.516687 (-1.25z)| norm 0.2773 (-0.33z)| lr 5.27e-04 | 323.71 ms | 52.1% bf16 MFU | 1622961 tok/s step 4976/19560 | loss 3.537732 (-0.81z)| norm 0.3302 (+1.65z)| lr 5.27e-04 | 322.52 ms | 52.3% bf16 MFU | 1623093 tok/s step 4977/19560 | loss 3.524551 (-1.08z)| norm 0.3224 (+1.34z)| lr 5.27e-04 | 322.82 ms | 52.3% bf16 MFU | 1623143 tok/s step 4978/19560 | loss 3.510547 (-1.34z)| norm 0.2729 (-0.50z)| lr 5.27e-04 | 323.45 ms | 52.2% bf16 MFU | 1623034 tok/s step 4979/19560 | loss 3.550963 (-0.53z)| norm 0.2845 (-0.05z)| lr 5.27e-04 | 322.91 ms | 52.3% bf16 MFU | 1623064 tok/s step 4980/19560 | loss 3.577769 (-0.00z)| norm 0.2955 (+0.36z)| lr 5.27e-04 | 322.69 ms | 52.3% bf16 MFU | 1623147 tok/s step 4981/19560 | loss 3.633533 (+1.09z)| norm 0.2766 (-0.35z)| lr 5.27e-04 | 322.90 ms | 52.3% bf16 MFU | 1623173 tok/s step 4982/19560 | loss 3.624046 (+0.90z)| norm 0.2950 (+0.34z)| lr 5.27e-04 | 323.31 ms | 52.2% bf16 MFU | 1623096 tok/s step 4983/19560 | loss 3.618081 (+0.78z)| norm 0.2840 (-0.08z)| lr 5.27e-04 | 322.58 ms | 52.3% bf16 MFU | 1623206 tok/s step 4984/19560 | loss 3.559576 (-0.37z)| norm 0.2791 (-0.26z)| lr 5.27e-04 | 323.16 ms | 52.2% bf16 MFU | 1623164 tok/s step 4985/19560 | loss 3.547417 (-0.61z)| norm 0.2783 (-0.29z)| lr 5.27e-04 | 322.67 ms | 52.3% bf16 MFU | 1623247 tok/s step 4986/19560 | loss 3.499551 (-1.52z)| norm 0.2727 (-0.51z)| lr 5.27e-04 | 323.03 ms | 52.2% bf16 MFU | 1623235 tok/s step 4987/19560 | loss 3.570168 (-0.15z)| norm 0.2519 (-1.28z)| lr 5.27e-04 | 322.55 ms | 52.3% bf16 MFU | 1623347 tok/s step 4988/19560 | loss 3.474462 (-1.97z)| norm 0.2602 (-0.97z)| lr 5.27e-04 | 323.61 ms | 52.2% bf16 MFU | 1623185 tok/s step 4989/19560 | loss 3.518886 (-1.11z)| norm 0.2536 (-1.20z)| lr 5.27e-04 | 322.43 ms | 52.3% bf16 MFU | 1623328 tok/s step 4990/19560 | loss 3.598417 (+0.41z)| norm 0.2596 (-0.96z)| lr 5.27e-04 | 322.99 ms | 52.3% bf16 MFU | 1623322 tok/s step 4991/19560 | loss 3.611427 (+0.73z)| norm 0.2547 (-1.13z)| lr 5.27e-04 | 323.10 ms | 52.2% bf16 MFU | 1623291 tok/s step 4992/19560 | loss 3.600289 (+0.50z)| norm 0.2612 (-0.87z)| lr 5.27e-04 | 323.18 ms | 52.2% bf16 MFU | 1623241 tok/s step 4993/19560 | loss 3.554197 (-0.43z)| norm 0.2897 (+0.19z)| lr 5.27e-04 | 323.00 ms | 52.3% bf16 MFU | 1623238 tok/s step 4994/19560 | loss 3.540402 (-0.71z)| norm 0.3165 (+1.19z)| lr 5.27e-04 | 322.64 ms | 52.3% bf16 MFU | 1623325 tok/s step 4995/19560 | loss 3.554491 (-0.42z)| norm 0.2996 (+0.58z)| lr 5.26e-04 | 322.44 ms | 52.3% bf16 MFU | 1623460 tok/s step 4996/19560 | loss 3.582642 (+0.15z)| norm 0.2930 (+0.34z)| lr 5.26e-04 | 322.94 ms | 52.3% bf16 MFU | 1623462 tok/s step 4997/19560 | loss 3.515320 (-1.22z)| norm 0.2476 (-1.38z)| lr 5.26e-04 | 323.43 ms | 52.2% bf16 MFU | 1623339 tok/s step 4998/19560 | loss 3.513011 (-1.26z)| norm 0.2700 (-0.52z)| lr 5.26e-04 | 323.19 ms | 52.2% bf16 MFU | 1623284 tok/s step 4999/19560 | loss 3.578470 (+0.08z)| norm 0.2655 (-0.68z)| lr 5.26e-04 | 323.57 ms | 52.2% bf16 MFU | 1623135 tok/s step 5000/19560 | loss 3.579595 (+0.11z)| norm 0.2620 (-0.80z)| lr 5.26e-04 | 322.47 ms | 52.3% bf16 MFU | 1623271 tok/s val loss 3.558775 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00001.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00004.bin evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2761/10042 = 0.274945 Writing checkpoint at step 5000 Writing model to log124M/model_00005000.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00003.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00002.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00005.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00007.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00005000_00006.bin Writing state to log124M/state_00005000_00000.bin step 5001/19560 | loss 3.563595 (-0.22z)| norm 0.2856 (+0.10z)| lr 5.26e-04 | 318.87 ms | 52.9% bf16 MFU | 1624317 tok/s step 5002/19560 | loss 3.536031 (-0.77z)| norm 0.2709 (-0.45z)| lr 5.26e-04 | 320.99 ms | 52.6% bf16 MFU | 1624768 tok/s step 5003/19560 | loss 3.514350 (-1.20z)| norm 0.2677 (-0.56z)| lr 5.26e-04 | 322.17 ms | 52.4% bf16 MFU | 1624898 tok/s step 5004/19560 | loss 3.658193 (+1.72z)| norm 0.2699 (-0.47z)| lr 5.26e-04 | 321.72 ms | 52.5% bf16 MFU | 1625134 tok/s step 5005/19560 | loss 3.565235 (-0.16z)| norm 0.2928 (+0.44z)| lr 5.26e-04 | 321.92 ms | 52.4% bf16 MFU | 1625309 tok/s step 5006/19560 | loss 3.576202 (+0.05z)| norm 0.2744 (-0.29z)| lr 5.26e-04 | 322.74 ms | 52.3% bf16 MFU | 1625268 tok/s step 5007/19560 | loss 3.535709 (-0.77z)| norm 0.2998 (+0.72z)| lr 5.26e-04 | 321.84 ms | 52.4% bf16 MFU | 1625458 tok/s step 5008/19560 | loss 3.595331 (+0.45z)| norm 0.2771 (-0.19z)| lr 5.26e-04 | 322.17 ms | 52.4% bf16 MFU | 1625552 tok/s step 5009/19560 | loss 3.576960 (+0.10z)| norm 0.2869 (+0.20z)| lr 5.26e-04 | 322.25 ms | 52.4% bf16 MFU | 1625624 tok/s step 5010/19560 | loss 3.511806 (-1.25z)| norm 0.2900 (+0.32z)| lr 5.26e-04 | 322.71 ms | 52.3% bf16 MFU | 1625574 tok/s step 5011/19560 | loss 3.577993 (+0.13z)| norm 0.3069 (+0.97z)| lr 5.26e-04 | 322.48 ms | 52.3% bf16 MFU | 1625586 tok/s step 5012/19560 | loss 3.610289 (+0.80z)| norm 0.2695 (-0.51z)| lr 5.26e-04 | 322.10 ms | 52.4% bf16 MFU | 1625694 tok/s step 5013/19560 | loss 3.542117 (-0.62z)| norm 0.2747 (-0.30z)| lr 5.26e-04 | 322.69 ms | 52.3% bf16 MFU | 1625645 tok/s step 5014/19560 | loss 3.552410 (-0.41z)| norm 0.2806 (-0.06z)| lr 5.26e-04 | 322.51 ms | 52.3% bf16 MFU | 1625645 tok/s step 5015/19560 | loss 3.575680 (+0.06z)| norm 0.2677 (-0.58z)| lr 5.26e-04 | 322.26 ms | 52.4% bf16 MFU | 1625710 tok/s step 5016/19560 | loss 3.494785 (-1.63z)| norm 0.2646 (-0.69z)| lr 5.26e-04 | 322.19 ms | 52.4% bf16 MFU | 1625786 tok/s step 5017/19560 | loss 3.573226 (+0.04z)| norm 0.2681 (-0.54z)| lr 5.26e-04 | 323.52 ms | 52.2% bf16 MFU | 1625525 tok/s step 5018/19560 | loss 3.523031 (-1.03z)| norm 0.2777 (-0.15z)| lr 5.26e-04 | 322.47 ms | 52.3% bf16 MFU | 1625542 tok/s step 5019/19560 | loss 3.626008 (+1.14z)| norm 0.2919 (+0.41z)| lr 5.26e-04 | 322.46 ms | 52.3% bf16 MFU | 1625561 tok/s step 5020/19560 | loss 3.536155 (-0.74z)| norm 0.2874 (+0.23z)| lr 5.26e-04 | 322.80 ms | 52.3% bf16 MFU | 1625492 tok/s step 5021/19560 | loss 3.512032 (-1.23z)| norm 0.2754 (-0.25z)| lr 5.26e-04 | 322.77 ms | 52.3% bf16 MFU | 1625435 tok/s step 5022/19560 | loss 3.561490 (-0.18z)| norm 0.2946 (+0.53z)| lr 5.26e-04 | 322.54 ms | 52.3% bf16 MFU | 1625437 tok/s step 5023/19560 | loss 3.587629 (+0.36z)| norm 0.3035 (+0.89z)| lr 5.26e-04 | 322.78 ms | 52.3% bf16 MFU | 1625379 tok/s step 5024/19560 | loss 3.474690 (-1.98z)| norm 0.3271 (+1.81z)| lr 5.26e-04 | 322.50 ms | 52.3% bf16 MFU | 1625394 tok/s step 5025/19560 | loss 3.630334 (+1.27z)| norm 0.2975 (+0.62z)| lr 5.25e-04 | 322.36 ms | 52.4% bf16 MFU | 1625445 tok/s step 5026/19560 | loss 3.506300 (-1.30z)| norm 0.2726 (-0.37z)| lr 5.25e-04 | 322.90 ms | 52.3% bf16 MFU | 1625358 tok/s step 5027/19560 | loss 3.523304 (-0.94z)| norm 0.2772 (-0.18z)| lr 5.25e-04 | 322.48 ms | 52.3% bf16 MFU | 1625380 tok/s step 5028/19560 | loss 3.572634 (+0.09z)| norm 0.2502 (-1.26z)| lr 5.25e-04 | 323.27 ms | 52.2% bf16 MFU | 1625202 tok/s step 5029/19560 | loss 3.549759 (-0.38z)| norm 0.2812 (-0.02z)| lr 5.25e-04 | 322.05 ms | 52.4% bf16 MFU | 1625340 tok/s step 5030/19560 | loss 3.603853 (+0.75z)| norm 0.2843 (+0.10z)| lr 5.25e-04 | 323.11 ms | 52.2% bf16 MFU | 1625204 tok/s step 5031/19560 | loss 3.566382 (-0.03z)| norm 0.2820 (+0.01z)| lr 5.25e-04 | 323.23 ms | 52.2% bf16 MFU | 1625045 tok/s step 5032/19560 | loss 3.549970 (-0.38z)| norm 0.2563 (-1.02z)| lr 5.25e-04 | 322.49 ms | 52.3% bf16 MFU | 1625081 tok/s step 5033/19560 | loss 3.572929 (+0.10z)| norm 0.2679 (-0.55z)| lr 5.25e-04 | 322.72 ms | 52.3% bf16 MFU | 1625055 tok/s step 5034/19560 | loss 3.529986 (-0.81z)| norm 0.2679 (-0.56z)| lr 5.25e-04 | 322.70 ms | 52.3% bf16 MFU | 1625038 tok/s step 5035/19560 | loss 3.540280 (-0.60z)| norm 0.2741 (-0.31z)| lr 5.25e-04 | 322.53 ms | 52.3% bf16 MFU | 1625065 tok/s step 5036/19560 | loss 3.537796 (-0.65z)| norm 0.2785 (-0.13z)| lr 5.25e-04 | 322.80 ms | 52.3% bf16 MFU | 1625021 tok/s step 5037/19560 | loss 3.563147 (-0.11z)| norm 0.2692 (-0.50z)| lr 5.25e-04 | 322.56 ms | 52.3% bf16 MFU | 1625039 tok/s step 5038/19560 | loss 3.583719 (+0.33z)| norm 0.2555 (-1.04z)| lr 5.25e-04 | 322.85 ms | 52.3% bf16 MFU | 1624984 tok/s step 5039/19560 | loss 3.557359 (-0.22z)| norm 0.2653 (-0.66z)| lr 5.25e-04 | 322.54 ms | 52.3% bf16 MFU | 1625010 tok/s step 5040/19560 | loss 3.551732 (-0.35z)| norm 0.2726 (-0.37z)| lr 5.25e-04 | 322.62 ms | 52.3% bf16 MFU | 1625014 tok/s step 5041/19560 | loss 3.587013 (+0.40z)| norm 0.2455 (-1.45z)| lr 5.25e-04 | 323.18 ms | 52.2% bf16 MFU | 1624876 tok/s step 5042/19560 | loss 3.589133 (+0.44z)| norm 0.3088 (+1.08z)| lr 5.25e-04 | 322.32 ms | 52.4% bf16 MFU | 1624961 tok/s step 5043/19560 | loss 3.562057 (-0.13z)| norm 0.3172 (+1.39z)| lr 5.25e-04 | 323.23 ms | 52.2% bf16 MFU | 1624813 tok/s step 5044/19560 | loss 3.562728 (-0.12z)| norm 0.2918 (+0.37z)| lr 5.25e-04 | 322.60 ms | 52.3% bf16 MFU | 1624832 tok/s step 5045/19560 | loss 3.547813 (-0.43z)| norm 0.2669 (-0.63z)| lr 5.25e-04 | 322.37 ms | 52.4% bf16 MFU | 1624907 tok/s step 5046/19560 | loss 3.583993 (+0.37z)| norm 0.2696 (-0.51z)| lr 5.25e-04 | 323.05 ms | 52.2% bf16 MFU | 1624809 tok/s step 5047/19560 | loss 3.661667 (+2.04z)| norm 0.2655 (-0.66z)| lr 5.25e-04 | 322.79 ms | 52.3% bf16 MFU | 1624780 tok/s step 5048/19560 | loss 3.569735 (+0.04z)| norm 0.2786 (-0.14z)| lr 5.25e-04 | 323.43 ms | 52.2% bf16 MFU | 1624593 tok/s step 5049/19560 | loss 3.519613 (-1.04z)| norm 0.2904 (+0.33z)| lr 5.25e-04 | 323.00 ms | 52.3% bf16 MFU | 1624521 tok/s step 5050/19560 | loss 3.534418 (-0.71z)| norm 0.2684 (-0.54z)| lr 5.25e-04 | 322.38 ms | 52.4% bf16 MFU | 1624611 tok/s step 5051/19560 | loss 3.622623 (+1.21z)| norm 0.2828 (+0.04z)| lr 5.25e-04 | 322.19 ms | 52.4% bf16 MFU | 1624744 tok/s step 5052/19560 | loss 3.516656 (-1.09z)| norm 0.2592 (-0.89z)| lr 5.25e-04 | 323.61 ms | 52.2% bf16 MFU | 1624514 tok/s step 5053/19560 | loss 3.883112 (+5.85z)| norm 0.2807 (-0.03z)| lr 5.25e-04 | 323.21 ms | 52.2% bf16 MFU | 1624396 tok/s step 5054/19560 | loss 3.590499 (+0.40z)| norm 0.3009 (+0.78z)| lr 5.25e-04 | 322.73 ms | 52.3% bf16 MFU | 1624404 tok/s step 5055/19560 | loss 3.548207 (-0.38z)| norm 0.2760 (-0.21z)| lr 5.24e-04 | 322.64 ms | 52.3% bf16 MFU | 1624433 tok/s step 5056/19560 | loss 3.528412 (-0.74z)| norm 0.2772 (-0.14z)| lr 5.24e-04 | 323.17 ms | 52.2% bf16 MFU | 1624328 tok/s step 5057/19560 | loss 3.539780 (-0.53z)| norm 0.2565 (-1.00z)| lr 5.24e-04 | 322.97 ms | 52.3% bf16 MFU | 1624279 tok/s step 5058/19560 | loss 3.598954 (+0.58z)| norm 0.2895 (+0.44z)| lr 5.24e-04 | 322.46 ms | 52.3% bf16 MFU | 1624360 tok/s step 5059/19560 | loss 3.597025 (+0.56z)| norm 0.2853 (+0.28z)| lr 5.24e-04 | 323.41 ms | 52.2% bf16 MFU | 1624200 tok/s step 5060/19560 | loss 3.586210 (+0.34z)| norm 0.2722 (-0.30z)| lr 5.24e-04 | 322.66 ms | 52.3% bf16 MFU | 1624234 tok/s step 5061/19560 | loss 3.650757 (+1.55z)| norm 0.2689 (-0.45z)| lr 5.24e-04 | 322.93 ms | 52.3% bf16 MFU | 1624198 tok/s step 5062/19560 | loss 3.505530 (-1.19z)| norm 0.2499 (-1.41z)| lr 5.24e-04 | 323.33 ms | 52.2% bf16 MFU | 1624064 tok/s step 5063/19560 | loss 3.502185 (-1.23z)| norm 0.2907 (+0.71z)| lr 5.24e-04 | 322.61 ms | 52.3% bf16 MFU | 1624118 tok/s step 5064/19560 | loss 3.525702 (-0.78z)| norm 0.2789 (+0.12z)| lr 5.24e-04 | 323.21 ms | 52.2% bf16 MFU | 1624018 tok/s step 5065/19560 | loss 3.583863 (+0.32z)| norm 0.2684 (-0.43z)| lr 5.24e-04 | 322.80 ms | 52.3% bf16 MFU | 1624026 tok/s step 5066/19560 | loss 3.523441 (-0.81z)| norm 0.2743 (-0.09z)| lr 5.24e-04 | 322.68 ms | 52.3% bf16 MFU | 1624066 tok/s step 5067/19560 | loss 3.593980 (+0.51z)| norm 0.2774 (+0.09z)| lr 5.24e-04 | 323.38 ms | 52.2% bf16 MFU | 1623925 tok/s step 5068/19560 | loss 3.609179 (+0.79z)| norm 0.2902 (+0.83z)| lr 5.24e-04 | 322.15 ms | 52.4% bf16 MFU | 1624102 tok/s step 5069/19560 | loss 3.551098 (-0.29z)| norm 0.2914 (+0.89z)| lr 5.24e-04 | 322.88 ms | 52.3% bf16 MFU | 1624086 tok/s step 5070/19560 | loss 3.560156 (-0.12z)| norm 0.3000 (+1.36z)| lr 5.24e-04 | 322.85 ms | 52.3% bf16 MFU | 1624078 tok/s step 5071/19560 | loss 3.603966 (+0.69z)| norm 0.2595 (-0.96z)| lr 5.24e-04 | 323.05 ms | 52.2% bf16 MFU | 1624020 tok/s step 5072/19560 | loss 3.598878 (+0.60z)| norm 0.2786 (+0.13z)| lr 5.24e-04 | 323.47 ms | 52.2% bf16 MFU | 1623859 tok/s step 5073/19560 | loss 3.558131 (-0.18z)| norm 0.2662 (-0.59z)| lr 5.24e-04 | 322.47 ms | 52.3% bf16 MFU | 1623960 tok/s step 5074/19560 | loss 3.536629 (-0.57z)| norm 0.2952 (+1.07z)| lr 5.24e-04 | 322.62 ms | 52.3% bf16 MFU | 1624018 tok/s step 5075/19560 | loss 3.526793 (-0.75z)| norm 0.2836 (+0.39z)| lr 5.24e-04 | 323.27 ms | 52.2% bf16 MFU | 1623907 tok/s step 5076/19560 | loss 3.557763 (-0.16z)| norm 0.2827 (+0.32z)| lr 5.24e-04 | 323.31 ms | 52.2% bf16 MFU | 1623794 tok/s step 5077/19560 | loss 3.533645 (-0.61z)| norm 0.2974 (+1.16z)| lr 5.24e-04 | 323.27 ms | 52.2% bf16 MFU | 1623696 tok/s step 5078/19560 | loss 3.542974 (-0.43z)| norm 0.2810 (+0.21z)| lr 5.24e-04 | 323.05 ms | 52.2% bf16 MFU | 1623658 tok/s step 5079/19560 | loss 3.542827 (-0.43z)| norm 0.2667 (-0.63z)| lr 5.24e-04 | 322.84 ms | 52.3% bf16 MFU | 1623675 tok/s step 5080/19560 | loss 3.543820 (-0.40z)| norm 0.2828 (+0.30z)| lr 5.24e-04 | 322.86 ms | 52.3% bf16 MFU | 1623686 tok/s step 5081/19560 | loss 3.598244 (+0.66z)| norm 0.2717 (-0.36z)| lr 5.24e-04 | 322.61 ms | 52.3% bf16 MFU | 1623760 tok/s step 5082/19560 | loss 3.588640 (+0.49z)| norm 0.2915 (+0.80z)| lr 5.24e-04 | 322.65 ms | 52.3% bf16 MFU | 1623820 tok/s step 5083/19560 | loss 3.553211 (-0.22z)| norm 0.2450 (-1.92z)| lr 5.24e-04 | 323.20 ms | 52.2% bf16 MFU | 1623739 tok/s step 5084/19560 | loss 3.566316 (+0.04z)| norm 0.2685 (-0.54z)| lr 5.24e-04 | 322.28 ms | 52.4% bf16 MFU | 1623892 tok/s step 5085/19560 | loss 3.542235 (-0.44z)| norm 0.2665 (-0.66z)| lr 5.23e-04 | 322.36 ms | 52.4% bf16 MFU | 1624019 tok/s step 5086/19560 | loss 3.586257 (+0.43z)| norm 0.2863 (+0.49z)| lr 5.23e-04 | 322.79 ms | 52.3% bf16 MFU | 1624030 tok/s step 5087/19560 | loss 3.552439 (-0.23z)| norm 0.2812 (+0.19z)| lr 5.23e-04 | 322.98 ms | 52.3% bf16 MFU | 1623994 tok/s step 5088/19560 | loss 3.584903 (+0.47z)| norm 0.2750 (-0.18z)| lr 5.23e-04 | 322.63 ms | 52.3% bf16 MFU | 1624045 tok/s step 5089/19560 | loss 3.587380 (+0.52z)| norm 0.2789 (+0.05z)| lr 5.23e-04 | 322.69 ms | 52.3% bf16 MFU | 1624079 tok/s step 5090/19560 | loss 3.599436 (+0.78z)| norm 0.2921 (+0.82z)| lr 5.23e-04 | 323.11 ms | 52.2% bf16 MFU | 1624008 tok/s step 5091/19560 | loss 3.612947 (+1.05z)| norm 0.2517 (-1.57z)| lr 5.23e-04 | 322.57 ms | 52.3% bf16 MFU | 1624074 tok/s step 5092/19560 | loss 3.572202 (+0.18z)| norm 0.2868 (+0.49z)| lr 5.23e-04 | 322.91 ms | 52.3% bf16 MFU | 1624052 tok/s step 5093/19560 | loss 3.551220 (-0.27z)| norm 0.2857 (+0.42z)| lr 5.23e-04 | 323.35 ms | 52.2% bf16 MFU | 1623922 tok/s step 5094/19560 | loss 3.554911 (-0.19z)| norm 0.2558 (-1.33z)| lr 5.23e-04 | 322.43 ms | 52.3% bf16 MFU | 1624029 tok/s step 5095/19560 | loss 3.524089 (-0.84z)| norm 0.2760 (-0.16z)| lr 5.23e-04 | 322.79 ms | 52.3% bf16 MFU | 1624038 tok/s step 5096/19560 | loss 3.552409 (-0.23z)| norm 0.2625 (-0.96z)| lr 5.23e-04 | 322.63 ms | 52.3% bf16 MFU | 1624089 tok/s step 5097/19560 | loss 3.545948 (-0.37z)| norm 0.2694 (-0.54z)| lr 5.23e-04 | 324.05 ms | 52.1% bf16 MFU | 1623782 tok/s step 5098/19560 | loss 3.642319 (+1.66z)| norm 0.2540 (-1.44z)| lr 5.23e-04 | 323.01 ms | 52.2% bf16 MFU | 1623749 tok/s step 5099/19560 | loss 3.701152 (+2.82z)| norm 0.3938 (+5.87z)| lr 5.23e-04 | 322.72 ms | 52.3% bf16 MFU | 1623792 tok/s step 5100/19560 | loss 3.566303 (+0.01z)| norm 0.3267 (+2.36z)| lr 5.23e-04 | 323.40 ms | 52.2% bf16 MFU | 1623661 tok/s step 5101/19560 | loss 3.547609 (-0.38z)| norm 0.2932 (+0.70z)| lr 5.23e-04 | 322.96 ms | 52.3% bf16 MFU | 1623648 tok/s step 5102/19560 | loss 3.595885 (+0.63z)| norm 0.3146 (+1.75z)| lr 5.23e-04 | 322.56 ms | 52.3% bf16 MFU | 1623735 tok/s step 5103/19560 | loss 3.640934 (+1.54z)| norm 0.2978 (+0.90z)| lr 5.23e-04 | 323.32 ms | 52.2% bf16 MFU | 1623628 tok/s step 5104/19560 | loss 3.567411 (+0.01z)| norm 0.2927 (+0.67z)| lr 5.23e-04 | 322.99 ms | 52.3% bf16 MFU | 1623608 tok/s step 5105/19560 | loss 3.553794 (-0.28z)| norm 0.2976 (+0.95z)| lr 5.23e-04 | 322.82 ms | 52.3% bf16 MFU | 1623633 tok/s step 5106/19560 | loss 3.656878 (+1.83z)| norm 0.4049 (+5.61z)| lr 5.23e-04 | 323.18 ms | 52.2% bf16 MFU | 1623566 tok/s step 5107/19560 | loss 3.512159 (-1.15z)| norm 0.3752 (+3.97z)| lr 5.23e-04 | 322.44 ms | 52.3% bf16 MFU | 1623687 tok/s step 5108/19560 | loss 3.583725 (+0.32z)| norm 0.3334 (+2.16z)| lr 5.23e-04 | 322.82 ms | 52.3% bf16 MFU | 1623707 tok/s step 5109/19560 | loss 3.547131 (-0.42z)| norm 0.3251 (+1.77z)| lr 5.23e-04 | 322.68 ms | 52.3% bf16 MFU | 1623761 tok/s step 5110/19560 | loss 3.507401 (-1.22z)| norm 0.3088 (+1.10z)| lr 5.23e-04 | 323.03 ms | 52.2% bf16 MFU | 1623726 tok/s step 5111/19560 | loss 3.589016 (+0.47z)| norm 0.2886 (+0.27z)| lr 5.23e-04 | 322.57 ms | 52.3% bf16 MFU | 1623806 tok/s step 5112/19560 | loss 3.539262 (-0.56z)| norm 0.2812 (-0.03z)| lr 5.23e-04 | 322.52 ms | 52.3% bf16 MFU | 1623895 tok/s step 5113/19560 | loss 3.488577 (-1.59z)| norm 0.2721 (-0.40z)| lr 5.23e-04 | 322.61 ms | 52.3% bf16 MFU | 1623958 tok/s step 5114/19560 | loss 3.574961 (+0.18z)| norm 0.2614 (-0.83z)| lr 5.23e-04 | 322.87 ms | 52.3% bf16 MFU | 1623952 tok/s step 5115/19560 | loss 3.631826 (+1.34z)| norm 0.2718 (-0.42z)| lr 5.22e-04 | 322.88 ms | 52.3% bf16 MFU | 1623943 tok/s step 5116/19560 | loss 3.589478 (+0.46z)| norm 0.2857 (+0.14z)| lr 5.22e-04 | 323.10 ms | 52.2% bf16 MFU | 1623879 tok/s step 5117/19560 | loss 3.651671 (+1.72z)| norm 0.3049 (+0.92z)| lr 5.22e-04 | 323.27 ms | 52.2% bf16 MFU | 1623777 tok/s step 5118/19560 | loss 3.550671 (-0.37z)| norm 0.2942 (+0.47z)| lr 5.22e-04 | 322.75 ms | 52.3% bf16 MFU | 1623809 tok/s step 5119/19560 | loss 3.612156 (+0.91z)| norm 0.3136 (+1.25z)| lr 5.22e-04 | 323.30 ms | 52.2% bf16 MFU | 1623703 tok/s step 5120/19560 | loss 3.563345 (-0.10z)| norm 0.2800 (-0.14z)| lr 5.22e-04 | 322.51 ms | 52.3% bf16 MFU | 1623800 tok/s step 5121/19560 | loss 3.521057 (-0.97z)| norm 0.2799 (-0.14z)| lr 5.22e-04 | 322.71 ms | 52.3% bf16 MFU | 1623843 tok/s step 5122/19560 | loss 3.602790 (+0.71z)| norm 0.2636 (-0.80z)| lr 5.22e-04 | 322.47 ms | 52.3% bf16 MFU | 1623942 tok/s step 5123/19560 | loss 3.577318 (+0.18z)| norm 0.2582 (-1.01z)| lr 5.22e-04 | 322.61 ms | 52.3% bf16 MFU | 1624002 tok/s step 5124/19560 | loss 3.562851 (-0.11z)| norm 0.2672 (-0.63z)| lr 5.22e-04 | 322.79 ms | 52.3% bf16 MFU | 1624014 tok/s step 5125/19560 | loss 3.510437 (-1.19z)| norm 0.2759 (-0.28z)| lr 5.22e-04 | 322.49 ms | 52.3% bf16 MFU | 1624100 tok/s step 5126/19560 | loss 3.542228 (-0.54z)| norm 0.2588 (-0.99z)| lr 5.22e-04 | 322.87 ms | 52.3% bf16 MFU | 1624088 tok/s step 5127/19560 | loss 3.538454 (-0.61z)| norm 0.2646 (-0.75z)| lr 5.22e-04 | 322.64 ms | 52.3% bf16 MFU | 1624133 tok/s step 5128/19560 | loss 3.538321 (-0.61z)| norm 0.2483 (-1.42z)| lr 5.22e-04 | 322.74 ms | 52.3% bf16 MFU | 1624150 tok/s step 5129/19560 | loss 3.536960 (-0.63z)| norm 0.2629 (-0.80z)| lr 5.22e-04 | 323.03 ms | 52.2% bf16 MFU | 1624094 tok/s step 5130/19560 | loss 3.500163 (-1.38z)| norm 0.2566 (-1.05z)| lr 5.22e-04 | 322.34 ms | 52.4% bf16 MFU | 1624216 tok/s step 5131/19560 | loss 3.613039 (+0.92z)| norm 0.2685 (-0.56z)| lr 5.22e-04 | 322.95 ms | 52.3% bf16 MFU | 1624176 tok/s step 5132/19560 | loss 3.580036 (+0.26z)| norm 0.2811 (-0.05z)| lr 5.22e-04 | 322.52 ms | 52.3% bf16 MFU | 1624248 tok/s step 5133/19560 | loss 3.560100 (-0.15z)| norm 0.2775 (-0.19z)| lr 5.22e-04 | 322.16 ms | 52.4% bf16 MFU | 1624407 tok/s step 5134/19560 | loss 3.562082 (-0.11z)| norm 0.2691 (-0.53z)| lr 5.22e-04 | 322.64 ms | 52.3% bf16 MFU | 1624438 tok/s step 5135/19560 | loss 3.555401 (-0.25z)| norm 0.2628 (-0.78z)| lr 5.22e-04 | 322.61 ms | 52.3% bf16 MFU | 1624473 tok/s step 5136/19560 | loss 3.589890 (+0.47z)| norm 0.2743 (-0.31z)| lr 5.22e-04 | 323.20 ms | 52.2% bf16 MFU | 1624358 tok/s step 5137/19560 | loss 3.535004 (-0.67z)| norm 0.2780 (-0.15z)| lr 5.22e-04 | 322.59 ms | 52.3% bf16 MFU | 1624403 tok/s step 5138/19560 | loss 3.509778 (-1.19z)| norm 0.2843 (+0.11z)| lr 5.22e-04 | 322.32 ms | 52.4% bf16 MFU | 1624514 tok/s step 5139/19560 | loss 3.536853 (-0.62z)| norm 0.2746 (-0.28z)| lr 5.22e-04 | 322.57 ms | 52.3% bf16 MFU | 1624557 tok/s step 5140/19560 | loss 3.494505 (-1.48z)| norm 0.2578 (-0.97z)| lr 5.22e-04 | 322.95 ms | 52.3% bf16 MFU | 1624502 tok/s step 5141/19560 | loss 3.490445 (-1.54z)| norm 0.2654 (-0.65z)| lr 5.22e-04 | 322.70 ms | 52.3% bf16 MFU | 1624510 tok/s step 5142/19560 | loss 3.564527 (-0.02z)| norm 0.2441 (-1.51z)| lr 5.22e-04 | 322.38 ms | 52.4% bf16 MFU | 1624600 tok/s step 5143/19560 | loss 3.635724 (+1.42z)| norm 0.2708 (-0.42z)| lr 5.22e-04 | 322.65 ms | 52.3% bf16 MFU | 1624616 tok/s step 5144/19560 | loss 3.530494 (-0.73z)| norm 0.2766 (-0.18z)| lr 5.22e-04 | 323.13 ms | 52.2% bf16 MFU | 1624512 tok/s step 5145/19560 | loss 3.596481 (+0.61z)| norm 0.3289 (+1.91z)| lr 5.21e-04 | 322.91 ms | 52.3% bf16 MFU | 1624468 tok/s step 5146/19560 | loss 3.626298 (+1.20z)| norm 0.3279 (+1.83z)| lr 5.21e-04 | 322.76 ms | 52.3% bf16 MFU | 1624465 tok/s step 5147/19560 | loss 3.592811 (+0.53z)| norm 0.3032 (+0.84z)| lr 5.21e-04 | 322.10 ms | 52.4% bf16 MFU | 1624626 tok/s step 5148/19560 | loss 3.561825 (-0.11z)| norm 0.2631 (-0.74z)| lr 5.21e-04 | 322.13 ms | 52.4% bf16 MFU | 1624773 tok/s step 5149/19560 | loss 3.527388 (-0.82z)| norm 0.3008 (+0.74z)| lr 5.21e-04 | 323.49 ms | 52.2% bf16 MFU | 1624571 tok/s step 5150/19560 | loss 3.554258 (-0.27z)| norm 0.2944 (+0.49z)| lr 5.21e-04 | 322.62 ms | 52.3% bf16 MFU | 1624596 tok/s step 5151/19560 | loss 3.503586 (-1.29z)| norm 0.2623 (-0.77z)| lr 5.21e-04 | 322.62 ms | 52.3% bf16 MFU | 1624622 tok/s step 5152/19560 | loss 3.584303 (+0.35z)| norm 0.2735 (-0.31z)| lr 5.21e-04 | 322.88 ms | 52.3% bf16 MFU | 1624579 tok/s step 5153/19560 | loss 3.499645 (-1.39z)| norm 0.2536 (-1.10z)| lr 5.21e-04 | 323.13 ms | 52.2% bf16 MFU | 1624476 tok/s step 5154/19560 | loss 3.539051 (-0.58z)| norm 0.2879 (+0.27z)| lr 5.21e-04 | 322.79 ms | 52.3% bf16 MFU | 1624464 tok/s step 5155/19560 | loss 3.556036 (-0.23z)| norm 0.2715 (-0.38z)| lr 5.21e-04 | 322.78 ms | 52.3% bf16 MFU | 1624456 tok/s step 5156/19560 | loss 3.526185 (-0.84z)| norm 0.3007 (+0.77z)| lr 5.21e-04 | 322.75 ms | 52.3% bf16 MFU | 1624455 tok/s step 5157/19560 | loss 3.531451 (-0.73z)| norm 0.3203 (+1.53z)| lr 5.21e-04 | 323.06 ms | 52.2% bf16 MFU | 1624375 tok/s step 5158/19560 | loss 3.520626 (-0.94z)| norm 0.2667 (-0.59z)| lr 5.21e-04 | 322.52 ms | 52.3% bf16 MFU | 1624437 tok/s step 5159/19560 | loss 3.565184 (-0.01z)| norm 0.2590 (-0.89z)| lr 5.21e-04 | 323.21 ms | 52.2% bf16 MFU | 1624321 tok/s step 5160/19560 | loss 3.594676 (+0.59z)| norm 0.2531 (-1.12z)| lr 5.21e-04 | 322.86 ms | 52.3% bf16 MFU | 1624299 tok/s step 5161/19560 | loss 3.538949 (-0.56z)| norm 0.2658 (-0.62z)| lr 5.21e-04 | 322.78 ms | 52.3% bf16 MFU | 1624299 tok/s step 5162/19560 | loss 3.574415 (+0.17z)| norm 0.2407 (-1.58z)| lr 5.21e-04 | 322.76 ms | 52.3% bf16 MFU | 1624304 tok/s step 5163/19560 | loss 3.599506 (+0.68z)| norm 0.2647 (-0.64z)| lr 5.21e-04 | 322.58 ms | 52.3% bf16 MFU | 1624352 tok/s step 5164/19560 | loss 3.499092 (-1.39z)| norm 0.2514 (-1.15z)| lr 5.21e-04 | 322.61 ms | 52.3% bf16 MFU | 1624391 tok/s step 5165/19560 | loss 3.574523 (+0.16z)| norm 0.2734 (-0.30z)| lr 5.21e-04 | 322.90 ms | 52.3% bf16 MFU | 1624355 tok/s step 5166/19560 | loss 3.528922 (-0.77z)| norm 0.2704 (-0.42z)| lr 5.21e-04 | 322.73 ms | 52.3% bf16 MFU | 1624364 tok/s step 5167/19560 | loss 3.518542 (-0.97z)| norm 0.2663 (-0.58z)| lr 5.21e-04 | 322.68 ms | 52.3% bf16 MFU | 1624385 tok/s step 5168/19560 | loss 3.562917 (-0.06z)| norm 0.2814 (+0.01z)| lr 5.21e-04 | 323.35 ms | 52.2% bf16 MFU | 1624236 tok/s step 5169/19560 | loss 3.550551 (-0.31z)| norm 0.2674 (-0.55z)| lr 5.21e-04 | 322.63 ms | 52.3% bf16 MFU | 1624276 tok/s step 5170/19560 | loss 3.513635 (-1.05z)| norm 0.2544 (-1.04z)| lr 5.21e-04 | 322.53 ms | 52.3% bf16 MFU | 1624339 tok/s step 5171/19560 | loss 3.484232 (-1.62z)| norm 0.2786 (-0.08z)| lr 5.21e-04 | 323.28 ms | 52.2% bf16 MFU | 1624210 tok/s step 5172/19560 | loss 3.553542 (-0.22z)| norm 0.3196 (+1.53z)| lr 5.21e-04 | 322.65 ms | 52.3% bf16 MFU | 1624247 tok/s step 5173/19560 | loss 3.553214 (-0.23z)| norm 0.3161 (+1.37z)| lr 5.21e-04 | 322.58 ms | 52.3% bf16 MFU | 1624299 tok/s step 5174/19560 | loss 3.520158 (-0.88z)| norm 0.2635 (-0.69z)| lr 5.21e-04 | 322.69 ms | 52.3% bf16 MFU | 1624322 tok/s step 5175/19560 | loss 3.526005 (-0.75z)| norm 0.3037 (+0.87z)| lr 5.20e-04 | 322.87 ms | 52.3% bf16 MFU | 1624298 tok/s step 5176/19560 | loss 3.525905 (-0.75z)| norm 0.2909 (+0.36z)| lr 5.20e-04 | 322.15 ms | 52.4% bf16 MFU | 1624456 tok/s step 5177/19560 | loss 3.549495 (-0.27z)| norm 0.2848 (+0.13z)| lr 5.20e-04 | 322.83 ms | 52.3% bf16 MFU | 1624435 tok/s step 5178/19560 | loss 3.495093 (-1.37z)| norm 0.2835 (+0.07z)| lr 5.20e-04 | 322.88 ms | 52.3% bf16 MFU | 1624402 tok/s step 5179/19560 | loss 3.554199 (-0.16z)| norm 0.2820 (+0.01z)| lr 5.20e-04 | 322.52 ms | 52.3% bf16 MFU | 1624461 tok/s step 5180/19560 | loss 3.531236 (-0.63z)| norm 0.2901 (+0.32z)| lr 5.20e-04 | 322.68 ms | 52.3% bf16 MFU | 1624478 tok/s step 5181/19560 | loss 3.552514 (-0.17z)| norm 0.2922 (+0.40z)| lr 5.20e-04 | 322.44 ms | 52.3% bf16 MFU | 1624554 tok/s step 5182/19560 | loss 3.537111 (-0.55z)| norm 0.2753 (-0.25z)| lr 5.20e-04 | 322.43 ms | 52.3% bf16 MFU | 1624629 tok/s step 5183/19560 | loss 3.565777 (+0.17z)| norm 0.2756 (-0.24z)| lr 5.20e-04 | 323.40 ms | 52.2% bf16 MFU | 1624456 tok/s step 5184/19560 | loss 3.549966 (-0.24z)| norm 0.2688 (-0.50z)| lr 5.20e-04 | 322.59 ms | 52.3% bf16 MFU | 1624494 tok/s step 5185/19560 | loss 3.560349 (+0.02z)| norm 0.2601 (-0.85z)| lr 5.20e-04 | 322.60 ms | 52.3% bf16 MFU | 1624528 tok/s step 5186/19560 | loss 3.538330 (-0.53z)| norm 0.2405 (-1.58z)| lr 5.20e-04 | 322.69 ms | 52.3% bf16 MFU | 1624538 tok/s step 5187/19560 | loss 3.610138 (+1.30z)| norm 0.2700 (-0.43z)| lr 5.20e-04 | 322.73 ms | 52.3% bf16 MFU | 1624537 tok/s step 5188/19560 | loss 3.538123 (-0.52z)| norm 0.2719 (-0.36z)| lr 5.20e-04 | 322.45 ms | 52.3% bf16 MFU | 1624608 tok/s step 5189/19560 | loss 3.514983 (-1.10z)| norm 0.2510 (-1.16z)| lr 5.20e-04 | 322.80 ms | 52.3% bf16 MFU | 1624587 tok/s step 5190/19560 | loss 3.543308 (-0.38z)| norm 0.2680 (-0.51z)| lr 5.20e-04 | 322.82 ms | 52.3% bf16 MFU | 1624562 tok/s step 5191/19560 | loss 3.569236 (+0.28z)| norm 0.2702 (-0.42z)| lr 5.20e-04 | 322.72 ms | 52.3% bf16 MFU | 1624562 tok/s step 5192/19560 | loss 3.589444 (+0.80z)| norm 0.2834 (+0.09z)| lr 5.20e-04 | 322.28 ms | 52.4% bf16 MFU | 1624676 tok/s step 5193/19560 | loss 3.533970 (-0.65z)| norm 0.2925 (+0.44z)| lr 5.20e-04 | 322.57 ms | 52.3% bf16 MFU | 1624708 tok/s step 5194/19560 | loss 3.578761 (+0.52z)| norm 0.2893 (+0.31z)| lr 5.20e-04 | 322.66 ms | 52.3% bf16 MFU | 1624717 tok/s step 5195/19560 | loss 3.578291 (+0.51z)| norm 0.2647 (-0.64z)| lr 5.20e-04 | 323.04 ms | 52.2% bf16 MFU | 1624631 tok/s step 5196/19560 | loss 3.547928 (-0.28z)| norm 0.2609 (-0.78z)| lr 5.20e-04 | 323.25 ms | 52.2% bf16 MFU | 1624495 tok/s step 5197/19560 | loss 3.524135 (-0.90z)| norm 0.2937 (+0.49z)| lr 5.20e-04 | 322.47 ms | 52.3% bf16 MFU | 1624563 tok/s step 5198/19560 | loss 3.516896 (-1.08z)| norm 0.2523 (-1.09z)| lr 5.20e-04 | 322.49 ms | 52.3% bf16 MFU | 1624623 tok/s step 5199/19560 | loss 3.543178 (-0.38z)| norm 0.2591 (-0.83z)| lr 5.20e-04 | 322.88 ms | 52.3% bf16 MFU | 1624581 tok/s step 5200/19560 | loss 3.533157 (-0.63z)| norm 0.2528 (-1.06z)| lr 5.20e-04 | 322.60 ms | 52.3% bf16 MFU | 1624613 tok/s step 5201/19560 | loss 3.539741 (-0.45z)| norm 0.2980 (+0.66z)| lr 5.20e-04 | 322.22 ms | 52.4% bf16 MFU | 1624737 tok/s step 5202/19560 | loss 3.557936 (+0.03z)| norm 0.2638 (-0.64z)| lr 5.20e-04 | 323.25 ms | 52.2% bf16 MFU | 1624597 tok/s step 5203/19560 | loss 3.707136 (+3.74z)| norm 0.3202 (+1.50z)| lr 5.20e-04 | 322.68 ms | 52.3% bf16 MFU | 1624607 tok/s step 5204/19560 | loss 3.524845 (-0.83z)| norm 0.4245 (+4.88z)| lr 5.19e-04 | 322.49 ms | 52.3% bf16 MFU | 1624663 tok/s step 5205/19560 | loss 3.562383 (+0.10z)| norm 0.3121 (+1.03z)| lr 5.19e-04 | 322.58 ms | 52.3% bf16 MFU | 1624694 tok/s step 5206/19560 | loss 3.551236 (-0.18z)| norm 0.2801 (-0.06z)| lr 5.19e-04 | 322.98 ms | 52.3% bf16 MFU | 1624624 tok/s step 5207/19560 | loss 3.572062 (+0.34z)| norm 0.2777 (-0.15z)| lr 5.19e-04 | 322.15 ms | 52.4% bf16 MFU | 1624765 tok/s step 5208/19560 | loss 3.464305 (-2.30z)| norm 0.2609 (-0.71z)| lr 5.19e-04 | 322.76 ms | 52.3% bf16 MFU | 1624746 tok/s step 5209/19560 | loss 3.535965 (-0.53z)| norm 0.2721 (-0.33z)| lr 5.19e-04 | 322.52 ms | 52.3% bf16 MFU | 1624788 tok/s step 5210/19560 | loss 3.509251 (-1.17z)| norm 0.2547 (-0.92z)| lr 5.19e-04 | 324.18 ms | 52.1% bf16 MFU | 1624413 tok/s step 5211/19560 | loss 3.516793 (-0.97z)| norm 0.2736 (-0.28z)| lr 5.19e-04 | 322.44 ms | 52.3% bf16 MFU | 1624493 tok/s step 5212/19560 | loss 3.569086 (+0.31z)| norm 0.2747 (-0.24z)| lr 5.19e-04 | 322.13 ms | 52.4% bf16 MFU | 1624648 tok/s step 5213/19560 | loss 3.506691 (-1.21z)| norm 0.2777 (-0.15z)| lr 5.19e-04 | 322.59 ms | 52.3% bf16 MFU | 1624678 tok/s step 5214/19560 | loss 3.493516 (-1.50z)| norm 0.2917 (+0.33z)| lr 5.19e-04 | 322.51 ms | 52.3% bf16 MFU | 1624725 tok/s step 5215/19560 | loss 3.607897 (+1.25z)| norm 0.2578 (-0.82z)| lr 5.19e-04 | 322.50 ms | 52.3% bf16 MFU | 1624774 tok/s step 5216/19560 | loss 3.559024 (+0.08z)| norm 0.2752 (-0.23z)| lr 5.19e-04 | 322.83 ms | 52.3% bf16 MFU | 1624737 tok/s step 5217/19560 | loss 3.537531 (-0.43z)| norm 0.2636 (-0.62z)| lr 5.19e-04 | 322.32 ms | 52.4% bf16 MFU | 1624830 tok/s step 5218/19560 | loss 3.507312 (-1.14z)| norm 0.2619 (-0.67z)| lr 5.19e-04 | 322.69 ms | 52.3% bf16 MFU | 1624826 tok/s step 5219/19560 | loss 3.510864 (-1.04z)| norm 0.2704 (-0.38z)| lr 5.19e-04 | 322.77 ms | 52.3% bf16 MFU | 1624801 tok/s step 5220/19560 | loss 3.637280 (+1.98z)| norm 0.2790 (-0.09z)| lr 5.19e-04 | 322.55 ms | 52.3% bf16 MFU | 1624832 tok/s step 5221/19560 | loss 3.567577 (+0.31z)| norm 0.2862 (+0.16z)| lr 5.19e-04 | 322.75 ms | 52.3% bf16 MFU | 1624813 tok/s step 5222/19560 | loss 3.527090 (-0.65z)| norm 0.2770 (-0.16z)| lr 5.19e-04 | 322.19 ms | 52.4% bf16 MFU | 1624936 tok/s step 5223/19560 | loss 3.576400 (+0.52z)| norm 0.2946 (+0.44z)| lr 5.19e-04 | 323.26 ms | 52.2% bf16 MFU | 1624782 tok/s step 5224/19560 | loss 3.544830 (-0.23z)| norm 0.2794 (-0.09z)| lr 5.19e-04 | 322.62 ms | 52.3% bf16 MFU | 1624797 tok/s step 5225/19560 | loss 3.542746 (-0.28z)| norm 0.2694 (-0.43z)| lr 5.19e-04 | 322.56 ms | 52.3% bf16 MFU | 1624827 tok/s step 5226/19560 | loss 3.516887 (-0.89z)| norm 0.2738 (-0.29z)| lr 5.19e-04 | 322.22 ms | 52.4% bf16 MFU | 1624941 tok/s step 5227/19560 | loss 3.554927 (+0.06z)| norm 0.2988 (+0.64z)| lr 5.19e-04 | 322.53 ms | 52.3% bf16 MFU | 1624970 tok/s step 5228/19560 | loss 3.577560 (+0.64z)| norm 0.2878 (+0.25z)| lr 5.19e-04 | 322.29 ms | 52.4% bf16 MFU | 1625060 tok/s step 5229/19560 | loss 3.622991 (+1.75z)| norm 0.2838 (+0.10z)| lr 5.19e-04 | 323.10 ms | 52.2% bf16 MFU | 1624941 tok/s step 5230/19560 | loss 3.489702 (-1.57z)| norm 0.2768 (-0.15z)| lr 5.19e-04 | 322.38 ms | 52.4% bf16 MFU | 1625009 tok/s step 5231/19560 | loss 3.545860 (-0.15z)| norm 0.3087 (+1.03z)| lr 5.19e-04 | 322.59 ms | 52.3% bf16 MFU | 1625020 tok/s step 5232/19560 | loss 3.517608 (-0.85z)| norm 0.2844 (+0.14z)| lr 5.19e-04 | 323.60 ms | 52.2% bf16 MFU | 1624777 tok/s step 5233/19560 | loss 3.506004 (-1.13z)| norm 0.2781 (-0.09z)| lr 5.18e-04 | 322.38 ms | 52.4% bf16 MFU | 1624855 tok/s step 5234/19560 | loss 3.496230 (-1.38z)| norm 0.2594 (-0.82z)| lr 5.18e-04 | 322.25 ms | 52.4% bf16 MFU | 1624959 tok/s step 5235/19560 | loss 3.568862 (+0.49z)| norm 0.2653 (-0.58z)| lr 5.18e-04 | 323.20 ms | 52.2% bf16 MFU | 1624820 tok/s step 5236/19560 | loss 3.542883 (-0.18z)| norm 0.2654 (-0.56z)| lr 5.18e-04 | 322.56 ms | 52.3% bf16 MFU | 1624848 tok/s step 5237/19560 | loss 3.559570 (+0.25z)| norm 0.2758 (-0.09z)| lr 5.18e-04 | 322.82 ms | 52.3% bf16 MFU | 1624810 tok/s step 5238/19560 | loss 3.550585 (+0.01z)| norm 0.2431 (-1.53z)| lr 5.18e-04 | 322.53 ms | 52.3% bf16 MFU | 1624847 tok/s step 5239/19560 | loss 3.564058 (+0.37z)| norm 0.2532 (-1.06z)| lr 5.18e-04 | 323.07 ms | 52.2% bf16 MFU | 1624746 tok/s step 5240/19560 | loss 3.590827 (+1.06z)| norm 0.2535 (-1.04z)| lr 5.18e-04 | 322.31 ms | 52.4% bf16 MFU | 1624842 tok/s step 5241/19560 | loss 3.584331 (+0.87z)| norm 0.2823 (+0.25z)| lr 5.18e-04 | 322.51 ms | 52.3% bf16 MFU | 1624883 tok/s step 5242/19560 | loss 3.603237 (+1.36z)| norm 0.2508 (-1.15z)| lr 5.18e-04 | 323.46 ms | 52.2% bf16 MFU | 1624684 tok/s step 5243/19560 | loss 3.505053 (-1.20z)| norm 0.2610 (-0.69z)| lr 5.18e-04 | 322.43 ms | 52.3% bf16 MFU | 1624751 tok/s step 5244/19560 | loss 3.631899 (+2.12z)| norm 0.2602 (-0.72z)| lr 5.18e-04 | 322.55 ms | 52.3% bf16 MFU | 1624785 tok/s step 5245/19560 | loss 3.640508 (+2.36z)| norm 0.2757 (-0.02z)| lr 5.18e-04 | 323.09 ms | 52.2% bf16 MFU | 1624682 tok/s step 5246/19560 | loss 3.575975 (+0.66z)| norm 0.2502 (-1.14z)| lr 5.18e-04 | 323.13 ms | 52.2% bf16 MFU | 1624573 tok/s step 5247/19560 | loss 3.487313 (-1.64z)| norm 0.2703 (-0.23z)| lr 5.18e-04 | 322.95 ms | 52.3% bf16 MFU | 1624515 tok/s step 5248/19560 | loss 3.541054 (-0.22z)| norm 0.2653 (-0.45z)| lr 5.18e-04 | 322.73 ms | 52.3% bf16 MFU | 1624517 tok/s step 5249/19560 | loss 3.588393 (+1.00z)| norm 0.2653 (-0.45z)| lr 5.18e-04 | 322.71 ms | 52.3% bf16 MFU | 1624524 tok/s step 5250/19560 | loss 3.493776 (-1.45z)| norm 0.2719 (-0.16z)| lr 5.18e-04 | 323.20 ms | 52.2% bf16 MFU | 1624408 tok/s val loss 3.545979 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2754/10042 = 0.274248 step 5251/19560 | loss 3.517582 (-0.82z)| norm 0.2879 (+0.55z)| lr 5.18e-04 | 321.76 ms | 52.5% bf16 MFU | 1624660 tok/s step 5252/19560 | loss 3.498264 (-1.30z)| norm 0.2846 (+0.40z)| lr 5.18e-04 | 322.67 ms | 52.3% bf16 MFU | 1624668 tok/s step 5253/19560 | loss 3.599802 (+1.31z)| norm 0.2538 (-0.97z)| lr 5.18e-04 | 323.33 ms | 52.2% bf16 MFU | 1624511 tok/s step 5254/19560 | loss 3.535271 (-0.35z)| norm 0.2912 (+0.69z)| lr 5.18e-04 | 322.81 ms | 52.3% bf16 MFU | 1624492 tok/s step 5255/19560 | loss 3.552275 (+0.08z)| norm 0.2813 (+0.24z)| lr 5.18e-04 | 322.91 ms | 52.3% bf16 MFU | 1624450 tok/s step 5256/19560 | loss 3.578015 (+0.74z)| norm 0.2806 (+0.20z)| lr 5.18e-04 | 322.58 ms | 52.3% bf16 MFU | 1624491 tok/s step 5257/19560 | loss 3.580423 (+0.79z)| norm 0.2834 (+0.32z)| lr 5.18e-04 | 323.09 ms | 52.2% bf16 MFU | 1624403 tok/s step 5258/19560 | loss 3.584064 (+0.87z)| norm 0.2797 (+0.15z)| lr 5.18e-04 | 323.01 ms | 52.2% bf16 MFU | 1624340 tok/s step 5259/19560 | loss 3.567622 (+0.46z)| norm 0.2590 (-0.79z)| lr 5.18e-04 | 323.39 ms | 52.2% bf16 MFU | 1624184 tok/s step 5260/19560 | loss 3.542408 (-0.19z)| norm 0.2690 (-0.33z)| lr 5.18e-04 | 322.86 ms | 52.3% bf16 MFU | 1624170 tok/s step 5261/19560 | loss 3.571701 (+0.57z)| norm 0.2812 (+0.22z)| lr 5.18e-04 | 324.03 ms | 52.1% bf16 MFU | 1623861 tok/s step 5262/19560 | loss 3.522394 (-0.71z)| norm 0.2729 (-0.16z)| lr 5.18e-04 | 322.88 ms | 52.3% bf16 MFU | 1623856 tok/s step 5263/19560 | loss 3.677195 (+3.18z)| norm 0.2783 (+0.08z)| lr 5.17e-04 | 322.12 ms | 52.4% bf16 MFU | 1624044 tok/s step 5264/19560 | loss 3.549039 (-0.03z)| norm 0.2902 (+0.61z)| lr 5.17e-04 | 323.11 ms | 52.2% bf16 MFU | 1623973 tok/s step 5265/19560 | loss 3.536417 (-0.35z)| norm 0.2802 (+0.16z)| lr 5.17e-04 | 322.42 ms | 52.3% bf16 MFU | 1624081 tok/s step 5266/19560 | loss 3.594908 (+1.11z)| norm 0.2827 (+0.28z)| lr 5.17e-04 | 322.93 ms | 52.3% bf16 MFU | 1624053 tok/s step 5267/19560 | loss 3.516133 (-0.87z)| norm 0.2503 (-1.18z)| lr 5.17e-04 | 322.95 ms | 52.3% bf16 MFU | 1624022 tok/s step 5268/19560 | loss 3.529371 (-0.54z)| norm 0.2582 (-0.82z)| lr 5.17e-04 | 322.92 ms | 52.3% bf16 MFU | 1624001 tok/s step 5269/19560 | loss 3.553147 (+0.04z)| norm 0.2843 (+0.35z)| lr 5.17e-04 | 323.25 ms | 52.2% bf16 MFU | 1623897 tok/s step 5270/19560 | loss 3.554713 (+0.09z)| norm 0.2785 (+0.07z)| lr 5.17e-04 | 322.57 ms | 52.3% bf16 MFU | 1623970 tok/s step 5271/19560 | loss 3.517354 (-0.86z)| norm 0.2559 (-0.94z)| lr 5.17e-04 | 322.38 ms | 52.4% bf16 MFU | 1624088 tok/s step 5272/19560 | loss 3.581236 (+0.79z)| norm 0.2618 (-0.67z)| lr 5.17e-04 | 322.74 ms | 52.3% bf16 MFU | 1624108 tok/s step 5273/19560 | loss 3.531739 (-0.48z)| norm 0.2619 (-0.65z)| lr 5.17e-04 | 323.51 ms | 52.2% bf16 MFU | 1623934 tok/s step 5274/19560 | loss 3.587431 (+0.98z)| norm 0.2464 (-1.36z)| lr 5.17e-04 | 322.89 ms | 52.3% bf16 MFU | 1623925 tok/s step 5275/19560 | loss 3.559848 (+0.27z)| norm 0.2534 (-1.02z)| lr 5.17e-04 | 322.54 ms | 52.3% bf16 MFU | 1624004 tok/s step 5276/19560 | loss 3.530156 (-0.51z)| norm 0.2654 (-0.46z)| lr 5.17e-04 | 323.47 ms | 52.2% bf16 MFU | 1623846 tok/s step 5277/19560 | loss 3.539078 (-0.28z)| norm 0.2592 (-0.73z)| lr 5.17e-04 | 322.37 ms | 52.4% bf16 MFU | 1623972 tok/s step 5278/19560 | loss 3.518686 (-0.81z)| norm 0.2675 (-0.33z)| lr 5.17e-04 | 323.42 ms | 52.2% bf16 MFU | 1623828 tok/s step 5279/19560 | loss 3.499359 (-1.31z)| norm 0.2904 (+0.74z)| lr 5.17e-04 | 323.21 ms | 52.2% bf16 MFU | 1623744 tok/s step 5280/19560 | loss 3.599795 (+1.32z)| norm 0.3025 (+1.29z)| lr 5.17e-04 | 322.68 ms | 52.3% bf16 MFU | 1623796 tok/s step 5281/19560 | loss 3.497254 (-1.37z)| norm 0.3162 (+1.89z)| lr 5.17e-04 | 322.87 ms | 52.3% bf16 MFU | 1623797 tok/s step 5282/19560 | loss 3.506152 (-1.12z)| norm 0.3042 (+1.32z)| lr 5.17e-04 | 322.50 ms | 52.3% bf16 MFU | 1623892 tok/s step 5283/19560 | loss 3.568954 (+0.51z)| norm 0.2829 (+0.33z)| lr 5.17e-04 | 323.70 ms | 52.1% bf16 MFU | 1623681 tok/s step 5284/19560 | loss 3.605422 (+1.44z)| norm 0.3209 (+2.05z)| lr 5.17e-04 | 323.07 ms | 52.2% bf16 MFU | 1623637 tok/s step 5285/19560 | loss 3.540723 (-0.24z)| norm 0.3118 (+1.65z)| lr 5.17e-04 | 322.93 ms | 52.3% bf16 MFU | 1623633 tok/s step 5286/19560 | loss 3.548955 (-0.03z)| norm 0.2846 (+0.40z)| lr 5.17e-04 | 322.90 ms | 52.3% bf16 MFU | 1623635 tok/s step 5287/19560 | loss 3.511089 (-1.00z)| norm 0.3351 (+2.62z)| lr 5.17e-04 | 322.64 ms | 52.3% bf16 MFU | 1623704 tok/s step 5288/19560 | loss 3.694086 (+3.55z)| norm 0.3125 (+1.58z)| lr 5.17e-04 | 323.96 ms | 52.1% bf16 MFU | 1623436 tok/s step 5289/19560 | loss 3.571779 (+0.52z)| norm 0.3262 (+2.13z)| lr 5.17e-04 | 323.14 ms | 52.2% bf16 MFU | 1623388 tok/s step 5290/19560 | loss 3.635437 (+2.05z)| norm 0.3445 (+2.84z)| lr 5.17e-04 | 322.46 ms | 52.3% bf16 MFU | 1623514 tok/s step 5291/19560 | loss 3.590143 (+0.95z)| norm 0.3054 (+1.14z)| lr 5.17e-04 | 322.31 ms | 52.4% bf16 MFU | 1623670 tok/s step 5292/19560 | loss 3.749637 (+4.43z)| norm 0.2755 (-0.14z)| lr 5.16e-04 | 322.88 ms | 52.3% bf16 MFU | 1623675 tok/s step 5293/19560 | loss 3.571384 (+0.41z)| norm 0.2962 (+0.74z)| lr 5.16e-04 | 323.11 ms | 52.2% bf16 MFU | 1623624 tok/s step 5294/19560 | loss 3.509710 (-0.97z)| norm 0.2630 (-0.68z)| lr 5.16e-04 | 322.90 ms | 52.3% bf16 MFU | 1623626 tok/s step 5295/19560 | loss 3.556040 (+0.06z)| norm 0.2790 (-0.00z)| lr 5.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1623674 tok/s step 5296/19560 | loss 3.574795 (+0.48z)| norm 0.2631 (-0.67z)| lr 5.16e-04 | 322.52 ms | 52.3% bf16 MFU | 1623770 tok/s step 5297/19560 | loss 3.562970 (+0.21z)| norm 0.2545 (-1.03z)| lr 5.16e-04 | 322.45 ms | 52.3% bf16 MFU | 1623880 tok/s step 5298/19560 | loss 3.555919 (+0.05z)| norm 0.2651 (-0.59z)| lr 5.16e-04 | 322.52 ms | 52.3% bf16 MFU | 1623966 tok/s step 5299/19560 | loss 3.529009 (-0.57z)| norm 0.2447 (-1.43z)| lr 5.16e-04 | 322.89 ms | 52.3% bf16 MFU | 1623953 tok/s step 5300/19560 | loss 3.501764 (-1.18z)| norm 0.2678 (-0.45z)| lr 5.16e-04 | 322.89 ms | 52.3% bf16 MFU | 1623941 tok/s step 5301/19560 | loss 3.515114 (-0.86z)| norm 0.2660 (-0.51z)| lr 5.16e-04 | 322.46 ms | 52.3% bf16 MFU | 1624040 tok/s step 5302/19560 | loss 3.527977 (-0.58z)| norm 0.2483 (-1.27z)| lr 5.16e-04 | 322.06 ms | 52.4% bf16 MFU | 1624235 tok/s step 5303/19560 | loss 3.557565 (+0.09z)| norm 0.2855 (+0.34z)| lr 5.16e-04 | 323.43 ms | 52.2% bf16 MFU | 1624074 tok/s step 5304/19560 | loss 3.614842 (+1.36z)| norm 0.3040 (+1.13z)| lr 5.16e-04 | 322.29 ms | 52.4% bf16 MFU | 1624209 tok/s step 5305/19560 | loss 3.557415 (+0.07z)| norm 0.3460 (+2.83z)| lr 5.16e-04 | 322.25 ms | 52.4% bf16 MFU | 1624347 tok/s step 5306/19560 | loss 3.487227 (-1.51z)| norm 0.2899 (+0.49z)| lr 5.16e-04 | 322.10 ms | 52.4% bf16 MFU | 1624516 tok/s step 5307/19560 | loss 3.631088 (+1.69z)| norm 0.2663 (-0.49z)| lr 5.16e-04 | 322.78 ms | 52.3% bf16 MFU | 1624504 tok/s step 5308/19560 | loss 3.524769 (-0.67z)| norm 0.2808 (+0.12z)| lr 5.16e-04 | 322.87 ms | 52.3% bf16 MFU | 1624472 tok/s step 5309/19560 | loss 3.556198 (+0.03z)| norm 0.2927 (+0.61z)| lr 5.16e-04 | 322.48 ms | 52.3% bf16 MFU | 1624539 tok/s step 5310/19560 | loss 3.505163 (-1.10z)| norm 0.2813 (+0.14z)| lr 5.16e-04 | 322.63 ms | 52.3% bf16 MFU | 1624564 tok/s step 5311/19560 | loss 3.633034 (+1.70z)| norm 0.2835 (+0.22z)| lr 5.16e-04 | 322.73 ms | 52.3% bf16 MFU | 1624564 tok/s step 5312/19560 | loss 3.575373 (+0.44z)| norm 0.2718 (-0.26z)| lr 5.16e-04 | 322.48 ms | 52.3% bf16 MFU | 1624626 tok/s step 5313/19560 | loss 3.564428 (+0.20z)| norm 0.2746 (-0.15z)| lr 5.16e-04 | 322.84 ms | 52.3% bf16 MFU | 1624594 tok/s step 5314/19560 | loss 3.546864 (-0.19z)| norm 0.2865 (+0.33z)| lr 5.16e-04 | 322.71 ms | 52.3% bf16 MFU | 1624597 tok/s step 5315/19560 | loss 3.520399 (-0.76z)| norm 0.2581 (-0.86z)| lr 5.16e-04 | 322.65 ms | 52.3% bf16 MFU | 1624615 tok/s step 5316/19560 | loss 3.544314 (-0.23z)| norm 0.2545 (-1.00z)| lr 5.16e-04 | 323.03 ms | 52.2% bf16 MFU | 1624537 tok/s step 5317/19560 | loss 3.539090 (-0.35z)| norm 0.2802 (+0.07z)| lr 5.16e-04 | 322.77 ms | 52.3% bf16 MFU | 1624526 tok/s step 5318/19560 | loss 3.492039 (-1.37z)| norm 0.2636 (-0.63z)| lr 5.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1624528 tok/s step 5319/19560 | loss 3.595263 (+0.88z)| norm 0.2699 (-0.36z)| lr 5.16e-04 | 323.31 ms | 52.2% bf16 MFU | 1624384 tok/s step 5320/19560 | loss 3.540812 (-0.30z)| norm 0.2886 (+0.42z)| lr 5.15e-04 | 322.85 ms | 52.3% bf16 MFU | 1624361 tok/s step 5321/19560 | loss 3.494916 (-1.29z)| norm 0.2701 (-0.35z)| lr 5.15e-04 | 322.39 ms | 52.4% bf16 MFU | 1624456 tok/s step 5322/19560 | loss 3.490009 (-1.37z)| norm 0.2942 (+0.66z)| lr 5.15e-04 | 322.83 ms | 52.3% bf16 MFU | 1624436 tok/s step 5323/19560 | loss 3.578983 (+0.55z)| norm 0.2860 (+0.31z)| lr 5.15e-04 | 322.93 ms | 52.3% bf16 MFU | 1624392 tok/s step 5324/19560 | loss 3.545892 (-0.16z)| norm 0.3041 (+1.05z)| lr 5.15e-04 | 322.59 ms | 52.3% bf16 MFU | 1624434 tok/s step 5325/19560 | loss 3.565512 (+0.25z)| norm 0.3348 (+2.29z)| lr 5.15e-04 | 323.01 ms | 52.2% bf16 MFU | 1624369 tok/s step 5326/19560 | loss 3.599744 (+0.98z)| norm 0.3249 (+1.84z)| lr 5.15e-04 | 322.82 ms | 52.3% bf16 MFU | 1624356 tok/s step 5327/19560 | loss 3.507960 (-0.99z)| norm 0.2735 (-0.26z)| lr 5.15e-04 | 322.71 ms | 52.3% bf16 MFU | 1624370 tok/s step 5328/19560 | loss 3.583802 (+0.63z)| norm 0.2692 (-0.45z)| lr 5.15e-04 | 322.56 ms | 52.3% bf16 MFU | 1624420 tok/s step 5329/19560 | loss 3.610614 (+1.18z)| norm 0.2962 (+0.66z)| lr 5.15e-04 | 323.10 ms | 52.2% bf16 MFU | 1624333 tok/s step 5330/19560 | loss 3.604243 (+1.04z)| norm 0.2790 (-0.05z)| lr 5.15e-04 | 322.54 ms | 52.3% bf16 MFU | 1624391 tok/s step 5331/19560 | loss 3.558762 (+0.10z)| norm 0.2708 (-0.38z)| lr 5.15e-04 | 323.05 ms | 52.2% bf16 MFU | 1624319 tok/s step 5332/19560 | loss 3.560011 (+0.12z)| norm 0.2711 (-0.37z)| lr 5.15e-04 | 322.63 ms | 52.3% bf16 MFU | 1624356 tok/s step 5333/19560 | loss 3.550981 (-0.08z)| norm 0.2681 (-0.50z)| lr 5.15e-04 | 322.37 ms | 52.4% bf16 MFU | 1624455 tok/s step 5334/19560 | loss 3.452497 (-2.21z)| norm 0.2674 (-0.54z)| lr 5.15e-04 | 323.45 ms | 52.2% bf16 MFU | 1624278 tok/s step 5335/19560 | loss 3.623730 (+1.50z)| norm 0.2504 (-1.36z)| lr 5.15e-04 | 322.19 ms | 52.4% bf16 MFU | 1624428 tok/s step 5336/19560 | loss 3.588418 (+0.73z)| norm 0.2859 (+0.39z)| lr 5.15e-04 | 322.66 ms | 52.3% bf16 MFU | 1624450 tok/s step 5337/19560 | loss 3.569559 (+0.31z)| norm 0.2683 (-0.49z)| lr 5.15e-04 | 322.18 ms | 52.4% bf16 MFU | 1624593 tok/s step 5338/19560 | loss 3.601763 (+1.00z)| norm 0.2971 (+0.92z)| lr 5.15e-04 | 323.25 ms | 52.2% bf16 MFU | 1624460 tok/s step 5339/19560 | loss 3.527221 (-0.64z)| norm 0.2794 (+0.05z)| lr 5.15e-04 | 322.85 ms | 52.3% bf16 MFU | 1624435 tok/s step 5340/19560 | loss 3.595975 (+0.87z)| norm 0.2727 (-0.29z)| lr 5.15e-04 | 322.84 ms | 52.3% bf16 MFU | 1624414 tok/s step 5341/19560 | loss 3.562923 (+0.13z)| norm 0.2955 (+0.83z)| lr 5.15e-04 | 322.72 ms | 52.3% bf16 MFU | 1624422 tok/s step 5342/19560 | loss 3.557790 (+0.01z)| norm 0.2865 (+0.39z)| lr 5.15e-04 | 322.77 ms | 52.3% bf16 MFU | 1624418 tok/s step 5343/19560 | loss 3.496271 (-1.33z)| norm 0.2553 (-1.15z)| lr 5.15e-04 | 322.27 ms | 52.4% bf16 MFU | 1624540 tok/s step 5344/19560 | loss 3.553119 (-0.07z)| norm 0.2562 (-1.09z)| lr 5.15e-04 | 322.92 ms | 52.3% bf16 MFU | 1624493 tok/s step 5345/19560 | loss 3.581805 (+0.55z)| norm 0.2973 (+0.91z)| lr 5.15e-04 | 323.14 ms | 52.2% bf16 MFU | 1624393 tok/s step 5346/19560 | loss 3.550881 (-0.14z)| norm 0.2538 (-1.21z)| lr 5.15e-04 | 322.76 ms | 52.3% bf16 MFU | 1624393 tok/s step 5347/19560 | loss 3.608925 (+1.13z)| norm 0.2450 (-1.62z)| lr 5.15e-04 | 322.68 ms | 52.3% bf16 MFU | 1624412 tok/s step 5348/19560 | loss 3.519993 (-0.83z)| norm 0.2727 (-0.27z)| lr 5.15e-04 | 323.09 ms | 52.2% bf16 MFU | 1624327 tok/s step 5349/19560 | loss 3.572505 (+0.35z)| norm 0.2537 (-1.18z)| lr 5.14e-04 | 322.22 ms | 52.4% bf16 MFU | 1624466 tok/s step 5350/19560 | loss 3.554004 (-0.07z)| norm 0.2426 (-1.68z)| lr 5.14e-04 | 322.51 ms | 52.3% bf16 MFU | 1624525 tok/s step 5351/19560 | loss 3.562165 (+0.11z)| norm 0.2796 (+0.09z)| lr 5.14e-04 | 322.77 ms | 52.3% bf16 MFU | 1624516 tok/s step 5352/19560 | loss 3.557684 (+0.01z)| norm 0.2517 (-1.22z)| lr 5.14e-04 | 322.53 ms | 52.3% bf16 MFU | 1624569 tok/s step 5353/19560 | loss 3.563207 (+0.13z)| norm 0.2479 (-1.39z)| lr 5.14e-04 | 322.40 ms | 52.3% bf16 MFU | 1624651 tok/s step 5354/19560 | loss 3.543724 (-0.31z)| norm 0.2622 (-0.71z)| lr 5.14e-04 | 323.36 ms | 52.2% bf16 MFU | 1624487 tok/s step 5355/19560 | loss 3.513421 (-0.98z)| norm 0.2644 (-0.59z)| lr 5.14e-04 | 322.98 ms | 52.3% bf16 MFU | 1624427 tok/s step 5356/19560 | loss 3.535829 (-0.47z)| norm 0.2768 (-0.00z)| lr 5.14e-04 | 322.66 ms | 52.3% bf16 MFU | 1624451 tok/s step 5357/19560 | loss 3.594517 (+0.85z)| norm 0.2661 (-0.50z)| lr 5.14e-04 | 322.71 ms | 52.3% bf16 MFU | 1624461 tok/s step 5358/19560 | loss 3.568552 (+0.26z)| norm 0.2690 (-0.36z)| lr 5.14e-04 | 322.29 ms | 52.4% bf16 MFU | 1624575 tok/s step 5359/19560 | loss 3.505526 (-1.17z)| norm 0.2825 (+0.29z)| lr 5.14e-04 | 323.02 ms | 52.2% bf16 MFU | 1624501 tok/s step 5360/19560 | loss 3.562422 (+0.11z)| norm 0.2652 (-0.53z)| lr 5.14e-04 | 322.40 ms | 52.3% bf16 MFU | 1624586 tok/s step 5361/19560 | loss 3.589415 (+0.72z)| norm 0.2641 (-0.58z)| lr 5.14e-04 | 322.80 ms | 52.3% bf16 MFU | 1624565 tok/s step 5362/19560 | loss 3.537037 (-0.49z)| norm 0.2761 (-0.01z)| lr 5.14e-04 | 322.69 ms | 52.3% bf16 MFU | 1624574 tok/s step 5363/19560 | loss 3.733758 (+3.77z)| norm 0.2835 (+0.33z)| lr 5.14e-04 | 322.98 ms | 52.3% bf16 MFU | 1624509 tok/s step 5364/19560 | loss 3.634745 (+1.59z)| norm 0.3058 (+1.37z)| lr 5.14e-04 | 322.77 ms | 52.3% bf16 MFU | 1624500 tok/s step 5365/19560 | loss 3.554918 (-0.12z)| norm 0.3011 (+1.13z)| lr 5.14e-04 | 322.70 ms | 52.3% bf16 MFU | 1624510 tok/s step 5366/19560 | loss 3.557214 (-0.07z)| norm 0.3186 (+1.92z)| lr 5.14e-04 | 322.40 ms | 52.3% bf16 MFU | 1624593 tok/s step 5367/19560 | loss 3.515231 (-0.96z)| norm 0.3083 (+1.41z)| lr 5.14e-04 | 322.76 ms | 52.3% bf16 MFU | 1624582 tok/s step 5368/19560 | loss 3.572930 (+0.28z)| norm 0.3041 (+1.20z)| lr 5.14e-04 | 322.75 ms | 52.3% bf16 MFU | 1624576 tok/s step 5369/19560 | loss 3.535794 (-0.51z)| norm 0.2813 (+0.13z)| lr 5.14e-04 | 322.97 ms | 52.3% bf16 MFU | 1624513 tok/s step 5370/19560 | loss 3.547271 (-0.25z)| norm 0.2828 (+0.19z)| lr 5.14e-04 | 323.01 ms | 52.2% bf16 MFU | 1624444 tok/s step 5371/19560 | loss 3.553489 (-0.13z)| norm 0.2873 (+0.40z)| lr 5.14e-04 | 322.57 ms | 52.3% bf16 MFU | 1624488 tok/s step 5372/19560 | loss 3.614343 (+1.19z)| norm 0.2755 (-0.16z)| lr 5.14e-04 | 322.83 ms | 52.3% bf16 MFU | 1624466 tok/s step 5373/19560 | loss 3.601526 (+0.93z)| norm 0.2710 (-0.38z)| lr 5.14e-04 | 322.43 ms | 52.3% bf16 MFU | 1624545 tok/s step 5374/19560 | loss 3.526575 (-0.70z)| norm 0.2906 (+0.54z)| lr 5.14e-04 | 322.41 ms | 52.3% bf16 MFU | 1624626 tok/s step 5375/19560 | loss 3.480446 (-1.70z)| norm 0.2630 (-0.77z)| lr 5.14e-04 | 323.00 ms | 52.3% bf16 MFU | 1624554 tok/s step 5376/19560 | loss 3.522349 (-0.78z)| norm 0.3115 (+1.50z)| lr 5.14e-04 | 322.29 ms | 52.4% bf16 MFU | 1624663 tok/s step 5377/19560 | loss 3.572608 (+0.31z)| norm 0.2982 (+0.86z)| lr 5.14e-04 | 322.89 ms | 52.3% bf16 MFU | 1624617 tok/s step 5378/19560 | loss 3.611979 (+1.15z)| norm 0.2845 (+0.22z)| lr 5.13e-04 | 323.42 ms | 52.2% bf16 MFU | 1624439 tok/s step 5379/19560 | loss 3.510165 (-1.07z)| norm 0.2918 (+0.56z)| lr 5.13e-04 | 322.17 ms | 52.4% bf16 MFU | 1624585 tok/s step 5380/19560 | loss 3.575166 (+0.34z)| norm 0.2829 (+0.14z)| lr 5.13e-04 | 322.50 ms | 52.3% bf16 MFU | 1624640 tok/s step 5381/19560 | loss 3.534879 (-0.53z)| norm 0.2615 (-0.87z)| lr 5.13e-04 | 323.30 ms | 52.2% bf16 MFU | 1624491 tok/s step 5382/19560 | loss 3.553035 (-0.14z)| norm 0.3047 (+1.15z)| lr 5.13e-04 | 322.73 ms | 52.3% bf16 MFU | 1624492 tok/s step 5383/19560 | loss 3.539091 (-0.44z)| norm 0.2797 (-0.02z)| lr 5.13e-04 | 322.58 ms | 52.3% bf16 MFU | 1624532 tok/s step 5384/19560 | loss 3.563011 (+0.09z)| norm 0.2697 (-0.48z)| lr 5.13e-04 | 322.70 ms | 52.3% bf16 MFU | 1624539 tok/s step 5385/19560 | loss 3.517781 (-0.90z)| norm 0.2732 (-0.32z)| lr 5.13e-04 | 322.27 ms | 52.4% bf16 MFU | 1624656 tok/s step 5386/19560 | loss 3.551480 (-0.15z)| norm 0.2870 (+0.33z)| lr 5.13e-04 | 322.58 ms | 52.3% bf16 MFU | 1624689 tok/s step 5387/19560 | loss 3.520817 (-0.82z)| norm 0.2836 (+0.16z)| lr 5.13e-04 | 322.74 ms | 52.3% bf16 MFU | 1624679 tok/s step 5388/19560 | loss 3.534656 (-0.51z)| norm 0.3048 (+1.14z)| lr 5.13e-04 | 323.02 ms | 52.2% bf16 MFU | 1624599 tok/s step 5389/19560 | loss 3.564748 (+0.15z)| norm 0.2820 (+0.07z)| lr 5.13e-04 | 322.81 ms | 52.3% bf16 MFU | 1624575 tok/s step 5390/19560 | loss 3.597627 (+0.86z)| norm 0.2560 (-1.14z)| lr 5.13e-04 | 322.67 ms | 52.3% bf16 MFU | 1624590 tok/s step 5391/19560 | loss 3.558505 (+0.02z)| norm 0.2898 (+0.44z)| lr 5.13e-04 | 322.29 ms | 52.4% bf16 MFU | 1624698 tok/s step 5392/19560 | loss 3.539473 (-0.41z)| norm 0.2933 (+0.60z)| lr 5.13e-04 | 322.22 ms | 52.4% bf16 MFU | 1624818 tok/s step 5393/19560 | loss 3.571787 (+0.32z)| norm 0.2573 (-1.07z)| lr 5.13e-04 | 323.03 ms | 52.2% bf16 MFU | 1624729 tok/s step 5394/19560 | loss 3.532529 (-0.56z)| norm 0.2888 (+0.39z)| lr 5.13e-04 | 322.51 ms | 52.3% bf16 MFU | 1624774 tok/s step 5395/19560 | loss 3.540651 (-0.38z)| norm 0.2764 (-0.19z)| lr 5.13e-04 | 323.45 ms | 52.2% bf16 MFU | 1624582 tok/s step 5396/19560 | loss 3.512108 (-1.02z)| norm 0.2616 (-0.89z)| lr 5.13e-04 | 322.23 ms | 52.4% bf16 MFU | 1624706 tok/s step 5397/19560 | loss 3.649284 (+2.02z)| norm 0.2996 (+0.88z)| lr 5.13e-04 | 323.13 ms | 52.2% bf16 MFU | 1624597 tok/s step 5398/19560 | loss 3.578979 (+0.46z)| norm 0.2798 (-0.04z)| lr 5.13e-04 | 322.90 ms | 52.3% bf16 MFU | 1624552 tok/s step 5399/19560 | loss 3.498211 (-1.32z)| norm 0.2754 (-0.25z)| lr 5.13e-04 | 322.62 ms | 52.3% bf16 MFU | 1624580 tok/s step 5400/19560 | loss 3.500764 (-1.25z)| norm 0.2557 (-1.18z)| lr 5.13e-04 | 323.00 ms | 52.3% bf16 MFU | 1624509 tok/s step 5401/19560 | loss 3.478700 (-1.70z)| norm 0.2512 (-1.37z)| lr 5.13e-04 | 322.60 ms | 52.3% bf16 MFU | 1624544 tok/s step 5402/19560 | loss 3.472888 (-1.79z)| norm 0.2610 (-0.93z)| lr 5.13e-04 | 322.55 ms | 52.3% bf16 MFU | 1624590 tok/s step 5403/19560 | loss 3.530012 (-0.56z)| norm 0.2596 (-1.00z)| lr 5.13e-04 | 322.93 ms | 52.3% bf16 MFU | 1624538 tok/s step 5404/19560 | loss 3.597949 (+0.89z)| norm 0.2815 (+0.02z)| lr 5.13e-04 | 322.42 ms | 52.3% bf16 MFU | 1624616 tok/s step 5405/19560 | loss 3.514081 (-0.90z)| norm 0.2670 (-0.66z)| lr 5.13e-04 | 322.85 ms | 52.3% bf16 MFU | 1624582 tok/s step 5406/19560 | loss 3.540176 (-0.35z)| norm 0.2672 (-0.66z)| lr 5.12e-04 | 322.90 ms | 52.3% bf16 MFU | 1624536 tok/s step 5407/19560 | loss 3.602530 (+0.97z)| norm 0.2867 (+0.27z)| lr 5.12e-04 | 322.74 ms | 52.3% bf16 MFU | 1624534 tok/s step 5408/19560 | loss 3.521799 (-0.75z)| norm 0.2749 (-0.28z)| lr 5.12e-04 | 322.29 ms | 52.4% bf16 MFU | 1624645 tok/s step 5409/19560 | loss 3.503317 (-1.15z)| norm 0.2625 (-0.85z)| lr 5.12e-04 | 323.36 ms | 52.2% bf16 MFU | 1624480 tok/s step 5410/19560 | loss 3.498151 (-1.26z)| norm 0.2623 (-0.85z)| lr 5.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1624511 tok/s step 5411/19560 | loss 3.610260 (+1.14z)| norm 0.2605 (-0.92z)| lr 5.12e-04 | 322.96 ms | 52.3% bf16 MFU | 1624454 tok/s step 5412/19560 | loss 3.525589 (-0.66z)| norm 0.2522 (-1.31z)| lr 5.12e-04 | 322.60 ms | 52.3% bf16 MFU | 1624491 tok/s step 5413/19560 | loss 3.516598 (-0.85z)| norm 0.2954 (+0.78z)| lr 5.12e-04 | 322.14 ms | 52.4% bf16 MFU | 1624642 tok/s step 5414/19560 | loss 3.551395 (-0.10z)| norm 0.3158 (+1.74z)| lr 5.12e-04 | 323.52 ms | 52.2% bf16 MFU | 1624439 tok/s step 5415/19560 | loss 3.564439 (+0.17z)| norm 0.2663 (-0.62z)| lr 5.12e-04 | 322.68 ms | 52.3% bf16 MFU | 1624456 tok/s step 5416/19560 | loss 3.551075 (-0.10z)| norm 0.2627 (-0.79z)| lr 5.12e-04 | 322.63 ms | 52.3% bf16 MFU | 1624486 tok/s step 5417/19560 | loss 3.562599 (+0.16z)| norm 0.3069 (+1.44z)| lr 5.12e-04 | 322.52 ms | 52.3% bf16 MFU | 1624542 tok/s step 5418/19560 | loss 3.622739 (+1.51z)| norm 0.2798 (+0.10z)| lr 5.12e-04 | 322.11 ms | 52.4% bf16 MFU | 1624698 tok/s step 5419/19560 | loss 3.545404 (-0.21z)| norm 0.2518 (-1.35z)| lr 5.12e-04 | 322.56 ms | 52.3% bf16 MFU | 1624734 tok/s step 5420/19560 | loss 3.599513 (+1.10z)| norm 0.2517 (-1.34z)| lr 5.12e-04 | 322.49 ms | 52.3% bf16 MFU | 1624785 tok/s step 5421/19560 | loss 3.512062 (-0.99z)| norm 0.2653 (-0.62z)| lr 5.12e-04 | 322.65 ms | 52.3% bf16 MFU | 1624793 tok/s step 5422/19560 | loss 3.567056 (+0.32z)| norm 0.2547 (-1.17z)| lr 5.12e-04 | 322.99 ms | 52.3% bf16 MFU | 1624715 tok/s step 5423/19560 | loss 3.555243 (+0.04z)| norm 0.2786 (+0.09z)| lr 5.12e-04 | 322.56 ms | 52.3% bf16 MFU | 1624749 tok/s step 5424/19560 | loss 3.548721 (-0.12z)| norm 0.2780 (+0.05z)| lr 5.12e-04 | 322.14 ms | 52.4% bf16 MFU | 1624887 tok/s step 5425/19560 | loss 3.551072 (-0.06z)| norm 0.2937 (+0.86z)| lr 5.12e-04 | 322.76 ms | 52.3% bf16 MFU | 1624860 tok/s step 5426/19560 | loss 3.504177 (-1.18z)| norm 0.2741 (-0.18z)| lr 5.12e-04 | 322.66 ms | 52.3% bf16 MFU | 1624862 tok/s step 5427/19560 | loss 3.547531 (-0.14z)| norm 0.3202 (+2.21z)| lr 5.12e-04 | 322.83 ms | 52.3% bf16 MFU | 1624821 tok/s step 5428/19560 | loss 3.582510 (+0.69z)| norm 0.3165 (+1.96z)| lr 5.12e-04 | 322.71 ms | 52.3% bf16 MFU | 1624812 tok/s step 5429/19560 | loss 3.523992 (-0.72z)| norm 0.2926 (+0.72z)| lr 5.12e-04 | 322.35 ms | 52.4% bf16 MFU | 1624895 tok/s step 5430/19560 | loss 3.544384 (-0.23z)| norm 0.2898 (+0.56z)| lr 5.12e-04 | 323.43 ms | 52.2% bf16 MFU | 1624701 tok/s step 5431/19560 | loss 3.539718 (-0.34z)| norm 0.3218 (+2.17z)| lr 5.12e-04 | 322.73 ms | 52.3% bf16 MFU | 1624694 tok/s step 5432/19560 | loss 3.527532 (-0.63z)| norm 0.2948 (+0.80z)| lr 5.12e-04 | 322.70 ms | 52.3% bf16 MFU | 1624694 tok/s step 5433/19560 | loss 3.561379 (+0.20z)| norm 0.3062 (+1.46z)| lr 5.12e-04 | 322.49 ms | 52.3% bf16 MFU | 1624748 tok/s step 5434/19560 | loss 3.682755 (+3.04z)| norm 0.3063 (+1.45z)| lr 5.11e-04 | 322.82 ms | 52.3% bf16 MFU | 1624714 tok/s step 5435/19560 | loss 3.525420 (-0.68z)| norm 0.2957 (+0.87z)| lr 5.11e-04 | 323.25 ms | 52.2% bf16 MFU | 1624576 tok/s step 5436/19560 | loss 3.513022 (-0.98z)| norm 0.2910 (+0.62z)| lr 5.11e-04 | 322.71 ms | 52.3% bf16 MFU | 1624581 tok/s step 5437/19560 | loss 3.625833 (+1.69z)| norm 0.3069 (+1.44z)| lr 5.11e-04 | 323.12 ms | 52.2% bf16 MFU | 1624481 tok/s step 5438/19560 | loss 3.572232 (+0.41z)| norm 0.2917 (+0.64z)| lr 5.11e-04 | 323.57 ms | 52.2% bf16 MFU | 1624273 tok/s step 5439/19560 | loss 3.583199 (+0.69z)| norm 0.2578 (-1.12z)| lr 5.11e-04 | 323.23 ms | 52.2% bf16 MFU | 1624161 tok/s step 5440/19560 | loss 3.575111 (+0.50z)| norm 0.2982 (+0.97z)| lr 5.11e-04 | 322.67 ms | 52.3% bf16 MFU | 1624196 tok/s step 5441/19560 | loss 3.506401 (-1.14z)| norm 0.2867 (+0.37z)| lr 5.11e-04 | 322.05 ms | 52.4% bf16 MFU | 1624384 tok/s step 5442/19560 | loss 3.523521 (-0.73z)| norm 0.2742 (-0.27z)| lr 5.11e-04 | 322.84 ms | 52.3% bf16 MFU | 1624365 tok/s step 5443/19560 | loss 3.528146 (-0.62z)| norm 0.2459 (-1.73z)| lr 5.11e-04 | 322.92 ms | 52.3% bf16 MFU | 1624326 tok/s step 5444/19560 | loss 3.509380 (-1.06z)| norm 0.2669 (-0.66z)| lr 5.11e-04 | 322.70 ms | 52.3% bf16 MFU | 1624344 tok/s step 5445/19560 | loss 3.506470 (-1.12z)| norm 0.2630 (-0.85z)| lr 5.11e-04 | 322.60 ms | 52.3% bf16 MFU | 1624387 tok/s step 5446/19560 | loss 3.594720 (+0.97z)| norm 0.2741 (-0.28z)| lr 5.11e-04 | 322.79 ms | 52.3% bf16 MFU | 1624378 tok/s step 5447/19560 | loss 3.529777 (-0.57z)| norm 0.2442 (-1.80z)| lr 5.11e-04 | 323.07 ms | 52.2% bf16 MFU | 1624300 tok/s step 5448/19560 | loss 3.559403 (+0.13z)| norm 0.2809 (+0.09z)| lr 5.11e-04 | 322.88 ms | 52.3% bf16 MFU | 1624274 tok/s step 5449/19560 | loss 3.526876 (-0.66z)| norm 0.2922 (+0.66z)| lr 5.11e-04 | 322.91 ms | 52.3% bf16 MFU | 1624242 tok/s step 5450/19560 | loss 3.491418 (-1.51z)| norm 0.2625 (-0.85z)| lr 5.11e-04 | 322.39 ms | 52.3% bf16 MFU | 1624342 tok/s step 5451/19560 | loss 3.573553 (+0.47z)| norm 0.2656 (-0.68z)| lr 5.11e-04 | 322.91 ms | 52.3% bf16 MFU | 1624306 tok/s step 5452/19560 | loss 3.528135 (-0.62z)| norm 0.2683 (-0.54z)| lr 5.11e-04 | 323.06 ms | 52.2% bf16 MFU | 1624236 tok/s step 5453/19560 | loss 3.628042 (+1.75z)| norm 0.2618 (-0.86z)| lr 5.11e-04 | 324.21 ms | 52.1% bf16 MFU | 1623880 tok/s step 5454/19560 | loss 3.546920 (-0.17z)| norm 0.2642 (-0.73z)| lr 5.11e-04 | 322.70 ms | 52.3% bf16 MFU | 1623922 tok/s step 5455/19560 | loss 3.601409 (+1.12z)| norm 0.2633 (-0.77z)| lr 5.11e-04 | 322.59 ms | 52.3% bf16 MFU | 1623989 tok/s step 5456/19560 | loss 3.532445 (-0.52z)| norm 0.2627 (-0.80z)| lr 5.11e-04 | 322.82 ms | 52.3% bf16 MFU | 1623995 tok/s step 5457/19560 | loss 3.639817 (+2.02z)| norm 0.2717 (-0.30z)| lr 5.11e-04 | 322.53 ms | 52.3% bf16 MFU | 1624072 tok/s step 5458/19560 | loss 3.543013 (-0.26z)| norm 0.2825 (+0.28z)| lr 5.11e-04 | 323.11 ms | 52.2% bf16 MFU | 1623999 tok/s step 5459/19560 | loss 3.547168 (-0.16z)| norm 0.2518 (-1.37z)| lr 5.11e-04 | 322.67 ms | 52.3% bf16 MFU | 1624041 tok/s step 5460/19560 | loss 3.556522 (+0.06z)| norm 0.2951 (+0.95z)| lr 5.11e-04 | 322.95 ms | 52.3% bf16 MFU | 1624010 tok/s step 5461/19560 | loss 3.544346 (-0.23z)| norm 0.2737 (-0.20z)| lr 5.11e-04 | 322.96 ms | 52.3% bf16 MFU | 1623978 tok/s step 5462/19560 | loss 3.509470 (-1.09z)| norm 0.2541 (-1.24z)| lr 5.11e-04 | 323.56 ms | 52.2% bf16 MFU | 1623798 tok/s step 5463/19560 | loss 3.487976 (-1.59z)| norm 0.2792 (+0.09z)| lr 5.10e-04 | 322.70 ms | 52.3% bf16 MFU | 1623842 tok/s step 5464/19560 | loss 3.526941 (-0.63z)| norm 0.2673 (-0.54z)| lr 5.10e-04 | 322.19 ms | 52.4% bf16 MFU | 1624012 tok/s step 5465/19560 | loss 3.538555 (-0.34z)| norm 0.2700 (-0.40z)| lr 5.10e-04 | 322.89 ms | 52.3% bf16 MFU | 1623999 tok/s step 5466/19560 | loss 3.678081 (+2.95z)| norm 0.2696 (-0.41z)| lr 5.10e-04 | 322.90 ms | 52.3% bf16 MFU | 1623984 tok/s step 5467/19560 | loss 3.528852 (-0.57z)| norm 0.2782 (+0.05z)| lr 5.10e-04 | 322.85 ms | 52.3% bf16 MFU | 1623982 tok/s step 5468/19560 | loss 3.637142 (+1.95z)| norm 0.2817 (+0.24z)| lr 5.10e-04 | 322.67 ms | 52.3% bf16 MFU | 1624024 tok/s step 5469/19560 | loss 3.566304 (+0.30z)| norm 0.2822 (+0.27z)| lr 5.10e-04 | 322.87 ms | 52.3% bf16 MFU | 1624014 tok/s step 5470/19560 | loss 3.460992 (-2.11z)| norm 0.2576 (-1.05z)| lr 5.10e-04 | 322.77 ms | 52.3% bf16 MFU | 1624031 tok/s step 5471/19560 | loss 3.495056 (-1.32z)| norm 0.2989 (+1.17z)| lr 5.10e-04 | 322.81 ms | 52.3% bf16 MFU | 1624035 tok/s step 5472/19560 | loss 3.550556 (-0.05z)| norm 0.3008 (+1.26z)| lr 5.10e-04 | 322.55 ms | 52.3% bf16 MFU | 1624106 tok/s step 5473/19560 | loss 3.506293 (-1.05z)| norm 0.3151 (+2.00z)| lr 5.10e-04 | 322.71 ms | 52.3% bf16 MFU | 1624132 tok/s step 5474/19560 | loss 3.610292 (+1.31z)| norm 0.2640 (-0.74z)| lr 5.10e-04 | 322.91 ms | 52.3% bf16 MFU | 1624106 tok/s step 5475/19560 | loss 3.483529 (-1.54z)| norm 0.3172 (+2.07z)| lr 5.10e-04 | 322.35 ms | 52.4% bf16 MFU | 1624222 tok/s step 5476/19560 | loss 3.564998 (+0.30z)| norm 0.2915 (+0.69z)| lr 5.10e-04 | 322.79 ms | 52.3% bf16 MFU | 1624222 tok/s step 5477/19560 | loss 3.588181 (+0.82z)| norm 0.2831 (+0.24z)| lr 5.10e-04 | 322.97 ms | 52.3% bf16 MFU | 1624179 tok/s step 5478/19560 | loss 3.561527 (+0.21z)| norm 0.3087 (+1.59z)| lr 5.10e-04 | 322.80 ms | 52.3% bf16 MFU | 1624178 tok/s step 5479/19560 | loss 3.543291 (-0.20z)| norm 0.2935 (+0.76z)| lr 5.10e-04 | 323.16 ms | 52.2% bf16 MFU | 1624089 tok/s step 5480/19560 | loss 3.499234 (-1.18z)| norm 0.3015 (+1.18z)| lr 5.10e-04 | 322.80 ms | 52.3% bf16 MFU | 1624094 tok/s step 5481/19560 | loss 3.522523 (-0.65z)| norm 0.3026 (+1.22z)| lr 5.10e-04 | 322.62 ms | 52.3% bf16 MFU | 1624144 tok/s step 5482/19560 | loss 3.604449 (+1.18z)| norm 0.3410 (+3.15z)| lr 5.10e-04 | 323.36 ms | 52.2% bf16 MFU | 1624005 tok/s step 5483/19560 | loss 3.504721 (-1.05z)| norm 0.2927 (+0.61z)| lr 5.10e-04 | 322.92 ms | 52.3% bf16 MFU | 1623983 tok/s step 5484/19560 | loss 3.550860 (-0.02z)| norm 0.2657 (-0.80z)| lr 5.10e-04 | 322.82 ms | 52.3% bf16 MFU | 1623988 tok/s step 5485/19560 | loss 3.522622 (-0.64z)| norm 0.2677 (-0.70z)| lr 5.10e-04 | 323.00 ms | 52.3% bf16 MFU | 1623949 tok/s step 5486/19560 | loss 3.529866 (-0.47z)| norm 0.2724 (-0.45z)| lr 5.10e-04 | 322.76 ms | 52.3% bf16 MFU | 1623972 tok/s step 5487/19560 | loss 3.471596 (-1.76z)| norm 0.2720 (-0.47z)| lr 5.10e-04 | 322.69 ms | 52.3% bf16 MFU | 1624010 tok/s step 5488/19560 | loss 3.548548 (-0.04z)| norm 0.2624 (-0.97z)| lr 5.10e-04 | 322.93 ms | 52.3% bf16 MFU | 1623987 tok/s step 5489/19560 | loss 3.580507 (+0.67z)| norm 0.2807 (-0.02z)| lr 5.10e-04 | 322.64 ms | 52.3% bf16 MFU | 1624037 tok/s step 5490/19560 | loss 3.526921 (-0.52z)| norm 0.2987 (+0.92z)| lr 5.10e-04 | 323.16 ms | 52.2% bf16 MFU | 1623955 tok/s step 5491/19560 | loss 3.561368 (+0.29z)| norm 0.2613 (-1.03z)| lr 5.09e-04 | 322.36 ms | 52.4% bf16 MFU | 1624078 tok/s step 5492/19560 | loss 3.539690 (-0.21z)| norm 0.2830 (+0.11z)| lr 5.09e-04 | 322.29 ms | 52.4% bf16 MFU | 1624212 tok/s step 5493/19560 | loss 3.530181 (-0.43z)| norm 0.2625 (-0.95z)| lr 5.09e-04 | 321.90 ms | 52.4% bf16 MFU | 1624438 tok/s step 5494/19560 | loss 3.558466 (+0.25z)| norm 0.2803 (+0.00z)| lr 5.09e-04 | 322.58 ms | 52.3% bf16 MFU | 1624482 tok/s step 5495/19560 | loss 3.529996 (-0.44z)| norm 0.2694 (-0.57z)| lr 5.09e-04 | 322.29 ms | 52.4% bf16 MFU | 1624596 tok/s step 5496/19560 | loss 3.498839 (-1.18z)| norm 0.2896 (+0.53z)| lr 5.09e-04 | 322.20 ms | 52.4% bf16 MFU | 1624727 tok/s step 5497/19560 | loss 3.542816 (-0.12z)| norm 0.2903 (+0.56z)| lr 5.09e-04 | 322.38 ms | 52.4% bf16 MFU | 1624806 tok/s step 5498/19560 | loss 3.559383 (+0.28z)| norm 0.2789 (-0.05z)| lr 5.09e-04 | 322.73 ms | 52.3% bf16 MFU | 1624792 tok/s step 5499/19560 | loss 3.568217 (+0.49z)| norm 0.2961 (+0.87z)| lr 5.09e-04 | 322.84 ms | 52.3% bf16 MFU | 1624751 tok/s step 5500/19560 | loss 3.568776 (+0.52z)| norm 0.2859 (+0.32z)| lr 5.09e-04 | 322.81 ms | 52.3% bf16 MFU | 1624720 tok/s val loss 3.535431 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2806/10042 = 0.279426 step 5501/19560 | loss 3.587350 (+0.98z)| norm 0.2744 (-0.31z)| lr 5.09e-04 | 322.00 ms | 52.4% bf16 MFU | 1624896 tok/s step 5502/19560 | loss 3.689809 (+3.30z)| norm 0.2624 (-0.94z)| lr 5.09e-04 | 322.31 ms | 52.4% bf16 MFU | 1624984 tok/s step 5503/19560 | loss 3.602938 (+1.25z)| norm 0.2970 (+0.91z)| lr 5.09e-04 | 323.18 ms | 52.2% bf16 MFU | 1624848 tok/s step 5504/19560 | loss 3.439763 (-2.51z)| norm 0.2701 (-0.52z)| lr 5.09e-04 | 323.06 ms | 52.2% bf16 MFU | 1624750 tok/s step 5505/19560 | loss 3.605810 (+1.29z)| norm 0.2783 (-0.07z)| lr 5.09e-04 | 322.90 ms | 52.3% bf16 MFU | 1624696 tok/s step 5506/19560 | loss 3.573493 (+0.56z)| norm 0.2730 (-0.36z)| lr 5.09e-04 | 322.65 ms | 52.3% bf16 MFU | 1624708 tok/s step 5507/19560 | loss 3.551366 (+0.05z)| norm 0.2730 (-0.35z)| lr 5.09e-04 | 322.92 ms | 52.3% bf16 MFU | 1624653 tok/s step 5508/19560 | loss 3.595881 (+1.07z)| norm 0.2680 (-0.61z)| lr 5.09e-04 | 323.51 ms | 52.2% bf16 MFU | 1624451 tok/s step 5509/19560 | loss 3.542668 (-0.16z)| norm 0.2577 (-1.17z)| lr 5.09e-04 | 322.88 ms | 52.3% bf16 MFU | 1624416 tok/s step 5510/19560 | loss 3.535421 (-0.32z)| norm 0.2754 (-0.20z)| lr 5.09e-04 | 323.14 ms | 52.2% bf16 MFU | 1624320 tok/s step 5511/19560 | loss 3.556803 (+0.17z)| norm 0.2858 (+0.37z)| lr 5.09e-04 | 323.27 ms | 52.2% bf16 MFU | 1624194 tok/s step 5512/19560 | loss 3.570205 (+0.47z)| norm 0.2706 (-0.46z)| lr 5.09e-04 | 322.54 ms | 52.3% bf16 MFU | 1624258 tok/s step 5513/19560 | loss 3.603978 (+1.23z)| norm 0.2806 (+0.08z)| lr 5.09e-04 | 323.31 ms | 52.2% bf16 MFU | 1624126 tok/s step 5514/19560 | loss 3.558006 (+0.18z)| norm 0.2611 (-0.98z)| lr 5.09e-04 | 323.00 ms | 52.3% bf16 MFU | 1624080 tok/s step 5515/19560 | loss 3.534722 (-0.36z)| norm 0.2641 (-0.80z)| lr 5.09e-04 | 323.35 ms | 52.2% bf16 MFU | 1623946 tok/s step 5516/19560 | loss 3.555783 (+0.12z)| norm 0.2482 (-1.64z)| lr 5.09e-04 | 322.90 ms | 52.3% bf16 MFU | 1623933 tok/s step 5517/19560 | loss 3.572726 (+0.51z)| norm 0.2572 (-1.14z)| lr 5.09e-04 | 322.50 ms | 52.3% bf16 MFU | 1624021 tok/s step 5518/19560 | loss 3.566573 (+0.37z)| norm 0.2579 (-1.10z)| lr 5.08e-04 | 323.14 ms | 52.2% bf16 MFU | 1623944 tok/s step 5519/19560 | loss 3.532855 (-0.40z)| norm 0.2724 (-0.30z)| lr 5.08e-04 | 322.17 ms | 52.4% bf16 MFU | 1624116 tok/s step 5520/19560 | loss 3.588879 (+0.88z)| norm 0.2920 (+0.76z)| lr 5.08e-04 | 322.60 ms | 52.3% bf16 MFU | 1624170 tok/s step 5521/19560 | loss 3.561538 (+0.25z)| norm 0.2732 (-0.27z)| lr 5.08e-04 | 322.83 ms | 52.3% bf16 MFU | 1624162 tok/s step 5522/19560 | loss 3.566983 (+0.37z)| norm 0.2777 (-0.02z)| lr 5.08e-04 | 322.64 ms | 52.3% bf16 MFU | 1624204 tok/s step 5523/19560 | loss 3.562703 (+0.27z)| norm 0.2549 (-1.25z)| lr 5.08e-04 | 323.34 ms | 52.2% bf16 MFU | 1624067 tok/s step 5524/19560 | loss 3.502635 (-1.10z)| norm 0.2738 (-0.23z)| lr 5.08e-04 | 322.07 ms | 52.4% bf16 MFU | 1624257 tok/s step 5525/19560 | loss 3.519829 (-0.70z)| norm 0.2805 (+0.15z)| lr 5.08e-04 | 322.62 ms | 52.3% bf16 MFU | 1624298 tok/s step 5526/19560 | loss 3.552129 (+0.06z)| norm 0.2683 (-0.51z)| lr 5.08e-04 | 322.67 ms | 52.3% bf16 MFU | 1624325 tok/s step 5527/19560 | loss 3.584273 (+0.80z)| norm 0.2412 (-1.95z)| lr 5.08e-04 | 322.29 ms | 52.4% bf16 MFU | 1624446 tok/s step 5528/19560 | loss 3.486130 (-1.50z)| norm 0.2727 (-0.26z)| lr 5.08e-04 | 323.48 ms | 52.2% bf16 MFU | 1624263 tok/s step 5529/19560 | loss 3.523917 (-0.63z)| norm 0.2926 (+0.80z)| lr 5.08e-04 | 322.83 ms | 52.3% bf16 MFU | 1624251 tok/s step 5530/19560 | loss 3.485593 (-1.54z)| norm 0.2851 (+0.38z)| lr 5.08e-04 | 322.65 ms | 52.3% bf16 MFU | 1624286 tok/s step 5531/19560 | loss 3.588271 (+0.88z)| norm 0.2723 (-0.32z)| lr 5.08e-04 | 323.27 ms | 52.2% bf16 MFU | 1624164 tok/s step 5532/19560 | loss 3.522126 (-0.67z)| norm 0.2774 (-0.04z)| lr 5.08e-04 | 322.37 ms | 52.4% bf16 MFU | 1624274 tok/s step 5533/19560 | loss 3.494080 (-1.33z)| norm 0.2653 (-0.70z)| lr 5.08e-04 | 322.73 ms | 52.3% bf16 MFU | 1624286 tok/s step 5534/19560 | loss 3.518134 (-0.76z)| norm 0.2763 (-0.11z)| lr 5.08e-04 | 322.47 ms | 52.3% bf16 MFU | 1624365 tok/s step 5535/19560 | loss 3.524988 (-0.58z)| norm 0.2776 (-0.03z)| lr 5.08e-04 | 323.00 ms | 52.3% bf16 MFU | 1624307 tok/s step 5536/19560 | loss 3.524267 (-0.60z)| norm 0.2519 (-1.42z)| lr 5.08e-04 | 322.32 ms | 52.4% bf16 MFU | 1624422 tok/s step 5537/19560 | loss 3.518286 (-0.75z)| norm 0.3046 (+1.42z)| lr 5.08e-04 | 322.45 ms | 52.3% bf16 MFU | 1624498 tok/s step 5538/19560 | loss 3.562814 (+0.30z)| norm 0.2855 (+0.38z)| lr 5.08e-04 | 323.39 ms | 52.2% bf16 MFU | 1624334 tok/s step 5539/19560 | loss 3.545013 (-0.11z)| norm 0.2569 (-1.17z)| lr 5.08e-04 | 322.88 ms | 52.3% bf16 MFU | 1624306 tok/s step 5540/19560 | loss 3.539993 (-0.24z)| norm 0.2977 (+1.03z)| lr 5.08e-04 | 322.87 ms | 52.3% bf16 MFU | 1624283 tok/s step 5541/19560 | loss 3.580481 (+0.73z)| norm 0.2580 (-1.12z)| lr 5.08e-04 | 322.43 ms | 52.3% bf16 MFU | 1624372 tok/s step 5542/19560 | loss 3.622410 (+1.70z)| norm 0.2961 (+0.98z)| lr 5.08e-04 | 322.43 ms | 52.3% bf16 MFU | 1624456 tok/s step 5543/19560 | loss 3.597130 (+1.09z)| norm 0.2705 (-0.43z)| lr 5.08e-04 | 323.12 ms | 52.2% bf16 MFU | 1624361 tok/s step 5544/19560 | loss 3.545496 (-0.13z)| norm 0.2764 (-0.12z)| lr 5.08e-04 | 322.77 ms | 52.3% bf16 MFU | 1624360 tok/s step 5545/19560 | loss 3.473356 (-1.80z)| norm 0.2852 (+0.38z)| lr 5.08e-04 | 322.37 ms | 52.4% bf16 MFU | 1624459 tok/s step 5546/19560 | loss 3.554437 (+0.11z)| norm 0.2741 (-0.24z)| lr 5.07e-04 | 323.16 ms | 52.2% bf16 MFU | 1624356 tok/s step 5547/19560 | loss 3.522269 (-0.65z)| norm 0.2598 (-1.04z)| lr 5.07e-04 | 322.85 ms | 52.3% bf16 MFU | 1624336 tok/s step 5548/19560 | loss 3.552979 (+0.09z)| norm 0.2703 (-0.47z)| lr 5.07e-04 | 322.39 ms | 52.4% bf16 MFU | 1624432 tok/s step 5549/19560 | loss 3.502609 (-1.11z)| norm 0.3328 (+2.94z)| lr 5.07e-04 | 323.22 ms | 52.2% bf16 MFU | 1624315 tok/s step 5550/19560 | loss 3.510655 (-0.90z)| norm 0.3236 (+2.37z)| lr 5.07e-04 | 322.80 ms | 52.3% bf16 MFU | 1624309 tok/s step 5551/19560 | loss 3.536957 (-0.28z)| norm 0.2646 (-0.80z)| lr 5.07e-04 | 322.77 ms | 52.3% bf16 MFU | 1624310 tok/s step 5552/19560 | loss 3.543479 (-0.12z)| norm 0.2709 (-0.46z)| lr 5.07e-04 | 323.19 ms | 52.2% bf16 MFU | 1624207 tok/s step 5553/19560 | loss 3.548745 (+0.00z)| norm 0.2812 (+0.10z)| lr 5.07e-04 | 322.77 ms | 52.3% bf16 MFU | 1624213 tok/s step 5554/19560 | loss 3.545292 (-0.09z)| norm 0.2706 (-0.47z)| lr 5.07e-04 | 322.44 ms | 52.3% bf16 MFU | 1624304 tok/s step 5555/19560 | loss 3.518772 (-0.71z)| norm 0.2681 (-0.59z)| lr 5.07e-04 | 322.73 ms | 52.3% bf16 MFU | 1624315 tok/s step 5556/19560 | loss 3.677022 (+2.93z)| norm 0.2665 (-0.67z)| lr 5.07e-04 | 322.47 ms | 52.3% bf16 MFU | 1624392 tok/s step 5557/19560 | loss 3.576635 (+0.61z)| norm 0.2634 (-0.82z)| lr 5.07e-04 | 322.79 ms | 52.3% bf16 MFU | 1624384 tok/s step 5558/19560 | loss 3.529827 (-0.46z)| norm 0.3300 (+2.77z)| lr 5.07e-04 | 322.68 ms | 52.3% bf16 MFU | 1624403 tok/s step 5559/19560 | loss 3.537335 (-0.28z)| norm 0.3015 (+1.26z)| lr 5.07e-04 | 322.41 ms | 52.3% bf16 MFU | 1624490 tok/s step 5560/19560 | loss 3.535274 (-0.33z)| norm 0.2717 (-0.36z)| lr 5.07e-04 | 323.03 ms | 52.2% bf16 MFU | 1624417 tok/s step 5561/19560 | loss 3.560294 (+0.24z)| norm 0.2682 (-0.54z)| lr 5.07e-04 | 322.76 ms | 52.3% bf16 MFU | 1624415 tok/s step 5562/19560 | loss 3.464063 (-1.97z)| norm 0.2713 (-0.36z)| lr 5.07e-04 | 322.55 ms | 52.3% bf16 MFU | 1624466 tok/s step 5563/19560 | loss 3.538057 (-0.24z)| norm 0.2887 (+0.62z)| lr 5.07e-04 | 322.54 ms | 52.3% bf16 MFU | 1624518 tok/s step 5564/19560 | loss 3.497082 (-1.19z)| norm 0.2821 (+0.26z)| lr 5.07e-04 | 322.66 ms | 52.3% bf16 MFU | 1624537 tok/s step 5565/19560 | loss 3.597974 (+1.18z)| norm 0.3019 (+1.37z)| lr 5.07e-04 | 322.59 ms | 52.3% bf16 MFU | 1624573 tok/s step 5566/19560 | loss 3.537189 (-0.25z)| norm 0.3425 (+3.47z)| lr 5.07e-04 | 322.71 ms | 52.3% bf16 MFU | 1624578 tok/s step 5567/19560 | loss 3.597078 (+1.16z)| norm 0.3079 (+1.58z)| lr 5.07e-04 | 323.08 ms | 52.2% bf16 MFU | 1624488 tok/s step 5568/19560 | loss 3.521240 (-0.61z)| norm 0.2949 (+0.89z)| lr 5.07e-04 | 322.11 ms | 52.4% bf16 MFU | 1624646 tok/s step 5569/19560 | loss 3.674544 (+2.88z)| norm 0.2890 (+0.57z)| lr 5.07e-04 | 322.31 ms | 52.4% bf16 MFU | 1624748 tok/s step 5570/19560 | loss 3.537106 (-0.26z)| norm 0.2948 (+0.87z)| lr 5.07e-04 | 323.28 ms | 52.2% bf16 MFU | 1624600 tok/s step 5571/19560 | loss 3.553171 (+0.10z)| norm 0.3012 (+1.20z)| lr 5.07e-04 | 322.89 ms | 52.3% bf16 MFU | 1624557 tok/s step 5572/19560 | loss 3.522363 (-0.61z)| norm 0.3283 (+2.56z)| lr 5.07e-04 | 322.40 ms | 52.3% bf16 MFU | 1624639 tok/s step 5573/19560 | loss 3.532572 (-0.38z)| norm 0.2936 (+0.73z)| lr 5.07e-04 | 323.09 ms | 52.2% bf16 MFU | 1624544 tok/s step 5574/19560 | loss 3.544080 (-0.11z)| norm 0.2798 (+0.01z)| lr 5.06e-04 | 322.21 ms | 52.4% bf16 MFU | 1624676 tok/s step 5575/19560 | loss 3.531888 (-0.39z)| norm 0.2797 (-0.01z)| lr 5.06e-04 | 322.18 ms | 52.4% bf16 MFU | 1624807 tok/s step 5576/19560 | loss 3.537797 (-0.25z)| norm 0.3099 (+1.57z)| lr 5.06e-04 | 322.88 ms | 52.3% bf16 MFU | 1624756 tok/s step 5577/19560 | loss 3.491370 (-1.31z)| norm 0.2852 (+0.27z)| lr 5.06e-04 | 323.13 ms | 52.2% bf16 MFU | 1624644 tok/s step 5578/19560 | loss 3.608209 (+1.36z)| norm 0.2754 (-0.25z)| lr 5.06e-04 | 322.45 ms | 52.3% bf16 MFU | 1624710 tok/s step 5579/19560 | loss 3.553614 (+0.10z)| norm 0.2787 (-0.08z)| lr 5.06e-04 | 322.70 ms | 52.3% bf16 MFU | 1624709 tok/s step 5580/19560 | loss 3.533902 (-0.35z)| norm 0.2583 (-1.15z)| lr 5.06e-04 | 322.87 ms | 52.3% bf16 MFU | 1624665 tok/s step 5581/19560 | loss 3.498250 (-1.16z)| norm 0.2516 (-1.50z)| lr 5.06e-04 | 322.31 ms | 52.4% bf16 MFU | 1624763 tok/s step 5582/19560 | loss 3.584560 (+0.84z)| norm 0.2587 (-1.12z)| lr 5.06e-04 | 322.88 ms | 52.3% bf16 MFU | 1624715 tok/s step 5583/19560 | loss 3.517011 (-0.71z)| norm 0.2589 (-1.11z)| lr 5.06e-04 | 323.09 ms | 52.2% bf16 MFU | 1624615 tok/s step 5584/19560 | loss 3.538913 (-0.21z)| norm 0.2686 (-0.60z)| lr 5.06e-04 | 322.84 ms | 52.3% bf16 MFU | 1624584 tok/s step 5585/19560 | loss 3.573328 (+0.62z)| norm 0.2569 (-1.20z)| lr 5.06e-04 | 322.87 ms | 52.3% bf16 MFU | 1624548 tok/s step 5586/19560 | loss 3.617868 (+1.64z)| norm 0.2614 (-0.95z)| lr 5.06e-04 | 322.44 ms | 52.3% bf16 MFU | 1624620 tok/s step 5587/19560 | loss 3.515764 (-0.74z)| norm 0.2789 (-0.06z)| lr 5.06e-04 | 322.60 ms | 52.3% bf16 MFU | 1624649 tok/s step 5588/19560 | loss 3.488220 (-1.36z)| norm 0.2330 (-2.39z)| lr 5.06e-04 | 322.90 ms | 52.3% bf16 MFU | 1624602 tok/s step 5589/19560 | loss 3.506207 (-0.94z)| norm 0.2672 (-0.63z)| lr 5.06e-04 | 322.43 ms | 52.3% bf16 MFU | 1624673 tok/s step 5590/19560 | loss 3.566453 (+0.44z)| norm 0.2726 (-0.36z)| lr 5.06e-04 | 322.44 ms | 52.3% bf16 MFU | 1624739 tok/s step 5591/19560 | loss 3.443221 (-2.36z)| norm 0.2560 (-1.20z)| lr 5.06e-04 | 322.84 ms | 52.3% bf16 MFU | 1624702 tok/s step 5592/19560 | loss 3.564350 (+0.39z)| norm 0.2985 (+0.96z)| lr 5.06e-04 | 322.74 ms | 52.3% bf16 MFU | 1624692 tok/s step 5593/19560 | loss 3.573084 (+0.58z)| norm 0.2591 (-1.04z)| lr 5.06e-04 | 322.83 ms | 52.3% bf16 MFU | 1624659 tok/s step 5594/19560 | loss 3.538115 (-0.19z)| norm 0.2860 (+0.32z)| lr 5.06e-04 | 322.97 ms | 52.3% bf16 MFU | 1624592 tok/s step 5595/19560 | loss 3.517065 (-0.69z)| norm 0.3021 (+1.13z)| lr 5.06e-04 | 323.51 ms | 52.2% bf16 MFU | 1624393 tok/s step 5596/19560 | loss 3.562371 (+0.40z)| norm 0.3131 (+1.65z)| lr 5.06e-04 | 322.75 ms | 52.3% bf16 MFU | 1624396 tok/s step 5597/19560 | loss 3.511175 (-0.82z)| norm 0.2956 (+0.77z)| lr 5.06e-04 | 322.64 ms | 52.3% bf16 MFU | 1624426 tok/s step 5598/19560 | loss 3.521746 (-0.58z)| norm 0.3065 (+1.29z)| lr 5.06e-04 | 322.66 ms | 52.3% bf16 MFU | 1624451 tok/s step 5599/19560 | loss 3.550453 (+0.10z)| norm 0.2792 (-0.06z)| lr 5.06e-04 | 323.55 ms | 52.2% bf16 MFU | 1624250 tok/s step 5600/19560 | loss 3.564370 (+0.44z)| norm 0.2719 (-0.42z)| lr 5.06e-04 | 322.38 ms | 52.4% bf16 MFU | 1624353 tok/s step 5601/19560 | loss 3.580525 (+0.82z)| norm 0.2533 (-1.34z)| lr 5.05e-04 | 322.15 ms | 52.4% bf16 MFU | 1624508 tok/s step 5602/19560 | loss 3.533915 (-0.30z)| norm 0.2682 (-0.59z)| lr 5.05e-04 | 323.01 ms | 52.2% bf16 MFU | 1624439 tok/s step 5603/19560 | loss 3.504232 (-1.05z)| norm 0.2989 (+0.99z)| lr 5.05e-04 | 322.86 ms | 52.3% bf16 MFU | 1624410 tok/s step 5604/19560 | loss 3.561188 (+0.37z)| norm 0.2844 (+0.24z)| lr 5.05e-04 | 322.79 ms | 52.3% bf16 MFU | 1624401 tok/s step 5605/19560 | loss 3.529927 (-0.40z)| norm 0.2394 (-2.01z)| lr 5.05e-04 | 323.93 ms | 52.1% bf16 MFU | 1624107 tok/s step 5606/19560 | loss 3.518752 (-0.67z)| norm 0.2753 (-0.19z)| lr 5.05e-04 | 323.38 ms | 52.2% bf16 MFU | 1623964 tok/s step 5607/19560 | loss 3.639977 (+2.28z)| norm 0.3063 (+1.37z)| lr 5.05e-04 | 322.18 ms | 52.4% bf16 MFU | 1624131 tok/s step 5608/19560 | loss 3.525391 (-0.52z)| norm 0.2584 (-1.03z)| lr 5.05e-04 | 322.42 ms | 52.3% bf16 MFU | 1624229 tok/s step 5609/19560 | loss 3.523923 (-0.55z)| norm 0.2759 (-0.14z)| lr 5.05e-04 | 323.49 ms | 52.2% bf16 MFU | 1624053 tok/s step 5610/19560 | loss 3.537649 (-0.21z)| norm 0.2526 (-1.33z)| lr 5.05e-04 | 322.65 ms | 52.3% bf16 MFU | 1624097 tok/s step 5611/19560 | loss 3.509822 (-0.90z)| norm 0.2672 (-0.55z)| lr 5.05e-04 | 322.08 ms | 52.4% bf16 MFU | 1624284 tok/s step 5612/19560 | loss 3.513795 (-0.79z)| norm 0.2687 (-0.48z)| lr 5.05e-04 | 323.29 ms | 52.2% bf16 MFU | 1624155 tok/s step 5613/19560 | loss 3.551083 (+0.12z)| norm 0.2687 (-0.48z)| lr 5.05e-04 | 323.36 ms | 52.2% bf16 MFU | 1624017 tok/s step 5614/19560 | loss 3.500224 (-1.12z)| norm 0.2830 (+0.28z)| lr 5.05e-04 | 322.87 ms | 52.3% bf16 MFU | 1624007 tok/s step 5615/19560 | loss 3.548241 (+0.05z)| norm 0.2938 (+0.84z)| lr 5.05e-04 | 323.08 ms | 52.2% bf16 MFU | 1623946 tok/s step 5616/19560 | loss 3.561930 (+0.38z)| norm 0.2807 (+0.14z)| lr 5.05e-04 | 322.11 ms | 52.4% bf16 MFU | 1624131 tok/s step 5617/19560 | loss 3.571746 (+0.63z)| norm 0.3031 (+1.30z)| lr 5.05e-04 | 323.06 ms | 52.2% bf16 MFU | 1624069 tok/s step 5618/19560 | loss 3.518059 (-0.70z)| norm 0.3019 (+1.24z)| lr 5.05e-04 | 322.65 ms | 52.3% bf16 MFU | 1624114 tok/s step 5619/19560 | loss 3.794532 (+5.39z)| norm 0.3656 (+4.22z)| lr 5.05e-04 | 322.66 ms | 52.3% bf16 MFU | 1624154 tok/s step 5620/19560 | loss 3.546651 (-0.03z)| norm 0.3104 (+1.50z)| lr 5.05e-04 | 322.77 ms | 52.3% bf16 MFU | 1624162 tok/s step 5621/19560 | loss 3.515968 (-0.70z)| norm 0.3302 (+2.38z)| lr 5.05e-04 | 323.41 ms | 52.2% bf16 MFU | 1624010 tok/s step 5622/19560 | loss 3.595445 (+1.02z)| norm 0.3117 (+1.48z)| lr 5.05e-04 | 322.51 ms | 52.3% bf16 MFU | 1624092 tok/s step 5623/19560 | loss 3.545611 (-0.06z)| norm 0.2854 (+0.24z)| lr 5.05e-04 | 323.02 ms | 52.2% bf16 MFU | 1624040 tok/s step 5624/19560 | loss 3.580284 (+0.68z)| norm 0.2797 (-0.02z)| lr 5.05e-04 | 323.06 ms | 52.2% bf16 MFU | 1623984 tok/s step 5625/19560 | loss 3.530408 (-0.41z)| norm 0.2804 (+0.01z)| lr 5.05e-04 | 322.92 ms | 52.3% bf16 MFU | 1623963 tok/s step 5626/19560 | loss 3.492110 (-1.22z)| norm 0.2751 (-0.23z)| lr 5.05e-04 | 322.84 ms | 52.3% bf16 MFU | 1623965 tok/s step 5627/19560 | loss 3.484244 (-1.37z)| norm 0.2776 (-0.11z)| lr 5.05e-04 | 322.80 ms | 52.3% bf16 MFU | 1623976 tok/s step 5628/19560 | loss 3.603228 (+1.18z)| norm 0.2584 (-1.00z)| lr 5.05e-04 | 322.82 ms | 52.3% bf16 MFU | 1623982 tok/s step 5629/19560 | loss 3.595483 (+1.01z)| norm 0.2752 (-0.21z)| lr 5.04e-04 | 322.98 ms | 52.3% bf16 MFU | 1623948 tok/s step 5630/19560 | loss 3.533176 (-0.31z)| norm 0.2719 (-0.37z)| lr 5.04e-04 | 322.75 ms | 52.3% bf16 MFU | 1623973 tok/s step 5631/19560 | loss 3.543188 (-0.07z)| norm 0.2753 (-0.20z)| lr 5.04e-04 | 322.74 ms | 52.3% bf16 MFU | 1624000 tok/s step 5632/19560 | loss 3.540103 (-0.17z)| norm 0.2782 (-0.07z)| lr 5.04e-04 | 322.64 ms | 52.3% bf16 MFU | 1624050 tok/s step 5633/19560 | loss 3.568120 (+0.49z)| norm 0.2729 (-0.32z)| lr 5.04e-04 | 322.96 ms | 52.3% bf16 MFU | 1624018 tok/s step 5634/19560 | loss 3.539543 (-0.17z)| norm 0.2854 (+0.27z)| lr 5.04e-04 | 322.81 ms | 52.3% bf16 MFU | 1624024 tok/s step 5635/19560 | loss 3.503392 (-0.99z)| norm 0.2535 (-1.22z)| lr 5.04e-04 | 322.97 ms | 52.3% bf16 MFU | 1623990 tok/s step 5636/19560 | loss 3.548148 (+0.05z)| norm 0.2824 (+0.13z)| lr 5.04e-04 | 322.79 ms | 52.3% bf16 MFU | 1624002 tok/s step 5637/19560 | loss 3.549762 (+0.09z)| norm 0.3426 (+2.84z)| lr 5.04e-04 | 322.77 ms | 52.3% bf16 MFU | 1624020 tok/s step 5638/19560 | loss 3.655515 (+2.45z)| norm 0.2700 (-0.47z)| lr 5.04e-04 | 323.36 ms | 52.2% bf16 MFU | 1623888 tok/s step 5639/19560 | loss 3.492695 (-1.21z)| norm 0.2775 (-0.12z)| lr 5.04e-04 | 322.90 ms | 52.3% bf16 MFU | 1623877 tok/s step 5640/19560 | loss 3.586693 (+0.90z)| norm 0.2781 (-0.10z)| lr 5.04e-04 | 322.43 ms | 52.3% bf16 MFU | 1623985 tok/s step 5641/19560 | loss 3.510386 (-0.80z)| norm 0.2736 (-0.30z)| lr 5.04e-04 | 323.40 ms | 52.2% bf16 MFU | 1623843 tok/s step 5642/19560 | loss 3.528214 (-0.39z)| norm 0.3074 (+1.22z)| lr 5.04e-04 | 322.45 ms | 52.3% bf16 MFU | 1623948 tok/s step 5643/19560 | loss 3.522720 (-0.51z)| norm 0.2794 (-0.06z)| lr 5.04e-04 | 322.88 ms | 52.3% bf16 MFU | 1623940 tok/s step 5644/19560 | loss 3.521676 (-0.53z)| norm 0.2528 (-1.28z)| lr 5.04e-04 | 322.66 ms | 52.3% bf16 MFU | 1623988 tok/s step 5645/19560 | loss 3.522490 (-0.50z)| norm 0.2549 (-1.18z)| lr 5.04e-04 | 322.57 ms | 52.3% bf16 MFU | 1624056 tok/s step 5646/19560 | loss 3.556815 (+0.27z)| norm 0.2750 (-0.27z)| lr 5.04e-04 | 322.91 ms | 52.3% bf16 MFU | 1624034 tok/s step 5647/19560 | loss 3.552042 (+0.16z)| norm 0.3028 (+0.99z)| lr 5.04e-04 | 322.83 ms | 52.3% bf16 MFU | 1624034 tok/s step 5648/19560 | loss 3.592956 (+1.08z)| norm 0.2973 (+0.74z)| lr 5.04e-04 | 322.46 ms | 52.3% bf16 MFU | 1624127 tok/s step 5649/19560 | loss 3.507677 (-0.83z)| norm 0.2661 (-0.68z)| lr 5.04e-04 | 323.30 ms | 52.2% bf16 MFU | 1624004 tok/s step 5650/19560 | loss 3.576226 (+0.71z)| norm 0.2788 (-0.10z)| lr 5.04e-04 | 322.71 ms | 52.3% bf16 MFU | 1624036 tok/s step 5651/19560 | loss 3.550003 (+0.12z)| norm 0.2997 (+0.83z)| lr 5.04e-04 | 322.81 ms | 52.3% bf16 MFU | 1624040 tok/s step 5652/19560 | loss 3.499047 (-1.02z)| norm 0.2993 (+0.81z)| lr 5.04e-04 | 322.77 ms | 52.3% bf16 MFU | 1624054 tok/s step 5653/19560 | loss 3.564889 (+0.45z)| norm 0.2633 (-0.83z)| lr 5.04e-04 | 323.12 ms | 52.2% bf16 MFU | 1623980 tok/s step 5654/19560 | loss 3.461927 (-1.82z)| norm 0.2744 (-0.33z)| lr 5.04e-04 | 323.43 ms | 52.2% bf16 MFU | 1623833 tok/s step 5655/19560 | loss 3.544269 (+0.01z)| norm 0.2572 (-1.12z)| lr 5.04e-04 | 322.43 ms | 52.3% bf16 MFU | 1623943 tok/s step 5656/19560 | loss 3.560527 (+0.36z)| norm 0.2567 (-1.14z)| lr 5.03e-04 | 323.22 ms | 52.2% bf16 MFU | 1623851 tok/s step 5657/19560 | loss 3.529962 (-0.32z)| norm 0.2713 (-0.46z)| lr 5.03e-04 | 323.34 ms | 52.2% bf16 MFU | 1623732 tok/s step 5658/19560 | loss 3.501465 (-0.97z)| norm 0.2480 (-1.50z)| lr 5.03e-04 | 323.11 ms | 52.2% bf16 MFU | 1623677 tok/s step 5659/19560 | loss 3.502812 (-0.92z)| norm 0.2658 (-0.69z)| lr 5.03e-04 | 323.42 ms | 52.2% bf16 MFU | 1623547 tok/s step 5660/19560 | loss 3.559983 (+0.35z)| norm 0.2450 (-1.61z)| lr 5.03e-04 | 322.94 ms | 52.3% bf16 MFU | 1623544 tok/s step 5661/19560 | loss 3.550565 (+0.13z)| norm 0.2581 (-1.01z)| lr 5.03e-04 | 322.59 ms | 52.3% bf16 MFU | 1623629 tok/s step 5662/19560 | loss 3.507322 (-0.84z)| norm 0.2710 (-0.43z)| lr 5.03e-04 | 323.06 ms | 52.2% bf16 MFU | 1623591 tok/s step 5663/19560 | loss 3.585789 (+0.91z)| norm 0.2745 (-0.28z)| lr 5.03e-04 | 323.04 ms | 52.2% bf16 MFU | 1623561 tok/s step 5664/19560 | loss 3.496573 (-1.08z)| norm 0.2632 (-0.79z)| lr 5.03e-04 | 322.62 ms | 52.3% bf16 MFU | 1623638 tok/s step 5665/19560 | loss 3.571286 (+0.58z)| norm 0.2807 (+0.00z)| lr 5.03e-04 | 323.34 ms | 52.2% bf16 MFU | 1623530 tok/s step 5666/19560 | loss 3.492299 (-1.16z)| norm 0.2931 (+0.56z)| lr 5.03e-04 | 322.85 ms | 52.3% bf16 MFU | 1623549 tok/s step 5667/19560 | loss 3.496222 (-1.06z)| norm 0.3138 (+1.47z)| lr 5.03e-04 | 322.98 ms | 52.3% bf16 MFU | 1623537 tok/s step 5668/19560 | loss 3.510243 (-0.75z)| norm 0.2675 (-0.60z)| lr 5.03e-04 | 322.80 ms | 52.3% bf16 MFU | 1623571 tok/s step 5669/19560 | loss 3.582397 (+0.84z)| norm 0.2953 (+0.64z)| lr 5.03e-04 | 322.80 ms | 52.3% bf16 MFU | 1623602 tok/s step 5670/19560 | loss 3.560243 (+0.37z)| norm 0.3338 (+2.31z)| lr 5.03e-04 | 323.82 ms | 52.1% bf16 MFU | 1623375 tok/s step 5671/19560 | loss 3.509292 (-0.75z)| norm 0.2793 (-0.10z)| lr 5.03e-04 | 323.00 ms | 52.3% bf16 MFU | 1623365 tok/s step 5672/19560 | loss 3.595326 (+1.16z)| norm 0.3084 (+1.17z)| lr 5.03e-04 | 322.50 ms | 52.3% bf16 MFU | 1623481 tok/s step 5673/19560 | loss 3.511297 (-0.73z)| norm 0.2530 (-1.24z)| lr 5.03e-04 | 322.29 ms | 52.4% bf16 MFU | 1623645 tok/s step 5674/19560 | loss 3.545919 (+0.05z)| norm 0.2698 (-0.51z)| lr 5.03e-04 | 322.54 ms | 52.3% bf16 MFU | 1623737 tok/s step 5675/19560 | loss 3.563416 (+0.44z)| norm 0.2687 (-0.56z)| lr 5.03e-04 | 322.72 ms | 52.3% bf16 MFU | 1623781 tok/s step 5676/19560 | loss 3.567522 (+0.53z)| norm 0.2997 (+0.78z)| lr 5.03e-04 | 322.76 ms | 52.3% bf16 MFU | 1623812 tok/s step 5677/19560 | loss 3.509952 (-0.77z)| norm 0.2661 (-0.67z)| lr 5.03e-04 | 322.10 ms | 52.4% bf16 MFU | 1624006 tok/s step 5678/19560 | loss 3.560049 (+0.35z)| norm 0.2626 (-0.82z)| lr 5.03e-04 | 322.26 ms | 52.4% bf16 MFU | 1624151 tok/s step 5679/19560 | loss 3.556067 (+0.26z)| norm 0.2819 (+0.05z)| lr 5.03e-04 | 323.15 ms | 52.2% bf16 MFU | 1624064 tok/s step 5680/19560 | loss 3.566732 (+0.49z)| norm 0.2777 (-0.15z)| lr 5.03e-04 | 323.40 ms | 52.2% bf16 MFU | 1623919 tok/s step 5681/19560 | loss 3.488246 (-1.25z)| norm 0.2756 (-0.24z)| lr 5.03e-04 | 322.71 ms | 52.3% bf16 MFU | 1623954 tok/s step 5682/19560 | loss 3.571944 (+0.61z)| norm 0.2927 (+0.52z)| lr 5.03e-04 | 322.48 ms | 52.3% bf16 MFU | 1624047 tok/s step 5683/19560 | loss 3.454887 (-1.96z)| norm 0.3305 (+2.17z)| lr 5.02e-04 | 322.52 ms | 52.3% bf16 MFU | 1624125 tok/s step 5684/19560 | loss 3.590594 (+1.07z)| norm 0.2719 (-0.43z)| lr 5.02e-04 | 322.99 ms | 52.3% bf16 MFU | 1624081 tok/s step 5685/19560 | loss 3.633410 (+2.00z)| norm 0.3285 (+2.03z)| lr 5.02e-04 | 323.10 ms | 52.2% bf16 MFU | 1624012 tok/s step 5686/19560 | loss 3.529132 (-0.33z)| norm 0.2980 (+0.72z)| lr 5.02e-04 | 322.95 ms | 52.3% bf16 MFU | 1623983 tok/s step 5687/19560 | loss 3.573321 (+0.65z)| norm 0.2765 (-0.23z)| lr 5.02e-04 | 322.93 ms | 52.3% bf16 MFU | 1623962 tok/s step 5688/19560 | loss 3.706463 (+3.42z)| norm 0.3118 (+1.32z)| lr 5.02e-04 | 322.50 ms | 52.3% bf16 MFU | 1624050 tok/s step 5689/19560 | loss 3.513038 (-0.68z)| norm 0.3053 (+1.02z)| lr 5.02e-04 | 322.48 ms | 52.3% bf16 MFU | 1624138 tok/s step 5690/19560 | loss 3.638873 (+1.96z)| norm 0.2878 (+0.24z)| lr 5.02e-04 | 322.63 ms | 52.3% bf16 MFU | 1624183 tok/s step 5691/19560 | loss 3.544356 (-0.04z)| norm 0.2771 (-0.23z)| lr 5.02e-04 | 323.37 ms | 52.2% bf16 MFU | 1624040 tok/s step 5692/19560 | loss 3.539525 (-0.15z)| norm 0.2831 (+0.03z)| lr 5.02e-04 | 322.34 ms | 52.4% bf16 MFU | 1624165 tok/s step 5693/19560 | loss 3.538520 (-0.17z)| norm 0.2714 (-0.47z)| lr 5.02e-04 | 322.39 ms | 52.4% bf16 MFU | 1624269 tok/s step 5694/19560 | loss 3.499393 (-0.99z)| norm 0.2421 (-1.77z)| lr 5.02e-04 | 323.23 ms | 52.2% bf16 MFU | 1624156 tok/s step 5695/19560 | loss 3.503807 (-0.88z)| norm 0.2768 (-0.19z)| lr 5.02e-04 | 322.88 ms | 52.3% bf16 MFU | 1624138 tok/s step 5696/19560 | loss 3.572204 (+0.56z)| norm 0.2667 (-0.64z)| lr 5.02e-04 | 322.90 ms | 52.3% bf16 MFU | 1624116 tok/s step 5697/19560 | loss 3.579941 (+0.77z)| norm 0.2499 (-1.38z)| lr 5.02e-04 | 322.45 ms | 52.3% bf16 MFU | 1624207 tok/s step 5698/19560 | loss 3.556056 (+0.24z)| norm 0.2687 (-0.53z)| lr 5.02e-04 | 322.69 ms | 52.3% bf16 MFU | 1624234 tok/s step 5699/19560 | loss 3.536323 (-0.19z)| norm 0.2386 (-1.84z)| lr 5.02e-04 | 322.24 ms | 52.4% bf16 MFU | 1624372 tok/s step 5700/19560 | loss 3.595500 (+1.09z)| norm 0.2521 (-1.22z)| lr 5.02e-04 | 323.18 ms | 52.2% bf16 MFU | 1624269 tok/s step 5701/19560 | loss 3.543591 (-0.04z)| norm 0.2742 (-0.22z)| lr 5.02e-04 | 322.63 ms | 52.3% bf16 MFU | 1624307 tok/s step 5702/19560 | loss 3.594902 (+1.06z)| norm 0.2849 (+0.26z)| lr 5.02e-04 | 322.66 ms | 52.3% bf16 MFU | 1624337 tok/s step 5703/19560 | loss 3.542629 (-0.08z)| norm 0.2680 (-0.50z)| lr 5.02e-04 | 323.32 ms | 52.2% bf16 MFU | 1624200 tok/s step 5704/19560 | loss 3.546777 (+0.01z)| norm 0.2539 (-1.12z)| lr 5.02e-04 | 323.11 ms | 52.2% bf16 MFU | 1624120 tok/s step 5705/19560 | loss 3.599096 (+1.13z)| norm 0.2795 (+0.04z)| lr 5.02e-04 | 322.27 ms | 52.4% bf16 MFU | 1624257 tok/s step 5706/19560 | loss 3.567874 (+0.46z)| norm 0.2659 (-0.57z)| lr 5.02e-04 | 322.80 ms | 52.3% bf16 MFU | 1624253 tok/s step 5707/19560 | loss 3.531075 (-0.34z)| norm 0.3032 (+1.10z)| lr 5.02e-04 | 322.71 ms | 52.3% bf16 MFU | 1624272 tok/s step 5708/19560 | loss 3.554488 (+0.17z)| norm 0.2706 (-0.37z)| lr 5.02e-04 | 322.66 ms | 52.3% bf16 MFU | 1624303 tok/s step 5709/19560 | loss 3.602003 (+1.19z)| norm 0.2859 (+0.31z)| lr 5.02e-04 | 323.25 ms | 52.2% bf16 MFU | 1624183 tok/s step 5710/19560 | loss 3.545082 (-0.05z)| norm 0.3240 (+1.99z)| lr 5.01e-04 | 322.44 ms | 52.3% bf16 MFU | 1624275 tok/s step 5711/19560 | loss 3.470096 (-1.67z)| norm 0.3026 (+1.02z)| lr 5.01e-04 | 323.34 ms | 52.2% bf16 MFU | 1624134 tok/s step 5712/19560 | loss 3.471175 (-1.61z)| norm 0.2623 (-0.79z)| lr 5.01e-04 | 322.75 ms | 52.3% bf16 MFU | 1624150 tok/s step 5713/19560 | loss 3.582667 (+0.78z)| norm 0.2957 (+0.69z)| lr 5.01e-04 | 322.27 ms | 52.4% bf16 MFU | 1624286 tok/s step 5714/19560 | loss 3.518944 (-0.58z)| norm 0.2774 (-0.13z)| lr 5.01e-04 | 322.38 ms | 52.4% bf16 MFU | 1624388 tok/s step 5715/19560 | loss 3.606868 (+1.30z)| norm 0.2752 (-0.23z)| lr 5.01e-04 | 323.04 ms | 52.2% bf16 MFU | 1624319 tok/s step 5716/19560 | loss 3.595792 (+1.05z)| norm 0.2703 (-0.47z)| lr 5.01e-04 | 322.80 ms | 52.3% bf16 MFU | 1624313 tok/s step 5717/19560 | loss 3.512728 (-0.74z)| norm 0.2725 (-0.37z)| lr 5.01e-04 | 322.77 ms | 52.3% bf16 MFU | 1624315 tok/s step 5718/19560 | loss 3.542065 (-0.11z)| norm 0.2729 (-0.35z)| lr 5.01e-04 | 322.54 ms | 52.3% bf16 MFU | 1624374 tok/s step 5719/19560 | loss 3.559468 (+0.25z)| norm 0.2753 (-0.25z)| lr 5.01e-04 | 323.11 ms | 52.2% bf16 MFU | 1624286 tok/s step 5720/19560 | loss 3.540909 (-0.15z)| norm 0.2549 (-1.17z)| lr 5.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1624237 tok/s step 5721/19560 | loss 3.613412 (+1.43z)| norm 0.2724 (-0.37z)| lr 5.01e-04 | 322.54 ms | 52.3% bf16 MFU | 1624299 tok/s step 5722/19560 | loss 3.542602 (-0.12z)| norm 0.2519 (-1.30z)| lr 5.01e-04 | 323.03 ms | 52.2% bf16 MFU | 1624235 tok/s step 5723/19560 | loss 3.519557 (-0.63z)| norm 0.2765 (-0.16z)| lr 5.01e-04 | 323.10 ms | 52.2% bf16 MFU | 1624157 tok/s step 5724/19560 | loss 3.630843 (+1.78z)| norm 0.2517 (-1.28z)| lr 5.01e-04 | 322.68 ms | 52.3% bf16 MFU | 1624188 tok/s step 5725/19560 | loss 3.533533 (-0.33z)| norm 0.2708 (-0.40z)| lr 5.01e-04 | 322.83 ms | 52.3% bf16 MFU | 1624181 tok/s step 5726/19560 | loss 3.600836 (+1.11z)| norm 0.2711 (-0.37z)| lr 5.01e-04 | 322.99 ms | 52.3% bf16 MFU | 1624133 tok/s step 5727/19560 | loss 3.488751 (-1.29z)| norm 0.2806 (+0.07z)| lr 5.01e-04 | 322.08 ms | 52.4% bf16 MFU | 1624317 tok/s step 5728/19560 | loss 3.541285 (-0.16z)| norm 0.3002 (+0.96z)| lr 5.01e-04 | 322.26 ms | 52.4% bf16 MFU | 1624446 tok/s step 5729/19560 | loss 3.526868 (-0.46z)| norm 0.2871 (+0.35z)| lr 5.01e-04 | 323.38 ms | 52.2% bf16 MFU | 1624289 tok/s step 5730/19560 | loss 3.505353 (-0.92z)| norm 0.2612 (-0.85z)| lr 5.01e-04 | 322.31 ms | 52.4% bf16 MFU | 1624408 tok/s step 5731/19560 | loss 3.506777 (-0.89z)| norm 0.3012 (+1.00z)| lr 5.01e-04 | 323.01 ms | 52.3% bf16 MFU | 1624344 tok/s step 5732/19560 | loss 3.538265 (-0.21z)| norm 0.2840 (+0.21z)| lr 5.01e-04 | 322.32 ms | 52.4% bf16 MFU | 1624458 tok/s step 5733/19560 | loss 3.580223 (+0.68z)| norm 0.2533 (-1.23z)| lr 5.01e-04 | 322.74 ms | 52.3% bf16 MFU | 1624460 tok/s step 5734/19560 | loss 3.456889 (-1.92z)| norm 0.2448 (-1.60z)| lr 5.01e-04 | 323.06 ms | 52.2% bf16 MFU | 1624380 tok/s step 5735/19560 | loss 3.542108 (-0.11z)| norm 0.2598 (-0.89z)| lr 5.01e-04 | 322.85 ms | 52.3% bf16 MFU | 1624358 tok/s step 5736/19560 | loss 3.529030 (-0.39z)| norm 0.2576 (-0.99z)| lr 5.01e-04 | 322.55 ms | 52.3% bf16 MFU | 1624414 tok/s step 5737/19560 | loss 3.520097 (-0.58z)| norm 0.2958 (+0.77z)| lr 5.00e-04 | 322.97 ms | 52.3% bf16 MFU | 1624359 tok/s step 5738/19560 | loss 3.500521 (-0.99z)| norm 0.3682 (+3.85z)| lr 5.00e-04 | 322.46 ms | 52.3% bf16 MFU | 1624435 tok/s step 5739/19560 | loss 3.541598 (-0.12z)| norm 0.4009 (+4.75z)| lr 5.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1624433 tok/s step 5740/19560 | loss 3.570470 (+0.49z)| norm 0.3070 (+1.01z)| lr 5.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624453 tok/s step 5741/19560 | loss 3.592205 (+0.95z)| norm 0.2825 (+0.04z)| lr 5.00e-04 | 322.56 ms | 52.3% bf16 MFU | 1624501 tok/s step 5742/19560 | loss 3.521357 (-0.57z)| norm 0.3330 (+1.99z)| lr 5.00e-04 | 322.90 ms | 52.3% bf16 MFU | 1624461 tok/s step 5743/19560 | loss 3.503341 (-0.95z)| norm 0.2634 (-0.71z)| lr 5.00e-04 | 323.28 ms | 52.2% bf16 MFU | 1624327 tok/s step 5744/19560 | loss 3.606695 (+1.25z)| norm 0.3370 (+2.10z)| lr 5.00e-04 | 323.17 ms | 52.2% bf16 MFU | 1624228 tok/s step 5745/19560 | loss 3.570370 (+0.48z)| norm 0.2714 (-0.40z)| lr 5.00e-04 | 322.51 ms | 52.3% bf16 MFU | 1624298 tok/s step 5746/19560 | loss 3.526376 (-0.46z)| norm 0.2838 (+0.08z)| lr 5.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624318 tok/s step 5747/19560 | loss 3.511241 (-0.83z)| norm 0.2631 (-0.72z)| lr 5.00e-04 | 323.42 ms | 52.2% bf16 MFU | 1624156 tok/s step 5748/19560 | loss 3.535874 (-0.24z)| norm 0.2794 (-0.05z)| lr 5.00e-04 | 322.77 ms | 52.3% bf16 MFU | 1624164 tok/s step 5749/19560 | loss 3.492636 (-1.26z)| norm 0.2629 (-0.71z)| lr 5.00e-04 | 322.88 ms | 52.3% bf16 MFU | 1624146 tok/s step 5750/19560 | loss 3.537158 (-0.19z)| norm 0.2914 (+0.47z)| lr 5.00e-04 | 323.33 ms | 52.2% bf16 MFU | 1624015 tok/s val loss 3.524983 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2774/10042 = 0.276240 step 5751/19560 | loss 3.504372 (-0.97z)| norm 0.2621 (-0.73z)| lr 5.00e-04 | 321.81 ms | 52.4% bf16 MFU | 1624273 tok/s step 5752/19560 | loss 3.593611 (+1.16z)| norm 0.2688 (-0.45z)| lr 5.00e-04 | 322.20 ms | 52.4% bf16 MFU | 1624421 tok/s step 5753/19560 | loss 3.508547 (-0.86z)| norm 0.2773 (-0.10z)| lr 5.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1624454 tok/s step 5754/19560 | loss 3.498852 (-1.09z)| norm 0.2655 (-0.58z)| lr 5.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1624392 tok/s step 5755/19560 | loss 3.548357 (+0.07z)| norm 0.2906 (+0.44z)| lr 5.00e-04 | 322.38 ms | 52.4% bf16 MFU | 1624487 tok/s step 5756/19560 | loss 3.542691 (-0.05z)| norm 0.2730 (-0.28z)| lr 5.00e-04 | 322.42 ms | 52.3% bf16 MFU | 1624568 tok/s step 5757/19560 | loss 3.468046 (-1.82z)| norm 0.2410 (-1.57z)| lr 5.00e-04 | 323.09 ms | 52.2% bf16 MFU | 1624476 tok/s step 5758/19560 | loss 3.619132 (+1.77z)| norm 0.2551 (-0.99z)| lr 5.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1624445 tok/s step 5759/19560 | loss 3.475362 (-1.61z)| norm 0.2864 (+0.28z)| lr 5.00e-04 | 322.96 ms | 52.3% bf16 MFU | 1624391 tok/s step 5760/19560 | loss 3.506507 (-0.87z)| norm 0.3325 (+2.08z)| lr 5.00e-04 | 323.44 ms | 52.2% bf16 MFU | 1624221 tok/s step 5761/19560 | loss 3.561348 (+0.41z)| norm 0.2810 (+0.04z)| lr 5.00e-04 | 322.20 ms | 52.4% bf16 MFU | 1624371 tok/s step 5762/19560 | loss 3.497765 (-1.06z)| norm 0.2757 (-0.17z)| lr 5.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1624408 tok/s step 5763/19560 | loss 3.512290 (-0.73z)| norm 0.2558 (-0.96z)| lr 5.00e-04 | 323.53 ms | 52.2% bf16 MFU | 1624215 tok/s step 5764/19560 | loss 3.567520 (+0.56z)| norm 0.2857 (+0.23z)| lr 4.99e-04 | 322.51 ms | 52.3% bf16 MFU | 1624286 tok/s step 5765/19560 | loss 3.525950 (-0.41z)| norm 0.2830 (+0.14z)| lr 4.99e-04 | 324.77 ms | 52.0% bf16 MFU | 1623790 tok/s step 5766/19560 | loss 3.565522 (+0.55z)| norm 0.3030 (+0.94z)| lr 4.99e-04 | 322.34 ms | 52.4% bf16 MFU | 1623924 tok/s step 5767/19560 | loss 3.446872 (-2.25z)| norm 0.2723 (-0.30z)| lr 4.99e-04 | 323.04 ms | 52.2% bf16 MFU | 1623878 tok/s step 5768/19560 | loss 3.449064 (-2.14z)| norm 0.2787 (-0.04z)| lr 4.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1623888 tok/s step 5769/19560 | loss 3.555623 (+0.33z)| norm 0.2987 (+0.76z)| lr 4.99e-04 | 322.75 ms | 52.3% bf16 MFU | 1623916 tok/s step 5770/19560 | loss 3.609354 (+1.55z)| norm 0.2685 (-0.46z)| lr 4.99e-04 | 322.32 ms | 52.4% bf16 MFU | 1624051 tok/s step 5771/19560 | loss 3.525099 (-0.40z)| norm 0.2603 (-0.78z)| lr 4.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1624067 tok/s step 5772/19560 | loss 3.524408 (-0.41z)| norm 0.2685 (-0.46z)| lr 4.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1624143 tok/s step 5773/19560 | loss 3.519845 (-0.52z)| norm 0.2680 (-0.48z)| lr 4.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624179 tok/s step 5774/19560 | loss 3.505251 (-0.84z)| norm 0.2433 (-1.47z)| lr 4.99e-04 | 322.58 ms | 52.3% bf16 MFU | 1624234 tok/s step 5775/19560 | loss 3.446459 (-2.14z)| norm 0.2717 (-0.31z)| lr 4.99e-04 | 322.51 ms | 52.3% bf16 MFU | 1624304 tok/s step 5776/19560 | loss 3.542271 (+0.04z)| norm 0.2524 (-1.08z)| lr 4.99e-04 | 322.75 ms | 52.3% bf16 MFU | 1624312 tok/s step 5777/19560 | loss 3.635333 (+2.10z)| norm 0.2613 (-0.71z)| lr 4.99e-04 | 323.06 ms | 52.2% bf16 MFU | 1624240 tok/s step 5778/19560 | loss 3.564378 (+0.51z)| norm 0.2891 (+0.41z)| lr 4.99e-04 | 322.32 ms | 52.4% bf16 MFU | 1624358 tok/s step 5779/19560 | loss 3.529796 (-0.26z)| norm 0.2486 (-1.21z)| lr 4.99e-04 | 322.84 ms | 52.3% bf16 MFU | 1624340 tok/s step 5780/19560 | loss 3.603652 (+1.37z)| norm 0.2607 (-0.71z)| lr 4.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1624293 tok/s step 5781/19560 | loss 3.484716 (-1.26z)| norm 0.2695 (-0.36z)| lr 4.99e-04 | 322.63 ms | 52.3% bf16 MFU | 1624330 tok/s step 5782/19560 | loss 3.501682 (-0.90z)| norm 0.2636 (-0.59z)| lr 4.99e-04 | 323.37 ms | 52.2% bf16 MFU | 1624180 tok/s step 5783/19560 | loss 3.504729 (-0.82z)| norm 0.2714 (-0.28z)| lr 4.99e-04 | 322.54 ms | 52.3% bf16 MFU | 1624247 tok/s step 5784/19560 | loss 3.518870 (-0.50z)| norm 0.2624 (-0.65z)| lr 4.99e-04 | 322.38 ms | 52.4% bf16 MFU | 1624350 tok/s step 5785/19560 | loss 3.523270 (-0.40z)| norm 0.2826 (+0.17z)| lr 4.99e-04 | 322.78 ms | 52.3% bf16 MFU | 1624347 tok/s step 5786/19560 | loss 3.521429 (-0.45z)| norm 0.2519 (-1.08z)| lr 4.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624372 tok/s step 5787/19560 | loss 3.513407 (-0.63z)| norm 0.2683 (-0.42z)| lr 4.99e-04 | 323.20 ms | 52.2% bf16 MFU | 1624262 tok/s step 5788/19560 | loss 3.508074 (-0.74z)| norm 0.3102 (+1.27z)| lr 4.99e-04 | 322.41 ms | 52.3% bf16 MFU | 1624358 tok/s step 5789/19560 | loss 3.519129 (-0.48z)| norm 0.2982 (+0.77z)| lr 4.99e-04 | 322.73 ms | 52.3% bf16 MFU | 1624367 tok/s step 5790/19560 | loss 3.535779 (-0.12z)| norm 0.2665 (-0.52z)| lr 4.99e-04 | 322.51 ms | 52.3% bf16 MFU | 1624432 tok/s step 5791/19560 | loss 3.602765 (+1.38z)| norm 0.2713 (-0.33z)| lr 4.98e-04 | 322.76 ms | 52.3% bf16 MFU | 1624430 tok/s step 5792/19560 | loss 3.509217 (-0.72z)| norm 0.2531 (-1.06z)| lr 4.98e-04 | 322.73 ms | 52.3% bf16 MFU | 1624435 tok/s step 5793/19560 | loss 3.553810 (+0.29z)| norm 0.2820 (+0.11z)| lr 4.98e-04 | 322.72 ms | 52.3% bf16 MFU | 1624444 tok/s step 5794/19560 | loss 3.528567 (-0.29z)| norm 0.2836 (+0.18z)| lr 4.98e-04 | 322.98 ms | 52.3% bf16 MFU | 1624387 tok/s step 5795/19560 | loss 3.512312 (-0.66z)| norm 0.2715 (-0.30z)| lr 4.98e-04 | 323.15 ms | 52.2% bf16 MFU | 1624289 tok/s step 5796/19560 | loss 3.555901 (+0.32z)| norm 0.2643 (-0.59z)| lr 4.98e-04 | 322.60 ms | 52.3% bf16 MFU | 1624335 tok/s step 5797/19560 | loss 3.646858 (+2.32z)| norm 0.2918 (+0.54z)| lr 4.98e-04 | 322.49 ms | 52.3% bf16 MFU | 1624405 tok/s step 5798/19560 | loss 3.569967 (+0.61z)| norm 0.3086 (+1.25z)| lr 4.98e-04 | 322.97 ms | 52.3% bf16 MFU | 1624352 tok/s step 5799/19560 | loss 3.542630 (-0.00z)| norm 0.3017 (+0.95z)| lr 4.98e-04 | 322.39 ms | 52.4% bf16 MFU | 1624448 tok/s step 5800/19560 | loss 3.526597 (-0.35z)| norm 0.3178 (+1.61z)| lr 4.98e-04 | 322.86 ms | 52.3% bf16 MFU | 1624420 tok/s step 5801/19560 | loss 3.509332 (-0.73z)| norm 0.2803 (+0.05z)| lr 4.98e-04 | 323.12 ms | 52.2% bf16 MFU | 1624327 tok/s step 5802/19560 | loss 3.523178 (-0.42z)| norm 0.2850 (+0.24z)| lr 4.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1624364 tok/s step 5803/19560 | loss 3.507357 (-0.76z)| norm 0.2645 (-0.60z)| lr 4.98e-04 | 322.96 ms | 52.3% bf16 MFU | 1624316 tok/s step 5804/19560 | loss 3.554576 (+0.29z)| norm 0.2910 (+0.50z)| lr 4.98e-04 | 322.41 ms | 52.3% bf16 MFU | 1624407 tok/s step 5805/19560 | loss 3.517968 (-0.52z)| norm 0.2997 (+0.85z)| lr 4.98e-04 | 323.17 ms | 52.2% bf16 MFU | 1624303 tok/s step 5806/19560 | loss 3.567641 (+0.58z)| norm 0.2484 (-1.27z)| lr 4.98e-04 | 322.60 ms | 52.3% bf16 MFU | 1624348 tok/s step 5807/19560 | loss 3.520411 (-0.46z)| norm 0.2670 (-0.50z)| lr 4.98e-04 | 322.55 ms | 52.3% bf16 MFU | 1624402 tok/s step 5808/19560 | loss 3.510997 (-0.66z)| norm 0.2576 (-0.87z)| lr 4.98e-04 | 322.88 ms | 52.3% bf16 MFU | 1624372 tok/s step 5809/19560 | loss 3.548375 (+0.16z)| norm 0.2664 (-0.51z)| lr 4.98e-04 | 322.49 ms | 52.3% bf16 MFU | 1624442 tok/s step 5810/19560 | loss 3.517336 (-0.53z)| norm 0.2486 (-1.22z)| lr 4.98e-04 | 322.82 ms | 52.3% bf16 MFU | 1624423 tok/s step 5811/19560 | loss 3.750514 (+4.35z)| norm 0.3091 (+1.27z)| lr 4.98e-04 | 323.14 ms | 52.2% bf16 MFU | 1624327 tok/s step 5812/19560 | loss 3.567928 (+0.52z)| norm 0.3126 (+1.39z)| lr 4.98e-04 | 322.95 ms | 52.3% bf16 MFU | 1624281 tok/s step 5813/19560 | loss 3.535763 (-0.14z)| norm 0.2986 (+0.84z)| lr 4.98e-04 | 322.52 ms | 52.3% bf16 MFU | 1624346 tok/s step 5814/19560 | loss 3.552363 (+0.21z)| norm 0.2797 (+0.06z)| lr 4.98e-04 | 322.99 ms | 52.3% bf16 MFU | 1624292 tok/s step 5815/19560 | loss 3.559419 (+0.37z)| norm 0.3148 (+1.49z)| lr 4.98e-04 | 322.50 ms | 52.3% bf16 MFU | 1624362 tok/s step 5816/19560 | loss 3.488314 (-1.17z)| norm 0.2657 (-0.52z)| lr 4.98e-04 | 323.08 ms | 52.2% bf16 MFU | 1624282 tok/s step 5817/19560 | loss 3.564167 (+0.52z)| norm 0.3141 (+1.48z)| lr 4.97e-04 | 322.49 ms | 52.3% bf16 MFU | 1624356 tok/s step 5818/19560 | loss 3.593755 (+1.21z)| norm 0.2524 (-1.05z)| lr 4.97e-04 | 322.49 ms | 52.3% bf16 MFU | 1624426 tok/s step 5819/19560 | loss 3.492993 (-1.07z)| norm 0.2662 (-0.48z)| lr 4.97e-04 | 323.27 ms | 52.2% bf16 MFU | 1624296 tok/s step 5820/19560 | loss 3.563544 (+0.52z)| norm 0.2594 (-0.75z)| lr 4.97e-04 | 322.77 ms | 52.3% bf16 MFU | 1624298 tok/s step 5821/19560 | loss 3.493435 (-1.05z)| norm 0.2912 (+0.55z)| lr 4.97e-04 | 323.06 ms | 52.2% bf16 MFU | 1624227 tok/s step 5822/19560 | loss 3.542490 (+0.05z)| norm 0.2600 (-0.75z)| lr 4.97e-04 | 322.66 ms | 52.3% bf16 MFU | 1624260 tok/s step 5823/19560 | loss 3.455739 (-1.88z)| norm 0.2830 (+0.20z)| lr 4.97e-04 | 322.86 ms | 52.3% bf16 MFU | 1624241 tok/s step 5824/19560 | loss 3.500601 (-0.87z)| norm 0.2826 (+0.18z)| lr 4.97e-04 | 323.18 ms | 52.2% bf16 MFU | 1624143 tok/s step 5825/19560 | loss 3.477642 (-1.35z)| norm 0.2815 (+0.13z)| lr 4.97e-04 | 323.05 ms | 52.2% bf16 MFU | 1624084 tok/s step 5826/19560 | loss 3.573573 (+0.77z)| norm 0.2862 (+0.32z)| lr 4.97e-04 | 322.57 ms | 52.3% bf16 MFU | 1624146 tok/s step 5827/19560 | loss 3.527065 (-0.26z)| norm 0.3023 (+0.97z)| lr 4.97e-04 | 322.18 ms | 52.4% bf16 MFU | 1624305 tok/s step 5828/19560 | loss 3.569427 (+0.69z)| norm 0.2853 (+0.25z)| lr 4.97e-04 | 322.52 ms | 52.3% bf16 MFU | 1624370 tok/s step 5829/19560 | loss 3.525578 (-0.28z)| norm 0.2663 (-0.55z)| lr 4.97e-04 | 323.41 ms | 52.2% bf16 MFU | 1624209 tok/s step 5830/19560 | loss 3.523003 (-0.33z)| norm 0.2646 (-0.61z)| lr 4.97e-04 | 323.01 ms | 52.3% bf16 MFU | 1624155 tok/s step 5831/19560 | loss 3.547089 (+0.21z)| norm 0.2677 (-0.48z)| lr 4.97e-04 | 322.68 ms | 52.3% bf16 MFU | 1624186 tok/s step 5832/19560 | loss 3.532481 (-0.12z)| norm 0.2644 (-0.62z)| lr 4.97e-04 | 323.02 ms | 52.2% bf16 MFU | 1624132 tok/s step 5833/19560 | loss 3.515849 (-0.48z)| norm 0.2675 (-0.49z)| lr 4.97e-04 | 322.70 ms | 52.3% bf16 MFU | 1624158 tok/s step 5834/19560 | loss 3.510398 (-0.59z)| norm 0.2859 (+0.28z)| lr 4.97e-04 | 323.19 ms | 52.2% bf16 MFU | 1624063 tok/s step 5835/19560 | loss 3.535010 (-0.04z)| norm 0.2742 (-0.21z)| lr 4.97e-04 | 322.93 ms | 52.3% bf16 MFU | 1624037 tok/s step 5836/19560 | loss 3.543676 (+0.16z)| norm 0.2871 (+0.34z)| lr 4.97e-04 | 323.55 ms | 52.2% bf16 MFU | 1623857 tok/s step 5837/19560 | loss 3.583549 (+1.06z)| norm 0.2767 (-0.10z)| lr 4.97e-04 | 323.25 ms | 52.2% bf16 MFU | 1623759 tok/s step 5838/19560 | loss 3.519485 (-0.38z)| norm 0.2913 (+0.53z)| lr 4.97e-04 | 322.53 ms | 52.3% bf16 MFU | 1623850 tok/s step 5839/19560 | loss 3.501975 (-0.79z)| norm 0.2810 (+0.10z)| lr 4.97e-04 | 323.45 ms | 52.2% bf16 MFU | 1623703 tok/s step 5840/19560 | loss 3.496423 (-0.92z)| norm 0.2631 (-0.67z)| lr 4.97e-04 | 323.30 ms | 52.2% bf16 MFU | 1623602 tok/s step 5841/19560 | loss 3.593250 (+1.29z)| norm 0.3205 (+1.77z)| lr 4.97e-04 | 322.31 ms | 52.4% bf16 MFU | 1623756 tok/s step 5842/19560 | loss 3.494437 (-0.96z)| norm 0.3380 (+2.44z)| lr 4.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1623832 tok/s step 5843/19560 | loss 3.491259 (-1.02z)| norm 0.3351 (+2.26z)| lr 4.97e-04 | 323.14 ms | 52.2% bf16 MFU | 1623765 tok/s step 5844/19560 | loss 3.536764 (+0.04z)| norm 0.2946 (+0.59z)| lr 4.96e-04 | 323.07 ms | 52.2% bf16 MFU | 1623717 tok/s step 5845/19560 | loss 3.531086 (-0.10z)| norm 0.2746 (-0.22z)| lr 4.96e-04 | 323.27 ms | 52.2% bf16 MFU | 1623622 tok/s step 5846/19560 | loss 3.523231 (-0.28z)| norm 0.3069 (+1.08z)| lr 4.96e-04 | 322.57 ms | 52.3% bf16 MFU | 1623709 tok/s step 5847/19560 | loss 3.472752 (-1.42z)| norm 0.2658 (-0.58z)| lr 4.96e-04 | 322.82 ms | 52.3% bf16 MFU | 1623729 tok/s step 5848/19560 | loss 3.479836 (-1.24z)| norm 0.2743 (-0.25z)| lr 4.96e-04 | 322.50 ms | 52.3% bf16 MFU | 1623827 tok/s step 5849/19560 | loss 3.475021 (-1.33z)| norm 0.2645 (-0.64z)| lr 4.96e-04 | 322.82 ms | 52.3% bf16 MFU | 1623841 tok/s step 5850/19560 | loss 3.534080 (+0.03z)| norm 0.2608 (-0.80z)| lr 4.96e-04 | 323.25 ms | 52.2% bf16 MFU | 1623746 tok/s step 5851/19560 | loss 3.531141 (-0.04z)| norm 0.2549 (-1.03z)| lr 4.96e-04 | 322.88 ms | 52.3% bf16 MFU | 1623747 tok/s step 5852/19560 | loss 3.506083 (-0.61z)| norm 0.2366 (-1.75z)| lr 4.96e-04 | 322.91 ms | 52.3% bf16 MFU | 1623742 tok/s step 5853/19560 | loss 3.497403 (-0.80z)| norm 0.2427 (-1.49z)| lr 4.96e-04 | 322.77 ms | 52.3% bf16 MFU | 1623770 tok/s step 5854/19560 | loss 3.566204 (+0.82z)| norm 0.2635 (-0.65z)| lr 4.96e-04 | 322.89 ms | 52.3% bf16 MFU | 1623768 tok/s step 5855/19560 | loss 3.525188 (-0.16z)| norm 0.2755 (-0.17z)| lr 4.96e-04 | 323.29 ms | 52.2% bf16 MFU | 1623665 tok/s step 5856/19560 | loss 3.483101 (-1.13z)| norm 0.2475 (-1.27z)| lr 4.96e-04 | 322.38 ms | 52.4% bf16 MFU | 1623797 tok/s step 5857/19560 | loss 3.568392 (+0.86z)| norm 0.2714 (-0.31z)| lr 4.96e-04 | 322.75 ms | 52.3% bf16 MFU | 1623830 tok/s step 5858/19560 | loss 3.535365 (+0.08z)| norm 0.2740 (-0.21z)| lr 4.96e-04 | 323.05 ms | 52.2% bf16 MFU | 1623784 tok/s step 5859/19560 | loss 3.494559 (-0.87z)| norm 0.2894 (+0.40z)| lr 4.96e-04 | 322.78 ms | 52.3% bf16 MFU | 1623809 tok/s step 5860/19560 | loss 3.540977 (+0.21z)| norm 0.2823 (+0.12z)| lr 4.96e-04 | 323.00 ms | 52.3% bf16 MFU | 1623776 tok/s step 5861/19560 | loss 3.581802 (+1.17z)| norm 0.2570 (-0.89z)| lr 4.96e-04 | 322.39 ms | 52.4% bf16 MFU | 1623901 tok/s step 5862/19560 | loss 3.559906 (+0.65z)| norm 0.2702 (-0.37z)| lr 4.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1623947 tok/s step 5863/19560 | loss 3.524808 (-0.18z)| norm 0.2700 (-0.38z)| lr 4.96e-04 | 323.00 ms | 52.3% bf16 MFU | 1623908 tok/s step 5864/19560 | loss 3.480418 (-1.22z)| norm 0.2894 (+0.39z)| lr 4.96e-04 | 322.47 ms | 52.3% bf16 MFU | 1624004 tok/s step 5865/19560 | loss 3.522665 (-0.22z)| norm 0.2728 (-0.28z)| lr 4.96e-04 | 322.67 ms | 52.3% bf16 MFU | 1624047 tok/s step 5866/19560 | loss 3.528188 (-0.10z)| norm 0.2695 (-0.40z)| lr 4.96e-04 | 322.70 ms | 52.3% bf16 MFU | 1624079 tok/s step 5867/19560 | loss 3.490055 (-0.98z)| norm 0.2967 (+0.90z)| lr 4.96e-04 | 322.75 ms | 52.3% bf16 MFU | 1624096 tok/s step 5868/19560 | loss 3.560101 (+0.66z)| norm 0.2595 (-0.88z)| lr 4.96e-04 | 322.70 ms | 52.3% bf16 MFU | 1624127 tok/s step 5869/19560 | loss 3.525265 (-0.14z)| norm 0.2771 (-0.03z)| lr 4.96e-04 | 322.35 ms | 52.4% bf16 MFU | 1624243 tok/s step 5870/19560 | loss 3.616351 (+1.97z)| norm 0.2956 (+0.91z)| lr 4.95e-04 | 322.21 ms | 52.4% bf16 MFU | 1624388 tok/s step 5871/19560 | loss 3.532564 (+0.01z)| norm 0.2892 (+0.58z)| lr 4.95e-04 | 322.60 ms | 52.3% bf16 MFU | 1624428 tok/s step 5872/19560 | loss 3.497577 (-0.80z)| norm 0.2901 (+0.66z)| lr 4.95e-04 | 322.43 ms | 52.3% bf16 MFU | 1624508 tok/s step 5873/19560 | loss 3.494828 (-0.85z)| norm 0.2967 (+0.99z)| lr 4.95e-04 | 322.91 ms | 52.3% bf16 MFU | 1624464 tok/s step 5874/19560 | loss 3.533055 (+0.05z)| norm 0.3096 (+1.62z)| lr 4.95e-04 | 322.77 ms | 52.3% bf16 MFU | 1624456 tok/s step 5875/19560 | loss 3.550386 (+0.45z)| norm 0.2777 (+0.00z)| lr 4.95e-04 | 322.92 ms | 52.3% bf16 MFU | 1624413 tok/s step 5876/19560 | loss 3.506880 (-0.57z)| norm 0.2582 (-0.97z)| lr 4.95e-04 | 322.77 ms | 52.3% bf16 MFU | 1624410 tok/s step 5877/19560 | loss 3.593691 (+1.45z)| norm 0.2668 (-0.54z)| lr 4.95e-04 | 322.53 ms | 52.3% bf16 MFU | 1624466 tok/s step 5878/19560 | loss 3.497147 (-0.80z)| norm 0.2829 (+0.27z)| lr 4.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1624540 tok/s step 5879/19560 | loss 3.577707 (+1.07z)| norm 0.2563 (-1.07z)| lr 4.95e-04 | 322.33 ms | 52.4% bf16 MFU | 1624641 tok/s step 5880/19560 | loss 3.505780 (-0.60z)| norm 0.2772 (-0.01z)| lr 4.95e-04 | 322.56 ms | 52.3% bf16 MFU | 1624678 tok/s step 5881/19560 | loss 3.543015 (+0.27z)| norm 0.2669 (-0.53z)| lr 4.95e-04 | 322.61 ms | 52.3% bf16 MFU | 1624702 tok/s step 5882/19560 | loss 3.488867 (-1.00z)| norm 0.2657 (-0.59z)| lr 4.95e-04 | 322.24 ms | 52.4% bf16 MFU | 1624818 tok/s step 5883/19560 | loss 3.553995 (+0.53z)| norm 0.2851 (+0.39z)| lr 4.95e-04 | 322.34 ms | 52.4% bf16 MFU | 1624904 tok/s step 5884/19560 | loss 3.478456 (-1.23z)| norm 0.2787 (+0.06z)| lr 4.95e-04 | 322.40 ms | 52.3% bf16 MFU | 1624968 tok/s step 5885/19560 | loss 3.480732 (-1.18z)| norm 0.2560 (-1.09z)| lr 4.95e-04 | 322.73 ms | 52.3% bf16 MFU | 1624947 tok/s step 5886/19560 | loss 3.531143 (+0.02z)| norm 0.2831 (+0.28z)| lr 4.95e-04 | 322.75 ms | 52.3% bf16 MFU | 1624921 tok/s step 5887/19560 | loss 3.535535 (+0.11z)| norm 0.2652 (-0.63z)| lr 4.95e-04 | 322.63 ms | 52.3% bf16 MFU | 1624928 tok/s step 5888/19560 | loss 3.495725 (-0.84z)| norm 0.2477 (-1.53z)| lr 4.95e-04 | 322.25 ms | 52.4% bf16 MFU | 1625030 tok/s step 5889/19560 | loss 3.535809 (+0.12z)| norm 0.2721 (-0.25z)| lr 4.95e-04 | 322.36 ms | 52.4% bf16 MFU | 1625099 tok/s step 5890/19560 | loss 3.496213 (-0.83z)| norm 0.2639 (-0.67z)| lr 4.95e-04 | 323.02 ms | 52.2% bf16 MFU | 1624998 tok/s step 5891/19560 | loss 3.472025 (-1.39z)| norm 0.2871 (+0.53z)| lr 4.95e-04 | 322.40 ms | 52.3% bf16 MFU | 1625058 tok/s step 5892/19560 | loss 3.535020 (+0.12z)| norm 0.2875 (+0.55z)| lr 4.95e-04 | 322.65 ms | 52.3% bf16 MFU | 1625051 tok/s step 5893/19560 | loss 3.541480 (+0.27z)| norm 0.2479 (-1.50z)| lr 4.95e-04 | 322.53 ms | 52.3% bf16 MFU | 1625075 tok/s step 5894/19560 | loss 3.489318 (-0.96z)| norm 0.3013 (+1.28z)| lr 4.95e-04 | 323.10 ms | 52.2% bf16 MFU | 1624955 tok/s step 5895/19560 | loss 3.538837 (+0.21z)| norm 0.3064 (+1.52z)| lr 4.95e-04 | 322.63 ms | 52.3% bf16 MFU | 1624959 tok/s step 5896/19560 | loss 3.451047 (-1.92z)| norm 0.2830 (+0.31z)| lr 4.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1625009 tok/s step 5897/19560 | loss 3.491419 (-0.93z)| norm 0.2852 (+0.43z)| lr 4.94e-04 | 322.18 ms | 52.4% bf16 MFU | 1625124 tok/s step 5898/19560 | loss 3.504430 (-0.60z)| norm 0.2949 (+0.92z)| lr 4.94e-04 | 322.81 ms | 52.3% bf16 MFU | 1625074 tok/s step 5899/19560 | loss 3.510805 (-0.44z)| norm 0.2637 (-0.70z)| lr 4.94e-04 | 322.48 ms | 52.3% bf16 MFU | 1625110 tok/s step 5900/19560 | loss 3.519631 (-0.23z)| norm 0.2939 (+0.85z)| lr 4.94e-04 | 322.79 ms | 52.3% bf16 MFU | 1625065 tok/s step 5901/19560 | loss 3.483026 (-1.11z)| norm 0.2848 (+0.38z)| lr 4.94e-04 | 322.65 ms | 52.3% bf16 MFU | 1625060 tok/s step 5902/19560 | loss 3.577681 (+1.18z)| norm 0.2474 (-1.56z)| lr 4.94e-04 | 322.11 ms | 52.4% bf16 MFU | 1625190 tok/s step 5903/19560 | loss 3.460845 (-1.67z)| norm 0.3022 (+1.26z)| lr 4.94e-04 | 323.21 ms | 52.2% bf16 MFU | 1625036 tok/s step 5904/19560 | loss 3.565181 (+0.87z)| norm 0.3353 (+2.85z)| lr 4.94e-04 | 322.61 ms | 52.3% bf16 MFU | 1625041 tok/s step 5905/19560 | loss 3.517899 (-0.27z)| norm 0.2785 (-0.00z)| lr 4.94e-04 | 322.92 ms | 52.3% bf16 MFU | 1624968 tok/s step 5906/19560 | loss 3.554256 (+0.65z)| norm 0.2774 (-0.05z)| lr 4.94e-04 | 322.46 ms | 52.3% bf16 MFU | 1625015 tok/s step 5907/19560 | loss 3.561752 (+0.83z)| norm 0.2799 (+0.06z)| lr 4.94e-04 | 322.63 ms | 52.3% bf16 MFU | 1625016 tok/s step 5908/19560 | loss 3.537766 (+0.24z)| norm 0.2623 (-0.83z)| lr 4.94e-04 | 322.21 ms | 52.4% bf16 MFU | 1625122 tok/s step 5909/19560 | loss 3.542649 (+0.36z)| norm 0.2522 (-1.33z)| lr 4.94e-04 | 322.89 ms | 52.3% bf16 MFU | 1625053 tok/s step 5910/19560 | loss 3.506055 (-0.58z)| norm 0.2608 (-0.89z)| lr 4.94e-04 | 322.88 ms | 52.3% bf16 MFU | 1624991 tok/s step 5911/19560 | loss 3.534041 (+0.13z)| norm 0.2490 (-1.47z)| lr 4.94e-04 | 322.28 ms | 52.4% bf16 MFU | 1625081 tok/s step 5912/19560 | loss 3.515672 (-0.34z)| norm 0.2457 (-1.61z)| lr 4.94e-04 | 322.93 ms | 52.3% bf16 MFU | 1625005 tok/s step 5913/19560 | loss 3.483792 (-1.14z)| norm 0.2475 (-1.50z)| lr 4.94e-04 | 322.23 ms | 52.4% bf16 MFU | 1625108 tok/s step 5914/19560 | loss 3.512931 (-0.40z)| norm 0.2617 (-0.81z)| lr 4.94e-04 | 322.70 ms | 52.3% bf16 MFU | 1625087 tok/s step 5915/19560 | loss 3.471915 (-1.42z)| norm 0.2728 (-0.26z)| lr 4.94e-04 | 323.17 ms | 52.2% bf16 MFU | 1624949 tok/s step 5916/19560 | loss 3.464450 (-1.58z)| norm 0.2625 (-0.76z)| lr 4.94e-04 | 323.15 ms | 52.2% bf16 MFU | 1624824 tok/s step 5917/19560 | loss 3.468393 (-1.46z)| norm 0.2579 (-0.97z)| lr 4.94e-04 | 322.95 ms | 52.3% bf16 MFU | 1624755 tok/s step 5918/19560 | loss 3.541689 (+0.35z)| norm 0.2798 (+0.12z)| lr 4.94e-04 | 322.25 ms | 52.4% bf16 MFU | 1624866 tok/s step 5919/19560 | loss 3.532879 (+0.15z)| norm 0.2672 (-0.51z)| lr 4.94e-04 | 323.04 ms | 52.2% bf16 MFU | 1624772 tok/s step 5920/19560 | loss 3.474389 (-1.30z)| norm 0.2862 (+0.42z)| lr 4.94e-04 | 322.12 ms | 52.4% bf16 MFU | 1624914 tok/s step 5921/19560 | loss 3.572506 (+1.13z)| norm 0.2831 (+0.27z)| lr 4.94e-04 | 322.67 ms | 52.3% bf16 MFU | 1624910 tok/s step 5922/19560 | loss 3.482916 (-1.08z)| norm 0.2848 (+0.35z)| lr 4.94e-04 | 322.57 ms | 52.3% bf16 MFU | 1624933 tok/s step 5923/19560 | loss 3.540843 (+0.35z)| norm 0.2769 (-0.04z)| lr 4.93e-04 | 322.84 ms | 52.3% bf16 MFU | 1624887 tok/s step 5924/19560 | loss 3.568248 (+1.02z)| norm 0.2659 (-0.59z)| lr 4.93e-04 | 322.41 ms | 52.3% bf16 MFU | 1624950 tok/s step 5925/19560 | loss 3.509773 (-0.41z)| norm 0.2779 (+0.01z)| lr 4.93e-04 | 323.32 ms | 52.2% bf16 MFU | 1624780 tok/s step 5926/19560 | loss 3.544495 (+0.49z)| norm 0.2885 (+0.56z)| lr 4.93e-04 | 322.22 ms | 52.4% bf16 MFU | 1624897 tok/s step 5927/19560 | loss 3.464214 (-1.54z)| norm 0.2671 (-0.52z)| lr 4.93e-04 | 322.54 ms | 52.3% bf16 MFU | 1624928 tok/s step 5928/19560 | loss 3.558372 (+0.84z)| norm 0.2837 (+0.35z)| lr 4.93e-04 | 323.32 ms | 52.2% bf16 MFU | 1624761 tok/s step 5929/19560 | loss 3.574977 (+1.24z)| norm 0.2934 (+0.84z)| lr 4.93e-04 | 322.78 ms | 52.3% bf16 MFU | 1624737 tok/s step 5930/19560 | loss 3.519397 (-0.16z)| norm 0.2948 (+0.91z)| lr 4.93e-04 | 322.33 ms | 52.4% bf16 MFU | 1624828 tok/s step 5931/19560 | loss 3.537176 (+0.28z)| norm 0.2595 (-0.91z)| lr 4.93e-04 | 322.69 ms | 52.3% bf16 MFU | 1624823 tok/s step 5932/19560 | loss 3.540595 (+0.37z)| norm 0.2714 (-0.28z)| lr 4.93e-04 | 322.86 ms | 52.3% bf16 MFU | 1624776 tok/s step 5933/19560 | loss 3.524101 (-0.04z)| norm 0.2820 (+0.27z)| lr 4.93e-04 | 322.27 ms | 52.4% bf16 MFU | 1624880 tok/s step 5934/19560 | loss 3.525645 (+0.00z)| norm 0.2707 (-0.33z)| lr 4.93e-04 | 322.53 ms | 52.3% bf16 MFU | 1624914 tok/s step 5935/19560 | loss 3.482669 (-1.07z)| norm 0.2849 (+0.40z)| lr 4.93e-04 | 322.98 ms | 52.3% bf16 MFU | 1624833 tok/s step 5936/19560 | loss 3.533634 (+0.21z)| norm 0.2610 (-0.85z)| lr 4.93e-04 | 322.54 ms | 52.3% bf16 MFU | 1624866 tok/s step 5937/19560 | loss 3.559695 (+0.86z)| norm 0.2657 (-0.60z)| lr 4.93e-04 | 323.16 ms | 52.2% bf16 MFU | 1624741 tok/s step 5938/19560 | loss 3.560961 (+0.88z)| norm 0.2654 (-0.63z)| lr 4.93e-04 | 322.50 ms | 52.3% bf16 MFU | 1624790 tok/s step 5939/19560 | loss 3.414486 (-3.04z)| norm 0.2684 (-0.45z)| lr 4.93e-04 | 322.94 ms | 52.3% bf16 MFU | 1624726 tok/s step 5940/19560 | loss 3.498421 (-0.68z)| norm 0.2677 (-0.48z)| lr 4.93e-04 | 322.88 ms | 52.3% bf16 MFU | 1624677 tok/s step 5941/19560 | loss 3.537693 (+0.42z)| norm 0.2763 (-0.01z)| lr 4.93e-04 | 323.46 ms | 52.2% bf16 MFU | 1624487 tok/s step 5942/19560 | loss 3.550923 (+0.80z)| norm 0.2788 (+0.13z)| lr 4.93e-04 | 322.54 ms | 52.3% bf16 MFU | 1624538 tok/s step 5943/19560 | loss 3.490911 (-0.88z)| norm 0.2864 (+0.56z)| lr 4.93e-04 | 322.28 ms | 52.4% bf16 MFU | 1624651 tok/s step 5944/19560 | loss 3.504229 (-0.51z)| norm 0.2676 (-0.48z)| lr 4.93e-04 | 322.37 ms | 52.4% bf16 MFU | 1624736 tok/s step 5945/19560 | loss 3.490545 (-0.88z)| norm 0.2586 (-0.96z)| lr 4.93e-04 | 322.50 ms | 52.3% bf16 MFU | 1624784 tok/s step 5946/19560 | loss 3.512280 (-0.25z)| norm 0.2784 (+0.13z)| lr 4.93e-04 | 322.58 ms | 52.3% bf16 MFU | 1624809 tok/s step 5947/19560 | loss 3.508130 (-0.38z)| norm 0.2790 (+0.16z)| lr 4.93e-04 | 322.28 ms | 52.4% bf16 MFU | 1624909 tok/s step 5948/19560 | loss 3.535526 (+0.42z)| norm 0.2764 (+0.01z)| lr 4.93e-04 | 322.75 ms | 52.3% bf16 MFU | 1624885 tok/s step 5949/19560 | loss 3.541368 (+0.58z)| norm 0.2755 (-0.03z)| lr 4.92e-04 | 322.38 ms | 52.4% bf16 MFU | 1624956 tok/s step 5950/19560 | loss 3.597895 (+2.17z)| norm 0.2622 (-0.79z)| lr 4.92e-04 | 322.48 ms | 52.3% bf16 MFU | 1624997 tok/s step 5951/19560 | loss 3.643533 (+3.33z)| norm 0.2652 (-0.61z)| lr 4.92e-04 | 322.74 ms | 52.3% bf16 MFU | 1624973 tok/s step 5952/19560 | loss 3.514839 (-0.23z)| norm 0.2851 (+0.52z)| lr 4.92e-04 | 322.55 ms | 52.3% bf16 MFU | 1624996 tok/s step 5953/19560 | loss 3.517778 (-0.16z)| norm 0.2886 (+0.71z)| lr 4.92e-04 | 322.64 ms | 52.3% bf16 MFU | 1624996 tok/s step 5954/19560 | loss 3.528296 (+0.14z)| norm 0.2719 (-0.23z)| lr 4.92e-04 | 322.43 ms | 52.3% bf16 MFU | 1625049 tok/s step 5955/19560 | loss 3.490710 (-0.90z)| norm 0.2917 (+0.90z)| lr 4.92e-04 | 322.71 ms | 52.3% bf16 MFU | 1625028 tok/s step 5956/19560 | loss 3.502033 (-0.57z)| norm 0.2782 (+0.14z)| lr 4.92e-04 | 322.88 ms | 52.3% bf16 MFU | 1624967 tok/s step 5957/19560 | loss 3.576561 (+1.50z)| norm 0.2643 (-0.65z)| lr 4.92e-04 | 323.02 ms | 52.2% bf16 MFU | 1624872 tok/s step 5958/19560 | loss 3.521957 (-0.02z)| norm 0.2855 (+0.54z)| lr 4.92e-04 | 322.88 ms | 52.3% bf16 MFU | 1624818 tok/s step 5959/19560 | loss 3.517151 (-0.15z)| norm 0.3157 (+2.20z)| lr 4.92e-04 | 322.22 ms | 52.4% bf16 MFU | 1624931 tok/s step 5960/19560 | loss 3.501932 (-0.57z)| norm 0.3307 (+2.91z)| lr 4.92e-04 | 322.74 ms | 52.3% bf16 MFU | 1624908 tok/s step 5961/19560 | loss 3.413793 (-2.91z)| norm 0.2906 (+0.73z)| lr 4.92e-04 | 322.12 ms | 52.4% bf16 MFU | 1625044 tok/s step 5962/19560 | loss 3.539547 (+0.48z)| norm 0.2872 (+0.55z)| lr 4.92e-04 | 322.56 ms | 52.3% bf16 MFU | 1625061 tok/s step 5963/19560 | loss 3.472621 (-1.30z)| norm 0.3118 (+1.84z)| lr 4.92e-04 | 323.10 ms | 52.2% bf16 MFU | 1624943 tok/s step 5964/19560 | loss 3.500244 (-0.56z)| norm 0.2808 (+0.19z)| lr 4.92e-04 | 322.45 ms | 52.3% bf16 MFU | 1624993 tok/s step 5965/19560 | loss 3.523507 (+0.08z)| norm 0.2855 (+0.44z)| lr 4.92e-04 | 322.53 ms | 52.3% bf16 MFU | 1625020 tok/s step 5966/19560 | loss 3.563248 (+1.14z)| norm 0.2995 (+1.17z)| lr 4.92e-04 | 322.32 ms | 52.4% bf16 MFU | 1625100 tok/s step 5967/19560 | loss 3.489478 (-0.84z)| norm 0.2845 (+0.37z)| lr 4.92e-04 | 322.76 ms | 52.3% bf16 MFU | 1625064 tok/s step 5968/19560 | loss 3.547944 (+0.72z)| norm 0.2802 (+0.14z)| lr 4.92e-04 | 322.58 ms | 52.3% bf16 MFU | 1625075 tok/s step 5969/19560 | loss 3.495433 (-0.68z)| norm 0.2893 (+0.65z)| lr 4.92e-04 | 322.46 ms | 52.3% bf16 MFU | 1625116 tok/s step 5970/19560 | loss 3.595084 (+1.99z)| norm 0.2759 (-0.06z)| lr 4.92e-04 | 322.89 ms | 52.3% bf16 MFU | 1625047 tok/s step 5971/19560 | loss 3.573549 (+1.38z)| norm 0.2844 (+0.47z)| lr 4.92e-04 | 322.73 ms | 52.3% bf16 MFU | 1625022 tok/s step 5972/19560 | loss 3.508233 (-0.36z)| norm 0.2622 (-0.83z)| lr 4.92e-04 | 322.36 ms | 52.4% bf16 MFU | 1625093 tok/s step 5973/19560 | loss 3.438251 (-2.17z)| norm 0.2833 (+0.42z)| lr 4.92e-04 | 323.33 ms | 52.2% bf16 MFU | 1624915 tok/s step 5974/19560 | loss 3.587581 (+1.72z)| norm 0.2690 (-0.42z)| lr 4.92e-04 | 322.38 ms | 52.4% bf16 MFU | 1624984 tok/s step 5975/19560 | loss 3.504165 (-0.46z)| norm 0.3003 (+1.44z)| lr 4.91e-04 | 323.29 ms | 52.2% bf16 MFU | 1624821 tok/s step 5976/19560 | loss 3.482122 (-1.03z)| norm 0.2659 (-0.61z)| lr 4.91e-04 | 322.28 ms | 52.4% bf16 MFU | 1624919 tok/s step 5977/19560 | loss 3.475552 (-1.20z)| norm 0.2800 (+0.22z)| lr 4.91e-04 | 322.64 ms | 52.3% bf16 MFU | 1624924 tok/s step 5978/19560 | loss 3.561479 (+1.03z)| norm 0.2986 (+1.31z)| lr 4.91e-04 | 322.83 ms | 52.3% bf16 MFU | 1624880 tok/s step 5979/19560 | loss 3.479750 (-1.08z)| norm 0.2892 (+0.74z)| lr 4.91e-04 | 323.01 ms | 52.2% bf16 MFU | 1624793 tok/s step 5980/19560 | loss 3.464725 (-1.45z)| norm 0.3088 (+1.89z)| lr 4.91e-04 | 322.63 ms | 52.3% bf16 MFU | 1624806 tok/s step 5981/19560 | loss 3.538651 (+0.44z)| norm 0.2568 (-1.27z)| lr 4.91e-04 | 322.60 ms | 52.3% bf16 MFU | 1624824 tok/s step 5982/19560 | loss 3.578002 (+1.45z)| norm 0.3403 (+3.60z)| lr 4.91e-04 | 322.74 ms | 52.3% bf16 MFU | 1624807 tok/s step 5983/19560 | loss 3.608797 (+2.18z)| norm 0.2866 (+0.49z)| lr 4.91e-04 | 322.72 ms | 52.3% bf16 MFU | 1624795 tok/s step 5984/19560 | loss 3.607148 (+2.08z)| norm 0.3354 (+3.18z)| lr 4.91e-04 | 322.83 ms | 52.3% bf16 MFU | 1624757 tok/s step 5985/19560 | loss 3.510692 (-0.30z)| norm 0.3140 (+1.93z)| lr 4.91e-04 | 322.55 ms | 52.3% bf16 MFU | 1624792 tok/s step 5986/19560 | loss 3.485442 (-0.92z)| norm 0.2681 (-0.62z)| lr 4.91e-04 | 322.36 ms | 52.4% bf16 MFU | 1624871 tok/s step 5987/19560 | loss 3.524290 (+0.04z)| norm 0.2805 (+0.07z)| lr 4.91e-04 | 323.29 ms | 52.2% bf16 MFU | 1624715 tok/s step 5988/19560 | loss 3.487153 (-0.87z)| norm 0.2768 (-0.13z)| lr 4.91e-04 | 322.68 ms | 52.3% bf16 MFU | 1624719 tok/s step 5989/19560 | loss 3.544226 (+0.56z)| norm 0.2711 (-0.45z)| lr 4.91e-04 | 322.67 ms | 52.3% bf16 MFU | 1624726 tok/s step 5990/19560 | loss 3.518535 (-0.08z)| norm 0.2656 (-0.75z)| lr 4.91e-04 | 322.85 ms | 52.3% bf16 MFU | 1624686 tok/s step 5991/19560 | loss 3.512186 (-0.23z)| norm 0.2618 (-0.96z)| lr 4.91e-04 | 322.49 ms | 52.3% bf16 MFU | 1624740 tok/s step 5992/19560 | loss 3.433489 (-2.17z)| norm 0.2577 (-1.17z)| lr 4.91e-04 | 323.32 ms | 52.2% bf16 MFU | 1624583 tok/s step 5993/19560 | loss 3.533838 (+0.31z)| norm 0.2752 (-0.20z)| lr 4.91e-04 | 322.56 ms | 52.3% bf16 MFU | 1624624 tok/s step 5994/19560 | loss 3.584208 (+1.53z)| norm 0.2459 (-1.80z)| lr 4.91e-04 | 323.02 ms | 52.2% bf16 MFU | 1624546 tok/s step 5995/19560 | loss 3.532179 (+0.25z)| norm 0.2810 (+0.14z)| lr 4.91e-04 | 322.57 ms | 52.3% bf16 MFU | 1624585 tok/s step 5996/19560 | loss 3.460047 (-1.49z)| norm 0.2461 (-1.76z)| lr 4.91e-04 | 322.69 ms | 52.3% bf16 MFU | 1624593 tok/s step 5997/19560 | loss 3.492780 (-0.69z)| norm 0.2661 (-0.67z)| lr 4.91e-04 | 323.57 ms | 52.2% bf16 MFU | 1624378 tok/s step 5998/19560 | loss 3.476058 (-1.08z)| norm 0.2689 (-0.50z)| lr 4.91e-04 | 322.23 ms | 52.4% bf16 MFU | 1624513 tok/s step 5999/19560 | loss 3.488585 (-0.76z)| norm 0.2635 (-0.79z)| lr 4.91e-04 | 323.01 ms | 52.2% bf16 MFU | 1624444 tok/s step 6000/19560 | loss 3.540606 (+0.52z)| norm 0.2717 (-0.34z)| lr 4.91e-04 | 323.28 ms | 52.2% bf16 MFU | 1624309 tok/s val loss 3.510486 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2792/10042 = 0.278032 step 6001/19560 | loss 3.484311 (-0.87z)| norm 0.2468 (-1.66z)| lr 4.90e-04 | 322.89 ms | 52.3% bf16 MFU | 1624281 tok/s step 6002/19560 | loss 3.530425 (+0.27z)| norm 0.2703 (-0.37z)| lr 4.90e-04 | 322.34 ms | 52.4% bf16 MFU | 1624391 tok/s step 6003/19560 | loss 3.574306 (+1.34z)| norm 0.2598 (-0.94z)| lr 4.90e-04 | 325.07 ms | 51.9% bf16 MFU | 1623814 tok/s step 6004/19560 | loss 3.541531 (+0.53z)| norm 0.2902 (+0.71z)| lr 4.90e-04 | 322.46 ms | 52.3% bf16 MFU | 1623918 tok/s step 6005/19560 | loss 3.564774 (+1.11z)| norm 0.2952 (+0.97z)| lr 4.90e-04 | 323.21 ms | 52.2% bf16 MFU | 1623828 tok/s step 6006/19560 | loss 3.486656 (-0.82z)| norm 0.3002 (+1.23z)| lr 4.90e-04 | 322.28 ms | 52.4% bf16 MFU | 1623978 tok/s step 6007/19560 | loss 3.531697 (+0.31z)| norm 0.3198 (+2.23z)| lr 4.90e-04 | 322.63 ms | 52.3% bf16 MFU | 1624032 tok/s step 6008/19560 | loss 3.496106 (-0.58z)| norm 0.2557 (-1.18z)| lr 4.90e-04 | 323.15 ms | 52.2% bf16 MFU | 1623953 tok/s step 6009/19560 | loss 3.544112 (+0.62z)| norm 0.2929 (+0.78z)| lr 4.90e-04 | 322.67 ms | 52.3% bf16 MFU | 1623999 tok/s step 6010/19560 | loss 3.524221 (+0.11z)| norm 0.2668 (-0.60z)| lr 4.90e-04 | 323.00 ms | 52.3% bf16 MFU | 1623958 tok/s step 6011/19560 | loss 3.627606 (+2.62z)| norm 0.2939 (+0.83z)| lr 4.90e-04 | 322.88 ms | 52.3% bf16 MFU | 1623949 tok/s step 6012/19560 | loss 3.545660 (+0.61z)| norm 0.3469 (+3.44z)| lr 4.90e-04 | 322.73 ms | 52.3% bf16 MFU | 1623980 tok/s step 6013/19560 | loss 3.521091 (+0.00z)| norm 0.3324 (+2.62z)| lr 4.90e-04 | 322.68 ms | 52.3% bf16 MFU | 1624020 tok/s step 6014/19560 | loss 3.488998 (-0.78z)| norm 0.3171 (+1.83z)| lr 4.90e-04 | 323.05 ms | 52.2% bf16 MFU | 1623964 tok/s step 6015/19560 | loss 3.464579 (-1.35z)| norm 0.3092 (+1.42z)| lr 4.90e-04 | 323.05 ms | 52.2% bf16 MFU | 1623912 tok/s step 6016/19560 | loss 3.516647 (-0.09z)| norm 0.2982 (+0.87z)| lr 4.90e-04 | 322.92 ms | 52.3% bf16 MFU | 1623895 tok/s step 6017/19560 | loss 3.495374 (-0.60z)| norm 0.3006 (+0.97z)| lr 4.90e-04 | 322.87 ms | 52.3% bf16 MFU | 1623892 tok/s step 6018/19560 | loss 3.500168 (-0.48z)| norm 0.2823 (+0.08z)| lr 4.90e-04 | 323.33 ms | 52.2% bf16 MFU | 1623773 tok/s step 6019/19560 | loss 3.523855 (+0.08z)| norm 0.2909 (+0.50z)| lr 4.90e-04 | 322.39 ms | 52.3% bf16 MFU | 1623897 tok/s step 6020/19560 | loss 3.543225 (+0.55z)| norm 0.2840 (+0.16z)| lr 4.90e-04 | 323.39 ms | 52.2% bf16 MFU | 1623764 tok/s step 6021/19560 | loss 3.552980 (+0.79z)| norm 0.2711 (-0.48z)| lr 4.90e-04 | 323.22 ms | 52.2% bf16 MFU | 1623680 tok/s step 6022/19560 | loss 3.521562 (+0.02z)| norm 0.2760 (-0.23z)| lr 4.90e-04 | 323.21 ms | 52.2% bf16 MFU | 1623603 tok/s step 6023/19560 | loss 3.660158 (+3.24z)| norm 0.3285 (+2.32z)| lr 4.90e-04 | 322.70 ms | 52.3% bf16 MFU | 1623657 tok/s step 6024/19560 | loss 3.506659 (-0.37z)| norm 0.2815 (+0.03z)| lr 4.90e-04 | 323.51 ms | 52.2% bf16 MFU | 1623506 tok/s step 6025/19560 | loss 3.454045 (-1.60z)| norm 0.2638 (-0.82z)| lr 4.90e-04 | 322.80 ms | 52.3% bf16 MFU | 1623541 tok/s step 6026/19560 | loss 3.443453 (-1.81z)| norm 0.2837 (+0.15z)| lr 4.90e-04 | 322.79 ms | 52.3% bf16 MFU | 1623575 tok/s step 6027/19560 | loss 3.539989 (+0.42z)| norm 0.2808 (+0.00z)| lr 4.89e-04 | 323.33 ms | 52.2% bf16 MFU | 1623472 tok/s step 6028/19560 | loss 3.595517 (+1.68z)| norm 0.2697 (-0.53z)| lr 4.89e-04 | 323.13 ms | 52.2% bf16 MFU | 1623425 tok/s step 6029/19560 | loss 3.483003 (-0.90z)| norm 0.3070 (+1.27z)| lr 4.89e-04 | 322.98 ms | 52.3% bf16 MFU | 1623416 tok/s step 6030/19560 | loss 3.498702 (-0.53z)| norm 0.2830 (+0.10z)| lr 4.89e-04 | 323.18 ms | 52.2% bf16 MFU | 1623360 tok/s step 6031/19560 | loss 3.495098 (-0.62z)| norm 0.2832 (+0.12z)| lr 4.89e-04 | 322.86 ms | 52.3% bf16 MFU | 1623388 tok/s step 6032/19560 | loss 3.554514 (+0.76z)| norm 0.2751 (-0.27z)| lr 4.89e-04 | 323.13 ms | 52.2% bf16 MFU | 1623344 tok/s step 6033/19560 | loss 3.508757 (-0.30z)| norm 0.3128 (+1.61z)| lr 4.89e-04 | 322.70 ms | 52.3% bf16 MFU | 1623411 tok/s step 6034/19560 | loss 3.544198 (+0.52z)| norm 0.2770 (-0.18z)| lr 4.89e-04 | 323.24 ms | 52.2% bf16 MFU | 1623340 tok/s step 6035/19560 | loss 3.538201 (+0.39z)| norm 0.2465 (-1.68z)| lr 4.89e-04 | 323.40 ms | 52.2% bf16 MFU | 1623230 tok/s step 6036/19560 | loss 3.555290 (+0.78z)| norm 0.2591 (-1.05z)| lr 4.89e-04 | 322.80 ms | 52.3% bf16 MFU | 1623279 tok/s step 6037/19560 | loss 3.505181 (-0.38z)| norm 0.2715 (-0.45z)| lr 4.89e-04 | 322.85 ms | 52.3% bf16 MFU | 1623313 tok/s step 6038/19560 | loss 3.546719 (+0.58z)| norm 0.2422 (-1.88z)| lr 4.89e-04 | 322.91 ms | 52.3% bf16 MFU | 1623330 tok/s step 6039/19560 | loss 3.535565 (+0.32z)| norm 0.2445 (-1.76z)| lr 4.89e-04 | 322.95 ms | 52.3% bf16 MFU | 1623336 tok/s step 6040/19560 | loss 3.509073 (-0.29z)| norm 0.2371 (-2.11z)| lr 4.89e-04 | 323.13 ms | 52.2% bf16 MFU | 1623296 tok/s step 6041/19560 | loss 3.541135 (+0.44z)| norm 0.2383 (-2.03z)| lr 4.89e-04 | 322.95 ms | 52.3% bf16 MFU | 1623303 tok/s step 6042/19560 | loss 3.471799 (-1.16z)| norm 0.2679 (-0.60z)| lr 4.89e-04 | 323.16 ms | 52.2% bf16 MFU | 1623257 tok/s step 6043/19560 | loss 3.551459 (+0.68z)| norm 0.2515 (-1.38z)| lr 4.89e-04 | 322.76 ms | 52.3% bf16 MFU | 1623314 tok/s step 6044/19560 | loss 3.567906 (+1.04z)| norm 0.2896 (+0.45z)| lr 4.89e-04 | 322.85 ms | 52.3% bf16 MFU | 1623344 tok/s step 6045/19560 | loss 3.526599 (+0.07z)| norm 0.2819 (+0.07z)| lr 4.89e-04 | 322.71 ms | 52.3% bf16 MFU | 1623408 tok/s step 6046/19560 | loss 3.530402 (+0.16z)| norm 0.2563 (-1.16z)| lr 4.89e-04 | 323.01 ms | 52.2% bf16 MFU | 1623395 tok/s step 6047/19560 | loss 3.506407 (-0.40z)| norm 0.2949 (+0.70z)| lr 4.89e-04 | 322.98 ms | 52.3% bf16 MFU | 1623389 tok/s step 6048/19560 | loss 3.522407 (-0.03z)| norm 0.2992 (+0.90z)| lr 4.89e-04 | 322.81 ms | 52.3% bf16 MFU | 1623427 tok/s step 6049/19560 | loss 3.479282 (-1.04z)| norm 0.2476 (-1.56z)| lr 4.89e-04 | 322.44 ms | 52.3% bf16 MFU | 1623556 tok/s step 6050/19560 | loss 3.507765 (-0.37z)| norm 0.2952 (+0.71z)| lr 4.89e-04 | 322.71 ms | 52.3% bf16 MFU | 1623609 tok/s step 6051/19560 | loss 3.516262 (-0.16z)| norm 0.2753 (-0.24z)| lr 4.89e-04 | 322.90 ms | 52.3% bf16 MFU | 1623612 tok/s step 6052/19560 | loss 3.528806 (+0.14z)| norm 0.3107 (+1.42z)| lr 4.89e-04 | 322.56 ms | 52.3% bf16 MFU | 1623702 tok/s step 6053/19560 | loss 3.538189 (+0.36z)| norm 0.3226 (+1.93z)| lr 4.88e-04 | 322.42 ms | 52.3% bf16 MFU | 1623822 tok/s step 6054/19560 | loss 3.519213 (-0.08z)| norm 0.3047 (+1.09z)| lr 4.88e-04 | 322.95 ms | 52.3% bf16 MFU | 1623804 tok/s step 6055/19560 | loss 3.481010 (-1.00z)| norm 0.2721 (-0.42z)| lr 4.88e-04 | 323.36 ms | 52.2% bf16 MFU | 1623682 tok/s step 6056/19560 | loss 3.532065 (+0.23z)| norm 0.2567 (-1.13z)| lr 4.88e-04 | 323.12 ms | 52.2% bf16 MFU | 1623627 tok/s step 6057/19560 | loss 3.479688 (-1.02z)| norm 0.2594 (-0.99z)| lr 4.88e-04 | 322.26 ms | 52.4% bf16 MFU | 1623791 tok/s step 6058/19560 | loss 3.511237 (-0.26z)| norm 0.2680 (-0.58z)| lr 4.88e-04 | 322.39 ms | 52.4% bf16 MFU | 1623914 tok/s step 6059/19560 | loss 3.454922 (-1.58z)| norm 0.2503 (-1.39z)| lr 4.88e-04 | 322.71 ms | 52.3% bf16 MFU | 1623950 tok/s step 6060/19560 | loss 3.523625 (+0.06z)| norm 0.2497 (-1.40z)| lr 4.88e-04 | 323.19 ms | 52.2% bf16 MFU | 1623864 tok/s step 6061/19560 | loss 3.520831 (-0.01z)| norm 0.2644 (-0.72z)| lr 4.88e-04 | 322.64 ms | 52.3% bf16 MFU | 1623920 tok/s step 6062/19560 | loss 3.523437 (+0.06z)| norm 0.2666 (-0.62z)| lr 4.88e-04 | 322.59 ms | 52.3% bf16 MFU | 1623988 tok/s step 6063/19560 | loss 3.582136 (+1.43z)| norm 0.2627 (-0.78z)| lr 4.88e-04 | 322.34 ms | 52.4% bf16 MFU | 1624114 tok/s step 6064/19560 | loss 3.533760 (+0.28z)| norm 0.2386 (-1.85z)| lr 4.88e-04 | 322.50 ms | 52.3% bf16 MFU | 1624193 tok/s step 6065/19560 | loss 3.586009 (+1.51z)| norm 0.2799 (-0.00z)| lr 4.88e-04 | 322.71 ms | 52.3% bf16 MFU | 1624215 tok/s step 6066/19560 | loss 3.564393 (+1.00z)| norm 0.2749 (-0.23z)| lr 4.88e-04 | 323.15 ms | 52.2% bf16 MFU | 1624125 tok/s step 6067/19560 | loss 3.572976 (+1.20z)| norm 0.2849 (+0.22z)| lr 4.88e-04 | 322.54 ms | 52.3% bf16 MFU | 1624195 tok/s step 6068/19560 | loss 3.532608 (+0.22z)| norm 0.2550 (-1.12z)| lr 4.88e-04 | 322.48 ms | 52.3% bf16 MFU | 1624275 tok/s step 6069/19560 | loss 3.551728 (+0.68z)| norm 0.2482 (-1.41z)| lr 4.88e-04 | 322.73 ms | 52.3% bf16 MFU | 1624288 tok/s step 6070/19560 | loss 3.541960 (+0.44z)| norm 0.2539 (-1.14z)| lr 4.88e-04 | 322.19 ms | 52.4% bf16 MFU | 1624438 tok/s step 6071/19560 | loss 3.524829 (+0.02z)| norm 0.2609 (-0.82z)| lr 4.88e-04 | 322.47 ms | 52.3% bf16 MFU | 1624508 tok/s step 6072/19560 | loss 3.519537 (-0.11z)| norm 0.2630 (-0.73z)| lr 4.88e-04 | 323.04 ms | 52.2% bf16 MFU | 1624431 tok/s step 6073/19560 | loss 3.506227 (-0.43z)| norm 0.2750 (-0.20z)| lr 4.88e-04 | 322.62 ms | 52.3% bf16 MFU | 1624463 tok/s step 6074/19560 | loss 3.692810 (+3.82z)| norm 0.3136 (+1.49z)| lr 4.88e-04 | 322.57 ms | 52.3% bf16 MFU | 1624508 tok/s step 6075/19560 | loss 3.491546 (-0.77z)| norm 0.2870 (+0.32z)| lr 4.88e-04 | 322.37 ms | 52.4% bf16 MFU | 1624599 tok/s step 6076/19560 | loss 3.512904 (-0.28z)| norm 0.2750 (-0.21z)| lr 4.88e-04 | 322.68 ms | 52.3% bf16 MFU | 1624608 tok/s step 6077/19560 | loss 3.538340 (+0.30z)| norm 0.2958 (+0.70z)| lr 4.88e-04 | 322.75 ms | 52.3% bf16 MFU | 1624599 tok/s step 6078/19560 | loss 3.479317 (-1.03z)| norm 0.2632 (-0.74z)| lr 4.87e-04 | 322.85 ms | 52.3% bf16 MFU | 1624566 tok/s step 6079/19560 | loss 3.578030 (+1.27z)| norm 0.2651 (-0.66z)| lr 4.87e-04 | 322.71 ms | 52.3% bf16 MFU | 1624569 tok/s step 6080/19560 | loss 3.518061 (-0.13z)| norm 0.2958 (+0.69z)| lr 4.87e-04 | 322.13 ms | 52.4% bf16 MFU | 1624719 tok/s step 6081/19560 | loss 3.515830 (-0.19z)| norm 0.2912 (+0.48z)| lr 4.87e-04 | 322.60 ms | 52.3% bf16 MFU | 1624743 tok/s step 6082/19560 | loss 3.533210 (+0.22z)| norm 0.2451 (-1.51z)| lr 4.87e-04 | 322.71 ms | 52.3% bf16 MFU | 1624739 tok/s step 6083/19560 | loss 3.472730 (-1.19z)| norm 0.2650 (-0.64z)| lr 4.87e-04 | 322.58 ms | 52.3% bf16 MFU | 1624766 tok/s step 6084/19560 | loss 3.511689 (-0.28z)| norm 0.2660 (-0.59z)| lr 4.87e-04 | 322.45 ms | 52.3% bf16 MFU | 1624824 tok/s step 6085/19560 | loss 3.547532 (+0.56z)| norm 0.2687 (-0.48z)| lr 4.87e-04 | 322.45 ms | 52.3% bf16 MFU | 1624880 tok/s step 6086/19560 | loss 3.563497 (+0.93z)| norm 0.2711 (-0.37z)| lr 4.87e-04 | 322.57 ms | 52.3% bf16 MFU | 1624903 tok/s step 6087/19560 | loss 3.567352 (+1.01z)| norm 0.2640 (-0.66z)| lr 4.87e-04 | 322.45 ms | 52.3% bf16 MFU | 1624954 tok/s step 6088/19560 | loss 3.511268 (-0.30z)| norm 0.2839 (+0.23z)| lr 4.87e-04 | 322.86 ms | 52.3% bf16 MFU | 1624901 tok/s step 6089/19560 | loss 3.565090 (+0.95z)| norm 0.3276 (+2.13z)| lr 4.87e-04 | 322.17 ms | 52.4% bf16 MFU | 1625025 tok/s step 6090/19560 | loss 3.528689 (+0.08z)| norm 0.2877 (+0.38z)| lr 4.87e-04 | 322.63 ms | 52.3% bf16 MFU | 1625026 tok/s step 6091/19560 | loss 3.506853 (-0.45z)| norm 0.2785 (-0.01z)| lr 4.87e-04 | 322.75 ms | 52.3% bf16 MFU | 1624996 tok/s step 6092/19560 | loss 3.492683 (-0.79z)| norm 0.2933 (+0.64z)| lr 4.87e-04 | 322.70 ms | 52.3% bf16 MFU | 1624980 tok/s step 6093/19560 | loss 3.488347 (-0.89z)| norm 0.2683 (-0.46z)| lr 4.87e-04 | 322.58 ms | 52.3% bf16 MFU | 1624997 tok/s step 6094/19560 | loss 3.455925 (-1.63z)| norm 0.2564 (-0.97z)| lr 4.87e-04 | 322.33 ms | 52.4% bf16 MFU | 1625074 tok/s step 6095/19560 | loss 3.512374 (-0.29z)| norm 0.2674 (-0.48z)| lr 4.87e-04 | 323.12 ms | 52.2% bf16 MFU | 1624951 tok/s step 6096/19560 | loss 3.520465 (-0.10z)| norm 0.2720 (-0.27z)| lr 4.87e-04 | 323.11 ms | 52.2% bf16 MFU | 1624835 tok/s step 6097/19560 | loss 3.487828 (-0.87z)| norm 0.2610 (-0.75z)| lr 4.87e-04 | 322.65 ms | 52.3% bf16 MFU | 1624840 tok/s step 6098/19560 | loss 3.663235 (+3.20z)| norm 0.2826 (+0.20z)| lr 4.87e-04 | 322.80 ms | 52.3% bf16 MFU | 1624808 tok/s step 6099/19560 | loss 3.433664 (-2.06z)| norm 0.2674 (-0.46z)| lr 4.87e-04 | 322.66 ms | 52.3% bf16 MFU | 1624811 tok/s step 6100/19560 | loss 3.495999 (-0.63z)| norm 0.2644 (-0.59z)| lr 4.87e-04 | 322.44 ms | 52.3% bf16 MFU | 1624872 tok/s step 6101/19560 | loss 3.513325 (-0.26z)| norm 0.2766 (-0.05z)| lr 4.87e-04 | 322.69 ms | 52.3% bf16 MFU | 1624866 tok/s step 6102/19560 | loss 3.491906 (-0.74z)| norm 0.2586 (-0.84z)| lr 4.87e-04 | 322.33 ms | 52.4% bf16 MFU | 1624951 tok/s step 6103/19560 | loss 3.505099 (-0.43z)| norm 0.2901 (+0.55z)| lr 4.87e-04 | 322.85 ms | 52.3% bf16 MFU | 1624900 tok/s step 6104/19560 | loss 3.477809 (-1.07z)| norm 0.3650 (+3.61z)| lr 4.86e-04 | 322.71 ms | 52.3% bf16 MFU | 1624887 tok/s step 6105/19560 | loss 3.542732 (+0.44z)| norm 0.4952 (+7.02z)| lr 4.86e-04 | 322.12 ms | 52.4% bf16 MFU | 1625022 tok/s step 6106/19560 | loss 3.581515 (+1.34z)| norm 0.3726 (+2.90z)| lr 4.86e-04 | 322.90 ms | 52.3% bf16 MFU | 1624956 tok/s step 6107/19560 | loss 3.534658 (+0.23z)| norm 0.3385 (+1.79z)| lr 4.86e-04 | 322.35 ms | 52.4% bf16 MFU | 1625031 tok/s step 6108/19560 | loss 3.509223 (-0.37z)| norm 0.3320 (+1.57z)| lr 4.86e-04 | 322.76 ms | 52.3% bf16 MFU | 1624998 tok/s step 6109/19560 | loss 3.513317 (-0.27z)| norm 0.3109 (+0.90z)| lr 4.86e-04 | 322.73 ms | 52.3% bf16 MFU | 1624976 tok/s step 6110/19560 | loss 3.554829 (+0.71z)| norm 0.3328 (+1.59z)| lr 4.86e-04 | 322.28 ms | 52.4% bf16 MFU | 1625068 tok/s step 6111/19560 | loss 3.545760 (+0.52z)| norm 0.3108 (+0.89z)| lr 4.86e-04 | 322.62 ms | 52.3% bf16 MFU | 1625068 tok/s step 6112/19560 | loss 3.605615 (+1.96z)| norm 0.2857 (+0.13z)| lr 4.86e-04 | 322.39 ms | 52.4% bf16 MFU | 1625128 tok/s step 6113/19560 | loss 3.510928 (-0.32z)| norm 0.3148 (+1.04z)| lr 4.86e-04 | 322.83 ms | 52.3% bf16 MFU | 1625074 tok/s step 6114/19560 | loss 3.479756 (-1.07z)| norm 0.2712 (-0.32z)| lr 4.86e-04 | 322.57 ms | 52.3% bf16 MFU | 1625087 tok/s step 6115/19560 | loss 3.497251 (-0.64z)| norm 0.2739 (-0.24z)| lr 4.86e-04 | 322.54 ms | 52.3% bf16 MFU | 1625108 tok/s step 6116/19560 | loss 3.489020 (-0.84z)| norm 0.2664 (-0.47z)| lr 4.86e-04 | 322.69 ms | 52.3% bf16 MFU | 1625088 tok/s step 6117/19560 | loss 3.521870 (-0.04z)| norm 0.2631 (-0.57z)| lr 4.86e-04 | 322.18 ms | 52.4% bf16 MFU | 1625199 tok/s step 6118/19560 | loss 3.663100 (+3.18z)| norm 0.2599 (-0.67z)| lr 4.86e-04 | 322.46 ms | 52.3% bf16 MFU | 1625234 tok/s step 6119/19560 | loss 3.535633 (+0.24z)| norm 0.3014 (+0.62z)| lr 4.86e-04 | 323.71 ms | 52.1% bf16 MFU | 1624953 tok/s step 6120/19560 | loss 3.534025 (+0.19z)| norm 0.2736 (-0.25z)| lr 4.86e-04 | 322.28 ms | 52.4% bf16 MFU | 1625047 tok/s step 6121/19560 | loss 3.548064 (+0.52z)| norm 0.2727 (-0.28z)| lr 4.86e-04 | 322.56 ms | 52.3% bf16 MFU | 1625064 tok/s step 6122/19560 | loss 3.577471 (+1.21z)| norm 0.2626 (-0.60z)| lr 4.86e-04 | 322.81 ms | 52.3% bf16 MFU | 1625018 tok/s step 6123/19560 | loss 3.463645 (-1.44z)| norm 0.2485 (-1.03z)| lr 4.86e-04 | 322.64 ms | 52.3% bf16 MFU | 1625016 tok/s step 6124/19560 | loss 3.471851 (-1.25z)| norm 0.2873 (+0.17z)| lr 4.86e-04 | 322.88 ms | 52.3% bf16 MFU | 1624955 tok/s step 6125/19560 | loss 3.499713 (-0.60z)| norm 0.2491 (-1.02z)| lr 4.86e-04 | 322.37 ms | 52.4% bf16 MFU | 1625026 tok/s step 6126/19560 | loss 3.520275 (-0.13z)| norm 0.2813 (-0.02z)| lr 4.86e-04 | 322.19 ms | 52.4% bf16 MFU | 1625137 tok/s step 6127/19560 | loss 3.452787 (-1.70z)| norm 0.2768 (-0.16z)| lr 4.86e-04 | 322.60 ms | 52.3% bf16 MFU | 1625140 tok/s step 6128/19560 | loss 3.489867 (-0.82z)| norm 0.2609 (-0.66z)| lr 4.86e-04 | 322.93 ms | 52.3% bf16 MFU | 1625060 tok/s step 6129/19560 | loss 3.561852 (+0.84z)| norm 0.2627 (-0.61z)| lr 4.86e-04 | 322.56 ms | 52.3% bf16 MFU | 1625077 tok/s step 6130/19560 | loss 3.490465 (-0.81z)| norm 0.2560 (-0.81z)| lr 4.85e-04 | 322.47 ms | 52.3% bf16 MFU | 1625115 tok/s step 6131/19560 | loss 3.472311 (-1.21z)| norm 0.2613 (-0.64z)| lr 4.85e-04 | 323.54 ms | 52.2% bf16 MFU | 1624884 tok/s step 6132/19560 | loss 3.498892 (-0.59z)| norm 0.2664 (-0.48z)| lr 4.85e-04 | 322.20 ms | 52.4% bf16 MFU | 1624999 tok/s step 6133/19560 | loss 3.443300 (-1.84z)| norm 0.2758 (-0.18z)| lr 4.85e-04 | 323.00 ms | 52.3% bf16 MFU | 1624908 tok/s step 6134/19560 | loss 3.526862 (+0.07z)| norm 0.2558 (-0.80z)| lr 4.85e-04 | 322.43 ms | 52.3% bf16 MFU | 1624965 tok/s step 6135/19560 | loss 3.517921 (-0.13z)| norm 0.2575 (-0.73z)| lr 4.85e-04 | 322.33 ms | 52.4% bf16 MFU | 1625044 tok/s step 6136/19560 | loss 3.447248 (-1.73z)| norm 0.2681 (-0.40z)| lr 4.85e-04 | 322.47 ms | 52.3% bf16 MFU | 1625086 tok/s step 6137/19560 | loss 3.421906 (-2.24z)| norm 0.2646 (-0.50z)| lr 4.85e-04 | 323.07 ms | 52.2% bf16 MFU | 1624972 tok/s step 6138/19560 | loss 3.535302 (+0.29z)| norm 0.2682 (-0.39z)| lr 4.85e-04 | 322.70 ms | 52.3% bf16 MFU | 1624959 tok/s step 6139/19560 | loss 3.501842 (-0.44z)| norm 0.2836 (+0.10z)| lr 4.85e-04 | 322.79 ms | 52.3% bf16 MFU | 1624922 tok/s step 6140/19560 | loss 3.488800 (-0.73z)| norm 0.2597 (-0.64z)| lr 4.85e-04 | 322.33 ms | 52.4% bf16 MFU | 1625003 tok/s step 6141/19560 | loss 3.509117 (-0.27z)| norm 0.2536 (-0.83z)| lr 4.85e-04 | 322.36 ms | 52.4% bf16 MFU | 1625073 tok/s step 6142/19560 | loss 3.477624 (-0.98z)| norm 0.2519 (-0.87z)| lr 4.85e-04 | 322.74 ms | 52.3% bf16 MFU | 1625045 tok/s step 6143/19560 | loss 3.521912 (+0.02z)| norm 0.2748 (-0.12z)| lr 4.85e-04 | 322.51 ms | 52.3% bf16 MFU | 1625075 tok/s step 6144/19560 | loss 3.449893 (-1.60z)| norm 0.2810 (+0.09z)| lr 4.85e-04 | 322.90 ms | 52.3% bf16 MFU | 1625005 tok/s step 6145/19560 | loss 3.495197 (-0.58z)| norm 0.2549 (-0.75z)| lr 4.85e-04 | 322.69 ms | 52.3% bf16 MFU | 1624991 tok/s step 6146/19560 | loss 3.522421 (+0.04z)| norm 0.2667 (-0.36z)| lr 4.85e-04 | 322.82 ms | 52.3% bf16 MFU | 1624945 tok/s step 6147/19560 | loss 3.533541 (+0.29z)| norm 0.2770 (-0.02z)| lr 4.85e-04 | 322.52 ms | 52.3% bf16 MFU | 1624977 tok/s step 6148/19560 | loss 3.557308 (+0.82z)| norm 0.2968 (+0.62z)| lr 4.85e-04 | 322.92 ms | 52.3% bf16 MFU | 1624908 tok/s step 6149/19560 | loss 3.514586 (-0.14z)| norm 0.3096 (+1.02z)| lr 4.85e-04 | 322.69 ms | 52.3% bf16 MFU | 1624901 tok/s step 6150/19560 | loss 3.509228 (-0.26z)| norm 0.2847 (+0.21z)| lr 4.85e-04 | 322.51 ms | 52.3% bf16 MFU | 1624938 tok/s step 6151/19560 | loss 3.502858 (-0.39z)| norm 0.2631 (-0.48z)| lr 4.85e-04 | 322.64 ms | 52.3% bf16 MFU | 1624941 tok/s step 6152/19560 | loss 3.475156 (-1.03z)| norm 0.2754 (-0.07z)| lr 4.85e-04 | 322.72 ms | 52.3% bf16 MFU | 1624923 tok/s step 6153/19560 | loss 3.497189 (-0.53z)| norm 0.3109 (+1.07z)| lr 4.85e-04 | 322.75 ms | 52.3% bf16 MFU | 1624898 tok/s step 6154/19560 | loss 3.560340 (+0.96z)| norm 0.2779 (-0.00z)| lr 4.85e-04 | 322.28 ms | 52.4% bf16 MFU | 1624993 tok/s step 6155/19560 | loss 3.517431 (-0.07z)| norm 0.2632 (-0.47z)| lr 4.84e-04 | 322.73 ms | 52.3% bf16 MFU | 1624972 tok/s step 6156/19560 | loss 3.603163 (+1.98z)| norm 0.2918 (+0.45z)| lr 4.84e-04 | 322.33 ms | 52.4% bf16 MFU | 1625052 tok/s step 6157/19560 | loss 3.539005 (+0.44z)| norm 0.2906 (+0.42z)| lr 4.84e-04 | 323.39 ms | 52.2% bf16 MFU | 1624860 tok/s step 6158/19560 | loss 3.502486 (-0.44z)| norm 0.3004 (+0.73z)| lr 4.84e-04 | 322.25 ms | 52.4% bf16 MFU | 1624965 tok/s step 6159/19560 | loss 3.484146 (-0.87z)| norm 0.2917 (+0.44z)| lr 4.84e-04 | 322.63 ms | 52.3% bf16 MFU | 1624968 tok/s step 6160/19560 | loss 3.508214 (-0.29z)| norm 0.2869 (+0.28z)| lr 4.84e-04 | 323.18 ms | 52.2% bf16 MFU | 1624834 tok/s step 6161/19560 | loss 3.522713 (+0.06z)| norm 0.2958 (+0.58z)| lr 4.84e-04 | 322.64 ms | 52.3% bf16 MFU | 1624842 tok/s step 6162/19560 | loss 3.569642 (+1.17z)| norm 0.2779 (-0.01z)| lr 4.84e-04 | 322.63 ms | 52.3% bf16 MFU | 1624851 tok/s step 6163/19560 | loss 3.566295 (+1.08z)| norm 0.2789 (+0.02z)| lr 4.84e-04 | 322.70 ms | 52.3% bf16 MFU | 1624842 tok/s step 6164/19560 | loss 3.492146 (-0.67z)| norm 0.2751 (-0.11z)| lr 4.84e-04 | 322.95 ms | 52.3% bf16 MFU | 1624771 tok/s step 6165/19560 | loss 3.475609 (-1.05z)| norm 0.2931 (+0.48z)| lr 4.84e-04 | 322.42 ms | 52.3% bf16 MFU | 1624839 tok/s step 6166/19560 | loss 3.486301 (-0.79z)| norm 0.2566 (-0.73z)| lr 4.84e-04 | 322.51 ms | 52.3% bf16 MFU | 1624879 tok/s step 6167/19560 | loss 3.482877 (-0.86z)| norm 0.2783 (-0.02z)| lr 4.84e-04 | 323.02 ms | 52.2% bf16 MFU | 1624789 tok/s step 6168/19560 | loss 3.555391 (+0.84z)| norm 0.2702 (-0.30z)| lr 4.84e-04 | 322.35 ms | 52.4% bf16 MFU | 1624873 tok/s step 6169/19560 | loss 3.476210 (-1.01z)| norm 0.2544 (-0.84z)| lr 4.84e-04 | 323.33 ms | 52.2% bf16 MFU | 1624706 tok/s step 6170/19560 | loss 3.548679 (+0.68z)| norm 0.2647 (-0.49z)| lr 4.84e-04 | 322.43 ms | 52.3% bf16 MFU | 1624774 tok/s step 6171/19560 | loss 3.457679 (-1.44z)| norm 0.2868 (+0.24z)| lr 4.84e-04 | 322.57 ms | 52.3% bf16 MFU | 1624803 tok/s step 6172/19560 | loss 3.542548 (+0.56z)| norm 0.2475 (-1.06z)| lr 4.84e-04 | 323.50 ms | 52.2% bf16 MFU | 1624597 tok/s step 6173/19560 | loss 3.496663 (-0.51z)| norm 0.2604 (-0.62z)| lr 4.84e-04 | 322.66 ms | 52.3% bf16 MFU | 1624611 tok/s step 6174/19560 | loss 3.504820 (-0.32z)| norm 0.2619 (-0.57z)| lr 4.84e-04 | 322.81 ms | 52.3% bf16 MFU | 1624588 tok/s step 6175/19560 | loss 3.630444 (+2.55z)| norm 0.3272 (+1.59z)| lr 4.84e-04 | 322.47 ms | 52.3% bf16 MFU | 1624651 tok/s step 6176/19560 | loss 3.489400 (-0.68z)| norm 0.3308 (+1.68z)| lr 4.84e-04 | 322.43 ms | 52.3% bf16 MFU | 1624722 tok/s step 6177/19560 | loss 3.509143 (-0.23z)| norm 0.3045 (+0.80z)| lr 4.84e-04 | 323.42 ms | 52.2% bf16 MFU | 1624539 tok/s step 6178/19560 | loss 3.493599 (-0.58z)| norm 0.2602 (-0.64z)| lr 4.84e-04 | 323.09 ms | 52.2% bf16 MFU | 1624449 tok/s step 6179/19560 | loss 3.484330 (-0.79z)| norm 0.3004 (+0.67z)| lr 4.84e-04 | 322.29 ms | 52.4% bf16 MFU | 1624564 tok/s step 6180/19560 | loss 3.566733 (+1.08z)| norm 0.2727 (-0.23z)| lr 4.83e-04 | 322.82 ms | 52.3% bf16 MFU | 1624540 tok/s step 6181/19560 | loss 3.496519 (-0.51z)| norm 0.2810 (+0.05z)| lr 4.83e-04 | 322.46 ms | 52.3% bf16 MFU | 1624609 tok/s step 6182/19560 | loss 3.529753 (+0.25z)| norm 0.3011 (+0.72z)| lr 4.83e-04 | 322.37 ms | 52.4% bf16 MFU | 1624697 tok/s step 6183/19560 | loss 3.549799 (+0.69z)| norm 0.3132 (+1.11z)| lr 4.83e-04 | 323.12 ms | 52.2% bf16 MFU | 1624591 tok/s step 6184/19560 | loss 3.561198 (+0.94z)| norm 0.2850 (+0.17z)| lr 4.83e-04 | 323.12 ms | 52.2% bf16 MFU | 1624491 tok/s step 6185/19560 | loss 3.480786 (-0.88z)| norm 0.2835 (+0.11z)| lr 4.83e-04 | 322.83 ms | 52.3% bf16 MFU | 1624467 tok/s step 6186/19560 | loss 3.501924 (-0.40z)| norm 0.2744 (-0.19z)| lr 4.83e-04 | 322.88 ms | 52.3% bf16 MFU | 1624432 tok/s step 6187/19560 | loss 3.483122 (-0.84z)| norm 0.2901 (+0.32z)| lr 4.83e-04 | 322.83 ms | 52.3% bf16 MFU | 1624413 tok/s step 6188/19560 | loss 3.542738 (+0.52z)| norm 0.2889 (+0.27z)| lr 4.83e-04 | 322.42 ms | 52.3% bf16 MFU | 1624497 tok/s step 6189/19560 | loss 3.541184 (+0.48z)| norm 0.2880 (+0.24z)| lr 4.83e-04 | 323.05 ms | 52.2% bf16 MFU | 1624418 tok/s step 6190/19560 | loss 3.487460 (-0.74z)| norm 0.2855 (+0.15z)| lr 4.83e-04 | 322.96 ms | 52.3% bf16 MFU | 1624366 tok/s step 6191/19560 | loss 3.524841 (+0.12z)| norm 0.2811 (-0.01z)| lr 4.83e-04 | 322.74 ms | 52.3% bf16 MFU | 1624373 tok/s step 6192/19560 | loss 3.473091 (-1.05z)| norm 0.2756 (-0.20z)| lr 4.83e-04 | 322.59 ms | 52.3% bf16 MFU | 1624416 tok/s step 6193/19560 | loss 3.567389 (+1.11z)| norm 0.2974 (+0.53z)| lr 4.83e-04 | 322.41 ms | 52.3% bf16 MFU | 1624504 tok/s step 6194/19560 | loss 3.487311 (-0.71z)| norm 0.2933 (+0.39z)| lr 4.83e-04 | 323.01 ms | 52.3% bf16 MFU | 1624437 tok/s step 6195/19560 | loss 3.517513 (-0.01z)| norm 0.3001 (+0.61z)| lr 4.83e-04 | 323.07 ms | 52.2% bf16 MFU | 1624357 tok/s step 6196/19560 | loss 3.563147 (+1.04z)| norm 0.3030 (+0.70z)| lr 4.83e-04 | 322.69 ms | 52.3% bf16 MFU | 1624375 tok/s step 6197/19560 | loss 3.483327 (-0.79z)| norm 0.2855 (+0.10z)| lr 4.83e-04 | 323.04 ms | 52.2% bf16 MFU | 1624305 tok/s step 6198/19560 | loss 3.444286 (-1.65z)| norm 0.2683 (-0.49z)| lr 4.83e-04 | 322.60 ms | 52.3% bf16 MFU | 1624350 tok/s step 6199/19560 | loss 3.460839 (-1.26z)| norm 0.3199 (+1.25z)| lr 4.83e-04 | 322.41 ms | 52.3% bf16 MFU | 1624440 tok/s step 6200/19560 | loss 3.461489 (-1.22z)| norm 0.2665 (-0.57z)| lr 4.83e-04 | 322.90 ms | 52.3% bf16 MFU | 1624401 tok/s step 6201/19560 | loss 3.518786 (+0.07z)| norm 0.2646 (-0.63z)| lr 4.83e-04 | 323.39 ms | 52.2% bf16 MFU | 1624242 tok/s step 6202/19560 | loss 3.493136 (-0.51z)| norm 0.2929 (+0.34z)| lr 4.83e-04 | 322.56 ms | 52.3% bf16 MFU | 1624299 tok/s step 6203/19560 | loss 3.557180 (+1.02z)| norm 0.2761 (-0.23z)| lr 4.83e-04 | 323.37 ms | 52.2% bf16 MFU | 1624151 tok/s step 6204/19560 | loss 3.525204 (+0.25z)| norm 0.2842 (+0.04z)| lr 4.83e-04 | 323.13 ms | 52.2% bf16 MFU | 1624070 tok/s step 6205/19560 | loss 3.508545 (-0.15z)| norm 0.2781 (-0.16z)| lr 4.83e-04 | 322.51 ms | 52.3% bf16 MFU | 1624149 tok/s step 6206/19560 | loss 3.522210 (+0.17z)| norm 0.3141 (+1.05z)| lr 4.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1624190 tok/s step 6207/19560 | loss 3.466894 (-1.14z)| norm 0.3471 (+2.12z)| lr 4.82e-04 | 323.05 ms | 52.2% bf16 MFU | 1624126 tok/s step 6208/19560 | loss 3.503570 (-0.25z)| norm 0.2756 (-0.27z)| lr 4.82e-04 | 322.77 ms | 52.3% bf16 MFU | 1624136 tok/s step 6209/19560 | loss 3.500726 (-0.32z)| norm 0.3107 (+0.89z)| lr 4.82e-04 | 322.89 ms | 52.3% bf16 MFU | 1624117 tok/s step 6210/19560 | loss 3.578284 (+1.53z)| norm 0.3527 (+2.24z)| lr 4.82e-04 | 323.06 ms | 52.2% bf16 MFU | 1624055 tok/s step 6211/19560 | loss 3.456916 (-1.37z)| norm 0.3071 (+0.73z)| lr 4.82e-04 | 323.21 ms | 52.2% bf16 MFU | 1623960 tok/s step 6212/19560 | loss 3.533046 (+0.45z)| norm 0.2918 (+0.22z)| lr 4.82e-04 | 322.62 ms | 52.3% bf16 MFU | 1624017 tok/s step 6213/19560 | loss 3.482838 (-0.74z)| norm 0.3084 (+0.75z)| lr 4.82e-04 | 322.68 ms | 52.3% bf16 MFU | 1624055 tok/s step 6214/19560 | loss 3.422364 (-2.13z)| norm 0.3157 (+0.98z)| lr 4.82e-04 | 323.27 ms | 52.2% bf16 MFU | 1623945 tok/s step 6215/19560 | loss 3.528778 (+0.39z)| norm 0.2682 (-0.58z)| lr 4.82e-04 | 322.80 ms | 52.3% bf16 MFU | 1623957 tok/s step 6216/19560 | loss 3.546431 (+0.80z)| norm 0.3234 (+1.21z)| lr 4.82e-04 | 322.76 ms | 52.3% bf16 MFU | 1623980 tok/s step 6217/19560 | loss 3.529563 (+0.41z)| norm 0.2756 (-0.33z)| lr 4.82e-04 | 322.85 ms | 52.3% bf16 MFU | 1623976 tok/s step 6218/19560 | loss 3.531831 (+0.46z)| norm 0.2657 (-0.65z)| lr 4.82e-04 | 323.17 ms | 52.2% bf16 MFU | 1623895 tok/s step 6219/19560 | loss 3.459624 (-1.24z)| norm 0.3066 (+0.68z)| lr 4.82e-04 | 322.96 ms | 52.3% bf16 MFU | 1623868 tok/s step 6220/19560 | loss 3.544886 (+0.76z)| norm 0.2843 (-0.05z)| lr 4.82e-04 | 323.01 ms | 52.2% bf16 MFU | 1623831 tok/s step 6221/19560 | loss 3.530892 (+0.43z)| norm 0.3066 (+0.67z)| lr 4.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1623769 tok/s step 6222/19560 | loss 3.501812 (-0.27z)| norm 0.2873 (+0.03z)| lr 4.82e-04 | 323.41 ms | 52.2% bf16 MFU | 1623636 tok/s step 6223/19560 | loss 3.511832 (-0.03z)| norm 0.3153 (+0.94z)| lr 4.82e-04 | 322.93 ms | 52.3% bf16 MFU | 1623630 tok/s step 6224/19560 | loss 3.521857 (+0.21z)| norm 0.3094 (+0.74z)| lr 4.82e-04 | 322.70 ms | 52.3% bf16 MFU | 1623682 tok/s step 6225/19560 | loss 3.463304 (-1.18z)| norm 0.2961 (+0.29z)| lr 4.82e-04 | 323.43 ms | 52.2% bf16 MFU | 1623548 tok/s step 6226/19560 | loss 3.548068 (+0.90z)| norm 0.3118 (+0.80z)| lr 4.82e-04 | 323.44 ms | 52.2% bf16 MFU | 1623420 tok/s step 6227/19560 | loss 3.598430 (+2.11z)| norm 0.3154 (+0.90z)| lr 4.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1623499 tok/s step 6228/19560 | loss 3.501799 (-0.29z)| norm 0.2630 (-0.81z)| lr 4.82e-04 | 322.72 ms | 52.3% bf16 MFU | 1623553 tok/s step 6229/19560 | loss 3.553926 (+0.99z)| norm 0.3265 (+1.25z)| lr 4.82e-04 | 322.79 ms | 52.3% bf16 MFU | 1623589 tok/s step 6230/19560 | loss 3.515609 (+0.04z)| norm 0.2613 (-0.88z)| lr 4.82e-04 | 322.44 ms | 52.3% bf16 MFU | 1623710 tok/s step 6231/19560 | loss 3.497813 (-0.40z)| norm 0.2674 (-0.67z)| lr 4.81e-04 | 322.72 ms | 52.3% bf16 MFU | 1623754 tok/s step 6232/19560 | loss 3.510133 (-0.10z)| norm 0.2753 (-0.40z)| lr 4.81e-04 | 322.96 ms | 52.3% bf16 MFU | 1623737 tok/s step 6233/19560 | loss 3.632555 (+2.83z)| norm 0.2589 (-1.12z)| lr 4.81e-04 | 323.04 ms | 52.2% bf16 MFU | 1623700 tok/s step 6234/19560 | loss 3.499293 (-0.36z)| norm 0.2939 (+0.40z)| lr 4.81e-04 | 323.15 ms | 52.2% bf16 MFU | 1623637 tok/s step 6235/19560 | loss 3.485709 (-0.68z)| norm 0.2557 (-1.29z)| lr 4.81e-04 | 323.02 ms | 52.2% bf16 MFU | 1623608 tok/s step 6236/19560 | loss 3.481407 (-0.78z)| norm 0.2733 (-0.48z)| lr 4.81e-04 | 323.20 ms | 52.2% bf16 MFU | 1623536 tok/s step 6237/19560 | loss 3.498694 (-0.36z)| norm 0.2573 (-1.20z)| lr 4.81e-04 | 323.54 ms | 52.2% bf16 MFU | 1623381 tok/s step 6238/19560 | loss 3.482422 (-0.74z)| norm 0.2708 (-0.57z)| lr 4.81e-04 | 323.06 ms | 52.2% bf16 MFU | 1623357 tok/s step 6239/19560 | loss 3.539457 (+0.65z)| norm 0.2971 (+0.67z)| lr 4.81e-04 | 323.82 ms | 52.1% bf16 MFU | 1623142 tok/s step 6240/19560 | loss 3.551489 (+0.96z)| norm 0.3045 (+1.01z)| lr 4.81e-04 | 323.08 ms | 52.2% bf16 MFU | 1623123 tok/s step 6241/19560 | loss 3.557109 (+1.09z)| norm 0.2665 (-0.76z)| lr 4.81e-04 | 322.80 ms | 52.3% bf16 MFU | 1623177 tok/s step 6242/19560 | loss 3.540012 (+0.66z)| norm 0.2618 (-0.97z)| lr 4.81e-04 | 323.59 ms | 52.2% bf16 MFU | 1623028 tok/s step 6243/19560 | loss 3.431859 (-1.96z)| norm 0.2677 (-0.69z)| lr 4.81e-04 | 322.90 ms | 52.3% bf16 MFU | 1623061 tok/s step 6244/19560 | loss 3.464074 (-1.17z)| norm 0.2811 (-0.07z)| lr 4.81e-04 | 323.25 ms | 52.2% bf16 MFU | 1623005 tok/s step 6245/19560 | loss 3.685606 (+3.89z)| norm 0.2709 (-0.55z)| lr 4.81e-04 | 322.78 ms | 52.3% bf16 MFU | 1623069 tok/s step 6246/19560 | loss 3.492090 (-0.48z)| norm 0.2918 (+0.42z)| lr 4.81e-04 | 323.55 ms | 52.2% bf16 MFU | 1622937 tok/s step 6247/19560 | loss 3.499373 (-0.30z)| norm 0.2497 (-1.54z)| lr 4.81e-04 | 323.48 ms | 52.2% bf16 MFU | 1622828 tok/s step 6248/19560 | loss 3.463372 (-1.14z)| norm 0.2872 (+0.22z)| lr 4.81e-04 | 323.57 ms | 52.2% bf16 MFU | 1622702 tok/s step 6249/19560 | loss 3.483726 (-0.65z)| norm 0.2401 (-1.96z)| lr 4.81e-04 | 322.81 ms | 52.3% bf16 MFU | 1622773 tok/s step 6250/19560 | loss 3.507961 (-0.06z)| norm 0.2687 (-0.63z)| lr 4.81e-04 | 322.54 ms | 52.3% bf16 MFU | 1622909 tok/s val loss 3.503370 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2794/10042 = 0.278231 step 6251/19560 | loss 3.579242 (+1.61z)| norm 0.2371 (-2.08z)| lr 4.81e-04 | 323.85 ms | 52.1% bf16 MFU | 1622709 tok/s step 6252/19560 | loss 3.442970 (-1.61z)| norm 0.3622 (+3.48z)| lr 4.81e-04 | 322.18 ms | 52.4% bf16 MFU | 1622939 tok/s step 6253/19560 | loss 3.506087 (-0.12z)| norm 0.3077 (+1.08z)| lr 4.81e-04 | 322.61 ms | 52.3% bf16 MFU | 1623049 tok/s step 6254/19560 | loss 3.560266 (+1.14z)| norm 0.3157 (+1.40z)| lr 4.81e-04 | 322.55 ms | 52.3% bf16 MFU | 1623170 tok/s step 6255/19560 | loss 3.490836 (-0.50z)| norm 0.2822 (-0.06z)| lr 4.81e-04 | 322.19 ms | 52.4% bf16 MFU | 1623374 tok/s step 6256/19560 | loss 3.476043 (-0.85z)| norm 0.2842 (+0.02z)| lr 4.80e-04 | 322.09 ms | 52.4% bf16 MFU | 1623593 tok/s step 6257/19560 | loss 3.492730 (-0.44z)| norm 0.2785 (-0.24z)| lr 4.80e-04 | 322.54 ms | 52.3% bf16 MFU | 1623688 tok/s step 6258/19560 | loss 3.510886 (-0.01z)| norm 0.2720 (-0.53z)| lr 4.80e-04 | 322.32 ms | 52.4% bf16 MFU | 1623835 tok/s step 6259/19560 | loss 3.518633 (+0.16z)| norm 0.2564 (-1.22z)| lr 4.80e-04 | 322.69 ms | 52.3% bf16 MFU | 1623882 tok/s step 6260/19560 | loss 3.477183 (-0.82z)| norm 0.2573 (-1.17z)| lr 4.80e-04 | 322.61 ms | 52.3% bf16 MFU | 1623945 tok/s step 6261/19560 | loss 3.644023 (+3.03z)| norm 0.2911 (+0.31z)| lr 4.80e-04 | 322.75 ms | 52.3% bf16 MFU | 1623968 tok/s step 6262/19560 | loss 3.529314 (+0.37z)| norm 0.3598 (+3.18z)| lr 4.80e-04 | 322.41 ms | 52.3% bf16 MFU | 1624079 tok/s step 6263/19560 | loss 3.510019 (-0.07z)| norm 0.2862 (+0.05z)| lr 4.80e-04 | 322.83 ms | 52.3% bf16 MFU | 1624077 tok/s step 6264/19560 | loss 3.410561 (-2.34z)| norm 0.2669 (-0.78z)| lr 4.80e-04 | 322.63 ms | 52.3% bf16 MFU | 1624125 tok/s step 6265/19560 | loss 3.529505 (+0.37z)| norm 0.2527 (-1.37z)| lr 4.80e-04 | 322.29 ms | 52.4% bf16 MFU | 1624256 tok/s step 6266/19560 | loss 3.498094 (-0.36z)| norm 0.2963 (+0.47z)| lr 4.80e-04 | 322.58 ms | 52.3% bf16 MFU | 1624308 tok/s step 6267/19560 | loss 3.497514 (-0.37z)| norm 0.2666 (-0.78z)| lr 4.80e-04 | 322.66 ms | 52.3% bf16 MFU | 1624337 tok/s step 6268/19560 | loss 3.471319 (-0.97z)| norm 0.2893 (+0.17z)| lr 4.80e-04 | 322.65 ms | 52.3% bf16 MFU | 1624367 tok/s step 6269/19560 | loss 3.501080 (-0.28z)| norm 0.2613 (-1.03z)| lr 4.80e-04 | 322.55 ms | 52.3% bf16 MFU | 1624421 tok/s step 6270/19560 | loss 3.461044 (-1.20z)| norm 0.2730 (-0.54z)| lr 4.80e-04 | 322.59 ms | 52.3% bf16 MFU | 1624462 tok/s step 6271/19560 | loss 3.834426 (+6.17z)| norm 0.3057 (+0.86z)| lr 4.80e-04 | 322.89 ms | 52.3% bf16 MFU | 1624425 tok/s step 6272/19560 | loss 3.558135 (+0.81z)| norm 0.2900 (+0.18z)| lr 4.80e-04 | 322.93 ms | 52.3% bf16 MFU | 1624380 tok/s step 6273/19560 | loss 3.484955 (-0.61z)| norm 0.2566 (-1.26z)| lr 4.80e-04 | 322.50 ms | 52.3% bf16 MFU | 1624447 tok/s step 6274/19560 | loss 3.464944 (-0.98z)| norm 0.2474 (-1.64z)| lr 4.80e-04 | 322.91 ms | 52.3% bf16 MFU | 1624406 tok/s step 6275/19560 | loss 3.488451 (-0.52z)| norm 0.2744 (-0.48z)| lr 4.80e-04 | 322.50 ms | 52.3% bf16 MFU | 1624472 tok/s step 6276/19560 | loss 3.519331 (+0.08z)| norm 0.2512 (-1.45z)| lr 4.80e-04 | 322.55 ms | 52.3% bf16 MFU | 1624519 tok/s step 6277/19560 | loss 3.640235 (+2.35z)| norm 0.3289 (+1.82z)| lr 4.80e-04 | 322.94 ms | 52.3% bf16 MFU | 1624467 tok/s step 6278/19560 | loss 3.530055 (+0.26z)| norm 0.2817 (-0.16z)| lr 4.80e-04 | 322.50 ms | 52.3% bf16 MFU | 1624529 tok/s step 6279/19560 | loss 3.493374 (-0.43z)| norm 0.2446 (-1.69z)| lr 4.80e-04 | 322.18 ms | 52.4% bf16 MFU | 1624669 tok/s step 6280/19560 | loss 3.484412 (-0.60z)| norm 0.2752 (-0.42z)| lr 4.80e-04 | 322.93 ms | 52.3% bf16 MFU | 1624611 tok/s step 6281/19560 | loss 3.499402 (-0.32z)| norm 0.3374 (+2.13z)| lr 4.79e-04 | 322.74 ms | 52.3% bf16 MFU | 1624604 tok/s step 6282/19560 | loss 3.551251 (+0.66z)| norm 0.2897 (+0.17z)| lr 4.79e-04 | 322.21 ms | 52.4% bf16 MFU | 1624732 tok/s step 6283/19560 | loss 3.510391 (-0.11z)| norm 0.2629 (-0.93z)| lr 4.79e-04 | 322.35 ms | 52.4% bf16 MFU | 1624818 tok/s step 6284/19560 | loss 3.466023 (-0.94z)| norm 0.2653 (-0.82z)| lr 4.79e-04 | 322.45 ms | 52.3% bf16 MFU | 1624875 tok/s step 6285/19560 | loss 3.555554 (+0.77z)| norm 0.2844 (-0.04z)| lr 4.79e-04 | 322.48 ms | 52.3% bf16 MFU | 1624922 tok/s step 6286/19560 | loss 3.515291 (-0.00z)| norm 0.3000 (+0.60z)| lr 4.79e-04 | 322.59 ms | 52.3% bf16 MFU | 1624937 tok/s step 6287/19560 | loss 3.484585 (-0.59z)| norm 0.3020 (+0.68z)| lr 4.79e-04 | 322.66 ms | 52.3% bf16 MFU | 1624935 tok/s step 6288/19560 | loss 3.504854 (-0.20z)| norm 0.2927 (+0.30z)| lr 4.79e-04 | 322.27 ms | 52.4% bf16 MFU | 1625030 tok/s step 6289/19560 | loss 3.506565 (-0.17z)| norm 0.2872 (+0.07z)| lr 4.79e-04 | 322.32 ms | 52.4% bf16 MFU | 1625110 tok/s step 6290/19560 | loss 3.447981 (-1.26z)| norm 0.2931 (+0.31z)| lr 4.79e-04 | 322.96 ms | 52.3% bf16 MFU | 1625024 tok/s step 6291/19560 | loss 3.518585 (+0.09z)| norm 0.2668 (-0.76z)| lr 4.79e-04 | 322.89 ms | 52.3% bf16 MFU | 1624961 tok/s step 6292/19560 | loss 3.474599 (-0.75z)| norm 0.3097 (+0.97z)| lr 4.79e-04 | 323.10 ms | 52.2% bf16 MFU | 1624847 tok/s step 6293/19560 | loss 3.553777 (+0.75z)| norm 0.2841 (-0.06z)| lr 4.79e-04 | 322.28 ms | 52.4% bf16 MFU | 1624946 tok/s step 6294/19560 | loss 3.481963 (-0.62z)| norm 0.2793 (-0.27z)| lr 4.79e-04 | 323.20 ms | 52.2% bf16 MFU | 1624808 tok/s step 6295/19560 | loss 3.527758 (+0.25z)| norm 0.3033 (+0.71z)| lr 4.79e-04 | 322.40 ms | 52.3% bf16 MFU | 1624878 tok/s step 6296/19560 | loss 3.506857 (-0.14z)| norm 0.2780 (-0.33z)| lr 4.79e-04 | 322.60 ms | 52.3% bf16 MFU | 1624894 tok/s step 6297/19560 | loss 3.404366 (-2.06z)| norm 0.2648 (-0.88z)| lr 4.79e-04 | 321.94 ms | 52.4% bf16 MFU | 1625075 tok/s step 6298/19560 | loss 3.521998 (+0.16z)| norm 0.2797 (-0.27z)| lr 4.79e-04 | 323.00 ms | 52.3% bf16 MFU | 1624980 tok/s step 6299/19560 | loss 3.531684 (+0.33z)| norm 0.2565 (-1.21z)| lr 4.79e-04 | 322.66 ms | 52.3% bf16 MFU | 1624976 tok/s step 6300/19560 | loss 3.454922 (-1.11z)| norm 0.2364 (-2.02z)| lr 4.79e-04 | 322.64 ms | 52.3% bf16 MFU | 1624976 tok/s step 6301/19560 | loss 3.593203 (+1.48z)| norm 0.2832 (-0.12z)| lr 4.79e-04 | 322.90 ms | 52.3% bf16 MFU | 1624912 tok/s step 6302/19560 | loss 3.440724 (-1.36z)| norm 0.2899 (+0.15z)| lr 4.79e-04 | 322.18 ms | 52.4% bf16 MFU | 1625033 tok/s step 6303/19560 | loss 3.482370 (-0.57z)| norm 0.2736 (-0.51z)| lr 4.79e-04 | 322.58 ms | 52.3% bf16 MFU | 1625046 tok/s step 6304/19560 | loss 3.484906 (-0.52z)| norm 0.2826 (-0.12z)| lr 4.79e-04 | 322.93 ms | 52.3% bf16 MFU | 1624970 tok/s step 6305/19560 | loss 3.529806 (+0.32z)| norm 0.2598 (-1.06z)| lr 4.79e-04 | 322.58 ms | 52.3% bf16 MFU | 1624987 tok/s step 6306/19560 | loss 3.455774 (-1.07z)| norm 0.2697 (-0.66z)| lr 4.78e-04 | 322.28 ms | 52.4% bf16 MFU | 1625078 tok/s step 6307/19560 | loss 3.511266 (-0.02z)| norm 0.2896 (+0.18z)| lr 4.78e-04 | 323.25 ms | 52.2% bf16 MFU | 1624920 tok/s step 6308/19560 | loss 3.508466 (-0.07z)| norm 0.2400 (-1.86z)| lr 4.78e-04 | 322.49 ms | 52.3% bf16 MFU | 1624961 tok/s step 6309/19560 | loss 3.486229 (-0.49z)| norm 0.3005 (+0.64z)| lr 4.78e-04 | 322.39 ms | 52.4% bf16 MFU | 1625026 tok/s step 6310/19560 | loss 3.550902 (+0.73z)| norm 0.2547 (-1.24z)| lr 4.78e-04 | 322.72 ms | 52.3% bf16 MFU | 1625004 tok/s step 6311/19560 | loss 3.556291 (+0.83z)| norm 0.2708 (-0.56z)| lr 4.78e-04 | 322.57 ms | 52.3% bf16 MFU | 1625022 tok/s step 6312/19560 | loss 3.544097 (+0.61z)| norm 0.3052 (+0.86z)| lr 4.78e-04 | 322.29 ms | 52.4% bf16 MFU | 1625108 tok/s step 6313/19560 | loss 3.529960 (+0.33z)| norm 0.2680 (-0.68z)| lr 4.78e-04 | 322.61 ms | 52.3% bf16 MFU | 1625111 tok/s step 6314/19560 | loss 3.557943 (+0.85z)| norm 0.2828 (-0.07z)| lr 4.78e-04 | 322.49 ms | 52.3% bf16 MFU | 1625142 tok/s step 6315/19560 | loss 3.471315 (-0.78z)| norm 0.3099 (+1.04z)| lr 4.78e-04 | 323.02 ms | 52.2% bf16 MFU | 1625038 tok/s step 6316/19560 | loss 3.474493 (-0.71z)| norm 0.2693 (-0.62z)| lr 4.78e-04 | 322.89 ms | 52.3% bf16 MFU | 1624972 tok/s step 6317/19560 | loss 3.445765 (-1.23z)| norm 0.2577 (-1.09z)| lr 4.78e-04 | 323.11 ms | 52.2% bf16 MFU | 1624855 tok/s step 6318/19560 | loss 3.504561 (-0.13z)| norm 0.2786 (-0.23z)| lr 4.78e-04 | 322.89 ms | 52.3% bf16 MFU | 1624799 tok/s step 6319/19560 | loss 3.477028 (-0.64z)| norm 0.2865 (+0.09z)| lr 4.78e-04 | 322.37 ms | 52.4% bf16 MFU | 1624878 tok/s step 6320/19560 | loss 3.532232 (+0.38z)| norm 0.2718 (-0.51z)| lr 4.78e-04 | 322.19 ms | 52.4% bf16 MFU | 1624996 tok/s step 6321/19560 | loss 3.507910 (-0.06z)| norm 0.2926 (+0.34z)| lr 4.78e-04 | 322.32 ms | 52.4% bf16 MFU | 1625076 tok/s step 6322/19560 | loss 3.591430 (+1.48z)| norm 0.2840 (-0.00z)| lr 4.78e-04 | 323.02 ms | 52.2% bf16 MFU | 1624976 tok/s step 6323/19560 | loss 3.527610 (+0.29z)| norm 0.3034 (+0.79z)| lr 4.78e-04 | 322.80 ms | 52.3% bf16 MFU | 1624935 tok/s step 6324/19560 | loss 3.484008 (-0.52z)| norm 0.3241 (+1.61z)| lr 4.78e-04 | 322.52 ms | 52.3% bf16 MFU | 1624969 tok/s step 6325/19560 | loss 3.529647 (+0.33z)| norm 0.3012 (+0.68z)| lr 4.78e-04 | 322.55 ms | 52.3% bf16 MFU | 1624993 tok/s step 6326/19560 | loss 3.512085 (-0.01z)| norm 0.3076 (+0.92z)| lr 4.78e-04 | 323.10 ms | 52.2% bf16 MFU | 1624876 tok/s step 6327/19560 | loss 3.557331 (+0.83z)| norm 0.2926 (+0.33z)| lr 4.78e-04 | 322.74 ms | 52.3% bf16 MFU | 1624856 tok/s step 6328/19560 | loss 3.536378 (+0.43z)| norm 0.2823 (-0.10z)| lr 4.78e-04 | 322.24 ms | 52.4% bf16 MFU | 1624962 tok/s step 6329/19560 | loss 3.529413 (+0.29z)| norm 0.2643 (-0.83z)| lr 4.78e-04 | 322.46 ms | 52.3% bf16 MFU | 1625010 tok/s step 6330/19560 | loss 3.522672 (+0.16z)| norm 0.2959 (+0.46z)| lr 4.78e-04 | 322.96 ms | 52.3% bf16 MFU | 1624928 tok/s step 6331/19560 | loss 3.475789 (-0.72z)| norm 0.2983 (+0.55z)| lr 4.77e-04 | 322.91 ms | 52.3% bf16 MFU | 1624865 tok/s step 6332/19560 | loss 3.522699 (+0.17z)| norm 0.2539 (-1.24z)| lr 4.77e-04 | 323.04 ms | 52.2% bf16 MFU | 1624769 tok/s step 6333/19560 | loss 3.412520 (-1.88z)| norm 0.2777 (-0.28z)| lr 4.77e-04 | 323.13 ms | 52.2% bf16 MFU | 1624656 tok/s step 6334/19560 | loss 3.522545 (+0.18z)| norm 0.2480 (-1.46z)| lr 4.77e-04 | 322.45 ms | 52.3% bf16 MFU | 1624722 tok/s step 6335/19560 | loss 3.504967 (-0.15z)| norm 0.2700 (-0.56z)| lr 4.77e-04 | 322.37 ms | 52.4% bf16 MFU | 1624805 tok/s step 6336/19560 | loss 3.483689 (-0.55z)| norm 0.2960 (+0.51z)| lr 4.77e-04 | 322.81 ms | 52.3% bf16 MFU | 1624771 tok/s step 6337/19560 | loss 3.531753 (+0.35z)| norm 0.2521 (-1.28z)| lr 4.77e-04 | 323.23 ms | 52.2% bf16 MFU | 1624634 tok/s step 6338/19560 | loss 3.608474 (+1.77z)| norm 0.2768 (-0.25z)| lr 4.77e-04 | 322.99 ms | 52.3% bf16 MFU | 1624563 tok/s step 6339/19560 | loss 3.464659 (-0.91z)| norm 0.2662 (-0.69z)| lr 4.77e-04 | 322.53 ms | 52.3% bf16 MFU | 1624612 tok/s step 6340/19560 | loss 3.522377 (+0.17z)| norm 0.3090 (+1.14z)| lr 4.77e-04 | 322.92 ms | 52.3% bf16 MFU | 1624562 tok/s step 6341/19560 | loss 3.468266 (-0.84z)| norm 0.2556 (-1.13z)| lr 4.77e-04 | 322.71 ms | 52.3% bf16 MFU | 1624566 tok/s step 6342/19560 | loss 3.492636 (-0.40z)| norm 0.2709 (-0.46z)| lr 4.77e-04 | 323.13 ms | 52.2% bf16 MFU | 1624464 tok/s step 6343/19560 | loss 3.462322 (-0.96z)| norm 0.2590 (-0.97z)| lr 4.77e-04 | 322.98 ms | 52.3% bf16 MFU | 1624405 tok/s step 6344/19560 | loss 3.527515 (+0.27z)| norm 0.2444 (-1.57z)| lr 4.77e-04 | 322.85 ms | 52.3% bf16 MFU | 1624381 tok/s step 6345/19560 | loss 3.442552 (-1.30z)| norm 0.2616 (-0.83z)| lr 4.77e-04 | 323.56 ms | 52.2% bf16 MFU | 1624180 tok/s step 6346/19560 | loss 3.504592 (-0.14z)| norm 0.2792 (-0.07z)| lr 4.77e-04 | 322.35 ms | 52.4% bf16 MFU | 1624295 tok/s step 6347/19560 | loss 3.517442 (+0.09z)| norm 0.2729 (-0.33z)| lr 4.77e-04 | 322.71 ms | 52.3% bf16 MFU | 1624312 tok/s step 6348/19560 | loss 3.512998 (+0.01z)| norm 0.2414 (-1.66z)| lr 4.77e-04 | 323.63 ms | 52.1% bf16 MFU | 1624098 tok/s step 6349/19560 | loss 3.518306 (+0.11z)| norm 0.2642 (-0.68z)| lr 4.77e-04 | 322.31 ms | 52.4% bf16 MFU | 1624226 tok/s step 6350/19560 | loss 3.460321 (-0.97z)| norm 0.2560 (-1.01z)| lr 4.77e-04 | 323.07 ms | 52.2% bf16 MFU | 1624157 tok/s step 6351/19560 | loss 3.544518 (+0.60z)| norm 0.2555 (-1.02z)| lr 4.77e-04 | 322.89 ms | 52.3% bf16 MFU | 1624135 tok/s step 6352/19560 | loss 3.512131 (-0.00z)| norm 0.2812 (+0.09z)| lr 4.77e-04 | 322.85 ms | 52.3% bf16 MFU | 1624126 tok/s step 6353/19560 | loss 3.523556 (+0.20z)| norm 0.2494 (-1.26z)| lr 4.77e-04 | 322.38 ms | 52.4% bf16 MFU | 1624236 tok/s step 6354/19560 | loss 3.478415 (-0.63z)| norm 0.2654 (-0.56z)| lr 4.77e-04 | 322.46 ms | 52.3% bf16 MFU | 1624318 tok/s step 6355/19560 | loss 3.490077 (-0.40z)| norm 0.2699 (-0.35z)| lr 4.76e-04 | 323.16 ms | 52.2% bf16 MFU | 1624220 tok/s step 6356/19560 | loss 3.502719 (-0.16z)| norm 0.2594 (-0.81z)| lr 4.76e-04 | 322.99 ms | 52.3% bf16 MFU | 1624170 tok/s step 6357/19560 | loss 3.509610 (-0.02z)| norm 0.2836 (+0.27z)| lr 4.76e-04 | 323.57 ms | 52.2% bf16 MFU | 1623978 tok/s step 6358/19560 | loss 3.447659 (-1.18z)| norm 0.2561 (-0.95z)| lr 4.76e-04 | 322.72 ms | 52.3% bf16 MFU | 1624008 tok/s step 6359/19560 | loss 3.528549 (+0.34z)| norm 0.2796 (+0.08z)| lr 4.76e-04 | 323.06 ms | 52.2% bf16 MFU | 1623950 tok/s step 6360/19560 | loss 3.485134 (-0.48z)| norm 0.2599 (-0.78z)| lr 4.76e-04 | 322.86 ms | 52.3% bf16 MFU | 1623947 tok/s step 6361/19560 | loss 3.500717 (-0.17z)| norm 0.2701 (-0.34z)| lr 4.76e-04 | 323.59 ms | 52.2% bf16 MFU | 1623762 tok/s step 6362/19560 | loss 3.468061 (-0.79z)| norm 0.2732 (-0.19z)| lr 4.76e-04 | 322.49 ms | 52.3% bf16 MFU | 1623861 tok/s step 6363/19560 | loss 3.689054 (+3.28z)| norm 0.3375 (+2.58z)| lr 4.76e-04 | 322.79 ms | 52.3% bf16 MFU | 1623881 tok/s step 6364/19560 | loss 3.504886 (-0.11z)| norm 0.3427 (+2.70z)| lr 4.76e-04 | 322.77 ms | 52.3% bf16 MFU | 1623904 tok/s step 6365/19560 | loss 3.497630 (-0.24z)| norm 0.3080 (+1.22z)| lr 4.76e-04 | 322.68 ms | 52.3% bf16 MFU | 1623949 tok/s step 6366/19560 | loss 3.501081 (-0.18z)| norm 0.2563 (-0.95z)| lr 4.76e-04 | 323.06 ms | 52.2% bf16 MFU | 1623894 tok/s step 6367/19560 | loss 3.475090 (-0.65z)| norm 0.3575 (+3.15z)| lr 4.76e-04 | 322.51 ms | 52.3% bf16 MFU | 1623983 tok/s step 6368/19560 | loss 3.512761 (+0.05z)| norm 0.3382 (+2.32z)| lr 4.76e-04 | 323.08 ms | 52.2% bf16 MFU | 1623923 tok/s step 6369/19560 | loss 3.551306 (+0.76z)| norm 0.3931 (+4.15z)| lr 4.76e-04 | 323.20 ms | 52.2% bf16 MFU | 1623834 tok/s step 6370/19560 | loss 3.481289 (-0.53z)| norm 0.2713 (-0.35z)| lr 4.76e-04 | 322.76 ms | 52.3% bf16 MFU | 1623863 tok/s step 6371/19560 | loss 3.517744 (+0.14z)| norm 0.3176 (+1.34z)| lr 4.76e-04 | 323.41 ms | 52.2% bf16 MFU | 1623726 tok/s step 6372/19560 | loss 3.516292 (+0.10z)| norm 0.2922 (+0.40z)| lr 4.76e-04 | 322.69 ms | 52.3% bf16 MFU | 1623778 tok/s step 6373/19560 | loss 3.520826 (+0.22z)| norm 0.2904 (+0.33z)| lr 4.76e-04 | 323.38 ms | 52.2% bf16 MFU | 1623652 tok/s step 6374/19560 | loss 3.549261 (+0.77z)| norm 0.3013 (+0.73z)| lr 4.76e-04 | 322.70 ms | 52.3% bf16 MFU | 1623703 tok/s step 6375/19560 | loss 3.491912 (-0.35z)| norm 0.2660 (-0.58z)| lr 4.76e-04 | 323.14 ms | 52.2% bf16 MFU | 1623643 tok/s step 6376/19560 | loss 3.569279 (+1.14z)| norm 0.2790 (-0.09z)| lr 4.76e-04 | 323.12 ms | 52.2% bf16 MFU | 1623590 tok/s step 6377/19560 | loss 3.475944 (-0.68z)| norm 0.2675 (-0.53z)| lr 4.76e-04 | 323.11 ms | 52.2% bf16 MFU | 1623542 tok/s step 6378/19560 | loss 3.497231 (-0.26z)| norm 0.2477 (-1.25z)| lr 4.76e-04 | 323.14 ms | 52.2% bf16 MFU | 1623489 tok/s step 6379/19560 | loss 3.556972 (+0.91z)| norm 0.2649 (-0.63z)| lr 4.76e-04 | 322.44 ms | 52.3% bf16 MFU | 1623615 tok/s step 6380/19560 | loss 3.475874 (-0.68z)| norm 0.2641 (-0.65z)| lr 4.75e-04 | 322.49 ms | 52.3% bf16 MFU | 1623722 tok/s step 6381/19560 | loss 3.494365 (-0.32z)| norm 0.2586 (-0.85z)| lr 4.75e-04 | 322.70 ms | 52.3% bf16 MFU | 1623770 tok/s step 6382/19560 | loss 3.519013 (+0.17z)| norm 0.2534 (-1.04z)| lr 4.75e-04 | 322.54 ms | 52.3% bf16 MFU | 1623855 tok/s step 6383/19560 | loss 3.508252 (-0.04z)| norm 0.2550 (-0.96z)| lr 4.75e-04 | 323.54 ms | 52.2% bf16 MFU | 1623686 tok/s step 6384/19560 | loss 3.418260 (-1.78z)| norm 0.2533 (-1.02z)| lr 4.75e-04 | 322.71 ms | 52.3% bf16 MFU | 1623732 tok/s step 6385/19560 | loss 3.521356 (+0.22z)| norm 0.2553 (-0.93z)| lr 4.75e-04 | 322.75 ms | 52.3% bf16 MFU | 1623769 tok/s step 6386/19560 | loss 3.465423 (-0.86z)| norm 0.2559 (-0.90z)| lr 4.75e-04 | 322.34 ms | 52.4% bf16 MFU | 1623906 tok/s step 6387/19560 | loss 3.552522 (+0.82z)| norm 0.2725 (-0.27z)| lr 4.75e-04 | 322.83 ms | 52.3% bf16 MFU | 1623911 tok/s step 6388/19560 | loss 3.538181 (+0.53z)| norm 0.2917 (+0.46z)| lr 4.75e-04 | 323.88 ms | 52.1% bf16 MFU | 1623655 tok/s step 6389/19560 | loss 3.489854 (-0.39z)| norm 0.2709 (-0.33z)| lr 4.75e-04 | 322.79 ms | 52.3% bf16 MFU | 1623684 tok/s step 6390/19560 | loss 3.553474 (+0.87z)| norm 0.2485 (-1.20z)| lr 4.75e-04 | 323.13 ms | 52.2% bf16 MFU | 1623626 tok/s step 6391/19560 | loss 3.483795 (-0.51z)| norm 0.2483 (-1.19z)| lr 4.75e-04 | 322.70 ms | 52.3% bf16 MFU | 1623680 tok/s step 6392/19560 | loss 3.499374 (-0.21z)| norm 0.2622 (-0.64z)| lr 4.75e-04 | 323.04 ms | 52.2% bf16 MFU | 1623646 tok/s step 6393/19560 | loss 3.514546 (+0.09z)| norm 0.2596 (-0.75z)| lr 4.75e-04 | 323.09 ms | 52.2% bf16 MFU | 1623601 tok/s step 6394/19560 | loss 3.561776 (+1.03z)| norm 0.2457 (-1.27z)| lr 4.75e-04 | 322.99 ms | 52.3% bf16 MFU | 1623583 tok/s step 6395/19560 | loss 3.413295 (-1.91z)| norm 0.2946 (+0.64z)| lr 4.75e-04 | 323.04 ms | 52.2% bf16 MFU | 1623554 tok/s step 6396/19560 | loss 3.521156 (+0.22z)| norm 0.3163 (+1.48z)| lr 4.75e-04 | 323.29 ms | 52.2% bf16 MFU | 1623462 tok/s step 6397/19560 | loss 3.471858 (-0.75z)| norm 0.3055 (+1.04z)| lr 4.75e-04 | 322.67 ms | 52.3% bf16 MFU | 1623532 tok/s step 6398/19560 | loss 3.503160 (-0.14z)| norm 0.3050 (+1.01z)| lr 4.75e-04 | 323.71 ms | 52.1% bf16 MFU | 1623336 tok/s step 6399/19560 | loss 3.559131 (+1.22z)| norm 0.2817 (+0.11z)| lr 4.75e-04 | 322.65 ms | 52.3% bf16 MFU | 1623416 tok/s step 6400/19560 | loss 3.461855 (-1.09z)| norm 0.2930 (+0.55z)| lr 4.75e-04 | 322.56 ms | 52.3% bf16 MFU | 1623515 tok/s step 6401/19560 | loss 3.481793 (-0.61z)| norm 0.2646 (-0.56z)| lr 4.75e-04 | 322.82 ms | 52.3% bf16 MFU | 1623544 tok/s step 6402/19560 | loss 3.532154 (+0.59z)| norm 0.2892 (+0.39z)| lr 4.75e-04 | 322.62 ms | 52.3% bf16 MFU | 1623622 tok/s step 6403/19560 | loss 3.488174 (-0.47z)| norm 0.2575 (-0.85z)| lr 4.75e-04 | 322.72 ms | 52.3% bf16 MFU | 1623671 tok/s step 6404/19560 | loss 3.466388 (-0.99z)| norm 0.2538 (-1.00z)| lr 4.75e-04 | 323.05 ms | 52.2% bf16 MFU | 1623634 tok/s step 6405/19560 | loss 3.459461 (-1.16z)| norm 0.2661 (-0.50z)| lr 4.74e-04 | 322.92 ms | 52.3% bf16 MFU | 1623633 tok/s step 6406/19560 | loss 3.501045 (-0.12z)| norm 0.2805 (+0.07z)| lr 4.74e-04 | 322.99 ms | 52.3% bf16 MFU | 1623613 tok/s step 6407/19560 | loss 3.473480 (-0.80z)| norm 0.2608 (-0.72z)| lr 4.74e-04 | 323.13 ms | 52.2% bf16 MFU | 1623558 tok/s step 6408/19560 | loss 3.584207 (+1.91z)| norm 0.3056 (+1.06z)| lr 4.74e-04 | 323.33 ms | 52.2% bf16 MFU | 1623458 tok/s step 6409/19560 | loss 3.516081 (+0.23z)| norm 0.2735 (-0.21z)| lr 4.74e-04 | 322.62 ms | 52.3% bf16 MFU | 1623540 tok/s step 6410/19560 | loss 3.519814 (+0.33z)| norm 0.3081 (+1.19z)| lr 4.74e-04 | 322.95 ms | 52.3% bf16 MFU | 1623536 tok/s step 6411/19560 | loss 3.525673 (+0.47z)| norm 0.2922 (+0.54z)| lr 4.74e-04 | 322.74 ms | 52.3% bf16 MFU | 1623583 tok/s step 6412/19560 | loss 3.540098 (+0.82z)| norm 0.2946 (+0.62z)| lr 4.74e-04 | 322.93 ms | 52.3% bf16 MFU | 1623581 tok/s step 6413/19560 | loss 3.499604 (-0.17z)| norm 0.2801 (+0.04z)| lr 4.74e-04 | 323.14 ms | 52.2% bf16 MFU | 1623525 tok/s step 6414/19560 | loss 3.458833 (-1.17z)| norm 0.2822 (+0.13z)| lr 4.74e-04 | 323.19 ms | 52.2% bf16 MFU | 1623461 tok/s step 6415/19560 | loss 3.514653 (+0.20z)| norm 0.3057 (+1.08z)| lr 4.74e-04 | 322.27 ms | 52.4% bf16 MFU | 1623630 tok/s step 6416/19560 | loss 3.525124 (+0.46z)| norm 0.2812 (+0.09z)| lr 4.74e-04 | 322.53 ms | 52.3% bf16 MFU | 1623726 tok/s step 6417/19560 | loss 3.478228 (-0.69z)| norm 0.2837 (+0.20z)| lr 4.74e-04 | 322.76 ms | 52.3% bf16 MFU | 1623759 tok/s step 6418/19560 | loss 3.545582 (+0.95z)| norm 0.2675 (-0.46z)| lr 4.74e-04 | 322.60 ms | 52.3% bf16 MFU | 1623831 tok/s step 6419/19560 | loss 3.534852 (+0.68z)| norm 0.3255 (+1.85z)| lr 4.74e-04 | 322.57 ms | 52.3% bf16 MFU | 1623907 tok/s step 6420/19560 | loss 3.439663 (-1.65z)| norm 0.2615 (-0.70z)| lr 4.74e-04 | 322.32 ms | 52.4% bf16 MFU | 1624043 tok/s step 6421/19560 | loss 3.530633 (+0.59z)| norm 0.2881 (+0.37z)| lr 4.74e-04 | 322.46 ms | 52.3% bf16 MFU | 1624136 tok/s step 6422/19560 | loss 3.525791 (+0.46z)| norm 0.2793 (+0.02z)| lr 4.74e-04 | 322.54 ms | 52.3% bf16 MFU | 1624203 tok/s step 6423/19560 | loss 3.470103 (-0.90z)| norm 0.2748 (-0.15z)| lr 4.74e-04 | 322.48 ms | 52.3% bf16 MFU | 1624283 tok/s step 6424/19560 | loss 3.461529 (-1.09z)| norm 0.2546 (-0.96z)| lr 4.74e-04 | 323.04 ms | 52.2% bf16 MFU | 1624217 tok/s step 6425/19560 | loss 3.450467 (-1.40z)| norm 0.2531 (-1.01z)| lr 4.74e-04 | 322.59 ms | 52.3% bf16 MFU | 1624269 tok/s step 6426/19560 | loss 3.514693 (+0.20z)| norm 0.2542 (-0.96z)| lr 4.74e-04 | 322.52 ms | 52.3% bf16 MFU | 1624335 tok/s step 6427/19560 | loss 3.524664 (+0.45z)| norm 0.2561 (-0.88z)| lr 4.74e-04 | 322.39 ms | 52.4% bf16 MFU | 1624431 tok/s step 6428/19560 | loss 3.515229 (+0.21z)| norm 0.2747 (-0.15z)| lr 4.74e-04 | 323.26 ms | 52.2% bf16 MFU | 1624303 tok/s step 6429/19560 | loss 3.469795 (-0.92z)| norm 0.2541 (-0.97z)| lr 4.73e-04 | 322.14 ms | 52.4% bf16 MFU | 1624462 tok/s step 6430/19560 | loss 3.447809 (-1.49z)| norm 0.2603 (-0.71z)| lr 4.73e-04 | 322.70 ms | 52.3% bf16 MFU | 1624473 tok/s step 6431/19560 | loss 3.462637 (-1.10z)| norm 0.2744 (-0.14z)| lr 4.73e-04 | 322.67 ms | 52.3% bf16 MFU | 1624490 tok/s step 6432/19560 | loss 3.501885 (-0.11z)| norm 0.3132 (+1.40z)| lr 4.73e-04 | 322.49 ms | 52.3% bf16 MFU | 1624553 tok/s step 6433/19560 | loss 3.520355 (+0.37z)| norm 0.2818 (+0.14z)| lr 4.73e-04 | 323.15 ms | 52.2% bf16 MFU | 1624447 tok/s step 6434/19560 | loss 3.490232 (-0.41z)| norm 0.3042 (+1.02z)| lr 4.73e-04 | 322.40 ms | 52.3% bf16 MFU | 1624536 tok/s step 6435/19560 | loss 3.489789 (-0.42z)| norm 0.2735 (-0.20z)| lr 4.73e-04 | 322.90 ms | 52.3% bf16 MFU | 1624494 tok/s step 6436/19560 | loss 3.515174 (+0.23z)| norm 0.2713 (-0.30z)| lr 4.73e-04 | 322.76 ms | 52.3% bf16 MFU | 1624488 tok/s step 6437/19560 | loss 3.492805 (-0.34z)| norm 0.2981 (+0.78z)| lr 4.73e-04 | 322.53 ms | 52.3% bf16 MFU | 1624541 tok/s step 6438/19560 | loss 3.569281 (+1.60z)| norm 0.2786 (-0.01z)| lr 4.73e-04 | 322.57 ms | 52.3% bf16 MFU | 1624582 tok/s step 6439/19560 | loss 3.454220 (-1.31z)| norm 0.2950 (+0.64z)| lr 4.73e-04 | 322.80 ms | 52.3% bf16 MFU | 1624562 tok/s step 6440/19560 | loss 3.591343 (+2.14z)| norm 0.2797 (+0.03z)| lr 4.73e-04 | 322.52 ms | 52.3% bf16 MFU | 1624614 tok/s step 6441/19560 | loss 3.536117 (+0.76z)| norm 0.2635 (-0.62z)| lr 4.73e-04 | 322.70 ms | 52.3% bf16 MFU | 1624618 tok/s step 6442/19560 | loss 3.502386 (-0.08z)| norm 0.2768 (-0.08z)| lr 4.73e-04 | 322.14 ms | 52.4% bf16 MFU | 1624761 tok/s step 6443/19560 | loss 3.553759 (+1.20z)| norm 0.2660 (-0.51z)| lr 4.73e-04 | 323.19 ms | 52.2% bf16 MFU | 1624635 tok/s step 6444/19560 | loss 3.491665 (-0.37z)| norm 0.2975 (+0.76z)| lr 4.73e-04 | 322.67 ms | 52.3% bf16 MFU | 1624647 tok/s step 6445/19560 | loss 3.443342 (-1.59z)| norm 0.2667 (-0.49z)| lr 4.73e-04 | 322.34 ms | 52.4% bf16 MFU | 1624739 tok/s step 6446/19560 | loss 3.413198 (-2.28z)| norm 0.2659 (-0.52z)| lr 4.73e-04 | 322.62 ms | 52.3% bf16 MFU | 1624758 tok/s step 6447/19560 | loss 3.530772 (+0.61z)| norm 0.2643 (-0.58z)| lr 4.73e-04 | 322.46 ms | 52.3% bf16 MFU | 1624815 tok/s step 6448/19560 | loss 3.535524 (+0.73z)| norm 0.2945 (+0.64z)| lr 4.73e-04 | 322.62 ms | 52.3% bf16 MFU | 1624829 tok/s step 6449/19560 | loss 3.532621 (+0.65z)| norm 0.2839 (+0.21z)| lr 4.73e-04 | 322.65 ms | 52.3% bf16 MFU | 1624835 tok/s step 6450/19560 | loss 3.599233 (+2.28z)| norm 0.2764 (-0.09z)| lr 4.73e-04 | 322.56 ms | 52.3% bf16 MFU | 1624862 tok/s step 6451/19560 | loss 3.494525 (-0.28z)| norm 0.2801 (+0.07z)| lr 4.73e-04 | 322.50 ms | 52.3% bf16 MFU | 1624904 tok/s step 6452/19560 | loss 3.547541 (+1.01z)| norm 0.2717 (-0.26z)| lr 4.73e-04 | 322.59 ms | 52.3% bf16 MFU | 1624919 tok/s step 6453/19560 | loss 3.509308 (+0.07z)| norm 0.2666 (-0.46z)| lr 4.73e-04 | 322.55 ms | 52.3% bf16 MFU | 1624946 tok/s step 6454/19560 | loss 3.495454 (-0.26z)| norm 0.2661 (-0.47z)| lr 4.72e-04 | 322.10 ms | 52.4% bf16 MFU | 1625084 tok/s step 6455/19560 | loss 3.513277 (+0.18z)| norm 0.2507 (-1.10z)| lr 4.72e-04 | 322.56 ms | 52.3% bf16 MFU | 1625099 tok/s step 6456/19560 | loss 3.508008 (+0.06z)| norm 0.2522 (-1.02z)| lr 4.72e-04 | 322.18 ms | 52.4% bf16 MFU | 1625209 tok/s step 6457/19560 | loss 3.631407 (+2.99z)| norm 0.2899 (+0.53z)| lr 4.72e-04 | 322.62 ms | 52.3% bf16 MFU | 1625204 tok/s step 6458/19560 | loss 3.555541 (+1.16z)| norm 0.3072 (+1.24z)| lr 4.72e-04 | 322.52 ms | 52.3% bf16 MFU | 1625222 tok/s step 6459/19560 | loss 3.550219 (+1.02z)| norm 0.2652 (-0.48z)| lr 4.72e-04 | 322.45 ms | 52.3% bf16 MFU | 1625260 tok/s step 6460/19560 | loss 3.504487 (-0.06z)| norm 0.2878 (+0.44z)| lr 4.72e-04 | 322.57 ms | 52.3% bf16 MFU | 1625265 tok/s step 6461/19560 | loss 3.457102 (-1.21z)| norm 0.2608 (-0.67z)| lr 4.72e-04 | 322.22 ms | 52.4% bf16 MFU | 1625357 tok/s step 6462/19560 | loss 3.476606 (-0.73z)| norm 0.2880 (+0.44z)| lr 4.72e-04 | 322.70 ms | 52.3% bf16 MFU | 1625323 tok/s step 6463/19560 | loss 3.542258 (+0.84z)| norm 0.3231 (+1.86z)| lr 4.72e-04 | 322.74 ms | 52.3% bf16 MFU | 1625282 tok/s step 6464/19560 | loss 3.524570 (+0.40z)| norm 0.2984 (+0.85z)| lr 4.72e-04 | 322.50 ms | 52.3% bf16 MFU | 1625303 tok/s step 6465/19560 | loss 3.458667 (-1.16z)| norm 0.2672 (-0.44z)| lr 4.72e-04 | 322.44 ms | 52.3% bf16 MFU | 1625338 tok/s step 6466/19560 | loss 3.409258 (-2.31z)| norm 0.2972 (+0.79z)| lr 4.72e-04 | 322.54 ms | 52.3% bf16 MFU | 1625347 tok/s step 6467/19560 | loss 3.471888 (-0.81z)| norm 0.2742 (-0.16z)| lr 4.72e-04 | 322.82 ms | 52.3% bf16 MFU | 1625283 tok/s step 6468/19560 | loss 3.532907 (+0.65z)| norm 0.2744 (-0.14z)| lr 4.72e-04 | 322.76 ms | 52.3% bf16 MFU | 1625238 tok/s step 6469/19560 | loss 3.455320 (-1.20z)| norm 0.2898 (+0.49z)| lr 4.72e-04 | 322.37 ms | 52.4% bf16 MFU | 1625295 tok/s step 6470/19560 | loss 3.435313 (-1.65z)| norm 0.2529 (-1.03z)| lr 4.72e-04 | 322.50 ms | 52.3% bf16 MFU | 1625316 tok/s step 6471/19560 | loss 3.524921 (+0.46z)| norm 0.2825 (+0.18z)| lr 4.72e-04 | 322.74 ms | 52.3% bf16 MFU | 1625274 tok/s step 6472/19560 | loss 3.413302 (-2.13z)| norm 0.2652 (-0.54z)| lr 4.72e-04 | 322.37 ms | 52.4% bf16 MFU | 1625328 tok/s step 6473/19560 | loss 3.502580 (-0.06z)| norm 0.2616 (-0.70z)| lr 4.72e-04 | 322.53 ms | 52.3% bf16 MFU | 1625340 tok/s step 6474/19560 | loss 3.454141 (-1.19z)| norm 0.2487 (-1.22z)| lr 4.72e-04 | 322.89 ms | 52.3% bf16 MFU | 1625260 tok/s step 6475/19560 | loss 3.468341 (-0.84z)| norm 0.2746 (-0.14z)| lr 4.72e-04 | 322.45 ms | 52.3% bf16 MFU | 1625294 tok/s step 6476/19560 | loss 3.465165 (-0.91z)| norm 0.3201 (+1.71z)| lr 4.72e-04 | 322.59 ms | 52.3% bf16 MFU | 1625292 tok/s step 6477/19560 | loss 3.447645 (-1.29z)| norm 0.2697 (-0.37z)| lr 4.72e-04 | 322.98 ms | 52.3% bf16 MFU | 1625191 tok/s step 6478/19560 | loss 3.550739 (+1.07z)| norm 0.2546 (-1.00z)| lr 4.71e-04 | 322.77 ms | 52.3% bf16 MFU | 1625148 tok/s step 6479/19560 | loss 3.546536 (+0.98z)| norm 0.2700 (-0.37z)| lr 4.71e-04 | 322.56 ms | 52.3% bf16 MFU | 1625162 tok/s step 6480/19560 | loss 3.483613 (-0.47z)| norm 0.2611 (-0.73z)| lr 4.71e-04 | 322.33 ms | 52.4% bf16 MFU | 1625231 tok/s step 6481/19560 | loss 3.489116 (-0.34z)| norm 0.2863 (+0.31z)| lr 4.71e-04 | 322.68 ms | 52.3% bf16 MFU | 1625210 tok/s step 6482/19560 | loss 3.485727 (-0.42z)| norm 0.2626 (-0.68z)| lr 4.71e-04 | 322.39 ms | 52.3% bf16 MFU | 1625262 tok/s step 6483/19560 | loss 3.536664 (+0.75z)| norm 0.3015 (+0.92z)| lr 4.71e-04 | 322.63 ms | 52.3% bf16 MFU | 1625251 tok/s step 6484/19560 | loss 3.485210 (-0.44z)| norm 0.2919 (+0.52z)| lr 4.71e-04 | 322.46 ms | 52.3% bf16 MFU | 1625282 tok/s step 6485/19560 | loss 3.443942 (-1.36z)| norm 0.2783 (-0.04z)| lr 4.71e-04 | 322.83 ms | 52.3% bf16 MFU | 1625219 tok/s step 6486/19560 | loss 3.515218 (+0.26z)| norm 0.2922 (+0.52z)| lr 4.71e-04 | 322.99 ms | 52.3% bf16 MFU | 1625120 tok/s step 6487/19560 | loss 3.520895 (+0.39z)| norm 0.2796 (-0.00z)| lr 4.71e-04 | 322.49 ms | 52.3% bf16 MFU | 1625151 tok/s step 6488/19560 | loss 3.482985 (-0.48z)| norm 0.2987 (+0.78z)| lr 4.71e-04 | 322.74 ms | 52.3% bf16 MFU | 1625118 tok/s step 6489/19560 | loss 3.459921 (-1.00z)| norm 0.3127 (+1.34z)| lr 4.71e-04 | 323.18 ms | 52.2% bf16 MFU | 1624975 tok/s step 6490/19560 | loss 3.526147 (+0.50z)| norm 0.2744 (-0.24z)| lr 4.71e-04 | 322.90 ms | 52.3% bf16 MFU | 1624910 tok/s step 6491/19560 | loss 3.526846 (+0.59z)| norm 0.2669 (-0.54z)| lr 4.71e-04 | 322.42 ms | 52.3% bf16 MFU | 1624971 tok/s step 6492/19560 | loss 3.517663 (+0.36z)| norm 0.2555 (-1.02z)| lr 4.71e-04 | 322.94 ms | 52.3% bf16 MFU | 1624897 tok/s step 6493/19560 | loss 3.484746 (-0.45z)| norm 0.2696 (-0.40z)| lr 4.71e-04 | 322.93 ms | 52.3% bf16 MFU | 1624828 tok/s step 6494/19560 | loss 3.468108 (-0.85z)| norm 0.2589 (-0.87z)| lr 4.71e-04 | 322.80 ms | 52.3% bf16 MFU | 1624794 tok/s step 6495/19560 | loss 3.494842 (-0.20z)| norm 0.2422 (-1.62z)| lr 4.71e-04 | 322.83 ms | 52.3% bf16 MFU | 1624758 tok/s step 6496/19560 | loss 3.483396 (-0.47z)| norm 0.2726 (-0.22z)| lr 4.71e-04 | 322.47 ms | 52.3% bf16 MFU | 1624813 tok/s step 6497/19560 | loss 3.452151 (-1.23z)| norm 0.2710 (-0.29z)| lr 4.71e-04 | 322.92 ms | 52.3% bf16 MFU | 1624751 tok/s step 6498/19560 | loss 3.487114 (-0.36z)| norm 0.2706 (-0.31z)| lr 4.71e-04 | 322.54 ms | 52.3% bf16 MFU | 1624788 tok/s step 6499/19560 | loss 3.543884 (+1.03z)| norm 0.2563 (-1.06z)| lr 4.71e-04 | 322.81 ms | 52.3% bf16 MFU | 1624756 tok/s step 6500/19560 | loss 3.563413 (+1.49z)| norm 0.3553 (+3.99z)| lr 4.71e-04 | 322.26 ms | 52.4% bf16 MFU | 1624865 tok/s val loss 3.497015 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2820/10042 = 0.280821 step 6501/19560 | loss 3.558582 (+1.35z)| norm 0.3088 (+1.62z)| lr 4.71e-04 | 323.24 ms | 52.2% bf16 MFU | 1624721 tok/s step 6502/19560 | loss 3.508764 (+0.16z)| norm 0.2917 (+0.77z)| lr 4.71e-04 | 322.18 ms | 52.4% bf16 MFU | 1624851 tok/s step 6503/19560 | loss 3.439563 (-1.51z)| norm 0.2967 (+1.00z)| lr 4.70e-04 | 322.27 ms | 52.4% bf16 MFU | 1624951 tok/s step 6504/19560 | loss 3.500138 (-0.03z)| norm 0.3068 (+1.48z)| lr 4.70e-04 | 322.32 ms | 52.4% bf16 MFU | 1625033 tok/s step 6505/19560 | loss 3.428964 (-1.74z)| norm 0.2887 (+0.57z)| lr 4.70e-04 | 322.69 ms | 52.3% bf16 MFU | 1625018 tok/s step 6506/19560 | loss 3.448035 (-1.26z)| norm 0.2841 (+0.33z)| lr 4.70e-04 | 323.33 ms | 52.2% bf16 MFU | 1624843 tok/s step 6507/19560 | loss 3.519491 (+0.46z)| norm 0.2843 (+0.34z)| lr 4.70e-04 | 321.96 ms | 52.4% bf16 MFU | 1625021 tok/s step 6508/19560 | loss 3.523232 (+0.54z)| norm 0.2641 (-0.68z)| lr 4.70e-04 | 322.78 ms | 52.3% bf16 MFU | 1624984 tok/s step 6509/19560 | loss 3.517900 (+0.41z)| norm 0.2666 (-0.55z)| lr 4.70e-04 | 321.87 ms | 52.4% bf16 MFU | 1625180 tok/s step 6510/19560 | loss 3.528118 (+0.65z)| norm 0.2696 (-0.41z)| lr 4.70e-04 | 322.54 ms | 52.3% bf16 MFU | 1625196 tok/s step 6511/19560 | loss 3.585454 (+1.99z)| norm 0.2862 (+0.42z)| lr 4.70e-04 | 322.69 ms | 52.3% bf16 MFU | 1625172 tok/s step 6512/19560 | loss 3.479215 (-0.55z)| norm 0.2904 (+0.62z)| lr 4.70e-04 | 322.88 ms | 52.3% bf16 MFU | 1625104 tok/s step 6513/19560 | loss 3.454189 (-1.14z)| norm 0.2622 (-0.83z)| lr 4.70e-04 | 322.39 ms | 52.4% bf16 MFU | 1625162 tok/s step 6514/19560 | loss 3.463359 (-0.92z)| norm 0.2823 (+0.19z)| lr 4.70e-04 | 323.12 ms | 52.2% bf16 MFU | 1625032 tok/s step 6515/19560 | loss 3.509010 (+0.19z)| norm 0.2919 (+0.68z)| lr 4.70e-04 | 322.30 ms | 52.4% bf16 MFU | 1625114 tok/s step 6516/19560 | loss 3.474294 (-0.64z)| norm 0.2769 (-0.09z)| lr 4.70e-04 | 322.75 ms | 52.3% bf16 MFU | 1625081 tok/s step 6517/19560 | loss 3.471639 (-0.70z)| norm 0.2796 (+0.05z)| lr 4.70e-04 | 323.69 ms | 52.1% bf16 MFU | 1624813 tok/s step 6518/19560 | loss 3.474386 (-0.62z)| norm 0.2514 (-1.41z)| lr 4.70e-04 | 322.95 ms | 52.3% bf16 MFU | 1624745 tok/s step 6519/19560 | loss 3.492502 (-0.18z)| norm 0.2499 (-1.49z)| lr 4.70e-04 | 322.63 ms | 52.3% bf16 MFU | 1624759 tok/s step 6520/19560 | loss 3.468044 (-0.77z)| norm 0.2778 (-0.05z)| lr 4.70e-04 | 322.58 ms | 52.3% bf16 MFU | 1624786 tok/s step 6521/19560 | loss 3.479229 (-0.49z)| norm 0.2812 (+0.11z)| lr 4.70e-04 | 322.39 ms | 52.3% bf16 MFU | 1624859 tok/s step 6522/19560 | loss 3.458899 (-0.97z)| norm 0.2667 (-0.66z)| lr 4.70e-04 | 323.22 ms | 52.2% bf16 MFU | 1624721 tok/s step 6523/19560 | loss 3.535202 (+0.88z)| norm 0.2888 (+0.51z)| lr 4.70e-04 | 322.61 ms | 52.3% bf16 MFU | 1624741 tok/s step 6524/19560 | loss 3.507007 (+0.18z)| norm 0.2838 (+0.26z)| lr 4.70e-04 | 322.81 ms | 52.3% bf16 MFU | 1624710 tok/s step 6525/19560 | loss 3.496344 (-0.08z)| norm 0.2721 (-0.35z)| lr 4.70e-04 | 323.10 ms | 52.2% bf16 MFU | 1624609 tok/s step 6526/19560 | loss 3.448396 (-1.25z)| norm 0.2631 (-0.82z)| lr 4.70e-04 | 322.48 ms | 52.3% bf16 MFU | 1624670 tok/s step 6527/19560 | loss 3.524749 (+0.64z)| norm 0.2900 (+0.63z)| lr 4.69e-04 | 322.74 ms | 52.3% bf16 MFU | 1624662 tok/s step 6528/19560 | loss 3.510035 (+0.26z)| norm 0.2675 (-0.57z)| lr 4.69e-04 | 322.34 ms | 52.4% bf16 MFU | 1624754 tok/s step 6529/19560 | loss 3.487035 (-0.31z)| norm 0.2890 (+0.58z)| lr 4.69e-04 | 323.17 ms | 52.2% bf16 MFU | 1624634 tok/s step 6530/19560 | loss 3.508197 (+0.22z)| norm 0.2865 (+0.45z)| lr 4.69e-04 | 322.85 ms | 52.3% bf16 MFU | 1624599 tok/s step 6531/19560 | loss 3.529490 (+0.74z)| norm 0.2870 (+0.46z)| lr 4.69e-04 | 322.35 ms | 52.4% bf16 MFU | 1624690 tok/s step 6532/19560 | loss 3.513547 (+0.34z)| norm 0.3283 (+2.62z)| lr 4.69e-04 | 323.26 ms | 52.2% bf16 MFU | 1624549 tok/s step 6533/19560 | loss 3.533015 (+0.81z)| norm 0.3055 (+1.38z)| lr 4.69e-04 | 322.76 ms | 52.3% bf16 MFU | 1624540 tok/s step 6534/19560 | loss 3.557156 (+1.39z)| norm 0.2650 (-0.76z)| lr 4.69e-04 | 323.11 ms | 52.2% bf16 MFU | 1624446 tok/s step 6535/19560 | loss 3.545856 (+1.09z)| norm 0.2737 (-0.30z)| lr 4.69e-04 | 322.43 ms | 52.3% bf16 MFU | 1624528 tok/s step 6536/19560 | loss 3.535966 (+0.87z)| norm 0.2744 (-0.26z)| lr 4.69e-04 | 322.51 ms | 52.3% bf16 MFU | 1624585 tok/s step 6537/19560 | loss 3.505132 (+0.10z)| norm 0.3270 (+2.48z)| lr 4.69e-04 | 323.78 ms | 52.1% bf16 MFU | 1624319 tok/s step 6538/19560 | loss 3.458930 (-1.04z)| norm 0.2826 (+0.17z)| lr 4.69e-04 | 322.35 ms | 52.4% bf16 MFU | 1624425 tok/s step 6539/19560 | loss 3.465077 (-0.87z)| norm 0.2643 (-0.78z)| lr 4.69e-04 | 322.86 ms | 52.3% bf16 MFU | 1624398 tok/s step 6540/19560 | loss 3.470192 (-0.73z)| norm 0.2845 (+0.29z)| lr 4.69e-04 | 322.86 ms | 52.3% bf16 MFU | 1624372 tok/s step 6541/19560 | loss 3.490594 (-0.22z)| norm 0.2664 (-0.66z)| lr 4.69e-04 | 323.08 ms | 52.2% bf16 MFU | 1624292 tok/s step 6542/19560 | loss 3.448600 (-1.27z)| norm 0.2828 (+0.20z)| lr 4.69e-04 | 323.07 ms | 52.2% bf16 MFU | 1624218 tok/s step 6543/19560 | loss 3.504993 (+0.14z)| norm 0.2447 (-1.77z)| lr 4.69e-04 | 322.59 ms | 52.3% bf16 MFU | 1624270 tok/s step 6544/19560 | loss 3.537560 (+0.95z)| norm 0.2927 (+0.74z)| lr 4.69e-04 | 322.78 ms | 52.3% bf16 MFU | 1624270 tok/s step 6545/19560 | loss 3.518069 (+0.46z)| norm 0.2328 (-2.32z)| lr 4.69e-04 | 323.36 ms | 52.2% bf16 MFU | 1624126 tok/s step 6546/19560 | loss 3.560839 (+1.51z)| norm 0.2843 (+0.31z)| lr 4.69e-04 | 322.36 ms | 52.4% bf16 MFU | 1624241 tok/s step 6547/19560 | loss 3.464539 (-0.86z)| norm 0.2751 (-0.15z)| lr 4.69e-04 | 322.80 ms | 52.3% bf16 MFU | 1624237 tok/s step 6548/19560 | loss 3.532404 (+0.81z)| norm 0.2673 (-0.56z)| lr 4.69e-04 | 322.60 ms | 52.3% bf16 MFU | 1624286 tok/s step 6549/19560 | loss 3.527705 (+0.69z)| norm 0.2948 (+0.88z)| lr 4.69e-04 | 322.94 ms | 52.3% bf16 MFU | 1624247 tok/s step 6550/19560 | loss 3.543763 (+1.08z)| norm 0.3232 (+2.30z)| lr 4.69e-04 | 322.38 ms | 52.4% bf16 MFU | 1624351 tok/s step 6551/19560 | loss 3.551081 (+1.24z)| norm 0.2797 (+0.06z)| lr 4.68e-04 | 322.94 ms | 52.3% bf16 MFU | 1624307 tok/s step 6552/19560 | loss 3.480423 (-0.51z)| norm 0.2895 (+0.56z)| lr 4.68e-04 | 322.53 ms | 52.3% bf16 MFU | 1624370 tok/s step 6553/19560 | loss 3.512870 (+0.29z)| norm 0.2923 (+0.69z)| lr 4.68e-04 | 323.07 ms | 52.2% bf16 MFU | 1624292 tok/s step 6554/19560 | loss 3.529078 (+0.69z)| norm 0.2662 (-0.67z)| lr 4.68e-04 | 322.87 ms | 52.3% bf16 MFU | 1624269 tok/s step 6555/19560 | loss 3.523256 (+0.54z)| norm 0.2766 (-0.14z)| lr 4.68e-04 | 322.57 ms | 52.3% bf16 MFU | 1624323 tok/s step 6556/19560 | loss 3.480772 (-0.51z)| norm 0.2722 (-0.37z)| lr 4.68e-04 | 322.60 ms | 52.3% bf16 MFU | 1624367 tok/s step 6557/19560 | loss 3.459044 (-1.05z)| norm 0.2590 (-1.06z)| lr 4.68e-04 | 322.57 ms | 52.3% bf16 MFU | 1624416 tok/s step 6558/19560 | loss 3.532449 (+0.76z)| norm 0.2826 (+0.17z)| lr 4.68e-04 | 323.48 ms | 52.2% bf16 MFU | 1624234 tok/s step 6559/19560 | loss 3.515288 (+0.33z)| norm 0.2908 (+0.59z)| lr 4.68e-04 | 323.01 ms | 52.2% bf16 MFU | 1624179 tok/s step 6560/19560 | loss 3.444659 (-1.42z)| norm 0.2713 (-0.42z)| lr 4.68e-04 | 322.59 ms | 52.3% bf16 MFU | 1624232 tok/s step 6561/19560 | loss 3.501350 (-0.01z)| norm 0.2645 (-0.77z)| lr 4.68e-04 | 322.72 ms | 52.3% bf16 MFU | 1624250 tok/s step 6562/19560 | loss 3.594924 (+2.26z)| norm 0.2728 (-0.32z)| lr 4.68e-04 | 323.05 ms | 52.2% bf16 MFU | 1624184 tok/s step 6563/19560 | loss 3.448598 (-1.30z)| norm 0.2655 (-0.71z)| lr 4.68e-04 | 323.10 ms | 52.2% bf16 MFU | 1624108 tok/s step 6564/19560 | loss 3.472637 (-0.71z)| norm 0.2481 (-1.61z)| lr 4.68e-04 | 322.44 ms | 52.3% bf16 MFU | 1624202 tok/s step 6565/19560 | loss 3.505350 (+0.08z)| norm 0.2473 (-1.62z)| lr 4.68e-04 | 323.05 ms | 52.2% bf16 MFU | 1624139 tok/s step 6566/19560 | loss 3.451577 (-1.20z)| norm 0.2514 (-1.39z)| lr 4.68e-04 | 322.60 ms | 52.3% bf16 MFU | 1624191 tok/s step 6567/19560 | loss 3.477405 (-0.58z)| norm 0.2495 (-1.46z)| lr 4.68e-04 | 323.18 ms | 52.2% bf16 MFU | 1624096 tok/s step 6568/19560 | loss 3.529045 (+0.70z)| norm 0.2640 (-0.70z)| lr 4.68e-04 | 323.03 ms | 52.2% bf16 MFU | 1624043 tok/s step 6569/19560 | loss 3.463958 (-0.90z)| norm 0.2508 (-1.37z)| lr 4.68e-04 | 322.82 ms | 52.3% bf16 MFU | 1624046 tok/s step 6570/19560 | loss 3.559810 (+1.46z)| norm 0.2624 (-0.77z)| lr 4.68e-04 | 322.86 ms | 52.3% bf16 MFU | 1624037 tok/s step 6571/19560 | loss 3.487891 (-0.30z)| norm 0.2541 (-1.18z)| lr 4.68e-04 | 322.53 ms | 52.3% bf16 MFU | 1624112 tok/s step 6572/19560 | loss 3.502866 (+0.07z)| norm 0.2519 (-1.27z)| lr 4.68e-04 | 322.25 ms | 52.4% bf16 MFU | 1624255 tok/s step 6573/19560 | loss 3.526915 (+0.65z)| norm 0.2614 (-0.79z)| lr 4.68e-04 | 323.12 ms | 52.2% bf16 MFU | 1624171 tok/s step 6574/19560 | loss 3.480379 (-0.53z)| norm 0.2563 (-1.04z)| lr 4.68e-04 | 323.12 ms | 52.2% bf16 MFU | 1624092 tok/s step 6575/19560 | loss 3.515765 (+0.37z)| norm 0.2372 (-1.97z)| lr 4.67e-04 | 322.53 ms | 52.3% bf16 MFU | 1624164 tok/s step 6576/19560 | loss 3.479016 (-0.56z)| norm 0.3084 (+1.58z)| lr 4.67e-04 | 322.73 ms | 52.3% bf16 MFU | 1624183 tok/s step 6577/19560 | loss 3.537131 (+0.93z)| norm 0.3031 (+1.30z)| lr 4.67e-04 | 322.88 ms | 52.3% bf16 MFU | 1624164 tok/s step 6578/19560 | loss 3.452691 (-1.22z)| norm 0.2911 (+0.70z)| lr 4.67e-04 | 323.75 ms | 52.1% bf16 MFU | 1623927 tok/s step 6579/19560 | loss 3.574175 (+1.89z)| norm 0.2667 (-0.50z)| lr 4.67e-04 | 322.59 ms | 52.3% bf16 MFU | 1623994 tok/s step 6580/19560 | loss 3.524738 (+0.63z)| norm 0.2893 (+0.61z)| lr 4.67e-04 | 322.67 ms | 52.3% bf16 MFU | 1624035 tok/s step 6581/19560 | loss 3.472053 (-0.71z)| norm 0.2854 (+0.41z)| lr 4.67e-04 | 322.93 ms | 52.3% bf16 MFU | 1624009 tok/s step 6582/19560 | loss 3.553572 (+1.36z)| norm 0.2931 (+0.78z)| lr 4.67e-04 | 322.72 ms | 52.3% bf16 MFU | 1624038 tok/s step 6583/19560 | loss 3.497040 (-0.08z)| norm 0.2917 (+0.70z)| lr 4.67e-04 | 323.27 ms | 52.2% bf16 MFU | 1623928 tok/s step 6584/19560 | loss 3.453845 (-1.16z)| norm 0.2734 (-0.22z)| lr 4.67e-04 | 323.04 ms | 52.2% bf16 MFU | 1623881 tok/s step 6585/19560 | loss 3.487979 (-0.28z)| norm 0.2513 (-1.29z)| lr 4.67e-04 | 323.09 ms | 52.2% bf16 MFU | 1623824 tok/s step 6586/19560 | loss 3.442163 (-1.48z)| norm 0.2789 (+0.08z)| lr 4.67e-04 | 322.84 ms | 52.3% bf16 MFU | 1623832 tok/s step 6587/19560 | loss 3.463895 (-0.89z)| norm 0.2593 (-0.89z)| lr 4.67e-04 | 322.03 ms | 52.4% bf16 MFU | 1624045 tok/s step 6588/19560 | loss 3.546850 (+1.31z)| norm 0.2768 (-0.01z)| lr 4.67e-04 | 323.25 ms | 52.2% bf16 MFU | 1623940 tok/s step 6589/19560 | loss 3.467142 (-0.81z)| norm 0.2584 (-0.93z)| lr 4.67e-04 | 323.09 ms | 52.2% bf16 MFU | 1623880 tok/s step 6590/19560 | loss 3.491081 (-0.17z)| norm 0.2850 (+0.39z)| lr 4.67e-04 | 321.89 ms | 52.4% bf16 MFU | 1624126 tok/s step 6591/19560 | loss 3.490320 (-0.18z)| norm 0.2560 (-1.04z)| lr 4.67e-04 | 322.86 ms | 52.3% bf16 MFU | 1624115 tok/s step 6592/19560 | loss 3.480217 (-0.44z)| norm 0.2650 (-0.57z)| lr 4.67e-04 | 322.04 ms | 52.4% bf16 MFU | 1624310 tok/s step 6593/19560 | loss 3.577281 (+2.10z)| norm 0.2979 (+1.08z)| lr 4.67e-04 | 322.99 ms | 52.3% bf16 MFU | 1624257 tok/s step 6594/19560 | loss 3.478175 (-0.54z)| norm 0.3153 (+1.94z)| lr 4.67e-04 | 322.89 ms | 52.3% bf16 MFU | 1624231 tok/s step 6595/19560 | loss 3.525792 (+0.73z)| norm 0.2583 (-0.91z)| lr 4.67e-04 | 322.54 ms | 52.3% bf16 MFU | 1624295 tok/s step 6596/19560 | loss 3.493429 (-0.13z)| norm 0.2756 (-0.05z)| lr 4.67e-04 | 322.81 ms | 52.3% bf16 MFU | 1624288 tok/s step 6597/19560 | loss 3.473796 (-0.67z)| norm 0.3084 (+1.57z)| lr 4.67e-04 | 322.88 ms | 52.3% bf16 MFU | 1624263 tok/s step 6598/19560 | loss 3.491648 (-0.20z)| norm 0.3022 (+1.25z)| lr 4.67e-04 | 322.69 ms | 52.3% bf16 MFU | 1624288 tok/s step 6599/19560 | loss 3.538761 (+1.09z)| norm 0.3038 (+1.31z)| lr 4.66e-04 | 322.66 ms | 52.3% bf16 MFU | 1624319 tok/s step 6600/19560 | loss 3.480623 (-0.53z)| norm 0.3418 (+3.04z)| lr 4.66e-04 | 322.46 ms | 52.3% bf16 MFU | 1624397 tok/s step 6601/19560 | loss 3.510003 (+0.29z)| norm 0.3007 (+1.07z)| lr 4.66e-04 | 322.44 ms | 52.3% bf16 MFU | 1624478 tok/s step 6602/19560 | loss 3.518906 (+0.53z)| norm 0.2907 (+0.58z)| lr 4.66e-04 | 322.25 ms | 52.4% bf16 MFU | 1624600 tok/s step 6603/19560 | loss 3.554514 (+1.50z)| norm 0.3026 (+1.13z)| lr 4.66e-04 | 322.49 ms | 52.3% bf16 MFU | 1624658 tok/s step 6604/19560 | loss 3.469921 (-0.87z)| norm 0.2952 (+0.80z)| lr 4.66e-04 | 322.22 ms | 52.4% bf16 MFU | 1624781 tok/s step 6605/19560 | loss 3.509401 (+0.23z)| norm 0.3035 (+1.18z)| lr 4.66e-04 | 322.12 ms | 52.4% bf16 MFU | 1624922 tok/s step 6606/19560 | loss 3.508693 (+0.22z)| norm 0.2693 (-0.46z)| lr 4.66e-04 | 322.61 ms | 52.3% bf16 MFU | 1624932 tok/s step 6607/19560 | loss 3.510992 (+0.29z)| norm 0.2790 (-0.00z)| lr 4.66e-04 | 322.80 ms | 52.3% bf16 MFU | 1624895 tok/s step 6608/19560 | loss 3.467982 (-0.93z)| norm 0.2645 (-0.70z)| lr 4.66e-04 | 322.26 ms | 52.4% bf16 MFU | 1624995 tok/s step 6609/19560 | loss 3.542950 (+1.19z)| norm 0.2634 (-0.74z)| lr 4.66e-04 | 322.60 ms | 52.3% bf16 MFU | 1625006 tok/s step 6610/19560 | loss 3.504114 (+0.08z)| norm 0.2866 (+0.37z)| lr 4.66e-04 | 322.56 ms | 52.3% bf16 MFU | 1625026 tok/s step 6611/19560 | loss 3.479142 (-0.62z)| norm 0.2759 (-0.14z)| lr 4.66e-04 | 322.49 ms | 52.3% bf16 MFU | 1625062 tok/s step 6612/19560 | loss 3.552871 (+1.46z)| norm 0.2749 (-0.18z)| lr 4.66e-04 | 322.49 ms | 52.3% bf16 MFU | 1625097 tok/s step 6613/19560 | loss 3.490554 (-0.32z)| norm 0.3900 (+4.83z)| lr 4.66e-04 | 322.39 ms | 52.3% bf16 MFU | 1625153 tok/s step 6614/19560 | loss 3.519622 (+0.51z)| norm 0.2920 (+0.54z)| lr 4.66e-04 | 322.67 ms | 52.3% bf16 MFU | 1625139 tok/s step 6615/19560 | loss 3.505332 (+0.11z)| norm 0.2649 (-0.63z)| lr 4.66e-04 | 322.74 ms | 52.3% bf16 MFU | 1625107 tok/s step 6616/19560 | loss 3.513598 (+0.34z)| norm 0.2739 (-0.23z)| lr 4.66e-04 | 322.90 ms | 52.3% bf16 MFU | 1625036 tok/s step 6617/19560 | loss 3.464465 (-1.07z)| norm 0.2855 (+0.28z)| lr 4.66e-04 | 321.99 ms | 52.4% bf16 MFU | 1625199 tok/s step 6618/19560 | loss 3.540899 (+1.11z)| norm 0.2548 (-1.06z)| lr 4.66e-04 | 322.55 ms | 52.3% bf16 MFU | 1625212 tok/s step 6619/19560 | loss 3.458830 (-1.21z)| norm 0.2677 (-0.49z)| lr 4.66e-04 | 322.67 ms | 52.3% bf16 MFU | 1625194 tok/s step 6620/19560 | loss 3.521378 (+0.57z)| norm 0.2895 (+0.46z)| lr 4.66e-04 | 322.70 ms | 52.3% bf16 MFU | 1625170 tok/s step 6621/19560 | loss 3.490653 (-0.31z)| norm 0.2727 (-0.29z)| lr 4.66e-04 | 322.58 ms | 52.3% bf16 MFU | 1625175 tok/s step 6622/19560 | loss 3.487088 (-0.42z)| norm 0.2648 (-0.64z)| lr 4.66e-04 | 322.30 ms | 52.4% bf16 MFU | 1625251 tok/s step 6623/19560 | loss 3.452518 (-1.38z)| norm 0.2424 (-1.63z)| lr 4.65e-04 | 322.53 ms | 52.3% bf16 MFU | 1625264 tok/s step 6624/19560 | loss 3.552236 (+1.42z)| norm 0.2886 (+0.41z)| lr 4.65e-04 | 322.66 ms | 52.3% bf16 MFU | 1625246 tok/s step 6625/19560 | loss 3.472018 (-0.85z)| norm 0.2823 (+0.13z)| lr 4.65e-04 | 322.29 ms | 52.4% bf16 MFU | 1625321 tok/s step 6626/19560 | loss 3.500343 (-0.05z)| norm 0.2621 (-0.76z)| lr 4.65e-04 | 322.86 ms | 52.3% bf16 MFU | 1625250 tok/s step 6627/19560 | loss 3.481896 (-0.56z)| norm 0.2624 (-0.75z)| lr 4.65e-04 | 322.80 ms | 52.3% bf16 MFU | 1625198 tok/s step 6628/19560 | loss 3.526955 (+0.73z)| norm 0.2524 (-1.21z)| lr 4.65e-04 | 322.51 ms | 52.3% bf16 MFU | 1625220 tok/s step 6629/19560 | loss 3.456405 (-1.27z)| norm 0.2471 (-1.43z)| lr 4.65e-04 | 322.60 ms | 52.3% bf16 MFU | 1625219 tok/s step 6630/19560 | loss 3.535326 (+0.99z)| norm 0.2642 (-0.63z)| lr 4.65e-04 | 322.10 ms | 52.4% bf16 MFU | 1625343 tok/s step 6631/19560 | loss 3.475835 (-0.73z)| norm 0.2388 (-1.76z)| lr 4.65e-04 | 322.88 ms | 52.3% bf16 MFU | 1625264 tok/s step 6632/19560 | loss 3.476428 (-0.71z)| norm 0.2430 (-1.54z)| lr 4.65e-04 | 322.52 ms | 52.3% bf16 MFU | 1625282 tok/s step 6633/19560 | loss 3.551034 (+1.44z)| norm 0.2595 (-0.78z)| lr 4.65e-04 | 322.60 ms | 52.3% bf16 MFU | 1625277 tok/s step 6634/19560 | loss 3.445080 (-1.65z)| norm 0.2640 (-0.57z)| lr 4.65e-04 | 323.19 ms | 52.2% bf16 MFU | 1625125 tok/s step 6635/19560 | loss 3.458560 (-1.24z)| norm 0.3393 (+2.75z)| lr 4.65e-04 | 322.37 ms | 52.4% bf16 MFU | 1625188 tok/s step 6636/19560 | loss 3.560980 (+1.71z)| norm 0.3043 (+1.18z)| lr 4.65e-04 | 322.25 ms | 52.4% bf16 MFU | 1625276 tok/s step 6637/19560 | loss 3.492843 (-0.25z)| norm 0.2756 (-0.08z)| lr 4.65e-04 | 323.08 ms | 52.2% bf16 MFU | 1625150 tok/s step 6638/19560 | loss 3.522826 (+0.62z)| norm 0.3149 (+1.61z)| lr 4.65e-04 | 322.39 ms | 52.3% bf16 MFU | 1625205 tok/s step 6639/19560 | loss 3.508227 (+0.22z)| norm 0.2695 (-0.35z)| lr 4.65e-04 | 322.22 ms | 52.4% bf16 MFU | 1625300 tok/s step 6640/19560 | loss 3.482026 (-0.55z)| norm 0.2828 (+0.23z)| lr 4.65e-04 | 323.14 ms | 52.2% bf16 MFU | 1625159 tok/s step 6641/19560 | loss 3.514511 (+0.39z)| norm 0.2557 (-0.95z)| lr 4.65e-04 | 322.21 ms | 52.4% bf16 MFU | 1625259 tok/s step 6642/19560 | loss 3.467980 (-0.99z)| norm 0.2781 (+0.03z)| lr 4.65e-04 | 322.58 ms | 52.3% bf16 MFU | 1625261 tok/s step 6643/19560 | loss 3.507825 (+0.19z)| norm 0.2518 (-1.10z)| lr 4.65e-04 | 323.02 ms | 52.2% bf16 MFU | 1625152 tok/s step 6644/19560 | loss 3.596210 (+2.72z)| norm 0.2659 (-0.48z)| lr 4.65e-04 | 322.33 ms | 52.4% bf16 MFU | 1625222 tok/s step 6645/19560 | loss 3.518582 (+0.46z)| norm 0.2854 (+0.36z)| lr 4.65e-04 | 322.70 ms | 52.3% bf16 MFU | 1625194 tok/s step 6646/19560 | loss 3.512366 (+0.28z)| norm 0.2756 (-0.08z)| lr 4.65e-04 | 322.78 ms | 52.3% bf16 MFU | 1625148 tok/s step 6647/19560 | loss 3.487982 (-0.43z)| norm 0.3076 (+1.30z)| lr 4.64e-04 | 322.57 ms | 52.3% bf16 MFU | 1625159 tok/s step 6648/19560 | loss 3.518211 (+0.44z)| norm 0.2736 (-0.18z)| lr 4.64e-04 | 322.37 ms | 52.4% bf16 MFU | 1625220 tok/s step 6649/19560 | loss 3.536519 (+0.95z)| norm 0.2881 (+0.45z)| lr 4.64e-04 | 322.57 ms | 52.3% bf16 MFU | 1625227 tok/s step 6650/19560 | loss 3.459710 (-1.28z)| norm 0.2889 (+0.47z)| lr 4.64e-04 | 322.42 ms | 52.3% bf16 MFU | 1625271 tok/s step 6651/19560 | loss 3.480239 (-0.67z)| norm 0.2795 (+0.07z)| lr 4.64e-04 | 322.32 ms | 52.4% bf16 MFU | 1625337 tok/s step 6652/19560 | loss 3.475178 (-0.81z)| norm 0.2674 (-0.45z)| lr 4.64e-04 | 322.86 ms | 52.3% bf16 MFU | 1625265 tok/s step 6653/19560 | loss 3.576542 (+2.08z)| norm 0.2826 (+0.21z)| lr 4.64e-04 | 322.39 ms | 52.3% bf16 MFU | 1625314 tok/s step 6654/19560 | loss 3.430833 (-2.06z)| norm 0.2695 (-0.37z)| lr 4.64e-04 | 322.99 ms | 52.3% bf16 MFU | 1625210 tok/s step 6655/19560 | loss 3.488024 (-0.43z)| norm 0.2644 (-0.58z)| lr 4.64e-04 | 322.39 ms | 52.4% bf16 MFU | 1625263 tok/s step 6656/19560 | loss 3.502698 (-0.01z)| norm 0.2581 (-0.84z)| lr 4.64e-04 | 322.28 ms | 52.4% bf16 MFU | 1625340 tok/s step 6657/19560 | loss 3.525173 (+0.62z)| norm 0.2891 (+0.50z)| lr 4.64e-04 | 322.56 ms | 52.3% bf16 MFU | 1625342 tok/s step 6658/19560 | loss 3.427463 (-2.10z)| norm 0.2775 (-0.00z)| lr 4.64e-04 | 322.51 ms | 52.3% bf16 MFU | 1625358 tok/s step 6659/19560 | loss 3.458046 (-1.23z)| norm 0.2410 (-1.55z)| lr 4.64e-04 | 322.43 ms | 52.3% bf16 MFU | 1625392 tok/s step 6660/19560 | loss 3.400176 (-2.73z)| norm 0.2604 (-0.71z)| lr 4.64e-04 | 322.33 ms | 52.4% bf16 MFU | 1625451 tok/s step 6661/19560 | loss 3.460627 (-1.08z)| norm 0.2981 (+0.94z)| lr 4.64e-04 | 322.38 ms | 52.4% bf16 MFU | 1625495 tok/s step 6662/19560 | loss 3.596516 (+2.52z)| norm 0.2846 (+0.35z)| lr 4.64e-04 | 322.54 ms | 52.3% bf16 MFU | 1625495 tok/s step 6663/19560 | loss 3.449653 (-1.34z)| norm 0.2987 (+0.95z)| lr 4.64e-04 | 322.61 ms | 52.3% bf16 MFU | 1625478 tok/s step 6664/19560 | loss 3.580180 (+2.07z)| norm 0.2931 (+0.70z)| lr 4.64e-04 | 322.51 ms | 52.3% bf16 MFU | 1625485 tok/s step 6665/19560 | loss 3.491798 (-0.23z)| norm 0.2653 (-0.50z)| lr 4.64e-04 | 322.38 ms | 52.4% bf16 MFU | 1625527 tok/s step 6666/19560 | loss 3.486781 (-0.37z)| norm 0.2791 (+0.11z)| lr 4.64e-04 | 322.59 ms | 52.3% bf16 MFU | 1625513 tok/s step 6667/19560 | loss 3.500940 (-0.00z)| norm 0.2983 (+0.95z)| lr 4.64e-04 | 322.69 ms | 52.3% bf16 MFU | 1625474 tok/s step 6668/19560 | loss 3.488102 (-0.35z)| norm 0.2680 (-0.39z)| lr 4.64e-04 | 322.78 ms | 52.3% bf16 MFU | 1625414 tok/s step 6669/19560 | loss 3.495736 (-0.15z)| norm 0.2580 (-0.82z)| lr 4.64e-04 | 322.20 ms | 52.4% bf16 MFU | 1625503 tok/s step 6670/19560 | loss 3.538178 (+0.96z)| norm 0.2780 (+0.06z)| lr 4.64e-04 | 322.76 ms | 52.3% bf16 MFU | 1625447 tok/s step 6671/19560 | loss 3.504640 (+0.07z)| norm 0.2812 (+0.19z)| lr 4.63e-04 | 322.93 ms | 52.3% bf16 MFU | 1625352 tok/s step 6672/19560 | loss 3.517670 (+0.42z)| norm 0.2859 (+0.40z)| lr 4.63e-04 | 322.45 ms | 52.3% bf16 MFU | 1625382 tok/s step 6673/19560 | loss 3.516538 (+0.39z)| norm 0.2773 (+0.01z)| lr 4.63e-04 | 322.52 ms | 52.3% bf16 MFU | 1625392 tok/s step 6674/19560 | loss 3.514565 (+0.35z)| norm 0.2588 (-0.82z)| lr 4.63e-04 | 322.29 ms | 52.4% bf16 MFU | 1625460 tok/s step 6675/19560 | loss 3.519954 (+0.48z)| norm 0.2852 (+0.37z)| lr 4.63e-04 | 322.69 ms | 52.3% bf16 MFU | 1625423 tok/s step 6676/19560 | loss 3.467175 (-0.92z)| norm 0.2642 (-0.58z)| lr 4.63e-04 | 322.40 ms | 52.3% bf16 MFU | 1625461 tok/s step 6677/19560 | loss 3.517790 (+0.44z)| norm 0.3000 (+1.03z)| lr 4.63e-04 | 322.58 ms | 52.3% bf16 MFU | 1625454 tok/s step 6678/19560 | loss 3.470116 (-0.82z)| norm 0.2750 (-0.08z)| lr 4.63e-04 | 323.35 ms | 52.2% bf16 MFU | 1625253 tok/s step 6679/19560 | loss 3.558372 (+1.55z)| norm 0.2586 (-0.82z)| lr 4.63e-04 | 322.13 ms | 52.4% bf16 MFU | 1625369 tok/s step 6680/19560 | loss 3.460495 (-1.07z)| norm 0.3080 (+1.43z)| lr 4.63e-04 | 322.20 ms | 52.4% bf16 MFU | 1625461 tok/s step 6681/19560 | loss 3.519598 (+0.51z)| norm 0.3121 (+1.59z)| lr 4.63e-04 | 322.85 ms | 52.3% bf16 MFU | 1625384 tok/s step 6682/19560 | loss 3.514651 (+0.38z)| norm 0.2948 (+0.80z)| lr 4.63e-04 | 322.38 ms | 52.4% bf16 MFU | 1625431 tok/s step 6683/19560 | loss 3.467777 (-0.87z)| norm 0.2591 (-0.80z)| lr 4.63e-04 | 322.36 ms | 52.4% bf16 MFU | 1625480 tok/s step 6684/19560 | loss 3.503370 (+0.08z)| norm 0.3141 (+1.64z)| lr 4.63e-04 | 322.57 ms | 52.3% bf16 MFU | 1625473 tok/s step 6685/19560 | loss 3.499237 (-0.04z)| norm 0.2942 (+0.74z)| lr 4.63e-04 | 322.41 ms | 52.3% bf16 MFU | 1625507 tok/s step 6686/19560 | loss 3.488815 (-0.31z)| norm 0.2754 (-0.09z)| lr 4.63e-04 | 322.22 ms | 52.4% bf16 MFU | 1625586 tok/s step 6687/19560 | loss 3.400768 (-2.59z)| norm 0.2961 (+0.83z)| lr 4.63e-04 | 322.49 ms | 52.3% bf16 MFU | 1625595 tok/s step 6688/19560 | loss 3.439278 (-1.58z)| norm 0.3080 (+1.33z)| lr 4.63e-04 | 322.45 ms | 52.3% bf16 MFU | 1625612 tok/s step 6689/19560 | loss 3.508993 (+0.25z)| norm 0.2675 (-0.46z)| lr 4.63e-04 | 322.68 ms | 52.3% bf16 MFU | 1625570 tok/s step 6690/19560 | loss 3.471644 (-0.72z)| norm 0.2834 (+0.25z)| lr 4.63e-04 | 322.56 ms | 52.3% bf16 MFU | 1625561 tok/s step 6691/19560 | loss 3.442619 (-1.49z)| norm 0.2685 (-0.42z)| lr 4.63e-04 | 322.42 ms | 52.3% bf16 MFU | 1625588 tok/s step 6692/19560 | loss 3.517848 (+0.51z)| norm 0.2871 (+0.39z)| lr 4.63e-04 | 322.63 ms | 52.3% bf16 MFU | 1625560 tok/s step 6693/19560 | loss 3.466387 (-0.86z)| norm 0.2854 (+0.31z)| lr 4.63e-04 | 322.85 ms | 52.3% bf16 MFU | 1625478 tok/s step 6694/19560 | loss 3.510281 (+0.31z)| norm 0.2907 (+0.53z)| lr 4.63e-04 | 323.18 ms | 52.2% bf16 MFU | 1625318 tok/s step 6695/19560 | loss 3.539831 (+1.08z)| norm 0.2976 (+0.83z)| lr 4.62e-04 | 322.37 ms | 52.4% bf16 MFU | 1625369 tok/s step 6696/19560 | loss 3.438366 (-1.60z)| norm 0.3046 (+1.13z)| lr 4.62e-04 | 322.71 ms | 52.3% bf16 MFU | 1625333 tok/s step 6697/19560 | loss 3.489522 (-0.25z)| norm 0.3109 (+1.39z)| lr 4.62e-04 | 322.21 ms | 52.4% bf16 MFU | 1625423 tok/s step 6698/19560 | loss 3.466406 (-0.85z)| norm 0.3245 (+1.96z)| lr 4.62e-04 | 322.80 ms | 52.3% bf16 MFU | 1625361 tok/s step 6699/19560 | loss 3.509037 (+0.29z)| norm 0.2873 (+0.29z)| lr 4.62e-04 | 322.55 ms | 52.3% bf16 MFU | 1625365 tok/s step 6700/19560 | loss 3.482016 (-0.43z)| norm 0.2790 (-0.09z)| lr 4.62e-04 | 322.43 ms | 52.3% bf16 MFU | 1625399 tok/s step 6701/19560 | loss 3.518151 (+0.54z)| norm 0.2726 (-0.38z)| lr 4.62e-04 | 322.51 ms | 52.3% bf16 MFU | 1625413 tok/s step 6702/19560 | loss 3.522736 (+0.65z)| norm 0.2809 (-0.02z)| lr 4.62e-04 | 322.43 ms | 52.3% bf16 MFU | 1625445 tok/s step 6703/19560 | loss 3.485003 (-0.35z)| norm 0.2635 (-0.82z)| lr 4.62e-04 | 322.56 ms | 52.3% bf16 MFU | 1625444 tok/s step 6704/19560 | loss 3.514797 (+0.44z)| norm 0.2768 (-0.20z)| lr 4.62e-04 | 322.70 ms | 52.3% bf16 MFU | 1625408 tok/s step 6705/19560 | loss 3.491994 (-0.17z)| norm 0.2550 (-1.19z)| lr 4.62e-04 | 322.29 ms | 52.4% bf16 MFU | 1625475 tok/s step 6706/19560 | loss 3.645883 (+3.75z)| norm 0.2852 (+0.21z)| lr 4.62e-04 | 322.57 ms | 52.3% bf16 MFU | 1625470 tok/s step 6707/19560 | loss 3.509345 (+0.27z)| norm 0.2958 (+0.68z)| lr 4.62e-04 | 322.58 ms | 52.3% bf16 MFU | 1625461 tok/s step 6708/19560 | loss 3.559676 (+1.55z)| norm 0.2407 (-1.82z)| lr 4.62e-04 | 322.88 ms | 52.3% bf16 MFU | 1625377 tok/s step 6709/19560 | loss 3.417052 (-2.08z)| norm 0.3019 (+0.96z)| lr 4.62e-04 | 322.48 ms | 52.3% bf16 MFU | 1625399 tok/s step 6710/19560 | loss 3.484537 (-0.36z)| norm 0.3014 (+0.93z)| lr 4.62e-04 | 322.43 ms | 52.3% bf16 MFU | 1625432 tok/s step 6711/19560 | loss 3.500098 (+0.04z)| norm 0.2453 (-1.58z)| lr 4.62e-04 | 322.31 ms | 52.4% bf16 MFU | 1625493 tok/s step 6712/19560 | loss 3.474078 (-0.63z)| norm 0.2664 (-0.63z)| lr 4.62e-04 | 322.87 ms | 52.3% bf16 MFU | 1625410 tok/s step 6713/19560 | loss 3.465206 (-0.85z)| norm 0.2678 (-0.57z)| lr 4.62e-04 | 322.38 ms | 52.4% bf16 MFU | 1625454 tok/s step 6714/19560 | loss 3.499906 (+0.03z)| norm 0.2566 (-1.07z)| lr 4.62e-04 | 322.86 ms | 52.3% bf16 MFU | 1625376 tok/s step 6715/19560 | loss 3.519318 (+0.52z)| norm 0.2503 (-1.34z)| lr 4.62e-04 | 322.64 ms | 52.3% bf16 MFU | 1625358 tok/s step 6716/19560 | loss 3.468877 (-0.77z)| norm 0.2573 (-1.02z)| lr 4.62e-04 | 322.65 ms | 52.3% bf16 MFU | 1625338 tok/s step 6717/19560 | loss 3.524401 (+0.66z)| norm 0.2574 (-1.01z)| lr 4.62e-04 | 322.83 ms | 52.3% bf16 MFU | 1625272 tok/s step 6718/19560 | loss 3.536846 (+0.97z)| norm 0.2594 (-0.91z)| lr 4.62e-04 | 322.97 ms | 52.3% bf16 MFU | 1625174 tok/s step 6719/19560 | loss 3.466706 (-0.84z)| norm 0.2664 (-0.60z)| lr 4.61e-04 | 322.64 ms | 52.3% bf16 MFU | 1625164 tok/s step 6720/19560 | loss 3.457237 (-1.08z)| norm 0.2763 (-0.17z)| lr 4.61e-04 | 322.72 ms | 52.3% bf16 MFU | 1625136 tok/s step 6721/19560 | loss 3.637910 (+3.44z)| norm 0.2663 (-0.61z)| lr 4.61e-04 | 322.61 ms | 52.3% bf16 MFU | 1625137 tok/s step 6722/19560 | loss 3.455959 (-1.08z)| norm 0.3117 (+1.42z)| lr 4.61e-04 | 322.59 ms | 52.3% bf16 MFU | 1625144 tok/s step 6723/19560 | loss 3.432995 (-1.62z)| norm 0.2587 (-0.95z)| lr 4.61e-04 | 322.55 ms | 52.3% bf16 MFU | 1625159 tok/s step 6724/19560 | loss 3.420255 (-1.89z)| norm 0.2852 (+0.24z)| lr 4.61e-04 | 322.84 ms | 52.3% bf16 MFU | 1625101 tok/s step 6725/19560 | loss 3.461946 (-0.87z)| norm 0.2884 (+0.39z)| lr 4.61e-04 | 322.62 ms | 52.3% bf16 MFU | 1625100 tok/s step 6726/19560 | loss 3.552720 (+1.30z)| norm 0.2973 (+0.79z)| lr 4.61e-04 | 322.67 ms | 52.3% bf16 MFU | 1625086 tok/s step 6727/19560 | loss 3.527181 (+0.69z)| norm 0.2766 (-0.13z)| lr 4.61e-04 | 323.17 ms | 52.2% bf16 MFU | 1624947 tok/s step 6728/19560 | loss 3.494922 (-0.09z)| norm 0.2931 (+0.66z)| lr 4.61e-04 | 323.34 ms | 52.2% bf16 MFU | 1624775 tok/s step 6729/19560 | loss 3.463733 (-0.83z)| norm 0.3517 (+3.24z)| lr 4.61e-04 | 322.50 ms | 52.3% bf16 MFU | 1624820 tok/s step 6730/19560 | loss 3.435173 (-1.49z)| norm 0.3142 (+1.53z)| lr 4.61e-04 | 322.62 ms | 52.3% bf16 MFU | 1624834 tok/s step 6731/19560 | loss 3.479147 (-0.43z)| norm 0.2760 (-0.16z)| lr 4.61e-04 | 323.00 ms | 52.3% bf16 MFU | 1624751 tok/s step 6732/19560 | loss 3.446381 (-1.21z)| norm 0.2917 (+0.55z)| lr 4.61e-04 | 322.35 ms | 52.4% bf16 MFU | 1624836 tok/s step 6733/19560 | loss 3.443881 (-1.25z)| norm 0.2751 (-0.18z)| lr 4.61e-04 | 322.67 ms | 52.3% bf16 MFU | 1624835 tok/s step 6734/19560 | loss 3.507665 (+0.27z)| norm 0.2670 (-0.55z)| lr 4.61e-04 | 322.83 ms | 52.3% bf16 MFU | 1624796 tok/s step 6735/19560 | loss 3.513796 (+0.42z)| norm 0.2716 (-0.34z)| lr 4.61e-04 | 322.57 ms | 52.3% bf16 MFU | 1624823 tok/s step 6736/19560 | loss 3.468586 (-0.66z)| norm 0.2826 (+0.15z)| lr 4.61e-04 | 323.07 ms | 52.2% bf16 MFU | 1624723 tok/s step 6737/19560 | loss 3.449000 (-1.11z)| norm 0.2366 (-1.89z)| lr 4.61e-04 | 323.62 ms | 52.2% bf16 MFU | 1624490 tok/s step 6738/19560 | loss 3.561591 (+1.55z)| norm 0.2850 (+0.26z)| lr 4.61e-04 | 322.85 ms | 52.3% bf16 MFU | 1624462 tok/s step 6739/19560 | loss 3.544661 (+1.13z)| norm 0.2950 (+0.70z)| lr 4.61e-04 | 322.56 ms | 52.3% bf16 MFU | 1624510 tok/s step 6740/19560 | loss 3.518570 (+0.53z)| norm 0.2818 (+0.11z)| lr 4.61e-04 | 322.92 ms | 52.3% bf16 MFU | 1624464 tok/s step 6741/19560 | loss 3.496872 (+0.01z)| norm 0.3025 (+1.17z)| lr 4.61e-04 | 322.77 ms | 52.3% bf16 MFU | 1624458 tok/s step 6742/19560 | loss 3.567273 (+1.65z)| norm 0.2731 (-0.26z)| lr 4.61e-04 | 322.54 ms | 52.3% bf16 MFU | 1624511 tok/s step 6743/19560 | loss 3.503056 (+0.15z)| norm 0.2574 (-1.03z)| lr 4.60e-04 | 322.45 ms | 52.3% bf16 MFU | 1624584 tok/s step 6744/19560 | loss 3.514887 (+0.43z)| norm 0.2882 (+0.48z)| lr 4.60e-04 | 322.89 ms | 52.3% bf16 MFU | 1624542 tok/s step 6745/19560 | loss 3.482365 (-0.34z)| norm 0.2652 (-0.64z)| lr 4.60e-04 | 322.81 ms | 52.3% bf16 MFU | 1624523 tok/s step 6746/19560 | loss 3.469053 (-0.64z)| norm 0.2467 (-1.54z)| lr 4.60e-04 | 322.49 ms | 52.3% bf16 MFU | 1624583 tok/s step 6747/19560 | loss 3.493753 (-0.06z)| norm 0.2410 (-1.78z)| lr 4.60e-04 | 323.14 ms | 52.2% bf16 MFU | 1624477 tok/s step 6748/19560 | loss 3.419895 (-1.77z)| norm 0.2628 (-0.72z)| lr 4.60e-04 | 323.48 ms | 52.2% bf16 MFU | 1624292 tok/s step 6749/19560 | loss 3.481181 (-0.34z)| norm 0.2826 (+0.23z)| lr 4.60e-04 | 322.54 ms | 52.3% bf16 MFU | 1624352 tok/s step 6750/19560 | loss 3.456567 (-0.90z)| norm 0.2520 (-1.24z)| lr 4.60e-04 | 322.65 ms | 52.3% bf16 MFU | 1624380 tok/s val loss 3.486723 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2801/10042 = 0.278928 step 6751/19560 | loss 3.475295 (-0.47z)| norm 0.2979 (+0.95z)| lr 4.60e-04 | 322.60 ms | 52.3% bf16 MFU | 1624420 tok/s step 6752/19560 | loss 3.508076 (+0.30z)| norm 0.2501 (-1.34z)| lr 4.60e-04 | 322.65 ms | 52.3% bf16 MFU | 1624445 tok/s step 6753/19560 | loss 3.599795 (+2.38z)| norm 0.2825 (+0.22z)| lr 4.60e-04 | 322.94 ms | 52.3% bf16 MFU | 1624398 tok/s step 6754/19560 | loss 3.508654 (+0.28z)| norm 0.2681 (-0.47z)| lr 4.60e-04 | 322.92 ms | 52.3% bf16 MFU | 1624356 tok/s step 6755/19560 | loss 3.603796 (+2.39z)| norm 0.2824 (+0.20z)| lr 4.60e-04 | 322.42 ms | 52.3% bf16 MFU | 1624444 tok/s step 6756/19560 | loss 3.459548 (-0.83z)| norm 0.2637 (-0.70z)| lr 4.60e-04 | 322.92 ms | 52.3% bf16 MFU | 1624400 tok/s step 6757/19560 | loss 3.505834 (+0.20z)| norm 0.2678 (-0.52z)| lr 4.60e-04 | 322.29 ms | 52.4% bf16 MFU | 1624518 tok/s step 6758/19560 | loss 3.499484 (+0.06z)| norm 0.2531 (-1.22z)| lr 4.60e-04 | 322.13 ms | 52.4% bf16 MFU | 1624671 tok/s step 6759/19560 | loss 3.561264 (+1.43z)| norm 0.2764 (-0.11z)| lr 4.60e-04 | 322.48 ms | 52.3% bf16 MFU | 1624728 tok/s step 6760/19560 | loss 3.466963 (-0.68z)| norm 0.2871 (+0.41z)| lr 4.60e-04 | 322.64 ms | 52.3% bf16 MFU | 1624742 tok/s step 6761/19560 | loss 3.527918 (+0.69z)| norm 0.2805 (+0.07z)| lr 4.60e-04 | 322.21 ms | 52.4% bf16 MFU | 1624862 tok/s step 6762/19560 | loss 3.472256 (-0.57z)| norm 0.3053 (+1.28z)| lr 4.60e-04 | 321.94 ms | 52.4% bf16 MFU | 1625045 tok/s step 6763/19560 | loss 3.492811 (-0.11z)| norm 0.2915 (+0.64z)| lr 4.60e-04 | 322.70 ms | 52.3% bf16 MFU | 1625027 tok/s step 6764/19560 | loss 3.489198 (-0.18z)| norm 0.2473 (-1.61z)| lr 4.60e-04 | 322.80 ms | 52.3% bf16 MFU | 1624984 tok/s step 6765/19560 | loss 3.488575 (-0.19z)| norm 0.2915 (+0.65z)| lr 4.60e-04 | 322.56 ms | 52.3% bf16 MFU | 1625003 tok/s step 6766/19560 | loss 3.483749 (-0.30z)| norm 0.2791 (+0.03z)| lr 4.59e-04 | 322.47 ms | 52.3% bf16 MFU | 1625046 tok/s step 6767/19560 | loss 3.479025 (-0.40z)| norm 0.2790 (+0.02z)| lr 4.59e-04 | 322.68 ms | 52.3% bf16 MFU | 1625034 tok/s step 6768/19560 | loss 3.520070 (+0.53z)| norm 0.2857 (+0.37z)| lr 4.59e-04 | 322.35 ms | 52.4% bf16 MFU | 1625105 tok/s step 6769/19560 | loss 3.489075 (-0.17z)| norm 0.2710 (-0.40z)| lr 4.59e-04 | 323.09 ms | 52.2% bf16 MFU | 1624985 tok/s step 6770/19560 | loss 3.509100 (+0.28z)| norm 0.2990 (+1.05z)| lr 4.59e-04 | 323.11 ms | 52.2% bf16 MFU | 1624868 tok/s step 6771/19560 | loss 3.524869 (+0.63z)| norm 0.2648 (-0.74z)| lr 4.59e-04 | 322.25 ms | 52.4% bf16 MFU | 1624972 tok/s step 6772/19560 | loss 3.543674 (+1.09z)| norm 0.2505 (-1.47z)| lr 4.59e-04 | 322.76 ms | 52.3% bf16 MFU | 1624944 tok/s step 6773/19560 | loss 3.467055 (-0.68z)| norm 0.2602 (-0.96z)| lr 4.59e-04 | 322.66 ms | 52.3% bf16 MFU | 1624941 tok/s step 6774/19560 | loss 3.438929 (-1.31z)| norm 0.2504 (-1.44z)| lr 4.59e-04 | 322.90 ms | 52.3% bf16 MFU | 1624879 tok/s step 6775/19560 | loss 3.520355 (+0.56z)| norm 0.2606 (-0.90z)| lr 4.59e-04 | 322.97 ms | 52.3% bf16 MFU | 1624802 tok/s step 6776/19560 | loss 3.443587 (-1.18z)| norm 0.2571 (-1.07z)| lr 4.59e-04 | 322.55 ms | 52.3% bf16 MFU | 1624833 tok/s step 6777/19560 | loss 3.503270 (+0.19z)| norm 0.2666 (-0.58z)| lr 4.59e-04 | 322.95 ms | 52.3% bf16 MFU | 1624763 tok/s step 6778/19560 | loss 3.456707 (-0.88z)| norm 0.2644 (-0.68z)| lr 4.59e-04 | 322.53 ms | 52.3% bf16 MFU | 1624803 tok/s step 6779/19560 | loss 3.468144 (-0.62z)| norm 0.2759 (-0.09z)| lr 4.59e-04 | 322.69 ms | 52.3% bf16 MFU | 1624800 tok/s step 6780/19560 | loss 3.469519 (-0.58z)| norm 0.2867 (+0.46z)| lr 4.59e-04 | 322.74 ms | 52.3% bf16 MFU | 1624784 tok/s step 6781/19560 | loss 3.520396 (+0.60z)| norm 0.2814 (+0.19z)| lr 4.59e-04 | 322.23 ms | 52.4% bf16 MFU | 1624899 tok/s step 6782/19560 | loss 3.447641 (-1.09z)| norm 0.3429 (+3.19z)| lr 4.59e-04 | 322.23 ms | 52.4% bf16 MFU | 1625007 tok/s step 6783/19560 | loss 3.450187 (-1.02z)| norm 0.3069 (+1.39z)| lr 4.59e-04 | 322.39 ms | 52.3% bf16 MFU | 1625069 tok/s step 6784/19560 | loss 3.500263 (+0.14z)| norm 0.2838 (+0.24z)| lr 4.59e-04 | 323.18 ms | 52.2% bf16 MFU | 1624930 tok/s step 6785/19560 | loss 3.472871 (-0.49z)| norm 0.2844 (+0.28z)| lr 4.59e-04 | 322.60 ms | 52.3% bf16 MFU | 1624944 tok/s step 6786/19560 | loss 3.517496 (+0.53z)| norm 0.2688 (-0.49z)| lr 4.59e-04 | 322.46 ms | 52.3% bf16 MFU | 1624992 tok/s step 6787/19560 | loss 3.488091 (-0.16z)| norm 0.2726 (-0.32z)| lr 4.59e-04 | 322.43 ms | 52.3% bf16 MFU | 1625046 tok/s step 6788/19560 | loss 3.466285 (-0.70z)| norm 0.2685 (-0.52z)| lr 4.59e-04 | 322.65 ms | 52.3% bf16 MFU | 1625042 tok/s step 6789/19560 | loss 3.534447 (+0.92z)| norm 0.2635 (-0.76z)| lr 4.59e-04 | 322.65 ms | 52.3% bf16 MFU | 1625036 tok/s step 6790/19560 | loss 3.501965 (+0.16z)| norm 0.2615 (-0.85z)| lr 4.58e-04 | 322.78 ms | 52.3% bf16 MFU | 1624998 tok/s step 6791/19560 | loss 3.500459 (+0.12z)| norm 0.2871 (+0.43z)| lr 4.58e-04 | 322.65 ms | 52.3% bf16 MFU | 1624996 tok/s step 6792/19560 | loss 3.530068 (+0.87z)| norm 0.2712 (-0.36z)| lr 4.58e-04 | 322.38 ms | 52.4% bf16 MFU | 1625061 tok/s step 6793/19560 | loss 3.435005 (-1.48z)| norm 0.2577 (-1.03z)| lr 4.58e-04 | 322.43 ms | 52.3% bf16 MFU | 1625110 tok/s step 6794/19560 | loss 3.545527 (+1.23z)| norm 0.2905 (+0.61z)| lr 4.58e-04 | 322.54 ms | 52.3% bf16 MFU | 1625129 tok/s step 6795/19560 | loss 3.438847 (-1.36z)| norm 0.2928 (+0.73z)| lr 4.58e-04 | 322.72 ms | 52.3% bf16 MFU | 1625103 tok/s step 6796/19560 | loss 3.539912 (+1.08z)| norm 0.2736 (-0.24z)| lr 4.58e-04 | 322.75 ms | 52.3% bf16 MFU | 1625070 tok/s step 6797/19560 | loss 3.531914 (+0.88z)| norm 0.2917 (+0.66z)| lr 4.58e-04 | 322.35 ms | 52.4% bf16 MFU | 1625140 tok/s step 6798/19560 | loss 3.426182 (-1.64z)| norm 0.2724 (-0.31z)| lr 4.58e-04 | 322.64 ms | 52.3% bf16 MFU | 1625133 tok/s step 6799/19560 | loss 3.459944 (-0.82z)| norm 0.2712 (-0.37z)| lr 4.58e-04 | 322.08 ms | 52.4% bf16 MFU | 1625267 tok/s step 6800/19560 | loss 3.519532 (+0.60z)| norm 0.2651 (-0.66z)| lr 4.58e-04 | 322.41 ms | 52.3% bf16 MFU | 1625310 tok/s step 6801/19560 | loss 3.472391 (-0.52z)| norm 0.2704 (-0.39z)| lr 4.58e-04 | 322.74 ms | 52.3% bf16 MFU | 1625270 tok/s step 6802/19560 | loss 3.441894 (-1.23z)| norm 0.2583 (-1.00z)| lr 4.58e-04 | 322.31 ms | 52.4% bf16 MFU | 1625339 tok/s step 6803/19560 | loss 3.491915 (-0.03z)| norm 0.2744 (-0.19z)| lr 4.58e-04 | 322.58 ms | 52.3% bf16 MFU | 1625336 tok/s step 6804/19560 | loss 3.484266 (-0.22z)| norm 0.2685 (-0.49z)| lr 4.58e-04 | 322.56 ms | 52.3% bf16 MFU | 1625340 tok/s step 6805/19560 | loss 3.496660 (+0.08z)| norm 0.2800 (+0.10z)| lr 4.58e-04 | 322.86 ms | 52.3% bf16 MFU | 1625268 tok/s step 6806/19560 | loss 3.511967 (+0.44z)| norm 0.2945 (+0.82z)| lr 4.58e-04 | 322.50 ms | 52.3% bf16 MFU | 1625290 tok/s step 6807/19560 | loss 3.511570 (+0.45z)| norm 0.2899 (+0.58z)| lr 4.58e-04 | 322.45 ms | 52.3% bf16 MFU | 1625324 tok/s step 6808/19560 | loss 3.476666 (-0.40z)| norm 0.2715 (-0.34z)| lr 4.58e-04 | 322.12 ms | 52.4% bf16 MFU | 1625438 tok/s step 6809/19560 | loss 3.461906 (-0.75z)| norm 0.2834 (+0.28z)| lr 4.58e-04 | 322.66 ms | 52.3% bf16 MFU | 1625410 tok/s step 6810/19560 | loss 3.545987 (+1.27z)| norm 0.2491 (-1.46z)| lr 4.58e-04 | 322.46 ms | 52.3% bf16 MFU | 1625434 tok/s step 6811/19560 | loss 3.551167 (+1.38z)| norm 0.3116 (+1.70z)| lr 4.58e-04 | 322.67 ms | 52.3% bf16 MFU | 1625405 tok/s step 6812/19560 | loss 3.525974 (+0.77z)| norm 0.3098 (+1.62z)| lr 4.58e-04 | 322.62 ms | 52.3% bf16 MFU | 1625390 tok/s step 6813/19560 | loss 3.477684 (-0.38z)| norm 0.2746 (-0.16z)| lr 4.57e-04 | 322.44 ms | 52.3% bf16 MFU | 1625419 tok/s step 6814/19560 | loss 3.488848 (-0.11z)| norm 0.3065 (+1.44z)| lr 4.57e-04 | 322.57 ms | 52.3% bf16 MFU | 1625415 tok/s step 6815/19560 | loss 3.491323 (-0.07z)| norm 0.2798 (+0.09z)| lr 4.57e-04 | 322.36 ms | 52.4% bf16 MFU | 1625465 tok/s step 6816/19560 | loss 3.501237 (+0.16z)| norm 0.3180 (+2.02z)| lr 4.57e-04 | 322.88 ms | 52.3% bf16 MFU | 1625382 tok/s step 6817/19560 | loss 3.526509 (+0.77z)| norm 0.2848 (+0.34z)| lr 4.57e-04 | 322.88 ms | 52.3% bf16 MFU | 1625301 tok/s step 6818/19560 | loss 3.484483 (-0.26z)| norm 0.2831 (+0.25z)| lr 4.57e-04 | 322.24 ms | 52.4% bf16 MFU | 1625388 tok/s step 6819/19560 | loss 3.510565 (+0.37z)| norm 0.2878 (+0.48z)| lr 4.57e-04 | 322.73 ms | 52.3% bf16 MFU | 1625346 tok/s step 6820/19560 | loss 3.478621 (-0.41z)| norm 0.2782 (+0.00z)| lr 4.57e-04 | 322.75 ms | 52.3% bf16 MFU | 1625301 tok/s step 6821/19560 | loss 3.531646 (+0.88z)| norm 0.2803 (+0.11z)| lr 4.57e-04 | 322.20 ms | 52.4% bf16 MFU | 1625397 tok/s step 6822/19560 | loss 3.458330 (-0.91z)| norm 0.2810 (+0.15z)| lr 4.57e-04 | 322.85 ms | 52.3% bf16 MFU | 1625324 tok/s step 6823/19560 | loss 3.518617 (+0.58z)| norm 0.3014 (+1.17z)| lr 4.57e-04 | 322.69 ms | 52.3% bf16 MFU | 1625294 tok/s step 6824/19560 | loss 3.501267 (+0.14z)| norm 0.2763 (-0.08z)| lr 4.57e-04 | 322.33 ms | 52.4% bf16 MFU | 1625357 tok/s step 6825/19560 | loss 3.471463 (-0.60z)| norm 0.2744 (-0.17z)| lr 4.57e-04 | 322.78 ms | 52.3% bf16 MFU | 1625304 tok/s step 6826/19560 | loss 3.556985 (+1.49z)| norm 0.2832 (+0.31z)| lr 4.57e-04 | 322.74 ms | 52.3% bf16 MFU | 1625263 tok/s step 6827/19560 | loss 3.498891 (+0.07z)| norm 0.2810 (+0.19z)| lr 4.57e-04 | 322.60 ms | 52.3% bf16 MFU | 1625261 tok/s step 6828/19560 | loss 3.452570 (-1.07z)| norm 0.2531 (-1.26z)| lr 4.57e-04 | 323.18 ms | 52.2% bf16 MFU | 1625112 tok/s step 6829/19560 | loss 3.441584 (-1.31z)| norm 0.2763 (-0.04z)| lr 4.57e-04 | 322.45 ms | 52.3% bf16 MFU | 1625154 tok/s step 6830/19560 | loss 3.498032 (+0.07z)| norm 0.2433 (-1.73z)| lr 4.57e-04 | 322.33 ms | 52.4% bf16 MFU | 1625224 tok/s step 6831/19560 | loss 3.449819 (-1.10z)| norm 0.2680 (-0.46z)| lr 4.57e-04 | 322.17 ms | 52.4% bf16 MFU | 1625332 tok/s step 6832/19560 | loss 3.505334 (+0.25z)| norm 0.2896 (+0.65z)| lr 4.57e-04 | 322.97 ms | 52.3% bf16 MFU | 1625232 tok/s step 6833/19560 | loss 3.468842 (-0.63z)| norm 0.2971 (+1.03z)| lr 4.57e-04 | 322.86 ms | 52.3% bf16 MFU | 1625165 tok/s step 6834/19560 | loss 3.518825 (+0.65z)| norm 0.2719 (-0.27z)| lr 4.57e-04 | 322.41 ms | 52.3% bf16 MFU | 1625214 tok/s step 6835/19560 | loss 3.476518 (-0.43z)| norm 0.2718 (-0.27z)| lr 4.57e-04 | 322.54 ms | 52.3% bf16 MFU | 1625230 tok/s step 6836/19560 | loss 3.405436 (-2.21z)| norm 0.2647 (-0.65z)| lr 4.57e-04 | 322.69 ms | 52.3% bf16 MFU | 1625206 tok/s step 6837/19560 | loss 3.657363 (+3.95z)| norm 0.2932 (+0.85z)| lr 4.56e-04 | 322.89 ms | 52.3% bf16 MFU | 1625132 tok/s step 6838/19560 | loss 3.541077 (+1.12z)| norm 0.2924 (+0.81z)| lr 4.56e-04 | 322.51 ms | 52.3% bf16 MFU | 1625158 tok/s step 6839/19560 | loss 3.565532 (+1.68z)| norm 0.2601 (-0.91z)| lr 4.56e-04 | 322.89 ms | 52.3% bf16 MFU | 1625086 tok/s step 6840/19560 | loss 3.506731 (+0.27z)| norm 0.2720 (-0.28z)| lr 4.56e-04 | 323.00 ms | 52.3% bf16 MFU | 1624991 tok/s step 6841/19560 | loss 3.481271 (-0.34z)| norm 0.2755 (-0.09z)| lr 4.56e-04 | 322.58 ms | 52.3% bf16 MFU | 1625007 tok/s step 6842/19560 | loss 3.426930 (-1.60z)| norm 0.2727 (-0.25z)| lr 4.56e-04 | 322.46 ms | 52.3% bf16 MFU | 1625051 tok/s step 6843/19560 | loss 3.474356 (-0.48z)| norm 0.2736 (-0.21z)| lr 4.56e-04 | 322.48 ms | 52.3% bf16 MFU | 1625090 tok/s step 6844/19560 | loss 3.439246 (-1.29z)| norm 0.2747 (-0.16z)| lr 4.56e-04 | 322.43 ms | 52.3% bf16 MFU | 1625137 tok/s step 6845/19560 | loss 3.456505 (-0.87z)| norm 0.2494 (-1.53z)| lr 4.56e-04 | 322.75 ms | 52.3% bf16 MFU | 1625103 tok/s step 6846/19560 | loss 3.502976 (+0.22z)| norm 0.2623 (-0.83z)| lr 4.56e-04 | 323.54 ms | 52.2% bf16 MFU | 1624871 tok/s step 6847/19560 | loss 3.481261 (-0.29z)| norm 0.2849 (+0.39z)| lr 4.56e-04 | 322.53 ms | 52.3% bf16 MFU | 1624904 tok/s step 6848/19560 | loss 3.499225 (+0.13z)| norm 0.2468 (-1.65z)| lr 4.56e-04 | 322.47 ms | 52.3% bf16 MFU | 1624952 tok/s step 6849/19560 | loss 3.430677 (-1.52z)| norm 0.2719 (-0.30z)| lr 4.56e-04 | 322.29 ms | 52.4% bf16 MFU | 1625041 tok/s step 6850/19560 | loss 3.531845 (+0.96z)| norm 0.2619 (-0.83z)| lr 4.56e-04 | 322.81 ms | 52.3% bf16 MFU | 1624996 tok/s step 6851/19560 | loss 3.535801 (+1.04z)| norm 0.2613 (-0.87z)| lr 4.56e-04 | 322.36 ms | 52.4% bf16 MFU | 1625066 tok/s step 6852/19560 | loss 3.459198 (-0.87z)| norm 0.2504 (-1.43z)| lr 4.56e-04 | 322.66 ms | 52.3% bf16 MFU | 1625056 tok/s step 6853/19560 | loss 3.545730 (+1.27z)| norm 0.2763 (-0.03z)| lr 4.56e-04 | 322.57 ms | 52.3% bf16 MFU | 1625072 tok/s step 6854/19560 | loss 3.463933 (-0.75z)| norm 0.2627 (-0.76z)| lr 4.56e-04 | 322.85 ms | 52.3% bf16 MFU | 1625014 tok/s step 6855/19560 | loss 3.491165 (-0.06z)| norm 0.2966 (+1.07z)| lr 4.56e-04 | 322.56 ms | 52.3% bf16 MFU | 1625032 tok/s step 6856/19560 | loss 3.503642 (+0.25z)| norm 0.2563 (-1.09z)| lr 4.56e-04 | 322.58 ms | 52.3% bf16 MFU | 1625045 tok/s step 6857/19560 | loss 3.440865 (-1.32z)| norm 0.2833 (+0.43z)| lr 4.56e-04 | 322.82 ms | 52.3% bf16 MFU | 1624998 tok/s step 6858/19560 | loss 3.513021 (+0.47z)| norm 0.2736 (-0.12z)| lr 4.56e-04 | 322.35 ms | 52.4% bf16 MFU | 1625071 tok/s step 6859/19560 | loss 3.493312 (-0.02z)| norm 0.2809 (+0.31z)| lr 4.56e-04 | 322.29 ms | 52.4% bf16 MFU | 1625156 tok/s step 6860/19560 | loss 3.532228 (+0.94z)| norm 0.2706 (-0.29z)| lr 4.55e-04 | 322.73 ms | 52.3% bf16 MFU | 1625124 tok/s step 6861/19560 | loss 3.558112 (+1.56z)| norm 0.2693 (-0.36z)| lr 4.55e-04 | 322.46 ms | 52.3% bf16 MFU | 1625164 tok/s step 6862/19560 | loss 3.495742 (-0.00z)| norm 0.2656 (-0.58z)| lr 4.55e-04 | 322.52 ms | 52.3% bf16 MFU | 1625184 tok/s step 6863/19560 | loss 3.442302 (-1.32z)| norm 0.2507 (-1.44z)| lr 4.55e-04 | 322.72 ms | 52.3% bf16 MFU | 1625154 tok/s step 6864/19560 | loss 3.453007 (-1.05z)| norm 0.2644 (-0.63z)| lr 4.55e-04 | 322.46 ms | 52.3% bf16 MFU | 1625191 tok/s step 6865/19560 | loss 3.438065 (-1.42z)| norm 0.2799 (+0.27z)| lr 4.55e-04 | 322.52 ms | 52.3% bf16 MFU | 1625211 tok/s step 6866/19560 | loss 3.465520 (-0.72z)| norm 0.2590 (-0.97z)| lr 4.55e-04 | 322.53 ms | 52.3% bf16 MFU | 1625228 tok/s step 6867/19560 | loss 3.552359 (+1.45z)| norm 0.2760 (+0.06z)| lr 4.55e-04 | 322.57 ms | 52.3% bf16 MFU | 1625233 tok/s step 6868/19560 | loss 3.550191 (+1.38z)| norm 0.2949 (+1.17z)| lr 4.55e-04 | 322.43 ms | 52.3% bf16 MFU | 1625273 tok/s step 6869/19560 | loss 3.536535 (+1.03z)| norm 0.2925 (+1.04z)| lr 4.55e-04 | 322.32 ms | 52.4% bf16 MFU | 1625339 tok/s step 6870/19560 | loss 3.463897 (-0.76z)| norm 0.2667 (-0.50z)| lr 4.55e-04 | 322.50 ms | 52.3% bf16 MFU | 1625358 tok/s step 6871/19560 | loss 3.511284 (+0.43z)| norm 0.2778 (+0.15z)| lr 4.55e-04 | 322.30 ms | 52.4% bf16 MFU | 1625425 tok/s step 6872/19560 | loss 3.411741 (-2.01z)| norm 0.2861 (+0.65z)| lr 4.55e-04 | 322.72 ms | 52.3% bf16 MFU | 1625383 tok/s step 6873/19560 | loss 3.530428 (+0.90z)| norm 0.2745 (-0.05z)| lr 4.55e-04 | 322.54 ms | 52.3% bf16 MFU | 1625388 tok/s step 6874/19560 | loss 3.451283 (-1.04z)| norm 0.2631 (-0.75z)| lr 4.55e-04 | 322.89 ms | 52.3% bf16 MFU | 1625306 tok/s step 6875/19560 | loss 3.456066 (-0.91z)| norm 0.2641 (-0.71z)| lr 4.55e-04 | 322.81 ms | 52.3% bf16 MFU | 1625248 tok/s step 6876/19560 | loss 3.474778 (-0.47z)| norm 0.2834 (+0.47z)| lr 4.55e-04 | 322.42 ms | 52.3% bf16 MFU | 1625292 tok/s step 6877/19560 | loss 3.484997 (-0.22z)| norm 0.2808 (+0.31z)| lr 4.55e-04 | 322.45 ms | 52.3% bf16 MFU | 1625325 tok/s step 6878/19560 | loss 3.480564 (-0.33z)| norm 0.2749 (-0.06z)| lr 4.55e-04 | 322.53 ms | 52.3% bf16 MFU | 1625337 tok/s step 6879/19560 | loss 3.459260 (-0.85z)| norm 0.2942 (+1.14z)| lr 4.55e-04 | 322.25 ms | 52.4% bf16 MFU | 1625418 tok/s step 6880/19560 | loss 3.546427 (+1.29z)| norm 0.2612 (-0.93z)| lr 4.55e-04 | 322.23 ms | 52.4% bf16 MFU | 1625501 tok/s step 6881/19560 | loss 3.477955 (-0.38z)| norm 0.2908 (+0.93z)| lr 4.55e-04 | 322.55 ms | 52.3% bf16 MFU | 1625498 tok/s step 6882/19560 | loss 3.534217 (+1.03z)| norm 0.3027 (+1.64z)| lr 4.55e-04 | 322.98 ms | 52.3% bf16 MFU | 1625388 tok/s step 6883/19560 | loss 3.475558 (-0.44z)| norm 0.2627 (-0.84z)| lr 4.55e-04 | 322.41 ms | 52.3% bf16 MFU | 1625427 tok/s step 6884/19560 | loss 3.485410 (-0.19z)| norm 0.2826 (+0.39z)| lr 4.54e-04 | 322.77 ms | 52.3% bf16 MFU | 1625374 tok/s step 6885/19560 | loss 3.507263 (+0.38z)| norm 0.2718 (-0.28z)| lr 4.54e-04 | 322.82 ms | 52.3% bf16 MFU | 1625309 tok/s step 6886/19560 | loss 3.554814 (+1.59z)| norm 0.2682 (-0.52z)| lr 4.54e-04 | 322.61 ms | 52.3% bf16 MFU | 1625301 tok/s step 6887/19560 | loss 3.521684 (+0.76z)| norm 0.2848 (+0.52z)| lr 4.54e-04 | 322.75 ms | 52.3% bf16 MFU | 1625258 tok/s step 6888/19560 | loss 3.480327 (-0.32z)| norm 0.2899 (+0.83z)| lr 4.54e-04 | 322.93 ms | 52.3% bf16 MFU | 1625172 tok/s step 6889/19560 | loss 3.435595 (-1.46z)| norm 0.2927 (+1.00z)| lr 4.54e-04 | 323.11 ms | 52.2% bf16 MFU | 1625043 tok/s step 6890/19560 | loss 3.498550 (+0.16z)| norm 0.2991 (+1.41z)| lr 4.54e-04 | 322.86 ms | 52.3% bf16 MFU | 1624984 tok/s step 6891/19560 | loss 3.580312 (+2.22z)| norm 0.2753 (-0.07z)| lr 4.54e-04 | 322.31 ms | 52.4% bf16 MFU | 1625068 tok/s step 6892/19560 | loss 3.453740 (-0.99z)| norm 0.2696 (-0.45z)| lr 4.54e-04 | 322.64 ms | 52.3% bf16 MFU | 1625064 tok/s step 6893/19560 | loss 3.521354 (+0.72z)| norm 0.2731 (-0.22z)| lr 4.54e-04 | 322.75 ms | 52.3% bf16 MFU | 1625033 tok/s step 6894/19560 | loss 3.488527 (-0.11z)| norm 0.2986 (+1.39z)| lr 4.54e-04 | 322.54 ms | 52.3% bf16 MFU | 1625056 tok/s step 6895/19560 | loss 3.497050 (+0.10z)| norm 0.3205 (+2.69z)| lr 4.54e-04 | 322.64 ms | 52.3% bf16 MFU | 1625053 tok/s step 6896/19560 | loss 3.510036 (+0.43z)| norm 0.3190 (+2.51z)| lr 4.54e-04 | 323.20 ms | 52.2% bf16 MFU | 1624909 tok/s step 6897/19560 | loss 3.487315 (-0.14z)| norm 0.3420 (+3.66z)| lr 4.54e-04 | 323.35 ms | 52.2% bf16 MFU | 1624734 tok/s step 6898/19560 | loss 3.473776 (-0.48z)| norm 0.2748 (-0.16z)| lr 4.54e-04 | 322.45 ms | 52.3% bf16 MFU | 1624794 tok/s step 6899/19560 | loss 3.522583 (+0.76z)| norm 0.3004 (+1.28z)| lr 4.54e-04 | 322.74 ms | 52.3% bf16 MFU | 1624779 tok/s step 6900/19560 | loss 3.511836 (+0.49z)| norm 0.2700 (-0.46z)| lr 4.54e-04 | 322.99 ms | 52.3% bf16 MFU | 1624703 tok/s step 6901/19560 | loss 3.479114 (-0.34z)| norm 0.2737 (-0.26z)| lr 4.54e-04 | 322.98 ms | 52.3% bf16 MFU | 1624632 tok/s step 6902/19560 | loss 3.532627 (+1.01z)| norm 0.2862 (+0.45z)| lr 4.54e-04 | 322.45 ms | 52.3% bf16 MFU | 1624697 tok/s step 6903/19560 | loss 3.552431 (+1.50z)| norm 0.3007 (+1.28z)| lr 4.54e-04 | 322.74 ms | 52.3% bf16 MFU | 1624688 tok/s step 6904/19560 | loss 3.612228 (+2.90z)| norm 0.3151 (+2.07z)| lr 4.54e-04 | 322.60 ms | 52.3% bf16 MFU | 1624713 tok/s step 6905/19560 | loss 3.474136 (-0.51z)| norm 0.3198 (+2.27z)| lr 4.54e-04 | 322.38 ms | 52.4% bf16 MFU | 1624791 tok/s step 6906/19560 | loss 3.475890 (-0.47z)| norm 0.2936 (+0.78z)| lr 4.54e-04 | 322.98 ms | 52.3% bf16 MFU | 1624715 tok/s step 6907/19560 | loss 3.508359 (+0.33z)| norm 0.2815 (+0.09z)| lr 4.53e-04 | 322.86 ms | 52.3% bf16 MFU | 1624674 tok/s step 6908/19560 | loss 3.471355 (-0.59z)| norm 0.2523 (-1.53z)| lr 4.53e-04 | 323.11 ms | 52.2% bf16 MFU | 1624572 tok/s step 6909/19560 | loss 3.442322 (-1.29z)| norm 0.2799 (+0.02z)| lr 4.53e-04 | 322.89 ms | 52.3% bf16 MFU | 1624530 tok/s step 6910/19560 | loss 3.531732 (+0.90z)| norm 0.2620 (-1.00z)| lr 4.53e-04 | 322.36 ms | 52.4% bf16 MFU | 1624624 tok/s step 6911/19560 | loss 3.442884 (-1.29z)| norm 0.2997 (+1.23z)| lr 4.53e-04 | 323.63 ms | 52.1% bf16 MFU | 1624393 tok/s step 6912/19560 | loss 3.477998 (-0.42z)| norm 0.2704 (-0.49z)| lr 4.53e-04 | 322.57 ms | 52.3% bf16 MFU | 1624442 tok/s step 6913/19560 | loss 3.527159 (+0.78z)| norm 0.2938 (+0.88z)| lr 4.53e-04 | 322.82 ms | 52.3% bf16 MFU | 1624426 tok/s step 6914/19560 | loss 3.487436 (-0.19z)| norm 0.2739 (-0.30z)| lr 4.53e-04 | 322.60 ms | 52.3% bf16 MFU | 1624465 tok/s step 6915/19560 | loss 3.534223 (+0.95z)| norm 0.2702 (-0.51z)| lr 4.53e-04 | 323.22 ms | 52.2% bf16 MFU | 1624345 tok/s step 6916/19560 | loss 3.516490 (+0.51z)| norm 0.2932 (+0.83z)| lr 4.53e-04 | 322.74 ms | 52.3% bf16 MFU | 1624352 tok/s step 6917/19560 | loss 3.470546 (-0.61z)| norm 0.2556 (-1.37z)| lr 4.53e-04 | 322.77 ms | 52.3% bf16 MFU | 1624352 tok/s step 6918/19560 | loss 3.442747 (-1.28z)| norm 0.2643 (-0.86z)| lr 4.53e-04 | 323.12 ms | 52.2% bf16 MFU | 1624263 tok/s step 6919/19560 | loss 3.612068 (+2.76z)| norm 0.2623 (-0.97z)| lr 4.53e-04 | 322.16 ms | 52.4% bf16 MFU | 1624421 tok/s step 6920/19560 | loss 3.510791 (+0.36z)| norm 0.2737 (-0.31z)| lr 4.53e-04 | 322.77 ms | 52.3% bf16 MFU | 1624418 tok/s step 6921/19560 | loss 3.520546 (+0.58z)| norm 0.2516 (-1.59z)| lr 4.53e-04 | 323.30 ms | 52.2% bf16 MFU | 1624282 tok/s step 6922/19560 | loss 3.475132 (-0.50z)| norm 0.2618 (-0.98z)| lr 4.53e-04 | 322.57 ms | 52.3% bf16 MFU | 1624335 tok/s step 6923/19560 | loss 3.498110 (+0.05z)| norm 0.2870 (+0.49z)| lr 4.53e-04 | 322.55 ms | 52.3% bf16 MFU | 1624390 tok/s step 6924/19560 | loss 3.447552 (-1.16z)| norm 0.2858 (+0.41z)| lr 4.53e-04 | 323.37 ms | 52.2% bf16 MFU | 1624238 tok/s step 6925/19560 | loss 3.445638 (-1.19z)| norm 0.2647 (-0.80z)| lr 4.53e-04 | 322.82 ms | 52.3% bf16 MFU | 1624230 tok/s step 6926/19560 | loss 3.529849 (+0.83z)| norm 0.2621 (-0.94z)| lr 4.53e-04 | 322.61 ms | 52.3% bf16 MFU | 1624276 tok/s step 6927/19560 | loss 3.452056 (-1.06z)| norm 0.2548 (-1.35z)| lr 4.53e-04 | 323.25 ms | 52.2% bf16 MFU | 1624159 tok/s step 6928/19560 | loss 3.442902 (-1.26z)| norm 0.2597 (-1.06z)| lr 4.53e-04 | 322.91 ms | 52.3% bf16 MFU | 1624132 tok/s step 6929/19560 | loss 3.524874 (+0.71z)| norm 0.2860 (+0.44z)| lr 4.53e-04 | 323.71 ms | 52.1% bf16 MFU | 1623907 tok/s step 6930/19560 | loss 3.507968 (+0.29z)| norm 0.2641 (-0.82z)| lr 4.52e-04 | 322.50 ms | 52.3% bf16 MFU | 1623996 tok/s step 6931/19560 | loss 3.474658 (-0.51z)| norm 0.2701 (-0.47z)| lr 4.52e-04 | 322.24 ms | 52.4% bf16 MFU | 1624148 tok/s step 6932/19560 | loss 3.473806 (-0.53z)| norm 0.2525 (-1.47z)| lr 4.52e-04 | 322.82 ms | 52.3% bf16 MFU | 1624144 tok/s step 6933/19560 | loss 3.519510 (+0.57z)| norm 0.2745 (-0.21z)| lr 4.52e-04 | 323.01 ms | 52.2% bf16 MFU | 1624093 tok/s step 6934/19560 | loss 3.532444 (+0.88z)| norm 0.2829 (+0.28z)| lr 4.52e-04 | 322.45 ms | 52.3% bf16 MFU | 1624186 tok/s step 6935/19560 | loss 3.480659 (-0.37z)| norm 0.2804 (+0.14z)| lr 4.52e-04 | 323.16 ms | 52.2% bf16 MFU | 1624095 tok/s step 6936/19560 | loss 3.503911 (+0.19z)| norm 0.2725 (-0.32z)| lr 4.52e-04 | 322.68 ms | 52.3% bf16 MFU | 1624130 tok/s step 6937/19560 | loss 3.504928 (+0.21z)| norm 0.2792 (+0.07z)| lr 4.52e-04 | 322.54 ms | 52.3% bf16 MFU | 1624198 tok/s step 6938/19560 | loss 3.508514 (+0.31z)| norm 0.2753 (-0.17z)| lr 4.52e-04 | 323.63 ms | 52.1% bf16 MFU | 1623989 tok/s step 6939/19560 | loss 3.466447 (-0.71z)| norm 0.2908 (+0.75z)| lr 4.52e-04 | 322.59 ms | 52.3% bf16 MFU | 1624052 tok/s step 6940/19560 | loss 3.576307 (+1.95z)| norm 0.2824 (+0.27z)| lr 4.52e-04 | 322.40 ms | 52.3% bf16 MFU | 1624160 tok/s step 6941/19560 | loss 3.509759 (+0.33z)| norm 0.2979 (+1.18z)| lr 4.52e-04 | 322.79 ms | 52.3% bf16 MFU | 1624163 tok/s step 6942/19560 | loss 3.409518 (-2.05z)| norm 0.2572 (-1.22z)| lr 4.52e-04 | 322.91 ms | 52.3% bf16 MFU | 1624137 tok/s step 6943/19560 | loss 3.521243 (+0.61z)| norm 0.2974 (+1.17z)| lr 4.52e-04 | 322.52 ms | 52.3% bf16 MFU | 1624212 tok/s step 6944/19560 | loss 3.462069 (-0.79z)| norm 0.2974 (+1.20z)| lr 4.52e-04 | 322.74 ms | 52.3% bf16 MFU | 1624225 tok/s step 6945/19560 | loss 3.460054 (-0.82z)| norm 0.3064 (+1.71z)| lr 4.52e-04 | 323.14 ms | 52.2% bf16 MFU | 1624138 tok/s step 6946/19560 | loss 3.544975 (+1.17z)| norm 0.2658 (-0.70z)| lr 4.52e-04 | 322.78 ms | 52.3% bf16 MFU | 1624144 tok/s step 6947/19560 | loss 3.484695 (-0.24z)| norm 0.3073 (+1.75z)| lr 4.52e-04 | 322.65 ms | 52.3% bf16 MFU | 1624183 tok/s step 6948/19560 | loss 3.549983 (+1.28z)| norm 0.2587 (-1.11z)| lr 4.52e-04 | 323.19 ms | 52.2% bf16 MFU | 1624085 tok/s step 6949/19560 | loss 3.498419 (+0.07z)| norm 0.3099 (+1.86z)| lr 4.52e-04 | 322.95 ms | 52.3% bf16 MFU | 1624052 tok/s step 6950/19560 | loss 3.457528 (-0.89z)| norm 0.2571 (-1.19z)| lr 4.52e-04 | 322.54 ms | 52.3% bf16 MFU | 1624124 tok/s step 6951/19560 | loss 3.470725 (-0.57z)| norm 0.2518 (-1.47z)| lr 4.52e-04 | 322.38 ms | 52.4% bf16 MFU | 1624232 tok/s step 6952/19560 | loss 3.421834 (-1.69z)| norm 0.2496 (-1.57z)| lr 4.52e-04 | 322.50 ms | 52.3% bf16 MFU | 1624306 tok/s step 6953/19560 | loss 3.495889 (+0.03z)| norm 0.2733 (-0.21z)| lr 4.51e-04 | 322.94 ms | 52.3% bf16 MFU | 1624266 tok/s step 6954/19560 | loss 3.526709 (+0.76z)| norm 0.2452 (-1.78z)| lr 4.51e-04 | 322.37 ms | 52.4% bf16 MFU | 1624370 tok/s step 6955/19560 | loss 3.458878 (-0.82z)| norm 0.2538 (-1.27z)| lr 4.51e-04 | 322.46 ms | 52.3% bf16 MFU | 1624445 tok/s step 6956/19560 | loss 3.442696 (-1.20z)| norm 0.2664 (-0.58z)| lr 4.51e-04 | 322.46 ms | 52.3% bf16 MFU | 1624519 tok/s step 6957/19560 | loss 3.472251 (-0.51z)| norm 0.2575 (-1.07z)| lr 4.51e-04 | 322.17 ms | 52.4% bf16 MFU | 1624660 tok/s step 6958/19560 | loss 3.474192 (-0.46z)| norm 0.2587 (-1.02z)| lr 4.51e-04 | 323.03 ms | 52.2% bf16 MFU | 1624579 tok/s step 6959/19560 | loss 3.487190 (-0.17z)| norm 0.2643 (-0.69z)| lr 4.51e-04 | 323.20 ms | 52.2% bf16 MFU | 1624460 tok/s step 6960/19560 | loss 3.474201 (-0.47z)| norm 0.2515 (-1.39z)| lr 4.51e-04 | 323.03 ms | 52.2% bf16 MFU | 1624389 tok/s step 6961/19560 | loss 3.486456 (-0.18z)| norm 0.2731 (-0.17z)| lr 4.51e-04 | 322.82 ms | 52.3% bf16 MFU | 1624374 tok/s step 6962/19560 | loss 3.506605 (+0.30z)| norm 0.3196 (+2.38z)| lr 4.51e-04 | 322.37 ms | 52.4% bf16 MFU | 1624473 tok/s step 6963/19560 | loss 3.457970 (-0.85z)| norm 0.3301 (+2.85z)| lr 4.51e-04 | 323.54 ms | 52.2% bf16 MFU | 1624272 tok/s step 6964/19560 | loss 3.473580 (-0.50z)| norm 0.3083 (+1.65z)| lr 4.51e-04 | 322.97 ms | 52.3% bf16 MFU | 1624225 tok/s step 6965/19560 | loss 3.488000 (-0.13z)| norm 0.2814 (+0.22z)| lr 4.51e-04 | 322.96 ms | 52.3% bf16 MFU | 1624183 tok/s step 6966/19560 | loss 3.502813 (+0.26z)| norm 0.2973 (+1.06z)| lr 4.51e-04 | 322.74 ms | 52.3% bf16 MFU | 1624199 tok/s step 6967/19560 | loss 3.466242 (-0.67z)| norm 0.2811 (+0.20z)| lr 4.51e-04 | 322.10 ms | 52.4% bf16 MFU | 1624375 tok/s step 6968/19560 | loss 3.473089 (-0.49z)| norm 0.2983 (+1.10z)| lr 4.51e-04 | 322.28 ms | 52.4% bf16 MFU | 1624496 tok/s step 6969/19560 | loss 3.474605 (-0.44z)| norm 0.2547 (-1.20z)| lr 4.51e-04 | 322.18 ms | 52.4% bf16 MFU | 1624636 tok/s step 6970/19560 | loss 3.485709 (-0.17z)| norm 0.2935 (+0.84z)| lr 4.51e-04 | 322.78 ms | 52.3% bf16 MFU | 1624618 tok/s step 6971/19560 | loss 3.500283 (+0.21z)| norm 0.2498 (-1.44z)| lr 4.51e-04 | 322.91 ms | 52.3% bf16 MFU | 1624568 tok/s step 6972/19560 | loss 3.497704 (+0.13z)| norm 0.2722 (-0.27z)| lr 4.51e-04 | 322.71 ms | 52.3% bf16 MFU | 1624572 tok/s step 6973/19560 | loss 3.502263 (+0.24z)| norm 0.2982 (+1.07z)| lr 4.51e-04 | 322.66 ms | 52.3% bf16 MFU | 1624588 tok/s step 6974/19560 | loss 3.460302 (-0.86z)| norm 0.2717 (-0.32z)| lr 4.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1624490 tok/s step 6975/19560 | loss 3.467854 (-0.66z)| norm 0.2964 (+0.97z)| lr 4.51e-04 | 322.70 ms | 52.3% bf16 MFU | 1624501 tok/s step 6976/19560 | loss 3.524014 (+0.82z)| norm 0.3112 (+1.71z)| lr 4.51e-04 | 322.77 ms | 52.3% bf16 MFU | 1624492 tok/s step 6977/19560 | loss 3.571095 (+2.02z)| norm 0.2997 (+1.09z)| lr 4.50e-04 | 323.14 ms | 52.2% bf16 MFU | 1624392 tok/s step 6978/19560 | loss 3.456659 (-0.97z)| norm 0.2885 (+0.50z)| lr 4.50e-04 | 322.43 ms | 52.3% bf16 MFU | 1624476 tok/s step 6979/19560 | loss 3.525181 (+0.83z)| norm 0.3047 (+1.33z)| lr 4.50e-04 | 322.45 ms | 52.3% bf16 MFU | 1624549 tok/s step 6980/19560 | loss 3.504424 (+0.28z)| norm 0.3070 (+1.42z)| lr 4.50e-04 | 322.70 ms | 52.3% bf16 MFU | 1624555 tok/s step 6981/19560 | loss 3.461157 (-0.84z)| norm 0.2695 (-0.52z)| lr 4.50e-04 | 322.65 ms | 52.3% bf16 MFU | 1624574 tok/s step 6982/19560 | loss 3.493861 (+0.01z)| norm 0.2936 (+0.72z)| lr 4.50e-04 | 322.99 ms | 52.3% bf16 MFU | 1624507 tok/s step 6983/19560 | loss 3.448590 (-1.17z)| norm 0.2537 (-1.34z)| lr 4.50e-04 | 322.50 ms | 52.3% bf16 MFU | 1624567 tok/s step 6984/19560 | loss 3.535797 (+1.12z)| norm 0.2677 (-0.62z)| lr 4.50e-04 | 322.56 ms | 52.3% bf16 MFU | 1624608 tok/s step 6985/19560 | loss 3.535745 (+1.10z)| norm 0.2622 (-0.89z)| lr 4.50e-04 | 323.30 ms | 52.2% bf16 MFU | 1624461 tok/s step 6986/19560 | loss 3.484138 (-0.25z)| norm 0.2565 (-1.17z)| lr 4.50e-04 | 323.22 ms | 52.2% bf16 MFU | 1624341 tok/s step 6987/19560 | loss 3.489564 (-0.11z)| norm 0.2692 (-0.52z)| lr 4.50e-04 | 322.37 ms | 52.4% bf16 MFU | 1624441 tok/s step 6988/19560 | loss 3.471586 (-0.57z)| norm 0.2745 (-0.24z)| lr 4.50e-04 | 322.75 ms | 52.3% bf16 MFU | 1624440 tok/s step 6989/19560 | loss 3.469873 (-0.61z)| norm 0.2717 (-0.39z)| lr 4.50e-04 | 322.63 ms | 52.3% bf16 MFU | 1624469 tok/s step 6990/19560 | loss 3.476742 (-0.42z)| norm 0.2625 (-0.86z)| lr 4.50e-04 | 322.86 ms | 52.3% bf16 MFU | 1624439 tok/s step 6991/19560 | loss 3.510668 (+0.48z)| norm 0.2700 (-0.49z)| lr 4.50e-04 | 322.80 ms | 52.3% bf16 MFU | 1624426 tok/s step 6992/19560 | loss 3.514968 (+0.58z)| norm 0.2490 (-1.56z)| lr 4.50e-04 | 322.71 ms | 52.3% bf16 MFU | 1624437 tok/s step 6993/19560 | loss 3.498301 (+0.12z)| norm 0.2864 (+0.36z)| lr 4.50e-04 | 322.75 ms | 52.3% bf16 MFU | 1624438 tok/s step 6994/19560 | loss 3.486735 (-0.20z)| norm 0.2965 (+0.87z)| lr 4.50e-04 | 322.29 ms | 52.4% bf16 MFU | 1624555 tok/s step 6995/19560 | loss 3.491066 (-0.07z)| norm 0.2595 (-1.03z)| lr 4.50e-04 | 322.92 ms | 52.3% bf16 MFU | 1624508 tok/s step 6996/19560 | loss 3.469768 (-0.64z)| norm 0.2595 (-1.01z)| lr 4.50e-04 | 322.57 ms | 52.3% bf16 MFU | 1624550 tok/s step 6997/19560 | loss 3.483881 (-0.24z)| norm 0.2385 (-2.03z)| lr 4.50e-04 | 322.66 ms | 52.3% bf16 MFU | 1624567 tok/s step 6998/19560 | loss 3.447558 (-1.25z)| norm 0.2971 (+0.92z)| lr 4.50e-04 | 322.60 ms | 52.3% bf16 MFU | 1624598 tok/s step 6999/19560 | loss 3.480840 (-0.32z)| norm 0.2874 (+0.42z)| lr 4.50e-04 | 322.59 ms | 52.3% bf16 MFU | 1624631 tok/s step 7000/19560 | loss 3.499639 (+0.19z)| norm 0.2568 (-1.11z)| lr 4.49e-04 | 322.60 ms | 52.3% bf16 MFU | 1624659 tok/s val loss 3.476349 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2852/10042 = 0.284007 step 7001/19560 | loss 3.475997 (-0.47z)| norm 0.2837 (+0.24z)| lr 4.49e-04 | 321.93 ms | 52.4% bf16 MFU | 1624855 tok/s step 7002/19560 | loss 3.462398 (-0.86z)| norm 0.2856 (+0.33z)| lr 4.49e-04 | 322.71 ms | 52.3% bf16 MFU | 1624845 tok/s step 7003/19560 | loss 3.516041 (+0.66z)| norm 0.2838 (+0.23z)| lr 4.49e-04 | 322.55 ms | 52.3% bf16 MFU | 1624874 tok/s step 7004/19560 | loss 3.485075 (-0.23z)| norm 0.2794 (+0.01z)| lr 4.49e-04 | 322.67 ms | 52.3% bf16 MFU | 1624872 tok/s step 7005/19560 | loss 3.515252 (+0.63z)| norm 0.2725 (-0.33z)| lr 4.49e-04 | 322.31 ms | 52.4% bf16 MFU | 1624962 tok/s step 7006/19560 | loss 3.570453 (+2.15z)| norm 0.2867 (+0.38z)| lr 4.49e-04 | 322.34 ms | 52.4% bf16 MFU | 1625040 tok/s step 7007/19560 | loss 3.563116 (+1.90z)| norm 0.2853 (+0.31z)| lr 4.49e-04 | 322.07 ms | 52.4% bf16 MFU | 1625182 tok/s step 7008/19560 | loss 3.555975 (+1.69z)| norm 0.2592 (-1.01z)| lr 4.49e-04 | 322.93 ms | 52.3% bf16 MFU | 1625099 tok/s step 7009/19560 | loss 3.412119 (-2.23z)| norm 0.2749 (-0.21z)| lr 4.49e-04 | 322.62 ms | 52.3% bf16 MFU | 1625099 tok/s step 7010/19560 | loss 3.418093 (-2.02z)| norm 0.2664 (-0.63z)| lr 4.49e-04 | 322.15 ms | 52.4% bf16 MFU | 1625218 tok/s step 7011/19560 | loss 3.541593 (+1.27z)| norm 0.2502 (-1.43z)| lr 4.49e-04 | 323.19 ms | 52.2% bf16 MFU | 1625068 tok/s step 7012/19560 | loss 3.456484 (-0.99z)| norm 0.2825 (+0.19z)| lr 4.49e-04 | 322.63 ms | 52.3% bf16 MFU | 1625068 tok/s step 7013/19560 | loss 3.475049 (-0.49z)| norm 0.2688 (-0.50z)| lr 4.49e-04 | 323.43 ms | 52.2% bf16 MFU | 1624866 tok/s step 7014/19560 | loss 3.493813 (+0.02z)| norm 0.2592 (-0.97z)| lr 4.49e-04 | 322.20 ms | 52.4% bf16 MFU | 1624983 tok/s step 7015/19560 | loss 3.522416 (+0.79z)| norm 0.2595 (-0.95z)| lr 4.49e-04 | 322.65 ms | 52.3% bf16 MFU | 1624982 tok/s step 7016/19560 | loss 3.486429 (-0.18z)| norm 0.2350 (-2.11z)| lr 4.49e-04 | 322.36 ms | 52.4% bf16 MFU | 1625052 tok/s step 7017/19560 | loss 3.495540 (+0.05z)| norm 0.2671 (-0.53z)| lr 4.49e-04 | 323.14 ms | 52.2% bf16 MFU | 1624923 tok/s step 7018/19560 | loss 3.402888 (-2.39z)| norm 0.2660 (-0.57z)| lr 4.49e-04 | 322.94 ms | 52.3% bf16 MFU | 1624850 tok/s step 7019/19560 | loss 3.517307 (+0.68z)| norm 0.2659 (-0.57z)| lr 4.49e-04 | 322.73 ms | 52.3% bf16 MFU | 1624836 tok/s step 7020/19560 | loss 3.500924 (+0.22z)| norm 0.2512 (-1.28z)| lr 4.49e-04 | 322.60 ms | 52.3% bf16 MFU | 1624854 tok/s step 7021/19560 | loss 3.592958 (+2.64z)| norm 0.2606 (-0.81z)| lr 4.49e-04 | 322.94 ms | 52.3% bf16 MFU | 1624786 tok/s step 7022/19560 | loss 3.494717 (+0.04z)| norm 0.2541 (-1.11z)| lr 4.49e-04 | 322.49 ms | 52.3% bf16 MFU | 1624834 tok/s step 7023/19560 | loss 3.516870 (+0.62z)| norm 0.2547 (-1.07z)| lr 4.48e-04 | 322.21 ms | 52.4% bf16 MFU | 1624951 tok/s step 7024/19560 | loss 3.470592 (-0.60z)| norm 0.2826 (+0.34z)| lr 4.48e-04 | 322.99 ms | 52.3% bf16 MFU | 1624866 tok/s step 7025/19560 | loss 3.498310 (+0.13z)| norm 0.2676 (-0.41z)| lr 4.48e-04 | 322.94 ms | 52.3% bf16 MFU | 1624796 tok/s step 7026/19560 | loss 3.522703 (+0.77z)| norm 0.2619 (-0.71z)| lr 4.48e-04 | 322.66 ms | 52.3% bf16 MFU | 1624802 tok/s step 7027/19560 | loss 3.431662 (-1.60z)| norm 0.2553 (-1.04z)| lr 4.48e-04 | 322.50 ms | 52.3% bf16 MFU | 1624848 tok/s step 7028/19560 | loss 3.497717 (+0.13z)| norm 0.2647 (-0.54z)| lr 4.48e-04 | 322.15 ms | 52.4% bf16 MFU | 1624978 tok/s step 7029/19560 | loss 3.511412 (+0.48z)| norm 0.3492 (+3.68z)| lr 4.48e-04 | 322.57 ms | 52.3% bf16 MFU | 1624996 tok/s step 7030/19560 | loss 3.467401 (-0.66z)| norm 0.3002 (+1.22z)| lr 4.48e-04 | 322.87 ms | 52.3% bf16 MFU | 1624938 tok/s step 7031/19560 | loss 3.566648 (+1.93z)| norm 0.2526 (-1.13z)| lr 4.48e-04 | 322.65 ms | 52.3% bf16 MFU | 1624940 tok/s step 7032/19560 | loss 3.456856 (-0.94z)| norm 0.3059 (+1.54z)| lr 4.48e-04 | 322.75 ms | 52.3% bf16 MFU | 1624914 tok/s step 7033/19560 | loss 3.476744 (-0.40z)| norm 0.2751 (+0.01z)| lr 4.48e-04 | 323.27 ms | 52.2% bf16 MFU | 1624759 tok/s step 7034/19560 | loss 3.513142 (+0.58z)| norm 0.2635 (-0.57z)| lr 4.48e-04 | 321.60 ms | 52.5% bf16 MFU | 1625033 tok/s step 7035/19560 | loss 3.548352 (+1.51z)| norm 0.2819 (+0.37z)| lr 4.48e-04 | 321.98 ms | 52.4% bf16 MFU | 1625196 tok/s step 7036/19560 | loss 3.477593 (-0.39z)| norm 0.2526 (-1.13z)| lr 4.48e-04 | 323.08 ms | 52.2% bf16 MFU | 1625076 tok/s step 7037/19560 | loss 3.468711 (-0.64z)| norm 0.2778 (+0.17z)| lr 4.48e-04 | 322.29 ms | 52.4% bf16 MFU | 1625162 tok/s step 7038/19560 | loss 3.488058 (-0.11z)| norm 0.2885 (+0.70z)| lr 4.48e-04 | 322.76 ms | 52.3% bf16 MFU | 1625123 tok/s step 7039/19560 | loss 3.426331 (-1.77z)| norm 0.3148 (+2.02z)| lr 4.48e-04 | 322.73 ms | 52.3% bf16 MFU | 1625094 tok/s step 7040/19560 | loss 3.468660 (-0.62z)| norm 0.2469 (-1.40z)| lr 4.48e-04 | 322.34 ms | 52.4% bf16 MFU | 1625163 tok/s step 7041/19560 | loss 3.502194 (+0.29z)| norm 0.2826 (+0.40z)| lr 4.48e-04 | 322.57 ms | 52.3% bf16 MFU | 1625171 tok/s step 7042/19560 | loss 3.497350 (+0.15z)| norm 0.2775 (+0.14z)| lr 4.48e-04 | 322.55 ms | 52.3% bf16 MFU | 1625186 tok/s step 7043/19560 | loss 3.529111 (+1.02z)| norm 0.2830 (+0.41z)| lr 4.48e-04 | 322.92 ms | 52.3% bf16 MFU | 1625105 tok/s step 7044/19560 | loss 3.454286 (-1.00z)| norm 0.2599 (-0.74z)| lr 4.48e-04 | 322.67 ms | 52.3% bf16 MFU | 1625093 tok/s step 7045/19560 | loss 3.495002 (+0.10z)| norm 0.2657 (-0.45z)| lr 4.48e-04 | 322.05 ms | 52.4% bf16 MFU | 1625236 tok/s step 7046/19560 | loss 3.533698 (+1.13z)| norm 0.2556 (-0.96z)| lr 4.47e-04 | 322.82 ms | 52.3% bf16 MFU | 1625178 tok/s step 7047/19560 | loss 3.518781 (+0.78z)| norm 0.2663 (-0.42z)| lr 4.47e-04 | 322.05 ms | 52.4% bf16 MFU | 1625318 tok/s step 7048/19560 | loss 3.448436 (-1.19z)| norm 0.2675 (-0.35z)| lr 4.47e-04 | 322.92 ms | 52.3% bf16 MFU | 1625232 tok/s step 7049/19560 | loss 3.467374 (-0.65z)| norm 0.2501 (-1.23z)| lr 4.47e-04 | 322.74 ms | 52.3% bf16 MFU | 1625194 tok/s step 7050/19560 | loss 3.490098 (-0.01z)| norm 0.2973 (+1.13z)| lr 4.47e-04 | 322.25 ms | 52.4% bf16 MFU | 1625282 tok/s step 7051/19560 | loss 3.607737 (+3.15z)| norm 0.2587 (-0.80z)| lr 4.47e-04 | 322.66 ms | 52.3% bf16 MFU | 1625262 tok/s step 7052/19560 | loss 3.462056 (-0.80z)| norm 0.2536 (-1.04z)| lr 4.47e-04 | 322.45 ms | 52.3% bf16 MFU | 1625296 tok/s step 7053/19560 | loss 3.475845 (-0.43z)| norm 0.2496 (-1.23z)| lr 4.47e-04 | 322.55 ms | 52.3% bf16 MFU | 1625303 tok/s step 7054/19560 | loss 3.382007 (-2.88z)| norm 0.2621 (-0.60z)| lr 4.47e-04 | 322.73 ms | 52.3% bf16 MFU | 1625264 tok/s step 7055/19560 | loss 3.448666 (-1.11z)| norm 0.2724 (-0.10z)| lr 4.47e-04 | 322.23 ms | 52.4% bf16 MFU | 1625353 tok/s step 7056/19560 | loss 3.562500 (+1.87z)| norm 0.2675 (-0.34z)| lr 4.47e-04 | 322.31 ms | 52.4% bf16 MFU | 1625419 tok/s step 7057/19560 | loss 3.479932 (-0.30z)| norm 0.2680 (-0.31z)| lr 4.47e-04 | 322.30 ms | 52.4% bf16 MFU | 1625485 tok/s step 7058/19560 | loss 3.499239 (+0.22z)| norm 0.2956 (+1.06z)| lr 4.47e-04 | 322.97 ms | 52.3% bf16 MFU | 1625376 tok/s step 7059/19560 | loss 3.444917 (-1.20z)| norm 0.2732 (-0.07z)| lr 4.47e-04 | 322.38 ms | 52.4% bf16 MFU | 1625422 tok/s step 7060/19560 | loss 3.470477 (-0.53z)| norm 0.2607 (-0.70z)| lr 4.47e-04 | 322.68 ms | 52.3% bf16 MFU | 1625391 tok/s step 7061/19560 | loss 3.537723 (+1.22z)| norm 0.2802 (+0.28z)| lr 4.47e-04 | 323.08 ms | 52.2% bf16 MFU | 1625260 tok/s step 7062/19560 | loss 3.485593 (-0.13z)| norm 0.2600 (-0.72z)| lr 4.47e-04 | 322.62 ms | 52.3% bf16 MFU | 1625253 tok/s step 7063/19560 | loss 3.446059 (-1.16z)| norm 0.2853 (+0.54z)| lr 4.47e-04 | 322.22 ms | 52.4% bf16 MFU | 1625346 tok/s step 7064/19560 | loss 3.538726 (+1.25z)| norm 0.2647 (-0.49z)| lr 4.47e-04 | 322.70 ms | 52.3% bf16 MFU | 1625314 tok/s step 7065/19560 | loss 3.463916 (-0.68z)| norm 0.2678 (-0.32z)| lr 4.47e-04 | 322.88 ms | 52.3% bf16 MFU | 1625239 tok/s step 7066/19560 | loss 3.452459 (-0.97z)| norm 0.2756 (+0.06z)| lr 4.47e-04 | 322.81 ms | 52.3% bf16 MFU | 1625184 tok/s step 7067/19560 | loss 3.415561 (-1.89z)| norm 0.2984 (+1.20z)| lr 4.47e-04 | 322.03 ms | 52.4% bf16 MFU | 1625328 tok/s step 7068/19560 | loss 3.476361 (-0.32z)| norm 0.3037 (+1.44z)| lr 4.47e-04 | 323.12 ms | 52.2% bf16 MFU | 1625190 tok/s step 7069/19560 | loss 3.425318 (-1.62z)| norm 0.2657 (-0.43z)| lr 4.46e-04 | 322.56 ms | 52.3% bf16 MFU | 1625200 tok/s step 7070/19560 | loss 3.482223 (-0.17z)| norm 0.2926 (+0.89z)| lr 4.46e-04 | 322.64 ms | 52.3% bf16 MFU | 1625189 tok/s step 7071/19560 | loss 3.551441 (+1.63z)| norm 0.2737 (-0.03z)| lr 4.46e-04 | 322.97 ms | 52.3% bf16 MFU | 1625097 tok/s step 7072/19560 | loss 3.611722 (+3.06z)| norm 0.3123 (+1.87z)| lr 4.46e-04 | 322.50 ms | 52.3% bf16 MFU | 1625128 tok/s step 7073/19560 | loss 3.610548 (+2.90z)| norm 0.3146 (+1.97z)| lr 4.46e-04 | 322.62 ms | 52.3% bf16 MFU | 1625127 tok/s step 7074/19560 | loss 3.515979 (+0.61z)| norm 0.2950 (+0.99z)| lr 4.46e-04 | 323.00 ms | 52.3% bf16 MFU | 1625030 tok/s step 7075/19560 | loss 3.636874 (+3.37z)| norm 0.3241 (+2.38z)| lr 4.46e-04 | 322.59 ms | 52.3% bf16 MFU | 1625039 tok/s step 7076/19560 | loss 3.457959 (-0.78z)| norm 0.2972 (+1.06z)| lr 4.46e-04 | 322.65 ms | 52.3% bf16 MFU | 1625034 tok/s step 7077/19560 | loss 3.470912 (-0.47z)| norm 0.2996 (+1.19z)| lr 4.46e-04 | 322.49 ms | 52.3% bf16 MFU | 1625069 tok/s step 7078/19560 | loss 3.470374 (-0.49z)| norm 0.2974 (+1.07z)| lr 4.46e-04 | 322.99 ms | 52.3% bf16 MFU | 1624977 tok/s step 7079/19560 | loss 3.450234 (-0.96z)| norm 0.2825 (+0.33z)| lr 4.46e-04 | 322.55 ms | 52.3% bf16 MFU | 1625002 tok/s step 7080/19560 | loss 3.468656 (-0.54z)| norm 0.2990 (+1.12z)| lr 4.46e-04 | 322.78 ms | 52.3% bf16 MFU | 1624965 tok/s step 7081/19560 | loss 3.537258 (+1.07z)| norm 0.2726 (-0.17z)| lr 4.46e-04 | 322.68 ms | 52.3% bf16 MFU | 1624957 tok/s step 7082/19560 | loss 3.451079 (-0.94z)| norm 0.2967 (+0.99z)| lr 4.46e-04 | 322.40 ms | 52.3% bf16 MFU | 1625020 tok/s step 7083/19560 | loss 3.470022 (-0.50z)| norm 0.2898 (+0.64z)| lr 4.46e-04 | 322.57 ms | 52.3% bf16 MFU | 1625038 tok/s step 7084/19560 | loss 3.446525 (-1.05z)| norm 0.2827 (+0.29z)| lr 4.46e-04 | 322.90 ms | 52.3% bf16 MFU | 1624970 tok/s step 7085/19560 | loss 3.427419 (-1.48z)| norm 0.2694 (-0.38z)| lr 4.46e-04 | 322.45 ms | 52.3% bf16 MFU | 1625020 tok/s step 7086/19560 | loss 3.469782 (-0.49z)| norm 0.2684 (-0.43z)| lr 4.46e-04 | 323.59 ms | 52.2% bf16 MFU | 1624779 tok/s step 7087/19560 | loss 3.503857 (+0.30z)| norm 0.2727 (-0.22z)| lr 4.46e-04 | 323.13 ms | 52.2% bf16 MFU | 1624668 tok/s step 7088/19560 | loss 3.528035 (+0.85z)| norm 0.2627 (-0.72z)| lr 4.46e-04 | 322.72 ms | 52.3% bf16 MFU | 1624663 tok/s step 7089/19560 | loss 3.472984 (-0.43z)| norm 0.2812 (+0.20z)| lr 4.46e-04 | 322.79 ms | 52.3% bf16 MFU | 1624641 tok/s step 7090/19560 | loss 3.508574 (+0.40z)| norm 0.2412 (-1.78z)| lr 4.46e-04 | 322.78 ms | 52.3% bf16 MFU | 1624624 tok/s step 7091/19560 | loss 3.445980 (-1.05z)| norm 0.2616 (-0.75z)| lr 4.46e-04 | 322.53 ms | 52.3% bf16 MFU | 1624670 tok/s step 7092/19560 | loss 3.501568 (+0.23z)| norm 0.3101 (+1.75z)| lr 4.45e-04 | 323.46 ms | 52.2% bf16 MFU | 1624480 tok/s step 7093/19560 | loss 3.436006 (-1.27z)| norm 0.3591 (+3.97z)| lr 4.45e-04 | 323.29 ms | 52.2% bf16 MFU | 1624344 tok/s step 7094/19560 | loss 3.463275 (-0.63z)| norm 0.2889 (+0.59z)| lr 4.45e-04 | 323.11 ms | 52.2% bf16 MFU | 1624258 tok/s step 7095/19560 | loss 3.489089 (-0.04z)| norm 0.2934 (+0.80z)| lr 4.45e-04 | 323.35 ms | 52.2% bf16 MFU | 1624117 tok/s step 7096/19560 | loss 3.494730 (+0.08z)| norm 0.3087 (+1.53z)| lr 4.45e-04 | 323.20 ms | 52.2% bf16 MFU | 1624021 tok/s step 7097/19560 | loss 3.458663 (-0.75z)| norm 0.2584 (-0.89z)| lr 4.45e-04 | 322.44 ms | 52.3% bf16 MFU | 1624119 tok/s step 7098/19560 | loss 3.468185 (-0.52z)| norm 0.2802 (+0.16z)| lr 4.45e-04 | 322.87 ms | 52.3% bf16 MFU | 1624105 tok/s step 7099/19560 | loss 3.501913 (+0.25z)| norm 0.2657 (-0.54z)| lr 4.45e-04 | 322.77 ms | 52.3% bf16 MFU | 1624116 tok/s step 7100/19560 | loss 3.431844 (-1.34z)| norm 0.2815 (+0.22z)| lr 4.45e-04 | 323.47 ms | 52.2% bf16 MFU | 1623951 tok/s step 7101/19560 | loss 3.455716 (-0.78z)| norm 0.2589 (-0.86z)| lr 4.45e-04 | 323.46 ms | 52.2% bf16 MFU | 1623798 tok/s step 7102/19560 | loss 3.519160 (+0.65z)| norm 0.2584 (-0.88z)| lr 4.45e-04 | 322.00 ms | 52.4% bf16 MFU | 1624018 tok/s step 7103/19560 | loss 3.464129 (-0.60z)| norm 0.2518 (-1.18z)| lr 4.45e-04 | 323.21 ms | 52.2% bf16 MFU | 1623924 tok/s step 7104/19560 | loss 3.428688 (-1.38z)| norm 0.2465 (-1.41z)| lr 4.45e-04 | 322.92 ms | 52.3% bf16 MFU | 1623906 tok/s step 7105/19560 | loss 3.468863 (-0.46z)| norm 0.2671 (-0.41z)| lr 4.45e-04 | 322.84 ms | 52.3% bf16 MFU | 1623910 tok/s step 7106/19560 | loss 3.482658 (-0.15z)| norm 0.2584 (-0.82z)| lr 4.45e-04 | 322.74 ms | 52.3% bf16 MFU | 1623937 tok/s step 7107/19560 | loss 3.491338 (+0.06z)| norm 0.2594 (-0.75z)| lr 4.45e-04 | 322.34 ms | 52.4% bf16 MFU | 1624065 tok/s step 7108/19560 | loss 3.473000 (-0.36z)| norm 0.2621 (-0.61z)| lr 4.45e-04 | 323.04 ms | 52.2% bf16 MFU | 1624010 tok/s step 7109/19560 | loss 3.511696 (+0.52z)| norm 0.2686 (-0.29z)| lr 4.45e-04 | 322.49 ms | 52.3% bf16 MFU | 1624098 tok/s step 7110/19560 | loss 3.455377 (-0.77z)| norm 0.2793 (+0.24z)| lr 4.45e-04 | 322.25 ms | 52.4% bf16 MFU | 1624241 tok/s step 7111/19560 | loss 3.445338 (-1.00z)| norm 0.2610 (-0.67z)| lr 4.45e-04 | 322.84 ms | 52.3% bf16 MFU | 1624227 tok/s step 7112/19560 | loss 3.438311 (-1.14z)| norm 0.2516 (-1.12z)| lr 4.45e-04 | 323.63 ms | 52.1% bf16 MFU | 1624016 tok/s step 7113/19560 | loss 3.466897 (-0.47z)| norm 0.2350 (-1.90z)| lr 4.45e-04 | 322.70 ms | 52.3% bf16 MFU | 1624051 tok/s step 7114/19560 | loss 3.520052 (+0.75z)| norm 0.2419 (-1.55z)| lr 4.44e-04 | 322.25 ms | 52.4% bf16 MFU | 1624195 tok/s step 7115/19560 | loss 3.541599 (+1.23z)| norm 0.2747 (+0.03z)| lr 4.44e-04 | 322.55 ms | 52.3% bf16 MFU | 1624257 tok/s step 7116/19560 | loss 3.498447 (+0.23z)| norm 0.2480 (-1.24z)| lr 4.44e-04 | 322.51 ms | 52.3% bf16 MFU | 1624326 tok/s step 7117/19560 | loss 3.685386 (+4.17z)| norm 0.2834 (+0.45z)| lr 4.44e-04 | 323.38 ms | 52.2% bf16 MFU | 1624174 tok/s step 7118/19560 | loss 3.543140 (+1.12z)| norm 0.2975 (+1.11z)| lr 4.44e-04 | 322.43 ms | 52.3% bf16 MFU | 1624267 tok/s step 7119/19560 | loss 3.500941 (+0.22z)| norm 0.3099 (+1.67z)| lr 4.44e-04 | 322.75 ms | 52.3% bf16 MFU | 1624275 tok/s step 7120/19560 | loss 3.450876 (-0.83z)| norm 0.3004 (+1.21z)| lr 4.44e-04 | 322.74 ms | 52.3% bf16 MFU | 1624285 tok/s step 7121/19560 | loss 3.483530 (-0.13z)| norm 0.2986 (+1.11z)| lr 4.44e-04 | 322.75 ms | 52.3% bf16 MFU | 1624293 tok/s step 7122/19560 | loss 3.491228 (+0.03z)| norm 0.2730 (-0.09z)| lr 4.44e-04 | 323.34 ms | 52.2% bf16 MFU | 1624152 tok/s step 7123/19560 | loss 3.465630 (-0.51z)| norm 0.2752 (+0.01z)| lr 4.44e-04 | 322.86 ms | 52.3% bf16 MFU | 1624138 tok/s step 7124/19560 | loss 3.454601 (-0.74z)| norm 0.2799 (+0.23z)| lr 4.44e-04 | 322.85 ms | 52.3% bf16 MFU | 1624127 tok/s step 7125/19560 | loss 3.488403 (-0.02z)| norm 0.2853 (+0.47z)| lr 4.44e-04 | 323.29 ms | 52.2% bf16 MFU | 1624006 tok/s step 7126/19560 | loss 3.500342 (+0.22z)| norm 0.3017 (+1.25z)| lr 4.44e-04 | 322.58 ms | 52.3% bf16 MFU | 1624072 tok/s step 7127/19560 | loss 3.487815 (-0.05z)| norm 0.2863 (+0.52z)| lr 4.44e-04 | 323.12 ms | 52.2% bf16 MFU | 1623997 tok/s step 7128/19560 | loss 3.465282 (-0.52z)| norm 0.3142 (+1.82z)| lr 4.44e-04 | 322.91 ms | 52.3% bf16 MFU | 1623978 tok/s step 7129/19560 | loss 3.575362 (+1.78z)| norm 0.2758 (-0.00z)| lr 4.44e-04 | 322.80 ms | 52.3% bf16 MFU | 1623990 tok/s step 7130/19560 | loss 3.514983 (+0.50z)| norm 0.2990 (+1.09z)| lr 4.44e-04 | 322.96 ms | 52.3% bf16 MFU | 1623960 tok/s step 7131/19560 | loss 3.520775 (+0.62z)| norm 0.2634 (-0.59z)| lr 4.44e-04 | 322.81 ms | 52.3% bf16 MFU | 1623970 tok/s step 7132/19560 | loss 3.544139 (+1.10z)| norm 0.2814 (+0.26z)| lr 4.44e-04 | 323.03 ms | 52.2% bf16 MFU | 1623922 tok/s step 7133/19560 | loss 3.476376 (-0.31z)| norm 0.2748 (-0.05z)| lr 4.44e-04 | 322.82 ms | 52.3% bf16 MFU | 1623931 tok/s step 7134/19560 | loss 3.538632 (+1.00z)| norm 0.2780 (+0.10z)| lr 4.44e-04 | 322.03 ms | 52.4% bf16 MFU | 1624139 tok/s step 7135/19560 | loss 3.486558 (-0.08z)| norm 0.2718 (-0.18z)| lr 4.44e-04 | 322.77 ms | 52.3% bf16 MFU | 1624149 tok/s step 7136/19560 | loss 3.491779 (+0.04z)| norm 0.2664 (-0.44z)| lr 4.44e-04 | 322.84 ms | 52.3% bf16 MFU | 1624141 tok/s step 7137/19560 | loss 3.506290 (+0.34z)| norm 0.2576 (-0.85z)| lr 4.43e-04 | 322.91 ms | 52.3% bf16 MFU | 1624114 tok/s step 7138/19560 | loss 3.461262 (-0.64z)| norm 0.2751 (-0.03z)| lr 4.43e-04 | 322.93 ms | 52.3% bf16 MFU | 1624086 tok/s step 7139/19560 | loss 3.452312 (-0.82z)| norm 0.2854 (+0.45z)| lr 4.43e-04 | 322.57 ms | 52.3% bf16 MFU | 1624149 tok/s step 7140/19560 | loss 3.447547 (-0.93z)| norm 0.2683 (-0.36z)| lr 4.43e-04 | 322.97 ms | 52.3% bf16 MFU | 1624109 tok/s step 7141/19560 | loss 3.446848 (-0.93z)| norm 0.2613 (-0.69z)| lr 4.43e-04 | 323.01 ms | 52.3% bf16 MFU | 1624061 tok/s step 7142/19560 | loss 3.466431 (-0.50z)| norm 0.2484 (-1.29z)| lr 4.43e-04 | 322.92 ms | 52.3% bf16 MFU | 1624038 tok/s step 7143/19560 | loss 3.557983 (+1.47z)| norm 0.2740 (-0.08z)| lr 4.43e-04 | 322.94 ms | 52.3% bf16 MFU | 1624010 tok/s step 7144/19560 | loss 3.537006 (+1.00z)| norm 0.2973 (+1.00z)| lr 4.43e-04 | 322.61 ms | 52.3% bf16 MFU | 1624068 tok/s step 7145/19560 | loss 3.616507 (+2.61z)| norm 0.2911 (+0.70z)| lr 4.43e-04 | 322.44 ms | 52.3% bf16 MFU | 1624163 tok/s step 7146/19560 | loss 3.505594 (+0.29z)| norm 0.2967 (+0.95z)| lr 4.43e-04 | 323.25 ms | 52.2% bf16 MFU | 1624050 tok/s step 7147/19560 | loss 3.491524 (-0.01z)| norm 0.3208 (+2.05z)| lr 4.43e-04 | 322.80 ms | 52.3% bf16 MFU | 1624058 tok/s step 7148/19560 | loss 3.498971 (+0.15z)| norm 0.2735 (-0.18z)| lr 4.43e-04 | 322.76 ms | 52.3% bf16 MFU | 1624074 tok/s step 7149/19560 | loss 3.505087 (+0.30z)| norm 0.2821 (+0.22z)| lr 4.43e-04 | 321.99 ms | 52.4% bf16 MFU | 1624283 tok/s step 7150/19560 | loss 3.430464 (-1.29z)| norm 0.2945 (+0.79z)| lr 4.43e-04 | 322.81 ms | 52.3% bf16 MFU | 1624277 tok/s step 7151/19560 | loss 3.459765 (-0.65z)| norm 0.2813 (+0.16z)| lr 4.43e-04 | 322.75 ms | 52.3% bf16 MFU | 1624285 tok/s step 7152/19560 | loss 3.478554 (-0.25z)| norm 0.2794 (+0.07z)| lr 4.43e-04 | 323.05 ms | 52.2% bf16 MFU | 1624218 tok/s step 7153/19560 | loss 3.513167 (+0.49z)| norm 0.2775 (-0.03z)| lr 4.43e-04 | 322.02 ms | 52.4% bf16 MFU | 1624413 tok/s step 7154/19560 | loss 3.455768 (-0.73z)| norm 0.2517 (-1.25z)| lr 4.43e-04 | 322.56 ms | 52.3% bf16 MFU | 1624463 tok/s step 7155/19560 | loss 3.487040 (-0.07z)| norm 0.2611 (-0.81z)| lr 4.43e-04 | 322.74 ms | 52.3% bf16 MFU | 1624463 tok/s step 7156/19560 | loss 3.452472 (-0.80z)| norm 0.2614 (-0.79z)| lr 4.43e-04 | 322.54 ms | 52.3% bf16 MFU | 1624515 tok/s step 7157/19560 | loss 3.627530 (+2.84z)| norm 0.3981 (+5.29z)| lr 4.43e-04 | 322.06 ms | 52.4% bf16 MFU | 1624687 tok/s step 7158/19560 | loss 3.487537 (-0.07z)| norm 0.2831 (+0.22z)| lr 4.43e-04 | 322.54 ms | 52.3% bf16 MFU | 1624728 tok/s step 7159/19560 | loss 3.552338 (+1.28z)| norm 0.3054 (+1.18z)| lr 4.43e-04 | 322.14 ms | 52.4% bf16 MFU | 1624868 tok/s step 7160/19560 | loss 3.460609 (-0.63z)| norm 0.2800 (+0.07z)| lr 4.42e-04 | 322.50 ms | 52.3% bf16 MFU | 1624909 tok/s step 7161/19560 | loss 3.480021 (-0.23z)| norm 0.2946 (+0.71z)| lr 4.42e-04 | 322.50 ms | 52.3% bf16 MFU | 1624948 tok/s step 7162/19560 | loss 3.455851 (-0.72z)| norm 0.3058 (+1.19z)| lr 4.42e-04 | 322.56 ms | 52.3% bf16 MFU | 1624972 tok/s step 7163/19560 | loss 3.498172 (+0.17z)| norm 0.3250 (+1.99z)| lr 4.42e-04 | 322.57 ms | 52.3% bf16 MFU | 1624991 tok/s step 7164/19560 | loss 3.489698 (-0.01z)| norm 0.3254 (+1.97z)| lr 4.42e-04 | 321.98 ms | 52.4% bf16 MFU | 1625157 tok/s step 7165/19560 | loss 3.485497 (-0.10z)| norm 0.3193 (+1.67z)| lr 4.42e-04 | 322.37 ms | 52.4% bf16 MFU | 1625218 tok/s step 7166/19560 | loss 3.459040 (-0.65z)| norm 0.2949 (+0.63z)| lr 4.42e-04 | 322.76 ms | 52.3% bf16 MFU | 1625176 tok/s step 7167/19560 | loss 3.455248 (-0.74z)| norm 0.3046 (+1.04z)| lr 4.42e-04 | 322.46 ms | 52.3% bf16 MFU | 1625212 tok/s step 7168/19560 | loss 3.431167 (-1.23z)| norm 0.2875 (+0.30z)| lr 4.42e-04 | 322.74 ms | 52.3% bf16 MFU | 1625175 tok/s step 7169/19560 | loss 3.440797 (-1.02z)| norm 0.2685 (-0.51z)| lr 4.42e-04 | 321.96 ms | 52.4% bf16 MFU | 1625336 tok/s step 7170/19560 | loss 3.470902 (-0.39z)| norm 0.2916 (+0.48z)| lr 4.42e-04 | 323.05 ms | 52.2% bf16 MFU | 1625216 tok/s step 7171/19560 | loss 3.533939 (+0.93z)| norm 0.2734 (-0.30z)| lr 4.42e-04 | 322.71 ms | 52.3% bf16 MFU | 1625188 tok/s step 7172/19560 | loss 3.479546 (-0.21z)| norm 0.2848 (+0.18z)| lr 4.42e-04 | 322.31 ms | 52.4% bf16 MFU | 1625263 tok/s step 7173/19560 | loss 3.412029 (-1.59z)| norm 0.2621 (-0.80z)| lr 4.42e-04 | 322.52 ms | 52.3% bf16 MFU | 1625279 tok/s step 7174/19560 | loss 3.510180 (+0.44z)| norm 0.3018 (+0.90z)| lr 4.42e-04 | 323.15 ms | 52.2% bf16 MFU | 1625136 tok/s step 7175/19560 | loss 3.441205 (-0.97z)| norm 0.2676 (-0.58z)| lr 4.42e-04 | 322.52 ms | 52.3% bf16 MFU | 1625160 tok/s step 7176/19560 | loss 3.451150 (-0.77z)| norm 0.2631 (-0.77z)| lr 4.42e-04 | 322.13 ms | 52.4% bf16 MFU | 1625281 tok/s step 7177/19560 | loss 3.479210 (-0.19z)| norm 0.2864 (+0.23z)| lr 4.42e-04 | 322.76 ms | 52.3% bf16 MFU | 1625236 tok/s step 7178/19560 | loss 3.460668 (-0.57z)| norm 0.2427 (-1.64z)| lr 4.42e-04 | 322.44 ms | 52.3% bf16 MFU | 1625274 tok/s step 7179/19560 | loss 3.480625 (-0.14z)| norm 0.2910 (+0.43z)| lr 4.42e-04 | 322.73 ms | 52.3% bf16 MFU | 1625238 tok/s step 7180/19560 | loss 3.489852 (+0.06z)| norm 0.2782 (-0.13z)| lr 4.42e-04 | 323.00 ms | 52.3% bf16 MFU | 1625135 tok/s step 7181/19560 | loss 3.624527 (+2.80z)| norm 0.3002 (+0.81z)| lr 4.42e-04 | 322.29 ms | 52.4% bf16 MFU | 1625217 tok/s step 7182/19560 | loss 3.462041 (-0.57z)| norm 0.2628 (-0.82z)| lr 4.42e-04 | 321.90 ms | 52.4% bf16 MFU | 1625394 tok/s step 7183/19560 | loss 3.492702 (+0.07z)| norm 0.2840 (+0.10z)| lr 4.41e-04 | 322.63 ms | 52.3% bf16 MFU | 1625375 tok/s step 7184/19560 | loss 3.509589 (+0.44z)| norm 0.2659 (-0.69z)| lr 4.41e-04 | 322.49 ms | 52.3% bf16 MFU | 1625394 tok/s step 7185/19560 | loss 3.463547 (-0.54z)| norm 0.2698 (-0.52z)| lr 4.41e-04 | 322.55 ms | 52.3% bf16 MFU | 1625397 tok/s step 7186/19560 | loss 3.433472 (-1.16z)| norm 0.2760 (-0.24z)| lr 4.41e-04 | 322.76 ms | 52.3% bf16 MFU | 1625346 tok/s step 7187/19560 | loss 3.564937 (+1.58z)| norm 0.3212 (+1.70z)| lr 4.41e-04 | 322.32 ms | 52.4% bf16 MFU | 1625408 tok/s step 7188/19560 | loss 3.442663 (-0.97z)| norm 0.2768 (-0.23z)| lr 4.41e-04 | 322.53 ms | 52.3% bf16 MFU | 1625414 tok/s step 7189/19560 | loss 3.571009 (+1.69z)| norm 0.2815 (-0.02z)| lr 4.41e-04 | 322.83 ms | 52.3% bf16 MFU | 1625346 tok/s step 7190/19560 | loss 3.452312 (-0.76z)| norm 0.2520 (-1.30z)| lr 4.41e-04 | 322.96 ms | 52.3% bf16 MFU | 1625247 tok/s step 7191/19560 | loss 3.564978 (+1.54z)| norm 0.5727 (+8.36z)| lr 4.41e-04 | 322.51 ms | 52.3% bf16 MFU | 1625266 tok/s step 7192/19560 | loss 3.449044 (-0.83z)| norm 0.3116 (+0.78z)| lr 4.41e-04 | 322.58 ms | 52.3% bf16 MFU | 1625267 tok/s step 7193/19560 | loss 3.596269 (+2.14z)| norm 0.3158 (+0.89z)| lr 4.41e-04 | 322.58 ms | 52.3% bf16 MFU | 1625268 tok/s step 7194/19560 | loss 3.430813 (-1.19z)| norm 0.3251 (+1.14z)| lr 4.41e-04 | 322.67 ms | 52.3% bf16 MFU | 1625247 tok/s step 7195/19560 | loss 3.447108 (-0.88z)| norm 0.3009 (+0.45z)| lr 4.41e-04 | 322.54 ms | 52.3% bf16 MFU | 1625259 tok/s step 7196/19560 | loss 3.498634 (+0.16z)| norm 0.3122 (+0.77z)| lr 4.41e-04 | 322.03 ms | 52.4% bf16 MFU | 1625400 tok/s step 7197/19560 | loss 3.522844 (+0.64z)| norm 0.2984 (+0.36z)| lr 4.41e-04 | 322.62 ms | 52.3% bf16 MFU | 1625385 tok/s step 7198/19560 | loss 3.435575 (-1.12z)| norm 0.2902 (+0.13z)| lr 4.41e-04 | 322.46 ms | 52.3% bf16 MFU | 1625411 tok/s step 7199/19560 | loss 3.490169 (-0.01z)| norm 0.2758 (-0.29z)| lr 4.41e-04 | 322.72 ms | 52.3% bf16 MFU | 1625370 tok/s step 7200/19560 | loss 3.390737 (-2.02z)| norm 0.3083 (+0.65z)| lr 4.41e-04 | 322.87 ms | 52.3% bf16 MFU | 1625292 tok/s step 7201/19560 | loss 3.485691 (-0.04z)| norm 0.2708 (-0.42z)| lr 4.41e-04 | 322.62 ms | 52.3% bf16 MFU | 1625282 tok/s step 7202/19560 | loss 3.501874 (+0.30z)| norm 0.2896 (+0.12z)| lr 4.41e-04 | 322.30 ms | 52.4% bf16 MFU | 1625353 tok/s step 7203/19560 | loss 3.431268 (-1.20z)| norm 0.2816 (-0.10z)| lr 4.41e-04 | 322.54 ms | 52.3% bf16 MFU | 1625361 tok/s step 7204/19560 | loss 3.441157 (-0.98z)| norm 0.2695 (-0.44z)| lr 4.41e-04 | 322.37 ms | 52.4% bf16 MFU | 1625410 tok/s step 7205/19560 | loss 3.467438 (-0.40z)| norm 0.2603 (-0.70z)| lr 4.40e-04 | 322.84 ms | 52.3% bf16 MFU | 1625340 tok/s step 7206/19560 | loss 3.474076 (-0.26z)| norm 0.2672 (-0.49z)| lr 4.40e-04 | 322.92 ms | 52.3% bf16 MFU | 1625253 tok/s step 7207/19560 | loss 3.590276 (+2.21z)| norm 0.3099 (+0.73z)| lr 4.40e-04 | 322.94 ms | 52.3% bf16 MFU | 1625165 tok/s step 7208/19560 | loss 3.421425 (-1.39z)| norm 0.2888 (+0.13z)| lr 4.40e-04 | 323.00 ms | 52.3% bf16 MFU | 1625065 tok/s step 7209/19560 | loss 3.533367 (+1.00z)| norm 0.3036 (+0.55z)| lr 4.40e-04 | 322.23 ms | 52.4% bf16 MFU | 1625165 tok/s step 7210/19560 | loss 3.490002 (+0.07z)| norm 0.2680 (-0.47z)| lr 4.40e-04 | 322.63 ms | 52.3% bf16 MFU | 1625158 tok/s step 7211/19560 | loss 3.464162 (-0.49z)| norm 0.2639 (-0.58z)| lr 4.40e-04 | 322.59 ms | 52.3% bf16 MFU | 1625162 tok/s step 7212/19560 | loss 3.427044 (-1.27z)| norm 0.2785 (-0.16z)| lr 4.40e-04 | 322.83 ms | 52.3% bf16 MFU | 1625105 tok/s step 7213/19560 | loss 3.471140 (-0.34z)| norm 0.2803 (-0.11z)| lr 4.40e-04 | 322.53 ms | 52.3% bf16 MFU | 1625128 tok/s step 7214/19560 | loss 3.361641 (-2.60z)| norm 0.2509 (-0.95z)| lr 4.40e-04 | 322.29 ms | 52.4% bf16 MFU | 1625210 tok/s step 7215/19560 | loss 3.492634 (+0.14z)| norm 0.2765 (-0.22z)| lr 4.40e-04 | 322.98 ms | 52.3% bf16 MFU | 1625114 tok/s step 7216/19560 | loss 3.559509 (+1.51z)| norm 0.2630 (-0.61z)| lr 4.40e-04 | 322.67 ms | 52.3% bf16 MFU | 1625100 tok/s step 7217/19560 | loss 3.589471 (+2.08z)| norm 0.2697 (-0.41z)| lr 4.40e-04 | 322.38 ms | 52.4% bf16 MFU | 1625162 tok/s step 7218/19560 | loss 3.433770 (-1.07z)| norm 0.2834 (-0.03z)| lr 4.40e-04 | 322.35 ms | 52.4% bf16 MFU | 1625226 tok/s step 7219/19560 | loss 3.523649 (+0.74z)| norm 0.2616 (-0.66z)| lr 4.40e-04 | 323.33 ms | 52.2% bf16 MFU | 1625040 tok/s step 7220/19560 | loss 3.467506 (-0.40z)| norm 0.2678 (-0.47z)| lr 4.40e-04 | 322.41 ms | 52.3% bf16 MFU | 1625096 tok/s step 7221/19560 | loss 3.469066 (-0.37z)| norm 0.2781 (-0.16z)| lr 4.40e-04 | 322.60 ms | 52.3% bf16 MFU | 1625102 tok/s step 7222/19560 | loss 3.452058 (-0.72z)| norm 0.2913 (+0.23z)| lr 4.40e-04 | 322.58 ms | 52.3% bf16 MFU | 1625113 tok/s step 7223/19560 | loss 3.417910 (-1.39z)| norm 0.2501 (-0.97z)| lr 4.40e-04 | 322.21 ms | 52.4% bf16 MFU | 1625214 tok/s step 7224/19560 | loss 3.473463 (-0.26z)| norm 0.2806 (-0.07z)| lr 4.40e-04 | 322.63 ms | 52.3% bf16 MFU | 1625207 tok/s step 7225/19560 | loss 3.483101 (-0.07z)| norm 0.2475 (-1.04z)| lr 4.40e-04 | 322.58 ms | 52.3% bf16 MFU | 1625211 tok/s step 7226/19560 | loss 3.513634 (+0.54z)| norm 0.3037 (+0.61z)| lr 4.40e-04 | 322.34 ms | 52.4% bf16 MFU | 1625277 tok/s step 7227/19560 | loss 3.405846 (-1.61z)| norm 0.2721 (-0.32z)| lr 4.40e-04 | 322.38 ms | 52.4% bf16 MFU | 1625328 tok/s step 7228/19560 | loss 3.431983 (-1.09z)| norm 0.2841 (+0.03z)| lr 4.39e-04 | 322.39 ms | 52.3% bf16 MFU | 1625373 tok/s step 7229/19560 | loss 3.438780 (-0.95z)| norm 0.2639 (-0.56z)| lr 4.39e-04 | 322.36 ms | 52.4% bf16 MFU | 1625425 tok/s step 7230/19560 | loss 3.479969 (-0.12z)| norm 0.2623 (-0.61z)| lr 4.39e-04 | 322.54 ms | 52.3% bf16 MFU | 1625427 tok/s step 7231/19560 | loss 3.444539 (-0.82z)| norm 0.2759 (-0.22z)| lr 4.39e-04 | 322.25 ms | 52.4% bf16 MFU | 1625503 tok/s step 7232/19560 | loss 3.464966 (-0.42z)| norm 0.2488 (-1.02z)| lr 4.39e-04 | 322.49 ms | 52.3% bf16 MFU | 1625516 tok/s step 7233/19560 | loss 3.437322 (-0.97z)| norm 0.2849 (+0.04z)| lr 4.39e-04 | 322.72 ms | 52.3% bf16 MFU | 1625470 tok/s step 7234/19560 | loss 3.477724 (-0.16z)| norm 0.3038 (+0.59z)| lr 4.39e-04 | 322.38 ms | 52.4% bf16 MFU | 1625511 tok/s step 7235/19560 | loss 3.502806 (+0.34z)| norm 0.2886 (+0.14z)| lr 4.39e-04 | 322.53 ms | 52.3% bf16 MFU | 1625512 tok/s step 7236/19560 | loss 3.474487 (-0.23z)| norm 0.2803 (-0.11z)| lr 4.39e-04 | 322.64 ms | 52.3% bf16 MFU | 1625486 tok/s step 7237/19560 | loss 3.554421 (+1.35z)| norm 0.3053 (+0.62z)| lr 4.39e-04 | 322.67 ms | 52.3% bf16 MFU | 1625453 tok/s step 7238/19560 | loss 3.473180 (-0.26z)| norm 0.2761 (-0.25z)| lr 4.39e-04 | 322.27 ms | 52.4% bf16 MFU | 1625524 tok/s step 7239/19560 | loss 3.466214 (-0.40z)| norm 0.3016 (+0.50z)| lr 4.39e-04 | 323.19 ms | 52.2% bf16 MFU | 1625358 tok/s step 7240/19560 | loss 3.509569 (+0.45z)| norm 0.3052 (+0.60z)| lr 4.39e-04 | 322.51 ms | 52.3% bf16 MFU | 1625373 tok/s step 7241/19560 | loss 3.496273 (+0.18z)| norm 0.2809 (-0.14z)| lr 4.39e-04 | 322.53 ms | 52.3% bf16 MFU | 1625382 tok/s step 7242/19560 | loss 3.457455 (-0.59z)| norm 0.3021 (+0.49z)| lr 4.39e-04 | 322.74 ms | 52.3% bf16 MFU | 1625337 tok/s step 7243/19560 | loss 3.466427 (-0.40z)| norm 0.2995 (+0.40z)| lr 4.39e-04 | 322.28 ms | 52.4% bf16 MFU | 1625412 tok/s step 7244/19560 | loss 3.503286 (+0.34z)| norm 0.2703 (-0.49z)| lr 4.39e-04 | 322.76 ms | 52.3% bf16 MFU | 1625359 tok/s step 7245/19560 | loss 3.458557 (-0.55z)| norm 0.3093 (+0.69z)| lr 4.39e-04 | 322.48 ms | 52.3% bf16 MFU | 1625381 tok/s step 7246/19560 | loss 3.478098 (-0.13z)| norm 0.2673 (-0.58z)| lr 4.39e-04 | 322.31 ms | 52.4% bf16 MFU | 1625444 tok/s step 7247/19560 | loss 3.450899 (-0.70z)| norm 0.2923 (+0.18z)| lr 4.39e-04 | 322.61 ms | 52.3% bf16 MFU | 1625429 tok/s step 7248/19560 | loss 3.447001 (-0.79z)| norm 0.2799 (-0.18z)| lr 4.39e-04 | 322.56 ms | 52.3% bf16 MFU | 1625427 tok/s step 7249/19560 | loss 3.530072 (+0.99z)| norm 0.2827 (-0.10z)| lr 4.39e-04 | 322.84 ms | 52.3% bf16 MFU | 1625353 tok/s step 7250/19560 | loss 3.501363 (+0.37z)| norm 0.2924 (+0.19z)| lr 4.39e-04 | 322.19 ms | 52.4% bf16 MFU | 1625449 tok/s val loss 3.470035 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2831/10042 = 0.281916 step 7251/19560 | loss 3.441615 (-0.90z)| norm 0.3020 (+0.48z)| lr 4.38e-04 | 322.78 ms | 52.3% bf16 MFU | 1625391 tok/s step 7252/19560 | loss 3.462667 (-0.45z)| norm 0.2803 (-0.18z)| lr 4.38e-04 | 322.37 ms | 52.4% bf16 MFU | 1625438 tok/s step 7253/19560 | loss 3.520429 (+0.78z)| norm 0.2823 (-0.12z)| lr 4.38e-04 | 322.77 ms | 52.3% bf16 MFU | 1625383 tok/s step 7254/19560 | loss 3.515687 (+0.67z)| norm 0.2748 (-0.34z)| lr 4.38e-04 | 322.72 ms | 52.3% bf16 MFU | 1625345 tok/s step 7255/19560 | loss 3.496807 (+0.27z)| norm 0.2491 (-1.11z)| lr 4.38e-04 | 322.70 ms | 52.3% bf16 MFU | 1625311 tok/s step 7256/19560 | loss 3.490825 (+0.14z)| norm 0.2751 (-0.31z)| lr 4.38e-04 | 322.42 ms | 52.3% bf16 MFU | 1625350 tok/s step 7257/19560 | loss 3.393240 (-1.91z)| norm 0.2516 (-1.01z)| lr 4.38e-04 | 322.63 ms | 52.3% bf16 MFU | 1625335 tok/s step 7258/19560 | loss 3.481246 (-0.03z)| norm 0.2722 (-0.39z)| lr 4.38e-04 | 322.93 ms | 52.3% bf16 MFU | 1625245 tok/s step 7259/19560 | loss 3.475977 (-0.14z)| norm 0.2750 (-0.31z)| lr 4.38e-04 | 322.63 ms | 52.3% bf16 MFU | 1625236 tok/s step 7260/19560 | loss 3.476937 (-0.11z)| norm 0.2722 (-0.39z)| lr 4.38e-04 | 322.98 ms | 52.3% bf16 MFU | 1625139 tok/s step 7261/19560 | loss 3.491877 (+0.21z)| norm 0.2838 (-0.04z)| lr 4.38e-04 | 322.59 ms | 52.3% bf16 MFU | 1625144 tok/s step 7262/19560 | loss 3.476505 (-0.11z)| norm 0.3072 (+0.66z)| lr 4.38e-04 | 322.25 ms | 52.4% bf16 MFU | 1625235 tok/s step 7263/19560 | loss 3.423966 (-1.23z)| norm 0.2617 (-0.71z)| lr 4.38e-04 | 323.52 ms | 52.2% bf16 MFU | 1625003 tok/s step 7264/19560 | loss 3.506929 (+0.55z)| norm 0.2686 (-0.50z)| lr 4.38e-04 | 323.39 ms | 52.2% bf16 MFU | 1624814 tok/s step 7265/19560 | loss 3.463127 (-0.38z)| norm 0.2921 (+0.20z)| lr 4.38e-04 | 322.73 ms | 52.3% bf16 MFU | 1624800 tok/s step 7266/19560 | loss 3.462936 (-0.39z)| norm 0.2652 (-0.61z)| lr 4.38e-04 | 322.48 ms | 52.3% bf16 MFU | 1624849 tok/s step 7267/19560 | loss 3.538099 (+1.21z)| norm 0.2762 (-0.28z)| lr 4.38e-04 | 323.10 ms | 52.2% bf16 MFU | 1624741 tok/s step 7268/19560 | loss 3.482383 (+0.01z)| norm 0.2931 (+0.23z)| lr 4.38e-04 | 323.15 ms | 52.2% bf16 MFU | 1624626 tok/s step 7269/19560 | loss 3.511497 (+0.63z)| norm 0.3087 (+0.69z)| lr 4.38e-04 | 322.28 ms | 52.4% bf16 MFU | 1624734 tok/s step 7270/19560 | loss 3.532793 (+1.07z)| norm 0.2977 (+0.34z)| lr 4.38e-04 | 323.16 ms | 52.2% bf16 MFU | 1624616 tok/s step 7271/19560 | loss 3.454285 (-0.60z)| norm 0.2638 (-0.68z)| lr 4.38e-04 | 322.52 ms | 52.3% bf16 MFU | 1624664 tok/s step 7272/19560 | loss 3.455448 (-0.56z)| norm 0.2964 (+0.31z)| lr 4.38e-04 | 322.57 ms | 52.3% bf16 MFU | 1624699 tok/s step 7273/19560 | loss 3.487471 (+0.16z)| norm 0.2530 (-1.00z)| lr 4.37e-04 | 323.18 ms | 52.2% bf16 MFU | 1624578 tok/s step 7274/19560 | loss 3.472332 (-0.17z)| norm 0.2789 (-0.21z)| lr 4.37e-04 | 322.09 ms | 52.4% bf16 MFU | 1624736 tok/s step 7275/19560 | loss 3.447038 (-0.73z)| norm 0.2837 (-0.06z)| lr 4.37e-04 | 322.72 ms | 52.3% bf16 MFU | 1624728 tok/s step 7276/19560 | loss 3.454100 (-0.57z)| norm 0.2835 (-0.06z)| lr 4.37e-04 | 323.10 ms | 52.2% bf16 MFU | 1624625 tok/s step 7277/19560 | loss 3.471077 (-0.18z)| norm 0.2813 (-0.13z)| lr 4.37e-04 | 322.72 ms | 52.3% bf16 MFU | 1624623 tok/s step 7278/19560 | loss 3.507125 (+0.62z)| norm 0.2750 (-0.32z)| lr 4.37e-04 | 323.08 ms | 52.2% bf16 MFU | 1624530 tok/s step 7279/19560 | loss 3.548743 (+1.53z)| norm 0.3322 (+1.39z)| lr 4.37e-04 | 322.66 ms | 52.3% bf16 MFU | 1624549 tok/s step 7280/19560 | loss 3.509979 (+0.65z)| norm 0.3194 (+1.00z)| lr 4.37e-04 | 323.03 ms | 52.2% bf16 MFU | 1624474 tok/s step 7281/19560 | loss 3.449338 (-0.69z)| norm 0.3024 (+0.48z)| lr 4.37e-04 | 323.19 ms | 52.2% bf16 MFU | 1624363 tok/s step 7282/19560 | loss 3.414909 (-1.44z)| norm 0.2727 (-0.42z)| lr 4.37e-04 | 322.57 ms | 52.3% bf16 MFU | 1624412 tok/s step 7283/19560 | loss 3.445139 (-0.76z)| norm 0.3187 (+0.95z)| lr 4.37e-04 | 322.71 ms | 52.3% bf16 MFU | 1624424 tok/s step 7284/19560 | loss 3.497700 (+0.39z)| norm 0.3037 (+0.49z)| lr 4.37e-04 | 322.69 ms | 52.3% bf16 MFU | 1624439 tok/s step 7285/19560 | loss 3.518555 (+0.91z)| norm 0.2911 (+0.15z)| lr 4.37e-04 | 322.40 ms | 52.3% bf16 MFU | 1624526 tok/s step 7286/19560 | loss 3.461222 (-0.41z)| norm 0.3018 (+0.48z)| lr 4.37e-04 | 322.43 ms | 52.3% bf16 MFU | 1624604 tok/s step 7287/19560 | loss 3.445940 (-0.75z)| norm 0.2677 (-0.58z)| lr 4.37e-04 | 323.11 ms | 52.2% bf16 MFU | 1624504 tok/s step 7288/19560 | loss 3.450936 (-0.63z)| norm 0.2732 (-0.41z)| lr 4.37e-04 | 322.83 ms | 52.3% bf16 MFU | 1624481 tok/s step 7289/19560 | loss 3.480612 (+0.06z)| norm 0.2829 (-0.10z)| lr 4.37e-04 | 323.25 ms | 52.2% bf16 MFU | 1624353 tok/s step 7290/19560 | loss 3.473442 (-0.11z)| norm 0.2776 (-0.26z)| lr 4.37e-04 | 322.28 ms | 52.4% bf16 MFU | 1624476 tok/s step 7291/19560 | loss 3.702245 (+4.71z)| norm 0.3427 (+1.77z)| lr 4.37e-04 | 322.79 ms | 52.3% bf16 MFU | 1624463 tok/s step 7292/19560 | loss 3.506065 (+0.56z)| norm 0.3766 (+2.75z)| lr 4.37e-04 | 323.00 ms | 52.3% bf16 MFU | 1624400 tok/s step 7293/19560 | loss 3.503928 (+0.51z)| norm 0.2997 (+0.41z)| lr 4.37e-04 | 322.50 ms | 52.3% bf16 MFU | 1624465 tok/s step 7294/19560 | loss 3.473432 (-0.14z)| norm 0.2747 (-0.35z)| lr 4.37e-04 | 323.27 ms | 52.2% bf16 MFU | 1624333 tok/s step 7295/19560 | loss 3.460823 (-0.41z)| norm 0.2626 (-0.71z)| lr 4.37e-04 | 322.66 ms | 52.3% bf16 MFU | 1624360 tok/s step 7296/19560 | loss 3.492409 (+0.25z)| norm 0.2588 (-0.82z)| lr 4.36e-04 | 323.49 ms | 52.2% bf16 MFU | 1624178 tok/s step 7297/19560 | loss 3.488590 (+0.16z)| norm 0.2587 (-0.82z)| lr 4.36e-04 | 322.54 ms | 52.3% bf16 MFU | 1624245 tok/s step 7298/19560 | loss 3.472330 (-0.18z)| norm 0.2592 (-0.79z)| lr 4.36e-04 | 322.60 ms | 52.3% bf16 MFU | 1624292 tok/s step 7299/19560 | loss 3.514491 (+0.72z)| norm 0.2692 (-0.49z)| lr 4.36e-04 | 322.92 ms | 52.3% bf16 MFU | 1624258 tok/s step 7300/19560 | loss 3.429751 (-1.08z)| norm 0.2757 (-0.29z)| lr 4.36e-04 | 322.96 ms | 52.3% bf16 MFU | 1624214 tok/s step 7301/19560 | loss 3.467528 (-0.28z)| norm 0.2670 (-0.55z)| lr 4.36e-04 | 322.88 ms | 52.3% bf16 MFU | 1624193 tok/s step 7302/19560 | loss 3.442820 (-0.80z)| norm 0.2748 (-0.31z)| lr 4.36e-04 | 323.18 ms | 52.2% bf16 MFU | 1624097 tok/s step 7303/19560 | loss 3.501840 (+0.45z)| norm 0.2815 (-0.11z)| lr 4.36e-04 | 323.05 ms | 52.2% bf16 MFU | 1624038 tok/s step 7304/19560 | loss 3.462310 (-0.40z)| norm 0.2580 (-0.82z)| lr 4.36e-04 | 322.63 ms | 52.3% bf16 MFU | 1624090 tok/s step 7305/19560 | loss 3.535858 (+1.16z)| norm 0.2938 (+0.26z)| lr 4.36e-04 | 323.24 ms | 52.2% bf16 MFU | 1623984 tok/s step 7306/19560 | loss 3.485526 (+0.09z)| norm 0.2746 (-0.33z)| lr 4.36e-04 | 322.20 ms | 52.4% bf16 MFU | 1624145 tok/s step 7307/19560 | loss 3.512468 (+0.66z)| norm 0.3199 (+1.04z)| lr 4.36e-04 | 322.91 ms | 52.3% bf16 MFU | 1624120 tok/s step 7308/19560 | loss 3.477933 (-0.08z)| norm 0.2531 (-0.98z)| lr 4.36e-04 | 323.22 ms | 52.2% bf16 MFU | 1624018 tok/s step 7309/19560 | loss 3.436656 (-0.96z)| norm 0.3533 (+2.02z)| lr 4.36e-04 | 322.95 ms | 52.3% bf16 MFU | 1623990 tok/s step 7310/19560 | loss 3.460954 (-0.42z)| norm 0.2813 (-0.14z)| lr 4.36e-04 | 322.46 ms | 52.3% bf16 MFU | 1624086 tok/s step 7311/19560 | loss 3.587653 (+2.31z)| norm 0.3181 (+0.95z)| lr 4.36e-04 | 322.23 ms | 52.4% bf16 MFU | 1624236 tok/s step 7312/19560 | loss 3.481982 (+0.03z)| norm 0.2918 (+0.16z)| lr 4.36e-04 | 322.53 ms | 52.3% bf16 MFU | 1624301 tok/s step 7313/19560 | loss 3.504189 (+0.50z)| norm 0.2693 (-0.51z)| lr 4.36e-04 | 323.17 ms | 52.2% bf16 MFU | 1624203 tok/s step 7314/19560 | loss 3.470157 (-0.24z)| norm 0.2792 (-0.22z)| lr 4.36e-04 | 322.91 ms | 52.3% bf16 MFU | 1624174 tok/s step 7315/19560 | loss 3.481248 (+0.01z)| norm 0.2702 (-0.48z)| lr 4.36e-04 | 322.31 ms | 52.4% bf16 MFU | 1624298 tok/s step 7316/19560 | loss 3.497610 (+0.36z)| norm 0.2814 (-0.14z)| lr 4.36e-04 | 322.95 ms | 52.3% bf16 MFU | 1624254 tok/s step 7317/19560 | loss 3.474633 (-0.13z)| norm 0.2592 (-0.80z)| lr 4.36e-04 | 322.61 ms | 52.3% bf16 MFU | 1624300 tok/s step 7318/19560 | loss 3.491229 (+0.24z)| norm 0.2944 (+0.24z)| lr 4.35e-04 | 323.76 ms | 52.1% bf16 MFU | 1624053 tok/s step 7319/19560 | loss 3.470484 (-0.21z)| norm 0.2536 (-1.39z)| lr 4.35e-04 | 322.81 ms | 52.3% bf16 MFU | 1624058 tok/s step 7320/19560 | loss 3.472198 (-0.18z)| norm 0.2748 (-0.40z)| lr 4.35e-04 | 323.13 ms | 52.2% bf16 MFU | 1623981 tok/s step 7321/19560 | loss 3.516837 (+0.87z)| norm 0.2766 (-0.31z)| lr 4.35e-04 | 322.97 ms | 52.3% bf16 MFU | 1623950 tok/s step 7322/19560 | loss 3.556156 (+1.76z)| norm 0.2719 (-0.52z)| lr 4.35e-04 | 323.06 ms | 52.2% bf16 MFU | 1623895 tok/s step 7323/19560 | loss 3.568581 (+2.00z)| norm 0.2636 (-0.90z)| lr 4.35e-04 | 322.95 ms | 52.3% bf16 MFU | 1623871 tok/s step 7324/19560 | loss 3.531396 (+1.13z)| norm 0.2753 (-0.33z)| lr 4.35e-04 | 323.03 ms | 52.2% bf16 MFU | 1623829 tok/s step 7325/19560 | loss 3.471732 (-0.22z)| norm 0.2767 (-0.25z)| lr 4.35e-04 | 322.74 ms | 52.3% bf16 MFU | 1623863 tok/s step 7326/19560 | loss 3.434312 (-1.07z)| norm 0.2607 (-1.01z)| lr 4.35e-04 | 322.94 ms | 52.3% bf16 MFU | 1623845 tok/s step 7327/19560 | loss 3.452204 (-0.66z)| norm 0.2806 (-0.06z)| lr 4.35e-04 | 323.18 ms | 52.2% bf16 MFU | 1623767 tok/s step 7328/19560 | loss 3.506214 (+0.57z)| norm 0.2841 (+0.12z)| lr 4.35e-04 | 323.09 ms | 52.2% bf16 MFU | 1623714 tok/s step 7329/19560 | loss 3.459551 (-0.51z)| norm 0.2678 (-0.66z)| lr 4.35e-04 | 322.73 ms | 52.3% bf16 MFU | 1623755 tok/s step 7330/19560 | loss 3.532681 (+1.17z)| norm 0.2597 (-1.04z)| lr 4.35e-04 | 322.69 ms | 52.3% bf16 MFU | 1623803 tok/s step 7331/19560 | loss 3.523112 (+0.94z)| norm 0.2789 (-0.12z)| lr 4.35e-04 | 322.47 ms | 52.3% bf16 MFU | 1623906 tok/s step 7332/19560 | loss 3.457210 (-0.59z)| norm 0.2793 (-0.10z)| lr 4.35e-04 | 322.98 ms | 52.3% bf16 MFU | 1623874 tok/s step 7333/19560 | loss 3.516031 (+0.76z)| norm 0.2634 (-0.87z)| lr 4.35e-04 | 322.92 ms | 52.3% bf16 MFU | 1623860 tok/s step 7334/19560 | loss 3.445814 (-0.85z)| norm 0.2585 (-1.09z)| lr 4.35e-04 | 322.75 ms | 52.3% bf16 MFU | 1623889 tok/s step 7335/19560 | loss 3.497120 (+0.35z)| norm 0.2996 (+0.88z)| lr 4.35e-04 | 322.62 ms | 52.3% bf16 MFU | 1623948 tok/s step 7336/19560 | loss 3.557849 (+1.76z)| norm 0.3764 (+4.20z)| lr 4.35e-04 | 322.79 ms | 52.3% bf16 MFU | 1623962 tok/s step 7337/19560 | loss 3.462911 (-0.47z)| norm 0.2996 (+0.79z)| lr 4.35e-04 | 322.73 ms | 52.3% bf16 MFU | 1623992 tok/s step 7338/19560 | loss 3.481480 (-0.03z)| norm 0.2670 (-0.67z)| lr 4.35e-04 | 322.91 ms | 52.3% bf16 MFU | 1623974 tok/s step 7339/19560 | loss 3.437548 (-1.06z)| norm 0.3129 (+1.36z)| lr 4.35e-04 | 322.47 ms | 52.3% bf16 MFU | 1624067 tok/s step 7340/19560 | loss 3.434525 (-1.13z)| norm 0.2469 (-1.55z)| lr 4.35e-04 | 322.76 ms | 52.3% bf16 MFU | 1624082 tok/s step 7341/19560 | loss 3.470320 (-0.29z)| norm 0.2857 (+0.16z)| lr 4.34e-04 | 322.83 ms | 52.3% bf16 MFU | 1624080 tok/s step 7342/19560 | loss 3.469817 (-0.33z)| norm 0.2764 (-0.26z)| lr 4.34e-04 | 322.29 ms | 52.4% bf16 MFU | 1624213 tok/s step 7343/19560 | loss 3.508824 (+0.62z)| norm 0.2771 (-0.23z)| lr 4.34e-04 | 322.73 ms | 52.3% bf16 MFU | 1624229 tok/s step 7344/19560 | loss 3.487556 (+0.12z)| norm 0.2992 (+0.73z)| lr 4.34e-04 | 322.76 ms | 52.3% bf16 MFU | 1624237 tok/s step 7345/19560 | loss 3.403887 (-1.94z)| norm 0.2966 (+0.61z)| lr 4.34e-04 | 322.28 ms | 52.4% bf16 MFU | 1624366 tok/s step 7346/19560 | loss 3.584624 (+2.50z)| norm 0.2755 (-0.32z)| lr 4.34e-04 | 322.96 ms | 52.3% bf16 MFU | 1624316 tok/s step 7347/19560 | loss 3.467800 (-0.35z)| norm 0.2870 (+0.18z)| lr 4.34e-04 | 322.45 ms | 52.3% bf16 MFU | 1624398 tok/s step 7348/19560 | loss 3.461369 (-0.51z)| norm 0.2840 (+0.04z)| lr 4.34e-04 | 322.60 ms | 52.3% bf16 MFU | 1624439 tok/s step 7349/19560 | loss 3.521805 (+0.96z)| norm 0.2585 (-1.08z)| lr 4.34e-04 | 322.76 ms | 52.3% bf16 MFU | 1624436 tok/s step 7350/19560 | loss 3.489720 (+0.17z)| norm 0.2903 (+0.33z)| lr 4.34e-04 | 322.48 ms | 52.3% bf16 MFU | 1624504 tok/s step 7351/19560 | loss 3.478588 (-0.12z)| norm 0.2727 (-0.47z)| lr 4.34e-04 | 323.03 ms | 52.2% bf16 MFU | 1624429 tok/s step 7352/19560 | loss 3.525682 (+1.04z)| norm 0.2616 (-0.95z)| lr 4.34e-04 | 322.90 ms | 52.3% bf16 MFU | 1624392 tok/s step 7353/19560 | loss 3.474037 (-0.24z)| norm 0.2600 (-1.03z)| lr 4.34e-04 | 322.66 ms | 52.3% bf16 MFU | 1624417 tok/s step 7354/19560 | loss 3.484130 (+0.02z)| norm 0.2416 (-1.82z)| lr 4.34e-04 | 322.54 ms | 52.3% bf16 MFU | 1624472 tok/s step 7355/19560 | loss 3.479311 (-0.12z)| norm 0.2805 (-0.10z)| lr 4.34e-04 | 322.72 ms | 52.3% bf16 MFU | 1624477 tok/s step 7356/19560 | loss 3.546118 (+1.54z)| norm 0.2660 (-0.73z)| lr 4.34e-04 | 322.33 ms | 52.4% bf16 MFU | 1624582 tok/s step 7357/19560 | loss 3.485685 (+0.01z)| norm 0.2356 (-2.04z)| lr 4.34e-04 | 322.52 ms | 52.3% bf16 MFU | 1624632 tok/s step 7358/19560 | loss 3.517376 (+0.80z)| norm 0.2825 (+0.00z)| lr 4.34e-04 | 322.81 ms | 52.3% bf16 MFU | 1624608 tok/s step 7359/19560 | loss 3.450349 (-0.89z)| norm 0.2655 (-0.74z)| lr 4.34e-04 | 322.30 ms | 52.4% bf16 MFU | 1624712 tok/s step 7360/19560 | loss 3.558950 (+1.81z)| norm 0.2669 (-0.69z)| lr 4.34e-04 | 322.87 ms | 52.3% bf16 MFU | 1624668 tok/s step 7361/19560 | loss 3.526586 (+0.99z)| norm 0.2991 (+0.72z)| lr 4.34e-04 | 322.61 ms | 52.3% bf16 MFU | 1624691 tok/s step 7362/19560 | loss 3.454883 (-0.79z)| norm 0.2584 (-1.05z)| lr 4.34e-04 | 322.28 ms | 52.4% bf16 MFU | 1624796 tok/s step 7363/19560 | loss 3.433077 (-1.31z)| norm 0.2439 (-1.65z)| lr 4.33e-04 | 322.18 ms | 52.4% bf16 MFU | 1624922 tok/s step 7364/19560 | loss 3.471557 (-0.36z)| norm 0.2658 (-0.69z)| lr 4.33e-04 | 322.68 ms | 52.3% bf16 MFU | 1624914 tok/s step 7365/19560 | loss 3.502150 (+0.41z)| norm 0.2967 (+0.65z)| lr 4.33e-04 | 322.88 ms | 52.3% bf16 MFU | 1624858 tok/s step 7366/19560 | loss 3.496036 (+0.25z)| norm 0.2756 (-0.27z)| lr 4.33e-04 | 322.54 ms | 52.3% bf16 MFU | 1624891 tok/s step 7367/19560 | loss 3.471843 (-0.35z)| norm 0.2697 (-0.51z)| lr 4.33e-04 | 322.78 ms | 52.3% bf16 MFU | 1624862 tok/s step 7368/19560 | loss 3.462815 (-0.57z)| norm 0.3163 (+1.50z)| lr 4.33e-04 | 322.50 ms | 52.3% bf16 MFU | 1624905 tok/s step 7369/19560 | loss 3.482403 (-0.08z)| norm 0.2667 (-0.64z)| lr 4.33e-04 | 322.03 ms | 52.4% bf16 MFU | 1625064 tok/s step 7370/19560 | loss 3.458779 (-0.67z)| norm 0.2443 (-1.58z)| lr 4.33e-04 | 322.64 ms | 52.3% bf16 MFU | 1625061 tok/s step 7371/19560 | loss 3.498421 (+0.31z)| norm 0.2742 (-0.28z)| lr 4.33e-04 | 322.52 ms | 52.3% bf16 MFU | 1625087 tok/s step 7372/19560 | loss 3.470583 (-0.38z)| norm 0.2466 (-1.45z)| lr 4.33e-04 | 322.70 ms | 52.3% bf16 MFU | 1625067 tok/s step 7373/19560 | loss 3.482553 (-0.08z)| norm 0.2644 (-0.68z)| lr 4.33e-04 | 322.90 ms | 52.3% bf16 MFU | 1624997 tok/s step 7374/19560 | loss 3.540284 (+1.34z)| norm 0.2586 (-0.92z)| lr 4.33e-04 | 322.18 ms | 52.4% bf16 MFU | 1625113 tok/s step 7375/19560 | loss 3.470260 (-0.40z)| norm 0.2626 (-0.74z)| lr 4.33e-04 | 322.53 ms | 52.3% bf16 MFU | 1625134 tok/s step 7376/19560 | loss 3.440438 (-1.14z)| norm 0.2653 (-0.62z)| lr 4.33e-04 | 322.66 ms | 52.3% bf16 MFU | 1625123 tok/s step 7377/19560 | loss 3.410913 (-1.84z)| norm 0.2797 (-0.00z)| lr 4.33e-04 | 322.69 ms | 52.3% bf16 MFU | 1625105 tok/s step 7378/19560 | loss 3.448683 (-0.89z)| norm 0.2651 (-0.62z)| lr 4.33e-04 | 322.07 ms | 52.4% bf16 MFU | 1625244 tok/s step 7379/19560 | loss 3.528054 (+1.04z)| norm 0.2774 (-0.09z)| lr 4.33e-04 | 322.27 ms | 52.4% bf16 MFU | 1625324 tok/s step 7380/19560 | loss 3.459162 (-0.65z)| norm 0.2693 (-0.43z)| lr 4.33e-04 | 322.47 ms | 52.3% bf16 MFU | 1625350 tok/s step 7381/19560 | loss 3.538990 (+1.30z)| norm 0.2855 (+0.26z)| lr 4.33e-04 | 322.21 ms | 52.4% bf16 MFU | 1625441 tok/s step 7382/19560 | loss 3.511961 (+0.64z)| norm 0.2561 (-0.99z)| lr 4.33e-04 | 322.54 ms | 52.3% bf16 MFU | 1625445 tok/s step 7383/19560 | loss 3.484202 (-0.04z)| norm 0.2552 (-1.03z)| lr 4.33e-04 | 322.39 ms | 52.4% bf16 MFU | 1625486 tok/s step 7384/19560 | loss 3.440481 (-1.09z)| norm 0.2611 (-0.77z)| lr 4.33e-04 | 322.64 ms | 52.3% bf16 MFU | 1625462 tok/s step 7385/19560 | loss 3.492107 (+0.15z)| norm 0.2650 (-0.61z)| lr 4.32e-04 | 322.53 ms | 52.3% bf16 MFU | 1625465 tok/s step 7386/19560 | loss 3.477168 (-0.22z)| norm 0.2830 (+0.16z)| lr 4.32e-04 | 322.69 ms | 52.3% bf16 MFU | 1625428 tok/s step 7387/19560 | loss 3.456424 (-0.73z)| norm 0.2714 (-0.34z)| lr 4.32e-04 | 322.30 ms | 52.4% bf16 MFU | 1625492 tok/s step 7388/19560 | loss 3.450174 (-0.88z)| norm 0.2599 (-0.83z)| lr 4.32e-04 | 322.64 ms | 52.3% bf16 MFU | 1625468 tok/s step 7389/19560 | loss 3.445964 (-0.97z)| norm 0.2755 (-0.16z)| lr 4.32e-04 | 322.34 ms | 52.4% bf16 MFU | 1625520 tok/s step 7390/19560 | loss 3.458999 (-0.65z)| norm 0.2471 (-1.35z)| lr 4.32e-04 | 322.28 ms | 52.4% bf16 MFU | 1625584 tok/s step 7391/19560 | loss 3.472490 (-0.32z)| norm 0.2771 (-0.07z)| lr 4.32e-04 | 322.74 ms | 52.3% bf16 MFU | 1625529 tok/s step 7392/19560 | loss 3.446589 (-0.96z)| norm 0.2584 (-0.87z)| lr 4.32e-04 | 322.63 ms | 52.3% bf16 MFU | 1625504 tok/s step 7393/19560 | loss 3.474637 (-0.26z)| norm 0.2736 (-0.21z)| lr 4.32e-04 | 322.70 ms | 52.3% bf16 MFU | 1625464 tok/s step 7394/19560 | loss 3.428958 (-1.38z)| norm 0.2745 (-0.18z)| lr 4.32e-04 | 322.31 ms | 52.4% bf16 MFU | 1625523 tok/s step 7395/19560 | loss 3.503986 (+0.48z)| norm 0.2787 (+0.00z)| lr 4.32e-04 | 322.65 ms | 52.3% bf16 MFU | 1625493 tok/s step 7396/19560 | loss 3.503420 (+0.46z)| norm 0.2919 (+0.57z)| lr 4.32e-04 | 322.50 ms | 52.3% bf16 MFU | 1625504 tok/s step 7397/19560 | loss 3.518158 (+0.83z)| norm 0.3065 (+1.19z)| lr 4.32e-04 | 323.03 ms | 52.2% bf16 MFU | 1625380 tok/s step 7398/19560 | loss 3.569457 (+2.07z)| norm 0.3180 (+1.66z)| lr 4.32e-04 | 322.61 ms | 52.3% bf16 MFU | 1625367 tok/s step 7399/19560 | loss 3.460880 (-0.60z)| norm 0.3436 (+2.65z)| lr 4.32e-04 | 322.35 ms | 52.4% bf16 MFU | 1625422 tok/s step 7400/19560 | loss 3.486076 (+0.02z)| norm 0.3094 (+1.23z)| lr 4.32e-04 | 322.49 ms | 52.3% bf16 MFU | 1625439 tok/s step 7401/19560 | loss 3.468163 (-0.42z)| norm 0.2847 (+0.20z)| lr 4.32e-04 | 322.50 ms | 52.3% bf16 MFU | 1625453 tok/s step 7402/19560 | loss 3.414362 (-1.71z)| norm 0.2639 (-0.65z)| lr 4.32e-04 | 322.73 ms | 52.3% bf16 MFU | 1625406 tok/s step 7403/19560 | loss 3.454083 (-0.75z)| norm 0.3133 (+1.37z)| lr 4.32e-04 | 322.38 ms | 52.4% bf16 MFU | 1625452 tok/s step 7404/19560 | loss 3.457326 (-0.67z)| norm 0.2731 (-0.28z)| lr 4.32e-04 | 322.98 ms | 52.3% bf16 MFU | 1625344 tok/s step 7405/19560 | loss 3.496888 (+0.29z)| norm 0.3079 (+1.14z)| lr 4.32e-04 | 322.23 ms | 52.4% bf16 MFU | 1625428 tok/s step 7406/19560 | loss 3.433553 (-1.23z)| norm 0.2974 (+0.70z)| lr 4.32e-04 | 322.54 ms | 52.3% bf16 MFU | 1625432 tok/s step 7407/19560 | loss 3.544193 (+1.45z)| norm 0.2970 (+0.71z)| lr 4.32e-04 | 322.77 ms | 52.3% bf16 MFU | 1625376 tok/s step 7408/19560 | loss 3.476407 (-0.19z)| norm 0.2926 (+0.54z)| lr 4.31e-04 | 322.58 ms | 52.3% bf16 MFU | 1625373 tok/s step 7409/19560 | loss 3.576971 (+2.19z)| norm 0.3125 (+1.36z)| lr 4.31e-04 | 322.30 ms | 52.4% bf16 MFU | 1625439 tok/s step 7410/19560 | loss 3.470149 (-0.37z)| norm 0.2693 (-0.43z)| lr 4.31e-04 | 322.41 ms | 52.3% bf16 MFU | 1625474 tok/s step 7411/19560 | loss 3.463865 (-0.53z)| norm 0.2721 (-0.30z)| lr 4.31e-04 | 322.65 ms | 52.3% bf16 MFU | 1625446 tok/s step 7412/19560 | loss 3.535124 (+1.18z)| norm 0.2771 (-0.09z)| lr 4.31e-04 | 322.36 ms | 52.4% bf16 MFU | 1625493 tok/s step 7413/19560 | loss 3.512844 (+0.65z)| norm 0.3238 (+1.84z)| lr 4.31e-04 | 322.74 ms | 52.3% bf16 MFU | 1625443 tok/s step 7414/19560 | loss 3.490559 (+0.10z)| norm 0.3159 (+1.50z)| lr 4.31e-04 | 322.34 ms | 52.4% bf16 MFU | 1625498 tok/s step 7415/19560 | loss 3.527061 (+0.97z)| norm 0.2655 (-0.58z)| lr 4.31e-04 | 322.85 ms | 52.3% bf16 MFU | 1625419 tok/s step 7416/19560 | loss 3.439564 (-1.14z)| norm 0.3039 (+0.99z)| lr 4.31e-04 | 322.32 ms | 52.4% bf16 MFU | 1625478 tok/s step 7417/19560 | loss 3.538588 (+1.23z)| norm 0.2727 (-0.29z)| lr 4.31e-04 | 322.53 ms | 52.3% bf16 MFU | 1625482 tok/s step 7418/19560 | loss 3.470183 (-0.41z)| norm 0.2937 (+0.57z)| lr 4.31e-04 | 322.73 ms | 52.3% bf16 MFU | 1625434 tok/s step 7419/19560 | loss 3.441031 (-1.18z)| norm 0.2677 (-0.49z)| lr 4.31e-04 | 322.68 ms | 52.3% bf16 MFU | 1625403 tok/s step 7420/19560 | loss 3.501100 (+0.43z)| norm 0.2851 (+0.30z)| lr 4.31e-04 | 322.42 ms | 52.3% bf16 MFU | 1625437 tok/s step 7421/19560 | loss 3.446522 (-1.02z)| norm 0.2660 (-0.56z)| lr 4.31e-04 | 322.61 ms | 52.3% bf16 MFU | 1625423 tok/s step 7422/19560 | loss 3.454422 (-0.80z)| norm 0.3118 (+1.50z)| lr 4.31e-04 | 322.36 ms | 52.4% bf16 MFU | 1625472 tok/s step 7423/19560 | loss 3.500783 (+0.42z)| norm 0.2883 (+0.43z)| lr 4.31e-04 | 322.82 ms | 52.3% bf16 MFU | 1625402 tok/s step 7424/19560 | loss 3.476756 (-0.21z)| norm 0.2843 (+0.24z)| lr 4.31e-04 | 322.81 ms | 52.3% bf16 MFU | 1625338 tok/s step 7425/19560 | loss 3.565748 (+2.10z)| norm 0.3064 (+1.22z)| lr 4.31e-04 | 322.47 ms | 52.3% bf16 MFU | 1625365 tok/s step 7426/19560 | loss 3.385681 (-2.53z)| norm 0.2623 (-0.77z)| lr 4.31e-04 | 322.56 ms | 52.3% bf16 MFU | 1625368 tok/s step 7427/19560 | loss 3.500276 (+0.40z)| norm 0.2733 (-0.28z)| lr 4.31e-04 | 322.52 ms | 52.3% bf16 MFU | 1625378 tok/s step 7428/19560 | loss 3.483820 (-0.03z)| norm 0.2780 (-0.06z)| lr 4.31e-04 | 323.00 ms | 52.3% bf16 MFU | 1625269 tok/s step 7429/19560 | loss 3.472501 (-0.32z)| norm 0.2514 (-1.25z)| lr 4.31e-04 | 322.50 ms | 52.3% bf16 MFU | 1625289 tok/s step 7430/19560 | loss 3.494024 (+0.22z)| norm 0.2652 (-0.63z)| lr 4.30e-04 | 322.72 ms | 52.3% bf16 MFU | 1625254 tok/s step 7431/19560 | loss 3.427742 (-1.47z)| norm 0.2636 (-0.69z)| lr 4.30e-04 | 322.83 ms | 52.3% bf16 MFU | 1625194 tok/s step 7432/19560 | loss 3.503459 (+0.47z)| norm 0.2582 (-0.94z)| lr 4.30e-04 | 322.51 ms | 52.3% bf16 MFU | 1625216 tok/s step 7433/19560 | loss 3.471519 (-0.34z)| norm 0.2579 (-0.93z)| lr 4.30e-04 | 322.85 ms | 52.3% bf16 MFU | 1625152 tok/s step 7434/19560 | loss 3.468566 (-0.41z)| norm 0.2604 (-0.81z)| lr 4.30e-04 | 322.85 ms | 52.3% bf16 MFU | 1625091 tok/s step 7435/19560 | loss 3.503077 (+0.48z)| norm 0.2714 (-0.31z)| lr 4.30e-04 | 322.54 ms | 52.3% bf16 MFU | 1625112 tok/s step 7436/19560 | loss 3.730028 (+5.50z)| norm 1.3569 (+10.98z)| lr 4.30e-04 | 322.34 ms | 52.4% bf16 MFU | 1625181 tok/s step 7437/19560 | loss 3.486502 (-0.01z)| norm 0.3937 (+1.09z)| lr 4.30e-04 | 322.79 ms | 52.3% bf16 MFU | 1625134 tok/s step 7438/19560 | loss 3.582431 (+2.11z)| norm 0.3232 (+0.36z)| lr 4.30e-04 | 322.78 ms | 52.3% bf16 MFU | 1625092 tok/s step 7439/19560 | loss 3.551729 (+1.45z)| norm 0.3391 (+0.52z)| lr 4.30e-04 | 322.89 ms | 52.3% bf16 MFU | 1625026 tok/s step 7440/19560 | loss 3.403364 (-1.86z)| norm 0.3027 (+0.15z)| lr 4.30e-04 | 322.51 ms | 52.3% bf16 MFU | 1625056 tok/s step 7441/19560 | loss 3.473424 (-0.29z)| norm 0.3042 (+0.17z)| lr 4.30e-04 | 322.57 ms | 52.3% bf16 MFU | 1625070 tok/s step 7442/19560 | loss 3.472989 (-0.31z)| norm 0.2957 (+0.08z)| lr 4.30e-04 | 323.14 ms | 52.2% bf16 MFU | 1624940 tok/s step 7443/19560 | loss 3.514832 (+0.62z)| norm 0.3277 (+0.40z)| lr 4.30e-04 | 322.24 ms | 52.4% bf16 MFU | 1625044 tok/s step 7444/19560 | loss 3.491895 (+0.11z)| norm 0.3154 (+0.27z)| lr 4.30e-04 | 322.93 ms | 52.3% bf16 MFU | 1624970 tok/s step 7445/19560 | loss 3.458344 (-0.63z)| norm 0.3003 (+0.11z)| lr 4.30e-04 | 322.29 ms | 52.4% bf16 MFU | 1625059 tok/s step 7446/19560 | loss 3.520060 (+0.73z)| norm 0.3016 (+0.13z)| lr 4.30e-04 | 322.61 ms | 52.3% bf16 MFU | 1625064 tok/s step 7447/19560 | loss 3.516748 (+0.65z)| norm 0.3141 (+0.25z)| lr 4.30e-04 | 322.49 ms | 52.3% bf16 MFU | 1625098 tok/s step 7448/19560 | loss 3.467539 (-0.44z)| norm 0.2813 (-0.09z)| lr 4.30e-04 | 322.97 ms | 52.3% bf16 MFU | 1625010 tok/s step 7449/19560 | loss 3.517983 (+0.68z)| norm 0.2749 (-0.15z)| lr 4.30e-04 | 322.61 ms | 52.3% bf16 MFU | 1625017 tok/s step 7450/19560 | loss 3.459615 (-0.60z)| norm 0.2684 (-0.22z)| lr 4.30e-04 | 322.26 ms | 52.4% bf16 MFU | 1625112 tok/s step 7451/19560 | loss 3.464949 (-0.47z)| norm 0.2723 (-0.18z)| lr 4.30e-04 | 322.75 ms | 52.3% bf16 MFU | 1625077 tok/s step 7452/19560 | loss 3.591568 (+2.34z)| norm 0.2463 (-0.44z)| lr 4.29e-04 | 322.34 ms | 52.4% bf16 MFU | 1625148 tok/s step 7453/19560 | loss 3.453253 (-0.73z)| norm 0.2840 (-0.06z)| lr 4.29e-04 | 322.57 ms | 52.3% bf16 MFU | 1625158 tok/s step 7454/19560 | loss 3.494510 (+0.18z)| norm 0.2850 (-0.05z)| lr 4.29e-04 | 322.82 ms | 52.3% bf16 MFU | 1625106 tok/s step 7455/19560 | loss 3.487755 (+0.02z)| norm 0.2785 (-0.12z)| lr 4.29e-04 | 323.03 ms | 52.2% bf16 MFU | 1625002 tok/s step 7456/19560 | loss 3.513123 (+0.59z)| norm 0.2761 (-0.14z)| lr 4.29e-04 | 322.87 ms | 52.3% bf16 MFU | 1624943 tok/s step 7457/19560 | loss 3.558356 (+1.56z)| norm 0.2739 (-0.16z)| lr 4.29e-04 | 322.85 ms | 52.3% bf16 MFU | 1624894 tok/s step 7458/19560 | loss 3.485378 (-0.04z)| norm 0.2548 (-0.36z)| lr 4.29e-04 | 322.74 ms | 52.3% bf16 MFU | 1624873 tok/s step 7459/19560 | loss 3.405048 (-1.79z)| norm 0.2647 (-0.26z)| lr 4.29e-04 | 322.53 ms | 52.3% bf16 MFU | 1624907 tok/s step 7460/19560 | loss 3.463202 (-0.51z)| norm 0.2776 (-0.12z)| lr 4.29e-04 | 322.66 ms | 52.3% bf16 MFU | 1624906 tok/s step 7461/19560 | loss 3.526364 (+0.88z)| norm 0.2571 (-0.33z)| lr 4.29e-04 | 322.71 ms | 52.3% bf16 MFU | 1624893 tok/s step 7462/19560 | loss 3.487513 (+0.02z)| norm 0.2691 (-0.21z)| lr 4.29e-04 | 322.77 ms | 52.3% bf16 MFU | 1624865 tok/s step 7463/19560 | loss 3.494219 (+0.16z)| norm 0.2507 (-0.39z)| lr 4.29e-04 | 322.94 ms | 52.3% bf16 MFU | 1624797 tok/s step 7464/19560 | loss 3.411032 (-1.64z)| norm 0.2619 (-0.27z)| lr 4.29e-04 | 322.72 ms | 52.3% bf16 MFU | 1624787 tok/s step 7465/19560 | loss 3.505459 (+0.43z)| norm 1.7556 (+8.99z)| lr 4.29e-04 | 322.64 ms | 52.3% bf16 MFU | 1624798 tok/s step 7466/19560 | loss 3.468656 (-0.38z)| norm 0.3323 (+0.20z)| lr 4.29e-04 | 322.56 ms | 52.3% bf16 MFU | 1624828 tok/s step 7467/19560 | loss 3.467438 (-0.41z)| norm 0.3020 (+0.01z)| lr 4.29e-04 | 322.83 ms | 52.3% bf16 MFU | 1624790 tok/s step 7468/19560 | loss 3.473934 (-0.28z)| norm 0.3157 (+0.09z)| lr 4.29e-04 | 323.00 ms | 52.3% bf16 MFU | 1624710 tok/s step 7469/19560 | loss 3.572142 (+1.86z)| norm 0.3146 (+0.08z)| lr 4.29e-04 | 323.11 ms | 52.2% bf16 MFU | 1624607 tok/s step 7470/19560 | loss 3.467366 (-0.43z)| norm 0.2809 (-0.12z)| lr 4.29e-04 | 322.52 ms | 52.3% bf16 MFU | 1624658 tok/s step 7471/19560 | loss 3.525319 (+0.83z)| norm 0.3189 (+0.11z)| lr 4.29e-04 | 322.76 ms | 52.3% bf16 MFU | 1624644 tok/s step 7472/19560 | loss 3.453495 (-0.73z)| norm 0.2458 (-0.34z)| lr 4.29e-04 | 323.22 ms | 52.2% bf16 MFU | 1624515 tok/s step 7473/19560 | loss 3.503307 (+0.34z)| norm 0.2993 (-0.01z)| lr 4.29e-04 | 322.87 ms | 52.3% bf16 MFU | 1624480 tok/s step 7474/19560 | loss 3.441595 (-1.01z)| norm 0.2806 (-0.13z)| lr 4.28e-04 | 322.70 ms | 52.3% bf16 MFU | 1624491 tok/s step 7475/19560 | loss 3.468525 (-0.41z)| norm 0.2608 (-0.25z)| lr 4.28e-04 | 322.71 ms | 52.3% bf16 MFU | 1624499 tok/s step 7476/19560 | loss 3.467577 (-0.43z)| norm 0.2668 (-0.21z)| lr 4.28e-04 | 323.25 ms | 52.2% bf16 MFU | 1624369 tok/s step 7477/19560 | loss 3.503539 (+0.38z)| norm 0.2518 (-0.30z)| lr 4.28e-04 | 322.33 ms | 52.4% bf16 MFU | 1624480 tok/s step 7478/19560 | loss 3.461555 (-0.56z)| norm 0.2599 (-0.25z)| lr 4.28e-04 | 322.93 ms | 52.3% bf16 MFU | 1624432 tok/s step 7479/19560 | loss 3.501419 (+0.33z)| norm 0.2670 (-0.21z)| lr 4.28e-04 | 322.42 ms | 52.3% bf16 MFU | 1624516 tok/s step 7480/19560 | loss 3.432065 (-1.20z)| norm 0.2679 (-0.20z)| lr 4.28e-04 | 322.97 ms | 52.3% bf16 MFU | 1624456 tok/s step 7481/19560 | loss 3.446584 (-0.87z)| norm 0.3044 (+0.02z)| lr 4.28e-04 | 323.47 ms | 52.2% bf16 MFU | 1624275 tok/s step 7482/19560 | loss 3.480419 (-0.12z)| norm 0.3006 (-0.00z)| lr 4.28e-04 | 322.59 ms | 52.3% bf16 MFU | 1624325 tok/s step 7483/19560 | loss 3.411740 (-1.62z)| norm 0.2779 (-0.14z)| lr 4.28e-04 | 323.08 ms | 52.2% bf16 MFU | 1624246 tok/s step 7484/19560 | loss 3.541507 (+1.25z)| norm 0.2977 (-0.02z)| lr 4.28e-04 | 322.29 ms | 52.4% bf16 MFU | 1624371 tok/s step 7485/19560 | loss 3.439955 (-0.99z)| norm 0.2607 (-0.25z)| lr 4.28e-04 | 323.05 ms | 52.2% bf16 MFU | 1624298 tok/s step 7486/19560 | loss 3.455677 (-0.63z)| norm 0.2978 (-0.02z)| lr 4.28e-04 | 323.49 ms | 52.2% bf16 MFU | 1624120 tok/s step 7487/19560 | loss 3.465180 (-0.42z)| norm 0.2963 (-0.04z)| lr 4.28e-04 | 322.79 ms | 52.3% bf16 MFU | 1624125 tok/s step 7488/19560 | loss 3.459656 (-0.53z)| norm 0.2641 (-0.23z)| lr 4.28e-04 | 322.53 ms | 52.3% bf16 MFU | 1624196 tok/s step 7489/19560 | loss 3.526182 (+0.95z)| norm 0.2957 (-0.04z)| lr 4.28e-04 | 323.52 ms | 52.2% bf16 MFU | 1624014 tok/s step 7490/19560 | loss 3.398221 (-1.87z)| norm 0.2779 (-0.15z)| lr 4.28e-04 | 322.99 ms | 52.3% bf16 MFU | 1623976 tok/s step 7491/19560 | loss 3.516058 (+0.71z)| norm 0.2572 (-0.28z)| lr 4.28e-04 | 322.73 ms | 52.3% bf16 MFU | 1624005 tok/s step 7492/19560 | loss 3.466509 (-0.38z)| norm 0.2781 (-0.15z)| lr 4.28e-04 | 322.81 ms | 52.3% bf16 MFU | 1624011 tok/s step 7493/19560 | loss 3.513237 (+0.65z)| norm 0.2550 (-0.29z)| lr 4.28e-04 | 322.83 ms | 52.3% bf16 MFU | 1624013 tok/s step 7494/19560 | loss 3.472831 (-0.24z)| norm 0.2761 (-0.16z)| lr 4.28e-04 | 322.75 ms | 52.3% bf16 MFU | 1624033 tok/s step 7495/19560 | loss 3.444532 (-0.86z)| norm 0.2692 (-0.20z)| lr 4.28e-04 | 322.49 ms | 52.3% bf16 MFU | 1624120 tok/s step 7496/19560 | loss 3.424088 (-1.29z)| norm 0.2539 (-0.29z)| lr 4.27e-04 | 323.14 ms | 52.2% bf16 MFU | 1624038 tok/s step 7497/19560 | loss 3.452343 (-0.67z)| norm 0.2864 (-0.09z)| lr 4.27e-04 | 323.20 ms | 52.2% bf16 MFU | 1623944 tok/s step 7498/19560 | loss 3.469438 (-0.30z)| norm 0.2612 (-0.25z)| lr 4.27e-04 | 322.89 ms | 52.3% bf16 MFU | 1623933 tok/s step 7499/19560 | loss 3.416064 (-1.44z)| norm 0.2453 (-0.35z)| lr 4.27e-04 | 322.74 ms | 52.3% bf16 MFU | 1623960 tok/s step 7500/19560 | loss 3.509397 (+0.58z)| norm 0.2625 (-0.24z)| lr 4.27e-04 | 322.76 ms | 52.3% bf16 MFU | 1623981 tok/s val loss 3.460540 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Helevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2818/10042 = 0.280621 step 7501/19560 | loss 3.488135 (+0.12z)| norm 0.2576 (-0.27z)| lr 4.27e-04 | 321.61 ms | 52.5% bf16 MFU | 1624291 tok/s step 7502/19560 | loss 3.475910 (-0.14z)| norm 0.2541 (-0.29z)| lr 4.27e-04 | 322.72 ms | 52.3% bf16 MFU | 1624306 tok/s step 7503/19560 | loss 3.495042 (+0.28z)| norm 0.2683 (-0.21z)| lr 4.27e-04 | 323.15 ms | 52.2% bf16 MFU | 1624213 tok/s step 7504/19560 | loss 3.504112 (+0.46z)| norm 0.2820 (-0.12z)| lr 4.27e-04 | 322.48 ms | 52.3% bf16 MFU | 1624292 tok/s step 7505/19560 | loss 3.503468 (+0.44z)| norm 0.2822 (-0.12z)| lr 4.27e-04 | 322.15 ms | 52.4% bf16 MFU | 1624451 tok/s step 7506/19560 | loss 3.452683 (-0.68z)| norm 0.2665 (-0.22z)| lr 4.27e-04 | 322.65 ms | 52.3% bf16 MFU | 1624476 tok/s step 7507/19560 | loss 3.515402 (+0.70z)| norm 0.3000 (-0.01z)| lr 4.27e-04 | 322.73 ms | 52.3% bf16 MFU | 1624480 tok/s step 7508/19560 | loss 3.452843 (-0.68z)| norm 0.2670 (-0.22z)| lr 4.27e-04 | 322.70 ms | 52.3% bf16 MFU | 1624491 tok/s step 7509/19560 | loss 3.481057 (-0.04z)| norm 0.2867 (-0.09z)| lr 4.27e-04 | 323.13 ms | 52.2% bf16 MFU | 1624393 tok/s step 7510/19560 | loss 3.541069 (+1.28z)| norm 0.2431 (-0.36z)| lr 4.27e-04 | 323.22 ms | 52.2% bf16 MFU | 1624278 tok/s step 7511/19560 | loss 3.474930 (-0.18z)| norm 0.3069 (+0.03z)| lr 4.27e-04 | 322.57 ms | 52.3% bf16 MFU | 1624331 tok/s step 7512/19560 | loss 3.427350 (-1.23z)| norm 0.2569 (-0.28z)| lr 4.27e-04 | 322.53 ms | 52.3% bf16 MFU | 1624392 tok/s step 7513/19560 | loss 3.449711 (-0.73z)| norm 0.2866 (-0.10z)| lr 4.27e-04 | 323.08 ms | 52.2% bf16 MFU | 1624312 tok/s step 7514/19560 | loss 3.461193 (-0.47z)| norm 0.3290 (+0.16z)| lr 4.27e-04 | 323.08 ms | 52.2% bf16 MFU | 1624234 tok/s step 7515/19560 | loss 3.466264 (-0.36z)| norm 0.2907 (-0.08z)| lr 4.27e-04 | 322.86 ms | 52.3% bf16 MFU | 1624218 tok/s step 7516/19560 | loss 3.491520 (+0.19z)| norm 0.2603 (-0.26z)| lr 4.27e-04 | 323.04 ms | 52.2% bf16 MFU | 1624157 tok/s step 7517/19560 | loss 3.439269 (-0.96z)| norm 0.2880 (-0.09z)| lr 4.27e-04 | 322.74 ms | 52.3% bf16 MFU | 1624173 tok/s step 7518/19560 | loss 3.524270 (+0.90z)| norm 0.2886 (-0.09z)| lr 4.26e-04 | 322.76 ms | 52.3% bf16 MFU | 1624185 tok/s step 7519/19560 | loss 3.524688 (+0.89z)| norm 0.2929 (-0.07z)| lr 4.26e-04 | 322.44 ms | 52.3% bf16 MFU | 1624274 tok/s step 7520/19560 | loss 3.545145 (+1.32z)| norm 0.2676 (-0.22z)| lr 4.26e-04 | 322.89 ms | 52.3% bf16 MFU | 1624248 tok/s step 7521/19560 | loss 3.506743 (+0.48z)| norm 0.2667 (-0.23z)| lr 4.26e-04 | 322.58 ms | 52.3% bf16 MFU | 1624300 tok/s step 7522/19560 | loss 3.471605 (-0.30z)| norm 0.2770 (-0.16z)| lr 4.26e-04 | 322.60 ms | 52.3% bf16 MFU | 1624345 tok/s step 7523/19560 | loss 3.519191 (+0.74z)| norm 0.2679 (-0.22z)| lr 4.26e-04 | 322.71 ms | 52.3% bf16 MFU | 1624361 tok/s step 7524/19560 | loss 3.500424 (+0.33z)| norm 0.2538 (-0.30z)| lr 4.26e-04 | 322.83 ms | 52.3% bf16 MFU | 1624344 tok/s step 7525/19560 | loss 3.479339 (-0.12z)| norm 0.2730 (-0.18z)| lr 4.26e-04 | 323.16 ms | 52.2% bf16 MFU | 1624247 tok/s step 7526/19560 | loss 3.482734 (-0.04z)| norm 0.2631 (-0.24z)| lr 4.26e-04 | 322.86 ms | 52.3% bf16 MFU | 1624230 tok/s step 7527/19560 | loss 3.420710 (-1.40z)| norm 0.2395 (-0.38z)| lr 4.26e-04 | 322.27 ms | 52.4% bf16 MFU | 1624360 tok/s step 7528/19560 | loss 3.470795 (-0.29z)| norm 0.2597 (-0.26z)| lr 4.26e-04 | 322.85 ms | 52.3% bf16 MFU | 1624338 tok/s step 7529/19560 | loss 3.463962 (-0.44z)| norm 0.2525 (-0.30z)| lr 4.26e-04 | 322.37 ms | 52.4% bf16 MFU | 1624438 tok/s step 7530/19560 | loss 3.473017 (-0.25z)| norm 0.2632 (-0.23z)| lr 4.26e-04 | 323.22 ms | 52.2% bf16 MFU | 1624320 tok/s step 7531/19560 | loss 3.444314 (-0.89z)| norm 0.2598 (-0.25z)| lr 4.26e-04 | 322.19 ms | 52.4% bf16 MFU | 1624467 tok/s step 7532/19560 | loss 3.463984 (-0.45z)| norm 0.2537 (-0.29z)| lr 4.26e-04 | 322.53 ms | 52.3% bf16 MFU | 1624521 tok/s step 7533/19560 | loss 3.486531 (+0.05z)| norm 0.2691 (-0.19z)| lr 4.26e-04 | 322.42 ms | 52.3% bf16 MFU | 1624600 tok/s step 7534/19560 | loss 3.478544 (-0.13z)| norm 0.2743 (-0.16z)| lr 4.26e-04 | 322.51 ms | 52.3% bf16 MFU | 1624652 tok/s step 7535/19560 | loss 3.485429 (+0.03z)| norm 0.2543 (-0.28z)| lr 4.26e-04 | 322.64 ms | 52.3% bf16 MFU | 1624669 tok/s step 7536/19560 | loss 3.509858 (+0.57z)| norm 0.2771 (-0.14z)| lr 4.26e-04 | 323.47 ms | 52.2% bf16 MFU | 1624476 tok/s step 7537/19560 | loss 3.499628 (+0.36z)| norm 0.2558 (-0.27z)| lr 4.26e-04 | 322.70 ms | 52.3% bf16 MFU | 1624486 tok/s step 7538/19560 | loss 3.449165 (-0.79z)| norm 0.2571 (-0.26z)| lr 4.26e-04 | 322.65 ms | 52.3% bf16 MFU | 1624510 tok/s step 7539/19560 | loss 3.486773 (+0.07z)| norm 0.2650 (-0.21z)| lr 4.26e-04 | 322.44 ms | 52.3% bf16 MFU | 1624585 tok/s step 7540/19560 | loss 3.494390 (+0.25z)| norm 0.2698 (-0.18z)| lr 4.25e-04 | 322.59 ms | 52.3% bf16 MFU | 1624618 tok/s step 7541/19560 | loss 3.515757 (+0.74z)| norm 0.2606 (-0.23z)| lr 4.25e-04 | 322.66 ms | 52.3% bf16 MFU | 1624630 tok/s step 7542/19560 | loss 3.498458 (+0.34z)| norm 0.2788 (-0.12z)| lr 4.25e-04 | 322.15 ms | 52.4% bf16 MFU | 1624773 tok/s step 7543/19560 | loss 3.639514 (+3.41z)| norm 0.3084 (+0.06z)| lr 4.25e-04 | 323.49 ms | 52.2% bf16 MFU | 1624570 tok/s step 7544/19560 | loss 3.520510 (+0.78z)| norm 0.2802 (-0.11z)| lr 4.25e-04 | 322.37 ms | 52.4% bf16 MFU | 1624659 tok/s step 7545/19560 | loss 3.498022 (+0.29z)| norm 0.2737 (-0.15z)| lr 4.25e-04 | 323.51 ms | 52.2% bf16 MFU | 1624456 tok/s step 7546/19560 | loss 3.423497 (-1.34z)| norm 0.2472 (-0.31z)| lr 4.25e-04 | 322.30 ms | 52.4% bf16 MFU | 1624569 tok/s step 7547/19560 | loss 3.470941 (-0.30z)| norm 0.2864 (-0.07z)| lr 4.25e-04 | 322.55 ms | 52.3% bf16 MFU | 1624612 tok/s step 7548/19560 | loss 3.454324 (-0.66z)| norm 0.2995 (+0.01z)| lr 4.25e-04 | 322.81 ms | 52.3% bf16 MFU | 1624589 tok/s step 7549/19560 | loss 3.509896 (+0.55z)| norm 0.2666 (-0.19z)| lr 4.25e-04 | 323.17 ms | 52.2% bf16 MFU | 1624476 tok/s step 7550/19560 | loss 3.388880 (-2.07z)| norm 0.2948 (-0.02z)| lr 4.25e-04 | 323.06 ms | 52.2% bf16 MFU | 1624396 tok/s step 7551/19560 | loss 3.493275 (+0.20z)| norm 0.2912 (-0.04z)| lr 4.25e-04 | 323.44 ms | 52.2% bf16 MFU | 1624224 tok/s step 7552/19560 | loss 3.501743 (+0.38z)| norm 0.2899 (-0.05z)| lr 4.25e-04 | 321.97 ms | 52.4% bf16 MFU | 1624432 tok/s step 7553/19560 | loss 3.487569 (+0.08z)| norm 0.3142 (+0.10z)| lr 4.25e-04 | 323.03 ms | 52.2% bf16 MFU | 1624362 tok/s step 7554/19560 | loss 3.496408 (+0.26z)| norm 0.2929 (-0.03z)| lr 4.25e-04 | 323.02 ms | 52.2% bf16 MFU | 1624298 tok/s step 7555/19560 | loss 3.491627 (+0.16z)| norm 0.2835 (-0.09z)| lr 4.25e-04 | 323.19 ms | 52.2% bf16 MFU | 1624195 tok/s step 7556/19560 | loss 3.501615 (+0.38z)| norm 0.2760 (-0.14z)| lr 4.25e-04 | 322.60 ms | 52.3% bf16 MFU | 1624246 tok/s step 7557/19560 | loss 3.448429 (-0.81z)| norm 0.2829 (-0.10z)| lr 4.25e-04 | 322.55 ms | 52.3% bf16 MFU | 1624305 tok/s step 7558/19560 | loss 3.425333 (-1.30z)| norm 0.2765 (-0.14z)| lr 4.25e-04 | 322.93 ms | 52.3% bf16 MFU | 1624267 tok/s step 7559/19560 | loss 3.505175 (+0.46z)| norm 0.2680 (-0.19z)| lr 4.25e-04 | 322.80 ms | 52.3% bf16 MFU | 1624263 tok/s step 7560/19560 | loss 3.494768 (+0.23z)| norm 0.2835 (-0.09z)| lr 4.25e-04 | 322.29 ms | 52.4% bf16 MFU | 1624388 tok/s step 7561/19560 | loss 3.443092 (-0.92z)| norm 0.2821 (-0.11z)| lr 4.25e-04 | 322.71 ms | 52.3% bf16 MFU | 1624402 tok/s step 7562/19560 | loss 3.475733 (-0.19z)| norm 0.2655 (-0.21z)| lr 4.24e-04 | 322.80 ms | 52.3% bf16 MFU | 1624391 tok/s step 7563/19560 | loss 3.458675 (-0.56z)| norm 0.2660 (-0.20z)| lr 4.24e-04 | 322.50 ms | 52.3% bf16 MFU | 1624457 tok/s step 7564/19560 | loss 3.525988 (+1.10z)| norm 0.2708 (-0.15z)| lr 4.24e-04 | 322.84 ms | 52.3% bf16 MFU | 1624433 tok/s step 7565/19560 | loss 3.457080 (-0.63z)| norm 0.2624 (-0.21z)| lr 4.24e-04 | 322.21 ms | 52.4% bf16 MFU | 1624570 tok/s step 7566/19560 | loss 3.446198 (-0.90z)| norm 0.2810 (-0.06z)| lr 4.24e-04 | 323.53 ms | 52.2% bf16 MFU | 1624367 tok/s step 7567/19560 | loss 3.410817 (-1.79z)| norm 0.2885 (-0.00z)| lr 4.24e-04 | 322.73 ms | 52.3% bf16 MFU | 1624377 tok/s step 7568/19560 | loss 3.425370 (-1.43z)| norm 0.2444 (-0.33z)| lr 4.24e-04 | 322.67 ms | 52.3% bf16 MFU | 1624400 tok/s step 7569/19560 | loss 3.452495 (-0.72z)| norm 0.2568 (-0.24z)| lr 4.24e-04 | 322.99 ms | 52.3% bf16 MFU | 1624343 tok/s step 7570/19560 | loss 3.464442 (-0.40z)| norm 0.2589 (-0.22z)| lr 4.24e-04 | 322.37 ms | 52.4% bf16 MFU | 1624444 tok/s step 7571/19560 | loss 3.488703 (+0.23z)| norm 0.2883 (+0.01z)| lr 4.24e-04 | 323.01 ms | 52.3% bf16 MFU | 1624379 tok/s step 7572/19560 | loss 3.439988 (-1.02z)| norm 0.2772 (-0.08z)| lr 4.24e-04 | 323.55 ms | 52.2% bf16 MFU | 1624182 tok/s step 7573/19560 | loss 3.495163 (+0.40z)| norm 0.2573 (-0.22z)| lr 4.24e-04 | 322.52 ms | 52.3% bf16 MFU | 1624254 tok/s step 7574/19560 | loss 3.500288 (+0.54z)| norm 0.2692 (-0.13z)| lr 4.24e-04 | 322.68 ms | 52.3% bf16 MFU | 1624281 tok/s step 7575/19560 | loss 3.481999 (+0.07z)| norm 0.2757 (-0.08z)| lr 4.24e-04 | 322.14 ms | 52.4% bf16 MFU | 1624442 tok/s step 7576/19560 | loss 3.416209 (-1.62z)| norm 0.2480 (-0.29z)| lr 4.24e-04 | 322.95 ms | 52.3% bf16 MFU | 1624392 tok/s step 7577/19560 | loss 3.493524 (+0.39z)| norm 0.2898 (+0.03z)| lr 4.24e-04 | 322.60 ms | 52.3% bf16 MFU | 1624431 tok/s step 7578/19560 | loss 3.490751 (+0.31z)| norm 0.2738 (-0.09z)| lr 4.24e-04 | 322.63 ms | 52.3% bf16 MFU | 1624461 tok/s step 7579/19560 | loss 3.461673 (-0.45z)| norm 0.2857 (-0.00z)| lr 4.24e-04 | 322.45 ms | 52.3% bf16 MFU | 1624536 tok/s step 7580/19560 | loss 3.443441 (-0.92z)| norm 0.2812 (-0.04z)| lr 4.24e-04 | 323.11 ms | 52.2% bf16 MFU | 1624440 tok/s step 7581/19560 | loss 3.419132 (-1.55z)| norm 0.2868 (+0.00z)| lr 4.24e-04 | 322.94 ms | 52.3% bf16 MFU | 1624393 tok/s step 7582/19560 | loss 3.461384 (-0.42z)| norm 0.2743 (-0.09z)| lr 4.24e-04 | 322.74 ms | 52.3% bf16 MFU | 1624399 tok/s step 7583/19560 | loss 3.461581 (-0.41z)| norm 0.2703 (-0.12z)| lr 4.24e-04 | 322.41 ms | 52.3% bf16 MFU | 1624486 tok/s step 7584/19560 | loss 3.470994 (-0.15z)| norm 0.2773 (-0.07z)| lr 4.23e-04 | 322.71 ms | 52.3% bf16 MFU | 1624494 tok/s step 7585/19560 | loss 3.508095 (+0.87z)| norm 0.2816 (-0.04z)| lr 4.23e-04 | 322.65 ms | 52.3% bf16 MFU | 1624517 tok/s step 7586/19560 | loss 3.490369 (+0.38z)| norm 0.2728 (-0.10z)| lr 4.23e-04 | 322.64 ms | 52.3% bf16 MFU | 1624542 tok/s step 7587/19560 | loss 3.527858 (+1.39z)| norm 0.2885 (+0.01z)| lr 4.23e-04 | 322.59 ms | 52.3% bf16 MFU | 1624576 tok/s step 7588/19560 | loss 3.451343 (-0.70z)| norm 0.3214 (+0.26z)| lr 4.23e-04 | 322.79 ms | 52.3% bf16 MFU | 1624559 tok/s step 7589/19560 | loss 3.498439 (+0.59z)| norm 0.2838 (-0.03z)| lr 4.23e-04 | 322.63 ms | 52.3% bf16 MFU | 1624582 tok/s step 7590/19560 | loss 3.448795 (-0.76z)| norm 0.2806 (-0.05z)| lr 4.23e-04 | 323.11 ms | 52.2% bf16 MFU | 1624485 tok/s step 7591/19560 | loss 3.515235 (+1.05z)| norm 0.3098 (+0.17z)| lr 4.23e-04 | 322.51 ms | 52.3% bf16 MFU | 1624542 tok/s step 7592/19560 | loss 3.506035 (+0.79z)| norm 0.2952 (+0.05z)| lr 4.23e-04 | 322.47 ms | 52.3% bf16 MFU | 1624607 tok/s step 7593/19560 | loss 3.485794 (+0.23z)| norm 0.3026 (+1.36z)| lr 4.23e-04 | 323.18 ms | 52.2% bf16 MFU | 1624490 tok/s step 7594/19560 | loss 3.445050 (-0.89z)| norm 0.2940 (+0.95z)| lr 4.23e-04 | 322.47 ms | 52.3% bf16 MFU | 1624559 tok/s step 7595/19560 | loss 3.425329 (-1.41z)| norm 0.2935 (+0.93z)| lr 4.23e-04 | 322.39 ms | 52.4% bf16 MFU | 1624643 tok/s step 7596/19560 | loss 3.473894 (-0.08z)| norm 0.3152 (+2.13z)| lr 4.23e-04 | 323.16 ms | 52.2% bf16 MFU | 1624531 tok/s step 7597/19560 | loss 3.498942 (+0.64z)| norm 0.2872 (+0.61z)| lr 4.23e-04 | 322.61 ms | 52.3% bf16 MFU | 1624562 tok/s step 7598/19560 | loss 3.463452 (-0.36z)| norm 0.2992 (+1.27z)| lr 4.23e-04 | 322.40 ms | 52.3% bf16 MFU | 1624644 tok/s step 7599/19560 | loss 3.555327 (+2.19z)| norm 0.2943 (+1.02z)| lr 4.23e-04 | 322.92 ms | 52.3% bf16 MFU | 1624592 tok/s step 7600/19560 | loss 3.483146 (+0.18z)| norm 0.2791 (+0.15z)| lr 4.23e-04 | 322.61 ms | 52.3% bf16 MFU | 1624618 tok/s step 7601/19560 | loss 3.548363 (+1.96z)| norm 0.2984 (+1.26z)| lr 4.23e-04 | 322.92 ms | 52.3% bf16 MFU | 1624568 tok/s step 7602/19560 | loss 3.501449 (+0.66z)| norm 0.3434 (+3.60z)| lr 4.23e-04 | 322.62 ms | 52.3% bf16 MFU | 1624594 tok/s step 7603/19560 | loss 3.409552 (-1.83z)| norm 0.3155 (+2.04z)| lr 4.23e-04 | 322.27 ms | 52.4% bf16 MFU | 1624708 tok/s step 7604/19560 | loss 3.474319 (-0.08z)| norm 0.2874 (+0.53z)| lr 4.23e-04 | 323.01 ms | 52.2% bf16 MFU | 1624629 tok/s step 7605/19560 | loss 3.447690 (-0.79z)| norm 0.2962 (+0.98z)| lr 4.23e-04 | 322.72 ms | 52.3% bf16 MFU | 1624626 tok/s step 7606/19560 | loss 3.443444 (-0.90z)| norm 0.2823 (+0.23z)| lr 4.22e-04 | 322.32 ms | 52.4% bf16 MFU | 1624724 tok/s step 7607/19560 | loss 3.411807 (-1.72z)| norm 0.2842 (+0.32z)| lr 4.22e-04 | 322.96 ms | 52.3% bf16 MFU | 1624657 tok/s step 7608/19560 | loss 3.475864 (-0.01z)| norm 0.2906 (+0.66z)| lr 4.22e-04 | 322.22 ms | 52.4% bf16 MFU | 1624779 tok/s step 7609/19560 | loss 3.487842 (+0.31z)| norm 0.2844 (+0.34z)| lr 4.22e-04 | 322.96 ms | 52.3% bf16 MFU | 1624709 tok/s step 7610/19560 | loss 3.515569 (+1.04z)| norm 0.2878 (+0.53z)| lr 4.22e-04 | 322.32 ms | 52.4% bf16 MFU | 1624805 tok/s step 7611/19560 | loss 3.511553 (+0.92z)| norm 0.2659 (-0.66z)| lr 4.22e-04 | 322.76 ms | 52.3% bf16 MFU | 1624784 tok/s step 7612/19560 | loss 3.455641 (-0.58z)| norm 0.2594 (-1.00z)| lr 4.22e-04 | 322.60 ms | 52.3% bf16 MFU | 1624804 tok/s step 7613/19560 | loss 3.469336 (-0.21z)| norm 0.2958 (+0.97z)| lr 4.22e-04 | 322.33 ms | 52.4% bf16 MFU | 1624892 tok/s step 7614/19560 | loss 3.524805 (+1.29z)| norm 0.2541 (-1.28z)| lr 4.22e-04 | 322.31 ms | 52.4% bf16 MFU | 1624980 tok/s step 7615/19560 | loss 3.442108 (-0.97z)| norm 0.2803 (+0.15z)| lr 4.22e-04 | 322.47 ms | 52.3% bf16 MFU | 1625024 tok/s step 7616/19560 | loss 3.483118 (+0.15z)| norm 0.2471 (-1.64z)| lr 4.22e-04 | 322.50 ms | 52.3% bf16 MFU | 1625059 tok/s step 7617/19560 | loss 3.427937 (-1.34z)| norm 0.2934 (+0.87z)| lr 4.22e-04 | 322.72 ms | 52.3% bf16 MFU | 1625034 tok/s step 7618/19560 | loss 3.491647 (+0.39z)| norm 0.2653 (-0.64z)| lr 4.22e-04 | 322.60 ms | 52.3% bf16 MFU | 1625041 tok/s step 7619/19560 | loss 3.513757 (+1.01z)| norm 0.2815 (+0.22z)| lr 4.22e-04 | 322.58 ms | 52.3% bf16 MFU | 1625053 tok/s step 7620/19560 | loss 3.384749 (-2.51z)| norm 0.3048 (+1.46z)| lr 4.22e-04 | 322.38 ms | 52.4% bf16 MFU | 1625115 tok/s step 7621/19560 | loss 3.486424 (+0.26z)| norm 0.2868 (+0.48z)| lr 4.22e-04 | 323.07 ms | 52.2% bf16 MFU | 1625000 tok/s step 7622/19560 | loss 3.589976 (+2.95z)| norm 0.2814 (+0.19z)| lr 4.22e-04 | 322.37 ms | 52.4% bf16 MFU | 1625067 tok/s step 7623/19560 | loss 3.470289 (-0.20z)| norm 0.3369 (+3.04z)| lr 4.22e-04 | 322.85 ms | 52.3% bf16 MFU | 1625011 tok/s step 7624/19560 | loss 3.435439 (-1.13z)| norm 0.3013 (+1.17z)| lr 4.22e-04 | 322.98 ms | 52.3% bf16 MFU | 1624926 tok/s step 7625/19560 | loss 3.450749 (-0.72z)| norm 0.2680 (-0.56z)| lr 4.22e-04 | 322.81 ms | 52.3% bf16 MFU | 1624887 tok/s step 7626/19560 | loss 3.478642 (+0.02z)| norm 0.2788 (+0.00z)| lr 4.22e-04 | 322.86 ms | 52.3% bf16 MFU | 1624837 tok/s step 7627/19560 | loss 3.533335 (+1.44z)| norm 0.2979 (+0.98z)| lr 4.22e-04 | 322.25 ms | 52.4% bf16 MFU | 1624944 tok/s step 7628/19560 | loss 3.530472 (+1.36z)| norm 0.2787 (-0.04z)| lr 4.21e-04 | 323.08 ms | 52.2% bf16 MFU | 1624835 tok/s step 7629/19560 | loss 3.485505 (+0.17z)| norm 0.2901 (+0.55z)| lr 4.21e-04 | 322.82 ms | 52.3% bf16 MFU | 1624799 tok/s step 7630/19560 | loss 3.448451 (-0.80z)| norm 0.2503 (-1.55z)| lr 4.21e-04 | 322.56 ms | 52.3% bf16 MFU | 1624830 tok/s step 7631/19560 | loss 3.464247 (-0.38z)| norm 0.2875 (+0.41z)| lr 4.21e-04 | 322.62 ms | 52.3% bf16 MFU | 1624844 tok/s step 7632/19560 | loss 3.456968 (-0.56z)| norm 0.2749 (-0.25z)| lr 4.21e-04 | 322.71 ms | 52.3% bf16 MFU | 1624834 tok/s step 7633/19560 | loss 3.466927 (-0.29z)| norm 0.2760 (-0.19z)| lr 4.21e-04 | 322.86 ms | 52.3% bf16 MFU | 1624788 tok/s step 7634/19560 | loss 3.425634 (-1.37z)| norm 0.2614 (-0.96z)| lr 4.21e-04 | 323.29 ms | 52.2% bf16 MFU | 1624635 tok/s step 7635/19560 | loss 3.448934 (-0.74z)| norm 0.3008 (+1.12z)| lr 4.21e-04 | 322.99 ms | 52.3% bf16 MFU | 1624564 tok/s step 7636/19560 | loss 3.472743 (-0.12z)| norm 0.2455 (-1.78z)| lr 4.21e-04 | 322.38 ms | 52.4% bf16 MFU | 1624651 tok/s step 7637/19560 | loss 3.434659 (-1.11z)| norm 0.8041 (+10.41z)| lr 4.21e-04 | 322.87 ms | 52.3% bf16 MFU | 1624609 tok/s step 7638/19560 | loss 3.482837 (+0.17z)| norm 0.2990 (+0.30z)| lr 4.21e-04 | 322.81 ms | 52.3% bf16 MFU | 1624585 tok/s step 7639/19560 | loss 3.466732 (-0.26z)| norm 0.3029 (+0.38z)| lr 4.21e-04 | 322.86 ms | 52.3% bf16 MFU | 1624549 tok/s step 7640/19560 | loss 3.479941 (+0.08z)| norm 0.3142 (+0.60z)| lr 4.21e-04 | 322.95 ms | 52.3% bf16 MFU | 1624492 tok/s step 7641/19560 | loss 3.500957 (+0.63z)| norm 0.2874 (+0.06z)| lr 4.21e-04 | 323.60 ms | 52.2% bf16 MFU | 1624278 tok/s step 7642/19560 | loss 3.457766 (-0.52z)| norm 0.3069 (+0.46z)| lr 4.21e-04 | 323.34 ms | 52.2% bf16 MFU | 1624138 tok/s step 7643/19560 | loss 3.467585 (-0.26z)| norm 0.2839 (-0.00z)| lr 4.21e-04 | 322.91 ms | 52.3% bf16 MFU | 1624113 tok/s step 7644/19560 | loss 3.450986 (-0.69z)| norm 0.3045 (+0.40z)| lr 4.21e-04 | 322.51 ms | 52.3% bf16 MFU | 1624191 tok/s step 7645/19560 | loss 3.501736 (+0.65z)| norm 0.2710 (-0.27z)| lr 4.21e-04 | 323.28 ms | 52.2% bf16 MFU | 1624069 tok/s step 7646/19560 | loss 3.476072 (-0.03z)| norm 0.2696 (-0.29z)| lr 4.21e-04 | 322.38 ms | 52.4% bf16 MFU | 1624180 tok/s step 7647/19560 | loss 3.453546 (-0.62z)| norm 0.2755 (-0.17z)| lr 4.21e-04 | 322.60 ms | 52.3% bf16 MFU | 1624232 tok/s step 7648/19560 | loss 3.475951 (-0.00z)| norm 0.2666 (-0.35z)| lr 4.21e-04 | 322.59 ms | 52.3% bf16 MFU | 1624283 tok/s step 7649/19560 | loss 3.558747 (+2.22z)| norm 0.2615 (-0.45z)| lr 4.21e-04 | 322.74 ms | 52.3% bf16 MFU | 1624292 tok/s step 7650/19560 | loss 3.436996 (-1.05z)| norm 0.2561 (-0.56z)| lr 4.20e-04 | 322.43 ms | 52.3% bf16 MFU | 1624380 tok/s step 7651/19560 | loss 3.491018 (+0.41z)| norm 0.2677 (-0.32z)| lr 4.20e-04 | 322.76 ms | 52.3% bf16 MFU | 1624382 tok/s step 7652/19560 | loss 3.401136 (-1.97z)| norm 0.2591 (-0.50z)| lr 4.20e-04 | 322.68 ms | 52.3% bf16 MFU | 1624401 tok/s step 7653/19560 | loss 3.411398 (-1.66z)| norm 0.2622 (-0.43z)| lr 4.20e-04 | 322.93 ms | 52.3% bf16 MFU | 1624357 tok/s step 7654/19560 | loss 3.356769 (-2.96z)| norm 0.2518 (-0.64z)| lr 4.20e-04 | 323.10 ms | 52.2% bf16 MFU | 1624274 tok/s step 7655/19560 | loss 3.532209 (+1.46z)| norm 0.2667 (-0.35z)| lr 4.20e-04 | 323.00 ms | 52.3% bf16 MFU | 1624219 tok/s step 7656/19560 | loss 3.504997 (+0.76z)| norm 0.2942 (+0.20z)| lr 4.20e-04 | 322.79 ms | 52.3% bf16 MFU | 1624220 tok/s step 7657/19560 | loss 3.461269 (-0.34z)| norm 0.2775 (-0.14z)| lr 4.20e-04 | 322.71 ms | 52.3% bf16 MFU | 1624241 tok/s step 7658/19560 | loss 3.533179 (+1.45z)| norm 0.2634 (-0.42z)| lr 4.20e-04 | 323.41 ms | 52.2% bf16 MFU | 1624086 tok/s step 7659/19560 | loss 3.454147 (-0.53z)| norm 0.2796 (-0.10z)| lr 4.20e-04 | 322.47 ms | 52.3% bf16 MFU | 1624173 tok/s step 7660/19560 | loss 3.546793 (+1.76z)| norm 0.2588 (-0.52z)| lr 4.20e-04 | 323.10 ms | 52.2% bf16 MFU | 1624098 tok/s step 7661/19560 | loss 3.392974 (-2.01z)| norm 0.2608 (-0.48z)| lr 4.20e-04 | 322.42 ms | 52.3% bf16 MFU | 1624198 tok/s step 7662/19560 | loss 3.450683 (-0.59z)| norm 0.2543 (-0.61z)| lr 4.20e-04 | 322.82 ms | 52.3% bf16 MFU | 1624192 tok/s step 7663/19560 | loss 3.372712 (-2.41z)| norm 0.2624 (-0.44z)| lr 4.20e-04 | 322.60 ms | 52.3% bf16 MFU | 1624242 tok/s step 7664/19560 | loss 3.519462 (+1.08z)| norm 0.3102 (+0.51z)| lr 4.20e-04 | 322.86 ms | 52.3% bf16 MFU | 1624223 tok/s step 7665/19560 | loss 3.468284 (-0.13z)| norm 0.2873 (+0.05z)| lr 4.20e-04 | 322.36 ms | 52.4% bf16 MFU | 1624334 tok/s step 7666/19560 | loss 3.432456 (-0.98z)| norm 0.2735 (-0.23z)| lr 4.20e-04 | 322.86 ms | 52.3% bf16 MFU | 1624311 tok/s step 7667/19560 | loss 3.478229 (+0.11z)| norm 0.2921 (+0.14z)| lr 4.20e-04 | 322.79 ms | 52.3% bf16 MFU | 1624309 tok/s step 7668/19560 | loss 3.502335 (+0.68z)| norm 0.2807 (-0.09z)| lr 4.20e-04 | 322.90 ms | 52.3% bf16 MFU | 1624277 tok/s step 7669/19560 | loss 3.487845 (+0.34z)| norm 0.2991 (+0.27z)| lr 4.20e-04 | 323.08 ms | 52.2% bf16 MFU | 1624203 tok/s step 7670/19560 | loss 3.481047 (+0.18z)| norm 0.3245 (+0.78z)| lr 4.20e-04 | 322.46 ms | 52.3% bf16 MFU | 1624288 tok/s step 7671/19560 | loss 3.486811 (+0.37z)| norm 0.2854 (-0.01z)| lr 4.20e-04 | 323.00 ms | 52.3% bf16 MFU | 1624234 tok/s step 7672/19560 | loss 3.539850 (+1.71z)| norm 0.3055 (+0.39z)| lr 4.19e-04 | 323.04 ms | 52.2% bf16 MFU | 1624172 tok/s step 7673/19560 | loss 3.458583 (-0.34z)| norm 0.3083 (+0.44z)| lr 4.19e-04 | 322.83 ms | 52.3% bf16 MFU | 1624164 tok/s step 7674/19560 | loss 3.506612 (+0.86z)| norm 0.2906 (+0.08z)| lr 4.19e-04 | 322.72 ms | 52.3% bf16 MFU | 1624185 tok/s step 7675/19560 | loss 3.465280 (-0.19z)| norm 0.3015 (+0.30z)| lr 4.19e-04 | 322.65 ms | 52.3% bf16 MFU | 1624224 tok/s step 7676/19560 | loss 3.495165 (+0.56z)| norm 0.3497 (+1.26z)| lr 4.19e-04 | 322.99 ms | 52.3% bf16 MFU | 1624174 tok/s step 7677/19560 | loss 3.555973 (+2.07z)| norm 0.3005 (+0.26z)| lr 4.19e-04 | 322.94 ms | 52.3% bf16 MFU | 1624139 tok/s step 7678/19560 | loss 3.461149 (-0.33z)| norm 0.2894 (+0.04z)| lr 4.19e-04 | 323.13 ms | 52.2% bf16 MFU | 1624057 tok/s step 7679/19560 | loss 3.455684 (-0.46z)| norm 0.2809 (-0.13z)| lr 4.19e-04 | 323.15 ms | 52.2% bf16 MFU | 1623975 tok/s step 7680/19560 | loss 3.485776 (+0.31z)| norm 0.3286 (+0.82z)| lr 4.19e-04 | 322.87 ms | 52.3% bf16 MFU | 1623968 tok/s step 7681/19560 | loss 3.492003 (+0.47z)| norm 0.2584 (-0.57z)| lr 4.19e-04 | 323.21 ms | 52.2% bf16 MFU | 1623877 tok/s step 7682/19560 | loss 3.438343 (-0.89z)| norm 0.2728 (-0.28z)| lr 4.19e-04 | 323.13 ms | 52.2% bf16 MFU | 1623809 tok/s step 7683/19560 | loss 3.513401 (+1.02z)| norm 0.2670 (-0.40z)| lr 4.19e-04 | 323.15 ms | 52.2% bf16 MFU | 1623740 tok/s step 7684/19560 | loss 3.470618 (-0.06z)| norm 0.2607 (-0.52z)| lr 4.19e-04 | 322.99 ms | 52.3% bf16 MFU | 1623714 tok/s step 7685/19560 | loss 3.481942 (+0.22z)| norm 0.2539 (-0.65z)| lr 4.19e-04 | 323.12 ms | 52.2% bf16 MFU | 1623659 tok/s step 7686/19560 | loss 3.480189 (+0.17z)| norm 0.2554 (-0.62z)| lr 4.19e-04 | 322.99 ms | 52.3% bf16 MFU | 1623637 tok/s step 7687/19560 | loss 3.524579 (+1.29z)| norm 0.2723 (-0.28z)| lr 4.19e-04 | 323.38 ms | 52.2% bf16 MFU | 1623518 tok/s step 7688/19560 | loss 3.312791 (-3.84z)| norm 0.6710 (+6.31z)| lr 4.19e-04 | 322.18 ms | 52.4% bf16 MFU | 1623709 tok/s step 7689/19560 | loss 3.504685 (+0.76z)| norm 0.3762 (+1.41z)| lr 4.19e-04 | 322.74 ms | 52.3% bf16 MFU | 1623747 tok/s step 7690/19560 | loss 3.457823 (-0.36z)| norm 0.3410 (+0.82z)| lr 4.19e-04 | 323.00 ms | 52.3% bf16 MFU | 1623718 tok/s step 7691/19560 | loss 3.489522 (+0.40z)| norm 0.2892 (-0.03z)| lr 4.19e-04 | 322.78 ms | 52.3% bf16 MFU | 1623748 tok/s step 7692/19560 | loss 3.397562 (-1.78z)| norm 0.3170 (+0.42z)| lr 4.19e-04 | 323.14 ms | 52.2% bf16 MFU | 1623685 tok/s step 7693/19560 | loss 3.440647 (-0.74z)| norm 0.2938 (+0.04z)| lr 4.19e-04 | 322.70 ms | 52.3% bf16 MFU | 1623736 tok/s step 7694/19560 | loss 3.445387 (-0.63z)| norm 0.2860 (-0.09z)| lr 4.18e-04 | 322.51 ms | 52.3% bf16 MFU | 1623832 tok/s step 7695/19560 | loss 3.473830 (+0.03z)| norm 0.2673 (-0.40z)| lr 4.18e-04 | 322.60 ms | 52.3% bf16 MFU | 1623900 tok/s step 7696/19560 | loss 3.447539 (-0.60z)| norm 0.2723 (-0.32z)| lr 4.18e-04 | 322.57 ms | 52.3% bf16 MFU | 1623971 tok/s step 7697/19560 | loss 3.436345 (-0.87z)| norm 0.2638 (-0.46z)| lr 4.18e-04 | 322.76 ms | 52.3% bf16 MFU | 1623992 tok/s step 7698/19560 | loss 3.447267 (-0.60z)| norm 0.2743 (-0.29z)| lr 4.18e-04 | 323.20 ms | 52.2% bf16 MFU | 1623902 tok/s step 7699/19560 | loss 3.476727 (+0.11z)| norm 0.2836 (-0.13z)| lr 4.18e-04 | 322.77 ms | 52.3% bf16 MFU | 1623924 tok/s step 7700/19560 | loss 3.456969 (-0.37z)| norm 0.2724 (-0.32z)| lr 4.18e-04 | 322.89 ms | 52.3% bf16 MFU | 1623916 tok/s step 7701/19560 | loss 3.444601 (-0.66z)| norm 0.2794 (-0.21z)| lr 4.18e-04 | 322.74 ms | 52.3% bf16 MFU | 1623944 tok/s step 7702/19560 | loss 3.436295 (-0.85z)| norm 0.2767 (-0.25z)| lr 4.18e-04 | 323.33 ms | 52.2% bf16 MFU | 1623822 tok/s step 7703/19560 | loss 3.473963 (+0.06z)| norm 0.2814 (-0.17z)| lr 4.18e-04 | 322.97 ms | 52.3% bf16 MFU | 1623799 tok/s step 7704/19560 | loss 3.457790 (-0.34z)| norm 0.2617 (-0.50z)| lr 4.18e-04 | 323.14 ms | 52.2% bf16 MFU | 1623734 tok/s step 7705/19560 | loss 3.395522 (-1.80z)| norm 0.2878 (-0.07z)| lr 4.18e-04 | 323.01 ms | 52.3% bf16 MFU | 1623705 tok/s step 7706/19560 | loss 3.418830 (-1.23z)| norm 0.2683 (-0.39z)| lr 4.18e-04 | 322.76 ms | 52.3% bf16 MFU | 1623740 tok/s step 7707/19560 | loss 3.465689 (-0.11z)| norm 0.2605 (-0.52z)| lr 4.18e-04 | 323.31 ms | 52.2% bf16 MFU | 1623635 tok/s step 7708/19560 | loss 3.630994 (+3.59z)| norm 0.2821 (-0.16z)| lr 4.18e-04 | 322.84 ms | 52.3% bf16 MFU | 1623652 tok/s step 7709/19560 | loss 3.436283 (-0.81z)| norm 0.2835 (-0.14z)| lr 4.18e-04 | 322.68 ms | 52.3% bf16 MFU | 1623709 tok/s step 7710/19560 | loss 3.439220 (-0.74z)| norm 0.2473 (-0.73z)| lr 4.18e-04 | 322.86 ms | 52.3% bf16 MFU | 1623716 tok/s step 7711/19560 | loss 3.421266 (-1.13z)| norm 0.2852 (-0.11z)| lr 4.18e-04 | 322.87 ms | 52.3% bf16 MFU | 1623723 tok/s step 7712/19560 | loss 3.522003 (+1.12z)| norm 0.2929 (+0.02z)| lr 4.18e-04 | 322.68 ms | 52.3% bf16 MFU | 1623777 tok/s step 7713/19560 | loss 3.426791 (-1.00z)| norm 0.2762 (-0.26z)| lr 4.18e-04 | 322.81 ms | 52.3% bf16 MFU | 1623795 tok/s step 7714/19560 | loss 3.541429 (+1.55z)| norm 0.7694 (+6.40z)| lr 4.18e-04 | 322.63 ms | 52.3% bf16 MFU | 1623857 tok/s step 7715/19560 | loss 3.431389 (-0.88z)| norm 0.4537 (+2.08z)| lr 4.18e-04 | 323.28 ms | 52.2% bf16 MFU | 1623753 tok/s step 7716/19560 | loss 3.516754 (+1.00z)| norm 0.3188 (+0.29z)| lr 4.17e-04 | 322.48 ms | 52.3% bf16 MFU | 1623855 tok/s step 7717/19560 | loss 3.553869 (+1.80z)| norm 0.3272 (+0.40z)| lr 4.17e-04 | 322.49 ms | 52.3% bf16 MFU | 1623950 tok/s step 7718/19560 | loss 3.498801 (+0.58z)| norm 0.2727 (-0.33z)| lr 4.17e-04 | 322.49 ms | 52.3% bf16 MFU | 1624038 tok/s step 7719/19560 | loss 3.482821 (+0.24z)| norm 0.2914 (-0.08z)| lr 4.17e-04 | 322.17 ms | 52.4% bf16 MFU | 1624204 tok/s step 7720/19560 | loss 3.489964 (+0.40z)| norm 0.2650 (-0.42z)| lr 4.17e-04 | 322.28 ms | 52.4% bf16 MFU | 1624336 tok/s step 7721/19560 | loss 3.500844 (+0.63z)| norm 0.2871 (-0.13z)| lr 4.17e-04 | 322.73 ms | 52.3% bf16 MFU | 1624345 tok/s step 7722/19560 | loss 3.474000 (+0.04z)| norm 0.2802 (-0.22z)| lr 4.17e-04 | 322.00 ms | 52.4% bf16 MFU | 1624540 tok/s step 7723/19560 | loss 3.531298 (+1.28z)| norm 0.2760 (-0.27z)| lr 4.17e-04 | 322.69 ms | 52.3% bf16 MFU | 1624551 tok/s step 7724/19560 | loss 3.467746 (-0.12z)| norm 0.2763 (-0.26z)| lr 4.17e-04 | 322.94 ms | 52.3% bf16 MFU | 1624497 tok/s step 7725/19560 | loss 3.501866 (+0.63z)| norm 0.2764 (-0.26z)| lr 4.17e-04 | 322.53 ms | 52.3% bf16 MFU | 1624549 tok/s step 7726/19560 | loss 3.559880 (+1.86z)| norm 0.2877 (-0.11z)| lr 4.17e-04 | 322.22 ms | 52.4% bf16 MFU | 1624677 tok/s step 7727/19560 | loss 3.497446 (+0.53z)| norm 0.2907 (-0.07z)| lr 4.17e-04 | 322.87 ms | 52.3% bf16 MFU | 1624635 tok/s step 7728/19560 | loss 3.478065 (+0.10z)| norm 0.2615 (-0.46z)| lr 4.17e-04 | 322.69 ms | 52.3% bf16 MFU | 1624639 tok/s step 7729/19560 | loss 3.430252 (-0.93z)| norm 0.2729 (-0.30z)| lr 4.17e-04 | 322.41 ms | 52.3% bf16 MFU | 1624715 tok/s step 7730/19560 | loss 3.475800 (+0.08z)| norm 0.2552 (-0.53z)| lr 4.17e-04 | 322.65 ms | 52.3% bf16 MFU | 1624726 tok/s step 7731/19560 | loss 3.433334 (-0.87z)| norm 0.2782 (-0.22z)| lr 4.17e-04 | 323.31 ms | 52.2% bf16 MFU | 1624572 tok/s step 7732/19560 | loss 3.502560 (+0.66z)| norm 0.2480 (-0.61z)| lr 4.17e-04 | 322.97 ms | 52.3% bf16 MFU | 1624510 tok/s step 7733/19560 | loss 3.425704 (-1.03z)| norm 0.7410 (+5.20z)| lr 4.17e-04 | 322.84 ms | 52.3% bf16 MFU | 1624485 tok/s step 7734/19560 | loss 3.463694 (-0.20z)| norm 0.2738 (-0.28z)| lr 4.17e-04 | 322.31 ms | 52.4% bf16 MFU | 1624595 tok/s step 7735/19560 | loss 3.443008 (-0.66z)| norm 0.2667 (-0.36z)| lr 4.17e-04 | 322.69 ms | 52.3% bf16 MFU | 1624602 tok/s step 7736/19560 | loss 3.497499 (+0.54z)| norm 0.2640 (-0.39z)| lr 4.17e-04 | 323.28 ms | 52.2% bf16 MFU | 1624460 tok/s step 7737/19560 | loss 3.471334 (-0.03z)| norm 0.2606 (-0.43z)| lr 4.16e-04 | 322.48 ms | 52.3% bf16 MFU | 1624526 tok/s step 7738/19560 | loss 3.511318 (+0.86z)| norm 0.2681 (-0.34z)| lr 4.16e-04 | 322.76 ms | 52.3% bf16 MFU | 1624520 tok/s step 7739/19560 | loss 3.484562 (+0.27z)| norm 0.2751 (-0.26z)| lr 4.16e-04 | 322.35 ms | 52.4% bf16 MFU | 1624616 tok/s step 7740/19560 | loss 3.453469 (-0.43z)| norm 0.2720 (-0.30z)| lr 4.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1624615 tok/s step 7741/19560 | loss 3.476687 (+0.09z)| norm 0.2608 (-0.42z)| lr 4.16e-04 | 322.54 ms | 52.3% bf16 MFU | 1624660 tok/s step 7742/19560 | loss 3.454341 (-0.40z)| norm 0.2653 (-0.37z)| lr 4.16e-04 | 322.62 ms | 52.3% bf16 MFU | 1624681 tok/s step 7743/19560 | loss 3.454838 (-0.39z)| norm 0.3052 (+0.09z)| lr 4.16e-04 | 322.77 ms | 52.3% bf16 MFU | 1624663 tok/s step 7744/19560 | loss 3.507249 (+0.78z)| norm 0.2470 (-0.59z)| lr 4.16e-04 | 323.04 ms | 52.2% bf16 MFU | 1624580 tok/s step 7745/19560 | loss 3.439893 (-0.73z)| norm 0.2803 (-0.20z)| lr 4.16e-04 | 322.31 ms | 52.4% bf16 MFU | 1624683 tok/s step 7746/19560 | loss 3.462511 (-0.22z)| norm 0.2656 (-0.37z)| lr 4.16e-04 | 322.67 ms | 52.3% bf16 MFU | 1624691 tok/s step 7747/19560 | loss 3.424262 (-1.06z)| norm 0.2760 (-0.25z)| lr 4.16e-04 | 322.63 ms | 52.3% bf16 MFU | 1624708 tok/s step 7748/19560 | loss 3.541608 (+1.55z)| norm 0.3045 (+0.09z)| lr 4.16e-04 | 322.78 ms | 52.3% bf16 MFU | 1624687 tok/s step 7749/19560 | loss 3.476271 (+0.08z)| norm 0.2677 (-0.34z)| lr 4.16e-04 | 323.17 ms | 52.2% bf16 MFU | 1624569 tok/s step 7750/19560 | loss 3.493259 (+0.49z)| norm 0.2712 (-0.30z)| lr 4.16e-04 | 322.82 ms | 52.3% bf16 MFU | 1624546 tok/s val loss 3.458447 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2865/10042 = 0.285302 step 7751/19560 | loss 3.502538 (+0.70z)| norm 0.2727 (-0.28z)| lr 4.16e-04 | 322.92 ms | 52.3% bf16 MFU | 1624497 tok/s step 7752/19560 | loss 3.468060 (-0.10z)| norm 0.2645 (-0.37z)| lr 4.16e-04 | 322.63 ms | 52.3% bf16 MFU | 1624525 tok/s step 7753/19560 | loss 3.486884 (+0.33z)| norm 0.2677 (-0.33z)| lr 4.16e-04 | 322.65 ms | 52.3% bf16 MFU | 1624547 tok/s step 7754/19560 | loss 3.483437 (+0.25z)| norm 0.2563 (-0.46z)| lr 4.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1624550 tok/s step 7755/19560 | loss 3.545006 (+1.67z)| norm 0.2794 (-0.19z)| lr 4.16e-04 | 322.56 ms | 52.3% bf16 MFU | 1624593 tok/s step 7756/19560 | loss 3.432822 (-0.91z)| norm 0.3136 (+0.20z)| lr 4.16e-04 | 322.88 ms | 52.3% bf16 MFU | 1624553 tok/s step 7757/19560 | loss 3.492687 (+0.48z)| norm 0.2701 (-0.30z)| lr 4.16e-04 | 322.70 ms | 52.3% bf16 MFU | 1624560 tok/s step 7758/19560 | loss 3.478095 (+0.13z)| norm 0.3247 (+0.33z)| lr 4.16e-04 | 322.74 ms | 52.3% bf16 MFU | 1624556 tok/s step 7759/19560 | loss 3.453982 (-0.43z)| norm 0.2923 (-0.05z)| lr 4.15e-04 | 322.47 ms | 52.3% bf16 MFU | 1624620 tok/s step 7760/19560 | loss 3.475974 (+0.08z)| norm 0.2660 (-0.36z)| lr 4.15e-04 | 322.64 ms | 52.3% bf16 MFU | 1624639 tok/s step 7761/19560 | loss 3.401071 (-1.63z)| norm 0.2855 (-0.13z)| lr 4.15e-04 | 322.58 ms | 52.3% bf16 MFU | 1624672 tok/s step 7762/19560 | loss 3.453789 (-0.42z)| norm 0.2952 (-0.02z)| lr 4.15e-04 | 322.76 ms | 52.3% bf16 MFU | 1624657 tok/s step 7763/19560 | loss 3.507470 (+0.80z)| norm 0.2747 (-0.26z)| lr 4.15e-04 | 322.13 ms | 52.4% bf16 MFU | 1624802 tok/s step 7764/19560 | loss 3.529624 (+1.30z)| norm 0.3033 (+0.07z)| lr 4.15e-04 | 322.61 ms | 52.3% bf16 MFU | 1624820 tok/s step 7765/19560 | loss 3.472114 (-0.03z)| norm 0.2863 (-0.09z)| lr 4.15e-04 | 323.16 ms | 52.2% bf16 MFU | 1624698 tok/s step 7766/19560 | loss 3.447708 (-0.58z)| norm 0.2540 (-0.53z)| lr 4.15e-04 | 322.15 ms | 52.4% bf16 MFU | 1624838 tok/s step 7767/19560 | loss 3.426052 (-1.07z)| norm 0.2827 (-0.13z)| lr 4.15e-04 | 322.34 ms | 52.4% bf16 MFU | 1624920 tok/s step 7768/19560 | loss 3.434054 (-0.87z)| norm 0.2550 (-0.51z)| lr 4.15e-04 | 323.50 ms | 52.2% bf16 MFU | 1624708 tok/s step 7769/19560 | loss 3.481716 (+0.22z)| norm 0.2837 (-0.11z)| lr 4.15e-04 | 322.65 ms | 52.3% bf16 MFU | 1624719 tok/s step 7770/19560 | loss 3.440681 (-0.72z)| norm 0.2849 (-0.09z)| lr 4.15e-04 | 322.11 ms | 52.4% bf16 MFU | 1624865 tok/s step 7771/19560 | loss 3.578872 (+2.36z)| norm 0.2841 (-0.11z)| lr 4.15e-04 | 322.57 ms | 52.3% bf16 MFU | 1624888 tok/s step 7772/19560 | loss 3.396573 (-1.68z)| norm 0.2913 (-0.01z)| lr 4.15e-04 | 323.25 ms | 52.2% bf16 MFU | 1624740 tok/s step 7773/19560 | loss 3.449806 (-0.49z)| norm 0.3027 (+0.15z)| lr 4.15e-04 | 322.39 ms | 52.3% bf16 MFU | 1624816 tok/s step 7774/19560 | loss 3.603040 (+2.78z)| norm 0.2915 (-0.01z)| lr 4.15e-04 | 322.50 ms | 52.3% bf16 MFU | 1624860 tok/s step 7775/19560 | loss 3.454629 (-0.40z)| norm 0.2942 (+0.03z)| lr 4.15e-04 | 323.08 ms | 52.2% bf16 MFU | 1624757 tok/s step 7776/19560 | loss 3.481237 (+0.17z)| norm 0.2549 (-0.51z)| lr 4.15e-04 | 322.69 ms | 52.3% bf16 MFU | 1624755 tok/s step 7777/19560 | loss 3.438710 (-0.73z)| norm 0.2912 (-0.02z)| lr 4.15e-04 | 323.08 ms | 52.2% bf16 MFU | 1624657 tok/s step 7778/19560 | loss 3.418087 (-1.16z)| norm 0.2924 (-0.00z)| lr 4.15e-04 | 323.01 ms | 52.2% bf16 MFU | 1624580 tok/s step 7779/19560 | loss 3.440459 (-0.67z)| norm 0.2768 (-0.22z)| lr 4.15e-04 | 322.35 ms | 52.4% bf16 MFU | 1624674 tok/s step 7780/19560 | loss 3.613685 (+2.94z)| norm 0.2723 (-0.28z)| lr 4.15e-04 | 322.19 ms | 52.4% bf16 MFU | 1624804 tok/s step 7781/19560 | loss 3.473325 (-0.01z)| norm 0.2723 (-0.29z)| lr 4.14e-04 | 322.37 ms | 52.4% bf16 MFU | 1624881 tok/s step 7782/19560 | loss 3.468560 (-0.13z)| norm 0.2704 (-0.31z)| lr 4.14e-04 | 322.45 ms | 52.3% bf16 MFU | 1624935 tok/s step 7783/19560 | loss 3.450373 (-0.52z)| norm 0.2890 (-0.06z)| lr 4.14e-04 | 322.55 ms | 52.3% bf16 MFU | 1624959 tok/s step 7784/19560 | loss 3.482784 (+0.19z)| norm 0.3178 (+0.34z)| lr 4.14e-04 | 322.31 ms | 52.4% bf16 MFU | 1625044 tok/s step 7785/19560 | loss 3.400757 (-1.57z)| norm 0.2901 (-0.05z)| lr 4.14e-04 | 322.30 ms | 52.4% bf16 MFU | 1625128 tok/s step 7786/19560 | loss 3.486933 (+0.30z)| norm 0.3115 (+0.24z)| lr 4.14e-04 | 323.46 ms | 52.2% bf16 MFU | 1624915 tok/s step 7787/19560 | loss 3.525103 (+1.11z)| norm 0.2839 (-0.14z)| lr 4.14e-04 | 322.11 ms | 52.4% bf16 MFU | 1625054 tok/s step 7788/19560 | loss 3.478503 (+0.12z)| norm 0.2934 (-0.01z)| lr 4.14e-04 | 322.48 ms | 52.3% bf16 MFU | 1625091 tok/s step 7789/19560 | loss 3.513923 (+0.88z)| norm 0.2607 (-0.46z)| lr 4.14e-04 | 321.90 ms | 52.4% bf16 MFU | 1625273 tok/s step 7790/19560 | loss 3.494957 (+0.45z)| norm 0.2870 (-0.10z)| lr 4.14e-04 | 322.52 ms | 52.3% bf16 MFU | 1625289 tok/s step 7791/19560 | loss 3.482192 (+0.16z)| norm 0.2732 (-0.30z)| lr 4.14e-04 | 323.27 ms | 52.2% bf16 MFU | 1625117 tok/s step 7792/19560 | loss 3.473770 (-0.03z)| norm 0.2795 (-0.21z)| lr 4.14e-04 | 322.68 ms | 52.3% bf16 MFU | 1625100 tok/s step 7793/19560 | loss 3.499402 (+0.55z)| norm 0.2698 (-0.34z)| lr 4.14e-04 | 322.27 ms | 52.4% bf16 MFU | 1625190 tok/s step 7794/19560 | loss 3.422839 (-1.18z)| norm 0.2684 (-0.36z)| lr 4.14e-04 | 322.58 ms | 52.3% bf16 MFU | 1625194 tok/s step 7795/19560 | loss 3.425026 (-1.11z)| norm 0.2712 (-0.32z)| lr 4.14e-04 | 322.19 ms | 52.4% bf16 MFU | 1625299 tok/s step 7796/19560 | loss 3.464377 (-0.22z)| norm 0.2796 (-0.20z)| lr 4.14e-04 | 323.04 ms | 52.2% bf16 MFU | 1625183 tok/s step 7797/19560 | loss 3.441555 (-0.73z)| norm 0.2925 (-0.02z)| lr 4.14e-04 | 322.62 ms | 52.3% bf16 MFU | 1625177 tok/s step 7798/19560 | loss 3.494361 (+0.45z)| norm 0.2812 (-0.17z)| lr 4.14e-04 | 322.95 ms | 52.3% bf16 MFU | 1625090 tok/s step 7799/19560 | loss 3.472399 (-0.04z)| norm 0.2741 (-0.27z)| lr 4.14e-04 | 323.37 ms | 52.2% bf16 MFU | 1624900 tok/s step 7800/19560 | loss 3.444489 (-0.65z)| norm 0.2881 (-0.07z)| lr 4.14e-04 | 323.22 ms | 52.2% bf16 MFU | 1624758 tok/s step 7801/19560 | loss 3.412124 (-1.36z)| norm 0.2572 (-0.50z)| lr 4.14e-04 | 322.85 ms | 52.3% bf16 MFU | 1624718 tok/s step 7802/19560 | loss 3.455162 (-0.39z)| norm 0.3209 (+0.38z)| lr 4.13e-04 | 322.51 ms | 52.3% bf16 MFU | 1624765 tok/s step 7803/19560 | loss 3.462641 (-0.22z)| norm 0.2775 (-0.22z)| lr 4.13e-04 | 322.63 ms | 52.3% bf16 MFU | 1624780 tok/s step 7804/19560 | loss 3.440581 (-0.70z)| norm 0.2731 (-0.27z)| lr 4.13e-04 | 322.26 ms | 52.4% bf16 MFU | 1624886 tok/s step 7805/19560 | loss 3.490616 (+0.43z)| norm 0.2751 (-0.24z)| lr 4.13e-04 | 322.76 ms | 52.3% bf16 MFU | 1624861 tok/s step 7806/19560 | loss 3.457839 (-0.31z)| norm 0.2813 (-0.15z)| lr 4.13e-04 | 322.89 ms | 52.3% bf16 MFU | 1624804 tok/s step 7807/19560 | loss 3.469260 (-0.05z)| norm 0.2802 (-0.17z)| lr 4.13e-04 | 322.96 ms | 52.3% bf16 MFU | 1624732 tok/s step 7808/19560 | loss 3.474701 (+0.07z)| norm 0.2909 (-0.01z)| lr 4.13e-04 | 322.39 ms | 52.4% bf16 MFU | 1624808 tok/s step 7809/19560 | loss 3.449122 (-0.50z)| norm 0.2924 (+0.00z)| lr 4.13e-04 | 322.32 ms | 52.4% bf16 MFU | 1624897 tok/s step 7810/19560 | loss 3.418583 (-1.19z)| norm 0.2691 (-0.32z)| lr 4.13e-04 | 322.83 ms | 52.3% bf16 MFU | 1624855 tok/s step 7811/19560 | loss 3.496118 (+0.57z)| norm 0.3147 (+0.31z)| lr 4.13e-04 | 322.59 ms | 52.3% bf16 MFU | 1624876 tok/s step 7812/19560 | loss 3.444174 (-0.60z)| norm 0.2903 (-0.03z)| lr 4.13e-04 | 322.50 ms | 52.3% bf16 MFU | 1624916 tok/s step 7813/19560 | loss 3.503360 (+0.74z)| norm 0.2619 (-0.43z)| lr 4.13e-04 | 323.07 ms | 52.2% bf16 MFU | 1624813 tok/s step 7814/19560 | loss 3.513059 (+0.95z)| norm 0.3018 (+0.12z)| lr 4.13e-04 | 322.30 ms | 52.4% bf16 MFU | 1624907 tok/s step 7815/19560 | loss 3.465094 (-0.13z)| norm 0.2997 (+0.09z)| lr 4.13e-04 | 322.42 ms | 52.3% bf16 MFU | 1624968 tok/s step 7816/19560 | loss 3.458105 (-0.33z)| norm 0.2924 (+0.03z)| lr 4.13e-04 | 323.22 ms | 52.2% bf16 MFU | 1624822 tok/s step 7817/19560 | loss 3.514205 (+1.01z)| norm 0.3312 (+0.65z)| lr 4.13e-04 | 322.34 ms | 52.4% bf16 MFU | 1624907 tok/s step 7818/19560 | loss 3.478066 (+0.14z)| norm 0.2821 (-0.12z)| lr 4.13e-04 | 322.79 ms | 52.3% bf16 MFU | 1624874 tok/s step 7819/19560 | loss 3.453506 (-0.44z)| norm 0.3034 (+0.22z)| lr 4.13e-04 | 322.64 ms | 52.3% bf16 MFU | 1624880 tok/s step 7820/19560 | loss 3.469297 (-0.07z)| norm 0.2784 (-0.17z)| lr 4.13e-04 | 322.63 ms | 52.3% bf16 MFU | 1624888 tok/s step 7821/19560 | loss 3.464135 (-0.20z)| norm 0.2782 (-0.18z)| lr 4.13e-04 | 322.97 ms | 52.3% bf16 MFU | 1624811 tok/s step 7822/19560 | loss 3.406914 (-1.57z)| norm 0.2963 (+0.11z)| lr 4.13e-04 | 322.52 ms | 52.3% bf16 MFU | 1624849 tok/s step 7823/19560 | loss 3.457557 (-0.35z)| norm 0.2522 (-0.59z)| lr 4.13e-04 | 322.56 ms | 52.3% bf16 MFU | 1624876 tok/s step 7824/19560 | loss 3.453121 (-0.46z)| norm 0.2787 (-0.17z)| lr 4.12e-04 | 322.76 ms | 52.3% bf16 MFU | 1624852 tok/s step 7825/19560 | loss 3.524683 (+1.24z)| norm 0.2722 (-0.27z)| lr 4.12e-04 | 323.15 ms | 52.2% bf16 MFU | 1624732 tok/s step 7826/19560 | loss 3.510461 (+0.89z)| norm 0.2789 (-0.17z)| lr 4.12e-04 | 322.63 ms | 52.3% bf16 MFU | 1624747 tok/s step 7827/19560 | loss 3.452139 (-0.50z)| norm 0.2877 (-0.03z)| lr 4.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1624764 tok/s step 7828/19560 | loss 3.465417 (-0.18z)| norm 0.2668 (-0.36z)| lr 4.12e-04 | 322.53 ms | 52.3% bf16 MFU | 1624803 tok/s step 7829/19560 | loss 3.426736 (-1.10z)| norm 0.2776 (-0.19z)| lr 4.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1624817 tok/s step 7830/19560 | loss 3.480091 (+0.16z)| norm 0.2729 (-0.26z)| lr 4.12e-04 | 322.78 ms | 52.3% bf16 MFU | 1624790 tok/s step 7831/19560 | loss 3.425339 (-1.13z)| norm 0.2871 (-0.04z)| lr 4.12e-04 | 322.90 ms | 52.3% bf16 MFU | 1624734 tok/s step 7832/19560 | loss 3.501539 (+0.67z)| norm 0.3392 (+0.78z)| lr 4.12e-04 | 322.61 ms | 52.3% bf16 MFU | 1624755 tok/s step 7833/19560 | loss 3.457784 (-0.39z)| norm 0.2638 (-0.41z)| lr 4.12e-04 | 322.74 ms | 52.3% bf16 MFU | 1624743 tok/s step 7834/19560 | loss 3.514471 (+0.96z)| norm 0.2868 (-0.05z)| lr 4.12e-04 | 322.65 ms | 52.3% bf16 MFU | 1624754 tok/s step 7835/19560 | loss 3.407466 (-1.59z)| norm 0.2634 (-0.42z)| lr 4.12e-04 | 322.84 ms | 52.3% bf16 MFU | 1624715 tok/s step 7836/19560 | loss 3.466027 (-0.17z)| norm 0.2767 (-0.21z)| lr 4.12e-04 | 322.79 ms | 52.3% bf16 MFU | 1624692 tok/s step 7837/19560 | loss 3.449988 (-0.58z)| norm 0.2591 (-0.49z)| lr 4.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1624711 tok/s step 7838/19560 | loss 3.517545 (+1.12z)| norm 0.2647 (-0.40z)| lr 4.12e-04 | 322.69 ms | 52.3% bf16 MFU | 1624713 tok/s step 7839/19560 | loss 3.431109 (-1.08z)| norm 0.2673 (-0.36z)| lr 4.12e-04 | 322.94 ms | 52.3% bf16 MFU | 1624652 tok/s step 7840/19560 | loss 3.460984 (-0.31z)| norm 0.2727 (-0.27z)| lr 4.12e-04 | 322.81 ms | 52.3% bf16 MFU | 1624625 tok/s step 7841/19560 | loss 3.526651 (+1.35z)| norm 0.2639 (-0.40z)| lr 4.12e-04 | 322.36 ms | 52.4% bf16 MFU | 1624714 tok/s step 7842/19560 | loss 3.451733 (-0.55z)| norm 0.2629 (-0.48z)| lr 4.12e-04 | 323.21 ms | 52.2% bf16 MFU | 1624586 tok/s step 7843/19560 | loss 3.486268 (+0.33z)| norm 0.2770 (-0.16z)| lr 4.12e-04 | 323.02 ms | 52.2% bf16 MFU | 1624512 tok/s step 7844/19560 | loss 3.492863 (+0.50z)| norm 0.2903 (+0.15z)| lr 4.12e-04 | 322.68 ms | 52.3% bf16 MFU | 1624526 tok/s step 7845/19560 | loss 3.465366 (-0.20z)| norm 0.2508 (-0.74z)| lr 4.11e-04 | 322.66 ms | 52.3% bf16 MFU | 1624544 tok/s step 7846/19560 | loss 3.484667 (+0.32z)| norm 0.2658 (-0.40z)| lr 4.11e-04 | 322.79 ms | 52.3% bf16 MFU | 1624528 tok/s step 7847/19560 | loss 3.590823 (+3.00z)| norm 0.2627 (-0.46z)| lr 4.11e-04 | 323.10 ms | 52.2% bf16 MFU | 1624435 tok/s step 7848/19560 | loss 3.531599 (+1.46z)| norm 0.2948 (+0.26z)| lr 4.11e-04 | 322.22 ms | 52.4% bf16 MFU | 1624568 tok/s step 7849/19560 | loss 3.458889 (-0.37z)| norm 0.3025 (+0.43z)| lr 4.11e-04 | 322.75 ms | 52.3% bf16 MFU | 1624561 tok/s step 7850/19560 | loss 3.436411 (-0.93z)| norm 0.2779 (-0.12z)| lr 4.11e-04 | 323.14 ms | 52.2% bf16 MFU | 1624457 tok/s step 7851/19560 | loss 3.457641 (-0.38z)| norm 0.2688 (-0.33z)| lr 4.11e-04 | 322.92 ms | 52.3% bf16 MFU | 1624413 tok/s step 7852/19560 | loss 3.415754 (-1.43z)| norm 0.2954 (+0.27z)| lr 4.11e-04 | 322.80 ms | 52.3% bf16 MFU | 1624401 tok/s step 7853/19560 | loss 3.483060 (+0.28z)| norm 0.2926 (+0.21z)| lr 4.11e-04 | 322.88 ms | 52.3% bf16 MFU | 1624370 tok/s step 7854/19560 | loss 3.491656 (+0.52z)| norm 0.2860 (+0.06z)| lr 4.11e-04 | 322.51 ms | 52.3% bf16 MFU | 1624433 tok/s step 7855/19560 | loss 3.466112 (-0.13z)| norm 0.2772 (-0.14z)| lr 4.11e-04 | 322.57 ms | 52.3% bf16 MFU | 1624478 tok/s step 7856/19560 | loss 3.452226 (-0.49z)| norm 0.2628 (-0.47z)| lr 4.11e-04 | 322.65 ms | 52.3% bf16 MFU | 1624502 tok/s step 7857/19560 | loss 3.809952 (+6.89z)| norm 0.3638 (+1.78z)| lr 4.11e-04 | 322.37 ms | 52.4% bf16 MFU | 1624593 tok/s step 7858/19560 | loss 3.392169 (-1.65z)| norm 0.3247 (+0.89z)| lr 4.11e-04 | 322.80 ms | 52.3% bf16 MFU | 1624572 tok/s step 7859/19560 | loss 3.528068 (+1.09z)| norm 0.2883 (+0.08z)| lr 4.11e-04 | 322.93 ms | 52.3% bf16 MFU | 1624519 tok/s step 7860/19560 | loss 3.503512 (+0.59z)| norm 0.2962 (+0.25z)| lr 4.11e-04 | 322.90 ms | 52.3% bf16 MFU | 1624477 tok/s step 7861/19560 | loss 3.444061 (-0.61z)| norm 0.2966 (+0.80z)| lr 4.11e-04 | 322.48 ms | 52.3% bf16 MFU | 1624543 tok/s step 7862/19560 | loss 3.470542 (-0.08z)| norm 0.3367 (+2.82z)| lr 4.11e-04 | 322.85 ms | 52.3% bf16 MFU | 1624512 tok/s step 7863/19560 | loss 3.470750 (-0.08z)| norm 0.2870 (+0.24z)| lr 4.11e-04 | 323.62 ms | 52.2% bf16 MFU | 1624290 tok/s step 7864/19560 | loss 3.474771 (+0.01z)| norm 0.2689 (-0.70z)| lr 4.11e-04 | 322.50 ms | 52.3% bf16 MFU | 1624361 tok/s step 7865/19560 | loss 3.494049 (+0.40z)| norm 0.2945 (+0.62z)| lr 4.11e-04 | 322.30 ms | 52.4% bf16 MFU | 1624479 tok/s step 7866/19560 | loss 3.463101 (-0.23z)| norm 0.2958 (+0.67z)| lr 4.11e-04 | 322.82 ms | 52.3% bf16 MFU | 1624458 tok/s step 7867/19560 | loss 3.477680 (+0.07z)| norm 0.2969 (+0.72z)| lr 4.10e-04 | 323.91 ms | 52.1% bf16 MFU | 1624166 tok/s step 7868/19560 | loss 3.430127 (-0.89z)| norm 0.2804 (-0.14z)| lr 4.10e-04 | 323.20 ms | 52.2% bf16 MFU | 1624068 tok/s step 7869/19560 | loss 3.487322 (+0.27z)| norm 0.3252 (+2.14z)| lr 4.10e-04 | 322.79 ms | 52.3% bf16 MFU | 1624076 tok/s step 7870/19560 | loss 3.458160 (-0.32z)| norm 0.2760 (-0.40z)| lr 4.10e-04 | 322.74 ms | 52.3% bf16 MFU | 1624096 tok/s step 7871/19560 | loss 3.478211 (+0.08z)| norm 0.2827 (-0.04z)| lr 4.10e-04 | 323.32 ms | 52.2% bf16 MFU | 1623970 tok/s step 7872/19560 | loss 3.429508 (-0.90z)| norm 0.3070 (+1.20z)| lr 4.10e-04 | 323.24 ms | 52.2% bf16 MFU | 1623872 tok/s step 7873/19560 | loss 3.613600 (+2.74z)| norm 0.2671 (-0.88z)| lr 4.10e-04 | 322.94 ms | 52.3% bf16 MFU | 1623852 tok/s step 7874/19560 | loss 3.433341 (-0.82z)| norm 0.3014 (+0.90z)| lr 4.10e-04 | 322.82 ms | 52.3% bf16 MFU | 1623863 tok/s step 7875/19560 | loss 3.525707 (+0.99z)| norm 0.2751 (-0.47z)| lr 4.10e-04 | 322.81 ms | 52.3% bf16 MFU | 1623878 tok/s step 7876/19560 | loss 3.457846 (-0.34z)| norm 0.2822 (-0.09z)| lr 4.10e-04 | 322.85 ms | 52.3% bf16 MFU | 1623882 tok/s step 7877/19560 | loss 3.409043 (-1.29z)| norm 0.2772 (-0.36z)| lr 4.10e-04 | 322.95 ms | 52.3% bf16 MFU | 1623859 tok/s step 7878/19560 | loss 3.417546 (-1.10z)| norm 0.2761 (-0.42z)| lr 4.10e-04 | 323.23 ms | 52.2% bf16 MFU | 1623767 tok/s step 7879/19560 | loss 3.508371 (+0.68z)| norm 0.2736 (-0.55z)| lr 4.10e-04 | 323.23 ms | 52.2% bf16 MFU | 1623681 tok/s step 7880/19560 | loss 3.470830 (-0.06z)| norm 0.2695 (-0.77z)| lr 4.10e-04 | 323.04 ms | 52.2% bf16 MFU | 1623647 tok/s step 7881/19560 | loss 3.413951 (-1.16z)| norm 0.2663 (-0.94z)| lr 4.10e-04 | 322.46 ms | 52.3% bf16 MFU | 1623758 tok/s step 7882/19560 | loss 3.525878 (+1.02z)| norm 0.2705 (-0.73z)| lr 4.10e-04 | 322.71 ms | 52.3% bf16 MFU | 1623803 tok/s step 7883/19560 | loss 3.478911 (+0.11z)| norm 0.2556 (-1.49z)| lr 4.10e-04 | 323.07 ms | 52.2% bf16 MFU | 1623756 tok/s step 7884/19560 | loss 3.449029 (-0.47z)| norm 0.2716 (-0.64z)| lr 4.10e-04 | 322.76 ms | 52.3% bf16 MFU | 1623787 tok/s step 7885/19560 | loss 3.470592 (-0.05z)| norm 0.2590 (-1.30z)| lr 4.10e-04 | 323.06 ms | 52.2% bf16 MFU | 1623741 tok/s step 7886/19560 | loss 3.488428 (+0.30z)| norm 0.2683 (-0.80z)| lr 4.10e-04 | 322.36 ms | 52.4% bf16 MFU | 1623874 tok/s step 7887/19560 | loss 3.554693 (+1.57z)| norm 0.2681 (-0.80z)| lr 4.10e-04 | 322.99 ms | 52.3% bf16 MFU | 1623843 tok/s step 7888/19560 | loss 3.462258 (-0.22z)| norm 0.2784 (-0.25z)| lr 4.09e-04 | 323.59 ms | 52.2% bf16 MFU | 1623662 tok/s step 7889/19560 | loss 3.480033 (+0.11z)| norm 0.2683 (-0.78z)| lr 4.09e-04 | 322.41 ms | 52.3% bf16 MFU | 1623788 tok/s step 7890/19560 | loss 3.436007 (-0.75z)| norm 0.2737 (-0.49z)| lr 4.09e-04 | 322.78 ms | 52.3% bf16 MFU | 1623814 tok/s step 7891/19560 | loss 3.369410 (-2.00z)| norm 0.2557 (-1.43z)| lr 4.09e-04 | 322.79 ms | 52.3% bf16 MFU | 1623835 tok/s step 7892/19560 | loss 3.550978 (+1.49z)| norm 0.2738 (-0.46z)| lr 4.09e-04 | 322.47 ms | 52.3% bf16 MFU | 1623936 tok/s step 7893/19560 | loss 3.462432 (-0.21z)| norm 0.2626 (-1.05z)| lr 4.09e-04 | 323.19 ms | 52.2% bf16 MFU | 1623850 tok/s step 7894/19560 | loss 3.461431 (-0.23z)| norm 0.2847 (+0.12z)| lr 4.09e-04 | 323.08 ms | 52.2% bf16 MFU | 1623796 tok/s step 7895/19560 | loss 3.466129 (-0.15z)| norm 0.2701 (-0.66z)| lr 4.09e-04 | 322.69 ms | 52.3% bf16 MFU | 1623843 tok/s step 7896/19560 | loss 3.454420 (-0.38z)| norm 0.2959 (+0.71z)| lr 4.09e-04 | 322.98 ms | 52.3% bf16 MFU | 1623816 tok/s step 7897/19560 | loss 3.483143 (+0.18z)| norm 0.2787 (-0.21z)| lr 4.09e-04 | 322.96 ms | 52.3% bf16 MFU | 1623793 tok/s step 7898/19560 | loss 3.461788 (-0.24z)| norm 0.2661 (-0.88z)| lr 4.09e-04 | 322.72 ms | 52.3% bf16 MFU | 1623833 tok/s step 7899/19560 | loss 3.443616 (-0.58z)| norm 0.2935 (+0.59z)| lr 4.09e-04 | 322.91 ms | 52.3% bf16 MFU | 1623822 tok/s step 7900/19560 | loss 3.434251 (-0.77z)| norm 0.3012 (+0.99z)| lr 4.09e-04 | 322.60 ms | 52.3% bf16 MFU | 1623892 tok/s step 7901/19560 | loss 3.458102 (-0.30z)| norm 0.3112 (+1.51z)| lr 4.09e-04 | 322.43 ms | 52.3% bf16 MFU | 1624001 tok/s step 7902/19560 | loss 3.506633 (+0.69z)| norm 0.2812 (-0.08z)| lr 4.09e-04 | 322.46 ms | 52.3% bf16 MFU | 1624096 tok/s step 7903/19560 | loss 3.564315 (+1.82z)| norm 0.2962 (+0.72z)| lr 4.09e-04 | 323.10 ms | 52.2% bf16 MFU | 1624027 tok/s step 7904/19560 | loss 3.462658 (-0.21z)| norm 0.2831 (+0.01z)| lr 4.09e-04 | 322.79 ms | 52.3% bf16 MFU | 1624038 tok/s step 7905/19560 | loss 3.497016 (+0.47z)| norm 0.2686 (-0.75z)| lr 4.09e-04 | 322.82 ms | 52.3% bf16 MFU | 1624041 tok/s step 7906/19560 | loss 3.439740 (-0.69z)| norm 0.2970 (+0.76z)| lr 4.09e-04 | 322.39 ms | 52.3% bf16 MFU | 1624151 tok/s step 7907/19560 | loss 3.494539 (+0.41z)| norm 0.2495 (-1.75z)| lr 4.09e-04 | 322.71 ms | 52.3% bf16 MFU | 1624174 tok/s step 7908/19560 | loss 3.452461 (-0.43z)| norm 0.2792 (-0.18z)| lr 4.09e-04 | 323.00 ms | 52.3% bf16 MFU | 1624124 tok/s step 7909/19560 | loss 3.483327 (+0.21z)| norm 0.2671 (-0.82z)| lr 4.09e-04 | 322.67 ms | 52.3% bf16 MFU | 1624160 tok/s step 7910/19560 | loss 3.436212 (-0.76z)| norm 0.2578 (-1.30z)| lr 4.08e-04 | 322.82 ms | 52.3% bf16 MFU | 1624155 tok/s step 7911/19560 | loss 3.451707 (-0.44z)| norm 0.2757 (-0.35z)| lr 4.08e-04 | 322.84 ms | 52.3% bf16 MFU | 1624146 tok/s step 7912/19560 | loss 3.489891 (+0.35z)| norm 0.2515 (-1.60z)| lr 4.08e-04 | 322.18 ms | 52.4% bf16 MFU | 1624304 tok/s step 7913/19560 | loss 3.462612 (-0.23z)| norm 0.2772 (-0.24z)| lr 4.08e-04 | 322.75 ms | 52.3% bf16 MFU | 1624311 tok/s step 7914/19560 | loss 3.396452 (-1.58z)| norm 0.2385 (-2.23z)| lr 4.08e-04 | 322.50 ms | 52.3% bf16 MFU | 1624380 tok/s step 7915/19560 | loss 3.452547 (-0.41z)| norm 0.2587 (-1.16z)| lr 4.08e-04 | 322.28 ms | 52.4% bf16 MFU | 1624503 tok/s step 7916/19560 | loss 3.461627 (-0.22z)| norm 0.2555 (-1.30z)| lr 4.08e-04 | 322.56 ms | 52.3% bf16 MFU | 1624548 tok/s step 7917/19560 | loss 3.451636 (-0.41z)| norm 0.2905 (+0.50z)| lr 4.08e-04 | 323.01 ms | 52.3% bf16 MFU | 1624478 tok/s step 7918/19560 | loss 3.502649 (+0.64z)| norm 0.3042 (+1.20z)| lr 4.08e-04 | 322.44 ms | 52.3% bf16 MFU | 1624554 tok/s step 7919/19560 | loss 3.441085 (-0.63z)| norm 0.2663 (-0.76z)| lr 4.08e-04 | 324.15 ms | 52.1% bf16 MFU | 1624198 tok/s step 7920/19560 | loss 3.485659 (+0.30z)| norm 0.3004 (+0.99z)| lr 4.08e-04 | 322.01 ms | 52.4% bf16 MFU | 1624396 tok/s step 7921/19560 | loss 3.450046 (-0.44z)| norm 0.2835 (+0.11z)| lr 4.08e-04 | 322.78 ms | 52.3% bf16 MFU | 1624392 tok/s step 7922/19560 | loss 3.452549 (-0.39z)| norm 0.2796 (-0.09z)| lr 4.08e-04 | 322.87 ms | 52.3% bf16 MFU | 1624365 tok/s step 7923/19560 | loss 3.523495 (+1.07z)| norm 0.2937 (+0.63z)| lr 4.08e-04 | 322.67 ms | 52.3% bf16 MFU | 1624389 tok/s step 7924/19560 | loss 3.420128 (-1.07z)| norm 0.2852 (+0.19z)| lr 4.08e-04 | 322.41 ms | 52.3% bf16 MFU | 1624478 tok/s step 7925/19560 | loss 3.426541 (-0.93z)| norm 0.2849 (+0.17z)| lr 4.08e-04 | 322.95 ms | 52.3% bf16 MFU | 1624427 tok/s step 7926/19560 | loss 3.494754 (+0.48z)| norm 0.2657 (-0.81z)| lr 4.08e-04 | 322.50 ms | 52.3% bf16 MFU | 1624490 tok/s step 7927/19560 | loss 3.502648 (+0.64z)| norm 0.2692 (-0.62z)| lr 4.08e-04 | 322.19 ms | 52.4% bf16 MFU | 1624629 tok/s step 7928/19560 | loss 3.576786 (+2.11z)| norm 0.2775 (-0.19z)| lr 4.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1624688 tok/s step 7929/19560 | loss 3.497065 (+0.48z)| norm 0.2579 (-1.20z)| lr 4.08e-04 | 322.68 ms | 52.3% bf16 MFU | 1624692 tok/s step 7930/19560 | loss 3.515600 (+0.85z)| norm 0.2502 (-1.58z)| lr 4.08e-04 | 322.25 ms | 52.4% bf16 MFU | 1624806 tok/s step 7931/19560 | loss 3.415571 (-1.18z)| norm 0.2661 (-0.75z)| lr 4.07e-04 | 322.59 ms | 52.3% bf16 MFU | 1624828 tok/s step 7932/19560 | loss 3.469397 (-0.09z)| norm 0.3004 (+1.01z)| lr 4.07e-04 | 322.19 ms | 52.4% bf16 MFU | 1624950 tok/s step 7933/19560 | loss 3.453298 (-0.41z)| norm 0.2800 (-0.04z)| lr 4.07e-04 | 322.20 ms | 52.4% bf16 MFU | 1625063 tok/s step 7934/19560 | loss 3.431337 (-0.85z)| norm 0.2805 (-0.02z)| lr 4.07e-04 | 323.13 ms | 52.2% bf16 MFU | 1624936 tok/s step 7935/19560 | loss 3.431833 (-0.83z)| norm 0.2961 (+0.78z)| lr 4.07e-04 | 322.61 ms | 52.3% bf16 MFU | 1624946 tok/s step 7936/19560 | loss 3.506372 (+0.67z)| norm 0.3053 (+1.24z)| lr 4.07e-04 | 322.75 ms | 52.3% bf16 MFU | 1624921 tok/s step 7937/19560 | loss 3.485136 (+0.23z)| norm 0.2698 (-0.57z)| lr 4.07e-04 | 322.72 ms | 52.3% bf16 MFU | 1624905 tok/s step 7938/19560 | loss 3.441585 (-0.65z)| norm 0.2870 (+0.30z)| lr 4.07e-04 | 322.21 ms | 52.4% bf16 MFU | 1625017 tok/s step 7939/19560 | loss 3.414375 (-1.18z)| norm 0.2657 (-0.78z)| lr 4.07e-04 | 322.87 ms | 52.3% bf16 MFU | 1624957 tok/s step 7940/19560 | loss 3.528814 (+1.10z)| norm 0.2869 (+0.32z)| lr 4.07e-04 | 322.73 ms | 52.3% bf16 MFU | 1624936 tok/s step 7941/19560 | loss 3.497042 (+0.47z)| norm 0.2728 (-0.41z)| lr 4.07e-04 | 322.88 ms | 52.3% bf16 MFU | 1624879 tok/s step 7942/19560 | loss 3.354467 (-2.32z)| norm 0.2570 (-1.21z)| lr 4.07e-04 | 322.09 ms | 52.4% bf16 MFU | 1625023 tok/s step 7943/19560 | loss 3.471577 (-0.02z)| norm 0.2652 (-0.78z)| lr 4.07e-04 | 322.49 ms | 52.3% bf16 MFU | 1625058 tok/s step 7944/19560 | loss 3.479426 (+0.13z)| norm 0.2793 (-0.04z)| lr 4.07e-04 | 322.55 ms | 52.3% bf16 MFU | 1625077 tok/s step 7945/19560 | loss 3.451165 (-0.42z)| norm 0.2937 (+0.74z)| lr 4.07e-04 | 322.66 ms | 52.3% bf16 MFU | 1625069 tok/s step 7946/19560 | loss 3.495407 (+0.45z)| norm 0.2924 (+0.67z)| lr 4.07e-04 | 322.60 ms | 52.3% bf16 MFU | 1625075 tok/s step 7947/19560 | loss 3.425227 (-0.92z)| norm 0.2755 (-0.22z)| lr 4.07e-04 | 322.59 ms | 52.3% bf16 MFU | 1625084 tok/s step 7948/19560 | loss 3.574897 (+1.97z)| norm 0.2912 (+0.62z)| lr 4.07e-04 | 322.55 ms | 52.3% bf16 MFU | 1625103 tok/s step 7949/19560 | loss 3.468737 (-0.08z)| norm 0.2702 (-0.50z)| lr 4.07e-04 | 323.11 ms | 52.2% bf16 MFU | 1624979 tok/s step 7950/19560 | loss 3.518071 (+0.86z)| norm 0.2798 (+0.02z)| lr 4.07e-04 | 323.07 ms | 52.2% bf16 MFU | 1624873 tok/s step 7951/19560 | loss 3.508594 (+0.67z)| norm 0.2667 (-0.69z)| lr 4.07e-04 | 323.28 ms | 52.2% bf16 MFU | 1624717 tok/s step 7952/19560 | loss 3.419104 (-1.06z)| norm 0.2933 (+0.73z)| lr 4.07e-04 | 322.31 ms | 52.4% bf16 MFU | 1624814 tok/s step 7953/19560 | loss 3.493788 (+0.39z)| norm 0.3063 (+1.41z)| lr 4.06e-04 | 322.28 ms | 52.4% bf16 MFU | 1624914 tok/s step 7954/19560 | loss 3.477740 (+0.08z)| norm 0.2651 (-0.79z)| lr 4.06e-04 | 322.27 ms | 52.4% bf16 MFU | 1625011 tok/s step 7955/19560 | loss 3.494067 (+0.39z)| norm 0.2955 (+0.83z)| lr 4.06e-04 | 322.84 ms | 52.3% bf16 MFU | 1624960 tok/s step 7956/19560 | loss 3.483289 (+0.18z)| norm 0.2760 (-0.21z)| lr 4.06e-04 | 322.47 ms | 52.3% bf16 MFU | 1625006 tok/s step 7957/19560 | loss 3.498377 (+0.46z)| norm 0.2883 (+0.44z)| lr 4.06e-04 | 322.61 ms | 52.3% bf16 MFU | 1625012 tok/s step 7958/19560 | loss 3.498023 (+0.45z)| norm 0.2912 (+0.59z)| lr 4.06e-04 | 322.78 ms | 52.3% bf16 MFU | 1624976 tok/s step 7959/19560 | loss 3.463426 (-0.23z)| norm 0.2719 (-0.44z)| lr 4.06e-04 | 322.39 ms | 52.4% bf16 MFU | 1625040 tok/s step 7960/19560 | loss 3.470291 (-0.09z)| norm 0.2823 (+0.14z)| lr 4.06e-04 | 322.47 ms | 52.3% bf16 MFU | 1625081 tok/s step 7961/19560 | loss 3.424570 (-0.97z)| norm 0.2887 (+0.49z)| lr 4.06e-04 | 322.12 ms | 52.4% bf16 MFU | 1625207 tok/s step 7962/19560 | loss 3.515980 (+0.81z)| norm 0.2987 (+1.04z)| lr 4.06e-04 | 322.24 ms | 52.4% bf16 MFU | 1625297 tok/s step 7963/19560 | loss 3.496860 (+0.43z)| norm 0.2802 (+0.00z)| lr 4.06e-04 | 323.14 ms | 52.2% bf16 MFU | 1625157 tok/s step 7964/19560 | loss 3.473211 (-0.04z)| norm 0.2782 (-0.11z)| lr 4.06e-04 | 322.54 ms | 52.3% bf16 MFU | 1625174 tok/s step 7965/19560 | loss 3.432247 (-0.84z)| norm 0.2781 (-0.12z)| lr 4.06e-04 | 322.67 ms | 52.3% bf16 MFU | 1625157 tok/s step 7966/19560 | loss 3.518589 (+0.85z)| norm 0.2946 (+0.79z)| lr 4.06e-04 | 322.44 ms | 52.3% bf16 MFU | 1625199 tok/s step 7967/19560 | loss 3.424242 (-0.99z)| norm 0.3130 (+1.78z)| lr 4.06e-04 | 322.71 ms | 52.3% bf16 MFU | 1625172 tok/s step 7968/19560 | loss 3.525758 (+0.98z)| norm 0.2934 (+0.68z)| lr 4.06e-04 | 322.50 ms | 52.3% bf16 MFU | 1625199 tok/s step 7969/19560 | loss 3.457010 (-0.35z)| norm 0.2797 (-0.08z)| lr 4.06e-04 | 322.76 ms | 52.3% bf16 MFU | 1625159 tok/s step 7970/19560 | loss 3.550543 (+1.45z)| norm 0.3051 (+1.31z)| lr 4.06e-04 | 322.88 ms | 52.3% bf16 MFU | 1625091 tok/s step 7971/19560 | loss 3.499519 (+0.46z)| norm 0.2934 (+0.65z)| lr 4.06e-04 | 322.47 ms | 52.3% bf16 MFU | 1625128 tok/s step 7972/19560 | loss 3.456284 (-0.37z)| norm 0.2842 (+0.14z)| lr 4.06e-04 | 322.33 ms | 52.4% bf16 MFU | 1625201 tok/s step 7973/19560 | loss 3.463039 (-0.24z)| norm 0.2632 (-1.03z)| lr 4.06e-04 | 322.68 ms | 52.3% bf16 MFU | 1625180 tok/s step 7974/19560 | loss 3.473336 (-0.04z)| norm 0.2976 (+0.88z)| lr 4.05e-04 | 322.16 ms | 52.4% bf16 MFU | 1625292 tok/s step 7975/19560 | loss 3.454445 (-0.39z)| norm 0.2814 (-0.04z)| lr 4.05e-04 | 322.52 ms | 52.3% bf16 MFU | 1625307 tok/s step 7976/19560 | loss 3.381644 (-1.80z)| norm 0.2617 (-1.12z)| lr 4.05e-04 | 322.39 ms | 52.4% bf16 MFU | 1625355 tok/s step 7977/19560 | loss 3.511580 (+0.74z)| norm 0.2896 (+0.45z)| lr 4.05e-04 | 322.58 ms | 52.3% bf16 MFU | 1625351 tok/s step 7978/19560 | loss 3.415625 (-1.13z)| norm 0.2800 (-0.09z)| lr 4.05e-04 | 322.57 ms | 52.3% bf16 MFU | 1625350 tok/s step 7979/19560 | loss 3.441562 (-0.62z)| norm 0.2828 (+0.06z)| lr 4.05e-04 | 322.53 ms | 52.3% bf16 MFU | 1625359 tok/s step 7980/19560 | loss 3.509588 (+0.69z)| norm 0.3176 (+1.97z)| lr 4.05e-04 | 322.24 ms | 52.4% bf16 MFU | 1625443 tok/s step 7981/19560 | loss 3.482393 (+0.16z)| norm 0.3089 (+1.47z)| lr 4.05e-04 | 322.45 ms | 52.3% bf16 MFU | 1625469 tok/s step 7982/19560 | loss 3.436014 (-0.73z)| norm 0.2972 (+0.82z)| lr 4.05e-04 | 322.38 ms | 52.4% bf16 MFU | 1625511 tok/s step 7983/19560 | loss 3.460808 (-0.25z)| norm 0.2776 (-0.25z)| lr 4.05e-04 | 322.30 ms | 52.4% bf16 MFU | 1625570 tok/s step 7984/19560 | loss 3.514491 (+0.79z)| norm 0.2624 (-1.08z)| lr 4.05e-04 | 322.77 ms | 52.3% bf16 MFU | 1625507 tok/s step 7985/19560 | loss 3.517333 (+1.08z)| norm 0.2709 (-0.63z)| lr 4.05e-04 | 322.85 ms | 52.3% bf16 MFU | 1625429 tok/s step 7986/19560 | loss 3.461009 (-0.27z)| norm 0.2658 (-0.93z)| lr 4.05e-04 | 322.85 ms | 52.3% bf16 MFU | 1625354 tok/s step 7987/19560 | loss 3.513745 (+1.01z)| norm 0.3021 (+1.28z)| lr 4.05e-04 | 322.56 ms | 52.3% bf16 MFU | 1625355 tok/s step 7988/19560 | loss 3.488988 (+0.41z)| norm 0.2646 (-0.99z)| lr 4.05e-04 | 322.70 ms | 52.3% bf16 MFU | 1625321 tok/s step 7989/19560 | loss 3.471036 (-0.03z)| norm 0.2722 (-0.51z)| lr 4.05e-04 | 322.19 ms | 52.4% bf16 MFU | 1625420 tok/s step 7990/19560 | loss 3.456442 (-0.38z)| norm 0.2828 (+0.16z)| lr 4.05e-04 | 322.64 ms | 52.3% bf16 MFU | 1625397 tok/s step 7991/19560 | loss 3.420627 (-1.24z)| norm 0.3316 (+3.13z)| lr 4.05e-04 | 322.70 ms | 52.3% bf16 MFU | 1625361 tok/s step 7992/19560 | loss 3.410456 (-1.46z)| norm 0.2974 (+1.01z)| lr 4.05e-04 | 322.86 ms | 52.3% bf16 MFU | 1625287 tok/s step 7993/19560 | loss 3.444955 (-0.62z)| norm 0.3073 (+1.60z)| lr 4.05e-04 | 322.77 ms | 52.3% bf16 MFU | 1625239 tok/s step 7994/19560 | loss 3.463541 (-0.18z)| norm 0.2643 (-1.00z)| lr 4.05e-04 | 322.63 ms | 52.3% bf16 MFU | 1625231 tok/s step 7995/19560 | loss 3.501474 (+0.72z)| norm 0.2860 (+0.33z)| lr 4.05e-04 | 322.92 ms | 52.3% bf16 MFU | 1625148 tok/s step 7996/19560 | loss 3.452245 (-0.46z)| norm 0.2744 (-0.37z)| lr 4.04e-04 | 322.30 ms | 52.4% bf16 MFU | 1625225 tok/s step 7997/19560 | loss 3.500881 (+0.71z)| norm 0.2603 (-1.23z)| lr 4.04e-04 | 322.64 ms | 52.3% bf16 MFU | 1625214 tok/s step 7998/19560 | loss 3.494665 (+0.55z)| norm 0.2544 (-1.58z)| lr 4.04e-04 | 322.55 ms | 52.3% bf16 MFU | 1625226 tok/s step 7999/19560 | loss 3.515371 (+1.03z)| norm 0.2709 (-0.55z)| lr 4.04e-04 | 322.26 ms | 52.4% bf16 MFU | 1625311 tok/s step 8000/19560 | loss 3.477132 (+0.11z)| norm 0.2598 (-1.21z)| lr 4.04e-04 | 322.29 ms | 52.4% bf16 MFU | 1625382 tok/s val loss 3.448355 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2846/10042 = 0.283410 step 8001/19560 | loss 3.452466 (-0.47z)| norm 0.2498 (-1.81z)| lr 4.04e-04 | 322.16 ms | 52.4% bf16 MFU | 1625485 tok/s step 8002/19560 | loss 3.538170 (+1.65z)| norm 0.2909 (+0.73z)| lr 4.04e-04 | 322.43 ms | 52.3% bf16 MFU | 1625514 tok/s step 8003/19560 | loss 3.476079 (+0.11z)| norm 0.2951 (+0.97z)| lr 4.04e-04 | 322.61 ms | 52.3% bf16 MFU | 1625496 tok/s step 8004/19560 | loss 3.437048 (-0.86z)| norm 0.2767 (-0.16z)| lr 4.04e-04 | 322.67 ms | 52.3% bf16 MFU | 1625464 tok/s step 8005/19560 | loss 3.526115 (+1.35z)| norm 0.2934 (+0.86z)| lr 4.04e-04 | 322.77 ms | 52.3% bf16 MFU | 1625407 tok/s step 8006/19560 | loss 3.506467 (+0.84z)| norm 0.2755 (-0.24z)| lr 4.04e-04 | 322.62 ms | 52.3% bf16 MFU | 1625390 tok/s step 8007/19560 | loss 3.478049 (+0.13z)| norm 0.2800 (+0.03z)| lr 4.04e-04 | 322.48 ms | 52.3% bf16 MFU | 1625409 tok/s step 8008/19560 | loss 3.493196 (+0.51z)| norm 0.2489 (-1.85z)| lr 4.04e-04 | 322.76 ms | 52.3% bf16 MFU | 1625359 tok/s step 8009/19560 | loss 3.432867 (-1.02z)| norm 0.2821 (+0.16z)| lr 4.04e-04 | 322.61 ms | 52.3% bf16 MFU | 1625348 tok/s step 8010/19560 | loss 3.402009 (-1.77z)| norm 0.2598 (-1.19z)| lr 4.04e-04 | 322.32 ms | 52.4% bf16 MFU | 1625412 tok/s step 8011/19560 | loss 3.450325 (-0.54z)| norm 0.2756 (-0.24z)| lr 4.04e-04 | 322.16 ms | 52.4% bf16 MFU | 1625513 tok/s step 8012/19560 | loss 3.479403 (+0.18z)| norm 0.2761 (-0.21z)| lr 4.04e-04 | 322.50 ms | 52.3% bf16 MFU | 1625524 tok/s step 8013/19560 | loss 3.542377 (+1.74z)| norm 0.2973 (+1.07z)| lr 4.04e-04 | 322.73 ms | 52.3% bf16 MFU | 1625474 tok/s step 8014/19560 | loss 3.468172 (-0.11z)| norm 0.2647 (-0.93z)| lr 4.04e-04 | 322.59 ms | 52.3% bf16 MFU | 1625463 tok/s step 8015/19560 | loss 3.423878 (-1.20z)| norm 0.3008 (+1.26z)| lr 4.04e-04 | 322.29 ms | 52.4% bf16 MFU | 1625528 tok/s step 8016/19560 | loss 3.417363 (-1.35z)| norm 0.2784 (-0.10z)| lr 4.04e-04 | 322.59 ms | 52.3% bf16 MFU | 1625514 tok/s step 8017/19560 | loss 3.457446 (-0.34z)| norm 0.3094 (+1.75z)| lr 4.03e-04 | 322.65 ms | 52.3% bf16 MFU | 1625486 tok/s step 8018/19560 | loss 3.460347 (-0.27z)| norm 0.2514 (-1.72z)| lr 4.03e-04 | 322.77 ms | 52.3% bf16 MFU | 1625430 tok/s step 8019/19560 | loss 3.497928 (+0.66z)| norm 0.3159 (+2.08z)| lr 4.03e-04 | 322.93 ms | 52.3% bf16 MFU | 1625336 tok/s step 8020/19560 | loss 3.491471 (+0.52z)| norm 0.2789 (-0.11z)| lr 4.03e-04 | 322.08 ms | 52.4% bf16 MFU | 1625460 tok/s step 8021/19560 | loss 3.432832 (-1.00z)| norm 0.2923 (+0.67z)| lr 4.03e-04 | 322.36 ms | 52.4% bf16 MFU | 1625506 tok/s step 8022/19560 | loss 3.493039 (+0.55z)| norm 0.3088 (+1.62z)| lr 4.03e-04 | 323.06 ms | 52.2% bf16 MFU | 1625375 tok/s step 8023/19560 | loss 3.436702 (-0.90z)| norm 0.2731 (-0.47z)| lr 4.03e-04 | 322.58 ms | 52.3% bf16 MFU | 1625372 tok/s step 8024/19560 | loss 3.439655 (-0.82z)| norm 0.2671 (-0.81z)| lr 4.03e-04 | 322.12 ms | 52.4% bf16 MFU | 1625485 tok/s step 8025/19560 | loss 3.432504 (-0.99z)| norm 0.2834 (+0.14z)| lr 4.03e-04 | 322.74 ms | 52.3% bf16 MFU | 1625435 tok/s step 8026/19560 | loss 3.442844 (-0.72z)| norm 0.2536 (-1.59z)| lr 4.03e-04 | 322.53 ms | 52.3% bf16 MFU | 1625440 tok/s step 8027/19560 | loss 3.420719 (-1.28z)| norm 0.4697 (+7.85z)| lr 4.03e-04 | 322.70 ms | 52.3% bf16 MFU | 1625402 tok/s step 8028/19560 | loss 3.489376 (+0.47z)| norm 0.2728 (-0.39z)| lr 4.03e-04 | 322.95 ms | 52.3% bf16 MFU | 1625304 tok/s step 8029/19560 | loss 3.528022 (+1.44z)| norm 0.2971 (+0.64z)| lr 4.03e-04 | 322.52 ms | 52.3% bf16 MFU | 1625317 tok/s step 8030/19560 | loss 3.424855 (-1.17z)| norm 0.2678 (-0.59z)| lr 4.03e-04 | 322.67 ms | 52.3% bf16 MFU | 1625294 tok/s step 8031/19560 | loss 3.447334 (-0.59z)| norm 0.2688 (-0.54z)| lr 4.03e-04 | 322.16 ms | 52.4% bf16 MFU | 1625399 tok/s step 8032/19560 | loss 3.475389 (+0.14z)| norm 0.2676 (-0.59z)| lr 4.03e-04 | 322.76 ms | 52.3% bf16 MFU | 1625348 tok/s step 8033/19560 | loss 3.469728 (-0.01z)| norm 0.2814 (-0.01z)| lr 4.03e-04 | 322.66 ms | 52.3% bf16 MFU | 1625326 tok/s step 8034/19560 | loss 3.437891 (-0.83z)| norm 0.2680 (-0.56z)| lr 4.03e-04 | 322.94 ms | 52.3% bf16 MFU | 1625235 tok/s step 8035/19560 | loss 3.451470 (-0.47z)| norm 0.2691 (-0.52z)| lr 4.03e-04 | 322.73 ms | 52.3% bf16 MFU | 1625199 tok/s step 8036/19560 | loss 3.436028 (-0.87z)| norm 0.2542 (-1.14z)| lr 4.03e-04 | 323.28 ms | 52.2% bf16 MFU | 1625029 tok/s step 8037/19560 | loss 3.428103 (-1.06z)| norm 0.2667 (-0.62z)| lr 4.03e-04 | 322.64 ms | 52.3% bf16 MFU | 1625028 tok/s step 8038/19560 | loss 3.497617 (+0.73z)| norm 0.5088 (+7.28z)| lr 4.02e-04 | 323.51 ms | 52.2% bf16 MFU | 1624807 tok/s step 8039/19560 | loss 3.444383 (-0.65z)| norm 0.2998 (+0.53z)| lr 4.02e-04 | 322.49 ms | 52.3% bf16 MFU | 1624855 tok/s step 8040/19560 | loss 3.479012 (+0.25z)| norm 0.2769 (-0.22z)| lr 4.02e-04 | 322.99 ms | 52.3% bf16 MFU | 1624774 tok/s step 8041/19560 | loss 3.477558 (+0.21z)| norm 0.2847 (+0.03z)| lr 4.02e-04 | 322.75 ms | 52.3% bf16 MFU | 1624758 tok/s step 8042/19560 | loss 3.447442 (-0.59z)| norm 0.2732 (-0.35z)| lr 4.02e-04 | 322.51 ms | 52.3% bf16 MFU | 1624803 tok/s step 8043/19560 | loss 3.379241 (-2.31z)| norm 0.2724 (-0.38z)| lr 4.02e-04 | 323.06 ms | 52.2% bf16 MFU | 1624706 tok/s step 8044/19560 | loss 3.413341 (-1.42z)| norm 0.2670 (-0.56z)| lr 4.02e-04 | 322.47 ms | 52.3% bf16 MFU | 1624763 tok/s step 8045/19560 | loss 3.526695 (+1.44z)| norm 0.6381 (+8.06z)| lr 4.02e-04 | 322.26 ms | 52.4% bf16 MFU | 1624871 tok/s step 8046/19560 | loss 3.447789 (-0.54z)| norm 0.2692 (-0.40z)| lr 4.02e-04 | 323.67 ms | 52.1% bf16 MFU | 1624618 tok/s step 8047/19560 | loss 3.448811 (-0.52z)| norm 0.2490 (-0.86z)| lr 4.02e-04 | 322.89 ms | 52.3% bf16 MFU | 1624573 tok/s step 8048/19560 | loss 3.462379 (-0.17z)| norm 0.2662 (-0.46z)| lr 4.02e-04 | 322.60 ms | 52.3% bf16 MFU | 1624604 tok/s step 8049/19560 | loss 3.492195 (+0.58z)| norm 0.2785 (-0.18z)| lr 4.02e-04 | 322.96 ms | 52.3% bf16 MFU | 1624542 tok/s step 8050/19560 | loss 3.416892 (-1.31z)| norm 0.2666 (-0.45z)| lr 4.02e-04 | 322.62 ms | 52.3% bf16 MFU | 1624571 tok/s step 8051/19560 | loss 3.441258 (-0.69z)| norm 0.2732 (-0.29z)| lr 4.02e-04 | 323.17 ms | 52.2% bf16 MFU | 1624459 tok/s step 8052/19560 | loss 3.448405 (-0.51z)| norm 0.2523 (-0.76z)| lr 4.02e-04 | 322.38 ms | 52.4% bf16 MFU | 1624550 tok/s step 8053/19560 | loss 3.499894 (+0.78z)| norm 0.2951 (+0.21z)| lr 4.02e-04 | 323.23 ms | 52.2% bf16 MFU | 1624424 tok/s step 8054/19560 | loss 3.561571 (+2.30z)| norm 0.3005 (+0.33z)| lr 4.02e-04 | 322.91 ms | 52.3% bf16 MFU | 1624385 tok/s step 8055/19560 | loss 3.445504 (-0.60z)| norm 0.2812 (-0.11z)| lr 4.02e-04 | 322.46 ms | 52.3% bf16 MFU | 1624461 tok/s step 8056/19560 | loss 3.478900 (+0.27z)| norm 0.2844 (-0.04z)| lr 4.02e-04 | 323.02 ms | 52.2% bf16 MFU | 1624392 tok/s step 8057/19560 | loss 3.494650 (+0.67z)| norm 0.3039 (+0.40z)| lr 4.02e-04 | 322.60 ms | 52.3% bf16 MFU | 1624431 tok/s step 8058/19560 | loss 3.553986 (+2.17z)| norm 0.2631 (-0.54z)| lr 4.02e-04 | 322.49 ms | 52.3% bf16 MFU | 1624496 tok/s step 8059/19560 | loss 3.445965 (-0.59z)| norm 0.2970 (+0.23z)| lr 4.01e-04 | 322.93 ms | 52.3% bf16 MFU | 1624447 tok/s step 8060/19560 | loss 3.445854 (-0.59z)| norm 0.2716 (-0.34z)| lr 4.01e-04 | 323.01 ms | 52.2% bf16 MFU | 1624380 tok/s step 8061/19560 | loss 3.431827 (-0.94z)| norm 0.2735 (-0.30z)| lr 4.01e-04 | 322.81 ms | 52.3% bf16 MFU | 1624368 tok/s step 8062/19560 | loss 3.492123 (+0.59z)| norm 0.2833 (-0.08z)| lr 4.01e-04 | 322.25 ms | 52.4% bf16 MFU | 1624496 tok/s step 8063/19560 | loss 3.444322 (-0.64z)| norm 0.3031 (+0.38z)| lr 4.01e-04 | 323.53 ms | 52.2% bf16 MFU | 1624297 tok/s step 8064/19560 | loss 3.468634 (-0.01z)| norm 0.2555 (-0.70z)| lr 4.01e-04 | 322.68 ms | 52.3% bf16 MFU | 1624321 tok/s step 8065/19560 | loss 3.413982 (-1.39z)| norm 0.2842 (-0.05z)| lr 4.01e-04 | 322.56 ms | 52.3% bf16 MFU | 1624375 tok/s step 8066/19560 | loss 3.484256 (+0.40z)| norm 0.2788 (-0.17z)| lr 4.01e-04 | 322.67 ms | 52.3% bf16 MFU | 1624398 tok/s step 8067/19560 | loss 3.503905 (+0.89z)| norm 0.2657 (-0.47z)| lr 4.01e-04 | 322.27 ms | 52.4% bf16 MFU | 1624521 tok/s step 8068/19560 | loss 3.496096 (+0.70z)| norm 0.2974 (+0.25z)| lr 4.01e-04 | 322.54 ms | 52.3% bf16 MFU | 1624571 tok/s step 8069/19560 | loss 3.460140 (-0.23z)| norm 0.2768 (-0.22z)| lr 4.01e-04 | 322.38 ms | 52.4% bf16 MFU | 1624656 tok/s step 8070/19560 | loss 3.460537 (-0.25z)| norm 0.2635 (-0.53z)| lr 4.01e-04 | 322.49 ms | 52.3% bf16 MFU | 1624710 tok/s step 8071/19560 | loss 3.465949 (-0.10z)| norm 0.2706 (-0.37z)| lr 4.01e-04 | 322.57 ms | 52.3% bf16 MFU | 1624742 tok/s step 8072/19560 | loss 3.438629 (-0.82z)| norm 0.2426 (-1.00z)| lr 4.01e-04 | 322.55 ms | 52.3% bf16 MFU | 1624777 tok/s step 8073/19560 | loss 3.391223 (-2.05z)| norm 0.2608 (-0.57z)| lr 4.01e-04 | 322.45 ms | 52.3% bf16 MFU | 1624836 tok/s step 8074/19560 | loss 3.431296 (-0.98z)| norm 0.2468 (-0.88z)| lr 4.01e-04 | 322.44 ms | 52.3% bf16 MFU | 1624894 tok/s step 8075/19560 | loss 3.499957 (+0.82z)| norm 0.2521 (-0.76z)| lr 4.01e-04 | 322.79 ms | 52.3% bf16 MFU | 1624860 tok/s step 8076/19560 | loss 3.457575 (-0.29z)| norm 0.2401 (-1.01z)| lr 4.01e-04 | 321.82 ms | 52.4% bf16 MFU | 1625073 tok/s step 8077/19560 | loss 3.443516 (-0.66z)| norm 0.2721 (-0.29z)| lr 4.01e-04 | 322.71 ms | 52.3% bf16 MFU | 1625051 tok/s step 8078/19560 | loss 3.523494 (+1.51z)| norm 0.2581 (-0.60z)| lr 4.01e-04 | 322.77 ms | 52.3% bf16 MFU | 1625015 tok/s step 8079/19560 | loss 3.409162 (-1.56z)| norm 0.2535 (-0.70z)| lr 4.01e-04 | 322.57 ms | 52.3% bf16 MFU | 1625033 tok/s step 8080/19560 | loss 3.389805 (-2.06z)| norm 0.2553 (-0.66z)| lr 4.01e-04 | 322.30 ms | 52.4% bf16 MFU | 1625116 tok/s step 8081/19560 | loss 3.437672 (-0.77z)| norm 0.2628 (-0.48z)| lr 4.00e-04 | 322.43 ms | 52.3% bf16 MFU | 1625164 tok/s step 8082/19560 | loss 3.433413 (-0.87z)| norm 0.3414 (+1.27z)| lr 4.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1625150 tok/s step 8083/19560 | loss 3.423380 (-1.12z)| norm 0.2818 (-0.06z)| lr 4.00e-04 | 322.55 ms | 52.3% bf16 MFU | 1625164 tok/s step 8084/19560 | loss 3.462369 (-0.08z)| norm 0.2665 (-0.40z)| lr 4.00e-04 | 323.17 ms | 52.2% bf16 MFU | 1625024 tok/s step 8085/19560 | loss 3.466531 (+0.04z)| norm 0.2691 (-0.34z)| lr 4.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1625014 tok/s step 8086/19560 | loss 3.445737 (-0.51z)| norm 0.2621 (-0.49z)| lr 4.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624998 tok/s step 8087/19560 | loss 3.407174 (-1.51z)| norm 0.2860 (+0.04z)| lr 4.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1625005 tok/s step 8088/19560 | loss 3.506322 (+1.10z)| norm 0.2704 (-0.31z)| lr 4.00e-04 | 322.68 ms | 52.3% bf16 MFU | 1624994 tok/s step 8089/19560 | loss 3.426090 (-1.02z)| norm 0.2547 (-0.65z)| lr 4.00e-04 | 322.23 ms | 52.4% bf16 MFU | 1625098 tok/s step 8090/19560 | loss 3.444021 (-0.53z)| norm 0.2800 (-0.08z)| lr 4.00e-04 | 322.34 ms | 52.4% bf16 MFU | 1625169 tok/s step 8091/19560 | loss 3.463555 (-0.01z)| norm 0.2673 (-0.36z)| lr 4.00e-04 | 322.64 ms | 52.3% bf16 MFU | 1625161 tok/s step 8092/19560 | loss 3.528061 (+1.68z)| norm 0.2905 (+0.15z)| lr 4.00e-04 | 323.03 ms | 52.2% bf16 MFU | 1625053 tok/s step 8093/19560 | loss 3.482784 (+0.48z)| norm 0.2814 (-0.05z)| lr 4.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1625039 tok/s step 8094/19560 | loss 3.523967 (+1.56z)| norm 0.3148 (+0.68z)| lr 4.00e-04 | 322.54 ms | 52.3% bf16 MFU | 1625061 tok/s step 8095/19560 | loss 3.449972 (-0.39z)| norm 0.3383 (+1.20z)| lr 4.00e-04 | 322.44 ms | 52.3% bf16 MFU | 1625109 tok/s step 8096/19560 | loss 3.498468 (+0.90z)| norm 0.3150 (+0.68z)| lr 4.00e-04 | 322.22 ms | 52.4% bf16 MFU | 1625209 tok/s step 8097/19560 | loss 3.500377 (+0.94z)| norm 0.2798 (-0.10z)| lr 4.00e-04 | 323.23 ms | 52.2% bf16 MFU | 1625050 tok/s step 8098/19560 | loss 3.457263 (-0.19z)| norm 0.3030 (+0.41z)| lr 4.00e-04 | 322.69 ms | 52.3% bf16 MFU | 1625034 tok/s step 8099/19560 | loss 3.426305 (-1.01z)| norm 0.2839 (-0.01z)| lr 4.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1624969 tok/s step 8100/19560 | loss 3.476431 (+0.34z)| norm 0.2989 (+0.32z)| lr 4.00e-04 | 322.11 ms | 52.4% bf16 MFU | 1625105 tok/s step 8101/19560 | loss 3.479382 (+0.42z)| norm 0.2906 (+0.13z)| lr 4.00e-04 | 322.60 ms | 52.3% bf16 MFU | 1625108 tok/s step 8102/19560 | loss 3.508260 (+1.18z)| norm 0.2934 (+0.20z)| lr 3.99e-04 | 322.74 ms | 52.3% bf16 MFU | 1625077 tok/s step 8103/19560 | loss 3.486901 (+0.60z)| norm 0.2804 (-0.09z)| lr 3.99e-04 | 322.47 ms | 52.3% bf16 MFU | 1625114 tok/s step 8104/19560 | loss 3.489400 (+0.66z)| norm 0.3162 (+0.69z)| lr 3.99e-04 | 322.54 ms | 52.3% bf16 MFU | 1625133 tok/s step 8105/19560 | loss 3.474257 (+0.25z)| norm 0.2978 (+0.28z)| lr 3.99e-04 | 322.80 ms | 52.3% bf16 MFU | 1625087 tok/s step 8106/19560 | loss 3.453030 (-0.34z)| norm 0.2583 (-0.59z)| lr 3.99e-04 | 322.36 ms | 52.4% bf16 MFU | 1625152 tok/s step 8107/19560 | loss 3.430432 (-0.96z)| norm 0.2748 (-0.22z)| lr 3.99e-04 | 322.14 ms | 52.4% bf16 MFU | 1625270 tok/s step 8108/19560 | loss 3.408405 (-1.54z)| norm 0.2630 (-0.47z)| lr 3.99e-04 | 322.88 ms | 52.3% bf16 MFU | 1625196 tok/s step 8109/19560 | loss 3.460008 (-0.12z)| norm 0.2592 (-0.55z)| lr 3.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1625217 tok/s step 8110/19560 | loss 3.499227 (+0.95z)| norm 0.2607 (-0.51z)| lr 3.99e-04 | 322.76 ms | 52.3% bf16 MFU | 1625176 tok/s step 8111/19560 | loss 3.460575 (-0.11z)| norm 0.2670 (-0.37z)| lr 3.99e-04 | 323.03 ms | 52.2% bf16 MFU | 1625068 tok/s step 8112/19560 | loss 3.452050 (-0.34z)| norm 0.2769 (-0.15z)| lr 3.99e-04 | 322.11 ms | 52.4% bf16 MFU | 1625198 tok/s step 8113/19560 | loss 3.419296 (-1.23z)| norm 0.2544 (-0.64z)| lr 3.99e-04 | 322.40 ms | 52.3% bf16 MFU | 1625249 tok/s step 8114/19560 | loss 3.408864 (-1.50z)| norm 0.2602 (-0.52z)| lr 3.99e-04 | 322.96 ms | 52.3% bf16 MFU | 1625157 tok/s step 8115/19560 | loss 3.429491 (-0.91z)| norm 0.2940 (+0.23z)| lr 3.99e-04 | 322.63 ms | 52.3% bf16 MFU | 1625152 tok/s step 8116/19560 | loss 3.476048 (+0.38z)| norm 0.2634 (-0.44z)| lr 3.99e-04 | 322.40 ms | 52.3% bf16 MFU | 1625205 tok/s step 8117/19560 | loss 3.494967 (+0.90z)| norm 0.2663 (-0.38z)| lr 3.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1625223 tok/s step 8118/19560 | loss 3.474776 (+0.34z)| norm 0.2964 (+0.28z)| lr 3.99e-04 | 322.77 ms | 52.3% bf16 MFU | 1625180 tok/s step 8119/19560 | loss 3.391868 (-1.93z)| norm 0.2607 (-0.49z)| lr 3.99e-04 | 322.35 ms | 52.4% bf16 MFU | 1625243 tok/s step 8120/19560 | loss 3.472376 (+0.26z)| norm 0.2917 (+0.19z)| lr 3.99e-04 | 322.66 ms | 52.3% bf16 MFU | 1625225 tok/s step 8121/19560 | loss 3.408832 (-1.47z)| norm 0.2925 (+0.21z)| lr 3.99e-04 | 322.50 ms | 52.3% bf16 MFU | 1625250 tok/s step 8122/19560 | loss 3.384706 (-2.08z)| norm 0.2704 (-0.28z)| lr 3.99e-04 | 322.44 ms | 52.3% bf16 MFU | 1625286 tok/s step 8123/19560 | loss 3.538389 (+2.02z)| norm 0.2737 (-0.20z)| lr 3.98e-04 | 322.44 ms | 52.3% bf16 MFU | 1625322 tok/s step 8124/19560 | loss 3.459145 (-0.09z)| norm 0.2971 (+0.31z)| lr 3.98e-04 | 322.30 ms | 52.4% bf16 MFU | 1625391 tok/s step 8125/19560 | loss 3.507561 (+1.20z)| norm 0.2922 (+0.20z)| lr 3.98e-04 | 323.16 ms | 52.2% bf16 MFU | 1625241 tok/s step 8126/19560 | loss 3.483025 (+0.55z)| norm 0.2717 (-0.26z)| lr 3.98e-04 | 322.63 ms | 52.3% bf16 MFU | 1625232 tok/s step 8127/19560 | loss 3.411466 (-1.33z)| norm 0.3341 (+1.11z)| lr 3.98e-04 | 322.50 ms | 52.3% bf16 MFU | 1625256 tok/s step 8128/19560 | loss 3.413821 (-1.25z)| norm 0.2660 (-0.40z)| lr 3.98e-04 | 323.07 ms | 52.2% bf16 MFU | 1625136 tok/s step 8129/19560 | loss 3.398829 (-1.62z)| norm 0.2733 (-0.24z)| lr 3.98e-04 | 322.15 ms | 52.4% bf16 MFU | 1625251 tok/s step 8130/19560 | loss 3.438719 (-0.56z)| norm 0.2654 (-0.41z)| lr 3.98e-04 | 322.72 ms | 52.3% bf16 MFU | 1625217 tok/s step 8131/19560 | loss 3.432166 (-0.73z)| norm 0.2683 (-0.34z)| lr 3.98e-04 | 322.61 ms | 52.3% bf16 MFU | 1625214 tok/s step 8132/19560 | loss 3.470989 (+0.30z)| norm 0.2634 (-0.45z)| lr 3.98e-04 | 322.60 ms | 52.3% bf16 MFU | 1625213 tok/s step 8133/19560 | loss 3.531966 (+1.92z)| norm 0.2585 (-0.55z)| lr 3.98e-04 | 322.78 ms | 52.3% bf16 MFU | 1625166 tok/s step 8134/19560 | loss 3.494095 (+0.92z)| norm 0.2941 (+0.24z)| lr 3.98e-04 | 321.93 ms | 52.4% bf16 MFU | 1625338 tok/s step 8135/19560 | loss 3.429931 (-0.78z)| norm 0.2725 (-0.24z)| lr 3.98e-04 | 322.60 ms | 52.3% bf16 MFU | 1625330 tok/s step 8136/19560 | loss 3.456835 (-0.06z)| norm 0.2811 (-0.06z)| lr 3.98e-04 | 322.85 ms | 52.3% bf16 MFU | 1625260 tok/s step 8137/19560 | loss 3.473119 (+0.37z)| norm 0.2534 (-0.66z)| lr 3.98e-04 | 322.40 ms | 52.3% bf16 MFU | 1625308 tok/s step 8138/19560 | loss 3.438217 (-0.58z)| norm 0.2815 (-0.05z)| lr 3.98e-04 | 322.45 ms | 52.3% bf16 MFU | 1625340 tok/s step 8139/19560 | loss 3.388414 (-1.88z)| norm 0.2697 (-0.31z)| lr 3.98e-04 | 322.28 ms | 52.4% bf16 MFU | 1625412 tok/s step 8140/19560 | loss 3.469594 (+0.28z)| norm 0.2541 (-0.65z)| lr 3.98e-04 | 322.69 ms | 52.3% bf16 MFU | 1625380 tok/s step 8141/19560 | loss 3.410595 (-1.28z)| norm 0.2896 (+0.14z)| lr 3.98e-04 | 322.87 ms | 52.3% bf16 MFU | 1625302 tok/s step 8142/19560 | loss 3.521964 (+1.69z)| norm 0.2739 (-0.21z)| lr 3.98e-04 | 322.83 ms | 52.3% bf16 MFU | 1625238 tok/s step 8143/19560 | loss 3.453334 (-0.14z)| norm 0.2861 (+0.06z)| lr 3.98e-04 | 322.68 ms | 52.3% bf16 MFU | 1625216 tok/s step 8144/19560 | loss 3.545900 (+2.27z)| norm 0.3006 (+0.38z)| lr 3.97e-04 | 322.67 ms | 52.3% bf16 MFU | 1625198 tok/s step 8145/19560 | loss 3.427080 (-0.85z)| norm 0.3305 (+1.03z)| lr 3.97e-04 | 323.02 ms | 52.2% bf16 MFU | 1625093 tok/s step 8146/19560 | loss 3.465783 (+0.16z)| norm 0.2852 (+0.03z)| lr 3.97e-04 | 322.61 ms | 52.3% bf16 MFU | 1625096 tok/s step 8147/19560 | loss 3.488647 (+0.77z)| norm 0.2979 (+0.31z)| lr 3.97e-04 | 322.16 ms | 52.4% bf16 MFU | 1625211 tok/s step 8148/19560 | loss 3.486408 (+0.71z)| norm 0.2847 (+0.02z)| lr 3.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1625131 tok/s step 8149/19560 | loss 3.398932 (-1.57z)| norm 0.3099 (+0.57z)| lr 3.97e-04 | 323.21 ms | 52.2% bf16 MFU | 1624981 tok/s step 8150/19560 | loss 3.532151 (+1.88z)| norm 0.2736 (-0.22z)| lr 3.97e-04 | 323.04 ms | 52.2% bf16 MFU | 1624881 tok/s step 8151/19560 | loss 3.425650 (-0.87z)| norm 0.2783 (-0.12z)| lr 3.97e-04 | 322.20 ms | 52.4% bf16 MFU | 1624998 tok/s step 8152/19560 | loss 3.444342 (-0.39z)| norm 0.3045 (+0.46z)| lr 3.97e-04 | 322.21 ms | 52.4% bf16 MFU | 1625106 tok/s step 8153/19560 | loss 3.522467 (+1.59z)| norm 0.2843 (+0.01z)| lr 3.97e-04 | 322.65 ms | 52.3% bf16 MFU | 1625098 tok/s step 8154/19560 | loss 3.467944 (+0.20z)| norm 0.2810 (-0.07z)| lr 3.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1625107 tok/s step 8155/19560 | loss 3.424740 (-0.91z)| norm 0.2898 (+0.17z)| lr 3.97e-04 | 322.67 ms | 52.3% bf16 MFU | 1625095 tok/s step 8156/19560 | loss 3.465559 (+0.14z)| norm 0.2694 (-0.32z)| lr 3.97e-04 | 322.29 ms | 52.4% bf16 MFU | 1625178 tok/s step 8157/19560 | loss 3.489519 (+0.77z)| norm 0.2864 (+0.09z)| lr 3.97e-04 | 322.38 ms | 52.4% bf16 MFU | 1625234 tok/s step 8158/19560 | loss 3.476874 (+0.43z)| norm 0.2630 (-0.47z)| lr 3.97e-04 | 322.72 ms | 52.3% bf16 MFU | 1625202 tok/s step 8159/19560 | loss 3.492916 (+0.84z)| norm 0.2789 (-0.09z)| lr 3.97e-04 | 322.88 ms | 52.3% bf16 MFU | 1625132 tok/s step 8160/19560 | loss 3.445937 (-0.37z)| norm 0.2505 (-0.76z)| lr 3.97e-04 | 322.36 ms | 52.4% bf16 MFU | 1625194 tok/s step 8161/19560 | loss 3.472691 (+0.32z)| norm 0.2669 (-0.37z)| lr 3.97e-04 | 322.16 ms | 52.4% bf16 MFU | 1625305 tok/s step 8162/19560 | loss 3.474415 (+0.36z)| norm 0.2798 (-0.07z)| lr 3.97e-04 | 322.58 ms | 52.3% bf16 MFU | 1625305 tok/s step 8163/19560 | loss 3.391366 (-1.76z)| norm 0.2622 (-0.48z)| lr 3.97e-04 | 322.14 ms | 52.4% bf16 MFU | 1625417 tok/s step 8164/19560 | loss 3.436774 (-0.60z)| norm 0.2683 (-0.34z)| lr 3.97e-04 | 322.26 ms | 52.4% bf16 MFU | 1625491 tok/s step 8165/19560 | loss 3.446352 (-0.36z)| norm 0.2812 (-0.04z)| lr 3.96e-04 | 322.72 ms | 52.3% bf16 MFU | 1625446 tok/s step 8166/19560 | loss 3.518652 (+1.49z)| norm 0.2765 (-0.12z)| lr 3.96e-04 | 322.41 ms | 52.3% bf16 MFU | 1625482 tok/s step 8167/19560 | loss 3.416748 (-1.11z)| norm 0.2812 (+0.01z)| lr 3.96e-04 | 322.62 ms | 52.3% bf16 MFU | 1625462 tok/s step 8168/19560 | loss 3.456068 (-0.10z)| norm 0.2982 (+0.47z)| lr 3.96e-04 | 322.40 ms | 52.3% bf16 MFU | 1625499 tok/s step 8169/19560 | loss 3.416970 (-1.08z)| norm 0.2514 (-0.79z)| lr 3.96e-04 | 322.46 ms | 52.3% bf16 MFU | 1625519 tok/s step 8170/19560 | loss 3.431715 (-0.70z)| norm 0.2981 (+0.46z)| lr 3.96e-04 | 322.51 ms | 52.3% bf16 MFU | 1625526 tok/s step 8171/19560 | loss 3.370661 (-2.24z)| norm 0.2576 (-0.63z)| lr 3.96e-04 | 322.72 ms | 52.3% bf16 MFU | 1625479 tok/s step 8172/19560 | loss 3.565393 (+2.59z)| norm 0.2689 (-0.32z)| lr 3.96e-04 | 322.64 ms | 52.3% bf16 MFU | 1625456 tok/s step 8173/19560 | loss 3.433549 (-0.66z)| norm 0.2745 (-0.18z)| lr 3.96e-04 | 322.42 ms | 52.3% bf16 MFU | 1625488 tok/s step 8174/19560 | loss 3.512874 (+1.30z)| norm 0.2851 (+0.37z)| lr 3.96e-04 | 322.60 ms | 52.3% bf16 MFU | 1625473 tok/s step 8175/19560 | loss 3.481140 (+0.51z)| norm 0.2944 (+0.83z)| lr 3.96e-04 | 322.85 ms | 52.3% bf16 MFU | 1625396 tok/s step 8176/19560 | loss 3.446451 (-0.35z)| norm 0.2828 (+0.22z)| lr 3.96e-04 | 322.50 ms | 52.3% bf16 MFU | 1625411 tok/s step 8177/19560 | loss 3.652543 (+4.36z)| norm 0.3201 (+2.12z)| lr 3.96e-04 | 322.42 ms | 52.3% bf16 MFU | 1625445 tok/s step 8178/19560 | loss 3.410560 (-1.17z)| norm 0.2988 (+1.01z)| lr 3.96e-04 | 322.27 ms | 52.4% bf16 MFU | 1625515 tok/s step 8179/19560 | loss 3.456918 (-0.11z)| norm 0.3133 (+1.72z)| lr 3.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1625479 tok/s step 8180/19560 | loss 3.492445 (+0.69z)| norm 0.2930 (+0.67z)| lr 3.96e-04 | 322.68 ms | 52.3% bf16 MFU | 1625444 tok/s step 8181/19560 | loss 3.485914 (+0.54z)| norm 0.3188 (+1.95z)| lr 3.96e-04 | 322.64 ms | 52.3% bf16 MFU | 1625421 tok/s step 8182/19560 | loss 3.446149 (-0.35z)| norm 0.3178 (+1.88z)| lr 3.96e-04 | 322.89 ms | 52.3% bf16 MFU | 1625337 tok/s step 8183/19560 | loss 3.450830 (-0.24z)| norm 0.3219 (+2.03z)| lr 3.96e-04 | 322.36 ms | 52.4% bf16 MFU | 1625390 tok/s step 8184/19560 | loss 3.406577 (-1.25z)| norm 0.2994 (+0.92z)| lr 3.96e-04 | 323.02 ms | 52.2% bf16 MFU | 1625275 tok/s step 8185/19560 | loss 3.322572 (-3.06z)| norm 0.3046 (+1.18z)| lr 3.96e-04 | 322.91 ms | 52.3% bf16 MFU | 1625193 tok/s step 8186/19560 | loss 3.458709 (+0.00z)| norm 0.2681 (-0.61z)| lr 3.96e-04 | 322.74 ms | 52.3% bf16 MFU | 1625157 tok/s step 8187/19560 | loss 3.431992 (-0.60z)| norm 0.3119 (+1.52z)| lr 3.95e-04 | 322.82 ms | 52.3% bf16 MFU | 1625104 tok/s step 8188/19560 | loss 3.531547 (+1.63z)| norm 0.2878 (+0.34z)| lr 3.95e-04 | 322.80 ms | 52.3% bf16 MFU | 1625058 tok/s step 8189/19560 | loss 3.513242 (+1.20z)| norm 0.2823 (+0.07z)| lr 3.95e-04 | 323.26 ms | 52.2% bf16 MFU | 1624900 tok/s step 8190/19560 | loss 3.446017 (-0.30z)| norm 0.2967 (+0.76z)| lr 3.95e-04 | 322.37 ms | 52.4% bf16 MFU | 1624972 tok/s step 8191/19560 | loss 3.455307 (-0.09z)| norm 0.2884 (+0.37z)| lr 3.95e-04 | 322.83 ms | 52.3% bf16 MFU | 1624924 tok/s step 8192/19560 | loss 3.471393 (+0.27z)| norm 0.2761 (-0.24z)| lr 3.95e-04 | 323.19 ms | 52.2% bf16 MFU | 1624789 tok/s step 8193/19560 | loss 3.496039 (+0.81z)| norm 0.2995 (+0.90z)| lr 3.95e-04 | 322.83 ms | 52.3% bf16 MFU | 1624752 tok/s step 8194/19560 | loss 3.459618 (-0.01z)| norm 0.2838 (+0.13z)| lr 3.95e-04 | 322.64 ms | 52.3% bf16 MFU | 1624765 tok/s step 8195/19560 | loss 3.445241 (-0.32z)| norm 0.2619 (-0.94z)| lr 3.95e-04 | 323.06 ms | 52.2% bf16 MFU | 1624671 tok/s step 8196/19560 | loss 3.451730 (-0.17z)| norm 0.2790 (-0.10z)| lr 3.95e-04 | 322.76 ms | 52.3% bf16 MFU | 1624656 tok/s step 8197/19560 | loss 3.562030 (+2.27z)| norm 0.3020 (+1.02z)| lr 3.95e-04 | 322.50 ms | 52.3% bf16 MFU | 1624709 tok/s step 8198/19560 | loss 3.439908 (-0.44z)| norm 0.3012 (+0.96z)| lr 3.95e-04 | 323.03 ms | 52.2% bf16 MFU | 1624625 tok/s step 8199/19560 | loss 3.452605 (-0.16z)| norm 0.3012 (+0.95z)| lr 3.95e-04 | 323.27 ms | 52.2% bf16 MFU | 1624486 tok/s step 8200/19560 | loss 3.391875 (-1.49z)| norm 0.3171 (+1.70z)| lr 3.95e-04 | 322.79 ms | 52.3% bf16 MFU | 1624472 tok/s step 8201/19560 | loss 3.391084 (-1.50z)| norm 0.2809 (-0.08z)| lr 3.95e-04 | 322.45 ms | 52.3% bf16 MFU | 1624547 tok/s step 8202/19560 | loss 3.444769 (-0.32z)| norm 0.2700 (-0.62z)| lr 3.95e-04 | 323.52 ms | 52.2% bf16 MFU | 1624350 tok/s step 8203/19560 | loss 3.436304 (-0.50z)| norm 0.2710 (-0.59z)| lr 3.95e-04 | 323.45 ms | 52.2% bf16 MFU | 1624178 tok/s step 8204/19560 | loss 3.407697 (-1.12z)| norm 0.2585 (-1.23z)| lr 3.95e-04 | 322.67 ms | 52.3% bf16 MFU | 1624212 tok/s step 8205/19560 | loss 3.468642 (+0.22z)| norm 0.2573 (-1.28z)| lr 3.95e-04 | 322.84 ms | 52.3% bf16 MFU | 1624200 tok/s step 8206/19560 | loss 3.447823 (-0.23z)| norm 0.2703 (-0.63z)| lr 3.95e-04 | 323.08 ms | 52.2% bf16 MFU | 1624128 tok/s step 8207/19560 | loss 3.351173 (-2.32z)| norm 0.2798 (-0.17z)| lr 3.95e-04 | 322.83 ms | 52.3% bf16 MFU | 1624122 tok/s step 8208/19560 | loss 3.575961 (+2.50z)| norm 0.2528 (-1.54z)| lr 3.94e-04 | 322.27 ms | 52.4% bf16 MFU | 1624258 tok/s step 8209/19560 | loss 3.504668 (+0.96z)| norm 0.2839 (+0.03z)| lr 3.94e-04 | 322.85 ms | 52.3% bf16 MFU | 1624242 tok/s step 8210/19560 | loss 3.467369 (+0.16z)| norm 0.2764 (-0.34z)| lr 3.94e-04 | 322.37 ms | 52.4% bf16 MFU | 1624348 tok/s step 8211/19560 | loss 3.468399 (+0.17z)| norm 0.2725 (-0.54z)| lr 3.94e-04 | 323.06 ms | 52.2% bf16 MFU | 1624274 tok/s step 8212/19560 | loss 3.470813 (+0.22z)| norm 0.2901 (+0.38z)| lr 3.94e-04 | 322.93 ms | 52.3% bf16 MFU | 1624236 tok/s step 8213/19560 | loss 3.471918 (+0.25z)| norm 0.2830 (+0.00z)| lr 3.94e-04 | 322.90 ms | 52.3% bf16 MFU | 1624209 tok/s step 8214/19560 | loss 3.497344 (+0.78z)| norm 0.2758 (-0.39z)| lr 3.94e-04 | 321.94 ms | 52.4% bf16 MFU | 1624426 tok/s step 8215/19560 | loss 3.408718 (-1.11z)| norm 0.2896 (+0.34z)| lr 3.94e-04 | 322.83 ms | 52.3% bf16 MFU | 1624407 tok/s step 8216/19560 | loss 3.374466 (-1.81z)| norm 0.2922 (+0.47z)| lr 3.94e-04 | 323.32 ms | 52.2% bf16 MFU | 1624266 tok/s step 8217/19560 | loss 3.451830 (-0.17z)| norm 0.2970 (+0.72z)| lr 3.94e-04 | 322.80 ms | 52.3% bf16 MFU | 1624263 tok/s step 8218/19560 | loss 3.470942 (+0.23z)| norm 0.3041 (+1.08z)| lr 3.94e-04 | 322.74 ms | 52.3% bf16 MFU | 1624275 tok/s step 8219/19560 | loss 3.449311 (-0.23z)| norm 0.2993 (+0.81z)| lr 3.94e-04 | 322.86 ms | 52.3% bf16 MFU | 1624256 tok/s step 8220/19560 | loss 3.442408 (-0.36z)| norm 0.3072 (+1.22z)| lr 3.94e-04 | 323.18 ms | 52.2% bf16 MFU | 1624156 tok/s step 8221/19560 | loss 3.489245 (+0.64z)| norm 0.2815 (-0.14z)| lr 3.94e-04 | 322.35 ms | 52.4% bf16 MFU | 1624270 tok/s step 8222/19560 | loss 3.415968 (-0.92z)| norm 0.3331 (+2.55z)| lr 3.94e-04 | 322.83 ms | 52.3% bf16 MFU | 1624260 tok/s step 8223/19560 | loss 3.469101 (+0.22z)| norm 0.2502 (-1.78z)| lr 3.94e-04 | 323.59 ms | 52.2% bf16 MFU | 1624059 tok/s step 8224/19560 | loss 3.477499 (+0.41z)| norm 0.2987 (+0.82z)| lr 3.94e-04 | 322.99 ms | 52.3% bf16 MFU | 1624018 tok/s step 8225/19560 | loss 3.456656 (-0.04z)| norm 0.2444 (-2.06z)| lr 3.94e-04 | 322.81 ms | 52.3% bf16 MFU | 1624024 tok/s step 8226/19560 | loss 3.507570 (+1.05z)| norm 0.2868 (+0.19z)| lr 3.94e-04 | 322.45 ms | 52.3% bf16 MFU | 1624120 tok/s step 8227/19560 | loss 3.454659 (-0.09z)| norm 0.2663 (-0.88z)| lr 3.94e-04 | 322.49 ms | 52.3% bf16 MFU | 1624202 tok/s step 8228/19560 | loss 3.473277 (+0.31z)| norm 0.2671 (-0.83z)| lr 3.94e-04 | 322.45 ms | 52.3% bf16 MFU | 1624290 tok/s step 8229/19560 | loss 3.470618 (+0.25z)| norm 0.2510 (-1.65z)| lr 3.93e-04 | 322.48 ms | 52.3% bf16 MFU | 1624367 tok/s step 8230/19560 | loss 3.441821 (-0.36z)| norm 0.2855 (+0.17z)| lr 3.93e-04 | 322.87 ms | 52.3% bf16 MFU | 1624340 tok/s step 8231/19560 | loss 3.435473 (-0.49z)| norm 0.2688 (-0.71z)| lr 3.93e-04 | 322.78 ms | 52.3% bf16 MFU | 1624338 tok/s step 8232/19560 | loss 3.468102 (+0.23z)| norm 0.2689 (-0.69z)| lr 3.93e-04 | 322.42 ms | 52.3% bf16 MFU | 1624426 tok/s step 8233/19560 | loss 3.353908 (-2.19z)| norm 0.2570 (-1.30z)| lr 3.93e-04 | 322.95 ms | 52.3% bf16 MFU | 1624376 tok/s step 8234/19560 | loss 3.504952 (+1.01z)| norm 0.2643 (-0.91z)| lr 3.93e-04 | 322.75 ms | 52.3% bf16 MFU | 1624380 tok/s step 8235/19560 | loss 3.433596 (-0.50z)| norm 0.2571 (-1.28z)| lr 3.93e-04 | 322.50 ms | 52.3% bf16 MFU | 1624445 tok/s step 8236/19560 | loss 3.392683 (-1.36z)| norm 0.2586 (-1.20z)| lr 3.93e-04 | 322.59 ms | 52.3% bf16 MFU | 1624484 tok/s step 8237/19560 | loss 3.420137 (-0.77z)| norm 0.2714 (-0.54z)| lr 3.93e-04 | 322.61 ms | 52.3% bf16 MFU | 1624516 tok/s step 8238/19560 | loss 3.458055 (+0.03z)| norm 0.2680 (-0.72z)| lr 3.93e-04 | 323.51 ms | 52.2% bf16 MFU | 1624323 tok/s step 8239/19560 | loss 3.473243 (+0.35z)| norm 0.2574 (-1.27z)| lr 3.93e-04 | 322.92 ms | 52.3% bf16 MFU | 1624287 tok/s step 8240/19560 | loss 3.442923 (-0.29z)| norm 0.2868 (+0.27z)| lr 3.93e-04 | 323.22 ms | 52.2% bf16 MFU | 1624176 tok/s step 8241/19560 | loss 3.415781 (-0.86z)| norm 0.2550 (-1.40z)| lr 3.93e-04 | 323.36 ms | 52.2% bf16 MFU | 1624036 tok/s step 8242/19560 | loss 3.526121 (+1.45z)| norm 0.2962 (+0.75z)| lr 3.93e-04 | 322.48 ms | 52.3% bf16 MFU | 1624124 tok/s step 8243/19560 | loss 3.472931 (+0.32z)| norm 0.2672 (-0.77z)| lr 3.93e-04 | 322.70 ms | 52.3% bf16 MFU | 1624152 tok/s step 8244/19560 | loss 3.436982 (-0.43z)| norm 0.2940 (+0.64z)| lr 3.93e-04 | 323.09 ms | 52.2% bf16 MFU | 1624080 tok/s step 8245/19560 | loss 3.447840 (-0.19z)| norm 0.3418 (+3.03z)| lr 3.93e-04 | 322.15 ms | 52.4% bf16 MFU | 1624250 tok/s step 8246/19560 | loss 3.453453 (-0.07z)| norm 0.2846 (+0.11z)| lr 3.93e-04 | 323.28 ms | 52.2% bf16 MFU | 1624126 tok/s step 8247/19560 | loss 3.415910 (-0.87z)| norm 0.3131 (+1.54z)| lr 3.93e-04 | 322.86 ms | 52.3% bf16 MFU | 1624115 tok/s step 8248/19560 | loss 3.430942 (-0.55z)| norm 0.3371 (+2.67z)| lr 3.93e-04 | 322.34 ms | 52.4% bf16 MFU | 1624234 tok/s step 8249/19560 | loss 3.489754 (+0.69z)| norm 0.2824 (-0.03z)| lr 3.93e-04 | 322.94 ms | 52.3% bf16 MFU | 1624196 tok/s step 8250/19560 | loss 3.484787 (+0.57z)| norm 0.3119 (+1.40z)| lr 3.92e-04 | 322.46 ms | 52.3% bf16 MFU | 1624281 tok/s val loss 3.443434 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2853/10042 = 0.284107 step 8251/19560 | loss 3.441388 (-0.35z)| norm 0.2853 (+0.09z)| lr 3.92e-04 | 322.24 ms | 52.4% bf16 MFU | 1624418 tok/s step 8252/19560 | loss 3.451162 (-0.13z)| norm 0.2757 (-0.38z)| lr 3.92e-04 | 322.07 ms | 52.4% bf16 MFU | 1624590 tok/s step 8253/19560 | loss 3.435404 (-0.46z)| norm 0.2975 (+0.69z)| lr 3.92e-04 | 322.92 ms | 52.3% bf16 MFU | 1624540 tok/s step 8254/19560 | loss 3.493716 (+0.80z)| norm 0.2902 (+0.33z)| lr 3.92e-04 | 321.77 ms | 52.5% bf16 MFU | 1624781 tok/s step 8255/19560 | loss 3.476844 (+0.42z)| norm 0.2902 (+0.35z)| lr 3.92e-04 | 322.44 ms | 52.3% bf16 MFU | 1624842 tok/s step 8256/19560 | loss 3.433516 (-0.52z)| norm 0.2866 (+0.16z)| lr 3.92e-04 | 322.33 ms | 52.4% bf16 MFU | 1624928 tok/s step 8257/19560 | loss 3.427677 (-0.66z)| norm 0.2818 (-0.08z)| lr 3.92e-04 | 322.90 ms | 52.3% bf16 MFU | 1624867 tok/s step 8258/19560 | loss 3.479217 (+0.46z)| norm 0.2582 (-1.27z)| lr 3.92e-04 | 322.49 ms | 52.3% bf16 MFU | 1624911 tok/s step 8259/19560 | loss 3.450292 (-0.17z)| norm 0.3237 (+1.99z)| lr 3.92e-04 | 322.66 ms | 52.3% bf16 MFU | 1624911 tok/s step 8260/19560 | loss 3.436181 (-0.48z)| norm 0.2526 (-1.55z)| lr 3.92e-04 | 321.68 ms | 52.5% bf16 MFU | 1625156 tok/s step 8261/19560 | loss 3.496125 (+0.85z)| norm 0.2574 (-1.30z)| lr 3.92e-04 | 322.73 ms | 52.3% bf16 MFU | 1625125 tok/s step 8262/19560 | loss 3.564317 (+2.30z)| norm 0.2528 (-1.51z)| lr 3.92e-04 | 322.48 ms | 52.3% bf16 MFU | 1625158 tok/s step 8263/19560 | loss 3.425432 (-0.71z)| norm 0.2811 (-0.12z)| lr 3.92e-04 | 322.17 ms | 52.4% bf16 MFU | 1625269 tok/s step 8264/19560 | loss 3.445041 (-0.28z)| norm 0.2653 (-0.88z)| lr 3.92e-04 | 322.89 ms | 52.3% bf16 MFU | 1625192 tok/s step 8265/19560 | loss 3.466116 (+0.18z)| norm 0.2772 (-0.31z)| lr 3.92e-04 | 322.44 ms | 52.3% bf16 MFU | 1625233 tok/s step 8266/19560 | loss 3.430400 (-0.60z)| norm 0.2539 (-1.44z)| lr 3.92e-04 | 322.67 ms | 52.3% bf16 MFU | 1625214 tok/s step 8267/19560 | loss 3.488580 (+0.65z)| norm 0.2647 (-0.91z)| lr 3.92e-04 | 322.56 ms | 52.3% bf16 MFU | 1625224 tok/s step 8268/19560 | loss 3.459253 (+0.01z)| norm 0.2640 (-0.95z)| lr 3.92e-04 | 322.51 ms | 52.3% bf16 MFU | 1625246 tok/s step 8269/19560 | loss 3.464638 (+0.12z)| norm 0.2582 (-1.22z)| lr 3.92e-04 | 322.36 ms | 52.4% bf16 MFU | 1625305 tok/s step 8270/19560 | loss 3.457935 (-0.01z)| norm 0.2967 (+0.66z)| lr 3.92e-04 | 322.28 ms | 52.4% bf16 MFU | 1625380 tok/s step 8271/19560 | loss 3.448553 (-0.22z)| norm 0.2823 (-0.05z)| lr 3.91e-04 | 322.88 ms | 52.3% bf16 MFU | 1625301 tok/s step 8272/19560 | loss 3.475574 (+0.39z)| norm 0.2959 (+0.62z)| lr 3.91e-04 | 322.62 ms | 52.3% bf16 MFU | 1625292 tok/s step 8273/19560 | loss 3.433386 (-0.55z)| norm 0.2920 (+0.46z)| lr 3.91e-04 | 322.46 ms | 52.3% bf16 MFU | 1625322 tok/s step 8274/19560 | loss 3.461739 (+0.08z)| norm 0.2734 (-0.47z)| lr 3.91e-04 | 322.69 ms | 52.3% bf16 MFU | 1625292 tok/s step 8275/19560 | loss 3.538874 (+1.79z)| norm 0.2931 (+0.52z)| lr 3.91e-04 | 322.80 ms | 52.3% bf16 MFU | 1625236 tok/s step 8276/19560 | loss 3.419496 (-0.85z)| norm 0.2729 (-0.49z)| lr 3.91e-04 | 322.41 ms | 52.3% bf16 MFU | 1625281 tok/s step 8277/19560 | loss 3.518191 (+1.32z)| norm 0.2860 (+0.18z)| lr 3.91e-04 | 322.55 ms | 52.3% bf16 MFU | 1625289 tok/s step 8278/19560 | loss 3.457811 (-0.01z)| norm 0.2856 (+0.15z)| lr 3.91e-04 | 322.08 ms | 52.4% bf16 MFU | 1625414 tok/s step 8279/19560 | loss 3.435988 (-0.50z)| norm 0.2908 (+0.41z)| lr 3.91e-04 | 322.19 ms | 52.4% bf16 MFU | 1625507 tok/s step 8280/19560 | loss 3.493035 (+0.77z)| norm 0.2942 (+0.58z)| lr 3.91e-04 | 322.31 ms | 52.4% bf16 MFU | 1625565 tok/s step 8281/19560 | loss 3.469747 (+0.26z)| norm 0.3451 (+3.02z)| lr 3.91e-04 | 322.68 ms | 52.3% bf16 MFU | 1625525 tok/s step 8282/19560 | loss 3.453058 (-0.11z)| norm 0.3160 (+1.57z)| lr 3.91e-04 | 322.37 ms | 52.4% bf16 MFU | 1625567 tok/s step 8283/19560 | loss 3.435590 (-0.51z)| norm 0.2724 (-0.52z)| lr 3.91e-04 | 322.19 ms | 52.4% bf16 MFU | 1625652 tok/s step 8284/19560 | loss 3.427620 (-0.68z)| norm 0.3598 (+3.49z)| lr 3.91e-04 | 322.71 ms | 52.3% bf16 MFU | 1625602 tok/s step 8285/19560 | loss 3.425292 (-0.72z)| norm 0.2887 (+0.22z)| lr 3.91e-04 | 322.55 ms | 52.3% bf16 MFU | 1625595 tok/s step 8286/19560 | loss 3.404144 (-1.18z)| norm 0.2773 (-0.31z)| lr 3.91e-04 | 322.85 ms | 52.3% bf16 MFU | 1625513 tok/s step 8287/19560 | loss 3.492144 (+0.79z)| norm 0.2990 (+0.68z)| lr 3.91e-04 | 322.83 ms | 52.3% bf16 MFU | 1625438 tok/s step 8288/19560 | loss 3.394749 (-1.38z)| norm 0.2680 (-0.76z)| lr 3.91e-04 | 322.40 ms | 52.3% bf16 MFU | 1625478 tok/s step 8289/19560 | loss 3.418315 (-0.84z)| norm 0.2821 (-0.11z)| lr 3.91e-04 | 322.36 ms | 52.4% bf16 MFU | 1625524 tok/s step 8290/19560 | loss 3.464115 (+0.18z)| norm 0.2726 (-0.55z)| lr 3.91e-04 | 322.57 ms | 52.3% bf16 MFU | 1625515 tok/s step 8291/19560 | loss 3.432455 (-0.53z)| norm 0.2758 (-0.41z)| lr 3.91e-04 | 322.47 ms | 52.3% bf16 MFU | 1625532 tok/s step 8292/19560 | loss 3.466100 (+0.21z)| norm 0.2679 (-0.77z)| lr 3.90e-04 | 322.11 ms | 52.4% bf16 MFU | 1625640 tok/s step 8293/19560 | loss 3.453865 (-0.06z)| norm 0.2797 (-0.22z)| lr 3.90e-04 | 322.65 ms | 52.3% bf16 MFU | 1625606 tok/s step 8294/19560 | loss 3.436904 (-0.43z)| norm 0.2667 (-0.82z)| lr 3.90e-04 | 322.19 ms | 52.4% bf16 MFU | 1625688 tok/s step 8295/19560 | loss 3.431001 (-0.57z)| norm 0.2851 (+0.03z)| lr 3.90e-04 | 322.57 ms | 52.3% bf16 MFU | 1625671 tok/s step 8296/19560 | loss 3.442141 (-0.31z)| norm 0.2752 (-0.42z)| lr 3.90e-04 | 323.22 ms | 52.2% bf16 MFU | 1625492 tok/s step 8297/19560 | loss 3.424779 (-0.71z)| norm 0.2695 (-0.70z)| lr 3.90e-04 | 322.73 ms | 52.3% bf16 MFU | 1625444 tok/s step 8298/19560 | loss 3.373734 (-1.83z)| norm 0.2448 (-1.81z)| lr 3.90e-04 | 322.25 ms | 52.4% bf16 MFU | 1625521 tok/s step 8299/19560 | loss 3.451361 (-0.11z)| norm 0.2745 (-0.45z)| lr 3.90e-04 | 322.64 ms | 52.3% bf16 MFU | 1625494 tok/s step 8300/19560 | loss 3.366256 (-2.02z)| norm 0.3837 (+4.26z)| lr 3.90e-04 | 322.50 ms | 52.3% bf16 MFU | 1625503 tok/s step 8301/19560 | loss 3.446693 (-0.18z)| norm 0.3164 (+1.33z)| lr 3.90e-04 | 323.06 ms | 52.2% bf16 MFU | 1625373 tok/s step 8302/19560 | loss 3.451704 (-0.06z)| norm 0.2883 (+0.12z)| lr 3.90e-04 | 322.63 ms | 52.3% bf16 MFU | 1625356 tok/s step 8303/19560 | loss 3.488733 (+0.79z)| norm 0.3006 (+0.65z)| lr 3.90e-04 | 322.57 ms | 52.3% bf16 MFU | 1625356 tok/s step 8304/19560 | loss 3.447015 (-0.17z)| norm 0.2805 (-0.21z)| lr 3.90e-04 | 322.62 ms | 52.3% bf16 MFU | 1625344 tok/s step 8305/19560 | loss 3.460317 (+0.19z)| norm 0.2831 (-0.09z)| lr 3.90e-04 | 322.54 ms | 52.3% bf16 MFU | 1625350 tok/s step 8306/19560 | loss 3.548315 (+2.34z)| norm 0.2760 (-0.39z)| lr 3.90e-04 | 322.47 ms | 52.3% bf16 MFU | 1625375 tok/s step 8307/19560 | loss 3.443582 (-0.25z)| norm 0.2913 (+0.28z)| lr 3.90e-04 | 322.80 ms | 52.3% bf16 MFU | 1625315 tok/s step 8308/19560 | loss 3.423398 (-0.74z)| norm 0.2818 (-0.12z)| lr 3.90e-04 | 322.53 ms | 52.3% bf16 MFU | 1625326 tok/s step 8309/19560 | loss 3.495597 (+1.05z)| norm 0.2649 (-0.85z)| lr 3.90e-04 | 322.57 ms | 52.3% bf16 MFU | 1625327 tok/s step 8310/19560 | loss 3.424668 (-0.70z)| norm 0.3143 (+1.32z)| lr 3.90e-04 | 323.02 ms | 52.2% bf16 MFU | 1625216 tok/s step 8311/19560 | loss 3.451561 (-0.04z)| norm 0.2890 (+0.22z)| lr 3.90e-04 | 323.30 ms | 52.2% bf16 MFU | 1625038 tok/s step 8312/19560 | loss 3.447221 (-0.15z)| norm 0.2694 (-0.64z)| lr 3.90e-04 | 321.93 ms | 52.4% bf16 MFU | 1625215 tok/s step 8313/19560 | loss 3.423456 (-0.80z)| norm 0.3266 (+1.87z)| lr 3.89e-04 | 322.64 ms | 52.3% bf16 MFU | 1625204 tok/s step 8314/19560 | loss 3.417074 (-0.95z)| norm 0.2729 (-0.48z)| lr 3.89e-04 | 322.64 ms | 52.3% bf16 MFU | 1625193 tok/s step 8315/19560 | loss 3.401819 (-1.33z)| norm 0.2853 (+0.07z)| lr 3.89e-04 | 322.42 ms | 52.3% bf16 MFU | 1625239 tok/s step 8316/19560 | loss 3.541657 (+2.24z)| norm 0.2748 (-0.39z)| lr 3.89e-04 | 322.27 ms | 52.4% bf16 MFU | 1625319 tok/s step 8317/19560 | loss 3.423695 (-0.75z)| norm 0.2877 (+0.18z)| lr 3.89e-04 | 322.45 ms | 52.3% bf16 MFU | 1625350 tok/s step 8318/19560 | loss 3.485181 (+0.81z)| norm 0.2659 (-0.77z)| lr 3.89e-04 | 322.61 ms | 52.3% bf16 MFU | 1625340 tok/s step 8319/19560 | loss 3.430441 (-0.58z)| norm 0.2594 (-1.05z)| lr 3.89e-04 | 323.41 ms | 52.2% bf16 MFU | 1625130 tok/s step 8320/19560 | loss 3.460008 (+0.18z)| norm 0.2706 (-0.55z)| lr 3.89e-04 | 322.69 ms | 52.3% bf16 MFU | 1625111 tok/s step 8321/19560 | loss 3.464454 (+0.30z)| norm 0.2633 (-0.86z)| lr 3.89e-04 | 322.24 ms | 52.4% bf16 MFU | 1625206 tok/s step 8322/19560 | loss 3.421218 (-0.80z)| norm 0.2634 (-0.84z)| lr 3.89e-04 | 322.46 ms | 52.3% bf16 MFU | 1625240 tok/s step 8323/19560 | loss 3.417117 (-0.90z)| norm 0.2666 (-0.71z)| lr 3.89e-04 | 322.66 ms | 52.3% bf16 MFU | 1625224 tok/s step 8324/19560 | loss 3.481019 (+0.73z)| norm 0.2595 (-1.01z)| lr 3.89e-04 | 322.16 ms | 52.4% bf16 MFU | 1625332 tok/s step 8325/19560 | loss 3.426050 (-0.67z)| norm 0.2570 (-1.10z)| lr 3.89e-04 | 322.78 ms | 52.3% bf16 MFU | 1625280 tok/s step 8326/19560 | loss 3.422550 (-0.76z)| norm 0.2497 (-1.39z)| lr 3.89e-04 | 322.48 ms | 52.3% bf16 MFU | 1625306 tok/s step 8327/19560 | loss 3.470056 (+0.49z)| norm 0.2734 (-0.36z)| lr 3.89e-04 | 322.45 ms | 52.3% bf16 MFU | 1625338 tok/s step 8328/19560 | loss 3.429636 (-0.59z)| norm 0.2719 (-0.41z)| lr 3.89e-04 | 322.17 ms | 52.4% bf16 MFU | 1625438 tok/s step 8329/19560 | loss 3.360088 (-2.39z)| norm 0.2865 (+0.23z)| lr 3.89e-04 | 323.33 ms | 52.2% bf16 MFU | 1625242 tok/s step 8330/19560 | loss 3.481439 (+0.77z)| norm 0.2852 (+0.16z)| lr 3.89e-04 | 323.14 ms | 52.2% bf16 MFU | 1625104 tok/s step 8331/19560 | loss 3.462125 (+0.26z)| norm 0.2800 (-0.07z)| lr 3.89e-04 | 322.78 ms | 52.3% bf16 MFU | 1625063 tok/s step 8332/19560 | loss 3.455955 (+0.09z)| norm 0.2711 (-0.46z)| lr 3.89e-04 | 322.27 ms | 52.4% bf16 MFU | 1625152 tok/s step 8333/19560 | loss 3.474870 (+0.59z)| norm 0.2888 (+0.31z)| lr 3.89e-04 | 322.22 ms | 52.4% bf16 MFU | 1625250 tok/s step 8334/19560 | loss 3.387517 (-1.67z)| norm 0.2887 (+0.30z)| lr 3.88e-04 | 322.71 ms | 52.3% bf16 MFU | 1625219 tok/s step 8335/19560 | loss 3.455197 (+0.06z)| norm 0.2809 (-0.05z)| lr 3.88e-04 | 323.22 ms | 52.2% bf16 MFU | 1625062 tok/s step 8336/19560 | loss 3.446591 (-0.14z)| norm 0.2532 (-1.27z)| lr 3.88e-04 | 322.78 ms | 52.3% bf16 MFU | 1625022 tok/s step 8337/19560 | loss 3.460354 (+0.25z)| norm 0.2634 (-0.81z)| lr 3.88e-04 | 322.01 ms | 52.4% bf16 MFU | 1625179 tok/s step 8338/19560 | loss 3.473516 (+0.62z)| norm 0.2571 (-1.08z)| lr 3.88e-04 | 323.09 ms | 52.2% bf16 MFU | 1625057 tok/s step 8339/19560 | loss 3.478887 (+0.76z)| norm 0.2641 (-0.77z)| lr 3.88e-04 | 323.02 ms | 52.2% bf16 MFU | 1624959 tok/s step 8340/19560 | loss 3.432971 (-0.51z)| norm 0.2811 (-0.02z)| lr 3.88e-04 | 322.58 ms | 52.3% bf16 MFU | 1624975 tok/s step 8341/19560 | loss 3.457366 (+0.17z)| norm 0.2757 (-0.25z)| lr 3.88e-04 | 322.30 ms | 52.4% bf16 MFU | 1625061 tok/s step 8342/19560 | loss 3.459638 (+0.25z)| norm 0.2827 (+0.05z)| lr 3.88e-04 | 322.82 ms | 52.3% bf16 MFU | 1625012 tok/s step 8343/19560 | loss 3.489995 (+1.09z)| norm 0.2728 (-0.38z)| lr 3.88e-04 | 322.65 ms | 52.3% bf16 MFU | 1625008 tok/s step 8344/19560 | loss 3.508508 (+1.59z)| norm 0.2579 (-1.02z)| lr 3.88e-04 | 322.26 ms | 52.4% bf16 MFU | 1625102 tok/s step 8345/19560 | loss 3.443472 (-0.26z)| norm 0.2896 (+0.37z)| lr 3.88e-04 | 322.76 ms | 52.3% bf16 MFU | 1625066 tok/s step 8346/19560 | loss 3.435963 (-0.46z)| norm 0.2652 (-0.68z)| lr 3.88e-04 | 323.27 ms | 52.2% bf16 MFU | 1624904 tok/s step 8347/19560 | loss 3.522331 (+1.95z)| norm 0.3056 (+1.08z)| lr 3.88e-04 | 322.66 ms | 52.3% bf16 MFU | 1624904 tok/s step 8348/19560 | loss 3.464010 (+0.31z)| norm 0.2734 (-0.32z)| lr 3.88e-04 | 322.54 ms | 52.3% bf16 MFU | 1624934 tok/s step 8349/19560 | loss 3.549232 (+2.63z)| norm 0.2669 (-0.60z)| lr 3.88e-04 | 322.22 ms | 52.4% bf16 MFU | 1625042 tok/s step 8350/19560 | loss 3.453987 (+0.01z)| norm 0.2788 (-0.05z)| lr 3.88e-04 | 322.66 ms | 52.3% bf16 MFU | 1625034 tok/s step 8351/19560 | loss 3.399072 (-1.48z)| norm 0.2821 (+0.08z)| lr 3.88e-04 | 322.77 ms | 52.3% bf16 MFU | 1625000 tok/s step 8352/19560 | loss 3.499687 (+1.26z)| norm 0.2684 (-0.53z)| lr 3.88e-04 | 322.46 ms | 52.3% bf16 MFU | 1625046 tok/s step 8353/19560 | loss 3.460122 (+0.18z)| norm 0.3415 (+2.70z)| lr 3.88e-04 | 322.61 ms | 52.3% bf16 MFU | 1625050 tok/s step 8354/19560 | loss 3.436177 (-0.46z)| norm 0.3797 (+4.07z)| lr 3.88e-04 | 322.35 ms | 52.4% bf16 MFU | 1625119 tok/s step 8355/19560 | loss 3.442657 (-0.28z)| norm 0.3004 (+0.77z)| lr 3.87e-04 | 322.70 ms | 52.3% bf16 MFU | 1625098 tok/s step 8356/19560 | loss 3.446002 (-0.18z)| norm 0.3599 (+3.09z)| lr 3.87e-04 | 322.87 ms | 52.3% bf16 MFU | 1625034 tok/s step 8357/19560 | loss 3.438862 (-0.37z)| norm 0.3052 (+0.89z)| lr 3.87e-04 | 322.94 ms | 52.3% bf16 MFU | 1624956 tok/s step 8358/19560 | loss 3.428980 (-0.63z)| norm 0.3066 (+0.93z)| lr 3.87e-04 | 322.35 ms | 52.4% bf16 MFU | 1625030 tok/s step 8359/19560 | loss 3.458006 (+0.16z)| norm 0.3317 (+1.89z)| lr 3.87e-04 | 322.88 ms | 52.3% bf16 MFU | 1624968 tok/s step 8360/19560 | loss 3.473414 (+0.58z)| norm 0.2978 (+0.55z)| lr 3.87e-04 | 322.36 ms | 52.4% bf16 MFU | 1625041 tok/s step 8361/19560 | loss 3.446155 (-0.20z)| norm 0.3264 (+1.64z)| lr 3.87e-04 | 323.12 ms | 52.2% bf16 MFU | 1624918 tok/s step 8362/19560 | loss 3.429260 (-0.66z)| norm 0.2723 (-0.48z)| lr 3.87e-04 | 322.39 ms | 52.4% bf16 MFU | 1624985 tok/s step 8363/19560 | loss 3.558977 (+2.90z)| norm 0.2869 (+0.09z)| lr 3.87e-04 | 322.62 ms | 52.3% bf16 MFU | 1624991 tok/s step 8364/19560 | loss 3.424964 (-0.80z)| norm 0.2973 (+0.49z)| lr 3.87e-04 | 322.88 ms | 52.3% bf16 MFU | 1624930 tok/s step 8365/19560 | loss 3.440032 (-0.39z)| norm 0.2777 (-0.29z)| lr 3.87e-04 | 322.48 ms | 52.3% bf16 MFU | 1624974 tok/s step 8366/19560 | loss 3.442032 (-0.33z)| norm 0.2970 (+0.46z)| lr 3.87e-04 | 322.94 ms | 52.3% bf16 MFU | 1624900 tok/s step 8367/19560 | loss 3.459357 (+0.16z)| norm 0.2823 (-0.13z)| lr 3.87e-04 | 322.97 ms | 52.3% bf16 MFU | 1624821 tok/s step 8368/19560 | loss 3.523827 (+1.91z)| norm 0.2950 (+0.38z)| lr 3.87e-04 | 322.53 ms | 52.3% bf16 MFU | 1624858 tok/s step 8369/19560 | loss 3.507865 (+1.44z)| norm 0.2891 (+0.13z)| lr 3.87e-04 | 322.55 ms | 52.3% bf16 MFU | 1624886 tok/s step 8370/19560 | loss 3.432399 (-0.61z)| norm 0.2716 (-0.56z)| lr 3.87e-04 | 322.31 ms | 52.4% bf16 MFU | 1624975 tok/s step 8371/19560 | loss 3.550349 (+2.58z)| norm 0.2833 (-0.10z)| lr 3.87e-04 | 322.50 ms | 52.3% bf16 MFU | 1625012 tok/s step 8372/19560 | loss 3.495560 (+1.08z)| norm 0.2737 (-0.48z)| lr 3.87e-04 | 322.87 ms | 52.3% bf16 MFU | 1624954 tok/s step 8373/19560 | loss 3.441838 (-0.36z)| norm 0.2707 (-0.58z)| lr 3.87e-04 | 322.46 ms | 52.3% bf16 MFU | 1625002 tok/s step 8374/19560 | loss 3.450577 (-0.13z)| norm 0.3016 (+0.67z)| lr 3.87e-04 | 322.89 ms | 52.3% bf16 MFU | 1624939 tok/s step 8375/19560 | loss 3.460181 (+0.12z)| norm 0.3008 (+0.64z)| lr 3.87e-04 | 322.97 ms | 52.3% bf16 MFU | 1624860 tok/s step 8376/19560 | loss 3.440199 (-0.42z)| norm 0.2842 (-0.02z)| lr 3.86e-04 | 322.83 ms | 52.3% bf16 MFU | 1624819 tok/s step 8377/19560 | loss 3.453549 (-0.05z)| norm 0.3026 (+0.74z)| lr 3.86e-04 | 323.32 ms | 52.2% bf16 MFU | 1624657 tok/s step 8378/19560 | loss 3.417519 (-1.01z)| norm 0.2948 (+0.42z)| lr 3.86e-04 | 322.31 ms | 52.4% bf16 MFU | 1624756 tok/s step 8379/19560 | loss 3.423790 (-0.84z)| norm 0.2538 (-1.27z)| lr 3.86e-04 | 323.22 ms | 52.2% bf16 MFU | 1624622 tok/s step 8380/19560 | loss 3.426159 (-0.76z)| norm 0.2910 (+0.27z)| lr 3.86e-04 | 323.22 ms | 52.2% bf16 MFU | 1624494 tok/s step 8381/19560 | loss 3.455189 (+0.01z)| norm 0.2755 (-0.37z)| lr 3.86e-04 | 322.78 ms | 52.3% bf16 MFU | 1624485 tok/s step 8382/19560 | loss 3.450104 (-0.12z)| norm 0.2601 (-0.99z)| lr 3.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1624502 tok/s step 8383/19560 | loss 3.397811 (-1.50z)| norm 0.2652 (-0.77z)| lr 3.86e-04 | 322.37 ms | 52.4% bf16 MFU | 1624596 tok/s step 8384/19560 | loss 3.462017 (+0.22z)| norm 0.2761 (-0.32z)| lr 3.86e-04 | 323.14 ms | 52.2% bf16 MFU | 1624489 tok/s step 8385/19560 | loss 3.600127 (+3.69z)| norm 0.2781 (-0.23z)| lr 3.86e-04 | 323.01 ms | 52.3% bf16 MFU | 1624422 tok/s step 8386/19560 | loss 3.478111 (+0.58z)| norm 0.2736 (-0.43z)| lr 3.86e-04 | 322.39 ms | 52.3% bf16 MFU | 1624513 tok/s step 8387/19560 | loss 3.343314 (-2.75z)| norm 0.2551 (-1.18z)| lr 3.86e-04 | 322.70 ms | 52.3% bf16 MFU | 1624522 tok/s step 8388/19560 | loss 3.444894 (-0.24z)| norm 0.2962 (+0.52z)| lr 3.86e-04 | 323.21 ms | 52.2% bf16 MFU | 1624402 tok/s step 8389/19560 | loss 3.455651 (+0.04z)| norm 0.2813 (-0.11z)| lr 3.86e-04 | 322.86 ms | 52.3% bf16 MFU | 1624377 tok/s step 8390/19560 | loss 3.444383 (-0.23z)| norm 0.2638 (-0.85z)| lr 3.86e-04 | 322.80 ms | 52.3% bf16 MFU | 1624367 tok/s step 8391/19560 | loss 3.467017 (+0.34z)| norm 0.3029 (+0.79z)| lr 3.86e-04 | 322.39 ms | 52.4% bf16 MFU | 1624462 tok/s step 8392/19560 | loss 3.455477 (+0.05z)| norm 0.2886 (+0.18z)| lr 3.86e-04 | 322.65 ms | 52.3% bf16 MFU | 1624485 tok/s step 8393/19560 | loss 3.512126 (+1.48z)| norm 0.2754 (-0.38z)| lr 3.86e-04 | 323.02 ms | 52.2% bf16 MFU | 1624415 tok/s step 8394/19560 | loss 3.392346 (-1.55z)| norm 0.2793 (-0.22z)| lr 3.86e-04 | 323.16 ms | 52.2% bf16 MFU | 1624314 tok/s step 8395/19560 | loss 3.436028 (-0.44z)| norm 0.2992 (+0.61z)| lr 3.86e-04 | 322.98 ms | 52.3% bf16 MFU | 1624263 tok/s step 8396/19560 | loss 3.472973 (+0.49z)| norm 0.2791 (-0.25z)| lr 3.86e-04 | 322.92 ms | 52.3% bf16 MFU | 1624230 tok/s step 8397/19560 | loss 3.522881 (+1.72z)| norm 0.2617 (-0.99z)| lr 3.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1624226 tok/s step 8398/19560 | loss 3.476664 (+0.56z)| norm 0.2639 (-0.89z)| lr 3.85e-04 | 322.95 ms | 52.3% bf16 MFU | 1624187 tok/s step 8399/19560 | loss 3.471429 (+0.43z)| norm 0.2691 (-0.66z)| lr 3.85e-04 | 322.76 ms | 52.3% bf16 MFU | 1624198 tok/s step 8400/19560 | loss 3.449655 (-0.11z)| norm 0.2616 (-0.96z)| lr 3.85e-04 | 322.53 ms | 52.3% bf16 MFU | 1624266 tok/s step 8401/19560 | loss 3.412081 (-1.04z)| norm 0.2614 (-0.96z)| lr 3.85e-04 | 322.62 ms | 52.3% bf16 MFU | 1624308 tok/s step 8402/19560 | loss 3.400519 (-1.31z)| norm 0.2684 (-0.66z)| lr 3.85e-04 | 322.82 ms | 52.3% bf16 MFU | 1624297 tok/s step 8403/19560 | loss 3.489544 (+0.92z)| norm 0.3093 (+1.05z)| lr 3.85e-04 | 323.00 ms | 52.3% bf16 MFU | 1624242 tok/s step 8404/19560 | loss 3.386874 (-1.64z)| norm 0.3016 (+0.72z)| lr 3.85e-04 | 322.70 ms | 52.3% bf16 MFU | 1624263 tok/s step 8405/19560 | loss 3.449513 (-0.07z)| norm 0.2801 (-0.18z)| lr 3.85e-04 | 322.40 ms | 52.3% bf16 MFU | 1624359 tok/s step 8406/19560 | loss 3.444527 (-0.19z)| norm 0.2674 (-0.71z)| lr 3.85e-04 | 322.50 ms | 52.3% bf16 MFU | 1624427 tok/s step 8407/19560 | loss 3.449511 (-0.07z)| norm 0.2639 (-0.84z)| lr 3.85e-04 | 322.97 ms | 52.3% bf16 MFU | 1624373 tok/s step 8408/19560 | loss 3.501146 (+1.23z)| norm 0.2598 (-1.00z)| lr 3.85e-04 | 323.32 ms | 52.2% bf16 MFU | 1624234 tok/s step 8409/19560 | loss 3.487082 (+0.87z)| norm 0.2616 (-0.92z)| lr 3.85e-04 | 322.09 ms | 52.4% bf16 MFU | 1624410 tok/s step 8410/19560 | loss 3.433921 (-0.46z)| norm 0.2848 (+0.08z)| lr 3.85e-04 | 322.53 ms | 52.3% bf16 MFU | 1624467 tok/s step 8411/19560 | loss 3.409017 (-1.07z)| norm 0.2752 (-0.33z)| lr 3.85e-04 | 322.82 ms | 52.3% bf16 MFU | 1624447 tok/s step 8412/19560 | loss 3.451704 (-0.01z)| norm 0.3024 (+0.90z)| lr 3.85e-04 | 323.06 ms | 52.2% bf16 MFU | 1624368 tok/s step 8413/19560 | loss 3.373901 (-1.92z)| norm 0.2526 (-1.32z)| lr 3.85e-04 | 323.34 ms | 52.2% bf16 MFU | 1624223 tok/s step 8414/19560 | loss 3.413520 (-0.95z)| norm 0.2866 (+0.20z)| lr 3.85e-04 | 322.43 ms | 52.3% bf16 MFU | 1624314 tok/s step 8415/19560 | loss 3.375419 (-1.85z)| norm 0.2702 (-0.52z)| lr 3.85e-04 | 322.73 ms | 52.3% bf16 MFU | 1624324 tok/s step 8416/19560 | loss 3.401160 (-1.22z)| norm 0.2901 (+0.35z)| lr 3.85e-04 | 322.82 ms | 52.3% bf16 MFU | 1624313 tok/s step 8417/19560 | loss 3.438438 (-0.31z)| norm 0.2842 (+0.09z)| lr 3.84e-04 | 322.76 ms | 52.3% bf16 MFU | 1624318 tok/s step 8418/19560 | loss 3.407290 (-1.06z)| norm 0.2775 (-0.21z)| lr 3.84e-04 | 323.49 ms | 52.2% bf16 MFU | 1624138 tok/s step 8419/19560 | loss 3.425704 (-0.61z)| norm 0.2793 (-0.13z)| lr 3.84e-04 | 322.34 ms | 52.4% bf16 MFU | 1624256 tok/s step 8420/19560 | loss 3.431163 (-0.47z)| norm 0.2837 (+0.06z)| lr 3.84e-04 | 322.38 ms | 52.4% bf16 MFU | 1624359 tok/s step 8421/19560 | loss 3.505784 (+1.33z)| norm 0.2680 (-0.64z)| lr 3.84e-04 | 323.23 ms | 52.2% bf16 MFU | 1624243 tok/s step 8422/19560 | loss 3.472014 (+0.51z)| norm 0.2785 (-0.17z)| lr 3.84e-04 | 322.22 ms | 52.4% bf16 MFU | 1624386 tok/s step 8423/19560 | loss 3.482303 (+0.75z)| norm 0.2882 (+0.26z)| lr 3.84e-04 | 322.80 ms | 52.3% bf16 MFU | 1624375 tok/s step 8424/19560 | loss 3.436960 (-0.35z)| norm 0.2594 (-1.02z)| lr 3.84e-04 | 322.51 ms | 52.3% bf16 MFU | 1624439 tok/s step 8425/19560 | loss 3.449271 (-0.06z)| norm 0.2793 (-0.14z)| lr 3.84e-04 | 322.42 ms | 52.3% bf16 MFU | 1624521 tok/s step 8426/19560 | loss 3.451677 (-0.01z)| norm 0.2634 (-0.86z)| lr 3.84e-04 | 323.23 ms | 52.2% bf16 MFU | 1624397 tok/s step 8427/19560 | loss 3.464553 (+0.30z)| norm 0.2618 (-0.92z)| lr 3.84e-04 | 322.77 ms | 52.3% bf16 MFU | 1624394 tok/s step 8428/19560 | loss 3.437223 (-0.39z)| norm 0.2717 (-0.48z)| lr 3.84e-04 | 322.23 ms | 52.4% bf16 MFU | 1624528 tok/s step 8429/19560 | loss 3.438772 (-0.35z)| norm 0.2776 (-0.18z)| lr 3.84e-04 | 322.37 ms | 52.4% bf16 MFU | 1624620 tok/s step 8430/19560 | loss 3.426370 (-0.66z)| norm 0.3096 (+1.39z)| lr 3.84e-04 | 322.97 ms | 52.3% bf16 MFU | 1624555 tok/s step 8431/19560 | loss 3.417284 (-0.87z)| norm 0.2634 (-0.87z)| lr 3.84e-04 | 323.21 ms | 52.2% bf16 MFU | 1624434 tok/s step 8432/19560 | loss 3.479135 (+0.67z)| norm 0.2937 (+0.62z)| lr 3.84e-04 | 322.67 ms | 52.3% bf16 MFU | 1624455 tok/s step 8433/19560 | loss 3.477282 (+0.62z)| norm 0.2766 (-0.22z)| lr 3.84e-04 | 323.19 ms | 52.2% bf16 MFU | 1624344 tok/s step 8434/19560 | loss 3.385433 (-1.66z)| norm 0.2723 (-0.43z)| lr 3.84e-04 | 323.02 ms | 52.2% bf16 MFU | 1624280 tok/s step 8435/19560 | loss 3.406347 (-1.12z)| norm 0.3121 (+1.50z)| lr 3.84e-04 | 322.78 ms | 52.3% bf16 MFU | 1624279 tok/s step 8436/19560 | loss 3.548350 (+2.37z)| norm 0.3036 (+1.07z)| lr 3.84e-04 | 323.00 ms | 52.3% bf16 MFU | 1624225 tok/s step 8437/19560 | loss 3.469294 (+0.43z)| norm 0.2984 (+0.81z)| lr 3.84e-04 | 322.46 ms | 52.3% bf16 MFU | 1624309 tok/s step 8438/19560 | loss 3.451080 (-0.02z)| norm 0.3190 (+1.80z)| lr 3.83e-04 | 322.91 ms | 52.3% bf16 MFU | 1624274 tok/s step 8439/19560 | loss 3.407965 (-1.07z)| norm 0.2990 (+0.83z)| lr 3.83e-04 | 322.47 ms | 52.3% bf16 MFU | 1624352 tok/s step 8440/19560 | loss 3.451876 (+0.01z)| norm 0.3165 (+1.64z)| lr 3.83e-04 | 322.78 ms | 52.3% bf16 MFU | 1624348 tok/s step 8441/19560 | loss 3.411050 (-0.99z)| norm 0.3038 (+1.06z)| lr 3.83e-04 | 322.91 ms | 52.3% bf16 MFU | 1624312 tok/s step 8442/19560 | loss 3.454168 (+0.06z)| norm 0.2979 (+0.76z)| lr 3.83e-04 | 322.46 ms | 52.3% bf16 MFU | 1624392 tok/s step 8443/19560 | loss 3.469710 (+0.43z)| norm 0.2939 (+0.56z)| lr 3.83e-04 | 323.39 ms | 52.2% bf16 MFU | 1624234 tok/s step 8444/19560 | loss 3.366294 (-2.10z)| norm 0.2684 (-0.67z)| lr 3.83e-04 | 322.65 ms | 52.3% bf16 MFU | 1624269 tok/s step 8445/19560 | loss 3.527098 (+1.84z)| norm 0.3089 (+1.27z)| lr 3.83e-04 | 322.81 ms | 52.3% bf16 MFU | 1624264 tok/s step 8446/19560 | loss 3.415856 (-0.87z)| norm 0.2683 (-0.68z)| lr 3.83e-04 | 322.35 ms | 52.4% bf16 MFU | 1624373 tok/s step 8447/19560 | loss 3.431856 (-0.47z)| norm 0.2717 (-0.52z)| lr 3.83e-04 | 322.53 ms | 52.3% bf16 MFU | 1624432 tok/s step 8448/19560 | loss 3.478153 (+0.65z)| norm 0.2940 (+0.54z)| lr 3.83e-04 | 322.94 ms | 52.3% bf16 MFU | 1624385 tok/s step 8449/19560 | loss 3.446053 (-0.13z)| norm 0.2688 (-0.67z)| lr 3.83e-04 | 322.43 ms | 52.3% bf16 MFU | 1624468 tok/s step 8450/19560 | loss 3.407290 (-1.07z)| norm 0.2720 (-0.52z)| lr 3.83e-04 | 322.60 ms | 52.3% bf16 MFU | 1624504 tok/s step 8451/19560 | loss 3.512158 (+1.46z)| norm 0.2701 (-0.62z)| lr 3.83e-04 | 322.60 ms | 52.3% bf16 MFU | 1624540 tok/s step 8452/19560 | loss 3.455086 (+0.08z)| norm 0.2902 (+0.35z)| lr 3.83e-04 | 322.34 ms | 52.4% bf16 MFU | 1624638 tok/s step 8453/19560 | loss 3.418601 (-0.80z)| norm 0.2997 (+0.80z)| lr 3.83e-04 | 322.41 ms | 52.3% bf16 MFU | 1624714 tok/s step 8454/19560 | loss 3.435163 (-0.40z)| norm 0.2611 (-1.10z)| lr 3.83e-04 | 322.58 ms | 52.3% bf16 MFU | 1624742 tok/s step 8455/19560 | loss 3.477635 (+0.63z)| norm 0.2786 (-0.24z)| lr 3.83e-04 | 322.26 ms | 52.4% bf16 MFU | 1624850 tok/s step 8456/19560 | loss 3.451124 (-0.02z)| norm 0.2823 (-0.06z)| lr 3.83e-04 | 322.78 ms | 52.3% bf16 MFU | 1624823 tok/s step 8457/19560 | loss 3.468943 (+0.40z)| norm 0.2762 (-0.36z)| lr 3.83e-04 | 322.17 ms | 52.4% bf16 MFU | 1624951 tok/s step 8458/19560 | loss 3.445643 (-0.17z)| norm 0.2755 (-0.39z)| lr 3.83e-04 | 322.52 ms | 52.3% bf16 MFU | 1624984 tok/s step 8459/19560 | loss 3.375776 (-1.86z)| norm 0.2917 (+0.40z)| lr 3.82e-04 | 323.04 ms | 52.2% bf16 MFU | 1624885 tok/s step 8460/19560 | loss 3.394618 (-1.38z)| norm 0.2578 (-1.26z)| lr 3.82e-04 | 322.21 ms | 52.4% bf16 MFU | 1624998 tok/s step 8461/19560 | loss 3.409702 (-1.00z)| norm 0.2860 (+0.12z)| lr 3.82e-04 | 323.01 ms | 52.2% bf16 MFU | 1624905 tok/s step 8462/19560 | loss 3.471562 (+0.49z)| norm 0.2630 (-0.99z)| lr 3.82e-04 | 322.10 ms | 52.4% bf16 MFU | 1625046 tok/s step 8463/19560 | loss 3.365463 (-2.05z)| norm 0.2574 (-1.24z)| lr 3.82e-04 | 323.17 ms | 52.2% bf16 MFU | 1624909 tok/s step 8464/19560 | loss 3.464033 (+0.31z)| norm 0.2734 (-0.48z)| lr 3.82e-04 | 322.04 ms | 52.4% bf16 MFU | 1625063 tok/s step 8465/19560 | loss 3.423291 (-0.66z)| norm 0.2630 (-0.99z)| lr 3.82e-04 | 322.38 ms | 52.4% bf16 MFU | 1625125 tok/s step 8466/19560 | loss 3.434638 (-0.38z)| norm 0.2677 (-0.76z)| lr 3.82e-04 | 323.09 ms | 52.2% bf16 MFU | 1625007 tok/s step 8467/19560 | loss 3.400043 (-1.19z)| norm 0.2700 (-0.65z)| lr 3.82e-04 | 322.36 ms | 52.4% bf16 MFU | 1625076 tok/s step 8468/19560 | loss 3.408908 (-0.97z)| norm 0.2778 (-0.27z)| lr 3.82e-04 | 322.79 ms | 52.3% bf16 MFU | 1625035 tok/s step 8469/19560 | loss 3.526629 (+1.80z)| norm 0.2670 (-0.80z)| lr 3.82e-04 | 322.36 ms | 52.4% bf16 MFU | 1625103 tok/s step 8470/19560 | loss 3.429895 (-0.47z)| norm 0.2781 (-0.25z)| lr 3.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1625097 tok/s step 8471/19560 | loss 3.431057 (-0.43z)| norm 0.2620 (-1.03z)| lr 3.82e-04 | 322.44 ms | 52.3% bf16 MFU | 1625141 tok/s step 8472/19560 | loss 3.406031 (-1.01z)| norm 0.2745 (-0.43z)| lr 3.82e-04 | 322.64 ms | 52.3% bf16 MFU | 1625134 tok/s step 8473/19560 | loss 3.450429 (+0.04z)| norm 0.2665 (-0.81z)| lr 3.82e-04 | 322.40 ms | 52.3% bf16 MFU | 1625189 tok/s step 8474/19560 | loss 3.432849 (-0.37z)| norm 0.2686 (-0.71z)| lr 3.82e-04 | 323.03 ms | 52.2% bf16 MFU | 1625082 tok/s step 8475/19560 | loss 3.412231 (-0.85z)| norm 0.2851 (+0.11z)| lr 3.82e-04 | 322.55 ms | 52.3% bf16 MFU | 1625100 tok/s step 8476/19560 | loss 3.326467 (-2.79z)| norm 0.2521 (-1.50z)| lr 3.82e-04 | 322.70 ms | 52.3% bf16 MFU | 1625080 tok/s step 8477/19560 | loss 3.433089 (-0.30z)| norm 0.2898 (+0.33z)| lr 3.82e-04 | 322.41 ms | 52.3% bf16 MFU | 1625134 tok/s step 8478/19560 | loss 3.427930 (-0.42z)| norm 0.2806 (-0.11z)| lr 3.82e-04 | 322.82 ms | 52.3% bf16 MFU | 1625083 tok/s step 8479/19560 | loss 3.425676 (-0.48z)| norm 0.2891 (+0.30z)| lr 3.82e-04 | 322.22 ms | 52.4% bf16 MFU | 1625184 tok/s step 8480/19560 | loss 3.422585 (-0.54z)| norm 0.2875 (+0.21z)| lr 3.81e-04 | 322.80 ms | 52.3% bf16 MFU | 1625134 tok/s step 8481/19560 | loss 3.463497 (+0.44z)| norm 0.2831 (+0.02z)| lr 3.81e-04 | 322.53 ms | 52.3% bf16 MFU | 1625155 tok/s step 8482/19560 | loss 3.434541 (-0.25z)| norm 0.2940 (+0.67z)| lr 3.81e-04 | 322.68 ms | 52.3% bf16 MFU | 1625137 tok/s step 8483/19560 | loss 3.429801 (-0.37z)| norm 0.2977 (+0.88z)| lr 3.81e-04 | 322.68 ms | 52.3% bf16 MFU | 1625121 tok/s step 8484/19560 | loss 3.509105 (+1.50z)| norm 0.2865 (+0.31z)| lr 3.81e-04 | 322.53 ms | 52.3% bf16 MFU | 1625142 tok/s step 8485/19560 | loss 3.505738 (+1.40z)| norm 0.2649 (-0.99z)| lr 3.81e-04 | 322.43 ms | 52.3% bf16 MFU | 1625188 tok/s step 8486/19560 | loss 3.459270 (+0.30z)| norm 0.2963 (+0.94z)| lr 3.81e-04 | 322.56 ms | 52.3% bf16 MFU | 1625199 tok/s step 8487/19560 | loss 3.407431 (-0.90z)| norm 0.2703 (-0.66z)| lr 3.81e-04 | 322.31 ms | 52.4% bf16 MFU | 1625272 tok/s step 8488/19560 | loss 3.442542 (-0.07z)| norm 0.2909 (+0.67z)| lr 3.81e-04 | 322.81 ms | 52.3% bf16 MFU | 1625216 tok/s step 8489/19560 | loss 3.464875 (+0.45z)| norm 0.2890 (+0.59z)| lr 3.81e-04 | 322.04 ms | 52.4% bf16 MFU | 1625357 tok/s step 8490/19560 | loss 3.585331 (+3.12z)| norm 0.2798 (-0.03z)| lr 3.81e-04 | 322.54 ms | 52.3% bf16 MFU | 1625365 tok/s step 8491/19560 | loss 3.446706 (+0.01z)| norm 0.3209 (+2.62z)| lr 3.81e-04 | 322.72 ms | 52.3% bf16 MFU | 1625325 tok/s step 8492/19560 | loss 3.451406 (+0.12z)| norm 0.2639 (-1.07z)| lr 3.81e-04 | 322.57 ms | 52.3% bf16 MFU | 1625328 tok/s step 8493/19560 | loss 3.489359 (+0.98z)| norm 0.2885 (+0.53z)| lr 3.81e-04 | 322.54 ms | 52.3% bf16 MFU | 1625336 tok/s step 8494/19560 | loss 3.476391 (+0.68z)| norm 0.2687 (-0.74z)| lr 3.81e-04 | 322.41 ms | 52.3% bf16 MFU | 1625376 tok/s step 8495/19560 | loss 3.490137 (+0.98z)| norm 0.2890 (+0.57z)| lr 3.81e-04 | 322.68 ms | 52.3% bf16 MFU | 1625348 tok/s step 8496/19560 | loss 3.452627 (+0.14z)| norm 0.2727 (-0.47z)| lr 3.81e-04 | 322.87 ms | 52.3% bf16 MFU | 1625272 tok/s step 8497/19560 | loss 3.448830 (+0.06z)| norm 0.2535 (-1.69z)| lr 3.81e-04 | 322.10 ms | 52.4% bf16 MFU | 1625395 tok/s step 8498/19560 | loss 3.449033 (+0.06z)| norm 0.2916 (+0.75z)| lr 3.81e-04 | 322.47 ms | 52.3% bf16 MFU | 1625417 tok/s step 8499/19560 | loss 3.405685 (-0.95z)| norm 0.2753 (-0.29z)| lr 3.81e-04 | 322.22 ms | 52.4% bf16 MFU | 1625500 tok/s step 8500/19560 | loss 3.408258 (-0.87z)| norm 0.2776 (-0.15z)| lr 3.81e-04 | 322.96 ms | 52.3% bf16 MFU | 1625394 tok/s val loss 3.433615 laSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluaevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2847/10042 = 0.283509 step 8501/19560 | loss 3.397943 (-1.11z)| norm 0.2896 (+0.62z)| lr 3.80e-04 | 323.71 ms | 52.1% bf16 MFU | 1625104 tok/s step 8502/19560 | loss 3.467796 (+0.56z)| norm 0.2792 (-0.04z)| lr 3.80e-04 | 322.08 ms | 52.4% bf16 MFU | 1625240 tok/s step 8503/19560 | loss 3.405304 (-0.92z)| norm 0.2462 (-2.13z)| lr 3.80e-04 | 322.70 ms | 52.3% bf16 MFU | 1625213 tok/s step 8504/19560 | loss 3.464673 (+0.49z)| norm 0.2817 (+0.15z)| lr 3.80e-04 | 322.73 ms | 52.3% bf16 MFU | 1625179 tok/s step 8505/19560 | loss 3.466044 (+0.52z)| norm 0.2759 (-0.22z)| lr 3.80e-04 | 322.35 ms | 52.4% bf16 MFU | 1625244 tok/s step 8506/19560 | loss 3.453981 (+0.22z)| norm 0.2770 (-0.13z)| lr 3.80e-04 | 323.00 ms | 52.3% bf16 MFU | 1625141 tok/s step 8507/19560 | loss 3.457299 (+0.30z)| norm 0.2869 (+0.50z)| lr 3.80e-04 | 322.66 ms | 52.3% bf16 MFU | 1625129 tok/s step 8508/19560 | loss 3.458462 (+0.32z)| norm 0.2839 (+0.31z)| lr 3.80e-04 | 322.60 ms | 52.3% bf16 MFU | 1625132 tok/s step 8509/19560 | loss 3.379881 (-1.52z)| norm 0.2684 (-0.71z)| lr 3.80e-04 | 322.49 ms | 52.3% bf16 MFU | 1625162 tok/s step 8510/19560 | loss 3.453718 (+0.22z)| norm 0.2923 (+0.85z)| lr 3.80e-04 | 321.90 ms | 52.4% bf16 MFU | 1625340 tok/s step 8511/19560 | loss 3.434658 (-0.24z)| norm 0.3239 (+2.82z)| lr 3.80e-04 | 323.27 ms | 52.2% bf16 MFU | 1625164 tok/s step 8512/19560 | loss 3.384659 (-1.40z)| norm 0.2821 (+0.14z)| lr 3.80e-04 | 322.30 ms | 52.4% bf16 MFU | 1625242 tok/s step 8513/19560 | loss 3.437086 (-0.14z)| norm 0.3491 (+4.09z)| lr 3.80e-04 | 322.48 ms | 52.3% bf16 MFU | 1625271 tok/s step 8514/19560 | loss 3.431070 (-0.29z)| norm 0.3093 (+1.69z)| lr 3.80e-04 | 322.48 ms | 52.3% bf16 MFU | 1625298 tok/s step 8515/19560 | loss 3.478909 (+0.90z)| norm 0.2994 (+1.08z)| lr 3.80e-04 | 323.05 ms | 52.2% bf16 MFU | 1625180 tok/s step 8516/19560 | loss 3.454384 (+0.27z)| norm 0.3141 (+1.93z)| lr 3.80e-04 | 322.41 ms | 52.3% bf16 MFU | 1625229 tok/s step 8517/19560 | loss 3.445824 (+0.06z)| norm 0.2727 (-0.50z)| lr 3.80e-04 | 322.60 ms | 52.3% bf16 MFU | 1625226 tok/s step 8518/19560 | loss 3.439834 (-0.09z)| norm 0.2928 (+0.67z)| lr 3.80e-04 | 322.81 ms | 52.3% bf16 MFU | 1625172 tok/s step 8519/19560 | loss 3.413303 (-0.76z)| norm 0.3270 (+2.60z)| lr 3.80e-04 | 322.58 ms | 52.3% bf16 MFU | 1625177 tok/s step 8520/19560 | loss 3.421385 (-0.55z)| norm 0.2776 (-0.22z)| lr 3.80e-04 | 322.78 ms | 52.3% bf16 MFU | 1625133 tok/s step 8521/19560 | loss 3.455906 (+0.35z)| norm 0.2835 (+0.11z)| lr 3.79e-04 | 322.38 ms | 52.4% bf16 MFU | 1625193 tok/s step 8522/19560 | loss 3.414950 (-0.71z)| norm 0.2908 (+0.53z)| lr 3.79e-04 | 322.72 ms | 52.3% bf16 MFU | 1625164 tok/s step 8523/19560 | loss 3.446068 (+0.09z)| norm 0.2724 (-0.53z)| lr 3.79e-04 | 323.08 ms | 52.2% bf16 MFU | 1625045 tok/s step 8524/19560 | loss 3.429233 (-0.34z)| norm 0.2904 (+0.51z)| lr 3.79e-04 | 322.94 ms | 52.3% bf16 MFU | 1624967 tok/s step 8525/19560 | loss 3.444229 (+0.07z)| norm 0.2968 (+0.86z)| lr 3.79e-04 | 322.24 ms | 52.4% bf16 MFU | 1625069 tok/s step 8526/19560 | loss 3.414863 (-0.70z)| norm 0.2873 (+0.31z)| lr 3.79e-04 | 322.78 ms | 52.3% bf16 MFU | 1625030 tok/s step 8527/19560 | loss 3.417370 (-0.62z)| norm 0.2685 (-0.78z)| lr 3.79e-04 | 323.41 ms | 52.2% bf16 MFU | 1624835 tok/s step 8528/19560 | loss 3.420123 (-0.54z)| norm 0.2795 (-0.16z)| lr 3.79e-04 | 322.37 ms | 52.4% bf16 MFU | 1624911 tok/s step 8529/19560 | loss 3.433796 (-0.18z)| norm 0.2716 (-0.62z)| lr 3.79e-04 | 322.75 ms | 52.3% bf16 MFU | 1624888 tok/s step 8530/19560 | loss 3.430567 (-0.28z)| norm 0.2914 (+0.53z)| lr 3.79e-04 | 323.04 ms | 52.2% bf16 MFU | 1624793 tok/s step 8531/19560 | loss 3.435646 (-0.13z)| norm 0.2584 (-1.38z)| lr 3.79e-04 | 322.76 ms | 52.3% bf16 MFU | 1624774 tok/s step 8532/19560 | loss 3.429736 (-0.30z)| norm 0.2973 (+0.90z)| lr 3.79e-04 | 322.35 ms | 52.4% bf16 MFU | 1624859 tok/s step 8533/19560 | loss 3.433757 (-0.19z)| norm 0.2817 (-0.02z)| lr 3.79e-04 | 322.52 ms | 52.3% bf16 MFU | 1624896 tok/s step 8534/19560 | loss 3.413630 (-0.72z)| norm 0.2710 (-0.65z)| lr 3.79e-04 | 322.59 ms | 52.3% bf16 MFU | 1624913 tok/s step 8535/19560 | loss 3.444435 (+0.11z)| norm 0.2918 (+0.56z)| lr 3.79e-04 | 322.49 ms | 52.3% bf16 MFU | 1624956 tok/s step 8536/19560 | loss 3.429999 (-0.27z)| norm 0.2700 (-0.73z)| lr 3.79e-04 | 323.25 ms | 52.2% bf16 MFU | 1624804 tok/s step 8537/19560 | loss 3.436089 (-0.09z)| norm 0.2648 (-1.04z)| lr 3.79e-04 | 322.73 ms | 52.3% bf16 MFU | 1624790 tok/s step 8538/19560 | loss 3.411350 (-0.76z)| norm 0.2557 (-1.56z)| lr 3.79e-04 | 323.40 ms | 52.2% bf16 MFU | 1624608 tok/s step 8539/19560 | loss 3.438534 (-0.03z)| norm 0.2687 (-0.79z)| lr 3.79e-04 | 323.39 ms | 52.2% bf16 MFU | 1624440 tok/s step 8540/19560 | loss 3.509233 (+1.87z)| norm 0.2782 (-0.21z)| lr 3.79e-04 | 322.32 ms | 52.4% bf16 MFU | 1624547 tok/s step 8541/19560 | loss 3.471739 (+0.84z)| norm 0.2557 (-1.55z)| lr 3.79e-04 | 322.58 ms | 52.3% bf16 MFU | 1624584 tok/s step 8542/19560 | loss 3.519291 (+2.09z)| norm 0.2702 (-0.68z)| lr 3.78e-04 | 322.34 ms | 52.4% bf16 MFU | 1624680 tok/s step 8543/19560 | loss 3.461208 (+0.51z)| norm 0.2682 (-0.80z)| lr 3.78e-04 | 322.50 ms | 52.3% bf16 MFU | 1624731 tok/s step 8544/19560 | loss 3.436311 (-0.17z)| norm 0.2517 (-1.74z)| lr 3.78e-04 | 322.32 ms | 52.4% bf16 MFU | 1624824 tok/s step 8545/19560 | loss 3.393215 (-1.33z)| norm 0.2493 (-1.84z)| lr 3.78e-04 | 322.40 ms | 52.3% bf16 MFU | 1624892 tok/s step 8546/19560 | loss 3.434732 (-0.21z)| norm 0.2531 (-1.59z)| lr 3.78e-04 | 322.96 ms | 52.3% bf16 MFU | 1624818 tok/s step 8547/19560 | loss 3.403682 (-1.05z)| norm 0.2556 (-1.43z)| lr 3.78e-04 | 322.59 ms | 52.3% bf16 MFU | 1624839 tok/s step 8548/19560 | loss 3.447400 (+0.14z)| norm 0.2601 (-1.16z)| lr 3.78e-04 | 322.96 ms | 52.3% bf16 MFU | 1624767 tok/s step 8549/19560 | loss 3.414876 (-0.73z)| norm 0.2461 (-1.91z)| lr 3.78e-04 | 322.50 ms | 52.3% bf16 MFU | 1624814 tok/s step 8550/19560 | loss 3.451184 (+0.26z)| norm 0.2898 (+0.52z)| lr 3.78e-04 | 323.05 ms | 52.2% bf16 MFU | 1624721 tok/s step 8551/19560 | loss 3.403583 (-1.03z)| norm 0.2931 (+0.70z)| lr 3.78e-04 | 322.64 ms | 52.3% bf16 MFU | 1624735 tok/s step 8552/19560 | loss 3.417947 (-0.63z)| norm 0.2535 (-1.50z)| lr 3.78e-04 | 322.94 ms | 52.3% bf16 MFU | 1624673 tok/s step 8553/19560 | loss 3.586523 (+3.74z)| norm 0.3202 (+2.14z)| lr 3.78e-04 | 323.35 ms | 52.2% bf16 MFU | 1624510 tok/s step 8554/19560 | loss 3.423902 (-0.46z)| norm 0.3035 (+1.21z)| lr 3.78e-04 | 322.68 ms | 52.3% bf16 MFU | 1624526 tok/s step 8555/19560 | loss 3.475312 (+0.87z)| norm 0.2987 (+0.94z)| lr 3.78e-04 | 323.66 ms | 52.1% bf16 MFU | 1624294 tok/s step 8556/19560 | loss 3.388183 (-1.36z)| norm 0.2642 (-0.93z)| lr 3.78e-04 | 322.43 ms | 52.3% bf16 MFU | 1624382 tok/s step 8557/19560 | loss 3.428090 (-0.34z)| norm 0.2903 (+0.48z)| lr 3.78e-04 | 323.22 ms | 52.2% bf16 MFU | 1624266 tok/s step 8558/19560 | loss 3.385494 (-1.41z)| norm 0.2565 (-1.33z)| lr 3.78e-04 | 322.79 ms | 52.3% bf16 MFU | 1624265 tok/s step 8559/19560 | loss 3.473921 (+0.83z)| norm 0.2878 (+0.36z)| lr 3.78e-04 | 322.88 ms | 52.3% bf16 MFU | 1624240 tok/s step 8560/19560 | loss 3.425565 (-0.39z)| norm 0.2676 (-0.73z)| lr 3.78e-04 | 322.93 ms | 52.3% bf16 MFU | 1624204 tok/s step 8561/19560 | loss 3.446987 (+0.16z)| norm 0.2629 (-0.98z)| lr 3.78e-04 | 322.88 ms | 52.3% bf16 MFU | 1624182 tok/s step 8562/19560 | loss 3.413682 (-0.70z)| norm 0.2554 (-1.37z)| lr 3.78e-04 | 322.74 ms | 52.3% bf16 MFU | 1624198 tok/s step 8563/19560 | loss 3.452948 (+0.30z)| norm 0.2818 (+0.07z)| lr 3.77e-04 | 322.71 ms | 52.3% bf16 MFU | 1624220 tok/s step 8564/19560 | loss 3.440291 (-0.00z)| norm 0.2619 (-1.00z)| lr 3.77e-04 | 322.81 ms | 52.3% bf16 MFU | 1624216 tok/s step 8565/19560 | loss 3.424538 (-0.41z)| norm 0.2879 (+0.42z)| lr 3.77e-04 | 323.08 ms | 52.2% bf16 MFU | 1624144 tok/s step 8566/19560 | loss 3.480145 (+1.06z)| norm 0.2778 (-0.11z)| lr 3.77e-04 | 322.70 ms | 52.3% bf16 MFU | 1624172 tok/s step 8567/19560 | loss 3.409414 (-0.82z)| norm 0.2745 (-0.29z)| lr 3.77e-04 | 322.81 ms | 52.3% bf16 MFU | 1624170 tok/s step 8568/19560 | loss 3.406154 (-0.90z)| norm 0.2811 (+0.10z)| lr 3.77e-04 | 322.74 ms | 52.3% bf16 MFU | 1624185 tok/s step 8569/19560 | loss 3.402071 (-1.00z)| norm 0.2930 (+0.78z)| lr 3.77e-04 | 322.81 ms | 52.3% bf16 MFU | 1624183 tok/s step 8570/19560 | loss 3.431119 (-0.23z)| norm 0.2600 (-1.09z)| lr 3.77e-04 | 322.64 ms | 52.3% bf16 MFU | 1624223 tok/s step 8571/19560 | loss 3.437090 (-0.06z)| norm 0.3050 (+1.47z)| lr 3.77e-04 | 322.56 ms | 52.3% bf16 MFU | 1624280 tok/s step 8572/19560 | loss 3.476249 (+0.96z)| norm 0.2860 (+0.39z)| lr 3.77e-04 | 322.49 ms | 52.3% bf16 MFU | 1624354 tok/s step 8573/19560 | loss 3.440788 (+0.03z)| norm 0.2906 (+0.66z)| lr 3.77e-04 | 322.90 ms | 52.3% bf16 MFU | 1624322 tok/s step 8574/19560 | loss 3.506953 (+1.80z)| norm 0.2983 (+1.09z)| lr 3.77e-04 | 322.64 ms | 52.3% bf16 MFU | 1624355 tok/s step 8575/19560 | loss 3.438990 (-0.04z)| norm 0.2866 (+0.41z)| lr 3.77e-04 | 322.85 ms | 52.3% bf16 MFU | 1624335 tok/s step 8576/19560 | loss 3.426897 (-0.36z)| norm 0.2837 (+0.25z)| lr 3.77e-04 | 322.62 ms | 52.3% bf16 MFU | 1624372 tok/s step 8577/19560 | loss 3.454490 (+0.39z)| norm 0.2970 (+1.00z)| lr 3.77e-04 | 323.04 ms | 52.2% bf16 MFU | 1624304 tok/s step 8578/19560 | loss 3.396943 (-1.17z)| norm 0.3297 (+2.76z)| lr 3.77e-04 | 322.62 ms | 52.3% bf16 MFU | 1624344 tok/s step 8579/19560 | loss 3.464473 (+0.68z)| norm 0.2775 (-0.14z)| lr 3.77e-04 | 322.85 ms | 52.3% bf16 MFU | 1624325 tok/s step 8580/19560 | loss 3.441208 (+0.05z)| norm 0.3012 (+1.16z)| lr 3.77e-04 | 322.61 ms | 52.3% bf16 MFU | 1624365 tok/s step 8581/19560 | loss 3.466941 (+0.74z)| norm 0.2756 (-0.24z)| lr 3.77e-04 | 322.34 ms | 52.4% bf16 MFU | 1624471 tok/s step 8582/19560 | loss 3.515663 (+2.03z)| norm 0.3014 (+1.17z)| lr 3.77e-04 | 323.14 ms | 52.2% bf16 MFU | 1624372 tok/s step 8583/19560 | loss 3.468759 (+0.77z)| norm 0.2540 (-1.44z)| lr 3.77e-04 | 323.08 ms | 52.2% bf16 MFU | 1624292 tok/s step 8584/19560 | loss 3.434744 (-0.15z)| norm 0.2848 (+0.26z)| lr 3.76e-04 | 322.37 ms | 52.4% bf16 MFU | 1624395 tok/s step 8585/19560 | loss 3.409931 (-0.81z)| norm 0.2718 (-0.45z)| lr 3.76e-04 | 323.40 ms | 52.2% bf16 MFU | 1624233 tok/s step 8586/19560 | loss 3.533392 (+2.45z)| norm 0.2635 (-0.90z)| lr 3.76e-04 | 322.67 ms | 52.3% bf16 MFU | 1624263 tok/s step 8587/19560 | loss 3.396340 (-1.18z)| norm 0.2734 (-0.35z)| lr 3.76e-04 | 322.74 ms | 52.3% bf16 MFU | 1624275 tok/s step 8588/19560 | loss 3.477778 (+0.97z)| norm 0.2630 (-0.93z)| lr 3.76e-04 | 323.39 ms | 52.2% bf16 MFU | 1624122 tok/s step 8589/19560 | loss 3.447051 (+0.14z)| norm 0.4058 (+5.88z)| lr 3.76e-04 | 322.68 ms | 52.3% bf16 MFU | 1624155 tok/s step 8590/19560 | loss 3.456711 (+0.41z)| norm 0.2853 (+0.20z)| lr 3.76e-04 | 323.04 ms | 52.2% bf16 MFU | 1624097 tok/s step 8591/19560 | loss 3.451324 (+0.25z)| norm 0.2975 (+0.77z)| lr 3.76e-04 | 323.09 ms | 52.2% bf16 MFU | 1624028 tok/s step 8592/19560 | loss 3.426154 (-0.43z)| norm 0.2693 (-0.57z)| lr 3.76e-04 | 322.37 ms | 52.4% bf16 MFU | 1624146 tok/s step 8593/19560 | loss 3.490326 (+1.30z)| norm 0.2854 (+0.19z)| lr 3.76e-04 | 322.71 ms | 52.3% bf16 MFU | 1624170 tok/s step 8594/19560 | loss 3.394976 (-1.27z)| norm 0.2566 (-1.17z)| lr 3.76e-04 | 322.88 ms | 52.3% bf16 MFU | 1624150 tok/s step 8595/19560 | loss 3.497386 (+1.46z)| norm 0.2726 (-0.42z)| lr 3.76e-04 | 322.46 ms | 52.3% bf16 MFU | 1624236 tok/s step 8596/19560 | loss 3.517821 (+1.96z)| norm 0.2754 (-0.28z)| lr 3.76e-04 | 322.88 ms | 52.3% bf16 MFU | 1624214 tok/s step 8597/19560 | loss 3.462304 (+0.51z)| norm 0.2614 (-0.94z)| lr 3.76e-04 | 322.99 ms | 52.3% bf16 MFU | 1624165 tok/s step 8598/19560 | loss 3.496711 (+1.42z)| norm 0.2955 (+0.66z)| lr 3.76e-04 | 322.75 ms | 52.3% bf16 MFU | 1624178 tok/s step 8599/19560 | loss 3.466253 (+0.59z)| norm 0.2569 (-1.15z)| lr 3.76e-04 | 322.96 ms | 52.3% bf16 MFU | 1624139 tok/s step 8600/19560 | loss 3.437773 (-0.18z)| norm 0.2766 (-0.23z)| lr 3.76e-04 | 322.78 ms | 52.3% bf16 MFU | 1624146 tok/s step 8601/19560 | loss 3.412980 (-0.83z)| norm 0.2602 (-0.99z)| lr 3.76e-04 | 322.89 ms | 52.3% bf16 MFU | 1624126 tok/s step 8602/19560 | loss 3.414978 (-0.77z)| norm 0.2759 (-0.26z)| lr 3.76e-04 | 322.54 ms | 52.3% bf16 MFU | 1624194 tok/s step 8603/19560 | loss 3.395175 (-1.29z)| norm 0.2541 (-1.27z)| lr 3.76e-04 | 322.78 ms | 52.3% bf16 MFU | 1624199 tok/s step 8604/19560 | loss 3.436646 (-0.22z)| norm 0.2466 (-1.61z)| lr 3.75e-04 | 323.01 ms | 52.3% bf16 MFU | 1624146 tok/s step 8605/19560 | loss 3.471761 (+0.74z)| norm 0.2625 (-0.86z)| lr 3.75e-04 | 322.75 ms | 52.3% bf16 MFU | 1624161 tok/s step 8606/19560 | loss 3.434410 (-0.29z)| norm 0.2737 (-0.33z)| lr 3.75e-04 | 322.87 ms | 52.3% bf16 MFU | 1624144 tok/s step 8607/19560 | loss 3.435482 (-0.26z)| norm 0.2753 (-0.26z)| lr 3.75e-04 | 322.79 ms | 52.3% bf16 MFU | 1624148 tok/s step 8608/19560 | loss 3.421261 (-0.66z)| norm 0.2716 (-0.42z)| lr 3.75e-04 | 323.02 ms | 52.2% bf16 MFU | 1624095 tok/s step 8609/19560 | loss 3.371888 (-1.98z)| norm 0.2511 (-1.35z)| lr 3.75e-04 | 322.61 ms | 52.3% bf16 MFU | 1624148 tok/s step 8610/19560 | loss 3.440477 (-0.11z)| norm 0.2909 (+0.48z)| lr 3.75e-04 | 322.94 ms | 52.3% bf16 MFU | 1624114 tok/s step 8611/19560 | loss 3.442532 (-0.05z)| norm 0.2570 (-1.06z)| lr 3.75e-04 | 322.90 ms | 52.3% bf16 MFU | 1624093 tok/s step 8612/19560 | loss 3.479153 (+0.96z)| norm 0.2861 (+0.28z)| lr 3.75e-04 | 322.25 ms | 52.4% bf16 MFU | 1624236 tok/s step 8613/19560 | loss 3.552805 (+2.92z)| norm 0.2888 (+0.39z)| lr 3.75e-04 | 322.67 ms | 52.3% bf16 MFU | 1624266 tok/s step 8614/19560 | loss 3.459034 (+0.39z)| norm 0.2878 (+0.35z)| lr 3.75e-04 | 322.64 ms | 52.3% bf16 MFU | 1624302 tok/s step 8615/19560 | loss 3.501216 (+1.50z)| norm 0.2841 (+0.18z)| lr 3.75e-04 | 322.65 ms | 52.3% bf16 MFU | 1624335 tok/s step 8616/19560 | loss 3.430509 (-0.39z)| norm 0.2828 (+0.12z)| lr 3.75e-04 | 322.34 ms | 52.4% bf16 MFU | 1624444 tok/s step 8617/19560 | loss 3.515721 (+1.85z)| norm 0.3052 (+1.14z)| lr 3.75e-04 | 322.08 ms | 52.4% bf16 MFU | 1624612 tok/s step 8618/19560 | loss 3.373001 (-1.96z)| norm 0.2829 (+0.12z)| lr 3.75e-04 | 322.63 ms | 52.3% bf16 MFU | 1624633 tok/s step 8619/19560 | loss 3.432909 (-0.30z)| norm 0.2775 (-0.12z)| lr 3.75e-04 | 322.70 ms | 52.3% bf16 MFU | 1624635 tok/s step 8620/19560 | loss 3.491978 (+1.31z)| norm 0.2802 (+0.00z)| lr 3.75e-04 | 322.30 ms | 52.4% bf16 MFU | 1624738 tok/s step 8621/19560 | loss 3.411290 (-0.89z)| norm 0.2985 (+0.85z)| lr 3.75e-04 | 322.52 ms | 52.3% bf16 MFU | 1624782 tok/s step 8622/19560 | loss 3.410945 (-0.88z)| norm 0.2878 (+0.35z)| lr 3.75e-04 | 322.32 ms | 52.4% bf16 MFU | 1624873 tok/s step 8623/19560 | loss 3.495973 (+1.45z)| norm 0.2911 (+0.50z)| lr 3.75e-04 | 322.37 ms | 52.4% bf16 MFU | 1624946 tok/s step 8624/19560 | loss 3.454773 (+0.32z)| norm 0.2613 (-0.89z)| lr 3.75e-04 | 322.52 ms | 52.3% bf16 MFU | 1624978 tok/s step 8625/19560 | loss 3.397222 (-1.24z)| norm 0.3085 (+1.29z)| lr 3.74e-04 | 322.70 ms | 52.3% bf16 MFU | 1624964 tok/s step 8626/19560 | loss 3.455181 (+0.34z)| norm 0.2706 (-0.46z)| lr 3.74e-04 | 322.88 ms | 52.3% bf16 MFU | 1624906 tok/s step 8627/19560 | loss 3.373213 (-1.87z)| norm 0.2699 (-0.49z)| lr 3.74e-04 | 322.17 ms | 52.4% bf16 MFU | 1625028 tok/s step 8628/19560 | loss 3.414404 (-0.76z)| norm 0.3020 (+0.98z)| lr 3.74e-04 | 322.48 ms | 52.3% bf16 MFU | 1625065 tok/s step 8629/19560 | loss 3.463347 (+0.55z)| norm 0.2588 (-1.00z)| lr 3.74e-04 | 322.58 ms | 52.3% bf16 MFU | 1625077 tok/s step 8630/19560 | loss 3.436069 (-0.18z)| norm 0.2708 (-0.45z)| lr 3.74e-04 | 322.92 ms | 52.3% bf16 MFU | 1625001 tok/s step 8631/19560 | loss 3.490862 (+1.28z)| norm 0.2617 (-0.88z)| lr 3.74e-04 | 322.63 ms | 52.3% bf16 MFU | 1625002 tok/s step 8632/19560 | loss 3.482356 (+1.05z)| norm 0.2724 (-0.38z)| lr 3.74e-04 | 322.59 ms | 52.3% bf16 MFU | 1625015 tok/s step 8633/19560 | loss 3.440956 (-0.07z)| norm 0.2814 (+0.04z)| lr 3.74e-04 | 322.68 ms | 52.3% bf16 MFU | 1625004 tok/s step 8634/19560 | loss 3.422919 (-0.55z)| norm 0.2826 (+0.09z)| lr 3.74e-04 | 322.72 ms | 52.3% bf16 MFU | 1624983 tok/s step 8635/19560 | loss 3.469795 (+0.72z)| norm 0.2723 (-0.38z)| lr 3.74e-04 | 322.66 ms | 52.3% bf16 MFU | 1624978 tok/s step 8636/19560 | loss 3.411516 (-0.85z)| norm 0.2776 (-0.13z)| lr 3.74e-04 | 322.71 ms | 52.3% bf16 MFU | 1624962 tok/s step 8637/19560 | loss 3.426938 (-0.44z)| norm 0.2572 (-1.07z)| lr 3.74e-04 | 323.06 ms | 52.2% bf16 MFU | 1624856 tok/s step 8638/19560 | loss 3.418968 (-0.65z)| norm 0.2687 (-0.53z)| lr 3.74e-04 | 322.62 ms | 52.3% bf16 MFU | 1624868 tok/s step 8639/19560 | loss 3.485788 (+1.15z)| norm 0.2573 (-1.04z)| lr 3.74e-04 | 322.98 ms | 52.3% bf16 MFU | 1624788 tok/s step 8640/19560 | loss 3.445615 (+0.05z)| norm 0.2782 (-0.06z)| lr 3.74e-04 | 323.32 ms | 52.2% bf16 MFU | 1624628 tok/s step 8641/19560 | loss 3.405326 (-1.04z)| norm 0.2915 (+0.60z)| lr 3.74e-04 | 322.33 ms | 52.4% bf16 MFU | 1624724 tok/s step 8642/19560 | loss 3.451714 (+0.22z)| norm 0.2775 (-0.07z)| lr 3.74e-04 | 322.20 ms | 52.4% bf16 MFU | 1624847 tok/s step 8643/19560 | loss 3.458167 (+0.40z)| norm 0.2608 (-0.88z)| lr 3.74e-04 | 322.49 ms | 52.3% bf16 MFU | 1624891 tok/s step 8644/19560 | loss 3.381133 (-1.67z)| norm 0.2610 (-0.85z)| lr 3.74e-04 | 322.80 ms | 52.3% bf16 MFU | 1624857 tok/s step 8645/19560 | loss 3.384728 (-1.55z)| norm 0.2508 (-1.34z)| lr 3.74e-04 | 322.21 ms | 52.4% bf16 MFU | 1624973 tok/s step 8646/19560 | loss 3.408725 (-0.90z)| norm 0.2659 (-0.59z)| lr 3.73e-04 | 323.30 ms | 52.2% bf16 MFU | 1624809 tok/s step 8647/19560 | loss 3.492563 (+1.32z)| norm 0.2762 (-0.06z)| lr 3.73e-04 | 322.25 ms | 52.4% bf16 MFU | 1624917 tok/s step 8648/19560 | loss 3.464746 (+0.57z)| norm 0.2835 (+0.31z)| lr 3.73e-04 | 322.66 ms | 52.3% bf16 MFU | 1624917 tok/s step 8649/19560 | loss 3.470128 (+0.71z)| norm 0.2957 (+0.92z)| lr 3.73e-04 | 322.16 ms | 52.4% bf16 MFU | 1625041 tok/s step 8650/19560 | loss 3.485414 (+1.10z)| norm 0.2834 (+0.30z)| lr 3.73e-04 | 322.73 ms | 52.3% bf16 MFU | 1625015 tok/s step 8651/19560 | loss 3.433724 (-0.27z)| norm 0.2786 (+0.05z)| lr 3.73e-04 | 322.77 ms | 52.3% bf16 MFU | 1624982 tok/s step 8652/19560 | loss 3.405980 (-0.99z)| norm 0.2725 (-0.25z)| lr 3.73e-04 | 322.24 ms | 52.4% bf16 MFU | 1625084 tok/s step 8653/19560 | loss 3.397414 (-1.20z)| norm 0.2805 (+0.17z)| lr 3.73e-04 | 322.28 ms | 52.4% bf16 MFU | 1625169 tok/s step 8654/19560 | loss 3.398101 (-1.18z)| norm 0.2600 (-0.87z)| lr 3.73e-04 | 322.43 ms | 52.3% bf16 MFU | 1625212 tok/s step 8655/19560 | loss 3.453868 (+0.27z)| norm 0.2925 (+0.78z)| lr 3.73e-04 | 322.90 ms | 52.3% bf16 MFU | 1625135 tok/s step 8656/19560 | loss 3.435668 (-0.21z)| norm 0.2938 (+0.83z)| lr 3.73e-04 | 322.64 ms | 52.3% bf16 MFU | 1625127 tok/s step 8657/19560 | loss 3.406382 (-0.97z)| norm 0.3053 (+1.39z)| lr 3.73e-04 | 322.63 ms | 52.3% bf16 MFU | 1625122 tok/s step 8658/19560 | loss 3.403144 (-1.04z)| norm 0.2883 (+0.54z)| lr 3.73e-04 | 322.48 ms | 52.3% bf16 MFU | 1625157 tok/s step 8659/19560 | loss 3.412189 (-0.80z)| norm 0.2971 (+0.97z)| lr 3.73e-04 | 322.63 ms | 52.3% bf16 MFU | 1625151 tok/s step 8660/19560 | loss 3.418075 (-0.64z)| norm 0.2629 (-0.74z)| lr 3.73e-04 | 322.30 ms | 52.4% bf16 MFU | 1625228 tok/s step 8661/19560 | loss 3.477532 (+0.89z)| norm 0.2879 (+0.51z)| lr 3.73e-04 | 323.05 ms | 52.2% bf16 MFU | 1625114 tok/s step 8662/19560 | loss 3.486359 (+1.10z)| norm 0.2906 (+0.64z)| lr 3.73e-04 | 322.48 ms | 52.3% bf16 MFU | 1625148 tok/s step 8663/19560 | loss 3.461837 (+0.46z)| norm 0.2865 (+0.44z)| lr 3.73e-04 | 322.43 ms | 52.3% bf16 MFU | 1625194 tok/s step 8664/19560 | loss 3.437306 (-0.17z)| norm 0.2919 (+0.70z)| lr 3.73e-04 | 322.97 ms | 52.3% bf16 MFU | 1625100 tok/s step 8665/19560 | loss 3.393809 (-1.28z)| norm 0.2701 (-0.39z)| lr 3.73e-04 | 322.64 ms | 52.3% bf16 MFU | 1625094 tok/s step 8666/19560 | loss 3.434295 (-0.24z)| norm 0.2864 (+0.41z)| lr 3.72e-04 | 322.35 ms | 52.4% bf16 MFU | 1625163 tok/s step 8667/19560 | loss 3.377280 (-1.68z)| norm 0.2713 (-0.35z)| lr 3.72e-04 | 323.05 ms | 52.2% bf16 MFU | 1625052 tok/s step 8668/19560 | loss 3.385245 (-1.45z)| norm 0.2651 (-0.66z)| lr 3.72e-04 | 322.34 ms | 52.4% bf16 MFU | 1625126 tok/s step 8669/19560 | loss 3.498227 (+1.41z)| norm 0.2774 (-0.05z)| lr 3.72e-04 | 322.60 ms | 52.3% bf16 MFU | 1625130 tok/s step 8670/19560 | loss 3.535441 (+2.33z)| norm 0.2559 (-1.12z)| lr 3.72e-04 | 322.76 ms | 52.3% bf16 MFU | 1625093 tok/s step 8671/19560 | loss 3.416380 (-0.65z)| norm 0.2915 (+0.66z)| lr 3.72e-04 | 322.57 ms | 52.3% bf16 MFU | 1625105 tok/s step 8672/19560 | loss 3.395637 (-1.16z)| norm 0.2451 (-1.67z)| lr 3.72e-04 | 322.36 ms | 52.4% bf16 MFU | 1625170 tok/s step 8673/19560 | loss 3.432580 (-0.25z)| norm 0.2966 (+0.90z)| lr 3.72e-04 | 322.71 ms | 52.3% bf16 MFU | 1625144 tok/s step 8674/19560 | loss 3.435187 (-0.18z)| norm 0.2662 (-0.64z)| lr 3.72e-04 | 322.49 ms | 52.3% bf16 MFU | 1625174 tok/s step 8675/19560 | loss 3.459862 (+0.43z)| norm 0.2896 (+0.53z)| lr 3.72e-04 | 322.65 ms | 52.3% bf16 MFU | 1625163 tok/s step 8676/19560 | loss 3.423178 (-0.49z)| norm 0.2769 (-0.12z)| lr 3.72e-04 | 322.96 ms | 52.3% bf16 MFU | 1625074 tok/s step 8677/19560 | loss 3.485338 (+1.06z)| norm 0.2973 (+0.91z)| lr 3.72e-04 | 322.42 ms | 52.3% bf16 MFU | 1625125 tok/s step 8678/19560 | loss 3.421507 (-0.54z)| norm 0.2828 (+0.17z)| lr 3.72e-04 | 322.65 ms | 52.3% bf16 MFU | 1625117 tok/s step 8679/19560 | loss 3.426128 (-0.43z)| norm 0.2850 (+0.29z)| lr 3.72e-04 | 322.78 ms | 52.3% bf16 MFU | 1625076 tok/s step 8680/19560 | loss 3.322441 (-2.92z)| norm 0.4592 (+7.16z)| lr 3.72e-04 | 322.80 ms | 52.3% bf16 MFU | 1625033 tok/s step 8681/19560 | loss 3.407261 (-0.86z)| norm 0.3218 (+1.63z)| lr 3.72e-04 | 322.79 ms | 52.3% bf16 MFU | 1624994 tok/s step 8682/19560 | loss 3.415633 (-0.64z)| norm 0.2907 (+0.39z)| lr 3.72e-04 | 322.27 ms | 52.4% bf16 MFU | 1625086 tok/s step 8683/19560 | loss 3.511847 (+1.79z)| norm 0.2687 (-0.49z)| lr 3.72e-04 | 322.38 ms | 52.4% bf16 MFU | 1625147 tok/s step 8684/19560 | loss 3.443531 (+0.05z)| norm 0.2961 (+0.61z)| lr 3.72e-04 | 323.01 ms | 52.3% bf16 MFU | 1625047 tok/s step 8685/19560 | loss 3.417397 (-0.61z)| norm 0.2586 (-0.89z)| lr 3.72e-04 | 322.34 ms | 52.4% bf16 MFU | 1625120 tok/s step 8686/19560 | loss 3.499760 (+1.46z)| norm 0.2939 (+0.52z)| lr 3.72e-04 | 322.74 ms | 52.3% bf16 MFU | 1625089 tok/s step 8687/19560 | loss 3.403970 (-0.96z)| norm 0.2558 (-1.01z)| lr 3.71e-04 | 322.60 ms | 52.3% bf16 MFU | 1625093 tok/s step 8688/19560 | loss 3.434091 (-0.20z)| norm 0.2643 (-0.66z)| lr 3.71e-04 | 322.70 ms | 52.3% bf16 MFU | 1625072 tok/s step 8689/19560 | loss 3.467283 (+0.64z)| norm 0.2866 (+0.23z)| lr 3.71e-04 | 321.89 ms | 52.4% bf16 MFU | 1625257 tok/s step 8690/19560 | loss 3.403867 (-0.97z)| norm 0.2649 (-0.65z)| lr 3.71e-04 | 322.92 ms | 52.3% bf16 MFU | 1625172 tok/s step 8691/19560 | loss 3.469811 (+0.70z)| norm 0.2825 (+0.06z)| lr 3.71e-04 | 322.83 ms | 52.3% bf16 MFU | 1625116 tok/s step 8692/19560 | loss 3.442722 (+0.01z)| norm 0.2445 (-1.46z)| lr 3.71e-04 | 322.95 ms | 52.3% bf16 MFU | 1625032 tok/s step 8693/19560 | loss 3.438106 (-0.11z)| norm 0.2753 (-0.22z)| lr 3.71e-04 | 322.85 ms | 52.3% bf16 MFU | 1624979 tok/s step 8694/19560 | loss 3.454275 (+0.31z)| norm 0.2508 (-1.19z)| lr 3.71e-04 | 322.31 ms | 52.4% bf16 MFU | 1625062 tok/s step 8695/19560 | loss 3.464174 (+0.55z)| norm 0.2605 (-0.80z)| lr 3.71e-04 | 322.65 ms | 52.3% bf16 MFU | 1625057 tok/s step 8696/19560 | loss 3.499264 (+1.42z)| norm 0.2587 (-0.86z)| lr 3.71e-04 | 323.35 ms | 52.2% bf16 MFU | 1624874 tok/s step 8697/19560 | loss 3.412081 (-0.79z)| norm 0.2474 (-1.28z)| lr 3.71e-04 | 322.30 ms | 52.4% bf16 MFU | 1624966 tok/s step 8698/19560 | loss 3.437270 (-0.15z)| norm 0.2726 (-0.30z)| lr 3.71e-04 | 322.96 ms | 52.3% bf16 MFU | 1624887 tok/s step 8699/19560 | loss 3.454540 (+0.28z)| norm 0.2529 (-1.06z)| lr 3.71e-04 | 322.63 ms | 52.3% bf16 MFU | 1624894 tok/s step 8700/19560 | loss 3.474526 (+0.79z)| norm 0.2656 (-0.55z)| lr 3.71e-04 | 322.84 ms | 52.3% bf16 MFU | 1624850 tok/s step 8701/19560 | loss 3.472495 (+0.73z)| norm 0.2709 (-0.33z)| lr 3.71e-04 | 322.66 ms | 52.3% bf16 MFU | 1624851 tok/s step 8702/19560 | loss 3.432425 (-0.27z)| norm 0.2673 (-0.47z)| lr 3.71e-04 | 322.50 ms | 52.3% bf16 MFU | 1624894 tok/s step 8703/19560 | loss 3.441824 (-0.03z)| norm 0.2656 (-0.53z)| lr 3.71e-04 | 322.69 ms | 52.3% bf16 MFU | 1624886 tok/s step 8704/19560 | loss 3.428979 (-0.36z)| norm 0.2826 (+0.14z)| lr 3.71e-04 | 322.21 ms | 52.4% bf16 MFU | 1625001 tok/s step 8705/19560 | loss 3.379802 (-1.59z)| norm 0.2998 (+0.82z)| lr 3.71e-04 | 322.56 ms | 52.3% bf16 MFU | 1625020 tok/s step 8706/19560 | loss 3.553038 (+2.70z)| norm 0.3310 (+2.05z)| lr 3.71e-04 | 322.59 ms | 52.3% bf16 MFU | 1625031 tok/s step 8707/19560 | loss 3.449639 (+0.15z)| norm 0.2985 (+0.76z)| lr 3.70e-04 | 322.37 ms | 52.4% bf16 MFU | 1625098 tok/s step 8708/19560 | loss 3.456744 (+0.32z)| norm 0.2872 (+0.32z)| lr 3.70e-04 | 322.74 ms | 52.3% bf16 MFU | 1625069 tok/s step 8709/19560 | loss 3.473979 (+0.74z)| norm 0.2836 (+0.18z)| lr 3.70e-04 | 322.61 ms | 52.3% bf16 MFU | 1625072 tok/s step 8710/19560 | loss 3.494655 (+1.27z)| norm 0.2841 (+0.20z)| lr 3.70e-04 | 322.62 ms | 52.3% bf16 MFU | 1625072 tok/s step 8711/19560 | loss 3.510301 (+1.63z)| norm 0.2962 (+0.67z)| lr 3.70e-04 | 322.81 ms | 52.3% bf16 MFU | 1625025 tok/s step 8712/19560 | loss 3.380667 (-1.54z)| norm 0.2831 (+0.15z)| lr 3.70e-04 | 322.69 ms | 52.3% bf16 MFU | 1625012 tok/s step 8713/19560 | loss 3.456641 (+0.31z)| norm 0.2786 (-0.03z)| lr 3.70e-04 | 322.62 ms | 52.3% bf16 MFU | 1625016 tok/s step 8714/19560 | loss 3.435593 (-0.19z)| norm 0.2819 (+0.10z)| lr 3.70e-04 | 322.53 ms | 52.3% bf16 MFU | 1625042 tok/s step 8715/19560 | loss 3.511577 (+1.67z)| norm 0.2931 (+0.54z)| lr 3.70e-04 | 322.56 ms | 52.3% bf16 MFU | 1625059 tok/s step 8716/19560 | loss 3.480107 (+0.89z)| norm 0.2839 (+0.16z)| lr 3.70e-04 | 322.51 ms | 52.3% bf16 MFU | 1625088 tok/s step 8717/19560 | loss 3.365321 (-1.91z)| norm 0.2820 (+0.14z)| lr 3.70e-04 | 322.24 ms | 52.4% bf16 MFU | 1625186 tok/s step 8718/19560 | loss 3.502020 (+1.41z)| norm 0.2742 (-0.20z)| lr 3.70e-04 | 322.09 ms | 52.4% bf16 MFU | 1625315 tok/s step 8719/19560 | loss 3.441332 (-0.06z)| norm 0.2753 (-0.15z)| lr 3.70e-04 | 322.56 ms | 52.3% bf16 MFU | 1625318 tok/s step 8720/19560 | loss 3.485180 (+0.99z)| norm 0.2655 (-0.58z)| lr 3.70e-04 | 322.45 ms | 52.3% bf16 MFU | 1625349 tok/s step 8721/19560 | loss 3.431324 (-0.30z)| norm 0.2732 (-0.23z)| lr 3.70e-04 | 322.42 ms | 52.3% bf16 MFU | 1625387 tok/s step 8722/19560 | loss 3.514557 (+1.69z)| norm 0.2526 (-1.15z)| lr 3.70e-04 | 322.95 ms | 52.3% bf16 MFU | 1625288 tok/s step 8723/19560 | loss 3.421293 (-0.55z)| norm 0.2585 (-0.88z)| lr 3.70e-04 | 322.70 ms | 52.3% bf16 MFU | 1625258 tok/s step 8724/19560 | loss 3.423205 (-0.50z)| norm 0.2582 (-0.88z)| lr 3.70e-04 | 322.63 ms | 52.3% bf16 MFU | 1625248 tok/s step 8725/19560 | loss 3.430664 (-0.31z)| norm 0.2543 (-1.05z)| lr 3.70e-04 | 322.75 ms | 52.3% bf16 MFU | 1625207 tok/s step 8726/19560 | loss 3.456556 (+0.34z)| norm 0.2600 (-0.79z)| lr 3.70e-04 | 322.37 ms | 52.4% bf16 MFU | 1625264 tok/s step 8727/19560 | loss 3.502815 (+1.47z)| norm 0.2565 (-0.94z)| lr 3.70e-04 | 322.22 ms | 52.4% bf16 MFU | 1625357 tok/s step 8728/19560 | loss 3.464571 (+0.52z)| norm 0.2749 (-0.13z)| lr 3.69e-04 | 322.57 ms | 52.3% bf16 MFU | 1625356 tok/s step 8729/19560 | loss 3.461939 (+0.45z)| norm 0.2574 (-0.90z)| lr 3.69e-04 | 322.37 ms | 52.4% bf16 MFU | 1625408 tok/s step 8730/19560 | loss 3.406213 (-0.92z)| norm 0.2597 (-0.79z)| lr 3.69e-04 | 323.17 ms | 52.2% bf16 MFU | 1625254 tok/s step 8731/19560 | loss 3.413454 (-0.75z)| norm 0.2650 (-0.56z)| lr 3.69e-04 | 322.63 ms | 52.3% bf16 MFU | 1625244 tok/s step 8732/19560 | loss 3.492554 (+1.18z)| norm 0.2834 (+0.24z)| lr 3.69e-04 | 322.51 ms | 52.3% bf16 MFU | 1625264 tok/s step 8733/19560 | loss 3.426785 (-0.42z)| norm 0.2376 (-1.77z)| lr 3.69e-04 | 322.87 ms | 52.3% bf16 MFU | 1625192 tok/s step 8734/19560 | loss 3.415202 (-0.70z)| norm 0.2730 (-0.21z)| lr 3.69e-04 | 322.55 ms | 52.3% bf16 MFU | 1625205 tok/s step 8735/19560 | loss 3.469146 (+0.62z)| norm 0.2477 (-1.31z)| lr 3.69e-04 | 323.44 ms | 52.2% bf16 MFU | 1624993 tok/s step 8736/19560 | loss 3.440381 (-0.09z)| norm 0.2527 (-1.08z)| lr 3.69e-04 | 322.62 ms | 52.3% bf16 MFU | 1624999 tok/s step 8737/19560 | loss 3.391716 (-1.29z)| norm 0.2751 (-0.11z)| lr 3.69e-04 | 322.58 ms | 52.3% bf16 MFU | 1625015 tok/s step 8738/19560 | loss 3.384222 (-1.45z)| norm 0.2726 (-0.21z)| lr 3.69e-04 | 322.42 ms | 52.3% bf16 MFU | 1625070 tok/s step 8739/19560 | loss 3.437695 (-0.15z)| norm 0.2672 (-0.45z)| lr 3.69e-04 | 322.66 ms | 52.3% bf16 MFU | 1625062 tok/s step 8740/19560 | loss 3.411296 (-0.78z)| norm 0.2809 (+0.15z)| lr 3.69e-04 | 322.61 ms | 52.3% bf16 MFU | 1625067 tok/s step 8741/19560 | loss 3.489194 (+1.16z)| norm 0.2733 (-0.18z)| lr 3.69e-04 | 323.20 ms | 52.2% bf16 MFU | 1624922 tok/s step 8742/19560 | loss 3.400721 (-1.04z)| norm 0.2866 (+0.40z)| lr 3.69e-04 | 322.28 ms | 52.4% bf16 MFU | 1625016 tok/s step 8743/19560 | loss 3.363865 (-1.91z)| norm 0.2524 (-1.08z)| lr 3.69e-04 | 322.76 ms | 52.3% bf16 MFU | 1624985 tok/s step 8744/19560 | loss 3.559370 (+2.81z)| norm 0.2963 (+0.83z)| lr 3.69e-04 | 322.57 ms | 52.3% bf16 MFU | 1625004 tok/s step 8745/19560 | loss 3.425246 (-0.40z)| norm 0.2766 (-0.02z)| lr 3.69e-04 | 322.73 ms | 52.3% bf16 MFU | 1624981 tok/s step 8746/19560 | loss 3.441261 (-0.02z)| norm 0.2723 (-0.20z)| lr 3.69e-04 | 322.59 ms | 52.3% bf16 MFU | 1624994 tok/s step 8747/19560 | loss 3.439434 (-0.07z)| norm 0.2733 (-0.16z)| lr 3.69e-04 | 322.56 ms | 52.3% bf16 MFU | 1625013 tok/s step 8748/19560 | loss 3.482214 (+0.99z)| norm 0.2849 (+0.35z)| lr 3.69e-04 | 322.93 ms | 52.3% bf16 MFU | 1624940 tok/s step 8749/19560 | loss 3.460958 (+0.46z)| norm 0.2574 (-0.84z)| lr 3.68e-04 | 322.54 ms | 52.3% bf16 MFU | 1624967 tok/s step 8750/19560 | loss 3.427425 (-0.37z)| norm 0.2872 (+0.46z)| lr 3.68e-04 | 322.60 ms | 52.3% bf16 MFU | 1624979 tok/s val loss 3.428181 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2876/10042 = 0.286397 step 8751/19560 | loss 3.407825 (-0.84z)| norm 0.2922 (+0.68z)| lr 3.68e-04 | 322.18 ms | 52.4% bf16 MFU | 1625095 tok/s step 8752/19560 | loss 3.471816 (+0.74z)| norm 0.3002 (+1.02z)| lr 3.68e-04 | 322.61 ms | 52.3% bf16 MFU | 1625097 tok/s step 8753/19560 | loss 3.425977 (-0.40z)| norm 0.2710 (-0.25z)| lr 3.68e-04 | 322.29 ms | 52.4% bf16 MFU | 1625180 tok/s step 8754/19560 | loss 3.435452 (-0.16z)| norm 0.2863 (+0.42z)| lr 3.68e-04 | 323.33 ms | 52.2% bf16 MFU | 1624998 tok/s step 8755/19560 | loss 3.424991 (-0.44z)| norm 0.2684 (-0.37z)| lr 3.68e-04 | 323.04 ms | 52.2% bf16 MFU | 1624897 tok/s step 8756/19560 | loss 3.469100 (+0.66z)| norm 0.2937 (+0.75z)| lr 3.68e-04 | 322.59 ms | 52.3% bf16 MFU | 1624916 tok/s step 8757/19560 | loss 3.408222 (-0.86z)| norm 0.2731 (-0.16z)| lr 3.68e-04 | 322.74 ms | 52.3% bf16 MFU | 1624894 tok/s step 8758/19560 | loss 3.427280 (-0.38z)| norm 0.2686 (-0.36z)| lr 3.68e-04 | 322.44 ms | 52.3% bf16 MFU | 1624949 tok/s step 8759/19560 | loss 3.445343 (+0.08z)| norm 0.2811 (+0.18z)| lr 3.68e-04 | 322.87 ms | 52.3% bf16 MFU | 1624894 tok/s step 8760/19560 | loss 3.392836 (-1.22z)| norm 0.2583 (-0.82z)| lr 3.68e-04 | 322.53 ms | 52.3% bf16 MFU | 1624928 tok/s step 8761/19560 | loss 3.408125 (-0.83z)| norm 0.2912 (+0.63z)| lr 3.68e-04 | 322.18 ms | 52.4% bf16 MFU | 1625046 tok/s step 8762/19560 | loss 3.378864 (-1.54z)| norm 0.2499 (-1.17z)| lr 3.68e-04 | 322.42 ms | 52.3% bf16 MFU | 1625100 tok/s step 8763/19560 | loss 3.384586 (-1.37z)| norm 0.2968 (+0.87z)| lr 3.68e-04 | 322.24 ms | 52.4% bf16 MFU | 1625196 tok/s step 8764/19560 | loss 3.365040 (-1.83z)| norm 0.3355 (+2.49z)| lr 3.68e-04 | 323.57 ms | 52.2% bf16 MFU | 1624953 tok/s step 8765/19560 | loss 3.450251 (+0.25z)| norm 0.2981 (+0.87z)| lr 3.68e-04 | 323.05 ms | 52.2% bf16 MFU | 1624852 tok/s step 8766/19560 | loss 3.494607 (+1.32z)| norm 0.3414 (+2.63z)| lr 3.68e-04 | 322.50 ms | 52.3% bf16 MFU | 1624894 tok/s step 8767/19560 | loss 3.462483 (+0.54z)| norm 0.2741 (-0.18z)| lr 3.68e-04 | 322.88 ms | 52.3% bf16 MFU | 1624839 tok/s step 8768/19560 | loss 3.417066 (-0.56z)| norm 0.3120 (+1.38z)| lr 3.68e-04 | 322.80 ms | 52.3% bf16 MFU | 1624807 tok/s step 8769/19560 | loss 3.460970 (+0.50z)| norm 0.2842 (+0.23z)| lr 3.67e-04 | 323.28 ms | 52.2% bf16 MFU | 1624655 tok/s step 8770/19560 | loss 3.511663 (+1.71z)| norm 0.2764 (-0.09z)| lr 3.67e-04 | 322.42 ms | 52.3% bf16 MFU | 1624728 tok/s step 8771/19560 | loss 3.444846 (+0.10z)| norm 0.2785 (-0.01z)| lr 3.67e-04 | 322.91 ms | 52.3% bf16 MFU | 1624674 tok/s step 8772/19560 | loss 3.483955 (+1.03z)| norm 0.2863 (+0.31z)| lr 3.67e-04 | 322.32 ms | 52.4% bf16 MFU | 1624772 tok/s step 8773/19560 | loss 3.509178 (+1.61z)| norm 0.3006 (+0.89z)| lr 3.67e-04 | 322.56 ms | 52.3% bf16 MFU | 1624804 tok/s step 8774/19560 | loss 3.451959 (+0.22z)| norm 0.3280 (+1.98z)| lr 3.67e-04 | 322.32 ms | 52.4% bf16 MFU | 1624895 tok/s step 8775/19560 | loss 3.384933 (-1.39z)| norm 0.2513 (-1.15z)| lr 3.67e-04 | 322.43 ms | 52.3% bf16 MFU | 1624952 tok/s step 8776/19560 | loss 3.483584 (+1.00z)| norm 0.3264 (+1.87z)| lr 3.67e-04 | 322.08 ms | 52.4% bf16 MFU | 1625096 tok/s step 8777/19560 | loss 3.413937 (-0.67z)| norm 0.2745 (-0.21z)| lr 3.67e-04 | 322.92 ms | 52.3% bf16 MFU | 1625019 tok/s step 8778/19560 | loss 3.408093 (-0.80z)| norm 0.2865 (+0.27z)| lr 3.67e-04 | 322.46 ms | 52.3% bf16 MFU | 1625063 tok/s step 8779/19560 | loss 3.466022 (+0.59z)| norm 0.2631 (-0.66z)| lr 3.67e-04 | 322.27 ms | 52.4% bf16 MFU | 1625153 tok/s step 8780/19560 | loss 3.438791 (-0.07z)| norm 0.2841 (+0.18z)| lr 3.67e-04 | 323.06 ms | 52.2% bf16 MFU | 1625039 tok/s step 8781/19560 | loss 3.466761 (+0.59z)| norm 0.2797 (+0.00z)| lr 3.67e-04 | 322.75 ms | 52.3% bf16 MFU | 1625009 tok/s step 8782/19560 | loss 3.371473 (-1.70z)| norm 0.3156 (+1.42z)| lr 3.67e-04 | 322.53 ms | 52.3% bf16 MFU | 1625036 tok/s step 8783/19560 | loss 3.427069 (-0.36z)| norm 0.2754 (-0.18z)| lr 3.67e-04 | 322.78 ms | 52.3% bf16 MFU | 1624998 tok/s step 8784/19560 | loss 3.503256 (+1.46z)| norm 0.2941 (+0.56z)| lr 3.67e-04 | 322.45 ms | 52.3% bf16 MFU | 1625045 tok/s step 8785/19560 | loss 3.487238 (+1.06z)| norm 0.2845 (+0.19z)| lr 3.67e-04 | 323.00 ms | 52.3% bf16 MFU | 1624952 tok/s step 8786/19560 | loss 3.497473 (+1.28z)| norm 0.2821 (+0.09z)| lr 3.67e-04 | 322.90 ms | 52.3% bf16 MFU | 1624889 tok/s step 8787/19560 | loss 3.477722 (+0.80z)| norm 0.2908 (+0.45z)| lr 3.67e-04 | 322.91 ms | 52.3% bf16 MFU | 1624827 tok/s step 8788/19560 | loss 3.478541 (+0.80z)| norm 0.2616 (-0.73z)| lr 3.67e-04 | 322.87 ms | 52.3% bf16 MFU | 1624779 tok/s step 8789/19560 | loss 3.342118 (-2.37z)| norm 0.2694 (-0.41z)| lr 3.67e-04 | 322.95 ms | 52.3% bf16 MFU | 1624710 tok/s step 8790/19560 | loss 3.470212 (+0.62z)| norm 0.2722 (-0.29z)| lr 3.66e-04 | 323.18 ms | 52.2% bf16 MFU | 1624590 tok/s step 8791/19560 | loss 3.393988 (-1.14z)| norm 0.2596 (-0.79z)| lr 3.66e-04 | 322.53 ms | 52.3% bf16 MFU | 1624638 tok/s step 8792/19560 | loss 3.497056 (+1.24z)| norm 0.2735 (-0.22z)| lr 3.66e-04 | 322.56 ms | 52.3% bf16 MFU | 1624677 tok/s step 8793/19560 | loss 3.441606 (-0.05z)| norm 0.2946 (+0.61z)| lr 3.66e-04 | 322.96 ms | 52.3% bf16 MFU | 1624612 tok/s step 8794/19560 | loss 3.452300 (+0.19z)| norm 0.2784 (-0.03z)| lr 3.66e-04 | 323.04 ms | 52.2% bf16 MFU | 1624530 tok/s step 8795/19560 | loss 3.470620 (+0.61z)| norm 0.2742 (-0.20z)| lr 3.66e-04 | 322.89 ms | 52.3% bf16 MFU | 1624491 tok/s step 8796/19560 | loss 3.470849 (+0.60z)| norm 0.2983 (+0.75z)| lr 3.66e-04 | 322.80 ms | 52.3% bf16 MFU | 1624476 tok/s step 8797/19560 | loss 3.409572 (-0.83z)| norm 0.2888 (+0.37z)| lr 3.66e-04 | 323.02 ms | 52.2% bf16 MFU | 1624407 tok/s step 8798/19560 | loss 3.465658 (+0.52z)| norm 0.2793 (-0.02z)| lr 3.66e-04 | 322.68 ms | 52.3% bf16 MFU | 1624426 tok/s step 8799/19560 | loss 3.395717 (-1.16z)| norm 0.2711 (-0.34z)| lr 3.66e-04 | 322.95 ms | 52.3% bf16 MFU | 1624377 tok/s step 8800/19560 | loss 3.395290 (-1.17z)| norm 0.2750 (-0.20z)| lr 3.66e-04 | 322.48 ms | 52.3% bf16 MFU | 1624447 tok/s step 8801/19560 | loss 3.435674 (-0.20z)| norm 0.2820 (+0.09z)| lr 3.66e-04 | 322.35 ms | 52.4% bf16 MFU | 1624547 tok/s step 8802/19560 | loss 3.433220 (-0.26z)| norm 0.2990 (+0.77z)| lr 3.66e-04 | 322.26 ms | 52.4% bf16 MFU | 1624664 tok/s step 8803/19560 | loss 3.415141 (-0.68z)| norm 0.2976 (+0.71z)| lr 3.66e-04 | 322.81 ms | 52.3% bf16 MFU | 1624639 tok/s step 8804/19560 | loss 3.413068 (-0.73z)| norm 0.2987 (+0.75z)| lr 3.66e-04 | 322.34 ms | 52.4% bf16 MFU | 1624732 tok/s step 8805/19560 | loss 3.473010 (+0.71z)| norm 0.2909 (+0.44z)| lr 3.66e-04 | 322.62 ms | 52.3% bf16 MFU | 1624751 tok/s step 8806/19560 | loss 3.476702 (+0.78z)| norm 0.2779 (-0.09z)| lr 3.66e-04 | 322.74 ms | 52.3% bf16 MFU | 1624738 tok/s step 8807/19560 | loss 3.460063 (+0.38z)| norm 0.2884 (+0.33z)| lr 3.66e-04 | 322.30 ms | 52.4% bf16 MFU | 1624836 tok/s step 8808/19560 | loss 3.439105 (-0.15z)| norm 0.2869 (+0.43z)| lr 3.66e-04 | 322.56 ms | 52.3% bf16 MFU | 1624864 tok/s step 8809/19560 | loss 3.434112 (-0.28z)| norm 0.2797 (+0.07z)| lr 3.66e-04 | 322.26 ms | 52.4% bf16 MFU | 1624966 tok/s step 8810/19560 | loss 3.468459 (+0.57z)| norm 0.4048 (+5.82z)| lr 3.65e-04 | 323.36 ms | 52.2% bf16 MFU | 1624787 tok/s step 8811/19560 | loss 3.436867 (-0.21z)| norm 0.3710 (+3.95z)| lr 3.65e-04 | 322.87 ms | 52.3% bf16 MFU | 1624739 tok/s step 8812/19560 | loss 3.445322 (+0.00z)| norm 0.3041 (+1.03z)| lr 3.65e-04 | 322.66 ms | 52.3% bf16 MFU | 1624748 tok/s step 8813/19560 | loss 3.348622 (-2.37z)| norm 0.3340 (+2.27z)| lr 3.65e-04 | 322.54 ms | 52.3% bf16 MFU | 1624785 tok/s step 8814/19560 | loss 3.441570 (-0.07z)| norm 0.2927 (+0.51z)| lr 3.65e-04 | 322.23 ms | 52.4% bf16 MFU | 1624898 tok/s step 8815/19560 | loss 3.354689 (-2.18z)| norm 0.2729 (-0.35z)| lr 3.65e-04 | 323.01 ms | 52.2% bf16 MFU | 1624810 tok/s step 8816/19560 | loss 3.469795 (+0.63z)| norm 0.2725 (-0.36z)| lr 3.65e-04 | 322.44 ms | 52.3% bf16 MFU | 1624871 tok/s step 8817/19560 | loss 3.479502 (+0.86z)| norm 0.2682 (-0.54z)| lr 3.65e-04 | 323.10 ms | 52.2% bf16 MFU | 1624762 tok/s step 8818/19560 | loss 3.528237 (+2.00z)| norm 0.2731 (-0.34z)| lr 3.65e-04 | 322.09 ms | 52.4% bf16 MFU | 1624912 tok/s step 8819/19560 | loss 3.400769 (-1.05z)| norm 0.2699 (-0.47z)| lr 3.65e-04 | 322.32 ms | 52.4% bf16 MFU | 1624997 tok/s step 8820/19560 | loss 3.461381 (+0.40z)| norm 0.2986 (+0.75z)| lr 3.65e-04 | 322.45 ms | 52.3% bf16 MFU | 1625046 tok/s step 8821/19560 | loss 3.470102 (+0.60z)| norm 0.2736 (-0.33z)| lr 3.65e-04 | 322.41 ms | 52.3% bf16 MFU | 1625102 tok/s step 8822/19560 | loss 3.464617 (+0.47z)| norm 0.2709 (-0.46z)| lr 3.65e-04 | 322.79 ms | 52.3% bf16 MFU | 1625059 tok/s step 8823/19560 | loss 3.433441 (-0.27z)| norm 0.3123 (+1.32z)| lr 3.65e-04 | 323.13 ms | 52.2% bf16 MFU | 1624932 tok/s step 8824/19560 | loss 3.589712 (+3.32z)| norm 0.2722 (-0.42z)| lr 3.65e-04 | 323.03 ms | 52.2% bf16 MFU | 1624837 tok/s step 8825/19560 | loss 3.460890 (+0.35z)| norm 0.2781 (-0.18z)| lr 3.65e-04 | 322.47 ms | 52.3% bf16 MFU | 1624887 tok/s step 8826/19560 | loss 3.417469 (-0.65z)| norm 0.2527 (-1.27z)| lr 3.65e-04 | 322.52 ms | 52.3% bf16 MFU | 1624922 tok/s step 8827/19560 | loss 3.521033 (+1.70z)| norm 0.2747 (-0.33z)| lr 3.65e-04 | 322.47 ms | 52.3% bf16 MFU | 1624968 tok/s step 8828/19560 | loss 3.412377 (-0.76z)| norm 0.2780 (-0.19z)| lr 3.65e-04 | 322.28 ms | 52.4% bf16 MFU | 1625060 tok/s step 8829/19560 | loss 3.420574 (-0.57z)| norm 0.2717 (-0.46z)| lr 3.65e-04 | 322.72 ms | 52.3% bf16 MFU | 1625037 tok/s step 8830/19560 | loss 3.393234 (-1.18z)| norm 0.2478 (-1.49z)| lr 3.65e-04 | 322.67 ms | 52.3% bf16 MFU | 1625027 tok/s step 8831/19560 | loss 3.448792 (+0.08z)| norm 0.2687 (-0.58z)| lr 3.64e-04 | 322.46 ms | 52.3% bf16 MFU | 1625071 tok/s step 8832/19560 | loss 3.413512 (-0.71z)| norm 0.2806 (-0.07z)| lr 3.64e-04 | 322.62 ms | 52.3% bf16 MFU | 1625073 tok/s step 8833/19560 | loss 3.431435 (-0.32z)| norm 0.2918 (+0.43z)| lr 3.64e-04 | 322.60 ms | 52.3% bf16 MFU | 1625078 tok/s step 8834/19560 | loss 3.385299 (-1.37z)| norm 0.2718 (-0.44z)| lr 3.64e-04 | 322.62 ms | 52.3% bf16 MFU | 1625079 tok/s step 8835/19560 | loss 3.521108 (+1.75z)| norm 0.3257 (+1.93z)| lr 3.64e-04 | 323.05 ms | 52.2% bf16 MFU | 1624972 tok/s step 8836/19560 | loss 3.539125 (+2.11z)| norm 0.2948 (+0.57z)| lr 3.64e-04 | 322.83 ms | 52.3% bf16 MFU | 1624926 tok/s step 8837/19560 | loss 3.408901 (-0.81z)| norm 0.2834 (+0.07z)| lr 3.64e-04 | 322.38 ms | 52.4% bf16 MFU | 1624995 tok/s step 8838/19560 | loss 3.465849 (+0.48z)| norm 0.2991 (+0.75z)| lr 3.64e-04 | 322.23 ms | 52.4% bf16 MFU | 1625099 tok/s step 8839/19560 | loss 3.431461 (-0.29z)| norm 0.2756 (-0.27z)| lr 3.64e-04 | 322.55 ms | 52.3% bf16 MFU | 1625116 tok/s step 8840/19560 | loss 3.412944 (-0.72z)| norm 0.2790 (-0.12z)| lr 3.64e-04 | 323.83 ms | 52.1% bf16 MFU | 1624812 tok/s step 8841/19560 | loss 3.470102 (+0.59z)| norm 0.2769 (-0.21z)| lr 3.64e-04 | 322.13 ms | 52.4% bf16 MFU | 1624950 tok/s step 8842/19560 | loss 3.477692 (+0.76z)| norm 0.2689 (-0.56z)| lr 3.64e-04 | 322.01 ms | 52.4% bf16 MFU | 1625111 tok/s step 8843/19560 | loss 3.411615 (-0.74z)| norm 0.2640 (-0.76z)| lr 3.64e-04 | 322.58 ms | 52.3% bf16 MFU | 1625119 tok/s step 8844/19560 | loss 3.414520 (-0.67z)| norm 0.2866 (+0.23z)| lr 3.64e-04 | 322.51 ms | 52.3% bf16 MFU | 1625146 tok/s step 8845/19560 | loss 3.395822 (-1.11z)| norm 0.2669 (-0.63z)| lr 3.64e-04 | 322.38 ms | 52.4% bf16 MFU | 1625204 tok/s step 8846/19560 | loss 3.454199 (+0.26z)| norm 0.2785 (-0.13z)| lr 3.64e-04 | 322.60 ms | 52.3% bf16 MFU | 1625205 tok/s step 8847/19560 | loss 3.413811 (-0.68z)| norm 0.2957 (+0.62z)| lr 3.64e-04 | 322.23 ms | 52.4% bf16 MFU | 1625297 tok/s step 8848/19560 | loss 3.480820 (+0.88z)| norm 0.2716 (-0.43z)| lr 3.64e-04 | 322.55 ms | 52.3% bf16 MFU | 1625304 tok/s step 8849/19560 | loss 3.501454 (+1.35z)| norm 0.2718 (-0.43z)| lr 3.64e-04 | 322.90 ms | 52.3% bf16 MFU | 1625223 tok/s step 8850/19560 | loss 3.475128 (+0.75z)| norm 0.2965 (+0.64z)| lr 3.64e-04 | 322.89 ms | 52.3% bf16 MFU | 1625149 tok/s step 8851/19560 | loss 3.409016 (-0.80z)| norm 0.2791 (-0.13z)| lr 3.63e-04 | 322.84 ms | 52.3% bf16 MFU | 1625091 tok/s step 8852/19560 | loss 3.495074 (+1.20z)| norm 0.2607 (-0.94z)| lr 3.63e-04 | 322.56 ms | 52.3% bf16 MFU | 1625107 tok/s step 8853/19560 | loss 3.462538 (+0.43z)| norm 0.3030 (+0.91z)| lr 3.63e-04 | 322.36 ms | 52.4% bf16 MFU | 1625172 tok/s step 8854/19560 | loss 3.383745 (-1.38z)| norm 0.2418 (-1.78z)| lr 3.63e-04 | 322.61 ms | 52.3% bf16 MFU | 1625171 tok/s step 8855/19560 | loss 3.397996 (-1.03z)| norm 0.2855 (+0.13z)| lr 3.63e-04 | 322.39 ms | 52.3% bf16 MFU | 1625225 tok/s step 8856/19560 | loss 3.388087 (-1.24z)| norm 0.2421 (-1.75z)| lr 3.63e-04 | 322.29 ms | 52.4% bf16 MFU | 1625301 tok/s step 8857/19560 | loss 3.392888 (-1.11z)| norm 0.2765 (-0.26z)| lr 3.63e-04 | 322.65 ms | 52.3% bf16 MFU | 1625284 tok/s step 8858/19560 | loss 3.495675 (+1.23z)| norm 0.2776 (-0.22z)| lr 3.63e-04 | 322.55 ms | 52.3% bf16 MFU | 1625293 tok/s step 8859/19560 | loss 3.480313 (+0.86z)| norm 0.2779 (-0.21z)| lr 3.63e-04 | 322.40 ms | 52.3% bf16 MFU | 1625339 tok/s step 8860/19560 | loss 3.425961 (-0.37z)| norm 0.2828 (+0.00z)| lr 3.63e-04 | 322.72 ms | 52.3% bf16 MFU | 1625300 tok/s step 8861/19560 | loss 3.493790 (+1.17z)| norm 0.2713 (-0.52z)| lr 3.63e-04 | 322.58 ms | 52.3% bf16 MFU | 1625299 tok/s step 8862/19560 | loss 3.438894 (-0.09z)| norm 0.2711 (-0.53z)| lr 3.63e-04 | 322.57 ms | 52.3% bf16 MFU | 1625300 tok/s step 8863/19560 | loss 3.425257 (-0.39z)| norm 0.2818 (-0.07z)| lr 3.63e-04 | 322.42 ms | 52.3% bf16 MFU | 1625341 tok/s step 8864/19560 | loss 3.440145 (-0.05z)| norm 0.2625 (-0.94z)| lr 3.63e-04 | 322.65 ms | 52.3% bf16 MFU | 1625320 tok/s step 8865/19560 | loss 3.443184 (+0.01z)| norm 0.2723 (-0.50z)| lr 3.63e-04 | 322.37 ms | 52.4% bf16 MFU | 1625372 tok/s step 8866/19560 | loss 3.460804 (+0.40z)| norm 0.2480 (-1.57z)| lr 3.63e-04 | 322.65 ms | 52.3% bf16 MFU | 1625350 tok/s step 8867/19560 | loss 3.433975 (-0.22z)| norm 0.2550 (-1.25z)| lr 3.63e-04 | 322.73 ms | 52.3% bf16 MFU | 1625309 tok/s step 8868/19560 | loss 3.418096 (-0.59z)| norm 0.2865 (+0.15z)| lr 3.63e-04 | 322.19 ms | 52.4% bf16 MFU | 1625406 tok/s step 8869/19560 | loss 3.405058 (-0.88z)| norm 0.2732 (-0.44z)| lr 3.63e-04 | 323.14 ms | 52.2% bf16 MFU | 1625260 tok/s step 8870/19560 | loss 3.404686 (-0.89z)| norm 0.2723 (-0.48z)| lr 3.63e-04 | 322.98 ms | 52.3% bf16 MFU | 1625161 tok/s step 8871/19560 | loss 3.409397 (-0.80z)| norm 0.2753 (-0.35z)| lr 3.63e-04 | 322.62 ms | 52.3% bf16 MFU | 1625157 tok/s step 8872/19560 | loss 3.402478 (-0.95z)| norm 0.2761 (-0.31z)| lr 3.62e-04 | 322.40 ms | 52.3% bf16 MFU | 1625210 tok/s step 8873/19560 | loss 3.383002 (-1.41z)| norm 0.2676 (-0.68z)| lr 3.62e-04 | 322.58 ms | 52.3% bf16 MFU | 1625214 tok/s step 8874/19560 | loss 3.442644 (+0.02z)| norm 0.2719 (-0.49z)| lr 3.62e-04 | 323.08 ms | 52.2% bf16 MFU | 1625092 tok/s step 8875/19560 | loss 3.366915 (-1.76z)| norm 0.2701 (-0.57z)| lr 3.62e-04 | 322.04 ms | 52.4% bf16 MFU | 1625239 tok/s step 8876/19560 | loss 3.424840 (-0.38z)| norm 0.2849 (+0.09z)| lr 3.62e-04 | 322.42 ms | 52.3% bf16 MFU | 1625282 tok/s step 8877/19560 | loss 3.452245 (+0.28z)| norm 0.2743 (-0.39z)| lr 3.62e-04 | 323.03 ms | 52.2% bf16 MFU | 1625169 tok/s step 8878/19560 | loss 3.432686 (-0.19z)| norm 0.2624 (-0.92z)| lr 3.62e-04 | 322.16 ms | 52.4% bf16 MFU | 1625282 tok/s step 8879/19560 | loss 3.452299 (+0.27z)| norm 0.2645 (-0.81z)| lr 3.62e-04 | 322.35 ms | 52.4% bf16 MFU | 1625341 tok/s step 8880/19560 | loss 3.453877 (+0.31z)| norm 0.2634 (-0.84z)| lr 3.62e-04 | 323.14 ms | 52.2% bf16 MFU | 1625197 tok/s step 8881/19560 | loss 3.461178 (+0.48z)| norm 0.2586 (-1.05z)| lr 3.62e-04 | 322.44 ms | 52.3% bf16 MFU | 1625238 tok/s step 8882/19560 | loss 3.416324 (-0.59z)| norm 0.2625 (-0.87z)| lr 3.62e-04 | 322.13 ms | 52.4% bf16 MFU | 1625354 tok/s step 8883/19560 | loss 3.389269 (-1.22z)| norm 0.2645 (-0.78z)| lr 3.62e-04 | 322.31 ms | 52.4% bf16 MFU | 1625420 tok/s step 8884/19560 | loss 3.498570 (+1.36z)| norm 0.3017 (+0.87z)| lr 3.62e-04 | 322.62 ms | 52.3% bf16 MFU | 1625404 tok/s step 8885/19560 | loss 3.391124 (-1.17z)| norm 0.2686 (-0.60z)| lr 3.62e-04 | 322.26 ms | 52.4% bf16 MFU | 1625481 tok/s step 8886/19560 | loss 3.403497 (-0.87z)| norm 0.2683 (-0.61z)| lr 3.62e-04 | 322.67 ms | 52.3% bf16 MFU | 1625450 tok/s step 8887/19560 | loss 3.475668 (+0.81z)| norm 0.2694 (-0.56z)| lr 3.62e-04 | 322.60 ms | 52.3% bf16 MFU | 1625436 tok/s step 8888/19560 | loss 3.508762 (+1.56z)| norm 0.2610 (-0.93z)| lr 3.62e-04 | 322.51 ms | 52.3% bf16 MFU | 1625448 tok/s step 8889/19560 | loss 3.418881 (-0.54z)| norm 0.2652 (-0.73z)| lr 3.62e-04 | 322.41 ms | 52.3% bf16 MFU | 1625484 tok/s step 8890/19560 | loss 3.405162 (-0.86z)| norm 0.2884 (+0.28z)| lr 3.62e-04 | 322.44 ms | 52.3% bf16 MFU | 1625510 tok/s step 8891/19560 | loss 3.349230 (-2.14z)| norm 0.2861 (+0.18z)| lr 3.62e-04 | 322.68 ms | 52.3% bf16 MFU | 1625473 tok/s step 8892/19560 | loss 3.415139 (-0.63z)| norm 0.3081 (+1.20z)| lr 3.61e-04 | 322.48 ms | 52.3% bf16 MFU | 1625488 tok/s step 8893/19560 | loss 3.410538 (-0.73z)| norm 0.2601 (-0.97z)| lr 3.61e-04 | 322.81 ms | 52.3% bf16 MFU | 1625420 tok/s step 8894/19560 | loss 3.443739 (+0.05z)| norm 0.2787 (-0.10z)| lr 3.61e-04 | 322.78 ms | 52.3% bf16 MFU | 1625362 tok/s step 8895/19560 | loss 3.432724 (-0.20z)| norm 0.2847 (+0.17z)| lr 3.61e-04 | 322.18 ms | 52.4% bf16 MFU | 1625459 tok/s step 8896/19560 | loss 3.439173 (-0.05z)| norm 0.2786 (-0.10z)| lr 3.61e-04 | 322.67 ms | 52.3% bf16 MFU | 1625428 tok/s step 8897/19560 | loss 3.469169 (+0.65z)| norm 0.2517 (-1.35z)| lr 3.61e-04 | 322.78 ms | 52.3% bf16 MFU | 1625371 tok/s step 8898/19560 | loss 3.424594 (-0.39z)| norm 0.2591 (-0.99z)| lr 3.61e-04 | 322.53 ms | 52.3% bf16 MFU | 1625379 tok/s step 8899/19560 | loss 3.407840 (-0.78z)| norm 0.2719 (-0.39z)| lr 3.61e-04 | 322.55 ms | 52.3% bf16 MFU | 1625382 tok/s step 8900/19560 | loss 3.434444 (-0.14z)| norm 0.2517 (-1.31z)| lr 3.61e-04 | 323.05 ms | 52.2% bf16 MFU | 1625258 tok/s step 8901/19560 | loss 3.416077 (-0.56z)| norm 0.2661 (-0.64z)| lr 3.61e-04 | 322.41 ms | 52.3% bf16 MFU | 1625302 tok/s step 8902/19560 | loss 3.473243 (+0.81z)| norm 0.2705 (-0.42z)| lr 3.61e-04 | 322.26 ms | 52.4% bf16 MFU | 1625382 tok/s step 8903/19560 | loss 3.387339 (-1.25z)| norm 0.2516 (-1.31z)| lr 3.61e-04 | 322.46 ms | 52.3% bf16 MFU | 1625408 tok/s step 8904/19560 | loss 3.475152 (+0.86z)| norm 0.2777 (-0.06z)| lr 3.61e-04 | 323.98 ms | 52.1% bf16 MFU | 1625052 tok/s step 8905/19560 | loss 3.426559 (-0.31z)| norm 0.2583 (-0.99z)| lr 3.61e-04 | 321.77 ms | 52.5% bf16 MFU | 1625269 tok/s step 8906/19560 | loss 3.518939 (+1.87z)| norm 0.2683 (-0.50z)| lr 3.61e-04 | 323.02 ms | 52.2% bf16 MFU | 1625160 tok/s step 8907/19560 | loss 3.461851 (+0.51z)| norm 0.2509 (-1.32z)| lr 3.61e-04 | 322.30 ms | 52.4% bf16 MFU | 1625238 tok/s step 8908/19560 | loss 3.407568 (-0.78z)| norm 0.2789 (+0.02z)| lr 3.61e-04 | 323.26 ms | 52.2% bf16 MFU | 1625070 tok/s step 8909/19560 | loss 3.405862 (-0.80z)| norm 0.2419 (-1.72z)| lr 3.61e-04 | 322.54 ms | 52.3% bf16 MFU | 1625091 tok/s step 8910/19560 | loss 3.471332 (+0.74z)| norm 0.2491 (-1.36z)| lr 3.61e-04 | 322.35 ms | 52.4% bf16 MFU | 1625159 tok/s step 8911/19560 | loss 3.448135 (+0.18z)| norm 0.2575 (-0.95z)| lr 3.61e-04 | 322.32 ms | 52.4% bf16 MFU | 1625231 tok/s step 8912/19560 | loss 3.588362 (+3.39z)| norm 0.3025 (+1.17z)| lr 3.60e-04 | 322.38 ms | 52.4% bf16 MFU | 1625284 tok/s step 8913/19560 | loss 3.444009 (+0.07z)| norm 0.2622 (-0.72z)| lr 3.60e-04 | 322.51 ms | 52.3% bf16 MFU | 1625303 tok/s step 8914/19560 | loss 3.366094 (-1.70z)| norm 0.2878 (+0.48z)| lr 3.60e-04 | 322.28 ms | 52.4% bf16 MFU | 1625377 tok/s step 8915/19560 | loss 3.446282 (+0.15z)| norm 0.2810 (+0.17z)| lr 3.60e-04 | 322.42 ms | 52.3% bf16 MFU | 1625413 tok/s step 8916/19560 | loss 3.405958 (-0.77z)| norm 0.2943 (+0.78z)| lr 3.60e-04 | 322.73 ms | 52.3% bf16 MFU | 1625371 tok/s step 8917/19560 | loss 3.406327 (-0.78z)| norm 0.2910 (+0.62z)| lr 3.60e-04 | 323.37 ms | 52.2% bf16 MFU | 1625169 tok/s step 8918/19560 | loss 3.377916 (-1.43z)| norm 0.2757 (-0.10z)| lr 3.60e-04 | 322.87 ms | 52.3% bf16 MFU | 1625102 tok/s step 8919/19560 | loss 3.503366 (+1.48z)| norm 0.2695 (-0.40z)| lr 3.60e-04 | 322.39 ms | 52.3% bf16 MFU | 1625159 tok/s step 8920/19560 | loss 3.558239 (+2.69z)| norm 0.2752 (-0.13z)| lr 3.60e-04 | 323.57 ms | 52.2% bf16 MFU | 1624917 tok/s step 8921/19560 | loss 3.440918 (+0.02z)| norm 0.2844 (+0.31z)| lr 3.60e-04 | 322.74 ms | 52.3% bf16 MFU | 1624896 tok/s step 8922/19560 | loss 3.433374 (-0.15z)| norm 0.3127 (+1.62z)| lr 3.60e-04 | 322.11 ms | 52.4% bf16 MFU | 1625034 tok/s step 8923/19560 | loss 3.442149 (+0.05z)| norm 0.2793 (+0.05z)| lr 3.60e-04 | 322.97 ms | 52.3% bf16 MFU | 1624949 tok/s step 8924/19560 | loss 3.499680 (+1.35z)| norm 0.3119 (+1.56z)| lr 3.60e-04 | 323.03 ms | 52.2% bf16 MFU | 1624854 tok/s step 8925/19560 | loss 3.450813 (+0.24z)| norm 0.3154 (+1.70z)| lr 3.60e-04 | 323.51 ms | 52.2% bf16 MFU | 1624643 tok/s step 8926/19560 | loss 3.442741 (+0.06z)| norm 0.2681 (-0.47z)| lr 3.60e-04 | 322.66 ms | 52.3% bf16 MFU | 1624655 tok/s step 8927/19560 | loss 3.439507 (-0.02z)| norm 0.2933 (+0.68z)| lr 3.60e-04 | 322.64 ms | 52.3% bf16 MFU | 1624673 tok/s step 8928/19560 | loss 3.380825 (-1.36z)| norm 0.2802 (+0.07z)| lr 3.60e-04 | 322.62 ms | 52.3% bf16 MFU | 1624693 tok/s step 8929/19560 | loss 3.386668 (-1.21z)| norm 0.2945 (+0.72z)| lr 3.60e-04 | 322.52 ms | 52.3% bf16 MFU | 1624739 tok/s step 8930/19560 | loss 3.443421 (+0.07z)| norm 0.2580 (-0.94z)| lr 3.60e-04 | 322.82 ms | 52.3% bf16 MFU | 1624706 tok/s step 8931/19560 | loss 3.407630 (-0.74z)| norm 0.3157 (+1.69z)| lr 3.60e-04 | 322.71 ms | 52.3% bf16 MFU | 1624702 tok/s step 8932/19560 | loss 3.439260 (-0.02z)| norm 0.2683 (-0.46z)| lr 3.60e-04 | 322.75 ms | 52.3% bf16 MFU | 1624689 tok/s step 8933/19560 | loss 3.419868 (-0.46z)| norm 0.2733 (-0.22z)| lr 3.59e-04 | 322.98 ms | 52.3% bf16 MFU | 1624619 tok/s step 8934/19560 | loss 3.416588 (-0.52z)| norm 0.2729 (-0.24z)| lr 3.59e-04 | 322.47 ms | 52.3% bf16 MFU | 1624681 tok/s step 8935/19560 | loss 3.507649 (+1.53z)| norm 0.3247 (+2.08z)| lr 3.59e-04 | 322.63 ms | 52.3% bf16 MFU | 1624698 tok/s step 8936/19560 | loss 3.408108 (-0.71z)| norm 0.2862 (+0.35z)| lr 3.59e-04 | 322.63 ms | 52.3% bf16 MFU | 1624716 tok/s step 8937/19560 | loss 3.409893 (-0.66z)| norm 0.3084 (+1.33z)| lr 3.59e-04 | 323.38 ms | 52.2% bf16 MFU | 1624544 tok/s step 8938/19560 | loss 3.413041 (-0.58z)| norm 0.2851 (+0.38z)| lr 3.59e-04 | 323.61 ms | 52.2% bf16 MFU | 1624322 tok/s step 8939/19560 | loss 3.506184 (+1.49z)| norm 0.2778 (+0.05z)| lr 3.59e-04 | 323.68 ms | 52.1% bf16 MFU | 1624095 tok/s step 8940/19560 | loss 3.427557 (-0.26z)| norm 0.2794 (+0.15z)| lr 3.59e-04 | 322.97 ms | 52.3% bf16 MFU | 1624058 tok/s step 8941/19560 | loss 3.389699 (-1.13z)| norm 0.2984 (+1.30z)| lr 3.59e-04 | 323.14 ms | 52.2% bf16 MFU | 1623978 tok/s step 8942/19560 | loss 3.398412 (-0.92z)| norm 0.2920 (+0.92z)| lr 3.59e-04 | 322.44 ms | 52.3% bf16 MFU | 1624078 tok/s step 8943/19560 | loss 3.412790 (-0.61z)| norm 0.2722 (-0.26z)| lr 3.59e-04 | 322.89 ms | 52.3% bf16 MFU | 1624062 tok/s step 8944/19560 | loss 3.446712 (+0.16z)| norm 0.2873 (+0.63z)| lr 3.59e-04 | 323.06 ms | 52.2% bf16 MFU | 1624002 tok/s step 8945/19560 | loss 3.436824 (-0.06z)| norm 0.2953 (+1.10z)| lr 3.59e-04 | 322.17 ms | 52.4% bf16 MFU | 1624170 tok/s step 8946/19560 | loss 3.410948 (-0.64z)| norm 0.2754 (-0.09z)| lr 3.59e-04 | 323.72 ms | 52.1% bf16 MFU | 1623941 tok/s step 8947/19560 | loss 3.417913 (-0.48z)| norm 0.2821 (+0.31z)| lr 3.59e-04 | 323.05 ms | 52.2% bf16 MFU | 1623892 tok/s step 8948/19560 | loss 3.407659 (-0.71z)| norm 0.2665 (-0.61z)| lr 3.59e-04 | 322.89 ms | 52.3% bf16 MFU | 1623884 tok/s step 8949/19560 | loss 3.468031 (+0.70z)| norm 0.2637 (-0.77z)| lr 3.59e-04 | 322.41 ms | 52.3% bf16 MFU | 1623997 tok/s step 8950/19560 | loss 3.478948 (+0.95z)| norm 0.2736 (-0.18z)| lr 3.59e-04 | 322.62 ms | 52.3% bf16 MFU | 1624052 tok/s step 8951/19560 | loss 3.424880 (-0.31z)| norm 0.2835 (+0.43z)| lr 3.59e-04 | 322.61 ms | 52.3% bf16 MFU | 1624106 tok/s step 8952/19560 | loss 3.400236 (-0.89z)| norm 0.2595 (-1.02z)| lr 3.59e-04 | 323.11 ms | 52.2% bf16 MFU | 1624031 tok/s step 8953/19560 | loss 3.477831 (+1.00z)| norm 0.2721 (-0.25z)| lr 3.58e-04 | 322.93 ms | 52.3% bf16 MFU | 1624006 tok/s step 8954/19560 | loss 3.422528 (-0.35z)| norm 0.2786 (+0.13z)| lr 3.58e-04 | 322.63 ms | 52.3% bf16 MFU | 1624058 tok/s step 8955/19560 | loss 3.415101 (-0.51z)| norm 0.2467 (-1.77z)| lr 3.58e-04 | 322.67 ms | 52.3% bf16 MFU | 1624098 tok/s step 8956/19560 | loss 3.409843 (-0.64z)| norm 0.2741 (-0.13z)| lr 3.58e-04 | 322.55 ms | 52.3% bf16 MFU | 1624165 tok/s step 8957/19560 | loss 3.392240 (-1.07z)| norm 0.2470 (-1.72z)| lr 3.58e-04 | 323.35 ms | 52.2% bf16 MFU | 1624027 tok/s step 8958/19560 | loss 3.485990 (+1.22z)| norm 0.2685 (-0.46z)| lr 3.58e-04 | 322.65 ms | 52.3% bf16 MFU | 1624074 tok/s step 8959/19560 | loss 3.403657 (-0.79z)| norm 0.3094 (+1.95z)| lr 3.58e-04 | 323.43 ms | 52.2% bf16 MFU | 1623923 tok/s step 8960/19560 | loss 3.371307 (-1.56z)| norm 0.2811 (+0.27z)| lr 3.58e-04 | 322.58 ms | 52.3% bf16 MFU | 1623992 tok/s step 8961/19560 | loss 3.444086 (+0.20z)| norm 0.2581 (-1.07z)| lr 3.58e-04 | 322.99 ms | 52.3% bf16 MFU | 1623953 tok/s step 8962/19560 | loss 3.406603 (-0.72z)| norm 0.2793 (+0.18z)| lr 3.58e-04 | 322.58 ms | 52.3% bf16 MFU | 1624020 tok/s step 8963/19560 | loss 3.425295 (-0.25z)| norm 0.2571 (-1.14z)| lr 3.58e-04 | 322.71 ms | 52.3% bf16 MFU | 1624050 tok/s step 8964/19560 | loss 3.449435 (+0.38z)| norm 0.2661 (-0.58z)| lr 3.58e-04 | 323.20 ms | 52.2% bf16 MFU | 1623957 tok/s step 8965/19560 | loss 3.452588 (+0.45z)| norm 0.2507 (-1.49z)| lr 3.58e-04 | 322.19 ms | 52.4% bf16 MFU | 1624123 tok/s step 8966/19560 | loss 3.486574 (+1.31z)| norm 0.2696 (-0.34z)| lr 3.58e-04 | 322.53 ms | 52.3% bf16 MFU | 1624195 tok/s step 8967/19560 | loss 3.362250 (-1.81z)| norm 0.2610 (-0.85z)| lr 3.58e-04 | 322.99 ms | 52.3% bf16 MFU | 1624148 tok/s step 8968/19560 | loss 3.465017 (+0.75z)| norm 0.2626 (-0.74z)| lr 3.58e-04 | 322.66 ms | 52.3% bf16 MFU | 1624185 tok/s step 8969/19560 | loss 3.370696 (-1.58z)| norm 0.2732 (-0.10z)| lr 3.58e-04 | 323.58 ms | 52.2% bf16 MFU | 1623990 tok/s step 8970/19560 | loss 3.417867 (-0.39z)| norm 0.2765 (+0.10z)| lr 3.58e-04 | 322.28 ms | 52.4% bf16 MFU | 1624131 tok/s step 8971/19560 | loss 3.419619 (-0.35z)| norm 0.2701 (-0.29z)| lr 3.58e-04 | 322.68 ms | 52.3% bf16 MFU | 1624165 tok/s step 8972/19560 | loss 3.387263 (-1.15z)| norm 0.2677 (-0.43z)| lr 3.58e-04 | 322.73 ms | 52.3% bf16 MFU | 1624184 tok/s step 8973/19560 | loss 3.391004 (-1.05z)| norm 0.2743 (-0.03z)| lr 3.58e-04 | 322.87 ms | 52.3% bf16 MFU | 1624166 tok/s step 8974/19560 | loss 3.499007 (+1.60z)| norm 0.2801 (+0.32z)| lr 3.57e-04 | 323.07 ms | 52.2% bf16 MFU | 1624098 tok/s step 8975/19560 | loss 3.453792 (+0.48z)| norm 0.2671 (-0.46z)| lr 3.57e-04 | 323.12 ms | 52.2% bf16 MFU | 1624022 tok/s step 8976/19560 | loss 3.366905 (-1.62z)| norm 0.2773 (+0.16z)| lr 3.57e-04 | 322.50 ms | 52.3% bf16 MFU | 1624106 tok/s step 8977/19560 | loss 3.399822 (-0.81z)| norm 0.2666 (-0.49z)| lr 3.57e-04 | 323.46 ms | 52.2% bf16 MFU | 1623944 tok/s step 8978/19560 | loss 3.423532 (-0.21z)| norm 0.2726 (-0.11z)| lr 3.57e-04 | 322.88 ms | 52.3% bf16 MFU | 1623935 tok/s step 8979/19560 | loss 3.408410 (-0.59z)| norm 0.2806 (+0.38z)| lr 3.57e-04 | 322.99 ms | 52.3% bf16 MFU | 1623900 tok/s step 8980/19560 | loss 3.413599 (-0.45z)| norm 0.3158 (+2.47z)| lr 3.57e-04 | 323.08 ms | 52.2% bf16 MFU | 1623843 tok/s step 8981/19560 | loss 3.423694 (-0.19z)| norm 0.3273 (+3.07z)| lr 3.57e-04 | 323.43 ms | 52.2% bf16 MFU | 1623703 tok/s step 8982/19560 | loss 3.501943 (+1.74z)| norm 0.2908 (+0.91z)| lr 3.57e-04 | 323.27 ms | 52.2% bf16 MFU | 1623609 tok/s step 8983/19560 | loss 3.414044 (-0.45z)| norm 0.3042 (+1.69z)| lr 3.57e-04 | 322.95 ms | 52.3% bf16 MFU | 1623599 tok/s step 8984/19560 | loss 3.417197 (-0.38z)| norm 0.3075 (+1.86z)| lr 3.57e-04 | 323.24 ms | 52.2% bf16 MFU | 1623518 tok/s step 8985/19560 | loss 3.530268 (+2.38z)| norm 0.3297 (+3.02z)| lr 3.57e-04 | 322.93 ms | 52.3% bf16 MFU | 1623520 tok/s step 8986/19560 | loss 3.388352 (-1.09z)| norm 0.3203 (+2.42z)| lr 3.57e-04 | 322.95 ms | 52.3% bf16 MFU | 1623517 tok/s step 8987/19560 | loss 3.547812 (+2.76z)| norm 0.2938 (+0.93z)| lr 3.57e-04 | 323.30 ms | 52.2% bf16 MFU | 1623424 tok/s step 8988/19560 | loss 3.447506 (+0.34z)| norm 0.2789 (+0.11z)| lr 3.57e-04 | 322.94 ms | 52.3% bf16 MFU | 1623428 tok/s step 8989/19560 | loss 3.440611 (+0.19z)| norm 0.2760 (-0.06z)| lr 3.57e-04 | 322.74 ms | 52.3% bf16 MFU | 1623480 tok/s step 8990/19560 | loss 3.382853 (-1.20z)| norm 0.2740 (-0.16z)| lr 3.57e-04 | 322.33 ms | 52.4% bf16 MFU | 1623635 tok/s step 8991/19560 | loss 3.384134 (-1.15z)| norm 0.2786 (+0.09z)| lr 3.57e-04 | 323.24 ms | 52.2% bf16 MFU | 1623552 tok/s step 8992/19560 | loss 3.442890 (+0.26z)| norm 0.2707 (-0.35z)| lr 3.57e-04 | 323.46 ms | 52.2% bf16 MFU | 1623417 tok/s step 8993/19560 | loss 3.457104 (+0.59z)| norm 0.2570 (-1.10z)| lr 3.57e-04 | 322.77 ms | 52.3% bf16 MFU | 1623463 tok/s step 8994/19560 | loss 3.430238 (-0.04z)| norm 0.2815 (+0.24z)| lr 3.56e-04 | 323.36 ms | 52.2% bf16 MFU | 1623360 tok/s step 8995/19560 | loss 3.347271 (-1.99z)| norm 0.2774 (+0.00z)| lr 3.56e-04 | 323.53 ms | 52.2% bf16 MFU | 1623217 tok/s step 8996/19560 | loss 3.469496 (+0.89z)| norm 0.2610 (-0.91z)| lr 3.56e-04 | 322.94 ms | 52.3% bf16 MFU | 1623231 tok/s step 8997/19560 | loss 3.455777 (+0.56z)| norm 0.2808 (+0.20z)| lr 3.56e-04 | 322.66 ms | 52.3% bf16 MFU | 1623313 tok/s step 8998/19560 | loss 3.359455 (-1.69z)| norm 0.2722 (-0.28z)| lr 3.56e-04 | 323.11 ms | 52.2% bf16 MFU | 1623278 tok/s step 8999/19560 | loss 3.410799 (-0.49z)| norm 0.2594 (-0.99z)| lr 3.56e-04 | 322.44 ms | 52.3% bf16 MFU | 1623415 tok/s step 9000/19560 | loss 3.435229 (+0.07z)| norm 0.2708 (-0.35z)| lr 3.56e-04 | 322.95 ms | 52.3% bf16 MFU | 1623416 tok/s val loss 3.419980 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2860/10042 = 0.284804 step 9001/19560 | loss 3.470970 (+0.89z)| norm 0.2728 (-0.24z)| lr 3.56e-04 | 322.11 ms | 52.4% bf16 MFU | 1623628 tok/s step 9002/19560 | loss 3.458451 (+0.60z)| norm 0.2611 (-0.89z)| lr 3.56e-04 | 322.51 ms | 52.3% bf16 MFU | 1623728 tok/s step 9003/19560 | loss 3.430133 (-0.08z)| norm 0.2644 (-0.70z)| lr 3.56e-04 | 322.74 ms | 52.3% bf16 MFU | 1623766 tok/s step 9004/19560 | loss 3.402656 (-0.72z)| norm 0.2560 (-1.15z)| lr 3.56e-04 | 322.57 ms | 52.3% bf16 MFU | 1623847 tok/s step 9005/19560 | loss 3.385705 (-1.10z)| norm 0.2578 (-1.04z)| lr 3.56e-04 | 322.85 ms | 52.3% bf16 MFU | 1623850 tok/s step 9006/19560 | loss 3.411183 (-0.50z)| norm 0.2547 (-1.20z)| lr 3.56e-04 | 322.49 ms | 52.3% bf16 MFU | 1623945 tok/s step 9007/19560 | loss 3.487280 (+1.27z)| norm 0.4384 (+6.96z)| lr 3.56e-04 | 322.61 ms | 52.3% bf16 MFU | 1624006 tok/s step 9008/19560 | loss 3.448496 (+0.37z)| norm 0.2983 (+0.87z)| lr 3.56e-04 | 322.69 ms | 52.3% bf16 MFU | 1624044 tok/s step 9009/19560 | loss 3.488756 (+1.30z)| norm 0.3056 (+1.17z)| lr 3.56e-04 | 322.02 ms | 52.4% bf16 MFU | 1624248 tok/s step 9010/19560 | loss 3.435748 (+0.06z)| norm 0.2761 (-0.11z)| lr 3.56e-04 | 322.63 ms | 52.3% bf16 MFU | 1624287 tok/s step 9011/19560 | loss 3.487025 (+1.23z)| norm 0.2778 (-0.04z)| lr 3.56e-04 | 323.03 ms | 52.2% bf16 MFU | 1624225 tok/s step 9012/19560 | loss 3.433494 (+0.00z)| norm 0.2831 (+0.19z)| lr 3.56e-04 | 321.97 ms | 52.4% bf16 MFU | 1624432 tok/s step 9013/19560 | loss 3.396920 (-0.86z)| norm 0.2845 (+0.25z)| lr 3.56e-04 | 322.34 ms | 52.4% bf16 MFU | 1624537 tok/s step 9014/19560 | loss 3.441000 (+0.17z)| norm 0.2641 (-0.63z)| lr 3.55e-04 | 322.96 ms | 52.3% bf16 MFU | 1624479 tok/s step 9015/19560 | loss 3.398735 (-0.81z)| norm 0.2878 (+0.39z)| lr 3.55e-04 | 322.53 ms | 52.3% bf16 MFU | 1624533 tok/s step 9016/19560 | loss 3.428229 (-0.10z)| norm 0.2774 (-0.07z)| lr 3.55e-04 | 322.76 ms | 52.3% bf16 MFU | 1624526 tok/s step 9017/19560 | loss 3.465700 (+0.78z)| norm 0.2668 (-0.53z)| lr 3.55e-04 | 322.67 ms | 52.3% bf16 MFU | 1624542 tok/s step 9018/19560 | loss 3.446106 (+0.31z)| norm 0.2699 (-0.39z)| lr 3.55e-04 | 322.25 ms | 52.4% bf16 MFU | 1624663 tok/s step 9019/19560 | loss 3.342871 (-2.14z)| norm 0.2934 (+0.63z)| lr 3.55e-04 | 322.88 ms | 52.3% bf16 MFU | 1624618 tok/s step 9020/19560 | loss 3.459299 (+0.61z)| norm 0.2682 (-0.45z)| lr 3.55e-04 | 322.98 ms | 52.3% bf16 MFU | 1624552 tok/s step 9021/19560 | loss 3.448261 (+0.34z)| norm 0.3014 (+0.98z)| lr 3.55e-04 | 323.01 ms | 52.2% bf16 MFU | 1624480 tok/s step 9022/19560 | loss 3.419070 (-0.34z)| norm 0.2680 (-0.47z)| lr 3.55e-04 | 322.06 ms | 52.4% bf16 MFU | 1624651 tok/s step 9023/19560 | loss 3.363715 (-1.63z)| norm 0.2758 (-0.13z)| lr 3.55e-04 | 322.74 ms | 52.3% bf16 MFU | 1624643 tok/s step 9024/19560 | loss 3.434957 (+0.05z)| norm 0.2585 (-0.87z)| lr 3.55e-04 | 322.94 ms | 52.3% bf16 MFU | 1624585 tok/s step 9025/19560 | loss 3.394803 (-0.88z)| norm 0.2787 (-0.00z)| lr 3.55e-04 | 323.32 ms | 52.2% bf16 MFU | 1624434 tok/s step 9026/19560 | loss 3.494932 (+1.44z)| norm 0.2775 (-0.06z)| lr 3.55e-04 | 322.53 ms | 52.3% bf16 MFU | 1624490 tok/s step 9027/19560 | loss 3.464741 (+0.73z)| norm 0.2833 (+0.19z)| lr 3.55e-04 | 322.39 ms | 52.4% bf16 MFU | 1624579 tok/s step 9028/19560 | loss 3.376192 (-1.31z)| norm 0.2885 (+0.40z)| lr 3.55e-04 | 322.42 ms | 52.3% bf16 MFU | 1624655 tok/s step 9029/19560 | loss 3.425854 (-0.17z)| norm 0.2985 (+0.83z)| lr 3.55e-04 | 322.79 ms | 52.3% bf16 MFU | 1624635 tok/s step 9030/19560 | loss 3.417778 (-0.34z)| norm 0.2808 (+0.05z)| lr 3.55e-04 | 322.82 ms | 52.3% bf16 MFU | 1624608 tok/s step 9031/19560 | loss 3.386141 (-1.08z)| norm 0.2807 (+0.03z)| lr 3.55e-04 | 322.29 ms | 52.4% bf16 MFU | 1624716 tok/s step 9032/19560 | loss 3.455273 (+0.53z)| norm 0.2829 (+0.13z)| lr 3.55e-04 | 322.75 ms | 52.3% bf16 MFU | 1624701 tok/s step 9033/19560 | loss 3.438693 (+0.14z)| norm 0.2740 (-0.27z)| lr 3.55e-04 | 322.56 ms | 52.3% bf16 MFU | 1624736 tok/s step 9034/19560 | loss 3.423903 (-0.19z)| norm 0.2736 (-0.29z)| lr 3.55e-04 | 322.31 ms | 52.4% bf16 MFU | 1624832 tok/s step 9035/19560 | loss 3.426158 (-0.13z)| norm 0.2436 (-1.62z)| lr 3.54e-04 | 322.89 ms | 52.3% bf16 MFU | 1624777 tok/s step 9036/19560 | loss 3.407568 (-0.57z)| norm 0.2654 (-0.64z)| lr 3.54e-04 | 323.13 ms | 52.2% bf16 MFU | 1624664 tok/s step 9037/19560 | loss 3.513742 (+1.90z)| norm 0.2724 (-0.35z)| lr 3.54e-04 | 322.12 ms | 52.4% bf16 MFU | 1624813 tok/s step 9038/19560 | loss 3.520290 (+2.02z)| norm 0.2952 (+0.66z)| lr 3.54e-04 | 322.83 ms | 52.3% bf16 MFU | 1624775 tok/s step 9039/19560 | loss 3.388427 (-1.01z)| norm 0.2711 (-0.43z)| lr 3.54e-04 | 322.36 ms | 52.4% bf16 MFU | 1624857 tok/s step 9040/19560 | loss 3.373554 (-1.37z)| norm 0.2504 (-1.34z)| lr 3.54e-04 | 322.47 ms | 52.3% bf16 MFU | 1624906 tok/s step 9041/19560 | loss 3.359939 (-1.67z)| norm 0.2770 (-0.15z)| lr 3.54e-04 | 323.30 ms | 52.2% bf16 MFU | 1624744 tok/s step 9042/19560 | loss 3.411143 (-0.46z)| norm 0.2841 (+0.17z)| lr 3.54e-04 | 322.65 ms | 52.3% bf16 MFU | 1624754 tok/s step 9043/19560 | loss 3.469146 (+0.93z)| norm 0.2676 (-0.57z)| lr 3.54e-04 | 322.66 ms | 52.3% bf16 MFU | 1624762 tok/s step 9044/19560 | loss 3.336827 (-2.19z)| norm 0.2897 (+0.43z)| lr 3.54e-04 | 322.16 ms | 52.4% bf16 MFU | 1624896 tok/s step 9045/19560 | loss 3.378646 (-1.20z)| norm 0.2803 (+0.01z)| lr 3.54e-04 | 322.78 ms | 52.3% bf16 MFU | 1624865 tok/s step 9046/19560 | loss 3.463053 (+0.77z)| norm 0.2789 (-0.05z)| lr 3.54e-04 | 323.11 ms | 52.2% bf16 MFU | 1624753 tok/s step 9047/19560 | loss 3.443476 (+0.32z)| norm 0.2906 (+0.47z)| lr 3.54e-04 | 322.75 ms | 52.3% bf16 MFU | 1624737 tok/s step 9048/19560 | loss 3.432596 (+0.09z)| norm 0.2840 (+0.17z)| lr 3.54e-04 | 322.25 ms | 52.4% bf16 MFU | 1624850 tok/s step 9049/19560 | loss 3.403382 (-0.62z)| norm 0.2748 (-0.25z)| lr 3.54e-04 | 322.34 ms | 52.4% bf16 MFU | 1624932 tok/s step 9050/19560 | loss 3.356039 (-1.75z)| norm 0.2795 (-0.02z)| lr 3.54e-04 | 322.71 ms | 52.3% bf16 MFU | 1624917 tok/s step 9051/19560 | loss 3.464681 (+0.89z)| norm 0.2533 (-1.21z)| lr 3.54e-04 | 322.48 ms | 52.3% bf16 MFU | 1624962 tok/s step 9052/19560 | loss 3.419859 (-0.19z)| norm 0.2657 (-0.63z)| lr 3.54e-04 | 322.20 ms | 52.4% bf16 MFU | 1625075 tok/s step 9053/19560 | loss 3.453368 (+0.63z)| norm 0.2616 (-0.81z)| lr 3.54e-04 | 322.93 ms | 52.3% bf16 MFU | 1624997 tok/s step 9054/19560 | loss 3.399624 (-0.68z)| norm 0.2664 (-0.58z)| lr 3.54e-04 | 322.22 ms | 52.4% bf16 MFU | 1625103 tok/s step 9055/19560 | loss 3.423254 (-0.10z)| norm 0.2664 (-0.57z)| lr 3.53e-04 | 322.72 ms | 52.3% bf16 MFU | 1625079 tok/s step 9056/19560 | loss 3.408752 (-0.46z)| norm 0.2765 (-0.10z)| lr 3.53e-04 | 322.49 ms | 52.3% bf16 MFU | 1625112 tok/s step 9057/19560 | loss 3.466049 (+0.94z)| norm 0.2845 (+0.27z)| lr 3.53e-04 | 322.58 ms | 52.3% bf16 MFU | 1625121 tok/s step 9058/19560 | loss 3.489961 (+1.51z)| norm 0.2594 (-0.89z)| lr 3.53e-04 | 322.56 ms | 52.3% bf16 MFU | 1625136 tok/s step 9059/19560 | loss 3.356688 (-1.72z)| norm 0.3006 (+1.02z)| lr 3.53e-04 | 322.46 ms | 52.3% bf16 MFU | 1625174 tok/s step 9060/19560 | loss 3.420888 (-0.17z)| norm 0.2900 (+0.52z)| lr 3.53e-04 | 323.09 ms | 52.2% bf16 MFU | 1625052 tok/s step 9061/19560 | loss 3.400290 (-0.66z)| norm 0.2874 (+0.40z)| lr 3.53e-04 | 322.65 ms | 52.3% bf16 MFU | 1625046 tok/s step 9062/19560 | loss 3.471522 (+1.04z)| norm 0.2670 (-0.55z)| lr 3.53e-04 | 322.08 ms | 52.4% bf16 MFU | 1625185 tok/s step 9063/19560 | loss 3.421244 (-0.15z)| norm 0.2744 (-0.19z)| lr 3.53e-04 | 322.99 ms | 52.3% bf16 MFU | 1625087 tok/s step 9064/19560 | loss 3.413169 (-0.35z)| norm 0.2513 (-1.27z)| lr 3.53e-04 | 322.87 ms | 52.3% bf16 MFU | 1625024 tok/s step 9065/19560 | loss 3.445798 (+0.44z)| norm 0.2713 (-0.31z)| lr 3.53e-04 | 322.33 ms | 52.4% bf16 MFU | 1625101 tok/s step 9066/19560 | loss 3.413229 (-0.35z)| norm 0.2558 (-1.03z)| lr 3.53e-04 | 322.35 ms | 52.4% bf16 MFU | 1625168 tok/s step 9067/19560 | loss 3.446687 (+0.48z)| norm 0.2883 (+0.50z)| lr 3.53e-04 | 322.90 ms | 52.3% bf16 MFU | 1625094 tok/s step 9068/19560 | loss 3.482928 (+1.36z)| norm 0.2604 (-0.81z)| lr 3.53e-04 | 322.21 ms | 52.4% bf16 MFU | 1625196 tok/s step 9069/19560 | loss 3.447944 (+0.49z)| norm 0.2619 (-0.73z)| lr 3.53e-04 | 322.38 ms | 52.4% bf16 MFU | 1625250 tok/s step 9070/19560 | loss 3.437739 (+0.23z)| norm 0.2582 (-0.88z)| lr 3.53e-04 | 322.84 ms | 52.3% bf16 MFU | 1625187 tok/s step 9071/19560 | loss 3.493863 (+1.58z)| norm 0.3029 (+1.20z)| lr 3.53e-04 | 322.50 ms | 52.3% bf16 MFU | 1625212 tok/s step 9072/19560 | loss 3.376220 (-1.27z)| norm 0.2879 (+0.50z)| lr 3.53e-04 | 322.22 ms | 52.4% bf16 MFU | 1625308 tok/s step 9073/19560 | loss 3.504932 (+1.82z)| norm 0.2808 (+0.17z)| lr 3.53e-04 | 322.23 ms | 52.4% bf16 MFU | 1625397 tok/s step 9074/19560 | loss 3.491325 (+1.47z)| norm 0.2786 (+0.06z)| lr 3.53e-04 | 322.56 ms | 52.3% bf16 MFU | 1625397 tok/s step 9075/19560 | loss 3.409640 (-0.48z)| norm 0.2636 (-0.63z)| lr 3.52e-04 | 322.66 ms | 52.3% bf16 MFU | 1625372 tok/s step 9076/19560 | loss 3.434485 (+0.11z)| norm 0.2601 (-0.79z)| lr 3.52e-04 | 322.24 ms | 52.4% bf16 MFU | 1625453 tok/s step 9077/19560 | loss 3.374799 (-1.29z)| norm 0.2722 (-0.23z)| lr 3.52e-04 | 322.41 ms | 52.3% bf16 MFU | 1625489 tok/s step 9078/19560 | loss 3.368887 (-1.41z)| norm 0.2752 (-0.09z)| lr 3.52e-04 | 322.57 ms | 52.3% bf16 MFU | 1625483 tok/s step 9079/19560 | loss 3.386582 (-0.98z)| norm 0.2732 (-0.18z)| lr 3.52e-04 | 322.36 ms | 52.4% bf16 MFU | 1625530 tok/s step 9080/19560 | loss 3.449842 (+0.51z)| norm 0.2823 (+0.24z)| lr 3.52e-04 | 322.53 ms | 52.3% bf16 MFU | 1625531 tok/s step 9081/19560 | loss 3.406512 (-0.51z)| norm 0.2745 (-0.13z)| lr 3.52e-04 | 322.43 ms | 52.3% bf16 MFU | 1625556 tok/s step 9082/19560 | loss 3.405335 (-0.53z)| norm 0.2625 (-0.68z)| lr 3.52e-04 | 322.88 ms | 52.3% bf16 MFU | 1625467 tok/s step 9083/19560 | loss 3.387475 (-0.94z)| norm 0.2755 (-0.08z)| lr 3.52e-04 | 322.51 ms | 52.3% bf16 MFU | 1625476 tok/s step 9084/19560 | loss 3.441464 (+0.33z)| norm 0.2961 (+0.88z)| lr 3.52e-04 | 322.88 ms | 52.3% bf16 MFU | 1625392 tok/s step 9085/19560 | loss 3.460471 (+0.76z)| norm 0.2706 (-0.34z)| lr 3.52e-04 | 322.04 ms | 52.4% bf16 MFU | 1625524 tok/s step 9086/19560 | loss 3.377740 (-1.18z)| norm 0.2753 (-0.11z)| lr 3.52e-04 | 322.76 ms | 52.3% bf16 MFU | 1625466 tok/s step 9087/19560 | loss 3.442625 (+0.35z)| norm 0.2601 (-0.82z)| lr 3.52e-04 | 323.16 ms | 52.2% bf16 MFU | 1625311 tok/s step 9088/19560 | loss 3.390009 (-0.90z)| norm 0.2738 (-0.17z)| lr 3.52e-04 | 322.51 ms | 52.3% bf16 MFU | 1625327 tok/s step 9089/19560 | loss 3.427941 (+0.01z)| norm 0.2879 (+0.50z)| lr 3.52e-04 | 322.39 ms | 52.3% bf16 MFU | 1625372 tok/s step 9090/19560 | loss 3.405437 (-0.53z)| norm 0.2792 (+0.08z)| lr 3.52e-04 | 322.86 ms | 52.3% bf16 MFU | 1625298 tok/s step 9091/19560 | loss 3.449969 (+0.52z)| norm 0.2526 (-1.19z)| lr 3.52e-04 | 322.37 ms | 52.4% bf16 MFU | 1625350 tok/s step 9092/19560 | loss 3.457292 (+0.70z)| norm 0.2596 (-0.85z)| lr 3.52e-04 | 322.54 ms | 52.3% bf16 MFU | 1625358 tok/s step 9093/19560 | loss 3.401749 (-0.61z)| norm 0.2844 (+0.32z)| lr 3.52e-04 | 322.63 ms | 52.3% bf16 MFU | 1625342 tok/s step 9094/19560 | loss 3.457084 (+0.71z)| norm 0.3126 (+1.65z)| lr 3.52e-04 | 323.09 ms | 52.2% bf16 MFU | 1625212 tok/s step 9095/19560 | loss 3.443080 (+0.36z)| norm 0.3060 (+1.31z)| lr 3.52e-04 | 322.79 ms | 52.3% bf16 MFU | 1625164 tok/s step 9096/19560 | loss 3.502909 (+1.78z)| norm 0.2748 (-0.17z)| lr 3.51e-04 | 322.63 ms | 52.3% bf16 MFU | 1625157 tok/s step 9097/19560 | loss 3.518124 (+2.10z)| norm 0.3453 (+3.04z)| lr 3.51e-04 | 322.81 ms | 52.3% bf16 MFU | 1625107 tok/s step 9098/19560 | loss 3.406384 (-0.54z)| norm 0.2671 (-0.54z)| lr 3.51e-04 | 322.53 ms | 52.3% bf16 MFU | 1625129 tok/s step 9099/19560 | loss 3.469831 (+0.94z)| norm 0.2771 (-0.09z)| lr 3.51e-04 | 322.90 ms | 52.3% bf16 MFU | 1625056 tok/s step 9100/19560 | loss 3.421618 (-0.20z)| norm 0.2810 (+0.09z)| lr 3.51e-04 | 322.43 ms | 52.3% bf16 MFU | 1625107 tok/s step 9101/19560 | loss 3.370795 (-1.39z)| norm 0.2734 (-0.26z)| lr 3.51e-04 | 322.76 ms | 52.3% bf16 MFU | 1625072 tok/s step 9102/19560 | loss 3.442948 (+0.32z)| norm 0.2830 (+0.18z)| lr 3.51e-04 | 323.47 ms | 52.2% bf16 MFU | 1624860 tok/s step 9103/19560 | loss 3.442721 (+0.32z)| norm 0.2727 (-0.30z)| lr 3.51e-04 | 322.79 ms | 52.3% bf16 MFU | 1624828 tok/s step 9104/19560 | loss 3.493329 (+1.50z)| norm 0.2668 (-0.56z)| lr 3.51e-04 | 322.55 ms | 52.3% bf16 MFU | 1624860 tok/s step 9105/19560 | loss 3.408119 (-0.53z)| norm 0.2675 (-0.53z)| lr 3.51e-04 | 322.63 ms | 52.3% bf16 MFU | 1624869 tok/s step 9106/19560 | loss 3.392351 (-0.90z)| norm 0.2561 (-1.04z)| lr 3.51e-04 | 322.94 ms | 52.3% bf16 MFU | 1624800 tok/s step 9107/19560 | loss 3.444484 (+0.33z)| norm 0.2648 (-0.64z)| lr 3.51e-04 | 322.70 ms | 52.3% bf16 MFU | 1624795 tok/s step 9108/19560 | loss 3.360943 (-1.62z)| norm 0.2711 (-0.34z)| lr 3.51e-04 | 323.21 ms | 52.2% bf16 MFU | 1624662 tok/s step 9109/19560 | loss 3.499036 (+1.59z)| norm 0.2624 (-0.73z)| lr 3.51e-04 | 322.60 ms | 52.3% bf16 MFU | 1624688 tok/s step 9110/19560 | loss 3.395336 (-0.81z)| norm 0.2703 (-0.35z)| lr 3.51e-04 | 322.37 ms | 52.4% bf16 MFU | 1624772 tok/s step 9111/19560 | loss 3.414057 (-0.37z)| norm 0.2574 (-0.94z)| lr 3.51e-04 | 322.93 ms | 52.3% bf16 MFU | 1624711 tok/s step 9112/19560 | loss 3.392458 (-0.87z)| norm 0.2664 (-0.51z)| lr 3.51e-04 | 323.00 ms | 52.3% bf16 MFU | 1624634 tok/s step 9113/19560 | loss 3.459565 (+0.73z)| norm 0.2608 (-0.77z)| lr 3.51e-04 | 322.86 ms | 52.3% bf16 MFU | 1624596 tok/s step 9114/19560 | loss 3.379359 (-1.18z)| norm 0.2498 (-1.29z)| lr 3.51e-04 | 322.56 ms | 52.3% bf16 MFU | 1624637 tok/s step 9115/19560 | loss 3.423426 (-0.11z)| norm 0.2692 (-0.33z)| lr 3.51e-04 | 323.08 ms | 52.2% bf16 MFU | 1624545 tok/s step 9116/19560 | loss 3.424205 (-0.09z)| norm 0.2639 (-0.58z)| lr 3.50e-04 | 322.24 ms | 52.4% bf16 MFU | 1624668 tok/s step 9117/19560 | loss 3.400369 (-0.67z)| norm 0.2599 (-0.77z)| lr 3.50e-04 | 323.02 ms | 52.2% bf16 MFU | 1624588 tok/s step 9118/19560 | loss 3.423598 (-0.10z)| norm 0.2584 (-0.83z)| lr 3.50e-04 | 322.73 ms | 52.3% bf16 MFU | 1624584 tok/s step 9119/19560 | loss 3.507665 (+1.93z)| norm 0.2643 (-0.54z)| lr 3.50e-04 | 323.03 ms | 52.2% bf16 MFU | 1624506 tok/s step 9120/19560 | loss 3.435697 (+0.17z)| norm 0.2720 (-0.17z)| lr 3.50e-04 | 322.93 ms | 52.3% bf16 MFU | 1624457 tok/s step 9121/19560 | loss 3.353178 (-1.81z)| norm 0.2661 (-0.46z)| lr 3.50e-04 | 322.74 ms | 52.3% bf16 MFU | 1624459 tok/s step 9122/19560 | loss 3.391810 (-0.86z)| norm 0.2725 (-0.14z)| lr 3.50e-04 | 322.64 ms | 52.3% bf16 MFU | 1624485 tok/s step 9123/19560 | loss 3.433521 (+0.13z)| norm 0.2732 (-0.10z)| lr 3.50e-04 | 322.71 ms | 52.3% bf16 MFU | 1624493 tok/s step 9124/19560 | loss 3.438283 (+0.25z)| norm 0.2848 (+0.45z)| lr 3.50e-04 | 322.76 ms | 52.3% bf16 MFU | 1624487 tok/s step 9125/19560 | loss 3.458185 (+0.74z)| norm 0.2798 (+0.21z)| lr 3.50e-04 | 322.45 ms | 52.3% bf16 MFU | 1624562 tok/s step 9126/19560 | loss 3.462075 (+0.83z)| norm 0.2678 (-0.38z)| lr 3.50e-04 | 323.17 ms | 52.2% bf16 MFU | 1624450 tok/s step 9127/19560 | loss 3.502004 (+1.78z)| norm 0.2596 (-0.78z)| lr 3.50e-04 | 323.11 ms | 52.2% bf16 MFU | 1624360 tok/s step 9128/19560 | loss 3.377989 (-1.25z)| norm 0.2520 (-1.13z)| lr 3.50e-04 | 323.54 ms | 52.2% bf16 MFU | 1624166 tok/s step 9129/19560 | loss 3.425636 (-0.08z)| norm 0.2784 (+0.15z)| lr 3.50e-04 | 322.75 ms | 52.3% bf16 MFU | 1624179 tok/s step 9130/19560 | loss 3.440199 (+0.29z)| norm 0.2732 (-0.11z)| lr 3.50e-04 | 322.52 ms | 52.3% bf16 MFU | 1624250 tok/s step 9131/19560 | loss 3.412935 (-0.38z)| norm 0.2665 (-0.44z)| lr 3.50e-04 | 322.74 ms | 52.3% bf16 MFU | 1624262 tok/s step 9132/19560 | loss 3.377934 (-1.23z)| norm 0.2932 (+0.85z)| lr 3.50e-04 | 322.40 ms | 52.3% bf16 MFU | 1624360 tok/s step 9133/19560 | loss 3.433290 (+0.11z)| norm 0.2676 (-0.41z)| lr 3.50e-04 | 323.07 ms | 52.2% bf16 MFU | 1624283 tok/s step 9134/19560 | loss 3.406786 (-0.53z)| norm 0.2872 (+0.55z)| lr 3.50e-04 | 323.16 ms | 52.2% bf16 MFU | 1624188 tok/s step 9135/19560 | loss 3.415364 (-0.31z)| norm 0.2767 (+0.13z)| lr 3.50e-04 | 322.70 ms | 52.3% bf16 MFU | 1624213 tok/s step 9136/19560 | loss 3.442364 (+0.36z)| norm 0.2739 (-0.05z)| lr 3.49e-04 | 322.83 ms | 52.3% bf16 MFU | 1624205 tok/s step 9137/19560 | loss 3.431284 (+0.09z)| norm 0.2654 (-0.64z)| lr 3.49e-04 | 323.48 ms | 52.2% bf16 MFU | 1624035 tok/s step 9138/19560 | loss 3.436693 (+0.23z)| norm 0.2853 (+0.78z)| lr 3.49e-04 | 323.08 ms | 52.2% bf16 MFU | 1623973 tok/s step 9139/19560 | loss 3.402441 (-0.61z)| norm 0.2627 (-0.82z)| lr 3.49e-04 | 323.21 ms | 52.2% bf16 MFU | 1623881 tok/s step 9140/19560 | loss 3.402139 (-0.61z)| norm 0.2563 (-1.26z)| lr 3.49e-04 | 322.33 ms | 52.4% bf16 MFU | 1624014 tok/s step 9141/19560 | loss 3.344909 (-2.01z)| norm 0.2578 (-1.13z)| lr 3.49e-04 | 322.59 ms | 52.3% bf16 MFU | 1624075 tok/s step 9142/19560 | loss 3.446065 (+0.49z)| norm 0.2900 (+1.11z)| lr 3.49e-04 | 322.80 ms | 52.3% bf16 MFU | 1624082 tok/s step 9143/19560 | loss 3.383790 (-1.04z)| norm 0.2682 (-0.41z)| lr 3.49e-04 | 323.12 ms | 52.2% bf16 MFU | 1624006 tok/s step 9144/19560 | loss 3.446683 (+0.50z)| norm 0.2813 (+0.51z)| lr 3.49e-04 | 322.97 ms | 52.3% bf16 MFU | 1623972 tok/s step 9145/19560 | loss 3.381692 (-1.08z)| norm 0.2735 (-0.04z)| lr 3.49e-04 | 323.21 ms | 52.2% bf16 MFU | 1623879 tok/s step 9146/19560 | loss 3.448449 (+0.56z)| norm 0.2679 (-0.43z)| lr 3.49e-04 | 322.40 ms | 52.3% bf16 MFU | 1623995 tok/s step 9147/19560 | loss 3.448891 (+0.56z)| norm 0.2706 (-0.23z)| lr 3.49e-04 | 323.51 ms | 52.2% bf16 MFU | 1623826 tok/s step 9148/19560 | loss 3.442932 (+0.42z)| norm 0.2740 (+0.01z)| lr 3.49e-04 | 323.43 ms | 52.2% bf16 MFU | 1623685 tok/s step 9149/19560 | loss 3.392788 (-0.83z)| norm 0.2912 (+1.25z)| lr 3.49e-04 | 322.90 ms | 52.3% bf16 MFU | 1623686 tok/s step 9150/19560 | loss 3.479666 (+1.32z)| norm 0.2711 (-0.20z)| lr 3.49e-04 | 322.60 ms | 52.3% bf16 MFU | 1623761 tok/s step 9151/19560 | loss 3.378798 (-1.19z)| norm 0.2845 (+0.76z)| lr 3.49e-04 | 323.70 ms | 52.1% bf16 MFU | 1623556 tok/s step 9152/19560 | loss 3.491558 (+1.59z)| norm 0.2577 (-1.16z)| lr 3.49e-04 | 322.70 ms | 52.3% bf16 MFU | 1623612 tok/s step 9153/19560 | loss 3.463516 (+0.89z)| norm 0.2762 (+0.17z)| lr 3.49e-04 | 322.76 ms | 52.3% bf16 MFU | 1623652 tok/s step 9154/19560 | loss 3.358200 (-1.68z)| norm 0.2695 (-0.31z)| lr 3.49e-04 | 322.98 ms | 52.3% bf16 MFU | 1623634 tok/s step 9155/19560 | loss 3.453367 (+0.67z)| norm 0.2782 (+0.32z)| lr 3.49e-04 | 323.31 ms | 52.2% bf16 MFU | 1623534 tok/s step 9156/19560 | loss 3.498749 (+1.75z)| norm 0.3003 (+1.87z)| lr 3.49e-04 | 322.84 ms | 52.3% bf16 MFU | 1623558 tok/s step 9157/19560 | loss 3.440573 (+0.32z)| norm 0.3633 (+5.56z)| lr 3.48e-04 | 322.34 ms | 52.4% bf16 MFU | 1623707 tok/s step 9158/19560 | loss 3.464627 (+0.90z)| norm 0.2874 (+0.81z)| lr 3.48e-04 | 323.32 ms | 52.2% bf16 MFU | 1623599 tok/s step 9159/19560 | loss 3.459910 (+0.77z)| norm 0.3220 (+2.85z)| lr 3.48e-04 | 322.67 ms | 52.3% bf16 MFU | 1623661 tok/s step 9160/19560 | loss 3.410057 (-0.44z)| norm 0.2768 (+0.13z)| lr 3.48e-04 | 322.36 ms | 52.4% bf16 MFU | 1623799 tok/s step 9161/19560 | loss 3.530537 (+2.44z)| norm 0.3003 (+1.52z)| lr 3.48e-04 | 322.85 ms | 52.3% bf16 MFU | 1623805 tok/s step 9162/19560 | loss 3.438850 (+0.24z)| norm 0.2752 (+0.02z)| lr 3.48e-04 | 322.62 ms | 52.3% bf16 MFU | 1623868 tok/s step 9163/19560 | loss 3.356667 (-1.70z)| norm 0.2712 (-0.24z)| lr 3.48e-04 | 322.71 ms | 52.3% bf16 MFU | 1623906 tok/s step 9164/19560 | loss 3.488943 (+1.41z)| norm 0.3060 (+1.84z)| lr 3.48e-04 | 323.14 ms | 52.2% bf16 MFU | 1623835 tok/s step 9165/19560 | loss 3.429494 (+0.03z)| norm 0.2966 (+1.25z)| lr 3.48e-04 | 322.53 ms | 52.3% bf16 MFU | 1623920 tok/s step 9166/19560 | loss 3.398105 (-0.71z)| norm 0.2988 (+1.38z)| lr 3.48e-04 | 323.22 ms | 52.2% bf16 MFU | 1623828 tok/s step 9167/19560 | loss 3.435847 (+0.20z)| norm 0.2997 (+1.41z)| lr 3.48e-04 | 322.55 ms | 52.3% bf16 MFU | 1623909 tok/s step 9168/19560 | loss 3.446254 (+0.44z)| norm 0.2820 (+0.35z)| lr 3.48e-04 | 323.02 ms | 52.2% bf16 MFU | 1623868 tok/s step 9169/19560 | loss 3.415394 (-0.33z)| norm 0.2978 (+1.27z)| lr 3.48e-04 | 322.77 ms | 52.3% bf16 MFU | 1623892 tok/s step 9170/19560 | loss 3.406975 (-0.54z)| norm 0.2873 (+0.65z)| lr 3.48e-04 | 322.50 ms | 52.3% bf16 MFU | 1623982 tok/s step 9171/19560 | loss 3.510423 (+2.00z)| norm 0.2933 (+0.99z)| lr 3.48e-04 | 323.17 ms | 52.2% bf16 MFU | 1623900 tok/s step 9172/19560 | loss 3.399031 (-0.76z)| norm 0.2773 (+0.06z)| lr 3.48e-04 | 323.19 ms | 52.2% bf16 MFU | 1623817 tok/s step 9173/19560 | loss 3.451574 (+0.54z)| norm 0.3073 (+1.79z)| lr 3.48e-04 | 323.03 ms | 52.2% bf16 MFU | 1623778 tok/s step 9174/19560 | loss 3.393782 (-0.90z)| norm 0.2824 (+0.34z)| lr 3.48e-04 | 322.79 ms | 52.3% bf16 MFU | 1623802 tok/s step 9175/19560 | loss 3.425767 (-0.09z)| norm 0.3293 (+2.96z)| lr 3.48e-04 | 322.77 ms | 52.3% bf16 MFU | 1623828 tok/s step 9176/19560 | loss 3.432225 (+0.07z)| norm 0.2999 (+1.28z)| lr 3.48e-04 | 323.36 ms | 52.2% bf16 MFU | 1623705 tok/s step 9177/19560 | loss 3.350318 (-1.95z)| norm 0.2757 (-0.08z)| lr 3.47e-04 | 322.81 ms | 52.3% bf16 MFU | 1623727 tok/s step 9178/19560 | loss 3.427531 (-0.05z)| norm 0.2866 (+0.53z)| lr 3.47e-04 | 323.10 ms | 52.2% bf16 MFU | 1623674 tok/s step 9179/19560 | loss 3.471669 (+1.06z)| norm 0.2734 (-0.22z)| lr 3.47e-04 | 323.05 ms | 52.2% bf16 MFU | 1623636 tok/s step 9180/19560 | loss 3.397167 (-0.80z)| norm 0.2949 (+0.98z)| lr 3.47e-04 | 322.89 ms | 52.3% bf16 MFU | 1623642 tok/s step 9181/19560 | loss 3.394390 (-0.86z)| norm 0.2492 (-1.58z)| lr 3.47e-04 | 323.23 ms | 52.2% bf16 MFU | 1623561 tok/s step 9182/19560 | loss 3.486110 (+1.40z)| norm 0.3102 (+1.79z)| lr 3.47e-04 | 322.60 ms | 52.3% bf16 MFU | 1623642 tok/s step 9183/19560 | loss 3.445012 (+0.38z)| norm 0.3128 (+1.89z)| lr 3.47e-04 | 323.25 ms | 52.2% bf16 MFU | 1623556 tok/s step 9184/19560 | loss 3.421500 (-0.21z)| norm 0.2775 (-0.03z)| lr 3.47e-04 | 323.28 ms | 52.2% bf16 MFU | 1623466 tok/s step 9185/19560 | loss 3.472219 (+1.05z)| norm 0.2921 (+0.76z)| lr 3.47e-04 | 322.82 ms | 52.3% bf16 MFU | 1623497 tok/s step 9186/19560 | loss 3.459774 (+0.75z)| norm 0.2856 (+0.39z)| lr 3.47e-04 | 322.54 ms | 52.3% bf16 MFU | 1623596 tok/s step 9187/19560 | loss 3.477762 (+1.19z)| norm 0.2894 (+0.61z)| lr 3.47e-04 | 322.75 ms | 52.3% bf16 MFU | 1623637 tok/s step 9188/19560 | loss 3.352968 (-1.91z)| norm 0.2828 (+0.25z)| lr 3.47e-04 | 322.90 ms | 52.3% bf16 MFU | 1623641 tok/s step 9189/19560 | loss 3.473250 (+1.05z)| norm 0.2896 (+0.62z)| lr 3.47e-04 | 322.57 ms | 52.3% bf16 MFU | 1623727 tok/s step 9190/19560 | loss 3.490620 (+1.47z)| norm 0.2776 (-0.04z)| lr 3.47e-04 | 322.75 ms | 52.3% bf16 MFU | 1623762 tok/s step 9191/19560 | loss 3.426172 (-0.11z)| norm 0.2893 (+0.60z)| lr 3.47e-04 | 323.01 ms | 52.3% bf16 MFU | 1623731 tok/s step 9192/19560 | loss 3.453965 (+0.56z)| norm 0.2883 (+0.53z)| lr 3.47e-04 | 322.75 ms | 52.3% bf16 MFU | 1623768 tok/s step 9193/19560 | loss 3.368112 (-1.52z)| norm 0.2651 (-0.75z)| lr 3.47e-04 | 322.55 ms | 52.3% bf16 MFU | 1623853 tok/s step 9194/19560 | loss 3.439085 (+0.20z)| norm 0.2886 (+0.53z)| lr 3.47e-04 | 322.23 ms | 52.4% bf16 MFU | 1624014 tok/s step 9195/19560 | loss 3.470748 (+0.97z)| norm 0.2486 (-1.66z)| lr 3.47e-04 | 323.35 ms | 52.2% bf16 MFU | 1623886 tok/s step 9196/19560 | loss 3.406990 (-0.57z)| norm 0.3026 (+1.29z)| lr 3.47e-04 | 323.22 ms | 52.2% bf16 MFU | 1623796 tok/s step 9197/19560 | loss 3.467245 (+0.90z)| norm 0.2815 (+0.13z)| lr 3.46e-04 | 322.82 ms | 52.3% bf16 MFU | 1623809 tok/s step 9198/19560 | loss 3.343799 (-2.06z)| norm 0.2606 (-1.02z)| lr 3.46e-04 | 322.85 ms | 52.3% bf16 MFU | 1623815 tok/s step 9199/19560 | loss 3.426730 (-0.06z)| norm 0.2753 (-0.21z)| lr 3.46e-04 | 322.50 ms | 52.3% bf16 MFU | 1623908 tok/s step 9200/19560 | loss 3.363559 (-1.58z)| norm 0.2946 (+0.86z)| lr 3.46e-04 | 322.47 ms | 52.3% bf16 MFU | 1624006 tok/s step 9201/19560 | loss 3.418890 (-0.23z)| norm 0.2998 (+1.14z)| lr 3.46e-04 | 322.61 ms | 52.3% bf16 MFU | 1624062 tok/s step 9202/19560 | loss 3.439525 (+0.28z)| norm 0.2888 (+0.53z)| lr 3.46e-04 | 322.85 ms | 52.3% bf16 MFU | 1624057 tok/s step 9203/19560 | loss 3.414042 (-0.35z)| norm 0.2686 (-0.59z)| lr 3.46e-04 | 322.94 ms | 52.3% bf16 MFU | 1624028 tok/s step 9204/19560 | loss 3.406036 (-0.54z)| norm 0.2788 (-0.04z)| lr 3.46e-04 | 322.71 ms | 52.3% bf16 MFU | 1624058 tok/s step 9205/19560 | loss 3.423432 (-0.12z)| norm 0.2773 (-0.12z)| lr 3.46e-04 | 322.52 ms | 52.3% bf16 MFU | 1624136 tok/s step 9206/19560 | loss 3.380184 (-1.20z)| norm 0.2788 (-0.04z)| lr 3.46e-04 | 323.12 ms | 52.2% bf16 MFU | 1624057 tok/s step 9207/19560 | loss 3.426051 (-0.07z)| norm 0.2665 (-0.72z)| lr 3.46e-04 | 322.72 ms | 52.3% bf16 MFU | 1624085 tok/s step 9208/19560 | loss 3.425822 (-0.07z)| norm 0.2747 (-0.26z)| lr 3.46e-04 | 322.80 ms | 52.3% bf16 MFU | 1624089 tok/s step 9209/19560 | loss 3.418406 (-0.25z)| norm 0.2588 (-1.13z)| lr 3.46e-04 | 322.58 ms | 52.3% bf16 MFU | 1624151 tok/s step 9210/19560 | loss 3.412421 (-0.41z)| norm 0.2486 (-1.67z)| lr 3.46e-04 | 322.81 ms | 52.3% bf16 MFU | 1624150 tok/s step 9211/19560 | loss 3.436935 (+0.20z)| norm 0.2663 (-0.70z)| lr 3.46e-04 | 323.05 ms | 52.2% bf16 MFU | 1624088 tok/s step 9212/19560 | loss 3.340471 (-2.17z)| norm 0.2795 (+0.03z)| lr 3.46e-04 | 322.20 ms | 52.4% bf16 MFU | 1624244 tok/s step 9213/19560 | loss 3.302354 (-2.98z)| norm 0.2718 (-0.39z)| lr 3.46e-04 | 322.92 ms | 52.3% bf16 MFU | 1624210 tok/s step 9214/19560 | loss 3.398011 (-0.70z)| norm 0.2561 (-1.24z)| lr 3.46e-04 | 322.34 ms | 52.4% bf16 MFU | 1624324 tok/s step 9215/19560 | loss 3.399969 (-0.64z)| norm 0.2794 (+0.03z)| lr 3.46e-04 | 322.96 ms | 52.3% bf16 MFU | 1624278 tok/s step 9216/19560 | loss 3.470179 (+1.02z)| norm 0.2988 (+1.07z)| lr 3.46e-04 | 322.56 ms | 52.3% bf16 MFU | 1624334 tok/s step 9217/19560 | loss 3.455409 (+0.66z)| norm 0.3007 (+1.16z)| lr 3.45e-04 | 322.79 ms | 52.3% bf16 MFU | 1624331 tok/s step 9218/19560 | loss 3.389890 (-0.90z)| norm 0.2926 (+0.72z)| lr 3.45e-04 | 322.49 ms | 52.3% bf16 MFU | 1624402 tok/s step 9219/19560 | loss 3.370745 (-1.33z)| norm 0.2767 (-0.15z)| lr 3.45e-04 | 323.09 ms | 52.2% bf16 MFU | 1624319 tok/s step 9220/19560 | loss 3.361611 (-1.52z)| norm 0.2935 (+0.75z)| lr 3.45e-04 | 322.81 ms | 52.3% bf16 MFU | 1624309 tok/s step 9221/19560 | loss 3.433234 (+0.16z)| norm 0.3187 (+2.08z)| lr 3.45e-04 | 322.54 ms | 52.3% bf16 MFU | 1624368 tok/s step 9222/19560 | loss 3.399155 (-0.63z)| norm 0.2757 (-0.22z)| lr 3.45e-04 | 322.87 ms | 52.3% bf16 MFU | 1624342 tok/s step 9223/19560 | loss 3.412042 (-0.32z)| norm 0.3015 (+1.19z)| lr 3.45e-04 | 323.18 ms | 52.2% bf16 MFU | 1624239 tok/s step 9224/19560 | loss 3.441200 (+0.38z)| norm 0.2888 (+0.49z)| lr 3.45e-04 | 322.85 ms | 52.3% bf16 MFU | 1624223 tok/s step 9225/19560 | loss 3.420488 (-0.10z)| norm 0.2865 (+0.41z)| lr 3.45e-04 | 322.18 ms | 52.4% bf16 MFU | 1624378 tok/s step 9226/19560 | loss 3.371516 (-1.28z)| norm 0.2719 (-0.43z)| lr 3.45e-04 | 322.47 ms | 52.3% bf16 MFU | 1624452 tok/s step 9227/19560 | loss 3.458061 (+0.82z)| norm 0.2587 (-1.18z)| lr 3.45e-04 | 322.36 ms | 52.4% bf16 MFU | 1624551 tok/s step 9228/19560 | loss 3.434796 (+0.26z)| norm 0.2782 (-0.06z)| lr 3.45e-04 | 323.30 ms | 52.2% bf16 MFU | 1624408 tok/s step 9229/19560 | loss 3.356945 (-1.62z)| norm 0.2713 (-0.46z)| lr 3.45e-04 | 322.34 ms | 52.4% bf16 MFU | 1624513 tok/s step 9230/19560 | loss 3.465318 (+0.99z)| norm 0.3009 (+1.22z)| lr 3.45e-04 | 322.85 ms | 52.3% bf16 MFU | 1624483 tok/s step 9231/19560 | loss 3.464561 (+0.96z)| norm 0.2739 (-0.31z)| lr 3.45e-04 | 322.60 ms | 52.3% bf16 MFU | 1624519 tok/s step 9232/19560 | loss 3.432316 (+0.20z)| norm 0.2865 (+0.40z)| lr 3.45e-04 | 322.63 ms | 52.3% bf16 MFU | 1624546 tok/s step 9233/19560 | loss 3.483938 (+1.43z)| norm 0.2602 (-1.09z)| lr 3.45e-04 | 322.48 ms | 52.3% bf16 MFU | 1624607 tok/s step 9234/19560 | loss 3.399992 (-0.59z)| norm 0.2908 (+0.63z)| lr 3.45e-04 | 322.72 ms | 52.3% bf16 MFU | 1624608 tok/s step 9235/19560 | loss 3.456805 (+0.77z)| norm 0.2794 (-0.03z)| lr 3.45e-04 | 322.35 ms | 52.4% bf16 MFU | 1624700 tok/s step 9236/19560 | loss 3.443612 (+0.44z)| norm 0.2751 (-0.28z)| lr 3.45e-04 | 322.51 ms | 52.3% bf16 MFU | 1624747 tok/s step 9237/19560 | loss 3.416724 (-0.20z)| norm 0.2673 (-0.73z)| lr 3.45e-04 | 322.34 ms | 52.4% bf16 MFU | 1624835 tok/s step 9238/19560 | loss 3.496682 (+1.73z)| norm 0.3120 (+1.80z)| lr 3.44e-04 | 323.10 ms | 52.2% bf16 MFU | 1624728 tok/s step 9239/19560 | loss 3.389639 (-0.87z)| norm 0.2661 (-0.81z)| lr 3.44e-04 | 322.32 ms | 52.4% bf16 MFU | 1624822 tok/s step 9240/19560 | loss 3.417375 (-0.20z)| norm 0.2972 (+0.94z)| lr 3.44e-04 | 322.78 ms | 52.3% bf16 MFU | 1624795 tok/s step 9241/19560 | loss 3.372125 (-1.28z)| norm 0.2559 (-1.40z)| lr 3.44e-04 | 322.56 ms | 52.3% bf16 MFU | 1624825 tok/s step 9242/19560 | loss 3.405973 (-0.46z)| norm 0.3103 (+1.67z)| lr 3.44e-04 | 322.39 ms | 52.3% bf16 MFU | 1624896 tok/s step 9243/19560 | loss 3.454005 (+0.70z)| norm 0.2957 (+0.82z)| lr 3.44e-04 | 322.38 ms | 52.4% bf16 MFU | 1624966 tok/s step 9244/19560 | loss 3.424214 (-0.03z)| norm 0.2876 (+0.35z)| lr 3.44e-04 | 322.53 ms | 52.3% bf16 MFU | 1624996 tok/s step 9245/19560 | loss 3.403584 (-0.53z)| norm 0.2734 (-0.47z)| lr 3.44e-04 | 322.82 ms | 52.3% bf16 MFU | 1624950 tok/s step 9246/19560 | loss 3.446127 (+0.50z)| norm 0.2910 (+0.53z)| lr 3.44e-04 | 322.30 ms | 52.4% bf16 MFU | 1625038 tok/s step 9247/19560 | loss 3.462542 (+0.92z)| norm 0.2921 (+0.59z)| lr 3.44e-04 | 322.51 ms | 52.3% bf16 MFU | 1625069 tok/s step 9248/19560 | loss 3.366416 (-1.42z)| norm 0.2867 (+0.27z)| lr 3.44e-04 | 322.28 ms | 52.4% bf16 MFU | 1625156 tok/s step 9249/19560 | loss 3.458063 (+0.81z)| norm 0.2573 (-1.43z)| lr 3.44e-04 | 322.72 ms | 52.3% bf16 MFU | 1625128 tok/s step 9250/19560 | loss 3.393614 (-0.78z)| norm 0.2679 (-0.81z)| lr 3.44e-04 | 322.43 ms | 52.3% bf16 MFU | 1625175 tok/s val loss 3.413161 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2870/10042 = 0.285800 step 9251/19560 | loss 3.445570 (+0.49z)| norm 0.2453 (-2.06z)| lr 3.44e-04 | 322.37 ms | 52.4% bf16 MFU | 1625233 tok/s step 9252/19560 | loss 3.399996 (-0.62z)| norm 0.2691 (-0.71z)| lr 3.44e-04 | 322.64 ms | 52.3% bf16 MFU | 1625222 tok/s step 9253/19560 | loss 3.416083 (-0.22z)| norm 0.2600 (-1.21z)| lr 3.44e-04 | 322.92 ms | 52.3% bf16 MFU | 1625140 tok/s step 9254/19560 | loss 3.399903 (-0.61z)| norm 0.2719 (-0.54z)| lr 3.44e-04 | 322.68 ms | 52.3% bf16 MFU | 1625122 tok/s step 9255/19560 | loss 3.414478 (-0.23z)| norm 0.2579 (-1.33z)| lr 3.44e-04 | 322.81 ms | 52.3% bf16 MFU | 1625074 tok/s step 9256/19560 | loss 3.407038 (-0.43z)| norm 0.2789 (-0.16z)| lr 3.44e-04 | 322.82 ms | 52.3% bf16 MFU | 1625026 tok/s step 9257/19560 | loss 3.389021 (-0.87z)| norm 0.2743 (-0.42z)| lr 3.44e-04 | 322.89 ms | 52.3% bf16 MFU | 1624960 tok/s step 9258/19560 | loss 3.447663 (+0.60z)| norm 0.2799 (-0.10z)| lr 3.43e-04 | 322.75 ms | 52.3% bf16 MFU | 1624935 tok/s step 9259/19560 | loss 3.470957 (+1.17z)| norm 0.2639 (-1.01z)| lr 3.43e-04 | 322.89 ms | 52.3% bf16 MFU | 1624876 tok/s step 9260/19560 | loss 3.382469 (-1.04z)| norm 0.2933 (+0.66z)| lr 3.43e-04 | 322.59 ms | 52.3% bf16 MFU | 1624895 tok/s step 9261/19560 | loss 3.421422 (-0.07z)| norm 0.2697 (-0.68z)| lr 3.43e-04 | 322.73 ms | 52.3% bf16 MFU | 1624878 tok/s step 9262/19560 | loss 3.392023 (-0.80z)| norm 0.2837 (+0.11z)| lr 3.43e-04 | 322.72 ms | 52.3% bf16 MFU | 1624863 tok/s step 9263/19560 | loss 3.494771 (+1.73z)| norm 0.2643 (-0.98z)| lr 3.43e-04 | 322.68 ms | 52.3% bf16 MFU | 1624860 tok/s step 9264/19560 | loss 3.344530 (-1.93z)| norm 0.2889 (+0.40z)| lr 3.43e-04 | 322.64 ms | 52.3% bf16 MFU | 1624868 tok/s step 9265/19560 | loss 3.465139 (+0.99z)| norm 0.2751 (-0.38z)| lr 3.43e-04 | 322.63 ms | 52.3% bf16 MFU | 1624875 tok/s step 9266/19560 | loss 3.412289 (-0.28z)| norm 0.2984 (+0.93z)| lr 3.43e-04 | 323.07 ms | 52.2% bf16 MFU | 1624774 tok/s step 9267/19560 | loss 3.382241 (-1.00z)| norm 0.2907 (+0.49z)| lr 3.43e-04 | 322.93 ms | 52.3% bf16 MFU | 1624711 tok/s step 9268/19560 | loss 3.425657 (+0.04z)| norm 0.2943 (+0.68z)| lr 3.43e-04 | 322.26 ms | 52.4% bf16 MFU | 1624822 tok/s step 9269/19560 | loss 3.438115 (+0.33z)| norm 0.2716 (-0.63z)| lr 3.43e-04 | 322.86 ms | 52.3% bf16 MFU | 1624774 tok/s step 9270/19560 | loss 3.449979 (+0.62z)| norm 0.2712 (-0.64z)| lr 3.43e-04 | 323.14 ms | 52.2% bf16 MFU | 1624658 tok/s step 9271/19560 | loss 3.374284 (-1.23z)| norm 0.2584 (-1.37z)| lr 3.43e-04 | 322.66 ms | 52.3% bf16 MFU | 1624669 tok/s step 9272/19560 | loss 3.409103 (-0.37z)| norm 0.2674 (-0.85z)| lr 3.43e-04 | 323.03 ms | 52.2% bf16 MFU | 1624588 tok/s step 9273/19560 | loss 3.348776 (-1.83z)| norm 0.2713 (-0.62z)| lr 3.43e-04 | 323.08 ms | 52.2% bf16 MFU | 1624498 tok/s step 9274/19560 | loss 3.442684 (+0.45z)| norm 0.2719 (-0.59z)| lr 3.43e-04 | 322.69 ms | 52.3% bf16 MFU | 1624511 tok/s step 9275/19560 | loss 3.440725 (+0.41z)| norm 0.2706 (-0.66z)| lr 3.43e-04 | 322.86 ms | 52.3% bf16 MFU | 1624479 tok/s step 9276/19560 | loss 3.467121 (+1.04z)| norm 0.2512 (-1.74z)| lr 3.43e-04 | 322.46 ms | 52.3% bf16 MFU | 1624551 tok/s step 9277/19560 | loss 3.432765 (+0.20z)| norm 0.2533 (-1.59z)| lr 3.43e-04 | 322.94 ms | 52.3% bf16 MFU | 1624497 tok/s step 9278/19560 | loss 3.435764 (+0.28z)| norm 0.3047 (+1.26z)| lr 3.42e-04 | 322.83 ms | 52.3% bf16 MFU | 1624474 tok/s step 9279/19560 | loss 3.470587 (+1.12z)| norm 0.3236 (+2.25z)| lr 3.42e-04 | 322.72 ms | 52.3% bf16 MFU | 1624479 tok/s step 9280/19560 | loss 3.436845 (+0.31z)| norm 0.2571 (-1.37z)| lr 3.42e-04 | 322.63 ms | 52.3% bf16 MFU | 1624508 tok/s step 9281/19560 | loss 3.485362 (+1.49z)| norm 0.2842 (+0.10z)| lr 3.42e-04 | 323.30 ms | 52.2% bf16 MFU | 1624366 tok/s step 9282/19560 | loss 3.443853 (+0.46z)| norm 0.2977 (+0.82z)| lr 3.42e-04 | 322.70 ms | 52.3% bf16 MFU | 1624383 tok/s step 9283/19560 | loss 3.419714 (-0.13z)| norm 0.2495 (-1.77z)| lr 3.42e-04 | 323.48 ms | 52.2% bf16 MFU | 1624203 tok/s step 9284/19560 | loss 3.389152 (-0.88z)| norm 0.3105 (+1.50z)| lr 3.42e-04 | 322.59 ms | 52.3% bf16 MFU | 1624256 tok/s step 9285/19560 | loss 3.424518 (+0.01z)| norm 0.2972 (+0.88z)| lr 3.42e-04 | 322.48 ms | 52.3% bf16 MFU | 1624332 tok/s step 9286/19560 | loss 3.380416 (-1.08z)| norm 0.2609 (-1.20z)| lr 3.42e-04 | 323.00 ms | 52.3% bf16 MFU | 1624274 tok/s step 9287/19560 | loss 3.424868 (+0.05z)| norm 0.2949 (+0.78z)| lr 3.42e-04 | 323.82 ms | 52.1% bf16 MFU | 1624013 tok/s step 9288/19560 | loss 3.389166 (-0.85z)| norm 0.2502 (-1.80z)| lr 3.42e-04 | 322.31 ms | 52.4% bf16 MFU | 1624146 tok/s step 9289/19560 | loss 3.423180 (+0.03z)| norm 0.3037 (+1.29z)| lr 3.42e-04 | 323.13 ms | 52.2% bf16 MFU | 1624064 tok/s step 9290/19560 | loss 3.397919 (-0.62z)| norm 0.2612 (-1.15z)| lr 3.42e-04 | 322.83 ms | 52.3% bf16 MFU | 1624064 tok/s step 9291/19560 | loss 3.409080 (-0.34z)| norm 0.2662 (-0.86z)| lr 3.42e-04 | 322.75 ms | 52.3% bf16 MFU | 1624081 tok/s step 9292/19560 | loss 3.354329 (-1.75z)| norm 0.2740 (-0.40z)| lr 3.42e-04 | 322.80 ms | 52.3% bf16 MFU | 1624087 tok/s step 9293/19560 | loss 3.453323 (+0.83z)| norm 0.2734 (-0.43z)| lr 3.42e-04 | 323.25 ms | 52.2% bf16 MFU | 1623980 tok/s step 9294/19560 | loss 3.447568 (+0.67z)| norm 0.2576 (-1.32z)| lr 3.42e-04 | 322.67 ms | 52.3% bf16 MFU | 1624022 tok/s step 9295/19560 | loss 3.389339 (-0.83z)| norm 0.2697 (-0.61z)| lr 3.42e-04 | 322.56 ms | 52.3% bf16 MFU | 1624090 tok/s step 9296/19560 | loss 3.409102 (-0.31z)| norm 0.2635 (-0.96z)| lr 3.42e-04 | 323.24 ms | 52.2% bf16 MFU | 1623984 tok/s step 9297/19560 | loss 3.377214 (-1.13z)| norm 0.2709 (-0.52z)| lr 3.42e-04 | 322.47 ms | 52.3% bf16 MFU | 1624077 tok/s step 9298/19560 | loss 3.419411 (-0.04z)| norm 0.2507 (-1.66z)| lr 3.41e-04 | 322.83 ms | 52.3% bf16 MFU | 1624075 tok/s step 9299/19560 | loss 3.419945 (-0.00z)| norm 0.2495 (-1.69z)| lr 3.41e-04 | 323.22 ms | 52.2% bf16 MFU | 1623976 tok/s step 9300/19560 | loss 3.478252 (+1.51z)| norm 0.2645 (-0.83z)| lr 3.41e-04 | 322.86 ms | 52.3% bf16 MFU | 1623970 tok/s step 9301/19560 | loss 3.397611 (-0.60z)| norm 0.2483 (-1.72z)| lr 3.41e-04 | 323.24 ms | 52.2% bf16 MFU | 1623870 tok/s step 9302/19560 | loss 3.426388 (+0.15z)| norm 0.2756 (-0.17z)| lr 3.41e-04 | 322.24 ms | 52.4% bf16 MFU | 1624026 tok/s step 9303/19560 | loss 3.312884 (-2.73z)| norm 0.2788 (+0.04z)| lr 3.41e-04 | 322.63 ms | 52.3% bf16 MFU | 1624077 tok/s step 9304/19560 | loss 3.379971 (-1.00z)| norm 0.2783 (+0.01z)| lr 3.41e-04 | 323.72 ms | 52.1% bf16 MFU | 1623853 tok/s step 9305/19560 | loss 3.415244 (-0.12z)| norm 0.2700 (-0.47z)| lr 3.41e-04 | 322.57 ms | 52.3% bf16 MFU | 1623927 tok/s step 9306/19560 | loss 3.395101 (-0.63z)| norm 0.2928 (+0.87z)| lr 3.41e-04 | 323.10 ms | 52.2% bf16 MFU | 1623864 tok/s step 9307/19560 | loss 3.533846 (+2.86z)| norm 0.2440 (-1.96z)| lr 3.41e-04 | 322.74 ms | 52.3% bf16 MFU | 1623896 tok/s step 9308/19560 | loss 3.431159 (+0.27z)| norm 0.3143 (+2.08z)| lr 3.41e-04 | 322.48 ms | 52.3% bf16 MFU | 1623990 tok/s step 9309/19560 | loss 3.417886 (-0.07z)| norm 0.2892 (+0.63z)| lr 3.41e-04 | 322.30 ms | 52.4% bf16 MFU | 1624126 tok/s step 9310/19560 | loss 3.422184 (+0.06z)| norm 0.2912 (+0.76z)| lr 3.41e-04 | 322.73 ms | 52.3% bf16 MFU | 1624148 tok/s step 9311/19560 | loss 3.485607 (+1.65z)| norm 0.3011 (+1.36z)| lr 3.41e-04 | 323.08 ms | 52.2% bf16 MFU | 1624079 tok/s step 9312/19560 | loss 3.553445 (+3.19z)| norm 0.2779 (-0.01z)| lr 3.41e-04 | 322.72 ms | 52.3% bf16 MFU | 1624104 tok/s step 9313/19560 | loss 3.422571 (+0.04z)| norm 0.2992 (+1.24z)| lr 3.41e-04 | 322.61 ms | 52.3% bf16 MFU | 1624155 tok/s step 9314/19560 | loss 3.472648 (+1.25z)| norm 0.2728 (-0.30z)| lr 3.41e-04 | 322.99 ms | 52.3% bf16 MFU | 1624109 tok/s step 9315/19560 | loss 3.450166 (+0.72z)| norm 0.2952 (+1.01z)| lr 3.41e-04 | 322.32 ms | 52.4% bf16 MFU | 1624235 tok/s step 9316/19560 | loss 3.455505 (+0.83z)| norm 0.2913 (+0.77z)| lr 3.41e-04 | 323.12 ms | 52.2% bf16 MFU | 1624152 tok/s step 9317/19560 | loss 3.423570 (+0.06z)| norm 0.2869 (+0.52z)| lr 3.41e-04 | 322.58 ms | 52.3% bf16 MFU | 1624209 tok/s step 9318/19560 | loss 3.428918 (+0.20z)| norm 0.2774 (-0.04z)| lr 3.41e-04 | 322.65 ms | 52.3% bf16 MFU | 1624246 tok/s step 9319/19560 | loss 3.570322 (+3.53z)| norm 0.2864 (+0.49z)| lr 3.40e-04 | 323.03 ms | 52.2% bf16 MFU | 1624186 tok/s step 9320/19560 | loss 3.482838 (+1.44z)| norm 0.2759 (-0.12z)| lr 3.40e-04 | 323.48 ms | 52.2% bf16 MFU | 1624016 tok/s step 9321/19560 | loss 3.464703 (+0.99z)| norm 0.2972 (+1.11z)| lr 3.40e-04 | 322.55 ms | 52.3% bf16 MFU | 1624088 tok/s step 9322/19560 | loss 3.459089 (+0.85z)| norm 0.3077 (+1.70z)| lr 3.40e-04 | 322.95 ms | 52.3% bf16 MFU | 1624055 tok/s step 9323/19560 | loss 3.409983 (-0.30z)| norm 0.3035 (+1.44z)| lr 3.40e-04 | 323.28 ms | 52.2% bf16 MFU | 1623941 tok/s step 9324/19560 | loss 3.383781 (-0.91z)| norm 0.3057 (+1.56z)| lr 3.40e-04 | 322.86 ms | 52.3% bf16 MFU | 1623939 tok/s step 9325/19560 | loss 3.470131 (+1.13z)| norm 0.2943 (+0.89z)| lr 3.40e-04 | 323.24 ms | 52.2% bf16 MFU | 1623841 tok/s step 9326/19560 | loss 3.467571 (+1.06z)| norm 0.2957 (+0.96z)| lr 3.40e-04 | 322.90 ms | 52.3% bf16 MFU | 1623832 tok/s step 9327/19560 | loss 3.431810 (+0.20z)| norm 0.2625 (-0.96z)| lr 3.40e-04 | 322.60 ms | 52.3% bf16 MFU | 1623901 tok/s step 9328/19560 | loss 3.405968 (-0.43z)| norm 0.3096 (+1.73z)| lr 3.40e-04 | 323.61 ms | 52.2% bf16 MFU | 1623712 tok/s step 9329/19560 | loss 3.451692 (+0.67z)| norm 0.2535 (-1.45z)| lr 3.40e-04 | 322.51 ms | 52.3% bf16 MFU | 1623810 tok/s step 9330/19560 | loss 3.391675 (-0.77z)| norm 0.2816 (+0.16z)| lr 3.40e-04 | 322.84 ms | 52.3% bf16 MFU | 1623818 tok/s step 9331/19560 | loss 3.436102 (+0.30z)| norm 0.2723 (-0.37z)| lr 3.40e-04 | 323.57 ms | 52.2% bf16 MFU | 1623643 tok/s step 9332/19560 | loss 3.454785 (+0.73z)| norm 0.2711 (-0.44z)| lr 3.40e-04 | 323.17 ms | 52.2% bf16 MFU | 1623576 tok/s step 9333/19560 | loss 3.399625 (-0.58z)| norm 0.2883 (+0.54z)| lr 3.40e-04 | 322.87 ms | 52.3% bf16 MFU | 1623590 tok/s step 9334/19560 | loss 3.446181 (+0.52z)| norm 0.2572 (-1.22z)| lr 3.40e-04 | 322.48 ms | 52.3% bf16 MFU | 1623699 tok/s step 9335/19560 | loss 3.440565 (+0.38z)| norm 0.2810 (+0.12z)| lr 3.40e-04 | 322.77 ms | 52.3% bf16 MFU | 1623732 tok/s step 9336/19560 | loss 3.418458 (-0.15z)| norm 0.2833 (+0.25z)| lr 3.40e-04 | 323.24 ms | 52.2% bf16 MFU | 1623644 tok/s step 9337/19560 | loss 3.403785 (-0.49z)| norm 0.2877 (+0.49z)| lr 3.40e-04 | 322.87 ms | 52.3% bf16 MFU | 1623653 tok/s step 9338/19560 | loss 3.408731 (-0.38z)| norm 0.2692 (-0.58z)| lr 3.40e-04 | 322.55 ms | 52.3% bf16 MFU | 1623742 tok/s step 9339/19560 | loss 3.459961 (+0.85z)| norm 0.2673 (-0.69z)| lr 3.39e-04 | 322.94 ms | 52.3% bf16 MFU | 1623728 tok/s step 9340/19560 | loss 3.417983 (-0.17z)| norm 0.2861 (+0.39z)| lr 3.39e-04 | 322.47 ms | 52.3% bf16 MFU | 1623834 tok/s step 9341/19560 | loss 3.452569 (+0.66z)| norm 0.2835 (+0.24z)| lr 3.39e-04 | 322.70 ms | 52.3% bf16 MFU | 1623876 tok/s step 9342/19560 | loss 3.395315 (-0.78z)| norm 0.2862 (+0.38z)| lr 3.39e-04 | 322.78 ms | 52.3% bf16 MFU | 1623898 tok/s step 9343/19560 | loss 3.474441 (+1.19z)| norm 0.2919 (+0.71z)| lr 3.39e-04 | 322.57 ms | 52.3% bf16 MFU | 1623971 tok/s step 9344/19560 | loss 3.414359 (-0.30z)| norm 0.2647 (-0.86z)| lr 3.39e-04 | 322.33 ms | 52.4% bf16 MFU | 1624099 tok/s step 9345/19560 | loss 3.389789 (-0.91z)| norm 0.2799 (+0.03z)| lr 3.39e-04 | 323.72 ms | 52.1% bf16 MFU | 1623873 tok/s step 9346/19560 | loss 3.456420 (+0.75z)| norm 0.2665 (-0.74z)| lr 3.39e-04 | 323.14 ms | 52.2% bf16 MFU | 1623804 tok/s step 9347/19560 | loss 3.457751 (+0.77z)| norm 0.2839 (+0.27z)| lr 3.39e-04 | 323.35 ms | 52.2% bf16 MFU | 1623684 tok/s step 9348/19560 | loss 3.455708 (+0.71z)| norm 0.2876 (+0.50z)| lr 3.39e-04 | 323.12 ms | 52.2% bf16 MFU | 1623628 tok/s step 9349/19560 | loss 3.373840 (-1.36z)| norm 0.2708 (-0.47z)| lr 3.39e-04 | 323.24 ms | 52.2% bf16 MFU | 1623545 tok/s step 9350/19560 | loss 3.441908 (+0.36z)| norm 0.2816 (+0.17z)| lr 3.39e-04 | 323.16 ms | 52.2% bf16 MFU | 1623486 tok/s step 9351/19560 | loss 3.381757 (-1.16z)| norm 0.2806 (+0.12z)| lr 3.39e-04 | 322.62 ms | 52.3% bf16 MFU | 1623568 tok/s step 9352/19560 | loss 3.416733 (-0.27z)| norm 0.2697 (-0.53z)| lr 3.39e-04 | 323.56 ms | 52.2% bf16 MFU | 1623408 tok/s step 9353/19560 | loss 3.409210 (-0.46z)| norm 0.2755 (-0.17z)| lr 3.39e-04 | 322.87 ms | 52.3% bf16 MFU | 1623429 tok/s step 9354/19560 | loss 3.422941 (-0.12z)| norm 0.2769 (-0.09z)| lr 3.39e-04 | 323.16 ms | 52.2% bf16 MFU | 1623376 tok/s step 9355/19560 | loss 3.361411 (-1.65z)| norm 0.2803 (+0.10z)| lr 3.39e-04 | 323.16 ms | 52.2% bf16 MFU | 1623325 tok/s step 9356/19560 | loss 3.409941 (-0.42z)| norm 0.3015 (+1.36z)| lr 3.39e-04 | 322.51 ms | 52.3% bf16 MFU | 1623441 tok/s step 9357/19560 | loss 3.412963 (-0.36z)| norm 0.2787 (-0.01z)| lr 3.39e-04 | 323.34 ms | 52.2% bf16 MFU | 1623341 tok/s step 9358/19560 | loss 3.401350 (-0.65z)| norm 0.2948 (+0.97z)| lr 3.39e-04 | 323.19 ms | 52.2% bf16 MFU | 1623287 tok/s step 9359/19560 | loss 3.406418 (-0.51z)| norm 0.3045 (+1.52z)| lr 3.38e-04 | 322.38 ms | 52.4% bf16 MFU | 1623438 tok/s step 9360/19560 | loss 3.482491 (+1.42z)| norm 0.2540 (-1.47z)| lr 3.38e-04 | 323.06 ms | 52.2% bf16 MFU | 1623411 tok/s step 9361/19560 | loss 3.415249 (-0.28z)| norm 0.2976 (+1.10z)| lr 3.38e-04 | 323.34 ms | 52.2% bf16 MFU | 1623313 tok/s step 9362/19560 | loss 3.458653 (+0.82z)| norm 0.2604 (-1.09z)| lr 3.38e-04 | 322.92 ms | 52.3% bf16 MFU | 1623327 tok/s step 9363/19560 | loss 3.468327 (+1.07z)| norm 0.2734 (-0.32z)| lr 3.38e-04 | 323.02 ms | 52.2% bf16 MFU | 1623315 tok/s step 9364/19560 | loss 3.516856 (+2.25z)| norm 0.2843 (+0.32z)| lr 3.38e-04 | 323.32 ms | 52.2% bf16 MFU | 1623227 tok/s step 9365/19560 | loss 3.435175 (+0.20z)| norm 0.2723 (-0.39z)| lr 3.38e-04 | 323.13 ms | 52.2% bf16 MFU | 1623193 tok/s step 9366/19560 | loss 3.392496 (-0.86z)| norm 0.2853 (+0.40z)| lr 3.38e-04 | 323.03 ms | 52.2% bf16 MFU | 1623184 tok/s step 9367/19560 | loss 3.390326 (-0.91z)| norm 0.2549 (-1.41z)| lr 3.38e-04 | 322.70 ms | 52.3% bf16 MFU | 1623261 tok/s step 9368/19560 | loss 3.473814 (+1.18z)| norm 0.2647 (-0.82z)| lr 3.38e-04 | 322.54 ms | 52.3% bf16 MFU | 1623372 tok/s step 9369/19560 | loss 3.458562 (+0.78z)| norm 0.2694 (-0.54z)| lr 3.38e-04 | 323.03 ms | 52.2% bf16 MFU | 1623355 tok/s step 9370/19560 | loss 3.462722 (+0.87z)| norm 0.2637 (-0.88z)| lr 3.38e-04 | 323.62 ms | 52.2% bf16 MFU | 1623190 tok/s step 9371/19560 | loss 3.389178 (-0.97z)| norm 0.2624 (-0.94z)| lr 3.38e-04 | 322.64 ms | 52.3% bf16 MFU | 1623281 tok/s step 9372/19560 | loss 3.401764 (-0.64z)| norm 0.2531 (-1.48z)| lr 3.38e-04 | 322.36 ms | 52.4% bf16 MFU | 1623437 tok/s step 9373/19560 | loss 3.531305 (+2.53z)| norm 0.2926 (+0.90z)| lr 3.38e-04 | 323.30 ms | 52.2% bf16 MFU | 1623350 tok/s step 9374/19560 | loss 3.499157 (+1.71z)| norm 0.2692 (-0.50z)| lr 3.38e-04 | 322.95 ms | 52.3% bf16 MFU | 1623353 tok/s step 9375/19560 | loss 3.446391 (+0.43z)| norm 0.2590 (-1.10z)| lr 3.38e-04 | 322.97 ms | 52.3% bf16 MFU | 1623352 tok/s step 9376/19560 | loss 3.399298 (-0.73z)| norm 0.2674 (-0.59z)| lr 3.38e-04 | 322.76 ms | 52.3% bf16 MFU | 1623405 tok/s step 9377/19560 | loss 3.394077 (-0.84z)| norm 0.2645 (-0.77z)| lr 3.38e-04 | 322.49 ms | 52.3% bf16 MFU | 1623522 tok/s step 9378/19560 | loss 3.365987 (-1.51z)| norm 0.2751 (-0.13z)| lr 3.38e-04 | 322.65 ms | 52.3% bf16 MFU | 1623594 tok/s step 9379/19560 | loss 3.392663 (-0.85z)| norm 0.2570 (-1.25z)| lr 3.37e-04 | 322.76 ms | 52.3% bf16 MFU | 1623633 tok/s step 9380/19560 | loss 3.421810 (-0.15z)| norm 0.2831 (+0.35z)| lr 3.37e-04 | 322.29 ms | 52.4% bf16 MFU | 1623790 tok/s step 9381/19560 | loss 3.430404 (+0.06z)| norm 0.2790 (+0.08z)| lr 3.37e-04 | 322.53 ms | 52.3% bf16 MFU | 1623877 tok/s step 9382/19560 | loss 3.417909 (-0.25z)| norm 0.2601 (-1.07z)| lr 3.37e-04 | 323.57 ms | 52.2% bf16 MFU | 1623699 tok/s step 9383/19560 | loss 3.405516 (-0.55z)| norm 0.2620 (-0.95z)| lr 3.37e-04 | 322.59 ms | 52.3% bf16 MFU | 1623777 tok/s step 9384/19560 | loss 3.417656 (-0.26z)| norm 0.2822 (+0.28z)| lr 3.37e-04 | 322.86 ms | 52.3% bf16 MFU | 1623783 tok/s step 9385/19560 | loss 3.398551 (-0.73z)| norm 0.2791 (+0.10z)| lr 3.37e-04 | 322.70 ms | 52.3% bf16 MFU | 1623829 tok/s step 9386/19560 | loss 3.366122 (-1.49z)| norm 0.2790 (+0.09z)| lr 3.37e-04 | 322.70 ms | 52.3% bf16 MFU | 1623872 tok/s step 9387/19560 | loss 3.399670 (-0.67z)| norm 0.2779 (+0.01z)| lr 3.37e-04 | 323.52 ms | 52.2% bf16 MFU | 1623707 tok/s step 9388/19560 | loss 3.408010 (-0.47z)| norm 0.2697 (-0.48z)| lr 3.37e-04 | 322.75 ms | 52.3% bf16 MFU | 1623743 tok/s step 9389/19560 | loss 3.368790 (-1.40z)| norm 0.2706 (-0.43z)| lr 3.37e-04 | 322.55 ms | 52.3% bf16 MFU | 1623829 tok/s step 9390/19560 | loss 3.385008 (-1.01z)| norm 0.2707 (-0.41z)| lr 3.37e-04 | 322.55 ms | 52.3% bf16 MFU | 1623910 tok/s step 9391/19560 | loss 3.479974 (+1.29z)| norm 0.2598 (-1.09z)| lr 3.37e-04 | 322.21 ms | 52.4% bf16 MFU | 1624073 tok/s step 9392/19560 | loss 3.413160 (-0.35z)| norm 0.2688 (-0.52z)| lr 3.37e-04 | 322.96 ms | 52.3% bf16 MFU | 1624037 tok/s step 9393/19560 | loss 3.408783 (-0.45z)| norm 0.2395 (-2.26z)| lr 3.37e-04 | 322.77 ms | 52.3% bf16 MFU | 1624052 tok/s step 9394/19560 | loss 3.439651 (+0.31z)| norm 0.2867 (+0.60z)| lr 3.37e-04 | 322.45 ms | 52.3% bf16 MFU | 1624148 tok/s step 9395/19560 | loss 3.398242 (-0.72z)| norm 0.2630 (-0.83z)| lr 3.37e-04 | 322.98 ms | 52.3% bf16 MFU | 1624106 tok/s step 9396/19560 | loss 3.384566 (-1.04z)| norm 0.2377 (-2.30z)| lr 3.37e-04 | 322.57 ms | 52.3% bf16 MFU | 1624167 tok/s step 9397/19560 | loss 3.398888 (-0.68z)| norm 0.2698 (-0.38z)| lr 3.37e-04 | 323.17 ms | 52.2% bf16 MFU | 1624076 tok/s step 9398/19560 | loss 3.446718 (+0.50z)| norm 0.2464 (-1.74z)| lr 3.37e-04 | 322.45 ms | 52.3% bf16 MFU | 1624170 tok/s step 9399/19560 | loss 3.483299 (+1.38z)| norm 0.2461 (-1.74z)| lr 3.36e-04 | 322.45 ms | 52.3% bf16 MFU | 1624260 tok/s step 9400/19560 | loss 3.433453 (+0.15z)| norm 0.2692 (-0.39z)| lr 3.36e-04 | 322.78 ms | 52.3% bf16 MFU | 1624261 tok/s step 9401/19560 | loss 3.425845 (-0.06z)| norm 0.2363 (-2.26z)| lr 3.36e-04 | 323.02 ms | 52.2% bf16 MFU | 1624202 tok/s step 9402/19560 | loss 3.444388 (+0.41z)| norm 0.2742 (-0.08z)| lr 3.36e-04 | 322.89 ms | 52.3% bf16 MFU | 1624179 tok/s step 9403/19560 | loss 3.391920 (-0.89z)| norm 0.2726 (-0.18z)| lr 3.36e-04 | 322.10 ms | 52.4% bf16 MFU | 1624355 tok/s step 9404/19560 | loss 3.391119 (-0.90z)| norm 0.2521 (-1.36z)| lr 3.36e-04 | 322.41 ms | 52.3% bf16 MFU | 1624444 tok/s step 9405/19560 | loss 3.478980 (+1.27z)| norm 0.2788 (+0.17z)| lr 3.36e-04 | 322.54 ms | 52.3% bf16 MFU | 1624498 tok/s step 9406/19560 | loss 3.472588 (+1.10z)| norm 0.2843 (+0.50z)| lr 3.36e-04 | 322.43 ms | 52.3% bf16 MFU | 1624577 tok/s step 9407/19560 | loss 3.441433 (+0.34z)| norm 0.2930 (+1.06z)| lr 3.36e-04 | 322.36 ms | 52.4% bf16 MFU | 1624668 tok/s step 9408/19560 | loss 3.421423 (-0.15z)| norm 0.2808 (+0.31z)| lr 3.36e-04 | 322.49 ms | 52.3% bf16 MFU | 1624723 tok/s step 9409/19560 | loss 3.420683 (-0.16z)| norm 0.2779 (+0.14z)| lr 3.36e-04 | 322.55 ms | 52.3% bf16 MFU | 1624758 tok/s step 9410/19560 | loss 3.418828 (-0.20z)| norm 0.3056 (+1.80z)| lr 3.36e-04 | 322.61 ms | 52.3% bf16 MFU | 1624777 tok/s step 9411/19560 | loss 3.454445 (+0.68z)| norm 0.2952 (+1.16z)| lr 3.36e-04 | 322.43 ms | 52.3% bf16 MFU | 1624841 tok/s step 9412/19560 | loss 3.457003 (+0.73z)| norm 0.2788 (+0.19z)| lr 3.36e-04 | 322.62 ms | 52.3% bf16 MFU | 1624854 tok/s step 9413/19560 | loss 3.340709 (-2.11z)| norm 0.3122 (+2.20z)| lr 3.36e-04 | 323.16 ms | 52.2% bf16 MFU | 1624730 tok/s step 9414/19560 | loss 3.435354 (+0.20z)| norm 0.2526 (-1.40z)| lr 3.36e-04 | 322.63 ms | 52.3% bf16 MFU | 1624747 tok/s step 9415/19560 | loss 3.374478 (-1.28z)| norm 0.2839 (+0.50z)| lr 3.36e-04 | 322.74 ms | 52.3% bf16 MFU | 1624735 tok/s step 9416/19560 | loss 3.395993 (-0.76z)| norm 0.2609 (-0.91z)| lr 3.36e-04 | 322.48 ms | 52.3% bf16 MFU | 1624787 tok/s step 9417/19560 | loss 3.441001 (+0.34z)| norm 0.2832 (+0.46z)| lr 3.36e-04 | 322.36 ms | 52.4% bf16 MFU | 1624867 tok/s step 9418/19560 | loss 3.459995 (+0.79z)| norm 0.2900 (+0.87z)| lr 3.36e-04 | 322.56 ms | 52.3% bf16 MFU | 1624894 tok/s step 9419/19560 | loss 3.443259 (+0.38z)| norm 0.2689 (-0.43z)| lr 3.35e-04 | 322.89 ms | 52.3% bf16 MFU | 1624835 tok/s step 9420/19560 | loss 3.370189 (-1.42z)| norm 0.2564 (-1.19z)| lr 3.35e-04 | 322.65 ms | 52.3% bf16 MFU | 1624840 tok/s step 9421/19560 | loss 3.402342 (-0.62z)| norm 0.2702 (-0.34z)| lr 3.35e-04 | 322.50 ms | 52.3% bf16 MFU | 1624883 tok/s step 9422/19560 | loss 3.385766 (-1.02z)| norm 0.2843 (+0.51z)| lr 3.35e-04 | 322.61 ms | 52.3% bf16 MFU | 1624895 tok/s step 9423/19560 | loss 3.429618 (+0.05z)| norm 0.2556 (-1.24z)| lr 3.35e-04 | 322.32 ms | 52.4% bf16 MFU | 1624980 tok/s step 9424/19560 | loss 3.502450 (+1.81z)| norm 0.2836 (+0.47z)| lr 3.35e-04 | 322.93 ms | 52.3% bf16 MFU | 1624906 tok/s step 9425/19560 | loss 3.564057 (+3.16z)| norm 0.2613 (-0.89z)| lr 3.35e-04 | 322.80 ms | 52.3% bf16 MFU | 1624869 tok/s step 9426/19560 | loss 3.432077 (+0.05z)| norm 0.2718 (-0.27z)| lr 3.35e-04 | 322.54 ms | 52.3% bf16 MFU | 1624901 tok/s step 9427/19560 | loss 3.393736 (-0.84z)| norm 0.2678 (-0.52z)| lr 3.35e-04 | 322.59 ms | 52.3% bf16 MFU | 1624917 tok/s step 9428/19560 | loss 3.428361 (-0.02z)| norm 0.2731 (-0.20z)| lr 3.35e-04 | 322.67 ms | 52.3% bf16 MFU | 1624914 tok/s step 9429/19560 | loss 3.431750 (+0.05z)| norm 0.2528 (-1.48z)| lr 3.35e-04 | 322.52 ms | 52.3% bf16 MFU | 1624947 tok/s step 9430/19560 | loss 3.443682 (+0.33z)| norm 0.2759 (-0.03z)| lr 3.35e-04 | 322.50 ms | 52.3% bf16 MFU | 1624985 tok/s step 9431/19560 | loss 3.410707 (-0.48z)| norm 0.2359 (-2.45z)| lr 3.35e-04 | 322.52 ms | 52.3% bf16 MFU | 1625017 tok/s step 9432/19560 | loss 3.477991 (+1.14z)| norm 0.2963 (+1.22z)| lr 3.35e-04 | 322.47 ms | 52.3% bf16 MFU | 1625059 tok/s step 9433/19560 | loss 3.358592 (-1.73z)| norm 0.2626 (-0.81z)| lr 3.35e-04 | 322.63 ms | 52.3% bf16 MFU | 1625059 tok/s step 9434/19560 | loss 3.444680 (+0.33z)| norm 0.2774 (+0.08z)| lr 3.35e-04 | 323.12 ms | 52.2% bf16 MFU | 1624935 tok/s step 9435/19560 | loss 3.425455 (-0.12z)| norm 0.2741 (-0.13z)| lr 3.35e-04 | 322.17 ms | 52.4% bf16 MFU | 1625057 tok/s step 9436/19560 | loss 3.422276 (-0.19z)| norm 0.2813 (+0.34z)| lr 3.35e-04 | 322.83 ms | 52.3% bf16 MFU | 1625006 tok/s step 9437/19560 | loss 3.421096 (-0.22z)| norm 0.2694 (-0.41z)| lr 3.35e-04 | 322.68 ms | 52.3% bf16 MFU | 1624996 tok/s step 9438/19560 | loss 3.432518 (+0.06z)| norm 0.3164 (+2.50z)| lr 3.35e-04 | 322.31 ms | 52.4% bf16 MFU | 1625080 tok/s step 9439/19560 | loss 3.445301 (+0.38z)| norm 0.2706 (-0.32z)| lr 3.35e-04 | 322.75 ms | 52.3% bf16 MFU | 1625049 tok/s step 9440/19560 | loss 3.462773 (+0.87z)| norm 0.2674 (-0.51z)| lr 3.34e-04 | 323.13 ms | 52.2% bf16 MFU | 1624924 tok/s step 9441/19560 | loss 3.484311 (+1.40z)| norm 0.3059 (+1.88z)| lr 3.34e-04 | 322.33 ms | 52.4% bf16 MFU | 1625006 tok/s step 9442/19560 | loss 3.398172 (-0.79z)| norm 0.2740 (-0.11z)| lr 3.34e-04 | 322.39 ms | 52.3% bf16 MFU | 1625068 tok/s step 9443/19560 | loss 3.488088 (+1.49z)| norm 0.2973 (+1.34z)| lr 3.34e-04 | 322.81 ms | 52.3% bf16 MFU | 1625022 tok/s step 9444/19560 | loss 3.411378 (-0.45z)| norm 0.3081 (+1.97z)| lr 3.34e-04 | 322.77 ms | 52.3% bf16 MFU | 1624987 tok/s step 9445/19560 | loss 3.364311 (-1.62z)| norm 0.2872 (+0.70z)| lr 3.34e-04 | 322.83 ms | 52.3% bf16 MFU | 1624940 tok/s step 9446/19560 | loss 3.414347 (-0.36z)| norm 0.3146 (+2.31z)| lr 3.34e-04 | 322.73 ms | 52.3% bf16 MFU | 1624921 tok/s step 9447/19560 | loss 3.487561 (+1.57z)| norm 0.3160 (+2.33z)| lr 3.34e-04 | 322.81 ms | 52.3% bf16 MFU | 1624881 tok/s step 9448/19560 | loss 3.450596 (+0.61z)| norm 0.2810 (+0.27z)| lr 3.34e-04 | 322.77 ms | 52.3% bf16 MFU | 1624854 tok/s step 9449/19560 | loss 3.452591 (+0.67z)| norm 0.2977 (+1.25z)| lr 3.34e-04 | 322.31 ms | 52.4% bf16 MFU | 1624944 tok/s step 9450/19560 | loss 3.432281 (+0.13z)| norm 0.2647 (-0.68z)| lr 3.34e-04 | 322.68 ms | 52.3% bf16 MFU | 1624936 tok/s step 9451/19560 | loss 3.429868 (+0.06z)| norm 0.3088 (+1.93z)| lr 3.34e-04 | 322.95 ms | 52.3% bf16 MFU | 1624862 tok/s step 9452/19560 | loss 3.467342 (+1.05z)| norm 0.2773 (+0.08z)| lr 3.34e-04 | 322.47 ms | 52.3% bf16 MFU | 1624912 tok/s step 9453/19560 | loss 3.381235 (-1.23z)| norm 0.2795 (+0.22z)| lr 3.34e-04 | 322.92 ms | 52.3% bf16 MFU | 1624846 tok/s step 9454/19560 | loss 3.419372 (-0.20z)| norm 0.2797 (+0.24z)| lr 3.34e-04 | 322.98 ms | 52.3% bf16 MFU | 1624769 tok/s step 9455/19560 | loss 3.433297 (+0.17z)| norm 0.2793 (+0.21z)| lr 3.34e-04 | 322.58 ms | 52.3% bf16 MFU | 1624796 tok/s step 9456/19560 | loss 3.448240 (+0.56z)| norm 0.2668 (-0.54z)| lr 3.34e-04 | 322.34 ms | 52.4% bf16 MFU | 1624880 tok/s step 9457/19560 | loss 3.472896 (+1.21z)| norm 0.2819 (+0.38z)| lr 3.34e-04 | 322.93 ms | 52.3% bf16 MFU | 1624812 tok/s step 9458/19560 | loss 3.389090 (-1.02z)| norm 0.3017 (+1.59z)| lr 3.34e-04 | 322.82 ms | 52.3% bf16 MFU | 1624776 tok/s step 9459/19560 | loss 3.414068 (-0.35z)| norm 0.2844 (+0.52z)| lr 3.34e-04 | 322.26 ms | 52.4% bf16 MFU | 1624883 tok/s step 9460/19560 | loss 3.422533 (-0.12z)| norm 0.3049 (+1.74z)| lr 3.33e-04 | 322.77 ms | 52.3% bf16 MFU | 1624855 tok/s step 9461/19560 | loss 3.446362 (+0.51z)| norm 0.2727 (-0.21z)| lr 3.33e-04 | 322.69 ms | 52.3% bf16 MFU | 1624848 tok/s step 9462/19560 | loss 3.412710 (-0.39z)| norm 0.2787 (+0.15z)| lr 3.33e-04 | 322.47 ms | 52.3% bf16 MFU | 1624898 tok/s step 9463/19560 | loss 3.506064 (+2.06z)| norm 0.2792 (+0.18z)| lr 3.33e-04 | 323.10 ms | 52.2% bf16 MFU | 1624788 tok/s step 9464/19560 | loss 3.395244 (-0.85z)| norm 0.2961 (+1.20z)| lr 3.33e-04 | 322.39 ms | 52.3% bf16 MFU | 1624860 tok/s step 9465/19560 | loss 3.402099 (-0.67z)| norm 0.2477 (-1.71z)| lr 3.33e-04 | 322.63 ms | 52.3% bf16 MFU | 1624869 tok/s step 9466/19560 | loss 3.391122 (-0.95z)| norm 0.3037 (+1.63z)| lr 3.33e-04 | 322.53 ms | 52.3% bf16 MFU | 1624902 tok/s step 9467/19560 | loss 3.465262 (+0.99z)| norm 0.2829 (+0.39z)| lr 3.33e-04 | 322.75 ms | 52.3% bf16 MFU | 1624880 tok/s step 9468/19560 | loss 3.398269 (-0.76z)| norm 0.2879 (+0.68z)| lr 3.33e-04 | 322.75 ms | 52.3% bf16 MFU | 1624859 tok/s step 9469/19560 | loss 3.430142 (+0.08z)| norm 0.2687 (-0.45z)| lr 3.33e-04 | 322.57 ms | 52.3% bf16 MFU | 1624882 tok/s step 9470/19560 | loss 3.417952 (-0.24z)| norm 0.2757 (-0.03z)| lr 3.33e-04 | 322.78 ms | 52.3% bf16 MFU | 1624853 tok/s step 9471/19560 | loss 3.451258 (+0.64z)| norm 0.2632 (-0.77z)| lr 3.33e-04 | 323.50 ms | 52.2% bf16 MFU | 1624645 tok/s step 9472/19560 | loss 3.421206 (-0.16z)| norm 0.2822 (+0.36z)| lr 3.33e-04 | 322.81 ms | 52.3% bf16 MFU | 1624621 tok/s step 9473/19560 | loss 3.405468 (-0.57z)| norm 0.2669 (-0.55z)| lr 3.33e-04 | 323.29 ms | 52.2% bf16 MFU | 1624475 tok/s step 9474/19560 | loss 3.392671 (-0.90z)| norm 0.2717 (-0.26z)| lr 3.33e-04 | 322.85 ms | 52.3% bf16 MFU | 1624447 tok/s step 9475/19560 | loss 3.420556 (-0.16z)| norm 0.2725 (-0.21z)| lr 3.33e-04 | 322.96 ms | 52.3% bf16 MFU | 1624395 tok/s step 9476/19560 | loss 3.370253 (-1.46z)| norm 0.3004 (+1.44z)| lr 3.33e-04 | 322.56 ms | 52.3% bf16 MFU | 1624446 tok/s step 9477/19560 | loss 3.523122 (+2.48z)| norm 0.2938 (+1.04z)| lr 3.33e-04 | 322.47 ms | 52.3% bf16 MFU | 1624516 tok/s step 9478/19560 | loss 3.456463 (+0.76z)| norm 0.3110 (+2.01z)| lr 3.33e-04 | 322.31 ms | 52.4% bf16 MFU | 1624624 tok/s step 9479/19560 | loss 3.396101 (-0.80z)| norm 0.2805 (+0.23z)| lr 3.33e-04 | 322.71 ms | 52.3% bf16 MFU | 1624624 tok/s step 9480/19560 | loss 3.379694 (-1.21z)| norm 0.2859 (+0.54z)| lr 3.32e-04 | 322.48 ms | 52.3% bf16 MFU | 1624684 tok/s step 9481/19560 | loss 3.370990 (-1.42z)| norm 0.2971 (+1.18z)| lr 3.32e-04 | 323.10 ms | 52.2% bf16 MFU | 1624582 tok/s step 9482/19560 | loss 3.437275 (+0.27z)| norm 0.2822 (+0.31z)| lr 3.32e-04 | 322.87 ms | 52.3% bf16 MFU | 1624545 tok/s step 9483/19560 | loss 3.503578 (+1.92z)| norm 0.3011 (+1.39z)| lr 3.32e-04 | 322.46 ms | 52.3% bf16 MFU | 1624614 tok/s step 9484/19560 | loss 3.432920 (+0.12z)| norm 0.2929 (+0.92z)| lr 3.32e-04 | 322.55 ms | 52.3% bf16 MFU | 1624654 tok/s step 9485/19560 | loss 3.457134 (+0.73z)| norm 0.3087 (+1.80z)| lr 3.32e-04 | 323.06 ms | 52.2% bf16 MFU | 1624566 tok/s step 9486/19560 | loss 3.386091 (-1.07z)| norm 0.2799 (+0.16z)| lr 3.32e-04 | 322.12 ms | 52.4% bf16 MFU | 1624718 tok/s step 9487/19560 | loss 3.378986 (-1.24z)| norm 0.2861 (+0.53z)| lr 3.32e-04 | 322.39 ms | 52.3% bf16 MFU | 1624794 tok/s step 9488/19560 | loss 3.395498 (-0.81z)| norm 0.2801 (+0.17z)| lr 3.32e-04 | 322.82 ms | 52.3% bf16 MFU | 1624758 tok/s step 9489/19560 | loss 3.416463 (-0.28z)| norm 0.2705 (-0.38z)| lr 3.32e-04 | 323.39 ms | 52.2% bf16 MFU | 1624581 tok/s step 9490/19560 | loss 3.479631 (+1.31z)| norm 0.2818 (+0.28z)| lr 3.32e-04 | 322.58 ms | 52.3% bf16 MFU | 1624616 tok/s step 9491/19560 | loss 3.437658 (+0.26z)| norm 0.2676 (-0.55z)| lr 3.32e-04 | 322.81 ms | 52.3% bf16 MFU | 1624592 tok/s step 9492/19560 | loss 3.362331 (-1.63z)| norm 0.2812 (+0.25z)| lr 3.32e-04 | 322.64 ms | 52.3% bf16 MFU | 1624612 tok/s step 9493/19560 | loss 3.424560 (-0.04z)| norm 0.2568 (-1.17z)| lr 3.32e-04 | 322.96 ms | 52.3% bf16 MFU | 1624550 tok/s step 9494/19560 | loss 3.413147 (-0.33z)| norm 0.2741 (-0.16z)| lr 3.32e-04 | 322.86 ms | 52.3% bf16 MFU | 1624516 tok/s step 9495/19560 | loss 3.348840 (-1.95z)| norm 0.3289 (+2.93z)| lr 3.32e-04 | 323.19 ms | 52.2% bf16 MFU | 1624401 tok/s step 9496/19560 | loss 3.489193 (+1.60z)| norm 0.3127 (+1.96z)| lr 3.32e-04 | 322.79 ms | 52.3% bf16 MFU | 1624392 tok/s step 9497/19560 | loss 3.370372 (-1.38z)| norm 0.2914 (+0.75z)| lr 3.32e-04 | 322.65 ms | 52.3% bf16 MFU | 1624418 tok/s step 9498/19560 | loss 3.414044 (-0.27z)| norm 0.2753 (-0.15z)| lr 3.32e-04 | 322.67 ms | 52.3% bf16 MFU | 1624438 tok/s step 9499/19560 | loss 3.340934 (-2.07z)| norm 0.2967 (+1.03z)| lr 3.32e-04 | 322.23 ms | 52.4% bf16 MFU | 1624570 tok/s step 9500/19560 | loss 3.457988 (+0.82z)| norm 0.2829 (+0.25z)| lr 3.31e-04 | 323.65 ms | 52.1% bf16 MFU | 1624338 tok/s val loss 3.408037 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2870/10042 = 0.285800 step 9501/19560 | loss 3.398712 (-0.64z)| norm 0.2971 (+1.04z)| lr 3.31e-04 | 322.53 ms | 52.3% bf16 MFU | 1624398 tok/s step 9502/19560 | loss 3.445105 (+0.56z)| norm 0.2628 (-0.88z)| lr 3.31e-04 | 322.60 ms | 52.3% bf16 MFU | 1624438 tok/s step 9503/19560 | loss 3.410753 (-0.32z)| norm 0.2675 (-0.63z)| lr 3.31e-04 | 323.18 ms | 52.2% bf16 MFU | 1624329 tok/s step 9504/19560 | loss 3.390606 (-0.84z)| norm 0.2527 (-1.44z)| lr 3.31e-04 | 323.72 ms | 52.1% bf16 MFU | 1624091 tok/s step 9505/19560 | loss 3.410942 (-0.32z)| norm 0.2714 (-0.40z)| lr 3.31e-04 | 322.98 ms | 52.3% bf16 MFU | 1624050 tok/s step 9506/19560 | loss 3.357731 (-1.69z)| norm 0.2807 (+0.12z)| lr 3.31e-04 | 323.17 ms | 52.2% bf16 MFU | 1623963 tok/s step 9507/19560 | loss 3.389888 (-0.86z)| norm 0.2544 (-1.35z)| lr 3.31e-04 | 322.80 ms | 52.3% bf16 MFU | 1623974 tok/s step 9508/19560 | loss 3.437574 (+0.37z)| norm 0.2530 (-1.41z)| lr 3.31e-04 | 323.01 ms | 52.2% bf16 MFU | 1623932 tok/s step 9509/19560 | loss 3.419964 (-0.08z)| norm 0.2771 (-0.07z)| lr 3.31e-04 | 323.19 ms | 52.2% bf16 MFU | 1623848 tok/s step 9510/19560 | loss 3.413815 (-0.24z)| norm 0.2705 (-0.44z)| lr 3.31e-04 | 322.96 ms | 52.3% bf16 MFU | 1623824 tok/s step 9511/19560 | loss 3.431769 (+0.22z)| norm 0.2834 (+0.27z)| lr 3.31e-04 | 323.27 ms | 52.2% bf16 MFU | 1623725 tok/s step 9512/19560 | loss 3.430604 (+0.18z)| norm 0.2881 (+0.53z)| lr 3.31e-04 | 323.14 ms | 52.2% bf16 MFU | 1623662 tok/s step 9513/19560 | loss 3.381630 (-1.07z)| norm 0.2529 (-1.41z)| lr 3.31e-04 | 323.10 ms | 52.2% bf16 MFU | 1623614 tok/s step 9514/19560 | loss 3.487438 (+1.62z)| norm 0.2912 (+0.70z)| lr 3.31e-04 | 323.11 ms | 52.2% bf16 MFU | 1623565 tok/s step 9515/19560 | loss 3.356934 (-1.70z)| norm 0.2852 (+0.37z)| lr 3.31e-04 | 322.93 ms | 52.3% bf16 MFU | 1623563 tok/s step 9516/19560 | loss 3.425980 (+0.05z)| norm 0.2887 (+0.55z)| lr 3.31e-04 | 322.67 ms | 52.3% bf16 MFU | 1623628 tok/s step 9517/19560 | loss 3.378067 (-1.17z)| norm 0.3037 (+1.36z)| lr 3.31e-04 | 322.79 ms | 52.3% bf16 MFU | 1623658 tok/s step 9518/19560 | loss 3.417153 (-0.19z)| norm 0.2831 (+0.22z)| lr 3.31e-04 | 322.97 ms | 52.3% bf16 MFU | 1623641 tok/s step 9519/19560 | loss 3.363410 (-1.53z)| norm 0.2638 (-0.84z)| lr 3.31e-04 | 323.29 ms | 52.2% bf16 MFU | 1623545 tok/s step 9520/19560 | loss 3.470274 (+1.18z)| norm 0.2994 (+1.10z)| lr 3.30e-04 | 323.38 ms | 52.2% bf16 MFU | 1623431 tok/s step 9521/19560 | loss 3.402656 (-0.54z)| norm 0.2699 (-0.54z)| lr 3.30e-04 | 322.39 ms | 52.4% bf16 MFU | 1623573 tok/s step 9522/19560 | loss 3.420963 (-0.07z)| norm 0.2783 (-0.07z)| lr 3.30e-04 | 322.99 ms | 52.3% bf16 MFU | 1623557 tok/s step 9523/19560 | loss 3.404759 (-0.48z)| norm 0.2726 (-0.39z)| lr 3.30e-04 | 323.75 ms | 52.1% bf16 MFU | 1623351 tok/s step 9524/19560 | loss 3.417156 (-0.18z)| norm 0.2863 (+0.36z)| lr 3.30e-04 | 323.12 ms | 52.2% bf16 MFU | 1623311 tok/s step 9525/19560 | loss 3.452929 (+0.72z)| norm 0.2840 (+0.23z)| lr 3.30e-04 | 322.81 ms | 52.3% bf16 MFU | 1623353 tok/s step 9526/19560 | loss 3.382871 (-1.05z)| norm 0.2656 (-0.85z)| lr 3.30e-04 | 322.90 ms | 52.3% bf16 MFU | 1623369 tok/s step 9527/19560 | loss 3.452681 (+0.74z)| norm 0.2880 (+0.44z)| lr 3.30e-04 | 323.31 ms | 52.2% bf16 MFU | 1623282 tok/s step 9528/19560 | loss 3.415507 (-0.21z)| norm 0.2782 (-0.14z)| lr 3.30e-04 | 322.65 ms | 52.3% bf16 MFU | 1623366 tok/s step 9529/19560 | loss 3.358057 (-1.65z)| norm 0.3546 (+4.13z)| lr 3.30e-04 | 322.86 ms | 52.3% bf16 MFU | 1623391 tok/s step 9530/19560 | loss 3.426531 (+0.09z)| norm 0.3087 (+1.51z)| lr 3.30e-04 | 323.10 ms | 52.2% bf16 MFU | 1623355 tok/s step 9531/19560 | loss 3.471906 (+1.22z)| norm 0.2744 (-0.42z)| lr 3.30e-04 | 322.90 ms | 52.3% bf16 MFU | 1623372 tok/s step 9532/19560 | loss 3.513239 (+2.20z)| norm 0.2928 (+0.60z)| lr 3.30e-04 | 322.35 ms | 52.4% bf16 MFU | 1623526 tok/s step 9533/19560 | loss 3.499510 (+1.85z)| norm 0.2848 (+0.15z)| lr 3.30e-04 | 322.78 ms | 52.3% bf16 MFU | 1623564 tok/s step 9534/19560 | loss 3.420678 (-0.09z)| norm 0.2687 (-0.75z)| lr 3.30e-04 | 322.77 ms | 52.3% bf16 MFU | 1623602 tok/s step 9535/19560 | loss 3.407461 (-0.41z)| norm 0.2589 (-1.28z)| lr 3.30e-04 | 322.81 ms | 52.3% bf16 MFU | 1623630 tok/s step 9536/19560 | loss 3.378646 (-1.11z)| norm 0.2682 (-0.75z)| lr 3.30e-04 | 322.81 ms | 52.3% bf16 MFU | 1623656 tok/s step 9537/19560 | loss 3.447454 (+0.58z)| norm 0.2508 (-1.70z)| lr 3.30e-04 | 323.17 ms | 52.2% bf16 MFU | 1623590 tok/s step 9538/19560 | loss 3.448501 (+0.60z)| norm 0.2877 (+0.36z)| lr 3.30e-04 | 322.98 ms | 52.3% bf16 MFU | 1623574 tok/s step 9539/19560 | loss 3.383805 (-0.98z)| norm 0.2726 (-0.48z)| lr 3.30e-04 | 322.44 ms | 52.3% bf16 MFU | 1623695 tok/s step 9540/19560 | loss 3.386996 (-0.89z)| norm 0.2851 (+0.22z)| lr 3.29e-04 | 322.47 ms | 52.3% bf16 MFU | 1623801 tok/s step 9541/19560 | loss 3.405843 (-0.44z)| norm 0.2619 (-1.06z)| lr 3.29e-04 | 323.20 ms | 52.2% bf16 MFU | 1623721 tok/s step 9542/19560 | loss 3.408195 (-0.38z)| norm 0.3054 (+1.36z)| lr 3.29e-04 | 322.83 ms | 52.3% bf16 MFU | 1623736 tok/s step 9543/19560 | loss 3.497842 (+1.82z)| norm 0.2833 (+0.12z)| lr 3.29e-04 | 322.95 ms | 52.3% bf16 MFU | 1623721 tok/s step 9544/19560 | loss 3.473863 (+1.21z)| norm 0.2939 (+0.70z)| lr 3.29e-04 | 322.90 ms | 52.3% bf16 MFU | 1623719 tok/s step 9545/19560 | loss 3.450199 (+0.62z)| norm 0.2683 (-0.74z)| lr 3.29e-04 | 322.36 ms | 52.4% bf16 MFU | 1623853 tok/s step 9546/19560 | loss 3.388040 (-0.90z)| norm 0.2980 (+0.93z)| lr 3.29e-04 | 322.72 ms | 52.3% bf16 MFU | 1623890 tok/s step 9547/19560 | loss 3.427848 (+0.09z)| norm 0.3167 (+1.94z)| lr 3.29e-04 | 322.79 ms | 52.3% bf16 MFU | 1623908 tok/s step 9548/19560 | loss 3.439544 (+0.37z)| norm 0.2899 (+0.44z)| lr 3.29e-04 | 323.39 ms | 52.2% bf16 MFU | 1623774 tok/s step 9549/19560 | loss 3.352725 (-1.76z)| norm 0.2819 (-0.01z)| lr 3.29e-04 | 322.71 ms | 52.3% bf16 MFU | 1623818 tok/s step 9550/19560 | loss 3.443189 (+0.45z)| norm 0.3039 (+1.21z)| lr 3.29e-04 | 322.73 ms | 52.3% bf16 MFU | 1623855 tok/s step 9551/19560 | loss 3.378059 (-1.14z)| norm 0.2841 (+0.09z)| lr 3.29e-04 | 322.84 ms | 52.3% bf16 MFU | 1623862 tok/s step 9552/19560 | loss 3.483960 (+1.47z)| norm 0.3115 (+1.60z)| lr 3.29e-04 | 322.51 ms | 52.3% bf16 MFU | 1623951 tok/s step 9553/19560 | loss 3.404797 (-0.47z)| norm 0.2963 (+0.74z)| lr 3.29e-04 | 322.64 ms | 52.3% bf16 MFU | 1624004 tok/s step 9554/19560 | loss 3.389563 (-0.86z)| norm 0.2833 (+0.01z)| lr 3.29e-04 | 322.65 ms | 52.3% bf16 MFU | 1624052 tok/s step 9555/19560 | loss 3.433996 (+0.28z)| norm 0.2644 (-1.05z)| lr 3.29e-04 | 322.86 ms | 52.3% bf16 MFU | 1624044 tok/s step 9556/19560 | loss 3.442869 (+0.51z)| norm 0.2988 (+0.87z)| lr 3.29e-04 | 322.45 ms | 52.3% bf16 MFU | 1624139 tok/s step 9557/19560 | loss 3.474622 (+1.31z)| norm 0.2587 (-1.38z)| lr 3.29e-04 | 322.97 ms | 52.3% bf16 MFU | 1624098 tok/s step 9558/19560 | loss 3.495613 (+1.82z)| norm 0.2848 (+0.08z)| lr 3.29e-04 | 322.49 ms | 52.3% bf16 MFU | 1624179 tok/s step 9559/19560 | loss 3.334553 (-2.21z)| norm 0.2526 (-1.76z)| lr 3.29e-04 | 322.65 ms | 52.3% bf16 MFU | 1624218 tok/s step 9560/19560 | loss 3.402310 (-0.51z)| norm 0.2607 (-1.28z)| lr 3.28e-04 | 322.43 ms | 52.3% bf16 MFU | 1624310 tok/s step 9561/19560 | loss 3.378043 (-1.13z)| norm 0.2859 (+0.14z)| lr 3.28e-04 | 322.47 ms | 52.3% bf16 MFU | 1624387 tok/s step 9562/19560 | loss 3.457706 (+0.87z)| norm 0.2830 (-0.03z)| lr 3.28e-04 | 323.06 ms | 52.2% bf16 MFU | 1624312 tok/s step 9563/19560 | loss 3.415602 (-0.18z)| norm 0.2741 (-0.53z)| lr 3.28e-04 | 322.16 ms | 52.4% bf16 MFU | 1624466 tok/s step 9564/19560 | loss 3.451457 (+0.71z)| norm 0.2803 (-0.18z)| lr 3.28e-04 | 322.10 ms | 52.4% bf16 MFU | 1624630 tok/s step 9565/19560 | loss 3.379339 (-1.09z)| norm 0.2815 (-0.12z)| lr 3.28e-04 | 322.62 ms | 52.3% bf16 MFU | 1624652 tok/s step 9566/19560 | loss 3.456500 (+0.83z)| norm 0.2536 (-1.69z)| lr 3.28e-04 | 322.86 ms | 52.3% bf16 MFU | 1624614 tok/s step 9567/19560 | loss 3.421494 (-0.03z)| norm 0.2576 (-1.45z)| lr 3.28e-04 | 322.47 ms | 52.3% bf16 MFU | 1624676 tok/s step 9568/19560 | loss 3.437414 (+0.37z)| norm 0.2525 (-1.72z)| lr 3.28e-04 | 322.32 ms | 52.4% bf16 MFU | 1624773 tok/s step 9569/19560 | loss 3.432430 (+0.26z)| norm 0.2607 (-1.23z)| lr 3.28e-04 | 322.35 ms | 52.4% bf16 MFU | 1624856 tok/s step 9570/19560 | loss 3.503955 (+2.01z)| norm 0.2553 (-1.52z)| lr 3.28e-04 | 322.45 ms | 52.3% bf16 MFU | 1624911 tok/s step 9571/19560 | loss 3.390919 (-0.79z)| norm 0.2698 (-0.69z)| lr 3.28e-04 | 322.32 ms | 52.4% bf16 MFU | 1624996 tok/s step 9572/19560 | loss 3.393704 (-0.71z)| norm 0.2661 (-0.89z)| lr 3.28e-04 | 322.86 ms | 52.3% bf16 MFU | 1624941 tok/s step 9573/19560 | loss 3.419688 (-0.07z)| norm 0.2611 (-1.15z)| lr 3.28e-04 | 322.44 ms | 52.3% bf16 MFU | 1624994 tok/s step 9574/19560 | loss 3.440633 (+0.45z)| norm 0.2997 (+1.03z)| lr 3.28e-04 | 322.67 ms | 52.3% bf16 MFU | 1624987 tok/s step 9575/19560 | loss 3.408379 (-0.35z)| norm 0.2487 (-1.83z)| lr 3.28e-04 | 322.32 ms | 52.4% bf16 MFU | 1625069 tok/s step 9576/19560 | loss 3.450395 (+0.72z)| norm 0.2820 (+0.06z)| lr 3.28e-04 | 322.22 ms | 52.4% bf16 MFU | 1625171 tok/s step 9577/19560 | loss 3.389889 (-0.81z)| norm 0.2592 (-1.22z)| lr 3.28e-04 | 322.37 ms | 52.4% bf16 MFU | 1625230 tok/s step 9578/19560 | loss 3.357328 (-1.61z)| norm 0.2739 (-0.39z)| lr 3.28e-04 | 322.49 ms | 52.3% bf16 MFU | 1625255 tok/s step 9579/19560 | loss 3.438962 (+0.45z)| norm 0.2754 (-0.29z)| lr 3.28e-04 | 322.96 ms | 52.3% bf16 MFU | 1625162 tok/s step 9580/19560 | loss 3.403343 (-0.44z)| norm 0.3030 (+1.27z)| lr 3.27e-04 | 323.16 ms | 52.2% bf16 MFU | 1625023 tok/s step 9581/19560 | loss 3.392036 (-0.73z)| norm 0.2803 (-0.02z)| lr 3.27e-04 | 322.34 ms | 52.4% bf16 MFU | 1625098 tok/s step 9582/19560 | loss 3.433764 (+0.33z)| norm 0.2541 (-1.49z)| lr 3.27e-04 | 322.82 ms | 52.3% bf16 MFU | 1625048 tok/s step 9583/19560 | loss 3.405005 (-0.40z)| norm 0.2921 (+0.65z)| lr 3.27e-04 | 322.52 ms | 52.3% bf16 MFU | 1625075 tok/s step 9584/19560 | loss 3.427612 (+0.18z)| norm 0.2695 (-0.63z)| lr 3.27e-04 | 322.48 ms | 52.3% bf16 MFU | 1625112 tok/s step 9585/19560 | loss 3.419753 (-0.01z)| norm 0.2765 (-0.23z)| lr 3.27e-04 | 322.31 ms | 52.4% bf16 MFU | 1625189 tok/s step 9586/19560 | loss 3.376671 (-1.10z)| norm 0.2745 (-0.33z)| lr 3.27e-04 | 322.62 ms | 52.3% bf16 MFU | 1625184 tok/s step 9587/19560 | loss 3.430777 (+0.27z)| norm 0.2897 (+0.53z)| lr 3.27e-04 | 322.72 ms | 52.3% bf16 MFU | 1625155 tok/s step 9588/19560 | loss 3.386595 (-0.84z)| norm 0.2476 (-1.82z)| lr 3.27e-04 | 323.10 ms | 52.2% bf16 MFU | 1625031 tok/s step 9589/19560 | loss 3.412277 (-0.19z)| norm 0.2861 (+0.34z)| lr 3.27e-04 | 322.28 ms | 52.4% bf16 MFU | 1625120 tok/s step 9590/19560 | loss 3.412431 (-0.18z)| norm 0.2816 (+0.09z)| lr 3.27e-04 | 322.19 ms | 52.4% bf16 MFU | 1625226 tok/s step 9591/19560 | loss 3.411419 (-0.19z)| norm 0.2841 (+0.22z)| lr 3.27e-04 | 322.69 ms | 52.3% bf16 MFU | 1625203 tok/s step 9592/19560 | loss 3.439967 (+0.54z)| norm 0.2766 (-0.19z)| lr 3.27e-04 | 322.73 ms | 52.3% bf16 MFU | 1625171 tok/s step 9593/19560 | loss 3.431129 (+0.31z)| norm 0.2789 (-0.07z)| lr 3.27e-04 | 322.63 ms | 52.3% bf16 MFU | 1625164 tok/s step 9594/19560 | loss 3.416606 (-0.08z)| norm 0.2683 (-0.67z)| lr 3.27e-04 | 322.77 ms | 52.3% bf16 MFU | 1625123 tok/s step 9595/19560 | loss 3.357013 (-1.60z)| norm 0.2670 (-0.74z)| lr 3.27e-04 | 322.55 ms | 52.3% bf16 MFU | 1625140 tok/s step 9596/19560 | loss 3.469311 (+1.29z)| norm 0.3049 (+1.43z)| lr 3.27e-04 | 322.29 ms | 52.4% bf16 MFU | 1625222 tok/s step 9597/19560 | loss 3.319048 (-2.50z)| norm 0.2748 (-0.30z)| lr 3.27e-04 | 322.93 ms | 52.3% bf16 MFU | 1625136 tok/s step 9598/19560 | loss 3.426394 (+0.20z)| norm 0.2720 (-0.45z)| lr 3.27e-04 | 322.51 ms | 52.3% bf16 MFU | 1625162 tok/s step 9599/19560 | loss 3.398706 (-0.49z)| norm 0.2724 (-0.44z)| lr 3.27e-04 | 322.29 ms | 52.4% bf16 MFU | 1625243 tok/s step 9600/19560 | loss 3.516841 (+2.41z)| norm 0.2780 (-0.11z)| lr 3.27e-04 | 322.74 ms | 52.3% bf16 MFU | 1625206 tok/s step 9601/19560 | loss 3.467596 (+1.18z)| norm 0.2989 (+1.07z)| lr 3.26e-04 | 322.48 ms | 52.3% bf16 MFU | 1625235 tok/s step 9602/19560 | loss 3.374290 (-1.10z)| norm 0.2793 (-0.06z)| lr 3.26e-04 | 322.68 ms | 52.3% bf16 MFU | 1625213 tok/s step 9603/19560 | loss 3.390387 (-0.70z)| norm 0.2762 (-0.24z)| lr 3.26e-04 | 322.42 ms | 52.3% bf16 MFU | 1625256 tok/s step 9604/19560 | loss 3.341550 (-1.86z)| norm 0.2918 (+0.67z)| lr 3.26e-04 | 322.54 ms | 52.3% bf16 MFU | 1625269 tok/s step 9605/19560 | loss 3.390992 (-0.66z)| norm 0.3027 (+1.28z)| lr 3.26e-04 | 323.32 ms | 52.2% bf16 MFU | 1625084 tok/s step 9606/19560 | loss 3.572043 (+3.61z)| norm 0.2770 (-0.18z)| lr 3.26e-04 | 322.11 ms | 52.4% bf16 MFU | 1625213 tok/s step 9607/19560 | loss 3.410288 (-0.20z)| norm 0.3096 (+1.68z)| lr 3.26e-04 | 322.81 ms | 52.3% bf16 MFU | 1625159 tok/s step 9608/19560 | loss 3.407198 (-0.28z)| norm 0.2905 (+0.58z)| lr 3.26e-04 | 322.48 ms | 52.3% bf16 MFU | 1625191 tok/s step 9609/19560 | loss 3.412926 (-0.15z)| norm 0.2706 (-0.54z)| lr 3.26e-04 | 322.49 ms | 52.3% bf16 MFU | 1625218 tok/s step 9610/19560 | loss 3.496989 (+1.81z)| norm 0.3049 (+1.40z)| lr 3.26e-04 | 322.72 ms | 52.3% bf16 MFU | 1625186 tok/s step 9611/19560 | loss 3.463747 (+1.05z)| norm 0.2823 (+0.12z)| lr 3.26e-04 | 322.26 ms | 52.4% bf16 MFU | 1625271 tok/s step 9612/19560 | loss 3.417464 (-0.04z)| norm 0.2945 (+0.82z)| lr 3.26e-04 | 322.62 ms | 52.3% bf16 MFU | 1625261 tok/s step 9613/19560 | loss 3.419121 (+0.00z)| norm 0.2960 (+0.92z)| lr 3.26e-04 | 322.87 ms | 52.3% bf16 MFU | 1625191 tok/s step 9614/19560 | loss 3.437354 (+0.43z)| norm 0.2916 (+0.66z)| lr 3.26e-04 | 322.59 ms | 52.3% bf16 MFU | 1625194 tok/s step 9615/19560 | loss 3.409099 (-0.25z)| norm 0.2623 (-1.01z)| lr 3.26e-04 | 322.54 ms | 52.3% bf16 MFU | 1625209 tok/s step 9616/19560 | loss 3.368961 (-1.20z)| norm 0.2766 (-0.19z)| lr 3.26e-04 | 322.57 ms | 52.3% bf16 MFU | 1625216 tok/s step 9617/19560 | loss 3.448697 (+0.69z)| norm 0.2681 (-0.68z)| lr 3.26e-04 | 323.11 ms | 52.2% bf16 MFU | 1625086 tok/s step 9618/19560 | loss 3.414952 (-0.10z)| norm 0.2377 (-2.34z)| lr 3.26e-04 | 322.85 ms | 52.3% bf16 MFU | 1625028 tok/s step 9619/19560 | loss 3.419129 (+0.00z)| norm 0.2704 (-0.52z)| lr 3.26e-04 | 322.51 ms | 52.3% bf16 MFU | 1625060 tok/s step 9620/19560 | loss 3.391438 (-0.67z)| norm 0.2648 (-0.82z)| lr 3.26e-04 | 322.71 ms | 52.3% bf16 MFU | 1625039 tok/s step 9621/19560 | loss 3.486199 (+1.59z)| norm 0.2896 (+0.55z)| lr 3.25e-04 | 322.54 ms | 52.3% bf16 MFU | 1625061 tok/s step 9622/19560 | loss 3.402745 (-0.40z)| norm 0.2591 (-1.15z)| lr 3.25e-04 | 322.88 ms | 52.3% bf16 MFU | 1624998 tok/s step 9623/19560 | loss 3.464547 (+1.06z)| norm 0.2860 (+0.39z)| lr 3.25e-04 | 322.58 ms | 52.3% bf16 MFU | 1625012 tok/s step 9624/19560 | loss 3.343124 (-1.83z)| norm 0.2689 (-0.59z)| lr 3.25e-04 | 323.18 ms | 52.2% bf16 MFU | 1624875 tok/s step 9625/19560 | loss 3.628503 (+4.56z)| norm 0.2757 (-0.18z)| lr 3.25e-04 | 322.59 ms | 52.3% bf16 MFU | 1624893 tok/s step 9626/19560 | loss 3.359375 (-1.35z)| norm 0.3009 (+1.27z)| lr 3.25e-04 | 322.64 ms | 52.3% bf16 MFU | 1624899 tok/s step 9627/19560 | loss 3.435404 (+0.30z)| norm 0.2887 (+0.57z)| lr 3.25e-04 | 322.62 ms | 52.3% bf16 MFU | 1624909 tok/s step 9628/19560 | loss 3.428100 (+0.15z)| norm 0.2860 (+0.41z)| lr 3.25e-04 | 322.82 ms | 52.3% bf16 MFU | 1624869 tok/s step 9629/19560 | loss 3.367707 (-1.18z)| norm 0.3015 (+1.31z)| lr 3.25e-04 | 322.41 ms | 52.3% bf16 MFU | 1624934 tok/s step 9630/19560 | loss 3.447815 (+0.59z)| norm 0.2877 (+0.50z)| lr 3.25e-04 | 322.41 ms | 52.3% bf16 MFU | 1624996 tok/s step 9631/19560 | loss 3.406980 (-0.31z)| norm 0.2917 (+0.72z)| lr 3.25e-04 | 322.42 ms | 52.3% bf16 MFU | 1625052 tok/s step 9632/19560 | loss 3.407991 (-0.30z)| norm 0.2939 (+0.83z)| lr 3.25e-04 | 322.67 ms | 52.3% bf16 MFU | 1625042 tok/s step 9633/19560 | loss 3.389342 (-0.70z)| norm 0.2712 (-0.50z)| lr 3.25e-04 | 322.64 ms | 52.3% bf16 MFU | 1625038 tok/s step 9634/19560 | loss 3.400355 (-0.47z)| norm 0.3093 (+1.70z)| lr 3.25e-04 | 322.54 ms | 52.3% bf16 MFU | 1625061 tok/s step 9635/19560 | loss 3.383321 (-0.85z)| norm 0.2712 (-0.52z)| lr 3.25e-04 | 322.81 ms | 52.3% bf16 MFU | 1625016 tok/s step 9636/19560 | loss 3.382001 (-0.87z)| norm 0.3639 (+4.49z)| lr 3.25e-04 | 322.84 ms | 52.3% bf16 MFU | 1624964 tok/s step 9637/19560 | loss 3.516180 (+2.06z)| norm 0.3050 (+1.28z)| lr 3.25e-04 | 322.32 ms | 52.4% bf16 MFU | 1625046 tok/s step 9638/19560 | loss 3.457141 (+0.76z)| norm 0.3384 (+2.95z)| lr 3.25e-04 | 322.24 ms | 52.4% bf16 MFU | 1625144 tok/s step 9639/19560 | loss 3.370181 (-1.11z)| norm 0.2743 (-0.38z)| lr 3.25e-04 | 323.46 ms | 52.2% bf16 MFU | 1624931 tok/s step 9640/19560 | loss 3.413703 (-0.17z)| norm 0.3029 (+1.10z)| lr 3.25e-04 | 322.23 ms | 52.4% bf16 MFU | 1625038 tok/s step 9641/19560 | loss 3.484541 (+1.34z)| norm 0.2839 (+0.10z)| lr 3.24e-04 | 322.33 ms | 52.4% bf16 MFU | 1625114 tok/s step 9642/19560 | loss 3.422968 (+0.02z)| norm 0.2885 (+0.34z)| lr 3.24e-04 | 322.21 ms | 52.4% bf16 MFU | 1625217 tok/s step 9643/19560 | loss 3.493830 (+1.54z)| norm 0.2742 (-0.40z)| lr 3.24e-04 | 322.79 ms | 52.3% bf16 MFU | 1625167 tok/s step 9644/19560 | loss 3.426471 (+0.08z)| norm 0.2541 (-1.43z)| lr 3.24e-04 | 322.25 ms | 52.4% bf16 MFU | 1625257 tok/s step 9645/19560 | loss 3.377286 (-0.99z)| norm 0.2498 (-1.62z)| lr 3.24e-04 | 323.40 ms | 52.2% bf16 MFU | 1625053 tok/s step 9646/19560 | loss 3.345280 (-1.65z)| norm 0.2523 (-1.46z)| lr 3.24e-04 | 322.98 ms | 52.3% bf16 MFU | 1624964 tok/s step 9647/19560 | loss 3.440675 (+0.38z)| norm 0.2634 (-0.90z)| lr 3.24e-04 | 322.79 ms | 52.3% bf16 MFU | 1624928 tok/s step 9648/19560 | loss 3.407854 (-0.32z)| norm 0.2807 (-0.01z)| lr 3.24e-04 | 322.80 ms | 52.3% bf16 MFU | 1624891 tok/s step 9649/19560 | loss 3.383898 (-0.83z)| norm 0.2565 (-1.24z)| lr 3.24e-04 | 322.59 ms | 52.3% bf16 MFU | 1624908 tok/s step 9650/19560 | loss 3.406762 (-0.33z)| norm 0.2887 (+0.41z)| lr 3.24e-04 | 322.55 ms | 52.3% bf16 MFU | 1624936 tok/s step 9651/19560 | loss 3.403361 (-0.41z)| norm 0.2572 (-1.19z)| lr 3.24e-04 | 322.99 ms | 52.3% bf16 MFU | 1624850 tok/s step 9652/19560 | loss 3.411327 (-0.23z)| norm 0.2658 (-0.75z)| lr 3.24e-04 | 322.70 ms | 52.3% bf16 MFU | 1624840 tok/s step 9653/19560 | loss 3.391505 (-0.65z)| norm 0.2662 (-0.72z)| lr 3.24e-04 | 322.63 ms | 52.3% bf16 MFU | 1624850 tok/s step 9654/19560 | loss 3.421204 (-0.02z)| norm 0.2601 (-1.02z)| lr 3.24e-04 | 322.87 ms | 52.3% bf16 MFU | 1624799 tok/s step 9655/19560 | loss 3.348032 (-1.57z)| norm 0.2600 (-1.01z)| lr 3.24e-04 | 322.51 ms | 52.3% bf16 MFU | 1624842 tok/s step 9656/19560 | loss 3.421371 (+0.00z)| norm 0.2930 (+0.64z)| lr 3.24e-04 | 323.18 ms | 52.2% bf16 MFU | 1624714 tok/s step 9657/19560 | loss 3.482184 (+1.29z)| norm 0.2802 (+0.03z)| lr 3.24e-04 | 322.19 ms | 52.4% bf16 MFU | 1624840 tok/s step 9658/19560 | loss 3.466781 (+0.95z)| norm 0.3122 (+1.73z)| lr 3.24e-04 | 323.38 ms | 52.2% bf16 MFU | 1624663 tok/s step 9659/19560 | loss 3.446273 (+0.52z)| norm 0.2983 (+0.98z)| lr 3.24e-04 | 322.70 ms | 52.3% bf16 MFU | 1624663 tok/s step 9660/19560 | loss 3.509075 (+1.87z)| norm 0.2971 (+0.91z)| lr 3.24e-04 | 322.57 ms | 52.3% bf16 MFU | 1624698 tok/s step 9661/19560 | loss 3.387250 (-0.74z)| norm 0.2634 (-0.86z)| lr 3.23e-04 | 323.53 ms | 52.2% bf16 MFU | 1624490 tok/s step 9662/19560 | loss 3.385260 (-0.78z)| norm 0.2866 (+0.36z)| lr 3.23e-04 | 322.98 ms | 52.3% bf16 MFU | 1624428 tok/s step 9663/19560 | loss 3.382156 (-0.84z)| norm 0.2481 (-1.66z)| lr 3.23e-04 | 323.63 ms | 52.1% bf16 MFU | 1624208 tok/s step 9664/19560 | loss 3.544446 (+2.58z)| norm 0.2874 (+0.40z)| lr 3.23e-04 | 322.52 ms | 52.3% bf16 MFU | 1624277 tok/s step 9665/19560 | loss 3.454575 (+0.68z)| norm 0.2761 (-0.21z)| lr 3.23e-04 | 322.79 ms | 52.3% bf16 MFU | 1624274 tok/s step 9666/19560 | loss 3.428499 (+0.14z)| norm 0.3032 (+1.21z)| lr 3.23e-04 | 323.85 ms | 52.1% bf16 MFU | 1624006 tok/s step 9667/19560 | loss 3.432108 (+0.20z)| norm 0.2841 (+0.20z)| lr 3.23e-04 | 322.32 ms | 52.4% bf16 MFU | 1624136 tok/s step 9668/19560 | loss 3.344786 (-1.62z)| norm 0.3172 (+1.90z)| lr 3.23e-04 | 323.11 ms | 52.2% bf16 MFU | 1624062 tok/s step 9669/19560 | loss 3.395907 (-0.55z)| norm 0.2625 (-0.94z)| lr 3.23e-04 | 322.67 ms | 52.3% bf16 MFU | 1624100 tok/s step 9670/19560 | loss 3.527834 (+2.16z)| norm 0.2867 (+0.33z)| lr 3.23e-04 | 323.02 ms | 52.2% bf16 MFU | 1624050 tok/s step 9671/19560 | loss 3.380484 (-0.86z)| norm 0.2688 (-0.60z)| lr 3.23e-04 | 322.93 ms | 52.3% bf16 MFU | 1624024 tok/s step 9672/19560 | loss 3.407234 (-0.30z)| norm 0.2720 (-0.43z)| lr 3.23e-04 | 323.24 ms | 52.2% bf16 MFU | 1623923 tok/s step 9673/19560 | loss 3.430533 (+0.19z)| norm 0.2854 (+0.27z)| lr 3.23e-04 | 322.92 ms | 52.3% bf16 MFU | 1623906 tok/s step 9674/19560 | loss 3.497948 (+1.56z)| norm 0.2529 (-1.41z)| lr 3.23e-04 | 322.96 ms | 52.3% bf16 MFU | 1623881 tok/s step 9675/19560 | loss 3.403937 (-0.37z)| norm 0.2775 (-0.11z)| lr 3.23e-04 | 322.42 ms | 52.3% bf16 MFU | 1623992 tok/s step 9676/19560 | loss 3.420984 (-0.02z)| norm 0.2536 (-1.35z)| lr 3.23e-04 | 322.97 ms | 52.3% bf16 MFU | 1623958 tok/s step 9677/19560 | loss 3.356939 (-1.35z)| norm 0.2667 (-0.66z)| lr 3.23e-04 | 322.54 ms | 52.3% bf16 MFU | 1624034 tok/s step 9678/19560 | loss 3.396925 (-0.51z)| norm 0.2757 (-0.17z)| lr 3.23e-04 | 323.22 ms | 52.2% bf16 MFU | 1623936 tok/s step 9679/19560 | loss 3.410880 (-0.23z)| norm 0.2490 (-1.56z)| lr 3.23e-04 | 322.97 ms | 52.3% bf16 MFU | 1623907 tok/s step 9680/19560 | loss 3.391210 (-0.62z)| norm 0.2485 (-1.56z)| lr 3.23e-04 | 322.88 ms | 52.3% bf16 MFU | 1623900 tok/s step 9681/19560 | loss 3.440197 (+0.39z)| norm 0.2706 (-0.39z)| lr 3.22e-04 | 323.18 ms | 52.2% bf16 MFU | 1623819 tok/s step 9682/19560 | loss 3.419582 (-0.04z)| norm 0.2502 (-1.44z)| lr 3.22e-04 | 323.11 ms | 52.2% bf16 MFU | 1623760 tok/s step 9683/19560 | loss 3.387608 (-0.70z)| norm 0.2807 (+0.15z)| lr 3.22e-04 | 322.68 ms | 52.3% bf16 MFU | 1623811 tok/s step 9684/19560 | loss 3.390433 (-0.63z)| norm 0.2535 (-1.26z)| lr 3.22e-04 | 322.59 ms | 52.3% bf16 MFU | 1623882 tok/s step 9685/19560 | loss 3.405862 (-0.30z)| norm 0.2776 (-0.00z)| lr 3.22e-04 | 323.40 ms | 52.2% bf16 MFU | 1623746 tok/s step 9686/19560 | loss 3.457834 (+0.80z)| norm 0.2981 (+1.06z)| lr 3.22e-04 | 322.75 ms | 52.3% bf16 MFU | 1623780 tok/s step 9687/19560 | loss 3.391748 (-0.61z)| norm 0.2680 (-0.52z)| lr 3.22e-04 | 322.61 ms | 52.3% bf16 MFU | 1623850 tok/s step 9688/19560 | loss 3.372512 (-1.01z)| norm 0.2710 (-0.37z)| lr 3.22e-04 | 323.93 ms | 52.1% bf16 MFU | 1623584 tok/s step 9689/19560 | loss 3.452368 (+0.67z)| norm 0.2890 (+0.58z)| lr 3.22e-04 | 322.31 ms | 52.4% bf16 MFU | 1623737 tok/s step 9690/19560 | loss 3.388012 (-0.69z)| norm 0.2825 (+0.24z)| lr 3.22e-04 | 323.05 ms | 52.2% bf16 MFU | 1623697 tok/s step 9691/19560 | loss 3.331596 (-1.85z)| norm 0.2819 (+0.20z)| lr 3.22e-04 | 322.94 ms | 52.3% bf16 MFU | 1623686 tok/s step 9692/19560 | loss 3.409013 (-0.22z)| norm 0.2670 (-0.58z)| lr 3.22e-04 | 323.13 ms | 52.2% bf16 MFU | 1623627 tok/s step 9693/19560 | loss 3.429181 (+0.20z)| norm 0.2715 (-0.33z)| lr 3.22e-04 | 322.90 ms | 52.3% bf16 MFU | 1623630 tok/s step 9694/19560 | loss 3.402130 (-0.36z)| norm 0.2761 (-0.10z)| lr 3.22e-04 | 322.54 ms | 52.3% bf16 MFU | 1623724 tok/s step 9695/19560 | loss 3.372706 (-0.97z)| norm 0.2732 (-0.26z)| lr 3.22e-04 | 323.41 ms | 52.2% bf16 MFU | 1623595 tok/s step 9696/19560 | loss 3.409619 (-0.19z)| norm 0.2725 (-0.31z)| lr 3.22e-04 | 322.92 ms | 52.3% bf16 MFU | 1623594 tok/s step 9697/19560 | loss 3.382456 (-0.75z)| norm 0.2696 (-0.47z)| lr 3.22e-04 | 323.66 ms | 52.1% bf16 MFU | 1623407 tok/s step 9698/19560 | loss 3.392912 (-0.52z)| norm 0.2684 (-0.55z)| lr 3.22e-04 | 322.42 ms | 52.3% bf16 MFU | 1623542 tok/s step 9699/19560 | loss 3.379471 (-0.80z)| norm 0.2678 (-0.58z)| lr 3.22e-04 | 322.66 ms | 52.3% bf16 MFU | 1623610 tok/s step 9700/19560 | loss 3.389682 (-0.59z)| norm 0.2626 (-0.86z)| lr 3.22e-04 | 322.71 ms | 52.3% bf16 MFU | 1623662 tok/s step 9701/19560 | loss 3.406093 (-0.24z)| norm 0.2873 (+0.47z)| lr 3.21e-04 | 322.68 ms | 52.3% bf16 MFU | 1623718 tok/s step 9702/19560 | loss 3.435103 (+0.38z)| norm 0.2516 (-1.44z)| lr 3.21e-04 | 322.90 ms | 52.3% bf16 MFU | 1623716 tok/s step 9703/19560 | loss 3.409084 (-0.17z)| norm 0.2750 (-0.19z)| lr 3.21e-04 | 322.79 ms | 52.3% bf16 MFU | 1623742 tok/s step 9704/19560 | loss 3.437378 (+0.43z)| norm 0.2608 (-0.95z)| lr 3.21e-04 | 322.16 ms | 52.4% bf16 MFU | 1623926 tok/s step 9705/19560 | loss 3.447916 (+0.65z)| norm 0.2825 (+0.22z)| lr 3.21e-04 | 322.79 ms | 52.3% bf16 MFU | 1623941 tok/s step 9706/19560 | loss 3.406907 (-0.23z)| norm 0.2653 (-0.71z)| lr 3.21e-04 | 323.43 ms | 52.2% bf16 MFU | 1623796 tok/s step 9707/19560 | loss 3.376355 (-0.88z)| norm 0.2908 (+0.66z)| lr 3.21e-04 | 323.18 ms | 52.2% bf16 MFU | 1623720 tok/s step 9708/19560 | loss 3.438224 (+0.44z)| norm 0.2794 (+0.06z)| lr 3.21e-04 | 322.48 ms | 52.3% bf16 MFU | 1623824 tok/s step 9709/19560 | loss 3.457088 (+0.83z)| norm 0.3003 (+1.18z)| lr 3.21e-04 | 322.67 ms | 52.3% bf16 MFU | 1623876 tok/s step 9710/19560 | loss 3.429690 (+0.25z)| norm 0.2498 (-1.56z)| lr 3.21e-04 | 323.05 ms | 52.2% bf16 MFU | 1623828 tok/s step 9711/19560 | loss 3.380527 (-0.80z)| norm 0.2702 (-0.44z)| lr 3.21e-04 | 323.07 ms | 52.2% bf16 MFU | 1623779 tok/s step 9712/19560 | loss 3.469054 (+1.08z)| norm 0.2650 (-0.72z)| lr 3.21e-04 | 322.70 ms | 52.3% bf16 MFU | 1623824 tok/s step 9713/19560 | loss 3.597787 (+3.58z)| norm 0.2897 (+0.61z)| lr 3.21e-04 | 322.43 ms | 52.3% bf16 MFU | 1623935 tok/s step 9714/19560 | loss 3.447893 (+0.56z)| norm 0.2594 (-1.02z)| lr 3.21e-04 | 322.60 ms | 52.3% bf16 MFU | 1623999 tok/s step 9715/19560 | loss 3.379702 (-0.81z)| norm 0.2988 (+1.10z)| lr 3.21e-04 | 322.44 ms | 52.3% bf16 MFU | 1624098 tok/s step 9716/19560 | loss 3.459443 (+0.78z)| norm 0.2882 (+0.51z)| lr 3.21e-04 | 322.86 ms | 52.3% bf16 MFU | 1624088 tok/s step 9717/19560 | loss 3.397753 (-0.45z)| norm 0.2888 (+0.55z)| lr 3.21e-04 | 322.42 ms | 52.3% bf16 MFU | 1624189 tok/s step 9718/19560 | loss 3.401414 (-0.38z)| norm 0.2677 (-0.59z)| lr 3.21e-04 | 323.04 ms | 52.2% bf16 MFU | 1624128 tok/s step 9719/19560 | loss 3.440849 (+0.41z)| norm 0.2937 (+0.81z)| lr 3.21e-04 | 322.03 ms | 52.4% bf16 MFU | 1624326 tok/s step 9720/19560 | loss 3.421198 (+0.02z)| norm 0.2625 (-0.86z)| lr 3.21e-04 | 322.46 ms | 52.3% bf16 MFU | 1624406 tok/s step 9721/19560 | loss 3.382502 (-0.75z)| norm 0.3047 (+1.38z)| lr 3.20e-04 | 323.09 ms | 52.2% bf16 MFU | 1624323 tok/s step 9722/19560 | loss 3.443902 (+0.48z)| norm 0.2632 (-0.83z)| lr 3.20e-04 | 322.51 ms | 52.3% bf16 MFU | 1624389 tok/s step 9723/19560 | loss 3.404350 (-0.32z)| norm 0.2906 (+0.62z)| lr 3.20e-04 | 323.13 ms | 52.2% bf16 MFU | 1624297 tok/s step 9724/19560 | loss 3.427516 (+0.15z)| norm 0.2628 (-0.85z)| lr 3.20e-04 | 322.95 ms | 52.3% bf16 MFU | 1624255 tok/s step 9725/19560 | loss 3.411612 (-0.19z)| norm 0.2753 (-0.17z)| lr 3.20e-04 | 322.60 ms | 52.3% bf16 MFU | 1624302 tok/s step 9726/19560 | loss 3.386694 (-0.69z)| norm 0.2727 (-0.31z)| lr 3.20e-04 | 322.81 ms | 52.3% bf16 MFU | 1624293 tok/s step 9727/19560 | loss 3.452976 (+0.65z)| norm 0.2725 (-0.32z)| lr 3.20e-04 | 323.27 ms | 52.2% bf16 MFU | 1624168 tok/s step 9728/19560 | loss 3.366651 (-1.10z)| norm 0.2801 (+0.08z)| lr 3.20e-04 | 323.17 ms | 52.2% bf16 MFU | 1624076 tok/s step 9729/19560 | loss 3.452117 (+0.67z)| norm 0.2694 (-0.48z)| lr 3.20e-04 | 322.73 ms | 52.3% bf16 MFU | 1624098 tok/s step 9730/19560 | loss 3.385055 (-0.72z)| norm 0.2822 (+0.21z)| lr 3.20e-04 | 322.73 ms | 52.3% bf16 MFU | 1624119 tok/s step 9731/19560 | loss 3.397027 (-0.47z)| norm 0.3044 (+1.38z)| lr 3.20e-04 | 322.76 ms | 52.3% bf16 MFU | 1624133 tok/s step 9732/19560 | loss 3.437696 (+0.36z)| norm 0.2668 (-0.62z)| lr 3.20e-04 | 322.82 ms | 52.3% bf16 MFU | 1624131 tok/s step 9733/19560 | loss 3.407078 (-0.29z)| norm 0.2964 (+0.97z)| lr 3.20e-04 | 323.34 ms | 52.2% bf16 MFU | 1624000 tok/s step 9734/19560 | loss 3.429811 (+0.22z)| norm 0.2631 (-0.81z)| lr 3.20e-04 | 322.17 ms | 52.4% bf16 MFU | 1624168 tok/s step 9735/19560 | loss 3.392042 (-0.60z)| norm 0.2616 (-0.88z)| lr 3.20e-04 | 322.30 ms | 52.4% bf16 MFU | 1624296 tok/s step 9736/19560 | loss 3.362450 (-1.23z)| norm 0.2873 (+0.51z)| lr 3.20e-04 | 322.84 ms | 52.3% bf16 MFU | 1624279 tok/s step 9737/19560 | loss 3.436856 (+0.38z)| norm 0.2639 (-0.75z)| lr 3.20e-04 | 322.43 ms | 52.3% bf16 MFU | 1624369 tok/s step 9738/19560 | loss 3.478938 (+1.30z)| norm 0.2924 (+0.80z)| lr 3.20e-04 | 322.76 ms | 52.3% bf16 MFU | 1624369 tok/s step 9739/19560 | loss 3.462500 (+0.95z)| norm 0.2893 (+0.62z)| lr 3.20e-04 | 322.72 ms | 52.3% bf16 MFU | 1624380 tok/s step 9740/19560 | loss 3.457505 (+0.83z)| norm 0.2816 (+0.22z)| lr 3.20e-04 | 322.91 ms | 52.3% bf16 MFU | 1624343 tok/s step 9741/19560 | loss 3.374072 (-0.98z)| norm 0.2576 (-1.07z)| lr 3.19e-04 | 322.84 ms | 52.3% bf16 MFU | 1624325 tok/s step 9742/19560 | loss 3.470052 (+1.09z)| norm 0.2770 (-0.01z)| lr 3.19e-04 | 322.27 ms | 52.4% bf16 MFU | 1624452 tok/s step 9743/19560 | loss 3.423038 (+0.08z)| norm 0.3106 (+1.77z)| lr 3.19e-04 | 322.30 ms | 52.4% bf16 MFU | 1624565 tok/s step 9744/19560 | loss 3.457068 (+0.80z)| norm 0.2994 (+1.16z)| lr 3.19e-04 | 322.98 ms | 52.3% bf16 MFU | 1624502 tok/s step 9745/19560 | loss 3.375030 (-0.96z)| norm 0.2679 (-0.53z)| lr 3.19e-04 | 322.80 ms | 52.3% bf16 MFU | 1624486 tok/s step 9746/19560 | loss 3.417289 (-0.05z)| norm 0.2807 (+0.14z)| lr 3.19e-04 | 322.73 ms | 52.3% bf16 MFU | 1624490 tok/s step 9747/19560 | loss 3.440297 (+0.44z)| norm 0.3131 (+1.86z)| lr 3.19e-04 | 322.32 ms | 52.4% bf16 MFU | 1624597 tok/s step 9748/19560 | loss 3.471790 (+1.10z)| norm 0.3091 (+1.62z)| lr 3.19e-04 | 322.56 ms | 52.3% bf16 MFU | 1624638 tok/s step 9749/19560 | loss 3.403504 (-0.35z)| norm 0.2839 (+0.28z)| lr 3.19e-04 | 322.62 ms | 52.3% bf16 MFU | 1624662 tok/s step 9750/19560 | loss 3.408429 (-0.25z)| norm 0.3074 (+1.50z)| lr 3.19e-04 | 322.43 ms | 52.3% bf16 MFU | 1624731 tok/s val loss 3.402709 ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2888/10042 = 0.287592 step 9751/19560 | loss 3.484550 (+1.39z)| norm 0.3038 (+1.29z)| lr 3.19e-04 | 322.95 ms | 52.3% bf16 MFU | 1624667 tok/s step 9752/19560 | loss 3.447915 (+0.59z)| norm 0.3209 (+2.14z)| lr 3.19e-04 | 322.43 ms | 52.3% bf16 MFU | 1624737 tok/s step 9753/19560 | loss 3.419355 (+0.01z)| norm 0.3153 (+1.81z)| lr 3.19e-04 | 322.50 ms | 52.3% bf16 MFU | 1624785 tok/s step 9754/19560 | loss 3.411340 (-0.20z)| norm 0.3134 (+1.70z)| lr 3.19e-04 | 322.40 ms | 52.3% bf16 MFU | 1624857 tok/s step 9755/19560 | loss 3.394213 (-0.60z)| norm 0.2845 (+0.23z)| lr 3.19e-04 | 322.88 ms | 52.3% bf16 MFU | 1624803 tok/s step 9756/19560 | loss 3.407912 (-0.27z)| norm 0.2915 (+0.58z)| lr 3.19e-04 | 322.87 ms | 52.3% bf16 MFU | 1624756 tok/s step 9757/19560 | loss 3.397938 (-0.51z)| norm 0.2873 (+0.37z)| lr 3.19e-04 | 323.33 ms | 52.2% bf16 MFU | 1624594 tok/s step 9758/19560 | loss 3.341270 (-1.84z)| norm 0.2647 (-0.77z)| lr 3.19e-04 | 322.67 ms | 52.3% bf16 MFU | 1624607 tok/s step 9759/19560 | loss 3.457714 (+0.92z)| norm 0.2775 (-0.11z)| lr 3.19e-04 | 322.24 ms | 52.4% bf16 MFU | 1624729 tok/s step 9760/19560 | loss 3.384669 (-0.80z)| norm 0.2673 (-0.62z)| lr 3.19e-04 | 322.70 ms | 52.3% bf16 MFU | 1624728 tok/s step 9761/19560 | loss 3.414160 (-0.11z)| norm 0.2800 (+0.02z)| lr 3.18e-04 | 322.72 ms | 52.3% bf16 MFU | 1624721 tok/s step 9762/19560 | loss 3.408455 (-0.25z)| norm 0.2669 (-0.63z)| lr 3.18e-04 | 322.75 ms | 52.3% bf16 MFU | 1624707 tok/s step 9763/19560 | loss 3.404265 (-0.35z)| norm 0.2663 (-0.66z)| lr 3.18e-04 | 322.50 ms | 52.3% bf16 MFU | 1624756 tok/s step 9764/19560 | loss 3.420811 (+0.03z)| norm 0.2882 (+0.54z)| lr 3.18e-04 | 322.11 ms | 52.4% bf16 MFU | 1624901 tok/s step 9765/19560 | loss 3.420801 (+0.05z)| norm 0.2962 (+0.99z)| lr 3.18e-04 | 322.54 ms | 52.3% bf16 MFU | 1624932 tok/s step 9766/19560 | loss 3.420154 (+0.04z)| norm 0.2722 (-0.34z)| lr 3.18e-04 | 322.68 ms | 52.3% bf16 MFU | 1624924 tok/s step 9767/19560 | loss 3.353555 (-1.57z)| norm 0.2751 (-0.17z)| lr 3.18e-04 | 322.54 ms | 52.3% bf16 MFU | 1624953 tok/s step 9768/19560 | loss 3.504362 (+2.05z)| norm 0.2488 (-1.68z)| lr 3.18e-04 | 322.52 ms | 52.3% bf16 MFU | 1624986 tok/s step 9769/19560 | loss 3.431028 (+0.30z)| norm 0.2572 (-1.17z)| lr 3.18e-04 | 322.75 ms | 52.3% bf16 MFU | 1624958 tok/s step 9770/19560 | loss 3.450794 (+0.77z)| norm 0.2733 (-0.23z)| lr 3.18e-04 | 323.03 ms | 52.2% bf16 MFU | 1624862 tok/s step 9771/19560 | loss 3.409138 (-0.22z)| norm 0.2550 (-1.28z)| lr 3.18e-04 | 322.23 ms | 52.4% bf16 MFU | 1624973 tok/s step 9772/19560 | loss 3.530210 (+2.64z)| norm 0.2960 (+1.08z)| lr 3.18e-04 | 322.67 ms | 52.3% bf16 MFU | 1624965 tok/s step 9773/19560 | loss 3.396685 (-0.53z)| norm 0.2780 (+0.02z)| lr 3.18e-04 | 323.15 ms | 52.2% bf16 MFU | 1624838 tok/s step 9774/19560 | loss 3.445486 (+0.62z)| norm 0.2920 (+0.83z)| lr 3.18e-04 | 322.93 ms | 52.3% bf16 MFU | 1624774 tok/s step 9775/19560 | loss 3.424430 (+0.11z)| norm 0.2929 (+0.87z)| lr 3.18e-04 | 322.68 ms | 52.3% bf16 MFU | 1624775 tok/s step 9776/19560 | loss 3.428991 (+0.22z)| norm 0.2583 (-1.16z)| lr 3.18e-04 | 322.71 ms | 52.3% bf16 MFU | 1624769 tok/s step 9777/19560 | loss 3.465631 (+1.09z)| norm 0.2702 (-0.47z)| lr 3.18e-04 | 322.73 ms | 52.3% bf16 MFU | 1624759 tok/s step 9778/19560 | loss 3.369946 (-1.20z)| norm 0.2471 (-1.80z)| lr 3.18e-04 | 322.18 ms | 52.4% bf16 MFU | 1624887 tok/s step 9779/19560 | loss 3.480680 (+1.42z)| norm 0.2638 (-0.82z)| lr 3.18e-04 | 322.91 ms | 52.3% bf16 MFU | 1624825 tok/s step 9780/19560 | loss 3.446385 (+0.60z)| norm 0.2617 (-0.94z)| lr 3.18e-04 | 322.55 ms | 52.3% bf16 MFU | 1624857 tok/s step 9781/19560 | loss 3.441490 (+0.48z)| norm 0.2608 (-0.99z)| lr 3.17e-04 | 322.07 ms | 52.4% bf16 MFU | 1625007 tok/s step 9782/19560 | loss 3.421106 (-0.01z)| norm 0.2614 (-0.96z)| lr 3.17e-04 | 322.74 ms | 52.3% bf16 MFU | 1624981 tok/s step 9783/19560 | loss 3.507353 (+2.00z)| norm 0.2726 (-0.31z)| lr 3.17e-04 | 322.58 ms | 52.3% bf16 MFU | 1624995 tok/s step 9784/19560 | loss 3.364250 (-1.36z)| norm 0.2653 (-0.73z)| lr 3.17e-04 | 322.93 ms | 52.3% bf16 MFU | 1624921 tok/s step 9785/19560 | loss 3.327962 (-2.16z)| norm 0.2644 (-0.77z)| lr 3.17e-04 | 322.63 ms | 52.3% bf16 MFU | 1624926 tok/s step 9786/19560 | loss 3.459730 (+0.90z)| norm 0.2409 (-2.11z)| lr 3.17e-04 | 322.33 ms | 52.4% bf16 MFU | 1625008 tok/s step 9787/19560 | loss 3.374418 (-1.07z)| norm 0.2839 (+0.41z)| lr 3.17e-04 | 322.45 ms | 52.3% bf16 MFU | 1625056 tok/s step 9788/19560 | loss 3.355455 (-1.49z)| norm 0.2664 (-0.60z)| lr 3.17e-04 | 322.69 ms | 52.3% bf16 MFU | 1625041 tok/s step 9789/19560 | loss 3.385365 (-0.79z)| norm 0.2596 (-1.01z)| lr 3.17e-04 | 322.76 ms | 52.3% bf16 MFU | 1625009 tok/s step 9790/19560 | loss 3.398296 (-0.49z)| norm 0.2797 (+0.19z)| lr 3.17e-04 | 322.41 ms | 52.3% bf16 MFU | 1625066 tok/s step 9791/19560 | loss 3.419753 (+0.00z)| norm 0.2457 (-1.82z)| lr 3.17e-04 | 322.08 ms | 52.4% bf16 MFU | 1625204 tok/s step 9792/19560 | loss 3.447514 (+0.69z)| norm 0.2841 (+0.45z)| lr 3.17e-04 | 323.17 ms | 52.2% bf16 MFU | 1625059 tok/s step 9793/19560 | loss 3.483580 (+1.55z)| norm 0.2854 (+0.52z)| lr 3.17e-04 | 322.75 ms | 52.3% bf16 MFU | 1625028 tok/s step 9794/19560 | loss 3.324063 (-2.23z)| norm 0.2568 (-1.15z)| lr 3.17e-04 | 322.36 ms | 52.4% bf16 MFU | 1625097 tok/s step 9795/19560 | loss 3.409392 (-0.21z)| norm 0.2708 (-0.32z)| lr 3.17e-04 | 322.81 ms | 52.3% bf16 MFU | 1625048 tok/s step 9796/19560 | loss 3.352537 (-1.55z)| norm 0.2516 (-1.45z)| lr 3.17e-04 | 322.81 ms | 52.3% bf16 MFU | 1625004 tok/s step 9797/19560 | loss 3.460135 (+0.98z)| norm 0.2874 (+0.69z)| lr 3.17e-04 | 322.95 ms | 52.3% bf16 MFU | 1624925 tok/s step 9798/19560 | loss 3.373882 (-1.05z)| norm 0.2722 (-0.21z)| lr 3.17e-04 | 322.40 ms | 52.3% bf16 MFU | 1624990 tok/s step 9799/19560 | loss 3.382027 (-0.86z)| norm 0.2876 (+0.70z)| lr 3.17e-04 | 322.85 ms | 52.3% bf16 MFU | 1624937 tok/s step 9800/19560 | loss 3.345006 (-1.72z)| norm 0.2690 (-0.41z)| lr 3.17e-04 | 322.59 ms | 52.3% bf16 MFU | 1624952 tok/s step 9801/19560 | loss 3.398746 (-0.43z)| norm 0.2673 (-0.51z)| lr 3.16e-04 | 322.46 ms | 52.3% bf16 MFU | 1625000 tok/s step 9802/19560 | loss 3.458173 (+1.01z)| norm 0.2531 (-1.36z)| lr 3.16e-04 | 322.39 ms | 52.3% bf16 MFU | 1625061 tok/s step 9803/19560 | loss 3.329202 (-2.06z)| norm 0.3017 (+1.54z)| lr 3.16e-04 | 322.47 ms | 52.3% bf16 MFU | 1625100 tok/s step 9804/19560 | loss 3.350571 (-1.52z)| norm 0.2669 (-0.54z)| lr 3.16e-04 | 322.61 ms | 52.3% bf16 MFU | 1625103 tok/s step 9805/19560 | loss 3.352553 (-1.47z)| norm 0.2939 (+1.06z)| lr 3.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1625078 tok/s step 9806/19560 | loss 3.397637 (-0.41z)| norm 0.3597 (+4.53z)| lr 3.16e-04 | 322.71 ms | 52.3% bf16 MFU | 1625055 tok/s step 9807/19560 | loss 3.412928 (-0.06z)| norm 0.2681 (-0.49z)| lr 3.16e-04 | 322.82 ms | 52.3% bf16 MFU | 1625007 tok/s step 9808/19560 | loss 3.397870 (-0.41z)| norm 0.2621 (-0.83z)| lr 3.16e-04 | 322.59 ms | 52.3% bf16 MFU | 1625020 tok/s step 9809/19560 | loss 3.345418 (-1.61z)| norm 0.2963 (+1.05z)| lr 3.16e-04 | 322.55 ms | 52.3% bf16 MFU | 1625041 tok/s step 9810/19560 | loss 3.446320 (+0.73z)| norm 0.2492 (-1.55z)| lr 3.16e-04 | 322.44 ms | 52.3% bf16 MFU | 1625090 tok/s step 9811/19560 | loss 3.423201 (+0.19z)| norm 0.2731 (-0.23z)| lr 3.16e-04 | 322.57 ms | 52.3% bf16 MFU | 1625103 tok/s step 9812/19560 | loss 3.404634 (-0.25z)| norm 0.2710 (-0.35z)| lr 3.16e-04 | 322.69 ms | 52.3% bf16 MFU | 1625086 tok/s step 9813/19560 | loss 3.366769 (-1.12z)| norm 0.2806 (+0.18z)| lr 3.16e-04 | 322.16 ms | 52.4% bf16 MFU | 1625202 tok/s step 9814/19560 | loss 3.413543 (-0.02z)| norm 0.2824 (+0.29z)| lr 3.16e-04 | 323.07 ms | 52.2% bf16 MFU | 1625085 tok/s step 9815/19560 | loss 3.495379 (+1.84z)| norm 0.2867 (+0.52z)| lr 3.16e-04 | 322.83 ms | 52.3% bf16 MFU | 1625032 tok/s step 9816/19560 | loss 3.359944 (-1.27z)| norm 0.2649 (-0.70z)| lr 3.16e-04 | 322.75 ms | 52.3% bf16 MFU | 1625002 tok/s step 9817/19560 | loss 3.432825 (+0.41z)| norm 0.2738 (-0.20z)| lr 3.16e-04 | 322.54 ms | 52.3% bf16 MFU | 1625028 tok/s step 9818/19560 | loss 3.433241 (+0.41z)| norm 0.2657 (-0.64z)| lr 3.16e-04 | 322.56 ms | 52.3% bf16 MFU | 1625045 tok/s step 9819/19560 | loss 3.345512 (-1.62z)| norm 0.2729 (-0.23z)| lr 3.16e-04 | 322.26 ms | 52.4% bf16 MFU | 1625138 tok/s step 9820/19560 | loss 3.421173 (+0.13z)| norm 0.2829 (+0.32z)| lr 3.16e-04 | 322.89 ms | 52.3% bf16 MFU | 1625069 tok/s step 9821/19560 | loss 3.380313 (-0.81z)| norm 0.2596 (-0.98z)| lr 3.15e-04 | 322.41 ms | 52.3% bf16 MFU | 1625124 tok/s step 9822/19560 | loss 3.376333 (-0.89z)| norm 0.3140 (+2.01z)| lr 3.15e-04 | 323.70 ms | 52.1% bf16 MFU | 1624851 tok/s step 9823/19560 | loss 3.414340 (-0.02z)| norm 0.2833 (+0.32z)| lr 3.15e-04 | 322.77 ms | 52.3% bf16 MFU | 1624826 tok/s step 9824/19560 | loss 3.360106 (-1.26z)| norm 0.2739 (-0.20z)| lr 3.15e-04 | 322.70 ms | 52.3% bf16 MFU | 1624819 tok/s step 9825/19560 | loss 3.405941 (-0.21z)| norm 0.2893 (+0.64z)| lr 3.15e-04 | 322.75 ms | 52.3% bf16 MFU | 1624801 tok/s step 9826/19560 | loss 3.448039 (+0.74z)| norm 0.2886 (+0.59z)| lr 3.15e-04 | 322.63 ms | 52.3% bf16 MFU | 1624812 tok/s step 9827/19560 | loss 3.367552 (-1.10z)| norm 0.2858 (+0.43z)| lr 3.15e-04 | 322.68 ms | 52.3% bf16 MFU | 1624811 tok/s step 9828/19560 | loss 3.430592 (+0.34z)| norm 0.2803 (+0.12z)| lr 3.15e-04 | 322.47 ms | 52.3% bf16 MFU | 1624864 tok/s step 9829/19560 | loss 3.426294 (+0.24z)| norm 0.2742 (-0.21z)| lr 3.15e-04 | 323.00 ms | 52.3% bf16 MFU | 1624779 tok/s step 9830/19560 | loss 3.473379 (+1.30z)| norm 0.3309 (+2.81z)| lr 3.15e-04 | 322.27 ms | 52.4% bf16 MFU | 1624882 tok/s step 9831/19560 | loss 3.409755 (-0.15z)| norm 0.3597 (+4.04z)| lr 3.15e-04 | 323.02 ms | 52.2% bf16 MFU | 1624793 tok/s step 9832/19560 | loss 3.417242 (+0.02z)| norm 0.3734 (+4.33z)| lr 3.15e-04 | 322.50 ms | 52.3% bf16 MFU | 1624837 tok/s step 9833/19560 | loss 3.373582 (-0.96z)| norm 0.3193 (+1.78z)| lr 3.15e-04 | 323.24 ms | 52.2% bf16 MFU | 1624695 tok/s step 9834/19560 | loss 3.379063 (-0.83z)| norm 0.2822 (+0.08z)| lr 3.15e-04 | 322.62 ms | 52.3% bf16 MFU | 1624715 tok/s step 9835/19560 | loss 3.409308 (-0.14z)| norm 0.3126 (+1.45z)| lr 3.15e-04 | 323.14 ms | 52.2% bf16 MFU | 1624604 tok/s step 9836/19560 | loss 3.382964 (-0.73z)| norm 0.2880 (+0.33z)| lr 3.15e-04 | 323.17 ms | 52.2% bf16 MFU | 1624490 tok/s step 9837/19560 | loss 3.366533 (-1.09z)| norm 0.2927 (+0.55z)| lr 3.15e-04 | 322.54 ms | 52.3% bf16 MFU | 1624542 tok/s step 9838/19560 | loss 3.385661 (-0.65z)| norm 0.2881 (+0.33z)| lr 3.15e-04 | 322.77 ms | 52.3% bf16 MFU | 1624531 tok/s step 9839/19560 | loss 3.476488 (+1.39z)| norm 0.3076 (+1.20z)| lr 3.15e-04 | 322.69 ms | 52.3% bf16 MFU | 1624543 tok/s step 9840/19560 | loss 3.537253 (+2.69z)| norm 0.2756 (-0.27z)| lr 3.15e-04 | 323.03 ms | 52.2% bf16 MFU | 1624468 tok/s step 9841/19560 | loss 3.398174 (-0.37z)| norm 0.2958 (+0.66z)| lr 3.14e-04 | 323.44 ms | 52.2% bf16 MFU | 1624294 tok/s step 9842/19560 | loss 3.437152 (+0.55z)| norm 0.2833 (+0.08z)| lr 3.14e-04 | 322.82 ms | 52.3% bf16 MFU | 1624284 tok/s step 9843/19560 | loss 3.375061 (-0.92z)| norm 0.2824 (+0.04z)| lr 3.14e-04 | 322.37 ms | 52.4% bf16 MFU | 1624388 tok/s step 9844/19560 | loss 3.430334 (+0.40z)| norm 0.2814 (-0.00z)| lr 3.14e-04 | 322.91 ms | 52.3% bf16 MFU | 1624350 tok/s step 9845/19560 | loss 3.359562 (-1.27z)| norm 0.2661 (-0.70z)| lr 3.14e-04 | 322.98 ms | 52.3% bf16 MFU | 1624298 tok/s step 9846/19560 | loss 3.381593 (-0.74z)| norm 0.2854 (+0.19z)| lr 3.14e-04 | 322.44 ms | 52.3% bf16 MFU | 1624384 tok/s step 9847/19560 | loss 3.400727 (-0.28z)| norm 0.2915 (+0.47z)| lr 3.14e-04 | 322.73 ms | 52.3% bf16 MFU | 1624392 tok/s step 9848/19560 | loss 3.523227 (+2.52z)| norm 0.3050 (+1.07z)| lr 3.14e-04 | 322.76 ms | 52.3% bf16 MFU | 1624392 tok/s step 9849/19560 | loss 3.383839 (-0.68z)| norm 0.3067 (+1.15z)| lr 3.14e-04 | 322.98 ms | 52.3% bf16 MFU | 1624336 tok/s step 9850/19560 | loss 3.387843 (-0.58z)| norm 0.2874 (+0.25z)| lr 3.14e-04 | 322.81 ms | 52.3% bf16 MFU | 1624325 tok/s step 9851/19560 | loss 3.470280 (+1.30z)| norm 0.2920 (+0.47z)| lr 3.14e-04 | 323.40 ms | 52.2% bf16 MFU | 1624167 tok/s step 9852/19560 | loss 3.400861 (-0.29z)| norm 0.2889 (+0.31z)| lr 3.14e-04 | 322.53 ms | 52.3% bf16 MFU | 1624236 tok/s step 9853/19560 | loss 3.404794 (-0.20z)| norm 0.2792 (-0.14z)| lr 3.14e-04 | 322.80 ms | 52.3% bf16 MFU | 1624233 tok/s step 9854/19560 | loss 3.419583 (+0.14z)| norm 0.2666 (-0.72z)| lr 3.14e-04 | 322.79 ms | 52.3% bf16 MFU | 1624233 tok/s step 9855/19560 | loss 3.490555 (+1.74z)| norm 0.2751 (-0.32z)| lr 3.14e-04 | 322.65 ms | 52.3% bf16 MFU | 1624269 tok/s step 9856/19560 | loss 3.356803 (-1.29z)| norm 0.2637 (-0.84z)| lr 3.14e-04 | 322.67 ms | 52.3% bf16 MFU | 1624297 tok/s step 9857/19560 | loss 3.428078 (+0.33z)| norm 0.2717 (-0.48z)| lr 3.14e-04 | 322.88 ms | 52.3% bf16 MFU | 1624272 tok/s step 9858/19560 | loss 3.387767 (-0.59z)| norm 0.2711 (-0.50z)| lr 3.14e-04 | 322.79 ms | 52.3% bf16 MFU | 1624269 tok/s step 9859/19560 | loss 3.425839 (+0.27z)| norm 0.2758 (-0.27z)| lr 3.14e-04 | 322.35 ms | 52.4% bf16 MFU | 1624379 tok/s step 9860/19560 | loss 3.419418 (+0.13z)| norm 0.2961 (+0.65z)| lr 3.14e-04 | 323.63 ms | 52.2% bf16 MFU | 1624163 tok/s step 9861/19560 | loss 3.358366 (-1.24z)| norm 0.2915 (+0.44z)| lr 3.13e-04 | 322.62 ms | 52.3% bf16 MFU | 1624208 tok/s step 9862/19560 | loss 3.374186 (-0.87z)| norm 0.2866 (+0.21z)| lr 3.13e-04 | 322.49 ms | 52.3% bf16 MFU | 1624286 tok/s step 9863/19560 | loss 3.423794 (+0.24z)| norm 0.3176 (+1.62z)| lr 3.13e-04 | 323.11 ms | 52.2% bf16 MFU | 1624203 tok/s step 9864/19560 | loss 3.425268 (+0.26z)| norm 0.3007 (+0.83z)| lr 3.13e-04 | 322.83 ms | 52.3% bf16 MFU | 1624196 tok/s step 9865/19560 | loss 3.444249 (+0.69z)| norm 0.2981 (+0.70z)| lr 3.13e-04 | 322.59 ms | 52.3% bf16 MFU | 1624248 tok/s step 9866/19560 | loss 3.360647 (-1.19z)| norm 0.3222 (+1.78z)| lr 3.13e-04 | 323.07 ms | 52.2% bf16 MFU | 1624176 tok/s step 9867/19560 | loss 3.356771 (-1.25z)| norm 0.2841 (+0.04z)| lr 3.13e-04 | 322.58 ms | 52.3% bf16 MFU | 1624233 tok/s step 9868/19560 | loss 3.410315 (-0.03z)| norm 0.3019 (+0.85z)| lr 3.13e-04 | 322.76 ms | 52.3% bf16 MFU | 1624240 tok/s step 9869/19560 | loss 3.424704 (+0.29z)| norm 0.2682 (-0.69z)| lr 3.13e-04 | 323.66 ms | 52.1% bf16 MFU | 1624023 tok/s step 9870/19560 | loss 3.363992 (-1.08z)| norm 0.3106 (+1.22z)| lr 3.13e-04 | 322.57 ms | 52.3% bf16 MFU | 1624090 tok/s step 9871/19560 | loss 3.420181 (+0.21z)| norm 0.3168 (+1.50z)| lr 3.13e-04 | 322.75 ms | 52.3% bf16 MFU | 1624108 tok/s step 9872/19560 | loss 3.437461 (+0.61z)| norm 0.2778 (-0.26z)| lr 3.13e-04 | 322.79 ms | 52.3% bf16 MFU | 1624116 tok/s step 9873/19560 | loss 3.414689 (+0.08z)| norm 0.3314 (+2.11z)| lr 3.13e-04 | 322.60 ms | 52.3% bf16 MFU | 1624169 tok/s step 9874/19560 | loss 3.457574 (+1.05z)| norm 0.2795 (-0.20z)| lr 3.13e-04 | 323.06 ms | 52.2% bf16 MFU | 1624104 tok/s step 9875/19560 | loss 3.406516 (-0.11z)| norm 0.3188 (+1.54z)| lr 3.13e-04 | 322.58 ms | 52.3% bf16 MFU | 1624163 tok/s step 9876/19560 | loss 3.446620 (+0.82z)| norm 0.2873 (+0.15z)| lr 3.13e-04 | 322.65 ms | 52.3% bf16 MFU | 1624202 tok/s step 9877/19560 | loss 3.464963 (+1.22z)| norm 0.2875 (+0.16z)| lr 3.13e-04 | 323.26 ms | 52.2% bf16 MFU | 1624086 tok/s step 9878/19560 | loss 3.459225 (+1.08z)| norm 0.2959 (+0.54z)| lr 3.13e-04 | 322.34 ms | 52.4% bf16 MFU | 1624207 tok/s step 9879/19560 | loss 3.366185 (-1.03z)| norm 0.2875 (+0.17z)| lr 3.13e-04 | 322.88 ms | 52.3% bf16 MFU | 1624186 tok/s step 9880/19560 | loss 3.466543 (+1.27z)| norm 0.3043 (+0.94z)| lr 3.13e-04 | 323.06 ms | 52.2% bf16 MFU | 1624121 tok/s step 9881/19560 | loss 3.367527 (-0.99z)| norm 0.2625 (-0.94z)| lr 3.12e-04 | 322.81 ms | 52.3% bf16 MFU | 1624122 tok/s step 9882/19560 | loss 3.398580 (-0.28z)| norm 0.2801 (-0.12z)| lr 3.12e-04 | 323.43 ms | 52.2% bf16 MFU | 1623968 tok/s step 9883/19560 | loss 3.461655 (+1.15z)| norm 0.3160 (+1.49z)| lr 3.12e-04 | 322.58 ms | 52.3% bf16 MFU | 1624035 tok/s step 9884/19560 | loss 3.332651 (-1.75z)| norm 0.2793 (-0.17z)| lr 3.12e-04 | 322.72 ms | 52.3% bf16 MFU | 1624062 tok/s step 9885/19560 | loss 3.410095 (-0.02z)| norm 0.2909 (+0.36z)| lr 3.12e-04 | 322.98 ms | 52.3% bf16 MFU | 1624022 tok/s step 9886/19560 | loss 3.347725 (-1.42z)| norm 0.2864 (+0.15z)| lr 3.12e-04 | 322.91 ms | 52.3% bf16 MFU | 1624003 tok/s step 9887/19560 | loss 3.371966 (-0.86z)| norm 0.2729 (-0.47z)| lr 3.12e-04 | 323.60 ms | 52.2% bf16 MFU | 1623811 tok/s step 9888/19560 | loss 3.391099 (-0.43z)| norm 0.2780 (-0.24z)| lr 3.12e-04 | 323.11 ms | 52.2% bf16 MFU | 1623753 tok/s step 9889/19560 | loss 3.347435 (-1.39z)| norm 0.2849 (+0.07z)| lr 3.12e-04 | 323.30 ms | 52.2% bf16 MFU | 1623649 tok/s step 9890/19560 | loss 3.417593 (+0.17z)| norm 0.3067 (+1.05z)| lr 3.12e-04 | 322.63 ms | 52.3% bf16 MFU | 1623718 tok/s step 9891/19560 | loss 3.331573 (-1.72z)| norm 0.3136 (+1.34z)| lr 3.12e-04 | 322.79 ms | 52.3% bf16 MFU | 1623743 tok/s step 9892/19560 | loss 3.397535 (-0.25z)| norm 0.3167 (+1.46z)| lr 3.12e-04 | 323.14 ms | 52.2% bf16 MFU | 1623680 tok/s step 9893/19560 | loss 3.411121 (+0.05z)| norm 0.3018 (+0.79z)| lr 3.12e-04 | 323.06 ms | 52.2% bf16 MFU | 1623640 tok/s step 9894/19560 | loss 3.392609 (-0.36z)| norm 0.2778 (-0.29z)| lr 3.12e-04 | 322.79 ms | 52.3% bf16 MFU | 1623671 tok/s step 9895/19560 | loss 3.437742 (+0.63z)| norm 0.2873 (+0.13z)| lr 3.12e-04 | 322.97 ms | 52.3% bf16 MFU | 1623655 tok/s step 9896/19560 | loss 3.362039 (-1.04z)| norm 0.2651 (-0.88z)| lr 3.12e-04 | 322.55 ms | 52.3% bf16 MFU | 1623745 tok/s step 9897/19560 | loss 3.358058 (-1.11z)| norm 0.2921 (+0.33z)| lr 3.12e-04 | 322.26 ms | 52.4% bf16 MFU | 1623904 tok/s step 9898/19560 | loss 3.398170 (-0.21z)| norm 0.2618 (-1.04z)| lr 3.12e-04 | 322.95 ms | 52.3% bf16 MFU | 1623879 tok/s step 9899/19560 | loss 3.467883 (+1.34z)| norm 0.2917 (+0.31z)| lr 3.12e-04 | 322.35 ms | 52.4% bf16 MFU | 1624007 tok/s step 9900/19560 | loss 3.362881 (-1.00z)| norm 0.2575 (-1.23z)| lr 3.12e-04 | 322.95 ms | 52.3% bf16 MFU | 1623979 tok/s step 9901/19560 | loss 3.380918 (-0.58z)| norm 0.3085 (+1.07z)| lr 3.11e-04 | 322.90 ms | 52.3% bf16 MFU | 1623964 tok/s step 9902/19560 | loss 3.366138 (-0.91z)| norm 0.2588 (-1.16z)| lr 3.11e-04 | 322.24 ms | 52.4% bf16 MFU | 1624116 tok/s step 9903/19560 | loss 3.408998 (+0.08z)| norm 0.2855 (+0.04z)| lr 3.11e-04 | 323.30 ms | 52.2% bf16 MFU | 1623994 tok/s step 9904/19560 | loss 3.390144 (-0.35z)| norm 0.2631 (-0.97z)| lr 3.11e-04 | 322.24 ms | 52.4% bf16 MFU | 1624144 tok/s step 9905/19560 | loss 3.416725 (+0.27z)| norm 0.2815 (-0.15z)| lr 3.11e-04 | 322.62 ms | 52.3% bf16 MFU | 1624192 tok/s step 9906/19560 | loss 3.304705 (-2.26z)| norm 0.2643 (-0.93z)| lr 3.11e-04 | 323.17 ms | 52.2% bf16 MFU | 1624100 tok/s step 9907/19560 | loss 3.365031 (-0.88z)| norm 0.2573 (-1.25z)| lr 3.11e-04 | 322.33 ms | 52.4% bf16 MFU | 1624222 tok/s step 9908/19560 | loss 3.492005 (+2.00z)| norm 0.2702 (-0.67z)| lr 3.11e-04 | 322.21 ms | 52.4% bf16 MFU | 1624368 tok/s step 9909/19560 | loss 3.434887 (+0.70z)| norm 0.2847 (-0.02z)| lr 3.11e-04 | 322.59 ms | 52.3% bf16 MFU | 1624410 tok/s step 9910/19560 | loss 3.339010 (-1.44z)| norm 0.3032 (+0.82z)| lr 3.11e-04 | 323.20 ms | 52.2% bf16 MFU | 1624298 tok/s step 9911/19560 | loss 3.387230 (-0.34z)| norm 0.2893 (+0.17z)| lr 3.11e-04 | 322.48 ms | 52.3% bf16 MFU | 1624374 tok/s step 9912/19560 | loss 3.375338 (-0.62z)| norm 0.2883 (+0.12z)| lr 3.11e-04 | 322.40 ms | 52.3% bf16 MFU | 1624465 tok/s step 9913/19560 | loss 3.398224 (-0.11z)| norm 0.2787 (-0.33z)| lr 3.11e-04 | 322.43 ms | 52.3% bf16 MFU | 1624544 tok/s step 9914/19560 | loss 3.434472 (+0.74z)| norm 0.3026 (+0.76z)| lr 3.11e-04 | 323.15 ms | 52.2% bf16 MFU | 1624438 tok/s step 9915/19560 | loss 3.429431 (+0.62z)| norm 0.3264 (+1.84z)| lr 3.11e-04 | 322.29 ms | 52.4% bf16 MFU | 1624554 tok/s step 9916/19560 | loss 3.381327 (-0.52z)| norm 0.2973 (+0.49z)| lr 3.11e-04 | 322.19 ms | 52.4% bf16 MFU | 1624690 tok/s step 9917/19560 | loss 3.404878 (+0.03z)| norm 0.3162 (+1.34z)| lr 3.11e-04 | 322.63 ms | 52.3% bf16 MFU | 1624707 tok/s step 9918/19560 | loss 3.430924 (+0.64z)| norm 0.3068 (+0.89z)| lr 3.11e-04 | 322.70 ms | 52.3% bf16 MFU | 1624707 tok/s step 9919/19560 | loss 3.493377 (+2.06z)| norm 0.3074 (+0.91z)| lr 3.11e-04 | 322.80 ms | 52.3% bf16 MFU | 1624682 tok/s step 9920/19560 | loss 3.354851 (-1.12z)| norm 0.3100 (+1.02z)| lr 3.11e-04 | 322.36 ms | 52.4% bf16 MFU | 1624768 tok/s step 9921/19560 | loss 3.478767 (+1.74z)| norm 0.3436 (+2.50z)| lr 3.10e-04 | 322.75 ms | 52.3% bf16 MFU | 1624752 tok/s step 9922/19560 | loss 3.416354 (+0.28z)| norm 0.3227 (+1.52z)| lr 3.10e-04 | 323.25 ms | 52.2% bf16 MFU | 1624611 tok/s step 9923/19560 | loss 3.400590 (-0.08z)| norm 0.3232 (+1.52z)| lr 3.10e-04 | 322.57 ms | 52.3% bf16 MFU | 1624648 tok/s step 9924/19560 | loss 3.392999 (-0.27z)| norm 0.2985 (+0.39z)| lr 3.10e-04 | 322.25 ms | 52.4% bf16 MFU | 1624763 tok/s step 9925/19560 | loss 3.424168 (+0.47z)| norm 0.2960 (+0.27z)| lr 3.10e-04 | 322.24 ms | 52.4% bf16 MFU | 1624876 tok/s step 9926/19560 | loss 3.442568 (+0.89z)| norm 0.3035 (+0.60z)| lr 3.10e-04 | 322.91 ms | 52.3% bf16 MFU | 1624813 tok/s step 9927/19560 | loss 3.415967 (+0.26z)| norm 0.2726 (-0.80z)| lr 3.10e-04 | 322.60 ms | 52.3% bf16 MFU | 1624831 tok/s step 9928/19560 | loss 3.379057 (-0.62z)| norm 0.2876 (-0.12z)| lr 3.10e-04 | 322.60 ms | 52.3% bf16 MFU | 1624849 tok/s step 9929/19560 | loss 3.404902 (-0.01z)| norm 0.2826 (-0.36z)| lr 3.10e-04 | 322.36 ms | 52.4% bf16 MFU | 1624927 tok/s step 9930/19560 | loss 3.409867 (+0.12z)| norm 0.2644 (-1.21z)| lr 3.10e-04 | 322.24 ms | 52.4% bf16 MFU | 1625032 tok/s step 9931/19560 | loss 3.400942 (-0.11z)| norm 0.2712 (-0.88z)| lr 3.10e-04 | 323.12 ms | 52.2% bf16 MFU | 1624909 tok/s step 9932/19560 | loss 3.366341 (-0.96z)| norm 0.2605 (-1.37z)| lr 3.10e-04 | 322.93 ms | 52.3% bf16 MFU | 1624839 tok/s step 9933/19560 | loss 3.415756 (+0.24z)| norm 0.2722 (-0.82z)| lr 3.10e-04 | 322.39 ms | 52.4% bf16 MFU | 1624910 tok/s step 9934/19560 | loss 3.416163 (+0.24z)| norm 0.2531 (-1.70z)| lr 3.10e-04 | 322.41 ms | 52.3% bf16 MFU | 1624973 tok/s step 9935/19560 | loss 3.512588 (+2.52z)| norm 0.2860 (-0.16z)| lr 3.10e-04 | 323.17 ms | 52.2% bf16 MFU | 1624841 tok/s step 9936/19560 | loss 3.390676 (-0.39z)| norm 0.2707 (-0.89z)| lr 3.10e-04 | 322.78 ms | 52.3% bf16 MFU | 1624814 tok/s step 9937/19560 | loss 3.428324 (+0.50z)| norm 0.2524 (-1.73z)| lr 3.10e-04 | 322.81 ms | 52.3% bf16 MFU | 1624780 tok/s step 9938/19560 | loss 3.396899 (-0.25z)| norm 0.2626 (-1.26z)| lr 3.10e-04 | 322.38 ms | 52.4% bf16 MFU | 1624856 tok/s step 9939/19560 | loss 3.405452 (-0.04z)| norm 0.2511 (-1.78z)| lr 3.10e-04 | 322.68 ms | 52.3% bf16 MFU | 1624852 tok/s step 9940/19560 | loss 3.362383 (-1.07z)| norm 0.2704 (-0.87z)| lr 3.10e-04 | 322.53 ms | 52.3% bf16 MFU | 1624887 tok/s step 9941/19560 | loss 3.406262 (-0.02z)| norm 0.2684 (-0.96z)| lr 3.09e-04 | 322.82 ms | 52.3% bf16 MFU | 1624846 tok/s step 9942/19560 | loss 3.401573 (-0.13z)| norm 0.3061 (+0.80z)| lr 3.09e-04 | 322.12 ms | 52.4% bf16 MFU | 1624984 tok/s step 9943/19560 | loss 3.523767 (+2.78z)| norm 0.2693 (-0.91z)| lr 3.09e-04 | 322.46 ms | 52.3% bf16 MFU | 1625030 tok/s step 9944/19560 | loss 3.401977 (-0.13z)| norm 0.3169 (+1.28z)| lr 3.09e-04 | 322.27 ms | 52.4% bf16 MFU | 1625121 tok/s step 9945/19560 | loss 3.437684 (+0.72z)| norm 0.2748 (-0.68z)| lr 3.09e-04 | 322.59 ms | 52.3% bf16 MFU | 1625127 tok/s step 9946/19560 | loss 3.424180 (+0.40z)| norm 0.2847 (-0.22z)| lr 3.09e-04 | 322.29 ms | 52.4% bf16 MFU | 1625208 tok/s step 9947/19560 | loss 3.398819 (-0.22z)| norm 0.2908 (+0.05z)| lr 3.09e-04 | 322.45 ms | 52.3% bf16 MFU | 1625246 tok/s step 9948/19560 | loss 3.372057 (-0.86z)| norm 0.2727 (-0.79z)| lr 3.09e-04 | 322.41 ms | 52.3% bf16 MFU | 1625291 tok/s step 9949/19560 | loss 3.420967 (+0.32z)| norm 0.2781 (-0.55z)| lr 3.09e-04 | 322.66 ms | 52.3% bf16 MFU | 1625271 tok/s step 9950/19560 | loss 3.407738 (-0.01z)| norm 0.3113 (+1.02z)| lr 3.09e-04 | 323.12 ms | 52.2% bf16 MFU | 1625136 tok/s step 9951/19560 | loss 3.344256 (-1.52z)| norm 0.2911 (+0.06z)| lr 3.09e-04 | 322.45 ms | 52.3% bf16 MFU | 1625176 tok/s step 9952/19560 | loss 3.418759 (+0.26z)| norm 0.2801 (-0.46z)| lr 3.09e-04 | 322.57 ms | 52.3% bf16 MFU | 1625184 tok/s step 9953/19560 | loss 3.408945 (+0.02z)| norm 0.2946 (+0.22z)| lr 3.09e-04 | 322.34 ms | 52.4% bf16 MFU | 1625249 tok/s step 9954/19560 | loss 3.379551 (-0.67z)| norm 0.2709 (-0.89z)| lr 3.09e-04 | 322.66 ms | 52.3% bf16 MFU | 1625231 tok/s step 9955/19560 | loss 3.356708 (-1.22z)| norm 0.9020 (+10.48z)| lr 3.09e-04 | 322.87 ms | 52.3% bf16 MFU | 1625161 tok/s step 9956/19560 | loss 3.392318 (-0.36z)| norm 0.3005 (+0.10z)| lr 3.09e-04 | 322.74 ms | 52.3% bf16 MFU | 1625127 tok/s step 9957/19560 | loss 3.414131 (+0.17z)| norm 0.2877 (-0.12z)| lr 3.09e-04 | 322.16 ms | 52.4% bf16 MFU | 1625240 tok/s step 9958/19560 | loss 3.365431 (-0.99z)| norm 0.2711 (-0.40z)| lr 3.09e-04 | 322.64 ms | 52.3% bf16 MFU | 1625227 tok/s step 9959/19560 | loss 3.393036 (-0.32z)| norm 0.2807 (-0.23z)| lr 3.09e-04 | 322.50 ms | 52.3% bf16 MFU | 1625250 tok/s step 9960/19560 | loss 3.536603 (+3.03z)| norm 0.2787 (-0.25z)| lr 3.09e-04 | 322.30 ms | 52.4% bf16 MFU | 1625323 tok/s step 9961/19560 | loss 3.421930 (+0.34z)| norm 0.2657 (-0.47z)| lr 3.08e-04 | 322.98 ms | 52.3% bf16 MFU | 1625221 tok/s step 9962/19560 | loss 3.382137 (-0.59z)| norm 0.2780 (-0.25z)| lr 3.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1625251 tok/s step 9963/19560 | loss 3.396649 (-0.25z)| norm 0.2628 (-0.51z)| lr 3.08e-04 | 322.62 ms | 52.3% bf16 MFU | 1625244 tok/s step 9964/19560 | loss 3.333853 (-1.69z)| norm 0.2763 (-0.28z)| lr 3.08e-04 | 322.80 ms | 52.3% bf16 MFU | 1625191 tok/s step 9965/19560 | loss 3.375028 (-0.74z)| norm 0.2676 (-0.42z)| lr 3.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1625220 tok/s step 9966/19560 | loss 3.397249 (-0.23z)| norm 0.2780 (-0.24z)| lr 3.08e-04 | 322.74 ms | 52.3% bf16 MFU | 1625183 tok/s step 9967/19560 | loss 3.418505 (+0.28z)| norm 0.2784 (-0.23z)| lr 3.08e-04 | 322.51 ms | 52.3% bf16 MFU | 1625205 tok/s step 9968/19560 | loss 3.435609 (+0.73z)| norm 0.2733 (-0.32z)| lr 3.08e-04 | 322.24 ms | 52.4% bf16 MFU | 1625295 tok/s step 9969/19560 | loss 3.352612 (-1.28z)| norm 0.3023 (+0.19z)| lr 3.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1625319 tok/s step 9970/19560 | loss 3.352092 (-1.27z)| norm 0.2885 (-0.05z)| lr 3.08e-04 | 322.80 ms | 52.3% bf16 MFU | 1625263 tok/s step 9971/19560 | loss 3.360571 (-1.06z)| norm 0.2886 (-0.05z)| lr 3.08e-04 | 322.51 ms | 52.3% bf16 MFU | 1625283 tok/s step 9972/19560 | loss 3.437040 (+0.78z)| norm 0.2819 (-0.17z)| lr 3.08e-04 | 322.18 ms | 52.4% bf16 MFU | 1625385 tok/s step 9973/19560 | loss 3.343439 (-1.46z)| norm 0.2579 (-0.59z)| lr 3.08e-04 | 322.83 ms | 52.3% bf16 MFU | 1625318 tok/s step 9974/19560 | loss 3.402462 (-0.06z)| norm 0.2903 (-0.02z)| lr 3.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1625341 tok/s step 9975/19560 | loss 3.423048 (+0.43z)| norm 0.2708 (-0.36z)| lr 3.08e-04 | 322.37 ms | 52.4% bf16 MFU | 1625392 tok/s step 9976/19560 | loss 3.428644 (+0.60z)| norm 0.2742 (-0.30z)| lr 3.08e-04 | 322.41 ms | 52.3% bf16 MFU | 1625431 tok/s step 9977/19560 | loss 3.581136 (+4.04z)| norm 0.3130 (+0.38z)| lr 3.08e-04 | 322.74 ms | 52.3% bf16 MFU | 1625385 tok/s step 9978/19560 | loss 3.362245 (-1.00z)| norm 0.2759 (-0.27z)| lr 3.08e-04 | 322.42 ms | 52.3% bf16 MFU | 1625422 tok/s step 9979/19560 | loss 3.403642 (-0.03z)| norm 0.3013 (+0.17z)| lr 3.08e-04 | 322.32 ms | 52.4% bf16 MFU | 1625482 tok/s step 9980/19560 | loss 3.415594 (+0.24z)| norm 0.2702 (-0.36z)| lr 3.08e-04 | 322.55 ms | 52.3% bf16 MFU | 1625480 tok/s step 9981/19560 | loss 3.425350 (+0.46z)| norm 0.2484 (-0.74z)| lr 3.07e-04 | 321.95 ms | 52.4% bf16 MFU | 1625629 tok/s step 9982/19560 | loss 3.392904 (-0.28z)| norm 0.2643 (-0.46z)| lr 3.07e-04 | 322.54 ms | 52.3% bf16 MFU | 1625622 tok/s step 9983/19560 | loss 3.385045 (-0.45z)| norm 0.2469 (-0.76z)| lr 3.07e-04 | 322.68 ms | 52.3% bf16 MFU | 1625580 tok/s step 9984/19560 | loss 3.629923 (+4.78z)| norm 0.2690 (-0.38z)| lr 3.07e-04 | 322.50 ms | 52.3% bf16 MFU | 1625587 tok/s step 9985/19560 | loss 3.426475 (+0.43z)| norm 0.2825 (-0.14z)| lr 3.07e-04 | 322.96 ms | 52.3% bf16 MFU | 1625478 tok/s step 9986/19560 | loss 3.410400 (+0.08z)| norm 0.2780 (-0.22z)| lr 3.07e-04 | 322.60 ms | 52.3% bf16 MFU | 1625465 tok/s step 9987/19560 | loss 3.374501 (-0.68z)| norm 0.2708 (-0.35z)| lr 3.07e-04 | 322.10 ms | 52.4% bf16 MFU | 1625578 tok/s step 9988/19560 | loss 3.425613 (+0.41z)| norm 0.2708 (-0.34z)| lr 3.07e-04 | 322.45 ms | 52.3% bf16 MFU | 1625597 tok/s step 9989/19560 | loss 3.423953 (+0.37z)| norm 0.2903 (-0.00z)| lr 3.07e-04 | 322.63 ms | 52.3% bf16 MFU | 1625569 tok/s step 9990/19560 | loss 3.391456 (-0.33z)| norm 0.2650 (-0.44z)| lr 3.07e-04 | 322.99 ms | 52.3% bf16 MFU | 1625451 tok/s step 9991/19560 | loss 3.378156 (-0.61z)| norm 0.2902 (+0.00z)| lr 3.07e-04 | 322.40 ms | 52.3% bf16 MFU | 1625489 tok/s step 9992/19560 | loss 3.381907 (-0.52z)| norm 0.2736 (-0.28z)| lr 3.07e-04 | 322.27 ms | 52.4% bf16 MFU | 1625558 tok/s step 9993/19560 | loss 3.437747 (+0.68z)| norm 0.2677 (-0.38z)| lr 3.07e-04 | 322.51 ms | 52.3% bf16 MFU | 1625562 tok/s step 9994/19560 | loss 3.405512 (-0.02z)| norm 0.2808 (-0.15z)| lr 3.07e-04 | 322.58 ms | 52.3% bf16 MFU | 1625548 tok/s step 9995/19560 | loss 3.379626 (-0.58z)| norm 0.2708 (-0.32z)| lr 3.07e-04 | 322.38 ms | 52.4% bf16 MFU | 1625585 tok/s step 9996/19560 | loss 3.405016 (-0.03z)| norm 0.2650 (-0.42z)| lr 3.07e-04 | 322.63 ms | 52.3% bf16 MFU | 1625558 tok/s step 9997/19560 | loss 3.406706 (+0.01z)| norm 0.2766 (-0.22z)| lr 3.07e-04 | 322.38 ms | 52.4% bf16 MFU | 1625595 tok/s step 9998/19560 | loss 3.384115 (-0.49z)| norm 0.2680 (-0.36z)| lr 3.07e-04 | 322.96 ms | 52.3% bf16 MFU | 1625486 tok/s step 9999/19560 | loss 3.374701 (-0.68z)| norm 0.3041 (+0.27z)| lr 3.07e-04 | 322.52 ms | 52.3% bf16 MFU | 1625492 tok/s step 10000/19560 | loss 3.419989 (+0.30z)| norm 0.2602 (-0.49z)| lr 3.07e-04 | 322.77 ms | 52.3% bf16 MFU | 1625436 tok/s val loss 3.396394 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00007.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00004.bin evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2880/10042 = 0.286795 Writing checkpoint at step 10000 Writing model to log124M/model_00010000.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00003.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00005.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00002.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00001.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00010000_00006.bin Writing state to log124M/state_00010000_00000.bin step 10001/19560 | loss 3.387290 (-0.40z)| norm 0.3112 (+0.40z)| lr 3.06e-04 | 318.39 ms | 53.0% bf16 MFU | 1626498 tok/s step 10002/19560 | loss 3.454981 (+1.06z)| norm 0.3007 (+0.21z)| lr 3.06e-04 | 321.28 ms | 52.5% bf16 MFU | 1626766 tok/s step 10003/19560 | loss 3.372087 (-0.73z)| norm 0.2700 (-0.32z)| lr 3.06e-04 | 323.64 ms | 52.1% bf16 MFU | 1626425 tok/s step 10004/19560 | loss 3.443544 (+0.82z)| norm 0.3114 (+0.40z)| lr 3.06e-04 | 322.44 ms | 52.3% bf16 MFU | 1626405 tok/s step 10005/19560 | loss 3.457691 (+1.13z)| norm 0.2748 (-0.23z)| lr 3.06e-04 | 322.32 ms | 52.4% bf16 MFU | 1626416 tok/s step 10006/19560 | loss 3.385425 (-0.43z)| norm 0.2844 (-0.07z)| lr 3.06e-04 | 323.77 ms | 52.1% bf16 MFU | 1626062 tok/s step 10007/19560 | loss 3.411654 (+0.14z)| norm 0.2965 (+0.14z)| lr 3.06e-04 | 322.46 ms | 52.3% bf16 MFU | 1626053 tok/s step 10008/19560 | loss 3.370457 (-0.75z)| norm 0.3042 (+0.28z)| lr 3.06e-04 | 322.65 ms | 52.3% bf16 MFU | 1625998 tok/s step 10009/19560 | loss 3.442785 (+0.83z)| norm 0.2777 (-0.19z)| lr 3.06e-04 | 322.50 ms | 52.3% bf16 MFU | 1625982 tok/s step 10010/19560 | loss 3.423753 (+0.40z)| norm 0.2865 (-0.03z)| lr 3.06e-04 | 322.70 ms | 52.3% bf16 MFU | 1625917 tok/s step 10011/19560 | loss 3.422318 (+0.38z)| norm 0.2755 (-0.22z)| lr 3.06e-04 | 322.45 ms | 52.3% bf16 MFU | 1625918 tok/s step 10012/19560 | loss 3.437294 (+0.70z)| norm 0.2946 (+0.11z)| lr 3.06e-04 | 322.81 ms | 52.3% bf16 MFU | 1625830 tok/s step 10013/19560 | loss 3.367734 (-0.84z)| norm 0.2624 (-0.44z)| lr 3.06e-04 | 323.10 ms | 52.2% bf16 MFU | 1625673 tok/s step 10014/19560 | loss 3.418018 (+0.27z)| norm 0.2853 (-0.05z)| lr 3.06e-04 | 322.65 ms | 52.3% bf16 MFU | 1625636 tok/s step 10015/19560 | loss 3.367171 (-0.87z)| norm 0.2496 (-0.66z)| lr 3.06e-04 | 323.19 ms | 52.2% bf16 MFU | 1625466 tok/s step 10016/19560 | loss 3.444044 (+0.84z)| norm 0.2672 (-0.36z)| lr 3.06e-04 | 323.16 ms | 52.2% bf16 MFU | 1625310 tok/s step 10017/19560 | loss 3.392286 (-0.33z)| norm 0.3000 (+0.21z)| lr 3.06e-04 | 322.81 ms | 52.3% bf16 MFU | 1625251 tok/s step 10018/19560 | loss 3.422940 (+0.36z)| norm 0.2612 (-0.45z)| lr 3.06e-04 | 322.69 ms | 52.3% bf16 MFU | 1625224 tok/s step 10019/19560 | loss 3.440383 (+0.74z)| norm 0.2929 (+0.10z)| lr 3.06e-04 | 323.28 ms | 52.2% bf16 MFU | 1625051 tok/s step 10020/19560 | loss 3.414966 (+0.16z)| norm 0.2794 (-0.13z)| lr 3.06e-04 | 322.93 ms | 52.3% bf16 MFU | 1624975 tok/s step 10021/19560 | loss 3.444747 (+0.83z)| norm 0.2745 (-0.21z)| lr 3.05e-04 | 322.98 ms | 52.3% bf16 MFU | 1624890 tok/s step 10022/19560 | loss 3.446858 (+0.86z)| norm 0.2697 (-0.30z)| lr 3.05e-04 | 323.05 ms | 52.2% bf16 MFU | 1624792 tok/s step 10023/19560 | loss 3.587350 (+3.77z)| norm 0.2952 (+0.15z)| lr 3.05e-04 | 323.10 ms | 52.2% bf16 MFU | 1624687 tok/s step 10024/19560 | loss 3.371219 (-0.82z)| norm 0.3197 (+0.56z)| lr 3.05e-04 | 323.36 ms | 52.2% bf16 MFU | 1624521 tok/s step 10025/19560 | loss 3.383323 (-0.57z)| norm 0.2536 (-0.58z)| lr 3.05e-04 | 322.79 ms | 52.3% bf16 MFU | 1624508 tok/s step 10026/19560 | loss 3.408937 (-0.02z)| norm 0.2836 (-0.06z)| lr 3.05e-04 | 322.90 ms | 52.3% bf16 MFU | 1624466 tok/s step 10027/19560 | loss 3.450514 (+0.87z)| norm 0.2717 (-0.26z)| lr 3.05e-04 | 323.08 ms | 52.2% bf16 MFU | 1624382 tok/s step 10028/19560 | loss 3.521168 (+2.31z)| norm 0.2735 (-0.23z)| lr 3.05e-04 | 323.87 ms | 52.1% bf16 MFU | 1624105 tok/s step 10029/19560 | loss 3.420774 (+0.20z)| norm 0.2532 (-0.58z)| lr 3.05e-04 | 323.30 ms | 52.2% bf16 MFU | 1623985 tok/s step 10030/19560 | loss 3.370452 (-0.86z)| norm 0.2551 (-0.55z)| lr 3.05e-04 | 322.61 ms | 52.3% bf16 MFU | 1624042 tok/s step 10031/19560 | loss 3.416344 (+0.10z)| norm 0.2555 (-0.54z)| lr 3.05e-04 | 322.84 ms | 52.3% bf16 MFU | 1624040 tok/s step 10032/19560 | loss 3.456600 (+0.94z)| norm 0.2553 (-0.54z)| lr 3.05e-04 | 322.91 ms | 52.3% bf16 MFU | 1624021 tok/s step 10033/19560 | loss 3.351793 (-1.25z)| norm 0.2539 (-0.56z)| lr 3.05e-04 | 323.07 ms | 52.2% bf16 MFU | 1623962 tok/s step 10034/19560 | loss 3.369354 (-0.91z)| norm 0.2448 (-0.71z)| lr 3.05e-04 | 323.20 ms | 52.2% bf16 MFU | 1623873 tok/s step 10035/19560 | loss 3.373998 (-0.81z)| norm 0.2538 (-0.55z)| lr 3.05e-04 | 323.14 ms | 52.2% bf16 MFU | 1623804 tok/s step 10036/19560 | loss 3.396340 (-0.32z)| norm 0.2614 (-0.42z)| lr 3.05e-04 | 322.91 ms | 52.3% bf16 MFU | 1623796 tok/s step 10037/19560 | loss 3.411421 (+0.00z)| norm 0.2583 (-0.47z)| lr 3.05e-04 | 322.95 ms | 52.3% bf16 MFU | 1623778 tok/s step 10038/19560 | loss 3.410031 (-0.04z)| norm 0.2619 (-0.40z)| lr 3.05e-04 | 322.90 ms | 52.3% bf16 MFU | 1623775 tok/s step 10039/19560 | loss 3.385334 (-0.57z)| norm 0.2706 (-0.25z)| lr 3.05e-04 | 322.87 ms | 52.3% bf16 MFU | 1623777 tok/s step 10040/19560 | loss 3.399068 (-0.28z)| norm 0.2881 (+0.05z)| lr 3.05e-04 | 323.15 ms | 52.2% bf16 MFU | 1623709 tok/s step 10041/19560 | loss 3.414889 (+0.06z)| norm 0.2555 (-0.51z)| lr 3.04e-04 | 323.47 ms | 52.2% bf16 MFU | 1623565 tok/s step 10042/19560 | loss 3.387003 (-0.54z)| norm 0.2735 (-0.19z)| lr 3.04e-04 | 322.45 ms | 52.3% bf16 MFU | 1623686 tok/s step 10043/19560 | loss 3.410193 (-0.03z)| norm 0.2637 (-0.35z)| lr 3.04e-04 | 323.33 ms | 52.2% bf16 MFU | 1623577 tok/s step 10044/19560 | loss 3.460828 (+1.06z)| norm 0.3017 (+0.30z)| lr 3.04e-04 | 322.32 ms | 52.4% bf16 MFU | 1623729 tok/s step 10045/19560 | loss 3.390901 (-0.46z)| norm 0.3011 (+0.29z)| lr 3.04e-04 | 324.21 ms | 52.1% bf16 MFU | 1623399 tok/s step 10046/19560 | loss 3.428874 (+0.37z)| norm 0.2507 (-0.57z)| lr 3.04e-04 | 322.72 ms | 52.3% bf16 MFU | 1623458 tok/s step 10047/19560 | loss 3.383642 (-0.60z)| norm 0.2968 (+0.23z)| lr 3.04e-04 | 322.79 ms | 52.3% bf16 MFU | 1623496 tok/s step 10048/19560 | loss 3.467684 (+1.22z)| norm 0.2901 (+0.11z)| lr 3.04e-04 | 323.78 ms | 52.1% bf16 MFU | 1623286 tok/s step 10049/19560 | loss 3.366791 (-0.98z)| norm 0.2752 (-0.13z)| lr 3.04e-04 | 322.40 ms | 52.3% bf16 MFU | 1623432 tok/s step 10050/19560 | loss 3.421862 (+0.23z)| norm 0.2925 (+0.17z)| lr 3.04e-04 | 322.67 ms | 52.3% bf16 MFU | 1623503 tok/s step 10051/19560 | loss 3.395822 (-0.34z)| norm 0.2868 (+0.08z)| lr 3.04e-04 | 323.31 ms | 52.2% bf16 MFU | 1623409 tok/s step 10052/19560 | loss 3.397180 (-0.31z)| norm 0.2727 (-0.17z)| lr 3.04e-04 | 322.80 ms | 52.3% bf16 MFU | 1623447 tok/s step 10053/19560 | loss 3.426479 (+0.34z)| norm 0.2773 (-0.08z)| lr 3.04e-04 | 323.26 ms | 52.2% bf16 MFU | 1623369 tok/s step 10054/19560 | loss 3.421022 (+0.22z)| norm 0.2964 (+0.25z)| lr 3.04e-04 | 322.49 ms | 52.3% bf16 MFU | 1623489 tok/s step 10055/19560 | loss 3.362828 (-1.05z)| norm 0.2710 (-0.19z)| lr 3.04e-04 | 323.39 ms | 52.2% bf16 MFU | 1623376 tok/s step 10056/19560 | loss 3.392787 (-0.40z)| norm 0.2570 (-0.43z)| lr 3.04e-04 | 322.43 ms | 52.3% bf16 MFU | 1623509 tok/s step 10057/19560 | loss 3.380988 (-0.65z)| norm 0.2781 (-0.06z)| lr 3.04e-04 | 323.24 ms | 52.2% bf16 MFU | 1623432 tok/s step 10058/19560 | loss 3.447933 (+0.81z)| norm 0.3076 (+0.44z)| lr 3.04e-04 | 323.54 ms | 52.2% bf16 MFU | 1623283 tok/s step 10059/19560 | loss 3.422152 (+0.24z)| norm 0.2989 (+0.29z)| lr 3.04e-04 | 322.93 ms | 52.3% bf16 MFU | 1623295 tok/s step 10060/19560 | loss 3.405226 (-0.13z)| norm 0.3333 (+0.88z)| lr 3.04e-04 | 323.42 ms | 52.2% bf16 MFU | 1623185 tok/s step 10061/19560 | loss 3.429962 (+0.41z)| norm 0.2712 (-0.20z)| lr 3.03e-04 | 323.05 ms | 52.2% bf16 MFU | 1623171 tok/s step 10062/19560 | loss 3.382407 (-0.63z)| norm 0.2940 (+0.19z)| lr 3.03e-04 | 323.05 ms | 52.2% bf16 MFU | 1623159 tok/s step 10063/19560 | loss 3.480401 (+1.54z)| norm 0.2764 (-0.12z)| lr 3.03e-04 | 323.95 ms | 52.1% bf16 MFU | 1622921 tok/s step 10064/19560 | loss 3.416005 (+0.11z)| norm 0.2879 (+0.08z)| lr 3.03e-04 | 323.11 ms | 52.2% bf16 MFU | 1622907 tok/s step 10065/19560 | loss 3.366560 (-0.98z)| norm 0.2992 (+0.27z)| lr 3.03e-04 | 323.26 ms | 52.2% bf16 MFU | 1622856 tok/s step 10066/19560 | loss 3.387463 (-0.51z)| norm 0.2522 (-0.54z)| lr 3.03e-04 | 322.97 ms | 52.3% bf16 MFU | 1622878 tok/s step 10067/19560 | loss 3.345057 (-1.43z)| norm 0.3368 (+0.91z)| lr 3.03e-04 | 323.35 ms | 52.2% bf16 MFU | 1622805 tok/s step 10068/19560 | loss 3.376671 (-0.74z)| norm 0.2683 (-0.28z)| lr 3.03e-04 | 323.51 ms | 52.2% bf16 MFU | 1622697 tok/s step 10069/19560 | loss 3.385193 (-0.55z)| norm 0.3149 (+0.53z)| lr 3.03e-04 | 322.90 ms | 52.3% bf16 MFU | 1622747 tok/s step 10070/19560 | loss 3.385457 (-0.54z)| norm 0.2723 (-0.21z)| lr 3.03e-04 | 323.75 ms | 52.1% bf16 MFU | 1622581 tok/s step 10071/19560 | loss 3.369804 (-0.87z)| norm 0.3106 (+0.45z)| lr 3.03e-04 | 323.51 ms | 52.2% bf16 MFU | 1622484 tok/s step 10072/19560 | loss 3.389155 (-0.44z)| norm 0.3128 (+0.49z)| lr 3.03e-04 | 322.81 ms | 52.3% bf16 MFU | 1622566 tok/s step 10073/19560 | loss 3.382064 (-0.59z)| norm 0.2983 (+0.23z)| lr 3.03e-04 | 323.21 ms | 52.2% bf16 MFU | 1622543 tok/s step 10074/19560 | loss 3.410367 (+0.05z)| norm 0.3139 (+0.50z)| lr 3.03e-04 | 322.20 ms | 52.4% bf16 MFU | 1622776 tok/s step 10075/19560 | loss 3.402290 (-0.13z)| norm 0.2704 (-0.25z)| lr 3.03e-04 | 324.17 ms | 52.1% bf16 MFU | 1622505 tok/s step 10076/19560 | loss 3.443523 (+0.78z)| norm 0.3027 (+0.31z)| lr 3.03e-04 | 323.14 ms | 52.2% bf16 MFU | 1622504 tok/s step 10077/19560 | loss 3.446202 (+0.84z)| norm 0.2619 (-0.40z)| lr 3.03e-04 | 322.96 ms | 52.3% bf16 MFU | 1622547 tok/s step 10078/19560 | loss 3.420761 (+0.26z)| norm 0.2702 (-0.25z)| lr 3.03e-04 | 323.45 ms | 52.2% bf16 MFU | 1622465 tok/s step 10079/19560 | loss 3.461098 (+1.15z)| norm 0.2941 (+0.16z)| lr 3.03e-04 | 323.11 ms | 52.2% bf16 MFU | 1622474 tok/s step 10080/19560 | loss 3.402225 (-0.17z)| norm 0.2449 (-0.68z)| lr 3.03e-04 | 322.96 ms | 52.3% bf16 MFU | 1622518 tok/s step 10081/19560 | loss 3.382689 (-0.60z)| norm 0.2955 (+0.19z)| lr 3.02e-04 | 322.63 ms | 52.3% bf16 MFU | 1622644 tok/s step 10082/19560 | loss 3.396438 (-0.30z)| norm 0.2608 (-0.41z)| lr 3.02e-04 | 324.18 ms | 52.1% bf16 MFU | 1622375 tok/s step 10083/19560 | loss 3.364377 (-1.02z)| norm 0.2641 (-0.81z)| lr 3.02e-04 | 323.06 ms | 52.2% bf16 MFU | 1622400 tok/s step 10084/19560 | loss 3.367034 (-0.95z)| norm 0.2877 (+0.45z)| lr 3.02e-04 | 322.88 ms | 52.3% bf16 MFU | 1622468 tok/s step 10085/19560 | loss 3.395104 (-0.32z)| norm 0.2682 (-0.58z)| lr 3.02e-04 | 322.50 ms | 52.3% bf16 MFU | 1622629 tok/s step 10086/19560 | loss 3.446591 (+0.82z)| norm 0.2666 (-0.66z)| lr 3.02e-04 | 323.11 ms | 52.2% bf16 MFU | 1622630 tok/s step 10087/19560 | loss 3.411141 (+0.02z)| norm 0.2763 (-0.14z)| lr 3.02e-04 | 323.21 ms | 52.2% bf16 MFU | 1622606 tok/s step 10088/19560 | loss 3.408317 (-0.02z)| norm 0.2799 (+0.05z)| lr 3.02e-04 | 323.00 ms | 52.3% bf16 MFU | 1622634 tok/s step 10089/19560 | loss 3.382884 (-0.60z)| norm 0.2798 (+0.04z)| lr 3.02e-04 | 323.67 ms | 52.1% bf16 MFU | 1622492 tok/s step 10090/19560 | loss 3.357160 (-1.19z)| norm 0.2820 (+0.16z)| lr 3.02e-04 | 323.18 ms | 52.2% bf16 MFU | 1622481 tok/s step 10091/19560 | loss 3.414435 (+0.13z)| norm 0.2846 (+0.29z)| lr 3.02e-04 | 323.04 ms | 52.2% bf16 MFU | 1622507 tok/s step 10092/19560 | loss 3.359769 (-1.15z)| norm 0.2912 (+0.63z)| lr 3.02e-04 | 323.08 ms | 52.2% bf16 MFU | 1622521 tok/s step 10093/19560 | loss 3.429482 (+0.47z)| norm 0.2800 (+0.03z)| lr 3.02e-04 | 322.97 ms | 52.3% bf16 MFU | 1622562 tok/s step 10094/19560 | loss 3.424188 (+0.34z)| norm 0.2968 (+0.92z)| lr 3.02e-04 | 322.69 ms | 52.3% bf16 MFU | 1622671 tok/s step 10095/19560 | loss 3.367521 (-0.97z)| norm 0.2823 (+0.14z)| lr 3.02e-04 | 322.98 ms | 52.3% bf16 MFU | 1622701 tok/s step 10096/19560 | loss 3.373178 (-0.83z)| norm 0.3346 (+2.81z)| lr 3.02e-04 | 322.78 ms | 52.3% bf16 MFU | 1622779 tok/s step 10097/19560 | loss 3.415446 (+0.14z)| norm 0.2879 (+0.41z)| lr 3.02e-04 | 322.87 ms | 52.3% bf16 MFU | 1622833 tok/s step 10098/19560 | loss 3.381071 (-0.67z)| norm 0.3358 (+2.79z)| lr 3.02e-04 | 323.21 ms | 52.2% bf16 MFU | 1622799 tok/s step 10099/19560 | loss 3.397817 (-0.28z)| norm 0.2884 (+0.40z)| lr 3.02e-04 | 323.14 ms | 52.2% bf16 MFU | 1622784 tok/s step 10100/19560 | loss 3.427428 (+0.42z)| norm 0.3165 (+1.78z)| lr 3.02e-04 | 323.39 ms | 52.2% bf16 MFU | 1622707 tok/s step 10101/19560 | loss 3.402761 (-0.18z)| norm 0.2745 (-0.31z)| lr 3.01e-04 | 322.36 ms | 52.4% bf16 MFU | 1622892 tok/s step 10102/19560 | loss 3.400195 (-0.24z)| norm 0.3105 (+1.46z)| lr 3.01e-04 | 323.10 ms | 52.2% bf16 MFU | 1622881 tok/s step 10103/19560 | loss 3.388227 (-0.52z)| norm 0.2869 (+0.29z)| lr 3.01e-04 | 322.99 ms | 52.3% bf16 MFU | 1622899 tok/s step 10104/19560 | loss 3.449624 (+0.94z)| norm 0.2592 (-1.08z)| lr 3.01e-04 | 323.39 ms | 52.2% bf16 MFU | 1622815 tok/s step 10105/19560 | loss 3.356107 (-1.32z)| norm 0.3425 (+2.95z)| lr 3.01e-04 | 323.13 ms | 52.2% bf16 MFU | 1622802 tok/s step 10106/19560 | loss 3.449329 (+1.02z)| norm 0.2609 (-0.97z)| lr 3.01e-04 | 322.57 ms | 52.3% bf16 MFU | 1622929 tok/s step 10107/19560 | loss 3.452118 (+1.08z)| norm 0.2998 (+0.90z)| lr 3.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1622947 tok/s step 10108/19560 | loss 3.439776 (+0.76z)| norm 0.2721 (-0.43z)| lr 3.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1622964 tok/s step 10109/19560 | loss 3.433910 (+0.61z)| norm 0.2915 (+0.49z)| lr 3.01e-04 | 322.96 ms | 52.3% bf16 MFU | 1622986 tok/s step 10110/19560 | loss 3.437836 (+0.70z)| norm 0.2781 (-0.17z)| lr 3.01e-04 | 322.73 ms | 52.3% bf16 MFU | 1623064 tok/s step 10111/19560 | loss 3.459588 (+1.23z)| norm 0.2907 (+0.43z)| lr 3.01e-04 | 322.96 ms | 52.3% bf16 MFU | 1623080 tok/s step 10112/19560 | loss 3.549150 (+3.75z)| norm 0.3011 (+0.93z)| lr 3.01e-04 | 322.70 ms | 52.3% bf16 MFU | 1623160 tok/s step 10113/19560 | loss 3.380008 (-0.79z)| norm 0.2681 (-0.68z)| lr 3.01e-04 | 322.51 ms | 52.3% bf16 MFU | 1623284 tok/s step 10114/19560 | loss 3.437291 (+0.74z)| norm 0.2954 (+0.65z)| lr 3.01e-04 | 322.50 ms | 52.3% bf16 MFU | 1623404 tok/s step 10115/19560 | loss 3.407177 (-0.08z)| norm 0.2794 (-0.14z)| lr 3.01e-04 | 322.81 ms | 52.3% bf16 MFU | 1623440 tok/s step 10116/19560 | loss 3.423483 (+0.36z)| norm 0.2870 (+0.23z)| lr 3.01e-04 | 323.12 ms | 52.2% bf16 MFU | 1623396 tok/s step 10117/19560 | loss 3.380255 (-0.79z)| norm 0.2732 (-0.44z)| lr 3.01e-04 | 322.28 ms | 52.4% bf16 MFU | 1623566 tok/s step 10118/19560 | loss 3.365291 (-1.18z)| norm 0.2897 (+0.36z)| lr 3.01e-04 | 323.35 ms | 52.2% bf16 MFU | 1623460 tok/s step 10119/19560 | loss 3.443975 (+0.91z)| norm 0.2874 (+0.25z)| lr 3.01e-04 | 322.46 ms | 52.3% bf16 MFU | 1623581 tok/s step 10120/19560 | loss 3.449111 (+1.03z)| norm 0.2775 (-0.24z)| lr 3.01e-04 | 323.12 ms | 52.2% bf16 MFU | 1623530 tok/s step 10121/19560 | loss 3.371733 (-1.02z)| norm 0.2899 (+0.36z)| lr 3.00e-04 | 322.82 ms | 52.3% bf16 MFU | 1623557 tok/s step 10122/19560 | loss 3.336457 (-1.92z)| norm 0.2802 (-0.12z)| lr 3.00e-04 | 322.24 ms | 52.4% bf16 MFU | 1623730 tok/s step 10123/19560 | loss 3.341967 (-1.75z)| norm 0.2729 (-0.48z)| lr 3.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1623821 tok/s step 10124/19560 | loss 3.464136 (+1.41z)| norm 0.2843 (+0.08z)| lr 3.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1623854 tok/s step 10125/19560 | loss 3.364014 (-1.16z)| norm 0.2546 (-1.37z)| lr 3.00e-04 | 322.66 ms | 52.3% bf16 MFU | 1623907 tok/s step 10126/19560 | loss 3.402504 (-0.18z)| norm 0.2792 (-0.17z)| lr 3.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1623903 tok/s step 10127/19560 | loss 3.395495 (-0.36z)| norm 0.2711 (-0.56z)| lr 3.00e-04 | 322.16 ms | 52.4% bf16 MFU | 1624077 tok/s step 10128/19560 | loss 3.415736 (+0.16z)| norm 0.2793 (-0.16z)| lr 3.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1624076 tok/s step 10129/19560 | loss 3.366895 (-1.09z)| norm 0.2720 (-0.51z)| lr 3.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624131 tok/s step 10130/19560 | loss 3.357682 (-1.31z)| norm 0.2836 (+0.07z)| lr 3.00e-04 | 322.39 ms | 52.3% bf16 MFU | 1624236 tok/s step 10131/19560 | loss 3.371634 (-0.95z)| norm 0.2471 (-1.72z)| lr 3.00e-04 | 322.87 ms | 52.3% bf16 MFU | 1624216 tok/s step 10132/19560 | loss 3.408292 (-0.00z)| norm 0.2620 (-0.97z)| lr 3.00e-04 | 322.37 ms | 52.4% bf16 MFU | 1624323 tok/s step 10133/19560 | loss 3.432796 (+0.64z)| norm 0.2645 (-0.84z)| lr 3.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624348 tok/s step 10134/19560 | loss 3.340226 (-1.73z)| norm 0.2865 (+0.25z)| lr 3.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624389 tok/s step 10135/19560 | loss 3.395983 (-0.30z)| norm 0.2781 (-0.16z)| lr 3.00e-04 | 322.37 ms | 52.4% bf16 MFU | 1624489 tok/s step 10136/19560 | loss 3.356447 (-1.30z)| norm 0.2684 (-0.63z)| lr 3.00e-04 | 322.32 ms | 52.4% bf16 MFU | 1624594 tok/s step 10137/19560 | loss 3.343072 (-1.61z)| norm 0.2778 (-0.16z)| lr 3.00e-04 | 322.91 ms | 52.3% bf16 MFU | 1624547 tok/s step 10138/19560 | loss 3.467633 (+1.52z)| norm 0.2844 (+0.16z)| lr 3.00e-04 | 322.49 ms | 52.3% bf16 MFU | 1624607 tok/s step 10139/19560 | loss 3.443596 (+0.91z)| norm 0.2704 (-0.53z)| lr 3.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624618 tok/s step 10140/19560 | loss 3.370003 (-0.92z)| norm 0.2696 (-0.56z)| lr 3.00e-04 | 323.12 ms | 52.2% bf16 MFU | 1624516 tok/s step 10141/19560 | loss 3.394017 (-0.32z)| norm 0.2754 (-0.28z)| lr 3.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624568 tok/s step 10142/19560 | loss 3.391545 (-0.38z)| norm 0.3039 (+1.13z)| lr 2.99e-04 | 323.02 ms | 52.2% bf16 MFU | 1624492 tok/s step 10143/19560 | loss 3.355554 (-1.28z)| norm 0.2654 (-0.79z)| lr 2.99e-04 | 322.61 ms | 52.3% bf16 MFU | 1624525 tok/s step 10144/19560 | loss 3.492547 (+2.11z)| norm 0.2871 (+0.29z)| lr 2.99e-04 | 323.10 ms | 52.2% bf16 MFU | 1624433 tok/s step 10145/19560 | loss 3.346537 (-1.47z)| norm 0.2673 (-0.70z)| lr 2.99e-04 | 322.72 ms | 52.3% bf16 MFU | 1624442 tok/s step 10146/19560 | loss 3.435891 (+0.71z)| norm 0.2580 (-1.16z)| lr 2.99e-04 | 322.71 ms | 52.3% bf16 MFU | 1624453 tok/s step 10147/19560 | loss 3.404161 (-0.06z)| norm 0.2744 (-0.33z)| lr 2.99e-04 | 322.47 ms | 52.3% bf16 MFU | 1624522 tok/s step 10148/19560 | loss 3.471923 (+1.58z)| norm 0.2660 (-0.74z)| lr 2.99e-04 | 322.61 ms | 52.3% bf16 MFU | 1624554 tok/s step 10149/19560 | loss 3.408821 (+0.05z)| norm 0.2730 (-0.39z)| lr 2.99e-04 | 323.27 ms | 52.2% bf16 MFU | 1624417 tok/s step 10150/19560 | loss 3.449536 (+1.04z)| norm 0.2678 (-0.65z)| lr 2.99e-04 | 322.62 ms | 52.3% bf16 MFU | 1624452 tok/s step 10151/19560 | loss 3.421459 (+0.42z)| norm 0.2720 (-0.43z)| lr 2.99e-04 | 323.72 ms | 52.1% bf16 MFU | 1624208 tok/s step 10152/19560 | loss 3.417914 (+0.32z)| norm 0.2747 (-0.28z)| lr 2.99e-04 | 322.30 ms | 52.4% bf16 MFU | 1624334 tok/s step 10153/19560 | loss 3.386708 (-0.51z)| norm 0.2748 (-0.29z)| lr 2.99e-04 | 322.39 ms | 52.3% bf16 MFU | 1624429 tok/s step 10154/19560 | loss 3.413061 (+0.19z)| norm 0.2530 (-1.38z)| lr 2.99e-04 | 323.10 ms | 52.2% bf16 MFU | 1624342 tok/s step 10155/19560 | loss 3.407744 (+0.06z)| norm 0.2719 (-0.42z)| lr 2.99e-04 | 323.00 ms | 52.3% bf16 MFU | 1624285 tok/s step 10156/19560 | loss 3.424638 (+0.55z)| norm 0.2476 (-1.63z)| lr 2.99e-04 | 323.00 ms | 52.3% bf16 MFU | 1624230 tok/s step 10157/19560 | loss 3.404423 (-0.00z)| norm 0.2744 (-0.29z)| lr 2.99e-04 | 322.48 ms | 52.3% bf16 MFU | 1624310 tok/s step 10158/19560 | loss 3.365352 (-1.08z)| norm 0.2588 (-1.09z)| lr 2.99e-04 | 322.50 ms | 52.3% bf16 MFU | 1624380 tok/s step 10159/19560 | loss 3.404846 (+0.01z)| norm 0.2782 (-0.11z)| lr 2.99e-04 | 323.28 ms | 52.2% bf16 MFU | 1624250 tok/s step 10160/19560 | loss 3.393644 (-0.29z)| norm 0.2752 (-0.27z)| lr 2.99e-04 | 322.39 ms | 52.4% bf16 MFU | 1624350 tok/s step 10161/19560 | loss 3.375583 (-0.80z)| norm 0.2770 (-0.19z)| lr 2.99e-04 | 323.48 ms | 52.2% bf16 MFU | 1624171 tok/s step 10162/19560 | loss 3.469923 (+1.81z)| norm 0.2537 (-1.41z)| lr 2.98e-04 | 322.66 ms | 52.3% bf16 MFU | 1624207 tok/s step 10163/19560 | loss 3.364522 (-1.12z)| norm 0.2666 (-0.75z)| lr 2.98e-04 | 322.62 ms | 52.3% bf16 MFU | 1624252 tok/s step 10164/19560 | loss 3.410353 (+0.15z)| norm 0.2318 (-2.51z)| lr 2.98e-04 | 321.82 ms | 52.4% bf16 MFU | 1624497 tok/s step 10165/19560 | loss 3.400705 (-0.12z)| norm 0.2707 (-0.52z)| lr 2.98e-04 | 323.92 ms | 52.1% bf16 MFU | 1624201 tok/s step 10166/19560 | loss 3.423496 (+0.51z)| norm 0.2766 (-0.22z)| lr 2.98e-04 | 322.84 ms | 52.3% bf16 MFU | 1624190 tok/s step 10167/19560 | loss 3.384475 (-0.57z)| norm 0.2679 (-0.67z)| lr 2.98e-04 | 322.40 ms | 52.3% bf16 MFU | 1624290 tok/s step 10168/19560 | loss 3.403498 (-0.04z)| norm 0.2882 (+0.38z)| lr 2.98e-04 | 322.77 ms | 52.3% bf16 MFU | 1624294 tok/s step 10169/19560 | loss 3.441831 (+1.01z)| norm 0.2905 (+0.49z)| lr 2.98e-04 | 322.75 ms | 52.3% bf16 MFU | 1624301 tok/s step 10170/19560 | loss 3.422220 (+0.46z)| norm 0.2995 (+0.95z)| lr 2.98e-04 | 323.10 ms | 52.2% bf16 MFU | 1624219 tok/s step 10171/19560 | loss 3.366889 (-1.05z)| norm 0.2928 (+0.59z)| lr 2.98e-04 | 322.44 ms | 52.3% bf16 MFU | 1624309 tok/s step 10172/19560 | loss 3.350686 (-1.48z)| norm 0.2901 (+0.45z)| lr 2.98e-04 | 323.25 ms | 52.2% bf16 MFU | 1624189 tok/s step 10173/19560 | loss 3.329122 (-2.02z)| norm 0.2792 (-0.11z)| lr 2.98e-04 | 323.06 ms | 52.2% bf16 MFU | 1624123 tok/s step 10174/19560 | loss 3.446707 (+1.15z)| norm 0.2837 (+0.11z)| lr 2.98e-04 | 322.20 ms | 52.4% bf16 MFU | 1624278 tok/s step 10175/19560 | loss 3.386755 (-0.47z)| norm 0.2623 (-1.00z)| lr 2.98e-04 | 322.67 ms | 52.3% bf16 MFU | 1624308 tok/s step 10176/19560 | loss 3.432918 (+0.80z)| norm 0.2907 (+0.50z)| lr 2.98e-04 | 323.09 ms | 52.2% bf16 MFU | 1624229 tok/s step 10177/19560 | loss 3.406979 (+0.08z)| norm 0.2642 (-0.89z)| lr 2.98e-04 | 323.05 ms | 52.2% bf16 MFU | 1624165 tok/s step 10178/19560 | loss 3.394552 (-0.25z)| norm 0.2662 (-0.78z)| lr 2.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1624226 tok/s step 10179/19560 | loss 3.350427 (-1.44z)| norm 0.2804 (-0.03z)| lr 2.98e-04 | 322.68 ms | 52.3% bf16 MFU | 1624254 tok/s step 10180/19560 | loss 3.476495 (+1.94z)| norm 0.2619 (-0.99z)| lr 2.98e-04 | 322.38 ms | 52.4% bf16 MFU | 1624357 tok/s step 10181/19560 | loss 3.338516 (-1.72z)| norm 0.2847 (+0.20z)| lr 2.98e-04 | 322.82 ms | 52.3% bf16 MFU | 1624344 tok/s step 10182/19560 | loss 3.345769 (-1.50z)| norm 0.2797 (-0.05z)| lr 2.97e-04 | 322.81 ms | 52.3% bf16 MFU | 1624334 tok/s step 10183/19560 | loss 3.429076 (+0.68z)| norm 0.3119 (+1.60z)| lr 2.97e-04 | 322.34 ms | 52.4% bf16 MFU | 1624442 tok/s step 10184/19560 | loss 3.439487 (+0.94z)| norm 0.2853 (+0.21z)| lr 2.97e-04 | 322.72 ms | 52.3% bf16 MFU | 1624449 tok/s step 10185/19560 | loss 3.452711 (+1.27z)| norm 0.2759 (-0.28z)| lr 2.97e-04 | 323.22 ms | 52.2% bf16 MFU | 1624330 tok/s step 10186/19560 | loss 3.512410 (+2.74z)| norm 0.2755 (-0.29z)| lr 2.97e-04 | 322.74 ms | 52.3% bf16 MFU | 1624337 tok/s step 10187/19560 | loss 3.467059 (+1.56z)| norm 0.2897 (+0.46z)| lr 2.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1624303 tok/s step 10188/19560 | loss 3.368812 (-0.91z)| norm 0.2679 (-0.68z)| lr 2.97e-04 | 322.39 ms | 52.4% bf16 MFU | 1624401 tok/s step 10189/19560 | loss 3.383230 (-0.53z)| norm 0.2950 (+0.78z)| lr 2.97e-04 | 322.56 ms | 52.3% bf16 MFU | 1624451 tok/s step 10190/19560 | loss 3.333957 (-1.75z)| norm 0.2693 (-0.61z)| lr 2.97e-04 | 323.21 ms | 52.2% bf16 MFU | 1624334 tok/s step 10191/19560 | loss 3.556389 (+3.63z)| norm 0.2750 (-0.30z)| lr 2.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1624298 tok/s step 10192/19560 | loss 3.401381 (-0.08z)| norm 0.2665 (-0.75z)| lr 2.97e-04 | 322.47 ms | 52.3% bf16 MFU | 1624376 tok/s step 10193/19560 | loss 3.352057 (-1.25z)| norm 0.2891 (+0.48z)| lr 2.97e-04 | 323.01 ms | 52.2% bf16 MFU | 1624314 tok/s step 10194/19560 | loss 3.382709 (-0.52z)| norm 0.2817 (+0.07z)| lr 2.97e-04 | 322.91 ms | 52.3% bf16 MFU | 1624280 tok/s step 10195/19560 | loss 3.402284 (-0.06z)| norm 0.3100 (+1.69z)| lr 2.97e-04 | 322.34 ms | 52.4% bf16 MFU | 1624393 tok/s step 10196/19560 | loss 3.345845 (-1.40z)| norm 0.2859 (+0.32z)| lr 2.97e-04 | 322.45 ms | 52.3% bf16 MFU | 1624471 tok/s step 10197/19560 | loss 3.368101 (-0.87z)| norm 0.2908 (+0.61z)| lr 2.97e-04 | 323.64 ms | 52.1% bf16 MFU | 1624245 tok/s step 10198/19560 | loss 3.423469 (+0.45z)| norm 0.2597 (-1.17z)| lr 2.97e-04 | 322.57 ms | 52.3% bf16 MFU | 1624300 tok/s step 10199/19560 | loss 3.382286 (-0.54z)| norm 0.2863 (+0.38z)| lr 2.97e-04 | 322.86 ms | 52.3% bf16 MFU | 1624279 tok/s step 10200/19560 | loss 3.361192 (-1.03z)| norm 0.2673 (-0.72z)| lr 2.97e-04 | 322.95 ms | 52.3% bf16 MFU | 1624237 tok/s step 10201/19560 | loss 3.381669 (-0.55z)| norm 0.2819 (+0.15z)| lr 2.97e-04 | 322.69 ms | 52.3% bf16 MFU | 1624261 tok/s step 10202/19560 | loss 3.386502 (-0.43z)| norm 0.2964 (+1.03z)| lr 2.96e-04 | 322.69 ms | 52.3% bf16 MFU | 1624285 tok/s step 10203/19560 | loss 3.401790 (-0.06z)| norm 0.2767 (-0.16z)| lr 2.96e-04 | 323.12 ms | 52.2% bf16 MFU | 1624199 tok/s step 10204/19560 | loss 3.440336 (+0.85z)| norm 0.2866 (+0.44z)| lr 2.96e-04 | 322.83 ms | 52.3% bf16 MFU | 1624191 tok/s step 10205/19560 | loss 3.372369 (-0.75z)| norm 0.2873 (+0.48z)| lr 2.96e-04 | 323.24 ms | 52.2% bf16 MFU | 1624080 tok/s step 10206/19560 | loss 3.403750 (+0.00z)| norm 0.3538 (+4.14z)| lr 2.96e-04 | 322.39 ms | 52.4% bf16 MFU | 1624188 tok/s step 10207/19560 | loss 3.411956 (+0.21z)| norm 0.3621 (+4.26z)| lr 2.96e-04 | 322.55 ms | 52.3% bf16 MFU | 1624250 tok/s step 10208/19560 | loss 3.418391 (+0.36z)| norm 0.2967 (+0.83z)| lr 2.96e-04 | 323.58 ms | 52.2% bf16 MFU | 1624052 tok/s step 10209/19560 | loss 3.371027 (-0.78z)| norm 0.2754 (-0.28z)| lr 2.96e-04 | 322.42 ms | 52.3% bf16 MFU | 1624154 tok/s step 10210/19560 | loss 3.410616 (+0.17z)| norm 0.2802 (-0.04z)| lr 2.96e-04 | 322.90 ms | 52.3% bf16 MFU | 1624130 tok/s step 10211/19560 | loss 3.330790 (-1.72z)| norm 0.2757 (-0.29z)| lr 2.96e-04 | 323.16 ms | 52.2% bf16 MFU | 1624043 tok/s step 10212/19560 | loss 3.367306 (-0.85z)| norm 0.2772 (-0.20z)| lr 2.96e-04 | 322.35 ms | 52.4% bf16 MFU | 1624163 tok/s step 10213/19560 | loss 3.421704 (+0.43z)| norm 0.2501 (-1.62z)| lr 2.96e-04 | 322.83 ms | 52.3% bf16 MFU | 1624156 tok/s step 10214/19560 | loss 3.348931 (-1.27z)| norm 0.2821 (+0.06z)| lr 2.96e-04 | 322.62 ms | 52.3% bf16 MFU | 1624203 tok/s step 10215/19560 | loss 3.397424 (-0.12z)| norm 0.2660 (-0.78z)| lr 2.96e-04 | 323.33 ms | 52.2% bf16 MFU | 1624070 tok/s step 10216/19560 | loss 3.403931 (+0.03z)| norm 0.2796 (-0.07z)| lr 2.96e-04 | 322.18 ms | 52.4% bf16 MFU | 1624234 tok/s step 10217/19560 | loss 3.400023 (-0.06z)| norm 0.2667 (-0.74z)| lr 2.96e-04 | 323.46 ms | 52.2% bf16 MFU | 1624065 tok/s step 10218/19560 | loss 3.371965 (-0.73z)| norm 0.2595 (-1.11z)| lr 2.96e-04 | 323.45 ms | 52.2% bf16 MFU | 1623908 tok/s step 10219/19560 | loss 3.379685 (-0.54z)| norm 0.2829 (+0.12z)| lr 2.96e-04 | 322.35 ms | 52.4% bf16 MFU | 1624036 tok/s step 10220/19560 | loss 3.367248 (-0.84z)| norm 0.2580 (-1.16z)| lr 2.96e-04 | 322.76 ms | 52.3% bf16 MFU | 1624054 tok/s step 10221/19560 | loss 3.407696 (+0.13z)| norm 0.2825 (+0.11z)| lr 2.96e-04 | 322.96 ms | 52.3% bf16 MFU | 1624022 tok/s step 10222/19560 | loss 3.396454 (-0.14z)| norm 0.2677 (-0.65z)| lr 2.95e-04 | 322.71 ms | 52.3% bf16 MFU | 1624053 tok/s step 10223/19560 | loss 3.376299 (-0.62z)| norm 0.2538 (-1.35z)| lr 2.95e-04 | 322.76 ms | 52.3% bf16 MFU | 1624070 tok/s step 10224/19560 | loss 3.418604 (+0.38z)| norm 0.2669 (-0.66z)| lr 2.95e-04 | 324.21 ms | 52.1% bf16 MFU | 1623722 tok/s step 10225/19560 | loss 3.403007 (+0.01z)| norm 0.2884 (+0.49z)| lr 2.95e-04 | 322.13 ms | 52.4% bf16 MFU | 1623915 tok/s step 10226/19560 | loss 3.403640 (+0.02z)| norm 0.2784 (-0.03z)| lr 2.95e-04 | 322.81 ms | 52.3% bf16 MFU | 1623927 tok/s step 10227/19560 | loss 3.412499 (+0.23z)| norm 0.2753 (-0.19z)| lr 2.95e-04 | 323.12 ms | 52.2% bf16 MFU | 1623860 tok/s step 10228/19560 | loss 3.423842 (+0.50z)| norm 0.2777 (-0.05z)| lr 2.95e-04 | 323.13 ms | 52.2% bf16 MFU | 1623794 tok/s step 10229/19560 | loss 3.432160 (+0.70z)| norm 0.2664 (-0.68z)| lr 2.95e-04 | 322.46 ms | 52.3% bf16 MFU | 1623900 tok/s step 10230/19560 | loss 3.408857 (+0.14z)| norm 0.2733 (-0.28z)| lr 2.95e-04 | 322.74 ms | 52.3% bf16 MFU | 1623928 tok/s step 10231/19560 | loss 3.416287 (+0.31z)| norm 0.2675 (-0.60z)| lr 2.95e-04 | 323.30 ms | 52.2% bf16 MFU | 1623816 tok/s step 10232/19560 | loss 3.433891 (+0.74z)| norm 0.2583 (-1.13z)| lr 2.95e-04 | 323.29 ms | 52.2% bf16 MFU | 1623712 tok/s step 10233/19560 | loss 3.438338 (+0.83z)| norm 0.2664 (-0.67z)| lr 2.95e-04 | 323.18 ms | 52.2% bf16 MFU | 1623640 tok/s step 10234/19560 | loss 3.425485 (+0.53z)| norm 0.2528 (-1.47z)| lr 2.95e-04 | 322.55 ms | 52.3% bf16 MFU | 1623731 tok/s step 10235/19560 | loss 3.460366 (+1.36z)| norm 0.2622 (-0.90z)| lr 2.95e-04 | 323.19 ms | 52.2% bf16 MFU | 1623656 tok/s step 10236/19560 | loss 3.436856 (+0.80z)| norm 0.2610 (-0.96z)| lr 2.95e-04 | 322.68 ms | 52.3% bf16 MFU | 1623713 tok/s step 10237/19560 | loss 3.513681 (+2.57z)| norm 0.2893 (+0.75z)| lr 2.95e-04 | 322.46 ms | 52.3% bf16 MFU | 1623822 tok/s step 10238/19560 | loss 3.415410 (+0.27z)| norm 0.2579 (-1.13z)| lr 2.95e-04 | 323.19 ms | 52.2% bf16 MFU | 1623741 tok/s step 10239/19560 | loss 3.400225 (-0.08z)| norm 0.3091 (+1.91z)| lr 2.95e-04 | 322.98 ms | 52.3% bf16 MFU | 1623718 tok/s step 10240/19560 | loss 3.402577 (+0.00z)| norm 0.2681 (-0.51z)| lr 2.95e-04 | 323.29 ms | 52.2% bf16 MFU | 1623618 tok/s step 10241/19560 | loss 3.352726 (-1.23z)| norm 0.2844 (+0.46z)| lr 2.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1623614 tok/s step 10242/19560 | loss 3.541848 (+3.30z)| norm 0.2753 (-0.08z)| lr 2.94e-04 | 323.26 ms | 52.2% bf16 MFU | 1623527 tok/s step 10243/19560 | loss 3.385368 (-0.42z)| norm 0.2681 (-0.51z)| lr 2.94e-04 | 323.27 ms | 52.2% bf16 MFU | 1623441 tok/s step 10244/19560 | loss 3.367690 (-0.83z)| norm 0.2781 (+0.10z)| lr 2.94e-04 | 323.10 ms | 52.2% bf16 MFU | 1623403 tok/s step 10245/19560 | loss 3.394134 (-0.20z)| norm 0.2792 (+0.16z)| lr 2.94e-04 | 323.23 ms | 52.2% bf16 MFU | 1623335 tok/s step 10246/19560 | loss 3.359515 (-1.02z)| norm 0.2716 (-0.29z)| lr 2.94e-04 | 322.60 ms | 52.3% bf16 MFU | 1623428 tok/s step 10247/19560 | loss 3.410253 (+0.19z)| norm 0.2696 (-0.40z)| lr 2.94e-04 | 322.72 ms | 52.3% bf16 MFU | 1623485 tok/s step 10248/19560 | loss 3.382840 (-0.45z)| norm 0.2829 (+0.40z)| lr 2.94e-04 | 323.23 ms | 52.2% bf16 MFU | 1623412 tok/s step 10249/19560 | loss 3.399061 (-0.07z)| norm 0.2847 (+0.51z)| lr 2.94e-04 | 323.08 ms | 52.2% bf16 MFU | 1623382 tok/s step 10250/19560 | loss 3.361068 (-0.99z)| norm 0.3005 (+1.44z)| lr 2.94e-04 | 323.04 ms | 52.2% bf16 MFU | 1623362 tok/s val loss 3.391018 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2933/10042 = 0.292073 step 10251/19560 | loss 3.392031 (-0.26z)| norm 0.3107 (+2.00z)| lr 2.94e-04 | 322.85 ms | 52.3% bf16 MFU | 1623392 tok/s step 10252/19560 | loss 3.388530 (-0.33z)| norm 0.3287 (+2.94z)| lr 2.94e-04 | 322.70 ms | 52.3% bf16 MFU | 1623458 tok/s step 10253/19560 | loss 3.535133 (+3.11z)| norm 0.3302 (+2.91z)| lr 2.94e-04 | 323.48 ms | 52.2% bf16 MFU | 1623324 tok/s step 10254/19560 | loss 3.453739 (+1.17z)| norm 0.3181 (+2.18z)| lr 2.94e-04 | 323.48 ms | 52.2% bf16 MFU | 1623196 tok/s step 10255/19560 | loss 3.399994 (-0.09z)| norm 0.3077 (+1.58z)| lr 2.94e-04 | 322.99 ms | 52.3% bf16 MFU | 1623198 tok/s step 10256/19560 | loss 3.424707 (+0.49z)| norm 0.2824 (+0.22z)| lr 2.94e-04 | 322.49 ms | 52.3% bf16 MFU | 1623326 tok/s step 10257/19560 | loss 3.381806 (-0.52z)| norm 0.3593 (+4.04z)| lr 2.94e-04 | 323.50 ms | 52.2% bf16 MFU | 1623193 tok/s step 10258/19560 | loss 3.324053 (-1.86z)| norm 0.2804 (+0.08z)| lr 2.94e-04 | 322.41 ms | 52.3% bf16 MFU | 1623340 tok/s step 10259/19560 | loss 3.406758 (+0.07z)| norm 0.3127 (+1.67z)| lr 2.94e-04 | 322.55 ms | 52.3% bf16 MFU | 1623444 tok/s step 10260/19560 | loss 3.429786 (+0.60z)| norm 0.2992 (+0.98z)| lr 2.94e-04 | 322.05 ms | 52.4% bf16 MFU | 1623672 tok/s step 10261/19560 | loss 3.392919 (-0.25z)| norm 0.2983 (+0.92z)| lr 2.94e-04 | 323.13 ms | 52.2% bf16 MFU | 1623615 tok/s step 10262/19560 | loss 3.466471 (+1.44z)| norm 0.3024 (+1.11z)| lr 2.93e-04 | 323.28 ms | 52.2% bf16 MFU | 1623523 tok/s step 10263/19560 | loss 3.338270 (-1.53z)| norm 0.2990 (+0.93z)| lr 2.93e-04 | 323.05 ms | 52.2% bf16 MFU | 1623493 tok/s step 10264/19560 | loss 3.383070 (-0.50z)| norm 0.2754 (-0.24z)| lr 2.93e-04 | 322.49 ms | 52.3% bf16 MFU | 1623607 tok/s step 10265/19560 | loss 3.351059 (-1.25z)| norm 0.3098 (+1.44z)| lr 2.93e-04 | 323.29 ms | 52.2% bf16 MFU | 1623513 tok/s step 10266/19560 | loss 3.342610 (-1.42z)| norm 0.2735 (-0.34z)| lr 2.93e-04 | 323.28 ms | 52.2% bf16 MFU | 1623427 tok/s step 10267/19560 | loss 3.408967 (+0.13z)| norm 0.3107 (+1.46z)| lr 2.93e-04 | 322.82 ms | 52.3% bf16 MFU | 1623460 tok/s step 10268/19560 | loss 3.417200 (+0.32z)| norm 0.2811 (+0.01z)| lr 2.93e-04 | 323.23 ms | 52.2% bf16 MFU | 1623390 tok/s step 10269/19560 | loss 3.385556 (-0.42z)| norm 0.3543 (+3.40z)| lr 2.93e-04 | 323.01 ms | 52.3% bf16 MFU | 1623378 tok/s step 10270/19560 | loss 3.393845 (-0.23z)| norm 0.3200 (+1.78z)| lr 2.93e-04 | 322.95 ms | 52.3% bf16 MFU | 1623381 tok/s step 10271/19560 | loss 3.323777 (-1.85z)| norm 0.3798 (+4.19z)| lr 2.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1623438 tok/s step 10272/19560 | loss 3.448701 (+1.07z)| norm 0.3256 (+1.82z)| lr 2.93e-04 | 322.90 ms | 52.3% bf16 MFU | 1623449 tok/s step 10273/19560 | loss 3.371027 (-0.76z)| norm 0.3336 (+2.10z)| lr 2.93e-04 | 322.57 ms | 52.3% bf16 MFU | 1623543 tok/s step 10274/19560 | loss 3.405609 (+0.06z)| norm 0.3236 (+1.65z)| lr 2.93e-04 | 322.77 ms | 52.3% bf16 MFU | 1623584 tok/s step 10275/19560 | loss 3.400434 (-0.06z)| norm 0.2989 (+0.62z)| lr 2.93e-04 | 323.07 ms | 52.2% bf16 MFU | 1623545 tok/s step 10276/19560 | loss 3.460319 (+1.36z)| norm 0.3689 (+3.34z)| lr 2.93e-04 | 322.79 ms | 52.3% bf16 MFU | 1623579 tok/s step 10277/19560 | loss 3.452221 (+1.16z)| norm 0.3199 (+1.37z)| lr 2.93e-04 | 322.54 ms | 52.3% bf16 MFU | 1623674 tok/s step 10278/19560 | loss 3.425927 (+0.54z)| norm 0.3186 (+1.30z)| lr 2.93e-04 | 322.78 ms | 52.3% bf16 MFU | 1623704 tok/s step 10279/19560 | loss 3.350157 (-1.24z)| norm 0.2995 (+0.54z)| lr 2.93e-04 | 322.74 ms | 52.3% bf16 MFU | 1623744 tok/s step 10280/19560 | loss 3.414237 (+0.28z)| norm 0.3004 (+0.57z)| lr 2.93e-04 | 322.98 ms | 52.3% bf16 MFU | 1623720 tok/s step 10281/19560 | loss 3.412311 (+0.23z)| norm 0.3060 (+0.77z)| lr 2.93e-04 | 323.23 ms | 52.2% bf16 MFU | 1623637 tok/s step 10282/19560 | loss 3.446305 (+1.02z)| norm 0.2822 (-0.17z)| lr 2.92e-04 | 322.70 ms | 52.3% bf16 MFU | 1623690 tok/s step 10283/19560 | loss 3.420545 (+0.41z)| norm 0.2922 (+0.22z)| lr 2.92e-04 | 322.90 ms | 52.3% bf16 MFU | 1623688 tok/s step 10284/19560 | loss 3.372627 (-0.71z)| norm 0.2774 (-0.38z)| lr 2.92e-04 | 322.86 ms | 52.3% bf16 MFU | 1623697 tok/s step 10285/19560 | loss 3.395025 (-0.18z)| norm 0.2926 (+0.22z)| lr 2.92e-04 | 322.90 ms | 52.3% bf16 MFU | 1623698 tok/s step 10286/19560 | loss 3.477884 (+1.73z)| norm 0.2920 (+0.19z)| lr 2.92e-04 | 322.89 ms | 52.3% bf16 MFU | 1623698 tok/s step 10287/19560 | loss 3.331223 (-1.65z)| norm 0.2737 (-0.54z)| lr 2.92e-04 | 322.90 ms | 52.3% bf16 MFU | 1623697 tok/s step 10288/19560 | loss 3.383301 (-0.45z)| norm 0.2820 (-0.21z)| lr 2.92e-04 | 322.94 ms | 52.3% bf16 MFU | 1623685 tok/s step 10289/19560 | loss 3.328901 (-1.68z)| norm 0.2648 (-0.90z)| lr 2.92e-04 | 322.61 ms | 52.3% bf16 MFU | 1623759 tok/s step 10290/19560 | loss 3.423574 (+0.49z)| norm 0.2545 (-1.31z)| lr 2.92e-04 | 323.07 ms | 52.2% bf16 MFU | 1623714 tok/s step 10291/19560 | loss 3.433373 (+0.71z)| norm 0.2625 (-0.99z)| lr 2.92e-04 | 322.63 ms | 52.3% bf16 MFU | 1623780 tok/s step 10292/19560 | loss 3.420125 (+0.40z)| norm 0.2567 (-1.24z)| lr 2.92e-04 | 323.00 ms | 52.3% bf16 MFU | 1623751 tok/s step 10293/19560 | loss 3.357872 (-1.02z)| norm 0.2709 (-0.67z)| lr 2.92e-04 | 322.56 ms | 52.3% bf16 MFU | 1623833 tok/s step 10294/19560 | loss 3.453434 (+1.16z)| norm 0.2678 (-0.79z)| lr 2.92e-04 | 322.63 ms | 52.3% bf16 MFU | 1623894 tok/s step 10295/19560 | loss 3.351514 (-1.15z)| norm 0.2536 (-1.35z)| lr 2.92e-04 | 323.14 ms | 52.2% bf16 MFU | 1623824 tok/s step 10296/19560 | loss 3.382116 (-0.45z)| norm 0.2529 (-1.36z)| lr 2.92e-04 | 322.72 ms | 52.3% bf16 MFU | 1623862 tok/s step 10297/19560 | loss 3.408166 (+0.14z)| norm 0.2478 (-1.54z)| lr 2.92e-04 | 322.72 ms | 52.3% bf16 MFU | 1623898 tok/s step 10298/19560 | loss 3.419576 (+0.40z)| norm 0.2595 (-1.06z)| lr 2.92e-04 | 323.03 ms | 52.2% bf16 MFU | 1623855 tok/s step 10299/19560 | loss 3.384824 (-0.39z)| norm 0.2439 (-1.64z)| lr 2.92e-04 | 322.84 ms | 52.3% bf16 MFU | 1623861 tok/s step 10300/19560 | loss 3.365147 (-0.85z)| norm 0.2605 (-0.98z)| lr 2.92e-04 | 322.62 ms | 52.3% bf16 MFU | 1623922 tok/s step 10301/19560 | loss 3.427049 (+0.56z)| norm 0.2461 (-1.51z)| lr 2.92e-04 | 322.71 ms | 52.3% bf16 MFU | 1623958 tok/s step 10302/19560 | loss 3.351948 (-1.16z)| norm 0.2758 (-0.37z)| lr 2.91e-04 | 323.38 ms | 52.2% bf16 MFU | 1623825 tok/s step 10303/19560 | loss 3.402617 (+0.01z)| norm 0.2555 (-1.15z)| lr 2.91e-04 | 323.12 ms | 52.2% bf16 MFU | 1623764 tok/s step 10304/19560 | loss 3.343785 (-1.33z)| norm 0.2619 (-0.89z)| lr 2.91e-04 | 322.76 ms | 52.3% bf16 MFU | 1623795 tok/s step 10305/19560 | loss 3.404347 (+0.06z)| norm 0.2740 (-0.43z)| lr 2.91e-04 | 322.74 ms | 52.3% bf16 MFU | 1623829 tok/s step 10306/19560 | loss 3.470975 (+1.56z)| norm 0.2622 (-0.88z)| lr 2.91e-04 | 323.47 ms | 52.2% bf16 MFU | 1623679 tok/s step 10307/19560 | loss 3.511176 (+2.41z)| norm 0.2779 (-0.28z)| lr 2.91e-04 | 323.12 ms | 52.2% bf16 MFU | 1623624 tok/s step 10308/19560 | loss 3.396020 (-0.15z)| norm 0.2496 (-1.35z)| lr 2.91e-04 | 323.12 ms | 52.2% bf16 MFU | 1623573 tok/s step 10309/19560 | loss 3.420400 (+0.39z)| norm 0.2815 (-0.13z)| lr 2.91e-04 | 322.50 ms | 52.3% bf16 MFU | 1623679 tok/s step 10310/19560 | loss 3.422299 (+0.42z)| norm 0.2518 (-1.25z)| lr 2.91e-04 | 322.59 ms | 52.3% bf16 MFU | 1623758 tok/s step 10311/19560 | loss 3.387854 (-0.36z)| norm 0.3028 (+0.69z)| lr 2.91e-04 | 323.35 ms | 52.2% bf16 MFU | 1623642 tok/s step 10312/19560 | loss 3.532535 (+2.84z)| norm 0.2527 (-1.20z)| lr 2.91e-04 | 322.99 ms | 52.3% bf16 MFU | 1623621 tok/s step 10313/19560 | loss 3.453920 (+1.10z)| norm 0.2799 (-0.17z)| lr 2.91e-04 | 323.03 ms | 52.2% bf16 MFU | 1623592 tok/s step 10314/19560 | loss 3.404608 (+0.02z)| norm 0.2610 (-0.88z)| lr 2.91e-04 | 323.15 ms | 52.2% bf16 MFU | 1623534 tok/s step 10315/19560 | loss 3.425292 (+0.50z)| norm 0.2869 (+0.10z)| lr 2.91e-04 | 322.82 ms | 52.3% bf16 MFU | 1623563 tok/s step 10316/19560 | loss 3.378580 (-0.57z)| norm 0.2883 (+0.15z)| lr 2.91e-04 | 322.52 ms | 52.3% bf16 MFU | 1623665 tok/s step 10317/19560 | loss 3.378691 (-0.56z)| norm 0.2638 (-0.77z)| lr 2.91e-04 | 323.65 ms | 52.1% bf16 MFU | 1623478 tok/s step 10318/19560 | loss 3.411563 (+0.18z)| norm 0.3367 (+1.94z)| lr 2.91e-04 | 322.80 ms | 52.3% bf16 MFU | 1623513 tok/s step 10319/19560 | loss 3.579107 (+3.99z)| norm 0.2974 (+0.46z)| lr 2.91e-04 | 322.43 ms | 52.3% bf16 MFU | 1623640 tok/s step 10320/19560 | loss 3.439759 (+0.80z)| norm 0.2973 (+0.45z)| lr 2.91e-04 | 323.02 ms | 52.2% bf16 MFU | 1623612 tok/s step 10321/19560 | loss 3.409502 (+0.11z)| norm 0.3157 (+1.13z)| lr 2.91e-04 | 323.53 ms | 52.2% bf16 MFU | 1623457 tok/s step 10322/19560 | loss 3.512889 (+2.39z)| norm 0.2808 (-0.17z)| lr 2.90e-04 | 322.25 ms | 52.4% bf16 MFU | 1623632 tok/s step 10323/19560 | loss 3.519967 (+2.47z)| norm 0.3156 (+1.12z)| lr 2.90e-04 | 322.73 ms | 52.3% bf16 MFU | 1623676 tok/s step 10324/19560 | loss 3.460085 (+1.14z)| norm 0.2943 (+0.33z)| lr 2.90e-04 | 323.12 ms | 52.2% bf16 MFU | 1623620 tok/s step 10325/19560 | loss 3.383724 (-0.53z)| norm 0.2958 (+0.38z)| lr 2.90e-04 | 322.59 ms | 52.3% bf16 MFU | 1623702 tok/s step 10326/19560 | loss 3.459568 (+1.12z)| norm 0.2996 (+0.51z)| lr 2.90e-04 | 323.62 ms | 52.2% bf16 MFU | 1623520 tok/s step 10327/19560 | loss 3.383298 (-0.54z)| norm 0.3276 (+1.53z)| lr 2.90e-04 | 323.00 ms | 52.3% bf16 MFU | 1623503 tok/s step 10328/19560 | loss 3.325290 (-1.78z)| norm 0.3196 (+1.21z)| lr 2.90e-04 | 322.97 ms | 52.3% bf16 MFU | 1623496 tok/s step 10329/19560 | loss 3.509674 (+2.14z)| norm 0.3150 (+1.03z)| lr 2.90e-04 | 322.52 ms | 52.3% bf16 MFU | 1623600 tok/s step 10330/19560 | loss 3.437422 (+0.60z)| norm 0.3290 (+1.52z)| lr 2.90e-04 | 322.59 ms | 52.3% bf16 MFU | 1623682 tok/s step 10331/19560 | loss 3.369223 (-0.84z)| norm 0.3117 (+0.88z)| lr 2.90e-04 | 322.76 ms | 52.3% bf16 MFU | 1623718 tok/s step 10332/19560 | loss 3.349545 (-1.24z)| norm 0.2805 (-0.24z)| lr 2.90e-04 | 323.31 ms | 52.2% bf16 MFU | 1623612 tok/s step 10333/19560 | loss 3.398796 (-0.20z)| norm 0.2886 (+0.05z)| lr 2.90e-04 | 322.19 ms | 52.4% bf16 MFU | 1623796 tok/s step 10334/19560 | loss 3.420596 (+0.25z)| norm 0.2954 (+0.31z)| lr 2.90e-04 | 322.76 ms | 52.3% bf16 MFU | 1623826 tok/s step 10335/19560 | loss 3.376471 (-0.67z)| norm 0.3391 (+1.97z)| lr 2.90e-04 | 322.75 ms | 52.3% bf16 MFU | 1623857 tok/s step 10336/19560 | loss 3.376290 (-0.67z)| norm 0.2627 (-0.89z)| lr 2.90e-04 | 322.75 ms | 52.3% bf16 MFU | 1623888 tok/s step 10337/19560 | loss 3.410424 (+0.04z)| norm 0.2803 (-0.23z)| lr 2.90e-04 | 323.01 ms | 52.2% bf16 MFU | 1623848 tok/s step 10338/19560 | loss 3.381178 (-0.57z)| norm 0.2938 (+0.27z)| lr 2.90e-04 | 322.93 ms | 52.3% bf16 MFU | 1623834 tok/s step 10339/19560 | loss 3.437498 (+0.61z)| norm 0.2798 (-0.25z)| lr 2.90e-04 | 323.16 ms | 52.2% bf16 MFU | 1623762 tok/s step 10340/19560 | loss 3.420348 (+0.24z)| norm 0.3010 (+0.53z)| lr 2.90e-04 | 322.56 ms | 52.3% bf16 MFU | 1623844 tok/s step 10341/19560 | loss 3.387137 (-0.47z)| norm 0.2448 (-1.57z)| lr 2.90e-04 | 322.72 ms | 52.3% bf16 MFU | 1623881 tok/s step 10342/19560 | loss 3.375831 (-0.72z)| norm 0.3215 (+1.28z)| lr 2.89e-04 | 322.71 ms | 52.3% bf16 MFU | 1623918 tok/s step 10343/19560 | loss 3.397388 (-0.25z)| norm 0.2816 (-0.21z)| lr 2.89e-04 | 322.95 ms | 52.3% bf16 MFU | 1623893 tok/s step 10344/19560 | loss 3.391769 (-0.37z)| norm 0.2754 (-0.44z)| lr 2.89e-04 | 322.28 ms | 52.4% bf16 MFU | 1624039 tok/s step 10345/19560 | loss 3.383533 (-0.55z)| norm 0.2889 (+0.06z)| lr 2.89e-04 | 322.69 ms | 52.3% bf16 MFU | 1624073 tok/s step 10346/19560 | loss 3.332181 (-1.62z)| norm 0.2534 (-1.26z)| lr 2.89e-04 | 323.16 ms | 52.2% bf16 MFU | 1623987 tok/s step 10347/19560 | loss 3.413492 (+0.10z)| norm 0.2869 (-0.01z)| lr 2.89e-04 | 322.46 ms | 52.3% bf16 MFU | 1624082 tok/s step 10348/19560 | loss 3.370379 (-0.82z)| norm 0.2846 (-0.11z)| lr 2.89e-04 | 322.91 ms | 52.3% bf16 MFU | 1624058 tok/s step 10349/19560 | loss 3.343875 (-1.36z)| norm 0.2885 (+0.04z)| lr 2.89e-04 | 322.70 ms | 52.3% bf16 MFU | 1624091 tok/s step 10350/19560 | loss 3.429615 (+0.44z)| norm 0.2699 (-0.66z)| lr 2.89e-04 | 322.52 ms | 52.3% bf16 MFU | 1624166 tok/s step 10351/19560 | loss 3.381390 (-0.58z)| norm 0.3030 (+0.56z)| lr 2.89e-04 | 322.66 ms | 52.3% bf16 MFU | 1624203 tok/s step 10352/19560 | loss 3.326123 (-1.71z)| norm 0.2647 (-0.87z)| lr 2.89e-04 | 322.55 ms | 52.3% bf16 MFU | 1624264 tok/s step 10353/19560 | loss 3.393462 (-0.30z)| norm 0.2809 (-0.26z)| lr 2.89e-04 | 322.73 ms | 52.3% bf16 MFU | 1624278 tok/s step 10354/19560 | loss 3.401007 (-0.15z)| norm 0.2795 (-0.31z)| lr 2.89e-04 | 322.38 ms | 52.4% bf16 MFU | 1624379 tok/s step 10355/19560 | loss 3.376009 (-0.66z)| norm 0.2827 (-0.19z)| lr 2.89e-04 | 322.29 ms | 52.4% bf16 MFU | 1624498 tok/s step 10356/19560 | loss 3.374834 (-0.68z)| norm 0.2722 (-0.59z)| lr 2.89e-04 | 322.66 ms | 52.3% bf16 MFU | 1624518 tok/s step 10357/19560 | loss 3.435853 (+0.59z)| norm 0.2917 (+0.14z)| lr 2.89e-04 | 322.97 ms | 52.3% bf16 MFU | 1624458 tok/s step 10358/19560 | loss 3.462766 (+1.14z)| norm 0.2655 (-0.85z)| lr 2.89e-04 | 322.67 ms | 52.3% bf16 MFU | 1624478 tok/s step 10359/19560 | loss 3.363037 (-0.91z)| norm 0.2807 (-0.28z)| lr 2.89e-04 | 322.40 ms | 52.3% bf16 MFU | 1624565 tok/s step 10360/19560 | loss 3.346876 (-1.23z)| norm 0.2609 (-1.03z)| lr 2.89e-04 | 322.86 ms | 52.3% bf16 MFU | 1624530 tok/s step 10361/19560 | loss 3.395375 (-0.23z)| norm 0.2697 (-0.69z)| lr 2.89e-04 | 323.02 ms | 52.2% bf16 MFU | 1624456 tok/s step 10362/19560 | loss 3.409004 (+0.06z)| norm 0.2953 (+0.26z)| lr 2.88e-04 | 321.72 ms | 52.5% bf16 MFU | 1624716 tok/s step 10363/19560 | loss 3.305111 (-2.03z)| norm 0.2956 (+0.26z)| lr 2.88e-04 | 322.98 ms | 52.3% bf16 MFU | 1624644 tok/s step 10364/19560 | loss 3.360792 (-0.88z)| norm 0.2778 (-0.43z)| lr 2.88e-04 | 322.91 ms | 52.3% bf16 MFU | 1624594 tok/s step 10365/19560 | loss 3.421930 (+0.38z)| norm 0.3005 (+0.44z)| lr 2.88e-04 | 322.59 ms | 52.3% bf16 MFU | 1624627 tok/s step 10366/19560 | loss 3.415609 (+0.25z)| norm 0.2658 (-0.89z)| lr 2.88e-04 | 322.76 ms | 52.3% bf16 MFU | 1624615 tok/s step 10367/19560 | loss 3.389968 (-0.28z)| norm 0.2797 (-0.35z)| lr 2.88e-04 | 322.52 ms | 52.3% bf16 MFU | 1624666 tok/s step 10368/19560 | loss 3.421594 (+0.37z)| norm 0.2594 (-1.12z)| lr 2.88e-04 | 322.99 ms | 52.3% bf16 MFU | 1624594 tok/s step 10369/19560 | loss 3.323472 (-1.64z)| norm 0.2624 (-1.00z)| lr 2.88e-04 | 323.07 ms | 52.2% bf16 MFU | 1624505 tok/s step 10370/19560 | loss 3.390385 (-0.25z)| norm 0.2701 (-0.70z)| lr 2.88e-04 | 322.77 ms | 52.3% bf16 MFU | 1624496 tok/s step 10371/19560 | loss 3.335990 (-1.39z)| norm 0.2608 (-1.05z)| lr 2.88e-04 | 322.33 ms | 52.4% bf16 MFU | 1624600 tok/s step 10372/19560 | loss 3.352979 (-1.03z)| norm 0.2808 (-0.29z)| lr 2.88e-04 | 323.09 ms | 52.2% bf16 MFU | 1624506 tok/s step 10373/19560 | loss 3.449191 (+0.98z)| norm 0.2916 (+0.11z)| lr 2.88e-04 | 323.26 ms | 52.2% bf16 MFU | 1624374 tok/s step 10374/19560 | loss 3.386197 (-0.34z)| norm 0.2927 (+0.15z)| lr 2.88e-04 | 322.47 ms | 52.3% bf16 MFU | 1624447 tok/s step 10375/19560 | loss 3.409414 (+0.15z)| norm 0.2628 (-0.98z)| lr 2.88e-04 | 322.92 ms | 52.3% bf16 MFU | 1624403 tok/s step 10376/19560 | loss 3.420921 (+0.38z)| norm 0.2738 (-0.56z)| lr 2.88e-04 | 322.60 ms | 52.3% bf16 MFU | 1624441 tok/s step 10377/19560 | loss 3.351826 (-1.06z)| norm 0.2722 (-0.62z)| lr 2.88e-04 | 322.82 ms | 52.3% bf16 MFU | 1624424 tok/s step 10378/19560 | loss 3.383608 (-0.40z)| norm 0.2567 (-1.19z)| lr 2.88e-04 | 322.58 ms | 52.3% bf16 MFU | 1624467 tok/s step 10379/19560 | loss 3.370326 (-0.67z)| norm 0.2521 (-1.34z)| lr 2.88e-04 | 323.13 ms | 52.2% bf16 MFU | 1624369 tok/s step 10380/19560 | loss 3.405961 (+0.07z)| norm 0.2622 (-0.95z)| lr 2.88e-04 | 322.60 ms | 52.3% bf16 MFU | 1624412 tok/s step 10381/19560 | loss 3.420540 (+0.41z)| norm 0.2632 (-0.89z)| lr 2.88e-04 | 322.67 ms | 52.3% bf16 MFU | 1624434 tok/s step 10382/19560 | loss 3.418366 (+0.37z)| norm 0.2749 (-0.44z)| lr 2.87e-04 | 323.09 ms | 52.2% bf16 MFU | 1624349 tok/s step 10383/19560 | loss 3.386736 (-0.31z)| norm 0.2628 (-0.89z)| lr 2.87e-04 | 322.42 ms | 52.3% bf16 MFU | 1624436 tok/s step 10384/19560 | loss 3.442666 (+0.89z)| norm 0.3039 (+0.68z)| lr 2.87e-04 | 322.49 ms | 52.3% bf16 MFU | 1624501 tok/s step 10385/19560 | loss 3.434072 (+0.70z)| norm 0.2836 (-0.08z)| lr 2.87e-04 | 323.10 ms | 52.2% bf16 MFU | 1624411 tok/s step 10386/19560 | loss 3.433186 (+0.66z)| norm 0.3017 (+0.63z)| lr 2.87e-04 | 322.97 ms | 52.3% bf16 MFU | 1624357 tok/s step 10387/19560 | loss 3.351513 (-1.10z)| norm 0.2992 (+0.54z)| lr 2.87e-04 | 322.27 ms | 52.4% bf16 MFU | 1624482 tok/s step 10388/19560 | loss 3.374281 (-0.60z)| norm 0.2935 (+0.31z)| lr 2.87e-04 | 322.97 ms | 52.3% bf16 MFU | 1624424 tok/s step 10389/19560 | loss 3.478691 (+1.64z)| norm 0.2975 (+0.47z)| lr 2.87e-04 | 322.88 ms | 52.3% bf16 MFU | 1624391 tok/s step 10390/19560 | loss 3.347431 (-1.16z)| norm 0.2957 (+0.40z)| lr 2.87e-04 | 323.05 ms | 52.2% bf16 MFU | 1624317 tok/s step 10391/19560 | loss 3.368057 (-0.73z)| norm 0.3108 (+0.99z)| lr 2.87e-04 | 322.87 ms | 52.3% bf16 MFU | 1624294 tok/s step 10392/19560 | loss 3.404609 (+0.06z)| norm 0.2683 (-0.68z)| lr 2.87e-04 | 322.69 ms | 52.3% bf16 MFU | 1624316 tok/s step 10393/19560 | loss 3.428813 (+0.57z)| norm 0.2784 (-0.27z)| lr 2.87e-04 | 322.95 ms | 52.3% bf16 MFU | 1624272 tok/s step 10394/19560 | loss 3.411651 (+0.19z)| norm 0.2597 (-1.01z)| lr 2.87e-04 | 323.69 ms | 52.1% bf16 MFU | 1624045 tok/s step 10395/19560 | loss 3.395932 (-0.15z)| norm 0.2646 (-0.80z)| lr 2.87e-04 | 322.78 ms | 52.3% bf16 MFU | 1624057 tok/s step 10396/19560 | loss 3.359146 (-0.95z)| norm 0.2532 (-1.23z)| lr 2.87e-04 | 322.90 ms | 52.3% bf16 MFU | 1624039 tok/s step 10397/19560 | loss 3.448559 (+0.99z)| norm 0.2614 (-0.91z)| lr 2.87e-04 | 323.03 ms | 52.2% bf16 MFU | 1623990 tok/s step 10398/19560 | loss 3.446652 (+0.94z)| norm 0.2631 (-0.83z)| lr 2.87e-04 | 322.87 ms | 52.3% bf16 MFU | 1623982 tok/s step 10399/19560 | loss 3.462704 (+1.27z)| norm 0.2617 (-0.90z)| lr 2.87e-04 | 323.17 ms | 52.2% bf16 MFU | 1623898 tok/s step 10400/19560 | loss 3.429826 (+0.55z)| norm 0.2639 (-0.79z)| lr 2.87e-04 | 323.59 ms | 52.2% bf16 MFU | 1623714 tok/s step 10401/19560 | loss 3.402732 (-0.04z)| norm 0.2514 (-1.33z)| lr 2.87e-04 | 323.49 ms | 52.2% bf16 MFU | 1623564 tok/s step 10402/19560 | loss 3.376420 (-0.61z)| norm 0.2776 (-0.16z)| lr 2.86e-04 | 322.46 ms | 52.3% bf16 MFU | 1623680 tok/s step 10403/19560 | loss 3.397403 (-0.15z)| norm 0.2652 (-0.70z)| lr 2.86e-04 | 322.39 ms | 52.4% bf16 MFU | 1623809 tok/s step 10404/19560 | loss 3.394708 (-0.20z)| norm 0.2719 (-0.39z)| lr 2.86e-04 | 323.91 ms | 52.1% bf16 MFU | 1623549 tok/s step 10405/19560 | loss 3.343357 (-1.31z)| norm 0.2573 (-1.07z)| lr 2.86e-04 | 322.71 ms | 52.3% bf16 MFU | 1623604 tok/s step 10406/19560 | loss 3.410275 (+0.16z)| norm 0.2821 (+0.14z)| lr 2.86e-04 | 323.07 ms | 52.2% bf16 MFU | 1623565 tok/s step 10407/19560 | loss 3.388117 (-0.33z)| norm 0.2661 (-0.63z)| lr 2.86e-04 | 323.12 ms | 52.2% bf16 MFU | 1623517 tok/s step 10408/19560 | loss 3.402112 (-0.02z)| norm 0.2934 (+0.71z)| lr 2.86e-04 | 323.05 ms | 52.2% bf16 MFU | 1623488 tok/s step 10409/19560 | loss 3.362937 (-0.87z)| norm 0.2612 (-0.86z)| lr 2.86e-04 | 322.93 ms | 52.3% bf16 MFU | 1623490 tok/s step 10410/19560 | loss 3.410594 (+0.18z)| norm 0.2935 (+0.73z)| lr 2.86e-04 | 323.29 ms | 52.2% bf16 MFU | 1623402 tok/s step 10411/19560 | loss 3.342582 (-1.30z)| norm 0.2819 (+0.16z)| lr 2.86e-04 | 322.97 ms | 52.3% bf16 MFU | 1623398 tok/s step 10412/19560 | loss 3.391511 (-0.23z)| norm 0.2721 (-0.32z)| lr 2.86e-04 | 322.92 ms | 52.3% bf16 MFU | 1623406 tok/s step 10413/19560 | loss 3.411507 (+0.21z)| norm 0.2828 (+0.21z)| lr 2.86e-04 | 323.40 ms | 52.2% bf16 MFU | 1623294 tok/s step 10414/19560 | loss 3.424620 (+0.51z)| norm 0.2589 (-0.95z)| lr 2.86e-04 | 322.98 ms | 52.3% bf16 MFU | 1623293 tok/s step 10415/19560 | loss 3.359536 (-0.95z)| norm 0.2617 (-0.81z)| lr 2.86e-04 | 323.18 ms | 52.2% bf16 MFU | 1623241 tok/s step 10416/19560 | loss 3.395655 (-0.14z)| norm 0.2880 (+0.48z)| lr 2.86e-04 | 323.11 ms | 52.2% bf16 MFU | 1623210 tok/s step 10417/19560 | loss 3.399389 (-0.07z)| norm 0.2802 (+0.09z)| lr 2.86e-04 | 322.78 ms | 52.3% bf16 MFU | 1623263 tok/s step 10418/19560 | loss 3.398056 (-0.10z)| norm 0.2768 (-0.09z)| lr 2.86e-04 | 322.75 ms | 52.3% bf16 MFU | 1623321 tok/s step 10419/19560 | loss 3.399356 (-0.06z)| norm 0.2783 (-0.02z)| lr 2.86e-04 | 323.64 ms | 52.1% bf16 MFU | 1623154 tok/s step 10420/19560 | loss 3.383693 (-0.41z)| norm 0.2608 (-0.89z)| lr 2.86e-04 | 323.56 ms | 52.2% bf16 MFU | 1623014 tok/s step 10421/19560 | loss 3.404852 (+0.06z)| norm 0.2590 (-0.97z)| lr 2.86e-04 | 323.21 ms | 52.2% bf16 MFU | 1622970 tok/s step 10422/19560 | loss 3.352409 (-1.11z)| norm 0.2632 (-0.76z)| lr 2.85e-04 | 322.95 ms | 52.3% bf16 MFU | 1622993 tok/s step 10423/19560 | loss 3.395986 (-0.13z)| norm 0.2634 (-0.76z)| lr 2.85e-04 | 323.35 ms | 52.2% bf16 MFU | 1622914 tok/s step 10424/19560 | loss 3.390321 (-0.26z)| norm 0.2656 (-0.66z)| lr 2.85e-04 | 322.53 ms | 52.3% bf16 MFU | 1623045 tok/s step 10425/19560 | loss 3.407266 (+0.13z)| norm 0.2692 (-0.49z)| lr 2.85e-04 | 322.27 ms | 52.4% bf16 MFU | 1623237 tok/s step 10426/19560 | loss 3.401907 (+0.01z)| norm 0.2776 (-0.07z)| lr 2.85e-04 | 323.83 ms | 52.1% bf16 MFU | 1623027 tok/s step 10427/19560 | loss 3.375858 (-0.59z)| norm 0.2901 (+0.55z)| lr 2.85e-04 | 323.45 ms | 52.2% bf16 MFU | 1622922 tok/s step 10428/19560 | loss 3.432302 (+0.69z)| norm 0.2541 (-1.29z)| lr 2.85e-04 | 322.52 ms | 52.3% bf16 MFU | 1623055 tok/s step 10429/19560 | loss 3.358898 (-0.97z)| norm 0.2734 (-0.32z)| lr 2.85e-04 | 322.48 ms | 52.3% bf16 MFU | 1623193 tok/s step 10430/19560 | loss 3.444359 (+0.96z)| norm 0.2910 (+0.58z)| lr 2.85e-04 | 323.26 ms | 52.2% bf16 MFU | 1623127 tok/s step 10431/19560 | loss 3.400130 (-0.05z)| norm 0.2861 (+0.32z)| lr 2.85e-04 | 323.19 ms | 52.2% bf16 MFU | 1623083 tok/s step 10432/19560 | loss 3.395468 (-0.17z)| norm 0.3035 (+1.20z)| lr 2.85e-04 | 322.88 ms | 52.3% bf16 MFU | 1623117 tok/s step 10433/19560 | loss 3.453726 (+1.16z)| norm 0.2801 (-0.01z)| lr 2.85e-04 | 322.96 ms | 52.3% bf16 MFU | 1623131 tok/s step 10434/19560 | loss 3.368471 (-0.78z)| norm 0.2816 (+0.06z)| lr 2.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1623181 tok/s step 10435/19560 | loss 3.306954 (-2.17z)| norm 0.2754 (-0.26z)| lr 2.85e-04 | 322.84 ms | 52.3% bf16 MFU | 1623222 tok/s step 10436/19560 | loss 3.440907 (+0.92z)| norm 0.2646 (-0.83z)| lr 2.85e-04 | 322.64 ms | 52.3% bf16 MFU | 1623311 tok/s step 10437/19560 | loss 3.433879 (+0.76z)| norm 0.2755 (-0.26z)| lr 2.85e-04 | 323.37 ms | 52.2% bf16 MFU | 1623211 tok/s step 10438/19560 | loss 3.419759 (+0.43z)| norm 0.2662 (-0.76z)| lr 2.85e-04 | 322.79 ms | 52.3% bf16 MFU | 1623261 tok/s step 10439/19560 | loss 3.445274 (+1.01z)| norm 0.2607 (-1.03z)| lr 2.85e-04 | 323.93 ms | 52.1% bf16 MFU | 1623024 tok/s step 10440/19560 | loss 3.353015 (-1.12z)| norm 0.2764 (-0.21z)| lr 2.85e-04 | 323.02 ms | 52.2% bf16 MFU | 1623026 tok/s step 10441/19560 | loss 3.383676 (-0.38z)| norm 0.2655 (-0.79z)| lr 2.85e-04 | 323.09 ms | 52.2% bf16 MFU | 1623010 tok/s step 10442/19560 | loss 3.391340 (-0.19z)| norm 0.2872 (+0.35z)| lr 2.84e-04 | 322.97 ms | 52.3% bf16 MFU | 1623028 tok/s step 10443/19560 | loss 3.335047 (-1.51z)| norm 0.2737 (-0.36z)| lr 2.84e-04 | 323.84 ms | 52.1% bf16 MFU | 1622824 tok/s step 10444/19560 | loss 3.437990 (+0.92z)| norm 0.2579 (-1.18z)| lr 2.84e-04 | 323.32 ms | 52.2% bf16 MFU | 1622762 tok/s step 10445/19560 | loss 3.464849 (+1.52z)| norm 0.2758 (-0.24z)| lr 2.84e-04 | 322.99 ms | 52.3% bf16 MFU | 1622786 tok/s step 10446/19560 | loss 3.384938 (-0.35z)| norm 0.2496 (-1.63z)| lr 2.84e-04 | 323.32 ms | 52.2% bf16 MFU | 1622725 tok/s step 10447/19560 | loss 3.440897 (+1.06z)| norm 0.2973 (+0.96z)| lr 2.84e-04 | 324.11 ms | 52.1% bf16 MFU | 1622471 tok/s step 10448/19560 | loss 3.479510 (+2.00z)| norm 0.2669 (-0.68z)| lr 2.84e-04 | 322.54 ms | 52.3% bf16 MFU | 1622623 tok/s step 10449/19560 | loss 3.340700 (-1.42z)| norm 0.2733 (-0.32z)| lr 2.84e-04 | 324.15 ms | 52.1% bf16 MFU | 1622362 tok/s step 10450/19560 | loss 3.330201 (-1.68z)| norm 0.2616 (-0.95z)| lr 2.84e-04 | 323.38 ms | 52.2% bf16 MFU | 1622309 tok/s step 10451/19560 | loss 3.415502 (+0.51z)| norm 0.2740 (-0.26z)| lr 2.84e-04 | 323.30 ms | 52.2% bf16 MFU | 1622278 tok/s step 10452/19560 | loss 3.421350 (+0.67z)| norm 0.2655 (-0.72z)| lr 2.84e-04 | 324.07 ms | 52.1% bf16 MFU | 1622056 tok/s step 10453/19560 | loss 3.430697 (+0.91z)| norm 0.2762 (-0.11z)| lr 2.84e-04 | 322.70 ms | 52.3% bf16 MFU | 1622187 tok/s step 10454/19560 | loss 3.368309 (-0.72z)| norm 0.2784 (+0.02z)| lr 2.84e-04 | 322.99 ms | 52.3% bf16 MFU | 1622239 tok/s step 10455/19560 | loss 3.388556 (-0.19z)| norm 0.2750 (-0.15z)| lr 2.84e-04 | 323.70 ms | 52.1% bf16 MFU | 1622110 tok/s step 10456/19560 | loss 3.390249 (-0.16z)| norm 0.3064 (+1.71z)| lr 2.84e-04 | 323.26 ms | 52.2% bf16 MFU | 1622098 tok/s step 10457/19560 | loss 3.384296 (-0.30z)| norm 0.2641 (-0.79z)| lr 2.84e-04 | 323.17 ms | 52.2% bf16 MFU | 1622111 tok/s step 10458/19560 | loss 3.403877 (+0.25z)| norm 0.2894 (+0.79z)| lr 2.84e-04 | 322.98 ms | 52.3% bf16 MFU | 1622169 tok/s step 10459/19560 | loss 3.465446 (+1.94z)| norm 0.2757 (-0.05z)| lr 2.84e-04 | 323.30 ms | 52.2% bf16 MFU | 1622143 tok/s step 10460/19560 | loss 3.366281 (-0.82z)| norm 0.2820 (+0.35z)| lr 2.84e-04 | 323.15 ms | 52.2% bf16 MFU | 1622158 tok/s step 10461/19560 | loss 3.375916 (-0.55z)| norm 0.2863 (+0.62z)| lr 2.84e-04 | 323.05 ms | 52.2% bf16 MFU | 1622197 tok/s step 10462/19560 | loss 3.420750 (+0.70z)| norm 0.2761 (-0.01z)| lr 2.83e-04 | 323.01 ms | 52.3% bf16 MFU | 1622244 tok/s step 10463/19560 | loss 3.412811 (+0.47z)| norm 0.2491 (-1.80z)| lr 2.83e-04 | 322.69 ms | 52.3% bf16 MFU | 1622369 tok/s step 10464/19560 | loss 3.457704 (+1.69z)| norm 0.2646 (-0.75z)| lr 2.83e-04 | 323.12 ms | 52.2% bf16 MFU | 1622381 tok/s step 10465/19560 | loss 3.378021 (-0.50z)| norm 0.2614 (-0.96z)| lr 2.83e-04 | 323.24 ms | 52.2% bf16 MFU | 1622362 tok/s step 10466/19560 | loss 3.350063 (-1.26z)| norm 0.2585 (-1.14z)| lr 2.83e-04 | 323.23 ms | 52.2% bf16 MFU | 1622345 tok/s step 10467/19560 | loss 3.379678 (-0.44z)| norm 0.2703 (-0.33z)| lr 2.83e-04 | 322.51 ms | 52.3% bf16 MFU | 1622511 tok/s step 10468/19560 | loss 3.356075 (-1.07z)| norm 0.2678 (-0.48z)| lr 2.83e-04 | 323.17 ms | 52.2% bf16 MFU | 1622501 tok/s step 10469/19560 | loss 3.372047 (-0.63z)| norm 0.2769 (+0.12z)| lr 2.83e-04 | 322.87 ms | 52.3% bf16 MFU | 1622568 tok/s step 10470/19560 | loss 3.386011 (-0.25z)| norm 0.2441 (-2.18z)| lr 2.83e-04 | 322.57 ms | 52.3% bf16 MFU | 1622707 tok/s step 10471/19560 | loss 3.407826 (+0.35z)| norm 0.2623 (-0.86z)| lr 2.83e-04 | 323.19 ms | 52.2% bf16 MFU | 1622684 tok/s step 10472/19560 | loss 3.337674 (-1.55z)| norm 0.2522 (-1.56z)| lr 2.83e-04 | 323.05 ms | 52.2% bf16 MFU | 1622696 tok/s step 10473/19560 | loss 3.451560 (+1.52z)| norm 0.2679 (-0.44z)| lr 2.83e-04 | 323.00 ms | 52.3% bf16 MFU | 1622719 tok/s step 10474/19560 | loss 3.358530 (-1.00z)| norm 0.2649 (-0.66z)| lr 2.83e-04 | 322.81 ms | 52.3% bf16 MFU | 1622791 tok/s step 10475/19560 | loss 3.379099 (-0.43z)| norm 0.2932 (+1.35z)| lr 2.83e-04 | 322.80 ms | 52.3% bf16 MFU | 1622861 tok/s step 10476/19560 | loss 3.479181 (+2.22z)| norm 0.2867 (+0.89z)| lr 2.83e-04 | 322.69 ms | 52.3% bf16 MFU | 1622955 tok/s step 10477/19560 | loss 3.396650 (+0.01z)| norm 0.2886 (+1.02z)| lr 2.83e-04 | 322.76 ms | 52.3% bf16 MFU | 1623027 tok/s step 10478/19560 | loss 3.412914 (+0.45z)| norm 0.2678 (-0.46z)| lr 2.83e-04 | 323.31 ms | 52.2% bf16 MFU | 1622957 tok/s step 10479/19560 | loss 3.380953 (-0.41z)| norm 0.3218 (+3.28z)| lr 2.83e-04 | 322.48 ms | 52.3% bf16 MFU | 1623100 tok/s step 10480/19560 | loss 3.412162 (+0.42z)| norm 0.2848 (+0.71z)| lr 2.83e-04 | 322.70 ms | 52.3% bf16 MFU | 1623180 tok/s step 10481/19560 | loss 3.407699 (+0.29z)| norm 0.2852 (+0.74z)| lr 2.83e-04 | 323.56 ms | 52.2% bf16 MFU | 1623040 tok/s step 10482/19560 | loss 3.372974 (-0.65z)| norm 0.3041 (+2.00z)| lr 2.82e-04 | 322.86 ms | 52.3% bf16 MFU | 1623082 tok/s step 10483/19560 | loss 3.391312 (-0.15z)| norm 0.3707 (+5.62z)| lr 2.82e-04 | 322.40 ms | 52.3% bf16 MFU | 1623237 tok/s step 10484/19560 | loss 3.415012 (+0.49z)| norm 0.2829 (+0.44z)| lr 2.82e-04 | 323.27 ms | 52.2% bf16 MFU | 1623167 tok/s step 10485/19560 | loss 3.403718 (+0.19z)| norm 0.3150 (+2.28z)| lr 2.82e-04 | 322.97 ms | 52.3% bf16 MFU | 1623175 tok/s step 10486/19560 | loss 3.357488 (-1.07z)| norm 0.3122 (+2.06z)| lr 2.82e-04 | 322.39 ms | 52.4% bf16 MFU | 1623329 tok/s step 10487/19560 | loss 3.421847 (+0.70z)| norm 0.3097 (+1.88z)| lr 2.82e-04 | 323.12 ms | 52.2% bf16 MFU | 1623291 tok/s step 10488/19560 | loss 3.371572 (-0.70z)| norm 0.2781 (+0.10z)| lr 2.82e-04 | 322.77 ms | 52.3% bf16 MFU | 1623345 tok/s step 10489/19560 | loss 3.367743 (-0.80z)| norm 0.3168 (+2.21z)| lr 2.82e-04 | 323.01 ms | 52.2% bf16 MFU | 1623334 tok/s step 10490/19560 | loss 3.402845 (+0.18z)| norm 0.2715 (-0.28z)| lr 2.82e-04 | 322.85 ms | 52.3% bf16 MFU | 1623364 tok/s step 10491/19560 | loss 3.376569 (-0.58z)| norm 0.3210 (+2.41z)| lr 2.82e-04 | 323.07 ms | 52.2% bf16 MFU | 1623338 tok/s step 10492/19560 | loss 3.378131 (-0.55z)| norm 0.2741 (-0.15z)| lr 2.82e-04 | 323.27 ms | 52.2% bf16 MFU | 1623263 tok/s step 10493/19560 | loss 3.419372 (+0.63z)| norm 0.2708 (-0.31z)| lr 2.82e-04 | 322.65 ms | 52.3% bf16 MFU | 1623348 tok/s step 10494/19560 | loss 3.351162 (-1.29z)| norm 0.2839 (+0.39z)| lr 2.82e-04 | 323.18 ms | 52.2% bf16 MFU | 1623295 tok/s step 10495/19560 | loss 3.383709 (-0.37z)| norm 0.2725 (-0.23z)| lr 2.82e-04 | 322.29 ms | 52.4% bf16 MFU | 1623468 tok/s step 10496/19560 | loss 3.376975 (-0.55z)| norm 0.2668 (-0.54z)| lr 2.82e-04 | 323.18 ms | 52.2% bf16 MFU | 1623410 tok/s step 10497/19560 | loss 3.334727 (-1.76z)| norm 0.2743 (-0.14z)| lr 2.82e-04 | 322.67 ms | 52.3% bf16 MFU | 1623481 tok/s step 10498/19560 | loss 3.441298 (+1.26z)| norm 0.2577 (-1.04z)| lr 2.82e-04 | 322.98 ms | 52.3% bf16 MFU | 1623471 tok/s step 10499/19560 | loss 3.370346 (-0.77z)| norm 0.2615 (-0.83z)| lr 2.82e-04 | 322.60 ms | 52.3% bf16 MFU | 1623558 tok/s step 10500/19560 | loss 3.413627 (+0.46z)| norm 0.2464 (-1.63z)| lr 2.82e-04 | 322.54 ms | 52.3% bf16 MFU | 1623654 tok/s val loss 3.384710 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2921/10042 = 0.290878 step 10501/19560 | loss 3.435599 (+1.10z)| norm 0.2781 (+0.10z)| lr 2.82e-04 | 322.10 ms | 52.4% bf16 MFU | 1623857 tok/s step 10502/19560 | loss 3.385600 (-0.34z)| norm 0.2614 (-0.80z)| lr 2.81e-04 | 322.94 ms | 52.3% bf16 MFU | 1623838 tok/s step 10503/19560 | loss 3.422592 (+0.72z)| norm 0.2651 (-0.60z)| lr 2.81e-04 | 323.41 ms | 52.2% bf16 MFU | 1623702 tok/s step 10504/19560 | loss 3.335410 (-1.76z)| norm 0.2615 (-0.79z)| lr 2.81e-04 | 323.10 ms | 52.2% bf16 MFU | 1623651 tok/s step 10505/19560 | loss 3.414510 (+0.49z)| norm 0.2687 (-0.40z)| lr 2.81e-04 | 322.75 ms | 52.3% bf16 MFU | 1623689 tok/s step 10506/19560 | loss 3.423794 (+0.75z)| norm 0.2663 (-0.53z)| lr 2.81e-04 | 322.79 ms | 52.3% bf16 MFU | 1623716 tok/s step 10507/19560 | loss 3.404167 (+0.18z)| norm 0.2661 (-0.55z)| lr 2.81e-04 | 322.90 ms | 52.3% bf16 MFU | 1623715 tok/s step 10508/19560 | loss 3.349413 (-1.37z)| norm 0.2554 (-1.13z)| lr 2.81e-04 | 323.20 ms | 52.2% bf16 MFU | 1623638 tok/s step 10509/19560 | loss 3.428109 (+0.87z)| norm 0.2621 (-0.76z)| lr 2.81e-04 | 323.04 ms | 52.2% bf16 MFU | 1623606 tok/s step 10510/19560 | loss 3.353843 (-1.23z)| norm 0.2651 (-0.59z)| lr 2.81e-04 | 322.91 ms | 52.3% bf16 MFU | 1623607 tok/s step 10511/19560 | loss 3.362791 (-0.96z)| norm 0.2727 (-0.19z)| lr 2.81e-04 | 322.74 ms | 52.3% bf16 MFU | 1623652 tok/s step 10512/19560 | loss 3.340265 (-1.57z)| norm 0.2434 (-1.76z)| lr 2.81e-04 | 323.12 ms | 52.2% bf16 MFU | 1623597 tok/s step 10513/19560 | loss 3.416067 (+0.57z)| norm 0.2680 (-0.41z)| lr 2.81e-04 | 322.68 ms | 52.3% bf16 MFU | 1623656 tok/s step 10514/19560 | loss 3.314539 (-2.24z)| norm 0.2678 (-0.41z)| lr 2.81e-04 | 322.65 ms | 52.3% bf16 MFU | 1623721 tok/s step 10515/19560 | loss 3.427630 (+0.90z)| norm 0.2902 (+0.83z)| lr 2.81e-04 | 322.85 ms | 52.3% bf16 MFU | 1623731 tok/s step 10516/19560 | loss 3.443986 (+1.33z)| norm 0.2972 (+1.22z)| lr 2.81e-04 | 322.95 ms | 52.3% bf16 MFU | 1623717 tok/s step 10517/19560 | loss 3.417633 (+0.62z)| norm 0.2729 (-0.12z)| lr 2.81e-04 | 322.53 ms | 52.3% bf16 MFU | 1623808 tok/s step 10518/19560 | loss 3.544463 (+3.94z)| norm 0.3044 (+1.62z)| lr 2.81e-04 | 323.18 ms | 52.2% bf16 MFU | 1623732 tok/s step 10519/19560 | loss 3.323203 (-1.94z)| norm 0.3002 (+1.41z)| lr 2.81e-04 | 323.14 ms | 52.2% bf16 MFU | 1623669 tok/s step 10520/19560 | loss 3.349583 (-1.23z)| norm 0.2776 (+0.14z)| lr 2.81e-04 | 322.26 ms | 52.4% bf16 MFU | 1623831 tok/s step 10521/19560 | loss 3.390671 (-0.14z)| norm 0.2801 (+0.28z)| lr 2.81e-04 | 323.14 ms | 52.2% bf16 MFU | 1623763 tok/s step 10522/19560 | loss 3.392869 (-0.08z)| norm 0.2803 (+0.28z)| lr 2.80e-04 | 322.65 ms | 52.3% bf16 MFU | 1623822 tok/s step 10523/19560 | loss 3.325493 (-1.82z)| norm 0.2720 (-0.18z)| lr 2.80e-04 | 322.87 ms | 52.3% bf16 MFU | 1623823 tok/s step 10524/19560 | loss 3.377193 (-0.48z)| norm 0.2621 (-0.74z)| lr 2.80e-04 | 322.94 ms | 52.3% bf16 MFU | 1623807 tok/s step 10525/19560 | loss 3.366385 (-0.75z)| norm 0.2830 (+0.42z)| lr 2.80e-04 | 323.11 ms | 52.2% bf16 MFU | 1623748 tok/s step 10526/19560 | loss 3.542967 (+3.68z)| norm 0.2876 (+0.67z)| lr 2.80e-04 | 322.69 ms | 52.3% bf16 MFU | 1623798 tok/s step 10527/19560 | loss 3.397644 (+0.06z)| norm 0.2704 (-0.30z)| lr 2.80e-04 | 322.86 ms | 52.3% bf16 MFU | 1623803 tok/s step 10528/19560 | loss 3.478710 (+2.07z)| norm 0.3018 (+1.44z)| lr 2.80e-04 | 323.38 ms | 52.2% bf16 MFU | 1623675 tok/s step 10529/19560 | loss 3.420772 (+0.62z)| norm 0.2877 (+0.64z)| lr 2.80e-04 | 322.25 ms | 52.4% bf16 MFU | 1623838 tok/s step 10530/19560 | loss 3.448711 (+1.30z)| norm 0.3030 (+1.48z)| lr 2.80e-04 | 323.00 ms | 52.3% bf16 MFU | 1623806 tok/s step 10531/19560 | loss 3.353371 (-1.05z)| norm 0.3382 (+3.27z)| lr 2.80e-04 | 322.37 ms | 52.4% bf16 MFU | 1623935 tok/s step 10532/19560 | loss 3.426147 (+0.74z)| norm 0.2836 (+0.34z)| lr 2.80e-04 | 322.59 ms | 52.3% bf16 MFU | 1624001 tok/s step 10533/19560 | loss 3.571831 (+4.01z)| norm 0.3258 (+2.51z)| lr 2.80e-04 | 323.14 ms | 52.2% bf16 MFU | 1623925 tok/s step 10534/19560 | loss 3.399270 (+0.03z)| norm 0.2719 (-0.30z)| lr 2.80e-04 | 323.01 ms | 52.2% bf16 MFU | 1623886 tok/s step 10535/19560 | loss 3.403959 (+0.14z)| norm 0.2977 (+1.03z)| lr 2.80e-04 | 322.21 ms | 52.4% bf16 MFU | 1624049 tok/s step 10536/19560 | loss 3.385666 (-0.28z)| norm 0.2928 (+0.78z)| lr 2.80e-04 | 323.26 ms | 52.2% bf16 MFU | 1623942 tok/s step 10537/19560 | loss 3.357148 (-0.94z)| norm 0.2952 (+0.89z)| lr 2.80e-04 | 323.25 ms | 52.2% bf16 MFU | 1623841 tok/s step 10538/19560 | loss 3.405941 (+0.19z)| norm 0.2824 (+0.23z)| lr 2.80e-04 | 322.39 ms | 52.4% bf16 MFU | 1623963 tok/s step 10539/19560 | loss 3.344558 (-1.23z)| norm 0.2969 (+0.98z)| lr 2.80e-04 | 322.70 ms | 52.3% bf16 MFU | 1624000 tok/s step 10540/19560 | loss 3.443327 (+1.04z)| norm 0.2883 (+0.52z)| lr 2.80e-04 | 322.30 ms | 52.4% bf16 MFU | 1624136 tok/s step 10541/19560 | loss 3.371336 (-0.61z)| norm 0.2808 (+0.13z)| lr 2.80e-04 | 322.66 ms | 52.3% bf16 MFU | 1624173 tok/s step 10542/19560 | loss 3.399459 (+0.04z)| norm 0.2840 (+0.29z)| lr 2.79e-04 | 323.12 ms | 52.2% bf16 MFU | 1624093 tok/s step 10543/19560 | loss 3.486651 (+2.00z)| norm 0.2890 (+0.54z)| lr 2.79e-04 | 322.88 ms | 52.3% bf16 MFU | 1624078 tok/s step 10544/19560 | loss 3.428638 (+0.67z)| norm 0.2936 (+0.78z)| lr 2.79e-04 | 322.92 ms | 52.3% bf16 MFU | 1624053 tok/s step 10545/19560 | loss 3.504817 (+2.33z)| norm 0.3025 (+1.23z)| lr 2.79e-04 | 322.50 ms | 52.3% bf16 MFU | 1624135 tok/s step 10546/19560 | loss 3.433175 (+0.73z)| norm 0.3043 (+1.30z)| lr 2.79e-04 | 322.83 ms | 52.3% bf16 MFU | 1624129 tok/s step 10547/19560 | loss 3.416236 (+0.36z)| norm 0.2993 (+1.03z)| lr 2.79e-04 | 322.90 ms | 52.3% bf16 MFU | 1624107 tok/s step 10548/19560 | loss 3.394439 (-0.13z)| norm 0.2814 (+0.10z)| lr 2.79e-04 | 322.51 ms | 52.3% bf16 MFU | 1624184 tok/s step 10549/19560 | loss 3.411248 (+0.24z)| norm 0.2706 (-0.47z)| lr 2.79e-04 | 322.87 ms | 52.3% bf16 MFU | 1624166 tok/s step 10550/19560 | loss 3.453293 (+1.15z)| norm 0.2850 (+0.27z)| lr 2.79e-04 | 322.71 ms | 52.3% bf16 MFU | 1624189 tok/s step 10551/19560 | loss 3.347005 (-1.18z)| norm 0.2874 (+0.39z)| lr 2.79e-04 | 323.43 ms | 52.2% bf16 MFU | 1624031 tok/s step 10552/19560 | loss 3.451589 (+1.10z)| norm 0.2703 (-0.50z)| lr 2.79e-04 | 322.37 ms | 52.4% bf16 MFU | 1624147 tok/s step 10553/19560 | loss 3.520417 (+2.52z)| norm 0.2929 (+0.67z)| lr 2.79e-04 | 322.95 ms | 52.3% bf16 MFU | 1624111 tok/s step 10554/19560 | loss 3.435887 (+0.71z)| norm 0.2899 (+0.50z)| lr 2.79e-04 | 322.39 ms | 52.3% bf16 MFU | 1624217 tok/s step 10555/19560 | loss 3.412436 (+0.21z)| norm 0.2697 (-0.54z)| lr 2.79e-04 | 323.16 ms | 52.2% bf16 MFU | 1624126 tok/s step 10556/19560 | loss 3.396241 (-0.13z)| norm 0.2993 (+0.99z)| lr 2.79e-04 | 322.60 ms | 52.3% bf16 MFU | 1624180 tok/s step 10557/19560 | loss 3.429553 (+0.57z)| norm 0.2811 (+0.03z)| lr 2.79e-04 | 322.29 ms | 52.4% bf16 MFU | 1624308 tok/s step 10558/19560 | loss 3.378884 (-0.50z)| norm 0.2939 (+0.70z)| lr 2.79e-04 | 323.33 ms | 52.2% bf16 MFU | 1624170 tok/s step 10559/19560 | loss 3.396460 (-0.13z)| norm 0.2899 (+0.49z)| lr 2.79e-04 | 323.11 ms | 52.2% bf16 MFU | 1624092 tok/s step 10560/19560 | loss 3.404680 (+0.05z)| norm 0.2650 (-0.80z)| lr 2.79e-04 | 323.26 ms | 52.2% bf16 MFU | 1623980 tok/s step 10561/19560 | loss 3.388521 (-0.29z)| norm 0.2871 (+0.36z)| lr 2.79e-04 | 322.07 ms | 52.4% bf16 MFU | 1624174 tok/s step 10562/19560 | loss 3.346339 (-1.19z)| norm 0.2716 (-0.45z)| lr 2.78e-04 | 323.68 ms | 52.1% bf16 MFU | 1623953 tok/s step 10563/19560 | loss 3.350066 (-1.13z)| norm 0.2590 (-1.10z)| lr 2.78e-04 | 323.20 ms | 52.2% bf16 MFU | 1623865 tok/s step 10564/19560 | loss 3.406353 (+0.10z)| norm 0.2567 (-1.21z)| lr 2.78e-04 | 322.84 ms | 52.3% bf16 MFU | 1623871 tok/s step 10565/19560 | loss 3.341236 (-1.30z)| norm 0.2883 (+0.42z)| lr 2.78e-04 | 322.92 ms | 52.3% bf16 MFU | 1623856 tok/s step 10566/19560 | loss 3.444220 (+0.93z)| norm 0.2634 (-0.86z)| lr 2.78e-04 | 322.57 ms | 52.3% bf16 MFU | 1623930 tok/s step 10567/19560 | loss 3.485652 (+1.80z)| norm 0.3305 (+2.53z)| lr 2.78e-04 | 323.17 ms | 52.2% bf16 MFU | 1623850 tok/s step 10568/19560 | loss 3.380045 (-0.47z)| norm 0.3281 (+2.33z)| lr 2.78e-04 | 323.16 ms | 52.2% bf16 MFU | 1623776 tok/s step 10569/19560 | loss 3.385063 (-0.36z)| norm 0.2716 (-0.47z)| lr 2.78e-04 | 322.79 ms | 52.3% bf16 MFU | 1623800 tok/s step 10570/19560 | loss 3.436783 (+0.74z)| norm 0.2883 (+0.36z)| lr 2.78e-04 | 323.07 ms | 52.2% bf16 MFU | 1623751 tok/s step 10571/19560 | loss 3.387167 (-0.33z)| norm 0.2752 (-0.30z)| lr 2.78e-04 | 322.75 ms | 52.3% bf16 MFU | 1623784 tok/s step 10572/19560 | loss 3.387872 (-0.31z)| norm 0.3018 (+1.01z)| lr 2.78e-04 | 323.62 ms | 52.2% bf16 MFU | 1623599 tok/s step 10573/19560 | loss 3.464168 (+1.34z)| norm 0.2926 (+0.54z)| lr 2.78e-04 | 322.86 ms | 52.3% bf16 MFU | 1623613 tok/s step 10574/19560 | loss 3.396019 (-0.13z)| norm 0.2612 (-1.02z)| lr 2.78e-04 | 322.83 ms | 52.3% bf16 MFU | 1623635 tok/s step 10575/19560 | loss 3.444334 (+0.91z)| norm 0.3076 (+1.29z)| lr 2.78e-04 | 323.16 ms | 52.2% bf16 MFU | 1623573 tok/s step 10576/19560 | loss 3.400764 (-0.02z)| norm 0.2519 (-1.47z)| lr 2.78e-04 | 323.19 ms | 52.2% bf16 MFU | 1623506 tok/s step 10577/19560 | loss 3.409422 (+0.16z)| norm 0.2681 (-0.67z)| lr 2.78e-04 | 322.55 ms | 52.3% bf16 MFU | 1623604 tok/s step 10578/19560 | loss 3.409169 (+0.14z)| norm 0.2590 (-1.12z)| lr 2.78e-04 | 323.22 ms | 52.2% bf16 MFU | 1623528 tok/s step 10579/19560 | loss 3.393572 (-0.20z)| norm 0.2695 (-0.60z)| lr 2.78e-04 | 323.02 ms | 52.2% bf16 MFU | 1623504 tok/s step 10580/19560 | loss 3.357503 (-0.99z)| norm 0.2530 (-1.40z)| lr 2.78e-04 | 323.25 ms | 52.2% bf16 MFU | 1623426 tok/s step 10581/19560 | loss 3.553186 (+3.20z)| norm 0.2820 (+0.03z)| lr 2.78e-04 | 323.24 ms | 52.2% bf16 MFU | 1623353 tok/s step 10582/19560 | loss 3.335814 (-1.42z)| norm 0.2620 (-0.95z)| lr 2.77e-04 | 322.39 ms | 52.4% bf16 MFU | 1623498 tok/s step 10583/19560 | loss 3.383623 (-0.41z)| norm 0.2633 (-0.88z)| lr 2.77e-04 | 322.88 ms | 52.3% bf16 MFU | 1623514 tok/s step 10584/19560 | loss 3.369281 (-0.71z)| norm 0.2592 (-1.06z)| lr 2.77e-04 | 323.03 ms | 52.2% bf16 MFU | 1623489 tok/s step 10585/19560 | loss 3.471130 (+1.42z)| norm 0.2818 (+0.04z)| lr 2.77e-04 | 322.80 ms | 52.3% bf16 MFU | 1623525 tok/s step 10586/19560 | loss 3.391852 (-0.24z)| norm 0.2874 (+0.31z)| lr 2.77e-04 | 323.10 ms | 52.2% bf16 MFU | 1623482 tok/s step 10587/19560 | loss 3.410735 (+0.17z)| norm 0.2746 (-0.32z)| lr 2.77e-04 | 322.65 ms | 52.3% bf16 MFU | 1623556 tok/s step 10588/19560 | loss 3.406009 (+0.06z)| norm 0.2792 (-0.09z)| lr 2.77e-04 | 323.05 ms | 52.2% bf16 MFU | 1623524 tok/s step 10589/19560 | loss 3.431438 (+0.59z)| norm 0.2673 (-0.66z)| lr 2.77e-04 | 323.36 ms | 52.2% bf16 MFU | 1623418 tok/s step 10590/19560 | loss 3.341713 (-1.29z)| norm 0.2742 (-0.32z)| lr 2.77e-04 | 322.49 ms | 52.3% bf16 MFU | 1623533 tok/s step 10591/19560 | loss 3.436241 (+0.70z)| norm 0.2697 (-0.56z)| lr 2.77e-04 | 323.15 ms | 52.2% bf16 MFU | 1623477 tok/s step 10592/19560 | loss 3.437225 (+0.72z)| norm 0.2867 (+0.27z)| lr 2.77e-04 | 323.63 ms | 52.1% bf16 MFU | 1623303 tok/s step 10593/19560 | loss 3.381245 (-0.46z)| norm 0.2701 (-0.55z)| lr 2.77e-04 | 323.16 ms | 52.2% bf16 MFU | 1623257 tok/s step 10594/19560 | loss 3.412632 (+0.19z)| norm 0.2760 (-0.27z)| lr 2.77e-04 | 322.79 ms | 52.3% bf16 MFU | 1623305 tok/s step 10595/19560 | loss 3.377810 (-0.54z)| norm 0.2726 (-0.44z)| lr 2.77e-04 | 323.06 ms | 52.2% bf16 MFU | 1623283 tok/s step 10596/19560 | loss 3.404307 (+0.01z)| norm 0.2905 (+0.44z)| lr 2.77e-04 | 323.01 ms | 52.2% bf16 MFU | 1623276 tok/s step 10597/19560 | loss 3.398154 (-0.13z)| norm 0.2818 (+0.01z)| lr 2.77e-04 | 322.67 ms | 52.3% bf16 MFU | 1623353 tok/s step 10598/19560 | loss 3.471868 (+1.42z)| norm 0.2645 (-0.87z)| lr 2.77e-04 | 323.04 ms | 52.2% bf16 MFU | 1623335 tok/s step 10599/19560 | loss 3.352904 (-1.08z)| norm 0.2924 (+0.53z)| lr 2.77e-04 | 322.49 ms | 52.3% bf16 MFU | 1623456 tok/s step 10600/19560 | loss 3.414057 (+0.19z)| norm 0.2698 (-0.63z)| lr 2.77e-04 | 322.68 ms | 52.3% bf16 MFU | 1623522 tok/s step 10601/19560 | loss 3.401369 (-0.07z)| norm 0.2717 (-0.53z)| lr 2.77e-04 | 322.73 ms | 52.3% bf16 MFU | 1623573 tok/s step 10602/19560 | loss 3.401865 (-0.06z)| norm 0.2626 (-1.00z)| lr 2.76e-04 | 322.84 ms | 52.3% bf16 MFU | 1623594 tok/s step 10603/19560 | loss 3.427603 (+0.48z)| norm 0.2801 (-0.10z)| lr 2.76e-04 | 323.48 ms | 52.2% bf16 MFU | 1623453 tok/s step 10604/19560 | loss 3.402611 (-0.04z)| norm 0.2742 (-0.40z)| lr 2.76e-04 | 322.51 ms | 52.3% bf16 MFU | 1623561 tok/s step 10605/19560 | loss 3.451191 (+0.99z)| norm 0.2563 (-1.29z)| lr 2.76e-04 | 322.75 ms | 52.3% bf16 MFU | 1623606 tok/s step 10606/19560 | loss 3.393765 (-0.24z)| norm 0.2783 (-0.18z)| lr 2.76e-04 | 323.12 ms | 52.2% bf16 MFU | 1623554 tok/s step 10607/19560 | loss 3.404373 (-0.01z)| norm 0.2609 (-1.05z)| lr 2.76e-04 | 322.49 ms | 52.3% bf16 MFU | 1623663 tok/s step 10608/19560 | loss 3.363288 (-0.89z)| norm 0.2660 (-0.78z)| lr 2.76e-04 | 322.36 ms | 52.4% bf16 MFU | 1623800 tok/s step 10609/19560 | loss 3.349326 (-1.17z)| norm 0.2646 (-0.84z)| lr 2.76e-04 | 322.93 ms | 52.3% bf16 MFU | 1623787 tok/s step 10610/19560 | loss 3.442530 (+0.81z)| norm 0.2863 (+0.28z)| lr 2.76e-04 | 322.58 ms | 52.3% bf16 MFU | 1623863 tok/s step 10611/19560 | loss 3.387735 (-0.36z)| norm 0.2531 (-1.50z)| lr 2.76e-04 | 323.27 ms | 52.2% bf16 MFU | 1623761 tok/s step 10612/19560 | loss 3.383370 (-0.45z)| norm 0.2725 (-0.41z)| lr 2.76e-04 | 322.80 ms | 52.3% bf16 MFU | 1623783 tok/s step 10613/19560 | loss 3.339187 (-1.37z)| norm 0.2387 (-2.26z)| lr 2.76e-04 | 323.10 ms | 52.2% bf16 MFU | 1623729 tok/s step 10614/19560 | loss 3.341651 (-1.31z)| norm 0.2563 (-1.26z)| lr 2.76e-04 | 323.59 ms | 52.2% bf16 MFU | 1623554 tok/s step 10615/19560 | loss 3.427078 (+0.49z)| norm 0.2569 (-1.21z)| lr 2.76e-04 | 323.30 ms | 52.2% bf16 MFU | 1623460 tok/s step 10616/19560 | loss 3.382172 (-0.46z)| norm 0.2537 (-1.37z)| lr 2.76e-04 | 323.27 ms | 52.2% bf16 MFU | 1623379 tok/s step 10617/19560 | loss 3.476957 (+1.51z)| norm 0.2794 (+0.08z)| lr 2.76e-04 | 323.09 ms | 52.2% bf16 MFU | 1623348 tok/s step 10618/19560 | loss 3.428929 (+0.50z)| norm 0.2928 (+0.84z)| lr 2.76e-04 | 323.00 ms | 52.3% bf16 MFU | 1623340 tok/s step 10619/19560 | loss 3.412087 (+0.14z)| norm 0.2604 (-1.00z)| lr 2.76e-04 | 323.28 ms | 52.2% bf16 MFU | 1623262 tok/s step 10620/19560 | loss 3.398341 (-0.15z)| norm 0.2784 (+0.04z)| lr 2.76e-04 | 323.10 ms | 52.2% bf16 MFU | 1623233 tok/s step 10621/19560 | loss 3.370250 (-0.73z)| norm 0.2708 (-0.40z)| lr 2.76e-04 | 323.67 ms | 52.1% bf16 MFU | 1623063 tok/s step 10622/19560 | loss 3.388854 (-0.35z)| norm 0.2801 (+0.14z)| lr 2.75e-04 | 323.15 ms | 52.2% bf16 MFU | 1623031 tok/s step 10623/19560 | loss 3.327197 (-1.62z)| norm 0.2720 (-0.33z)| lr 2.75e-04 | 323.20 ms | 52.2% bf16 MFU | 1622989 tok/s step 10624/19560 | loss 3.378855 (-0.54z)| norm 0.2718 (-0.34z)| lr 2.75e-04 | 323.52 ms | 52.2% bf16 MFU | 1622867 tok/s step 10625/19560 | loss 3.413596 (+0.17z)| norm 0.2841 (+0.37z)| lr 2.75e-04 | 323.27 ms | 52.2% bf16 MFU | 1622816 tok/s step 10626/19560 | loss 3.395108 (-0.21z)| norm 0.2526 (-1.45z)| lr 2.75e-04 | 323.43 ms | 52.2% bf16 MFU | 1622727 tok/s step 10627/19560 | loss 3.374343 (-0.65z)| norm 0.2838 (+0.34z)| lr 2.75e-04 | 323.09 ms | 52.2% bf16 MFU | 1622728 tok/s step 10628/19560 | loss 3.440241 (+0.73z)| norm 0.2607 (-1.01z)| lr 2.75e-04 | 323.81 ms | 52.1% bf16 MFU | 1622547 tok/s step 10629/19560 | loss 3.449539 (+0.92z)| norm 0.2899 (+0.69z)| lr 2.75e-04 | 323.18 ms | 52.2% bf16 MFU | 1622534 tok/s step 10630/19560 | loss 3.324717 (-1.67z)| norm 0.2511 (-1.56z)| lr 2.75e-04 | 323.62 ms | 52.2% bf16 MFU | 1622410 tok/s step 10631/19560 | loss 3.404038 (-0.02z)| norm 0.2897 (+0.67z)| lr 2.75e-04 | 323.55 ms | 52.2% bf16 MFU | 1622311 tok/s step 10632/19560 | loss 3.330280 (-1.55z)| norm 0.2690 (-0.54z)| lr 2.75e-04 | 323.09 ms | 52.2% bf16 MFU | 1622333 tok/s step 10633/19560 | loss 3.546464 (+2.83z)| norm 0.2920 (+0.79z)| lr 2.75e-04 | 323.63 ms | 52.1% bf16 MFU | 1622217 tok/s step 10634/19560 | loss 3.398584 (-0.14z)| norm 0.3100 (+1.80z)| lr 2.75e-04 | 322.82 ms | 52.3% bf16 MFU | 1622312 tok/s step 10635/19560 | loss 3.441256 (+0.71z)| norm 0.3136 (+1.95z)| lr 2.75e-04 | 323.29 ms | 52.2% bf16 MFU | 1622283 tok/s step 10636/19560 | loss 3.471297 (+1.29z)| norm 0.2924 (+0.74z)| lr 2.75e-04 | 323.23 ms | 52.2% bf16 MFU | 1622271 tok/s step 10637/19560 | loss 3.466181 (+1.18z)| norm 0.2972 (+1.00z)| lr 2.75e-04 | 322.70 ms | 52.3% bf16 MFU | 1622392 tok/s step 10638/19560 | loss 3.451528 (+0.87z)| norm 0.3103 (+1.71z)| lr 2.75e-04 | 322.79 ms | 52.3% bf16 MFU | 1622484 tok/s step 10639/19560 | loss 3.411135 (+0.05z)| norm 0.2742 (-0.33z)| lr 2.75e-04 | 323.51 ms | 52.2% bf16 MFU | 1622391 tok/s step 10640/19560 | loss 3.392468 (-0.33z)| norm 0.2987 (+1.04z)| lr 2.75e-04 | 323.07 ms | 52.2% bf16 MFU | 1622412 tok/s step 10641/19560 | loss 3.367826 (-0.82z)| norm 0.2972 (+0.94z)| lr 2.75e-04 | 323.61 ms | 52.2% bf16 MFU | 1622298 tok/s step 10642/19560 | loss 3.620166 (+4.00z)| norm 0.2702 (-0.60z)| lr 2.74e-04 | 322.74 ms | 52.3% bf16 MFU | 1622409 tok/s step 10643/19560 | loss 3.464312 (+1.01z)| norm 0.2829 (+0.12z)| lr 2.74e-04 | 323.31 ms | 52.2% bf16 MFU | 1622370 tok/s step 10644/19560 | loss 3.403600 (-0.14z)| norm 0.2876 (+0.40z)| lr 2.74e-04 | 323.04 ms | 52.2% bf16 MFU | 1622401 tok/s step 10645/19560 | loss 3.385879 (-0.47z)| norm 0.2756 (-0.29z)| lr 2.74e-04 | 322.56 ms | 52.3% bf16 MFU | 1622550 tok/s step 10646/19560 | loss 3.346598 (-1.21z)| norm 0.3007 (+1.16z)| lr 2.74e-04 | 323.40 ms | 52.2% bf16 MFU | 1622481 tok/s step 10647/19560 | loss 3.388628 (-0.41z)| norm 0.2833 (+0.16z)| lr 2.74e-04 | 323.12 ms | 52.2% bf16 MFU | 1622486 tok/s step 10648/19560 | loss 3.377681 (-0.63z)| norm 0.2819 (+0.08z)| lr 2.74e-04 | 323.04 ms | 52.2% bf16 MFU | 1622510 tok/s step 10649/19560 | loss 3.525964 (+2.23z)| norm 0.3005 (+1.14z)| lr 2.74e-04 | 323.21 ms | 52.2% bf16 MFU | 1622490 tok/s step 10650/19560 | loss 3.402624 (-0.16z)| norm 0.3021 (+1.22z)| lr 2.74e-04 | 323.02 ms | 52.2% bf16 MFU | 1622520 tok/s step 10651/19560 | loss 3.355084 (-1.09z)| norm 0.2587 (-1.25z)| lr 2.74e-04 | 323.08 ms | 52.2% bf16 MFU | 1622533 tok/s step 10652/19560 | loss 3.405024 (-0.12z)| norm 0.3044 (+1.32z)| lr 2.74e-04 | 322.60 ms | 52.3% bf16 MFU | 1622666 tok/s step 10653/19560 | loss 3.356997 (-1.06z)| norm 0.2669 (-0.80z)| lr 2.74e-04 | 322.86 ms | 52.3% bf16 MFU | 1622726 tok/s step 10654/19560 | loss 3.400246 (-0.20z)| norm 0.2924 (+0.65z)| lr 2.74e-04 | 323.41 ms | 52.2% bf16 MFU | 1622646 tok/s step 10655/19560 | loss 3.400630 (-0.19z)| norm 0.2955 (+0.81z)| lr 2.74e-04 | 322.45 ms | 52.3% bf16 MFU | 1622811 tok/s step 10656/19560 | loss 3.424180 (+0.29z)| norm 0.2847 (+0.21z)| lr 2.74e-04 | 322.80 ms | 52.3% bf16 MFU | 1622881 tok/s step 10657/19560 | loss 3.376660 (-0.66z)| norm 0.3108 (+1.67z)| lr 2.74e-04 | 322.77 ms | 52.3% bf16 MFU | 1622953 tok/s step 10658/19560 | loss 3.401601 (-0.15z)| norm 0.2908 (+0.55z)| lr 2.74e-04 | 323.20 ms | 52.2% bf16 MFU | 1622914 tok/s step 10659/19560 | loss 3.380474 (-0.58z)| norm 0.2863 (+0.33z)| lr 2.74e-04 | 323.15 ms | 52.2% bf16 MFU | 1622888 tok/s step 10660/19560 | loss 3.353937 (-1.10z)| norm 0.2674 (-0.77z)| lr 2.74e-04 | 322.30 ms | 52.4% bf16 MFU | 1623080 tok/s step 10661/19560 | loss 3.427356 (+0.42z)| norm 0.2975 (+1.03z)| lr 2.74e-04 | 322.76 ms | 52.3% bf16 MFU | 1623146 tok/s step 10662/19560 | loss 3.439938 (+0.67z)| norm 0.2901 (+0.58z)| lr 2.73e-04 | 322.48 ms | 52.3% bf16 MFU | 1623279 tok/s step 10663/19560 | loss 3.372608 (-0.73z)| norm 0.2999 (+1.16z)| lr 2.73e-04 | 322.73 ms | 52.3% bf16 MFU | 1623343 tok/s step 10664/19560 | loss 3.423221 (+0.32z)| norm 0.2917 (+0.67z)| lr 2.73e-04 | 322.69 ms | 52.3% bf16 MFU | 1623413 tok/s step 10665/19560 | loss 3.415965 (+0.16z)| norm 0.2746 (-0.35z)| lr 2.73e-04 | 322.95 ms | 52.3% bf16 MFU | 1623415 tok/s step 10666/19560 | loss 3.361058 (-0.98z)| norm 0.2887 (+0.50z)| lr 2.73e-04 | 323.15 ms | 52.2% bf16 MFU | 1623365 tok/s step 10667/19560 | loss 3.461561 (+1.11z)| norm 0.2825 (+0.13z)| lr 2.73e-04 | 322.30 ms | 52.4% bf16 MFU | 1623533 tok/s step 10668/19560 | loss 3.442847 (+0.71z)| norm 0.2926 (+0.74z)| lr 2.73e-04 | 322.58 ms | 52.3% bf16 MFU | 1623621 tok/s step 10669/19560 | loss 3.358073 (-1.06z)| norm 0.2732 (-0.43z)| lr 2.73e-04 | 322.61 ms | 52.3% bf16 MFU | 1623698 tok/s step 10670/19560 | loss 3.385269 (-0.49z)| norm 0.2894 (+0.55z)| lr 2.73e-04 | 323.65 ms | 52.1% bf16 MFU | 1623510 tok/s step 10671/19560 | loss 3.352719 (-1.16z)| norm 0.2550 (-1.50z)| lr 2.73e-04 | 322.97 ms | 52.3% bf16 MFU | 1623501 tok/s step 10672/19560 | loss 3.444701 (+0.78z)| norm 0.2926 (+0.75z)| lr 2.73e-04 | 322.75 ms | 52.3% bf16 MFU | 1623548 tok/s step 10673/19560 | loss 3.373425 (-0.71z)| norm 0.2594 (-1.22z)| lr 2.73e-04 | 322.90 ms | 52.3% bf16 MFU | 1623554 tok/s step 10674/19560 | loss 3.400724 (-0.12z)| norm 0.2605 (-1.14z)| lr 2.73e-04 | 322.73 ms | 52.3% bf16 MFU | 1623604 tok/s step 10675/19560 | loss 3.391866 (-0.31z)| norm 0.2802 (+0.06z)| lr 2.73e-04 | 322.95 ms | 52.3% bf16 MFU | 1623596 tok/s step 10676/19560 | loss 3.401643 (-0.10z)| norm 0.2660 (-0.79z)| lr 2.73e-04 | 322.98 ms | 52.3% bf16 MFU | 1623581 tok/s step 10677/19560 | loss 3.428911 (+0.48z)| norm 0.2781 (-0.06z)| lr 2.73e-04 | 322.75 ms | 52.3% bf16 MFU | 1623623 tok/s step 10678/19560 | loss 3.428447 (+0.47z)| norm 0.2708 (-0.50z)| lr 2.73e-04 | 322.51 ms | 52.3% bf16 MFU | 1623724 tok/s step 10679/19560 | loss 3.377926 (-0.62z)| norm 0.2701 (-0.53z)| lr 2.73e-04 | 322.94 ms | 52.3% bf16 MFU | 1623711 tok/s step 10680/19560 | loss 3.351926 (-1.16z)| norm 0.2562 (-1.36z)| lr 2.73e-04 | 322.80 ms | 52.3% bf16 MFU | 1623735 tok/s step 10681/19560 | loss 3.377589 (-0.60z)| norm 0.2622 (-0.99z)| lr 2.73e-04 | 323.06 ms | 52.2% bf16 MFU | 1623692 tok/s step 10682/19560 | loss 3.365414 (-0.85z)| norm 0.2535 (-1.48z)| lr 2.73e-04 | 322.63 ms | 52.3% bf16 MFU | 1623760 tok/s step 10683/19560 | loss 3.404235 (+0.00z)| norm 0.2879 (+0.56z)| lr 2.72e-04 | 322.72 ms | 52.3% bf16 MFU | 1623802 tok/s step 10684/19560 | loss 3.375926 (-0.61z)| norm 0.2656 (-0.75z)| lr 2.72e-04 | 322.87 ms | 52.3% bf16 MFU | 1623804 tok/s step 10685/19560 | loss 3.362299 (-0.90z)| norm 0.2758 (-0.14z)| lr 2.72e-04 | 322.80 ms | 52.3% bf16 MFU | 1623822 tok/s step 10686/19560 | loss 3.370076 (-0.73z)| norm 0.2861 (+0.48z)| lr 2.72e-04 | 322.82 ms | 52.3% bf16 MFU | 1623837 tok/s step 10687/19560 | loss 3.414963 (+0.25z)| norm 0.2672 (-0.64z)| lr 2.72e-04 | 322.67 ms | 52.3% bf16 MFU | 1623886 tok/s step 10688/19560 | loss 3.388234 (-0.33z)| norm 0.2543 (-1.40z)| lr 2.72e-04 | 323.06 ms | 52.2% bf16 MFU | 1623836 tok/s step 10689/19560 | loss 3.394121 (-0.20z)| norm 0.2583 (-1.14z)| lr 2.72e-04 | 322.53 ms | 52.3% bf16 MFU | 1623922 tok/s step 10690/19560 | loss 3.456031 (+1.13z)| norm 0.2657 (-0.70z)| lr 2.72e-04 | 322.99 ms | 52.3% bf16 MFU | 1623887 tok/s step 10691/19560 | loss 3.352018 (-1.14z)| norm 0.2526 (-1.47z)| lr 2.72e-04 | 322.88 ms | 52.3% bf16 MFU | 1623882 tok/s step 10692/19560 | loss 3.382294 (-0.48z)| norm 0.2785 (+0.05z)| lr 2.72e-04 | 322.65 ms | 52.3% bf16 MFU | 1623935 tok/s step 10693/19560 | loss 3.428077 (+0.51z)| norm 0.2636 (-0.82z)| lr 2.72e-04 | 322.64 ms | 52.3% bf16 MFU | 1623988 tok/s step 10694/19560 | loss 3.377477 (-0.59z)| norm 0.2839 (+0.38z)| lr 2.72e-04 | 322.89 ms | 52.3% bf16 MFU | 1623977 tok/s step 10695/19560 | loss 3.332013 (-1.57z)| norm 0.2607 (-1.01z)| lr 2.72e-04 | 322.93 ms | 52.3% bf16 MFU | 1623954 tok/s step 10696/19560 | loss 3.422458 (+0.42z)| norm 0.2915 (+0.95z)| lr 2.72e-04 | 322.72 ms | 52.3% bf16 MFU | 1623986 tok/s step 10697/19560 | loss 3.383605 (-0.44z)| norm 0.2890 (+0.77z)| lr 2.72e-04 | 322.71 ms | 52.3% bf16 MFU | 1624018 tok/s step 10698/19560 | loss 3.410459 (+0.16z)| norm 0.2669 (-0.63z)| lr 2.72e-04 | 322.57 ms | 52.3% bf16 MFU | 1624084 tok/s step 10699/19560 | loss 3.367115 (-0.79z)| norm 0.2656 (-0.71z)| lr 2.72e-04 | 323.44 ms | 52.2% bf16 MFU | 1623928 tok/s step 10700/19560 | loss 3.415569 (+0.27z)| norm 0.2940 (+1.12z)| lr 2.72e-04 | 322.66 ms | 52.3% bf16 MFU | 1623975 tok/s step 10701/19560 | loss 3.436249 (+0.74z)| norm 0.3233 (+2.90z)| lr 2.72e-04 | 323.10 ms | 52.2% bf16 MFU | 1623910 tok/s step 10702/19560 | loss 3.378518 (-0.54z)| norm 0.2839 (+0.43z)| lr 2.72e-04 | 323.20 ms | 52.2% bf16 MFU | 1623823 tok/s step 10703/19560 | loss 3.455290 (+1.16z)| norm 0.3033 (+1.65z)| lr 2.71e-04 | 322.17 ms | 52.4% bf16 MFU | 1624001 tok/s step 10704/19560 | loss 3.357970 (-0.99z)| norm 0.2919 (+0.92z)| lr 2.71e-04 | 322.69 ms | 52.3% bf16 MFU | 1624038 tok/s step 10705/19560 | loss 3.436163 (+0.74z)| norm 0.2991 (+1.36z)| lr 2.71e-04 | 322.88 ms | 52.3% bf16 MFU | 1624025 tok/s step 10706/19560 | loss 3.441918 (+0.86z)| norm 0.2746 (-0.19z)| lr 2.71e-04 | 322.62 ms | 52.3% bf16 MFU | 1624080 tok/s step 10707/19560 | loss 3.375228 (-0.61z)| norm 0.2935 (+0.99z)| lr 2.71e-04 | 322.84 ms | 52.3% bf16 MFU | 1624076 tok/s step 10708/19560 | loss 3.444368 (+0.90z)| norm 0.2766 (-0.09z)| lr 2.71e-04 | 322.71 ms | 52.3% bf16 MFU | 1624105 tok/s step 10709/19560 | loss 3.400588 (-0.04z)| norm 0.2768 (-0.08z)| lr 2.71e-04 | 322.93 ms | 52.3% bf16 MFU | 1624077 tok/s step 10710/19560 | loss 3.384720 (-0.42z)| norm 0.2651 (-0.82z)| lr 2.71e-04 | 322.75 ms | 52.3% bf16 MFU | 1624095 tok/s step 10711/19560 | loss 3.454783 (+1.19z)| norm 0.3207 (+2.62z)| lr 2.71e-04 | 322.90 ms | 52.3% bf16 MFU | 1624076 tok/s step 10712/19560 | loss 3.488813 (+1.93z)| norm 0.2760 (-0.17z)| lr 2.71e-04 | 322.72 ms | 52.3% bf16 MFU | 1624102 tok/s step 10713/19560 | loss 3.382797 (-0.48z)| norm 0.2859 (+0.45z)| lr 2.71e-04 | 322.64 ms | 52.3% bf16 MFU | 1624147 tok/s step 10714/19560 | loss 3.339990 (-1.44z)| norm 0.2694 (-0.57z)| lr 2.71e-04 | 322.17 ms | 52.4% bf16 MFU | 1624307 tok/s step 10715/19560 | loss 3.362694 (-0.91z)| norm 0.2651 (-0.83z)| lr 2.71e-04 | 323.30 ms | 52.2% bf16 MFU | 1624174 tok/s step 10716/19560 | loss 3.456952 (+1.22z)| norm 0.2745 (-0.24z)| lr 2.71e-04 | 322.67 ms | 52.3% bf16 MFU | 1624208 tok/s step 10717/19560 | loss 3.299334 (-2.28z)| norm 0.2687 (-0.61z)| lr 2.71e-04 | 322.66 ms | 52.3% bf16 MFU | 1624243 tok/s step 10718/19560 | loss 3.407059 (+0.10z)| norm 0.3283 (+2.97z)| lr 2.71e-04 | 322.03 ms | 52.4% bf16 MFU | 1624434 tok/s step 10719/19560 | loss 3.452195 (+1.10z)| norm 0.2897 (+0.64z)| lr 2.71e-04 | 323.62 ms | 52.2% bf16 MFU | 1624216 tok/s step 10720/19560 | loss 3.447496 (+0.99z)| norm 0.3097 (+1.81z)| lr 2.71e-04 | 322.81 ms | 52.3% bf16 MFU | 1624213 tok/s step 10721/19560 | loss 3.378647 (-0.54z)| norm 0.2891 (+0.58z)| lr 2.71e-04 | 322.81 ms | 52.3% bf16 MFU | 1624209 tok/s step 10722/19560 | loss 3.459506 (+1.24z)| norm 0.2951 (+0.93z)| lr 2.71e-04 | 322.50 ms | 52.3% bf16 MFU | 1624284 tok/s step 10723/19560 | loss 3.382895 (-0.45z)| norm 0.2849 (+0.32z)| lr 2.70e-04 | 322.31 ms | 52.4% bf16 MFU | 1624402 tok/s step 10724/19560 | loss 3.314046 (-1.93z)| norm 0.2428 (-2.12z)| lr 2.70e-04 | 322.64 ms | 52.3% bf16 MFU | 1624431 tok/s step 10725/19560 | loss 3.387422 (-0.33z)| norm 0.2814 (+0.13z)| lr 2.70e-04 | 322.51 ms | 52.3% bf16 MFU | 1624491 tok/s step 10726/19560 | loss 3.393522 (-0.18z)| norm 0.2703 (-0.52z)| lr 2.70e-04 | 322.52 ms | 52.3% bf16 MFU | 1624547 tok/s step 10727/19560 | loss 3.346948 (-1.20z)| norm 0.2581 (-1.21z)| lr 2.70e-04 | 322.88 ms | 52.3% bf16 MFU | 1624508 tok/s step 10728/19560 | loss 3.523655 (+2.58z)| norm 0.2869 (+0.46z)| lr 2.70e-04 | 322.55 ms | 52.3% bf16 MFU | 1624554 tok/s step 10729/19560 | loss 3.394302 (-0.18z)| norm 0.2971 (+1.03z)| lr 2.70e-04 | 322.41 ms | 52.3% bf16 MFU | 1624633 tok/s step 10730/19560 | loss 3.314786 (-1.84z)| norm 0.2635 (-0.92z)| lr 2.70e-04 | 322.56 ms | 52.3% bf16 MFU | 1624671 tok/s step 10731/19560 | loss 3.408947 (+0.15z)| norm 0.3088 (+1.68z)| lr 2.70e-04 | 322.54 ms | 52.3% bf16 MFU | 1624712 tok/s step 10732/19560 | loss 3.395818 (-0.12z)| norm 0.2898 (+0.58z)| lr 2.70e-04 | 323.29 ms | 52.2% bf16 MFU | 1624563 tok/s step 10733/19560 | loss 3.393825 (-0.16z)| norm 0.2805 (+0.04z)| lr 2.70e-04 | 322.55 ms | 52.3% bf16 MFU | 1624608 tok/s step 10734/19560 | loss 3.369970 (-0.66z)| norm 0.2651 (-0.84z)| lr 2.70e-04 | 322.62 ms | 52.3% bf16 MFU | 1624633 tok/s step 10735/19560 | loss 3.329175 (-1.49z)| norm 0.2592 (-1.18z)| lr 2.70e-04 | 322.47 ms | 52.3% bf16 MFU | 1624694 tok/s step 10736/19560 | loss 3.358791 (-0.87z)| norm 0.2746 (-0.30z)| lr 2.70e-04 | 323.00 ms | 52.3% bf16 MFU | 1624619 tok/s step 10737/19560 | loss 3.391263 (-0.20z)| norm 0.2509 (-1.64z)| lr 2.70e-04 | 322.86 ms | 52.3% bf16 MFU | 1624581 tok/s step 10738/19560 | loss 3.477076 (+1.59z)| norm 0.2631 (-0.94z)| lr 2.70e-04 | 322.95 ms | 52.3% bf16 MFU | 1624523 tok/s step 10739/19560 | loss 3.399070 (-0.04z)| norm 0.2771 (-0.15z)| lr 2.70e-04 | 323.01 ms | 52.2% bf16 MFU | 1624454 tok/s step 10740/19560 | loss 3.386901 (-0.30z)| norm 0.2538 (-1.47z)| lr 2.70e-04 | 323.07 ms | 52.2% bf16 MFU | 1624374 tok/s step 10741/19560 | loss 3.408275 (+0.14z)| norm 0.2808 (+0.06z)| lr 2.70e-04 | 322.26 ms | 52.4% bf16 MFU | 1624500 tok/s step 10742/19560 | loss 3.436077 (+0.71z)| norm 0.2718 (-0.48z)| lr 2.70e-04 | 322.48 ms | 52.3% bf16 MFU | 1624564 tok/s step 10743/19560 | loss 3.466893 (+1.35z)| norm 0.2594 (-1.21z)| lr 2.69e-04 | 323.40 ms | 52.2% bf16 MFU | 1624395 tok/s step 10744/19560 | loss 3.332922 (-1.45z)| norm 0.2825 (+0.14z)| lr 2.69e-04 | 322.46 ms | 52.3% bf16 MFU | 1624471 tok/s step 10745/19560 | loss 3.430089 (+0.59z)| norm 0.2720 (-0.48z)| lr 2.69e-04 | 322.30 ms | 52.4% bf16 MFU | 1624582 tok/s step 10746/19560 | loss 3.352609 (-1.02z)| norm 0.3014 (+1.25z)| lr 2.69e-04 | 322.96 ms | 52.3% bf16 MFU | 1624521 tok/s step 10747/19560 | loss 3.384270 (-0.35z)| norm 0.2866 (+0.37z)| lr 2.69e-04 | 323.01 ms | 52.2% bf16 MFU | 1624451 tok/s step 10748/19560 | loss 3.396918 (-0.09z)| norm 0.2656 (-0.87z)| lr 2.69e-04 | 322.84 ms | 52.3% bf16 MFU | 1624428 tok/s step 10749/19560 | loss 3.348980 (-1.09z)| norm 0.2798 (-0.04z)| lr 2.69e-04 | 322.54 ms | 52.3% bf16 MFU | 1624481 tok/s step 10750/19560 | loss 3.479374 (+1.61z)| norm 0.2975 (+1.00z)| lr 2.69e-04 | 322.91 ms | 52.3% bf16 MFU | 1624440 tok/s val loss 3.378517 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2950/10042 = 0.293766 step 10751/19560 | loss 3.475396 (+1.50z)| norm 0.2900 (+0.55z)| lr 2.69e-04 | 322.82 ms | 52.3% bf16 MFU | 1624422 tok/s step 10752/19560 | loss 3.434999 (+0.65z)| norm 0.2632 (-1.03z)| lr 2.69e-04 | 322.90 ms | 52.3% bf16 MFU | 1624385 tok/s step 10753/19560 | loss 3.501738 (+1.99z)| norm 0.2915 (+0.63z)| lr 2.69e-04 | 323.01 ms | 52.2% bf16 MFU | 1624322 tok/s step 10754/19560 | loss 3.390269 (-0.28z)| norm 0.2694 (-0.68z)| lr 2.69e-04 | 323.46 ms | 52.2% bf16 MFU | 1624151 tok/s step 10755/19560 | loss 3.416140 (+0.24z)| norm 0.2568 (-1.40z)| lr 2.69e-04 | 323.46 ms | 52.2% bf16 MFU | 1623986 tok/s step 10756/19560 | loss 3.391645 (-0.25z)| norm 0.2684 (-0.72z)| lr 2.69e-04 | 322.95 ms | 52.3% bf16 MFU | 1623958 tok/s step 10757/19560 | loss 3.472360 (+1.39z)| norm 0.2710 (-0.56z)| lr 2.69e-04 | 322.69 ms | 52.3% bf16 MFU | 1623999 tok/s step 10758/19560 | loss 3.436030 (+0.64z)| norm 0.2652 (-0.92z)| lr 2.69e-04 | 323.17 ms | 52.2% bf16 MFU | 1623914 tok/s step 10759/19560 | loss 3.381620 (-0.47z)| norm 0.2978 (+1.01z)| lr 2.69e-04 | 323.47 ms | 52.2% bf16 MFU | 1623760 tok/s step 10760/19560 | loss 3.389561 (-0.32z)| norm 0.3004 (+1.15z)| lr 2.69e-04 | 323.08 ms | 52.2% bf16 MFU | 1623711 tok/s step 10761/19560 | loss 3.322088 (-1.72z)| norm 0.2851 (+0.25z)| lr 2.69e-04 | 322.91 ms | 52.3% bf16 MFU | 1623708 tok/s step 10762/19560 | loss 3.366170 (-0.78z)| norm 0.2641 (-0.98z)| lr 2.69e-04 | 322.63 ms | 52.3% bf16 MFU | 1623774 tok/s step 10763/19560 | loss 3.374590 (-0.59z)| norm 0.2728 (-0.45z)| lr 2.68e-04 | 323.09 ms | 52.2% bf16 MFU | 1623722 tok/s step 10764/19560 | loss 3.451791 (+1.04z)| norm 0.2771 (-0.18z)| lr 2.68e-04 | 322.51 ms | 52.3% bf16 MFU | 1623818 tok/s step 10765/19560 | loss 3.414336 (+0.26z)| norm 0.2660 (-0.84z)| lr 2.68e-04 | 323.11 ms | 52.2% bf16 MFU | 1623758 tok/s step 10766/19560 | loss 3.408113 (+0.13z)| norm 0.2794 (-0.02z)| lr 2.68e-04 | 322.87 ms | 52.3% bf16 MFU | 1623762 tok/s step 10767/19560 | loss 3.317378 (-1.77z)| norm 0.2639 (-0.96z)| lr 2.68e-04 | 322.73 ms | 52.3% bf16 MFU | 1623801 tok/s step 10768/19560 | loss 3.369987 (-0.65z)| norm 0.2465 (-1.99z)| lr 2.68e-04 | 322.73 ms | 52.3% bf16 MFU | 1623837 tok/s step 10769/19560 | loss 3.387951 (-0.28z)| norm 0.2579 (-1.27z)| lr 2.68e-04 | 323.06 ms | 52.2% bf16 MFU | 1623789 tok/s step 10770/19560 | loss 3.386754 (-0.29z)| norm 0.2525 (-1.58z)| lr 2.68e-04 | 322.83 ms | 52.3% bf16 MFU | 1623803 tok/s step 10771/19560 | loss 3.393100 (-0.13z)| norm 0.2564 (-1.32z)| lr 2.68e-04 | 323.00 ms | 52.3% bf16 MFU | 1623770 tok/s step 10772/19560 | loss 3.400262 (+0.04z)| norm 0.2701 (-0.49z)| lr 2.68e-04 | 323.30 ms | 52.2% bf16 MFU | 1623665 tok/s step 10773/19560 | loss 3.363621 (-0.82z)| norm 0.2529 (-1.49z)| lr 2.68e-04 | 323.32 ms | 52.2% bf16 MFU | 1623561 tok/s step 10774/19560 | loss 3.336144 (-1.45z)| norm 0.2893 (+0.67z)| lr 2.68e-04 | 323.05 ms | 52.2% bf16 MFU | 1623529 tok/s step 10775/19560 | loss 3.426497 (+0.65z)| norm 0.2712 (-0.40z)| lr 2.68e-04 | 323.22 ms | 52.2% bf16 MFU | 1623458 tok/s step 10776/19560 | loss 3.345715 (-1.22z)| norm 0.2649 (-0.77z)| lr 2.68e-04 | 323.16 ms | 52.2% bf16 MFU | 1623404 tok/s step 10777/19560 | loss 3.328927 (-1.62z)| norm 0.2980 (+1.20z)| lr 2.68e-04 | 322.66 ms | 52.3% bf16 MFU | 1623479 tok/s step 10778/19560 | loss 3.389166 (-0.18z)| norm 0.2687 (-0.53z)| lr 2.68e-04 | 322.92 ms | 52.3% bf16 MFU | 1623484 tok/s step 10779/19560 | loss 3.367229 (-0.71z)| norm 0.2895 (+0.71z)| lr 2.68e-04 | 322.89 ms | 52.3% bf16 MFU | 1623497 tok/s step 10780/19560 | loss 3.308817 (-2.05z)| norm 0.2660 (-0.70z)| lr 2.68e-04 | 323.36 ms | 52.2% bf16 MFU | 1623391 tok/s step 10781/19560 | loss 3.342303 (-1.26z)| norm 0.2827 (+0.31z)| lr 2.68e-04 | 322.82 ms | 52.3% bf16 MFU | 1623425 tok/s step 10782/19560 | loss 3.368060 (-0.65z)| norm 0.2609 (-1.00z)| lr 2.68e-04 | 322.60 ms | 52.3% bf16 MFU | 1623514 tok/s step 10783/19560 | loss 3.420115 (+0.56z)| norm 0.2749 (-0.14z)| lr 2.67e-04 | 323.30 ms | 52.2% bf16 MFU | 1623423 tok/s step 10784/19560 | loss 3.383595 (-0.28z)| norm 0.2742 (-0.18z)| lr 2.67e-04 | 322.66 ms | 52.3% bf16 MFU | 1623495 tok/s step 10785/19560 | loss 3.387244 (-0.20z)| norm 0.2965 (+1.20z)| lr 2.67e-04 | 322.87 ms | 52.3% bf16 MFU | 1623512 tok/s step 10786/19560 | loss 3.414399 (+0.43z)| norm 0.2661 (-0.66z)| lr 2.67e-04 | 322.70 ms | 52.3% bf16 MFU | 1623571 tok/s step 10787/19560 | loss 3.360445 (-0.82z)| norm 0.2842 (+0.46z)| lr 2.67e-04 | 323.25 ms | 52.2% bf16 MFU | 1623488 tok/s step 10788/19560 | loss 3.388759 (-0.17z)| norm 0.2683 (-0.52z)| lr 2.67e-04 | 323.32 ms | 52.2% bf16 MFU | 1623393 tok/s step 10789/19560 | loss 3.434206 (+0.89z)| norm 0.2766 (-0.01z)| lr 2.67e-04 | 322.63 ms | 52.3% bf16 MFU | 1623476 tok/s step 10790/19560 | loss 3.527509 (+2.96z)| norm 0.3062 (+1.81z)| lr 2.67e-04 | 323.62 ms | 52.2% bf16 MFU | 1623305 tok/s step 10791/19560 | loss 3.387893 (-0.20z)| norm 0.2820 (+0.33z)| lr 2.67e-04 | 322.77 ms | 52.3% bf16 MFU | 1623358 tok/s step 10792/19560 | loss 3.387521 (-0.20z)| norm 0.2861 (+0.59z)| lr 2.67e-04 | 323.83 ms | 52.1% bf16 MFU | 1623142 tok/s step 10793/19560 | loss 3.415329 (+0.43z)| norm 0.3120 (+2.14z)| lr 2.67e-04 | 323.32 ms | 52.2% bf16 MFU | 1623063 tok/s step 10794/19560 | loss 3.407875 (+0.25z)| norm 0.2832 (+0.39z)| lr 2.67e-04 | 322.94 ms | 52.3% bf16 MFU | 1623084 tok/s step 10795/19560 | loss 3.411982 (+0.35z)| norm 0.2961 (+1.17z)| lr 2.67e-04 | 323.79 ms | 52.1% bf16 MFU | 1622890 tok/s step 10796/19560 | loss 3.388630 (-0.17z)| norm 0.2860 (+0.56z)| lr 2.67e-04 | 322.37 ms | 52.4% bf16 MFU | 1623063 tok/s step 10797/19560 | loss 3.413072 (+0.38z)| norm 0.2875 (+0.64z)| lr 2.67e-04 | 323.99 ms | 52.1% bf16 MFU | 1622822 tok/s step 10798/19560 | loss 3.301751 (-2.13z)| norm 0.2869 (+0.60z)| lr 2.67e-04 | 323.66 ms | 52.1% bf16 MFU | 1622675 tok/s step 10799/19560 | loss 3.331262 (-1.45z)| norm 0.3050 (+1.68z)| lr 2.67e-04 | 323.16 ms | 52.2% bf16 MFU | 1622659 tok/s step 10800/19560 | loss 3.385589 (-0.22z)| norm 0.2824 (+0.31z)| lr 2.67e-04 | 323.30 ms | 52.2% bf16 MFU | 1622611 tok/s step 10801/19560 | loss 3.340526 (-1.23z)| norm 0.2814 (+0.24z)| lr 2.67e-04 | 322.82 ms | 52.3% bf16 MFU | 1622686 tok/s step 10802/19560 | loss 3.370793 (-0.54z)| norm 0.2857 (+0.49z)| lr 2.67e-04 | 323.64 ms | 52.1% bf16 MFU | 1622550 tok/s step 10803/19560 | loss 3.408600 (+0.31z)| norm 0.2819 (+0.26z)| lr 2.66e-04 | 322.85 ms | 52.3% bf16 MFU | 1622620 tok/s step 10804/19560 | loss 3.444967 (+1.11z)| norm 0.2876 (+0.60z)| lr 2.66e-04 | 323.45 ms | 52.2% bf16 MFU | 1622536 tok/s step 10805/19560 | loss 3.381202 (-0.31z)| norm 0.2782 (+0.02z)| lr 2.66e-04 | 324.24 ms | 52.1% bf16 MFU | 1622257 tok/s step 10806/19560 | loss 3.382278 (-0.27z)| norm 0.2937 (+0.96z)| lr 2.66e-04 | 322.29 ms | 52.4% bf16 MFU | 1622482 tok/s step 10807/19560 | loss 3.323643 (-1.57z)| norm 0.2713 (-0.41z)| lr 2.66e-04 | 323.11 ms | 52.2% bf16 MFU | 1622490 tok/s step 10808/19560 | loss 3.412133 (+0.39z)| norm 0.3076 (+1.78z)| lr 2.66e-04 | 324.25 ms | 52.1% bf16 MFU | 1622212 tok/s step 10809/19560 | loss 3.388397 (-0.14z)| norm 0.2887 (+0.61z)| lr 2.66e-04 | 322.87 ms | 52.3% bf16 MFU | 1622293 tok/s step 10810/19560 | loss 3.419570 (+0.55z)| norm 0.2948 (+0.97z)| lr 2.66e-04 | 322.37 ms | 52.4% bf16 MFU | 1622497 tok/s step 10811/19560 | loss 3.404265 (+0.21z)| norm 0.2721 (-0.42z)| lr 2.66e-04 | 323.51 ms | 52.2% bf16 MFU | 1622404 tok/s step 10812/19560 | loss 3.373948 (-0.47z)| norm 0.3166 (+2.25z)| lr 2.66e-04 | 323.51 ms | 52.2% bf16 MFU | 1622315 tok/s step 10813/19560 | loss 3.383276 (-0.27z)| norm 0.2867 (+0.44z)| lr 2.66e-04 | 322.84 ms | 52.3% bf16 MFU | 1622398 tok/s step 10814/19560 | loss 3.356490 (-0.86z)| norm 0.2955 (+0.97z)| lr 2.66e-04 | 323.06 ms | 52.2% bf16 MFU | 1622422 tok/s step 10815/19560 | loss 3.368933 (-0.58z)| norm 0.2694 (-0.60z)| lr 2.66e-04 | 322.77 ms | 52.3% bf16 MFU | 1622517 tok/s step 10816/19560 | loss 3.314888 (-1.75z)| norm 0.3036 (+1.43z)| lr 2.66e-04 | 323.14 ms | 52.2% bf16 MFU | 1622515 tok/s step 10817/19560 | loss 3.421405 (+0.60z)| norm 0.2882 (+0.49z)| lr 2.66e-04 | 323.03 ms | 52.2% bf16 MFU | 1622542 tok/s step 10818/19560 | loss 3.373011 (-0.46z)| norm 0.3161 (+2.13z)| lr 2.66e-04 | 322.82 ms | 52.3% bf16 MFU | 1622620 tok/s step 10819/19560 | loss 3.500968 (+2.31z)| norm 0.2716 (-0.54z)| lr 2.66e-04 | 322.81 ms | 52.3% bf16 MFU | 1622697 tok/s step 10820/19560 | loss 3.401895 (+0.15z)| norm 0.3023 (+1.29z)| lr 2.66e-04 | 322.88 ms | 52.3% bf16 MFU | 1622751 tok/s step 10821/19560 | loss 3.367666 (-0.59z)| norm 0.2821 (+0.07z)| lr 2.66e-04 | 323.38 ms | 52.2% bf16 MFU | 1622676 tok/s step 10822/19560 | loss 3.420724 (+0.56z)| norm 0.2628 (-1.07z)| lr 2.66e-04 | 322.80 ms | 52.3% bf16 MFU | 1622753 tok/s step 10823/19560 | loss 3.436229 (+0.89z)| norm 0.2805 (-0.02z)| lr 2.65e-04 | 322.56 ms | 52.3% bf16 MFU | 1622886 tok/s step 10824/19560 | loss 3.360623 (-0.76z)| norm 0.2751 (-0.34z)| lr 2.65e-04 | 322.74 ms | 52.3% bf16 MFU | 1622966 tok/s step 10825/19560 | loss 3.440873 (+0.99z)| norm 0.2743 (-0.38z)| lr 2.65e-04 | 323.43 ms | 52.2% bf16 MFU | 1622869 tok/s step 10826/19560 | loss 3.368038 (-0.59z)| norm 0.2642 (-0.99z)| lr 2.65e-04 | 322.58 ms | 52.3% bf16 MFU | 1622989 tok/s step 10827/19560 | loss 3.380231 (-0.33z)| norm 0.2796 (-0.07z)| lr 2.65e-04 | 323.18 ms | 52.2% bf16 MFU | 1622955 tok/s step 10828/19560 | loss 3.358214 (-0.80z)| norm 0.2626 (-1.08z)| lr 2.65e-04 | 322.46 ms | 52.3% bf16 MFU | 1623102 tok/s step 10829/19560 | loss 3.393135 (-0.03z)| norm 0.2735 (-0.41z)| lr 2.65e-04 | 322.74 ms | 52.3% bf16 MFU | 1623172 tok/s step 10830/19560 | loss 3.377544 (-0.37z)| norm 0.2790 (-0.07z)| lr 2.65e-04 | 323.11 ms | 52.2% bf16 MFU | 1623145 tok/s step 10831/19560 | loss 3.372752 (-0.47z)| norm 0.2699 (-0.62z)| lr 2.65e-04 | 322.78 ms | 52.3% bf16 MFU | 1623202 tok/s step 10832/19560 | loss 3.371239 (-0.50z)| norm 0.2757 (-0.25z)| lr 2.65e-04 | 323.15 ms | 52.2% bf16 MFU | 1623163 tok/s step 10833/19560 | loss 3.390530 (-0.07z)| norm 0.2659 (-0.85z)| lr 2.65e-04 | 322.54 ms | 52.3% bf16 MFU | 1623279 tok/s step 10834/19560 | loss 3.357924 (-0.78z)| norm 0.2711 (-0.52z)| lr 2.65e-04 | 322.82 ms | 52.3% bf16 MFU | 1623320 tok/s step 10835/19560 | loss 3.384309 (-0.20z)| norm 0.2866 (+0.45z)| lr 2.65e-04 | 322.12 ms | 52.4% bf16 MFU | 1623535 tok/s step 10836/19560 | loss 3.387490 (-0.12z)| norm 0.2937 (+0.89z)| lr 2.65e-04 | 323.71 ms | 52.1% bf16 MFU | 1623339 tok/s step 10837/19560 | loss 3.362585 (-0.66z)| norm 0.2567 (-1.40z)| lr 2.65e-04 | 323.01 ms | 52.2% bf16 MFU | 1623329 tok/s step 10838/19560 | loss 3.540397 (+3.13z)| norm 0.2685 (-0.67z)| lr 2.65e-04 | 322.63 ms | 52.3% bf16 MFU | 1623414 tok/s step 10839/19560 | loss 3.356410 (-0.78z)| norm 0.3137 (+2.16z)| lr 2.65e-04 | 322.85 ms | 52.3% bf16 MFU | 1623440 tok/s step 10840/19560 | loss 3.401062 (+0.19z)| norm 0.3095 (+1.85z)| lr 2.65e-04 | 322.80 ms | 52.3% bf16 MFU | 1623478 tok/s step 10841/19560 | loss 3.420581 (+0.61z)| norm 0.2957 (+0.99z)| lr 2.65e-04 | 322.92 ms | 52.3% bf16 MFU | 1623482 tok/s step 10842/19560 | loss 3.429640 (+0.80z)| norm 0.2745 (-0.32z)| lr 2.65e-04 | 322.90 ms | 52.3% bf16 MFU | 1623492 tok/s step 10843/19560 | loss 3.352988 (-0.87z)| norm 0.3672 (+4.84z)| lr 2.65e-04 | 322.55 ms | 52.3% bf16 MFU | 1623590 tok/s step 10844/19560 | loss 3.367725 (-0.54z)| norm 0.3137 (+1.81z)| lr 2.64e-04 | 323.05 ms | 52.2% bf16 MFU | 1623557 tok/s step 10845/19560 | loss 3.408656 (+0.34z)| norm 0.3281 (+2.52z)| lr 2.64e-04 | 323.02 ms | 52.2% bf16 MFU | 1623534 tok/s step 10846/19560 | loss 3.378869 (-0.32z)| norm 0.2847 (+0.21z)| lr 2.64e-04 | 322.93 ms | 52.3% bf16 MFU | 1623535 tok/s step 10847/19560 | loss 3.313420 (-1.74z)| norm 0.3043 (+1.28z)| lr 2.64e-04 | 322.84 ms | 52.3% bf16 MFU | 1623556 tok/s step 10848/19560 | loss 3.333252 (-1.28z)| norm 0.2734 (-0.40z)| lr 2.64e-04 | 323.84 ms | 52.1% bf16 MFU | 1623326 tok/s step 10849/19560 | loss 3.441890 (+1.11z)| norm 0.2928 (+0.67z)| lr 2.64e-04 | 322.36 ms | 52.4% bf16 MFU | 1623480 tok/s step 10850/19560 | loss 3.365376 (-0.57z)| norm 0.3062 (+1.40z)| lr 2.64e-04 | 322.66 ms | 52.3% bf16 MFU | 1623551 tok/s step 10851/19560 | loss 3.380898 (-0.22z)| norm 0.2649 (-0.87z)| lr 2.64e-04 | 322.68 ms | 52.3% bf16 MFU | 1623612 tok/s step 10852/19560 | loss 3.472203 (+1.78z)| norm 0.3088 (+1.53z)| lr 2.64e-04 | 322.98 ms | 52.3% bf16 MFU | 1623594 tok/s step 10853/19560 | loss 3.424512 (+0.71z)| norm 0.2679 (-0.73z)| lr 2.64e-04 | 322.67 ms | 52.3% bf16 MFU | 1623656 tok/s step 10854/19560 | loss 3.597891 (+4.20z)| norm 0.4598 (+7.40z)| lr 2.64e-04 | 322.80 ms | 52.3% bf16 MFU | 1623684 tok/s step 10855/19560 | loss 3.364655 (-0.61z)| norm 0.3146 (+1.31z)| lr 2.64e-04 | 322.83 ms | 52.3% bf16 MFU | 1623702 tok/s step 10856/19560 | loss 3.420402 (+0.58z)| norm 0.3818 (+3.84z)| lr 2.64e-04 | 322.46 ms | 52.3% bf16 MFU | 1623813 tok/s step 10857/19560 | loss 3.469375 (+1.59z)| norm 0.3224 (+1.49z)| lr 2.64e-04 | 322.75 ms | 52.3% bf16 MFU | 1623844 tok/s step 10858/19560 | loss 3.417418 (+0.48z)| norm 0.3284 (+1.69z)| lr 2.64e-04 | 323.18 ms | 52.2% bf16 MFU | 1623766 tok/s step 10859/19560 | loss 3.366975 (-0.58z)| norm 0.3074 (+0.88z)| lr 2.64e-04 | 322.79 ms | 52.3% bf16 MFU | 1623788 tok/s step 10860/19560 | loss 3.396113 (+0.04z)| norm 0.2970 (+0.48z)| lr 2.64e-04 | 322.54 ms | 52.3% bf16 MFU | 1623874 tok/s step 10861/19560 | loss 3.361443 (-0.69z)| norm 0.3032 (+0.71z)| lr 2.64e-04 | 322.95 ms | 52.3% bf16 MFU | 1623852 tok/s step 10862/19560 | loss 3.315992 (-1.63z)| norm 0.2881 (+0.12z)| lr 2.64e-04 | 322.70 ms | 52.3% bf16 MFU | 1623893 tok/s step 10863/19560 | loss 3.411374 (+0.36z)| norm 0.2888 (+0.14z)| lr 2.64e-04 | 323.24 ms | 52.2% bf16 MFU | 1623796 tok/s step 10864/19560 | loss 3.430251 (+0.75z)| norm 0.2937 (+0.33z)| lr 2.63e-04 | 322.86 ms | 52.3% bf16 MFU | 1623801 tok/s step 10865/19560 | loss 3.299731 (-1.96z)| norm 0.3047 (+0.74z)| lr 2.63e-04 | 323.24 ms | 52.2% bf16 MFU | 1623711 tok/s step 10866/19560 | loss 3.424506 (+0.65z)| norm 0.2688 (-0.66z)| lr 2.63e-04 | 322.66 ms | 52.3% bf16 MFU | 1623769 tok/s step 10867/19560 | loss 3.434247 (+0.84z)| norm 0.2887 (+0.11z)| lr 2.63e-04 | 323.02 ms | 52.2% bf16 MFU | 1623734 tok/s step 10868/19560 | loss 3.312038 (-1.68z)| norm 0.2641 (-0.85z)| lr 2.63e-04 | 322.40 ms | 52.3% bf16 MFU | 1623857 tok/s step 10869/19560 | loss 3.389585 (-0.08z)| norm 0.2655 (-0.79z)| lr 2.63e-04 | 322.77 ms | 52.3% bf16 MFU | 1623882 tok/s step 10870/19560 | loss 3.486928 (+1.91z)| norm 0.2805 (-0.21z)| lr 2.63e-04 | 322.44 ms | 52.3% bf16 MFU | 1623986 tok/s step 10871/19560 | loss 3.388424 (-0.09z)| norm 0.2570 (-1.12z)| lr 2.63e-04 | 322.90 ms | 52.3% bf16 MFU | 1623971 tok/s step 10872/19560 | loss 3.377426 (-0.33z)| norm 0.2716 (-0.55z)| lr 2.63e-04 | 322.80 ms | 52.3% bf16 MFU | 1623982 tok/s step 10873/19560 | loss 3.351853 (-0.85z)| norm 0.2681 (-0.68z)| lr 2.63e-04 | 322.96 ms | 52.3% bf16 MFU | 1623952 tok/s step 10874/19560 | loss 3.406264 (+0.27z)| norm 0.2600 (-0.99z)| lr 2.63e-04 | 322.89 ms | 52.3% bf16 MFU | 1623942 tok/s step 10875/19560 | loss 3.365117 (-0.58z)| norm 0.2633 (-0.85z)| lr 2.63e-04 | 322.56 ms | 52.3% bf16 MFU | 1624016 tok/s step 10876/19560 | loss 3.429837 (+0.76z)| norm 0.2502 (-1.34z)| lr 2.63e-04 | 323.46 ms | 52.2% bf16 MFU | 1623858 tok/s step 10877/19560 | loss 3.522551 (+2.59z)| norm 0.2677 (-0.67z)| lr 2.63e-04 | 322.83 ms | 52.3% bf16 MFU | 1623866 tok/s step 10878/19560 | loss 3.470643 (+1.54z)| norm 0.2696 (-0.58z)| lr 2.63e-04 | 322.68 ms | 52.3% bf16 MFU | 1623913 tok/s step 10879/19560 | loss 3.358827 (-0.71z)| norm 0.2703 (-0.55z)| lr 2.63e-04 | 322.96 ms | 52.3% bf16 MFU | 1623886 tok/s step 10880/19560 | loss 3.432891 (+0.80z)| norm 0.2883 (+0.13z)| lr 2.63e-04 | 322.89 ms | 52.3% bf16 MFU | 1623878 tok/s step 10881/19560 | loss 3.412041 (+0.40z)| norm 0.2770 (-0.29z)| lr 2.63e-04 | 322.50 ms | 52.3% bf16 MFU | 1623970 tok/s step 10882/19560 | loss 3.372268 (-0.43z)| norm 0.2851 (+0.01z)| lr 2.63e-04 | 323.32 ms | 52.2% bf16 MFU | 1623850 tok/s step 10883/19560 | loss 3.362860 (-0.62z)| norm 0.2879 (+0.11z)| lr 2.63e-04 | 323.06 ms | 52.2% bf16 MFU | 1623803 tok/s step 10884/19560 | loss 3.368454 (-0.49z)| norm 0.2866 (+0.05z)| lr 2.62e-04 | 323.01 ms | 52.2% bf16 MFU | 1623769 tok/s step 10885/19560 | loss 3.378842 (-0.27z)| norm 0.2707 (-0.56z)| lr 2.62e-04 | 322.69 ms | 52.3% bf16 MFU | 1623817 tok/s step 10886/19560 | loss 3.564500 (+3.46z)| norm 0.3499 (+2.43z)| lr 2.62e-04 | 322.68 ms | 52.3% bf16 MFU | 1623864 tok/s step 10887/19560 | loss 3.425547 (+0.66z)| norm 0.3145 (+1.08z)| lr 2.62e-04 | 323.10 ms | 52.2% bf16 MFU | 1623806 tok/s step 10888/19560 | loss 3.379311 (-0.27z)| norm 0.2935 (+0.29z)| lr 2.62e-04 | 322.91 ms | 52.3% bf16 MFU | 1623797 tok/s step 10889/19560 | loss 3.357249 (-0.72z)| norm 0.2869 (+0.04z)| lr 2.62e-04 | 322.48 ms | 52.3% bf16 MFU | 1623897 tok/s step 10890/19560 | loss 3.366346 (-0.54z)| norm 0.2849 (-0.05z)| lr 2.62e-04 | 323.05 ms | 52.2% bf16 MFU | 1623848 tok/s step 10891/19560 | loss 3.381555 (-0.23z)| norm 0.2738 (-0.47z)| lr 2.62e-04 | 322.95 ms | 52.3% bf16 MFU | 1623829 tok/s step 10892/19560 | loss 3.360115 (-0.65z)| norm 0.2773 (-0.33z)| lr 2.62e-04 | 322.79 ms | 52.3% bf16 MFU | 1623849 tok/s step 10893/19560 | loss 3.466302 (+1.48z)| norm 0.2779 (-0.32z)| lr 2.62e-04 | 322.50 ms | 52.3% bf16 MFU | 1623940 tok/s step 10894/19560 | loss 3.417287 (+0.49z)| norm 0.2695 (-0.63z)| lr 2.62e-04 | 323.17 ms | 52.2% bf16 MFU | 1623860 tok/s step 10895/19560 | loss 3.391749 (-0.03z)| norm 0.2980 (+0.44z)| lr 2.62e-04 | 322.86 ms | 52.3% bf16 MFU | 1623862 tok/s step 10896/19560 | loss 3.356591 (-0.74z)| norm 0.2885 (+0.07z)| lr 2.62e-04 | 322.51 ms | 52.3% bf16 MFU | 1623951 tok/s step 10897/19560 | loss 3.475636 (+1.63z)| norm 0.2815 (-0.21z)| lr 2.62e-04 | 323.52 ms | 52.2% bf16 MFU | 1623782 tok/s step 10898/19560 | loss 3.505582 (+2.17z)| norm 0.2954 (+0.32z)| lr 2.62e-04 | 322.89 ms | 52.3% bf16 MFU | 1623780 tok/s step 10899/19560 | loss 3.382524 (-0.24z)| norm 0.2766 (-0.42z)| lr 2.62e-04 | 322.69 ms | 52.3% bf16 MFU | 1623828 tok/s step 10900/19560 | loss 3.377697 (-0.33z)| norm 0.2763 (-0.44z)| lr 2.62e-04 | 322.37 ms | 52.4% bf16 MFU | 1623954 tok/s step 10901/19560 | loss 3.382364 (-0.24z)| norm 0.2596 (-1.09z)| lr 2.62e-04 | 323.49 ms | 52.2% bf16 MFU | 1623792 tok/s step 10902/19560 | loss 3.463415 (+1.33z)| norm 0.2858 (-0.07z)| lr 2.62e-04 | 323.01 ms | 52.2% bf16 MFU | 1623758 tok/s step 10903/19560 | loss 3.332949 (-1.21z)| norm 0.2698 (-0.69z)| lr 2.62e-04 | 323.30 ms | 52.2% bf16 MFU | 1623655 tok/s step 10904/19560 | loss 3.445967 (+0.98z)| norm 0.2812 (-0.25z)| lr 2.61e-04 | 322.83 ms | 52.3% bf16 MFU | 1623674 tok/s step 10905/19560 | loss 3.398821 (+0.05z)| norm 0.2595 (-1.09z)| lr 2.61e-04 | 322.51 ms | 52.3% bf16 MFU | 1623772 tok/s step 10906/19560 | loss 3.357802 (-0.75z)| norm 0.2773 (-0.39z)| lr 2.61e-04 | 323.18 ms | 52.2% bf16 MFU | 1623698 tok/s step 10907/19560 | loss 3.376558 (-0.39z)| norm 0.3491 (+2.35z)| lr 2.61e-04 | 322.83 ms | 52.3% bf16 MFU | 1623716 tok/s step 10908/19560 | loss 3.378163 (-0.37z)| norm 0.2838 (-0.16z)| lr 2.61e-04 | 322.47 ms | 52.3% bf16 MFU | 1623822 tok/s step 10909/19560 | loss 3.360914 (-0.72z)| norm 0.2690 (-0.72z)| lr 2.61e-04 | 323.41 ms | 52.2% bf16 MFU | 1623686 tok/s step 10910/19560 | loss 3.378889 (-0.36z)| norm 0.2852 (-0.11z)| lr 2.61e-04 | 322.57 ms | 52.3% bf16 MFU | 1623769 tok/s step 10911/19560 | loss 3.328840 (-1.34z)| norm 0.2694 (-0.72z)| lr 2.61e-04 | 322.73 ms | 52.3% bf16 MFU | 1623808 tok/s step 10912/19560 | loss 3.463553 (+1.31z)| norm 0.3005 (+0.47z)| lr 2.61e-04 | 322.57 ms | 52.3% bf16 MFU | 1623884 tok/s step 10913/19560 | loss 3.460267 (+1.23z)| norm 0.2626 (-0.97z)| lr 2.61e-04 | 323.02 ms | 52.2% bf16 MFU | 1623844 tok/s step 10914/19560 | loss 3.312918 (-1.62z)| norm 0.2784 (-0.37z)| lr 2.61e-04 | 323.13 ms | 52.2% bf16 MFU | 1623778 tok/s step 10915/19560 | loss 3.401324 (+0.08z)| norm 0.2645 (-0.90z)| lr 2.61e-04 | 322.81 ms | 52.3% bf16 MFU | 1623797 tok/s step 10916/19560 | loss 3.405751 (+0.17z)| norm 0.2833 (-0.18z)| lr 2.61e-04 | 322.93 ms | 52.3% bf16 MFU | 1623785 tok/s step 10917/19560 | loss 3.405307 (+0.16z)| norm 0.2748 (-0.51z)| lr 2.61e-04 | 323.01 ms | 52.3% bf16 MFU | 1623753 tok/s step 10918/19560 | loss 3.336568 (-1.17z)| norm 0.2988 (+0.42z)| lr 2.61e-04 | 322.47 ms | 52.3% bf16 MFU | 1623856 tok/s step 10919/19560 | loss 3.353712 (-0.82z)| norm 0.2748 (-0.50z)| lr 2.61e-04 | 323.11 ms | 52.2% bf16 MFU | 1623795 tok/s step 10920/19560 | loss 3.397216 (+0.04z)| norm 0.2788 (-0.34z)| lr 2.61e-04 | 322.73 ms | 52.3% bf16 MFU | 1623832 tok/s step 10921/19560 | loss 3.431175 (+0.71z)| norm 0.2805 (-0.27z)| lr 2.61e-04 | 323.18 ms | 52.2% bf16 MFU | 1623753 tok/s step 10922/19560 | loss 3.342671 (-1.03z)| norm 0.2546 (-1.25z)| lr 2.61e-04 | 322.77 ms | 52.3% bf16 MFU | 1623782 tok/s step 10923/19560 | loss 3.405630 (+0.21z)| norm 0.2785 (-0.33z)| lr 2.61e-04 | 322.22 ms | 52.4% bf16 MFU | 1623950 tok/s step 10924/19560 | loss 3.409636 (+0.29z)| norm 0.2812 (-0.23z)| lr 2.60e-04 | 323.41 ms | 52.2% bf16 MFU | 1623809 tok/s step 10925/19560 | loss 3.476847 (+1.59z)| norm 0.2607 (-1.00z)| lr 2.60e-04 | 322.38 ms | 52.4% bf16 MFU | 1623935 tok/s step 10926/19560 | loss 3.418782 (+0.44z)| norm 0.2533 (-1.26z)| lr 2.60e-04 | 323.62 ms | 52.2% bf16 MFU | 1623741 tok/s step 10927/19560 | loss 3.352304 (-0.88z)| norm 0.2652 (-0.80z)| lr 2.60e-04 | 322.50 ms | 52.3% bf16 MFU | 1623838 tok/s step 10928/19560 | loss 3.406275 (+0.19z)| norm 0.2817 (-0.18z)| lr 2.60e-04 | 322.48 ms | 52.3% bf16 MFU | 1623936 tok/s step 10929/19560 | loss 3.382541 (-0.29z)| norm 0.2920 (+0.21z)| lr 2.60e-04 | 323.49 ms | 52.2% bf16 MFU | 1623774 tok/s step 10930/19560 | loss 3.409341 (+0.24z)| norm 0.2808 (-0.21z)| lr 2.60e-04 | 322.34 ms | 52.4% bf16 MFU | 1623910 tok/s step 10931/19560 | loss 3.417996 (+0.41z)| norm 0.2534 (-1.24z)| lr 2.60e-04 | 322.53 ms | 52.3% bf16 MFU | 1623992 tok/s step 10932/19560 | loss 3.398843 (+0.04z)| norm 0.2802 (-0.23z)| lr 2.60e-04 | 323.18 ms | 52.2% bf16 MFU | 1623905 tok/s step 10933/19560 | loss 3.395838 (-0.03z)| norm 0.2569 (-1.09z)| lr 2.60e-04 | 322.86 ms | 52.3% bf16 MFU | 1623904 tok/s step 10934/19560 | loss 3.416376 (+0.38z)| norm 0.2755 (-0.39z)| lr 2.60e-04 | 322.27 ms | 52.4% bf16 MFU | 1624051 tok/s step 10935/19560 | loss 3.368089 (-0.60z)| norm 0.2512 (-1.28z)| lr 2.60e-04 | 322.80 ms | 52.3% bf16 MFU | 1624057 tok/s step 10936/19560 | loss 3.385764 (-0.24z)| norm 0.2768 (-0.32z)| lr 2.60e-04 | 323.10 ms | 52.2% bf16 MFU | 1623987 tok/s step 10937/19560 | loss 3.453818 (+1.12z)| norm 0.2637 (-0.80z)| lr 2.60e-04 | 323.30 ms | 52.2% bf16 MFU | 1623873 tok/s step 10938/19560 | loss 3.382271 (-0.31z)| norm 0.2758 (-0.35z)| lr 2.60e-04 | 323.31 ms | 52.2% bf16 MFU | 1623761 tok/s step 10939/19560 | loss 3.347185 (-1.00z)| norm 0.2688 (-0.61z)| lr 2.60e-04 | 322.46 ms | 52.3% bf16 MFU | 1623867 tok/s step 10940/19560 | loss 3.382258 (-0.30z)| norm 0.2885 (+0.14z)| lr 2.60e-04 | 322.98 ms | 52.3% bf16 MFU | 1623839 tok/s step 10941/19560 | loss 3.387803 (-0.19z)| norm 0.2778 (-0.26z)| lr 2.60e-04 | 323.06 ms | 52.2% bf16 MFU | 1623792 tok/s step 10942/19560 | loss 3.391911 (-0.12z)| norm 0.2609 (-0.88z)| lr 2.60e-04 | 322.69 ms | 52.3% bf16 MFU | 1623840 tok/s step 10943/19560 | loss 3.340930 (-1.13z)| norm 0.2848 (+0.00z)| lr 2.60e-04 | 323.62 ms | 52.2% bf16 MFU | 1623653 tok/s step 10944/19560 | loss 3.367364 (-0.62z)| norm 0.2790 (-0.20z)| lr 2.59e-04 | 322.22 ms | 52.4% bf16 MFU | 1623826 tok/s step 10945/19560 | loss 3.353403 (-0.88z)| norm 0.2698 (-0.54z)| lr 2.59e-04 | 322.76 ms | 52.3% bf16 MFU | 1623853 tok/s step 10946/19560 | loss 3.560903 (+3.13z)| norm 0.2849 (+0.03z)| lr 2.59e-04 | 323.57 ms | 52.2% bf16 MFU | 1623676 tok/s step 10947/19560 | loss 3.424632 (+0.52z)| norm 0.2851 (+0.04z)| lr 2.59e-04 | 322.86 ms | 52.3% bf16 MFU | 1623685 tok/s step 10948/19560 | loss 3.372797 (-0.49z)| norm 0.2709 (-0.49z)| lr 2.59e-04 | 322.70 ms | 52.3% bf16 MFU | 1623736 tok/s step 10949/19560 | loss 3.443588 (+0.88z)| norm 0.2957 (+0.44z)| lr 2.59e-04 | 322.39 ms | 52.4% bf16 MFU | 1623861 tok/s step 10950/19560 | loss 3.379422 (-0.37z)| norm 0.2771 (-0.27z)| lr 2.59e-04 | 323.34 ms | 52.2% bf16 MFU | 1623743 tok/s step 10951/19560 | loss 3.359477 (-0.75z)| norm 0.2785 (-0.21z)| lr 2.59e-04 | 322.62 ms | 52.3% bf16 MFU | 1623812 tok/s step 10952/19560 | loss 3.426171 (+0.55z)| norm 0.2962 (+0.45z)| lr 2.59e-04 | 322.64 ms | 52.3% bf16 MFU | 1623870 tok/s step 10953/19560 | loss 3.402125 (+0.08z)| norm 0.3020 (+0.66z)| lr 2.59e-04 | 322.90 ms | 52.3% bf16 MFU | 1623860 tok/s step 10954/19560 | loss 3.398101 (-0.00z)| norm 0.2884 (+0.14z)| lr 2.59e-04 | 323.40 ms | 52.2% bf16 MFU | 1623725 tok/s step 10955/19560 | loss 3.398775 (+0.01z)| norm 0.2885 (+0.14z)| lr 2.59e-04 | 323.05 ms | 52.2% bf16 MFU | 1623686 tok/s step 10956/19560 | loss 3.406848 (+0.16z)| norm 0.2688 (-0.61z)| lr 2.59e-04 | 322.81 ms | 52.3% bf16 MFU | 1623708 tok/s step 10957/19560 | loss 3.441510 (+0.84z)| norm 0.2826 (-0.09z)| lr 2.59e-04 | 322.56 ms | 52.3% bf16 MFU | 1623792 tok/s step 10958/19560 | loss 3.359668 (-0.77z)| norm 0.2657 (-0.72z)| lr 2.59e-04 | 323.07 ms | 52.2% bf16 MFU | 1623744 tok/s step 10959/19560 | loss 3.442680 (+0.85z)| norm 0.3063 (+0.80z)| lr 2.59e-04 | 323.29 ms | 52.2% bf16 MFU | 1623642 tok/s step 10960/19560 | loss 3.431309 (+0.62z)| norm 0.3136 (+1.06z)| lr 2.59e-04 | 322.67 ms | 52.3% bf16 MFU | 1623703 tok/s step 10961/19560 | loss 3.445880 (+0.89z)| norm 0.2882 (+0.10z)| lr 2.59e-04 | 322.79 ms | 52.3% bf16 MFU | 1623730 tok/s step 10962/19560 | loss 3.415834 (+0.29z)| norm 0.2969 (+0.42z)| lr 2.59e-04 | 322.76 ms | 52.3% bf16 MFU | 1623763 tok/s step 10963/19560 | loss 3.419170 (+0.35z)| norm 0.2846 (-0.04z)| lr 2.59e-04 | 323.02 ms | 52.2% bf16 MFU | 1623728 tok/s step 10964/19560 | loss 3.483056 (+1.58z)| norm 0.2832 (-0.09z)| lr 2.59e-04 | 323.34 ms | 52.2% bf16 MFU | 1623615 tok/s step 10965/19560 | loss 3.398981 (-0.06z)| norm 0.2817 (-0.16z)| lr 2.58e-04 | 322.37 ms | 52.4% bf16 MFU | 1623753 tok/s step 10966/19560 | loss 3.483033 (+1.61z)| norm 0.3137 (+1.04z)| lr 2.58e-04 | 322.98 ms | 52.3% bf16 MFU | 1623729 tok/s step 10967/19560 | loss 3.333787 (-1.34z)| norm 0.3070 (+0.79z)| lr 2.58e-04 | 323.30 ms | 52.2% bf16 MFU | 1623626 tok/s step 10968/19560 | loss 3.375840 (-0.50z)| norm 0.3036 (+0.66z)| lr 2.58e-04 | 322.45 ms | 52.3% bf16 MFU | 1623742 tok/s step 10969/19560 | loss 3.437755 (+0.72z)| norm 0.2905 (+0.17z)| lr 2.58e-04 | 323.16 ms | 52.2% bf16 MFU | 1623674 tok/s step 10970/19560 | loss 3.451494 (+0.98z)| norm 0.3258 (+1.48z)| lr 2.58e-04 | 323.65 ms | 52.1% bf16 MFU | 1623487 tok/s step 10971/19560 | loss 3.406440 (+0.09z)| norm 0.2931 (+0.28z)| lr 2.58e-04 | 322.73 ms | 52.3% bf16 MFU | 1623540 tok/s step 10972/19560 | loss 3.423167 (+0.41z)| norm 0.2865 (+0.03z)| lr 2.58e-04 | 322.88 ms | 52.3% bf16 MFU | 1623553 tok/s step 10973/19560 | loss 3.391899 (-0.21z)| norm 0.2913 (+0.24z)| lr 2.58e-04 | 322.92 ms | 52.3% bf16 MFU | 1623554 tok/s step 10974/19560 | loss 3.405877 (+0.07z)| norm 0.2854 (+0.00z)| lr 2.58e-04 | 323.41 ms | 52.2% bf16 MFU | 1623433 tok/s step 10975/19560 | loss 3.406671 (+0.07z)| norm 0.2662 (-0.75z)| lr 2.58e-04 | 322.70 ms | 52.3% bf16 MFU | 1623496 tok/s step 10976/19560 | loss 3.377272 (-0.53z)| norm 0.2823 (-0.11z)| lr 2.58e-04 | 322.88 ms | 52.3% bf16 MFU | 1623512 tok/s step 10977/19560 | loss 3.372987 (-0.61z)| norm 0.2699 (-0.60z)| lr 2.58e-04 | 323.02 ms | 52.2% bf16 MFU | 1623489 tok/s step 10978/19560 | loss 3.394653 (-0.17z)| norm 0.2683 (-0.65z)| lr 2.58e-04 | 322.95 ms | 52.3% bf16 MFU | 1623487 tok/s step 10979/19560 | loss 3.436208 (+0.66z)| norm 0.2747 (-0.40z)| lr 2.58e-04 | 323.45 ms | 52.2% bf16 MFU | 1623359 tok/s step 10980/19560 | loss 3.345667 (-1.15z)| norm 0.2952 (+0.42z)| lr 2.58e-04 | 322.82 ms | 52.3% bf16 MFU | 1623395 tok/s step 10981/19560 | loss 3.362231 (-0.81z)| norm 0.2807 (-0.16z)| lr 2.58e-04 | 322.71 ms | 52.3% bf16 MFU | 1623457 tok/s step 10982/19560 | loss 3.366450 (-0.73z)| norm 0.2835 (+0.00z)| lr 2.58e-04 | 322.62 ms | 52.3% bf16 MFU | 1623538 tok/s step 10983/19560 | loss 3.409961 (+0.20z)| norm 0.2762 (-0.35z)| lr 2.58e-04 | 323.00 ms | 52.3% bf16 MFU | 1623521 tok/s step 10984/19560 | loss 3.373641 (-0.58z)| norm 0.2777 (-0.26z)| lr 2.58e-04 | 323.24 ms | 52.2% bf16 MFU | 1623443 tok/s step 10985/19560 | loss 3.400505 (+0.01z)| norm 0.2612 (-1.20z)| lr 2.57e-04 | 322.91 ms | 52.3% bf16 MFU | 1623454 tok/s step 10986/19560 | loss 3.399309 (-0.01z)| norm 0.2747 (-0.40z)| lr 2.57e-04 | 323.17 ms | 52.2% bf16 MFU | 1623398 tok/s step 10987/19560 | loss 3.394558 (-0.12z)| norm 0.2555 (-1.53z)| lr 2.57e-04 | 323.04 ms | 52.2% bf16 MFU | 1623378 tok/s step 10988/19560 | loss 3.320066 (-1.71z)| norm 0.3726 (+4.93z)| lr 2.57e-04 | 322.88 ms | 52.3% bf16 MFU | 1623397 tok/s step 10989/19560 | loss 3.333389 (-1.41z)| norm 0.2922 (+0.58z)| lr 2.57e-04 | 322.80 ms | 52.3% bf16 MFU | 1623436 tok/s step 10990/19560 | loss 3.371871 (-0.60z)| norm 0.2997 (+0.99z)| lr 2.57e-04 | 322.89 ms | 52.3% bf16 MFU | 1623451 tok/s step 10991/19560 | loss 3.368840 (-0.66z)| norm 0.3094 (+1.49z)| lr 2.57e-04 | 323.25 ms | 52.2% bf16 MFU | 1623374 tok/s step 10992/19560 | loss 3.395851 (-0.07z)| norm 0.2951 (+0.72z)| lr 2.57e-04 | 322.99 ms | 52.3% bf16 MFU | 1623367 tok/s step 10993/19560 | loss 3.413866 (+0.31z)| norm 0.2829 (+0.07z)| lr 2.57e-04 | 322.62 ms | 52.3% bf16 MFU | 1623453 tok/s step 10994/19560 | loss 3.298395 (-2.18z)| norm 0.2698 (-0.63z)| lr 2.57e-04 | 323.31 ms | 52.2% bf16 MFU | 1623362 tok/s step 10995/19560 | loss 3.370652 (-0.60z)| norm 0.3012 (+1.05z)| lr 2.57e-04 | 323.07 ms | 52.2% bf16 MFU | 1623336 tok/s step 10996/19560 | loss 3.427664 (+0.62z)| norm 0.2603 (-1.15z)| lr 2.57e-04 | 324.02 ms | 52.1% bf16 MFU | 1623074 tok/s step 10997/19560 | loss 3.378148 (-0.46z)| norm 0.2710 (-0.57z)| lr 2.57e-04 | 323.45 ms | 52.2% bf16 MFU | 1622966 tok/s step 10998/19560 | loss 3.423182 (+0.54z)| norm 0.2748 (-0.37z)| lr 2.57e-04 | 323.26 ms | 52.2% bf16 MFU | 1622911 tok/s step 10999/19560 | loss 3.396835 (-0.04z)| norm 0.2591 (-1.22z)| lr 2.57e-04 | 323.29 ms | 52.2% bf16 MFU | 1622852 tok/s step 11000/19560 | loss 3.354498 (-0.98z)| norm 0.2777 (-0.22z)| lr 2.57e-04 | 323.06 ms | 52.2% bf16 MFU | 1622852 tok/s val loss 3.374903 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2939/10042 = 0.292671 step 11001/19560 | loss 3.407538 (+0.19z)| norm 0.2926 (+0.58z)| lr 2.57e-04 | 322.12 ms | 52.4% bf16 MFU | 1623090 tok/s step 11002/19560 | loss 3.368099 (-0.68z)| norm 0.2902 (+0.44z)| lr 2.57e-04 | 323.44 ms | 52.2% bf16 MFU | 1622985 tok/s step 11003/19560 | loss 3.314169 (-1.85z)| norm 0.3070 (+1.33z)| lr 2.57e-04 | 322.99 ms | 52.3% bf16 MFU | 1622997 tok/s step 11004/19560 | loss 3.369700 (-0.62z)| norm 0.2693 (-0.73z)| lr 2.57e-04 | 323.10 ms | 52.2% bf16 MFU | 1622982 tok/s step 11005/19560 | loss 3.359953 (-0.83z)| norm 0.2876 (+0.26z)| lr 2.56e-04 | 323.74 ms | 52.1% bf16 MFU | 1622806 tok/s step 11006/19560 | loss 3.394910 (-0.03z)| norm 0.2651 (-0.96z)| lr 2.56e-04 | 322.67 ms | 52.3% bf16 MFU | 1622908 tok/s step 11007/19560 | loss 3.366891 (-0.67z)| norm 0.2943 (+0.62z)| lr 2.56e-04 | 323.30 ms | 52.2% bf16 MFU | 1622848 tok/s step 11008/19560 | loss 3.355477 (-0.92z)| norm 0.2773 (-0.30z)| lr 2.56e-04 | 323.30 ms | 52.2% bf16 MFU | 1622790 tok/s step 11009/19560 | loss 3.414553 (+0.44z)| norm 0.2767 (-0.33z)| lr 2.56e-04 | 322.81 ms | 52.3% bf16 MFU | 1622858 tok/s step 11010/19560 | loss 3.346579 (-1.11z)| norm 0.2814 (-0.08z)| lr 2.56e-04 | 322.99 ms | 52.3% bf16 MFU | 1622877 tok/s step 11011/19560 | loss 3.372274 (-0.53z)| norm 0.2839 (+0.06z)| lr 2.56e-04 | 322.70 ms | 52.3% bf16 MFU | 1622968 tok/s step 11012/19560 | loss 3.412208 (+0.38z)| norm 0.2671 (-0.85z)| lr 2.56e-04 | 322.97 ms | 52.3% bf16 MFU | 1622985 tok/s step 11013/19560 | loss 3.319765 (-1.71z)| norm 0.2691 (-0.74z)| lr 2.56e-04 | 322.86 ms | 52.3% bf16 MFU | 1623031 tok/s step 11014/19560 | loss 3.307441 (-2.03z)| norm 0.2754 (-0.38z)| lr 2.56e-04 | 322.88 ms | 52.3% bf16 MFU | 1623068 tok/s step 11015/19560 | loss 3.401705 (+0.21z)| norm 0.2712 (-0.61z)| lr 2.56e-04 | 322.59 ms | 52.3% bf16 MFU | 1623176 tok/s step 11016/19560 | loss 3.365945 (-0.64z)| norm 0.2704 (-0.65z)| lr 2.56e-04 | 323.11 ms | 52.2% bf16 MFU | 1623148 tok/s step 11017/19560 | loss 3.432911 (+0.93z)| norm 0.3232 (+2.37z)| lr 2.56e-04 | 323.23 ms | 52.2% bf16 MFU | 1623092 tok/s step 11018/19560 | loss 3.352254 (-0.97z)| norm 0.3093 (+1.55z)| lr 2.56e-04 | 322.78 ms | 52.3% bf16 MFU | 1623151 tok/s step 11019/19560 | loss 3.401351 (+0.18z)| norm 0.3146 (+1.81z)| lr 2.56e-04 | 322.16 ms | 52.4% bf16 MFU | 1623363 tok/s step 11020/19560 | loss 3.394053 (+0.01z)| norm 0.2652 (-0.95z)| lr 2.56e-04 | 323.07 ms | 52.2% bf16 MFU | 1623336 tok/s step 11021/19560 | loss 3.344449 (-1.15z)| norm 0.2941 (+0.65z)| lr 2.56e-04 | 322.27 ms | 52.4% bf16 MFU | 1623512 tok/s step 11022/19560 | loss 3.363213 (-0.70z)| norm 0.2575 (-1.37z)| lr 2.56e-04 | 322.62 ms | 52.3% bf16 MFU | 1623591 tok/s step 11023/19560 | loss 3.312337 (-1.86z)| norm 0.3169 (+1.89z)| lr 2.56e-04 | 323.31 ms | 52.2% bf16 MFU | 1623493 tok/s step 11024/19560 | loss 3.413115 (+0.49z)| norm 0.3202 (+2.02z)| lr 2.56e-04 | 322.48 ms | 52.3% bf16 MFU | 1623609 tok/s step 11025/19560 | loss 3.365387 (-0.62z)| norm 0.2826 (+0.00z)| lr 2.55e-04 | 323.16 ms | 52.2% bf16 MFU | 1623546 tok/s step 11026/19560 | loss 3.396526 (+0.15z)| norm 0.2867 (+0.22z)| lr 2.55e-04 | 322.25 ms | 52.4% bf16 MFU | 1623718 tok/s step 11027/19560 | loss 3.482574 (+2.20z)| norm 0.2625 (-1.07z)| lr 2.55e-04 | 323.17 ms | 52.2% bf16 MFU | 1623650 tok/s step 11028/19560 | loss 3.360198 (-0.75z)| norm 0.2946 (+0.65z)| lr 2.55e-04 | 322.88 ms | 52.3% bf16 MFU | 1623657 tok/s step 11029/19560 | loss 3.371231 (-0.48z)| norm 0.2541 (-1.52z)| lr 2.55e-04 | 322.55 ms | 52.3% bf16 MFU | 1623746 tok/s step 11030/19560 | loss 3.369794 (-0.50z)| norm 0.2945 (+0.63z)| lr 2.55e-04 | 322.82 ms | 52.3% bf16 MFU | 1623763 tok/s step 11031/19560 | loss 3.381154 (-0.24z)| norm 0.2706 (-0.64z)| lr 2.55e-04 | 322.55 ms | 52.3% bf16 MFU | 1623847 tok/s step 11032/19560 | loss 3.437511 (+1.15z)| norm 0.2905 (+0.42z)| lr 2.55e-04 | 322.70 ms | 52.3% bf16 MFU | 1623889 tok/s step 11033/19560 | loss 3.365253 (-0.62z)| norm 0.2723 (-0.56z)| lr 2.55e-04 | 322.71 ms | 52.3% bf16 MFU | 1623926 tok/s step 11034/19560 | loss 3.295172 (-2.28z)| norm 0.2765 (-0.34z)| lr 2.55e-04 | 323.12 ms | 52.2% bf16 MFU | 1623860 tok/s step 11035/19560 | loss 3.377822 (-0.29z)| norm 0.2892 (+0.39z)| lr 2.55e-04 | 322.62 ms | 52.3% bf16 MFU | 1623921 tok/s step 11036/19560 | loss 3.429087 (+0.93z)| norm 0.2706 (-0.65z)| lr 2.55e-04 | 322.91 ms | 52.3% bf16 MFU | 1623906 tok/s step 11037/19560 | loss 3.333941 (-1.34z)| norm 0.2788 (-0.20z)| lr 2.55e-04 | 322.72 ms | 52.3% bf16 MFU | 1623940 tok/s step 11038/19560 | loss 3.348154 (-0.99z)| norm 0.2949 (+0.71z)| lr 2.55e-04 | 322.75 ms | 52.3% bf16 MFU | 1623964 tok/s step 11039/19560 | loss 3.389822 (-0.01z)| norm 0.2618 (-1.15z)| lr 2.55e-04 | 322.85 ms | 52.3% bf16 MFU | 1623962 tok/s step 11040/19560 | loss 3.399871 (+0.24z)| norm 0.2643 (-1.00z)| lr 2.55e-04 | 322.65 ms | 52.3% bf16 MFU | 1624011 tok/s step 11041/19560 | loss 3.372859 (-0.40z)| norm 0.2548 (-1.52z)| lr 2.55e-04 | 323.12 ms | 52.2% bf16 MFU | 1623939 tok/s step 11042/19560 | loss 3.379656 (-0.25z)| norm 0.2716 (-0.58z)| lr 2.55e-04 | 323.04 ms | 52.2% bf16 MFU | 1623892 tok/s step 11043/19560 | loss 3.399301 (+0.24z)| norm 0.2530 (-1.60z)| lr 2.55e-04 | 322.32 ms | 52.4% bf16 MFU | 1624027 tok/s step 11044/19560 | loss 3.367078 (-0.55z)| norm 0.2713 (-0.58z)| lr 2.55e-04 | 322.45 ms | 52.3% bf16 MFU | 1624123 tok/s step 11045/19560 | loss 3.413878 (+0.61z)| norm 0.2514 (-1.66z)| lr 2.55e-04 | 322.96 ms | 52.3% bf16 MFU | 1624086 tok/s step 11046/19560 | loss 3.350343 (-0.98z)| norm 0.2630 (-1.00z)| lr 2.54e-04 | 322.59 ms | 52.3% bf16 MFU | 1624145 tok/s step 11047/19560 | loss 3.388526 (-0.03z)| norm 0.2819 (+0.03z)| lr 2.54e-04 | 322.94 ms | 52.3% bf16 MFU | 1624111 tok/s step 11048/19560 | loss 3.346682 (-1.06z)| norm 0.2761 (-0.29z)| lr 2.54e-04 | 322.48 ms | 52.3% bf16 MFU | 1624196 tok/s step 11049/19560 | loss 3.408198 (+0.47z)| norm 0.2426 (-2.07z)| lr 2.54e-04 | 323.35 ms | 52.2% bf16 MFU | 1624058 tok/s step 11050/19560 | loss 3.396147 (+0.16z)| norm 0.2572 (-1.29z)| lr 2.54e-04 | 322.51 ms | 52.3% bf16 MFU | 1624139 tok/s step 11051/19560 | loss 3.381431 (-0.20z)| norm 0.2763 (-0.25z)| lr 2.54e-04 | 323.22 ms | 52.2% bf16 MFU | 1624037 tok/s step 11052/19560 | loss 3.361299 (-0.70z)| norm 0.2493 (-1.68z)| lr 2.54e-04 | 322.74 ms | 52.3% bf16 MFU | 1624060 tok/s step 11053/19560 | loss 3.387221 (-0.03z)| norm 0.2712 (-0.52z)| lr 2.54e-04 | 323.45 ms | 52.2% bf16 MFU | 1623904 tok/s step 11054/19560 | loss 3.436425 (+1.22z)| norm 0.2834 (+0.13z)| lr 2.54e-04 | 322.88 ms | 52.3% bf16 MFU | 1623899 tok/s step 11055/19560 | loss 3.335997 (-1.33z)| norm 0.2783 (-0.15z)| lr 2.54e-04 | 322.57 ms | 52.3% bf16 MFU | 1623973 tok/s step 11056/19560 | loss 3.353018 (-0.89z)| norm 0.2852 (+0.22z)| lr 2.54e-04 | 322.67 ms | 52.3% bf16 MFU | 1624015 tok/s step 11057/19560 | loss 3.385742 (-0.06z)| norm 0.2796 (-0.08z)| lr 2.54e-04 | 323.34 ms | 52.2% bf16 MFU | 1623889 tok/s step 11058/19560 | loss 3.535572 (+3.53z)| norm 0.3307 (+2.60z)| lr 2.54e-04 | 323.04 ms | 52.2% bf16 MFU | 1623844 tok/s step 11059/19560 | loss 3.336288 (-1.25z)| norm 0.3092 (+1.44z)| lr 2.54e-04 | 322.53 ms | 52.3% bf16 MFU | 1623929 tok/s step 11060/19560 | loss 3.323988 (-1.52z)| norm 0.2971 (+0.79z)| lr 2.54e-04 | 322.88 ms | 52.3% bf16 MFU | 1623922 tok/s step 11061/19560 | loss 3.395892 (+0.19z)| norm 0.2897 (+0.39z)| lr 2.54e-04 | 322.37 ms | 52.4% bf16 MFU | 1624044 tok/s step 11062/19560 | loss 3.384055 (-0.08z)| norm 0.2774 (-0.26z)| lr 2.54e-04 | 323.00 ms | 52.3% bf16 MFU | 1624000 tok/s step 11063/19560 | loss 3.332805 (-1.29z)| norm 0.2996 (+0.90z)| lr 2.54e-04 | 322.78 ms | 52.3% bf16 MFU | 1624014 tok/s step 11064/19560 | loss 3.344814 (-0.99z)| norm 0.2837 (+0.05z)| lr 2.54e-04 | 323.05 ms | 52.2% bf16 MFU | 1623961 tok/s step 11065/19560 | loss 3.394881 (+0.20z)| norm 0.2904 (+0.40z)| lr 2.54e-04 | 322.72 ms | 52.3% bf16 MFU | 1623993 tok/s step 11066/19560 | loss 3.360167 (-0.62z)| norm 0.2766 (-0.34z)| lr 2.53e-04 | 323.15 ms | 52.2% bf16 MFU | 1623916 tok/s step 11067/19560 | loss 3.378589 (-0.19z)| norm 0.2694 (-0.73z)| lr 2.53e-04 | 322.48 ms | 52.3% bf16 MFU | 1624010 tok/s step 11068/19560 | loss 3.388614 (+0.05z)| norm 0.2903 (+0.39z)| lr 2.53e-04 | 322.51 ms | 52.3% bf16 MFU | 1624092 tok/s step 11069/19560 | loss 3.320098 (-1.56z)| norm 0.2547 (-1.50z)| lr 2.53e-04 | 323.02 ms | 52.2% bf16 MFU | 1624043 tok/s step 11070/19560 | loss 3.335566 (-1.18z)| norm 0.2825 (-0.03z)| lr 2.53e-04 | 322.58 ms | 52.3% bf16 MFU | 1624105 tok/s step 11071/19560 | loss 3.395051 (+0.21z)| norm 0.2731 (-0.52z)| lr 2.53e-04 | 322.43 ms | 52.3% bf16 MFU | 1624201 tok/s step 11072/19560 | loss 3.379403 (-0.16z)| norm 0.2842 (+0.07z)| lr 2.53e-04 | 322.85 ms | 52.3% bf16 MFU | 1624187 tok/s step 11073/19560 | loss 3.377643 (-0.21z)| norm 0.2637 (-1.03z)| lr 2.53e-04 | 323.29 ms | 52.2% bf16 MFU | 1624064 tok/s step 11074/19560 | loss 3.346910 (-0.96z)| norm 0.2833 (+0.02z)| lr 2.53e-04 | 323.08 ms | 52.2% bf16 MFU | 1624001 tok/s step 11075/19560 | loss 3.447192 (+1.57z)| norm 0.2655 (-0.91z)| lr 2.53e-04 | 322.26 ms | 52.4% bf16 MFU | 1624146 tok/s step 11076/19560 | loss 3.420214 (+0.88z)| norm 0.2909 (+0.43z)| lr 2.53e-04 | 323.47 ms | 52.2% bf16 MFU | 1623979 tok/s step 11077/19560 | loss 3.310877 (-1.83z)| norm 0.2524 (-1.59z)| lr 2.53e-04 | 322.75 ms | 52.3% bf16 MFU | 1624002 tok/s step 11078/19560 | loss 3.377047 (-0.18z)| norm 0.2640 (-0.97z)| lr 2.53e-04 | 323.36 ms | 52.2% bf16 MFU | 1623871 tok/s step 11079/19560 | loss 3.447648 (+1.56z)| norm 0.2893 (+0.36z)| lr 2.53e-04 | 322.70 ms | 52.3% bf16 MFU | 1623912 tok/s step 11080/19560 | loss 3.409815 (+0.62z)| norm 0.2626 (-1.03z)| lr 2.53e-04 | 323.22 ms | 52.2% bf16 MFU | 1623820 tok/s step 11081/19560 | loss 3.335525 (-1.21z)| norm 0.2669 (-0.79z)| lr 2.53e-04 | 322.55 ms | 52.3% bf16 MFU | 1623902 tok/s step 11082/19560 | loss 3.324372 (-1.46z)| norm 0.2674 (-0.75z)| lr 2.53e-04 | 322.65 ms | 52.3% bf16 MFU | 1623954 tok/s step 11083/19560 | loss 3.427368 (+1.06z)| norm 0.2743 (-0.39z)| lr 2.53e-04 | 322.43 ms | 52.3% bf16 MFU | 1624059 tok/s step 11084/19560 | loss 3.359340 (-0.59z)| norm 0.2671 (-0.77z)| lr 2.53e-04 | 323.00 ms | 52.3% bf16 MFU | 1624014 tok/s step 11085/19560 | loss 3.441623 (+1.42z)| norm 0.2944 (+0.66z)| lr 2.53e-04 | 322.81 ms | 52.3% bf16 MFU | 1624021 tok/s step 11086/19560 | loss 3.441297 (+1.39z)| norm 0.2801 (-0.09z)| lr 2.52e-04 | 323.31 ms | 52.2% bf16 MFU | 1623901 tok/s step 11087/19560 | loss 3.389755 (+0.15z)| norm 0.2766 (-0.27z)| lr 2.52e-04 | 322.71 ms | 52.3% bf16 MFU | 1623938 tok/s step 11088/19560 | loss 3.399422 (+0.39z)| norm 0.2610 (-1.08z)| lr 2.52e-04 | 322.84 ms | 52.3% bf16 MFU | 1623940 tok/s step 11089/19560 | loss 3.344364 (-0.95z)| norm 0.2782 (-0.16z)| lr 2.52e-04 | 322.57 ms | 52.3% bf16 MFU | 1624010 tok/s step 11090/19560 | loss 3.343909 (-0.94z)| norm 0.2678 (-0.70z)| lr 2.52e-04 | 323.13 ms | 52.2% bf16 MFU | 1623936 tok/s step 11091/19560 | loss 3.319340 (-1.52z)| norm 0.3024 (+1.13z)| lr 2.52e-04 | 323.48 ms | 52.2% bf16 MFU | 1623779 tok/s step 11092/19560 | loss 3.437980 (+1.42z)| norm 0.2769 (-0.22z)| lr 2.52e-04 | 322.90 ms | 52.3% bf16 MFU | 1623774 tok/s step 11093/19560 | loss 3.431418 (+1.25z)| norm 0.2884 (+0.39z)| lr 2.52e-04 | 322.40 ms | 52.3% bf16 MFU | 1623896 tok/s step 11094/19560 | loss 3.376319 (-0.10z)| norm 0.2747 (-0.33z)| lr 2.52e-04 | 323.02 ms | 52.2% bf16 MFU | 1623855 tok/s step 11095/19560 | loss 3.324702 (-1.42z)| norm 0.2536 (-1.43z)| lr 2.52e-04 | 322.58 ms | 52.3% bf16 MFU | 1623926 tok/s step 11096/19560 | loss 3.424867 (+1.12z)| norm 0.2915 (+0.60z)| lr 2.52e-04 | 323.49 ms | 52.2% bf16 MFU | 1623765 tok/s step 11097/19560 | loss 3.344105 (-0.91z)| norm 0.2955 (+0.82z)| lr 2.52e-04 | 322.68 ms | 52.3% bf16 MFU | 1623815 tok/s step 11098/19560 | loss 3.325717 (-1.37z)| norm 0.2671 (-0.70z)| lr 2.52e-04 | 323.04 ms | 52.2% bf16 MFU | 1623773 tok/s step 11099/19560 | loss 3.378455 (-0.01z)| norm 0.2644 (-0.84z)| lr 2.52e-04 | 322.42 ms | 52.3% bf16 MFU | 1623890 tok/s step 11100/19560 | loss 3.410278 (+0.81z)| norm 0.2636 (-0.87z)| lr 2.52e-04 | 323.52 ms | 52.2% bf16 MFU | 1623724 tok/s step 11101/19560 | loss 3.344845 (-0.86z)| norm 0.2588 (-1.11z)| lr 2.52e-04 | 322.59 ms | 52.3% bf16 MFU | 1623800 tok/s step 11102/19560 | loss 3.404319 (+0.67z)| norm 0.2632 (-0.86z)| lr 2.52e-04 | 322.26 ms | 52.4% bf16 MFU | 1623956 tok/s step 11103/19560 | loss 3.409029 (+0.79z)| norm 0.2742 (-0.27z)| lr 2.52e-04 | 323.21 ms | 52.2% bf16 MFU | 1623864 tok/s step 11104/19560 | loss 3.463022 (+2.12z)| norm 0.2916 (+0.68z)| lr 2.52e-04 | 323.06 ms | 52.2% bf16 MFU | 1623816 tok/s step 11105/19560 | loss 3.384086 (+0.13z)| norm 0.3049 (+1.38z)| lr 2.52e-04 | 323.09 ms | 52.2% bf16 MFU | 1623762 tok/s step 11106/19560 | loss 3.381534 (+0.06z)| norm 0.2804 (+0.05z)| lr 2.51e-04 | 322.40 ms | 52.3% bf16 MFU | 1623884 tok/s step 11107/19560 | loss 3.430892 (+1.31z)| norm 0.3178 (+2.02z)| lr 2.51e-04 | 322.75 ms | 52.3% bf16 MFU | 1623912 tok/s step 11108/19560 | loss 3.454001 (+1.85z)| norm 0.2885 (+0.47z)| lr 2.51e-04 | 322.70 ms | 52.3% bf16 MFU | 1623950 tok/s step 11109/19560 | loss 3.405077 (+0.62z)| norm 0.2863 (+0.34z)| lr 2.51e-04 | 322.78 ms | 52.3% bf16 MFU | 1623967 tok/s step 11110/19560 | loss 3.383351 (+0.08z)| norm 0.2774 (-0.13z)| lr 2.51e-04 | 322.72 ms | 52.3% bf16 MFU | 1623999 tok/s step 11111/19560 | loss 3.392233 (+0.30z)| norm 0.2740 (-0.31z)| lr 2.51e-04 | 322.33 ms | 52.4% bf16 MFU | 1624127 tok/s step 11112/19560 | loss 3.397425 (+0.43z)| norm 0.2896 (+0.52z)| lr 2.51e-04 | 322.64 ms | 52.3% bf16 MFU | 1624170 tok/s step 11113/19560 | loss 3.432684 (+1.30z)| norm 0.2845 (+0.24z)| lr 2.51e-04 | 323.18 ms | 52.2% bf16 MFU | 1624077 tok/s step 11114/19560 | loss 3.375187 (-0.13z)| norm 0.2749 (-0.28z)| lr 2.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1624005 tok/s step 11115/19560 | loss 3.403397 (+0.57z)| norm 0.2800 (-0.02z)| lr 2.51e-04 | 322.37 ms | 52.4% bf16 MFU | 1624124 tok/s step 11116/19560 | loss 3.342804 (-0.95z)| norm 0.2626 (-1.00z)| lr 2.51e-04 | 322.31 ms | 52.4% bf16 MFU | 1624250 tok/s step 11117/19560 | loss 3.402813 (+0.54z)| norm 0.2786 (-0.04z)| lr 2.51e-04 | 323.44 ms | 52.2% bf16 MFU | 1624086 tok/s step 11118/19560 | loss 3.382585 (+0.03z)| norm 0.2640 (-0.91z)| lr 2.51e-04 | 322.86 ms | 52.3% bf16 MFU | 1624076 tok/s step 11119/19560 | loss 3.418824 (+0.93z)| norm 0.2771 (-0.10z)| lr 2.51e-04 | 322.56 ms | 52.3% bf16 MFU | 1624143 tok/s step 11120/19560 | loss 3.358804 (-0.56z)| norm 0.2794 (+0.05z)| lr 2.51e-04 | 322.67 ms | 52.3% bf16 MFU | 1624178 tok/s step 11121/19560 | loss 3.400712 (+0.49z)| norm 0.2698 (-0.53z)| lr 2.51e-04 | 322.99 ms | 52.3% bf16 MFU | 1624131 tok/s step 11122/19560 | loss 3.321480 (-1.51z)| norm 0.2800 (+0.08z)| lr 2.51e-04 | 322.87 ms | 52.3% bf16 MFU | 1624117 tok/s step 11123/19560 | loss 3.377382 (-0.10z)| norm 0.2714 (-0.43z)| lr 2.51e-04 | 322.53 ms | 52.3% bf16 MFU | 1624189 tok/s step 11124/19560 | loss 3.385196 (+0.10z)| norm 0.2949 (+1.00z)| lr 2.51e-04 | 322.79 ms | 52.3% bf16 MFU | 1624192 tok/s step 11125/19560 | loss 3.293700 (-2.16z)| norm 0.2811 (+0.14z)| lr 2.51e-04 | 323.31 ms | 52.2% bf16 MFU | 1624063 tok/s step 11126/19560 | loss 3.378428 (-0.04z)| norm 0.2863 (+0.46z)| lr 2.51e-04 | 323.26 ms | 52.2% bf16 MFU | 1623952 tok/s step 11127/19560 | loss 3.359345 (-0.51z)| norm 0.2854 (+0.39z)| lr 2.50e-04 | 322.57 ms | 52.3% bf16 MFU | 1624023 tok/s step 11128/19560 | loss 3.398314 (+0.45z)| norm 0.2830 (+0.25z)| lr 2.50e-04 | 323.04 ms | 52.2% bf16 MFU | 1623971 tok/s step 11129/19560 | loss 3.395881 (+0.40z)| norm 0.2947 (+0.97z)| lr 2.50e-04 | 322.43 ms | 52.3% bf16 MFU | 1624074 tok/s step 11130/19560 | loss 3.367864 (-0.31z)| norm 0.2678 (-0.69z)| lr 2.50e-04 | 323.27 ms | 52.2% bf16 MFU | 1623963 tok/s step 11131/19560 | loss 3.392962 (+0.31z)| norm 0.2955 (+1.04z)| lr 2.50e-04 | 323.21 ms | 52.2% bf16 MFU | 1623872 tok/s step 11132/19560 | loss 3.375574 (-0.13z)| norm 0.2673 (-0.72z)| lr 2.50e-04 | 322.76 ms | 52.3% bf16 MFU | 1623898 tok/s step 11133/19560 | loss 3.375100 (-0.15z)| norm 0.2651 (-0.85z)| lr 2.50e-04 | 323.07 ms | 52.2% bf16 MFU | 1623844 tok/s step 11134/19560 | loss 3.336477 (-1.11z)| norm 0.2638 (-0.93z)| lr 2.50e-04 | 323.04 ms | 52.2% bf16 MFU | 1623802 tok/s step 11135/19560 | loss 3.368875 (-0.29z)| norm 0.2534 (-1.54z)| lr 2.50e-04 | 323.30 ms | 52.2% bf16 MFU | 1623697 tok/s step 11136/19560 | loss 3.355823 (-0.62z)| norm 0.2593 (-1.16z)| lr 2.50e-04 | 323.09 ms | 52.2% bf16 MFU | 1623649 tok/s step 11137/19560 | loss 3.352977 (-0.68z)| norm 0.2810 (+0.17z)| lr 2.50e-04 | 323.00 ms | 52.3% bf16 MFU | 1623626 tok/s step 11138/19560 | loss 3.384356 (+0.10z)| norm 0.2635 (-0.89z)| lr 2.50e-04 | 322.66 ms | 52.3% bf16 MFU | 1623690 tok/s step 11139/19560 | loss 3.435555 (+1.38z)| norm 0.2623 (-0.96z)| lr 2.50e-04 | 323.56 ms | 52.2% bf16 MFU | 1623523 tok/s step 11140/19560 | loss 3.357614 (-0.57z)| norm 0.2544 (-1.42z)| lr 2.50e-04 | 323.37 ms | 52.2% bf16 MFU | 1623414 tok/s step 11141/19560 | loss 3.331727 (-1.23z)| norm 0.2916 (+0.83z)| lr 2.50e-04 | 323.01 ms | 52.3% bf16 MFU | 1623401 tok/s step 11142/19560 | loss 3.384552 (+0.09z)| norm 0.2670 (-0.66z)| lr 2.50e-04 | 323.11 ms | 52.2% bf16 MFU | 1623362 tok/s step 11143/19560 | loss 3.375335 (-0.14z)| norm 0.2841 (+0.37z)| lr 2.50e-04 | 322.95 ms | 52.3% bf16 MFU | 1623365 tok/s step 11144/19560 | loss 3.399145 (+0.46z)| norm 0.2811 (+0.18z)| lr 2.50e-04 | 322.68 ms | 52.3% bf16 MFU | 1623436 tok/s step 11145/19560 | loss 3.361019 (-0.50z)| norm 0.2887 (+0.68z)| lr 2.50e-04 | 323.29 ms | 52.2% bf16 MFU | 1623349 tok/s step 11146/19560 | loss 3.361370 (-0.50z)| norm 0.2633 (-0.89z)| lr 2.50e-04 | 322.70 ms | 52.3% bf16 MFU | 1623417 tok/s step 11147/19560 | loss 3.329726 (-1.29z)| norm 0.2758 (-0.09z)| lr 2.49e-04 | 323.00 ms | 52.3% bf16 MFU | 1623405 tok/s step 11148/19560 | loss 3.395396 (+0.40z)| norm 0.2535 (-1.51z)| lr 2.49e-04 | 323.23 ms | 52.2% bf16 MFU | 1623335 tok/s step 11149/19560 | loss 3.360506 (-0.51z)| norm 0.2598 (-1.09z)| lr 2.49e-04 | 323.36 ms | 52.2% bf16 MFU | 1623236 tok/s step 11150/19560 | loss 3.390432 (+0.26z)| norm 0.2666 (-0.66z)| lr 2.49e-04 | 323.13 ms | 52.2% bf16 MFU | 1623200 tok/s step 11151/19560 | loss 3.338159 (-1.10z)| norm 0.2496 (-1.75z)| lr 2.49e-04 | 323.12 ms | 52.2% bf16 MFU | 1623170 tok/s step 11152/19560 | loss 3.390570 (+0.27z)| norm 0.2458 (-2.00z)| lr 2.49e-04 | 322.85 ms | 52.3% bf16 MFU | 1623208 tok/s step 11153/19560 | loss 3.359953 (-0.53z)| norm 0.2806 (+0.32z)| lr 2.49e-04 | 322.57 ms | 52.3% bf16 MFU | 1623315 tok/s step 11154/19560 | loss 3.384495 (+0.11z)| norm 0.2492 (-1.73z)| lr 2.49e-04 | 322.85 ms | 52.3% bf16 MFU | 1623347 tok/s step 11155/19560 | loss 3.355989 (-0.62z)| norm 0.2682 (-0.49z)| lr 2.49e-04 | 323.29 ms | 52.2% bf16 MFU | 1623266 tok/s step 11156/19560 | loss 3.382443 (+0.08z)| norm 0.2753 (-0.00z)| lr 2.49e-04 | 322.70 ms | 52.3% bf16 MFU | 1623336 tok/s step 11157/19560 | loss 3.362922 (-0.44z)| norm 0.2576 (-1.18z)| lr 2.49e-04 | 323.23 ms | 52.2% bf16 MFU | 1623269 tok/s step 11158/19560 | loss 3.409071 (+0.79z)| norm 0.2598 (-1.02z)| lr 2.49e-04 | 323.28 ms | 52.2% bf16 MFU | 1623194 tok/s step 11159/19560 | loss 3.370152 (-0.25z)| norm 0.2701 (-0.33z)| lr 2.49e-04 | 322.85 ms | 52.3% bf16 MFU | 1623231 tok/s step 11160/19560 | loss 3.395981 (+0.45z)| norm 0.2670 (-0.53z)| lr 2.49e-04 | 322.65 ms | 52.3% bf16 MFU | 1623317 tok/s step 11161/19560 | loss 3.349141 (-0.81z)| norm 0.2569 (-1.19z)| lr 2.49e-04 | 323.06 ms | 52.2% bf16 MFU | 1623295 tok/s step 11162/19560 | loss 3.322082 (-1.56z)| norm 0.2644 (-0.69z)| lr 2.49e-04 | 323.15 ms | 52.2% bf16 MFU | 1623251 tok/s step 11163/19560 | loss 3.385277 (+0.16z)| norm 0.2716 (-0.20z)| lr 2.49e-04 | 322.97 ms | 52.3% bf16 MFU | 1623255 tok/s step 11164/19560 | loss 3.382079 (+0.08z)| norm 0.2709 (-0.25z)| lr 2.49e-04 | 322.76 ms | 52.3% bf16 MFU | 1623312 tok/s step 11165/19560 | loss 3.398408 (+0.52z)| norm 0.3077 (+2.15z)| lr 2.49e-04 | 323.23 ms | 52.2% bf16 MFU | 1623248 tok/s step 11166/19560 | loss 3.359005 (-0.57z)| norm 0.2823 (+0.50z)| lr 2.49e-04 | 322.81 ms | 52.3% bf16 MFU | 1623292 tok/s step 11167/19560 | loss 3.301183 (-2.11z)| norm 0.2669 (-0.52z)| lr 2.48e-04 | 322.81 ms | 52.3% bf16 MFU | 1623334 tok/s step 11168/19560 | loss 3.351744 (-0.73z)| norm 0.2682 (-0.44z)| lr 2.48e-04 | 323.47 ms | 52.2% bf16 MFU | 1623208 tok/s step 11169/19560 | loss 3.508693 (+3.34z)| norm 0.2869 (+0.79z)| lr 2.48e-04 | 322.58 ms | 52.3% bf16 MFU | 1623312 tok/s step 11170/19560 | loss 3.330128 (-1.26z)| norm 0.3372 (+3.85z)| lr 2.48e-04 | 322.88 ms | 52.3% bf16 MFU | 1623335 tok/s step 11171/19560 | loss 3.341911 (-0.94z)| norm 0.2910 (+0.95z)| lr 2.48e-04 | 323.35 ms | 52.2% bf16 MFU | 1623241 tok/s step 11172/19560 | loss 3.396732 (+0.45z)| norm 0.2910 (+0.94z)| lr 2.48e-04 | 322.62 ms | 52.3% bf16 MFU | 1623334 tok/s step 11173/19560 | loss 3.414036 (+0.90z)| norm 0.3091 (+2.03z)| lr 2.48e-04 | 323.39 ms | 52.2% bf16 MFU | 1623229 tok/s step 11174/19560 | loss 3.371596 (-0.19z)| norm 0.2866 (+0.62z)| lr 2.48e-04 | 323.16 ms | 52.2% bf16 MFU | 1623186 tok/s step 11175/19560 | loss 3.385973 (+0.17z)| norm 0.3051 (+1.74z)| lr 2.48e-04 | 323.33 ms | 52.2% bf16 MFU | 1623102 tok/s step 11176/19560 | loss 3.364126 (-0.39z)| norm 0.3068 (+1.80z)| lr 2.48e-04 | 323.30 ms | 52.2% bf16 MFU | 1623030 tok/s step 11177/19560 | loss 3.362599 (-0.42z)| norm 0.2862 (+0.54z)| lr 2.48e-04 | 323.49 ms | 52.2% bf16 MFU | 1622915 tok/s step 11178/19560 | loss 3.429489 (+1.29z)| norm 0.3155 (+2.28z)| lr 2.48e-04 | 323.16 ms | 52.2% bf16 MFU | 1622887 tok/s step 11179/19560 | loss 3.354352 (-0.63z)| norm 0.2857 (+0.47z)| lr 2.48e-04 | 323.27 ms | 52.2% bf16 MFU | 1622834 tok/s step 11180/19560 | loss 3.438532 (+1.49z)| norm 0.2790 (+0.05z)| lr 2.48e-04 | 322.38 ms | 52.4% bf16 MFU | 1623007 tok/s step 11181/19560 | loss 3.442825 (+1.58z)| norm 0.2835 (+0.32z)| lr 2.48e-04 | 324.19 ms | 52.1% bf16 MFU | 1622718 tok/s step 11182/19560 | loss 3.383060 (+0.09z)| norm 0.2790 (+0.05z)| lr 2.48e-04 | 322.62 ms | 52.3% bf16 MFU | 1622837 tok/s step 11183/19560 | loss 3.420355 (+1.02z)| norm 0.2549 (-1.41z)| lr 2.48e-04 | 323.64 ms | 52.1% bf16 MFU | 1622695 tok/s step 11184/19560 | loss 3.375225 (-0.13z)| norm 0.2686 (-0.57z)| lr 2.48e-04 | 322.75 ms | 52.3% bf16 MFU | 1622784 tok/s step 11185/19560 | loss 3.386590 (+0.16z)| norm 0.2650 (-0.78z)| lr 2.48e-04 | 322.92 ms | 52.3% bf16 MFU | 1622824 tok/s step 11186/19560 | loss 3.402696 (+0.63z)| norm 0.2913 (+0.88z)| lr 2.48e-04 | 323.67 ms | 52.1% bf16 MFU | 1622675 tok/s step 11187/19560 | loss 3.347572 (-0.87z)| norm 0.2887 (+0.73z)| lr 2.48e-04 | 323.63 ms | 52.2% bf16 MFU | 1622544 tok/s step 11188/19560 | loss 3.412267 (+0.87z)| norm 0.2904 (+0.85z)| lr 2.47e-04 | 323.05 ms | 52.2% bf16 MFU | 1622564 tok/s step 11189/19560 | loss 3.463483 (+2.21z)| norm 0.2716 (-0.36z)| lr 2.47e-04 | 322.71 ms | 52.3% bf16 MFU | 1622668 tok/s step 11190/19560 | loss 3.363306 (-0.46z)| norm 0.2845 (+0.47z)| lr 2.47e-04 | 323.17 ms | 52.2% bf16 MFU | 1622652 tok/s step 11191/19560 | loss 3.324026 (-1.51z)| norm 0.2661 (-0.70z)| lr 2.47e-04 | 323.19 ms | 52.2% bf16 MFU | 1622630 tok/s step 11192/19560 | loss 3.398089 (+0.46z)| norm 0.3120 (+2.22z)| lr 2.47e-04 | 323.12 ms | 52.2% bf16 MFU | 1622627 tok/s step 11193/19560 | loss 3.384099 (+0.09z)| norm 0.2776 (+0.04z)| lr 2.47e-04 | 323.69 ms | 52.1% bf16 MFU | 1622482 tok/s step 11194/19560 | loss 3.371670 (-0.25z)| norm 0.2580 (-1.20z)| lr 2.47e-04 | 322.75 ms | 52.3% bf16 MFU | 1622580 tok/s step 11195/19560 | loss 3.377854 (-0.08z)| norm 0.2708 (-0.39z)| lr 2.47e-04 | 322.65 ms | 52.3% bf16 MFU | 1622698 tok/s step 11196/19560 | loss 3.375607 (-0.14z)| norm 0.2656 (-0.71z)| lr 2.47e-04 | 322.98 ms | 52.3% bf16 MFU | 1622727 tok/s step 11197/19560 | loss 3.388822 (+0.20z)| norm 0.2909 (+0.89z)| lr 2.47e-04 | 323.28 ms | 52.2% bf16 MFU | 1622679 tok/s step 11198/19560 | loss 3.365231 (-0.44z)| norm 0.2672 (-0.61z)| lr 2.47e-04 | 322.98 ms | 52.3% bf16 MFU | 1622710 tok/s step 11199/19560 | loss 3.398880 (+0.47z)| norm 0.3086 (+1.97z)| lr 2.47e-04 | 323.03 ms | 52.2% bf16 MFU | 1622727 tok/s step 11200/19560 | loss 3.407948 (+0.71z)| norm 0.2862 (+0.57z)| lr 2.47e-04 | 322.30 ms | 52.4% bf16 MFU | 1622925 tok/s step 11201/19560 | loss 3.382478 (+0.02z)| norm 0.2584 (-1.17z)| lr 2.47e-04 | 323.03 ms | 52.2% bf16 MFU | 1622931 tok/s step 11202/19560 | loss 3.366732 (-0.42z)| norm 0.2829 (+0.36z)| lr 2.47e-04 | 322.92 ms | 52.3% bf16 MFU | 1622964 tok/s step 11203/19560 | loss 3.334994 (-1.26z)| norm 0.2549 (-1.38z)| lr 2.47e-04 | 323.15 ms | 52.2% bf16 MFU | 1622936 tok/s step 11204/19560 | loss 3.434685 (+1.46z)| norm 0.2847 (+0.48z)| lr 2.47e-04 | 323.06 ms | 52.2% bf16 MFU | 1622934 tok/s step 11205/19560 | loss 3.371099 (-0.29z)| norm 0.2723 (-0.30z)| lr 2.47e-04 | 323.38 ms | 52.2% bf16 MFU | 1622853 tok/s step 11206/19560 | loss 3.360428 (-0.59z)| norm 0.2714 (-0.36z)| lr 2.47e-04 | 322.83 ms | 52.3% bf16 MFU | 1622913 tok/s step 11207/19560 | loss 3.408439 (+0.76z)| norm 0.2903 (+0.82z)| lr 2.47e-04 | 322.93 ms | 52.3% bf16 MFU | 1622944 tok/s step 11208/19560 | loss 3.373779 (-0.20z)| norm 0.2671 (-0.64z)| lr 2.46e-04 | 323.50 ms | 52.2% bf16 MFU | 1622830 tok/s step 11209/19560 | loss 3.496914 (+3.11z)| norm 0.3040 (+1.66z)| lr 2.46e-04 | 323.04 ms | 52.2% bf16 MFU | 1622838 tok/s step 11210/19560 | loss 3.387493 (+0.13z)| norm 0.2755 (-0.13z)| lr 2.46e-04 | 322.66 ms | 52.3% bf16 MFU | 1622940 tok/s step 11211/19560 | loss 3.416011 (+0.91z)| norm 0.3085 (+1.89z)| lr 2.46e-04 | 323.58 ms | 52.2% bf16 MFU | 1622807 tok/s step 11212/19560 | loss 3.386681 (+0.10z)| norm 0.3153 (+2.25z)| lr 2.46e-04 | 323.22 ms | 52.2% bf16 MFU | 1622770 tok/s step 11213/19560 | loss 3.452996 (+1.92z)| norm 0.2972 (+1.14z)| lr 2.46e-04 | 323.11 ms | 52.2% bf16 MFU | 1622764 tok/s step 11214/19560 | loss 3.361632 (-0.57z)| norm 0.2917 (+0.81z)| lr 2.46e-04 | 322.67 ms | 52.3% bf16 MFU | 1622867 tok/s step 11215/19560 | loss 3.395097 (+0.35z)| norm 0.2663 (-0.72z)| lr 2.46e-04 | 323.01 ms | 52.2% bf16 MFU | 1622879 tok/s step 11216/19560 | loss 3.351115 (-0.85z)| norm 0.2848 (+0.38z)| lr 2.46e-04 | 322.58 ms | 52.3% bf16 MFU | 1623000 tok/s step 11217/19560 | loss 3.435658 (+1.45z)| norm 0.2889 (+0.62z)| lr 2.46e-04 | 322.62 ms | 52.3% bf16 MFU | 1623105 tok/s step 11218/19560 | loss 3.367878 (-0.41z)| norm 0.2736 (-0.30z)| lr 2.46e-04 | 323.22 ms | 52.2% bf16 MFU | 1623055 tok/s step 11219/19560 | loss 3.396858 (+0.37z)| norm 0.2650 (-0.80z)| lr 2.46e-04 | 322.85 ms | 52.3% bf16 MFU | 1623099 tok/s step 11220/19560 | loss 3.382779 (-0.01z)| norm 0.2780 (-0.02z)| lr 2.46e-04 | 322.93 ms | 52.3% bf16 MFU | 1623122 tok/s step 11221/19560 | loss 3.429561 (+1.31z)| norm 0.2968 (+1.12z)| lr 2.46e-04 | 322.85 ms | 52.3% bf16 MFU | 1623162 tok/s step 11222/19560 | loss 3.421540 (+1.07z)| norm 0.2714 (-0.42z)| lr 2.46e-04 | 323.20 ms | 52.2% bf16 MFU | 1623114 tok/s step 11223/19560 | loss 3.378380 (-0.15z)| norm 0.2678 (-0.65z)| lr 2.46e-04 | 323.13 ms | 52.2% bf16 MFU | 1623084 tok/s step 11224/19560 | loss 3.502557 (+3.22z)| norm 0.2646 (-0.83z)| lr 2.46e-04 | 322.65 ms | 52.3% bf16 MFU | 1623177 tok/s step 11225/19560 | loss 3.428027 (+1.17z)| norm 0.2841 (+0.36z)| lr 2.46e-04 | 323.01 ms | 52.2% bf16 MFU | 1623175 tok/s step 11226/19560 | loss 3.407452 (+0.60z)| norm 0.2815 (+0.20z)| lr 2.46e-04 | 322.61 ms | 52.3% bf16 MFU | 1623274 tok/s step 11227/19560 | loss 3.394652 (+0.24z)| norm 0.2607 (-1.07z)| lr 2.46e-04 | 322.86 ms | 52.3% bf16 MFU | 1623304 tok/s step 11228/19560 | loss 3.394917 (+0.25z)| norm 0.2649 (-0.82z)| lr 2.45e-04 | 322.64 ms | 52.3% bf16 MFU | 1623388 tok/s step 11229/19560 | loss 3.364452 (-0.59z)| norm 0.2531 (-1.53z)| lr 2.45e-04 | 323.52 ms | 52.2% bf16 MFU | 1623248 tok/s step 11230/19560 | loss 3.343951 (-1.14z)| norm 0.2615 (-1.02z)| lr 2.45e-04 | 322.68 ms | 52.3% bf16 MFU | 1623326 tok/s step 11231/19560 | loss 3.407263 (+0.60z)| norm 0.2504 (-1.67z)| lr 2.45e-04 | 322.81 ms | 52.3% bf16 MFU | 1623366 tok/s step 11232/19560 | loss 3.436740 (+1.43z)| norm 0.2585 (-1.16z)| lr 2.45e-04 | 323.49 ms | 52.2% bf16 MFU | 1623232 tok/s step 11233/19560 | loss 3.416280 (+0.85z)| norm 0.2685 (-0.55z)| lr 2.45e-04 | 323.05 ms | 52.2% bf16 MFU | 1623217 tok/s step 11234/19560 | loss 3.448468 (+1.71z)| norm 0.2755 (-0.12z)| lr 2.45e-04 | 322.25 ms | 52.4% bf16 MFU | 1623404 tok/s step 11235/19560 | loss 3.467580 (+2.20z)| norm 0.2853 (+0.50z)| lr 2.45e-04 | 322.99 ms | 52.3% bf16 MFU | 1623395 tok/s step 11236/19560 | loss 3.423439 (+1.02z)| norm 0.2760 (-0.07z)| lr 2.45e-04 | 322.00 ms | 52.4% bf16 MFU | 1623636 tok/s step 11237/19560 | loss 3.367768 (-0.49z)| norm 0.2726 (-0.27z)| lr 2.45e-04 | 323.55 ms | 52.2% bf16 MFU | 1623476 tok/s step 11238/19560 | loss 3.435862 (+1.35z)| norm 0.2914 (+0.89z)| lr 2.45e-04 | 323.14 ms | 52.2% bf16 MFU | 1623427 tok/s step 11239/19560 | loss 3.448722 (+1.67z)| norm 0.2814 (+0.27z)| lr 2.45e-04 | 323.00 ms | 52.3% bf16 MFU | 1623415 tok/s step 11240/19560 | loss 3.339283 (-1.25z)| norm 0.3023 (+1.55z)| lr 2.45e-04 | 323.05 ms | 52.2% bf16 MFU | 1623392 tok/s step 11241/19560 | loss 3.375595 (-0.27z)| norm 0.3035 (+1.60z)| lr 2.45e-04 | 323.03 ms | 52.2% bf16 MFU | 1623372 tok/s step 11242/19560 | loss 3.374951 (-0.29z)| norm 0.2808 (+0.21z)| lr 2.45e-04 | 322.88 ms | 52.3% bf16 MFU | 1623393 tok/s step 11243/19560 | loss 3.369713 (-0.42z)| norm 0.3053 (+1.67z)| lr 2.45e-04 | 322.65 ms | 52.3% bf16 MFU | 1623471 tok/s step 11244/19560 | loss 3.386519 (+0.02z)| norm 0.2783 (+0.03z)| lr 2.45e-04 | 322.29 ms | 52.4% bf16 MFU | 1623635 tok/s step 11245/19560 | loss 3.365448 (-0.54z)| norm 0.2871 (+0.56z)| lr 2.45e-04 | 322.68 ms | 52.3% bf16 MFU | 1623692 tok/s step 11246/19560 | loss 3.399605 (+0.38z)| norm 0.2832 (+0.32z)| lr 2.45e-04 | 322.94 ms | 52.3% bf16 MFU | 1623681 tok/s step 11247/19560 | loss 3.384084 (-0.03z)| norm 0.2761 (-0.11z)| lr 2.45e-04 | 322.72 ms | 52.3% bf16 MFU | 1623727 tok/s step 11248/19560 | loss 3.327023 (-1.56z)| norm 0.2746 (-0.20z)| lr 2.45e-04 | 322.61 ms | 52.3% bf16 MFU | 1623797 tok/s step 11249/19560 | loss 3.438670 (+1.42z)| norm 0.2866 (+0.52z)| lr 2.44e-04 | 322.39 ms | 52.4% bf16 MFU | 1623920 tok/s step 11250/19560 | loss 3.393407 (+0.20z)| norm 0.2638 (-0.85z)| lr 2.44e-04 | 323.46 ms | 52.2% bf16 MFU | 1623768 tok/s val loss 3.367762 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2940/10042 = 0.292770 step 11251/19560 | loss 3.349939 (-0.96z)| norm 0.2933 (+0.92z)| lr 2.44e-04 | 322.25 ms | 52.4% bf16 MFU | 1623928 tok/s step 11252/19560 | loss 3.400962 (+0.40z)| norm 0.2875 (+0.57z)| lr 2.44e-04 | 323.09 ms | 52.2% bf16 MFU | 1623869 tok/s step 11253/19560 | loss 3.378553 (-0.22z)| norm 0.2725 (-0.33z)| lr 2.44e-04 | 324.25 ms | 52.0% bf16 MFU | 1623521 tok/s step 11254/19560 | loss 3.410399 (+0.65z)| norm 0.2827 (+0.29z)| lr 2.44e-04 | 322.84 ms | 52.3% bf16 MFU | 1623544 tok/s step 11255/19560 | loss 3.339971 (-1.27z)| norm 0.2612 (-1.00z)| lr 2.44e-04 | 323.42 ms | 52.2% bf16 MFU | 1623419 tok/s step 11256/19560 | loss 3.400652 (+0.38z)| norm 0.2748 (-0.18z)| lr 2.44e-04 | 324.22 ms | 52.1% bf16 MFU | 1623102 tok/s step 11257/19560 | loss 3.406282 (+0.53z)| norm 0.2663 (-0.68z)| lr 2.44e-04 | 323.50 ms | 52.2% bf16 MFU | 1622981 tok/s step 11258/19560 | loss 3.392722 (+0.16z)| norm 0.2619 (-0.94z)| lr 2.44e-04 | 323.10 ms | 52.2% bf16 MFU | 1622965 tok/s step 11259/19560 | loss 3.343314 (-1.17z)| norm 0.2780 (+0.05z)| lr 2.44e-04 | 323.50 ms | 52.2% bf16 MFU | 1622851 tok/s step 11260/19560 | loss 3.374221 (-0.33z)| norm 0.2720 (-0.32z)| lr 2.44e-04 | 322.92 ms | 52.3% bf16 MFU | 1622887 tok/s step 11261/19560 | loss 3.347782 (-1.04z)| norm 0.2719 (-0.33z)| lr 2.44e-04 | 323.25 ms | 52.2% bf16 MFU | 1622839 tok/s step 11262/19560 | loss 3.391224 (+0.12z)| norm 0.2727 (-0.29z)| lr 2.44e-04 | 323.12 ms | 52.2% bf16 MFU | 1622827 tok/s step 11263/19560 | loss 3.425869 (+1.05z)| norm 0.2714 (-0.38z)| lr 2.44e-04 | 323.09 ms | 52.2% bf16 MFU | 1622822 tok/s step 11264/19560 | loss 3.356398 (-0.84z)| norm 0.3235 (+2.72z)| lr 2.44e-04 | 323.19 ms | 52.2% bf16 MFU | 1622792 tok/s step 11265/19560 | loss 3.418895 (+0.85z)| norm 0.2597 (-1.09z)| lr 2.44e-04 | 323.03 ms | 52.2% bf16 MFU | 1622803 tok/s step 11266/19560 | loss 3.390304 (+0.07z)| norm 0.3046 (+1.56z)| lr 2.44e-04 | 322.15 ms | 52.4% bf16 MFU | 1623036 tok/s step 11267/19560 | loss 3.360223 (-0.74z)| norm 0.2939 (+0.91z)| lr 2.44e-04 | 323.05 ms | 52.2% bf16 MFU | 1623030 tok/s step 11268/19560 | loss 3.367178 (-0.55z)| norm 0.2896 (+0.65z)| lr 2.44e-04 | 322.87 ms | 52.3% bf16 MFU | 1623071 tok/s step 11269/19560 | loss 3.398496 (+0.30z)| norm 0.3030 (+1.43z)| lr 2.43e-04 | 322.71 ms | 52.3% bf16 MFU | 1623150 tok/s step 11270/19560 | loss 3.372551 (-0.42z)| norm 0.2824 (+0.20z)| lr 2.43e-04 | 322.63 ms | 52.3% bf16 MFU | 1623245 tok/s step 11271/19560 | loss 3.426566 (+1.06z)| norm 0.2808 (+0.11z)| lr 2.43e-04 | 322.27 ms | 52.4% bf16 MFU | 1623427 tok/s step 11272/19560 | loss 3.370248 (-0.48z)| norm 0.3004 (+1.26z)| lr 2.43e-04 | 322.75 ms | 52.3% bf16 MFU | 1623478 tok/s step 11273/19560 | loss 3.461196 (+1.97z)| norm 0.2795 (+0.03z)| lr 2.43e-04 | 323.61 ms | 52.2% bf16 MFU | 1623310 tok/s step 11274/19560 | loss 3.338926 (-1.33z)| norm 0.2813 (+0.13z)| lr 2.43e-04 | 322.57 ms | 52.3% bf16 MFU | 1623412 tok/s step 11275/19560 | loss 3.371026 (-0.48z)| norm 0.3451 (+3.67z)| lr 2.43e-04 | 322.87 ms | 52.3% bf16 MFU | 1623432 tok/s step 11276/19560 | loss 3.364171 (-0.66z)| norm 0.2726 (-0.41z)| lr 2.43e-04 | 322.64 ms | 52.3% bf16 MFU | 1623511 tok/s step 11277/19560 | loss 3.337303 (-1.38z)| norm 0.3002 (+1.14z)| lr 2.43e-04 | 322.46 ms | 52.3% bf16 MFU | 1623630 tok/s step 11278/19560 | loss 3.353714 (-0.92z)| norm 0.2716 (-0.48z)| lr 2.43e-04 | 322.66 ms | 52.3% bf16 MFU | 1623693 tok/s step 11279/19560 | loss 3.428047 (+1.06z)| norm 0.2816 (+0.07z)| lr 2.43e-04 | 322.96 ms | 52.3% bf16 MFU | 1623678 tok/s step 11280/19560 | loss 3.398269 (+0.26z)| norm 0.2649 (-0.91z)| lr 2.43e-04 | 322.94 ms | 52.3% bf16 MFU | 1623669 tok/s step 11281/19560 | loss 3.349710 (-1.05z)| norm 0.2827 (+0.12z)| lr 2.43e-04 | 322.68 ms | 52.3% bf16 MFU | 1623726 tok/s step 11282/19560 | loss 3.369274 (-0.52z)| norm 0.2625 (-1.07z)| lr 2.43e-04 | 322.28 ms | 52.4% bf16 MFU | 1623880 tok/s step 11283/19560 | loss 3.353822 (-0.94z)| norm 0.2777 (-0.18z)| lr 2.43e-04 | 323.21 ms | 52.2% bf16 MFU | 1623794 tok/s step 11284/19560 | loss 3.350379 (-1.02z)| norm 0.2704 (-0.61z)| lr 2.43e-04 | 322.32 ms | 52.4% bf16 MFU | 1623935 tok/s step 11285/19560 | loss 3.336398 (-1.38z)| norm 0.3257 (+2.56z)| lr 2.43e-04 | 322.44 ms | 52.3% bf16 MFU | 1624037 tok/s step 11286/19560 | loss 3.386799 (-0.03z)| norm 0.2974 (+0.91z)| lr 2.43e-04 | 323.23 ms | 52.2% bf16 MFU | 1623937 tok/s step 11287/19560 | loss 3.319681 (-1.79z)| norm 0.2618 (-1.13z)| lr 2.43e-04 | 322.60 ms | 52.3% bf16 MFU | 1624000 tok/s step 11288/19560 | loss 3.431484 (+1.14z)| norm 0.3062 (+1.40z)| lr 2.43e-04 | 323.48 ms | 52.2% bf16 MFU | 1623839 tok/s step 11289/19560 | loss 3.341716 (-1.21z)| norm 0.2549 (-1.54z)| lr 2.42e-04 | 322.70 ms | 52.3% bf16 MFU | 1623881 tok/s step 11290/19560 | loss 3.366098 (-0.58z)| norm 0.2955 (+0.77z)| lr 2.42e-04 | 322.76 ms | 52.3% bf16 MFU | 1623905 tok/s step 11291/19560 | loss 3.357759 (-0.80z)| norm 0.2834 (+0.07z)| lr 2.42e-04 | 322.93 ms | 52.3% bf16 MFU | 1623888 tok/s step 11292/19560 | loss 3.422836 (+0.91z)| norm 0.2948 (+0.72z)| lr 2.42e-04 | 322.83 ms | 52.3% bf16 MFU | 1623895 tok/s step 11293/19560 | loss 3.416729 (+0.74z)| norm 0.2902 (+0.46z)| lr 2.42e-04 | 323.03 ms | 52.2% bf16 MFU | 1623851 tok/s step 11294/19560 | loss 3.393452 (+0.13z)| norm 0.2644 (-1.02z)| lr 2.42e-04 | 322.54 ms | 52.3% bf16 MFU | 1623935 tok/s step 11295/19560 | loss 3.375315 (-0.37z)| norm 0.2824 (+0.01z)| lr 2.42e-04 | 323.19 ms | 52.2% bf16 MFU | 1623848 tok/s step 11296/19560 | loss 3.428082 (+1.03z)| norm 0.2765 (-0.33z)| lr 2.42e-04 | 322.96 ms | 52.3% bf16 MFU | 1623824 tok/s step 11297/19560 | loss 3.338974 (-1.38z)| norm 0.2770 (-0.30z)| lr 2.42e-04 | 322.75 ms | 52.3% bf16 MFU | 1623854 tok/s step 11298/19560 | loss 3.351276 (-1.05z)| norm 0.2729 (-0.53z)| lr 2.42e-04 | 322.90 ms | 52.3% bf16 MFU | 1623846 tok/s step 11299/19560 | loss 3.415585 (+0.74z)| norm 0.2721 (-0.57z)| lr 2.42e-04 | 323.27 ms | 52.2% bf16 MFU | 1623746 tok/s step 11300/19560 | loss 3.383120 (-0.17z)| norm 0.3186 (+2.18z)| lr 2.42e-04 | 322.53 ms | 52.3% bf16 MFU | 1623836 tok/s step 11301/19560 | loss 3.348356 (-1.13z)| norm 0.2945 (+0.77z)| lr 2.42e-04 | 323.07 ms | 52.2% bf16 MFU | 1623785 tok/s step 11302/19560 | loss 3.420952 (+0.89z)| norm 0.3061 (+1.44z)| lr 2.42e-04 | 322.41 ms | 52.3% bf16 MFU | 1623903 tok/s step 11303/19560 | loss 3.352058 (-1.02z)| norm 0.2882 (+0.39z)| lr 2.42e-04 | 323.18 ms | 52.2% bf16 MFU | 1623821 tok/s step 11304/19560 | loss 3.377208 (-0.32z)| norm 0.2951 (+0.81z)| lr 2.42e-04 | 323.08 ms | 52.2% bf16 MFU | 1623770 tok/s step 11305/19560 | loss 3.385046 (-0.11z)| norm 0.3095 (+1.65z)| lr 2.42e-04 | 322.67 ms | 52.3% bf16 MFU | 1623824 tok/s step 11306/19560 | loss 3.399029 (+0.29z)| norm 0.3440 (+3.55z)| lr 2.42e-04 | 322.75 ms | 52.3% bf16 MFU | 1623855 tok/s step 11307/19560 | loss 3.514551 (+3.35z)| norm 0.3470 (+3.51z)| lr 2.42e-04 | 323.04 ms | 52.2% bf16 MFU | 1623812 tok/s step 11308/19560 | loss 3.351064 (-1.03z)| norm 0.2745 (-0.43z)| lr 2.42e-04 | 322.79 ms | 52.3% bf16 MFU | 1623834 tok/s step 11309/19560 | loss 3.367267 (-0.58z)| norm 0.2863 (+0.21z)| lr 2.42e-04 | 322.83 ms | 52.3% bf16 MFU | 1623844 tok/s step 11310/19560 | loss 3.443287 (+1.46z)| norm 0.2865 (+0.22z)| lr 2.41e-04 | 323.10 ms | 52.2% bf16 MFU | 1623786 tok/s step 11311/19560 | loss 3.286265 (-2.67z)| norm 0.2874 (+0.26z)| lr 2.41e-04 | 322.98 ms | 52.3% bf16 MFU | 1623762 tok/s step 11312/19560 | loss 3.346215 (-1.09z)| norm 0.2930 (+0.55z)| lr 2.41e-04 | 323.11 ms | 52.2% bf16 MFU | 1623705 tok/s step 11313/19560 | loss 3.359225 (-0.74z)| norm 0.2623 (-1.13z)| lr 2.41e-04 | 322.85 ms | 52.3% bf16 MFU | 1623716 tok/s step 11314/19560 | loss 3.477875 (+2.29z)| norm 0.3320 (+2.60z)| lr 2.41e-04 | 323.18 ms | 52.2% bf16 MFU | 1623645 tok/s step 11315/19560 | loss 3.410455 (+0.55z)| norm 0.2864 (+0.17z)| lr 2.41e-04 | 322.61 ms | 52.3% bf16 MFU | 1623720 tok/s step 11316/19560 | loss 3.390046 (+0.04z)| norm 0.3026 (+1.02z)| lr 2.41e-04 | 322.38 ms | 52.4% bf16 MFU | 1623849 tok/s step 11317/19560 | loss 3.330055 (-1.49z)| norm 0.2659 (-0.92z)| lr 2.41e-04 | 323.22 ms | 52.2% bf16 MFU | 1623761 tok/s step 11318/19560 | loss 3.409535 (+0.56z)| norm 0.2878 (+0.24z)| lr 2.41e-04 | 323.17 ms | 52.2% bf16 MFU | 1623690 tok/s step 11319/19560 | loss 3.391452 (+0.08z)| norm 0.2651 (-0.96z)| lr 2.41e-04 | 322.88 ms | 52.3% bf16 MFU | 1623695 tok/s step 11320/19560 | loss 3.435986 (+1.22z)| norm 0.2773 (-0.31z)| lr 2.41e-04 | 322.79 ms | 52.3% bf16 MFU | 1623721 tok/s step 11321/19560 | loss 3.394653 (+0.15z)| norm 0.2725 (-0.56z)| lr 2.41e-04 | 322.63 ms | 52.3% bf16 MFU | 1623787 tok/s step 11322/19560 | loss 3.394236 (+0.13z)| norm 0.2767 (-0.35z)| lr 2.41e-04 | 323.45 ms | 52.2% bf16 MFU | 1623644 tok/s step 11323/19560 | loss 3.357311 (-0.82z)| norm 0.2754 (-0.41z)| lr 2.41e-04 | 322.69 ms | 52.3% bf16 MFU | 1623699 tok/s step 11324/19560 | loss 3.359451 (-0.76z)| norm 0.2697 (-0.73z)| lr 2.41e-04 | 323.37 ms | 52.2% bf16 MFU | 1623580 tok/s step 11325/19560 | loss 3.361821 (-0.69z)| norm 0.2720 (-0.59z)| lr 2.41e-04 | 322.81 ms | 52.3% bf16 MFU | 1623608 tok/s step 11326/19560 | loss 3.356405 (-0.83z)| norm 0.2590 (-1.29z)| lr 2.41e-04 | 322.61 ms | 52.3% bf16 MFU | 1623685 tok/s step 11327/19560 | loss 3.416913 (+0.73z)| norm 0.2962 (+0.72z)| lr 2.41e-04 | 323.69 ms | 52.1% bf16 MFU | 1623486 tok/s step 11328/19560 | loss 3.407787 (+0.49z)| norm 0.2615 (-1.14z)| lr 2.41e-04 | 323.50 ms | 52.2% bf16 MFU | 1623346 tok/s step 11329/19560 | loss 3.352269 (-0.93z)| norm 0.3045 (+1.16z)| lr 2.41e-04 | 322.75 ms | 52.3% bf16 MFU | 1623400 tok/s step 11330/19560 | loss 3.390600 (+0.05z)| norm 0.2736 (-0.50z)| lr 2.40e-04 | 322.75 ms | 52.3% bf16 MFU | 1623451 tok/s step 11331/19560 | loss 3.360388 (-0.73z)| norm 0.3115 (+1.51z)| lr 2.40e-04 | 322.68 ms | 52.3% bf16 MFU | 1623518 tok/s step 11332/19560 | loss 3.372962 (-0.40z)| norm 0.2659 (-0.93z)| lr 2.40e-04 | 323.25 ms | 52.2% bf16 MFU | 1623440 tok/s step 11333/19560 | loss 3.429404 (+1.05z)| norm 0.2787 (-0.25z)| lr 2.40e-04 | 322.86 ms | 52.3% bf16 MFU | 1623461 tok/s step 11334/19560 | loss 3.358346 (-0.79z)| norm 0.2697 (-0.73z)| lr 2.40e-04 | 323.80 ms | 52.1% bf16 MFU | 1623247 tok/s step 11335/19560 | loss 3.436059 (+1.21z)| norm 0.2886 (+0.29z)| lr 2.40e-04 | 322.56 ms | 52.3% bf16 MFU | 1623354 tok/s step 11336/19560 | loss 3.353584 (-0.90z)| norm 0.2640 (-1.03z)| lr 2.40e-04 | 323.06 ms | 52.2% bf16 MFU | 1623331 tok/s step 11337/19560 | loss 3.395318 (+0.19z)| norm 0.2874 (+0.23z)| lr 2.40e-04 | 322.97 ms | 52.3% bf16 MFU | 1623332 tok/s step 11338/19560 | loss 3.351158 (-0.96z)| norm 0.2702 (-0.69z)| lr 2.40e-04 | 322.87 ms | 52.3% bf16 MFU | 1623356 tok/s step 11339/19560 | loss 3.339150 (-1.26z)| norm 0.2786 (-0.23z)| lr 2.40e-04 | 322.82 ms | 52.3% bf16 MFU | 1623393 tok/s step 11340/19560 | loss 3.424953 (+0.98z)| norm 0.2595 (-1.25z)| lr 2.40e-04 | 322.44 ms | 52.3% bf16 MFU | 1623524 tok/s step 11341/19560 | loss 3.456024 (+1.79z)| norm 0.2866 (+0.23z)| lr 2.40e-04 | 323.05 ms | 52.2% bf16 MFU | 1623495 tok/s step 11342/19560 | loss 3.412059 (+0.63z)| norm 0.2845 (+0.12z)| lr 2.40e-04 | 322.61 ms | 52.3% bf16 MFU | 1623577 tok/s step 11343/19560 | loss 3.309792 (-1.99z)| norm 0.2619 (-1.11z)| lr 2.40e-04 | 323.16 ms | 52.2% bf16 MFU | 1623518 tok/s step 11344/19560 | loss 3.433188 (+1.17z)| norm 0.2921 (+0.54z)| lr 2.40e-04 | 322.66 ms | 52.3% bf16 MFU | 1623586 tok/s step 11345/19560 | loss 3.412740 (+0.65z)| norm 0.2807 (-0.08z)| lr 2.40e-04 | 322.72 ms | 52.3% bf16 MFU | 1623637 tok/s step 11346/19560 | loss 3.537464 (+3.63z)| norm 0.2774 (-0.26z)| lr 2.40e-04 | 323.36 ms | 52.2% bf16 MFU | 1623524 tok/s step 11347/19560 | loss 3.358624 (-0.73z)| norm 0.2908 (+0.46z)| lr 2.40e-04 | 322.45 ms | 52.3% bf16 MFU | 1623645 tok/s step 11348/19560 | loss 3.386822 (-0.05z)| norm 0.2626 (-1.07z)| lr 2.40e-04 | 323.88 ms | 52.1% bf16 MFU | 1623402 tok/s step 11349/19560 | loss 3.369936 (-0.45z)| norm 0.3025 (+1.09z)| lr 2.40e-04 | 323.59 ms | 52.2% bf16 MFU | 1623242 tok/s step 11350/19560 | loss 3.341420 (-1.13z)| norm 0.2918 (+0.50z)| lr 2.40e-04 | 323.15 ms | 52.2% bf16 MFU | 1623200 tok/s step 11351/19560 | loss 3.413208 (+0.62z)| norm 0.2783 (-0.23z)| lr 2.39e-04 | 323.74 ms | 52.1% bf16 MFU | 1623015 tok/s step 11352/19560 | loss 3.382321 (-0.12z)| norm 0.2777 (-0.28z)| lr 2.39e-04 | 323.66 ms | 52.1% bf16 MFU | 1622858 tok/s step 11353/19560 | loss 3.410635 (+0.60z)| norm 0.2888 (+0.33z)| lr 2.39e-04 | 323.25 ms | 52.2% bf16 MFU | 1622812 tok/s step 11354/19560 | loss 3.442511 (+1.39z)| norm 0.2719 (-0.59z)| lr 2.39e-04 | 322.70 ms | 52.3% bf16 MFU | 1622905 tok/s step 11355/19560 | loss 3.342445 (-1.10z)| norm 0.2625 (-1.10z)| lr 2.39e-04 | 323.29 ms | 52.2% bf16 MFU | 1622845 tok/s step 11356/19560 | loss 3.411258 (+0.61z)| norm 0.2813 (-0.09z)| lr 2.39e-04 | 323.36 ms | 52.2% bf16 MFU | 1622772 tok/s step 11357/19560 | loss 3.403519 (+0.41z)| norm 0.2556 (-1.49z)| lr 2.39e-04 | 322.83 ms | 52.3% bf16 MFU | 1622836 tok/s step 11358/19560 | loss 3.429505 (+1.04z)| norm 0.2542 (-1.56z)| lr 2.39e-04 | 323.20 ms | 52.2% bf16 MFU | 1622803 tok/s step 11359/19560 | loss 3.345632 (-1.04z)| norm 0.2552 (-1.52z)| lr 2.39e-04 | 323.61 ms | 52.2% bf16 MFU | 1622669 tok/s step 11360/19560 | loss 3.361594 (-0.63z)| norm 0.2463 (-1.98z)| lr 2.39e-04 | 323.52 ms | 52.2% bf16 MFU | 1622565 tok/s step 11361/19560 | loss 3.421134 (+0.86z)| norm 0.2747 (-0.44z)| lr 2.39e-04 | 322.98 ms | 52.3% bf16 MFU | 1622600 tok/s step 11362/19560 | loss 3.390769 (+0.11z)| norm 0.2475 (-1.89z)| lr 2.39e-04 | 323.27 ms | 52.2% bf16 MFU | 1622560 tok/s step 11363/19560 | loss 3.354399 (-0.79z)| norm 0.2818 (-0.04z)| lr 2.39e-04 | 322.77 ms | 52.3% bf16 MFU | 1622648 tok/s step 11364/19560 | loss 3.337059 (-1.21z)| norm 0.2511 (-1.66z)| lr 2.39e-04 | 323.17 ms | 52.2% bf16 MFU | 1622632 tok/s step 11365/19560 | loss 3.374363 (-0.27z)| norm 0.2636 (-0.99z)| lr 2.39e-04 | 323.95 ms | 52.1% bf16 MFU | 1622421 tok/s step 11366/19560 | loss 3.440189 (+1.41z)| norm 0.3081 (+1.36z)| lr 2.39e-04 | 323.67 ms | 52.1% bf16 MFU | 1622291 tok/s step 11367/19560 | loss 3.378811 (-0.14z)| norm 0.2656 (-0.87z)| lr 2.39e-04 | 323.82 ms | 52.1% bf16 MFU | 1622129 tok/s step 11368/19560 | loss 3.377575 (-0.18z)| norm 0.3063 (+1.26z)| lr 2.39e-04 | 323.21 ms | 52.2% bf16 MFU | 1622128 tok/s step 11369/19560 | loss 3.394562 (+0.26z)| norm 0.2662 (-0.83z)| lr 2.39e-04 | 322.80 ms | 52.3% bf16 MFU | 1622231 tok/s step 11370/19560 | loss 3.434824 (+1.28z)| norm 0.2924 (+0.54z)| lr 2.39e-04 | 323.48 ms | 52.2% bf16 MFU | 1622159 tok/s step 11371/19560 | loss 3.347746 (-0.96z)| norm 0.2709 (-0.58z)| lr 2.38e-04 | 323.45 ms | 52.2% bf16 MFU | 1622097 tok/s step 11372/19560 | loss 3.593471 (+4.80z)| norm 0.2985 (+0.87z)| lr 2.38e-04 | 323.12 ms | 52.2% bf16 MFU | 1622121 tok/s step 11373/19560 | loss 3.312921 (-1.68z)| norm 0.2712 (-0.56z)| lr 2.38e-04 | 323.85 ms | 52.1% bf16 MFU | 1621962 tok/s step 11374/19560 | loss 3.331257 (-1.24z)| norm 0.2883 (+0.34z)| lr 2.38e-04 | 322.90 ms | 52.3% bf16 MFU | 1622048 tok/s step 11375/19560 | loss 3.404339 (+0.42z)| norm 0.2776 (-0.23z)| lr 2.38e-04 | 322.90 ms | 52.3% bf16 MFU | 1622128 tok/s step 11376/19560 | loss 3.380255 (-0.14z)| norm 0.2767 (-0.28z)| lr 2.38e-04 | 322.98 ms | 52.3% bf16 MFU | 1622185 tok/s step 11377/19560 | loss 3.360752 (-0.57z)| norm 0.2875 (+0.29z)| lr 2.38e-04 | 323.24 ms | 52.2% bf16 MFU | 1622174 tok/s step 11378/19560 | loss 3.345601 (-0.91z)| norm 0.2734 (-0.45z)| lr 2.38e-04 | 323.08 ms | 52.2% bf16 MFU | 1622205 tok/s step 11379/19560 | loss 3.420688 (+0.80z)| norm 0.3040 (+1.15z)| lr 2.38e-04 | 322.61 ms | 52.3% bf16 MFU | 1622352 tok/s step 11380/19560 | loss 3.405348 (+0.45z)| norm 0.3101 (+1.45z)| lr 2.38e-04 | 322.90 ms | 52.3% bf16 MFU | 1622420 tok/s step 11381/19560 | loss 3.380025 (-0.13z)| norm 0.2847 (+0.12z)| lr 2.38e-04 | 322.85 ms | 52.3% bf16 MFU | 1622495 tok/s step 11382/19560 | loss 3.402356 (+0.38z)| norm 0.2883 (+0.31z)| lr 2.38e-04 | 323.39 ms | 52.2% bf16 MFU | 1622432 tok/s step 11383/19560 | loss 3.452492 (+1.51z)| norm 0.3087 (+1.35z)| lr 2.38e-04 | 323.25 ms | 52.2% bf16 MFU | 1622406 tok/s step 11384/19560 | loss 3.444139 (+1.30z)| norm 0.2716 (-0.58z)| lr 2.38e-04 | 322.43 ms | 52.3% bf16 MFU | 1622588 tok/s step 11385/19560 | loss 3.378070 (-0.20z)| norm 0.2626 (-1.04z)| lr 2.38e-04 | 322.36 ms | 52.4% bf16 MFU | 1622780 tok/s step 11386/19560 | loss 3.420846 (+0.77z)| norm 0.2947 (+0.61z)| lr 2.38e-04 | 323.58 ms | 52.2% bf16 MFU | 1622655 tok/s step 11387/19560 | loss 3.470077 (+1.85z)| norm 0.2703 (-0.66z)| lr 2.38e-04 | 322.89 ms | 52.3% bf16 MFU | 1622708 tok/s step 11388/19560 | loss 3.426740 (+0.86z)| norm 0.2693 (-0.71z)| lr 2.38e-04 | 323.09 ms | 52.2% bf16 MFU | 1622709 tok/s step 11389/19560 | loss 3.389289 (+0.01z)| norm 0.2807 (-0.12z)| lr 2.38e-04 | 323.03 ms | 52.2% bf16 MFU | 1622726 tok/s step 11390/19560 | loss 3.316951 (-1.59z)| norm 0.2682 (-0.76z)| lr 2.38e-04 | 322.60 ms | 52.3% bf16 MFU | 1622848 tok/s step 11391/19560 | loss 3.432075 (+0.98z)| norm 0.2902 (+0.37z)| lr 2.37e-04 | 322.53 ms | 52.3% bf16 MFU | 1622983 tok/s step 11392/19560 | loss 3.433199 (+0.99z)| norm 0.2692 (-0.71z)| lr 2.37e-04 | 322.94 ms | 52.3% bf16 MFU | 1623010 tok/s step 11393/19560 | loss 3.410547 (+0.49z)| norm 0.2847 (+0.10z)| lr 2.37e-04 | 322.85 ms | 52.3% bf16 MFU | 1623057 tok/s step 11394/19560 | loss 3.423481 (+0.77z)| norm 0.3124 (+1.56z)| lr 2.37e-04 | 322.37 ms | 52.4% bf16 MFU | 1623222 tok/s step 11395/19560 | loss 3.362748 (-0.58z)| norm 0.2981 (+0.80z)| lr 2.37e-04 | 322.78 ms | 52.3% bf16 MFU | 1623275 tok/s step 11396/19560 | loss 3.373092 (-0.35z)| norm 0.2791 (-0.20z)| lr 2.37e-04 | 323.32 ms | 52.2% bf16 MFU | 1623191 tok/s step 11397/19560 | loss 3.366408 (-0.50z)| norm 0.2815 (-0.07z)| lr 2.37e-04 | 322.36 ms | 52.4% bf16 MFU | 1623352 tok/s step 11398/19560 | loss 3.349081 (-0.88z)| norm 0.2862 (+0.18z)| lr 2.37e-04 | 322.40 ms | 52.3% bf16 MFU | 1623493 tok/s step 11399/19560 | loss 3.358907 (-0.65z)| norm 0.2639 (-0.99z)| lr 2.37e-04 | 323.26 ms | 52.2% bf16 MFU | 1623413 tok/s step 11400/19560 | loss 3.386358 (-0.04z)| norm 0.2976 (+0.80z)| lr 2.37e-04 | 322.77 ms | 52.3% bf16 MFU | 1623460 tok/s step 11401/19560 | loss 3.365861 (-0.48z)| norm 0.2780 (-0.24z)| lr 2.37e-04 | 322.92 ms | 52.3% bf16 MFU | 1623466 tok/s step 11402/19560 | loss 3.335799 (-1.16z)| norm 0.2950 (+0.65z)| lr 2.37e-04 | 322.95 ms | 52.3% bf16 MFU | 1623464 tok/s step 11403/19560 | loss 3.577778 (+3.96z)| norm 0.2767 (-0.30z)| lr 2.37e-04 | 323.36 ms | 52.2% bf16 MFU | 1623360 tok/s step 11404/19560 | loss 3.413153 (+0.50z)| norm 0.3099 (+1.50z)| lr 2.37e-04 | 322.67 ms | 52.3% bf16 MFU | 1623434 tok/s step 11405/19560 | loss 3.413753 (+0.50z)| norm 0.2783 (-0.22z)| lr 2.37e-04 | 322.38 ms | 52.4% bf16 MFU | 1623577 tok/s step 11406/19560 | loss 3.363430 (-0.56z)| norm 0.2741 (-0.45z)| lr 2.37e-04 | 322.92 ms | 52.3% bf16 MFU | 1623576 tok/s step 11407/19560 | loss 3.345068 (-0.94z)| norm 0.2990 (+0.91z)| lr 2.37e-04 | 322.84 ms | 52.3% bf16 MFU | 1623595 tok/s step 11408/19560 | loss 3.314730 (-1.55z)| norm 0.2633 (-1.04z)| lr 2.37e-04 | 323.00 ms | 52.3% bf16 MFU | 1623574 tok/s step 11409/19560 | loss 3.345699 (-0.90z)| norm 0.2959 (+0.73z)| lr 2.37e-04 | 323.23 ms | 52.2% bf16 MFU | 1623498 tok/s step 11410/19560 | loss 3.385134 (-0.08z)| norm 0.2853 (+0.14z)| lr 2.37e-04 | 322.76 ms | 52.3% bf16 MFU | 1623542 tok/s step 11411/19560 | loss 3.387252 (-0.04z)| norm 0.2855 (+0.15z)| lr 2.37e-04 | 322.93 ms | 52.3% bf16 MFU | 1623541 tok/s step 11412/19560 | loss 3.383226 (-0.13z)| norm 0.2659 (-0.92z)| lr 2.36e-04 | 322.66 ms | 52.3% bf16 MFU | 1623608 tok/s step 11413/19560 | loss 3.375375 (-0.30z)| norm 0.2774 (-0.28z)| lr 2.36e-04 | 322.46 ms | 52.3% bf16 MFU | 1623724 tok/s step 11414/19560 | loss 3.400792 (+0.23z)| norm 0.2906 (+0.46z)| lr 2.36e-04 | 322.38 ms | 52.4% bf16 MFU | 1623853 tok/s step 11415/19560 | loss 3.401385 (+0.23z)| norm 0.2803 (-0.12z)| lr 2.36e-04 | 323.17 ms | 52.2% bf16 MFU | 1623776 tok/s step 11416/19560 | loss 3.389885 (-0.01z)| norm 0.3076 (+1.42z)| lr 2.36e-04 | 322.97 ms | 52.3% bf16 MFU | 1623754 tok/s step 11417/19560 | loss 3.330446 (-1.27z)| norm 0.3183 (+1.98z)| lr 2.36e-04 | 322.36 ms | 52.4% bf16 MFU | 1623886 tok/s step 11418/19560 | loss 3.398022 (+0.16z)| norm 0.2728 (-0.56z)| lr 2.36e-04 | 322.49 ms | 52.3% bf16 MFU | 1623980 tok/s step 11419/19560 | loss 3.401576 (+0.23z)| norm 0.2899 (+0.40z)| lr 2.36e-04 | 322.80 ms | 52.3% bf16 MFU | 1623990 tok/s step 11420/19560 | loss 3.373033 (-0.37z)| norm 0.2690 (-0.76z)| lr 2.36e-04 | 322.71 ms | 52.3% bf16 MFU | 1624022 tok/s step 11421/19560 | loss 3.359185 (-0.65z)| norm 0.2806 (-0.11z)| lr 2.36e-04 | 323.06 ms | 52.2% bf16 MFU | 1623964 tok/s step 11422/19560 | loss 3.404334 (+0.31z)| norm 0.2671 (-0.87z)| lr 2.36e-04 | 322.75 ms | 52.3% bf16 MFU | 1623987 tok/s step 11423/19560 | loss 3.370276 (-0.42z)| norm 0.2617 (-1.16z)| lr 2.36e-04 | 322.25 ms | 52.4% bf16 MFU | 1624135 tok/s step 11424/19560 | loss 3.410569 (+0.44z)| norm 0.2696 (-0.71z)| lr 2.36e-04 | 322.84 ms | 52.3% bf16 MFU | 1624129 tok/s step 11425/19560 | loss 3.421516 (+0.67z)| norm 0.3639 (+4.19z)| lr 2.36e-04 | 322.70 ms | 52.3% bf16 MFU | 1624156 tok/s step 11426/19560 | loss 3.350587 (-0.85z)| norm 0.2945 (+0.59z)| lr 2.36e-04 | 322.99 ms | 52.3% bf16 MFU | 1624109 tok/s step 11427/19560 | loss 3.385116 (-0.11z)| norm 0.2750 (-0.43z)| lr 2.36e-04 | 322.58 ms | 52.3% bf16 MFU | 1624168 tok/s step 11428/19560 | loss 3.403124 (+0.27z)| norm 0.2702 (-0.66z)| lr 2.36e-04 | 322.99 ms | 52.3% bf16 MFU | 1624121 tok/s step 11429/19560 | loss 3.400949 (+0.22z)| norm 0.2575 (-1.30z)| lr 2.36e-04 | 322.94 ms | 52.3% bf16 MFU | 1624089 tok/s step 11430/19560 | loss 3.354250 (-0.77z)| norm 0.2771 (-0.27z)| lr 2.36e-04 | 322.71 ms | 52.3% bf16 MFU | 1624117 tok/s step 11431/19560 | loss 3.388403 (-0.04z)| norm 0.2696 (-0.66z)| lr 2.36e-04 | 322.61 ms | 52.3% bf16 MFU | 1624169 tok/s step 11432/19560 | loss 3.368444 (-0.47z)| norm 0.2637 (-0.96z)| lr 2.35e-04 | 322.76 ms | 52.3% bf16 MFU | 1624181 tok/s step 11433/19560 | loss 3.447565 (+1.21z)| norm 0.2987 (+0.88z)| lr 2.35e-04 | 323.06 ms | 52.2% bf16 MFU | 1624116 tok/s step 11434/19560 | loss 3.370270 (-0.44z)| norm 0.2916 (+0.56z)| lr 2.35e-04 | 322.72 ms | 52.3% bf16 MFU | 1624141 tok/s step 11435/19560 | loss 3.383688 (-0.13z)| norm 0.2879 (+0.40z)| lr 2.35e-04 | 323.16 ms | 52.2% bf16 MFU | 1624054 tok/s step 11436/19560 | loss 3.516373 (+2.69z)| norm 0.2979 (+0.96z)| lr 2.35e-04 | 322.62 ms | 52.3% bf16 MFU | 1624106 tok/s step 11437/19560 | loss 3.385255 (-0.12z)| norm 0.2921 (+0.63z)| lr 2.35e-04 | 322.71 ms | 52.3% bf16 MFU | 1624133 tok/s step 11438/19560 | loss 3.359485 (-0.67z)| norm 0.2836 (+0.14z)| lr 2.35e-04 | 322.56 ms | 52.3% bf16 MFU | 1624197 tok/s step 11439/19560 | loss 3.374127 (-0.37z)| norm 0.2927 (+0.66z)| lr 2.35e-04 | 322.74 ms | 52.3% bf16 MFU | 1624210 tok/s step 11440/19560 | loss 3.414571 (+0.50z)| norm 0.2723 (-0.51z)| lr 2.35e-04 | 322.64 ms | 52.3% bf16 MFU | 1624250 tok/s step 11441/19560 | loss 3.340610 (-1.12z)| norm 0.2730 (-0.47z)| lr 2.35e-04 | 322.28 ms | 52.4% bf16 MFU | 1624378 tok/s step 11442/19560 | loss 3.285095 (-2.29z)| norm 0.2564 (-1.43z)| lr 2.35e-04 | 322.71 ms | 52.3% bf16 MFU | 1624391 tok/s step 11443/19560 | loss 3.423633 (+0.73z)| norm 0.2699 (-0.63z)| lr 2.35e-04 | 323.06 ms | 52.2% bf16 MFU | 1624316 tok/s step 11444/19560 | loss 3.403936 (+0.30z)| norm 0.2633 (-1.00z)| lr 2.35e-04 | 323.15 ms | 52.2% bf16 MFU | 1624222 tok/s step 11445/19560 | loss 3.388769 (-0.04z)| norm 0.2685 (-0.69z)| lr 2.35e-04 | 322.68 ms | 52.3% bf16 MFU | 1624251 tok/s step 11446/19560 | loss 3.354194 (-0.79z)| norm 0.2563 (-1.39z)| lr 2.35e-04 | 322.79 ms | 52.3% bf16 MFU | 1624252 tok/s step 11447/19560 | loss 3.377363 (-0.28z)| norm 0.2640 (-0.94z)| lr 2.35e-04 | 323.07 ms | 52.2% bf16 MFU | 1624180 tok/s step 11448/19560 | loss 3.408795 (+0.41z)| norm 0.2861 (+0.36z)| lr 2.35e-04 | 322.50 ms | 52.3% bf16 MFU | 1624256 tok/s step 11449/19560 | loss 3.364482 (-0.55z)| norm 0.2549 (-1.46z)| lr 2.35e-04 | 322.95 ms | 52.3% bf16 MFU | 1624215 tok/s step 11450/19560 | loss 3.345817 (-0.95z)| norm 0.2612 (-1.08z)| lr 2.35e-04 | 322.95 ms | 52.3% bf16 MFU | 1624176 tok/s step 11451/19560 | loss 3.406130 (+0.36z)| norm 0.2798 (+0.01z)| lr 2.35e-04 | 322.77 ms | 52.3% bf16 MFU | 1624184 tok/s step 11452/19560 | loss 3.385176 (-0.10z)| norm 0.2667 (-0.76z)| lr 2.35e-04 | 322.97 ms | 52.3% bf16 MFU | 1624141 tok/s step 11453/19560 | loss 3.408204 (+0.39z)| norm 0.2666 (-0.76z)| lr 2.34e-04 | 323.29 ms | 52.2% bf16 MFU | 1624019 tok/s step 11454/19560 | loss 3.414871 (+0.53z)| norm 0.2656 (-0.82z)| lr 2.34e-04 | 323.56 ms | 52.2% bf16 MFU | 1623837 tok/s step 11455/19560 | loss 3.347389 (-0.94z)| norm 0.2642 (-0.89z)| lr 2.34e-04 | 322.99 ms | 52.3% bf16 MFU | 1623807 tok/s step 11456/19560 | loss 3.350380 (-0.86z)| norm 0.2678 (-0.69z)| lr 2.34e-04 | 322.79 ms | 52.3% bf16 MFU | 1623829 tok/s step 11457/19560 | loss 3.395406 (+0.12z)| norm 0.2597 (-1.15z)| lr 2.34e-04 | 323.12 ms | 52.2% bf16 MFU | 1623767 tok/s step 11458/19560 | loss 3.344121 (-0.99z)| norm 0.2601 (-1.11z)| lr 2.34e-04 | 323.01 ms | 52.2% bf16 MFU | 1623734 tok/s step 11459/19560 | loss 3.397514 (+0.16z)| norm 0.2644 (-0.84z)| lr 2.34e-04 | 323.18 ms | 52.2% bf16 MFU | 1623661 tok/s step 11460/19560 | loss 3.385059 (-0.11z)| norm 0.2781 (-0.04z)| lr 2.34e-04 | 323.00 ms | 52.3% bf16 MFU | 1623638 tok/s step 11461/19560 | loss 3.351810 (-0.82z)| norm 0.2666 (-0.72z)| lr 2.34e-04 | 322.88 ms | 52.3% bf16 MFU | 1623645 tok/s step 11462/19560 | loss 3.438334 (+1.05z)| norm 0.2783 (-0.03z)| lr 2.34e-04 | 322.60 ms | 52.3% bf16 MFU | 1623722 tok/s step 11463/19560 | loss 3.366551 (-0.50z)| norm 0.2564 (-1.30z)| lr 2.34e-04 | 323.03 ms | 52.2% bf16 MFU | 1623687 tok/s step 11464/19560 | loss 3.368012 (-0.47z)| norm 0.2719 (-0.39z)| lr 2.34e-04 | 322.50 ms | 52.3% bf16 MFU | 1623787 tok/s step 11465/19560 | loss 3.369509 (-0.44z)| norm 0.2544 (-1.40z)| lr 2.34e-04 | 322.95 ms | 52.3% bf16 MFU | 1623769 tok/s step 11466/19560 | loss 3.338507 (-1.11z)| norm 0.2663 (-0.70z)| lr 2.34e-04 | 322.76 ms | 52.3% bf16 MFU | 1623800 tok/s step 11467/19560 | loss 3.369225 (-0.45z)| norm 0.2632 (-0.87z)| lr 2.34e-04 | 322.60 ms | 52.3% bf16 MFU | 1623870 tok/s step 11468/19560 | loss 3.368531 (-0.45z)| norm 0.2782 (-0.01z)| lr 2.34e-04 | 322.83 ms | 52.3% bf16 MFU | 1623879 tok/s step 11469/19560 | loss 3.323531 (-1.42z)| norm 0.2681 (-0.59z)| lr 2.34e-04 | 323.00 ms | 52.3% bf16 MFU | 1623844 tok/s step 11470/19560 | loss 3.300549 (-1.88z)| norm 0.2649 (-0.77z)| lr 2.34e-04 | 322.59 ms | 52.3% bf16 MFU | 1623914 tok/s step 11471/19560 | loss 3.358073 (-0.65z)| norm 0.2688 (-0.54z)| lr 2.34e-04 | 322.88 ms | 52.3% bf16 MFU | 1623907 tok/s step 11472/19560 | loss 3.457983 (+1.53z)| norm 0.2752 (-0.16z)| lr 2.34e-04 | 322.51 ms | 52.3% bf16 MFU | 1623993 tok/s step 11473/19560 | loss 3.399292 (+0.25z)| norm 0.2633 (-0.85z)| lr 2.33e-04 | 323.06 ms | 52.2% bf16 MFU | 1623937 tok/s step 11474/19560 | loss 3.381338 (-0.12z)| norm 0.2662 (-0.67z)| lr 2.33e-04 | 322.98 ms | 52.3% bf16 MFU | 1623905 tok/s step 11475/19560 | loss 3.391780 (+0.11z)| norm 0.2702 (-0.43z)| lr 2.33e-04 | 322.63 ms | 52.3% bf16 MFU | 1623963 tok/s step 11476/19560 | loss 3.340000 (-1.05z)| norm 0.2722 (-0.32z)| lr 2.33e-04 | 322.86 ms | 52.3% bf16 MFU | 1623959 tok/s step 11477/19560 | loss 3.345318 (-0.93z)| norm 0.2715 (-0.35z)| lr 2.33e-04 | 322.63 ms | 52.3% bf16 MFU | 1624012 tok/s step 11478/19560 | loss 3.411126 (+0.55z)| norm 0.2683 (-0.53z)| lr 2.33e-04 | 322.78 ms | 52.3% bf16 MFU | 1624025 tok/s step 11479/19560 | loss 3.386445 (-0.00z)| norm 0.2699 (-0.43z)| lr 2.33e-04 | 322.34 ms | 52.4% bf16 MFU | 1624148 tok/s step 11480/19560 | loss 3.385356 (-0.03z)| norm 0.2637 (-0.79z)| lr 2.33e-04 | 322.85 ms | 52.3% bf16 MFU | 1624138 tok/s step 11481/19560 | loss 3.342505 (-0.99z)| norm 0.2705 (-0.38z)| lr 2.33e-04 | 323.20 ms | 52.2% bf16 MFU | 1624038 tok/s step 11482/19560 | loss 3.354010 (-0.71z)| norm 0.2502 (-1.56z)| lr 2.33e-04 | 322.44 ms | 52.3% bf16 MFU | 1624136 tok/s step 11483/19560 | loss 3.395000 (+0.21z)| norm 0.2703 (-0.38z)| lr 2.33e-04 | 322.67 ms | 52.3% bf16 MFU | 1624170 tok/s step 11484/19560 | loss 3.331232 (-1.23z)| norm 0.2624 (-0.83z)| lr 2.33e-04 | 322.82 ms | 52.3% bf16 MFU | 1624167 tok/s step 11485/19560 | loss 3.366970 (-0.41z)| norm 0.2642 (-0.73z)| lr 2.33e-04 | 322.42 ms | 52.3% bf16 MFU | 1624264 tok/s step 11486/19560 | loss 3.437535 (+1.20z)| norm 0.2817 (+0.28z)| lr 2.33e-04 | 322.74 ms | 52.3% bf16 MFU | 1624275 tok/s step 11487/19560 | loss 3.435642 (+1.14z)| norm 0.2720 (-0.30z)| lr 2.33e-04 | 323.00 ms | 52.3% bf16 MFU | 1624220 tok/s step 11488/19560 | loss 3.387986 (+0.05z)| norm 0.2912 (+0.84z)| lr 2.33e-04 | 323.11 ms | 52.2% bf16 MFU | 1624140 tok/s step 11489/19560 | loss 3.341389 (-0.99z)| norm 0.2819 (+0.27z)| lr 2.33e-04 | 322.99 ms | 52.3% bf16 MFU | 1624096 tok/s step 11490/19560 | loss 3.320331 (-1.45z)| norm 0.3923 (+5.92z)| lr 2.33e-04 | 322.42 ms | 52.3% bf16 MFU | 1624195 tok/s step 11491/19560 | loss 3.424619 (+0.89z)| norm 0.3212 (+2.16z)| lr 2.33e-04 | 322.82 ms | 52.3% bf16 MFU | 1624190 tok/s step 11492/19560 | loss 3.332024 (-1.19z)| norm 0.2876 (+0.43z)| lr 2.33e-04 | 322.98 ms | 52.3% bf16 MFU | 1624145 tok/s step 11493/19560 | loss 3.373626 (-0.26z)| norm 0.3286 (+2.46z)| lr 2.33e-04 | 322.10 ms | 52.4% bf16 MFU | 1624324 tok/s step 11494/19560 | loss 3.402189 (+0.39z)| norm 0.3098 (+1.51z)| lr 2.32e-04 | 322.84 ms | 52.3% bf16 MFU | 1624306 tok/s step 11495/19560 | loss 3.673583 (+5.61z)| norm 0.3205 (+2.00z)| lr 2.32e-04 | 322.60 ms | 52.3% bf16 MFU | 1624350 tok/s step 11496/19560 | loss 3.432878 (+0.88z)| norm 0.2984 (+0.91z)| lr 2.32e-04 | 322.86 ms | 52.3% bf16 MFU | 1624327 tok/s step 11497/19560 | loss 3.359678 (-0.54z)| norm 0.2902 (+0.50z)| lr 2.32e-04 | 322.72 ms | 52.3% bf16 MFU | 1624341 tok/s step 11498/19560 | loss 3.380915 (-0.12z)| norm 0.2857 (+0.28z)| lr 2.32e-04 | 322.78 ms | 52.3% bf16 MFU | 1624338 tok/s step 11499/19560 | loss 3.348270 (-0.76z)| norm 0.2818 (+0.07z)| lr 2.32e-04 | 323.04 ms | 52.2% bf16 MFU | 1624270 tok/s step 11500/19560 | loss 3.421759 (+0.76z)| norm 0.2827 (+0.13z)| lr 2.32e-04 | 323.21 ms | 52.2% bf16 MFU | 1624162 tok/s val loss 3.365113 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2983/10042 = 0.297052 step 11501/19560 | loss 3.412868 (+0.56z)| norm 0.3055 (+1.25z)| lr 2.32e-04 | 322.59 ms | 52.3% bf16 MFU | 1624216 tok/s step 11502/19560 | loss 3.433424 (+0.98z)| norm 0.2948 (+0.71z)| lr 2.32e-04 | 322.49 ms | 52.3% bf16 MFU | 1624293 tok/s step 11503/19560 | loss 3.390292 (+0.07z)| norm 0.2979 (+0.86z)| lr 2.32e-04 | 323.17 ms | 52.2% bf16 MFU | 1624196 tok/s step 11504/19560 | loss 3.388114 (+0.02z)| norm 0.2800 (-0.03z)| lr 2.32e-04 | 324.16 ms | 52.1% bf16 MFU | 1623854 tok/s step 11505/19560 | loss 3.345405 (-0.88z)| norm 0.2693 (-0.56z)| lr 2.32e-04 | 323.23 ms | 52.2% bf16 MFU | 1623762 tok/s step 11506/19560 | loss 3.336669 (-1.06z)| norm 0.2964 (+0.78z)| lr 2.32e-04 | 323.99 ms | 52.1% bf16 MFU | 1623484 tok/s step 11507/19560 | loss 3.375862 (-0.22z)| norm 0.2766 (-0.19z)| lr 2.32e-04 | 322.97 ms | 52.3% bf16 MFU | 1623476 tok/s step 11508/19560 | loss 3.376040 (-0.22z)| norm 0.2975 (+0.86z)| lr 2.32e-04 | 323.30 ms | 52.2% bf16 MFU | 1623386 tok/s step 11509/19560 | loss 3.342101 (-0.92z)| norm 0.2716 (-0.43z)| lr 2.32e-04 | 323.17 ms | 52.2% bf16 MFU | 1623334 tok/s step 11510/19560 | loss 3.386523 (+0.01z)| norm 0.2658 (-0.71z)| lr 2.32e-04 | 323.57 ms | 52.2% bf16 MFU | 1623184 tok/s step 11511/19560 | loss 3.391728 (+0.13z)| norm 0.2835 (+0.18z)| lr 2.32e-04 | 323.25 ms | 52.2% bf16 MFU | 1623120 tok/s step 11512/19560 | loss 3.494066 (+2.26z)| norm 0.2837 (+0.19z)| lr 2.32e-04 | 322.99 ms | 52.3% bf16 MFU | 1623125 tok/s step 11513/19560 | loss 3.318833 (-1.38z)| norm 0.2826 (+0.12z)| lr 2.32e-04 | 322.79 ms | 52.3% bf16 MFU | 1623181 tok/s step 11514/19560 | loss 3.395224 (+0.21z)| norm 0.2759 (-0.21z)| lr 2.31e-04 | 322.91 ms | 52.3% bf16 MFU | 1623204 tok/s step 11515/19560 | loss 3.308434 (-1.57z)| norm 0.2750 (-0.26z)| lr 2.31e-04 | 323.51 ms | 52.2% bf16 MFU | 1623075 tok/s step 11516/19560 | loss 3.426590 (+0.89z)| norm 0.2669 (-0.66z)| lr 2.31e-04 | 323.02 ms | 52.2% bf16 MFU | 1623074 tok/s step 11517/19560 | loss 3.392845 (+0.19z)| norm 0.2779 (-0.11z)| lr 2.31e-04 | 323.17 ms | 52.2% bf16 MFU | 1623036 tok/s step 11518/19560 | loss 3.340656 (-0.91z)| norm 0.2646 (-0.78z)| lr 2.31e-04 | 322.98 ms | 52.3% bf16 MFU | 1623048 tok/s step 11519/19560 | loss 3.376030 (-0.16z)| norm 0.2624 (-0.88z)| lr 2.31e-04 | 323.33 ms | 52.2% bf16 MFU | 1622973 tok/s step 11520/19560 | loss 3.407479 (+0.51z)| norm 0.2664 (-0.67z)| lr 2.31e-04 | 323.07 ms | 52.2% bf16 MFU | 1622965 tok/s step 11521/19560 | loss 3.414968 (+0.66z)| norm 0.3099 (+1.50z)| lr 2.31e-04 | 322.77 ms | 52.3% bf16 MFU | 1623035 tok/s step 11522/19560 | loss 3.366289 (-0.35z)| norm 0.2746 (-0.25z)| lr 2.31e-04 | 324.07 ms | 52.1% bf16 MFU | 1622774 tok/s step 11523/19560 | loss 3.428325 (+0.94z)| norm 0.2823 (+0.14z)| lr 2.31e-04 | 323.45 ms | 52.2% bf16 MFU | 1622680 tok/s step 11524/19560 | loss 3.425727 (+0.88z)| norm 0.4955 (+7.82z)| lr 2.31e-04 | 322.70 ms | 52.3% bf16 MFU | 1622780 tok/s step 11525/19560 | loss 3.356403 (-0.58z)| norm 0.2836 (+0.09z)| lr 2.31e-04 | 323.39 ms | 52.2% bf16 MFU | 1622703 tok/s step 11526/19560 | loss 3.484123 (+2.05z)| norm 0.2843 (+0.11z)| lr 2.31e-04 | 323.58 ms | 52.2% bf16 MFU | 1622582 tok/s step 11527/19560 | loss 3.342587 (-0.87z)| norm 0.2641 (-0.63z)| lr 2.31e-04 | 323.16 ms | 52.2% bf16 MFU | 1622571 tok/s step 11528/19560 | loss 3.421784 (+0.76z)| norm 0.2822 (+0.04z)| lr 2.31e-04 | 323.22 ms | 52.2% bf16 MFU | 1622547 tok/s step 11529/19560 | loss 3.387116 (+0.04z)| norm 0.2794 (-0.06z)| lr 2.31e-04 | 323.30 ms | 52.2% bf16 MFU | 1622504 tok/s step 11530/19560 | loss 3.314882 (-1.44z)| norm 0.2702 (-0.39z)| lr 2.31e-04 | 324.23 ms | 52.1% bf16 MFU | 1622231 tok/s step 11531/19560 | loss 3.364389 (-0.42z)| norm 0.2814 (+0.02z)| lr 2.31e-04 | 322.93 ms | 52.3% bf16 MFU | 1622297 tok/s step 11532/19560 | loss 3.408499 (+0.55z)| norm 0.2763 (-0.16z)| lr 2.31e-04 | 322.27 ms | 52.4% bf16 MFU | 1622526 tok/s step 11533/19560 | loss 3.339916 (-0.94z)| norm 0.2880 (+0.27z)| lr 2.31e-04 | 323.76 ms | 52.1% bf16 MFU | 1622368 tok/s step 11534/19560 | loss 3.379692 (-0.07z)| norm 0.2637 (-0.62z)| lr 2.31e-04 | 323.72 ms | 52.1% bf16 MFU | 1622228 tok/s step 11535/19560 | loss 3.391135 (+0.17z)| norm 0.2562 (-0.88z)| lr 2.30e-04 | 322.90 ms | 52.3% bf16 MFU | 1622301 tok/s step 11536/19560 | loss 3.298742 (-1.84z)| norm 0.2517 (-1.04z)| lr 2.30e-04 | 322.72 ms | 52.3% bf16 MFU | 1622416 tok/s step 11537/19560 | loss 3.309729 (-1.59z)| norm 0.2610 (-0.69z)| lr 2.30e-04 | 323.50 ms | 52.2% bf16 MFU | 1622330 tok/s step 11538/19560 | loss 3.448892 (+1.41z)| norm 0.2569 (-0.83z)| lr 2.30e-04 | 323.05 ms | 52.2% bf16 MFU | 1622360 tok/s step 11539/19560 | loss 3.367922 (-0.33z)| norm 0.2564 (-0.84z)| lr 2.30e-04 | 323.55 ms | 52.2% bf16 MFU | 1622263 tok/s step 11540/19560 | loss 3.408462 (+0.54z)| norm 0.2542 (-0.91z)| lr 2.30e-04 | 323.47 ms | 52.2% bf16 MFU | 1622191 tok/s step 11541/19560 | loss 3.369484 (-0.30z)| norm 0.2751 (-0.16z)| lr 2.30e-04 | 323.16 ms | 52.2% bf16 MFU | 1622200 tok/s step 11542/19560 | loss 3.381276 (-0.04z)| norm 0.2774 (-0.07z)| lr 2.30e-04 | 322.94 ms | 52.3% bf16 MFU | 1622264 tok/s step 11543/19560 | loss 3.385683 (+0.06z)| norm 0.2786 (-0.02z)| lr 2.30e-04 | 323.59 ms | 52.2% bf16 MFU | 1622163 tok/s step 11544/19560 | loss 3.414531 (+0.67z)| norm 0.2703 (-0.32z)| lr 2.30e-04 | 323.70 ms | 52.1% bf16 MFU | 1622038 tok/s step 11545/19560 | loss 3.325227 (-1.24z)| norm 0.2762 (-0.09z)| lr 2.30e-04 | 323.30 ms | 52.2% bf16 MFU | 1622020 tok/s step 11546/19560 | loss 3.357004 (-0.56z)| norm 0.2875 (+0.32z)| lr 2.30e-04 | 323.03 ms | 52.2% bf16 MFU | 1622071 tok/s step 11547/19560 | loss 3.354640 (-0.60z)| norm 0.2683 (-0.38z)| lr 2.30e-04 | 322.48 ms | 52.3% bf16 MFU | 1622259 tok/s step 11548/19560 | loss 3.430720 (+1.02z)| norm 0.2734 (-0.19z)| lr 2.30e-04 | 323.17 ms | 52.2% bf16 MFU | 1622262 tok/s step 11549/19560 | loss 3.383335 (+0.00z)| norm 0.2892 (+0.39z)| lr 2.30e-04 | 322.80 ms | 52.3% bf16 MFU | 1622360 tok/s step 11550/19560 | loss 3.364897 (-0.38z)| norm 0.2762 (-0.10z)| lr 2.30e-04 | 323.37 ms | 52.2% bf16 MFU | 1622309 tok/s step 11551/19560 | loss 3.377399 (-0.12z)| norm 0.3023 (+0.85z)| lr 2.30e-04 | 322.97 ms | 52.3% bf16 MFU | 1622360 tok/s step 11552/19560 | loss 3.361221 (-0.46z)| norm 0.2709 (-0.30z)| lr 2.30e-04 | 322.56 ms | 52.3% bf16 MFU | 1622511 tok/s step 11553/19560 | loss 3.426252 (+0.93z)| norm 0.3249 (+1.74z)| lr 2.30e-04 | 323.44 ms | 52.2% bf16 MFU | 1622435 tok/s step 11554/19560 | loss 3.458990 (+1.60z)| norm 0.2869 (+0.31z)| lr 2.30e-04 | 323.33 ms | 52.2% bf16 MFU | 1622389 tok/s step 11555/19560 | loss 3.381195 (-0.05z)| norm 0.2937 (+0.56z)| lr 2.30e-04 | 323.12 ms | 52.2% bf16 MFU | 1622398 tok/s step 11556/19560 | loss 3.370533 (-0.27z)| norm 0.2818 (+0.10z)| lr 2.29e-04 | 322.54 ms | 52.3% bf16 MFU | 1622553 tok/s step 11557/19560 | loss 3.341264 (-0.88z)| norm 0.2930 (+0.52z)| lr 2.29e-04 | 323.71 ms | 52.1% bf16 MFU | 1622406 tok/s step 11558/19560 | loss 3.323035 (-1.25z)| norm 0.2574 (-0.82z)| lr 2.29e-04 | 322.52 ms | 52.3% bf16 MFU | 1622565 tok/s step 11559/19560 | loss 3.366991 (-0.32z)| norm 0.2928 (+0.51z)| lr 2.29e-04 | 322.99 ms | 52.3% bf16 MFU | 1622598 tok/s step 11560/19560 | loss 3.418665 (+0.76z)| norm 0.2782 (-0.05z)| lr 2.29e-04 | 322.87 ms | 52.3% bf16 MFU | 1622660 tok/s step 11561/19560 | loss 3.351030 (-0.65z)| norm 0.2911 (+0.44z)| lr 2.29e-04 | 323.10 ms | 52.2% bf16 MFU | 1622661 tok/s step 11562/19560 | loss 3.365094 (-0.35z)| norm 0.2879 (+0.32z)| lr 2.29e-04 | 322.78 ms | 52.3% bf16 MFU | 1622744 tok/s step 11563/19560 | loss 3.358886 (-0.48z)| norm 0.3348 (+2.04z)| lr 2.29e-04 | 323.33 ms | 52.2% bf16 MFU | 1622684 tok/s step 11564/19560 | loss 3.405023 (+0.53z)| norm 0.2996 (+0.74z)| lr 2.29e-04 | 322.65 ms | 52.3% bf16 MFU | 1622796 tok/s step 11565/19560 | loss 3.428378 (+1.02z)| norm 0.2907 (+0.41z)| lr 2.29e-04 | 322.88 ms | 52.3% bf16 MFU | 1622846 tok/s step 11566/19560 | loss 3.402225 (+0.45z)| norm 0.2793 (-0.01z)| lr 2.29e-04 | 322.51 ms | 52.3% bf16 MFU | 1622985 tok/s step 11567/19560 | loss 3.279952 (-2.14z)| norm 0.2877 (+0.30z)| lr 2.29e-04 | 323.58 ms | 52.2% bf16 MFU | 1622849 tok/s step 11568/19560 | loss 3.358862 (-0.46z)| norm 0.2724 (-0.27z)| lr 2.29e-04 | 323.21 ms | 52.2% bf16 MFU | 1622813 tok/s step 11569/19560 | loss 3.370562 (-0.21z)| norm 0.2749 (-0.18z)| lr 2.29e-04 | 322.77 ms | 52.3% bf16 MFU | 1622889 tok/s step 11570/19560 | loss 3.376958 (-0.09z)| norm 0.2627 (-0.63z)| lr 2.29e-04 | 322.25 ms | 52.4% bf16 MFU | 1623093 tok/s step 11571/19560 | loss 3.343477 (-0.81z)| norm 0.2669 (-0.48z)| lr 2.29e-04 | 322.75 ms | 52.3% bf16 MFU | 1623160 tok/s step 11572/19560 | loss 3.368424 (-0.26z)| norm 0.3030 (+0.86z)| lr 2.29e-04 | 322.92 ms | 52.3% bf16 MFU | 1623181 tok/s step 11573/19560 | loss 3.386029 (+0.12z)| norm 0.2622 (-0.66z)| lr 2.29e-04 | 322.90 ms | 52.3% bf16 MFU | 1623205 tok/s step 11574/19560 | loss 3.399543 (+0.41z)| norm 0.2978 (+0.65z)| lr 2.29e-04 | 322.45 ms | 52.3% bf16 MFU | 1623342 tok/s step 11575/19560 | loss 3.326036 (-1.17z)| norm 0.2856 (+0.19z)| lr 2.29e-04 | 323.20 ms | 52.2% bf16 MFU | 1623284 tok/s step 11576/19560 | loss 3.367195 (-0.28z)| norm 0.2810 (+0.02z)| lr 2.28e-04 | 322.87 ms | 52.3% bf16 MFU | 1623312 tok/s step 11577/19560 | loss 3.357614 (-0.48z)| norm 0.2605 (-0.74z)| lr 2.28e-04 | 322.36 ms | 52.4% bf16 MFU | 1623467 tok/s step 11578/19560 | loss 3.388114 (+0.17z)| norm 0.2663 (-0.53z)| lr 2.28e-04 | 322.98 ms | 52.3% bf16 MFU | 1623458 tok/s step 11579/19560 | loss 3.339710 (-0.87z)| norm 0.2727 (-0.29z)| lr 2.28e-04 | 322.74 ms | 52.3% bf16 MFU | 1623510 tok/s step 11580/19560 | loss 3.402504 (+0.49z)| norm 0.2690 (-0.43z)| lr 2.28e-04 | 322.76 ms | 52.3% bf16 MFU | 1623555 tok/s step 11581/19560 | loss 3.374039 (-0.12z)| norm 0.2733 (-0.27z)| lr 2.28e-04 | 322.55 ms | 52.3% bf16 MFU | 1623650 tok/s step 11582/19560 | loss 3.371597 (-0.17z)| norm 0.2742 (-0.24z)| lr 2.28e-04 | 322.54 ms | 52.3% bf16 MFU | 1623743 tok/s step 11583/19560 | loss 3.385871 (+0.14z)| norm 0.2909 (+0.38z)| lr 2.28e-04 | 323.57 ms | 52.2% bf16 MFU | 1623573 tok/s step 11584/19560 | loss 3.312613 (-1.44z)| norm 0.2802 (-0.02z)| lr 2.28e-04 | 322.23 ms | 52.4% bf16 MFU | 1623747 tok/s step 11585/19560 | loss 3.358049 (-0.45z)| norm 0.2859 (+0.18z)| lr 2.28e-04 | 322.30 ms | 52.4% bf16 MFU | 1623894 tok/s step 11586/19560 | loss 3.360893 (-0.39z)| norm 0.2662 (-0.56z)| lr 2.28e-04 | 322.82 ms | 52.3% bf16 MFU | 1623904 tok/s step 11587/19560 | loss 3.424874 (+0.98z)| norm 0.2961 (+0.55z)| lr 2.28e-04 | 322.42 ms | 52.3% bf16 MFU | 1624013 tok/s step 11588/19560 | loss 3.458968 (+1.69z)| norm 0.3009 (+0.73z)| lr 2.28e-04 | 322.36 ms | 52.4% bf16 MFU | 1624132 tok/s step 11589/19560 | loss 3.319964 (-1.27z)| norm 0.2734 (-0.31z)| lr 2.28e-04 | 322.49 ms | 52.3% bf16 MFU | 1624211 tok/s step 11590/19560 | loss 3.330614 (-1.03z)| norm 0.2986 (+0.63z)| lr 2.28e-04 | 323.29 ms | 52.2% bf16 MFU | 1624087 tok/s step 11591/19560 | loss 3.361605 (-0.37z)| norm 0.2930 (+0.41z)| lr 2.28e-04 | 322.75 ms | 52.3% bf16 MFU | 1624106 tok/s step 11592/19560 | loss 3.395790 (+0.36z)| norm 0.2714 (-0.40z)| lr 2.28e-04 | 322.40 ms | 52.3% bf16 MFU | 1624210 tok/s step 11593/19560 | loss 3.338048 (-0.86z)| norm 0.2892 (+0.26z)| lr 2.28e-04 | 323.25 ms | 52.2% bf16 MFU | 1624095 tok/s step 11594/19560 | loss 3.364635 (-0.30z)| norm 0.2725 (-0.37z)| lr 2.28e-04 | 322.70 ms | 52.3% bf16 MFU | 1624126 tok/s step 11595/19560 | loss 3.370606 (-0.18z)| norm 0.2809 (-0.06z)| lr 2.28e-04 | 322.59 ms | 52.3% bf16 MFU | 1624182 tok/s step 11596/19560 | loss 3.436621 (+1.21z)| norm 0.2898 (+0.27z)| lr 2.28e-04 | 322.07 ms | 52.4% bf16 MFU | 1624366 tok/s step 11597/19560 | loss 3.354019 (-0.55z)| norm 0.2869 (+0.16z)| lr 2.27e-04 | 323.01 ms | 52.2% bf16 MFU | 1624303 tok/s step 11598/19560 | loss 3.387758 (+0.16z)| norm 0.2903 (+0.28z)| lr 2.27e-04 | 322.78 ms | 52.3% bf16 MFU | 1624301 tok/s step 11599/19560 | loss 3.347367 (-0.71z)| norm 0.2761 (-0.26z)| lr 2.27e-04 | 322.61 ms | 52.3% bf16 MFU | 1624344 tok/s step 11600/19560 | loss 3.376473 (-0.07z)| norm 0.2691 (-0.52z)| lr 2.27e-04 | 322.66 ms | 52.3% bf16 MFU | 1624371 tok/s step 11601/19560 | loss 3.368370 (-0.24z)| norm 0.2726 (-0.39z)| lr 2.27e-04 | 323.42 ms | 52.2% bf16 MFU | 1624206 tok/s step 11602/19560 | loss 3.419567 (+0.86z)| norm 0.2830 (-0.00z)| lr 2.27e-04 | 322.29 ms | 52.4% bf16 MFU | 1624334 tok/s step 11603/19560 | loss 3.373982 (-0.12z)| norm 0.2954 (+0.46z)| lr 2.27e-04 | 322.33 ms | 52.4% bf16 MFU | 1624444 tok/s step 11604/19560 | loss 3.332423 (-1.02z)| norm 0.2926 (+0.35z)| lr 2.27e-04 | 323.25 ms | 52.2% bf16 MFU | 1624319 tok/s step 11605/19560 | loss 3.408136 (+0.61z)| norm 0.2774 (-0.23z)| lr 2.27e-04 | 323.12 ms | 52.2% bf16 MFU | 1624232 tok/s step 11606/19560 | loss 3.370544 (-0.20z)| norm 0.2700 (-0.52z)| lr 2.27e-04 | 322.32 ms | 52.4% bf16 MFU | 1624351 tok/s step 11607/19560 | loss 3.304738 (-1.59z)| norm 0.2722 (-0.43z)| lr 2.27e-04 | 323.22 ms | 52.2% bf16 MFU | 1624238 tok/s step 11608/19560 | loss 3.298992 (-1.68z)| norm 0.2782 (-0.21z)| lr 2.27e-04 | 322.93 ms | 52.3% bf16 MFU | 1624202 tok/s step 11609/19560 | loss 3.308874 (-1.46z)| norm 0.2704 (-0.50z)| lr 2.27e-04 | 322.65 ms | 52.3% bf16 MFU | 1624239 tok/s step 11610/19560 | loss 3.364081 (-0.30z)| norm 0.2829 (-0.04z)| lr 2.27e-04 | 322.96 ms | 52.3% bf16 MFU | 1624198 tok/s step 11611/19560 | loss 3.378966 (+0.02z)| norm 0.2696 (-0.55z)| lr 2.27e-04 | 322.48 ms | 52.3% bf16 MFU | 1624278 tok/s step 11612/19560 | loss 3.369446 (-0.19z)| norm 0.2769 (-0.27z)| lr 2.27e-04 | 322.49 ms | 52.3% bf16 MFU | 1624353 tok/s step 11613/19560 | loss 3.371689 (-0.14z)| norm 0.2984 (+0.54z)| lr 2.27e-04 | 322.55 ms | 52.3% bf16 MFU | 1624406 tok/s step 11614/19560 | loss 3.390197 (+0.26z)| norm 0.2764 (-0.30z)| lr 2.27e-04 | 323.34 ms | 52.2% bf16 MFU | 1624259 tok/s step 11615/19560 | loss 3.420133 (+0.90z)| norm 0.2843 (-0.00z)| lr 2.27e-04 | 322.58 ms | 52.3% bf16 MFU | 1624311 tok/s step 11616/19560 | loss 3.417359 (+0.84z)| norm 0.2930 (+0.33z)| lr 2.27e-04 | 322.96 ms | 52.3% bf16 MFU | 1624264 tok/s step 11617/19560 | loss 3.368086 (-0.22z)| norm 0.2786 (-0.22z)| lr 2.26e-04 | 322.71 ms | 52.3% bf16 MFU | 1624282 tok/s step 11618/19560 | loss 3.408612 (+0.63z)| norm 0.2797 (-0.15z)| lr 2.26e-04 | 322.82 ms | 52.3% bf16 MFU | 1624272 tok/s step 11619/19560 | loss 3.383200 (+0.10z)| norm 0.3213 (+1.57z)| lr 2.26e-04 | 322.88 ms | 52.3% bf16 MFU | 1624248 tok/s step 11620/19560 | loss 3.357155 (-0.47z)| norm 0.2622 (-0.87z)| lr 2.26e-04 | 323.01 ms | 52.2% bf16 MFU | 1624192 tok/s step 11621/19560 | loss 3.337745 (-0.88z)| norm 0.3086 (+1.06z)| lr 2.26e-04 | 322.51 ms | 52.3% bf16 MFU | 1624265 tok/s step 11622/19560 | loss 3.382522 (+0.09z)| norm 0.2752 (-0.32z)| lr 2.26e-04 | 322.35 ms | 52.4% bf16 MFU | 1624375 tok/s step 11623/19560 | loss 3.320500 (-1.43z)| norm 0.2700 (-0.53z)| lr 2.26e-04 | 322.74 ms | 52.3% bf16 MFU | 1624380 tok/s step 11624/19560 | loss 3.356355 (-0.49z)| norm 0.2785 (-0.16z)| lr 2.26e-04 | 322.82 ms | 52.3% bf16 MFU | 1624366 tok/s step 11625/19560 | loss 3.358222 (-0.44z)| norm 0.2665 (-0.66z)| lr 2.26e-04 | 322.55 ms | 52.3% bf16 MFU | 1624421 tok/s step 11626/19560 | loss 3.413247 (+0.98z)| norm 0.2790 (-0.13z)| lr 2.26e-04 | 323.47 ms | 52.2% bf16 MFU | 1624240 tok/s step 11627/19560 | loss 3.374641 (-0.02z)| norm 0.2637 (-0.77z)| lr 2.26e-04 | 322.54 ms | 52.3% bf16 MFU | 1624303 tok/s step 11628/19560 | loss 3.412019 (+0.95z)| norm 0.2934 (+0.48z)| lr 2.26e-04 | 322.44 ms | 52.3% bf16 MFU | 1624388 tok/s step 11629/19560 | loss 3.376526 (+0.03z)| norm 0.2747 (-0.30z)| lr 2.26e-04 | 322.75 ms | 52.3% bf16 MFU | 1624391 tok/s step 11630/19560 | loss 3.352063 (-0.60z)| norm 0.2904 (+0.37z)| lr 2.26e-04 | 322.60 ms | 52.3% bf16 MFU | 1624431 tok/s step 11631/19560 | loss 3.361922 (-0.33z)| norm 0.2601 (-0.90z)| lr 2.26e-04 | 322.87 ms | 52.3% bf16 MFU | 1624402 tok/s step 11632/19560 | loss 3.374759 (+0.01z)| norm 0.2782 (-0.14z)| lr 2.26e-04 | 323.39 ms | 52.2% bf16 MFU | 1624244 tok/s step 11633/19560 | loss 3.380482 (+0.16z)| norm 0.2671 (-0.60z)| lr 2.26e-04 | 322.80 ms | 52.3% bf16 MFU | 1624241 tok/s step 11634/19560 | loss 3.364611 (-0.27z)| norm 0.2694 (-0.50z)| lr 2.26e-04 | 322.14 ms | 52.4% bf16 MFU | 1624404 tok/s step 11635/19560 | loss 3.355416 (-0.51z)| norm 0.2628 (-0.77z)| lr 2.26e-04 | 323.24 ms | 52.2% bf16 MFU | 1624281 tok/s step 11636/19560 | loss 3.329989 (-1.17z)| norm 0.2812 (+0.01z)| lr 2.26e-04 | 322.58 ms | 52.3% bf16 MFU | 1624333 tok/s step 11637/19560 | loss 3.373452 (-0.03z)| norm 0.2580 (-0.96z)| lr 2.26e-04 | 323.77 ms | 52.1% bf16 MFU | 1624083 tok/s step 11638/19560 | loss 3.360100 (-0.38z)| norm 0.2915 (+0.44z)| lr 2.25e-04 | 322.84 ms | 52.3% bf16 MFU | 1624079 tok/s step 11639/19560 | loss 3.344408 (-0.78z)| norm 0.2473 (-1.40z)| lr 2.25e-04 | 322.98 ms | 52.3% bf16 MFU | 1624040 tok/s step 11640/19560 | loss 3.351846 (-0.58z)| norm 0.2948 (+0.59z)| lr 2.25e-04 | 322.29 ms | 52.4% bf16 MFU | 1624176 tok/s step 11641/19560 | loss 3.355870 (-0.48z)| norm 0.2726 (-0.34z)| lr 2.25e-04 | 322.63 ms | 52.3% bf16 MFU | 1624218 tok/s step 11642/19560 | loss 3.338950 (-0.93z)| norm 0.2860 (+0.21z)| lr 2.25e-04 | 322.50 ms | 52.3% bf16 MFU | 1624293 tok/s step 11643/19560 | loss 3.343666 (-0.82z)| norm 0.2641 (-0.70z)| lr 2.25e-04 | 323.72 ms | 52.1% bf16 MFU | 1624056 tok/s step 11644/19560 | loss 3.348054 (-0.68z)| norm 0.2795 (-0.06z)| lr 2.25e-04 | 322.51 ms | 52.3% bf16 MFU | 1624136 tok/s step 11645/19560 | loss 3.309404 (-1.74z)| norm 0.2803 (-0.02z)| lr 2.25e-04 | 322.45 ms | 52.3% bf16 MFU | 1624226 tok/s step 11646/19560 | loss 3.428029 (+1.54z)| norm 0.2643 (-0.69z)| lr 2.25e-04 | 323.24 ms | 52.2% bf16 MFU | 1624114 tok/s step 11647/19560 | loss 3.334484 (-1.04z)| norm 0.3042 (+0.96z)| lr 2.25e-04 | 323.06 ms | 52.2% bf16 MFU | 1624053 tok/s step 11648/19560 | loss 3.361902 (-0.27z)| norm 0.2540 (-1.13z)| lr 2.25e-04 | 322.64 ms | 52.3% bf16 MFU | 1624099 tok/s step 11649/19560 | loss 3.377205 (+0.16z)| norm 0.2823 (+0.06z)| lr 2.25e-04 | 322.38 ms | 52.4% bf16 MFU | 1624208 tok/s step 11650/19560 | loss 3.342810 (-0.79z)| norm 0.2699 (-0.46z)| lr 2.25e-04 | 322.97 ms | 52.3% bf16 MFU | 1624164 tok/s step 11651/19560 | loss 3.341643 (-0.81z)| norm 0.2681 (-0.53z)| lr 2.25e-04 | 323.22 ms | 52.2% bf16 MFU | 1624061 tok/s step 11652/19560 | loss 3.363844 (-0.17z)| norm 0.2808 (+0.12z)| lr 2.25e-04 | 322.65 ms | 52.3% bf16 MFU | 1624104 tok/s step 11653/19560 | loss 3.392519 (+0.63z)| norm 0.2723 (-0.46z)| lr 2.25e-04 | 322.20 ms | 52.4% bf16 MFU | 1624261 tok/s step 11654/19560 | loss 3.323015 (-1.35z)| norm 0.2772 (-0.12z)| lr 2.25e-04 | 323.20 ms | 52.2% bf16 MFU | 1624158 tok/s step 11655/19560 | loss 3.358885 (-0.30z)| norm 0.2892 (+0.70z)| lr 2.25e-04 | 323.08 ms | 52.2% bf16 MFU | 1624088 tok/s step 11656/19560 | loss 3.386463 (+0.52z)| norm 0.2777 (-0.10z)| lr 2.25e-04 | 322.20 ms | 52.4% bf16 MFU | 1624244 tok/s step 11657/19560 | loss 3.428622 (+1.73z)| norm 0.2933 (+0.97z)| lr 2.25e-04 | 322.81 ms | 52.3% bf16 MFU | 1624238 tok/s step 11658/19560 | loss 3.342825 (-0.78z)| norm 0.2731 (-0.42z)| lr 2.25e-04 | 322.69 ms | 52.3% bf16 MFU | 1624263 tok/s step 11659/19560 | loss 3.371045 (+0.05z)| norm 0.2885 (+0.63z)| lr 2.24e-04 | 322.97 ms | 52.3% bf16 MFU | 1624216 tok/s step 11660/19560 | loss 3.425349 (+1.63z)| norm 0.2647 (-0.99z)| lr 2.24e-04 | 322.78 ms | 52.3% bf16 MFU | 1624219 tok/s step 11661/19560 | loss 3.344609 (-0.73z)| norm 0.2778 (-0.09z)| lr 2.24e-04 | 322.80 ms | 52.3% bf16 MFU | 1624218 tok/s step 11662/19560 | loss 3.374864 (+0.15z)| norm 0.2651 (-0.96z)| lr 2.24e-04 | 323.20 ms | 52.2% bf16 MFU | 1624115 tok/s step 11663/19560 | loss 3.428171 (+1.69z)| norm 0.2634 (-1.09z)| lr 2.24e-04 | 322.13 ms | 52.4% bf16 MFU | 1624288 tok/s step 11664/19560 | loss 3.399472 (+0.85z)| norm 0.2683 (-0.77z)| lr 2.24e-04 | 322.65 ms | 52.3% bf16 MFU | 1624320 tok/s step 11665/19560 | loss 3.375658 (+0.13z)| norm 0.2476 (-2.19z)| lr 2.24e-04 | 322.99 ms | 52.3% bf16 MFU | 1624265 tok/s step 11666/19560 | loss 3.367616 (-0.09z)| norm 0.2893 (+0.68z)| lr 2.24e-04 | 323.21 ms | 52.2% bf16 MFU | 1624158 tok/s step 11667/19560 | loss 3.373301 (+0.08z)| norm 0.2739 (-0.40z)| lr 2.24e-04 | 322.43 ms | 52.3% bf16 MFU | 1624252 tok/s step 11668/19560 | loss 3.342618 (-0.84z)| norm 0.2707 (-0.64z)| lr 2.24e-04 | 322.53 ms | 52.3% bf16 MFU | 1624317 tok/s step 11669/19560 | loss 3.390920 (+0.63z)| norm 0.2770 (-0.19z)| lr 2.24e-04 | 322.85 ms | 52.3% bf16 MFU | 1624299 tok/s step 11670/19560 | loss 3.364587 (-0.17z)| norm 0.2616 (-1.27z)| lr 2.24e-04 | 322.47 ms | 52.3% bf16 MFU | 1624376 tok/s step 11671/19560 | loss 3.339281 (-0.93z)| norm 0.2880 (+0.59z)| lr 2.24e-04 | 323.35 ms | 52.2% bf16 MFU | 1624228 tok/s step 11672/19560 | loss 3.323861 (-1.38z)| norm 0.2607 (-1.33z)| lr 2.24e-04 | 323.10 ms | 52.2% bf16 MFU | 1624150 tok/s step 11673/19560 | loss 3.372296 (+0.09z)| norm 0.2990 (+1.34z)| lr 2.24e-04 | 323.04 ms | 52.2% bf16 MFU | 1624092 tok/s step 11674/19560 | loss 3.343718 (-0.78z)| norm 0.2692 (-0.72z)| lr 2.24e-04 | 322.60 ms | 52.3% bf16 MFU | 1624146 tok/s step 11675/19560 | loss 3.348763 (-0.63z)| norm 0.2751 (-0.32z)| lr 2.24e-04 | 322.83 ms | 52.3% bf16 MFU | 1624141 tok/s step 11676/19560 | loss 3.377898 (+0.28z)| norm 0.2739 (-0.41z)| lr 2.24e-04 | 323.28 ms | 52.2% bf16 MFU | 1624024 tok/s step 11677/19560 | loss 3.291862 (-2.32z)| norm 0.2726 (-0.49z)| lr 2.24e-04 | 322.62 ms | 52.3% bf16 MFU | 1624077 tok/s step 11678/19560 | loss 3.363690 (-0.14z)| norm 0.2944 (+1.02z)| lr 2.24e-04 | 322.67 ms | 52.3% bf16 MFU | 1624116 tok/s step 11679/19560 | loss 3.369271 (+0.04z)| norm 0.2730 (-0.45z)| lr 2.23e-04 | 322.75 ms | 52.3% bf16 MFU | 1624133 tok/s step 11680/19560 | loss 3.378152 (+0.30z)| norm 0.3336 (+3.57z)| lr 2.23e-04 | 323.20 ms | 52.2% bf16 MFU | 1624036 tok/s step 11681/19560 | loss 3.378802 (+0.34z)| norm 0.3171 (+2.51z)| lr 2.23e-04 | 322.91 ms | 52.3% bf16 MFU | 1624017 tok/s step 11682/19560 | loss 3.358146 (-0.28z)| norm 0.2895 (+0.65z)| lr 2.23e-04 | 323.14 ms | 52.2% bf16 MFU | 1623941 tok/s step 11683/19560 | loss 3.450800 (+2.57z)| norm 0.2973 (+1.17z)| lr 2.23e-04 | 322.78 ms | 52.3% bf16 MFU | 1623959 tok/s step 11684/19560 | loss 3.396230 (+0.87z)| norm 0.2940 (+0.93z)| lr 2.23e-04 | 323.55 ms | 52.2% bf16 MFU | 1623781 tok/s step 11685/19560 | loss 3.350852 (-0.53z)| norm 0.2812 (+0.08z)| lr 2.23e-04 | 323.14 ms | 52.2% bf16 MFU | 1623717 tok/s step 11686/19560 | loss 3.389237 (+0.65z)| norm 0.2854 (+0.35z)| lr 2.23e-04 | 322.72 ms | 52.3% bf16 MFU | 1623759 tok/s step 11687/19560 | loss 3.373013 (+0.14z)| norm 0.2902 (+0.68z)| lr 2.23e-04 | 323.04 ms | 52.2% bf16 MFU | 1623722 tok/s step 11688/19560 | loss 3.362770 (-0.16z)| norm 0.2723 (-0.53z)| lr 2.23e-04 | 323.77 ms | 52.1% bf16 MFU | 1623502 tok/s step 11689/19560 | loss 3.313875 (-1.67z)| norm 0.2743 (-0.38z)| lr 2.23e-04 | 322.74 ms | 52.3% bf16 MFU | 1623551 tok/s step 11690/19560 | loss 3.351736 (-0.49z)| norm 0.2995 (+1.31z)| lr 2.23e-04 | 322.73 ms | 52.3% bf16 MFU | 1623601 tok/s step 11691/19560 | loss 3.339646 (-0.86z)| norm 0.2793 (-0.02z)| lr 2.23e-04 | 322.73 ms | 52.3% bf16 MFU | 1623647 tok/s step 11692/19560 | loss 3.323178 (-1.35z)| norm 0.3377 (+3.90z)| lr 2.23e-04 | 323.19 ms | 52.2% bf16 MFU | 1623576 tok/s step 11693/19560 | loss 3.382940 (+0.51z)| norm 0.2937 (+0.93z)| lr 2.23e-04 | 322.74 ms | 52.3% bf16 MFU | 1623623 tok/s step 11694/19560 | loss 3.407115 (+1.27z)| norm 0.2729 (-0.47z)| lr 2.23e-04 | 322.92 ms | 52.3% bf16 MFU | 1623621 tok/s step 11695/19560 | loss 3.406719 (+1.25z)| norm 0.2874 (+0.51z)| lr 2.23e-04 | 323.00 ms | 52.3% bf16 MFU | 1623599 tok/s step 11696/19560 | loss 3.361948 (-0.18z)| norm 0.2693 (-0.71z)| lr 2.23e-04 | 323.45 ms | 52.2% bf16 MFU | 1623465 tok/s step 11697/19560 | loss 3.381848 (+0.45z)| norm 0.2716 (-0.55z)| lr 2.23e-04 | 322.63 ms | 52.3% bf16 MFU | 1623542 tok/s step 11698/19560 | loss 3.323764 (-1.38z)| norm 0.2822 (+0.15z)| lr 2.23e-04 | 323.50 ms | 52.2% bf16 MFU | 1623398 tok/s step 11699/19560 | loss 3.353710 (-0.43z)| norm 0.2830 (+0.20z)| lr 2.23e-04 | 322.88 ms | 52.3% bf16 MFU | 1623417 tok/s step 11700/19560 | loss 3.350926 (-0.51z)| norm 0.2852 (+0.35z)| lr 2.22e-04 | 322.52 ms | 52.3% bf16 MFU | 1623527 tok/s step 11701/19560 | loss 3.382207 (+0.48z)| norm 0.2782 (-0.13z)| lr 2.22e-04 | 323.13 ms | 52.2% bf16 MFU | 1623477 tok/s step 11702/19560 | loss 3.376737 (+0.31z)| norm 0.2917 (+0.80z)| lr 2.22e-04 | 322.99 ms | 52.3% bf16 MFU | 1623464 tok/s step 11703/19560 | loss 3.371452 (+0.13z)| norm 0.2731 (-0.47z)| lr 2.22e-04 | 322.85 ms | 52.3% bf16 MFU | 1623487 tok/s step 11704/19560 | loss 3.343738 (-0.75z)| norm 0.2864 (+0.44z)| lr 2.22e-04 | 322.96 ms | 52.3% bf16 MFU | 1623482 tok/s step 11705/19560 | loss 3.404856 (+1.19z)| norm 0.2742 (-0.41z)| lr 2.22e-04 | 323.04 ms | 52.2% bf16 MFU | 1623456 tok/s step 11706/19560 | loss 3.394147 (+0.85z)| norm 0.3112 (+2.11z)| lr 2.22e-04 | 323.24 ms | 52.2% bf16 MFU | 1623381 tok/s step 11707/19560 | loss 3.407759 (+1.26z)| norm 0.2803 (-0.01z)| lr 2.22e-04 | 323.29 ms | 52.2% bf16 MFU | 1623298 tok/s step 11708/19560 | loss 3.362218 (-0.18z)| norm 0.2885 (+0.53z)| lr 2.22e-04 | 322.65 ms | 52.3% bf16 MFU | 1623380 tok/s step 11709/19560 | loss 3.341731 (-0.82z)| norm 0.2906 (+0.67z)| lr 2.22e-04 | 322.83 ms | 52.3% bf16 MFU | 1623413 tok/s step 11710/19560 | loss 3.361692 (-0.18z)| norm 0.2702 (-0.73z)| lr 2.22e-04 | 323.35 ms | 52.2% bf16 MFU | 1623315 tok/s step 11711/19560 | loss 3.345947 (-0.67z)| norm 0.2701 (-0.72z)| lr 2.22e-04 | 323.14 ms | 52.2% bf16 MFU | 1623273 tok/s step 11712/19560 | loss 3.366821 (-0.02z)| norm 0.2699 (-0.73z)| lr 2.22e-04 | 322.96 ms | 52.3% bf16 MFU | 1623279 tok/s step 11713/19560 | loss 3.362463 (-0.16z)| norm 0.2890 (+0.57z)| lr 2.22e-04 | 323.08 ms | 52.2% bf16 MFU | 1623253 tok/s step 11714/19560 | loss 3.359731 (-0.25z)| norm 0.2616 (-1.29z)| lr 2.22e-04 | 322.81 ms | 52.3% bf16 MFU | 1623297 tok/s step 11715/19560 | loss 3.382281 (+0.49z)| norm 0.3301 (+3.23z)| lr 2.22e-04 | 323.19 ms | 52.2% bf16 MFU | 1623243 tok/s step 11716/19560 | loss 3.404343 (+1.26z)| norm 0.2648 (-1.03z)| lr 2.22e-04 | 322.65 ms | 52.3% bf16 MFU | 1623329 tok/s step 11717/19560 | loss 3.335706 (-1.05z)| norm 0.3288 (+3.03z)| lr 2.22e-04 | 322.78 ms | 52.3% bf16 MFU | 1623377 tok/s step 11718/19560 | loss 3.441865 (+2.45z)| norm 0.2709 (-0.62z)| lr 2.22e-04 | 322.59 ms | 52.3% bf16 MFU | 1623470 tok/s step 11719/19560 | loss 3.341033 (-0.88z)| norm 0.2875 (+0.44z)| lr 2.22e-04 | 323.58 ms | 52.2% bf16 MFU | 1623310 tok/s step 11720/19560 | loss 3.406782 (+1.29z)| norm 0.2741 (-0.42z)| lr 2.22e-04 | 322.75 ms | 52.3% bf16 MFU | 1623365 tok/s step 11721/19560 | loss 3.301222 (-2.15z)| norm 0.2907 (+0.64z)| lr 2.21e-04 | 322.65 ms | 52.3% bf16 MFU | 1623443 tok/s step 11722/19560 | loss 3.329763 (-1.21z)| norm 0.2676 (-0.83z)| lr 2.21e-04 | 322.95 ms | 52.3% bf16 MFU | 1623443 tok/s step 11723/19560 | loss 3.406124 (+1.24z)| norm 0.2937 (+0.82z)| lr 2.21e-04 | 322.88 ms | 52.3% bf16 MFU | 1623460 tok/s step 11724/19560 | loss 3.366101 (-0.03z)| norm 0.2550 (-1.60z)| lr 2.21e-04 | 322.93 ms | 52.3% bf16 MFU | 1623464 tok/s step 11725/19560 | loss 3.341667 (-0.82z)| norm 0.2595 (-1.30z)| lr 2.21e-04 | 322.86 ms | 52.3% bf16 MFU | 1623486 tok/s step 11726/19560 | loss 3.375121 (+0.28z)| norm 0.2592 (-1.30z)| lr 2.21e-04 | 323.61 ms | 52.2% bf16 MFU | 1623318 tok/s step 11727/19560 | loss 3.324353 (-1.37z)| norm 0.3229 (+2.57z)| lr 2.21e-04 | 322.91 ms | 52.3% bf16 MFU | 1623334 tok/s step 11728/19560 | loss 3.300823 (-2.08z)| norm 0.2755 (-0.30z)| lr 2.21e-04 | 322.46 ms | 52.3% bf16 MFU | 1623461 tok/s step 11729/19560 | loss 3.361119 (-0.15z)| norm 0.2658 (-0.88z)| lr 2.21e-04 | 322.94 ms | 52.3% bf16 MFU | 1623462 tok/s step 11730/19560 | loss 3.336909 (-0.91z)| norm 0.2657 (-0.88z)| lr 2.21e-04 | 323.56 ms | 52.2% bf16 MFU | 1623306 tok/s step 11731/19560 | loss 3.345448 (-0.63z)| norm 0.2813 (+0.07z)| lr 2.21e-04 | 322.94 ms | 52.3% bf16 MFU | 1623316 tok/s step 11732/19560 | loss 3.395150 (+0.95z)| norm 0.2690 (-0.66z)| lr 2.21e-04 | 323.53 ms | 52.2% bf16 MFU | 1623176 tok/s step 11733/19560 | loss 3.364545 (-0.02z)| norm 0.2745 (-0.33z)| lr 2.21e-04 | 323.19 ms | 52.2% bf16 MFU | 1623129 tok/s step 11734/19560 | loss 3.358648 (-0.21z)| norm 0.2715 (-0.51z)| lr 2.21e-04 | 323.20 ms | 52.2% bf16 MFU | 1623082 tok/s step 11735/19560 | loss 3.374291 (+0.29z)| norm 0.2694 (-0.64z)| lr 2.21e-04 | 323.23 ms | 52.2% bf16 MFU | 1623029 tok/s step 11736/19560 | loss 3.377133 (+0.37z)| norm 0.2733 (-0.40z)| lr 2.21e-04 | 322.64 ms | 52.3% bf16 MFU | 1623128 tok/s step 11737/19560 | loss 3.366659 (+0.00z)| norm 0.2713 (-0.52z)| lr 2.21e-04 | 323.75 ms | 52.1% bf16 MFU | 1622943 tok/s step 11738/19560 | loss 3.434196 (+2.23z)| norm 0.2482 (-1.87z)| lr 2.21e-04 | 323.96 ms | 52.1% bf16 MFU | 1622714 tok/s step 11739/19560 | loss 3.439586 (+2.35z)| norm 0.2790 (-0.04z)| lr 2.21e-04 | 323.18 ms | 52.2% bf16 MFU | 1622692 tok/s step 11740/19560 | loss 3.420244 (+1.68z)| norm 0.2854 (+0.33z)| lr 2.21e-04 | 323.44 ms | 52.2% bf16 MFU | 1622605 tok/s step 11741/19560 | loss 3.373590 (+0.18z)| norm 0.2648 (-0.88z)| lr 2.21e-04 | 322.63 ms | 52.3% bf16 MFU | 1622728 tok/s step 11742/19560 | loss 3.350705 (-0.55z)| norm 0.2854 (+0.35z)| lr 2.20e-04 | 323.29 ms | 52.2% bf16 MFU | 1622678 tok/s step 11743/19560 | loss 3.357352 (-0.32z)| norm 0.2743 (-0.32z)| lr 2.20e-04 | 322.83 ms | 52.3% bf16 MFU | 1622748 tok/s step 11744/19560 | loss 3.365309 (-0.05z)| norm 0.2899 (+0.62z)| lr 2.20e-04 | 322.89 ms | 52.3% bf16 MFU | 1622796 tok/s step 11745/19560 | loss 3.351702 (-0.50z)| norm 0.2871 (+0.45z)| lr 2.20e-04 | 322.31 ms | 52.4% bf16 MFU | 1622990 tok/s step 11746/19560 | loss 3.364617 (-0.06z)| norm 0.2761 (-0.21z)| lr 2.20e-04 | 323.45 ms | 52.2% bf16 MFU | 1622888 tok/s step 11747/19560 | loss 3.275718 (-2.89z)| norm 0.3127 (+2.01z)| lr 2.20e-04 | 322.68 ms | 52.3% bf16 MFU | 1622984 tok/s step 11748/19560 | loss 3.456981 (+2.82z)| norm 0.2839 (+0.26z)| lr 2.20e-04 | 322.57 ms | 52.3% bf16 MFU | 1623102 tok/s step 11749/19560 | loss 3.467746 (+3.02z)| norm 0.3088 (+1.77z)| lr 2.20e-04 | 323.06 ms | 52.2% bf16 MFU | 1623090 tok/s step 11750/19560 | loss 3.374060 (+0.20z)| norm 0.2805 (+0.05z)| lr 2.20e-04 | 322.49 ms | 52.3% bf16 MFU | 1623224 tok/s val loss 3.358420 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2957/10042 = 0.294463 step 11751/19560 | loss 3.367248 (-0.01z)| norm 0.2943 (+0.87z)| lr 2.20e-04 | 322.85 ms | 52.3% bf16 MFU | 1623261 tok/s step 11752/19560 | loss 3.378587 (+0.33z)| norm 0.2918 (+0.71z)| lr 2.20e-04 | 323.83 ms | 52.1% bf16 MFU | 1623049 tok/s step 11753/19560 | loss 3.354220 (-0.41z)| norm 0.2717 (-0.51z)| lr 2.20e-04 | 322.85 ms | 52.3% bf16 MFU | 1623093 tok/s step 11754/19560 | loss 3.357875 (-0.29z)| norm 0.2968 (+1.00z)| lr 2.20e-04 | 322.98 ms | 52.3% bf16 MFU | 1623102 tok/s step 11755/19560 | loss 3.322591 (-1.35z)| norm 0.2778 (-0.15z)| lr 2.20e-04 | 323.38 ms | 52.2% bf16 MFU | 1623009 tok/s step 11756/19560 | loss 3.358232 (-0.25z)| norm 0.2811 (+0.06z)| lr 2.20e-04 | 322.86 ms | 52.3% bf16 MFU | 1623052 tok/s step 11757/19560 | loss 3.315955 (-1.52z)| norm 0.2531 (-1.61z)| lr 2.20e-04 | 322.61 ms | 52.3% bf16 MFU | 1623158 tok/s step 11758/19560 | loss 3.317384 (-1.46z)| norm 0.2844 (+0.27z)| lr 2.20e-04 | 322.76 ms | 52.3% bf16 MFU | 1623218 tok/s step 11759/19560 | loss 3.350128 (-0.47z)| norm 0.2818 (+0.10z)| lr 2.20e-04 | 323.32 ms | 52.2% bf16 MFU | 1623135 tok/s step 11760/19560 | loss 3.412834 (+1.40z)| norm 0.2844 (+0.25z)| lr 2.20e-04 | 322.59 ms | 52.3% bf16 MFU | 1623240 tok/s step 11761/19560 | loss 3.377955 (+0.36z)| norm 0.2807 (+0.02z)| lr 2.20e-04 | 322.65 ms | 52.3% bf16 MFU | 1623324 tok/s step 11762/19560 | loss 3.425631 (+1.74z)| norm 0.2991 (+1.12z)| lr 2.19e-04 | 322.84 ms | 52.3% bf16 MFU | 1623358 tok/s step 11763/19560 | loss 3.368107 (+0.05z)| norm 0.2722 (-0.51z)| lr 2.19e-04 | 322.29 ms | 52.4% bf16 MFU | 1623529 tok/s step 11764/19560 | loss 3.302938 (-1.85z)| norm 0.2783 (-0.14z)| lr 2.19e-04 | 322.84 ms | 52.3% bf16 MFU | 1623552 tok/s step 11765/19560 | loss 3.350144 (-0.47z)| norm 0.2557 (-1.50z)| lr 2.19e-04 | 322.77 ms | 52.3% bf16 MFU | 1623591 tok/s step 11766/19560 | loss 3.383065 (+0.49z)| norm 0.2863 (+0.35z)| lr 2.19e-04 | 322.72 ms | 52.3% bf16 MFU | 1623641 tok/s step 11767/19560 | loss 3.394155 (+0.80z)| norm 0.2591 (-1.31z)| lr 2.19e-04 | 323.04 ms | 52.2% bf16 MFU | 1623609 tok/s step 11768/19560 | loss 3.390310 (+0.68z)| norm 0.2701 (-0.63z)| lr 2.19e-04 | 322.56 ms | 52.3% bf16 MFU | 1623698 tok/s step 11769/19560 | loss 3.364845 (-0.07z)| norm 0.2854 (+0.30z)| lr 2.19e-04 | 322.66 ms | 52.3% bf16 MFU | 1623758 tok/s step 11770/19560 | loss 3.369418 (+0.06z)| norm 0.2679 (-0.76z)| lr 2.19e-04 | 322.75 ms | 52.3% bf16 MFU | 1623792 tok/s step 11771/19560 | loss 3.350046 (-0.51z)| norm 0.2745 (-0.37z)| lr 2.19e-04 | 322.54 ms | 52.3% bf16 MFU | 1623876 tok/s step 11772/19560 | loss 3.471065 (+2.90z)| norm 0.2698 (-0.65z)| lr 2.19e-04 | 322.92 ms | 52.3% bf16 MFU | 1623860 tok/s step 11773/19560 | loss 3.347878 (-0.59z)| norm 0.3031 (+1.37z)| lr 2.19e-04 | 322.86 ms | 52.3% bf16 MFU | 1623862 tok/s step 11774/19560 | loss 3.410412 (+1.20z)| norm 0.2773 (-0.20z)| lr 2.19e-04 | 322.32 ms | 52.4% bf16 MFU | 1623998 tok/s step 11775/19560 | loss 3.328821 (-1.14z)| norm 0.2834 (+0.18z)| lr 2.19e-04 | 322.71 ms | 52.3% bf16 MFU | 1624032 tok/s step 11776/19560 | loss 3.393966 (+0.72z)| norm 0.2827 (+0.13z)| lr 2.19e-04 | 322.82 ms | 52.3% bf16 MFU | 1624033 tok/s step 11777/19560 | loss 3.358152 (-0.30z)| norm 0.2852 (+0.27z)| lr 2.19e-04 | 323.07 ms | 52.2% bf16 MFU | 1623973 tok/s step 11778/19560 | loss 3.397625 (+0.82z)| norm 0.2779 (-0.18z)| lr 2.19e-04 | 322.32 ms | 52.4% bf16 MFU | 1624106 tok/s step 11779/19560 | loss 3.368808 (-0.01z)| norm 0.2709 (-0.62z)| lr 2.19e-04 | 323.04 ms | 52.2% bf16 MFU | 1624049 tok/s step 11780/19560 | loss 3.354644 (-0.42z)| norm 0.2785 (-0.14z)| lr 2.19e-04 | 322.49 ms | 52.3% bf16 MFU | 1624134 tok/s step 11781/19560 | loss 3.328889 (-1.14z)| norm 0.2620 (-1.16z)| lr 2.19e-04 | 322.34 ms | 52.4% bf16 MFU | 1624252 tok/s step 11782/19560 | loss 3.335946 (-0.94z)| norm 0.2717 (-0.55z)| lr 2.19e-04 | 322.76 ms | 52.3% bf16 MFU | 1624259 tok/s step 11783/19560 | loss 3.427028 (+1.64z)| norm 0.2624 (-1.12z)| lr 2.18e-04 | 322.88 ms | 52.3% bf16 MFU | 1624234 tok/s step 11784/19560 | loss 3.417953 (+1.36z)| norm 0.2835 (+0.18z)| lr 2.18e-04 | 322.62 ms | 52.3% bf16 MFU | 1624276 tok/s step 11785/19560 | loss 3.389315 (+0.57z)| norm 0.2655 (-0.91z)| lr 2.18e-04 | 322.81 ms | 52.3% bf16 MFU | 1624269 tok/s step 11786/19560 | loss 3.399157 (+0.84z)| norm 0.2618 (-1.13z)| lr 2.18e-04 | 322.56 ms | 52.3% bf16 MFU | 1624324 tok/s step 11787/19560 | loss 3.362298 (-0.21z)| norm 0.2791 (-0.06z)| lr 2.18e-04 | 323.41 ms | 52.2% bf16 MFU | 1624164 tok/s step 11788/19560 | loss 3.323040 (-1.30z)| norm 0.2696 (-0.65z)| lr 2.18e-04 | 322.24 ms | 52.4% bf16 MFU | 1624307 tok/s step 11789/19560 | loss 3.315237 (-1.51z)| norm 0.2849 (+0.29z)| lr 2.18e-04 | 323.20 ms | 52.2% bf16 MFU | 1624199 tok/s step 11790/19560 | loss 3.331322 (-1.04z)| norm 0.2610 (-1.18z)| lr 2.18e-04 | 322.56 ms | 52.3% bf16 MFU | 1624259 tok/s step 11791/19560 | loss 3.341327 (-0.75z)| norm 0.2742 (-0.37z)| lr 2.18e-04 | 322.55 ms | 52.3% bf16 MFU | 1624318 tok/s step 11792/19560 | loss 3.339777 (-0.78z)| norm 0.2546 (-1.56z)| lr 2.18e-04 | 322.56 ms | 52.3% bf16 MFU | 1624371 tok/s step 11793/19560 | loss 3.422205 (+1.55z)| norm 0.2936 (+0.81z)| lr 2.18e-04 | 323.01 ms | 52.3% bf16 MFU | 1624310 tok/s step 11794/19560 | loss 3.413553 (+1.28z)| norm 0.3134 (+2.00z)| lr 2.18e-04 | 322.82 ms | 52.3% bf16 MFU | 1624300 tok/s step 11795/19560 | loss 3.286255 (-2.22z)| norm 0.2752 (-0.34z)| lr 2.18e-04 | 322.81 ms | 52.3% bf16 MFU | 1624291 tok/s step 11796/19560 | loss 3.327517 (-1.08z)| norm 0.2753 (-0.34z)| lr 2.18e-04 | 323.09 ms | 52.2% bf16 MFU | 1624213 tok/s step 11797/19560 | loss 3.357020 (-0.27z)| norm 0.3193 (+2.29z)| lr 2.18e-04 | 322.68 ms | 52.3% bf16 MFU | 1624241 tok/s step 11798/19560 | loss 3.367982 (+0.03z)| norm 0.3044 (+1.37z)| lr 2.18e-04 | 322.77 ms | 52.3% bf16 MFU | 1624247 tok/s step 11799/19560 | loss 3.369147 (+0.06z)| norm 0.3014 (+1.18z)| lr 2.18e-04 | 322.19 ms | 52.4% bf16 MFU | 1624397 tok/s step 11800/19560 | loss 3.305568 (-1.68z)| norm 0.3235 (+2.43z)| lr 2.18e-04 | 322.94 ms | 52.3% bf16 MFU | 1624353 tok/s step 11801/19560 | loss 3.371876 (+0.14z)| norm 0.2972 (+0.89z)| lr 2.18e-04 | 322.97 ms | 52.3% bf16 MFU | 1624302 tok/s step 11802/19560 | loss 3.390900 (+0.65z)| norm 0.3020 (+1.15z)| lr 2.18e-04 | 322.52 ms | 52.3% bf16 MFU | 1624367 tok/s step 11803/19560 | loss 3.295591 (-1.92z)| norm 0.3002 (+1.03z)| lr 2.18e-04 | 322.76 ms | 52.3% bf16 MFU | 1624368 tok/s step 11804/19560 | loss 3.376541 (+0.26z)| norm 0.2722 (-0.59z)| lr 2.17e-04 | 322.38 ms | 52.4% bf16 MFU | 1624464 tok/s step 11805/19560 | loss 3.413548 (+1.25z)| norm 0.3018 (+1.11z)| lr 2.17e-04 | 323.03 ms | 52.2% bf16 MFU | 1624391 tok/s step 11806/19560 | loss 3.370406 (+0.07z)| norm 0.3029 (+1.17z)| lr 2.17e-04 | 323.09 ms | 52.2% bf16 MFU | 1624308 tok/s step 11807/19560 | loss 3.373189 (+0.15z)| norm 0.2813 (-0.09z)| lr 2.17e-04 | 322.57 ms | 52.3% bf16 MFU | 1624359 tok/s step 11808/19560 | loss 3.338197 (-0.80z)| norm 0.2875 (+0.30z)| lr 2.17e-04 | 322.29 ms | 52.4% bf16 MFU | 1624480 tok/s step 11809/19560 | loss 3.330487 (-0.99z)| norm 0.2738 (-0.50z)| lr 2.17e-04 | 322.67 ms | 52.3% bf16 MFU | 1624498 tok/s step 11810/19560 | loss 3.457848 (+2.38z)| norm 0.2741 (-0.48z)| lr 2.17e-04 | 322.55 ms | 52.3% bf16 MFU | 1624546 tok/s step 11811/19560 | loss 3.439367 (+1.90z)| norm 0.2913 (+0.57z)| lr 2.17e-04 | 323.10 ms | 52.2% bf16 MFU | 1624452 tok/s step 11812/19560 | loss 3.369267 (+0.04z)| norm 0.2732 (-0.52z)| lr 2.17e-04 | 323.00 ms | 52.3% bf16 MFU | 1624390 tok/s step 11813/19560 | loss 3.334694 (-0.87z)| norm 0.2948 (+0.78z)| lr 2.17e-04 | 322.35 ms | 52.4% bf16 MFU | 1624492 tok/s step 11814/19560 | loss 3.300142 (-1.76z)| norm 0.2944 (+0.76z)| lr 2.17e-04 | 323.06 ms | 52.2% bf16 MFU | 1624413 tok/s step 11815/19560 | loss 3.348671 (-0.47z)| norm 0.2735 (-0.50z)| lr 2.17e-04 | 322.44 ms | 52.3% bf16 MFU | 1624493 tok/s step 11816/19560 | loss 3.358333 (-0.22z)| norm 0.2961 (+0.86z)| lr 2.17e-04 | 322.62 ms | 52.3% bf16 MFU | 1624524 tok/s step 11817/19560 | loss 3.396930 (+0.79z)| norm 0.2791 (-0.18z)| lr 2.17e-04 | 322.21 ms | 52.4% bf16 MFU | 1624655 tok/s step 11818/19560 | loss 3.395519 (+0.74z)| norm 0.2996 (+1.07z)| lr 2.17e-04 | 322.41 ms | 52.3% bf16 MFU | 1624730 tok/s step 11819/19560 | loss 3.403993 (+0.95z)| norm 0.2827 (+0.04z)| lr 2.17e-04 | 322.61 ms | 52.3% bf16 MFU | 1624750 tok/s step 11820/19560 | loss 3.397955 (+0.78z)| norm 0.2871 (+0.34z)| lr 2.17e-04 | 322.58 ms | 52.3% bf16 MFU | 1624778 tok/s step 11821/19560 | loss 3.341526 (-0.71z)| norm 0.2769 (-0.30z)| lr 2.17e-04 | 322.51 ms | 52.3% bf16 MFU | 1624822 tok/s step 11822/19560 | loss 3.352844 (-0.40z)| norm 0.2753 (-0.40z)| lr 2.17e-04 | 323.40 ms | 52.2% bf16 MFU | 1624639 tok/s step 11823/19560 | loss 3.447942 (+2.09z)| norm 0.3373 (+3.37z)| lr 2.17e-04 | 322.66 ms | 52.3% bf16 MFU | 1624651 tok/s step 11824/19560 | loss 3.337808 (-0.79z)| norm 0.3011 (+1.15z)| lr 2.17e-04 | 323.05 ms | 52.2% bf16 MFU | 1624564 tok/s step 11825/19560 | loss 3.518690 (+3.69z)| norm 0.3342 (+3.01z)| lr 2.16e-04 | 322.72 ms | 52.3% bf16 MFU | 1624566 tok/s step 11826/19560 | loss 3.338880 (-0.75z)| norm 0.2808 (-0.11z)| lr 2.16e-04 | 322.84 ms | 52.3% bf16 MFU | 1624538 tok/s step 11827/19560 | loss 3.348250 (-0.52z)| norm 0.3511 (+3.75z)| lr 2.16e-04 | 322.95 ms | 52.3% bf16 MFU | 1624482 tok/s step 11828/19560 | loss 3.408663 (+0.96z)| norm 0.3002 (+0.93z)| lr 2.16e-04 | 322.42 ms | 52.3% bf16 MFU | 1624564 tok/s step 11829/19560 | loss 3.391565 (+0.54z)| norm 0.3075 (+1.31z)| lr 2.16e-04 | 323.05 ms | 52.2% bf16 MFU | 1624482 tok/s step 11830/19560 | loss 3.322619 (-1.14z)| norm 0.2939 (+0.56z)| lr 2.16e-04 | 322.22 ms | 52.4% bf16 MFU | 1624612 tok/s step 11831/19560 | loss 3.339476 (-0.72z)| norm 0.3083 (+1.33z)| lr 2.16e-04 | 322.49 ms | 52.3% bf16 MFU | 1624669 tok/s step 11832/19560 | loss 3.439024 (+1.68z)| norm 0.2766 (-0.39z)| lr 2.16e-04 | 322.54 ms | 52.3% bf16 MFU | 1624711 tok/s step 11833/19560 | loss 3.425245 (+1.33z)| norm 0.2945 (+0.58z)| lr 2.16e-04 | 322.84 ms | 52.3% bf16 MFU | 1624675 tok/s step 11834/19560 | loss 3.346456 (-0.56z)| norm 0.2911 (+0.40z)| lr 2.16e-04 | 322.75 ms | 52.3% bf16 MFU | 1624663 tok/s step 11835/19560 | loss 3.343071 (-0.63z)| norm 0.2925 (+0.47z)| lr 2.16e-04 | 322.88 ms | 52.3% bf16 MFU | 1624619 tok/s step 11836/19560 | loss 3.391098 (+0.53z)| norm 0.2976 (+0.74z)| lr 2.16e-04 | 323.58 ms | 52.2% bf16 MFU | 1624400 tok/s step 11837/19560 | loss 3.323399 (-1.10z)| norm 0.2907 (+0.37z)| lr 2.16e-04 | 322.50 ms | 52.3% bf16 MFU | 1624464 tok/s step 11838/19560 | loss 3.438125 (+1.63z)| norm 0.2799 (-0.23z)| lr 2.16e-04 | 322.93 ms | 52.3% bf16 MFU | 1624418 tok/s step 11839/19560 | loss 3.342568 (-0.65z)| norm 0.2708 (-0.73z)| lr 2.16e-04 | 323.09 ms | 52.2% bf16 MFU | 1624332 tok/s step 11840/19560 | loss 3.392647 (+0.54z)| norm 0.2668 (-0.94z)| lr 2.16e-04 | 322.63 ms | 52.3% bf16 MFU | 1624368 tok/s step 11841/19560 | loss 3.360598 (-0.22z)| norm 0.2685 (-0.83z)| lr 2.16e-04 | 322.91 ms | 52.3% bf16 MFU | 1624332 tok/s step 11842/19560 | loss 3.337232 (-0.77z)| norm 0.2764 (-0.42z)| lr 2.16e-04 | 322.52 ms | 52.3% bf16 MFU | 1624395 tok/s step 11843/19560 | loss 3.327680 (-0.98z)| norm 0.2496 (-1.87z)| lr 2.16e-04 | 322.48 ms | 52.3% bf16 MFU | 1624465 tok/s step 11844/19560 | loss 3.381721 (+0.30z)| norm 0.2672 (-0.90z)| lr 2.16e-04 | 322.60 ms | 52.3% bf16 MFU | 1624501 tok/s step 11845/19560 | loss 3.340085 (-0.69z)| norm 0.2684 (-0.82z)| lr 2.16e-04 | 323.30 ms | 52.2% bf16 MFU | 1624360 tok/s step 11846/19560 | loss 3.378902 (+0.25z)| norm 0.2802 (-0.16z)| lr 2.15e-04 | 323.01 ms | 52.2% bf16 MFU | 1624298 tok/s step 11847/19560 | loss 3.372317 (+0.08z)| norm 0.2529 (-1.67z)| lr 2.15e-04 | 322.45 ms | 52.3% bf16 MFU | 1624380 tok/s step 11848/19560 | loss 3.341013 (-0.66z)| norm 0.2820 (-0.04z)| lr 2.15e-04 | 322.42 ms | 52.3% bf16 MFU | 1624466 tok/s step 11849/19560 | loss 3.393254 (+0.59z)| norm 0.2545 (-1.56z)| lr 2.15e-04 | 322.61 ms | 52.3% bf16 MFU | 1624499 tok/s step 11850/19560 | loss 3.397972 (+0.69z)| norm 0.2542 (-1.56z)| lr 2.15e-04 | 323.35 ms | 52.2% bf16 MFU | 1624344 tok/s step 11851/19560 | loss 3.369483 (+0.00z)| norm 0.2530 (-1.60z)| lr 2.15e-04 | 322.91 ms | 52.3% bf16 MFU | 1624308 tok/s step 11852/19560 | loss 3.368240 (-0.03z)| norm 0.2675 (-0.81z)| lr 2.15e-04 | 323.17 ms | 52.2% bf16 MFU | 1624210 tok/s step 11853/19560 | loss 3.370648 (+0.03z)| norm 0.2612 (-1.16z)| lr 2.15e-04 | 321.94 ms | 52.4% bf16 MFU | 1624424 tok/s step 11854/19560 | loss 3.387015 (+0.42z)| norm 0.2719 (-0.57z)| lr 2.15e-04 | 323.19 ms | 52.2% bf16 MFU | 1624315 tok/s step 11855/19560 | loss 3.409971 (+0.97z)| norm 0.2655 (-0.92z)| lr 2.15e-04 | 324.12 ms | 52.1% bf16 MFU | 1623979 tok/s step 11856/19560 | loss 3.390919 (+0.49z)| norm 0.2533 (-1.59z)| lr 2.15e-04 | 322.82 ms | 52.3% bf16 MFU | 1623984 tok/s step 11857/19560 | loss 3.361041 (-0.25z)| norm 0.2685 (-0.74z)| lr 2.15e-04 | 322.52 ms | 52.3% bf16 MFU | 1624063 tok/s step 11858/19560 | loss 3.316904 (-1.33z)| norm 0.2453 (-2.00z)| lr 2.15e-04 | 322.38 ms | 52.4% bf16 MFU | 1624176 tok/s step 11859/19560 | loss 3.377518 (+0.16z)| norm 0.2736 (-0.43z)| lr 2.15e-04 | 322.93 ms | 52.3% bf16 MFU | 1624144 tok/s step 11860/19560 | loss 3.392987 (+0.54z)| norm 0.2725 (-0.50z)| lr 2.15e-04 | 323.89 ms | 52.1% bf16 MFU | 1623873 tok/s step 11861/19560 | loss 3.386509 (+0.37z)| norm 0.2833 (+0.10z)| lr 2.15e-04 | 322.83 ms | 52.3% bf16 MFU | 1623881 tok/s step 11862/19560 | loss 3.302817 (-1.66z)| norm 0.2446 (-2.00z)| lr 2.15e-04 | 322.60 ms | 52.3% bf16 MFU | 1623946 tok/s step 11863/19560 | loss 3.495317 (+2.91z)| norm 0.3174 (+1.93z)| lr 2.15e-04 | 322.64 ms | 52.3% bf16 MFU | 1623998 tok/s step 11864/19560 | loss 3.368888 (-0.07z)| norm 0.2855 (+0.20z)| lr 2.15e-04 | 322.89 ms | 52.3% bf16 MFU | 1623984 tok/s step 11865/19560 | loss 3.323391 (-1.13z)| norm 0.3217 (+2.10z)| lr 2.15e-04 | 323.55 ms | 52.2% bf16 MFU | 1623808 tok/s step 11866/19560 | loss 3.422311 (+1.20z)| norm 0.2821 (-0.02z)| lr 2.14e-04 | 322.80 ms | 52.3% bf16 MFU | 1623826 tok/s step 11867/19560 | loss 3.397010 (+0.62z)| norm 0.3013 (+1.00z)| lr 2.14e-04 | 322.74 ms | 52.3% bf16 MFU | 1623860 tok/s step 11868/19560 | loss 3.376730 (+0.15z)| norm 0.3178 (+1.85z)| lr 2.14e-04 | 323.33 ms | 52.2% bf16 MFU | 1623743 tok/s step 11869/19560 | loss 3.316895 (-1.26z)| norm 0.2828 (-0.01z)| lr 2.14e-04 | 323.20 ms | 52.2% bf16 MFU | 1623665 tok/s step 11870/19560 | loss 3.362658 (-0.18z)| norm 0.3062 (+1.21z)| lr 2.14e-04 | 322.57 ms | 52.3% bf16 MFU | 1623749 tok/s step 11871/19560 | loss 3.363848 (-0.15z)| norm 0.2708 (-0.65z)| lr 2.14e-04 | 323.04 ms | 52.2% bf16 MFU | 1623711 tok/s step 11872/19560 | loss 3.354651 (-0.37z)| norm 0.2885 (+0.28z)| lr 2.14e-04 | 323.35 ms | 52.2% bf16 MFU | 1623597 tok/s step 11873/19560 | loss 3.367292 (-0.07z)| norm 0.2814 (-0.09z)| lr 2.14e-04 | 322.77 ms | 52.3% bf16 MFU | 1623634 tok/s step 11874/19560 | loss 3.357759 (-0.30z)| norm 0.2930 (+0.51z)| lr 2.14e-04 | 322.50 ms | 52.3% bf16 MFU | 1623736 tok/s step 11875/19560 | loss 3.359250 (-0.28z)| norm 0.2695 (-0.71z)| lr 2.14e-04 | 323.35 ms | 52.2% bf16 MFU | 1623622 tok/s step 11876/19560 | loss 3.349331 (-0.51z)| norm 0.2901 (+0.38z)| lr 2.14e-04 | 322.91 ms | 52.3% bf16 MFU | 1623623 tok/s step 11877/19560 | loss 3.364008 (-0.13z)| norm 0.2695 (-0.70z)| lr 2.14e-04 | 322.83 ms | 52.3% bf16 MFU | 1623643 tok/s step 11878/19560 | loss 3.398333 (+0.72z)| norm 0.2797 (-0.15z)| lr 2.14e-04 | 322.82 ms | 52.3% bf16 MFU | 1623664 tok/s step 11879/19560 | loss 3.354474 (-0.38z)| norm 0.2878 (+0.28z)| lr 2.14e-04 | 323.10 ms | 52.2% bf16 MFU | 1623616 tok/s step 11880/19560 | loss 3.368752 (-0.02z)| norm 0.2679 (-0.77z)| lr 2.14e-04 | 323.61 ms | 52.2% bf16 MFU | 1623442 tok/s step 11881/19560 | loss 3.367196 (-0.06z)| norm 0.2744 (-0.43z)| lr 2.14e-04 | 322.65 ms | 52.3% bf16 MFU | 1623518 tok/s step 11882/19560 | loss 3.352559 (-0.42z)| norm 0.2838 (+0.08z)| lr 2.14e-04 | 322.90 ms | 52.3% bf16 MFU | 1623526 tok/s step 11883/19560 | loss 3.373750 (+0.10z)| norm 0.2616 (-1.10z)| lr 2.14e-04 | 323.02 ms | 52.2% bf16 MFU | 1623503 tok/s step 11884/19560 | loss 3.346561 (-0.58z)| norm 0.2659 (-0.86z)| lr 2.14e-04 | 323.15 ms | 52.2% bf16 MFU | 1623449 tok/s step 11885/19560 | loss 3.370491 (+0.01z)| norm 0.2765 (-0.31z)| lr 2.14e-04 | 322.44 ms | 52.3% bf16 MFU | 1623575 tok/s step 11886/19560 | loss 3.374539 (+0.10z)| norm 0.2838 (+0.09z)| lr 2.14e-04 | 323.03 ms | 52.2% bf16 MFU | 1623548 tok/s step 11887/19560 | loss 3.313077 (-1.46z)| norm 0.2758 (-0.34z)| lr 2.13e-04 | 323.17 ms | 52.2% bf16 MFU | 1623488 tok/s step 11888/19560 | loss 3.357137 (-0.33z)| norm 0.2899 (+0.41z)| lr 2.13e-04 | 323.01 ms | 52.2% bf16 MFU | 1623470 tok/s step 11889/19560 | loss 3.396751 (+0.68z)| norm 0.2684 (-0.73z)| lr 2.13e-04 | 323.21 ms | 52.2% bf16 MFU | 1623403 tok/s step 11890/19560 | loss 3.316029 (-1.36z)| norm 0.2951 (+0.70z)| lr 2.13e-04 | 322.89 ms | 52.3% bf16 MFU | 1623420 tok/s step 11891/19560 | loss 3.368402 (-0.02z)| norm 0.2972 (+0.80z)| lr 2.13e-04 | 323.40 ms | 52.2% bf16 MFU | 1623307 tok/s step 11892/19560 | loss 3.350457 (-0.49z)| norm 0.2776 (-0.25z)| lr 2.13e-04 | 323.03 ms | 52.2% bf16 MFU | 1623294 tok/s step 11893/19560 | loss 3.357803 (-0.30z)| norm 0.2881 (+0.30z)| lr 2.13e-04 | 322.95 ms | 52.3% bf16 MFU | 1623300 tok/s step 11894/19560 | loss 3.397547 (+0.72z)| norm 0.2856 (+0.16z)| lr 2.13e-04 | 323.50 ms | 52.2% bf16 MFU | 1623169 tok/s step 11895/19560 | loss 3.345141 (-0.62z)| norm 0.2974 (+0.79z)| lr 2.13e-04 | 322.93 ms | 52.3% bf16 MFU | 1623188 tok/s step 11896/19560 | loss 3.318885 (-1.28z)| norm 0.2808 (-0.12z)| lr 2.13e-04 | 322.94 ms | 52.3% bf16 MFU | 1623202 tok/s step 11897/19560 | loss 3.348319 (-0.52z)| norm 0.2848 (+0.10z)| lr 2.13e-04 | 322.89 ms | 52.3% bf16 MFU | 1623228 tok/s step 11898/19560 | loss 3.416137 (+1.20z)| norm 0.2780 (-0.27z)| lr 2.13e-04 | 323.48 ms | 52.2% bf16 MFU | 1623105 tok/s step 11899/19560 | loss 3.398105 (+0.73z)| norm 0.2999 (+0.90z)| lr 2.13e-04 | 323.13 ms | 52.2% bf16 MFU | 1623077 tok/s step 11900/19560 | loss 3.341838 (-0.69z)| norm 0.3018 (+0.99z)| lr 2.13e-04 | 322.66 ms | 52.3% bf16 MFU | 1623166 tok/s step 11901/19560 | loss 3.316873 (-1.33z)| norm 0.3033 (+1.07z)| lr 2.13e-04 | 322.67 ms | 52.3% bf16 MFU | 1623250 tok/s step 11902/19560 | loss 3.375550 (+0.20z)| norm 0.2796 (-0.21z)| lr 2.13e-04 | 323.54 ms | 52.2% bf16 MFU | 1623112 tok/s step 11903/19560 | loss 3.350419 (-0.46z)| norm 0.3457 (+3.20z)| lr 2.13e-04 | 323.49 ms | 52.2% bf16 MFU | 1622994 tok/s step 11904/19560 | loss 3.339124 (-0.74z)| norm 0.3019 (+0.92z)| lr 2.13e-04 | 322.51 ms | 52.3% bf16 MFU | 1623126 tok/s step 11905/19560 | loss 3.351222 (-0.43z)| norm 0.3038 (+1.01z)| lr 2.13e-04 | 323.90 ms | 52.1% bf16 MFU | 1622904 tok/s step 11906/19560 | loss 3.378366 (+0.29z)| norm 0.2956 (+0.58z)| lr 2.13e-04 | 322.92 ms | 52.3% bf16 MFU | 1622936 tok/s step 11907/19560 | loss 3.386327 (+0.49z)| norm 0.2755 (-0.46z)| lr 2.13e-04 | 322.34 ms | 52.4% bf16 MFU | 1623116 tok/s step 11908/19560 | loss 3.336542 (-0.80z)| norm 0.2855 (+0.05z)| lr 2.12e-04 | 323.31 ms | 52.2% bf16 MFU | 1623040 tok/s step 11909/19560 | loss 3.366901 (-0.02z)| norm 0.2738 (-0.55z)| lr 2.12e-04 | 323.40 ms | 52.2% bf16 MFU | 1622948 tok/s step 11910/19560 | loss 3.448219 (+2.06z)| norm 0.2952 (+0.54z)| lr 2.12e-04 | 324.02 ms | 52.1% bf16 MFU | 1622705 tok/s step 11911/19560 | loss 3.393095 (+0.64z)| norm 0.2631 (-1.12z)| lr 2.12e-04 | 322.81 ms | 52.3% bf16 MFU | 1622776 tok/s step 11912/19560 | loss 3.444681 (+1.96z)| norm 0.3334 (+2.44z)| lr 2.12e-04 | 323.24 ms | 52.2% bf16 MFU | 1622737 tok/s step 11913/19560 | loss 3.353172 (-0.39z)| norm 0.2649 (-1.02z)| lr 2.12e-04 | 322.94 ms | 52.3% bf16 MFU | 1622774 tok/s step 11914/19560 | loss 3.335444 (-0.83z)| norm 0.2916 (+0.32z)| lr 2.12e-04 | 323.66 ms | 52.1% bf16 MFU | 1622630 tok/s step 11915/19560 | loss 3.326068 (-1.06z)| norm 0.2750 (-0.53z)| lr 2.12e-04 | 323.11 ms | 52.2% bf16 MFU | 1622631 tok/s step 11916/19560 | loss 3.332842 (-0.89z)| norm 0.2908 (+0.27z)| lr 2.12e-04 | 323.28 ms | 52.2% bf16 MFU | 1622587 tok/s step 11917/19560 | loss 3.440438 (+1.84z)| norm 0.3335 (+2.36z)| lr 2.12e-04 | 323.55 ms | 52.2% bf16 MFU | 1622480 tok/s step 11918/19560 | loss 3.327613 (-1.04z)| norm 0.3529 (+3.18z)| lr 2.12e-04 | 323.55 ms | 52.2% bf16 MFU | 1622377 tok/s step 11919/19560 | loss 3.410243 (+1.05z)| norm 0.2825 (-0.20z)| lr 2.12e-04 | 322.45 ms | 52.3% bf16 MFU | 1622556 tok/s step 11920/19560 | loss 3.385355 (+0.41z)| norm 0.3413 (+2.55z)| lr 2.12e-04 | 323.77 ms | 52.1% bf16 MFU | 1622395 tok/s step 11921/19560 | loss 3.316012 (-1.34z)| norm 0.2797 (-0.35z)| lr 2.12e-04 | 323.63 ms | 52.1% bf16 MFU | 1622277 tok/s step 11922/19560 | loss 3.277735 (-2.25z)| norm 0.3196 (+1.52z)| lr 2.12e-04 | 323.23 ms | 52.2% bf16 MFU | 1622264 tok/s step 11923/19560 | loss 3.335763 (-0.82z)| norm 0.2936 (+0.29z)| lr 2.12e-04 | 323.83 ms | 52.1% bf16 MFU | 1622102 tok/s step 11924/19560 | loss 3.334882 (-0.84z)| norm 0.3028 (+0.71z)| lr 2.12e-04 | 323.07 ms | 52.2% bf16 MFU | 1622138 tok/s step 11925/19560 | loss 3.385900 (+0.45z)| norm 0.2729 (-0.68z)| lr 2.12e-04 | 322.81 ms | 52.3% bf16 MFU | 1622238 tok/s step 11926/19560 | loss 3.339539 (-0.72z)| norm 0.3178 (+1.43z)| lr 2.12e-04 | 323.23 ms | 52.2% bf16 MFU | 1622226 tok/s step 11927/19560 | loss 3.412942 (+1.13z)| norm 0.2751 (-0.57z)| lr 2.12e-04 | 323.28 ms | 52.2% bf16 MFU | 1622205 tok/s step 11928/19560 | loss 3.407321 (+0.98z)| norm 0.2768 (-0.48z)| lr 2.12e-04 | 322.86 ms | 52.3% bf16 MFU | 1622290 tok/s step 11929/19560 | loss 3.344667 (-0.61z)| norm 0.2646 (-1.04z)| lr 2.11e-04 | 322.83 ms | 52.3% bf16 MFU | 1622376 tok/s step 11930/19560 | loss 3.352283 (-0.41z)| norm 0.2715 (-0.70z)| lr 2.11e-04 | 323.52 ms | 52.2% bf16 MFU | 1622286 tok/s step 11931/19560 | loss 3.350886 (-0.47z)| norm 0.2730 (-0.62z)| lr 2.11e-04 | 322.87 ms | 52.3% bf16 MFU | 1622363 tok/s step 11932/19560 | loss 3.388663 (+0.51z)| norm 0.2723 (-0.66z)| lr 2.11e-04 | 323.03 ms | 52.2% bf16 MFU | 1622396 tok/s step 11933/19560 | loss 3.349439 (-0.49z)| norm 0.2556 (-1.42z)| lr 2.11e-04 | 323.25 ms | 52.2% bf16 MFU | 1622372 tok/s step 11934/19560 | loss 3.307181 (-1.56z)| norm 0.2530 (-1.52z)| lr 2.11e-04 | 322.49 ms | 52.3% bf16 MFU | 1622542 tok/s step 11935/19560 | loss 3.361942 (-0.15z)| norm 0.2633 (-1.02z)| lr 2.11e-04 | 323.70 ms | 52.1% bf16 MFU | 1622399 tok/s step 11936/19560 | loss 3.390789 (+0.57z)| norm 0.2552 (-1.38z)| lr 2.11e-04 | 322.64 ms | 52.3% bf16 MFU | 1622529 tok/s step 11937/19560 | loss 3.346406 (-0.57z)| norm 0.2593 (-1.18z)| lr 2.11e-04 | 322.79 ms | 52.3% bf16 MFU | 1622615 tok/s step 11938/19560 | loss 3.306235 (-1.59z)| norm 0.2605 (-1.12z)| lr 2.11e-04 | 323.41 ms | 52.2% bf16 MFU | 1622542 tok/s step 11939/19560 | loss 3.402494 (+0.93z)| norm 0.2532 (-1.43z)| lr 2.11e-04 | 322.91 ms | 52.3% bf16 MFU | 1622597 tok/s step 11940/19560 | loss 3.351524 (-0.40z)| norm 0.2686 (-0.72z)| lr 2.11e-04 | 322.66 ms | 52.3% bf16 MFU | 1622712 tok/s step 11941/19560 | loss 3.334248 (-0.86z)| norm 0.2677 (-0.75z)| lr 2.11e-04 | 323.02 ms | 52.2% bf16 MFU | 1622731 tok/s step 11942/19560 | loss 3.326012 (-1.09z)| norm 0.2563 (-1.25z)| lr 2.11e-04 | 322.67 ms | 52.3% bf16 MFU | 1622836 tok/s step 11943/19560 | loss 3.326190 (-1.07z)| norm 0.2677 (-0.73z)| lr 2.11e-04 | 322.67 ms | 52.3% bf16 MFU | 1622936 tok/s step 11944/19560 | loss 3.346773 (-0.53z)| norm 0.2784 (-0.24z)| lr 2.11e-04 | 323.49 ms | 52.2% bf16 MFU | 1622825 tok/s step 11945/19560 | loss 3.366046 (-0.01z)| norm 0.2746 (-0.41z)| lr 2.11e-04 | 323.03 ms | 52.2% bf16 MFU | 1622836 tok/s step 11946/19560 | loss 3.333846 (-0.85z)| norm 0.2834 (-0.01z)| lr 2.11e-04 | 322.54 ms | 52.3% bf16 MFU | 1622968 tok/s step 11947/19560 | loss 3.419566 (+1.40z)| norm 0.2936 (+0.45z)| lr 2.11e-04 | 322.97 ms | 52.3% bf16 MFU | 1622986 tok/s step 11948/19560 | loss 3.382573 (+0.43z)| norm 0.3013 (+0.80z)| lr 2.11e-04 | 322.90 ms | 52.3% bf16 MFU | 1623020 tok/s step 11949/19560 | loss 3.402342 (+0.94z)| norm 0.2718 (-0.54z)| lr 2.11e-04 | 322.44 ms | 52.3% bf16 MFU | 1623168 tok/s step 11950/19560 | loss 3.330600 (-0.94z)| norm 0.2795 (-0.19z)| lr 2.10e-04 | 322.81 ms | 52.3% bf16 MFU | 1623216 tok/s step 11951/19560 | loss 3.373002 (+0.19z)| norm 0.2733 (-0.46z)| lr 2.10e-04 | 323.22 ms | 52.2% bf16 MFU | 1623159 tok/s step 11952/19560 | loss 3.432609 (+1.75z)| norm 0.2774 (-0.26z)| lr 2.10e-04 | 322.71 ms | 52.3% bf16 MFU | 1623232 tok/s step 11953/19560 | loss 3.327451 (-1.06z)| norm 0.2849 (+0.11z)| lr 2.10e-04 | 322.88 ms | 52.3% bf16 MFU | 1623260 tok/s step 11954/19560 | loss 3.338314 (-0.75z)| norm 0.2745 (-0.38z)| lr 2.10e-04 | 323.07 ms | 52.2% bf16 MFU | 1623239 tok/s step 11955/19560 | loss 3.368973 (+0.11z)| norm 0.2665 (-0.76z)| lr 2.10e-04 | 322.95 ms | 52.3% bf16 MFU | 1623250 tok/s step 11956/19560 | loss 3.413903 (+1.37z)| norm 0.2602 (-1.06z)| lr 2.10e-04 | 322.92 ms | 52.3% bf16 MFU | 1623267 tok/s step 11957/19560 | loss 3.333749 (-0.87z)| norm 0.2765 (-0.25z)| lr 2.10e-04 | 322.59 ms | 52.3% bf16 MFU | 1623365 tok/s step 11958/19560 | loss 3.360177 (-0.14z)| norm 0.2603 (-1.03z)| lr 2.10e-04 | 323.07 ms | 52.2% bf16 MFU | 1623337 tok/s step 11959/19560 | loss 3.383301 (+0.51z)| norm 0.2596 (-1.05z)| lr 2.10e-04 | 322.85 ms | 52.3% bf16 MFU | 1623367 tok/s step 11960/19560 | loss 3.364073 (-0.02z)| norm 0.2696 (-0.55z)| lr 2.10e-04 | 323.50 ms | 52.2% bf16 MFU | 1623231 tok/s step 11961/19560 | loss 3.396752 (+0.93z)| norm 0.2619 (-0.92z)| lr 2.10e-04 | 322.83 ms | 52.3% bf16 MFU | 1623270 tok/s step 11962/19560 | loss 3.355009 (-0.28z)| norm 0.2657 (-0.72z)| lr 2.10e-04 | 322.94 ms | 52.3% bf16 MFU | 1623281 tok/s step 11963/19560 | loss 3.333709 (-0.90z)| norm 0.2511 (-1.42z)| lr 2.10e-04 | 322.67 ms | 52.3% bf16 MFU | 1623358 tok/s step 11964/19560 | loss 3.350341 (-0.41z)| norm 0.2597 (-0.98z)| lr 2.10e-04 | 322.94 ms | 52.3% bf16 MFU | 1623364 tok/s step 11965/19560 | loss 3.400911 (+1.05z)| norm 0.2479 (-1.53z)| lr 2.10e-04 | 322.51 ms | 52.3% bf16 MFU | 1623478 tok/s step 11966/19560 | loss 3.345899 (-0.54z)| norm 0.2635 (-0.76z)| lr 2.10e-04 | 323.07 ms | 52.2% bf16 MFU | 1623446 tok/s step 11967/19560 | loss 3.368697 (+0.13z)| norm 0.2576 (-1.04z)| lr 2.10e-04 | 322.39 ms | 52.4% bf16 MFU | 1623587 tok/s step 11968/19560 | loss 3.359636 (-0.13z)| norm 0.2750 (-0.20z)| lr 2.10e-04 | 322.55 ms | 52.3% bf16 MFU | 1623680 tok/s step 11969/19560 | loss 3.335336 (-0.85z)| norm 0.2705 (-0.42z)| lr 2.10e-04 | 323.01 ms | 52.2% bf16 MFU | 1623652 tok/s step 11970/19560 | loss 3.304538 (-1.74z)| norm 0.2762 (-0.14z)| lr 2.10e-04 | 322.90 ms | 52.3% bf16 MFU | 1623653 tok/s step 11971/19560 | loss 3.336266 (-0.81z)| norm 0.2694 (-0.48z)| lr 2.09e-04 | 322.43 ms | 52.3% bf16 MFU | 1623772 tok/s step 11972/19560 | loss 3.304355 (-1.71z)| norm 0.2746 (-0.23z)| lr 2.09e-04 | 322.75 ms | 52.3% bf16 MFU | 1623804 tok/s step 11973/19560 | loss 3.298469 (-1.85z)| norm 0.2682 (-0.54z)| lr 2.09e-04 | 323.51 ms | 52.2% bf16 MFU | 1623645 tok/s step 11974/19560 | loss 3.306229 (-1.60z)| norm 0.2955 (+0.78z)| lr 2.09e-04 | 322.29 ms | 52.4% bf16 MFU | 1623801 tok/s step 11975/19560 | loss 3.341227 (-0.59z)| norm 0.2666 (-0.64z)| lr 2.09e-04 | 322.34 ms | 52.4% bf16 MFU | 1623936 tok/s step 11976/19560 | loss 3.388513 (+0.74z)| norm 0.2797 (+0.00z)| lr 2.09e-04 | 323.32 ms | 52.2% bf16 MFU | 1623817 tok/s step 11977/19560 | loss 3.363074 (+0.03z)| norm 0.3266 (+2.24z)| lr 2.09e-04 | 322.51 ms | 52.3% bf16 MFU | 1623909 tok/s step 11978/19560 | loss 3.351599 (-0.29z)| norm 0.2833 (+0.14z)| lr 2.09e-04 | 322.45 ms | 52.3% bf16 MFU | 1624012 tok/s step 11979/19560 | loss 3.344551 (-0.49z)| norm 0.2918 (+0.54z)| lr 2.09e-04 | 323.44 ms | 52.2% bf16 MFU | 1623859 tok/s step 11980/19560 | loss 3.343552 (-0.51z)| norm 0.2946 (+0.67z)| lr 2.09e-04 | 322.80 ms | 52.3% bf16 MFU | 1623875 tok/s step 11981/19560 | loss 3.405247 (+1.24z)| norm 0.3100 (+1.40z)| lr 2.09e-04 | 322.78 ms | 52.3% bf16 MFU | 1623896 tok/s step 11982/19560 | loss 3.359020 (-0.07z)| norm 0.2953 (+0.67z)| lr 2.09e-04 | 322.51 ms | 52.3% bf16 MFU | 1623983 tok/s step 11983/19560 | loss 3.278760 (-2.30z)| norm 0.3961 (+4.97z)| lr 2.09e-04 | 322.82 ms | 52.3% bf16 MFU | 1623988 tok/s step 11984/19560 | loss 3.362751 (+0.07z)| norm 0.2996 (+0.74z)| lr 2.09e-04 | 322.69 ms | 52.3% bf16 MFU | 1624026 tok/s step 11985/19560 | loss 3.332687 (-0.77z)| norm 0.3121 (+1.26z)| lr 2.09e-04 | 322.23 ms | 52.4% bf16 MFU | 1624178 tok/s step 11986/19560 | loss 3.362145 (+0.05z)| norm 0.3049 (+0.93z)| lr 2.09e-04 | 323.18 ms | 52.2% bf16 MFU | 1624083 tok/s step 11987/19560 | loss 3.276647 (-2.30z)| norm 0.3104 (+1.16z)| lr 2.09e-04 | 322.66 ms | 52.3% bf16 MFU | 1624124 tok/s step 11988/19560 | loss 3.333833 (-0.70z)| norm 0.2965 (+0.54z)| lr 2.09e-04 | 322.39 ms | 52.3% bf16 MFU | 1624230 tok/s step 11989/19560 | loss 3.453955 (+2.56z)| norm 0.3196 (+1.53z)| lr 2.09e-04 | 323.00 ms | 52.3% bf16 MFU | 1624177 tok/s step 11990/19560 | loss 3.330733 (-0.79z)| norm 0.3093 (+1.07z)| lr 2.09e-04 | 322.56 ms | 52.3% bf16 MFU | 1624238 tok/s step 11991/19560 | loss 3.356019 (-0.08z)| norm 0.3144 (+1.29z)| lr 2.09e-04 | 323.18 ms | 52.2% bf16 MFU | 1624139 tok/s step 11992/19560 | loss 3.344005 (-0.42z)| norm 0.3306 (+1.96z)| lr 2.08e-04 | 322.52 ms | 52.3% bf16 MFU | 1624212 tok/s step 11993/19560 | loss 3.399895 (+1.17z)| norm 0.3108 (+1.11z)| lr 2.08e-04 | 322.54 ms | 52.3% bf16 MFU | 1624275 tok/s step 11994/19560 | loss 3.375541 (+0.49z)| norm 0.3086 (+1.00z)| lr 2.08e-04 | 322.74 ms | 52.3% bf16 MFU | 1624287 tok/s step 11995/19560 | loss 3.409511 (+1.47z)| norm 0.2889 (+0.16z)| lr 2.08e-04 | 322.81 ms | 52.3% bf16 MFU | 1624280 tok/s step 11996/19560 | loss 3.397898 (+1.13z)| norm 0.2863 (+0.06z)| lr 2.08e-04 | 323.28 ms | 52.2% bf16 MFU | 1624155 tok/s step 11997/19560 | loss 3.388883 (+0.85z)| norm 0.2786 (-0.28z)| lr 2.08e-04 | 322.43 ms | 52.3% bf16 MFU | 1624249 tok/s step 11998/19560 | loss 3.382978 (+0.67z)| norm 0.2798 (-0.22z)| lr 2.08e-04 | 322.71 ms | 52.3% bf16 MFU | 1624268 tok/s step 11999/19560 | loss 3.462801 (+2.87z)| norm 0.2803 (-0.20z)| lr 2.08e-04 | 322.31 ms | 52.4% bf16 MFU | 1624388 tok/s step 12000/19560 | loss 3.385080 (+0.68z)| norm 0.2853 (+0.02z)| lr 2.08e-04 | 322.67 ms | 52.3% bf16 MFU | 1624410 tok/s val loss 3.353923 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2974/10042 = 0.296156 step 12001/19560 | loss 3.334459 (-0.73z)| norm 0.2804 (-0.19z)| lr 2.08e-04 | 322.05 ms | 52.4% bf16 MFU | 1624590 tok/s step 12002/19560 | loss 3.407122 (+1.28z)| norm 0.2702 (-0.63z)| lr 2.08e-04 | 321.82 ms | 52.4% bf16 MFU | 1624816 tok/s step 12003/19560 | loss 3.417277 (+1.54z)| norm 0.3147 (+1.30z)| lr 2.08e-04 | 323.03 ms | 52.2% bf16 MFU | 1624727 tok/s step 12004/19560 | loss 3.421587 (+1.62z)| norm 0.2789 (-0.26z)| lr 2.08e-04 | 322.45 ms | 52.3% bf16 MFU | 1624788 tok/s step 12005/19560 | loss 3.313620 (-1.30z)| norm 0.2795 (-0.24z)| lr 2.08e-04 | 323.18 ms | 52.2% bf16 MFU | 1624663 tok/s step 12006/19560 | loss 3.390612 (+0.79z)| norm 0.2813 (-0.16z)| lr 2.08e-04 | 322.48 ms | 52.3% bf16 MFU | 1624720 tok/s step 12007/19560 | loss 3.358922 (-0.07z)| norm 0.2615 (-1.01z)| lr 2.08e-04 | 322.80 ms | 52.3% bf16 MFU | 1624694 tok/s step 12008/19560 | loss 3.415715 (+1.45z)| norm 0.2724 (-0.54z)| lr 2.08e-04 | 322.51 ms | 52.3% bf16 MFU | 1624741 tok/s step 12009/19560 | loss 3.344281 (-0.47z)| norm 0.2936 (+0.38z)| lr 2.08e-04 | 322.50 ms | 52.3% bf16 MFU | 1624790 tok/s step 12010/19560 | loss 3.415092 (+1.41z)| norm 0.2696 (-0.66z)| lr 2.08e-04 | 322.42 ms | 52.3% bf16 MFU | 1624855 tok/s step 12011/19560 | loss 3.317549 (-1.17z)| norm 0.2721 (-0.56z)| lr 2.08e-04 | 322.97 ms | 52.3% bf16 MFU | 1624779 tok/s step 12012/19560 | loss 3.432914 (+1.85z)| norm 0.2696 (-0.67z)| lr 2.08e-04 | 322.65 ms | 52.3% bf16 MFU | 1624786 tok/s step 12013/19560 | loss 3.313637 (-1.26z)| norm 0.2705 (-0.63z)| lr 2.07e-04 | 322.64 ms | 52.3% bf16 MFU | 1624796 tok/s step 12014/19560 | loss 3.346080 (-0.41z)| norm 0.2896 (+0.20z)| lr 2.07e-04 | 323.07 ms | 52.2% bf16 MFU | 1624697 tok/s step 12015/19560 | loss 3.355511 (-0.17z)| norm 0.2970 (+0.52z)| lr 2.07e-04 | 322.68 ms | 52.3% bf16 MFU | 1624700 tok/s step 12016/19560 | loss 3.325326 (-0.95z)| norm 0.2907 (+0.24z)| lr 2.07e-04 | 322.97 ms | 52.3% bf16 MFU | 1624633 tok/s step 12017/19560 | loss 3.392092 (+0.79z)| norm 0.2858 (+0.03z)| lr 2.07e-04 | 322.57 ms | 52.3% bf16 MFU | 1624669 tok/s step 12018/19560 | loss 3.293463 (-1.77z)| norm 0.3177 (+1.40z)| lr 2.07e-04 | 322.95 ms | 52.3% bf16 MFU | 1624607 tok/s step 12019/19560 | loss 3.330750 (-0.79z)| norm 0.3126 (+1.17z)| lr 2.07e-04 | 322.57 ms | 52.3% bf16 MFU | 1624644 tok/s step 12020/19560 | loss 3.398470 (+0.95z)| norm 0.3065 (+0.89z)| lr 2.07e-04 | 322.31 ms | 52.4% bf16 MFU | 1624744 tok/s step 12021/19560 | loss 3.317990 (-1.11z)| norm 0.3261 (+1.70z)| lr 2.07e-04 | 323.15 ms | 52.2% bf16 MFU | 1624628 tok/s step 12022/19560 | loss 3.335978 (-0.64z)| norm 0.2822 (-0.16z)| lr 2.07e-04 | 322.15 ms | 52.4% bf16 MFU | 1624769 tok/s step 12023/19560 | loss 3.518086 (+3.78z)| norm 0.3039 (+0.76z)| lr 2.07e-04 | 322.95 ms | 52.3% bf16 MFU | 1624703 tok/s step 12024/19560 | loss 3.308812 (-1.29z)| norm 0.2707 (-0.65z)| lr 2.07e-04 | 323.11 ms | 52.2% bf16 MFU | 1624598 tok/s step 12025/19560 | loss 3.517297 (+3.53z)| norm 0.3004 (+0.60z)| lr 2.07e-04 | 322.67 ms | 52.3% bf16 MFU | 1624610 tok/s step 12026/19560 | loss 3.478903 (+2.58z)| norm 0.2877 (+0.06z)| lr 2.07e-04 | 323.27 ms | 52.2% bf16 MFU | 1624472 tok/s step 12027/19560 | loss 3.370632 (+0.15z)| norm 0.3007 (+0.61z)| lr 2.07e-04 | 322.39 ms | 52.3% bf16 MFU | 1624560 tok/s step 12028/19560 | loss 3.318407 (-1.02z)| norm 0.2970 (+0.46z)| lr 2.07e-04 | 323.32 ms | 52.2% bf16 MFU | 1624412 tok/s step 12029/19560 | loss 3.293004 (-1.57z)| norm 0.2950 (+0.38z)| lr 2.07e-04 | 322.66 ms | 52.3% bf16 MFU | 1624436 tok/s step 12030/19560 | loss 3.314255 (-1.08z)| norm 0.2748 (-0.48z)| lr 2.07e-04 | 322.41 ms | 52.3% bf16 MFU | 1624521 tok/s step 12031/19560 | loss 3.379436 (+0.36z)| norm 0.2751 (-0.46z)| lr 2.07e-04 | 323.15 ms | 52.2% bf16 MFU | 1624417 tok/s step 12032/19560 | loss 3.349136 (-0.31z)| norm 0.2693 (-0.69z)| lr 2.07e-04 | 322.64 ms | 52.3% bf16 MFU | 1624445 tok/s step 12033/19560 | loss 3.397084 (+0.74z)| norm 0.2649 (-0.87z)| lr 2.07e-04 | 323.21 ms | 52.2% bf16 MFU | 1624330 tok/s step 12034/19560 | loss 3.391475 (+0.62z)| norm 0.2709 (-0.60z)| lr 2.06e-04 | 323.29 ms | 52.2% bf16 MFU | 1624200 tok/s step 12035/19560 | loss 3.416673 (+1.17z)| norm 0.2748 (-0.44z)| lr 2.06e-04 | 322.67 ms | 52.3% bf16 MFU | 1624233 tok/s step 12036/19560 | loss 3.377261 (+0.29z)| norm 0.2683 (-0.71z)| lr 2.06e-04 | 322.85 ms | 52.3% bf16 MFU | 1624219 tok/s step 12037/19560 | loss 3.351990 (-0.27z)| norm 0.2709 (-0.59z)| lr 2.06e-04 | 322.89 ms | 52.3% bf16 MFU | 1624196 tok/s step 12038/19560 | loss 3.366657 (+0.07z)| norm 0.2527 (-1.36z)| lr 2.06e-04 | 322.94 ms | 52.3% bf16 MFU | 1624159 tok/s step 12039/19560 | loss 3.359455 (-0.09z)| norm 0.2783 (-0.26z)| lr 2.06e-04 | 323.50 ms | 52.2% bf16 MFU | 1623986 tok/s step 12040/19560 | loss 3.365938 (+0.08z)| norm 0.2456 (-1.66z)| lr 2.06e-04 | 322.65 ms | 52.3% bf16 MFU | 1624033 tok/s step 12041/19560 | loss 3.441185 (+1.75z)| norm 0.2962 (+0.53z)| lr 2.06e-04 | 323.08 ms | 52.2% bf16 MFU | 1623970 tok/s step 12042/19560 | loss 3.383689 (+0.45z)| norm 0.2595 (-1.05z)| lr 2.06e-04 | 323.10 ms | 52.2% bf16 MFU | 1623907 tok/s step 12043/19560 | loss 3.374076 (+0.23z)| norm 0.2703 (-0.58z)| lr 2.06e-04 | 322.67 ms | 52.3% bf16 MFU | 1623954 tok/s step 12044/19560 | loss 3.440803 (+1.70z)| norm 0.2589 (-1.06z)| lr 2.06e-04 | 322.83 ms | 52.3% bf16 MFU | 1623958 tok/s step 12045/19560 | loss 3.450075 (+1.90z)| norm 0.2791 (-0.17z)| lr 2.06e-04 | 322.95 ms | 52.3% bf16 MFU | 1623931 tok/s step 12046/19560 | loss 3.390596 (+0.56z)| norm 0.2821 (-0.02z)| lr 2.06e-04 | 322.82 ms | 52.3% bf16 MFU | 1623939 tok/s step 12047/19560 | loss 3.388643 (+0.52z)| norm 0.2540 (-1.28z)| lr 2.06e-04 | 322.95 ms | 52.3% bf16 MFU | 1623914 tok/s step 12048/19560 | loss 3.448864 (+1.84z)| norm 0.2655 (-0.75z)| lr 2.06e-04 | 322.93 ms | 52.3% bf16 MFU | 1623894 tok/s step 12049/19560 | loss 3.388791 (+0.50z)| norm 0.2873 (+0.26z)| lr 2.06e-04 | 323.26 ms | 52.2% bf16 MFU | 1623792 tok/s step 12050/19560 | loss 3.391762 (+0.55z)| norm 0.2425 (-1.80z)| lr 2.06e-04 | 323.00 ms | 52.3% bf16 MFU | 1623762 tok/s step 12051/19560 | loss 3.370423 (+0.06z)| norm 0.2777 (-0.15z)| lr 2.06e-04 | 322.44 ms | 52.3% bf16 MFU | 1623873 tok/s step 12052/19560 | loss 3.349772 (-0.40z)| norm 0.2770 (-0.18z)| lr 2.06e-04 | 323.15 ms | 52.2% bf16 MFU | 1623802 tok/s step 12053/19560 | loss 3.319880 (-1.06z)| norm 0.2648 (-0.74z)| lr 2.06e-04 | 322.89 ms | 52.3% bf16 MFU | 1623798 tok/s step 12054/19560 | loss 3.371293 (+0.09z)| norm 0.2746 (-0.27z)| lr 2.06e-04 | 322.54 ms | 52.3% bf16 MFU | 1623884 tok/s step 12055/19560 | loss 3.323990 (-0.96z)| norm 0.2815 (+0.05z)| lr 2.05e-04 | 323.03 ms | 52.2% bf16 MFU | 1623841 tok/s step 12056/19560 | loss 3.323969 (-0.95z)| norm 0.2857 (+0.25z)| lr 2.05e-04 | 322.83 ms | 52.3% bf16 MFU | 1623851 tok/s step 12057/19560 | loss 3.399577 (+0.75z)| norm 0.2824 (+0.08z)| lr 2.05e-04 | 322.88 ms | 52.3% bf16 MFU | 1623849 tok/s step 12058/19560 | loss 3.302571 (-1.42z)| norm 0.2898 (+0.43z)| lr 2.05e-04 | 322.82 ms | 52.3% bf16 MFU | 1623860 tok/s step 12059/19560 | loss 3.355513 (-0.24z)| norm 0.2766 (-0.20z)| lr 2.05e-04 | 323.12 ms | 52.2% bf16 MFU | 1623797 tok/s step 12060/19560 | loss 3.374137 (+0.18z)| norm 0.2777 (-0.15z)| lr 2.05e-04 | 322.64 ms | 52.3% bf16 MFU | 1623856 tok/s step 12061/19560 | loss 3.373752 (+0.17z)| norm 0.2688 (-0.58z)| lr 2.05e-04 | 323.17 ms | 52.2% bf16 MFU | 1623779 tok/s step 12062/19560 | loss 3.316663 (-1.11z)| norm 0.2905 (+0.45z)| lr 2.05e-04 | 323.05 ms | 52.2% bf16 MFU | 1623736 tok/s step 12063/19560 | loss 3.373282 (+0.16z)| norm 0.2802 (-0.06z)| lr 2.05e-04 | 323.22 ms | 52.2% bf16 MFU | 1623652 tok/s step 12064/19560 | loss 3.367082 (+0.02z)| norm 0.2826 (+0.05z)| lr 2.05e-04 | 323.67 ms | 52.1% bf16 MFU | 1623460 tok/s step 12065/19560 | loss 3.418952 (+1.17z)| norm 0.2683 (-0.65z)| lr 2.05e-04 | 322.37 ms | 52.4% bf16 MFU | 1623605 tok/s step 12066/19560 | loss 3.387585 (+0.45z)| norm 0.2850 (+0.15z)| lr 2.05e-04 | 322.43 ms | 52.3% bf16 MFU | 1623727 tok/s step 12067/19560 | loss 3.336356 (-0.69z)| norm 0.2587 (-1.14z)| lr 2.05e-04 | 323.60 ms | 52.2% bf16 MFU | 1623550 tok/s step 12068/19560 | loss 3.292544 (-1.64z)| norm 0.2918 (+0.47z)| lr 2.05e-04 | 322.78 ms | 52.3% bf16 MFU | 1623588 tok/s step 12069/19560 | loss 3.343897 (-0.50z)| norm 0.2880 (+0.28z)| lr 2.05e-04 | 323.55 ms | 52.2% bf16 MFU | 1623429 tok/s step 12070/19560 | loss 3.309953 (-1.25z)| norm 0.2586 (-1.16z)| lr 2.05e-04 | 322.61 ms | 52.3% bf16 MFU | 1623515 tok/s step 12071/19560 | loss 3.375491 (+0.20z)| norm 0.2794 (-0.14z)| lr 2.05e-04 | 323.10 ms | 52.2% bf16 MFU | 1623472 tok/s step 12072/19560 | loss 3.370610 (+0.08z)| norm 0.2858 (+0.17z)| lr 2.05e-04 | 322.87 ms | 52.3% bf16 MFU | 1623491 tok/s step 12073/19560 | loss 3.415926 (+1.08z)| norm 0.2613 (-1.03z)| lr 2.05e-04 | 322.48 ms | 52.3% bf16 MFU | 1623607 tok/s step 12074/19560 | loss 3.376726 (+0.20z)| norm 0.3048 (+1.09z)| lr 2.05e-04 | 323.41 ms | 52.2% bf16 MFU | 1623484 tok/s step 12075/19560 | loss 3.418490 (+1.13z)| norm 0.2851 (+0.13z)| lr 2.05e-04 | 322.34 ms | 52.4% bf16 MFU | 1623634 tok/s step 12076/19560 | loss 3.426930 (+1.30z)| norm 0.2746 (-0.37z)| lr 2.04e-04 | 322.55 ms | 52.3% bf16 MFU | 1623724 tok/s step 12077/19560 | loss 3.396002 (+0.62z)| norm 0.2863 (+0.20z)| lr 2.04e-04 | 323.58 ms | 52.2% bf16 MFU | 1623552 tok/s step 12078/19560 | loss 3.307772 (-1.32z)| norm 0.2720 (-0.50z)| lr 2.04e-04 | 322.67 ms | 52.3% bf16 MFU | 1623616 tok/s step 12079/19560 | loss 3.381027 (+0.29z)| norm 0.2689 (-0.65z)| lr 2.04e-04 | 323.60 ms | 52.2% bf16 MFU | 1623444 tok/s step 12080/19560 | loss 3.351159 (-0.36z)| norm 0.2862 (+0.19z)| lr 2.04e-04 | 323.15 ms | 52.2% bf16 MFU | 1623392 tok/s step 12081/19560 | loss 3.371932 (+0.10z)| norm 0.2715 (-0.52z)| lr 2.04e-04 | 322.89 ms | 52.3% bf16 MFU | 1623409 tok/s step 12082/19560 | loss 3.321880 (-1.01z)| norm 0.2757 (-0.32z)| lr 2.04e-04 | 322.87 ms | 52.3% bf16 MFU | 1623432 tok/s step 12083/19560 | loss 3.300477 (-1.46z)| norm 0.2792 (-0.15z)| lr 2.04e-04 | 323.74 ms | 52.1% bf16 MFU | 1623233 tok/s step 12084/19560 | loss 3.303330 (-1.38z)| norm 0.2738 (-0.42z)| lr 2.04e-04 | 322.97 ms | 52.3% bf16 MFU | 1623238 tok/s step 12085/19560 | loss 3.373786 (+0.16z)| norm 0.2778 (-0.23z)| lr 2.04e-04 | 323.04 ms | 52.2% bf16 MFU | 1623224 tok/s step 12086/19560 | loss 3.340338 (-0.57z)| norm 0.2615 (-1.03z)| lr 2.04e-04 | 323.30 ms | 52.2% bf16 MFU | 1623148 tok/s step 12087/19560 | loss 3.339432 (-0.58z)| norm 0.2828 (+0.01z)| lr 2.04e-04 | 323.62 ms | 52.2% bf16 MFU | 1622993 tok/s step 12088/19560 | loss 3.300784 (-1.41z)| norm 0.2612 (-1.05z)| lr 2.04e-04 | 322.03 ms | 52.4% bf16 MFU | 1623248 tok/s step 12089/19560 | loss 3.373118 (+0.17z)| norm 0.2551 (-1.35z)| lr 2.04e-04 | 323.72 ms | 52.1% bf16 MFU | 1623064 tok/s step 12090/19560 | loss 3.409425 (+0.95z)| norm 0.2783 (-0.21z)| lr 2.04e-04 | 323.91 ms | 52.1% bf16 MFU | 1622842 tok/s step 12091/19560 | loss 3.348674 (-0.37z)| norm 0.2666 (-0.80z)| lr 2.04e-04 | 322.91 ms | 52.3% bf16 MFU | 1622881 tok/s step 12092/19560 | loss 3.409345 (+0.94z)| norm 0.2922 (+0.46z)| lr 2.04e-04 | 323.89 ms | 52.1% bf16 MFU | 1622672 tok/s step 12093/19560 | loss 3.355405 (-0.23z)| norm 0.2704 (-0.64z)| lr 2.04e-04 | 322.83 ms | 52.3% bf16 MFU | 1622741 tok/s step 12094/19560 | loss 3.367908 (+0.04z)| norm 0.2676 (-0.78z)| lr 2.04e-04 | 322.22 ms | 52.4% bf16 MFU | 1622960 tok/s step 12095/19560 | loss 3.323087 (-0.92z)| norm 0.2544 (-1.45z)| lr 2.04e-04 | 323.17 ms | 52.2% bf16 MFU | 1622928 tok/s step 12096/19560 | loss 3.384596 (+0.41z)| norm 0.2702 (-0.65z)| lr 2.04e-04 | 322.44 ms | 52.3% bf16 MFU | 1623081 tok/s step 12097/19560 | loss 3.341923 (-0.52z)| norm 0.2605 (-1.13z)| lr 2.04e-04 | 322.72 ms | 52.3% bf16 MFU | 1623156 tok/s step 12098/19560 | loss 3.339465 (-0.58z)| norm 0.2894 (+0.31z)| lr 2.03e-04 | 322.82 ms | 52.3% bf16 MFU | 1623203 tok/s step 12099/19560 | loss 3.328698 (-0.81z)| norm 0.2628 (-1.01z)| lr 2.03e-04 | 322.68 ms | 52.3% bf16 MFU | 1623281 tok/s step 12100/19560 | loss 3.353439 (-0.29z)| norm 0.2841 (+0.05z)| lr 2.03e-04 | 322.13 ms | 52.4% bf16 MFU | 1623496 tok/s step 12101/19560 | loss 3.410899 (+0.96z)| norm 0.2623 (-1.04z)| lr 2.03e-04 | 323.45 ms | 52.2% bf16 MFU | 1623368 tok/s step 12102/19560 | loss 3.344084 (-0.52z)| norm 0.2693 (-0.68z)| lr 2.03e-04 | 323.42 ms | 52.2% bf16 MFU | 1623253 tok/s step 12103/19560 | loss 3.353050 (-0.33z)| norm 0.2927 (+0.48z)| lr 2.03e-04 | 322.80 ms | 52.3% bf16 MFU | 1623300 tok/s step 12104/19560 | loss 3.286738 (-1.76z)| norm 0.2889 (+0.29z)| lr 2.03e-04 | 322.97 ms | 52.3% bf16 MFU | 1623303 tok/s step 12105/19560 | loss 3.360147 (-0.15z)| norm 0.3033 (+1.03z)| lr 2.03e-04 | 322.63 ms | 52.3% bf16 MFU | 1623390 tok/s step 12106/19560 | loss 3.397216 (+0.66z)| norm 0.2894 (+0.32z)| lr 2.03e-04 | 322.90 ms | 52.3% bf16 MFU | 1623404 tok/s step 12107/19560 | loss 3.333190 (-0.74z)| norm 0.2852 (+0.11z)| lr 2.03e-04 | 323.11 ms | 52.2% bf16 MFU | 1623364 tok/s step 12108/19560 | loss 3.362122 (-0.11z)| norm 0.2981 (+0.77z)| lr 2.03e-04 | 322.81 ms | 52.3% bf16 MFU | 1623401 tok/s step 12109/19560 | loss 3.382690 (+0.34z)| norm 0.3075 (+1.25z)| lr 2.03e-04 | 322.92 ms | 52.3% bf16 MFU | 1623411 tok/s step 12110/19560 | loss 3.358845 (-0.18z)| norm 0.2867 (+0.19z)| lr 2.03e-04 | 323.17 ms | 52.2% bf16 MFU | 1623357 tok/s step 12111/19560 | loss 3.386797 (+0.42z)| norm 0.2950 (+0.76z)| lr 2.03e-04 | 322.82 ms | 52.3% bf16 MFU | 1623394 tok/s step 12112/19560 | loss 3.334975 (-0.73z)| norm 0.2948 (+0.75z)| lr 2.03e-04 | 322.67 ms | 52.3% bf16 MFU | 1623465 tok/s step 12113/19560 | loss 3.357993 (-0.22z)| norm 0.3029 (+1.24z)| lr 2.03e-04 | 322.52 ms | 52.3% bf16 MFU | 1623572 tok/s step 12114/19560 | loss 3.327696 (-0.89z)| norm 0.2754 (-0.38z)| lr 2.03e-04 | 323.43 ms | 52.2% bf16 MFU | 1623446 tok/s step 12115/19560 | loss 3.401780 (+0.75z)| norm 0.2748 (-0.41z)| lr 2.03e-04 | 322.81 ms | 52.3% bf16 MFU | 1623482 tok/s step 12116/19560 | loss 3.324247 (-1.00z)| norm 0.2824 (+0.06z)| lr 2.03e-04 | 323.05 ms | 52.2% bf16 MFU | 1623453 tok/s step 12117/19560 | loss 3.364136 (-0.09z)| norm 0.2720 (-0.56z)| lr 2.03e-04 | 322.66 ms | 52.3% bf16 MFU | 1623525 tok/s step 12118/19560 | loss 3.383195 (+0.34z)| norm 0.3362 (+3.30z)| lr 2.03e-04 | 322.84 ms | 52.3% bf16 MFU | 1623549 tok/s step 12119/19560 | loss 3.377796 (+0.21z)| norm 0.3107 (+1.78z)| lr 2.02e-04 | 322.91 ms | 52.3% bf16 MFU | 1623553 tok/s step 12120/19560 | loss 3.400772 (+0.73z)| norm 0.2956 (+0.92z)| lr 2.02e-04 | 322.61 ms | 52.3% bf16 MFU | 1623633 tok/s step 12121/19560 | loss 3.393004 (+0.55z)| norm 0.3154 (+2.14z)| lr 2.02e-04 | 322.74 ms | 52.3% bf16 MFU | 1623677 tok/s step 12122/19560 | loss 3.321167 (-1.08z)| norm 0.3085 (+1.71z)| lr 2.02e-04 | 323.09 ms | 52.2% bf16 MFU | 1623628 tok/s step 12123/19560 | loss 3.391785 (+0.54z)| norm 0.3106 (+1.81z)| lr 2.02e-04 | 322.59 ms | 52.3% bf16 MFU | 1623710 tok/s step 12124/19560 | loss 3.455712 (+1.96z)| norm 0.2927 (+0.70z)| lr 2.02e-04 | 322.66 ms | 52.3% bf16 MFU | 1623769 tok/s step 12125/19560 | loss 3.358261 (-0.23z)| norm 0.2909 (+0.59z)| lr 2.02e-04 | 322.59 ms | 52.3% bf16 MFU | 1623842 tok/s step 12126/19560 | loss 3.353777 (-0.33z)| norm 0.2876 (+0.38z)| lr 2.02e-04 | 323.09 ms | 52.2% bf16 MFU | 1623787 tok/s step 12127/19560 | loss 3.317370 (-1.14z)| norm 0.2780 (-0.20z)| lr 2.02e-04 | 322.69 ms | 52.3% bf16 MFU | 1623835 tok/s step 12128/19560 | loss 3.379774 (+0.29z)| norm 0.3034 (+1.33z)| lr 2.02e-04 | 322.19 ms | 52.4% bf16 MFU | 1624005 tok/s step 12129/19560 | loss 3.374469 (+0.16z)| norm 0.2625 (-1.14z)| lr 2.02e-04 | 323.29 ms | 52.2% bf16 MFU | 1623890 tok/s step 12130/19560 | loss 3.387407 (+0.46z)| norm 0.2961 (+0.88z)| lr 2.02e-04 | 322.55 ms | 52.3% bf16 MFU | 1623969 tok/s step 12131/19560 | loss 3.353549 (-0.31z)| norm 0.2654 (-0.96z)| lr 2.02e-04 | 322.73 ms | 52.3% bf16 MFU | 1623998 tok/s step 12132/19560 | loss 3.346644 (-0.45z)| norm 0.3035 (+1.35z)| lr 2.02e-04 | 322.49 ms | 52.3% bf16 MFU | 1624086 tok/s step 12133/19560 | loss 3.318495 (-1.11z)| norm 0.2963 (+0.89z)| lr 2.02e-04 | 322.84 ms | 52.3% bf16 MFU | 1624081 tok/s step 12134/19560 | loss 3.414367 (+1.11z)| norm 0.2792 (-0.14z)| lr 2.02e-04 | 322.49 ms | 52.3% bf16 MFU | 1624163 tok/s step 12135/19560 | loss 3.369307 (+0.07z)| norm 0.2882 (+0.39z)| lr 2.02e-04 | 322.82 ms | 52.3% bf16 MFU | 1624160 tok/s step 12136/19560 | loss 3.315089 (-1.17z)| norm 0.2866 (+0.29z)| lr 2.02e-04 | 322.77 ms | 52.3% bf16 MFU | 1624170 tok/s step 12137/19560 | loss 3.378379 (+0.29z)| norm 0.3062 (+1.47z)| lr 2.02e-04 | 322.80 ms | 52.3% bf16 MFU | 1624171 tok/s step 12138/19560 | loss 3.347023 (-0.43z)| norm 0.2797 (-0.13z)| lr 2.02e-04 | 322.44 ms | 52.3% bf16 MFU | 1624261 tok/s step 12139/19560 | loss 3.374060 (+0.19z)| norm 0.2836 (+0.10z)| lr 2.02e-04 | 323.09 ms | 52.2% bf16 MFU | 1624185 tok/s step 12140/19560 | loss 3.376065 (+0.25z)| norm 0.2672 (-0.90z)| lr 2.01e-04 | 322.90 ms | 52.3% bf16 MFU | 1624159 tok/s step 12141/19560 | loss 3.394755 (+0.68z)| norm 0.2820 (-0.01z)| lr 2.01e-04 | 322.35 ms | 52.4% bf16 MFU | 1624273 tok/s step 12142/19560 | loss 3.361763 (-0.10z)| norm 0.3068 (+1.48z)| lr 2.01e-04 | 322.45 ms | 52.3% bf16 MFU | 1624357 tok/s step 12143/19560 | loss 3.308603 (-1.35z)| norm 0.2924 (+0.62z)| lr 2.01e-04 | 322.51 ms | 52.3% bf16 MFU | 1624422 tok/s step 12144/19560 | loss 3.338401 (-0.65z)| norm 0.2901 (+0.48z)| lr 2.01e-04 | 322.58 ms | 52.3% bf16 MFU | 1624466 tok/s step 12145/19560 | loss 3.355142 (-0.25z)| norm 0.2644 (-1.06z)| lr 2.01e-04 | 322.72 ms | 52.3% bf16 MFU | 1624473 tok/s step 12146/19560 | loss 3.310272 (-1.32z)| norm 0.2774 (-0.26z)| lr 2.01e-04 | 322.44 ms | 52.3% bf16 MFU | 1624548 tok/s step 12147/19560 | loss 3.389016 (+0.54z)| norm 0.2717 (-0.60z)| lr 2.01e-04 | 322.76 ms | 52.3% bf16 MFU | 1624539 tok/s step 12148/19560 | loss 3.373145 (+0.17z)| norm 0.3027 (+1.32z)| lr 2.01e-04 | 322.84 ms | 52.3% bf16 MFU | 1624512 tok/s step 12149/19560 | loss 3.311889 (-1.29z)| norm 0.2677 (-0.84z)| lr 2.01e-04 | 322.82 ms | 52.3% bf16 MFU | 1624490 tok/s step 12150/19560 | loss 3.312374 (-1.27z)| norm 0.2611 (-1.25z)| lr 2.01e-04 | 322.65 ms | 52.3% bf16 MFU | 1624512 tok/s step 12151/19560 | loss 3.284690 (-1.95z)| norm 0.2656 (-0.95z)| lr 2.01e-04 | 323.36 ms | 52.2% bf16 MFU | 1624357 tok/s step 12152/19560 | loss 3.295732 (-1.67z)| norm 0.2658 (-0.93z)| lr 2.01e-04 | 322.83 ms | 52.3% bf16 MFU | 1624340 tok/s step 12153/19560 | loss 3.357028 (-0.15z)| norm 0.2654 (-0.94z)| lr 2.01e-04 | 322.36 ms | 52.4% bf16 MFU | 1624442 tok/s step 12154/19560 | loss 3.342314 (-0.52z)| norm 0.2837 (+0.23z)| lr 2.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1624384 tok/s step 12155/19560 | loss 3.377710 (+0.43z)| norm 0.2617 (-1.16z)| lr 2.01e-04 | 323.05 ms | 52.2% bf16 MFU | 1624310 tok/s step 12156/19560 | loss 3.406937 (+1.20z)| norm 0.3075 (+1.76z)| lr 2.01e-04 | 322.46 ms | 52.3% bf16 MFU | 1624391 tok/s step 12157/19560 | loss 3.358100 (-0.13z)| norm 0.2582 (-1.36z)| lr 2.01e-04 | 323.17 ms | 52.2% bf16 MFU | 1624286 tok/s step 12158/19560 | loss 3.327735 (-0.97z)| norm 0.2884 (+0.55z)| lr 2.01e-04 | 322.72 ms | 52.3% bf16 MFU | 1624302 tok/s step 12159/19560 | loss 3.319539 (-1.17z)| norm 0.2773 (-0.16z)| lr 2.01e-04 | 322.86 ms | 52.3% bf16 MFU | 1624281 tok/s step 12160/19560 | loss 3.354390 (-0.22z)| norm 0.3046 (+1.55z)| lr 2.01e-04 | 322.59 ms | 52.3% bf16 MFU | 1624330 tok/s step 12161/19560 | loss 3.343033 (-0.52z)| norm 0.2636 (-1.03z)| lr 2.00e-04 | 322.89 ms | 52.3% bf16 MFU | 1624299 tok/s step 12162/19560 | loss 3.375848 (+0.38z)| norm 0.2927 (+0.79z)| lr 2.00e-04 | 322.95 ms | 52.3% bf16 MFU | 1624256 tok/s step 12163/19560 | loss 3.290775 (-1.92z)| norm 0.2649 (-0.95z)| lr 2.00e-04 | 323.10 ms | 52.2% bf16 MFU | 1624178 tok/s step 12164/19560 | loss 3.358599 (-0.06z)| norm 0.2686 (-0.72z)| lr 2.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624212 tok/s step 12165/19560 | loss 3.334640 (-0.71z)| norm 0.2688 (-0.70z)| lr 2.00e-04 | 322.61 ms | 52.3% bf16 MFU | 1624258 tok/s step 12166/19560 | loss 3.383734 (+0.62z)| norm 0.2958 (+0.97z)| lr 2.00e-04 | 322.90 ms | 52.3% bf16 MFU | 1624228 tok/s step 12167/19560 | loss 3.400257 (+1.06z)| norm 0.2545 (-1.60z)| lr 2.00e-04 | 322.74 ms | 52.3% bf16 MFU | 1624242 tok/s step 12168/19560 | loss 3.367468 (+0.17z)| norm 0.2935 (+0.81z)| lr 2.00e-04 | 322.85 ms | 52.3% bf16 MFU | 1624226 tok/s step 12169/19560 | loss 3.376684 (+0.44z)| norm 0.2647 (-0.99z)| lr 2.00e-04 | 323.19 ms | 52.2% bf16 MFU | 1624127 tok/s step 12170/19560 | loss 3.406393 (+1.25z)| norm 0.2804 (-0.00z)| lr 2.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1624187 tok/s step 12171/19560 | loss 3.383149 (+0.61z)| norm 0.3087 (+1.76z)| lr 2.00e-04 | 322.93 ms | 52.3% bf16 MFU | 1624154 tok/s step 12172/19560 | loss 3.357607 (-0.07z)| norm 0.2735 (-0.47z)| lr 2.00e-04 | 322.17 ms | 52.4% bf16 MFU | 1624314 tok/s step 12173/19560 | loss 3.374005 (+0.41z)| norm 0.3106 (+1.84z)| lr 2.00e-04 | 322.77 ms | 52.3% bf16 MFU | 1624316 tok/s step 12174/19560 | loss 3.300607 (-1.66z)| norm 0.2789 (-0.14z)| lr 2.00e-04 | 322.24 ms | 52.4% bf16 MFU | 1624450 tok/s step 12175/19560 | loss 3.306684 (-1.46z)| norm 0.2826 (+0.08z)| lr 2.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1624421 tok/s step 12176/19560 | loss 3.416264 (+1.67z)| norm 0.2749 (-0.42z)| lr 2.00e-04 | 322.94 ms | 52.3% bf16 MFU | 1624374 tok/s step 12177/19560 | loss 3.329770 (-0.80z)| norm 0.3017 (+1.27z)| lr 2.00e-04 | 322.53 ms | 52.3% bf16 MFU | 1624434 tok/s step 12178/19560 | loss 3.307237 (-1.42z)| norm 0.2721 (-0.63z)| lr 2.00e-04 | 323.31 ms | 52.2% bf16 MFU | 1624295 tok/s step 12179/19560 | loss 3.396601 (+1.13z)| norm 0.2778 (-0.26z)| lr 2.00e-04 | 322.70 ms | 52.3% bf16 MFU | 1624316 tok/s step 12180/19560 | loss 3.317616 (-1.11z)| norm 0.2699 (-0.77z)| lr 2.00e-04 | 322.67 ms | 52.3% bf16 MFU | 1624342 tok/s step 12181/19560 | loss 3.348468 (-0.25z)| norm 0.2916 (+0.62z)| lr 2.00e-04 | 322.83 ms | 52.3% bf16 MFU | 1624328 tok/s step 12182/19560 | loss 3.450931 (+2.59z)| norm 0.2866 (+0.29z)| lr 1.99e-04 | 322.14 ms | 52.4% bf16 MFU | 1624487 tok/s step 12183/19560 | loss 3.311615 (-1.27z)| norm 0.2936 (+0.74z)| lr 1.99e-04 | 322.68 ms | 52.3% bf16 MFU | 1624503 tok/s step 12184/19560 | loss 3.388596 (+0.84z)| norm 0.2946 (+0.80z)| lr 1.99e-04 | 322.32 ms | 52.4% bf16 MFU | 1624608 tok/s step 12185/19560 | loss 3.327501 (-0.83z)| norm 0.2689 (-0.85z)| lr 1.99e-04 | 322.70 ms | 52.3% bf16 MFU | 1624613 tok/s step 12186/19560 | loss 3.348510 (-0.26z)| norm 0.2729 (-0.58z)| lr 1.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1624586 tok/s step 12187/19560 | loss 3.343785 (-0.39z)| norm 0.2723 (-0.62z)| lr 1.99e-04 | 322.25 ms | 52.4% bf16 MFU | 1624705 tok/s step 12188/19560 | loss 3.335399 (-0.62z)| norm 0.2944 (+0.79z)| lr 1.99e-04 | 322.87 ms | 52.3% bf16 MFU | 1624662 tok/s step 12189/19560 | loss 3.375156 (+0.49z)| norm 0.2852 (+0.19z)| lr 1.99e-04 | 322.55 ms | 52.3% bf16 MFU | 1624702 tok/s step 12190/19560 | loss 3.311560 (-1.28z)| norm 0.2859 (+0.24z)| lr 1.99e-04 | 322.51 ms | 52.3% bf16 MFU | 1624749 tok/s step 12191/19560 | loss 3.440938 (+2.27z)| norm 0.2696 (-0.80z)| lr 1.99e-04 | 322.56 ms | 52.3% bf16 MFU | 1624782 tok/s step 12192/19560 | loss 3.378308 (+0.55z)| norm 0.2757 (-0.41z)| lr 1.99e-04 | 322.79 ms | 52.3% bf16 MFU | 1624755 tok/s step 12193/19560 | loss 3.310572 (-1.28z)| norm 0.2869 (+0.31z)| lr 1.99e-04 | 322.67 ms | 52.3% bf16 MFU | 1624760 tok/s step 12194/19560 | loss 3.364901 (+0.22z)| norm 0.2640 (-1.15z)| lr 1.99e-04 | 324.54 ms | 52.0% bf16 MFU | 1624295 tok/s step 12195/19560 | loss 3.396215 (+1.06z)| norm 0.2979 (+1.00z)| lr 1.99e-04 | 321.90 ms | 52.4% bf16 MFU | 1624518 tok/s step 12196/19560 | loss 3.324260 (-0.93z)| norm 0.2716 (-0.68z)| lr 1.99e-04 | 322.97 ms | 52.3% bf16 MFU | 1624459 tok/s step 12197/19560 | loss 3.350285 (-0.21z)| norm 0.2784 (-0.23z)| lr 1.99e-04 | 322.46 ms | 52.3% bf16 MFU | 1624532 tok/s step 12198/19560 | loss 3.358005 (-0.01z)| norm 0.2851 (+0.18z)| lr 1.99e-04 | 322.46 ms | 52.3% bf16 MFU | 1624600 tok/s step 12199/19560 | loss 3.276259 (-2.22z)| norm 0.2724 (-0.64z)| lr 1.99e-04 | 322.82 ms | 52.3% bf16 MFU | 1624574 tok/s step 12200/19560 | loss 3.397375 (+1.08z)| norm 0.2910 (+0.56z)| lr 1.99e-04 | 322.52 ms | 52.3% bf16 MFU | 1624625 tok/s step 12201/19560 | loss 3.333959 (-0.63z)| norm 0.2703 (-0.79z)| lr 1.99e-04 | 322.87 ms | 52.3% bf16 MFU | 1624586 tok/s step 12202/19560 | loss 3.360454 (+0.10z)| norm 0.2777 (-0.29z)| lr 1.99e-04 | 322.23 ms | 52.4% bf16 MFU | 1624711 tok/s step 12203/19560 | loss 3.365143 (+0.24z)| norm 0.2619 (-1.30z)| lr 1.99e-04 | 322.22 ms | 52.4% bf16 MFU | 1624832 tok/s step 12204/19560 | loss 3.306797 (-1.37z)| norm 0.2877 (+0.37z)| lr 1.98e-04 | 322.60 ms | 52.3% bf16 MFU | 1624849 tok/s step 12205/19560 | loss 3.353886 (-0.04z)| norm 0.2743 (-0.50z)| lr 1.98e-04 | 322.55 ms | 52.3% bf16 MFU | 1624879 tok/s step 12206/19560 | loss 3.402159 (+1.30z)| norm 0.2771 (-0.32z)| lr 1.98e-04 | 322.53 ms | 52.3% bf16 MFU | 1624913 tok/s step 12207/19560 | loss 3.361086 (+0.15z)| norm 0.2724 (-0.63z)| lr 1.98e-04 | 322.78 ms | 52.3% bf16 MFU | 1624881 tok/s step 12208/19560 | loss 3.308746 (-1.31z)| norm 0.2774 (-0.30z)| lr 1.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1624908 tok/s step 12209/19560 | loss 3.421848 (+1.83z)| norm 0.2751 (-0.45z)| lr 1.98e-04 | 322.32 ms | 52.4% bf16 MFU | 1624992 tok/s step 12210/19560 | loss 3.318627 (-1.03z)| norm 0.2615 (-1.32z)| lr 1.98e-04 | 323.36 ms | 52.2% bf16 MFU | 1624812 tok/s step 12211/19560 | loss 3.406935 (+1.40z)| norm 0.2700 (-0.77z)| lr 1.98e-04 | 323.27 ms | 52.2% bf16 MFU | 1624663 tok/s step 12212/19560 | loss 3.423076 (+1.81z)| norm 0.2763 (-0.36z)| lr 1.98e-04 | 322.52 ms | 52.3% bf16 MFU | 1624709 tok/s step 12213/19560 | loss 3.303588 (-1.47z)| norm 0.2693 (-0.81z)| lr 1.98e-04 | 322.15 ms | 52.4% bf16 MFU | 1624846 tok/s step 12214/19560 | loss 3.425238 (+1.83z)| norm 0.2762 (-0.37z)| lr 1.98e-04 | 322.78 ms | 52.3% bf16 MFU | 1624817 tok/s step 12215/19560 | loss 3.362451 (+0.12z)| norm 0.2638 (-1.16z)| lr 1.98e-04 | 323.43 ms | 52.2% bf16 MFU | 1624628 tok/s step 12216/19560 | loss 3.417209 (+1.58z)| norm 0.2980 (+1.04z)| lr 1.98e-04 | 323.05 ms | 52.2% bf16 MFU | 1624543 tok/s step 12217/19560 | loss 3.445335 (+2.28z)| norm 0.2739 (-0.54z)| lr 1.98e-04 | 322.13 ms | 52.4% bf16 MFU | 1624694 tok/s step 12218/19560 | loss 3.369786 (+0.29z)| norm 0.2651 (-1.11z)| lr 1.98e-04 | 322.56 ms | 52.3% bf16 MFU | 1624728 tok/s step 12219/19560 | loss 3.321808 (-0.99z)| norm 0.2678 (-0.93z)| lr 1.98e-04 | 322.73 ms | 52.3% bf16 MFU | 1624719 tok/s step 12220/19560 | loss 3.361213 (+0.07z)| norm 0.2702 (-0.77z)| lr 1.98e-04 | 322.68 ms | 52.3% bf16 MFU | 1624722 tok/s step 12221/19560 | loss 3.498924 (+3.55z)| norm 0.2792 (-0.18z)| lr 1.98e-04 | 323.03 ms | 52.2% bf16 MFU | 1624637 tok/s step 12222/19560 | loss 3.330949 (-0.72z)| norm 0.2679 (-0.92z)| lr 1.98e-04 | 322.34 ms | 52.4% bf16 MFU | 1624730 tok/s step 12223/19560 | loss 3.330184 (-0.74z)| norm 0.2599 (-1.45z)| lr 1.98e-04 | 322.59 ms | 52.3% bf16 MFU | 1624757 tok/s step 12224/19560 | loss 3.327383 (-0.80z)| norm 0.2645 (-1.14z)| lr 1.98e-04 | 322.65 ms | 52.3% bf16 MFU | 1624765 tok/s step 12225/19560 | loss 3.319908 (-0.98z)| norm 0.2547 (-1.78z)| lr 1.97e-04 | 323.01 ms | 52.2% bf16 MFU | 1624683 tok/s step 12226/19560 | loss 3.316710 (-1.06z)| norm 0.2673 (-0.94z)| lr 1.97e-04 | 322.38 ms | 52.4% bf16 MFU | 1624765 tok/s step 12227/19560 | loss 3.280988 (-1.93z)| norm 0.2921 (+0.67z)| lr 1.97e-04 | 322.87 ms | 52.3% bf16 MFU | 1624718 tok/s step 12228/19560 | loss 3.351234 (-0.17z)| norm 0.2916 (+0.62z)| lr 1.97e-04 | 322.95 ms | 52.3% bf16 MFU | 1624654 tok/s step 12229/19560 | loss 3.362517 (+0.12z)| norm 0.2791 (-0.20z)| lr 1.97e-04 | 322.84 ms | 52.3% bf16 MFU | 1624619 tok/s step 12230/19560 | loss 3.377821 (+0.50z)| norm 0.2905 (+0.54z)| lr 1.97e-04 | 322.75 ms | 52.3% bf16 MFU | 1624611 tok/s step 12231/19560 | loss 3.346873 (-0.28z)| norm 0.2515 (-1.98z)| lr 1.97e-04 | 323.30 ms | 52.2% bf16 MFU | 1624464 tok/s step 12232/19560 | loss 3.347295 (-0.28z)| norm 0.3074 (+1.62z)| lr 1.97e-04 | 322.51 ms | 52.3% bf16 MFU | 1624524 tok/s step 12233/19560 | loss 3.359926 (+0.04z)| norm 0.2623 (-1.26z)| lr 1.97e-04 | 323.03 ms | 52.2% bf16 MFU | 1624448 tok/s step 12234/19560 | loss 3.362510 (+0.11z)| norm 0.2913 (+0.61z)| lr 1.97e-04 | 322.48 ms | 52.3% bf16 MFU | 1624515 tok/s step 12235/19560 | loss 3.334467 (-0.61z)| norm 0.2656 (-1.03z)| lr 1.97e-04 | 322.68 ms | 52.3% bf16 MFU | 1624530 tok/s step 12236/19560 | loss 3.393133 (+0.88z)| norm 0.2807 (-0.05z)| lr 1.97e-04 | 323.00 ms | 52.3% bf16 MFU | 1624462 tok/s step 12237/19560 | loss 3.418959 (+1.52z)| norm 0.2720 (-0.61z)| lr 1.97e-04 | 323.25 ms | 52.2% bf16 MFU | 1624334 tok/s step 12238/19560 | loss 3.428695 (+1.73z)| norm 0.2976 (+1.05z)| lr 1.97e-04 | 322.90 ms | 52.3% bf16 MFU | 1624302 tok/s step 12239/19560 | loss 3.401848 (+1.06z)| norm 0.2971 (+1.02z)| lr 1.97e-04 | 323.19 ms | 52.2% bf16 MFU | 1624197 tok/s step 12240/19560 | loss 3.271717 (-2.13z)| norm 0.2936 (+0.79z)| lr 1.97e-04 | 323.06 ms | 52.2% bf16 MFU | 1624132 tok/s step 12241/19560 | loss 3.341808 (-0.41z)| norm 0.3004 (+1.23z)| lr 1.97e-04 | 323.60 ms | 52.2% bf16 MFU | 1623935 tok/s step 12242/19560 | loss 3.323366 (-0.86z)| norm 0.3034 (+1.40z)| lr 1.97e-04 | 322.76 ms | 52.3% bf16 MFU | 1623958 tok/s step 12243/19560 | loss 3.346624 (-0.29z)| norm 0.3006 (+1.20z)| lr 1.97e-04 | 323.03 ms | 52.2% bf16 MFU | 1623913 tok/s step 12244/19560 | loss 3.352183 (-0.16z)| norm 0.3030 (+1.34z)| lr 1.97e-04 | 322.61 ms | 52.3% bf16 MFU | 1623974 tok/s step 12245/19560 | loss 3.359501 (+0.02z)| norm 0.2891 (+0.45z)| lr 1.97e-04 | 323.16 ms | 52.2% bf16 MFU | 1623893 tok/s step 12246/19560 | loss 3.415526 (+1.39z)| norm 0.2794 (-0.15z)| lr 1.96e-04 | 322.93 ms | 52.3% bf16 MFU | 1623875 tok/s step 12247/19560 | loss 3.369259 (+0.26z)| norm 0.3024 (+1.41z)| lr 1.96e-04 | 322.79 ms | 52.3% bf16 MFU | 1623892 tok/s step 12248/19560 | loss 3.324588 (-0.82z)| norm 0.2927 (+0.75z)| lr 1.96e-04 | 322.45 ms | 52.3% bf16 MFU | 1623995 tok/s step 12249/19560 | loss 3.495600 (+3.22z)| norm 0.2880 (+0.46z)| lr 1.96e-04 | 323.25 ms | 52.2% bf16 MFU | 1623893 tok/s step 12250/19560 | loss 3.358180 (-0.02z)| norm 0.2869 (+0.40z)| lr 1.96e-04 | 322.90 ms | 52.3% bf16 MFU | 1623883 tok/s val loss 3.349317 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3002/10042 = 0.298944 step 12251/19560 | loss 3.376878 (+0.42z)| norm 0.2930 (+0.85z)| lr 1.96e-04 | 322.77 ms | 52.3% bf16 MFU | 1623905 tok/s step 12252/19560 | loss 3.376822 (+0.44z)| norm 0.2791 (-0.13z)| lr 1.96e-04 | 323.97 ms | 52.1% bf16 MFU | 1623627 tok/s step 12253/19560 | loss 3.373501 (+0.36z)| norm 0.3132 (+2.23z)| lr 1.96e-04 | 323.15 ms | 52.2% bf16 MFU | 1623566 tok/s step 12254/19560 | loss 3.378597 (+0.48z)| norm 0.2655 (-1.07z)| lr 1.96e-04 | 323.88 ms | 52.1% bf16 MFU | 1623326 tok/s step 12255/19560 | loss 3.357457 (-0.04z)| norm 0.3136 (+2.21z)| lr 1.96e-04 | 322.88 ms | 52.3% bf16 MFU | 1623349 tok/s step 12256/19560 | loss 3.355645 (-0.08z)| norm 0.2656 (-1.04z)| lr 1.96e-04 | 323.60 ms | 52.2% bf16 MFU | 1623191 tok/s step 12257/19560 | loss 3.346144 (-0.30z)| norm 0.2962 (+1.03z)| lr 1.96e-04 | 323.38 ms | 52.2% bf16 MFU | 1623096 tok/s step 12258/19560 | loss 3.381057 (+0.54z)| norm 0.2813 (+0.02z)| lr 1.96e-04 | 322.19 ms | 52.4% bf16 MFU | 1623305 tok/s step 12259/19560 | loss 3.363433 (+0.12z)| norm 0.2624 (-1.28z)| lr 1.96e-04 | 323.52 ms | 52.2% bf16 MFU | 1623168 tok/s step 12260/19560 | loss 3.374812 (+0.39z)| norm 0.2783 (-0.18z)| lr 1.96e-04 | 323.64 ms | 52.1% bf16 MFU | 1623008 tok/s step 12261/19560 | loss 3.310582 (-1.17z)| norm 0.2632 (-1.20z)| lr 1.96e-04 | 322.18 ms | 52.4% bf16 MFU | 1623224 tok/s step 12262/19560 | loss 3.323811 (-0.83z)| norm 0.2708 (-0.67z)| lr 1.96e-04 | 323.86 ms | 52.1% bf16 MFU | 1623006 tok/s step 12263/19560 | loss 3.414795 (+1.36z)| norm 0.2749 (-0.38z)| lr 1.96e-04 | 322.82 ms | 52.3% bf16 MFU | 1623059 tok/s step 12264/19560 | loss 3.382386 (+0.57z)| norm 0.2554 (-1.69z)| lr 1.96e-04 | 322.66 ms | 52.3% bf16 MFU | 1623151 tok/s step 12265/19560 | loss 3.308942 (-1.19z)| norm 0.3061 (+1.78z)| lr 1.96e-04 | 322.92 ms | 52.3% bf16 MFU | 1623173 tok/s step 12266/19560 | loss 3.328909 (-0.71z)| norm 0.2578 (-1.51z)| lr 1.96e-04 | 322.25 ms | 52.4% bf16 MFU | 1623362 tok/s step 12267/19560 | loss 3.356207 (-0.05z)| norm 0.2930 (+0.87z)| lr 1.95e-04 | 322.99 ms | 52.3% bf16 MFU | 1623355 tok/s step 12268/19560 | loss 3.434761 (+1.81z)| norm 0.3058 (+1.70z)| lr 1.95e-04 | 322.86 ms | 52.3% bf16 MFU | 1623382 tok/s step 12269/19560 | loss 3.352599 (-0.14z)| norm 0.3036 (+1.53z)| lr 1.95e-04 | 322.34 ms | 52.4% bf16 MFU | 1623537 tok/s step 12270/19560 | loss 3.346701 (-0.27z)| norm 0.2791 (-0.08z)| lr 1.95e-04 | 322.93 ms | 52.3% bf16 MFU | 1623537 tok/s step 12271/19560 | loss 3.365190 (+0.16z)| norm 0.3037 (+1.56z)| lr 1.95e-04 | 323.60 ms | 52.2% bf16 MFU | 1623369 tok/s step 12272/19560 | loss 3.398155 (+0.93z)| norm 0.2809 (+0.04z)| lr 1.95e-04 | 323.37 ms | 52.2% bf16 MFU | 1623267 tok/s step 12273/19560 | loss 3.343040 (-0.38z)| norm 0.3146 (+2.23z)| lr 1.95e-04 | 323.15 ms | 52.2% bf16 MFU | 1623224 tok/s step 12274/19560 | loss 3.395762 (+0.86z)| norm 0.2968 (+1.05z)| lr 1.95e-04 | 322.61 ms | 52.3% bf16 MFU | 1623321 tok/s step 12275/19560 | loss 3.356498 (-0.07z)| norm 0.2980 (+1.11z)| lr 1.95e-04 | 323.31 ms | 52.2% bf16 MFU | 1623237 tok/s step 12276/19560 | loss 3.375750 (+0.39z)| norm 0.2857 (+0.32z)| lr 1.95e-04 | 323.37 ms | 52.2% bf16 MFU | 1623142 tok/s step 12277/19560 | loss 3.355076 (-0.11z)| norm 0.3063 (+1.64z)| lr 1.95e-04 | 323.20 ms | 52.2% bf16 MFU | 1623095 tok/s step 12278/19560 | loss 3.358598 (-0.04z)| norm 0.2952 (+0.90z)| lr 1.95e-04 | 322.79 ms | 52.3% bf16 MFU | 1623152 tok/s step 12279/19560 | loss 3.393050 (+0.79z)| norm 0.2817 (+0.00z)| lr 1.95e-04 | 323.19 ms | 52.2% bf16 MFU | 1623105 tok/s step 12280/19560 | loss 3.332858 (-0.70z)| norm 0.2868 (+0.33z)| lr 1.95e-04 | 323.07 ms | 52.2% bf16 MFU | 1623093 tok/s step 12281/19560 | loss 3.341533 (-0.48z)| norm 0.2993 (+1.14z)| lr 1.95e-04 | 323.04 ms | 52.2% bf16 MFU | 1623087 tok/s step 12282/19560 | loss 3.393568 (+0.79z)| norm 0.2833 (+0.08z)| lr 1.95e-04 | 322.88 ms | 52.3% bf16 MFU | 1623123 tok/s step 12283/19560 | loss 3.342999 (-0.45z)| norm 0.2757 (-0.43z)| lr 1.95e-04 | 323.28 ms | 52.2% bf16 MFU | 1623054 tok/s step 12284/19560 | loss 3.369969 (+0.22z)| norm 0.2803 (-0.11z)| lr 1.95e-04 | 322.83 ms | 52.3% bf16 MFU | 1623103 tok/s step 12285/19560 | loss 3.366883 (+0.14z)| norm 0.2826 (+0.03z)| lr 1.95e-04 | 323.21 ms | 52.2% bf16 MFU | 1623054 tok/s step 12286/19560 | loss 3.349201 (-0.30z)| norm 0.2892 (+0.48z)| lr 1.95e-04 | 323.49 ms | 52.2% bf16 MFU | 1622938 tok/s step 12287/19560 | loss 3.415365 (+1.32z)| norm 0.2723 (-0.67z)| lr 1.95e-04 | 322.56 ms | 52.3% bf16 MFU | 1623060 tok/s step 12288/19560 | loss 3.292051 (-1.70z)| norm 0.2945 (+0.85z)| lr 1.95e-04 | 322.80 ms | 52.3% bf16 MFU | 1623117 tok/s step 12289/19560 | loss 3.338646 (-0.56z)| norm 0.2869 (+0.32z)| lr 1.94e-04 | 322.90 ms | 52.3% bf16 MFU | 1623146 tok/s step 12290/19560 | loss 3.329845 (-0.76z)| norm 0.3000 (+1.21z)| lr 1.94e-04 | 323.45 ms | 52.2% bf16 MFU | 1623036 tok/s step 12291/19560 | loss 3.478024 (+2.77z)| norm 0.2964 (+0.95z)| lr 1.94e-04 | 322.53 ms | 52.3% bf16 MFU | 1623162 tok/s step 12292/19560 | loss 3.326125 (-0.86z)| norm 0.2670 (-1.06z)| lr 1.94e-04 | 323.09 ms | 52.2% bf16 MFU | 1623140 tok/s step 12293/19560 | loss 3.373653 (+0.26z)| norm 0.2837 (+0.07z)| lr 1.94e-04 | 323.63 ms | 52.1% bf16 MFU | 1622985 tok/s step 12294/19560 | loss 3.360372 (-0.05z)| norm 0.2739 (-0.59z)| lr 1.94e-04 | 323.24 ms | 52.2% bf16 MFU | 1622934 tok/s step 12295/19560 | loss 3.357230 (-0.12z)| norm 0.2752 (-0.52z)| lr 1.94e-04 | 323.00 ms | 52.3% bf16 MFU | 1622947 tok/s step 12296/19560 | loss 3.367839 (+0.14z)| norm 0.2824 (-0.01z)| lr 1.94e-04 | 322.72 ms | 52.3% bf16 MFU | 1623030 tok/s step 12297/19560 | loss 3.345739 (-0.39z)| norm 0.2741 (-0.60z)| lr 1.94e-04 | 323.19 ms | 52.2% bf16 MFU | 1622990 tok/s step 12298/19560 | loss 3.355520 (-0.14z)| norm 0.2789 (-0.26z)| lr 1.94e-04 | 322.67 ms | 52.3% bf16 MFU | 1623082 tok/s step 12299/19560 | loss 3.374949 (+0.33z)| norm 0.2829 (+0.04z)| lr 1.94e-04 | 323.06 ms | 52.2% bf16 MFU | 1623071 tok/s step 12300/19560 | loss 3.330519 (-0.74z)| norm 0.2601 (-1.57z)| lr 1.94e-04 | 322.86 ms | 52.3% bf16 MFU | 1623111 tok/s step 12301/19560 | loss 3.385769 (+0.59z)| norm 0.2745 (-0.54z)| lr 1.94e-04 | 322.49 ms | 52.3% bf16 MFU | 1623242 tok/s step 12302/19560 | loss 3.347240 (-0.35z)| norm 0.2798 (-0.16z)| lr 1.94e-04 | 322.72 ms | 52.3% bf16 MFU | 1623311 tok/s step 12303/19560 | loss 3.392328 (+0.73z)| norm 0.2810 (-0.07z)| lr 1.94e-04 | 322.47 ms | 52.3% bf16 MFU | 1623439 tok/s step 12304/19560 | loss 3.347253 (-0.36z)| norm 0.2723 (-0.69z)| lr 1.94e-04 | 322.49 ms | 52.3% bf16 MFU | 1623554 tok/s step 12305/19560 | loss 3.417546 (+1.34z)| norm 0.2756 (-0.44z)| lr 1.94e-04 | 322.80 ms | 52.3% bf16 MFU | 1623584 tok/s step 12306/19560 | loss 3.320048 (-1.04z)| norm 0.2806 (-0.09z)| lr 1.94e-04 | 322.86 ms | 52.3% bf16 MFU | 1623600 tok/s step 12307/19560 | loss 3.360092 (-0.05z)| norm 0.2918 (+0.71z)| lr 1.94e-04 | 322.67 ms | 52.3% bf16 MFU | 1623662 tok/s step 12308/19560 | loss 3.438507 (+1.83z)| norm 0.2791 (-0.21z)| lr 1.94e-04 | 322.50 ms | 52.3% bf16 MFU | 1623765 tok/s step 12309/19560 | loss 3.407312 (+1.06z)| norm 0.2806 (-0.09z)| lr 1.94e-04 | 323.29 ms | 52.2% bf16 MFU | 1623663 tok/s step 12310/19560 | loss 3.380962 (+0.44z)| norm 0.2682 (-0.98z)| lr 1.93e-04 | 322.51 ms | 52.3% bf16 MFU | 1623763 tok/s step 12311/19560 | loss 3.380608 (+0.42z)| norm 0.2707 (-0.79z)| lr 1.93e-04 | 322.96 ms | 52.3% bf16 MFU | 1623744 tok/s step 12312/19560 | loss 3.286120 (-1.88z)| norm 0.2777 (-0.28z)| lr 1.93e-04 | 322.49 ms | 52.3% bf16 MFU | 1623844 tok/s step 12313/19560 | loss 3.335927 (-0.66z)| norm 0.2674 (-1.02z)| lr 1.93e-04 | 322.46 ms | 52.3% bf16 MFU | 1623947 tok/s step 12314/19560 | loss 3.452900 (+2.15z)| norm 0.2816 (+0.00z)| lr 1.93e-04 | 322.63 ms | 52.3% bf16 MFU | 1624003 tok/s step 12315/19560 | loss 3.389527 (+0.61z)| norm 0.2899 (+0.60z)| lr 1.93e-04 | 322.58 ms | 52.3% bf16 MFU | 1624067 tok/s step 12316/19560 | loss 3.332509 (-0.76z)| norm 0.2829 (+0.09z)| lr 1.93e-04 | 323.13 ms | 52.2% bf16 MFU | 1623990 tok/s step 12317/19560 | loss 3.328779 (-0.84z)| norm 0.2664 (-1.09z)| lr 1.93e-04 | 322.55 ms | 52.3% bf16 MFU | 1624062 tok/s step 12318/19560 | loss 3.343055 (-0.51z)| norm 0.3106 (+2.07z)| lr 1.93e-04 | 322.62 ms | 52.3% bf16 MFU | 1624113 tok/s step 12319/19560 | loss 3.385134 (+0.53z)| norm 0.2648 (-1.19z)| lr 1.93e-04 | 323.11 ms | 52.2% bf16 MFU | 1624039 tok/s step 12320/19560 | loss 3.356452 (-0.17z)| norm 0.3013 (+1.38z)| lr 1.93e-04 | 322.75 ms | 52.3% bf16 MFU | 1624060 tok/s step 12321/19560 | loss 3.344427 (-0.47z)| norm 0.2674 (-1.00z)| lr 1.93e-04 | 322.27 ms | 52.4% bf16 MFU | 1624199 tok/s step 12322/19560 | loss 3.346762 (-0.41z)| norm 0.2914 (+0.68z)| lr 1.93e-04 | 322.97 ms | 52.3% bf16 MFU | 1624157 tok/s step 12323/19560 | loss 3.383566 (+0.50z)| norm 0.2858 (+0.29z)| lr 1.93e-04 | 322.64 ms | 52.3% bf16 MFU | 1624199 tok/s step 12324/19560 | loss 3.338164 (-0.63z)| norm 0.2802 (-0.12z)| lr 1.93e-04 | 322.73 ms | 52.3% bf16 MFU | 1624217 tok/s step 12325/19560 | loss 3.341958 (-0.53z)| norm 0.2768 (-0.36z)| lr 1.93e-04 | 322.45 ms | 52.3% bf16 MFU | 1624304 tok/s step 12326/19560 | loss 3.355189 (-0.20z)| norm 0.2738 (-0.56z)| lr 1.93e-04 | 322.52 ms | 52.3% bf16 MFU | 1624369 tok/s step 12327/19560 | loss 3.329982 (-0.85z)| norm 0.2869 (+0.36z)| lr 1.93e-04 | 322.97 ms | 52.3% bf16 MFU | 1624316 tok/s step 12328/19560 | loss 3.493544 (+3.11z)| norm 0.2874 (+0.40z)| lr 1.93e-04 | 323.26 ms | 52.2% bf16 MFU | 1624194 tok/s step 12329/19560 | loss 3.368064 (+0.08z)| norm 0.3003 (+1.30z)| lr 1.93e-04 | 323.26 ms | 52.2% bf16 MFU | 1624078 tok/s step 12330/19560 | loss 3.324189 (-0.97z)| norm 0.2905 (+0.59z)| lr 1.93e-04 | 322.70 ms | 52.3% bf16 MFU | 1624110 tok/s step 12331/19560 | loss 3.351475 (-0.31z)| norm 0.2786 (-0.26z)| lr 1.93e-04 | 323.22 ms | 52.2% bf16 MFU | 1624007 tok/s step 12332/19560 | loss 3.391479 (+0.64z)| norm 0.2984 (+1.14z)| lr 1.92e-04 | 323.02 ms | 52.2% bf16 MFU | 1623961 tok/s step 12333/19560 | loss 3.428155 (+1.50z)| norm 0.2796 (-0.20z)| lr 1.92e-04 | 322.65 ms | 52.3% bf16 MFU | 1624009 tok/s step 12334/19560 | loss 3.393874 (+0.68z)| norm 0.2751 (-0.52z)| lr 1.92e-04 | 322.95 ms | 52.3% bf16 MFU | 1623982 tok/s step 12335/19560 | loss 3.312383 (-1.26z)| norm 0.2877 (+0.37z)| lr 1.92e-04 | 322.85 ms | 52.3% bf16 MFU | 1623978 tok/s step 12336/19560 | loss 3.399471 (+0.80z)| norm 0.2677 (-1.05z)| lr 1.92e-04 | 322.84 ms | 52.3% bf16 MFU | 1623978 tok/s step 12337/19560 | loss 3.369362 (+0.09z)| norm 0.2623 (-1.41z)| lr 1.92e-04 | 322.46 ms | 52.3% bf16 MFU | 1624074 tok/s step 12338/19560 | loss 3.338154 (-0.67z)| norm 0.2708 (-0.82z)| lr 1.92e-04 | 322.47 ms | 52.3% bf16 MFU | 1624164 tok/s step 12339/19560 | loss 3.336264 (-0.70z)| norm 0.2648 (-1.24z)| lr 1.92e-04 | 323.05 ms | 52.2% bf16 MFU | 1624102 tok/s step 12340/19560 | loss 3.358108 (-0.16z)| norm 0.2682 (-0.99z)| lr 1.92e-04 | 323.04 ms | 52.2% bf16 MFU | 1624047 tok/s step 12341/19560 | loss 3.341763 (-0.57z)| norm 0.2741 (-0.58z)| lr 1.92e-04 | 322.75 ms | 52.3% bf16 MFU | 1624066 tok/s step 12342/19560 | loss 3.385547 (+0.52z)| norm 0.3009 (+1.29z)| lr 1.92e-04 | 322.93 ms | 52.3% bf16 MFU | 1624040 tok/s step 12343/19560 | loss 3.375304 (+0.26z)| norm 0.2629 (-1.37z)| lr 1.92e-04 | 322.75 ms | 52.3% bf16 MFU | 1624058 tok/s step 12344/19560 | loss 3.352536 (-0.29z)| norm 0.2606 (-1.51z)| lr 1.92e-04 | 323.23 ms | 52.2% bf16 MFU | 1623958 tok/s step 12345/19560 | loss 3.395302 (+0.80z)| norm 0.2826 (+0.02z)| lr 1.92e-04 | 322.52 ms | 52.3% bf16 MFU | 1624040 tok/s step 12346/19560 | loss 3.278639 (-2.11z)| norm 0.2614 (-1.45z)| lr 1.92e-04 | 322.76 ms | 52.3% bf16 MFU | 1624057 tok/s step 12347/19560 | loss 3.373736 (+0.26z)| norm 0.3110 (+1.96z)| lr 1.92e-04 | 322.93 ms | 52.3% bf16 MFU | 1624029 tok/s step 12348/19560 | loss 3.357435 (-0.15z)| norm 0.2910 (+0.57z)| lr 1.92e-04 | 323.22 ms | 52.2% bf16 MFU | 1623931 tok/s step 12349/19560 | loss 3.392050 (+0.77z)| norm 0.2616 (-1.44z)| lr 1.92e-04 | 322.09 ms | 52.4% bf16 MFU | 1624122 tok/s step 12350/19560 | loss 3.381136 (+0.47z)| norm 0.2929 (+0.69z)| lr 1.92e-04 | 322.96 ms | 52.3% bf16 MFU | 1624085 tok/s step 12351/19560 | loss 3.341607 (-0.57z)| norm 0.2630 (-1.37z)| lr 1.92e-04 | 323.53 ms | 52.2% bf16 MFU | 1623906 tok/s step 12352/19560 | loss 3.422936 (+1.54z)| norm 0.2904 (+0.51z)| lr 1.92e-04 | 322.35 ms | 52.4% bf16 MFU | 1624035 tok/s step 12353/19560 | loss 3.327640 (-0.95z)| norm 0.2778 (-0.38z)| lr 1.91e-04 | 322.96 ms | 52.3% bf16 MFU | 1624003 tok/s step 12354/19560 | loss 3.366210 (+0.05z)| norm 0.2850 (+0.12z)| lr 1.91e-04 | 322.75 ms | 52.3% bf16 MFU | 1624025 tok/s step 12355/19560 | loss 3.325097 (-1.06z)| norm 0.2819 (-0.10z)| lr 1.91e-04 | 322.37 ms | 52.4% bf16 MFU | 1624142 tok/s step 12356/19560 | loss 3.340722 (-0.64z)| norm 0.2552 (-1.94z)| lr 1.91e-04 | 323.12 ms | 52.2% bf16 MFU | 1624065 tok/s step 12357/19560 | loss 3.383791 (+0.51z)| norm 0.2837 (+0.05z)| lr 1.91e-04 | 322.52 ms | 52.3% bf16 MFU | 1624142 tok/s step 12358/19560 | loss 3.386497 (+0.58z)| norm 0.2913 (+0.57z)| lr 1.91e-04 | 322.55 ms | 52.3% bf16 MFU | 1624208 tok/s step 12359/19560 | loss 3.379600 (+0.39z)| norm 0.2737 (-0.68z)| lr 1.91e-04 | 322.73 ms | 52.3% bf16 MFU | 1624224 tok/s step 12360/19560 | loss 3.342657 (-0.60z)| norm 0.2720 (-0.78z)| lr 1.91e-04 | 323.11 ms | 52.2% bf16 MFU | 1624145 tok/s step 12361/19560 | loss 3.305131 (-1.57z)| norm 0.2731 (-0.72z)| lr 1.91e-04 | 322.66 ms | 52.3% bf16 MFU | 1624182 tok/s step 12362/19560 | loss 3.338954 (-0.67z)| norm 0.2749 (-0.57z)| lr 1.91e-04 | 322.37 ms | 52.4% bf16 MFU | 1624291 tok/s step 12363/19560 | loss 3.356157 (-0.22z)| norm 0.2464 (-2.57z)| lr 1.91e-04 | 322.64 ms | 52.3% bf16 MFU | 1624325 tok/s step 12364/19560 | loss 3.351835 (-0.33z)| norm 0.2721 (-0.74z)| lr 1.91e-04 | 322.42 ms | 52.3% bf16 MFU | 1624413 tok/s step 12365/19560 | loss 3.406812 (+1.13z)| norm 0.2809 (-0.13z)| lr 1.91e-04 | 322.76 ms | 52.3% bf16 MFU | 1624411 tok/s step 12366/19560 | loss 3.296863 (-1.76z)| norm 0.2608 (-1.52z)| lr 1.91e-04 | 323.00 ms | 52.3% bf16 MFU | 1624348 tok/s step 12367/19560 | loss 3.397004 (+0.90z)| norm 0.2875 (+0.36z)| lr 1.91e-04 | 322.81 ms | 52.3% bf16 MFU | 1624338 tok/s step 12368/19560 | loss 3.295751 (-1.81z)| norm 0.2494 (-2.27z)| lr 1.91e-04 | 323.25 ms | 52.2% bf16 MFU | 1624218 tok/s step 12369/19560 | loss 3.388996 (+0.68z)| norm 0.2763 (-0.39z)| lr 1.91e-04 | 322.68 ms | 52.3% bf16 MFU | 1624246 tok/s step 12370/19560 | loss 3.343711 (-0.54z)| norm 0.2839 (+0.15z)| lr 1.91e-04 | 323.18 ms | 52.2% bf16 MFU | 1624148 tok/s step 12371/19560 | loss 3.386861 (+0.61z)| norm 0.2513 (-2.09z)| lr 1.91e-04 | 323.18 ms | 52.2% bf16 MFU | 1624056 tok/s step 12372/19560 | loss 3.359577 (-0.13z)| norm 0.2809 (-0.02z)| lr 1.91e-04 | 322.27 ms | 52.4% bf16 MFU | 1624195 tok/s step 12373/19560 | loss 3.325971 (-1.02z)| norm 0.2579 (-1.60z)| lr 1.91e-04 | 323.00 ms | 52.3% bf16 MFU | 1624145 tok/s step 12374/19560 | loss 3.349406 (-0.38z)| norm 0.2806 (-0.02z)| lr 1.91e-04 | 323.05 ms | 52.2% bf16 MFU | 1624084 tok/s step 12375/19560 | loss 3.414473 (+1.36z)| norm 0.2640 (-1.16z)| lr 1.90e-04 | 322.86 ms | 52.3% bf16 MFU | 1624075 tok/s step 12376/19560 | loss 3.302724 (-1.62z)| norm 0.2856 (+0.36z)| lr 1.90e-04 | 323.04 ms | 52.2% bf16 MFU | 1624021 tok/s step 12377/19560 | loss 3.305539 (-1.57z)| norm 0.2536 (-1.84z)| lr 1.90e-04 | 322.69 ms | 52.3% bf16 MFU | 1624057 tok/s step 12378/19560 | loss 3.315079 (-1.29z)| norm 0.2751 (-0.35z)| lr 1.90e-04 | 322.50 ms | 52.3% bf16 MFU | 1624140 tok/s step 12379/19560 | loss 3.493771 (+3.44z)| norm 0.2864 (+0.44z)| lr 1.90e-04 | 323.08 ms | 52.2% bf16 MFU | 1624071 tok/s step 12380/19560 | loss 3.395673 (+0.86z)| norm 0.3300 (+3.28z)| lr 1.90e-04 | 323.13 ms | 52.2% bf16 MFU | 1623994 tok/s step 12381/19560 | loss 3.402077 (+1.02z)| norm 0.2710 (-0.62z)| lr 1.90e-04 | 322.20 ms | 52.4% bf16 MFU | 1624155 tok/s step 12382/19560 | loss 3.331713 (-0.81z)| norm 0.3041 (+1.58z)| lr 1.90e-04 | 323.06 ms | 52.2% bf16 MFU | 1624093 tok/s step 12383/19560 | loss 3.367775 (+0.13z)| norm 0.2762 (-0.27z)| lr 1.90e-04 | 322.83 ms | 52.3% bf16 MFU | 1624090 tok/s step 12384/19560 | loss 3.366108 (+0.08z)| norm 0.3088 (+1.91z)| lr 1.90e-04 | 322.70 ms | 52.3% bf16 MFU | 1624121 tok/s step 12385/19560 | loss 3.296696 (-1.70z)| norm 0.2675 (-0.87z)| lr 1.90e-04 | 323.08 ms | 52.2% bf16 MFU | 1624053 tok/s step 12386/19560 | loss 3.320260 (-1.07z)| norm 0.2896 (+0.62z)| lr 1.90e-04 | 322.89 ms | 52.3% bf16 MFU | 1624036 tok/s step 12387/19560 | loss 3.363266 (+0.03z)| norm 0.2841 (+0.24z)| lr 1.90e-04 | 322.90 ms | 52.3% bf16 MFU | 1624018 tok/s step 12388/19560 | loss 3.393480 (+0.80z)| norm 0.2874 (+0.46z)| lr 1.90e-04 | 323.13 ms | 52.2% bf16 MFU | 1623943 tok/s step 12389/19560 | loss 3.325140 (-0.96z)| norm 0.2747 (-0.41z)| lr 1.90e-04 | 322.61 ms | 52.3% bf16 MFU | 1624002 tok/s step 12390/19560 | loss 3.500414 (+3.37z)| norm 0.3151 (+2.28z)| lr 1.90e-04 | 322.82 ms | 52.3% bf16 MFU | 1624007 tok/s step 12391/19560 | loss 3.380238 (+0.42z)| norm 0.3129 (+2.08z)| lr 1.90e-04 | 323.27 ms | 52.2% bf16 MFU | 1623899 tok/s step 12392/19560 | loss 3.361805 (-0.04z)| norm 0.2905 (+0.59z)| lr 1.90e-04 | 322.85 ms | 52.3% bf16 MFU | 1623900 tok/s step 12393/19560 | loss 3.404200 (+1.00z)| norm 0.3151 (+2.21z)| lr 1.90e-04 | 323.05 ms | 52.2% bf16 MFU | 1623852 tok/s step 12394/19560 | loss 3.342469 (-0.54z)| norm 0.2817 (-0.01z)| lr 1.90e-04 | 322.80 ms | 52.3% bf16 MFU | 1623869 tok/s step 12395/19560 | loss 3.370690 (+0.16z)| norm 0.2922 (+0.69z)| lr 1.90e-04 | 322.46 ms | 52.3% bf16 MFU | 1623971 tok/s step 12396/19560 | loss 3.374794 (+0.28z)| norm 0.2864 (+0.31z)| lr 1.89e-04 | 323.04 ms | 52.2% bf16 MFU | 1623922 tok/s step 12397/19560 | loss 3.390745 (+0.67z)| norm 0.3058 (+1.62z)| lr 1.89e-04 | 322.86 ms | 52.3% bf16 MFU | 1623919 tok/s step 12398/19560 | loss 3.337223 (-0.67z)| norm 0.2671 (-0.97z)| lr 1.89e-04 | 322.96 ms | 52.3% bf16 MFU | 1623893 tok/s step 12399/19560 | loss 3.331537 (-0.81z)| norm 0.2815 (+0.00z)| lr 1.89e-04 | 323.19 ms | 52.2% bf16 MFU | 1623810 tok/s step 12400/19560 | loss 3.345057 (-0.46z)| norm 0.2612 (-1.35z)| lr 1.89e-04 | 323.12 ms | 52.2% bf16 MFU | 1623748 tok/s step 12401/19560 | loss 3.289443 (-1.82z)| norm 0.2710 (-0.69z)| lr 1.89e-04 | 322.31 ms | 52.4% bf16 MFU | 1623893 tok/s step 12402/19560 | loss 3.311672 (-1.25z)| norm 0.2790 (-0.13z)| lr 1.89e-04 | 322.49 ms | 52.3% bf16 MFU | 1623987 tok/s step 12403/19560 | loss 3.377158 (+0.37z)| norm 0.2573 (-1.59z)| lr 1.89e-04 | 323.38 ms | 52.2% bf16 MFU | 1623852 tok/s step 12404/19560 | loss 3.353102 (-0.22z)| norm 0.3033 (+1.53z)| lr 1.89e-04 | 322.88 ms | 52.3% bf16 MFU | 1623850 tok/s step 12405/19560 | loss 3.367858 (+0.14z)| norm 0.2602 (-1.37z)| lr 1.89e-04 | 322.83 ms | 52.3% bf16 MFU | 1623859 tok/s step 12406/19560 | loss 3.345612 (-0.41z)| norm 0.2965 (+1.10z)| lr 1.89e-04 | 322.70 ms | 52.3% bf16 MFU | 1623901 tok/s step 12407/19560 | loss 3.324094 (-0.93z)| norm 0.2451 (-2.32z)| lr 1.89e-04 | 322.62 ms | 52.3% bf16 MFU | 1623960 tok/s step 12408/19560 | loss 3.401475 (+0.97z)| norm 0.2726 (-0.49z)| lr 1.89e-04 | 322.24 ms | 52.4% bf16 MFU | 1624111 tok/s step 12409/19560 | loss 3.300910 (-1.49z)| norm 0.2875 (+0.51z)| lr 1.89e-04 | 322.99 ms | 52.3% bf16 MFU | 1624067 tok/s step 12410/19560 | loss 3.365242 (+0.09z)| norm 0.2633 (-1.09z)| lr 1.89e-04 | 322.87 ms | 52.3% bf16 MFU | 1624055 tok/s step 12411/19560 | loss 3.285125 (-1.84z)| norm 0.2593 (-1.34z)| lr 1.89e-04 | 322.56 ms | 52.3% bf16 MFU | 1624123 tok/s step 12412/19560 | loss 3.452148 (+2.15z)| norm 0.2743 (-0.35z)| lr 1.89e-04 | 322.75 ms | 52.3% bf16 MFU | 1624139 tok/s step 12413/19560 | loss 3.392653 (+0.73z)| norm 0.2718 (-0.50z)| lr 1.89e-04 | 323.00 ms | 52.3% bf16 MFU | 1624091 tok/s step 12414/19560 | loss 3.545488 (+4.03z)| norm 0.2810 (+0.11z)| lr 1.89e-04 | 323.35 ms | 52.2% bf16 MFU | 1623958 tok/s step 12415/19560 | loss 3.336207 (-0.60z)| norm 0.2806 (+0.08z)| lr 1.89e-04 | 322.82 ms | 52.3% bf16 MFU | 1623964 tok/s step 12416/19560 | loss 3.308980 (-1.21z)| norm 0.2807 (+0.10z)| lr 1.89e-04 | 322.70 ms | 52.3% bf16 MFU | 1624001 tok/s step 12417/19560 | loss 3.319654 (-0.96z)| norm 0.2721 (-0.47z)| lr 1.89e-04 | 322.35 ms | 52.4% bf16 MFU | 1624124 tok/s step 12418/19560 | loss 3.370776 (+0.17z)| norm 0.2880 (+0.59z)| lr 1.88e-04 | 323.31 ms | 52.2% bf16 MFU | 1623999 tok/s step 12419/19560 | loss 3.306249 (-1.27z)| norm 0.2549 (-1.59z)| lr 1.88e-04 | 323.12 ms | 52.2% bf16 MFU | 1623927 tok/s step 12420/19560 | loss 3.358249 (-0.09z)| norm 0.2891 (+0.68z)| lr 1.88e-04 | 323.00 ms | 52.3% bf16 MFU | 1623889 tok/s step 12421/19560 | loss 3.351431 (-0.24z)| norm 0.2647 (-0.94z)| lr 1.88e-04 | 322.36 ms | 52.4% bf16 MFU | 1624015 tok/s step 12422/19560 | loss 3.352004 (-0.23z)| norm 0.2818 (+0.20z)| lr 1.88e-04 | 323.06 ms | 52.2% bf16 MFU | 1623959 tok/s step 12423/19560 | loss 3.377074 (+0.34z)| norm 0.2688 (-0.67z)| lr 1.88e-04 | 322.73 ms | 52.3% bf16 MFU | 1623988 tok/s step 12424/19560 | loss 3.358829 (-0.07z)| norm 0.2804 (+0.11z)| lr 1.88e-04 | 323.24 ms | 52.2% bf16 MFU | 1623888 tok/s step 12425/19560 | loss 3.368218 (+0.14z)| norm 0.2621 (-1.10z)| lr 1.88e-04 | 323.00 ms | 52.3% bf16 MFU | 1623853 tok/s step 12426/19560 | loss 3.411455 (+1.11z)| norm 0.2736 (-0.33z)| lr 1.88e-04 | 322.83 ms | 52.3% bf16 MFU | 1623862 tok/s step 12427/19560 | loss 3.397330 (+0.78z)| norm 0.2848 (+0.40z)| lr 1.88e-04 | 322.73 ms | 52.3% bf16 MFU | 1623896 tok/s step 12428/19560 | loss 3.392767 (+0.67z)| norm 0.2722 (-0.43z)| lr 1.88e-04 | 322.95 ms | 52.3% bf16 MFU | 1623874 tok/s step 12429/19560 | loss 3.330285 (-0.74z)| norm 0.2766 (-0.14z)| lr 1.88e-04 | 322.90 ms | 52.3% bf16 MFU | 1623865 tok/s step 12430/19560 | loss 3.348773 (-0.32z)| norm 0.2785 (-0.02z)| lr 1.88e-04 | 323.09 ms | 52.2% bf16 MFU | 1623808 tok/s step 12431/19560 | loss 3.331570 (-0.70z)| norm 0.2738 (-0.33z)| lr 1.88e-04 | 323.62 ms | 52.2% bf16 MFU | 1623621 tok/s step 12432/19560 | loss 3.378697 (+0.36z)| norm 0.2623 (-1.08z)| lr 1.88e-04 | 323.00 ms | 52.3% bf16 MFU | 1623600 tok/s step 12433/19560 | loss 3.360303 (-0.04z)| norm 0.2726 (-0.40z)| lr 1.88e-04 | 322.84 ms | 52.3% bf16 MFU | 1623619 tok/s step 12434/19560 | loss 3.319079 (-0.98z)| norm 0.2589 (-1.28z)| lr 1.88e-04 | 323.59 ms | 52.2% bf16 MFU | 1623450 tok/s step 12435/19560 | loss 3.362679 (+0.01z)| norm 0.2670 (-0.74z)| lr 1.88e-04 | 322.87 ms | 52.3% bf16 MFU | 1623469 tok/s step 12436/19560 | loss 3.310524 (-1.16z)| norm 0.2908 (+0.82z)| lr 1.88e-04 | 323.30 ms | 52.2% bf16 MFU | 1623379 tok/s step 12437/19560 | loss 3.250125 (-2.47z)| norm 0.2572 (-1.37z)| lr 1.88e-04 | 322.51 ms | 52.3% bf16 MFU | 1623494 tok/s step 12438/19560 | loss 3.332463 (-0.61z)| norm 0.2681 (-0.65z)| lr 1.88e-04 | 322.79 ms | 52.3% bf16 MFU | 1623532 tok/s step 12439/19560 | loss 3.429592 (+1.55z)| norm 0.2916 (+0.86z)| lr 1.87e-04 | 323.40 ms | 52.2% bf16 MFU | 1623414 tok/s step 12440/19560 | loss 3.355060 (-0.12z)| norm 0.2897 (+0.73z)| lr 1.87e-04 | 323.24 ms | 52.2% bf16 MFU | 1623342 tok/s step 12441/19560 | loss 3.345165 (-0.35z)| norm 0.2967 (+1.17z)| lr 1.87e-04 | 322.70 ms | 52.3% bf16 MFU | 1623411 tok/s step 12442/19560 | loss 3.310522 (-1.12z)| norm 0.2715 (-0.45z)| lr 1.87e-04 | 323.06 ms | 52.2% bf16 MFU | 1623384 tok/s step 12443/19560 | loss 3.368606 (+0.21z)| norm 0.2991 (+1.32z)| lr 1.87e-04 | 323.34 ms | 52.2% bf16 MFU | 1623289 tok/s step 12444/19560 | loss 3.424917 (+1.47z)| norm 0.2691 (-0.60z)| lr 1.87e-04 | 323.18 ms | 52.2% bf16 MFU | 1623239 tok/s step 12445/19560 | loss 3.368263 (+0.18z)| norm 0.2740 (-0.30z)| lr 1.87e-04 | 323.00 ms | 52.3% bf16 MFU | 1623236 tok/s step 12446/19560 | loss 3.384497 (+0.54z)| norm 0.2677 (-0.69z)| lr 1.87e-04 | 322.41 ms | 52.3% bf16 MFU | 1623382 tok/s step 12447/19560 | loss 3.318595 (-0.94z)| norm 0.2682 (-0.66z)| lr 1.87e-04 | 322.60 ms | 52.3% bf16 MFU | 1623474 tok/s step 12448/19560 | loss 3.329208 (-0.70z)| norm 0.2831 (+0.33z)| lr 1.87e-04 | 323.48 ms | 52.2% bf16 MFU | 1623340 tok/s step 12449/19560 | loss 3.306198 (-1.20z)| norm 0.2548 (-1.52z)| lr 1.87e-04 | 322.98 ms | 52.3% bf16 MFU | 1623337 tok/s step 12450/19560 | loss 3.296593 (-1.40z)| norm 0.2776 (-0.02z)| lr 1.87e-04 | 322.55 ms | 52.3% bf16 MFU | 1623442 tok/s step 12451/19560 | loss 3.366845 (+0.17z)| norm 0.2594 (-1.20z)| lr 1.87e-04 | 323.19 ms | 52.2% bf16 MFU | 1623382 tok/s step 12452/19560 | loss 3.364623 (+0.12z)| norm 0.2766 (-0.07z)| lr 1.87e-04 | 323.90 ms | 52.1% bf16 MFU | 1623146 tok/s step 12453/19560 | loss 3.363578 (+0.09z)| norm 0.2756 (-0.14z)| lr 1.87e-04 | 322.97 ms | 52.3% bf16 MFU | 1623154 tok/s step 12454/19560 | loss 3.268311 (-2.00z)| norm 0.3746 (+5.49z)| lr 1.87e-04 | 323.37 ms | 52.2% bf16 MFU | 1623062 tok/s step 12455/19560 | loss 3.367948 (+0.20z)| norm 0.3724 (+4.81z)| lr 1.87e-04 | 322.83 ms | 52.3% bf16 MFU | 1623112 tok/s step 12456/19560 | loss 3.364747 (+0.15z)| norm 0.2901 (+0.57z)| lr 1.87e-04 | 322.87 ms | 52.3% bf16 MFU | 1623148 tok/s step 12457/19560 | loss 3.429933 (+1.62z)| norm 0.3293 (+2.52z)| lr 1.87e-04 | 323.72 ms | 52.1% bf16 MFU | 1622970 tok/s step 12458/19560 | loss 3.383776 (+0.56z)| norm 0.2871 (+0.39z)| lr 1.87e-04 | 322.57 ms | 52.3% bf16 MFU | 1623088 tok/s step 12459/19560 | loss 3.322460 (-0.82z)| norm 0.2932 (+0.69z)| lr 1.87e-04 | 323.91 ms | 52.1% bf16 MFU | 1622864 tok/s step 12460/19560 | loss 3.370080 (+0.26z)| norm 0.2975 (+0.91z)| lr 1.87e-04 | 322.58 ms | 52.3% bf16 MFU | 1622984 tok/s step 12461/19560 | loss 3.347875 (-0.23z)| norm 0.2636 (-0.79z)| lr 1.86e-04 | 322.67 ms | 52.3% bf16 MFU | 1623078 tok/s step 12462/19560 | loss 3.391257 (+0.76z)| norm 0.2857 (+0.32z)| lr 1.86e-04 | 323.74 ms | 52.1% bf16 MFU | 1622898 tok/s step 12463/19560 | loss 3.351280 (-0.16z)| norm 0.2843 (+0.25z)| lr 1.86e-04 | 323.17 ms | 52.2% bf16 MFU | 1622869 tok/s step 12464/19560 | loss 3.381248 (+0.53z)| norm 0.2602 (-0.96z)| lr 1.86e-04 | 322.80 ms | 52.3% bf16 MFU | 1622934 tok/s step 12465/19560 | loss 3.427447 (+1.57z)| norm 0.2747 (-0.24z)| lr 1.86e-04 | 323.37 ms | 52.2% bf16 MFU | 1622855 tok/s step 12466/19560 | loss 3.299635 (-1.33z)| norm 0.2740 (-0.28z)| lr 1.86e-04 | 323.29 ms | 52.2% bf16 MFU | 1622800 tok/s step 12467/19560 | loss 3.388371 (+0.67z)| norm 0.2643 (-0.76z)| lr 1.86e-04 | 323.33 ms | 52.2% bf16 MFU | 1622736 tok/s step 12468/19560 | loss 3.329186 (-0.66z)| norm 0.2719 (-0.38z)| lr 1.86e-04 | 323.27 ms | 52.2% bf16 MFU | 1622690 tok/s step 12469/19560 | loss 3.397188 (+0.86z)| norm 0.2790 (-0.03z)| lr 1.86e-04 | 324.01 ms | 52.1% bf16 MFU | 1622463 tok/s step 12470/19560 | loss 3.353432 (-0.12z)| norm 0.2828 (+0.17z)| lr 1.86e-04 | 322.83 ms | 52.3% bf16 MFU | 1622542 tok/s step 12471/19560 | loss 3.350206 (-0.19z)| norm 0.3225 (+2.13z)| lr 1.86e-04 | 322.68 ms | 52.3% bf16 MFU | 1622656 tok/s step 12472/19560 | loss 3.425222 (+1.48z)| norm 0.2718 (-0.41z)| lr 1.86e-04 | 324.50 ms | 52.0% bf16 MFU | 1622307 tok/s step 12473/19560 | loss 3.364862 (+0.14z)| norm 0.2637 (-0.80z)| lr 1.86e-04 | 322.31 ms | 52.4% bf16 MFU | 1622524 tok/s step 12474/19560 | loss 3.371495 (+0.27z)| norm 0.2711 (-0.44z)| lr 1.86e-04 | 323.00 ms | 52.3% bf16 MFU | 1622557 tok/s step 12475/19560 | loss 3.312188 (-1.06z)| norm 0.2626 (-0.85z)| lr 1.86e-04 | 323.63 ms | 52.2% bf16 MFU | 1622431 tok/s step 12476/19560 | loss 3.310949 (-1.08z)| norm 0.2590 (-1.02z)| lr 1.86e-04 | 322.24 ms | 52.4% bf16 MFU | 1622659 tok/s step 12477/19560 | loss 3.416883 (+1.30z)| norm 0.2915 (+0.60z)| lr 1.86e-04 | 323.20 ms | 52.2% bf16 MFU | 1622634 tok/s step 12478/19560 | loss 3.346442 (-0.27z)| norm 0.2619 (-0.87z)| lr 1.86e-04 | 323.00 ms | 52.3% bf16 MFU | 1622661 tok/s step 12479/19560 | loss 3.379366 (+0.46z)| norm 0.3050 (+1.27z)| lr 1.86e-04 | 323.04 ms | 52.2% bf16 MFU | 1622678 tok/s step 12480/19560 | loss 3.315825 (-0.95z)| norm 0.2943 (+0.73z)| lr 1.86e-04 | 322.47 ms | 52.3% bf16 MFU | 1622837 tok/s step 12481/19560 | loss 3.365606 (+0.16z)| norm 0.2741 (-0.27z)| lr 1.86e-04 | 322.57 ms | 52.3% bf16 MFU | 1622963 tok/s step 12482/19560 | loss 3.334651 (-0.53z)| norm 0.2843 (+0.24z)| lr 1.85e-04 | 322.59 ms | 52.3% bf16 MFU | 1623076 tok/s step 12483/19560 | loss 3.340383 (-0.40z)| norm 0.2721 (-0.37z)| lr 1.85e-04 | 322.67 ms | 52.3% bf16 MFU | 1623165 tok/s step 12484/19560 | loss 3.415587 (+1.28z)| norm 0.2782 (-0.07z)| lr 1.85e-04 | 323.84 ms | 52.1% bf16 MFU | 1622956 tok/s step 12485/19560 | loss 3.464124 (+2.31z)| norm 0.2551 (-1.22z)| lr 1.85e-04 | 323.51 ms | 52.2% bf16 MFU | 1622839 tok/s step 12486/19560 | loss 3.361283 (+0.04z)| norm 0.2963 (+0.84z)| lr 1.85e-04 | 323.28 ms | 52.2% bf16 MFU | 1622786 tok/s step 12487/19560 | loss 3.430362 (+1.55z)| norm 0.2627 (-0.83z)| lr 1.85e-04 | 322.98 ms | 52.3% bf16 MFU | 1622811 tok/s step 12488/19560 | loss 3.341773 (-0.39z)| norm 0.2733 (-0.30z)| lr 1.85e-04 | 322.55 ms | 52.3% bf16 MFU | 1622943 tok/s step 12489/19560 | loss 3.353037 (-0.15z)| norm 0.2648 (-0.72z)| lr 1.85e-04 | 323.49 ms | 52.2% bf16 MFU | 1622833 tok/s step 12490/19560 | loss 3.385444 (+0.55z)| norm 0.2846 (+0.26z)| lr 1.85e-04 | 323.11 ms | 52.2% bf16 MFU | 1622823 tok/s step 12491/19560 | loss 3.582665 (+4.46z)| norm 0.2983 (+0.92z)| lr 1.85e-04 | 322.56 ms | 52.3% bf16 MFU | 1622950 tok/s step 12492/19560 | loss 3.331209 (-0.62z)| norm 0.2824 (+0.12z)| lr 1.85e-04 | 323.09 ms | 52.2% bf16 MFU | 1622940 tok/s step 12493/19560 | loss 3.250731 (-2.19z)| norm 0.3017 (+1.08z)| lr 1.85e-04 | 322.58 ms | 52.3% bf16 MFU | 1623057 tok/s step 12494/19560 | loss 3.331849 (-0.58z)| norm 0.2709 (-0.46z)| lr 1.85e-04 | 322.69 ms | 52.3% bf16 MFU | 1623142 tok/s step 12495/19560 | loss 3.456931 (+1.88z)| norm 0.2845 (+0.22z)| lr 1.85e-04 | 322.91 ms | 52.3% bf16 MFU | 1623166 tok/s step 12496/19560 | loss 3.364840 (+0.06z)| norm 0.2722 (-0.41z)| lr 1.85e-04 | 323.19 ms | 52.2% bf16 MFU | 1623120 tok/s step 12497/19560 | loss 3.453805 (+1.79z)| norm 0.3133 (+1.64z)| lr 1.85e-04 | 322.22 ms | 52.4% bf16 MFU | 1623318 tok/s step 12498/19560 | loss 3.336666 (-0.51z)| norm 0.2794 (-0.06z)| lr 1.85e-04 | 323.05 ms | 52.2% bf16 MFU | 1623298 tok/s step 12499/19560 | loss 3.353638 (-0.17z)| norm 0.2741 (-0.34z)| lr 1.85e-04 | 322.81 ms | 52.3% bf16 MFU | 1623340 tok/s step 12500/19560 | loss 3.357971 (-0.08z)| norm 0.2813 (+0.03z)| lr 1.85e-04 | 323.10 ms | 52.2% bf16 MFU | 1623308 tok/s val loss 3.344419 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Helevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2990/10042 = 0.297749 step 12501/19560 | loss 3.397574 (+0.68z)| norm 0.2679 (-0.65z)| lr 1.85e-04 | 321.91 ms | 52.4% bf16 MFU | 1623576 tok/s step 12502/19560 | loss 3.367697 (+0.09z)| norm 0.2781 (-0.14z)| lr 1.85e-04 | 323.47 ms | 52.2% bf16 MFU | 1623439 tok/s step 12503/19560 | loss 3.376879 (+0.28z)| norm 0.2644 (-0.83z)| lr 1.85e-04 | 322.88 ms | 52.3% bf16 MFU | 1623457 tok/s step 12504/19560 | loss 3.289445 (-1.44z)| norm 0.2962 (+0.77z)| lr 1.84e-04 | 322.92 ms | 52.3% bf16 MFU | 1623465 tok/s step 12505/19560 | loss 3.396997 (+0.67z)| norm 0.2762 (-0.25z)| lr 1.84e-04 | 322.88 ms | 52.3% bf16 MFU | 1623480 tok/s step 12506/19560 | loss 3.408532 (+0.88z)| norm 0.2954 (+0.72z)| lr 1.84e-04 | 322.90 ms | 52.3% bf16 MFU | 1623490 tok/s step 12507/19560 | loss 3.355777 (-0.14z)| norm 0.2665 (-0.74z)| lr 1.84e-04 | 322.91 ms | 52.3% bf16 MFU | 1623497 tok/s step 12508/19560 | loss 3.351036 (-0.23z)| norm 0.2676 (-0.67z)| lr 1.84e-04 | 322.51 ms | 52.3% bf16 MFU | 1623606 tok/s step 12509/19560 | loss 3.361803 (-0.01z)| norm 0.2649 (-0.81z)| lr 1.84e-04 | 323.05 ms | 52.2% bf16 MFU | 1623571 tok/s step 12510/19560 | loss 3.328659 (-0.68z)| norm 0.2580 (-1.15z)| lr 1.84e-04 | 322.70 ms | 52.3% bf16 MFU | 1623628 tok/s step 12511/19560 | loss 3.377151 (+0.30z)| norm 0.2639 (-0.83z)| lr 1.84e-04 | 322.31 ms | 52.4% bf16 MFU | 1623779 tok/s step 12512/19560 | loss 3.349317 (-0.26z)| norm 0.2472 (-1.67z)| lr 1.84e-04 | 323.09 ms | 52.2% bf16 MFU | 1623727 tok/s step 12513/19560 | loss 3.365549 (+0.06z)| norm 0.2535 (-1.33z)| lr 1.84e-04 | 322.53 ms | 52.3% bf16 MFU | 1623818 tok/s step 12514/19560 | loss 3.303495 (-1.21z)| norm 0.2544 (-1.27z)| lr 1.84e-04 | 323.17 ms | 52.2% bf16 MFU | 1623744 tok/s step 12515/19560 | loss 3.393181 (+0.62z)| norm 0.2630 (-0.82z)| lr 1.84e-04 | 322.40 ms | 52.3% bf16 MFU | 1623868 tok/s step 12516/19560 | loss 3.386265 (+0.48z)| norm 0.2706 (-0.42z)| lr 1.84e-04 | 322.73 ms | 52.3% bf16 MFU | 1623902 tok/s step 12517/19560 | loss 3.359986 (-0.06z)| norm 0.2800 (+0.05z)| lr 1.84e-04 | 323.23 ms | 52.2% bf16 MFU | 1623808 tok/s step 12518/19560 | loss 3.373072 (+0.23z)| norm 0.2612 (-0.89z)| lr 1.84e-04 | 322.40 ms | 52.3% bf16 MFU | 1623926 tok/s step 12519/19560 | loss 3.310583 (-1.07z)| norm 0.2576 (-1.06z)| lr 1.84e-04 | 322.82 ms | 52.3% bf16 MFU | 1623936 tok/s step 12520/19560 | loss 3.263021 (-2.02z)| norm 0.2769 (-0.06z)| lr 1.84e-04 | 322.84 ms | 52.3% bf16 MFU | 1623940 tok/s step 12521/19560 | loss 3.443741 (+1.70z)| norm 0.2592 (-0.97z)| lr 1.84e-04 | 323.08 ms | 52.2% bf16 MFU | 1623882 tok/s step 12522/19560 | loss 3.401040 (+0.81z)| norm 0.2806 (+0.16z)| lr 1.84e-04 | 322.89 ms | 52.3% bf16 MFU | 1623875 tok/s step 12523/19560 | loss 3.321222 (-0.81z)| norm 0.2912 (+0.72z)| lr 1.84e-04 | 322.15 ms | 52.4% bf16 MFU | 1624056 tok/s step 12524/19560 | loss 3.344141 (-0.34z)| norm 0.2651 (-0.64z)| lr 1.84e-04 | 323.47 ms | 52.2% bf16 MFU | 1623894 tok/s step 12525/19560 | loss 3.394375 (+0.68z)| norm 0.2827 (+0.29z)| lr 1.84e-04 | 322.26 ms | 52.4% bf16 MFU | 1624045 tok/s step 12526/19560 | loss 3.321397 (-0.80z)| norm 0.2875 (+0.54z)| lr 1.83e-04 | 322.89 ms | 52.3% bf16 MFU | 1624030 tok/s step 12527/19560 | loss 3.353487 (-0.15z)| norm 0.2751 (-0.12z)| lr 1.83e-04 | 323.01 ms | 52.2% bf16 MFU | 1623985 tok/s step 12528/19560 | loss 3.332924 (-0.57z)| norm 0.2860 (+0.45z)| lr 1.83e-04 | 322.58 ms | 52.3% bf16 MFU | 1624050 tok/s step 12529/19560 | loss 3.335288 (-0.53z)| norm 0.2684 (-0.48z)| lr 1.83e-04 | 323.16 ms | 52.2% bf16 MFU | 1623968 tok/s step 12530/19560 | loss 3.393558 (+0.65z)| norm 0.2710 (-0.34z)| lr 1.83e-04 | 322.61 ms | 52.3% bf16 MFU | 1624026 tok/s step 12531/19560 | loss 3.372673 (+0.22z)| norm 0.2922 (+0.77z)| lr 1.83e-04 | 322.76 ms | 52.3% bf16 MFU | 1624044 tok/s step 12532/19560 | loss 3.314045 (-0.97z)| norm 0.2697 (-0.41z)| lr 1.83e-04 | 322.97 ms | 52.3% bf16 MFU | 1624007 tok/s step 12533/19560 | loss 3.419054 (+1.17z)| norm 0.2862 (+0.46z)| lr 1.83e-04 | 322.53 ms | 52.3% bf16 MFU | 1624084 tok/s step 12534/19560 | loss 3.405105 (+0.87z)| norm 0.2717 (-0.31z)| lr 1.83e-04 | 323.48 ms | 52.2% bf16 MFU | 1623919 tok/s step 12535/19560 | loss 3.493047 (+2.57z)| norm 0.2885 (+0.59z)| lr 1.83e-04 | 322.20 ms | 52.4% bf16 MFU | 1624082 tok/s step 12536/19560 | loss 3.348921 (-0.29z)| norm 0.2662 (-0.63z)| lr 1.83e-04 | 322.55 ms | 52.3% bf16 MFU | 1624150 tok/s step 12537/19560 | loss 3.279914 (-1.65z)| norm 0.2851 (+0.40z)| lr 1.83e-04 | 323.07 ms | 52.2% bf16 MFU | 1624085 tok/s step 12538/19560 | loss 3.319849 (-0.85z)| norm 0.2691 (-0.47z)| lr 1.83e-04 | 322.64 ms | 52.3% bf16 MFU | 1624132 tok/s step 12539/19560 | loss 3.404449 (+0.81z)| norm 0.3019 (+1.29z)| lr 1.83e-04 | 323.39 ms | 52.2% bf16 MFU | 1623986 tok/s step 12540/19560 | loss 3.336108 (-0.54z)| norm 0.2787 (+0.03z)| lr 1.83e-04 | 322.82 ms | 52.3% bf16 MFU | 1623991 tok/s step 12541/19560 | loss 3.360634 (-0.04z)| norm 0.2960 (+0.96z)| lr 1.83e-04 | 322.95 ms | 52.3% bf16 MFU | 1623963 tok/s step 12542/19560 | loss 3.407924 (+0.99z)| norm 0.2781 (-0.01z)| lr 1.83e-04 | 323.02 ms | 52.2% bf16 MFU | 1623920 tok/s step 12543/19560 | loss 3.393648 (+0.67z)| norm 0.2919 (+0.73z)| lr 1.83e-04 | 322.47 ms | 52.3% bf16 MFU | 1624017 tok/s step 12544/19560 | loss 3.381210 (+0.40z)| norm 0.2766 (-0.09z)| lr 1.83e-04 | 322.92 ms | 52.3% bf16 MFU | 1623996 tok/s step 12545/19560 | loss 3.422563 (+1.26z)| norm 0.2862 (+0.42z)| lr 1.83e-04 | 322.57 ms | 52.3% bf16 MFU | 1624063 tok/s step 12546/19560 | loss 3.353617 (-0.20z)| norm 0.2889 (+0.57z)| lr 1.83e-04 | 322.35 ms | 52.4% bf16 MFU | 1624182 tok/s step 12547/19560 | loss 3.403317 (+0.84z)| norm 0.2952 (+0.89z)| lr 1.82e-04 | 323.00 ms | 52.3% bf16 MFU | 1624131 tok/s step 12548/19560 | loss 3.284798 (-1.66z)| norm 0.2795 (+0.04z)| lr 1.82e-04 | 322.23 ms | 52.4% bf16 MFU | 1624277 tok/s step 12549/19560 | loss 3.336424 (-0.56z)| norm 0.2931 (+0.77z)| lr 1.82e-04 | 322.86 ms | 52.3% bf16 MFU | 1624257 tok/s step 12550/19560 | loss 3.373806 (+0.22z)| norm 0.2879 (+0.48z)| lr 1.82e-04 | 323.38 ms | 52.2% bf16 MFU | 1624108 tok/s step 12551/19560 | loss 3.506761 (+2.90z)| norm 0.3236 (+2.34z)| lr 1.82e-04 | 322.56 ms | 52.3% bf16 MFU | 1624173 tok/s step 12552/19560 | loss 3.355093 (-0.19z)| norm 0.2702 (-0.48z)| lr 1.82e-04 | 322.99 ms | 52.3% bf16 MFU | 1624126 tok/s step 12553/19560 | loss 3.362253 (-0.04z)| norm 0.2978 (+0.96z)| lr 1.82e-04 | 322.43 ms | 52.3% bf16 MFU | 1624223 tok/s step 12554/19560 | loss 3.366033 (+0.04z)| norm 0.2670 (-0.67z)| lr 1.82e-04 | 323.05 ms | 52.2% bf16 MFU | 1624158 tok/s step 12555/19560 | loss 3.429612 (+1.33z)| norm 0.2865 (+0.37z)| lr 1.82e-04 | 322.45 ms | 52.3% bf16 MFU | 1624248 tok/s step 12556/19560 | loss 3.377901 (+0.28z)| norm 0.2721 (-0.39z)| lr 1.82e-04 | 322.77 ms | 52.3% bf16 MFU | 1624253 tok/s step 12557/19560 | loss 3.330944 (-0.67z)| norm 0.2682 (-0.60z)| lr 1.82e-04 | 323.48 ms | 52.2% bf16 MFU | 1624078 tok/s step 12558/19560 | loss 3.425023 (+1.22z)| norm 0.3125 (+1.71z)| lr 1.82e-04 | 323.19 ms | 52.2% bf16 MFU | 1623987 tok/s step 12559/19560 | loss 3.383039 (+0.36z)| norm 0.2872 (+0.38z)| lr 1.82e-04 | 322.31 ms | 52.4% bf16 MFU | 1624120 tok/s step 12560/19560 | loss 3.309494 (-1.11z)| norm 0.2690 (-0.57z)| lr 1.82e-04 | 322.46 ms | 52.3% bf16 MFU | 1624210 tok/s step 12561/19560 | loss 3.366351 (+0.04z)| norm 0.2783 (-0.09z)| lr 1.82e-04 | 323.00 ms | 52.3% bf16 MFU | 1624159 tok/s step 12562/19560 | loss 3.365879 (+0.02z)| norm 0.2720 (-0.42z)| lr 1.82e-04 | 323.01 ms | 52.2% bf16 MFU | 1624106 tok/s step 12563/19560 | loss 3.381361 (+0.33z)| norm 0.2680 (-0.63z)| lr 1.82e-04 | 322.63 ms | 52.3% bf16 MFU | 1624154 tok/s step 12564/19560 | loss 3.368520 (+0.06z)| norm 0.2845 (+0.24z)| lr 1.82e-04 | 322.78 ms | 52.3% bf16 MFU | 1624161 tok/s step 12565/19560 | loss 3.353103 (-0.28z)| norm 0.2893 (+0.48z)| lr 1.82e-04 | 322.65 ms | 52.3% bf16 MFU | 1624200 tok/s step 12566/19560 | loss 3.367679 (+0.02z)| norm 0.2609 (-1.02z)| lr 1.82e-04 | 322.53 ms | 52.3% bf16 MFU | 1624267 tok/s step 12567/19560 | loss 3.359642 (-0.14z)| norm 0.2921 (+0.63z)| lr 1.82e-04 | 322.76 ms | 52.3% bf16 MFU | 1624272 tok/s step 12568/19560 | loss 3.335078 (-0.65z)| norm 0.2501 (-1.56z)| lr 1.82e-04 | 322.49 ms | 52.3% bf16 MFU | 1624345 tok/s step 12569/19560 | loss 3.278957 (-1.79z)| norm 0.2726 (-0.37z)| lr 1.81e-04 | 322.71 ms | 52.3% bf16 MFU | 1624359 tok/s step 12570/19560 | loss 3.357702 (-0.17z)| norm 0.2692 (-0.55z)| lr 1.81e-04 | 322.70 ms | 52.3% bf16 MFU | 1624375 tok/s step 12571/19560 | loss 3.369958 (+0.09z)| norm 0.2604 (-1.00z)| lr 1.81e-04 | 322.56 ms | 52.3% bf16 MFU | 1624426 tok/s step 12572/19560 | loss 3.446034 (+1.66z)| norm 0.2767 (-0.14z)| lr 1.81e-04 | 322.94 ms | 52.3% bf16 MFU | 1624379 tok/s step 12573/19560 | loss 3.345256 (-0.42z)| norm 0.2592 (-1.05z)| lr 1.81e-04 | 322.71 ms | 52.3% bf16 MFU | 1624392 tok/s step 12574/19560 | loss 3.401856 (+0.74z)| norm 0.2902 (+0.56z)| lr 1.81e-04 | 322.57 ms | 52.3% bf16 MFU | 1624441 tok/s step 12575/19560 | loss 3.411320 (+0.92z)| norm 0.2898 (+0.53z)| lr 1.81e-04 | 322.89 ms | 52.3% bf16 MFU | 1624407 tok/s step 12576/19560 | loss 3.522263 (+3.07z)| norm 0.2632 (-0.85z)| lr 1.81e-04 | 323.05 ms | 52.2% bf16 MFU | 1624333 tok/s step 12577/19560 | loss 3.366346 (-0.05z)| norm 0.2901 (+0.54z)| lr 1.81e-04 | 322.55 ms | 52.3% bf16 MFU | 1624388 tok/s step 12578/19560 | loss 3.319671 (-0.99z)| norm 0.3007 (+1.08z)| lr 1.81e-04 | 322.78 ms | 52.3% bf16 MFU | 1624384 tok/s step 12579/19560 | loss 3.350756 (-0.36z)| norm 0.2881 (+0.41z)| lr 1.81e-04 | 323.00 ms | 52.3% bf16 MFU | 1624323 tok/s step 12580/19560 | loss 3.326957 (-0.83z)| norm 0.2706 (-0.50z)| lr 1.81e-04 | 322.77 ms | 52.3% bf16 MFU | 1624323 tok/s step 12581/19560 | loss 3.421091 (+1.04z)| norm 0.2871 (+0.36z)| lr 1.81e-04 | 323.05 ms | 52.2% bf16 MFU | 1624253 tok/s step 12582/19560 | loss 3.374718 (+0.10z)| norm 0.2696 (-0.57z)| lr 1.81e-04 | 322.75 ms | 52.3% bf16 MFU | 1624263 tok/s step 12583/19560 | loss 3.369173 (-0.01z)| norm 0.2655 (-0.87z)| lr 1.81e-04 | 323.02 ms | 52.2% bf16 MFU | 1624204 tok/s step 12584/19560 | loss 3.366000 (-0.07z)| norm 0.2817 (+0.21z)| lr 1.81e-04 | 322.92 ms | 52.3% bf16 MFU | 1624173 tok/s step 12585/19560 | loss 3.405756 (+0.74z)| norm 0.2578 (-1.39z)| lr 1.81e-04 | 322.63 ms | 52.3% bf16 MFU | 1624217 tok/s step 12586/19560 | loss 3.402876 (+0.68z)| norm 0.2750 (-0.20z)| lr 1.81e-04 | 322.69 ms | 52.3% bf16 MFU | 1624243 tok/s step 12587/19560 | loss 3.399749 (+0.60z)| norm 0.2698 (-0.55z)| lr 1.81e-04 | 323.25 ms | 52.2% bf16 MFU | 1624126 tok/s step 12588/19560 | loss 3.407358 (+0.75z)| norm 0.2761 (-0.10z)| lr 1.81e-04 | 323.57 ms | 52.2% bf16 MFU | 1623936 tok/s step 12589/19560 | loss 3.342702 (-0.56z)| norm 0.2701 (-0.52z)| lr 1.81e-04 | 323.10 ms | 52.2% bf16 MFU | 1623873 tok/s step 12590/19560 | loss 3.368216 (-0.04z)| norm 0.2668 (-0.74z)| lr 1.81e-04 | 322.91 ms | 52.3% bf16 MFU | 1623862 tok/s step 12591/19560 | loss 3.500997 (+2.57z)| norm 0.3306 (+3.50z)| lr 1.80e-04 | 322.99 ms | 52.3% bf16 MFU | 1623830 tok/s step 12592/19560 | loss 3.354111 (-0.34z)| norm 0.3082 (+1.97z)| lr 1.80e-04 | 323.01 ms | 52.2% bf16 MFU | 1623795 tok/s step 12593/19560 | loss 3.383279 (+0.25z)| norm 0.2835 (+0.35z)| lr 1.80e-04 | 323.05 ms | 52.2% bf16 MFU | 1623751 tok/s step 12594/19560 | loss 3.370256 (-0.02z)| norm 0.2984 (+1.30z)| lr 1.80e-04 | 322.99 ms | 52.3% bf16 MFU | 1623726 tok/s step 12595/19560 | loss 3.355904 (-0.31z)| norm 0.2966 (+1.17z)| lr 1.80e-04 | 322.15 ms | 52.4% bf16 MFU | 1623913 tok/s step 12596/19560 | loss 3.349019 (-0.45z)| norm 0.2882 (+0.61z)| lr 1.80e-04 | 322.54 ms | 52.3% bf16 MFU | 1623992 tok/s step 12597/19560 | loss 3.343328 (-0.56z)| norm 0.3014 (+1.44z)| lr 1.80e-04 | 323.49 ms | 52.2% bf16 MFU | 1623829 tok/s step 12598/19560 | loss 3.362696 (-0.17z)| norm 0.2814 (+0.16z)| lr 1.80e-04 | 322.64 ms | 52.3% bf16 MFU | 1623887 tok/s step 12599/19560 | loss 3.345431 (-0.51z)| norm 0.2878 (+0.60z)| lr 1.80e-04 | 323.40 ms | 52.2% bf16 MFU | 1623752 tok/s step 12600/19560 | loss 3.428448 (+1.15z)| norm 0.3359 (+3.58z)| lr 1.80e-04 | 323.14 ms | 52.2% bf16 MFU | 1623688 tok/s step 12601/19560 | loss 3.329965 (-0.82z)| norm 0.2829 (+0.23z)| lr 1.80e-04 | 322.34 ms | 52.4% bf16 MFU | 1623828 tok/s step 12602/19560 | loss 3.416776 (+0.91z)| norm 0.3123 (+2.04z)| lr 1.80e-04 | 323.31 ms | 52.2% bf16 MFU | 1623717 tok/s step 12603/19560 | loss 3.382224 (+0.21z)| norm 0.2980 (+1.12z)| lr 1.80e-04 | 323.62 ms | 52.2% bf16 MFU | 1623536 tok/s step 12604/19560 | loss 3.358971 (-0.26z)| norm 0.3188 (+2.36z)| lr 1.80e-04 | 322.60 ms | 52.3% bf16 MFU | 1623618 tok/s step 12605/19560 | loss 3.342363 (-0.59z)| norm 0.3105 (+1.82z)| lr 1.80e-04 | 322.75 ms | 52.3% bf16 MFU | 1623658 tok/s step 12606/19560 | loss 3.396173 (+0.49z)| norm 0.2877 (+0.42z)| lr 1.80e-04 | 322.96 ms | 52.3% bf16 MFU | 1623643 tok/s step 12607/19560 | loss 3.471767 (+1.98z)| norm 0.3111 (+1.84z)| lr 1.80e-04 | 323.05 ms | 52.2% bf16 MFU | 1623608 tok/s step 12608/19560 | loss 3.427908 (+1.09z)| norm 0.3094 (+1.71z)| lr 1.80e-04 | 323.01 ms | 52.2% bf16 MFU | 1623583 tok/s step 12609/19560 | loss 3.389919 (+0.32z)| norm 0.3161 (+2.06z)| lr 1.80e-04 | 323.49 ms | 52.2% bf16 MFU | 1623441 tok/s step 12610/19560 | loss 3.375751 (+0.04z)| norm 0.2872 (+0.35z)| lr 1.80e-04 | 322.61 ms | 52.3% bf16 MFU | 1623526 tok/s step 12611/19560 | loss 3.330059 (-0.87z)| norm 0.3018 (+1.19z)| lr 1.80e-04 | 322.71 ms | 52.3% bf16 MFU | 1623582 tok/s step 12612/19560 | loss 3.313056 (-1.19z)| norm 0.2801 (-0.08z)| lr 1.80e-04 | 323.01 ms | 52.3% bf16 MFU | 1623561 tok/s step 12613/19560 | loss 3.291218 (-1.61z)| norm 0.3004 (+1.09z)| lr 1.79e-04 | 322.83 ms | 52.3% bf16 MFU | 1623585 tok/s step 12614/19560 | loss 3.400788 (+0.57z)| norm 0.3188 (+2.14z)| lr 1.79e-04 | 322.36 ms | 52.4% bf16 MFU | 1623725 tok/s step 12615/19560 | loss 3.388808 (+0.34z)| norm 0.3280 (+2.59z)| lr 1.79e-04 | 323.49 ms | 52.2% bf16 MFU | 1623574 tok/s step 12616/19560 | loss 3.364971 (-0.14z)| norm 0.2974 (+0.83z)| lr 1.79e-04 | 323.36 ms | 52.2% bf16 MFU | 1623465 tok/s step 12617/19560 | loss 3.334997 (-0.74z)| norm 0.3325 (+2.73z)| lr 1.79e-04 | 322.60 ms | 52.3% bf16 MFU | 1623551 tok/s step 12618/19560 | loss 3.411429 (+0.79z)| norm 0.3029 (+1.08z)| lr 1.79e-04 | 323.57 ms | 52.2% bf16 MFU | 1623390 tok/s step 12619/19560 | loss 3.399843 (+0.63z)| norm 0.3212 (+2.04z)| lr 1.79e-04 | 323.43 ms | 52.2% bf16 MFU | 1623273 tok/s step 12620/19560 | loss 3.355922 (-0.32z)| norm 0.2771 (-0.35z)| lr 1.79e-04 | 323.04 ms | 52.2% bf16 MFU | 1623258 tok/s step 12621/19560 | loss 3.279839 (-1.98z)| norm 0.3054 (+1.19z)| lr 1.79e-04 | 322.78 ms | 52.3% bf16 MFU | 1623311 tok/s step 12622/19560 | loss 3.402620 (+0.68z)| norm 0.2678 (-0.86z)| lr 1.79e-04 | 323.48 ms | 52.2% bf16 MFU | 1623184 tok/s step 12623/19560 | loss 3.412094 (+0.90z)| norm 0.2993 (+0.84z)| lr 1.79e-04 | 322.58 ms | 52.3% bf16 MFU | 1623289 tok/s step 12624/19560 | loss 3.416102 (+0.98z)| norm 0.2955 (+0.63z)| lr 1.79e-04 | 322.96 ms | 52.3% bf16 MFU | 1623293 tok/s step 12625/19560 | loss 3.367846 (-0.07z)| norm 0.2949 (+0.61z)| lr 1.79e-04 | 322.67 ms | 52.3% bf16 MFU | 1623371 tok/s step 12626/19560 | loss 3.333675 (-0.83z)| norm 0.2856 (+0.10z)| lr 1.79e-04 | 322.68 ms | 52.3% bf16 MFU | 1623443 tok/s step 12627/19560 | loss 3.320131 (-1.12z)| norm 0.2826 (-0.07z)| lr 1.79e-04 | 323.56 ms | 52.2% bf16 MFU | 1623290 tok/s step 12628/19560 | loss 3.378468 (+0.17z)| norm 0.2796 (-0.23z)| lr 1.79e-04 | 323.28 ms | 52.2% bf16 MFU | 1623215 tok/s step 12629/19560 | loss 3.348752 (-0.48z)| norm 0.2663 (-0.95z)| lr 1.79e-04 | 322.89 ms | 52.3% bf16 MFU | 1623241 tok/s step 12630/19560 | loss 3.390006 (+0.43z)| norm 0.2824 (-0.08z)| lr 1.79e-04 | 322.87 ms | 52.3% bf16 MFU | 1623270 tok/s step 12631/19560 | loss 3.322721 (-1.04z)| norm 0.2698 (-0.77z)| lr 1.79e-04 | 323.05 ms | 52.2% bf16 MFU | 1623253 tok/s step 12632/19560 | loss 3.320790 (-1.10z)| norm 0.2761 (-0.42z)| lr 1.79e-04 | 322.99 ms | 52.3% bf16 MFU | 1623252 tok/s step 12633/19560 | loss 3.405154 (+0.77z)| norm 0.2541 (-1.60z)| lr 1.79e-04 | 322.92 ms | 52.3% bf16 MFU | 1623270 tok/s step 12634/19560 | loss 3.307287 (-1.38z)| norm 0.2702 (-0.72z)| lr 1.79e-04 | 322.93 ms | 52.3% bf16 MFU | 1623282 tok/s step 12635/19560 | loss 3.359483 (-0.22z)| norm 0.2654 (-0.98z)| lr 1.78e-04 | 322.84 ms | 52.3% bf16 MFU | 1623319 tok/s step 12636/19560 | loss 3.353319 (-0.36z)| norm 0.2615 (-1.18z)| lr 1.78e-04 | 323.78 ms | 52.1% bf16 MFU | 1623118 tok/s step 12637/19560 | loss 3.588689 (+4.42z)| norm 0.2853 (+0.10z)| lr 1.78e-04 | 322.72 ms | 52.3% bf16 MFU | 1623190 tok/s step 12638/19560 | loss 3.445369 (+1.48z)| norm 0.2551 (-1.54z)| lr 1.78e-04 | 323.25 ms | 52.2% bf16 MFU | 1623128 tok/s step 12639/19560 | loss 3.336322 (-0.72z)| norm 0.2808 (-0.15z)| lr 1.78e-04 | 322.85 ms | 52.3% bf16 MFU | 1623169 tok/s step 12640/19560 | loss 3.349047 (-0.46z)| norm 0.2674 (-0.90z)| lr 1.78e-04 | 323.01 ms | 52.3% bf16 MFU | 1623168 tok/s step 12641/19560 | loss 3.386092 (+0.28z)| norm 0.3002 (+0.90z)| lr 1.78e-04 | 323.64 ms | 52.1% bf16 MFU | 1623007 tok/s step 12642/19560 | loss 3.388842 (+0.32z)| norm 0.2933 (+0.50z)| lr 1.78e-04 | 322.88 ms | 52.3% bf16 MFU | 1623047 tok/s step 12643/19560 | loss 3.336862 (-0.72z)| norm 0.2713 (-0.74z)| lr 1.78e-04 | 323.86 ms | 52.1% bf16 MFU | 1622838 tok/s step 12644/19560 | loss 3.364201 (-0.16z)| norm 0.2753 (-0.52z)| lr 1.78e-04 | 323.28 ms | 52.2% bf16 MFU | 1622783 tok/s step 12645/19560 | loss 3.332624 (-0.80z)| norm 0.2887 (+0.23z)| lr 1.78e-04 | 322.95 ms | 52.3% bf16 MFU | 1622816 tok/s step 12646/19560 | loss 3.345787 (-0.53z)| norm 0.2792 (-0.31z)| lr 1.78e-04 | 323.38 ms | 52.2% bf16 MFU | 1622740 tok/s step 12647/19560 | loss 3.388291 (+0.32z)| norm 0.3195 (+1.94z)| lr 1.78e-04 | 323.71 ms | 52.1% bf16 MFU | 1622584 tok/s step 12648/19560 | loss 3.415847 (+0.87z)| norm 0.2901 (+0.27z)| lr 1.78e-04 | 323.46 ms | 52.2% bf16 MFU | 1622498 tok/s step 12649/19560 | loss 3.322598 (-1.04z)| norm 0.2988 (+0.75z)| lr 1.78e-04 | 322.55 ms | 52.3% bf16 MFU | 1622646 tok/s step 12650/19560 | loss 3.373704 (+0.03z)| norm 0.2837 (-0.11z)| lr 1.78e-04 | 323.22 ms | 52.2% bf16 MFU | 1622617 tok/s step 12651/19560 | loss 3.396755 (+0.49z)| norm 0.2898 (+0.24z)| lr 1.78e-04 | 323.93 ms | 52.1% bf16 MFU | 1622413 tok/s step 12652/19560 | loss 3.362393 (-0.23z)| norm 0.2848 (-0.06z)| lr 1.78e-04 | 322.95 ms | 52.3% bf16 MFU | 1622465 tok/s step 12653/19560 | loss 3.330345 (-0.88z)| norm 0.2693 (-0.93z)| lr 1.78e-04 | 323.58 ms | 52.2% bf16 MFU | 1622355 tok/s step 12654/19560 | loss 3.374210 (+0.02z)| norm 0.2806 (-0.29z)| lr 1.78e-04 | 323.28 ms | 52.2% bf16 MFU | 1622327 tok/s step 12655/19560 | loss 3.382211 (+0.19z)| norm 0.2713 (-0.81z)| lr 1.78e-04 | 322.68 ms | 52.3% bf16 MFU | 1622451 tok/s step 12656/19560 | loss 3.371849 (-0.04z)| norm 0.2872 (+0.09z)| lr 1.78e-04 | 323.16 ms | 52.2% bf16 MFU | 1622447 tok/s step 12657/19560 | loss 3.339768 (-0.71z)| norm 0.2783 (-0.42z)| lr 1.77e-04 | 323.72 ms | 52.1% bf16 MFU | 1622303 tok/s step 12658/19560 | loss 3.381070 (+0.16z)| norm 0.2821 (-0.21z)| lr 1.77e-04 | 322.82 ms | 52.3% bf16 MFU | 1622392 tok/s step 12659/19560 | loss 3.381829 (+0.17z)| norm 0.2524 (-1.87z)| lr 1.77e-04 | 323.29 ms | 52.2% bf16 MFU | 1622359 tok/s step 12660/19560 | loss 3.364977 (-0.19z)| norm 0.2600 (-1.43z)| lr 1.77e-04 | 322.99 ms | 52.3% bf16 MFU | 1622402 tok/s step 12661/19560 | loss 3.325653 (-1.01z)| norm 0.2843 (-0.06z)| lr 1.77e-04 | 323.22 ms | 52.2% bf16 MFU | 1622387 tok/s step 12662/19560 | loss 3.351732 (-0.45z)| norm 0.2642 (-1.18z)| lr 1.77e-04 | 323.06 ms | 52.2% bf16 MFU | 1622413 tok/s step 12663/19560 | loss 3.372595 (+0.01z)| norm 0.2808 (-0.25z)| lr 1.77e-04 | 322.70 ms | 52.3% bf16 MFU | 1622527 tok/s step 12664/19560 | loss 3.338334 (-0.73z)| norm 0.2655 (-1.10z)| lr 1.77e-04 | 323.38 ms | 52.2% bf16 MFU | 1622464 tok/s step 12665/19560 | loss 3.351244 (-0.47z)| norm 0.2910 (+0.32z)| lr 1.77e-04 | 322.89 ms | 52.3% bf16 MFU | 1622528 tok/s step 12666/19560 | loss 3.411908 (+0.86z)| norm 0.2770 (-0.47z)| lr 1.77e-04 | 323.47 ms | 52.2% bf16 MFU | 1622444 tok/s step 12667/19560 | loss 3.322962 (-1.09z)| norm 0.2731 (-0.68z)| lr 1.77e-04 | 323.30 ms | 52.2% bf16 MFU | 1622406 tok/s step 12668/19560 | loss 3.337188 (-0.78z)| norm 0.2548 (-1.67z)| lr 1.77e-04 | 323.40 ms | 52.2% bf16 MFU | 1622344 tok/s step 12669/19560 | loss 3.321909 (-1.10z)| norm 0.2645 (-1.12z)| lr 1.77e-04 | 323.09 ms | 52.2% bf16 MFU | 1622364 tok/s step 12670/19560 | loss 3.382960 (+0.24z)| norm 0.2722 (-0.69z)| lr 1.77e-04 | 322.61 ms | 52.3% bf16 MFU | 1622503 tok/s step 12671/19560 | loss 3.408882 (+0.81z)| norm 0.2690 (-0.85z)| lr 1.77e-04 | 323.53 ms | 52.2% bf16 MFU | 1622405 tok/s step 12672/19560 | loss 3.373940 (+0.04z)| norm 0.2601 (-1.33z)| lr 1.77e-04 | 322.72 ms | 52.3% bf16 MFU | 1622515 tok/s step 12673/19560 | loss 3.317054 (-1.19z)| norm 0.2813 (-0.17z)| lr 1.77e-04 | 323.03 ms | 52.2% bf16 MFU | 1622541 tok/s step 12674/19560 | loss 3.493832 (+2.60z)| norm 0.2640 (-1.10z)| lr 1.77e-04 | 322.77 ms | 52.3% bf16 MFU | 1622632 tok/s step 12675/19560 | loss 3.385758 (+0.29z)| norm 0.2819 (-0.12z)| lr 1.77e-04 | 322.98 ms | 52.3% bf16 MFU | 1622665 tok/s step 12676/19560 | loss 3.337005 (-0.77z)| norm 0.2514 (-1.75z)| lr 1.77e-04 | 323.38 ms | 52.2% bf16 MFU | 1622597 tok/s step 12677/19560 | loss 3.429588 (+1.21z)| norm 0.2730 (-0.57z)| lr 1.77e-04 | 323.00 ms | 52.3% bf16 MFU | 1622627 tok/s step 12678/19560 | loss 3.352154 (-0.45z)| norm 0.2707 (-0.69z)| lr 1.77e-04 | 322.47 ms | 52.3% bf16 MFU | 1622789 tok/s step 12679/19560 | loss 3.364305 (-0.17z)| norm 0.2563 (-1.45z)| lr 1.76e-04 | 322.52 ms | 52.3% bf16 MFU | 1622930 tok/s step 12680/19560 | loss 3.337254 (-0.77z)| norm 0.2705 (-0.68z)| lr 1.76e-04 | 323.14 ms | 52.2% bf16 MFU | 1622908 tok/s step 12681/19560 | loss 3.365562 (-0.14z)| norm 0.3114 (+1.53z)| lr 1.76e-04 | 323.04 ms | 52.2% bf16 MFU | 1622912 tok/s step 12682/19560 | loss 3.323144 (-1.07z)| norm 0.3078 (+1.32z)| lr 1.76e-04 | 322.70 ms | 52.3% bf16 MFU | 1623000 tok/s step 12683/19560 | loss 3.386126 (+0.33z)| norm 0.2749 (-0.45z)| lr 1.76e-04 | 322.55 ms | 52.3% bf16 MFU | 1623124 tok/s step 12684/19560 | loss 3.320489 (-1.12z)| norm 0.3290 (+2.39z)| lr 1.76e-04 | 323.11 ms | 52.2% bf16 MFU | 1623099 tok/s step 12685/19560 | loss 3.346438 (-0.54z)| norm 0.3022 (+0.96z)| lr 1.76e-04 | 322.80 ms | 52.3% bf16 MFU | 1623154 tok/s step 12686/19560 | loss 3.368522 (-0.04z)| norm 0.3023 (+0.97z)| lr 1.76e-04 | 323.03 ms | 52.2% bf16 MFU | 1623147 tok/s step 12687/19560 | loss 3.337900 (-0.72z)| norm 0.3035 (+1.02z)| lr 1.76e-04 | 322.95 ms | 52.3% bf16 MFU | 1623162 tok/s step 12688/19560 | loss 3.371196 (+0.01z)| norm 0.2930 (+0.46z)| lr 1.76e-04 | 322.84 ms | 52.3% bf16 MFU | 1623204 tok/s step 12689/19560 | loss 3.362803 (-0.18z)| norm 0.3279 (+2.24z)| lr 1.76e-04 | 322.76 ms | 52.3% bf16 MFU | 1623264 tok/s step 12690/19560 | loss 3.320982 (-1.10z)| norm 0.2922 (+0.39z)| lr 1.76e-04 | 322.84 ms | 52.3% bf16 MFU | 1623299 tok/s step 12691/19560 | loss 3.403399 (+0.74z)| norm 0.2906 (+0.29z)| lr 1.76e-04 | 322.70 ms | 52.3% bf16 MFU | 1623369 tok/s step 12692/19560 | loss 3.347940 (-0.50z)| norm 0.3043 (+0.99z)| lr 1.76e-04 | 323.20 ms | 52.2% bf16 MFU | 1623308 tok/s step 12693/19560 | loss 3.273614 (-2.11z)| norm 0.2730 (-0.62z)| lr 1.76e-04 | 322.91 ms | 52.3% bf16 MFU | 1623326 tok/s step 12694/19560 | loss 3.398502 (+0.63z)| norm 0.3030 (+0.92z)| lr 1.76e-04 | 322.56 ms | 52.3% bf16 MFU | 1623428 tok/s step 12695/19560 | loss 3.319339 (-1.09z)| norm 0.2818 (-0.18z)| lr 1.76e-04 | 322.92 ms | 52.3% bf16 MFU | 1623437 tok/s step 12696/19560 | loss 3.336508 (-0.72z)| norm 0.2730 (-0.65z)| lr 1.76e-04 | 322.78 ms | 52.3% bf16 MFU | 1623479 tok/s step 12697/19560 | loss 3.377958 (+0.17z)| norm 0.2746 (-0.57z)| lr 1.76e-04 | 322.48 ms | 52.3% bf16 MFU | 1623596 tok/s step 12698/19560 | loss 3.339671 (-0.68z)| norm 0.2581 (-1.42z)| lr 1.76e-04 | 323.01 ms | 52.3% bf16 MFU | 1623574 tok/s step 12699/19560 | loss 3.340447 (-0.65z)| norm 0.2782 (-0.38z)| lr 1.76e-04 | 322.82 ms | 52.3% bf16 MFU | 1623600 tok/s step 12700/19560 | loss 3.298674 (-1.55z)| norm 0.2729 (-0.66z)| lr 1.76e-04 | 323.14 ms | 52.2% bf16 MFU | 1623544 tok/s step 12701/19560 | loss 3.433329 (+1.40z)| norm 0.3332 (+2.44z)| lr 1.75e-04 | 322.71 ms | 52.3% bf16 MFU | 1623598 tok/s step 12702/19560 | loss 3.419287 (+1.09z)| norm 0.3234 (+1.89z)| lr 1.75e-04 | 322.61 ms | 52.3% bf16 MFU | 1623675 tok/s step 12703/19560 | loss 3.443274 (+1.60z)| norm 0.2979 (+0.59z)| lr 1.75e-04 | 322.62 ms | 52.3% bf16 MFU | 1623747 tok/s step 12704/19560 | loss 3.448734 (+1.78z)| norm 0.2771 (-0.48z)| lr 1.75e-04 | 323.32 ms | 52.2% bf16 MFU | 1623640 tok/s step 12705/19560 | loss 3.350915 (-0.41z)| norm 0.2920 (+0.28z)| lr 1.75e-04 | 322.98 ms | 52.3% bf16 MFU | 1623623 tok/s step 12706/19560 | loss 3.332241 (-0.83z)| norm 0.2901 (+0.19z)| lr 1.75e-04 | 322.60 ms | 52.3% bf16 MFU | 1623702 tok/s step 12707/19560 | loss 3.368881 (-0.01z)| norm 0.2698 (-0.84z)| lr 1.75e-04 | 322.88 ms | 52.3% bf16 MFU | 1623706 tok/s step 12708/19560 | loss 3.274854 (-2.09z)| norm 0.2886 (+0.12z)| lr 1.75e-04 | 323.18 ms | 52.2% bf16 MFU | 1623636 tok/s step 12709/19560 | loss 3.293517 (-1.65z)| norm 0.2830 (-0.17z)| lr 1.75e-04 | 322.78 ms | 52.3% bf16 MFU | 1623670 tok/s step 12710/19560 | loss 3.343036 (-0.55z)| norm 0.2758 (-0.54z)| lr 1.75e-04 | 322.52 ms | 52.3% bf16 MFU | 1623766 tok/s step 12711/19560 | loss 3.330394 (-0.82z)| norm 0.2888 (+0.12z)| lr 1.75e-04 | 322.98 ms | 52.3% bf16 MFU | 1623742 tok/s step 12712/19560 | loss 3.342337 (-0.55z)| norm 0.2817 (-0.25z)| lr 1.75e-04 | 322.65 ms | 52.3% bf16 MFU | 1623801 tok/s step 12713/19560 | loss 3.372405 (+0.12z)| norm 0.2719 (-0.76z)| lr 1.75e-04 | 322.87 ms | 52.3% bf16 MFU | 1623804 tok/s step 12714/19560 | loss 3.364581 (-0.05z)| norm 0.3272 (+2.05z)| lr 1.75e-04 | 322.65 ms | 52.3% bf16 MFU | 1623861 tok/s step 12715/19560 | loss 3.313624 (-1.16z)| norm 0.2895 (+0.11z)| lr 1.75e-04 | 322.93 ms | 52.3% bf16 MFU | 1623845 tok/s step 12716/19560 | loss 3.367673 (+0.04z)| norm 0.3399 (+2.60z)| lr 1.75e-04 | 322.43 ms | 52.3% bf16 MFU | 1623954 tok/s step 12717/19560 | loss 3.438736 (+1.58z)| norm 0.2759 (-0.59z)| lr 1.75e-04 | 322.57 ms | 52.3% bf16 MFU | 1624025 tok/s step 12718/19560 | loss 3.383949 (+0.38z)| norm 0.2837 (-0.21z)| lr 1.75e-04 | 323.10 ms | 52.2% bf16 MFU | 1623958 tok/s step 12719/19560 | loss 3.329483 (-0.81z)| norm 0.2742 (-0.68z)| lr 1.75e-04 | 322.85 ms | 52.3% bf16 MFU | 1623958 tok/s step 12720/19560 | loss 3.400661 (+0.79z)| norm 0.2964 (+0.46z)| lr 1.75e-04 | 322.99 ms | 52.3% bf16 MFU | 1623922 tok/s step 12721/19560 | loss 3.421280 (+1.24z)| norm 0.3000 (+0.64z)| lr 1.75e-04 | 322.82 ms | 52.3% bf16 MFU | 1623930 tok/s step 12722/19560 | loss 3.393961 (+0.62z)| norm 0.3120 (+1.24z)| lr 1.75e-04 | 323.29 ms | 52.2% bf16 MFU | 1623819 tok/s step 12723/19560 | loss 3.333270 (-0.73z)| norm 0.2920 (+0.22z)| lr 1.74e-04 | 322.08 ms | 52.4% bf16 MFU | 1624021 tok/s step 12724/19560 | loss 3.389295 (+0.51z)| norm 0.3251 (+1.87z)| lr 1.74e-04 | 322.97 ms | 52.3% bf16 MFU | 1623986 tok/s step 12725/19560 | loss 3.347388 (-0.42z)| norm 0.2934 (+0.28z)| lr 1.74e-04 | 322.72 ms | 52.3% bf16 MFU | 1624016 tok/s step 12726/19560 | loss 3.311545 (-1.21z)| norm 0.2965 (+0.43z)| lr 1.74e-04 | 322.87 ms | 52.3% bf16 MFU | 1624005 tok/s step 12727/19560 | loss 3.366612 (+0.01z)| norm 0.3063 (+0.91z)| lr 1.74e-04 | 322.89 ms | 52.3% bf16 MFU | 1623992 tok/s step 12728/19560 | loss 3.414501 (+1.08z)| norm 0.2663 (-1.08z)| lr 1.74e-04 | 322.67 ms | 52.3% bf16 MFU | 1624034 tok/s step 12729/19560 | loss 3.358295 (-0.18z)| norm 0.3506 (+3.07z)| lr 1.74e-04 | 323.30 ms | 52.2% bf16 MFU | 1623916 tok/s step 12730/19560 | loss 3.341164 (-0.55z)| norm 0.3059 (+0.88z)| lr 1.74e-04 | 322.62 ms | 52.3% bf16 MFU | 1623975 tok/s step 12731/19560 | loss 3.320783 (-0.99z)| norm 0.3059 (+0.88z)| lr 1.74e-04 | 323.12 ms | 52.2% bf16 MFU | 1623905 tok/s step 12732/19560 | loss 3.342038 (-0.51z)| norm 0.3000 (+0.60z)| lr 1.74e-04 | 323.18 ms | 52.2% bf16 MFU | 1623823 tok/s step 12733/19560 | loss 3.362096 (-0.07z)| norm 0.2808 (-0.34z)| lr 1.74e-04 | 322.62 ms | 52.3% bf16 MFU | 1623886 tok/s step 12734/19560 | loss 3.425498 (+1.34z)| norm 0.2832 (-0.22z)| lr 1.74e-04 | 322.73 ms | 52.3% bf16 MFU | 1623919 tok/s step 12735/19560 | loss 3.376431 (+0.27z)| norm 0.2701 (-0.86z)| lr 1.74e-04 | 322.95 ms | 52.3% bf16 MFU | 1623894 tok/s step 12736/19560 | loss 3.351938 (-0.28z)| norm 0.2694 (-0.88z)| lr 1.74e-04 | 323.49 ms | 52.2% bf16 MFU | 1623737 tok/s step 12737/19560 | loss 3.413030 (+1.12z)| norm 0.2639 (-1.14z)| lr 1.74e-04 | 323.16 ms | 52.2% bf16 MFU | 1623669 tok/s step 12738/19560 | loss 3.358356 (-0.13z)| norm 0.2617 (-1.23z)| lr 1.74e-04 | 322.88 ms | 52.3% bf16 MFU | 1623675 tok/s step 12739/19560 | loss 3.361107 (-0.07z)| norm 0.2548 (-1.54z)| lr 1.74e-04 | 322.60 ms | 52.3% bf16 MFU | 1623752 tok/s step 12740/19560 | loss 3.397143 (+0.74z)| norm 0.2510 (-1.70z)| lr 1.74e-04 | 322.59 ms | 52.3% bf16 MFU | 1623827 tok/s step 12741/19560 | loss 3.445978 (+1.83z)| norm 0.3056 (+0.96z)| lr 1.74e-04 | 323.27 ms | 52.2% bf16 MFU | 1623726 tok/s step 12742/19560 | loss 3.352269 (-0.31z)| norm 0.2779 (-0.38z)| lr 1.74e-04 | 323.36 ms | 52.2% bf16 MFU | 1623609 tok/s step 12743/19560 | loss 3.369574 (+0.09z)| norm 0.2545 (-1.52z)| lr 1.74e-04 | 323.04 ms | 52.2% bf16 MFU | 1623578 tok/s step 12744/19560 | loss 3.419126 (+1.21z)| norm 0.2823 (-0.13z)| lr 1.74e-04 | 323.66 ms | 52.1% bf16 MFU | 1623393 tok/s step 12745/19560 | loss 3.358156 (-0.19z)| norm 0.2678 (-0.84z)| lr 1.73e-04 | 322.74 ms | 52.3% bf16 MFU | 1623448 tok/s step 12746/19560 | loss 3.405358 (+0.90z)| norm 0.2773 (-0.35z)| lr 1.73e-04 | 323.10 ms | 52.2% bf16 MFU | 1623409 tok/s step 12747/19560 | loss 3.324502 (-0.94z)| norm 0.2632 (-1.05z)| lr 1.73e-04 | 322.89 ms | 52.3% bf16 MFU | 1623425 tok/s step 12748/19560 | loss 3.352284 (-0.31z)| norm 0.2681 (-0.80z)| lr 1.73e-04 | 322.66 ms | 52.3% bf16 MFU | 1623498 tok/s step 12749/19560 | loss 3.301863 (-1.48z)| norm 0.2602 (-1.18z)| lr 1.73e-04 | 322.79 ms | 52.3% bf16 MFU | 1623536 tok/s step 12750/19560 | loss 3.362746 (-0.06z)| norm 0.2618 (-1.10z)| lr 1.73e-04 | 322.62 ms | 52.3% bf16 MFU | 1623613 tok/s val loss 3.338635 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2992/10042 = 0.297949 step 12751/19560 | loss 3.321074 (-1.01z)| norm 0.2685 (-0.74z)| lr 1.73e-04 | 322.84 ms | 52.3% bf16 MFU | 1623631 tok/s step 12752/19560 | loss 3.349110 (-0.35z)| norm 0.2570 (-1.31z)| lr 1.73e-04 | 323.15 ms | 52.2% bf16 MFU | 1623571 tok/s step 12753/19560 | loss 3.305482 (-1.35z)| norm 0.2601 (-1.13z)| lr 1.73e-04 | 322.93 ms | 52.3% bf16 MFU | 1623569 tok/s step 12754/19560 | loss 3.342967 (-0.48z)| norm 0.2657 (-0.84z)| lr 1.73e-04 | 323.26 ms | 52.2% bf16 MFU | 1623483 tok/s step 12755/19560 | loss 3.402103 (+0.87z)| norm 0.2693 (-0.65z)| lr 1.73e-04 | 322.72 ms | 52.3% bf16 MFU | 1623538 tok/s step 12756/19560 | loss 3.385600 (+0.49z)| norm 0.2655 (-0.83z)| lr 1.73e-04 | 323.57 ms | 52.2% bf16 MFU | 1623376 tok/s step 12757/19560 | loss 3.336638 (-0.64z)| norm 0.2663 (-0.79z)| lr 1.73e-04 | 322.97 ms | 52.3% bf16 MFU | 1623375 tok/s step 12758/19560 | loss 3.372143 (+0.18z)| norm 0.2732 (-0.44z)| lr 1.73e-04 | 323.35 ms | 52.2% bf16 MFU | 1623279 tok/s step 12759/19560 | loss 3.337937 (-0.61z)| norm 0.2782 (-0.19z)| lr 1.73e-04 | 323.25 ms | 52.2% bf16 MFU | 1623212 tok/s step 12760/19560 | loss 3.355800 (-0.21z)| norm 0.2639 (-0.91z)| lr 1.73e-04 | 322.93 ms | 52.3% bf16 MFU | 1623229 tok/s step 12761/19560 | loss 3.392302 (+0.65z)| norm 0.2487 (-1.67z)| lr 1.73e-04 | 322.94 ms | 52.3% bf16 MFU | 1623243 tok/s step 12762/19560 | loss 3.336947 (-0.65z)| norm 0.2694 (-0.62z)| lr 1.73e-04 | 322.99 ms | 52.3% bf16 MFU | 1623242 tok/s step 12763/19560 | loss 3.326664 (-0.89z)| norm 0.2609 (-1.05z)| lr 1.73e-04 | 323.62 ms | 52.2% bf16 MFU | 1623085 tok/s step 12764/19560 | loss 3.351354 (-0.31z)| norm 0.2611 (-1.04z)| lr 1.73e-04 | 323.00 ms | 52.3% bf16 MFU | 1623090 tok/s step 12765/19560 | loss 3.336415 (-0.69z)| norm 0.2538 (-1.38z)| lr 1.73e-04 | 322.90 ms | 52.3% bf16 MFU | 1623120 tok/s step 12766/19560 | loss 3.354888 (-0.19z)| norm 0.2543 (-1.36z)| lr 1.73e-04 | 322.97 ms | 52.3% bf16 MFU | 1623130 tok/s step 12767/19560 | loss 3.320436 (-1.11z)| norm 0.2614 (-0.99z)| lr 1.72e-04 | 323.22 ms | 52.2% bf16 MFU | 1623077 tok/s step 12768/19560 | loss 3.275859 (-2.24z)| norm 0.2596 (-1.07z)| lr 1.72e-04 | 322.96 ms | 52.3% bf16 MFU | 1623092 tok/s step 12769/19560 | loss 3.319344 (-1.08z)| norm 0.2856 (+0.21z)| lr 1.72e-04 | 323.03 ms | 52.2% bf16 MFU | 1623089 tok/s step 12770/19560 | loss 3.345730 (-0.38z)| norm 0.2838 (+0.13z)| lr 1.72e-04 | 322.72 ms | 52.3% bf16 MFU | 1623163 tok/s step 12771/19560 | loss 3.345754 (-0.38z)| norm 0.2690 (-0.61z)| lr 1.72e-04 | 322.54 ms | 52.3% bf16 MFU | 1623280 tok/s step 12772/19560 | loss 3.343257 (-0.45z)| norm 0.2751 (-0.30z)| lr 1.72e-04 | 322.92 ms | 52.3% bf16 MFU | 1623296 tok/s step 12773/19560 | loss 3.359570 (-0.02z)| norm 0.2652 (-0.78z)| lr 1.72e-04 | 323.01 ms | 52.3% bf16 MFU | 1623288 tok/s step 12774/19560 | loss 3.393395 (+0.86z)| norm 0.2689 (-0.59z)| lr 1.72e-04 | 322.59 ms | 52.3% bf16 MFU | 1623386 tok/s step 12775/19560 | loss 3.344192 (-0.43z)| norm 0.2571 (-1.16z)| lr 1.72e-04 | 323.14 ms | 52.2% bf16 MFU | 1623342 tok/s step 12776/19560 | loss 3.414279 (+1.42z)| norm 0.2765 (-0.19z)| lr 1.72e-04 | 323.22 ms | 52.2% bf16 MFU | 1623279 tok/s step 12777/19560 | loss 3.396237 (+0.93z)| norm 0.2676 (-0.62z)| lr 1.72e-04 | 323.26 ms | 52.2% bf16 MFU | 1623208 tok/s step 12778/19560 | loss 3.383690 (+0.59z)| norm 0.2562 (-1.18z)| lr 1.72e-04 | 322.37 ms | 52.4% bf16 MFU | 1623366 tok/s step 12779/19560 | loss 3.359922 (-0.02z)| norm 0.2629 (-0.83z)| lr 1.72e-04 | 322.57 ms | 52.3% bf16 MFU | 1623466 tok/s step 12780/19560 | loss 3.349768 (-0.29z)| norm 0.2642 (-0.76z)| lr 1.72e-04 | 323.31 ms | 52.2% bf16 MFU | 1623375 tok/s step 12781/19560 | loss 3.354527 (-0.17z)| norm 0.2769 (-0.13z)| lr 1.72e-04 | 323.03 ms | 52.2% bf16 MFU | 1623358 tok/s step 12782/19560 | loss 3.334883 (-0.68z)| norm 0.2647 (-0.73z)| lr 1.72e-04 | 322.56 ms | 52.3% bf16 MFU | 1623460 tok/s step 12783/19560 | loss 3.369936 (+0.25z)| norm 0.3077 (+1.37z)| lr 1.72e-04 | 322.76 ms | 52.3% bf16 MFU | 1623506 tok/s step 12784/19560 | loss 3.350720 (-0.25z)| norm 0.2604 (-0.93z)| lr 1.72e-04 | 322.56 ms | 52.3% bf16 MFU | 1623601 tok/s step 12785/19560 | loss 3.309727 (-1.32z)| norm 0.3245 (+2.14z)| lr 1.72e-04 | 323.02 ms | 52.2% bf16 MFU | 1623574 tok/s step 12786/19560 | loss 3.359282 (-0.02z)| norm 0.2722 (-0.36z)| lr 1.72e-04 | 323.15 ms | 52.2% bf16 MFU | 1623517 tok/s step 12787/19560 | loss 3.352383 (-0.19z)| norm 0.3243 (+2.08z)| lr 1.72e-04 | 323.54 ms | 52.2% bf16 MFU | 1623366 tok/s step 12788/19560 | loss 3.377707 (+0.47z)| norm 0.2956 (+0.71z)| lr 1.72e-04 | 322.68 ms | 52.3% bf16 MFU | 1623437 tok/s step 12789/19560 | loss 3.348397 (-0.31z)| norm 0.2963 (+0.74z)| lr 1.71e-04 | 323.40 ms | 52.2% bf16 MFU | 1623325 tok/s step 12790/19560 | loss 3.371620 (+0.30z)| norm 0.2804 (-0.02z)| lr 1.71e-04 | 322.85 ms | 52.3% bf16 MFU | 1623356 tok/s step 12791/19560 | loss 3.350203 (-0.26z)| norm 0.2823 (+0.07z)| lr 1.71e-04 | 322.93 ms | 52.3% bf16 MFU | 1623365 tok/s step 12792/19560 | loss 3.385792 (+0.67z)| norm 0.2916 (+0.50z)| lr 1.71e-04 | 323.55 ms | 52.2% bf16 MFU | 1623217 tok/s step 12793/19560 | loss 3.309868 (-1.31z)| norm 0.2754 (-0.27z)| lr 1.71e-04 | 322.80 ms | 52.3% bf16 MFU | 1623264 tok/s step 12794/19560 | loss 3.322360 (-0.97z)| norm 0.2904 (+0.45z)| lr 1.71e-04 | 323.45 ms | 52.2% bf16 MFU | 1623147 tok/s step 12795/19560 | loss 3.291427 (-1.76z)| norm 0.2796 (-0.07z)| lr 1.71e-04 | 322.90 ms | 52.3% bf16 MFU | 1623174 tok/s step 12796/19560 | loss 3.322289 (-0.95z)| norm 0.2867 (+0.26z)| lr 1.71e-04 | 323.63 ms | 52.1% bf16 MFU | 1623016 tok/s step 12797/19560 | loss 3.363019 (+0.10z)| norm 0.2618 (-0.93z)| lr 1.71e-04 | 323.02 ms | 52.2% bf16 MFU | 1623019 tok/s step 12798/19560 | loss 3.408757 (+1.28z)| norm 0.2756 (-0.28z)| lr 1.71e-04 | 322.49 ms | 52.3% bf16 MFU | 1623155 tok/s step 12799/19560 | loss 3.404733 (+1.18z)| norm 0.3023 (+0.99z)| lr 1.71e-04 | 323.10 ms | 52.2% bf16 MFU | 1623132 tok/s step 12800/19560 | loss 3.327591 (-0.82z)| norm 0.2678 (-0.67z)| lr 1.71e-04 | 323.93 ms | 52.1% bf16 MFU | 1622901 tok/s step 12801/19560 | loss 3.315552 (-1.13z)| norm 0.2868 (+0.24z)| lr 1.71e-04 | 323.03 ms | 52.2% bf16 MFU | 1622908 tok/s step 12802/19560 | loss 3.332793 (-0.68z)| norm 0.2674 (-0.69z)| lr 1.71e-04 | 323.72 ms | 52.1% bf16 MFU | 1622740 tok/s step 12803/19560 | loss 3.353339 (-0.11z)| norm 0.3153 (+1.58z)| lr 1.71e-04 | 322.41 ms | 52.3% bf16 MFU | 1622911 tok/s step 12804/19560 | loss 3.349325 (-0.23z)| norm 0.3005 (+0.86z)| lr 1.71e-04 | 323.45 ms | 52.2% bf16 MFU | 1622812 tok/s step 12805/19560 | loss 3.344522 (-0.35z)| norm 0.2843 (+0.09z)| lr 1.71e-04 | 323.08 ms | 52.2% bf16 MFU | 1622811 tok/s step 12806/19560 | loss 3.294454 (-1.70z)| norm 0.2862 (+0.17z)| lr 1.71e-04 | 322.82 ms | 52.3% bf16 MFU | 1622876 tok/s step 12807/19560 | loss 3.386494 (+0.82z)| norm 0.2891 (+0.30z)| lr 1.71e-04 | 323.17 ms | 52.2% bf16 MFU | 1622848 tok/s step 12808/19560 | loss 3.404604 (+1.29z)| norm 0.2965 (+0.64z)| lr 1.71e-04 | 322.97 ms | 52.3% bf16 MFU | 1622871 tok/s step 12809/19560 | loss 3.372216 (+0.41z)| norm 0.2818 (-0.05z)| lr 1.71e-04 | 322.71 ms | 52.3% bf16 MFU | 1622961 tok/s step 12810/19560 | loss 3.399821 (+1.14z)| norm 0.2841 (+0.07z)| lr 1.71e-04 | 323.32 ms | 52.2% bf16 MFU | 1622892 tok/s step 12811/19560 | loss 3.378600 (+0.56z)| norm 0.2731 (-0.46z)| lr 1.70e-04 | 323.47 ms | 52.2% bf16 MFU | 1622790 tok/s step 12812/19560 | loss 3.378435 (+0.55z)| norm 0.3026 (+1.00z)| lr 1.70e-04 | 322.73 ms | 52.3% bf16 MFU | 1622878 tok/s step 12813/19560 | loss 3.395978 (+1.01z)| norm 0.2753 (-0.35z)| lr 1.70e-04 | 323.09 ms | 52.2% bf16 MFU | 1622872 tok/s step 12814/19560 | loss 3.304538 (-1.44z)| norm 0.2766 (-0.27z)| lr 1.70e-04 | 323.42 ms | 52.2% bf16 MFU | 1622783 tok/s step 12815/19560 | loss 3.390567 (+0.86z)| norm 0.3230 (+2.01z)| lr 1.70e-04 | 323.07 ms | 52.2% bf16 MFU | 1622785 tok/s step 12816/19560 | loss 3.323518 (-0.93z)| norm 0.3027 (+1.01z)| lr 1.70e-04 | 323.32 ms | 52.2% bf16 MFU | 1622724 tok/s step 12817/19560 | loss 3.309188 (-1.29z)| norm 0.3005 (+0.92z)| lr 1.70e-04 | 324.31 ms | 52.0% bf16 MFU | 1622418 tok/s step 12818/19560 | loss 3.381931 (+0.63z)| norm 0.2995 (+0.87z)| lr 1.70e-04 | 322.76 ms | 52.3% bf16 MFU | 1622517 tok/s step 12819/19560 | loss 3.309758 (-1.27z)| norm 0.3135 (+1.55z)| lr 1.70e-04 | 323.13 ms | 52.2% bf16 MFU | 1622519 tok/s step 12820/19560 | loss 3.343984 (-0.36z)| norm 0.2976 (+0.76z)| lr 1.70e-04 | 323.40 ms | 52.2% bf16 MFU | 1622453 tok/s step 12821/19560 | loss 3.332915 (-0.68z)| norm 0.2830 (+0.04z)| lr 1.70e-04 | 323.45 ms | 52.2% bf16 MFU | 1622376 tok/s step 12822/19560 | loss 3.340649 (-0.46z)| norm 0.2844 (+0.11z)| lr 1.70e-04 | 323.13 ms | 52.2% bf16 MFU | 1622383 tok/s step 12823/19560 | loss 3.321648 (-0.98z)| norm 0.2935 (+0.56z)| lr 1.70e-04 | 323.53 ms | 52.2% bf16 MFU | 1622290 tok/s step 12824/19560 | loss 3.347668 (-0.27z)| norm 0.3026 (+1.00z)| lr 1.70e-04 | 322.80 ms | 52.3% bf16 MFU | 1622385 tok/s step 12825/19560 | loss 3.365653 (+0.22z)| norm 0.2832 (+0.03z)| lr 1.70e-04 | 323.81 ms | 52.1% bf16 MFU | 1622222 tok/s step 12826/19560 | loss 3.398651 (+1.11z)| norm 0.2896 (+0.34z)| lr 1.70e-04 | 322.90 ms | 52.3% bf16 MFU | 1622295 tok/s step 12827/19560 | loss 3.299766 (-1.56z)| norm 0.2998 (+0.84z)| lr 1.70e-04 | 322.94 ms | 52.3% bf16 MFU | 1622354 tok/s step 12828/19560 | loss 3.347747 (-0.28z)| norm 0.2754 (-0.38z)| lr 1.70e-04 | 323.84 ms | 52.1% bf16 MFU | 1622185 tok/s step 12829/19560 | loss 3.320119 (-1.02z)| norm 0.3059 (+1.18z)| lr 1.70e-04 | 322.87 ms | 52.3% bf16 MFU | 1622267 tok/s step 12830/19560 | loss 3.293153 (-1.74z)| norm 0.2761 (-0.33z)| lr 1.70e-04 | 323.23 ms | 52.2% bf16 MFU | 1622256 tok/s step 12831/19560 | loss 3.332291 (-0.65z)| norm 0.2759 (-0.33z)| lr 1.70e-04 | 323.37 ms | 52.2% bf16 MFU | 1622208 tok/s step 12832/19560 | loss 3.362396 (+0.23z)| norm 0.3136 (+1.60z)| lr 1.70e-04 | 323.02 ms | 52.2% bf16 MFU | 1622252 tok/s step 12833/19560 | loss 3.335119 (-0.56z)| norm 0.2696 (-0.65z)| lr 1.69e-04 | 323.60 ms | 52.2% bf16 MFU | 1622149 tok/s step 12834/19560 | loss 3.370581 (+0.46z)| norm 0.3272 (+2.24z)| lr 1.69e-04 | 323.24 ms | 52.2% bf16 MFU | 1622140 tok/s step 12835/19560 | loss 3.311179 (-1.24z)| norm 0.2845 (+0.09z)| lr 1.69e-04 | 323.20 ms | 52.2% bf16 MFU | 1622142 tok/s step 12836/19560 | loss 3.358840 (+0.11z)| norm 0.3352 (+2.55z)| lr 1.69e-04 | 323.34 ms | 52.2% bf16 MFU | 1622108 tok/s step 12837/19560 | loss 3.334000 (-0.63z)| norm 0.2627 (-0.99z)| lr 1.69e-04 | 323.14 ms | 52.2% bf16 MFU | 1622126 tok/s step 12838/19560 | loss 3.362798 (+0.22z)| norm 0.3181 (+1.68z)| lr 1.69e-04 | 323.75 ms | 52.1% bf16 MFU | 1621991 tok/s step 12839/19560 | loss 3.359040 (+0.10z)| norm 0.2789 (-0.21z)| lr 1.69e-04 | 323.25 ms | 52.2% bf16 MFU | 1621988 tok/s step 12840/19560 | loss 3.296512 (-1.74z)| norm 0.2741 (-0.44z)| lr 1.69e-04 | 323.16 ms | 52.2% bf16 MFU | 1622009 tok/s step 12841/19560 | loss 3.317036 (-1.11z)| norm 0.2948 (+0.56z)| lr 1.69e-04 | 323.57 ms | 52.2% bf16 MFU | 1621924 tok/s step 12842/19560 | loss 3.321392 (-0.97z)| norm 0.2608 (-1.08z)| lr 1.69e-04 | 323.18 ms | 52.2% bf16 MFU | 1621941 tok/s step 12843/19560 | loss 3.364950 (+0.29z)| norm 0.2925 (+0.47z)| lr 1.69e-04 | 322.84 ms | 52.3% bf16 MFU | 1622043 tok/s step 12844/19560 | loss 3.404977 (+1.45z)| norm 0.2830 (+0.03z)| lr 1.69e-04 | 322.81 ms | 52.3% bf16 MFU | 1622148 tok/s step 12845/19560 | loss 3.344563 (-0.30z)| norm 0.2739 (-0.43z)| lr 1.69e-04 | 323.67 ms | 52.1% bf16 MFU | 1622033 tok/s step 12846/19560 | loss 3.349336 (-0.15z)| norm 0.2694 (-0.65z)| lr 1.69e-04 | 323.20 ms | 52.2% bf16 MFU | 1622041 tok/s step 12847/19560 | loss 3.346803 (-0.23z)| norm 0.2739 (-0.42z)| lr 1.69e-04 | 323.02 ms | 52.2% bf16 MFU | 1622094 tok/s step 12848/19560 | loss 3.367169 (+0.40z)| norm 0.2622 (-1.00z)| lr 1.69e-04 | 322.78 ms | 52.3% bf16 MFU | 1622204 tok/s step 12849/19560 | loss 3.293733 (-1.81z)| norm 0.2605 (-1.06z)| lr 1.69e-04 | 322.62 ms | 52.3% bf16 MFU | 1622349 tok/s step 12850/19560 | loss 3.344975 (-0.24z)| norm 0.3026 (+1.06z)| lr 1.69e-04 | 323.50 ms | 52.2% bf16 MFU | 1622264 tok/s step 12851/19560 | loss 3.349749 (-0.10z)| norm 0.2635 (-0.90z)| lr 1.69e-04 | 322.78 ms | 52.3% bf16 MFU | 1622366 tok/s step 12852/19560 | loss 3.328543 (-0.73z)| norm 0.2795 (-0.08z)| lr 1.69e-04 | 323.76 ms | 52.1% bf16 MFU | 1622216 tok/s step 12853/19560 | loss 3.366283 (+0.42z)| norm 0.2518 (-1.47z)| lr 1.69e-04 | 322.48 ms | 52.3% bf16 MFU | 1622396 tok/s step 12854/19560 | loss 3.395098 (+1.29z)| norm 0.2720 (-0.43z)| lr 1.69e-04 | 323.30 ms | 52.2% bf16 MFU | 1622360 tok/s step 12855/19560 | loss 3.360612 (+0.23z)| norm 0.2711 (-0.47z)| lr 1.68e-04 | 323.33 ms | 52.2% bf16 MFU | 1622318 tok/s step 12856/19560 | loss 3.382817 (+0.93z)| norm 0.2677 (-0.64z)| lr 1.68e-04 | 322.66 ms | 52.3% bf16 MFU | 1622448 tok/s step 12857/19560 | loss 3.351530 (-0.04z)| norm 0.2722 (-0.40z)| lr 1.68e-04 | 322.99 ms | 52.3% bf16 MFU | 1622486 tok/s step 12858/19560 | loss 3.333888 (-0.59z)| norm 0.2888 (+0.51z)| lr 1.68e-04 | 323.23 ms | 52.2% bf16 MFU | 1622464 tok/s step 12859/19560 | loss 3.367935 (+0.46z)| norm 0.2813 (+0.11z)| lr 1.68e-04 | 323.09 ms | 52.2% bf16 MFU | 1622476 tok/s step 12860/19560 | loss 3.313126 (-1.23z)| norm 0.2967 (+0.96z)| lr 1.68e-04 | 322.67 ms | 52.3% bf16 MFU | 1622594 tok/s step 12861/19560 | loss 3.295205 (-1.75z)| norm 0.2974 (+0.99z)| lr 1.68e-04 | 322.78 ms | 52.3% bf16 MFU | 1622678 tok/s step 12862/19560 | loss 3.340927 (-0.34z)| norm 0.3135 (+1.84z)| lr 1.68e-04 | 323.69 ms | 52.1% bf16 MFU | 1622529 tok/s step 12863/19560 | loss 3.302392 (-1.51z)| norm 0.2847 (+0.27z)| lr 1.68e-04 | 323.21 ms | 52.2% bf16 MFU | 1622508 tok/s step 12864/19560 | loss 3.452501 (+3.00z)| norm 0.3205 (+2.15z)| lr 1.68e-04 | 322.75 ms | 52.3% bf16 MFU | 1622604 tok/s step 12865/19560 | loss 3.337619 (-0.42z)| norm 0.3597 (+3.95z)| lr 1.68e-04 | 323.37 ms | 52.2% bf16 MFU | 1622540 tok/s step 12866/19560 | loss 3.328727 (-0.68z)| norm 0.2922 (+0.55z)| lr 1.68e-04 | 322.94 ms | 52.3% bf16 MFU | 1622588 tok/s step 12867/19560 | loss 3.292527 (-1.73z)| norm 0.3250 (+2.15z)| lr 1.68e-04 | 322.73 ms | 52.3% bf16 MFU | 1622685 tok/s step 12868/19560 | loss 3.363705 (+0.40z)| norm 0.2874 (+0.27z)| lr 1.68e-04 | 323.20 ms | 52.2% bf16 MFU | 1622660 tok/s step 12869/19560 | loss 3.409559 (+1.82z)| norm 0.3303 (+2.37z)| lr 1.68e-04 | 322.56 ms | 52.3% bf16 MFU | 1622796 tok/s step 12870/19560 | loss 3.380419 (+0.92z)| norm 0.3114 (+1.42z)| lr 1.68e-04 | 323.66 ms | 52.1% bf16 MFU | 1622650 tok/s step 12871/19560 | loss 3.315167 (-1.06z)| norm 0.2845 (+0.09z)| lr 1.68e-04 | 322.61 ms | 52.3% bf16 MFU | 1622774 tok/s step 12872/19560 | loss 3.368020 (+0.57z)| norm 0.2923 (+0.47z)| lr 1.68e-04 | 322.33 ms | 52.4% bf16 MFU | 1622964 tok/s step 12873/19560 | loss 3.385099 (+1.09z)| norm 0.3030 (+0.98z)| lr 1.68e-04 | 323.79 ms | 52.1% bf16 MFU | 1622778 tok/s step 12874/19560 | loss 3.365867 (+0.51z)| norm 0.2949 (+0.57z)| lr 1.68e-04 | 322.77 ms | 52.3% bf16 MFU | 1622857 tok/s step 12875/19560 | loss 3.289527 (-1.83z)| norm 0.2814 (-0.09z)| lr 1.68e-04 | 323.02 ms | 52.2% bf16 MFU | 1622868 tok/s step 12876/19560 | loss 3.335312 (-0.42z)| norm 0.2791 (-0.21z)| lr 1.68e-04 | 322.75 ms | 52.3% bf16 MFU | 1622948 tok/s step 12877/19560 | loss 3.378490 (+0.90z)| norm 0.2867 (+0.15z)| lr 1.68e-04 | 322.77 ms | 52.3% bf16 MFU | 1623017 tok/s step 12878/19560 | loss 3.466913 (+3.44z)| norm 0.3787 (+4.33z)| lr 1.67e-04 | 323.10 ms | 52.2% bf16 MFU | 1623000 tok/s step 12879/19560 | loss 3.327309 (-0.68z)| norm 0.3220 (+1.69z)| lr 1.67e-04 | 323.19 ms | 52.2% bf16 MFU | 1622960 tok/s step 12880/19560 | loss 3.317770 (-0.95z)| norm 0.2861 (+0.04z)| lr 1.67e-04 | 322.68 ms | 52.3% bf16 MFU | 1623051 tok/s step 12881/19560 | loss 3.314723 (-1.05z)| norm 0.2912 (+0.27z)| lr 1.67e-04 | 323.05 ms | 52.2% bf16 MFU | 1623046 tok/s step 12882/19560 | loss 3.320127 (-0.88z)| norm 0.2923 (+0.31z)| lr 1.67e-04 | 323.10 ms | 52.2% bf16 MFU | 1623027 tok/s step 12883/19560 | loss 3.434870 (+2.45z)| norm 0.2795 (-0.29z)| lr 1.67e-04 | 322.77 ms | 52.3% bf16 MFU | 1623092 tok/s step 12884/19560 | loss 3.309380 (-1.17z)| norm 0.3129 (+1.24z)| lr 1.67e-04 | 322.34 ms | 52.4% bf16 MFU | 1623263 tok/s step 12885/19560 | loss 3.366899 (+0.49z)| norm 0.2576 (-1.30z)| lr 1.67e-04 | 323.03 ms | 52.2% bf16 MFU | 1623252 tok/s step 12886/19560 | loss 3.338614 (-0.32z)| norm 0.2702 (-0.73z)| lr 1.67e-04 | 322.73 ms | 52.3% bf16 MFU | 1623315 tok/s step 12887/19560 | loss 3.341298 (-0.24z)| norm 0.2716 (-0.66z)| lr 1.67e-04 | 323.03 ms | 52.2% bf16 MFU | 1623301 tok/s step 12888/19560 | loss 3.395963 (+1.32z)| norm 0.2639 (-1.01z)| lr 1.67e-04 | 322.81 ms | 52.3% bf16 MFU | 1623341 tok/s step 12889/19560 | loss 3.347970 (-0.05z)| norm 0.2719 (-0.66z)| lr 1.67e-04 | 322.58 ms | 52.3% bf16 MFU | 1623438 tok/s step 12890/19560 | loss 3.319777 (-0.86z)| norm 0.2621 (-1.11z)| lr 1.67e-04 | 323.01 ms | 52.2% bf16 MFU | 1623423 tok/s step 12891/19560 | loss 3.405450 (+1.58z)| norm 0.2598 (-1.21z)| lr 1.67e-04 | 322.41 ms | 52.3% bf16 MFU | 1623560 tok/s step 12892/19560 | loss 3.337666 (-0.35z)| norm 0.2613 (-1.14z)| lr 1.67e-04 | 323.08 ms | 52.2% bf16 MFU | 1623522 tok/s step 12893/19560 | loss 3.429973 (+2.22z)| norm 0.2752 (-0.51z)| lr 1.67e-04 | 322.93 ms | 52.3% bf16 MFU | 1623522 tok/s step 12894/19560 | loss 3.570937 (+5.39z)| norm 0.3430 (+2.57z)| lr 1.67e-04 | 322.84 ms | 52.3% bf16 MFU | 1623546 tok/s step 12895/19560 | loss 3.289437 (-1.54z)| norm 0.2796 (-0.34z)| lr 1.67e-04 | 323.30 ms | 52.2% bf16 MFU | 1623453 tok/s step 12896/19560 | loss 3.417961 (+1.59z)| norm 0.3207 (+1.52z)| lr 1.67e-04 | 322.86 ms | 52.3% bf16 MFU | 1623475 tok/s step 12897/19560 | loss 3.323943 (-0.72z)| norm 0.2790 (-0.39z)| lr 1.67e-04 | 322.98 ms | 52.3% bf16 MFU | 1623465 tok/s step 12898/19560 | loss 3.400726 (+1.15z)| norm 0.2896 (+0.09z)| lr 1.67e-04 | 322.65 ms | 52.3% bf16 MFU | 1623538 tok/s step 12899/19560 | loss 3.356946 (+0.07z)| norm 0.2939 (+0.28z)| lr 1.67e-04 | 322.72 ms | 52.3% bf16 MFU | 1623592 tok/s step 12900/19560 | loss 3.422724 (+1.65z)| norm 0.2925 (+0.22z)| lr 1.66e-04 | 323.06 ms | 52.2% bf16 MFU | 1623556 tok/s step 12901/19560 | loss 3.356792 (+0.06z)| norm 0.2843 (-0.17z)| lr 1.66e-04 | 322.51 ms | 52.3% bf16 MFU | 1623659 tok/s step 12902/19560 | loss 3.432036 (+1.85z)| norm 0.2946 (+0.30z)| lr 1.66e-04 | 322.88 ms | 52.3% bf16 MFU | 1623666 tok/s step 12903/19560 | loss 3.358388 (+0.08z)| norm 0.2741 (-0.66z)| lr 1.66e-04 | 323.19 ms | 52.2% bf16 MFU | 1623595 tok/s step 12904/19560 | loss 3.350992 (-0.08z)| norm 0.2800 (-0.39z)| lr 1.66e-04 | 323.16 ms | 52.2% bf16 MFU | 1623533 tok/s step 12905/19560 | loss 3.389081 (+0.84z)| norm 0.2858 (-0.12z)| lr 1.66e-04 | 323.37 ms | 52.2% bf16 MFU | 1623424 tok/s step 12906/19560 | loss 3.368573 (+0.35z)| norm 0.2935 (+0.22z)| lr 1.66e-04 | 323.28 ms | 52.2% bf16 MFU | 1623343 tok/s step 12907/19560 | loss 3.388973 (+0.83z)| norm 0.3189 (+1.40z)| lr 1.66e-04 | 321.98 ms | 52.4% bf16 MFU | 1623593 tok/s step 12908/19560 | loss 3.335765 (-0.45z)| norm 0.2825 (-0.32z)| lr 1.66e-04 | 322.85 ms | 52.3% bf16 MFU | 1623610 tok/s step 12909/19560 | loss 3.341013 (-0.32z)| norm 0.2636 (-1.21z)| lr 1.66e-04 | 323.17 ms | 52.2% bf16 MFU | 1623545 tok/s step 12910/19560 | loss 3.385536 (+0.74z)| norm 0.3019 (+0.59z)| lr 1.66e-04 | 322.57 ms | 52.3% bf16 MFU | 1623636 tok/s step 12911/19560 | loss 3.371932 (+0.42z)| norm 0.2805 (-0.42z)| lr 1.66e-04 | 322.51 ms | 52.3% bf16 MFU | 1623738 tok/s step 12912/19560 | loss 3.359172 (+0.11z)| norm 0.2764 (-0.63z)| lr 1.66e-04 | 323.37 ms | 52.2% bf16 MFU | 1623618 tok/s step 12913/19560 | loss 3.351024 (-0.10z)| norm 0.2924 (+0.15z)| lr 1.66e-04 | 323.59 ms | 52.2% bf16 MFU | 1623450 tok/s step 12914/19560 | loss 3.450909 (+2.26z)| norm 0.2953 (+0.29z)| lr 1.66e-04 | 322.71 ms | 52.3% bf16 MFU | 1623508 tok/s step 12915/19560 | loss 3.377946 (+0.52z)| norm 0.2898 (+0.03z)| lr 1.66e-04 | 322.85 ms | 52.3% bf16 MFU | 1623531 tok/s step 12916/19560 | loss 3.269551 (-2.00z)| norm 0.2881 (-0.05z)| lr 1.66e-04 | 322.48 ms | 52.3% bf16 MFU | 1623644 tok/s step 12917/19560 | loss 3.355820 (+0.01z)| norm 0.2787 (-0.50z)| lr 1.66e-04 | 322.84 ms | 52.3% bf16 MFU | 1623662 tok/s step 12918/19560 | loss 3.324581 (-0.71z)| norm 0.2848 (-0.21z)| lr 1.66e-04 | 322.97 ms | 52.3% bf16 MFU | 1623646 tok/s step 12919/19560 | loss 3.326750 (-0.65z)| norm 0.2642 (-1.20z)| lr 1.66e-04 | 322.44 ms | 52.3% bf16 MFU | 1623764 tok/s step 12920/19560 | loss 3.403472 (+1.13z)| norm 0.2779 (-0.53z)| lr 1.66e-04 | 322.98 ms | 52.3% bf16 MFU | 1623739 tok/s step 12921/19560 | loss 3.382609 (+0.63z)| norm 0.2725 (-0.79z)| lr 1.66e-04 | 322.61 ms | 52.3% bf16 MFU | 1623810 tok/s step 12922/19560 | loss 3.362519 (+0.16z)| norm 0.2961 (+0.36z)| lr 1.65e-04 | 323.47 ms | 52.2% bf16 MFU | 1623662 tok/s step 12923/19560 | loss 3.395610 (+0.92z)| norm 0.2698 (-0.92z)| lr 1.65e-04 | 323.24 ms | 52.2% bf16 MFU | 1623578 tok/s step 12924/19560 | loss 3.397078 (+0.94z)| norm 0.2760 (-0.61z)| lr 1.65e-04 | 322.15 ms | 52.4% bf16 MFU | 1623772 tok/s step 12925/19560 | loss 3.371905 (+0.35z)| norm 0.2866 (-0.11z)| lr 1.65e-04 | 323.43 ms | 52.2% bf16 MFU | 1623634 tok/s step 12926/19560 | loss 3.291753 (-1.51z)| norm 0.2732 (-0.76z)| lr 1.65e-04 | 322.65 ms | 52.3% bf16 MFU | 1623699 tok/s step 12927/19560 | loss 3.350884 (-0.12z)| norm 0.2956 (+0.34z)| lr 1.65e-04 | 323.20 ms | 52.2% bf16 MFU | 1623623 tok/s step 12928/19560 | loss 3.361259 (+0.12z)| norm 0.2603 (-1.38z)| lr 1.65e-04 | 323.30 ms | 52.2% bf16 MFU | 1623525 tok/s step 12929/19560 | loss 3.372807 (+0.38z)| norm 0.2871 (-0.07z)| lr 1.65e-04 | 323.02 ms | 52.2% bf16 MFU | 1623504 tok/s step 12930/19560 | loss 3.471953 (+2.63z)| norm 0.3018 (+0.63z)| lr 1.65e-04 | 322.81 ms | 52.3% bf16 MFU | 1623535 tok/s step 12931/19560 | loss 3.389156 (+0.72z)| norm 0.2823 (-0.31z)| lr 1.65e-04 | 322.94 ms | 52.3% bf16 MFU | 1623531 tok/s step 12932/19560 | loss 3.345228 (-0.29z)| norm 0.2911 (+0.12z)| lr 1.65e-04 | 322.80 ms | 52.3% bf16 MFU | 1623565 tok/s step 12933/19560 | loss 3.428925 (+1.60z)| norm 0.3108 (+1.08z)| lr 1.65e-04 | 322.88 ms | 52.3% bf16 MFU | 1623575 tok/s step 12934/19560 | loss 3.320220 (-0.88z)| norm 0.2808 (-0.39z)| lr 1.65e-04 | 322.91 ms | 52.3% bf16 MFU | 1623578 tok/s step 12935/19560 | loss 3.433278 (+1.68z)| norm 0.2792 (-0.46z)| lr 1.65e-04 | 322.97 ms | 52.3% bf16 MFU | 1623565 tok/s step 12936/19560 | loss 3.376552 (+0.40z)| norm 0.2712 (-0.84z)| lr 1.65e-04 | 323.47 ms | 52.2% bf16 MFU | 1623429 tok/s step 12937/19560 | loss 3.340243 (-0.42z)| norm 0.2784 (-0.49z)| lr 1.65e-04 | 322.22 ms | 52.4% bf16 MFU | 1623613 tok/s step 12938/19560 | loss 3.353826 (-0.10z)| norm 0.2712 (-0.83z)| lr 1.65e-04 | 323.01 ms | 52.2% bf16 MFU | 1623590 tok/s step 12939/19560 | loss 3.380126 (+0.50z)| norm 0.2771 (-0.55z)| lr 1.65e-04 | 323.90 ms | 52.1% bf16 MFU | 1623344 tok/s step 12940/19560 | loss 3.374002 (+0.36z)| norm 0.2835 (-0.23z)| lr 1.65e-04 | 322.62 ms | 52.3% bf16 MFU | 1623432 tok/s step 12941/19560 | loss 3.410602 (+1.19z)| norm 0.2623 (-1.25z)| lr 1.65e-04 | 322.40 ms | 52.3% bf16 MFU | 1623571 tok/s step 12942/19560 | loss 3.482011 (+2.71z)| norm 0.3257 (+1.78z)| lr 1.65e-04 | 322.74 ms | 52.3% bf16 MFU | 1623616 tok/s step 12943/19560 | loss 3.361061 (+0.03z)| norm 0.2759 (-0.59z)| lr 1.65e-04 | 322.99 ms | 52.3% bf16 MFU | 1623596 tok/s step 12944/19560 | loss 3.439446 (+1.74z)| norm 0.2565 (-1.50z)| lr 1.65e-04 | 322.63 ms | 52.3% bf16 MFU | 1623669 tok/s step 12945/19560 | loss 3.306556 (-1.18z)| norm 0.3689 (+3.67z)| lr 1.64e-04 | 322.86 ms | 52.3% bf16 MFU | 1623680 tok/s step 12946/19560 | loss 3.397948 (+0.82z)| norm 0.2869 (-0.06z)| lr 1.64e-04 | 323.14 ms | 52.2% bf16 MFU | 1623619 tok/s step 12947/19560 | loss 3.330782 (-0.66z)| norm 0.2843 (-0.17z)| lr 1.64e-04 | 323.23 ms | 52.2% bf16 MFU | 1623540 tok/s step 12948/19560 | loss 3.327525 (-0.73z)| norm 0.2981 (+0.46z)| lr 1.64e-04 | 323.02 ms | 52.2% bf16 MFU | 1623518 tok/s step 12949/19560 | loss 3.329727 (-0.68z)| norm 0.3114 (+1.06z)| lr 1.64e-04 | 322.35 ms | 52.4% bf16 MFU | 1623664 tok/s step 12950/19560 | loss 3.348587 (-0.26z)| norm 0.3082 (+0.90z)| lr 1.64e-04 | 323.24 ms | 52.2% bf16 MFU | 1623580 tok/s step 12951/19560 | loss 3.347742 (-0.29z)| norm 0.2831 (-0.24z)| lr 1.64e-04 | 323.63 ms | 52.1% bf16 MFU | 1623402 tok/s step 12952/19560 | loss 3.362114 (+0.03z)| norm 0.2735 (-0.66z)| lr 1.64e-04 | 322.15 ms | 52.4% bf16 MFU | 1623606 tok/s step 12953/19560 | loss 3.387497 (+0.58z)| norm 0.2768 (-0.51z)| lr 1.64e-04 | 323.00 ms | 52.3% bf16 MFU | 1623584 tok/s step 12954/19560 | loss 3.346343 (-0.32z)| norm 0.2691 (-0.85z)| lr 1.64e-04 | 322.46 ms | 52.3% bf16 MFU | 1623700 tok/s step 12955/19560 | loss 3.364313 (+0.07z)| norm 0.2767 (-0.50z)| lr 1.64e-04 | 322.32 ms | 52.4% bf16 MFU | 1623845 tok/s step 12956/19560 | loss 3.366176 (+0.11z)| norm 0.2717 (-0.72z)| lr 1.64e-04 | 323.93 ms | 52.1% bf16 MFU | 1623579 tok/s step 12957/19560 | loss 3.320511 (-0.91z)| norm 0.2722 (-0.69z)| lr 1.64e-04 | 322.96 ms | 52.3% bf16 MFU | 1623568 tok/s step 12958/19560 | loss 3.315964 (-1.02z)| norm 0.2794 (-0.36z)| lr 1.64e-04 | 323.97 ms | 52.1% bf16 MFU | 1623305 tok/s step 12959/19560 | loss 3.315796 (-1.02z)| norm 0.2829 (-0.21z)| lr 1.64e-04 | 323.02 ms | 52.2% bf16 MFU | 1623294 tok/s step 12960/19560 | loss 3.311631 (-1.09z)| norm 0.2981 (+0.49z)| lr 1.64e-04 | 322.90 ms | 52.3% bf16 MFU | 1623313 tok/s step 12961/19560 | loss 3.345068 (-0.35z)| norm 0.2927 (+0.24z)| lr 1.64e-04 | 323.29 ms | 52.2% bf16 MFU | 1623233 tok/s step 12962/19560 | loss 3.401308 (+0.89z)| norm 0.2923 (+0.23z)| lr 1.64e-04 | 322.41 ms | 52.3% bf16 MFU | 1623379 tok/s step 12963/19560 | loss 3.377665 (+0.35z)| norm 0.2748 (-0.57z)| lr 1.64e-04 | 322.78 ms | 52.3% bf16 MFU | 1623425 tok/s step 12964/19560 | loss 3.290928 (-1.55z)| norm 0.3003 (+0.63z)| lr 1.64e-04 | 322.90 ms | 52.3% bf16 MFU | 1623439 tok/s step 12965/19560 | loss 3.404953 (+0.95z)| norm 0.2782 (-0.42z)| lr 1.64e-04 | 322.98 ms | 52.3% bf16 MFU | 1623430 tok/s step 12966/19560 | loss 3.359761 (-0.04z)| norm 0.2914 (+0.22z)| lr 1.64e-04 | 322.77 ms | 52.3% bf16 MFU | 1623475 tok/s step 12967/19560 | loss 3.326742 (-0.76z)| norm 0.2783 (-0.40z)| lr 1.63e-04 | 323.28 ms | 52.2% bf16 MFU | 1623390 tok/s step 12968/19560 | loss 3.417374 (+1.21z)| norm 0.2874 (+0.02z)| lr 1.63e-04 | 323.17 ms | 52.2% bf16 MFU | 1623336 tok/s step 12969/19560 | loss 3.317314 (-0.99z)| norm 0.2706 (-0.77z)| lr 1.63e-04 | 322.36 ms | 52.4% bf16 MFU | 1623490 tok/s step 12970/19560 | loss 3.382093 (+0.42z)| norm 0.2727 (-0.68z)| lr 1.63e-04 | 323.47 ms | 52.2% bf16 MFU | 1623356 tok/s step 12971/19560 | loss 3.342679 (-0.44z)| norm 0.2708 (-0.76z)| lr 1.63e-04 | 323.24 ms | 52.2% bf16 MFU | 1623286 tok/s step 12972/19560 | loss 3.355575 (-0.15z)| norm 0.2582 (-1.34z)| lr 1.63e-04 | 323.18 ms | 52.2% bf16 MFU | 1623237 tok/s step 12973/19560 | loss 3.303201 (-1.29z)| norm 0.2741 (-0.59z)| lr 1.63e-04 | 322.86 ms | 52.3% bf16 MFU | 1623270 tok/s step 12974/19560 | loss 3.356440 (-0.12z)| norm 0.2897 (+0.14z)| lr 1.63e-04 | 322.91 ms | 52.3% bf16 MFU | 1623289 tok/s step 12975/19560 | loss 3.355211 (-0.15z)| norm 0.2639 (-1.07z)| lr 1.63e-04 | 323.26 ms | 52.2% bf16 MFU | 1623217 tok/s step 12976/19560 | loss 3.382821 (+0.45z)| norm 0.2771 (-0.45z)| lr 1.63e-04 | 322.58 ms | 52.3% bf16 MFU | 1623320 tok/s step 12977/19560 | loss 3.391998 (+0.64z)| norm 0.2752 (-0.55z)| lr 1.63e-04 | 323.03 ms | 52.2% bf16 MFU | 1623306 tok/s step 12978/19560 | loss 3.356313 (-0.15z)| norm 0.2557 (-1.46z)| lr 1.63e-04 | 323.30 ms | 52.2% bf16 MFU | 1623224 tok/s step 12979/19560 | loss 3.486633 (+2.63z)| norm 0.2893 (+0.13z)| lr 1.63e-04 | 322.73 ms | 52.3% bf16 MFU | 1623289 tok/s step 12980/19560 | loss 3.352143 (-0.26z)| norm 0.2669 (-0.93z)| lr 1.63e-04 | 322.68 ms | 52.3% bf16 MFU | 1623364 tok/s step 12981/19560 | loss 3.290632 (-1.56z)| norm 0.2579 (-1.37z)| lr 1.63e-04 | 323.29 ms | 52.2% bf16 MFU | 1623282 tok/s step 12982/19560 | loss 3.449127 (+1.79z)| norm 0.2722 (-0.68z)| lr 1.63e-04 | 322.92 ms | 52.3% bf16 MFU | 1623297 tok/s step 12983/19560 | loss 3.506631 (+2.88z)| norm 0.2903 (+0.17z)| lr 1.63e-04 | 323.36 ms | 52.2% bf16 MFU | 1623202 tok/s step 12984/19560 | loss 3.349512 (-0.32z)| norm 0.2873 (+0.02z)| lr 1.63e-04 | 322.12 ms | 52.4% bf16 MFU | 1623423 tok/s step 12985/19560 | loss 3.324737 (-0.82z)| norm 0.2556 (-1.49z)| lr 1.63e-04 | 323.25 ms | 52.2% bf16 MFU | 1623349 tok/s step 12986/19560 | loss 3.406124 (+0.83z)| norm 0.2816 (-0.24z)| lr 1.63e-04 | 323.12 ms | 52.2% bf16 MFU | 1623310 tok/s step 12987/19560 | loss 3.362045 (-0.07z)| norm 0.2638 (-1.08z)| lr 1.63e-04 | 322.54 ms | 52.3% bf16 MFU | 1623419 tok/s step 12988/19560 | loss 3.432011 (+1.33z)| norm 0.2801 (-0.30z)| lr 1.63e-04 | 323.26 ms | 52.2% bf16 MFU | 1623341 tok/s step 12989/19560 | loss 3.314664 (-1.06z)| norm 0.2630 (-1.09z)| lr 1.63e-04 | 323.17 ms | 52.2% bf16 MFU | 1623291 tok/s step 12990/19560 | loss 3.379909 (+0.27z)| norm 0.3016 (+0.74z)| lr 1.62e-04 | 323.32 ms | 52.2% bf16 MFU | 1623206 tok/s step 12991/19560 | loss 3.288329 (-1.59z)| norm 0.2717 (-0.68z)| lr 1.62e-04 | 322.63 ms | 52.3% bf16 MFU | 1623298 tok/s step 12992/19560 | loss 3.412222 (+0.94z)| norm 0.2911 (+0.26z)| lr 1.62e-04 | 323.01 ms | 52.3% bf16 MFU | 1623291 tok/s step 12993/19560 | loss 3.305523 (-1.24z)| norm 0.2599 (-1.26z)| lr 1.62e-04 | 323.27 ms | 52.2% bf16 MFU | 1623219 tok/s step 12994/19560 | loss 3.333998 (-0.66z)| norm 0.2850 (+0.01z)| lr 1.62e-04 | 322.84 ms | 52.3% bf16 MFU | 1623256 tok/s step 12995/19560 | loss 3.296555 (-1.42z)| norm 0.2604 (-1.22z)| lr 1.62e-04 | 323.03 ms | 52.2% bf16 MFU | 1623244 tok/s step 12996/19560 | loss 3.380033 (+0.28z)| norm 0.2795 (-0.25z)| lr 1.62e-04 | 322.77 ms | 52.3% bf16 MFU | 1623298 tok/s step 12997/19560 | loss 3.355542 (-0.21z)| norm 0.2826 (-0.07z)| lr 1.62e-04 | 323.48 ms | 52.2% bf16 MFU | 1623171 tok/s step 12998/19560 | loss 3.287052 (-1.59z)| norm 0.2605 (-1.20z)| lr 1.62e-04 | 323.09 ms | 52.2% bf16 MFU | 1623150 tok/s step 12999/19560 | loss 3.361727 (-0.08z)| norm 0.2781 (-0.28z)| lr 1.62e-04 | 322.95 ms | 52.3% bf16 MFU | 1623164 tok/s step 13000/19560 | loss 3.427708 (+1.25z)| norm 0.2819 (-0.08z)| lr 1.62e-04 | 322.97 ms | 52.3% bf16 MFU | 1623172 tok/s val loss 3.333582 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3016/10042 = 0.300339 step 13001/19560 | loss 3.398769 (+0.66z)| norm 0.2868 (+0.19z)| lr 1.62e-04 | 322.14 ms | 52.4% bf16 MFU | 1623388 tok/s step 13002/19560 | loss 3.416483 (+1.01z)| norm 0.2764 (-0.35z)| lr 1.62e-04 | 322.43 ms | 52.3% bf16 MFU | 1623521 tok/s step 13003/19560 | loss 3.330786 (-0.73z)| norm 0.2825 (-0.03z)| lr 1.62e-04 | 323.30 ms | 52.2% bf16 MFU | 1623430 tok/s step 13004/19560 | loss 3.342239 (-0.50z)| norm 0.2759 (-0.38z)| lr 1.62e-04 | 322.30 ms | 52.4% bf16 MFU | 1623592 tok/s step 13005/19560 | loss 3.331124 (-0.72z)| norm 0.2636 (-1.01z)| lr 1.62e-04 | 322.81 ms | 52.3% bf16 MFU | 1623619 tok/s step 13006/19560 | loss 3.408017 (+0.86z)| norm 0.2983 (+0.92z)| lr 1.62e-04 | 322.83 ms | 52.3% bf16 MFU | 1623638 tok/s step 13007/19560 | loss 3.316964 (-1.01z)| norm 0.2888 (+0.40z)| lr 1.62e-04 | 322.56 ms | 52.3% bf16 MFU | 1623725 tok/s step 13008/19560 | loss 3.353476 (-0.26z)| norm 0.2666 (-0.90z)| lr 1.62e-04 | 322.67 ms | 52.3% bf16 MFU | 1623781 tok/s step 13009/19560 | loss 3.341581 (-0.51z)| norm 0.2758 (-0.35z)| lr 1.62e-04 | 323.18 ms | 52.2% bf16 MFU | 1623707 tok/s step 13010/19560 | loss 3.393551 (+0.55z)| norm 0.2748 (-0.41z)| lr 1.62e-04 | 323.20 ms | 52.2% bf16 MFU | 1623630 tok/s step 13011/19560 | loss 3.381678 (+0.31z)| norm 0.2947 (+0.76z)| lr 1.62e-04 | 323.07 ms | 52.2% bf16 MFU | 1623589 tok/s step 13012/19560 | loss 3.382652 (+0.32z)| norm 0.2576 (-1.40z)| lr 1.61e-04 | 322.54 ms | 52.3% bf16 MFU | 1623684 tok/s step 13013/19560 | loss 3.394907 (+0.58z)| norm 0.2921 (+0.62z)| lr 1.61e-04 | 322.15 ms | 52.4% bf16 MFU | 1623874 tok/s step 13014/19560 | loss 3.365237 (-0.05z)| norm 0.2554 (-1.54z)| lr 1.61e-04 | 323.84 ms | 52.1% bf16 MFU | 1623628 tok/s step 13015/19560 | loss 3.364198 (-0.08z)| norm 0.2858 (+0.25z)| lr 1.61e-04 | 323.64 ms | 52.1% bf16 MFU | 1623444 tok/s step 13016/19560 | loss 3.310138 (-1.19z)| norm 0.2722 (-0.56z)| lr 1.61e-04 | 322.55 ms | 52.3% bf16 MFU | 1623546 tok/s step 13017/19560 | loss 3.331293 (-0.75z)| norm 0.3014 (+1.16z)| lr 1.61e-04 | 323.12 ms | 52.2% bf16 MFU | 1623497 tok/s step 13018/19560 | loss 3.444736 (+1.59z)| norm 0.2781 (-0.23z)| lr 1.61e-04 | 322.55 ms | 52.3% bf16 MFU | 1623593 tok/s step 13019/19560 | loss 3.335629 (-0.66z)| norm 0.2932 (+0.65z)| lr 1.61e-04 | 322.73 ms | 52.3% bf16 MFU | 1623640 tok/s step 13020/19560 | loss 3.360404 (-0.15z)| norm 0.3021 (+1.17z)| lr 1.61e-04 | 323.19 ms | 52.2% bf16 MFU | 1623570 tok/s step 13021/19560 | loss 3.323184 (-0.91z)| norm 0.2699 (-0.76z)| lr 1.61e-04 | 322.97 ms | 52.3% bf16 MFU | 1623558 tok/s step 13022/19560 | loss 3.381720 (+0.37z)| norm 0.3063 (+1.50z)| lr 1.61e-04 | 323.12 ms | 52.2% bf16 MFU | 1623510 tok/s step 13023/19560 | loss 3.296122 (-1.56z)| norm 0.2875 (+0.33z)| lr 1.61e-04 | 323.22 ms | 52.2% bf16 MFU | 1623439 tok/s step 13024/19560 | loss 3.353704 (-0.25z)| norm 0.2908 (+0.55z)| lr 1.61e-04 | 322.77 ms | 52.3% bf16 MFU | 1623483 tok/s step 13025/19560 | loss 3.360581 (-0.10z)| norm 0.2982 (+1.01z)| lr 1.61e-04 | 322.98 ms | 52.3% bf16 MFU | 1623472 tok/s step 13026/19560 | loss 3.360998 (-0.09z)| norm 0.2821 (-0.01z)| lr 1.61e-04 | 323.09 ms | 52.2% bf16 MFU | 1623435 tok/s step 13027/19560 | loss 3.292133 (-1.63z)| norm 0.2823 (+0.01z)| lr 1.61e-04 | 322.88 ms | 52.3% bf16 MFU | 1623452 tok/s step 13028/19560 | loss 3.381003 (+0.39z)| norm 0.2987 (+1.05z)| lr 1.61e-04 | 322.06 ms | 52.4% bf16 MFU | 1623675 tok/s step 13029/19560 | loss 3.424357 (+1.35z)| norm 0.2639 (-1.14z)| lr 1.61e-04 | 323.11 ms | 52.2% bf16 MFU | 1623623 tok/s step 13030/19560 | loss 3.330118 (-0.76z)| norm 0.2792 (-0.17z)| lr 1.61e-04 | 323.06 ms | 52.2% bf16 MFU | 1623586 tok/s step 13031/19560 | loss 3.320408 (-0.98z)| norm 0.2681 (-0.87z)| lr 1.61e-04 | 323.08 ms | 52.2% bf16 MFU | 1623544 tok/s step 13032/19560 | loss 3.299084 (-1.44z)| norm 0.2819 (+0.01z)| lr 1.61e-04 | 323.26 ms | 52.2% bf16 MFU | 1623462 tok/s step 13033/19560 | loss 3.331971 (-0.69z)| norm 0.2678 (-0.88z)| lr 1.61e-04 | 322.53 ms | 52.3% bf16 MFU | 1623566 tok/s step 13034/19560 | loss 3.340349 (-0.49z)| norm 0.2699 (-0.74z)| lr 1.61e-04 | 322.73 ms | 52.3% bf16 MFU | 1623614 tok/s step 13035/19560 | loss 3.342603 (-0.44z)| norm 0.2758 (-0.35z)| lr 1.60e-04 | 323.21 ms | 52.2% bf16 MFU | 1623539 tok/s step 13036/19560 | loss 3.356776 (-0.12z)| norm 0.2664 (-0.94z)| lr 1.60e-04 | 322.82 ms | 52.3% bf16 MFU | 1623566 tok/s step 13037/19560 | loss 3.380744 (+0.41z)| norm 0.2542 (-1.71z)| lr 1.60e-04 | 322.80 ms | 52.3% bf16 MFU | 1623596 tok/s step 13038/19560 | loss 3.391483 (+0.65z)| norm 0.2644 (-1.05z)| lr 1.60e-04 | 322.73 ms | 52.3% bf16 MFU | 1623644 tok/s step 13039/19560 | loss 3.359743 (-0.06z)| norm 0.2504 (-1.90z)| lr 1.60e-04 | 323.16 ms | 52.2% bf16 MFU | 1623579 tok/s step 13040/19560 | loss 3.325712 (-0.82z)| norm 0.2706 (-0.62z)| lr 1.60e-04 | 323.31 ms | 52.2% bf16 MFU | 1623481 tok/s step 13041/19560 | loss 3.305321 (-1.26z)| norm 0.3136 (+2.06z)| lr 1.60e-04 | 322.86 ms | 52.3% bf16 MFU | 1623502 tok/s step 13042/19560 | loss 3.327654 (-0.75z)| norm 0.2579 (-1.39z)| lr 1.60e-04 | 322.82 ms | 52.3% bf16 MFU | 1623532 tok/s step 13043/19560 | loss 3.323494 (-0.83z)| norm 0.3105 (+1.84z)| lr 1.60e-04 | 322.59 ms | 52.3% bf16 MFU | 1623617 tok/s step 13044/19560 | loss 3.356520 (-0.11z)| norm 0.2690 (-0.69z)| lr 1.60e-04 | 323.01 ms | 52.2% bf16 MFU | 1623592 tok/s step 13045/19560 | loss 3.315086 (-1.04z)| norm 0.2673 (-0.78z)| lr 1.60e-04 | 322.79 ms | 52.3% bf16 MFU | 1623625 tok/s step 13046/19560 | loss 3.371473 (+0.24z)| norm 0.2655 (-0.88z)| lr 1.60e-04 | 322.50 ms | 52.3% bf16 MFU | 1623729 tok/s step 13047/19560 | loss 3.280761 (-1.81z)| norm 0.2595 (-1.24z)| lr 1.60e-04 | 322.77 ms | 52.3% bf16 MFU | 1623760 tok/s step 13048/19560 | loss 3.385883 (+0.57z)| norm 0.2561 (-1.43z)| lr 1.60e-04 | 323.14 ms | 52.2% bf16 MFU | 1623696 tok/s step 13049/19560 | loss 3.308207 (-1.17z)| norm 0.2716 (-0.49z)| lr 1.60e-04 | 323.66 ms | 52.1% bf16 MFU | 1623504 tok/s step 13050/19560 | loss 3.378753 (+0.42z)| norm 0.2658 (-0.83z)| lr 1.60e-04 | 323.29 ms | 52.2% bf16 MFU | 1623416 tok/s step 13051/19560 | loss 3.304804 (-1.23z)| norm 0.2835 (+0.23z)| lr 1.60e-04 | 322.42 ms | 52.3% bf16 MFU | 1623550 tok/s step 13052/19560 | loss 3.372380 (+0.29z)| norm 0.2592 (-1.22z)| lr 1.60e-04 | 323.07 ms | 52.2% bf16 MFU | 1623513 tok/s step 13053/19560 | loss 3.356151 (-0.07z)| norm 0.3445 (+3.66z)| lr 1.60e-04 | 323.05 ms | 52.2% bf16 MFU | 1623483 tok/s step 13054/19560 | loss 3.267982 (-2.04z)| norm 0.2783 (-0.10z)| lr 1.60e-04 | 322.85 ms | 52.3% bf16 MFU | 1623506 tok/s step 13055/19560 | loss 3.330063 (-0.64z)| norm 0.2739 (-0.34z)| lr 1.60e-04 | 322.86 ms | 52.3% bf16 MFU | 1623525 tok/s step 13056/19560 | loss 3.377459 (+0.41z)| norm 0.2895 (+0.53z)| lr 1.60e-04 | 322.90 ms | 52.3% bf16 MFU | 1623532 tok/s step 13057/19560 | loss 3.328577 (-0.67z)| norm 0.2593 (-1.17z)| lr 1.60e-04 | 322.43 ms | 52.3% bf16 MFU | 1623658 tok/s step 13058/19560 | loss 3.379849 (+0.50z)| norm 0.2801 (+0.02z)| lr 1.59e-04 | 323.07 ms | 52.2% bf16 MFU | 1623617 tok/s step 13059/19560 | loss 3.292924 (-1.46z)| norm 0.2672 (-0.71z)| lr 1.59e-04 | 322.77 ms | 52.3% bf16 MFU | 1623653 tok/s step 13060/19560 | loss 3.386051 (+0.65z)| norm 0.2782 (-0.07z)| lr 1.59e-04 | 323.11 ms | 52.2% bf16 MFU | 1623602 tok/s step 13061/19560 | loss 3.342391 (-0.33z)| norm 0.2851 (+0.33z)| lr 1.59e-04 | 323.13 ms | 52.2% bf16 MFU | 1623548 tok/s step 13062/19560 | loss 3.397649 (+0.92z)| norm 0.2823 (+0.17z)| lr 1.59e-04 | 323.12 ms | 52.2% bf16 MFU | 1623500 tok/s step 13063/19560 | loss 3.377798 (+0.48z)| norm 0.3015 (+1.26z)| lr 1.59e-04 | 322.68 ms | 52.3% bf16 MFU | 1623564 tok/s step 13064/19560 | loss 3.391975 (+0.81z)| norm 0.2592 (-1.16z)| lr 1.59e-04 | 322.89 ms | 52.3% bf16 MFU | 1623574 tok/s step 13065/19560 | loss 3.376673 (+0.45z)| norm 0.2633 (-0.91z)| lr 1.59e-04 | 322.62 ms | 52.3% bf16 MFU | 1623650 tok/s step 13066/19560 | loss 3.418906 (+1.40z)| norm 0.2673 (-0.68z)| lr 1.59e-04 | 322.79 ms | 52.3% bf16 MFU | 1623680 tok/s step 13067/19560 | loss 3.360286 (+0.06z)| norm 0.2666 (-0.72z)| lr 1.59e-04 | 322.91 ms | 52.3% bf16 MFU | 1623678 tok/s step 13068/19560 | loss 3.330187 (-0.62z)| norm 0.2682 (-0.62z)| lr 1.59e-04 | 323.02 ms | 52.2% bf16 MFU | 1623649 tok/s step 13069/19560 | loss 3.335124 (-0.50z)| norm 0.2585 (-1.16z)| lr 1.59e-04 | 322.66 ms | 52.3% bf16 MFU | 1623712 tok/s step 13070/19560 | loss 3.392785 (+0.87z)| norm 0.2569 (-1.26z)| lr 1.59e-04 | 322.91 ms | 52.3% bf16 MFU | 1623707 tok/s step 13071/19560 | loss 3.404177 (+1.13z)| norm 0.2737 (-0.28z)| lr 1.59e-04 | 323.10 ms | 52.2% bf16 MFU | 1623655 tok/s step 13072/19560 | loss 3.380085 (+0.58z)| norm 0.2557 (-1.32z)| lr 1.59e-04 | 323.06 ms | 52.2% bf16 MFU | 1623617 tok/s step 13073/19560 | loss 3.349551 (-0.16z)| norm 0.2678 (-0.65z)| lr 1.59e-04 | 322.83 ms | 52.3% bf16 MFU | 1623638 tok/s step 13074/19560 | loss 3.324100 (-0.76z)| norm 0.2615 (-1.04z)| lr 1.59e-04 | 322.62 ms | 52.3% bf16 MFU | 1623709 tok/s step 13075/19560 | loss 3.319758 (-0.87z)| norm 0.2650 (-0.80z)| lr 1.59e-04 | 323.09 ms | 52.2% bf16 MFU | 1623660 tok/s step 13076/19560 | loss 3.321557 (-0.82z)| norm 0.2564 (-1.34z)| lr 1.59e-04 | 322.93 ms | 52.3% bf16 MFU | 1623653 tok/s step 13077/19560 | loss 3.325208 (-0.73z)| norm 0.2680 (-0.57z)| lr 1.59e-04 | 323.10 ms | 52.2% bf16 MFU | 1623604 tok/s step 13078/19560 | loss 3.349223 (-0.15z)| norm 0.3518 (+4.61z)| lr 1.59e-04 | 322.88 ms | 52.3% bf16 MFU | 1623612 tok/s step 13079/19560 | loss 3.369303 (+0.33z)| norm 0.3045 (+1.67z)| lr 1.59e-04 | 322.68 ms | 52.3% bf16 MFU | 1623672 tok/s step 13080/19560 | loss 3.390674 (+0.83z)| norm 0.2927 (+0.94z)| lr 1.58e-04 | 322.71 ms | 52.3% bf16 MFU | 1623721 tok/s step 13081/19560 | loss 3.331457 (-0.58z)| norm 0.2956 (+1.10z)| lr 1.58e-04 | 322.72 ms | 52.3% bf16 MFU | 1623763 tok/s step 13082/19560 | loss 3.302481 (-1.26z)| norm 0.3017 (+1.44z)| lr 1.58e-04 | 322.73 ms | 52.3% bf16 MFU | 1623803 tok/s step 13083/19560 | loss 3.319521 (-0.84z)| norm 0.2816 (+0.23z)| lr 1.58e-04 | 322.86 ms | 52.3% bf16 MFU | 1623808 tok/s step 13084/19560 | loss 3.331134 (-0.56z)| norm 0.3061 (+1.66z)| lr 1.58e-04 | 323.13 ms | 52.2% bf16 MFU | 1623743 tok/s step 13085/19560 | loss 3.291545 (-1.49z)| norm 0.3076 (+1.72z)| lr 1.58e-04 | 323.29 ms | 52.2% bf16 MFU | 1623643 tok/s step 13086/19560 | loss 3.293485 (-1.43z)| norm 0.2840 (+0.33z)| lr 1.58e-04 | 322.38 ms | 52.4% bf16 MFU | 1623776 tok/s step 13087/19560 | loss 3.302801 (-1.20z)| norm 0.2949 (+0.96z)| lr 1.58e-04 | 323.44 ms | 52.2% bf16 MFU | 1623635 tok/s step 13088/19560 | loss 3.340015 (-0.34z)| norm 0.2811 (+0.17z)| lr 1.58e-04 | 322.72 ms | 52.3% bf16 MFU | 1623683 tok/s step 13089/19560 | loss 3.332996 (-0.50z)| norm 0.2940 (+0.92z)| lr 1.58e-04 | 322.90 ms | 52.3% bf16 MFU | 1623683 tok/s step 13090/19560 | loss 3.323113 (-0.72z)| norm 0.2920 (+0.80z)| lr 1.58e-04 | 323.22 ms | 52.2% bf16 MFU | 1623603 tok/s step 13091/19560 | loss 3.341410 (-0.28z)| norm 0.2667 (-0.68z)| lr 1.58e-04 | 322.85 ms | 52.3% bf16 MFU | 1623619 tok/s step 13092/19560 | loss 3.325218 (-0.67z)| norm 0.2954 (+1.01z)| lr 1.58e-04 | 323.91 ms | 52.1% bf16 MFU | 1623368 tok/s step 13093/19560 | loss 3.368968 (+0.38z)| norm 0.2754 (-0.17z)| lr 1.58e-04 | 322.74 ms | 52.3% bf16 MFU | 1623426 tok/s step 13094/19560 | loss 3.346832 (-0.15z)| norm 0.2790 (+0.05z)| lr 1.58e-04 | 322.92 ms | 52.3% bf16 MFU | 1623433 tok/s step 13095/19560 | loss 3.302007 (-1.22z)| norm 0.2660 (-0.71z)| lr 1.58e-04 | 322.86 ms | 52.3% bf16 MFU | 1623456 tok/s step 13096/19560 | loss 3.377635 (+0.60z)| norm 0.2798 (+0.11z)| lr 1.58e-04 | 322.81 ms | 52.3% bf16 MFU | 1623490 tok/s step 13097/19560 | loss 3.364291 (+0.27z)| norm 0.2665 (-0.67z)| lr 1.58e-04 | 323.89 ms | 52.1% bf16 MFU | 1623251 tok/s step 13098/19560 | loss 3.287202 (-1.56z)| norm 0.2595 (-1.07z)| lr 1.58e-04 | 323.02 ms | 52.2% bf16 MFU | 1623243 tok/s step 13099/19560 | loss 3.316705 (-0.84z)| norm 0.2644 (-0.78z)| lr 1.58e-04 | 322.80 ms | 52.3% bf16 MFU | 1623290 tok/s step 13100/19560 | loss 3.318317 (-0.80z)| norm 0.2838 (+0.35z)| lr 1.58e-04 | 322.76 ms | 52.3% bf16 MFU | 1623345 tok/s step 13101/19560 | loss 3.338438 (-0.33z)| norm 0.2859 (+0.46z)| lr 1.58e-04 | 322.79 ms | 52.3% bf16 MFU | 1623389 tok/s step 13102/19560 | loss 3.397157 (+1.07z)| norm 0.2961 (+1.05z)| lr 1.58e-04 | 323.25 ms | 52.2% bf16 MFU | 1623316 tok/s step 13103/19560 | loss 3.469940 (+2.70z)| norm 0.2931 (+0.87z)| lr 1.57e-04 | 323.11 ms | 52.2% bf16 MFU | 1623282 tok/s step 13104/19560 | loss 3.433181 (+1.82z)| norm 0.2869 (+0.50z)| lr 1.57e-04 | 322.22 ms | 52.4% bf16 MFU | 1623473 tok/s step 13105/19560 | loss 3.368700 (+0.35z)| norm 0.2816 (+0.18z)| lr 1.57e-04 | 323.00 ms | 52.3% bf16 MFU | 1623459 tok/s step 13106/19560 | loss 3.346808 (-0.15z)| norm 0.2661 (-0.73z)| lr 1.57e-04 | 322.77 ms | 52.3% bf16 MFU | 1623503 tok/s step 13107/19560 | loss 3.436244 (+1.95z)| norm 0.2757 (-0.16z)| lr 1.57e-04 | 323.28 ms | 52.2% bf16 MFU | 1623417 tok/s step 13108/19560 | loss 3.493541 (+3.14z)| norm 0.2760 (-0.15z)| lr 1.57e-04 | 322.53 ms | 52.3% bf16 MFU | 1623523 tok/s step 13109/19560 | loss 3.307009 (-1.07z)| norm 0.2850 (+0.37z)| lr 1.57e-04 | 322.77 ms | 52.3% bf16 MFU | 1623565 tok/s step 13110/19560 | loss 3.390278 (+0.84z)| norm 0.2824 (+0.21z)| lr 1.57e-04 | 323.37 ms | 52.2% bf16 MFU | 1623452 tok/s step 13111/19560 | loss 3.385828 (+0.79z)| norm 0.2768 (-0.11z)| lr 1.57e-04 | 323.02 ms | 52.2% bf16 MFU | 1623434 tok/s step 13112/19560 | loss 3.354802 (+0.05z)| norm 0.2977 (+1.12z)| lr 1.57e-04 | 322.42 ms | 52.3% bf16 MFU | 1623566 tok/s step 13113/19560 | loss 3.360506 (+0.18z)| norm 0.2891 (+0.60z)| lr 1.57e-04 | 322.99 ms | 52.3% bf16 MFU | 1623549 tok/s step 13114/19560 | loss 3.341045 (-0.28z)| norm 0.2993 (+1.19z)| lr 1.57e-04 | 322.39 ms | 52.4% bf16 MFU | 1623685 tok/s step 13115/19560 | loss 3.348724 (-0.09z)| norm 0.2988 (+1.15z)| lr 1.57e-04 | 322.53 ms | 52.3% bf16 MFU | 1623778 tok/s step 13116/19560 | loss 3.330835 (-0.52z)| norm 0.2884 (+0.52z)| lr 1.57e-04 | 323.01 ms | 52.3% bf16 MFU | 1623746 tok/s step 13117/19560 | loss 3.390482 (+0.94z)| norm 0.2986 (+1.11z)| lr 1.57e-04 | 323.22 ms | 52.2% bf16 MFU | 1623662 tok/s step 13118/19560 | loss 3.311962 (-0.98z)| norm 0.2670 (-0.74z)| lr 1.57e-04 | 322.73 ms | 52.3% bf16 MFU | 1623705 tok/s step 13119/19560 | loss 3.288753 (-1.55z)| norm 0.2775 (-0.12z)| lr 1.57e-04 | 322.33 ms | 52.4% bf16 MFU | 1623849 tok/s step 13120/19560 | loss 3.368214 (+0.41z)| norm 0.2883 (+0.52z)| lr 1.57e-04 | 323.07 ms | 52.2% bf16 MFU | 1623798 tok/s step 13121/19560 | loss 3.376071 (+0.60z)| norm 0.2753 (-0.26z)| lr 1.57e-04 | 322.92 ms | 52.3% bf16 MFU | 1623788 tok/s step 13122/19560 | loss 3.343151 (-0.22z)| norm 0.3014 (+1.28z)| lr 1.57e-04 | 322.90 ms | 52.3% bf16 MFU | 1623783 tok/s step 13123/19560 | loss 3.402382 (+1.23z)| norm 0.2754 (-0.27z)| lr 1.57e-04 | 322.34 ms | 52.4% bf16 MFU | 1623920 tok/s step 13124/19560 | loss 3.320817 (-0.79z)| norm 0.2640 (-0.94z)| lr 1.57e-04 | 322.85 ms | 52.3% bf16 MFU | 1623921 tok/s step 13125/19560 | loss 3.327661 (-0.61z)| norm 0.2854 (+0.33z)| lr 1.57e-04 | 323.35 ms | 52.2% bf16 MFU | 1623796 tok/s step 13126/19560 | loss 3.298225 (-1.35z)| norm 0.2817 (+0.10z)| lr 1.56e-04 | 322.97 ms | 52.3% bf16 MFU | 1623774 tok/s step 13127/19560 | loss 3.343908 (-0.21z)| norm 0.2779 (-0.12z)| lr 1.56e-04 | 322.47 ms | 52.3% bf16 MFU | 1623877 tok/s step 13128/19560 | loss 3.341499 (-0.25z)| norm 0.2936 (+0.81z)| lr 1.56e-04 | 323.06 ms | 52.2% bf16 MFU | 1623828 tok/s step 13129/19560 | loss 3.348983 (-0.06z)| norm 0.2804 (+0.03z)| lr 1.56e-04 | 322.72 ms | 52.3% bf16 MFU | 1623867 tok/s step 13130/19560 | loss 3.369853 (+0.49z)| norm 0.2687 (-0.67z)| lr 1.56e-04 | 322.43 ms | 52.3% bf16 MFU | 1623977 tok/s step 13131/19560 | loss 3.331285 (-0.50z)| norm 0.2970 (+1.01z)| lr 1.56e-04 | 322.62 ms | 52.3% bf16 MFU | 1624034 tok/s step 13132/19560 | loss 3.318481 (-0.82z)| norm 0.2640 (-0.94z)| lr 1.56e-04 | 323.53 ms | 52.2% bf16 MFU | 1623858 tok/s step 13133/19560 | loss 3.354921 (+0.11z)| norm 0.2884 (+0.49z)| lr 1.56e-04 | 322.12 ms | 52.4% bf16 MFU | 1624046 tok/s step 13134/19560 | loss 3.310757 (-1.01z)| norm 0.2777 (-0.14z)| lr 1.56e-04 | 322.25 ms | 52.4% bf16 MFU | 1624193 tok/s step 13135/19560 | loss 3.334502 (-0.40z)| norm 0.2677 (-0.72z)| lr 1.56e-04 | 323.50 ms | 52.2% bf16 MFU | 1624016 tok/s step 13136/19560 | loss 3.331988 (-0.46z)| norm 0.2831 (+0.19z)| lr 1.56e-04 | 323.19 ms | 52.2% bf16 MFU | 1623927 tok/s step 13137/19560 | loss 3.313402 (-0.94z)| norm 0.2908 (+0.64z)| lr 1.56e-04 | 323.32 ms | 52.2% bf16 MFU | 1623810 tok/s step 13138/19560 | loss 3.340302 (-0.23z)| norm 0.2924 (+0.73z)| lr 1.56e-04 | 323.18 ms | 52.2% bf16 MFU | 1623733 tok/s step 13139/19560 | loss 3.351012 (+0.05z)| norm 0.3172 (+2.15z)| lr 1.56e-04 | 322.90 ms | 52.3% bf16 MFU | 1623730 tok/s step 13140/19560 | loss 3.342140 (-0.17z)| norm 0.2721 (-0.49z)| lr 1.56e-04 | 323.39 ms | 52.2% bf16 MFU | 1623606 tok/s step 13141/19560 | loss 3.369702 (+0.55z)| norm 0.3186 (+2.19z)| lr 1.56e-04 | 322.22 ms | 52.4% bf16 MFU | 1623781 tok/s step 13142/19560 | loss 3.347184 (-0.03z)| norm 0.2562 (-1.42z)| lr 1.56e-04 | 323.29 ms | 52.2% bf16 MFU | 1623678 tok/s step 13143/19560 | loss 3.302464 (-1.18z)| norm 0.2807 (+0.00z)| lr 1.56e-04 | 323.04 ms | 52.2% bf16 MFU | 1623644 tok/s step 13144/19560 | loss 3.339898 (-0.22z)| norm 0.2691 (-0.67z)| lr 1.56e-04 | 322.21 ms | 52.4% bf16 MFU | 1623821 tok/s step 13145/19560 | loss 3.355040 (+0.17z)| norm 0.2925 (+0.69z)| lr 1.56e-04 | 322.70 ms | 52.3% bf16 MFU | 1623865 tok/s step 13146/19560 | loss 3.284242 (-1.67z)| norm 0.3986 (+5.82z)| lr 1.56e-04 | 323.47 ms | 52.2% bf16 MFU | 1623713 tok/s step 13147/19560 | loss 3.271193 (-1.97z)| norm 0.3113 (+1.46z)| lr 1.56e-04 | 322.38 ms | 52.4% bf16 MFU | 1623843 tok/s step 13148/19560 | loss 3.341526 (-0.13z)| norm 0.2968 (+0.75z)| lr 1.56e-04 | 323.04 ms | 52.2% bf16 MFU | 1623800 tok/s step 13149/19560 | loss 3.348227 (+0.04z)| norm 0.3340 (+2.51z)| lr 1.55e-04 | 323.01 ms | 52.2% bf16 MFU | 1623766 tok/s step 13150/19560 | loss 3.410889 (+1.66z)| norm 0.3202 (+1.82z)| lr 1.55e-04 | 322.80 ms | 52.3% bf16 MFU | 1623788 tok/s step 13151/19560 | loss 3.312001 (-0.91z)| norm 0.3283 (+2.15z)| lr 1.55e-04 | 322.95 ms | 52.3% bf16 MFU | 1623769 tok/s step 13152/19560 | loss 3.342292 (-0.12z)| norm 0.2808 (-0.08z)| lr 1.55e-04 | 322.75 ms | 52.3% bf16 MFU | 1623801 tok/s step 13153/19560 | loss 3.317799 (-0.75z)| norm 0.2913 (+0.42z)| lr 1.55e-04 | 323.02 ms | 52.2% bf16 MFU | 1623766 tok/s step 13154/19560 | loss 3.333736 (-0.33z)| norm 0.3016 (+0.89z)| lr 1.55e-04 | 323.24 ms | 52.2% bf16 MFU | 1623676 tok/s step 13155/19560 | loss 3.390222 (+1.12z)| norm 0.2674 (-0.70z)| lr 1.55e-04 | 322.79 ms | 52.3% bf16 MFU | 1623705 tok/s step 13156/19560 | loss 3.330747 (-0.42z)| norm 0.2825 (+0.01z)| lr 1.55e-04 | 323.15 ms | 52.2% bf16 MFU | 1623640 tok/s step 13157/19560 | loss 3.344424 (-0.05z)| norm 0.2996 (+0.80z)| lr 1.55e-04 | 322.77 ms | 52.3% bf16 MFU | 1623676 tok/s step 13158/19560 | loss 3.283849 (-1.63z)| norm 0.2566 (-1.21z)| lr 1.55e-04 | 322.85 ms | 52.3% bf16 MFU | 1623688 tok/s step 13159/19560 | loss 3.303838 (-1.10z)| norm 0.2922 (+0.45z)| lr 1.55e-04 | 323.06 ms | 52.2% bf16 MFU | 1623648 tok/s step 13160/19560 | loss 3.326055 (-0.52z)| norm 0.3155 (+1.51z)| lr 1.55e-04 | 323.32 ms | 52.2% bf16 MFU | 1623544 tok/s step 13161/19560 | loss 3.402936 (+1.48z)| norm 0.2870 (+0.18z)| lr 1.55e-04 | 322.83 ms | 52.3% bf16 MFU | 1623568 tok/s step 13162/19560 | loss 3.337548 (-0.23z)| norm 0.2921 (+0.41z)| lr 1.55e-04 | 323.41 ms | 52.2% bf16 MFU | 1623446 tok/s step 13163/19560 | loss 3.388220 (+1.08z)| norm 0.2640 (-0.88z)| lr 1.55e-04 | 322.98 ms | 52.3% bf16 MFU | 1623439 tok/s step 13164/19560 | loss 3.379533 (+0.85z)| norm 0.2682 (-0.69z)| lr 1.55e-04 | 323.31 ms | 52.2% bf16 MFU | 1623348 tok/s step 13165/19560 | loss 3.307349 (-1.01z)| norm 0.2775 (-0.27z)| lr 1.55e-04 | 322.96 ms | 52.3% bf16 MFU | 1623349 tok/s step 13166/19560 | loss 3.300474 (-1.17z)| norm 0.2779 (-0.26z)| lr 1.55e-04 | 323.72 ms | 52.1% bf16 MFU | 1623159 tok/s step 13167/19560 | loss 3.337467 (-0.21z)| norm 0.2628 (-0.97z)| lr 1.55e-04 | 322.58 ms | 52.3% bf16 MFU | 1623266 tok/s step 13168/19560 | loss 3.371635 (+0.67z)| norm 0.2879 (+0.20z)| lr 1.55e-04 | 323.23 ms | 52.2% bf16 MFU | 1623204 tok/s step 13169/19560 | loss 3.336385 (-0.25z)| norm 0.2612 (-1.04z)| lr 1.55e-04 | 323.14 ms | 52.2% bf16 MFU | 1623168 tok/s step 13170/19560 | loss 3.338675 (-0.20z)| norm 0.2759 (-0.36z)| lr 1.55e-04 | 323.01 ms | 52.3% bf16 MFU | 1623167 tok/s step 13171/19560 | loss 3.398988 (+1.35z)| norm 0.2675 (-0.74z)| lr 1.54e-04 | 322.78 ms | 52.3% bf16 MFU | 1623224 tok/s step 13172/19560 | loss 3.326303 (-0.52z)| norm 0.2772 (-0.28z)| lr 1.54e-04 | 322.85 ms | 52.3% bf16 MFU | 1623258 tok/s step 13173/19560 | loss 3.272974 (-1.87z)| norm 0.2541 (-1.37z)| lr 1.54e-04 | 322.76 ms | 52.3% bf16 MFU | 1623315 tok/s step 13174/19560 | loss 3.343777 (-0.06z)| norm 0.3043 (+1.00z)| lr 1.54e-04 | 323.21 ms | 52.2% bf16 MFU | 1623255 tok/s step 13175/19560 | loss 3.351463 (+0.13z)| norm 0.2630 (-0.97z)| lr 1.54e-04 | 322.91 ms | 52.3% bf16 MFU | 1623273 tok/s step 13176/19560 | loss 3.406867 (+1.55z)| norm 0.2872 (+0.17z)| lr 1.54e-04 | 323.54 ms | 52.2% bf16 MFU | 1623134 tok/s step 13177/19560 | loss 3.329792 (-0.44z)| norm 0.2930 (+0.44z)| lr 1.54e-04 | 322.46 ms | 52.3% bf16 MFU | 1623273 tok/s step 13178/19560 | loss 3.331733 (-0.38z)| norm 0.2613 (-1.07z)| lr 1.54e-04 | 323.33 ms | 52.2% bf16 MFU | 1623186 tok/s step 13179/19560 | loss 3.372852 (+0.67z)| norm 0.2831 (-0.03z)| lr 1.54e-04 | 323.56 ms | 52.2% bf16 MFU | 1623044 tok/s step 13180/19560 | loss 3.368658 (+0.56z)| norm 0.2915 (+0.36z)| lr 1.54e-04 | 323.16 ms | 52.2% bf16 MFU | 1623012 tok/s step 13181/19560 | loss 3.374020 (+0.70z)| norm 0.2688 (-0.73z)| lr 1.54e-04 | 322.61 ms | 52.3% bf16 MFU | 1623118 tok/s step 13182/19560 | loss 3.343636 (-0.11z)| norm 0.2782 (-0.26z)| lr 1.54e-04 | 322.97 ms | 52.3% bf16 MFU | 1623128 tok/s step 13183/19560 | loss 3.441850 (+2.41z)| norm 0.2710 (-0.61z)| lr 1.54e-04 | 322.90 ms | 52.3% bf16 MFU | 1623156 tok/s step 13184/19560 | loss 3.328897 (-0.50z)| norm 0.2772 (-0.30z)| lr 1.54e-04 | 322.74 ms | 52.3% bf16 MFU | 1623224 tok/s step 13185/19560 | loss 3.315398 (-0.84z)| norm 0.2540 (-1.45z)| lr 1.54e-04 | 322.89 ms | 52.3% bf16 MFU | 1623249 tok/s step 13186/19560 | loss 3.335391 (-0.32z)| norm 0.2861 (+0.14z)| lr 1.54e-04 | 323.05 ms | 52.2% bf16 MFU | 1623232 tok/s step 13187/19560 | loss 3.375480 (+0.70z)| norm 0.2813 (-0.10z)| lr 1.54e-04 | 323.01 ms | 52.2% bf16 MFU | 1623226 tok/s step 13188/19560 | loss 3.360125 (+0.31z)| norm 0.2674 (-0.78z)| lr 1.54e-04 | 322.76 ms | 52.3% bf16 MFU | 1623284 tok/s step 13189/19560 | loss 3.373705 (+0.66z)| norm 0.2796 (-0.18z)| lr 1.54e-04 | 322.94 ms | 52.3% bf16 MFU | 1623294 tok/s step 13190/19560 | loss 3.310280 (-0.98z)| norm 0.2672 (-0.78z)| lr 1.54e-04 | 322.96 ms | 52.3% bf16 MFU | 1623299 tok/s step 13191/19560 | loss 3.338978 (-0.22z)| norm 0.2789 (-0.20z)| lr 1.54e-04 | 323.45 ms | 52.2% bf16 MFU | 1623180 tok/s step 13192/19560 | loss 3.341116 (-0.16z)| norm 0.2639 (-0.95z)| lr 1.54e-04 | 322.69 ms | 52.3% bf16 MFU | 1623258 tok/s step 13193/19560 | loss 3.338231 (-0.23z)| norm 0.2770 (-0.30z)| lr 1.54e-04 | 323.20 ms | 52.2% bf16 MFU | 1623204 tok/s step 13194/19560 | loss 3.310763 (-0.94z)| norm 0.2785 (-0.24z)| lr 1.53e-04 | 322.93 ms | 52.3% bf16 MFU | 1623220 tok/s step 13195/19560 | loss 3.352880 (+0.19z)| norm 0.2711 (-0.61z)| lr 1.53e-04 | 323.32 ms | 52.2% bf16 MFU | 1623138 tok/s step 13196/19560 | loss 3.352690 (+0.18z)| norm 0.2627 (-1.02z)| lr 1.53e-04 | 323.22 ms | 52.2% bf16 MFU | 1623086 tok/s step 13197/19560 | loss 3.395186 (+1.29z)| norm 0.3178 (+1.70z)| lr 1.53e-04 | 323.01 ms | 52.2% bf16 MFU | 1623088 tok/s step 13198/19560 | loss 3.411505 (+1.71z)| norm 0.2892 (+0.27z)| lr 1.53e-04 | 323.13 ms | 52.2% bf16 MFU | 1623060 tok/s step 13199/19560 | loss 3.312102 (-0.90z)| norm 0.3176 (+1.65z)| lr 1.53e-04 | 323.48 ms | 52.2% bf16 MFU | 1622947 tok/s step 13200/19560 | loss 3.268938 (-1.99z)| norm 0.2822 (-0.11z)| lr 1.53e-04 | 323.19 ms | 52.2% bf16 MFU | 1622912 tok/s step 13201/19560 | loss 3.433969 (+2.26z)| norm 0.2949 (+0.51z)| lr 1.53e-04 | 323.29 ms | 52.2% bf16 MFU | 1622852 tok/s step 13202/19560 | loss 3.311382 (-0.88z)| norm 0.2922 (+0.37z)| lr 1.53e-04 | 323.09 ms | 52.2% bf16 MFU | 1622846 tok/s step 13203/19560 | loss 3.308841 (-0.94z)| norm 0.2665 (-0.92z)| lr 1.53e-04 | 323.54 ms | 52.2% bf16 MFU | 1622727 tok/s step 13204/19560 | loss 3.374510 (+0.73z)| norm 0.2927 (+0.38z)| lr 1.53e-04 | 323.03 ms | 52.2% bf16 MFU | 1622742 tok/s step 13205/19560 | loss 3.370029 (+0.61z)| norm 0.2924 (+0.36z)| lr 1.53e-04 | 323.12 ms | 52.2% bf16 MFU | 1622733 tok/s step 13206/19560 | loss 3.313938 (-0.82z)| norm 0.2866 (+0.09z)| lr 1.53e-04 | 322.99 ms | 52.3% bf16 MFU | 1622758 tok/s step 13207/19560 | loss 3.443592 (+2.42z)| norm 0.2844 (-0.02z)| lr 1.53e-04 | 322.95 ms | 52.3% bf16 MFU | 1622791 tok/s step 13208/19560 | loss 3.356078 (+0.24z)| norm 0.2893 (+0.25z)| lr 1.53e-04 | 322.33 ms | 52.4% bf16 MFU | 1622979 tok/s step 13209/19560 | loss 3.360491 (+0.35z)| norm 0.2730 (-0.62z)| lr 1.53e-04 | 322.35 ms | 52.4% bf16 MFU | 1623153 tok/s step 13210/19560 | loss 3.335787 (-0.28z)| norm 0.2770 (-0.39z)| lr 1.53e-04 | 323.55 ms | 52.2% bf16 MFU | 1623016 tok/s step 13211/19560 | loss 3.344670 (-0.06z)| norm 0.2603 (-1.27z)| lr 1.53e-04 | 322.51 ms | 52.3% bf16 MFU | 1623148 tok/s step 13212/19560 | loss 3.355027 (+0.20z)| norm 0.2716 (-0.65z)| lr 1.53e-04 | 323.36 ms | 52.2% bf16 MFU | 1623058 tok/s step 13213/19560 | loss 3.335066 (-0.32z)| norm 0.2861 (+0.13z)| lr 1.53e-04 | 323.16 ms | 52.2% bf16 MFU | 1623023 tok/s step 13214/19560 | loss 3.391491 (+1.10z)| norm 0.2745 (-0.49z)| lr 1.53e-04 | 323.01 ms | 52.3% bf16 MFU | 1623030 tok/s step 13215/19560 | loss 3.270062 (-1.97z)| norm 0.2684 (-0.81z)| lr 1.53e-04 | 323.10 ms | 52.2% bf16 MFU | 1623012 tok/s step 13216/19560 | loss 3.355533 (+0.18z)| norm 0.2936 (+0.54z)| lr 1.53e-04 | 322.96 ms | 52.3% bf16 MFU | 1623031 tok/s step 13217/19560 | loss 3.342635 (-0.14z)| norm 0.2656 (-0.95z)| lr 1.52e-04 | 323.55 ms | 52.2% bf16 MFU | 1622901 tok/s step 13218/19560 | loss 3.332864 (-0.39z)| norm 0.2838 (+0.03z)| lr 1.52e-04 | 323.60 ms | 52.2% bf16 MFU | 1622764 tok/s step 13219/19560 | loss 3.315835 (-0.81z)| norm 0.2749 (-0.45z)| lr 1.52e-04 | 322.84 ms | 52.3% bf16 MFU | 1622825 tok/s step 13220/19560 | loss 3.350419 (+0.05z)| norm 0.2670 (-0.86z)| lr 1.52e-04 | 323.35 ms | 52.2% bf16 MFU | 1622754 tok/s step 13221/19560 | loss 3.369273 (+0.53z)| norm 0.2792 (-0.21z)| lr 1.52e-04 | 322.28 ms | 52.4% bf16 MFU | 1622956 tok/s step 13222/19560 | loss 3.357223 (+0.22z)| norm 0.2609 (-1.18z)| lr 1.52e-04 | 323.25 ms | 52.2% bf16 MFU | 1622905 tok/s step 13223/19560 | loss 3.368250 (+0.49z)| norm 0.2764 (-0.35z)| lr 1.52e-04 | 323.16 ms | 52.2% bf16 MFU | 1622880 tok/s step 13224/19560 | loss 3.366239 (+0.44z)| norm 0.2862 (+0.17z)| lr 1.52e-04 | 322.99 ms | 52.3% bf16 MFU | 1622898 tok/s step 13225/19560 | loss 3.380651 (+0.80z)| norm 0.2616 (-1.14z)| lr 1.52e-04 | 322.81 ms | 52.3% bf16 MFU | 1622961 tok/s step 13226/19560 | loss 3.447257 (+2.42z)| norm 0.2851 (+0.10z)| lr 1.52e-04 | 322.56 ms | 52.3% bf16 MFU | 1623084 tok/s step 13227/19560 | loss 3.351877 (+0.03z)| norm 0.2645 (-1.01z)| lr 1.52e-04 | 323.24 ms | 52.2% bf16 MFU | 1623028 tok/s step 13228/19560 | loss 3.346725 (-0.10z)| norm 0.2857 (+0.13z)| lr 1.52e-04 | 323.17 ms | 52.2% bf16 MFU | 1622992 tok/s step 13229/19560 | loss 3.400778 (+1.24z)| norm 0.2636 (-1.04z)| lr 1.52e-04 | 323.02 ms | 52.2% bf16 MFU | 1622998 tok/s step 13230/19560 | loss 3.398883 (+1.19z)| norm 0.2690 (-0.74z)| lr 1.52e-04 | 322.86 ms | 52.3% bf16 MFU | 1623041 tok/s step 13231/19560 | loss 3.334824 (-0.40z)| norm 0.2772 (-0.30z)| lr 1.52e-04 | 322.45 ms | 52.3% bf16 MFU | 1623186 tok/s step 13232/19560 | loss 3.331903 (-0.46z)| norm 0.2747 (-0.43z)| lr 1.52e-04 | 323.50 ms | 52.2% bf16 MFU | 1623059 tok/s step 13233/19560 | loss 3.392551 (+1.13z)| norm 0.2907 (+0.42z)| lr 1.52e-04 | 322.58 ms | 52.3% bf16 MFU | 1623171 tok/s step 13234/19560 | loss 3.247423 (-2.58z)| norm 0.2670 (-0.84z)| lr 1.52e-04 | 323.00 ms | 52.3% bf16 MFU | 1623170 tok/s step 13235/19560 | loss 3.333443 (-0.38z)| norm 0.2811 (-0.09z)| lr 1.52e-04 | 322.99 ms | 52.3% bf16 MFU | 1623172 tok/s step 13236/19560 | loss 3.336062 (-0.30z)| norm 0.2729 (-0.53z)| lr 1.52e-04 | 323.31 ms | 52.2% bf16 MFU | 1623095 tok/s step 13237/19560 | loss 3.348660 (+0.04z)| norm 0.2614 (-1.13z)| lr 1.52e-04 | 323.12 ms | 52.2% bf16 MFU | 1623070 tok/s step 13238/19560 | loss 3.318679 (-0.77z)| norm 0.2850 (+0.13z)| lr 1.52e-04 | 323.08 ms | 52.2% bf16 MFU | 1623055 tok/s step 13239/19560 | loss 3.331538 (-0.41z)| norm 0.2587 (-1.25z)| lr 1.52e-04 | 322.55 ms | 52.3% bf16 MFU | 1623174 tok/s step 13240/19560 | loss 3.334723 (-0.31z)| norm 0.2967 (+0.75z)| lr 1.51e-04 | 322.70 ms | 52.3% bf16 MFU | 1623250 tok/s step 13241/19560 | loss 3.314700 (-0.86z)| norm 0.2655 (-0.88z)| lr 1.51e-04 | 323.21 ms | 52.2% bf16 MFU | 1623193 tok/s step 13242/19560 | loss 3.356667 (+0.30z)| norm 0.2729 (-0.48z)| lr 1.51e-04 | 322.92 ms | 52.3% bf16 MFU | 1623212 tok/s step 13243/19560 | loss 3.368518 (+0.63z)| norm 0.2637 (-0.96z)| lr 1.51e-04 | 323.61 ms | 52.2% bf16 MFU | 1623056 tok/s step 13244/19560 | loss 3.409691 (+1.74z)| norm 0.2656 (-0.84z)| lr 1.51e-04 | 323.03 ms | 52.2% bf16 MFU | 1623055 tok/s step 13245/19560 | loss 3.316786 (-0.80z)| norm 0.2765 (-0.26z)| lr 1.51e-04 | 322.96 ms | 52.3% bf16 MFU | 1623071 tok/s step 13246/19560 | loss 3.328049 (-0.50z)| norm 0.2835 (+0.10z)| lr 1.51e-04 | 322.66 ms | 52.3% bf16 MFU | 1623162 tok/s step 13247/19560 | loss 3.308448 (-1.05z)| norm 0.2645 (-0.90z)| lr 1.51e-04 | 322.78 ms | 52.3% bf16 MFU | 1623217 tok/s step 13248/19560 | loss 3.327574 (-0.51z)| norm 0.2752 (-0.33z)| lr 1.51e-04 | 322.84 ms | 52.3% bf16 MFU | 1623256 tok/s step 13249/19560 | loss 3.337945 (-0.21z)| norm 0.2619 (-1.02z)| lr 1.51e-04 | 322.91 ms | 52.3% bf16 MFU | 1623276 tok/s step 13250/19560 | loss 3.293020 (-1.44z)| norm 0.2855 (+0.23z)| lr 1.51e-04 | 322.94 ms | 52.3% bf16 MFU | 1623287 tok/s val loss 3.329666 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3003/10042 = 0.299044 step 13251/19560 | loss 3.339493 (-0.15z)| norm 0.2541 (-1.41z)| lr 1.51e-04 | 322.59 ms | 52.3% bf16 MFU | 1623384 tok/s step 13252/19560 | loss 3.294064 (-1.40z)| norm 0.2796 (-0.08z)| lr 1.51e-04 | 323.35 ms | 52.2% bf16 MFU | 1623287 tok/s step 13253/19560 | loss 3.282135 (-1.70z)| norm 0.2626 (-0.96z)| lr 1.51e-04 | 322.88 ms | 52.3% bf16 MFU | 1623311 tok/s step 13254/19560 | loss 3.571083 (+5.44z)| norm 0.2791 (-0.09z)| lr 1.51e-04 | 322.60 ms | 52.3% bf16 MFU | 1623406 tok/s step 13255/19560 | loss 3.285615 (-1.44z)| norm 0.2840 (+0.16z)| lr 1.51e-04 | 322.47 ms | 52.3% bf16 MFU | 1623527 tok/s step 13256/19560 | loss 3.365157 (+0.46z)| norm 0.2731 (-0.40z)| lr 1.51e-04 | 322.80 ms | 52.3% bf16 MFU | 1623560 tok/s step 13257/19560 | loss 3.294878 (-1.21z)| norm 0.2700 (-0.56z)| lr 1.51e-04 | 322.82 ms | 52.3% bf16 MFU | 1623587 tok/s step 13258/19560 | loss 3.309510 (-0.85z)| norm 0.2885 (+0.40z)| lr 1.51e-04 | 323.24 ms | 52.2% bf16 MFU | 1623506 tok/s step 13259/19560 | loss 3.339152 (-0.14z)| norm 0.2756 (-0.27z)| lr 1.51e-04 | 322.92 ms | 52.3% bf16 MFU | 1623511 tok/s step 13260/19560 | loss 3.300207 (-1.06z)| norm 0.3026 (+1.14z)| lr 1.51e-04 | 322.58 ms | 52.3% bf16 MFU | 1623600 tok/s step 13261/19560 | loss 3.376960 (+0.75z)| norm 0.2936 (+0.66z)| lr 1.51e-04 | 322.51 ms | 52.3% bf16 MFU | 1623703 tok/s step 13262/19560 | loss 3.272938 (-1.69z)| norm 0.2817 (+0.04z)| lr 1.51e-04 | 323.11 ms | 52.2% bf16 MFU | 1623649 tok/s step 13263/19560 | loss 3.331372 (-0.32z)| norm 0.2769 (-0.22z)| lr 1.50e-04 | 322.93 ms | 52.3% bf16 MFU | 1623645 tok/s step 13264/19560 | loss 3.292324 (-1.22z)| norm 0.2953 (+0.74z)| lr 1.50e-04 | 322.80 ms | 52.3% bf16 MFU | 1623672 tok/s step 13265/19560 | loss 3.375664 (+0.72z)| norm 0.2930 (+0.62z)| lr 1.50e-04 | 322.29 ms | 52.4% bf16 MFU | 1623827 tok/s step 13266/19560 | loss 3.270020 (-1.72z)| norm 0.2796 (-0.08z)| lr 1.50e-04 | 322.89 ms | 52.3% bf16 MFU | 1623823 tok/s step 13267/19560 | loss 3.399227 (+1.25z)| norm 0.2856 (+0.25z)| lr 1.50e-04 | 322.87 ms | 52.3% bf16 MFU | 1623825 tok/s step 13268/19560 | loss 3.336763 (-0.18z)| norm 0.3924 (+5.22z)| lr 1.50e-04 | 322.18 ms | 52.4% bf16 MFU | 1623998 tok/s step 13269/19560 | loss 3.359411 (+0.34z)| norm 0.2935 (+0.57z)| lr 1.50e-04 | 322.78 ms | 52.3% bf16 MFU | 1624014 tok/s step 13270/19560 | loss 3.333028 (-0.27z)| norm 0.3286 (+2.19z)| lr 1.50e-04 | 323.27 ms | 52.2% bf16 MFU | 1623904 tok/s step 13271/19560 | loss 3.315077 (-0.68z)| norm 0.3003 (+0.84z)| lr 1.50e-04 | 322.59 ms | 52.3% bf16 MFU | 1623972 tok/s step 13272/19560 | loss 3.312236 (-0.74z)| norm 0.3105 (+1.30z)| lr 1.50e-04 | 322.63 ms | 52.3% bf16 MFU | 1624026 tok/s step 13273/19560 | loss 3.352365 (+0.18z)| norm 0.2990 (+0.76z)| lr 1.50e-04 | 323.42 ms | 52.2% bf16 MFU | 1623878 tok/s step 13274/19560 | loss 3.332426 (-0.29z)| norm 0.3086 (+1.40z)| lr 1.50e-04 | 322.69 ms | 52.3% bf16 MFU | 1623921 tok/s step 13275/19560 | loss 3.327506 (-0.42z)| norm 0.3131 (+1.63z)| lr 1.50e-04 | 322.35 ms | 52.4% bf16 MFU | 1624047 tok/s step 13276/19560 | loss 3.354770 (+0.22z)| norm 0.3089 (+1.40z)| lr 1.50e-04 | 322.92 ms | 52.3% bf16 MFU | 1624023 tok/s step 13277/19560 | loss 3.386980 (+0.96z)| norm 0.2912 (+0.51z)| lr 1.50e-04 | 322.87 ms | 52.3% bf16 MFU | 1624013 tok/s step 13278/19560 | loss 3.148583 (-4.25z)| norm 0.4444 (+6.97z)| lr 1.50e-04 | 322.66 ms | 52.3% bf16 MFU | 1624057 tok/s step 13279/19560 | loss 3.344130 (+0.00z)| norm 0.2922 (+0.43z)| lr 1.50e-04 | 323.41 ms | 52.2% bf16 MFU | 1623910 tok/s step 13280/19560 | loss 3.303004 (-0.88z)| norm 0.2987 (+0.70z)| lr 1.50e-04 | 322.47 ms | 52.3% bf16 MFU | 1624005 tok/s step 13281/19560 | loss 3.288344 (-1.19z)| norm 0.2689 (-0.59z)| lr 1.50e-04 | 323.27 ms | 52.2% bf16 MFU | 1623898 tok/s step 13282/19560 | loss 3.291433 (-1.11z)| norm 0.2806 (-0.08z)| lr 1.50e-04 | 322.77 ms | 52.3% bf16 MFU | 1623920 tok/s step 13283/19560 | loss 3.385062 (+0.91z)| norm 0.2679 (-0.63z)| lr 1.50e-04 | 322.46 ms | 52.3% bf16 MFU | 1624019 tok/s step 13284/19560 | loss 3.313063 (-0.64z)| norm 0.2818 (-0.02z)| lr 1.50e-04 | 322.80 ms | 52.3% bf16 MFU | 1624026 tok/s step 13285/19560 | loss 3.361894 (+0.41z)| norm 0.2945 (+0.53z)| lr 1.50e-04 | 323.26 ms | 52.2% bf16 MFU | 1623918 tok/s step 13286/19560 | loss 3.264485 (-1.68z)| norm 0.2696 (-0.56z)| lr 1.49e-04 | 322.67 ms | 52.3% bf16 MFU | 1623965 tok/s step 13287/19560 | loss 3.460093 (+2.43z)| norm 0.3136 (+1.36z)| lr 1.49e-04 | 322.77 ms | 52.3% bf16 MFU | 1623982 tok/s step 13288/19560 | loss 3.315486 (-0.60z)| norm 0.2954 (+0.57z)| lr 1.49e-04 | 322.44 ms | 52.3% bf16 MFU | 1624083 tok/s step 13289/19560 | loss 3.357882 (+0.30z)| norm 0.2693 (-0.57z)| lr 1.49e-04 | 322.62 ms | 52.3% bf16 MFU | 1624134 tok/s step 13290/19560 | loss 3.284454 (-1.23z)| norm 0.3034 (+0.93z)| lr 1.49e-04 | 323.24 ms | 52.2% bf16 MFU | 1624026 tok/s step 13291/19560 | loss 3.357602 (+0.31z)| norm 0.3013 (+0.82z)| lr 1.49e-04 | 322.89 ms | 52.3% bf16 MFU | 1624011 tok/s step 13292/19560 | loss 3.345352 (+0.06z)| norm 0.2552 (-1.19z)| lr 1.49e-04 | 323.37 ms | 52.2% bf16 MFU | 1623877 tok/s step 13293/19560 | loss 3.302577 (-0.84z)| norm 0.2936 (+0.48z)| lr 1.49e-04 | 322.81 ms | 52.3% bf16 MFU | 1623890 tok/s step 13294/19560 | loss 3.346548 (+0.07z)| norm 0.2761 (-0.28z)| lr 1.49e-04 | 323.24 ms | 52.2% bf16 MFU | 1623795 tok/s step 13295/19560 | loss 3.362423 (+0.40z)| norm 0.2656 (-0.74z)| lr 1.49e-04 | 323.23 ms | 52.2% bf16 MFU | 1623706 tok/s step 13296/19560 | loss 3.292336 (-1.06z)| norm 0.2618 (-0.90z)| lr 1.49e-04 | 322.99 ms | 52.3% bf16 MFU | 1623682 tok/s step 13297/19560 | loss 3.346937 (+0.09z)| norm 0.2721 (-0.46z)| lr 1.49e-04 | 323.29 ms | 52.2% bf16 MFU | 1623585 tok/s step 13298/19560 | loss 3.322954 (-0.41z)| norm 0.3750 (+3.77z)| lr 1.49e-04 | 322.49 ms | 52.3% bf16 MFU | 1623694 tok/s step 13299/19560 | loss 3.391516 (+1.03z)| norm 0.2943 (+0.45z)| lr 1.49e-04 | 322.83 ms | 52.3% bf16 MFU | 1623710 tok/s step 13300/19560 | loss 3.312261 (-0.63z)| norm 0.2620 (-0.88z)| lr 1.49e-04 | 323.12 ms | 52.2% bf16 MFU | 1623654 tok/s step 13301/19560 | loss 3.448501 (+2.18z)| norm 0.2634 (-0.83z)| lr 1.49e-04 | 322.54 ms | 52.3% bf16 MFU | 1623746 tok/s step 13302/19560 | loss 3.381381 (+0.77z)| norm 0.2827 (-0.02z)| lr 1.49e-04 | 322.40 ms | 52.3% bf16 MFU | 1623869 tok/s step 13303/19560 | loss 3.365498 (+0.44z)| norm 0.2726 (-0.45z)| lr 1.49e-04 | 322.55 ms | 52.3% bf16 MFU | 1623947 tok/s step 13304/19560 | loss 3.358951 (+0.32z)| norm 0.2812 (-0.09z)| lr 1.49e-04 | 322.87 ms | 52.3% bf16 MFU | 1623942 tok/s step 13305/19560 | loss 3.381014 (+0.77z)| norm 0.2813 (-0.08z)| lr 1.49e-04 | 322.84 ms | 52.3% bf16 MFU | 1623945 tok/s step 13306/19560 | loss 3.322793 (-0.44z)| norm 0.2794 (-0.17z)| lr 1.49e-04 | 322.42 ms | 52.3% bf16 MFU | 1624052 tok/s step 13307/19560 | loss 3.371900 (+0.58z)| norm 0.2578 (-1.05z)| lr 1.49e-04 | 322.82 ms | 52.3% bf16 MFU | 1624053 tok/s step 13308/19560 | loss 3.332172 (-0.24z)| norm 0.2956 (+0.51z)| lr 1.49e-04 | 323.54 ms | 52.2% bf16 MFU | 1623873 tok/s step 13309/19560 | loss 3.318428 (-0.52z)| norm 0.2737 (-0.40z)| lr 1.49e-04 | 323.17 ms | 52.2% bf16 MFU | 1623795 tok/s step 13310/19560 | loss 3.319021 (-0.50z)| norm 0.2902 (+0.28z)| lr 1.48e-04 | 322.64 ms | 52.3% bf16 MFU | 1623856 tok/s step 13311/19560 | loss 3.379584 (+0.78z)| norm 0.2909 (+0.31z)| lr 1.48e-04 | 322.81 ms | 52.3% bf16 MFU | 1623869 tok/s step 13312/19560 | loss 3.274375 (-1.43z)| norm 0.2766 (-0.28z)| lr 1.48e-04 | 322.39 ms | 52.4% bf16 MFU | 1623989 tok/s step 13313/19560 | loss 3.313108 (-0.61z)| norm 0.2860 (+0.09z)| lr 1.48e-04 | 322.49 ms | 52.3% bf16 MFU | 1624076 tok/s step 13314/19560 | loss 3.378565 (+0.75z)| norm 0.3712 (+3.44z)| lr 1.48e-04 | 322.87 ms | 52.3% bf16 MFU | 1624064 tok/s step 13315/19560 | loss 3.296846 (-0.94z)| norm 0.3051 (+0.81z)| lr 1.48e-04 | 322.95 ms | 52.3% bf16 MFU | 1624033 tok/s step 13316/19560 | loss 3.338048 (-0.08z)| norm 0.2844 (-0.01z)| lr 1.48e-04 | 323.48 ms | 52.2% bf16 MFU | 1623871 tok/s step 13317/19560 | loss 3.332285 (-0.19z)| norm 0.2789 (-0.23z)| lr 1.48e-04 | 322.99 ms | 52.3% bf16 MFU | 1623838 tok/s step 13318/19560 | loss 3.383408 (+0.86z)| norm 0.2621 (-0.89z)| lr 1.48e-04 | 323.01 ms | 52.3% bf16 MFU | 1623804 tok/s step 13319/19560 | loss 3.276959 (-1.34z)| norm 0.2780 (-0.26z)| lr 1.48e-04 | 322.64 ms | 52.3% bf16 MFU | 1623863 tok/s step 13320/19560 | loss 3.389135 (+0.98z)| norm 0.2795 (-0.21z)| lr 1.48e-04 | 323.34 ms | 52.2% bf16 MFU | 1623742 tok/s step 13321/19560 | loss 3.284868 (-1.17z)| norm 0.2799 (-0.19z)| lr 1.48e-04 | 323.12 ms | 52.2% bf16 MFU | 1623683 tok/s step 13322/19560 | loss 3.367884 (+0.53z)| norm 0.3019 (+0.67z)| lr 1.48e-04 | 323.25 ms | 52.2% bf16 MFU | 1623596 tok/s step 13323/19560 | loss 3.437415 (+1.92z)| norm 0.2725 (-0.50z)| lr 1.48e-04 | 322.90 ms | 52.3% bf16 MFU | 1623602 tok/s step 13324/19560 | loss 3.313076 (-0.59z)| norm 0.3002 (+0.59z)| lr 1.48e-04 | 323.10 ms | 52.2% bf16 MFU | 1623555 tok/s step 13325/19560 | loss 3.368297 (+0.53z)| norm 0.2711 (-0.55z)| lr 1.48e-04 | 322.63 ms | 52.3% bf16 MFU | 1623631 tok/s step 13326/19560 | loss 3.322444 (-0.39z)| norm 0.2727 (-0.48z)| lr 1.48e-04 | 323.34 ms | 52.2% bf16 MFU | 1623523 tok/s step 13327/19560 | loss 3.464168 (+2.43z)| norm 0.2889 (+0.17z)| lr 1.48e-04 | 322.75 ms | 52.3% bf16 MFU | 1623570 tok/s step 13328/19560 | loss 3.320343 (-0.46z)| norm 0.2859 (+0.05z)| lr 1.48e-04 | 322.61 ms | 52.3% bf16 MFU | 1623648 tok/s step 13329/19560 | loss 3.354076 (+0.24z)| norm 0.2931 (+0.34z)| lr 1.48e-04 | 323.21 ms | 52.2% bf16 MFU | 1623572 tok/s step 13330/19560 | loss 3.317674 (-0.51z)| norm 0.2963 (+0.47z)| lr 1.48e-04 | 323.55 ms | 52.2% bf16 MFU | 1623414 tok/s step 13331/19560 | loss 3.358287 (+0.32z)| norm 0.2625 (-0.89z)| lr 1.48e-04 | 322.77 ms | 52.3% bf16 MFU | 1623460 tok/s step 13332/19560 | loss 3.315583 (-0.55z)| norm 0.2761 (-0.34z)| lr 1.48e-04 | 323.12 ms | 52.2% bf16 MFU | 1623417 tok/s step 13333/19560 | loss 3.355958 (+0.28z)| norm 0.2717 (-0.51z)| lr 1.47e-04 | 322.86 ms | 52.3% bf16 MFU | 1623442 tok/s step 13334/19560 | loss 3.317071 (-0.52z)| norm 0.2860 (+0.07z)| lr 1.47e-04 | 322.85 ms | 52.3% bf16 MFU | 1623467 tok/s step 13335/19560 | loss 3.397318 (+1.15z)| norm 0.2806 (-0.15z)| lr 1.47e-04 | 323.58 ms | 52.2% bf16 MFU | 1623307 tok/s step 13336/19560 | loss 3.390915 (+1.01z)| norm 0.2923 (+0.32z)| lr 1.47e-04 | 322.90 ms | 52.3% bf16 MFU | 1623327 tok/s step 13337/19560 | loss 3.294164 (-0.98z)| norm 0.3195 (+1.39z)| lr 1.47e-04 | 322.81 ms | 52.3% bf16 MFU | 1623368 tok/s step 13338/19560 | loss 3.333239 (-0.17z)| norm 0.2875 (+0.11z)| lr 1.47e-04 | 322.94 ms | 52.3% bf16 MFU | 1623373 tok/s step 13339/19560 | loss 3.329916 (-0.24z)| norm 0.2740 (-0.43z)| lr 1.47e-04 | 323.42 ms | 52.2% bf16 MFU | 1623258 tok/s step 13340/19560 | loss 3.328771 (-0.26z)| norm 0.2762 (-0.35z)| lr 1.47e-04 | 322.99 ms | 52.3% bf16 MFU | 1623257 tok/s step 13341/19560 | loss 3.322578 (-0.39z)| norm 0.2967 (+0.47z)| lr 1.47e-04 | 322.62 ms | 52.3% bf16 MFU | 1623350 tok/s step 13342/19560 | loss 3.290073 (-1.04z)| norm 0.2792 (-0.23z)| lr 1.47e-04 | 322.87 ms | 52.3% bf16 MFU | 1623373 tok/s step 13343/19560 | loss 3.304858 (-0.74z)| norm 0.2649 (-0.80z)| lr 1.47e-04 | 323.40 ms | 52.2% bf16 MFU | 1623263 tok/s step 13344/19560 | loss 3.301628 (-0.80z)| norm 0.2717 (-0.52z)| lr 1.47e-04 | 322.58 ms | 52.3% bf16 MFU | 1623365 tok/s step 13345/19560 | loss 3.388762 (+0.99z)| norm 0.2841 (-0.03z)| lr 1.47e-04 | 322.50 ms | 52.3% bf16 MFU | 1623480 tok/s step 13346/19560 | loss 3.319866 (-0.43z)| norm 0.2592 (-1.02z)| lr 1.47e-04 | 323.39 ms | 52.2% bf16 MFU | 1623368 tok/s step 13347/19560 | loss 3.288490 (-1.07z)| norm 0.2815 (-0.13z)| lr 1.47e-04 | 323.14 ms | 52.2% bf16 MFU | 1623324 tok/s step 13348/19560 | loss 3.407331 (+1.36z)| norm 0.2600 (-0.98z)| lr 1.47e-04 | 322.91 ms | 52.3% bf16 MFU | 1623340 tok/s step 13349/19560 | loss 3.387903 (+0.96z)| norm 0.2803 (-0.18z)| lr 1.47e-04 | 322.54 ms | 52.3% bf16 MFU | 1623448 tok/s step 13350/19560 | loss 3.376728 (+0.72z)| norm 0.2636 (-0.84z)| lr 1.47e-04 | 323.74 ms | 52.1% bf16 MFU | 1623248 tok/s step 13351/19560 | loss 3.457234 (+2.30z)| norm 0.2985 (+0.54z)| lr 1.47e-04 | 322.87 ms | 52.3% bf16 MFU | 1623278 tok/s step 13352/19560 | loss 3.347347 (+0.11z)| norm 0.2721 (-0.51z)| lr 1.47e-04 | 322.51 ms | 52.3% bf16 MFU | 1623396 tok/s step 13353/19560 | loss 3.284844 (-1.11z)| norm 0.2782 (-0.27z)| lr 1.47e-04 | 322.98 ms | 52.3% bf16 MFU | 1623389 tok/s step 13354/19560 | loss 3.319557 (-0.41z)| norm 0.2593 (-1.01z)| lr 1.47e-04 | 322.58 ms | 52.3% bf16 MFU | 1623485 tok/s step 13355/19560 | loss 3.354674 (+0.30z)| norm 0.2665 (-0.72z)| lr 1.47e-04 | 323.46 ms | 52.2% bf16 MFU | 1623355 tok/s step 13356/19560 | loss 3.335806 (-0.08z)| norm 0.2800 (-0.19z)| lr 1.46e-04 | 323.09 ms | 52.2% bf16 MFU | 1623325 tok/s step 13357/19560 | loss 3.344708 (+0.11z)| norm 0.2738 (-0.44z)| lr 1.46e-04 | 323.16 ms | 52.2% bf16 MFU | 1623277 tok/s step 13358/19560 | loss 3.329097 (-0.20z)| norm 0.2674 (-0.69z)| lr 1.46e-04 | 322.96 ms | 52.3% bf16 MFU | 1623283 tok/s step 13359/19560 | loss 3.313119 (-0.52z)| norm 0.2871 (+0.09z)| lr 1.46e-04 | 322.62 ms | 52.3% bf16 MFU | 1623374 tok/s step 13360/19560 | loss 3.336162 (-0.05z)| norm 0.2704 (-0.57z)| lr 1.46e-04 | 323.23 ms | 52.2% bf16 MFU | 1623305 tok/s step 13361/19560 | loss 3.427573 (+1.80z)| norm 0.2702 (-0.58z)| lr 1.46e-04 | 322.89 ms | 52.3% bf16 MFU | 1623327 tok/s step 13362/19560 | loss 3.341713 (+0.04z)| norm 0.2690 (-0.62z)| lr 1.46e-04 | 323.51 ms | 52.2% bf16 MFU | 1623191 tok/s step 13363/19560 | loss 3.341402 (+0.03z)| norm 0.2649 (-0.78z)| lr 1.46e-04 | 322.57 ms | 52.3% bf16 MFU | 1623300 tok/s step 13364/19560 | loss 3.369656 (+0.61z)| norm 0.2854 (+0.03z)| lr 1.46e-04 | 322.88 ms | 52.3% bf16 MFU | 1623324 tok/s step 13365/19560 | loss 3.336039 (-0.08z)| norm 0.2777 (-0.28z)| lr 1.46e-04 | 323.27 ms | 52.2% bf16 MFU | 1623250 tok/s step 13366/19560 | loss 3.480013 (+2.76z)| norm 0.2699 (-0.59z)| lr 1.46e-04 | 323.14 ms | 52.2% bf16 MFU | 1623211 tok/s step 13367/19560 | loss 3.299399 (-0.83z)| norm 0.3050 (+0.79z)| lr 1.46e-04 | 322.72 ms | 52.3% bf16 MFU | 1623281 tok/s step 13368/19560 | loss 3.310830 (-0.59z)| norm 0.2628 (-0.87z)| lr 1.46e-04 | 322.82 ms | 52.3% bf16 MFU | 1623321 tok/s step 13369/19560 | loss 3.387210 (+0.91z)| norm 0.2797 (-0.21z)| lr 1.46e-04 | 323.28 ms | 52.2% bf16 MFU | 1623243 tok/s step 13370/19560 | loss 3.418399 (+1.50z)| norm 0.2835 (-0.06z)| lr 1.46e-04 | 323.25 ms | 52.2% bf16 MFU | 1623176 tok/s step 13371/19560 | loss 3.268576 (-1.41z)| norm 0.2879 (+0.11z)| lr 1.46e-04 | 323.20 ms | 52.2% bf16 MFU | 1623126 tok/s step 13372/19560 | loss 3.338026 (-0.05z)| norm 0.2784 (-0.27z)| lr 1.46e-04 | 323.13 ms | 52.2% bf16 MFU | 1623098 tok/s step 13373/19560 | loss 3.334307 (-0.12z)| norm 0.2874 (+0.08z)| lr 1.46e-04 | 323.45 ms | 52.2% bf16 MFU | 1622989 tok/s step 13374/19560 | loss 3.311552 (-0.57z)| norm 0.2660 (-0.77z)| lr 1.46e-04 | 323.11 ms | 52.2% bf16 MFU | 1622971 tok/s step 13375/19560 | loss 3.281489 (-1.15z)| norm 0.2832 (-0.09z)| lr 1.46e-04 | 323.45 ms | 52.2% bf16 MFU | 1622870 tok/s step 13376/19560 | loss 3.315327 (-0.49z)| norm 0.2649 (-0.82z)| lr 1.46e-04 | 323.47 ms | 52.2% bf16 MFU | 1622768 tok/s step 13377/19560 | loss 3.597018 (+4.55z)| norm 0.3097 (+0.96z)| lr 1.46e-04 | 322.85 ms | 52.3% bf16 MFU | 1622826 tok/s step 13378/19560 | loss 3.355805 (+0.23z)| norm 0.2732 (-0.49z)| lr 1.46e-04 | 323.16 ms | 52.2% bf16 MFU | 1622803 tok/s step 13379/19560 | loss 3.345864 (+0.06z)| norm 0.2581 (-1.10z)| lr 1.45e-04 | 323.47 ms | 52.2% bf16 MFU | 1622705 tok/s step 13380/19560 | loss 3.409395 (+1.18z)| norm 0.2811 (-0.18z)| lr 1.45e-04 | 323.63 ms | 52.1% bf16 MFU | 1622571 tok/s step 13381/19560 | loss 3.412142 (+1.20z)| norm 0.2672 (-0.74z)| lr 1.45e-04 | 322.92 ms | 52.3% bf16 MFU | 1622620 tok/s step 13382/19560 | loss 3.268701 (-1.40z)| norm 0.2747 (-0.43z)| lr 1.45e-04 | 323.10 ms | 52.2% bf16 MFU | 1622624 tok/s step 13383/19560 | loss 3.320151 (-0.43z)| norm 0.2620 (-0.94z)| lr 1.45e-04 | 323.07 ms | 52.2% bf16 MFU | 1622634 tok/s step 13384/19560 | loss 3.275295 (-1.26z)| norm 0.2759 (-0.38z)| lr 1.45e-04 | 323.18 ms | 52.2% bf16 MFU | 1622616 tok/s step 13385/19560 | loss 3.368925 (+0.50z)| norm 0.2704 (-0.60z)| lr 1.45e-04 | 322.51 ms | 52.3% bf16 MFU | 1622768 tok/s step 13386/19560 | loss 3.312900 (-0.56z)| norm 0.2730 (-0.49z)| lr 1.45e-04 | 322.61 ms | 52.3% bf16 MFU | 1622886 tok/s step 13387/19560 | loss 3.372114 (+0.56z)| norm 0.2821 (-0.13z)| lr 1.45e-04 | 322.65 ms | 52.3% bf16 MFU | 1622989 tok/s step 13388/19560 | loss 3.376225 (+0.62z)| norm 0.2632 (-0.87z)| lr 1.45e-04 | 322.82 ms | 52.3% bf16 MFU | 1623043 tok/s step 13389/19560 | loss 3.336812 (-0.12z)| norm 0.2720 (-0.51z)| lr 1.45e-04 | 323.37 ms | 52.2% bf16 MFU | 1622957 tok/s step 13390/19560 | loss 3.312468 (-0.59z)| norm 0.2637 (-0.83z)| lr 1.45e-04 | 322.70 ms | 52.3% bf16 MFU | 1623042 tok/s step 13391/19560 | loss 3.345261 (+0.03z)| norm 0.2738 (-0.43z)| lr 1.45e-04 | 323.04 ms | 52.2% bf16 MFU | 1623038 tok/s step 13392/19560 | loss 3.268651 (-1.42z)| norm 0.2650 (-0.77z)| lr 1.45e-04 | 323.31 ms | 52.2% bf16 MFU | 1622967 tok/s step 13393/19560 | loss 3.405200 (+1.17z)| norm 0.2532 (-1.22z)| lr 1.45e-04 | 322.99 ms | 52.3% bf16 MFU | 1622981 tok/s step 13394/19560 | loss 3.434842 (+1.70z)| norm 0.3085 (+0.95z)| lr 1.45e-04 | 323.57 ms | 52.2% bf16 MFU | 1622849 tok/s step 13395/19560 | loss 3.291610 (-0.99z)| norm 0.2611 (-0.90z)| lr 1.45e-04 | 322.88 ms | 52.3% bf16 MFU | 1622896 tok/s step 13396/19560 | loss 3.278905 (-1.21z)| norm 0.2757 (-0.32z)| lr 1.45e-04 | 322.91 ms | 52.3% bf16 MFU | 1622932 tok/s step 13397/19560 | loss 3.354682 (+0.21z)| norm 0.2721 (-0.47z)| lr 1.45e-04 | 323.72 ms | 52.1% bf16 MFU | 1622765 tok/s step 13398/19560 | loss 3.300188 (-0.81z)| norm 0.2533 (-1.24z)| lr 1.45e-04 | 323.18 ms | 52.2% bf16 MFU | 1622741 tok/s step 13399/19560 | loss 3.318858 (-0.46z)| norm 0.2797 (-0.12z)| lr 1.45e-04 | 323.40 ms | 52.2% bf16 MFU | 1622663 tok/s step 13400/19560 | loss 3.367361 (+0.45z)| norm 0.2736 (-0.36z)| lr 1.45e-04 | 322.75 ms | 52.3% bf16 MFU | 1622753 tok/s step 13401/19560 | loss 3.325644 (-0.33z)| norm 0.3139 (+1.35z)| lr 1.45e-04 | 323.02 ms | 52.2% bf16 MFU | 1622769 tok/s step 13402/19560 | loss 3.301346 (-0.78z)| norm 0.2669 (-0.64z)| lr 1.45e-04 | 322.90 ms | 52.3% bf16 MFU | 1622815 tok/s step 13403/19560 | loss 3.403786 (+1.12z)| norm 0.2981 (+0.70z)| lr 1.44e-04 | 323.22 ms | 52.2% bf16 MFU | 1622780 tok/s step 13404/19560 | loss 3.285129 (-1.08z)| norm 0.2866 (+0.21z)| lr 1.44e-04 | 323.34 ms | 52.2% bf16 MFU | 1622715 tok/s step 13405/19560 | loss 3.353858 (+0.20z)| norm 0.2878 (+0.27z)| lr 1.44e-04 | 322.64 ms | 52.3% bf16 MFU | 1622830 tok/s step 13406/19560 | loss 3.326581 (-0.35z)| norm 0.3140 (+1.83z)| lr 1.44e-04 | 323.32 ms | 52.2% bf16 MFU | 1622769 tok/s step 13407/19560 | loss 3.351904 (+0.15z)| norm 0.2661 (-0.78z)| lr 1.44e-04 | 322.95 ms | 52.3% bf16 MFU | 1622802 tok/s step 13408/19560 | loss 3.341806 (-0.06z)| norm 0.3016 (+1.16z)| lr 1.44e-04 | 323.39 ms | 52.2% bf16 MFU | 1622724 tok/s step 13409/19560 | loss 3.321174 (-0.47z)| norm 0.2632 (-0.94z)| lr 1.44e-04 | 322.59 ms | 52.3% bf16 MFU | 1622851 tok/s step 13410/19560 | loss 3.296313 (-0.96z)| norm 0.3149 (+1.84z)| lr 1.44e-04 | 322.67 ms | 52.3% bf16 MFU | 1622952 tok/s step 13411/19560 | loss 3.363358 (+0.37z)| norm 0.2753 (-0.29z)| lr 1.44e-04 | 322.71 ms | 52.3% bf16 MFU | 1623037 tok/s step 13412/19560 | loss 3.311138 (-0.67z)| norm 0.2916 (+0.59z)| lr 1.44e-04 | 322.87 ms | 52.3% bf16 MFU | 1623077 tok/s step 13413/19560 | loss 3.380229 (+0.70z)| norm 0.2986 (+0.96z)| lr 1.44e-04 | 323.21 ms | 52.2% bf16 MFU | 1623028 tok/s step 13414/19560 | loss 3.289514 (-1.11z)| norm 0.2654 (-0.82z)| lr 1.44e-04 | 322.32 ms | 52.4% bf16 MFU | 1623208 tok/s step 13415/19560 | loss 3.394792 (+1.01z)| norm 0.3205 (+2.12z)| lr 1.44e-04 | 322.71 ms | 52.3% bf16 MFU | 1623278 tok/s step 13416/19560 | loss 3.352185 (+0.15z)| norm 0.2847 (+0.21z)| lr 1.44e-04 | 323.30 ms | 52.2% bf16 MFU | 1623198 tok/s step 13417/19560 | loss 3.327472 (-0.35z)| norm 0.2714 (-0.50z)| lr 1.44e-04 | 322.65 ms | 52.3% bf16 MFU | 1623287 tok/s step 13418/19560 | loss 3.384723 (+0.80z)| norm 0.2853 (+0.25z)| lr 1.44e-04 | 322.97 ms | 52.3% bf16 MFU | 1623288 tok/s step 13419/19560 | loss 3.374832 (+0.59z)| norm 0.2648 (-0.84z)| lr 1.44e-04 | 322.02 ms | 52.4% bf16 MFU | 1623529 tok/s step 13420/19560 | loss 3.300743 (-0.90z)| norm 0.3078 (+1.46z)| lr 1.44e-04 | 323.27 ms | 52.2% bf16 MFU | 1623443 tok/s step 13421/19560 | loss 3.321430 (-0.49z)| norm 0.2938 (+0.71z)| lr 1.44e-04 | 322.74 ms | 52.3% bf16 MFU | 1623496 tok/s step 13422/19560 | loss 3.299649 (-0.92z)| norm 0.2573 (-1.24z)| lr 1.44e-04 | 322.23 ms | 52.4% bf16 MFU | 1623674 tok/s step 13423/19560 | loss 3.375338 (+0.61z)| norm 0.3099 (+1.54z)| lr 1.44e-04 | 323.01 ms | 52.2% bf16 MFU | 1623646 tok/s step 13424/19560 | loss 3.290782 (-1.10z)| norm 0.2774 (-0.19z)| lr 1.44e-04 | 322.46 ms | 52.3% bf16 MFU | 1623759 tok/s step 13425/19560 | loss 3.379995 (+0.70z)| norm 0.2953 (+0.75z)| lr 1.44e-04 | 322.70 ms | 52.3% bf16 MFU | 1623804 tok/s step 13426/19560 | loss 3.433867 (+1.74z)| norm 0.2837 (+0.19z)| lr 1.43e-04 | 322.44 ms | 52.3% bf16 MFU | 1623915 tok/s step 13427/19560 | loss 3.367676 (+0.43z)| norm 0.2779 (-0.14z)| lr 1.43e-04 | 323.07 ms | 52.2% bf16 MFU | 1623860 tok/s step 13428/19560 | loss 3.297295 (-0.97z)| norm 0.2773 (-0.19z)| lr 1.43e-04 | 323.05 ms | 52.2% bf16 MFU | 1623813 tok/s step 13429/19560 | loss 3.308659 (-0.73z)| norm 0.2949 (+0.85z)| lr 1.43e-04 | 322.62 ms | 52.3% bf16 MFU | 1623877 tok/s step 13430/19560 | loss 3.312784 (-0.64z)| norm 0.2920 (+0.67z)| lr 1.43e-04 | 322.42 ms | 52.3% bf16 MFU | 1623988 tok/s step 13431/19560 | loss 3.279363 (-1.29z)| norm 0.2912 (+0.61z)| lr 1.43e-04 | 323.05 ms | 52.2% bf16 MFU | 1623935 tok/s step 13432/19560 | loss 3.292781 (-1.01z)| norm 0.2815 (+0.03z)| lr 1.43e-04 | 323.05 ms | 52.2% bf16 MFU | 1623885 tok/s step 13433/19560 | loss 3.326392 (-0.33z)| norm 0.2749 (-0.36z)| lr 1.43e-04 | 322.68 ms | 52.3% bf16 MFU | 1623930 tok/s step 13434/19560 | loss 3.295665 (-0.94z)| norm 0.3484 (+3.77z)| lr 1.43e-04 | 322.42 ms | 52.3% bf16 MFU | 1624040 tok/s step 13435/19560 | loss 3.322101 (-0.40z)| norm 0.3067 (+1.40z)| lr 1.43e-04 | 322.68 ms | 52.3% bf16 MFU | 1624078 tok/s step 13436/19560 | loss 3.284727 (-1.14z)| norm 0.2740 (-0.43z)| lr 1.43e-04 | 323.08 ms | 52.2% bf16 MFU | 1624012 tok/s step 13437/19560 | loss 3.372572 (+0.61z)| norm 0.2739 (-0.44z)| lr 1.43e-04 | 322.80 ms | 52.3% bf16 MFU | 1624021 tok/s step 13438/19560 | loss 3.271555 (-1.39z)| norm 0.2767 (-0.27z)| lr 1.43e-04 | 322.54 ms | 52.3% bf16 MFU | 1624095 tok/s step 13439/19560 | loss 3.296166 (-0.89z)| norm 0.2757 (-0.32z)| lr 1.43e-04 | 322.49 ms | 52.3% bf16 MFU | 1624179 tok/s step 13440/19560 | loss 3.326634 (-0.30z)| norm 0.2481 (-1.84z)| lr 1.43e-04 | 323.15 ms | 52.2% bf16 MFU | 1624092 tok/s step 13441/19560 | loss 3.324727 (-0.34z)| norm 0.2656 (-0.86z)| lr 1.43e-04 | 322.94 ms | 52.3% bf16 MFU | 1624061 tok/s step 13442/19560 | loss 3.304435 (-0.73z)| norm 0.2732 (-0.44z)| lr 1.43e-04 | 322.99 ms | 52.3% bf16 MFU | 1624021 tok/s step 13443/19560 | loss 3.353634 (+0.24z)| norm 0.2745 (-0.35z)| lr 1.43e-04 | 322.97 ms | 52.3% bf16 MFU | 1623987 tok/s step 13444/19560 | loss 3.341381 (-0.00z)| norm 0.2616 (-1.13z)| lr 1.43e-04 | 322.43 ms | 52.3% bf16 MFU | 1624089 tok/s step 13445/19560 | loss 3.338376 (-0.06z)| norm 0.2939 (+0.86z)| lr 1.43e-04 | 323.07 ms | 52.2% bf16 MFU | 1624027 tok/s step 13446/19560 | loss 3.307636 (-0.67z)| norm 0.3032 (+1.41z)| lr 1.43e-04 | 322.69 ms | 52.3% bf16 MFU | 1624064 tok/s step 13447/19560 | loss 3.364026 (+0.45z)| norm 0.2826 (+0.14z)| lr 1.43e-04 | 322.83 ms | 52.3% bf16 MFU | 1624063 tok/s step 13448/19560 | loss 3.304246 (-0.74z)| norm 0.2950 (+0.90z)| lr 1.43e-04 | 322.91 ms | 52.3% bf16 MFU | 1624041 tok/s step 13449/19560 | loss 3.360997 (+0.39z)| norm 0.2708 (-0.59z)| lr 1.43e-04 | 322.54 ms | 52.3% bf16 MFU | 1624112 tok/s step 13450/19560 | loss 3.391961 (+1.01z)| norm 0.2714 (-0.54z)| lr 1.42e-04 | 322.98 ms | 52.3% bf16 MFU | 1624070 tok/s step 13451/19560 | loss 3.299464 (-0.84z)| norm 0.2802 (-0.00z)| lr 1.42e-04 | 322.95 ms | 52.3% bf16 MFU | 1624038 tok/s step 13452/19560 | loss 3.302936 (-0.77z)| norm 0.2707 (-0.58z)| lr 1.42e-04 | 322.95 ms | 52.3% bf16 MFU | 1624008 tok/s step 13453/19560 | loss 3.331864 (-0.17z)| norm 0.2687 (-0.70z)| lr 1.42e-04 | 323.28 ms | 52.2% bf16 MFU | 1623896 tok/s step 13454/19560 | loss 3.328482 (-0.24z)| norm 0.2805 (+0.03z)| lr 1.42e-04 | 322.12 ms | 52.4% bf16 MFU | 1624082 tok/s step 13455/19560 | loss 3.250924 (-1.82z)| norm 0.2610 (-1.17z)| lr 1.42e-04 | 322.60 ms | 52.3% bf16 MFU | 1624138 tok/s step 13456/19560 | loss 3.305270 (-0.69z)| norm 0.2904 (+0.65z)| lr 1.42e-04 | 322.93 ms | 52.3% bf16 MFU | 1624107 tok/s step 13457/19560 | loss 3.372988 (+0.71z)| norm 0.2762 (-0.22z)| lr 1.42e-04 | 322.66 ms | 52.3% bf16 MFU | 1624147 tok/s step 13458/19560 | loss 3.286178 (-1.08z)| norm 0.2692 (-0.64z)| lr 1.42e-04 | 322.46 ms | 52.3% bf16 MFU | 1624234 tok/s step 13459/19560 | loss 3.319429 (-0.39z)| norm 0.2689 (-0.67z)| lr 1.42e-04 | 323.49 ms | 52.2% bf16 MFU | 1624058 tok/s step 13460/19560 | loss 3.383361 (+0.91z)| norm 0.2791 (-0.03z)| lr 1.42e-04 | 322.18 ms | 52.4% bf16 MFU | 1624220 tok/s step 13461/19560 | loss 3.323150 (-0.32z)| norm 0.2717 (-0.49z)| lr 1.42e-04 | 322.87 ms | 52.3% bf16 MFU | 1624200 tok/s step 13462/19560 | loss 3.337984 (-0.01z)| norm 0.2797 (+0.01z)| lr 1.42e-04 | 322.79 ms | 52.3% bf16 MFU | 1624201 tok/s step 13463/19560 | loss 3.351117 (+0.26z)| norm 0.2631 (-1.01z)| lr 1.42e-04 | 322.36 ms | 52.4% bf16 MFU | 1624310 tok/s step 13464/19560 | loss 3.333121 (-0.10z)| norm 0.3174 (+2.31z)| lr 1.42e-04 | 322.66 ms | 52.3% bf16 MFU | 1624339 tok/s step 13465/19560 | loss 3.300390 (-0.78z)| norm 0.2624 (-1.04z)| lr 1.42e-04 | 323.56 ms | 52.2% bf16 MFU | 1624141 tok/s step 13466/19560 | loss 3.329214 (-0.18z)| norm 0.2836 (+0.28z)| lr 1.42e-04 | 322.38 ms | 52.4% bf16 MFU | 1624249 tok/s step 13467/19560 | loss 3.259771 (-1.59z)| norm 0.2669 (-0.76z)| lr 1.42e-04 | 322.95 ms | 52.3% bf16 MFU | 1624209 tok/s step 13468/19560 | loss 3.308601 (-0.59z)| norm 0.2724 (-0.41z)| lr 1.42e-04 | 322.81 ms | 52.3% bf16 MFU | 1624206 tok/s step 13469/19560 | loss 3.331746 (-0.11z)| norm 0.2676 (-0.70z)| lr 1.42e-04 | 322.89 ms | 52.3% bf16 MFU | 1624182 tok/s step 13470/19560 | loss 3.382826 (+0.92z)| norm 0.2894 (+0.65z)| lr 1.42e-04 | 322.82 ms | 52.3% bf16 MFU | 1624176 tok/s step 13471/19560 | loss 3.318268 (-0.41z)| norm 0.3024 (+1.44z)| lr 1.42e-04 | 322.56 ms | 52.3% bf16 MFU | 1624236 tok/s step 13472/19560 | loss 3.273623 (-1.31z)| norm 0.2634 (-0.97z)| lr 1.42e-04 | 323.18 ms | 52.2% bf16 MFU | 1624137 tok/s step 13473/19560 | loss 3.341900 (+0.09z)| norm 0.3229 (+2.61z)| lr 1.41e-04 | 322.85 ms | 52.3% bf16 MFU | 1624126 tok/s step 13474/19560 | loss 3.311180 (-0.54z)| norm 0.2855 (+0.36z)| lr 1.41e-04 | 322.44 ms | 52.3% bf16 MFU | 1624219 tok/s step 13475/19560 | loss 3.337524 (-0.01z)| norm 0.2781 (-0.09z)| lr 1.41e-04 | 323.11 ms | 52.2% bf16 MFU | 1624141 tok/s step 13476/19560 | loss 3.309378 (-0.57z)| norm 0.3156 (+2.12z)| lr 1.41e-04 | 322.53 ms | 52.3% bf16 MFU | 1624212 tok/s step 13477/19560 | loss 3.351689 (+0.31z)| norm 0.2776 (-0.14z)| lr 1.41e-04 | 323.19 ms | 52.2% bf16 MFU | 1624112 tok/s step 13478/19560 | loss 3.315226 (-0.44z)| norm 0.3475 (+3.77z)| lr 1.41e-04 | 322.87 ms | 52.3% bf16 MFU | 1624098 tok/s step 13479/19560 | loss 3.350393 (+0.32z)| norm 0.3122 (+1.76z)| lr 1.41e-04 | 322.35 ms | 52.4% bf16 MFU | 1624215 tok/s step 13480/19560 | loss 3.334234 (-0.02z)| norm 0.3139 (+1.81z)| lr 1.41e-04 | 322.71 ms | 52.3% bf16 MFU | 1624236 tok/s step 13481/19560 | loss 3.405134 (+1.47z)| norm 0.3095 (+1.54z)| lr 1.41e-04 | 322.69 ms | 52.3% bf16 MFU | 1624261 tok/s step 13482/19560 | loss 3.400102 (+1.34z)| norm 0.2835 (+0.11z)| lr 1.41e-04 | 322.99 ms | 52.3% bf16 MFU | 1624209 tok/s step 13483/19560 | loss 3.270660 (-1.38z)| norm 0.2980 (+0.89z)| lr 1.41e-04 | 323.08 ms | 52.2% bf16 MFU | 1624138 tok/s step 13484/19560 | loss 3.347564 (+0.24z)| norm 0.2726 (-0.50z)| lr 1.41e-04 | 322.56 ms | 52.3% bf16 MFU | 1624201 tok/s step 13485/19560 | loss 3.337035 (+0.02z)| norm 0.2708 (-0.60z)| lr 1.41e-04 | 323.07 ms | 52.2% bf16 MFU | 1624131 tok/s step 13486/19560 | loss 3.319358 (-0.35z)| norm 0.2792 (-0.14z)| lr 1.41e-04 | 323.03 ms | 52.2% bf16 MFU | 1624077 tok/s step 13487/19560 | loss 3.315124 (-0.44z)| norm 0.2737 (-0.44z)| lr 1.41e-04 | 323.42 ms | 52.2% bf16 MFU | 1623928 tok/s step 13488/19560 | loss 3.363985 (+0.58z)| norm 0.2533 (-1.55z)| lr 1.41e-04 | 322.11 ms | 52.4% bf16 MFU | 1624115 tok/s step 13489/19560 | loss 3.487486 (+3.09z)| norm 0.2854 (+0.20z)| lr 1.41e-04 | 322.47 ms | 52.3% bf16 MFU | 1624202 tok/s step 13490/19560 | loss 3.289455 (-0.96z)| norm 0.2715 (-0.56z)| lr 1.41e-04 | 323.05 ms | 52.2% bf16 MFU | 1624138 tok/s step 13491/19560 | loss 3.294957 (-0.84z)| norm 0.2819 (+0.01z)| lr 1.41e-04 | 322.02 ms | 52.4% bf16 MFU | 1624337 tok/s step 13492/19560 | loss 3.336154 (+0.01z)| norm 0.2700 (-0.64z)| lr 1.41e-04 | 323.52 ms | 52.2% bf16 MFU | 1624150 tok/s step 13493/19560 | loss 3.306384 (-0.60z)| norm 0.2625 (-1.04z)| lr 1.41e-04 | 323.47 ms | 52.2% bf16 MFU | 1623983 tok/s step 13494/19560 | loss 3.251548 (-1.72z)| norm 0.2747 (-0.38z)| lr 1.41e-04 | 322.42 ms | 52.3% bf16 MFU | 1624090 tok/s step 13495/19560 | loss 3.335159 (+0.02z)| norm 0.2748 (-0.36z)| lr 1.41e-04 | 322.77 ms | 52.3% bf16 MFU | 1624103 tok/s step 13496/19560 | loss 3.337528 (+0.07z)| norm 0.2701 (-0.62z)| lr 1.41e-04 | 322.43 ms | 52.3% bf16 MFU | 1624201 tok/s step 13497/19560 | loss 3.329214 (-0.10z)| norm 0.2883 (+0.37z)| lr 1.40e-04 | 323.19 ms | 52.2% bf16 MFU | 1624102 tok/s step 13498/19560 | loss 3.239049 (-1.96z)| norm 0.2696 (-0.65z)| lr 1.40e-04 | 323.05 ms | 52.2% bf16 MFU | 1624043 tok/s step 13499/19560 | loss 3.314416 (-0.39z)| norm 0.2809 (-0.02z)| lr 1.40e-04 | 322.49 ms | 52.3% bf16 MFU | 1624127 tok/s step 13500/19560 | loss 3.336457 (+0.08z)| norm 0.2852 (+0.21z)| lr 1.40e-04 | 322.83 ms | 52.3% bf16 MFU | 1624123 tok/s val loss 3.325452 laSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluaevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3011/10042 = 0.299841 step 13501/19560 | loss 3.327560 (-0.11z)| norm 0.2679 (-0.73z)| lr 1.40e-04 | 322.74 ms | 52.3% bf16 MFU | 1624141 tok/s step 13502/19560 | loss 3.351025 (+0.38z)| norm 0.2592 (-1.20z)| lr 1.40e-04 | 322.67 ms | 52.3% bf16 MFU | 1624176 tok/s step 13503/19560 | loss 3.439312 (+2.18z)| norm 0.3001 (+1.03z)| lr 1.40e-04 | 323.53 ms | 52.2% bf16 MFU | 1623992 tok/s step 13504/19560 | loss 3.291939 (-0.88z)| norm 0.2737 (-0.42z)| lr 1.40e-04 | 322.45 ms | 52.3% bf16 MFU | 1624089 tok/s step 13505/19560 | loss 3.306277 (-0.61z)| norm 0.2651 (-0.88z)| lr 1.40e-04 | 322.60 ms | 52.3% bf16 MFU | 1624143 tok/s step 13506/19560 | loss 3.341771 (+0.24z)| norm 0.2581 (-1.25z)| lr 1.40e-04 | 322.92 ms | 52.3% bf16 MFU | 1624114 tok/s step 13507/19560 | loss 3.332553 (+0.02z)| norm 0.2694 (-0.64z)| lr 1.40e-04 | 322.61 ms | 52.3% bf16 MFU | 1624167 tok/s step 13508/19560 | loss 3.306077 (-0.60z)| norm 0.2795 (-0.08z)| lr 1.40e-04 | 322.82 ms | 52.3% bf16 MFU | 1624163 tok/s step 13509/19560 | loss 3.299871 (-0.73z)| norm 0.2623 (-1.02z)| lr 1.40e-04 | 322.84 ms | 52.3% bf16 MFU | 1624154 tok/s step 13510/19560 | loss 3.353617 (+0.56z)| norm 0.2619 (-1.04z)| lr 1.40e-04 | 322.95 ms | 52.3% bf16 MFU | 1624119 tok/s step 13511/19560 | loss 3.337765 (+0.17z)| norm 0.2584 (-1.23z)| lr 1.40e-04 | 322.95 ms | 52.3% bf16 MFU | 1624086 tok/s step 13512/19560 | loss 3.356378 (+0.62z)| norm 0.2907 (+0.53z)| lr 1.40e-04 | 323.21 ms | 52.2% bf16 MFU | 1623988 tok/s step 13513/19560 | loss 3.300133 (-0.76z)| norm 0.3166 (+1.90z)| lr 1.40e-04 | 323.09 ms | 52.2% bf16 MFU | 1623925 tok/s step 13514/19560 | loss 3.363967 (+0.81z)| norm 0.3074 (+1.38z)| lr 1.40e-04 | 322.75 ms | 52.3% bf16 MFU | 1623950 tok/s step 13515/19560 | loss 3.325037 (-0.15z)| norm 0.3086 (+1.42z)| lr 1.40e-04 | 323.33 ms | 52.2% bf16 MFU | 1623829 tok/s step 13516/19560 | loss 3.355296 (+0.61z)| norm 0.3063 (+1.28z)| lr 1.40e-04 | 322.67 ms | 52.3% bf16 MFU | 1623880 tok/s step 13517/19560 | loss 3.278906 (-1.27z)| norm 0.3123 (+1.56z)| lr 1.40e-04 | 322.69 ms | 52.3% bf16 MFU | 1623923 tok/s step 13518/19560 | loss 3.279961 (-1.23z)| norm 0.2878 (+0.27z)| lr 1.40e-04 | 322.92 ms | 52.3% bf16 MFU | 1623905 tok/s step 13519/19560 | loss 3.303179 (-0.65z)| norm 0.2968 (+0.73z)| lr 1.40e-04 | 323.08 ms | 52.2% bf16 MFU | 1623849 tok/s step 13520/19560 | loss 3.404385 (+1.80z)| norm 0.2945 (+0.60z)| lr 1.39e-04 | 323.30 ms | 52.2% bf16 MFU | 1623740 tok/s step 13521/19560 | loss 3.269141 (-1.49z)| norm 0.2995 (+0.85z)| lr 1.39e-04 | 323.43 ms | 52.2% bf16 MFU | 1623606 tok/s step 13522/19560 | loss 3.352109 (+0.58z)| norm 0.2937 (+0.56z)| lr 1.39e-04 | 323.32 ms | 52.2% bf16 MFU | 1623505 tok/s step 13523/19560 | loss 3.361570 (+0.81z)| norm 0.2928 (+0.50z)| lr 1.39e-04 | 323.21 ms | 52.2% bf16 MFU | 1623436 tok/s step 13524/19560 | loss 3.336541 (+0.16z)| norm 0.2942 (+0.57z)| lr 1.39e-04 | 323.01 ms | 52.3% bf16 MFU | 1623422 tok/s step 13525/19560 | loss 3.358854 (+0.73z)| norm 0.2670 (-0.89z)| lr 1.39e-04 | 323.05 ms | 52.2% bf16 MFU | 1623399 tok/s step 13526/19560 | loss 3.308054 (-0.56z)| norm 0.2963 (+0.66z)| lr 1.39e-04 | 323.36 ms | 52.2% bf16 MFU | 1623297 tok/s step 13527/19560 | loss 3.342258 (+0.30z)| norm 0.2796 (-0.24z)| lr 1.39e-04 | 322.91 ms | 52.3% bf16 MFU | 1623314 tok/s step 13528/19560 | loss 3.378618 (+1.22z)| norm 0.2963 (+0.66z)| lr 1.39e-04 | 323.09 ms | 52.2% bf16 MFU | 1623286 tok/s step 13529/19560 | loss 3.414023 (+2.06z)| norm 0.3001 (+0.87z)| lr 1.39e-04 | 322.57 ms | 52.3% bf16 MFU | 1623389 tok/s step 13530/19560 | loss 3.345017 (+0.34z)| norm 0.3126 (+1.53z)| lr 1.39e-04 | 322.42 ms | 52.3% bf16 MFU | 1623524 tok/s step 13531/19560 | loss 3.325356 (-0.14z)| norm 0.2784 (-0.31z)| lr 1.39e-04 | 323.27 ms | 52.2% bf16 MFU | 1623439 tok/s step 13532/19560 | loss 3.374982 (+1.10z)| norm 0.2995 (+0.82z)| lr 1.39e-04 | 323.18 ms | 52.2% bf16 MFU | 1623381 tok/s step 13533/19560 | loss 3.284274 (-1.18z)| norm 0.3061 (+1.16z)| lr 1.39e-04 | 322.88 ms | 52.3% bf16 MFU | 1623402 tok/s step 13534/19560 | loss 3.304409 (-0.66z)| norm 0.2959 (+0.63z)| lr 1.39e-04 | 322.61 ms | 52.3% bf16 MFU | 1623488 tok/s step 13535/19560 | loss 3.290236 (-1.00z)| norm 0.3132 (+1.54z)| lr 1.39e-04 | 322.96 ms | 52.3% bf16 MFU | 1623484 tok/s step 13536/19560 | loss 3.329749 (-0.01z)| norm 0.2971 (+0.67z)| lr 1.39e-04 | 323.16 ms | 52.2% bf16 MFU | 1623429 tok/s step 13537/19560 | loss 3.300566 (-0.74z)| norm 0.2785 (-0.34z)| lr 1.39e-04 | 323.26 ms | 52.2% bf16 MFU | 1623352 tok/s step 13538/19560 | loss 3.326464 (-0.10z)| norm 0.2755 (-0.49z)| lr 1.39e-04 | 323.07 ms | 52.2% bf16 MFU | 1623325 tok/s step 13539/19560 | loss 3.267157 (-1.55z)| norm 0.2896 (+0.27z)| lr 1.39e-04 | 323.15 ms | 52.2% bf16 MFU | 1623281 tok/s step 13540/19560 | loss 3.400898 (+1.74z)| norm 0.3291 (+2.37z)| lr 1.39e-04 | 322.65 ms | 52.3% bf16 MFU | 1623364 tok/s step 13541/19560 | loss 3.327194 (-0.07z)| norm 0.2809 (-0.21z)| lr 1.39e-04 | 323.07 ms | 52.2% bf16 MFU | 1623338 tok/s step 13542/19560 | loss 3.334133 (+0.10z)| norm 0.3112 (+1.39z)| lr 1.39e-04 | 323.51 ms | 52.2% bf16 MFU | 1623202 tok/s step 13543/19560 | loss 3.282119 (-1.18z)| norm 0.2746 (-0.55z)| lr 1.39e-04 | 322.99 ms | 52.3% bf16 MFU | 1623204 tok/s step 13544/19560 | loss 3.342654 (+0.33z)| norm 0.3119 (+1.45z)| lr 1.38e-04 | 322.62 ms | 52.3% bf16 MFU | 1623299 tok/s step 13545/19560 | loss 3.295046 (-0.85z)| norm 0.2732 (-0.63z)| lr 1.38e-04 | 323.57 ms | 52.2% bf16 MFU | 1623150 tok/s step 13546/19560 | loss 3.368838 (+1.00z)| norm 0.3042 (+1.02z)| lr 1.38e-04 | 322.66 ms | 52.3% bf16 MFU | 1623237 tok/s step 13547/19560 | loss 3.309032 (-0.49z)| norm 0.3006 (+0.81z)| lr 1.38e-04 | 323.00 ms | 52.3% bf16 MFU | 1623235 tok/s step 13548/19560 | loss 3.337713 (+0.23z)| norm 0.2807 (-0.24z)| lr 1.38e-04 | 323.74 ms | 52.1% bf16 MFU | 1623046 tok/s step 13549/19560 | loss 3.384156 (+1.37z)| norm 0.2900 (+0.26z)| lr 1.38e-04 | 322.92 ms | 52.3% bf16 MFU | 1623072 tok/s step 13550/19560 | loss 3.345390 (+0.40z)| norm 0.2948 (+0.51z)| lr 1.38e-04 | 323.34 ms | 52.2% bf16 MFU | 1622993 tok/s step 13551/19560 | loss 3.387805 (+1.45z)| norm 0.2749 (-0.56z)| lr 1.38e-04 | 323.55 ms | 52.2% bf16 MFU | 1622864 tok/s step 13552/19560 | loss 3.355412 (+0.63z)| norm 0.3087 (+1.27z)| lr 1.38e-04 | 323.35 ms | 52.2% bf16 MFU | 1622792 tok/s step 13553/19560 | loss 3.320156 (-0.24z)| norm 0.2770 (-0.45z)| lr 1.38e-04 | 323.41 ms | 52.2% bf16 MFU | 1622709 tok/s step 13554/19560 | loss 3.443106 (+2.84z)| norm 0.2697 (-0.84z)| lr 1.38e-04 | 324.26 ms | 52.0% bf16 MFU | 1622417 tok/s step 13555/19560 | loss 3.334356 (+0.12z)| norm 0.2784 (-0.37z)| lr 1.38e-04 | 323.06 ms | 52.2% bf16 MFU | 1622440 tok/s step 13556/19560 | loss 3.272827 (-1.41z)| norm 0.2760 (-0.50z)| lr 1.38e-04 | 323.46 ms | 52.2% bf16 MFU | 1622362 tok/s step 13557/19560 | loss 3.299808 (-0.73z)| norm 0.2581 (-1.45z)| lr 1.38e-04 | 323.63 ms | 52.2% bf16 MFU | 1622246 tok/s step 13558/19560 | loss 3.325429 (-0.10z)| norm 0.2604 (-1.30z)| lr 1.38e-04 | 323.48 ms | 52.2% bf16 MFU | 1622172 tok/s step 13559/19560 | loss 3.303550 (-0.65z)| norm 0.2610 (-1.25z)| lr 1.38e-04 | 323.38 ms | 52.2% bf16 MFU | 1622126 tok/s step 13560/19560 | loss 3.295396 (-0.86z)| norm 0.2602 (-1.27z)| lr 1.38e-04 | 323.25 ms | 52.2% bf16 MFU | 1622116 tok/s step 13561/19560 | loss 3.388424 (+1.45z)| norm 0.2695 (-0.77z)| lr 1.38e-04 | 322.95 ms | 52.3% bf16 MFU | 1622183 tok/s step 13562/19560 | loss 3.376046 (+1.13z)| norm 0.2688 (-0.82z)| lr 1.38e-04 | 323.51 ms | 52.2% bf16 MFU | 1622105 tok/s step 13563/19560 | loss 3.341413 (+0.26z)| norm 0.2582 (-1.38z)| lr 1.38e-04 | 323.52 ms | 52.2% bf16 MFU | 1622028 tok/s step 13564/19560 | loss 3.276009 (-1.35z)| norm 0.2732 (-0.55z)| lr 1.38e-04 | 323.13 ms | 52.2% bf16 MFU | 1622053 tok/s step 13565/19560 | loss 3.269097 (-1.50z)| norm 0.2637 (-1.07z)| lr 1.38e-04 | 323.13 ms | 52.2% bf16 MFU | 1622076 tok/s step 13566/19560 | loss 3.347245 (+0.42z)| norm 0.2793 (-0.21z)| lr 1.38e-04 | 323.20 ms | 52.2% bf16 MFU | 1622080 tok/s step 13567/19560 | loss 3.388440 (+1.41z)| norm 0.2922 (+0.49z)| lr 1.38e-04 | 323.14 ms | 52.2% bf16 MFU | 1622100 tok/s step 13568/19560 | loss 3.317989 (-0.32z)| norm 0.2844 (+0.05z)| lr 1.37e-04 | 323.96 ms | 52.1% bf16 MFU | 1621914 tok/s step 13569/19560 | loss 3.294068 (-0.91z)| norm 0.2920 (+0.46z)| lr 1.37e-04 | 322.90 ms | 52.3% bf16 MFU | 1622002 tok/s step 13570/19560 | loss 3.363969 (+0.80z)| norm 0.2668 (-0.94z)| lr 1.37e-04 | 322.87 ms | 52.3% bf16 MFU | 1622093 tok/s step 13571/19560 | loss 3.330364 (-0.02z)| norm 0.2651 (-1.03z)| lr 1.37e-04 | 323.94 ms | 52.1% bf16 MFU | 1621911 tok/s step 13572/19560 | loss 3.308720 (-0.55z)| norm 0.2777 (-0.34z)| lr 1.37e-04 | 323.22 ms | 52.2% bf16 MFU | 1621920 tok/s step 13573/19560 | loss 3.325483 (-0.13z)| norm 0.2722 (-0.64z)| lr 1.37e-04 | 323.04 ms | 52.2% bf16 MFU | 1621974 tok/s step 13574/19560 | loss 3.297391 (-0.82z)| norm 0.2657 (-0.99z)| lr 1.37e-04 | 323.62 ms | 52.2% bf16 MFU | 1621879 tok/s step 13575/19560 | loss 3.343586 (+0.32z)| norm 0.2748 (-0.47z)| lr 1.37e-04 | 322.88 ms | 52.3% bf16 MFU | 1621974 tok/s step 13576/19560 | loss 3.426519 (+2.29z)| norm 0.2909 (+0.43z)| lr 1.37e-04 | 323.40 ms | 52.2% bf16 MFU | 1621933 tok/s step 13577/19560 | loss 3.285349 (-1.10z)| norm 0.2867 (+0.19z)| lr 1.37e-04 | 323.32 ms | 52.2% bf16 MFU | 1621916 tok/s step 13578/19560 | loss 3.310770 (-0.47z)| norm 0.2944 (+0.61z)| lr 1.37e-04 | 322.45 ms | 52.3% bf16 MFU | 1622117 tok/s step 13579/19560 | loss 3.346356 (+0.38z)| norm 0.2930 (+0.53z)| lr 1.37e-04 | 323.06 ms | 52.2% bf16 MFU | 1622155 tok/s step 13580/19560 | loss 3.358038 (+0.66z)| norm 0.2998 (+0.89z)| lr 1.37e-04 | 322.77 ms | 52.3% bf16 MFU | 1622263 tok/s step 13581/19560 | loss 3.310095 (-0.51z)| norm 0.2890 (+0.28z)| lr 1.37e-04 | 322.95 ms | 52.3% bf16 MFU | 1622321 tok/s step 13582/19560 | loss 3.348957 (+0.43z)| norm 0.2809 (-0.17z)| lr 1.37e-04 | 323.45 ms | 52.2% bf16 MFU | 1622250 tok/s step 13583/19560 | loss 3.349916 (+0.44z)| norm 0.2652 (-1.06z)| lr 1.37e-04 | 322.74 ms | 52.3% bf16 MFU | 1622364 tok/s step 13584/19560 | loss 3.285688 (-1.13z)| norm 0.2844 (+0.02z)| lr 1.37e-04 | 322.69 ms | 52.3% bf16 MFU | 1622482 tok/s step 13585/19560 | loss 3.432811 (+2.42z)| norm 0.2715 (-0.70z)| lr 1.37e-04 | 322.73 ms | 52.3% bf16 MFU | 1622586 tok/s step 13586/19560 | loss 3.269520 (-1.50z)| norm 0.2679 (-0.90z)| lr 1.37e-04 | 323.19 ms | 52.2% bf16 MFU | 1622568 tok/s step 13587/19560 | loss 3.385551 (+1.26z)| norm 0.2744 (-0.54z)| lr 1.37e-04 | 323.22 ms | 52.2% bf16 MFU | 1622543 tok/s step 13588/19560 | loss 3.378325 (+1.09z)| norm 0.2882 (+0.23z)| lr 1.37e-04 | 322.61 ms | 52.3% bf16 MFU | 1622674 tok/s step 13589/19560 | loss 3.312015 (-0.49z)| norm 0.2847 (+0.03z)| lr 1.37e-04 | 322.73 ms | 52.3% bf16 MFU | 1622766 tok/s step 13590/19560 | loss 3.346784 (+0.34z)| norm 0.2675 (-0.93z)| lr 1.37e-04 | 323.13 ms | 52.2% bf16 MFU | 1622753 tok/s step 13591/19560 | loss 3.328646 (-0.09z)| norm 0.2676 (-0.93z)| lr 1.37e-04 | 322.63 ms | 52.3% bf16 MFU | 1622868 tok/s step 13592/19560 | loss 3.329097 (-0.08z)| norm 0.2757 (-0.46z)| lr 1.36e-04 | 323.26 ms | 52.2% bf16 MFU | 1622820 tok/s step 13593/19560 | loss 3.330085 (-0.06z)| norm 0.2646 (-1.10z)| lr 1.36e-04 | 323.31 ms | 52.2% bf16 MFU | 1622760 tok/s step 13594/19560 | loss 3.378780 (+1.09z)| norm 0.2586 (-1.42z)| lr 1.36e-04 | 322.36 ms | 52.4% bf16 MFU | 1622942 tok/s step 13595/19560 | loss 3.250936 (-1.94z)| norm 0.2767 (-0.40z)| lr 1.36e-04 | 322.76 ms | 52.3% bf16 MFU | 1623013 tok/s step 13596/19560 | loss 3.287750 (-1.06z)| norm 0.2570 (-1.50z)| lr 1.36e-04 | 323.91 ms | 52.1% bf16 MFU | 1622792 tok/s step 13597/19560 | loss 3.330168 (-0.06z)| norm 0.2799 (-0.21z)| lr 1.36e-04 | 323.02 ms | 52.2% bf16 MFU | 1622808 tok/s step 13598/19560 | loss 3.369594 (+0.88z)| norm 0.2850 (+0.08z)| lr 1.36e-04 | 322.37 ms | 52.4% bf16 MFU | 1622984 tok/s step 13599/19560 | loss 3.361287 (+0.67z)| norm 0.2705 (-0.73z)| lr 1.36e-04 | 323.17 ms | 52.2% bf16 MFU | 1622953 tok/s step 13600/19560 | loss 3.334436 (+0.03z)| norm 0.2746 (-0.51z)| lr 1.36e-04 | 323.27 ms | 52.2% bf16 MFU | 1622898 tok/s step 13601/19560 | loss 3.365677 (+0.76z)| norm 0.2776 (-0.32z)| lr 1.36e-04 | 322.70 ms | 52.3% bf16 MFU | 1622986 tok/s step 13602/19560 | loss 3.332706 (-0.02z)| norm 0.2722 (-0.63z)| lr 1.36e-04 | 323.31 ms | 52.2% bf16 MFU | 1622918 tok/s step 13603/19560 | loss 3.251107 (-1.92z)| norm 0.2629 (-1.15z)| lr 1.36e-04 | 322.74 ms | 52.3% bf16 MFU | 1622996 tok/s step 13604/19560 | loss 3.331508 (-0.04z)| norm 0.2821 (-0.03z)| lr 1.36e-04 | 323.05 ms | 52.2% bf16 MFU | 1622993 tok/s step 13605/19560 | loss 3.335992 (+0.07z)| norm 0.2681 (-0.85z)| lr 1.36e-04 | 322.80 ms | 52.3% bf16 MFU | 1623053 tok/s step 13606/19560 | loss 3.323848 (-0.22z)| norm 0.2835 (+0.09z)| lr 1.36e-04 | 322.65 ms | 52.3% bf16 MFU | 1623147 tok/s step 13607/19560 | loss 3.327574 (-0.13z)| norm 0.2829 (+0.07z)| lr 1.36e-04 | 322.79 ms | 52.3% bf16 MFU | 1623201 tok/s step 13608/19560 | loss 3.357059 (+0.56z)| norm 0.2724 (-0.58z)| lr 1.36e-04 | 323.13 ms | 52.2% bf16 MFU | 1623167 tok/s step 13609/19560 | loss 3.305533 (-0.64z)| norm 0.2970 (+1.01z)| lr 1.36e-04 | 322.99 ms | 52.3% bf16 MFU | 1623171 tok/s step 13610/19560 | loss 3.445098 (+2.62z)| norm 0.2755 (-0.38z)| lr 1.36e-04 | 322.86 ms | 52.3% bf16 MFU | 1623208 tok/s step 13611/19560 | loss 3.397798 (+1.49z)| norm 0.2833 (+0.14z)| lr 1.36e-04 | 322.50 ms | 52.3% bf16 MFU | 1623332 tok/s step 13612/19560 | loss 3.387586 (+1.24z)| norm 0.2717 (-0.62z)| lr 1.36e-04 | 322.89 ms | 52.3% bf16 MFU | 1623353 tok/s step 13613/19560 | loss 3.389348 (+1.27z)| norm 0.2712 (-0.65z)| lr 1.36e-04 | 323.19 ms | 52.2% bf16 MFU | 1623296 tok/s step 13614/19560 | loss 3.363147 (+0.65z)| norm 0.2788 (-0.16z)| lr 1.36e-04 | 323.00 ms | 52.3% bf16 MFU | 1623291 tok/s step 13615/19560 | loss 3.395801 (+1.38z)| norm 0.2947 (+0.87z)| lr 1.36e-04 | 323.12 ms | 52.2% bf16 MFU | 1623256 tok/s step 13616/19560 | loss 3.374963 (+0.90z)| norm 0.2942 (+0.82z)| lr 1.35e-04 | 322.84 ms | 52.3% bf16 MFU | 1623292 tok/s step 13617/19560 | loss 3.348601 (+0.34z)| norm 0.3075 (+1.66z)| lr 1.35e-04 | 323.10 ms | 52.2% bf16 MFU | 1623262 tok/s step 13618/19560 | loss 3.306411 (-0.68z)| norm 0.2740 (-0.51z)| lr 1.35e-04 | 322.41 ms | 52.3% bf16 MFU | 1623407 tok/s step 13619/19560 | loss 3.298054 (-0.88z)| norm 0.3071 (+1.61z)| lr 1.35e-04 | 323.07 ms | 52.2% bf16 MFU | 1623378 tok/s step 13620/19560 | loss 3.347902 (+0.32z)| norm 0.2777 (-0.28z)| lr 1.35e-04 | 322.70 ms | 52.3% bf16 MFU | 1623444 tok/s step 13621/19560 | loss 3.351382 (+0.39z)| norm 0.2962 (+0.89z)| lr 1.35e-04 | 322.64 ms | 52.3% bf16 MFU | 1623522 tok/s step 13622/19560 | loss 3.375439 (+0.96z)| norm 0.2816 (-0.05z)| lr 1.35e-04 | 322.69 ms | 52.3% bf16 MFU | 1623583 tok/s step 13623/19560 | loss 3.353936 (+0.43z)| norm 0.3036 (+1.34z)| lr 1.35e-04 | 322.76 ms | 52.3% bf16 MFU | 1623623 tok/s step 13624/19560 | loss 3.300499 (-0.86z)| norm 0.2847 (+0.12z)| lr 1.35e-04 | 322.92 ms | 52.3% bf16 MFU | 1623620 tok/s step 13625/19560 | loss 3.355102 (+0.46z)| norm 0.2662 (-1.05z)| lr 1.35e-04 | 322.72 ms | 52.3% bf16 MFU | 1623670 tok/s step 13626/19560 | loss 3.364964 (+0.69z)| norm 0.2945 (+0.75z)| lr 1.35e-04 | 323.28 ms | 52.2% bf16 MFU | 1623576 tok/s step 13627/19560 | loss 3.257204 (-1.94z)| norm 0.2911 (+0.53z)| lr 1.35e-04 | 323.05 ms | 52.2% bf16 MFU | 1623542 tok/s step 13628/19560 | loss 3.365954 (+0.71z)| norm 0.3060 (+1.46z)| lr 1.35e-04 | 322.81 ms | 52.3% bf16 MFU | 1623573 tok/s step 13629/19560 | loss 3.298488 (-0.93z)| norm 0.2864 (+0.21z)| lr 1.35e-04 | 323.15 ms | 52.2% bf16 MFU | 1623514 tok/s step 13630/19560 | loss 3.337186 (+0.02z)| norm 0.3067 (+1.48z)| lr 1.35e-04 | 322.93 ms | 52.3% bf16 MFU | 1623516 tok/s step 13631/19560 | loss 3.316965 (-0.46z)| norm 0.2746 (-0.56z)| lr 1.35e-04 | 322.27 ms | 52.4% bf16 MFU | 1623684 tok/s step 13632/19560 | loss 3.349117 (+0.33z)| norm 0.2860 (+0.16z)| lr 1.35e-04 | 323.02 ms | 52.2% bf16 MFU | 1623655 tok/s step 13633/19560 | loss 3.560307 (+5.00z)| norm 0.2848 (+0.08z)| lr 1.35e-04 | 322.36 ms | 52.4% bf16 MFU | 1623792 tok/s step 13634/19560 | loss 3.368132 (+0.67z)| norm 0.2871 (+0.21z)| lr 1.35e-04 | 323.27 ms | 52.2% bf16 MFU | 1623693 tok/s step 13635/19560 | loss 3.330663 (-0.17z)| norm 0.2752 (-0.57z)| lr 1.35e-04 | 322.70 ms | 52.3% bf16 MFU | 1623743 tok/s step 13636/19560 | loss 3.334734 (-0.08z)| norm 0.2775 (-0.41z)| lr 1.35e-04 | 322.96 ms | 52.3% bf16 MFU | 1623725 tok/s step 13637/19560 | loss 3.289046 (-1.11z)| norm 0.2788 (-0.34z)| lr 1.35e-04 | 323.02 ms | 52.2% bf16 MFU | 1623692 tok/s step 13638/19560 | loss 3.376869 (+0.86z)| norm 0.2774 (-0.44z)| lr 1.35e-04 | 322.47 ms | 52.3% bf16 MFU | 1623800 tok/s step 13639/19560 | loss 3.346157 (+0.17z)| norm 0.3097 (+1.67z)| lr 1.35e-04 | 322.86 ms | 52.3% bf16 MFU | 1623804 tok/s step 13640/19560 | loss 3.337537 (-0.02z)| norm 0.2758 (-0.57z)| lr 1.34e-04 | 323.03 ms | 52.2% bf16 MFU | 1623767 tok/s step 13641/19560 | loss 3.289509 (-1.09z)| norm 0.2905 (+0.43z)| lr 1.34e-04 | 322.45 ms | 52.3% bf16 MFU | 1623876 tok/s step 13642/19560 | loss 3.281205 (-1.26z)| norm 0.2686 (-1.03z)| lr 1.34e-04 | 323.00 ms | 52.3% bf16 MFU | 1623840 tok/s step 13643/19560 | loss 3.355716 (+0.40z)| norm 0.3016 (+1.21z)| lr 1.34e-04 | 323.29 ms | 52.2% bf16 MFU | 1623733 tok/s step 13644/19560 | loss 3.358130 (+0.45z)| norm 0.2863 (+0.18z)| lr 1.34e-04 | 322.44 ms | 52.3% bf16 MFU | 1623847 tok/s step 13645/19560 | loss 3.326339 (-0.27z)| norm 0.2820 (-0.10z)| lr 1.34e-04 | 322.88 ms | 52.3% bf16 MFU | 1623845 tok/s step 13646/19560 | loss 3.405047 (+1.47z)| norm 0.2849 (+0.10z)| lr 1.34e-04 | 323.23 ms | 52.2% bf16 MFU | 1623755 tok/s step 13647/19560 | loss 3.337479 (-0.05z)| norm 0.2593 (-1.65z)| lr 1.34e-04 | 323.22 ms | 52.2% bf16 MFU | 1623671 tok/s step 13648/19560 | loss 3.277039 (-1.38z)| norm 0.2716 (-0.79z)| lr 1.34e-04 | 322.44 ms | 52.3% bf16 MFU | 1623788 tok/s step 13649/19560 | loss 3.318874 (-0.45z)| norm 0.2718 (-0.76z)| lr 1.34e-04 | 322.34 ms | 52.4% bf16 MFU | 1623923 tok/s step 13650/19560 | loss 3.443286 (+2.30z)| norm 0.2810 (-0.11z)| lr 1.34e-04 | 322.74 ms | 52.3% bf16 MFU | 1623951 tok/s step 13651/19560 | loss 3.330625 (-0.20z)| norm 0.2601 (-1.54z)| lr 1.34e-04 | 322.91 ms | 52.3% bf16 MFU | 1623935 tok/s step 13652/19560 | loss 3.299245 (-0.88z)| norm 0.2877 (+0.37z)| lr 1.34e-04 | 323.06 ms | 52.2% bf16 MFU | 1623883 tok/s step 13653/19560 | loss 3.291646 (-1.03z)| norm 0.2666 (-1.09z)| lr 1.34e-04 | 322.05 ms | 52.4% bf16 MFU | 1624088 tok/s step 13654/19560 | loss 3.410331 (+1.55z)| norm 0.2901 (+0.54z)| lr 1.34e-04 | 323.17 ms | 52.2% bf16 MFU | 1624001 tok/s step 13655/19560 | loss 3.347671 (+0.18z)| norm 0.2618 (-1.40z)| lr 1.34e-04 | 323.01 ms | 52.2% bf16 MFU | 1623958 tok/s step 13656/19560 | loss 3.261919 (-1.66z)| norm 0.2846 (+0.18z)| lr 1.34e-04 | 322.52 ms | 52.3% bf16 MFU | 1624039 tok/s step 13657/19560 | loss 3.310427 (-0.60z)| norm 0.2634 (-1.26z)| lr 1.34e-04 | 323.09 ms | 52.2% bf16 MFU | 1623973 tok/s step 13658/19560 | loss 3.306569 (-0.67z)| norm 0.2546 (-1.86z)| lr 1.34e-04 | 323.19 ms | 52.2% bf16 MFU | 1623886 tok/s step 13659/19560 | loss 3.427528 (+1.92z)| norm 0.2927 (+0.78z)| lr 1.34e-04 | 322.51 ms | 52.3% bf16 MFU | 1623975 tok/s step 13660/19560 | loss 3.348820 (+0.23z)| norm 0.2626 (-1.28z)| lr 1.34e-04 | 323.03 ms | 52.2% bf16 MFU | 1623927 tok/s step 13661/19560 | loss 3.386419 (+1.03z)| norm 0.2768 (-0.29z)| lr 1.34e-04 | 322.45 ms | 52.3% bf16 MFU | 1624028 tok/s step 13662/19560 | loss 3.490816 (+3.13z)| norm 0.3065 (+1.77z)| lr 1.34e-04 | 323.22 ms | 52.2% bf16 MFU | 1623931 tok/s step 13663/19560 | loss 3.437341 (+1.97z)| norm 0.2977 (+1.19z)| lr 1.34e-04 | 322.47 ms | 52.3% bf16 MFU | 1624027 tok/s step 13664/19560 | loss 3.330607 (-0.22z)| norm 0.2858 (+0.36z)| lr 1.33e-04 | 323.24 ms | 52.2% bf16 MFU | 1623924 tok/s step 13665/19560 | loss 3.290232 (-1.05z)| norm 0.2749 (-0.42z)| lr 1.33e-04 | 323.23 ms | 52.2% bf16 MFU | 1623830 tok/s step 13666/19560 | loss 3.328588 (-0.26z)| norm 0.2641 (-1.16z)| lr 1.33e-04 | 322.83 ms | 52.3% bf16 MFU | 1623840 tok/s step 13667/19560 | loss 3.378339 (+0.75z)| norm 0.2944 (+0.96z)| lr 1.33e-04 | 322.58 ms | 52.3% bf16 MFU | 1623912 tok/s step 13668/19560 | loss 3.300769 (-0.84z)| norm 0.2981 (+1.29z)| lr 1.33e-04 | 323.19 ms | 52.2% bf16 MFU | 1623828 tok/s step 13669/19560 | loss 3.272086 (-1.42z)| norm 0.2750 (-0.39z)| lr 1.33e-04 | 322.59 ms | 52.3% bf16 MFU | 1623900 tok/s step 13670/19560 | loss 3.313466 (-0.56z)| norm 0.2781 (-0.15z)| lr 1.33e-04 | 322.93 ms | 52.3% bf16 MFU | 1623882 tok/s step 13671/19560 | loss 3.362097 (+0.43z)| norm 0.2904 (+0.75z)| lr 1.33e-04 | 322.79 ms | 52.3% bf16 MFU | 1623899 tok/s step 13672/19560 | loss 3.322010 (-0.40z)| norm 0.2753 (-0.35z)| lr 1.33e-04 | 322.86 ms | 52.3% bf16 MFU | 1623898 tok/s step 13673/19560 | loss 3.330117 (-0.24z)| norm 0.2866 (+0.50z)| lr 1.33e-04 | 322.57 ms | 52.3% bf16 MFU | 1623969 tok/s step 13674/19560 | loss 3.317471 (-0.49z)| norm 0.2692 (-0.82z)| lr 1.33e-04 | 322.87 ms | 52.3% bf16 MFU | 1623962 tok/s step 13675/19560 | loss 3.268086 (-1.50z)| norm 0.2864 (+0.52z)| lr 1.33e-04 | 323.06 ms | 52.2% bf16 MFU | 1623908 tok/s step 13676/19560 | loss 3.307734 (-0.68z)| norm 0.2828 (+0.24z)| lr 1.33e-04 | 323.04 ms | 52.2% bf16 MFU | 1623861 tok/s step 13677/19560 | loss 3.291057 (-1.00z)| norm 0.3095 (+2.27z)| lr 1.33e-04 | 322.60 ms | 52.3% bf16 MFU | 1623928 tok/s step 13678/19560 | loss 3.303586 (-0.74z)| norm 0.2980 (+1.38z)| lr 1.33e-04 | 322.77 ms | 52.3% bf16 MFU | 1623948 tok/s step 13679/19560 | loss 3.361298 (+0.45z)| norm 0.3386 (+4.13z)| lr 1.33e-04 | 322.14 ms | 52.4% bf16 MFU | 1624127 tok/s step 13680/19560 | loss 3.380270 (+0.83z)| norm 0.3375 (+3.85z)| lr 1.33e-04 | 323.26 ms | 52.2% bf16 MFU | 1624015 tok/s step 13681/19560 | loss 3.303345 (-0.74z)| norm 0.2895 (+0.59z)| lr 1.33e-04 | 322.69 ms | 52.3% bf16 MFU | 1624052 tok/s step 13682/19560 | loss 3.297071 (-0.86z)| norm 0.3330 (+3.35z)| lr 1.33e-04 | 322.03 ms | 52.4% bf16 MFU | 1624253 tok/s step 13683/19560 | loss 3.318316 (-0.41z)| norm 0.3029 (+1.38z)| lr 1.33e-04 | 323.47 ms | 52.2% bf16 MFU | 1624083 tok/s step 13684/19560 | loss 3.311013 (-0.57z)| norm 0.2804 (-0.07z)| lr 1.33e-04 | 322.89 ms | 52.3% bf16 MFU | 1624065 tok/s step 13685/19560 | loss 3.372054 (+0.69z)| norm 0.3029 (+1.36z)| lr 1.33e-04 | 322.73 ms | 52.3% bf16 MFU | 1624088 tok/s step 13686/19560 | loss 3.279313 (-1.23z)| norm 0.2897 (+0.50z)| lr 1.33e-04 | 322.20 ms | 52.4% bf16 MFU | 1624246 tok/s step 13687/19560 | loss 3.311789 (-0.56z)| norm 0.2718 (-0.67z)| lr 1.33e-04 | 323.61 ms | 52.2% bf16 MFU | 1624039 tok/s step 13688/19560 | loss 3.261241 (-1.60z)| norm 0.2945 (+0.80z)| lr 1.32e-04 | 323.33 ms | 52.2% bf16 MFU | 1623913 tok/s step 13689/19560 | loss 3.331000 (-0.15z)| norm 0.2879 (+0.36z)| lr 1.32e-04 | 322.59 ms | 52.3% bf16 MFU | 1623979 tok/s step 13690/19560 | loss 3.349953 (+0.25z)| norm 0.2845 (+0.13z)| lr 1.32e-04 | 323.28 ms | 52.2% bf16 MFU | 1623869 tok/s step 13691/19560 | loss 3.364311 (+0.55z)| norm 0.2712 (-0.76z)| lr 1.32e-04 | 322.54 ms | 52.3% bf16 MFU | 1623952 tok/s step 13692/19560 | loss 3.336226 (-0.05z)| norm 0.2837 (+0.06z)| lr 1.32e-04 | 323.04 ms | 52.2% bf16 MFU | 1623904 tok/s step 13693/19560 | loss 3.275955 (-1.31z)| norm 0.2970 (+0.93z)| lr 1.32e-04 | 323.26 ms | 52.2% bf16 MFU | 1623802 tok/s step 13694/19560 | loss 3.332779 (-0.12z)| norm 0.2790 (-0.27z)| lr 1.32e-04 | 322.84 ms | 52.3% bf16 MFU | 1623812 tok/s step 13695/19560 | loss 3.326612 (-0.24z)| norm 0.2736 (-0.62z)| lr 1.32e-04 | 323.23 ms | 52.2% bf16 MFU | 1623723 tok/s step 13696/19560 | loss 3.349712 (+0.24z)| norm 0.2616 (-1.40z)| lr 1.32e-04 | 322.93 ms | 52.3% bf16 MFU | 1623714 tok/s step 13697/19560 | loss 3.278766 (-1.24z)| norm 0.2746 (-0.53z)| lr 1.32e-04 | 322.86 ms | 52.3% bf16 MFU | 1623723 tok/s step 13698/19560 | loss 3.393667 (+1.16z)| norm 0.2715 (-0.74z)| lr 1.32e-04 | 322.63 ms | 52.3% bf16 MFU | 1623790 tok/s step 13699/19560 | loss 3.273815 (-1.33z)| norm 0.2660 (-1.10z)| lr 1.32e-04 | 322.95 ms | 52.3% bf16 MFU | 1623773 tok/s step 13700/19560 | loss 3.273969 (-1.31z)| norm 0.2788 (-0.26z)| lr 1.32e-04 | 323.29 ms | 52.2% bf16 MFU | 1623670 tok/s step 13701/19560 | loss 3.365180 (+0.56z)| norm 0.2665 (-1.07z)| lr 1.32e-04 | 323.01 ms | 52.2% bf16 MFU | 1623642 tok/s step 13702/19560 | loss 3.309470 (-0.59z)| norm 0.2820 (-0.05z)| lr 1.32e-04 | 323.19 ms | 52.2% bf16 MFU | 1623571 tok/s step 13703/19560 | loss 3.349674 (+0.24z)| norm 0.2738 (-0.60z)| lr 1.32e-04 | 322.77 ms | 52.3% bf16 MFU | 1623609 tok/s step 13704/19560 | loss 3.293581 (-0.91z)| norm 0.2723 (-0.69z)| lr 1.32e-04 | 322.35 ms | 52.4% bf16 MFU | 1623752 tok/s step 13705/19560 | loss 3.292097 (-0.94z)| norm 0.2656 (-1.11z)| lr 1.32e-04 | 323.12 ms | 52.2% bf16 MFU | 1623694 tok/s step 13706/19560 | loss 3.302092 (-0.73z)| norm 0.2667 (-1.03z)| lr 1.32e-04 | 323.14 ms | 52.2% bf16 MFU | 1623632 tok/s step 13707/19560 | loss 3.318701 (-0.38z)| norm 0.2870 (+0.32z)| lr 1.32e-04 | 322.99 ms | 52.3% bf16 MFU | 1623612 tok/s step 13708/19560 | loss 3.277295 (-1.22z)| norm 0.2759 (-0.41z)| lr 1.32e-04 | 322.84 ms | 52.3% bf16 MFU | 1623631 tok/s step 13709/19560 | loss 3.290828 (-0.93z)| norm 0.2532 (-1.87z)| lr 1.32e-04 | 322.80 ms | 52.3% bf16 MFU | 1623658 tok/s step 13710/19560 | loss 3.312067 (-0.49z)| norm 0.2727 (-0.58z)| lr 1.32e-04 | 323.06 ms | 52.2% bf16 MFU | 1623618 tok/s step 13711/19560 | loss 3.304911 (-0.63z)| norm 0.2712 (-0.69z)| lr 1.32e-04 | 323.04 ms | 52.2% bf16 MFU | 1623586 tok/s step 13712/19560 | loss 3.278140 (-1.18z)| norm 0.2513 (-1.95z)| lr 1.31e-04 | 322.87 ms | 52.3% bf16 MFU | 1623599 tok/s step 13713/19560 | loss 3.355063 (+0.43z)| norm 0.2776 (-0.25z)| lr 1.31e-04 | 323.09 ms | 52.2% bf16 MFU | 1623556 tok/s step 13714/19560 | loss 3.297175 (-0.79z)| norm 0.2615 (-1.29z)| lr 1.31e-04 | 322.83 ms | 52.3% bf16 MFU | 1623580 tok/s step 13715/19560 | loss 3.409823 (+1.57z)| norm 0.2971 (+1.00z)| lr 1.31e-04 | 323.02 ms | 52.2% bf16 MFU | 1623556 tok/s step 13716/19560 | loss 3.357793 (+0.48z)| norm 0.2905 (+0.57z)| lr 1.31e-04 | 323.49 ms | 52.2% bf16 MFU | 1623414 tok/s step 13717/19560 | loss 3.311526 (-0.49z)| norm 0.2596 (-1.39z)| lr 1.31e-04 | 323.33 ms | 52.2% bf16 MFU | 1623321 tok/s step 13718/19560 | loss 3.318915 (-0.33z)| norm 0.2750 (-0.42z)| lr 1.31e-04 | 322.82 ms | 52.3% bf16 MFU | 1623359 tok/s step 13719/19560 | loss 3.320236 (-0.30z)| norm 0.2736 (-0.51z)| lr 1.31e-04 | 322.77 ms | 52.3% bf16 MFU | 1623408 tok/s step 13720/19560 | loss 3.346876 (+0.25z)| norm 0.2650 (-1.05z)| lr 1.31e-04 | 323.15 ms | 52.2% bf16 MFU | 1623360 tok/s step 13721/19560 | loss 3.335070 (+0.01z)| norm 0.2719 (-0.62z)| lr 1.31e-04 | 322.72 ms | 52.3% bf16 MFU | 1623420 tok/s step 13722/19560 | loss 3.359338 (+0.52z)| norm 0.2738 (-0.51z)| lr 1.31e-04 | 322.95 ms | 52.3% bf16 MFU | 1623422 tok/s step 13723/19560 | loss 3.288985 (-0.97z)| norm 0.2677 (-0.89z)| lr 1.31e-04 | 323.02 ms | 52.2% bf16 MFU | 1623405 tok/s step 13724/19560 | loss 3.318907 (-0.35z)| norm 0.2814 (-0.02z)| lr 1.31e-04 | 322.93 ms | 52.3% bf16 MFU | 1623411 tok/s step 13725/19560 | loss 3.344103 (+0.19z)| norm 0.2754 (-0.41z)| lr 1.31e-04 | 322.65 ms | 52.3% bf16 MFU | 1623487 tok/s step 13726/19560 | loss 3.349695 (+0.31z)| norm 0.2677 (-0.90z)| lr 1.31e-04 | 323.17 ms | 52.2% bf16 MFU | 1623428 tok/s step 13727/19560 | loss 3.290787 (-0.93z)| norm 0.2786 (-0.20z)| lr 1.31e-04 | 323.25 ms | 52.2% bf16 MFU | 1623352 tok/s step 13728/19560 | loss 3.319041 (-0.33z)| norm 0.2668 (-0.96z)| lr 1.31e-04 | 323.30 ms | 52.2% bf16 MFU | 1623267 tok/s step 13729/19560 | loss 3.418211 (+1.75z)| norm 0.3062 (+1.57z)| lr 1.31e-04 | 322.78 ms | 52.3% bf16 MFU | 1623318 tok/s step 13730/19560 | loss 3.336362 (+0.03z)| norm 0.2943 (+0.79z)| lr 1.31e-04 | 322.94 ms | 52.3% bf16 MFU | 1623326 tok/s step 13731/19560 | loss 3.315730 (-0.42z)| norm 0.2969 (+0.94z)| lr 1.31e-04 | 323.08 ms | 52.2% bf16 MFU | 1623300 tok/s step 13732/19560 | loss 3.347969 (+0.26z)| norm 0.2707 (-0.73z)| lr 1.31e-04 | 322.59 ms | 52.3% bf16 MFU | 1623397 tok/s step 13733/19560 | loss 3.351933 (+0.35z)| norm 0.2933 (+0.70z)| lr 1.31e-04 | 323.32 ms | 52.2% bf16 MFU | 1623305 tok/s step 13734/19560 | loss 3.362563 (+0.56z)| norm 0.2891 (+0.43z)| lr 1.31e-04 | 322.59 ms | 52.3% bf16 MFU | 1623402 tok/s step 13735/19560 | loss 3.315371 (-0.44z)| norm 0.2736 (-0.56z)| lr 1.31e-04 | 323.36 ms | 52.2% bf16 MFU | 1623301 tok/s step 13736/19560 | loss 3.383746 (+1.01z)| norm 0.2888 (+0.41z)| lr 1.30e-04 | 323.12 ms | 52.2% bf16 MFU | 1623266 tok/s step 13737/19560 | loss 3.304288 (-0.67z)| norm 0.2905 (+0.52z)| lr 1.30e-04 | 323.60 ms | 52.2% bf16 MFU | 1623112 tok/s step 13738/19560 | loss 3.453538 (+2.48z)| norm 0.2845 (+0.13z)| lr 1.30e-04 | 323.06 ms | 52.2% bf16 MFU | 1623102 tok/s step 13739/19560 | loss 3.286562 (-1.03z)| norm 0.2827 (+0.01z)| lr 1.30e-04 | 322.93 ms | 52.3% bf16 MFU | 1623123 tok/s step 13740/19560 | loss 3.401368 (+1.39z)| norm 0.2690 (-0.87z)| lr 1.30e-04 | 323.03 ms | 52.2% bf16 MFU | 1623119 tok/s step 13741/19560 | loss 3.325094 (-0.21z)| norm 0.2611 (-1.36z)| lr 1.30e-04 | 322.88 ms | 52.3% bf16 MFU | 1623152 tok/s step 13742/19560 | loss 3.331391 (-0.07z)| norm 0.2559 (-1.66z)| lr 1.30e-04 | 322.99 ms | 52.3% bf16 MFU | 1623157 tok/s step 13743/19560 | loss 3.328673 (-0.12z)| norm 0.2705 (-0.72z)| lr 1.30e-04 | 322.40 ms | 52.3% bf16 MFU | 1623310 tok/s step 13744/19560 | loss 3.385743 (+1.10z)| norm 0.2741 (-0.49z)| lr 1.30e-04 | 323.35 ms | 52.2% bf16 MFU | 1623215 tok/s step 13745/19560 | loss 3.328125 (-0.13z)| norm 0.2653 (-1.03z)| lr 1.30e-04 | 323.64 ms | 52.1% bf16 MFU | 1623053 tok/s step 13746/19560 | loss 3.311029 (-0.49z)| norm 0.2565 (-1.57z)| lr 1.30e-04 | 323.18 ms | 52.2% bf16 MFU | 1623016 tok/s step 13747/19560 | loss 3.355503 (+0.45z)| norm 0.2826 (+0.09z)| lr 1.30e-04 | 322.66 ms | 52.3% bf16 MFU | 1623110 tok/s step 13748/19560 | loss 3.269360 (-1.37z)| norm 0.2806 (-0.04z)| lr 1.30e-04 | 322.87 ms | 52.3% bf16 MFU | 1623146 tok/s step 13749/19560 | loss 3.300990 (-0.69z)| norm 0.2804 (-0.04z)| lr 1.30e-04 | 323.23 ms | 52.2% bf16 MFU | 1623090 tok/s step 13750/19560 | loss 3.348850 (+0.33z)| norm 0.2663 (-0.94z)| lr 1.30e-04 | 322.44 ms | 52.3% bf16 MFU | 1623236 tok/s val loss 3.321792 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3000/10042 = 0.298745 step 13751/19560 | loss 3.340814 (+0.16z)| norm 0.2730 (-0.50z)| lr 1.30e-04 | 323.11 ms | 52.2% bf16 MFU | 1623207 tok/s step 13752/19560 | loss 3.348986 (+0.33z)| norm 0.2710 (-0.62z)| lr 1.30e-04 | 322.94 ms | 52.3% bf16 MFU | 1623221 tok/s step 13753/19560 | loss 3.420696 (+1.82z)| norm 0.2853 (+0.29z)| lr 1.30e-04 | 322.32 ms | 52.4% bf16 MFU | 1623392 tok/s step 13754/19560 | loss 3.286921 (-0.98z)| norm 0.2813 (+0.04z)| lr 1.30e-04 | 323.09 ms | 52.2% bf16 MFU | 1623359 tok/s step 13755/19560 | loss 3.289020 (-0.94z)| norm 0.2844 (+0.25z)| lr 1.30e-04 | 322.92 ms | 52.3% bf16 MFU | 1623371 tok/s step 13756/19560 | loss 3.342429 (+0.19z)| norm 0.2662 (-0.92z)| lr 1.30e-04 | 322.50 ms | 52.3% bf16 MFU | 1623489 tok/s step 13757/19560 | loss 3.321210 (-0.27z)| norm 0.2727 (-0.49z)| lr 1.30e-04 | 322.94 ms | 52.3% bf16 MFU | 1623489 tok/s step 13758/19560 | loss 3.324263 (-0.20z)| norm 0.2767 (-0.21z)| lr 1.30e-04 | 322.84 ms | 52.3% bf16 MFU | 1623514 tok/s step 13759/19560 | loss 3.350331 (+0.35z)| norm 0.2593 (-1.34z)| lr 1.30e-04 | 323.19 ms | 52.2% bf16 MFU | 1623450 tok/s step 13760/19560 | loss 3.332965 (-0.02z)| norm 0.2600 (-1.28z)| lr 1.29e-04 | 322.97 ms | 52.3% bf16 MFU | 1623444 tok/s step 13761/19560 | loss 3.400998 (+1.59z)| norm 0.2656 (-0.90z)| lr 1.29e-04 | 322.76 ms | 52.3% bf16 MFU | 1623490 tok/s step 13762/19560 | loss 3.350286 (+0.42z)| norm 0.2756 (-0.25z)| lr 1.29e-04 | 322.73 ms | 52.3% bf16 MFU | 1623544 tok/s step 13763/19560 | loss 3.336277 (+0.09z)| norm 0.2635 (-1.02z)| lr 1.29e-04 | 322.80 ms | 52.3% bf16 MFU | 1623574 tok/s step 13764/19560 | loss 3.351248 (+0.43z)| norm 0.2772 (-0.14z)| lr 1.29e-04 | 322.77 ms | 52.3% bf16 MFU | 1623612 tok/s step 13765/19560 | loss 3.333608 (+0.02z)| norm 0.2683 (-0.71z)| lr 1.29e-04 | 323.10 ms | 52.2% bf16 MFU | 1623565 tok/s step 13766/19560 | loss 3.330102 (-0.06z)| norm 0.2909 (+0.75z)| lr 1.29e-04 | 322.79 ms | 52.3% bf16 MFU | 1623597 tok/s step 13767/19560 | loss 3.278894 (-1.24z)| norm 0.2609 (-1.18z)| lr 1.29e-04 | 322.94 ms | 52.3% bf16 MFU | 1623591 tok/s step 13768/19560 | loss 3.325134 (-0.16z)| norm 0.2848 (+0.38z)| lr 1.29e-04 | 322.71 ms | 52.3% bf16 MFU | 1623643 tok/s step 13769/19560 | loss 3.342790 (+0.24z)| norm 0.2773 (-0.11z)| lr 1.29e-04 | 322.69 ms | 52.3% bf16 MFU | 1623699 tok/s step 13770/19560 | loss 3.338101 (+0.13z)| norm 0.2802 (+0.08z)| lr 1.29e-04 | 322.79 ms | 52.3% bf16 MFU | 1623726 tok/s step 13771/19560 | loss 3.310549 (-0.51z)| norm 0.2878 (+0.59z)| lr 1.29e-04 | 322.63 ms | 52.3% bf16 MFU | 1623792 tok/s step 13772/19560 | loss 3.473574 (+3.17z)| norm 0.3143 (+2.27z)| lr 1.29e-04 | 322.74 ms | 52.3% bf16 MFU | 1623828 tok/s step 13773/19560 | loss 3.306646 (-0.60z)| norm 0.2946 (+0.99z)| lr 1.29e-04 | 322.69 ms | 52.3% bf16 MFU | 1623872 tok/s step 13774/19560 | loss 3.327677 (-0.11z)| norm 0.2578 (-1.36z)| lr 1.29e-04 | 322.64 ms | 52.3% bf16 MFU | 1623928 tok/s step 13775/19560 | loss 3.358135 (+0.58z)| norm 0.2722 (-0.44z)| lr 1.29e-04 | 323.21 ms | 52.2% bf16 MFU | 1623838 tok/s step 13776/19560 | loss 3.369012 (+0.81z)| norm 0.2754 (-0.24z)| lr 1.29e-04 | 322.46 ms | 52.3% bf16 MFU | 1623942 tok/s step 13777/19560 | loss 3.289974 (-0.99z)| norm 0.2724 (-0.43z)| lr 1.29e-04 | 323.00 ms | 52.3% bf16 MFU | 1623904 tok/s step 13778/19560 | loss 3.338710 (+0.15z)| norm 0.2622 (-1.07z)| lr 1.29e-04 | 322.88 ms | 52.3% bf16 MFU | 1623898 tok/s step 13779/19560 | loss 3.244661 (-2.00z)| norm 0.2738 (-0.34z)| lr 1.29e-04 | 322.85 ms | 52.3% bf16 MFU | 1623900 tok/s step 13780/19560 | loss 3.270797 (-1.39z)| norm 0.2551 (-1.52z)| lr 1.29e-04 | 322.94 ms | 52.3% bf16 MFU | 1623880 tok/s step 13781/19560 | loss 3.359772 (+0.64z)| norm 0.2819 (+0.19z)| lr 1.29e-04 | 322.66 ms | 52.3% bf16 MFU | 1623932 tok/s step 13782/19560 | loss 3.320123 (-0.26z)| norm 0.2875 (+0.55z)| lr 1.29e-04 | 322.42 ms | 52.3% bf16 MFU | 1624040 tok/s step 13783/19560 | loss 3.299752 (-0.72z)| norm 0.2744 (-0.30z)| lr 1.29e-04 | 322.91 ms | 52.3% bf16 MFU | 1624019 tok/s step 13784/19560 | loss 3.258894 (-1.67z)| norm 0.2912 (+0.78z)| lr 1.29e-04 | 322.31 ms | 52.4% bf16 MFU | 1624150 tok/s step 13785/19560 | loss 3.332629 (+0.04z)| norm 0.2818 (+0.17z)| lr 1.28e-04 | 322.85 ms | 52.3% bf16 MFU | 1624140 tok/s step 13786/19560 | loss 3.278262 (-1.21z)| norm 0.2723 (-0.46z)| lr 1.28e-04 | 322.50 ms | 52.3% bf16 MFU | 1624219 tok/s step 13787/19560 | loss 3.358421 (+0.66z)| norm 0.2815 (+0.15z)| lr 1.28e-04 | 322.64 ms | 52.3% bf16 MFU | 1624258 tok/s step 13788/19560 | loss 3.297245 (-0.77z)| norm 0.2747 (-0.30z)| lr 1.28e-04 | 323.09 ms | 52.2% bf16 MFU | 1624180 tok/s step 13789/19560 | loss 3.275877 (-1.25z)| norm 0.2892 (+0.64z)| lr 1.28e-04 | 322.65 ms | 52.3% bf16 MFU | 1624220 tok/s step 13790/19560 | loss 3.350945 (+0.57z)| norm 0.2891 (+0.65z)| lr 1.28e-04 | 322.49 ms | 52.3% bf16 MFU | 1624296 tok/s step 13791/19560 | loss 3.324158 (-0.08z)| norm 0.2740 (-0.34z)| lr 1.28e-04 | 322.79 ms | 52.3% bf16 MFU | 1624294 tok/s step 13792/19560 | loss 3.401032 (+1.86z)| norm 0.2856 (+0.43z)| lr 1.28e-04 | 322.27 ms | 52.4% bf16 MFU | 1624423 tok/s step 13793/19560 | loss 3.283458 (-1.12z)| norm 0.3064 (+1.77z)| lr 1.28e-04 | 322.98 ms | 52.3% bf16 MFU | 1624365 tok/s step 13794/19560 | loss 3.299859 (-0.69z)| norm 0.2727 (-0.45z)| lr 1.28e-04 | 322.00 ms | 52.4% bf16 MFU | 1624557 tok/s step 13795/19560 | loss 3.340737 (+0.35z)| norm 0.2834 (+0.27z)| lr 1.28e-04 | 322.99 ms | 52.3% bf16 MFU | 1624491 tok/s step 13796/19560 | loss 3.295832 (-0.79z)| norm 0.2764 (-0.19z)| lr 1.28e-04 | 322.57 ms | 52.3% bf16 MFU | 1624533 tok/s step 13797/19560 | loss 3.316079 (-0.29z)| norm 0.2721 (-0.47z)| lr 1.28e-04 | 322.90 ms | 52.3% bf16 MFU | 1624491 tok/s step 13798/19560 | loss 3.358488 (+0.78z)| norm 0.2896 (+0.68z)| lr 1.28e-04 | 322.99 ms | 52.3% bf16 MFU | 1624427 tok/s step 13799/19560 | loss 3.294733 (-0.83z)| norm 0.2818 (+0.17z)| lr 1.28e-04 | 322.40 ms | 52.3% bf16 MFU | 1624516 tok/s step 13800/19560 | loss 3.286599 (-1.02z)| norm 0.2674 (-0.78z)| lr 1.28e-04 | 322.99 ms | 52.3% bf16 MFU | 1624452 tok/s step 13801/19560 | loss 3.257959 (-1.72z)| norm 0.2592 (-1.31z)| lr 1.28e-04 | 322.50 ms | 52.3% bf16 MFU | 1624515 tok/s step 13802/19560 | loss 3.302181 (-0.60z)| norm 0.2664 (-0.82z)| lr 1.28e-04 | 322.87 ms | 52.3% bf16 MFU | 1624481 tok/s step 13803/19560 | loss 3.339918 (+0.33z)| norm 0.2898 (+0.72z)| lr 1.28e-04 | 323.05 ms | 52.2% bf16 MFU | 1624403 tok/s step 13804/19560 | loss 3.321218 (-0.14z)| norm 0.2880 (+0.59z)| lr 1.28e-04 | 322.63 ms | 52.3% bf16 MFU | 1624435 tok/s step 13805/19560 | loss 3.325907 (-0.03z)| norm 0.2913 (+0.83z)| lr 1.28e-04 | 323.07 ms | 52.2% bf16 MFU | 1624355 tok/s step 13806/19560 | loss 3.394292 (+1.67z)| norm 0.3086 (+1.96z)| lr 1.28e-04 | 322.46 ms | 52.3% bf16 MFU | 1624433 tok/s step 13807/19560 | loss 3.376171 (+1.21z)| norm 0.2754 (-0.21z)| lr 1.28e-04 | 322.63 ms | 52.3% bf16 MFU | 1624464 tok/s step 13808/19560 | loss 3.306713 (-0.52z)| norm 0.2711 (-0.52z)| lr 1.28e-04 | 322.59 ms | 52.3% bf16 MFU | 1624504 tok/s step 13809/19560 | loss 3.320732 (-0.17z)| norm 0.2806 (+0.21z)| lr 1.27e-04 | 323.02 ms | 52.2% bf16 MFU | 1624432 tok/s step 13810/19560 | loss 3.278138 (-1.24z)| norm 0.2990 (+1.74z)| lr 1.27e-04 | 322.84 ms | 52.3% bf16 MFU | 1624409 tok/s step 13811/19560 | loss 3.552944 (+5.03z)| norm 0.3405 (+4.70z)| lr 1.27e-04 | 323.16 ms | 52.2% bf16 MFU | 1624307 tok/s step 13812/19560 | loss 3.353434 (+0.54z)| norm 0.2673 (-0.78z)| lr 1.27e-04 | 323.23 ms | 52.2% bf16 MFU | 1624193 tok/s step 13813/19560 | loss 3.312381 (-0.38z)| norm 0.3088 (+2.30z)| lr 1.27e-04 | 322.56 ms | 52.3% bf16 MFU | 1624255 tok/s step 13814/19560 | loss 3.292623 (-0.83z)| norm 0.2751 (-0.19z)| lr 1.27e-04 | 322.61 ms | 52.3% bf16 MFU | 1624299 tok/s step 13815/19560 | loss 3.435244 (+2.33z)| norm 0.2859 (+0.60z)| lr 1.27e-04 | 322.86 ms | 52.3% bf16 MFU | 1624277 tok/s step 13816/19560 | loss 3.371862 (+0.91z)| norm 0.3012 (+1.73z)| lr 1.27e-04 | 322.16 ms | 52.4% bf16 MFU | 1624435 tok/s step 13817/19560 | loss 3.315904 (-0.33z)| norm 0.2656 (-0.89z)| lr 1.27e-04 | 322.83 ms | 52.3% bf16 MFU | 1624415 tok/s step 13818/19560 | loss 3.251509 (-1.73z)| norm 0.2783 (+0.05z)| lr 1.27e-04 | 322.83 ms | 52.3% bf16 MFU | 1624396 tok/s step 13819/19560 | loss 3.312081 (-0.39z)| norm 0.2970 (+1.41z)| lr 1.27e-04 | 322.95 ms | 52.3% bf16 MFU | 1624348 tok/s step 13820/19560 | loss 3.398083 (+1.48z)| norm 0.2940 (+1.18z)| lr 1.27e-04 | 322.70 ms | 52.3% bf16 MFU | 1624365 tok/s step 13821/19560 | loss 3.307393 (-0.51z)| norm 0.2880 (+0.75z)| lr 1.27e-04 | 322.86 ms | 52.3% bf16 MFU | 1624342 tok/s step 13822/19560 | loss 3.361029 (+0.66z)| norm 0.2869 (+0.66z)| lr 1.27e-04 | 322.99 ms | 52.3% bf16 MFU | 1624287 tok/s step 13823/19560 | loss 3.316962 (-0.30z)| norm 0.2848 (+0.50z)| lr 1.27e-04 | 323.15 ms | 52.2% bf16 MFU | 1624195 tok/s step 13824/19560 | loss 3.358554 (+0.61z)| norm 0.2816 (+0.25z)| lr 1.27e-04 | 322.20 ms | 52.4% bf16 MFU | 1624344 tok/s step 13825/19560 | loss 3.369469 (+0.84z)| norm 0.2994 (+1.54z)| lr 1.27e-04 | 323.10 ms | 52.2% bf16 MFU | 1624260 tok/s step 13826/19560 | loss 3.314450 (-0.36z)| norm 0.2706 (-0.56z)| lr 1.27e-04 | 322.52 ms | 52.3% bf16 MFU | 1624328 tok/s step 13827/19560 | loss 3.267669 (-1.39z)| norm 0.2880 (+0.69z)| lr 1.27e-04 | 323.02 ms | 52.2% bf16 MFU | 1624265 tok/s step 13828/19560 | loss 3.339035 (+0.17z)| norm 0.2834 (+0.36z)| lr 1.27e-04 | 322.53 ms | 52.3% bf16 MFU | 1624330 tok/s step 13829/19560 | loss 3.358451 (+0.61z)| norm 0.2725 (-0.44z)| lr 1.27e-04 | 322.74 ms | 52.3% bf16 MFU | 1624338 tok/s step 13830/19560 | loss 3.289436 (-0.92z)| norm 0.2802 (+0.12z)| lr 1.27e-04 | 322.85 ms | 52.3% bf16 MFU | 1624318 tok/s step 13831/19560 | loss 3.351670 (+0.46z)| norm 0.3267 (+3.34z)| lr 1.27e-04 | 323.13 ms | 52.2% bf16 MFU | 1624229 tok/s step 13832/19560 | loss 3.296968 (-0.76z)| norm 0.3256 (+3.11z)| lr 1.27e-04 | 323.08 ms | 52.2% bf16 MFU | 1624157 tok/s step 13833/19560 | loss 3.271167 (-1.32z)| norm 0.2670 (-0.83z)| lr 1.27e-04 | 322.67 ms | 52.3% bf16 MFU | 1624192 tok/s step 13834/19560 | loss 3.338314 (+0.16z)| norm 0.3106 (+2.05z)| lr 1.26e-04 | 322.92 ms | 52.3% bf16 MFU | 1624162 tok/s step 13835/19560 | loss 3.374503 (+0.94z)| norm 0.2851 (+0.36z)| lr 1.26e-04 | 323.21 ms | 52.2% bf16 MFU | 1624061 tok/s step 13836/19560 | loss 3.330725 (-0.03z)| norm 0.2735 (-0.42z)| lr 1.26e-04 | 322.43 ms | 52.3% bf16 MFU | 1624161 tok/s step 13837/19560 | loss 3.389804 (+1.26z)| norm 0.2873 (+0.49z)| lr 1.26e-04 | 322.80 ms | 52.3% bf16 MFU | 1624164 tok/s step 13838/19560 | loss 3.437819 (+2.25z)| norm 0.2815 (+0.10z)| lr 1.26e-04 | 322.13 ms | 52.4% bf16 MFU | 1624334 tok/s step 13839/19560 | loss 3.329142 (-0.11z)| norm 0.2899 (+0.65z)| lr 1.26e-04 | 322.93 ms | 52.3% bf16 MFU | 1624293 tok/s step 13840/19560 | loss 3.313881 (-0.45z)| norm 0.2994 (+1.27z)| lr 1.26e-04 | 323.62 ms | 52.2% bf16 MFU | 1624083 tok/s step 13841/19560 | loss 3.364501 (+0.66z)| norm 0.2908 (+0.69z)| lr 1.26e-04 | 322.67 ms | 52.3% bf16 MFU | 1624120 tok/s step 13842/19560 | loss 3.298657 (-0.78z)| norm 0.2828 (+0.13z)| lr 1.26e-04 | 322.24 ms | 52.4% bf16 MFU | 1624266 tok/s step 13843/19560 | loss 3.281227 (-1.14z)| norm 0.2963 (+1.05z)| lr 1.26e-04 | 323.24 ms | 52.2% bf16 MFU | 1624152 tok/s step 13844/19560 | loss 3.232378 (-2.16z)| norm 0.2798 (-0.07z)| lr 1.26e-04 | 322.60 ms | 52.3% bf16 MFU | 1624203 tok/s step 13845/19560 | loss 3.326520 (-0.13z)| norm 0.2811 (+0.01z)| lr 1.26e-04 | 322.62 ms | 52.3% bf16 MFU | 1624247 tok/s step 13846/19560 | loss 3.346042 (+0.29z)| norm 0.2850 (+0.28z)| lr 1.26e-04 | 323.28 ms | 52.2% bf16 MFU | 1624123 tok/s step 13847/19560 | loss 3.349241 (+0.35z)| norm 0.2670 (-0.96z)| lr 1.26e-04 | 322.31 ms | 52.4% bf16 MFU | 1624251 tok/s step 13848/19560 | loss 3.362264 (+0.63z)| norm 0.3049 (+1.62z)| lr 1.26e-04 | 322.56 ms | 52.3% bf16 MFU | 1624309 tok/s step 13849/19560 | loss 3.313780 (-0.41z)| norm 0.2808 (-0.04z)| lr 1.26e-04 | 323.13 ms | 52.2% bf16 MFU | 1624220 tok/s step 13850/19560 | loss 3.340624 (+0.17z)| norm 0.2673 (-0.96z)| lr 1.26e-04 | 322.88 ms | 52.3% bf16 MFU | 1624199 tok/s step 13851/19560 | loss 3.398698 (+1.40z)| norm 0.3078 (+1.77z)| lr 1.26e-04 | 322.72 ms | 52.3% bf16 MFU | 1624218 tok/s step 13852/19560 | loss 3.353925 (+0.43z)| norm 0.2870 (+0.37z)| lr 1.26e-04 | 322.60 ms | 52.3% bf16 MFU | 1624267 tok/s step 13853/19560 | loss 3.394666 (+1.29z)| norm 0.2728 (-0.59z)| lr 1.26e-04 | 322.71 ms | 52.3% bf16 MFU | 1624284 tok/s step 13854/19560 | loss 3.260176 (-1.55z)| norm 0.2803 (-0.10z)| lr 1.26e-04 | 322.65 ms | 52.3% bf16 MFU | 1624318 tok/s step 13855/19560 | loss 3.254383 (-1.65z)| norm 0.2711 (-0.72z)| lr 1.26e-04 | 322.84 ms | 52.3% bf16 MFU | 1624301 tok/s step 13856/19560 | loss 3.345561 (+0.25z)| norm 0.2931 (+0.76z)| lr 1.26e-04 | 323.28 ms | 52.2% bf16 MFU | 1624174 tok/s step 13857/19560 | loss 3.366923 (+0.72z)| norm 0.2603 (-1.44z)| lr 1.26e-04 | 322.83 ms | 52.3% bf16 MFU | 1624167 tok/s step 13858/19560 | loss 3.311054 (-0.46z)| norm 0.2770 (-0.30z)| lr 1.25e-04 | 322.85 ms | 52.3% bf16 MFU | 1624156 tok/s step 13859/19560 | loss 3.291815 (-0.86z)| norm 0.2707 (-0.71z)| lr 1.25e-04 | 322.31 ms | 52.4% bf16 MFU | 1624283 tok/s step 13860/19560 | loss 3.341458 (+0.19z)| norm 0.2742 (-0.48z)| lr 1.25e-04 | 322.65 ms | 52.3% bf16 MFU | 1624315 tok/s step 13861/19560 | loss 3.292765 (-0.83z)| norm 0.2734 (-0.53z)| lr 1.25e-04 | 322.93 ms | 52.3% bf16 MFU | 1624275 tok/s step 13862/19560 | loss 3.301684 (-0.63z)| norm 0.2659 (-1.02z)| lr 1.25e-04 | 322.88 ms | 52.3% bf16 MFU | 1624251 tok/s step 13863/19560 | loss 3.374392 (+0.89z)| norm 0.2651 (-1.07z)| lr 1.25e-04 | 322.93 ms | 52.3% bf16 MFU | 1624214 tok/s step 13864/19560 | loss 3.308547 (-0.49z)| norm 0.2699 (-0.73z)| lr 1.25e-04 | 322.73 ms | 52.3% bf16 MFU | 1624232 tok/s step 13865/19560 | loss 3.368207 (+0.76z)| norm 0.3026 (+1.47z)| lr 1.25e-04 | 323.17 ms | 52.2% bf16 MFU | 1624137 tok/s step 13866/19560 | loss 3.314481 (-0.36z)| norm 0.2946 (+0.93z)| lr 1.25e-04 | 322.79 ms | 52.3% bf16 MFU | 1624141 tok/s step 13867/19560 | loss 3.310019 (-0.46z)| norm 0.2757 (-0.34z)| lr 1.25e-04 | 322.85 ms | 52.3% bf16 MFU | 1624130 tok/s step 13868/19560 | loss 3.282454 (-1.04z)| norm 0.2733 (-0.51z)| lr 1.25e-04 | 322.89 ms | 52.3% bf16 MFU | 1624112 tok/s step 13869/19560 | loss 3.253078 (-1.65z)| norm 0.2570 (-1.60z)| lr 1.25e-04 | 322.30 ms | 52.4% bf16 MFU | 1624241 tok/s step 13870/19560 | loss 3.376941 (+1.01z)| norm 0.2680 (-0.87z)| lr 1.25e-04 | 323.11 ms | 52.2% bf16 MFU | 1624160 tok/s step 13871/19560 | loss 3.362615 (+0.69z)| norm 0.2884 (+0.50z)| lr 1.25e-04 | 323.00 ms | 52.3% bf16 MFU | 1624113 tok/s step 13872/19560 | loss 3.340695 (+0.23z)| norm 0.2620 (-1.28z)| lr 1.25e-04 | 323.26 ms | 52.2% bf16 MFU | 1624001 tok/s step 13873/19560 | loss 3.337907 (+0.17z)| norm 0.2624 (-1.24z)| lr 1.25e-04 | 322.62 ms | 52.3% bf16 MFU | 1624055 tok/s step 13874/19560 | loss 3.226172 (-2.18z)| norm 0.2701 (-0.74z)| lr 1.25e-04 | 322.14 ms | 52.4% bf16 MFU | 1624227 tok/s step 13875/19560 | loss 3.307448 (-0.46z)| norm 0.2718 (-0.61z)| lr 1.25e-04 | 323.05 ms | 52.2% bf16 MFU | 1624161 tok/s step 13876/19560 | loss 3.306024 (-0.50z)| norm 0.2502 (-2.03z)| lr 1.25e-04 | 323.23 ms | 52.2% bf16 MFU | 1624054 tok/s step 13877/19560 | loss 3.345786 (+0.34z)| norm 0.2835 (+0.19z)| lr 1.25e-04 | 323.07 ms | 52.2% bf16 MFU | 1623992 tok/s step 13878/19560 | loss 3.321709 (-0.16z)| norm 0.2608 (-1.32z)| lr 1.25e-04 | 322.85 ms | 52.3% bf16 MFU | 1623988 tok/s step 13879/19560 | loss 3.356484 (+0.57z)| norm 0.2667 (-0.92z)| lr 1.25e-04 | 322.45 ms | 52.3% bf16 MFU | 1624085 tok/s step 13880/19560 | loss 3.307650 (-0.46z)| norm 0.2972 (+1.08z)| lr 1.25e-04 | 322.91 ms | 52.3% bf16 MFU | 1624062 tok/s step 13881/19560 | loss 3.354403 (+0.55z)| norm 0.2568 (-1.56z)| lr 1.25e-04 | 323.27 ms | 52.2% bf16 MFU | 1623952 tok/s step 13882/19560 | loss 3.330840 (+0.04z)| norm 0.2903 (+0.63z)| lr 1.25e-04 | 323.22 ms | 52.2% bf16 MFU | 1623859 tok/s step 13883/19560 | loss 3.321471 (-0.17z)| norm 0.2778 (-0.18z)| lr 1.24e-04 | 322.63 ms | 52.3% bf16 MFU | 1623917 tok/s step 13884/19560 | loss 3.359651 (+0.65z)| norm 0.2555 (-1.63z)| lr 1.24e-04 | 322.72 ms | 52.3% bf16 MFU | 1623950 tok/s step 13885/19560 | loss 3.288198 (-0.89z)| norm 0.2697 (-0.70z)| lr 1.24e-04 | 323.42 ms | 52.2% bf16 MFU | 1623806 tok/s step 13886/19560 | loss 3.343183 (+0.30z)| norm 0.2600 (-1.31z)| lr 1.24e-04 | 322.72 ms | 52.3% bf16 MFU | 1623846 tok/s step 13887/19560 | loss 3.355037 (+0.55z)| norm 0.2565 (-1.53z)| lr 1.24e-04 | 323.69 ms | 52.1% bf16 MFU | 1623640 tok/s step 13888/19560 | loss 3.283825 (-0.97z)| norm 0.2667 (-0.88z)| lr 1.24e-04 | 322.65 ms | 52.3% bf16 MFU | 1623706 tok/s step 13889/19560 | loss 3.340995 (+0.27z)| norm 0.2852 (+0.30z)| lr 1.24e-04 | 322.62 ms | 52.3% bf16 MFU | 1623775 tok/s step 13890/19560 | loss 3.353127 (+0.53z)| norm 0.2750 (-0.36z)| lr 1.24e-04 | 323.27 ms | 52.2% bf16 MFU | 1623676 tok/s step 13891/19560 | loss 3.357613 (+0.63z)| norm 0.2785 (-0.14z)| lr 1.24e-04 | 323.19 ms | 52.2% bf16 MFU | 1623603 tok/s step 13892/19560 | loss 3.366181 (+0.81z)| norm 0.2720 (-0.56z)| lr 1.24e-04 | 322.93 ms | 52.3% bf16 MFU | 1623600 tok/s step 13893/19560 | loss 3.343920 (+0.32z)| norm 0.2650 (-1.01z)| lr 1.24e-04 | 322.49 ms | 52.3% bf16 MFU | 1623706 tok/s step 13894/19560 | loss 3.365121 (+0.78z)| norm 0.2849 (+0.28z)| lr 1.24e-04 | 323.21 ms | 52.2% bf16 MFU | 1623627 tok/s step 13895/19560 | loss 3.296726 (-0.71z)| norm 0.2529 (-1.78z)| lr 1.24e-04 | 322.34 ms | 52.4% bf16 MFU | 1623771 tok/s step 13896/19560 | loss 3.316845 (-0.27z)| norm 0.2807 (+0.02z)| lr 1.24e-04 | 323.28 ms | 52.2% bf16 MFU | 1623672 tok/s step 13897/19560 | loss 3.285208 (-0.94z)| norm 0.2804 (-0.01z)| lr 1.24e-04 | 322.85 ms | 52.3% bf16 MFU | 1623686 tok/s step 13898/19560 | loss 3.458482 (+2.70z)| norm 0.2988 (+1.17z)| lr 1.24e-04 | 323.32 ms | 52.2% bf16 MFU | 1623580 tok/s step 13899/19560 | loss 3.289827 (-0.83z)| norm 0.2977 (+1.09z)| lr 1.24e-04 | 322.85 ms | 52.3% bf16 MFU | 1623598 tok/s step 13900/19560 | loss 3.299056 (-0.63z)| norm 0.2905 (+0.65z)| lr 1.24e-04 | 322.89 ms | 52.3% bf16 MFU | 1623606 tok/s step 13901/19560 | loss 3.335663 (+0.16z)| norm 0.2716 (-0.57z)| lr 1.24e-04 | 322.37 ms | 52.4% bf16 MFU | 1623743 tok/s step 13902/19560 | loss 3.424497 (+2.03z)| norm 0.2900 (+0.62z)| lr 1.24e-04 | 322.94 ms | 52.3% bf16 MFU | 1623730 tok/s step 13903/19560 | loss 3.317561 (-0.24z)| norm 0.2929 (+0.80z)| lr 1.24e-04 | 323.34 ms | 52.2% bf16 MFU | 1623616 tok/s step 13904/19560 | loss 3.344115 (+0.33z)| norm 0.2718 (-0.59z)| lr 1.24e-04 | 323.28 ms | 52.2% bf16 MFU | 1623524 tok/s step 13905/19560 | loss 3.290251 (-0.82z)| norm 0.2690 (-0.77z)| lr 1.24e-04 | 323.07 ms | 52.2% bf16 MFU | 1623491 tok/s step 13906/19560 | loss 3.301040 (-0.59z)| norm 0.2655 (-1.00z)| lr 1.24e-04 | 323.26 ms | 52.2% bf16 MFU | 1623409 tok/s step 13907/19560 | loss 3.341361 (+0.26z)| norm 0.2588 (-1.42z)| lr 1.24e-04 | 322.78 ms | 52.3% bf16 MFU | 1623452 tok/s step 13908/19560 | loss 3.371691 (+0.91z)| norm 0.2784 (-0.16z)| lr 1.23e-04 | 322.92 ms | 52.3% bf16 MFU | 1623458 tok/s step 13909/19560 | loss 3.379506 (+1.07z)| norm 0.2631 (-1.15z)| lr 1.23e-04 | 323.00 ms | 52.3% bf16 MFU | 1623443 tok/s step 13910/19560 | loss 3.322134 (-0.17z)| norm 0.3414 (+3.73z)| lr 1.23e-04 | 323.58 ms | 52.2% bf16 MFU | 1623284 tok/s step 13911/19560 | loss 3.268174 (-1.33z)| norm 0.2675 (-0.83z)| lr 1.23e-04 | 322.88 ms | 52.3% bf16 MFU | 1623308 tok/s step 13912/19560 | loss 3.310811 (-0.42z)| norm 0.2994 (+1.13z)| lr 1.23e-04 | 322.63 ms | 52.3% bf16 MFU | 1623396 tok/s step 13913/19560 | loss 3.299602 (-0.66z)| norm 0.2788 (-0.14z)| lr 1.23e-04 | 323.28 ms | 52.2% bf16 MFU | 1623314 tok/s step 13914/19560 | loss 3.329348 (-0.02z)| norm 0.2872 (+0.37z)| lr 1.23e-04 | 322.97 ms | 52.3% bf16 MFU | 1623316 tok/s step 13915/19560 | loss 3.317441 (-0.28z)| norm 0.2925 (+0.69z)| lr 1.23e-04 | 322.53 ms | 52.3% bf16 MFU | 1623428 tok/s step 13916/19560 | loss 3.237564 (-1.98z)| norm 0.2976 (+0.99z)| lr 1.23e-04 | 323.51 ms | 52.2% bf16 MFU | 1623287 tok/s step 13917/19560 | loss 3.323709 (-0.14z)| norm 0.2697 (-0.71z)| lr 1.23e-04 | 322.97 ms | 52.3% bf16 MFU | 1623289 tok/s step 13918/19560 | loss 3.343153 (+0.29z)| norm 0.2920 (+0.65z)| lr 1.23e-04 | 322.42 ms | 52.3% bf16 MFU | 1623429 tok/s step 13919/19560 | loss 3.373352 (+0.93z)| norm 0.2887 (+0.45z)| lr 1.23e-04 | 323.18 ms | 52.2% bf16 MFU | 1623371 tok/s step 13920/19560 | loss 3.316827 (-0.28z)| norm 0.2758 (-0.34z)| lr 1.23e-04 | 323.34 ms | 52.2% bf16 MFU | 1623275 tok/s step 13921/19560 | loss 3.324671 (-0.12z)| norm 0.2915 (+0.63z)| lr 1.23e-04 | 323.09 ms | 52.2% bf16 MFU | 1623249 tok/s step 13922/19560 | loss 3.331573 (+0.03z)| norm 0.2786 (-0.17z)| lr 1.23e-04 | 322.80 ms | 52.3% bf16 MFU | 1623296 tok/s step 13923/19560 | loss 3.461599 (+2.77z)| norm 0.3040 (+1.38z)| lr 1.23e-04 | 322.99 ms | 52.3% bf16 MFU | 1623293 tok/s step 13924/19560 | loss 3.369933 (+0.81z)| norm 0.2669 (-0.88z)| lr 1.23e-04 | 323.25 ms | 52.2% bf16 MFU | 1623225 tok/s step 13925/19560 | loss 3.344810 (+0.27z)| norm 0.2894 (+0.48z)| lr 1.23e-04 | 322.65 ms | 52.3% bf16 MFU | 1623310 tok/s step 13926/19560 | loss 3.353874 (+0.47z)| norm 0.2689 (-0.76z)| lr 1.23e-04 | 323.44 ms | 52.2% bf16 MFU | 1623193 tok/s step 13927/19560 | loss 3.363746 (+0.66z)| norm 0.2622 (-1.15z)| lr 1.23e-04 | 322.88 ms | 52.3% bf16 MFU | 1623223 tok/s step 13928/19560 | loss 3.300310 (-0.69z)| norm 0.2684 (-0.77z)| lr 1.23e-04 | 322.52 ms | 52.3% bf16 MFU | 1623343 tok/s step 13929/19560 | loss 3.324683 (-0.18z)| norm 0.2700 (-0.69z)| lr 1.23e-04 | 323.60 ms | 52.2% bf16 MFU | 1623186 tok/s step 13930/19560 | loss 3.295001 (-0.82z)| norm 0.2559 (-1.53z)| lr 1.23e-04 | 323.26 ms | 52.2% bf16 MFU | 1623119 tok/s step 13931/19560 | loss 3.333643 (+0.01z)| norm 0.2650 (-0.97z)| lr 1.23e-04 | 323.02 ms | 52.2% bf16 MFU | 1623117 tok/s step 13932/19560 | loss 3.298033 (-0.75z)| norm 0.2727 (-0.49z)| lr 1.22e-04 | 323.35 ms | 52.2% bf16 MFU | 1623032 tok/s step 13933/19560 | loss 3.388669 (+1.18z)| norm 0.2796 (-0.07z)| lr 1.22e-04 | 323.18 ms | 52.2% bf16 MFU | 1622995 tok/s step 13934/19560 | loss 3.516646 (+3.70z)| norm 0.3202 (+2.36z)| lr 1.22e-04 | 322.73 ms | 52.3% bf16 MFU | 1623071 tok/s step 13935/19560 | loss 3.328699 (-0.11z)| norm 0.2814 (+0.03z)| lr 1.22e-04 | 323.18 ms | 52.2% bf16 MFU | 1623031 tok/s step 13936/19560 | loss 3.262239 (-1.44z)| norm 0.3011 (+1.19z)| lr 1.22e-04 | 323.71 ms | 52.1% bf16 MFU | 1622861 tok/s step 13937/19560 | loss 3.325179 (-0.17z)| norm 0.2812 (+0.01z)| lr 1.22e-04 | 324.05 ms | 52.1% bf16 MFU | 1622613 tok/s step 13938/19560 | loss 3.307178 (-0.54z)| norm 0.2742 (-0.40z)| lr 1.22e-04 | 322.79 ms | 52.3% bf16 MFU | 1622695 tok/s step 13939/19560 | loss 3.333094 (+0.02z)| norm 0.3332 (+3.17z)| lr 1.22e-04 | 323.29 ms | 52.2% bf16 MFU | 1622645 tok/s step 13940/19560 | loss 3.363918 (+0.70z)| norm 0.2896 (+0.52z)| lr 1.22e-04 | 322.87 ms | 52.3% bf16 MFU | 1622705 tok/s step 13941/19560 | loss 3.346755 (+0.32z)| norm 0.2807 (-0.01z)| lr 1.22e-04 | 323.05 ms | 52.2% bf16 MFU | 1622716 tok/s step 13942/19560 | loss 3.548380 (+4.36z)| norm 0.3671 (+4.76z)| lr 1.22e-04 | 322.76 ms | 52.3% bf16 MFU | 1622798 tok/s step 13943/19560 | loss 3.356596 (+0.47z)| norm 0.2852 (+0.20z)| lr 1.22e-04 | 322.69 ms | 52.3% bf16 MFU | 1622896 tok/s step 13944/19560 | loss 3.364569 (+0.64z)| norm 0.3063 (+1.37z)| lr 1.22e-04 | 322.93 ms | 52.3% bf16 MFU | 1622929 tok/s step 13945/19560 | loss 3.317692 (-0.33z)| norm 0.2978 (+0.88z)| lr 1.22e-04 | 322.94 ms | 52.3% bf16 MFU | 1622956 tok/s step 13946/19560 | loss 3.310639 (-0.50z)| norm 0.3167 (+1.89z)| lr 1.22e-04 | 322.86 ms | 52.3% bf16 MFU | 1623002 tok/s step 13947/19560 | loss 3.357650 (+0.48z)| norm 0.2848 (+0.15z)| lr 1.22e-04 | 322.91 ms | 52.3% bf16 MFU | 1623033 tok/s step 13948/19560 | loss 3.279324 (-1.14z)| norm 0.3078 (+1.40z)| lr 1.22e-04 | 322.83 ms | 52.3% bf16 MFU | 1623083 tok/s step 13949/19560 | loss 3.296560 (-0.78z)| norm 0.2748 (-0.40z)| lr 1.22e-04 | 323.21 ms | 52.2% bf16 MFU | 1623036 tok/s step 13950/19560 | loss 3.294338 (-0.81z)| norm 0.2818 (-0.01z)| lr 1.22e-04 | 323.39 ms | 52.2% bf16 MFU | 1622945 tok/s step 13951/19560 | loss 3.330242 (-0.06z)| norm 0.2814 (-0.03z)| lr 1.22e-04 | 322.70 ms | 52.3% bf16 MFU | 1623033 tok/s step 13952/19560 | loss 3.330821 (-0.05z)| norm 0.2596 (-1.20z)| lr 1.22e-04 | 323.18 ms | 52.2% bf16 MFU | 1622996 tok/s step 13953/19560 | loss 3.303900 (-0.60z)| norm 0.2789 (-0.15z)| lr 1.22e-04 | 323.18 ms | 52.2% bf16 MFU | 1622959 tok/s step 13954/19560 | loss 3.323561 (-0.19z)| norm 0.2887 (+0.37z)| lr 1.22e-04 | 323.02 ms | 52.2% bf16 MFU | 1622964 tok/s step 13955/19560 | loss 3.278767 (-1.13z)| norm 0.2836 (+0.10z)| lr 1.22e-04 | 322.64 ms | 52.3% bf16 MFU | 1623064 tok/s step 13956/19560 | loss 3.388419 (+1.16z)| norm 0.2879 (+0.33z)| lr 1.22e-04 | 322.68 ms | 52.3% bf16 MFU | 1623152 tok/s step 13957/19560 | loss 3.309320 (-0.49z)| norm 0.2800 (-0.10z)| lr 1.21e-04 | 322.94 ms | 52.3% bf16 MFU | 1623168 tok/s step 13958/19560 | loss 3.367843 (+0.73z)| norm 0.2763 (-0.30z)| lr 1.21e-04 | 323.26 ms | 52.2% bf16 MFU | 1623103 tok/s step 13959/19560 | loss 3.330907 (-0.05z)| norm 0.2746 (-0.38z)| lr 1.21e-04 | 322.79 ms | 52.3% bf16 MFU | 1623160 tok/s step 13960/19560 | loss 3.352596 (+0.40z)| norm 0.2758 (-0.30z)| lr 1.21e-04 | 322.82 ms | 52.3% bf16 MFU | 1623207 tok/s step 13961/19560 | loss 3.292963 (-0.86z)| norm 0.2630 (-1.03z)| lr 1.21e-04 | 323.01 ms | 52.3% bf16 MFU | 1623204 tok/s step 13962/19560 | loss 3.275123 (-1.22z)| norm 0.2746 (-0.35z)| lr 1.21e-04 | 322.33 ms | 52.4% bf16 MFU | 1623372 tok/s step 13963/19560 | loss 3.323547 (-0.19z)| norm 0.2731 (-0.43z)| lr 1.21e-04 | 323.30 ms | 52.2% bf16 MFU | 1623288 tok/s step 13964/19560 | loss 3.298615 (-0.71z)| norm 0.2731 (-0.43z)| lr 1.21e-04 | 322.34 ms | 52.4% bf16 MFU | 1623448 tok/s step 13965/19560 | loss 3.332891 (+0.02z)| norm 0.2796 (-0.05z)| lr 1.21e-04 | 323.24 ms | 52.2% bf16 MFU | 1623374 tok/s step 13966/19560 | loss 3.295292 (-0.77z)| norm 0.2729 (-0.44z)| lr 1.21e-04 | 322.85 ms | 52.3% bf16 MFU | 1623402 tok/s step 13967/19560 | loss 3.306411 (-0.52z)| norm 0.2842 (+0.22z)| lr 1.21e-04 | 323.05 ms | 52.2% bf16 MFU | 1623378 tok/s step 13968/19560 | loss 3.334299 (+0.07z)| norm 0.3214 (+2.31z)| lr 1.21e-04 | 322.46 ms | 52.3% bf16 MFU | 1623503 tok/s step 13969/19560 | loss 3.352386 (+0.46z)| norm 0.2919 (+0.64z)| lr 1.21e-04 | 322.72 ms | 52.3% bf16 MFU | 1623558 tok/s step 13970/19560 | loss 3.311671 (-0.42z)| norm 0.2933 (+0.71z)| lr 1.21e-04 | 323.12 ms | 52.2% bf16 MFU | 1623510 tok/s step 13971/19560 | loss 3.271853 (-1.27z)| norm 0.2553 (-1.41z)| lr 1.21e-04 | 322.16 ms | 52.4% bf16 MFU | 1623706 tok/s step 13972/19560 | loss 3.289060 (-0.92z)| norm 0.2984 (+1.00z)| lr 1.21e-04 | 323.26 ms | 52.2% bf16 MFU | 1623615 tok/s step 13973/19560 | loss 3.388655 (+1.23z)| norm 0.3041 (+1.30z)| lr 1.21e-04 | 322.69 ms | 52.3% bf16 MFU | 1623671 tok/s step 13974/19560 | loss 3.344496 (+0.28z)| norm 0.2696 (-0.61z)| lr 1.21e-04 | 322.89 ms | 52.3% bf16 MFU | 1623674 tok/s step 13975/19560 | loss 3.276182 (-1.19z)| norm 0.2798 (-0.05z)| lr 1.21e-04 | 322.53 ms | 52.3% bf16 MFU | 1623768 tok/s step 13976/19560 | loss 3.315046 (-0.34z)| norm 0.2838 (+0.18z)| lr 1.21e-04 | 322.70 ms | 52.3% bf16 MFU | 1623815 tok/s step 13977/19560 | loss 3.383871 (+1.13z)| norm 0.2768 (-0.21z)| lr 1.21e-04 | 322.68 ms | 52.3% bf16 MFU | 1623864 tok/s step 13978/19560 | loss 3.431495 (+2.10z)| norm 0.3210 (+2.22z)| lr 1.21e-04 | 322.35 ms | 52.4% bf16 MFU | 1623994 tok/s step 13979/19560 | loss 3.315174 (-0.35z)| norm 0.2947 (+0.77z)| lr 1.21e-04 | 323.23 ms | 52.2% bf16 MFU | 1623896 tok/s step 13980/19560 | loss 3.359424 (+0.59z)| norm 0.2696 (-0.61z)| lr 1.21e-04 | 323.07 ms | 52.2% bf16 MFU | 1623844 tok/s step 13981/19560 | loss 3.365727 (+0.74z)| norm 0.2889 (+0.45z)| lr 1.21e-04 | 323.04 ms | 52.2% bf16 MFU | 1623800 tok/s step 13982/19560 | loss 3.334464 (+0.06z)| norm 0.2743 (-0.36z)| lr 1.20e-04 | 322.94 ms | 52.3% bf16 MFU | 1623783 tok/s step 13983/19560 | loss 3.436969 (+2.22z)| norm 0.2947 (+0.76z)| lr 1.20e-04 | 322.85 ms | 52.3% bf16 MFU | 1623790 tok/s step 13984/19560 | loss 3.345350 (+0.26z)| norm 0.2726 (-0.45z)| lr 1.20e-04 | 322.79 ms | 52.3% bf16 MFU | 1623812 tok/s step 13985/19560 | loss 3.347952 (+0.32z)| norm 0.2739 (-0.39z)| lr 1.20e-04 | 323.17 ms | 52.2% bf16 MFU | 1623738 tok/s step 13986/19560 | loss 3.276513 (-1.20z)| norm 0.2696 (-0.62z)| lr 1.20e-04 | 322.70 ms | 52.3% bf16 MFU | 1623787 tok/s step 13987/19560 | loss 3.344913 (+0.25z)| norm 0.2838 (+0.16z)| lr 1.20e-04 | 322.37 ms | 52.4% bf16 MFU | 1623914 tok/s step 13988/19560 | loss 3.318631 (-0.31z)| norm 0.2738 (-0.40z)| lr 1.20e-04 | 322.99 ms | 52.3% bf16 MFU | 1623880 tok/s step 13989/19560 | loss 3.325671 (-0.16z)| norm 0.2882 (+0.40z)| lr 1.20e-04 | 322.79 ms | 52.3% bf16 MFU | 1623897 tok/s step 13990/19560 | loss 3.370512 (+0.78z)| norm 0.2667 (-0.80z)| lr 1.20e-04 | 322.73 ms | 52.3% bf16 MFU | 1623930 tok/s step 13991/19560 | loss 3.304899 (-0.61z)| norm 0.2673 (-0.77z)| lr 1.20e-04 | 322.66 ms | 52.3% bf16 MFU | 1623977 tok/s step 13992/19560 | loss 3.355564 (+0.47z)| norm 0.2780 (-0.18z)| lr 1.20e-04 | 322.84 ms | 52.3% bf16 MFU | 1623978 tok/s step 13993/19560 | loss 3.302797 (-0.65z)| norm 0.2540 (-1.49z)| lr 1.20e-04 | 322.13 ms | 52.4% bf16 MFU | 1624157 tok/s step 13994/19560 | loss 3.309601 (-0.51z)| norm 0.2665 (-0.78z)| lr 1.20e-04 | 323.08 ms | 52.2% bf16 MFU | 1624088 tok/s step 13995/19560 | loss 3.307176 (-0.56z)| norm 0.2621 (-1.02z)| lr 1.20e-04 | 322.96 ms | 52.3% bf16 MFU | 1624052 tok/s step 13996/19560 | loss 3.256313 (-1.64z)| norm 0.2697 (-0.60z)| lr 1.20e-04 | 322.62 ms | 52.3% bf16 MFU | 1624104 tok/s step 13997/19560 | loss 3.408674 (+1.59z)| norm 0.2658 (-0.82z)| lr 1.20e-04 | 322.98 ms | 52.3% bf16 MFU | 1624063 tok/s step 13998/19560 | loss 3.317378 (-0.35z)| norm 0.2726 (-0.44z)| lr 1.20e-04 | 322.73 ms | 52.3% bf16 MFU | 1624087 tok/s step 13999/19560 | loss 3.276218 (-1.21z)| norm 0.2595 (-1.15z)| lr 1.20e-04 | 322.54 ms | 52.3% bf16 MFU | 1624158 tok/s step 14000/19560 | loss 3.323000 (-0.21z)| norm 0.2705 (-0.55z)| lr 1.20e-04 | 323.74 ms | 52.1% bf16 MFU | 1623923 tok/s val loss 3.317825 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3012/10042 = 0.299940 step 14001/19560 | loss 3.298245 (-0.73z)| norm 0.2706 (-0.55z)| lr 1.20e-04 | 322.44 ms | 52.3% bf16 MFU | 1624027 tok/s step 14002/19560 | loss 3.350191 (+0.36z)| norm 0.2787 (-0.10z)| lr 1.20e-04 | 322.60 ms | 52.3% bf16 MFU | 1624086 tok/s step 14003/19560 | loss 3.315331 (-0.40z)| norm 0.2698 (-0.60z)| lr 1.20e-04 | 322.92 ms | 52.3% bf16 MFU | 1624060 tok/s step 14004/19560 | loss 3.285808 (-1.03z)| norm 0.2742 (-0.37z)| lr 1.20e-04 | 322.42 ms | 52.3% bf16 MFU | 1624163 tok/s step 14005/19560 | loss 3.316741 (-0.36z)| norm 0.2925 (+0.66z)| lr 1.20e-04 | 322.50 ms | 52.3% bf16 MFU | 1624241 tok/s step 14006/19560 | loss 3.303274 (-0.65z)| norm 0.2641 (-0.94z)| lr 1.20e-04 | 323.04 ms | 52.2% bf16 MFU | 1624177 tok/s step 14007/19560 | loss 3.310030 (-0.49z)| norm 0.2764 (-0.25z)| lr 1.19e-04 | 323.01 ms | 52.2% bf16 MFU | 1624125 tok/s step 14008/19560 | loss 3.401561 (+1.46z)| norm 0.2847 (+0.22z)| lr 1.19e-04 | 322.37 ms | 52.4% bf16 MFU | 1624235 tok/s step 14009/19560 | loss 3.375992 (+0.91z)| norm 0.2676 (-0.76z)| lr 1.19e-04 | 322.46 ms | 52.3% bf16 MFU | 1624318 tok/s step 14010/19560 | loss 3.288927 (-0.95z)| norm 0.2705 (-0.58z)| lr 1.19e-04 | 323.10 ms | 52.2% bf16 MFU | 1624236 tok/s step 14011/19560 | loss 3.328336 (-0.11z)| norm 0.2792 (-0.09z)| lr 1.19e-04 | 322.64 ms | 52.3% bf16 MFU | 1624275 tok/s step 14012/19560 | loss 3.260869 (-1.52z)| norm 0.2740 (-0.39z)| lr 1.19e-04 | 322.75 ms | 52.3% bf16 MFU | 1624285 tok/s step 14013/19560 | loss 3.280860 (-1.09z)| norm 0.2527 (-1.59z)| lr 1.19e-04 | 323.10 ms | 52.2% bf16 MFU | 1624205 tok/s step 14014/19560 | loss 3.387874 (+1.16z)| norm 0.2783 (-0.14z)| lr 1.19e-04 | 322.48 ms | 52.3% bf16 MFU | 1624285 tok/s step 14015/19560 | loss 3.343776 (+0.23z)| norm 0.2802 (-0.04z)| lr 1.19e-04 | 323.44 ms | 52.2% bf16 MFU | 1624120 tok/s step 14016/19560 | loss 3.284104 (-1.03z)| norm 0.2528 (-1.61z)| lr 1.19e-04 | 322.47 ms | 52.3% bf16 MFU | 1624208 tok/s step 14017/19560 | loss 3.339962 (+0.15z)| norm 0.2924 (+0.66z)| lr 1.19e-04 | 322.96 ms | 52.3% bf16 MFU | 1624167 tok/s step 14018/19560 | loss 3.369094 (+0.76z)| norm 0.2699 (-0.63z)| lr 1.19e-04 | 322.80 ms | 52.3% bf16 MFU | 1624169 tok/s step 14019/19560 | loss 3.340400 (+0.16z)| norm 0.3035 (+1.27z)| lr 1.19e-04 | 322.37 ms | 52.4% bf16 MFU | 1624279 tok/s step 14020/19560 | loss 3.393204 (+1.26z)| norm 0.3181 (+2.04z)| lr 1.19e-04 | 323.34 ms | 52.2% bf16 MFU | 1624139 tok/s step 14021/19560 | loss 3.330977 (-0.04z)| norm 0.2994 (+0.98z)| lr 1.19e-04 | 322.87 ms | 52.3% bf16 MFU | 1624123 tok/s step 14022/19560 | loss 3.322808 (-0.20z)| norm 0.2923 (+0.59z)| lr 1.19e-04 | 322.59 ms | 52.3% bf16 MFU | 1624179 tok/s step 14023/19560 | loss 3.374597 (+0.87z)| norm 0.3258 (+2.39z)| lr 1.19e-04 | 322.64 ms | 52.3% bf16 MFU | 1624221 tok/s step 14024/19560 | loss 3.318940 (-0.30z)| norm 0.2802 (-0.12z)| lr 1.19e-04 | 322.56 ms | 52.3% bf16 MFU | 1624280 tok/s step 14025/19560 | loss 3.297426 (-0.75z)| norm 0.2842 (+0.10z)| lr 1.19e-04 | 323.18 ms | 52.2% bf16 MFU | 1624180 tok/s step 14026/19560 | loss 3.315622 (-0.36z)| norm 0.2876 (+0.29z)| lr 1.19e-04 | 322.28 ms | 52.4% bf16 MFU | 1624311 tok/s step 14027/19560 | loss 3.364309 (+0.68z)| norm 0.3281 (+2.46z)| lr 1.19e-04 | 322.89 ms | 52.3% bf16 MFU | 1624282 tok/s step 14028/19560 | loss 3.281745 (-1.10z)| norm 0.2643 (-0.97z)| lr 1.19e-04 | 323.25 ms | 52.2% bf16 MFU | 1624165 tok/s step 14029/19560 | loss 3.326450 (-0.13z)| norm 0.2798 (-0.14z)| lr 1.19e-04 | 322.15 ms | 52.4% bf16 MFU | 1624331 tok/s step 14030/19560 | loss 3.332259 (+0.01z)| norm 0.2817 (-0.03z)| lr 1.19e-04 | 323.39 ms | 52.2% bf16 MFU | 1624177 tok/s step 14031/19560 | loss 3.387872 (+1.21z)| norm 0.2937 (+0.62z)| lr 1.19e-04 | 322.87 ms | 52.3% bf16 MFU | 1624160 tok/s step 14032/19560 | loss 3.344112 (+0.26z)| norm 0.2854 (+0.16z)| lr 1.18e-04 | 323.33 ms | 52.2% bf16 MFU | 1624028 tok/s step 14033/19560 | loss 3.352919 (+0.44z)| norm 0.2668 (-0.84z)| lr 1.18e-04 | 322.69 ms | 52.3% bf16 MFU | 1624064 tok/s step 14034/19560 | loss 3.292271 (-0.88z)| norm 0.2695 (-0.70z)| lr 1.18e-04 | 322.70 ms | 52.3% bf16 MFU | 1624094 tok/s step 14035/19560 | loss 3.348239 (+0.34z)| norm 0.2690 (-0.73z)| lr 1.18e-04 | 322.34 ms | 52.4% bf16 MFU | 1624215 tok/s step 14036/19560 | loss 3.339957 (+0.16z)| norm 0.2685 (-0.76z)| lr 1.18e-04 | 322.82 ms | 52.3% bf16 MFU | 1624209 tok/s step 14037/19560 | loss 3.289270 (-0.93z)| norm 0.2653 (-0.93z)| lr 1.18e-04 | 321.97 ms | 52.4% bf16 MFU | 1624417 tok/s step 14038/19560 | loss 3.265213 (-1.44z)| norm 0.2893 (+0.41z)| lr 1.18e-04 | 323.31 ms | 52.2% bf16 MFU | 1624276 tok/s step 14039/19560 | loss 3.335123 (+0.07z)| norm 0.2695 (-0.71z)| lr 1.18e-04 | 322.29 ms | 52.4% bf16 MFU | 1624401 tok/s step 14040/19560 | loss 3.353635 (+0.47z)| norm 0.2668 (-0.85z)| lr 1.18e-04 | 322.54 ms | 52.3% bf16 MFU | 1624455 tok/s step 14041/19560 | loss 3.395917 (+1.37z)| norm 0.2636 (-1.03z)| lr 1.18e-04 | 322.48 ms | 52.3% bf16 MFU | 1624522 tok/s step 14042/19560 | loss 3.333601 (+0.01z)| norm 0.2630 (-1.04z)| lr 1.18e-04 | 323.04 ms | 52.2% bf16 MFU | 1624445 tok/s step 14043/19560 | loss 3.346754 (+0.29z)| norm 0.2656 (-0.88z)| lr 1.18e-04 | 322.46 ms | 52.3% bf16 MFU | 1624518 tok/s step 14044/19560 | loss 3.376853 (+0.94z)| norm 0.2604 (-1.15z)| lr 1.18e-04 | 322.47 ms | 52.3% bf16 MFU | 1624586 tok/s step 14045/19560 | loss 3.311500 (-0.50z)| norm 0.2539 (-1.50z)| lr 1.18e-04 | 322.83 ms | 52.3% bf16 MFU | 1624558 tok/s step 14046/19560 | loss 3.354945 (+0.45z)| norm 0.2653 (-0.86z)| lr 1.18e-04 | 322.98 ms | 52.3% bf16 MFU | 1624494 tok/s step 14047/19560 | loss 3.290483 (-0.95z)| norm 0.2621 (-1.02z)| lr 1.18e-04 | 323.02 ms | 52.2% bf16 MFU | 1624424 tok/s step 14048/19560 | loss 3.340961 (+0.16z)| norm 0.2700 (-0.58z)| lr 1.18e-04 | 322.69 ms | 52.3% bf16 MFU | 1624440 tok/s step 14049/19560 | loss 3.342719 (+0.19z)| norm 0.2749 (-0.30z)| lr 1.18e-04 | 322.62 ms | 52.3% bf16 MFU | 1624472 tok/s step 14050/19560 | loss 3.272576 (-1.34z)| norm 0.2599 (-1.12z)| lr 1.18e-04 | 323.12 ms | 52.2% bf16 MFU | 1624378 tok/s step 14051/19560 | loss 3.330893 (-0.04z)| norm 0.2720 (-0.44z)| lr 1.18e-04 | 322.60 ms | 52.3% bf16 MFU | 1624420 tok/s step 14052/19560 | loss 3.309248 (-0.52z)| norm 0.2725 (-0.41z)| lr 1.18e-04 | 323.18 ms | 52.2% bf16 MFU | 1624313 tok/s step 14053/19560 | loss 3.363680 (+0.71z)| norm 0.2900 (+0.56z)| lr 1.18e-04 | 322.59 ms | 52.3% bf16 MFU | 1624359 tok/s step 14054/19560 | loss 3.384230 (+1.16z)| norm 0.2812 (+0.06z)| lr 1.18e-04 | 323.13 ms | 52.2% bf16 MFU | 1624266 tok/s step 14055/19560 | loss 3.290983 (-0.92z)| norm 0.2933 (+0.72z)| lr 1.18e-04 | 322.89 ms | 52.3% bf16 MFU | 1624241 tok/s step 14056/19560 | loss 3.328060 (-0.09z)| norm 0.2795 (-0.05z)| lr 1.18e-04 | 323.76 ms | 52.1% bf16 MFU | 1623998 tok/s step 14057/19560 | loss 3.180851 (-3.24z)| norm 0.2740 (-0.36z)| lr 1.17e-04 | 322.99 ms | 52.3% bf16 MFU | 1623958 tok/s step 14058/19560 | loss 3.345722 (+0.31z)| norm 0.2991 (+1.03z)| lr 1.17e-04 | 322.81 ms | 52.3% bf16 MFU | 1623968 tok/s step 14059/19560 | loss 3.352967 (+0.46z)| norm 0.2522 (-1.59z)| lr 1.17e-04 | 323.64 ms | 52.1% bf16 MFU | 1623767 tok/s step 14060/19560 | loss 3.266270 (-1.40z)| norm 0.2956 (+0.82z)| lr 1.17e-04 | 323.06 ms | 52.2% bf16 MFU | 1623722 tok/s step 14061/19560 | loss 3.320467 (-0.22z)| norm 0.2693 (-0.63z)| lr 1.17e-04 | 323.08 ms | 52.2% bf16 MFU | 1623675 tok/s step 14062/19560 | loss 3.364221 (+0.80z)| norm 0.3507 (+3.72z)| lr 1.17e-04 | 322.60 ms | 52.3% bf16 MFU | 1623750 tok/s step 14063/19560 | loss 3.322852 (-0.15z)| norm 0.2821 (+0.06z)| lr 1.17e-04 | 322.72 ms | 52.3% bf16 MFU | 1623791 tok/s step 14064/19560 | loss 3.324468 (-0.13z)| norm 0.2852 (+0.23z)| lr 1.17e-04 | 323.52 ms | 52.2% bf16 MFU | 1623630 tok/s step 14065/19560 | loss 3.301973 (-0.65z)| norm 0.3059 (+1.32z)| lr 1.17e-04 | 323.05 ms | 52.2% bf16 MFU | 1623596 tok/s step 14066/19560 | loss 3.401860 (+1.64z)| norm 0.3021 (+1.10z)| lr 1.17e-04 | 322.86 ms | 52.3% bf16 MFU | 1623611 tok/s step 14067/19560 | loss 3.320134 (-0.24z)| norm 0.2884 (+0.41z)| lr 1.17e-04 | 323.01 ms | 52.3% bf16 MFU | 1623588 tok/s step 14068/19560 | loss 3.356383 (+0.60z)| norm 0.2714 (-0.51z)| lr 1.17e-04 | 322.82 ms | 52.3% bf16 MFU | 1623613 tok/s step 14069/19560 | loss 3.287099 (-0.98z)| norm 0.3082 (+1.47z)| lr 1.17e-04 | 323.09 ms | 52.2% bf16 MFU | 1623570 tok/s step 14070/19560 | loss 3.308634 (-0.50z)| norm 0.2809 (+0.03z)| lr 1.17e-04 | 323.57 ms | 52.2% bf16 MFU | 1623406 tok/s step 14071/19560 | loss 3.298944 (-0.74z)| norm 0.2876 (+0.43z)| lr 1.17e-04 | 322.98 ms | 52.3% bf16 MFU | 1623400 tok/s step 14072/19560 | loss 3.320139 (-0.18z)| norm 0.2871 (+0.41z)| lr 1.17e-04 | 322.93 ms | 52.3% bf16 MFU | 1623408 tok/s step 14073/19560 | loss 3.273933 (-1.35z)| norm 0.2692 (-0.65z)| lr 1.17e-04 | 323.31 ms | 52.2% bf16 MFU | 1623318 tok/s step 14074/19560 | loss 3.285946 (-1.04z)| norm 0.2973 (+1.07z)| lr 1.17e-04 | 323.67 ms | 52.1% bf16 MFU | 1623143 tok/s step 14075/19560 | loss 3.344711 (+0.46z)| norm 0.2796 (-0.01z)| lr 1.17e-04 | 322.56 ms | 52.3% bf16 MFU | 1623257 tok/s step 14076/19560 | loss 3.336578 (+0.24z)| norm 0.2898 (+0.63z)| lr 1.17e-04 | 323.17 ms | 52.2% bf16 MFU | 1623211 tok/s step 14077/19560 | loss 3.356413 (+0.74z)| norm 0.2923 (+0.77z)| lr 1.17e-04 | 322.91 ms | 52.3% bf16 MFU | 1623232 tok/s step 14078/19560 | loss 3.337855 (+0.26z)| norm 0.2847 (+0.30z)| lr 1.17e-04 | 322.99 ms | 52.3% bf16 MFU | 1623231 tok/s step 14079/19560 | loss 3.340540 (+0.32z)| norm 0.2681 (-0.71z)| lr 1.17e-04 | 323.95 ms | 52.1% bf16 MFU | 1622991 tok/s step 14080/19560 | loss 3.353834 (+0.66z)| norm 0.2956 (+0.96z)| lr 1.17e-04 | 322.98 ms | 52.3% bf16 MFU | 1623006 tok/s step 14081/19560 | loss 3.342983 (+0.37z)| norm 0.3030 (+1.39z)| lr 1.17e-04 | 323.23 ms | 52.2% bf16 MFU | 1622957 tok/s step 14082/19560 | loss 3.272157 (-1.42z)| norm 0.2873 (+0.43z)| lr 1.17e-04 | 322.87 ms | 52.3% bf16 MFU | 1623000 tok/s step 14083/19560 | loss 3.310712 (-0.45z)| norm 0.2701 (-0.61z)| lr 1.16e-04 | 323.11 ms | 52.2% bf16 MFU | 1622982 tok/s step 14084/19560 | loss 3.273499 (-1.38z)| norm 0.2825 (+0.15z)| lr 1.16e-04 | 323.12 ms | 52.2% bf16 MFU | 1622963 tok/s step 14085/19560 | loss 3.337327 (+0.25z)| norm 0.2826 (+0.15z)| lr 1.16e-04 | 323.30 ms | 52.2% bf16 MFU | 1622898 tok/s step 14086/19560 | loss 3.276414 (-1.29z)| norm 0.2902 (+0.61z)| lr 1.16e-04 | 322.88 ms | 52.3% bf16 MFU | 1622942 tok/s step 14087/19560 | loss 3.294107 (-0.83z)| norm 0.2831 (+0.18z)| lr 1.16e-04 | 322.77 ms | 52.3% bf16 MFU | 1623012 tok/s step 14088/19560 | loss 3.315148 (-0.29z)| norm 0.2769 (-0.20z)| lr 1.16e-04 | 323.22 ms | 52.2% bf16 MFU | 1622965 tok/s step 14089/19560 | loss 3.313587 (-0.33z)| norm 0.2887 (+0.50z)| lr 1.16e-04 | 322.89 ms | 52.3% bf16 MFU | 1623004 tok/s step 14090/19560 | loss 3.379635 (+1.34z)| norm 0.2900 (+0.58z)| lr 1.16e-04 | 323.32 ms | 52.2% bf16 MFU | 1622934 tok/s step 14091/19560 | loss 3.268061 (-1.49z)| norm 0.2828 (+0.13z)| lr 1.16e-04 | 323.34 ms | 52.2% bf16 MFU | 1622860 tok/s step 14092/19560 | loss 3.388343 (+1.53z)| norm 0.2612 (-1.17z)| lr 1.16e-04 | 323.40 ms | 52.2% bf16 MFU | 1622776 tok/s step 14093/19560 | loss 3.372018 (+1.11z)| norm 0.3029 (+1.34z)| lr 1.16e-04 | 322.76 ms | 52.3% bf16 MFU | 1622858 tok/s step 14094/19560 | loss 3.372842 (+1.11z)| norm 0.2758 (-0.30z)| lr 1.16e-04 | 323.10 ms | 52.2% bf16 MFU | 1622848 tok/s step 14095/19560 | loss 3.319450 (-0.23z)| norm 0.2944 (+0.82z)| lr 1.16e-04 | 322.56 ms | 52.3% bf16 MFU | 1622977 tok/s step 14096/19560 | loss 3.327371 (-0.03z)| norm 0.2786 (-0.12z)| lr 1.16e-04 | 322.57 ms | 52.3% bf16 MFU | 1623095 tok/s step 14097/19560 | loss 3.316841 (-0.29z)| norm 0.2877 (+0.45z)| lr 1.16e-04 | 322.71 ms | 52.3% bf16 MFU | 1623171 tok/s step 14098/19560 | loss 3.336947 (+0.21z)| norm 0.2937 (+0.82z)| lr 1.16e-04 | 323.14 ms | 52.2% bf16 MFU | 1623137 tok/s step 14099/19560 | loss 3.304093 (-0.62z)| norm 0.2659 (-0.91z)| lr 1.16e-04 | 323.37 ms | 52.2% bf16 MFU | 1623047 tok/s step 14100/19560 | loss 3.302311 (-0.67z)| norm 0.2794 (-0.06z)| lr 1.16e-04 | 322.75 ms | 52.3% bf16 MFU | 1623118 tok/s step 14101/19560 | loss 3.344302 (+0.40z)| norm 0.2713 (-0.56z)| lr 1.16e-04 | 322.60 ms | 52.3% bf16 MFU | 1623221 tok/s step 14102/19560 | loss 3.299888 (-0.72z)| norm 0.2997 (+1.22z)| lr 1.16e-04 | 323.11 ms | 52.2% bf16 MFU | 1623192 tok/s step 14103/19560 | loss 3.344870 (+0.42z)| norm 0.2874 (+0.44z)| lr 1.16e-04 | 323.27 ms | 52.2% bf16 MFU | 1623124 tok/s step 14104/19560 | loss 3.288192 (-1.03z)| norm 0.2902 (+0.61z)| lr 1.16e-04 | 323.16 ms | 52.2% bf16 MFU | 1623087 tok/s step 14105/19560 | loss 3.332894 (+0.12z)| norm 0.2809 (+0.02z)| lr 1.16e-04 | 322.37 ms | 52.4% bf16 MFU | 1623249 tok/s step 14106/19560 | loss 3.354282 (+0.71z)| norm 0.2874 (+0.46z)| lr 1.16e-04 | 323.87 ms | 52.1% bf16 MFU | 1623029 tok/s step 14107/19560 | loss 3.350541 (+0.60z)| norm 0.3019 (+1.38z)| lr 1.16e-04 | 323.40 ms | 52.2% bf16 MFU | 1622936 tok/s step 14108/19560 | loss 3.360310 (+0.86z)| norm 0.2694 (-0.69z)| lr 1.15e-04 | 322.73 ms | 52.3% bf16 MFU | 1623016 tok/s step 14109/19560 | loss 3.336608 (+0.24z)| norm 0.2990 (+1.19z)| lr 1.15e-04 | 323.20 ms | 52.2% bf16 MFU | 1622974 tok/s step 14110/19560 | loss 3.308435 (-0.50z)| norm 0.2766 (-0.24z)| lr 1.15e-04 | 323.02 ms | 52.2% bf16 MFU | 1622980 tok/s step 14111/19560 | loss 3.305765 (-0.56z)| norm 0.2615 (-1.18z)| lr 1.15e-04 | 324.08 ms | 52.1% bf16 MFU | 1622720 tok/s step 14112/19560 | loss 3.317224 (-0.24z)| norm 0.2804 (+0.01z)| lr 1.15e-04 | 323.20 ms | 52.2% bf16 MFU | 1622693 tok/s step 14113/19560 | loss 3.386513 (+1.63z)| norm 0.2617 (-1.17z)| lr 1.15e-04 | 322.58 ms | 52.3% bf16 MFU | 1622824 tok/s step 14114/19560 | loss 3.355646 (+0.78z)| norm 0.2636 (-1.04z)| lr 1.15e-04 | 323.42 ms | 52.2% bf16 MFU | 1622736 tok/s step 14115/19560 | loss 3.303669 (-0.63z)| norm 0.2716 (-0.53z)| lr 1.15e-04 | 323.48 ms | 52.2% bf16 MFU | 1622638 tok/s step 14116/19560 | loss 3.305959 (-0.56z)| norm 0.2527 (-1.69z)| lr 1.15e-04 | 322.78 ms | 52.3% bf16 MFU | 1622720 tok/s step 14117/19560 | loss 3.314254 (-0.33z)| norm 0.3048 (+1.54z)| lr 1.15e-04 | 323.02 ms | 52.2% bf16 MFU | 1622737 tok/s step 14118/19560 | loss 3.239240 (-2.31z)| norm 0.2580 (-1.35z)| lr 1.15e-04 | 322.56 ms | 52.3% bf16 MFU | 1622870 tok/s step 14119/19560 | loss 3.281102 (-1.18z)| norm 0.3269 (+2.79z)| lr 1.15e-04 | 323.04 ms | 52.2% bf16 MFU | 1622876 tok/s step 14120/19560 | loss 3.360022 (+0.93z)| norm 0.2632 (-1.02z)| lr 1.15e-04 | 322.71 ms | 52.3% bf16 MFU | 1622965 tok/s step 14121/19560 | loss 3.307479 (-0.48z)| norm 0.2689 (-0.69z)| lr 1.15e-04 | 322.39 ms | 52.4% bf16 MFU | 1623129 tok/s step 14122/19560 | loss 3.367025 (+1.10z)| norm 0.3433 (+3.56z)| lr 1.15e-04 | 322.77 ms | 52.3% bf16 MFU | 1623189 tok/s step 14123/19560 | loss 3.363962 (+1.00z)| norm 0.2598 (-1.21z)| lr 1.15e-04 | 322.62 ms | 52.3% bf16 MFU | 1623286 tok/s step 14124/19560 | loss 3.341010 (+0.38z)| norm 0.2870 (+0.34z)| lr 1.15e-04 | 323.03 ms | 52.2% bf16 MFU | 1623272 tok/s step 14125/19560 | loss 3.308404 (-0.48z)| norm 0.2906 (+0.53z)| lr 1.15e-04 | 323.05 ms | 52.2% bf16 MFU | 1623256 tok/s step 14126/19560 | loss 3.272192 (-1.45z)| norm 0.2814 (+0.00z)| lr 1.15e-04 | 322.91 ms | 52.3% bf16 MFU | 1623275 tok/s step 14127/19560 | loss 3.341940 (+0.43z)| norm 0.2956 (+0.81z)| lr 1.15e-04 | 322.79 ms | 52.3% bf16 MFU | 1623324 tok/s step 14128/19560 | loss 3.315019 (-0.31z)| norm 0.2807 (-0.06z)| lr 1.15e-04 | 323.32 ms | 52.2% bf16 MFU | 1623236 tok/s step 14129/19560 | loss 3.313834 (-0.34z)| norm 0.2956 (+0.79z)| lr 1.15e-04 | 323.08 ms | 52.2% bf16 MFU | 1623212 tok/s step 14130/19560 | loss 3.328103 (+0.05z)| norm 0.2693 (-0.72z)| lr 1.15e-04 | 322.60 ms | 52.3% bf16 MFU | 1623311 tok/s step 14131/19560 | loss 3.385409 (+1.59z)| norm 0.3143 (+1.83z)| lr 1.15e-04 | 322.74 ms | 52.3% bf16 MFU | 1623370 tok/s step 14132/19560 | loss 3.292066 (-0.94z)| norm 0.2850 (+0.16z)| lr 1.15e-04 | 323.28 ms | 52.2% bf16 MFU | 1623289 tok/s step 14133/19560 | loss 3.367958 (+1.10z)| norm 0.2750 (-0.40z)| lr 1.14e-04 | 323.21 ms | 52.2% bf16 MFU | 1623231 tok/s step 14134/19560 | loss 3.324466 (-0.08z)| norm 0.3052 (+1.29z)| lr 1.14e-04 | 322.87 ms | 52.3% bf16 MFU | 1623262 tok/s step 14135/19560 | loss 3.334403 (+0.19z)| norm 0.2778 (-0.26z)| lr 1.14e-04 | 323.07 ms | 52.2% bf16 MFU | 1623239 tok/s step 14136/19560 | loss 3.334037 (+0.19z)| norm 0.2934 (+0.62z)| lr 1.14e-04 | 323.59 ms | 52.2% bf16 MFU | 1623087 tok/s step 14137/19560 | loss 3.295674 (-0.85z)| norm 0.2795 (-0.17z)| lr 1.14e-04 | 323.06 ms | 52.2% bf16 MFU | 1623077 tok/s step 14138/19560 | loss 3.337204 (+0.29z)| norm 0.3014 (+1.05z)| lr 1.14e-04 | 322.86 ms | 52.3% bf16 MFU | 1623118 tok/s step 14139/19560 | loss 3.355072 (+0.78z)| norm 0.2959 (+0.73z)| lr 1.14e-04 | 322.86 ms | 52.3% bf16 MFU | 1623156 tok/s step 14140/19560 | loss 3.306514 (-0.58z)| norm 0.2684 (-0.82z)| lr 1.14e-04 | 322.84 ms | 52.3% bf16 MFU | 1623196 tok/s step 14141/19560 | loss 3.382636 (+1.53z)| norm 0.2974 (+0.80z)| lr 1.14e-04 | 322.38 ms | 52.4% bf16 MFU | 1623352 tok/s step 14142/19560 | loss 3.314351 (-0.37z)| norm 0.2726 (-0.61z)| lr 1.14e-04 | 323.04 ms | 52.2% bf16 MFU | 1623333 tok/s step 14143/19560 | loss 3.327447 (+0.00z)| norm 0.3086 (+1.42z)| lr 1.14e-04 | 322.89 ms | 52.3% bf16 MFU | 1623353 tok/s step 14144/19560 | loss 3.367082 (+1.11z)| norm 0.2754 (-0.47z)| lr 1.14e-04 | 322.78 ms | 52.3% bf16 MFU | 1623400 tok/s step 14145/19560 | loss 3.384434 (+1.57z)| norm 0.3131 (+1.66z)| lr 1.14e-04 | 322.52 ms | 52.3% bf16 MFU | 1623510 tok/s step 14146/19560 | loss 3.300690 (-0.77z)| norm 0.3031 (+1.08z)| lr 1.14e-04 | 322.87 ms | 52.3% bf16 MFU | 1623525 tok/s step 14147/19560 | loss 3.311049 (-0.47z)| norm 0.3306 (+2.56z)| lr 1.14e-04 | 322.94 ms | 52.3% bf16 MFU | 1623523 tok/s step 14148/19560 | loss 3.331418 (+0.12z)| norm 0.2873 (+0.19z)| lr 1.14e-04 | 322.58 ms | 52.3% bf16 MFU | 1623612 tok/s step 14149/19560 | loss 3.377367 (+1.41z)| norm 0.3053 (+1.18z)| lr 1.14e-04 | 322.76 ms | 52.3% bf16 MFU | 1623652 tok/s step 14150/19560 | loss 3.326991 (-0.02z)| norm 0.2769 (-0.39z)| lr 1.14e-04 | 323.26 ms | 52.2% bf16 MFU | 1623564 tok/s step 14151/19560 | loss 3.345023 (+0.50z)| norm 0.3121 (+1.60z)| lr 1.14e-04 | 322.50 ms | 52.3% bf16 MFU | 1623671 tok/s step 14152/19560 | loss 3.367961 (+1.14z)| norm 0.2935 (+0.54z)| lr 1.14e-04 | 322.60 ms | 52.3% bf16 MFU | 1623748 tok/s step 14153/19560 | loss 3.337338 (+0.27z)| norm 0.2800 (-0.22z)| lr 1.14e-04 | 322.80 ms | 52.3% bf16 MFU | 1623769 tok/s step 14154/19560 | loss 3.428993 (+2.76z)| norm 0.3157 (+1.76z)| lr 1.14e-04 | 322.98 ms | 52.3% bf16 MFU | 1623746 tok/s step 14155/19560 | loss 3.349818 (+0.58z)| norm 0.2740 (-0.55z)| lr 1.14e-04 | 322.38 ms | 52.4% bf16 MFU | 1623874 tok/s step 14156/19560 | loss 3.356455 (+0.75z)| norm 0.2965 (+0.72z)| lr 1.14e-04 | 322.92 ms | 52.3% bf16 MFU | 1623860 tok/s step 14157/19560 | loss 3.307134 (-0.61z)| norm 0.2992 (+0.86z)| lr 1.14e-04 | 322.96 ms | 52.3% bf16 MFU | 1623835 tok/s step 14158/19560 | loss 3.306260 (-0.63z)| norm 0.2655 (-1.05z)| lr 1.14e-04 | 322.51 ms | 52.3% bf16 MFU | 1623927 tok/s step 14159/19560 | loss 3.323752 (-0.13z)| norm 0.2961 (+0.69z)| lr 1.13e-04 | 322.75 ms | 52.3% bf16 MFU | 1623954 tok/s step 14160/19560 | loss 3.306221 (-0.61z)| norm 0.2666 (-0.97z)| lr 1.13e-04 | 323.34 ms | 52.2% bf16 MFU | 1623828 tok/s step 14161/19560 | loss 3.292856 (-0.97z)| norm 0.2717 (-0.69z)| lr 1.13e-04 | 322.97 ms | 52.3% bf16 MFU | 1623803 tok/s step 14162/19560 | loss 3.342580 (+0.40z)| norm 0.2720 (-0.68z)| lr 1.13e-04 | 322.71 ms | 52.3% bf16 MFU | 1623845 tok/s step 14163/19560 | loss 3.395538 (+1.85z)| norm 0.2706 (-0.76z)| lr 1.13e-04 | 322.70 ms | 52.3% bf16 MFU | 1623888 tok/s step 14164/19560 | loss 3.359900 (+0.86z)| norm 0.2776 (-0.36z)| lr 1.13e-04 | 323.09 ms | 52.2% bf16 MFU | 1623831 tok/s step 14165/19560 | loss 3.342155 (+0.36z)| norm 0.2688 (-0.86z)| lr 1.13e-04 | 322.53 ms | 52.3% bf16 MFU | 1623918 tok/s step 14166/19560 | loss 3.332550 (+0.08z)| norm 0.2720 (-0.68z)| lr 1.13e-04 | 322.81 ms | 52.3% bf16 MFU | 1623929 tok/s step 14167/19560 | loss 3.402846 (+2.01z)| norm 0.2945 (+0.60z)| lr 1.13e-04 | 323.05 ms | 52.2% bf16 MFU | 1623878 tok/s step 14168/19560 | loss 3.454607 (+3.27z)| norm 0.2582 (-1.46z)| lr 1.13e-04 | 322.52 ms | 52.3% bf16 MFU | 1623965 tok/s step 14169/19560 | loss 3.304384 (-0.69z)| norm 0.2728 (-0.64z)| lr 1.13e-04 | 322.43 ms | 52.3% bf16 MFU | 1624069 tok/s step 14170/19560 | loss 3.338022 (+0.21z)| norm 0.2716 (-0.72z)| lr 1.13e-04 | 322.99 ms | 52.3% bf16 MFU | 1624027 tok/s step 14171/19560 | loss 3.315893 (-0.38z)| norm 0.2630 (-1.21z)| lr 1.13e-04 | 323.82 ms | 52.1% bf16 MFU | 1623779 tok/s step 14172/19560 | loss 3.255866 (-1.94z)| norm 0.2688 (-0.88z)| lr 1.13e-04 | 322.47 ms | 52.3% bf16 MFU | 1623882 tok/s step 14173/19560 | loss 3.299174 (-0.79z)| norm 0.2814 (-0.17z)| lr 1.13e-04 | 322.57 ms | 52.3% bf16 MFU | 1623954 tok/s step 14174/19560 | loss 3.372974 (+1.16z)| norm 0.2711 (-0.77z)| lr 1.13e-04 | 322.67 ms | 52.3% bf16 MFU | 1623998 tok/s step 14175/19560 | loss 3.326555 (-0.07z)| norm 0.2784 (-0.36z)| lr 1.13e-04 | 323.43 ms | 52.2% bf16 MFU | 1623849 tok/s step 14176/19560 | loss 3.291039 (-1.00z)| norm 0.2792 (-0.32z)| lr 1.13e-04 | 322.46 ms | 52.3% bf16 MFU | 1623952 tok/s step 14177/19560 | loss 3.328450 (-0.01z)| norm 0.2697 (-0.87z)| lr 1.13e-04 | 322.86 ms | 52.3% bf16 MFU | 1623948 tok/s step 14178/19560 | loss 3.410426 (+2.10z)| norm 0.3448 (+3.36z)| lr 1.13e-04 | 322.93 ms | 52.3% bf16 MFU | 1623928 tok/s step 14179/19560 | loss 3.349337 (+0.50z)| norm 0.2860 (+0.04z)| lr 1.13e-04 | 322.83 ms | 52.3% bf16 MFU | 1623934 tok/s step 14180/19560 | loss 3.330376 (+0.00z)| norm 0.3199 (+1.90z)| lr 1.13e-04 | 322.52 ms | 52.3% bf16 MFU | 1624018 tok/s step 14181/19560 | loss 3.442619 (+2.84z)| norm 0.2929 (+0.40z)| lr 1.13e-04 | 322.66 ms | 52.3% bf16 MFU | 1624062 tok/s step 14182/19560 | loss 3.316651 (-0.35z)| norm 0.2845 (-0.07z)| lr 1.13e-04 | 323.05 ms | 52.2% bf16 MFU | 1624004 tok/s step 14183/19560 | loss 3.290931 (-1.01z)| norm 0.2629 (-1.26z)| lr 1.13e-04 | 322.92 ms | 52.3% bf16 MFU | 1623984 tok/s step 14184/19560 | loss 3.344337 (+0.35z)| norm 0.2782 (-0.41z)| lr 1.13e-04 | 322.68 ms | 52.3% bf16 MFU | 1624024 tok/s step 14185/19560 | loss 3.289546 (-1.13z)| norm 0.2771 (-0.47z)| lr 1.12e-04 | 322.53 ms | 52.3% bf16 MFU | 1624101 tok/s step 14186/19560 | loss 3.352615 (+0.57z)| norm 0.2964 (+0.61z)| lr 1.12e-04 | 323.27 ms | 52.2% bf16 MFU | 1623986 tok/s step 14187/19560 | loss 3.326744 (-0.12z)| norm 0.2953 (+0.53z)| lr 1.12e-04 | 323.33 ms | 52.2% bf16 MFU | 1623864 tok/s step 14188/19560 | loss 3.358019 (+0.71z)| norm 0.2925 (+0.37z)| lr 1.12e-04 | 323.37 ms | 52.2% bf16 MFU | 1623738 tok/s step 14189/19560 | loss 3.350279 (+0.50z)| norm 0.2801 (-0.33z)| lr 1.12e-04 | 322.71 ms | 52.3% bf16 MFU | 1623783 tok/s step 14190/19560 | loss 3.312021 (-0.54z)| norm 0.2900 (+0.27z)| lr 1.12e-04 | 322.95 ms | 52.3% bf16 MFU | 1623765 tok/s step 14191/19560 | loss 3.293386 (-1.04z)| norm 0.2934 (+0.46z)| lr 1.12e-04 | 323.01 ms | 52.3% bf16 MFU | 1623735 tok/s step 14192/19560 | loss 3.281198 (-1.35z)| norm 0.2894 (+0.23z)| lr 1.12e-04 | 322.85 ms | 52.3% bf16 MFU | 1623745 tok/s step 14193/19560 | loss 3.336324 (+0.13z)| norm 0.2876 (+0.13z)| lr 1.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1623814 tok/s step 14194/19560 | loss 3.323690 (-0.20z)| norm 0.2919 (+0.39z)| lr 1.12e-04 | 322.41 ms | 52.3% bf16 MFU | 1623932 tok/s step 14195/19560 | loss 3.381007 (+1.36z)| norm 0.2876 (+0.14z)| lr 1.12e-04 | 323.09 ms | 52.2% bf16 MFU | 1623872 tok/s step 14196/19560 | loss 3.308372 (-0.62z)| norm 0.2983 (+0.76z)| lr 1.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1623933 tok/s step 14197/19560 | loss 3.296196 (-0.95z)| norm 0.2875 (+0.13z)| lr 1.12e-04 | 323.34 ms | 52.2% bf16 MFU | 1623812 tok/s step 14198/19560 | loss 3.405409 (+1.99z)| norm 0.2726 (-0.77z)| lr 1.12e-04 | 323.50 ms | 52.2% bf16 MFU | 1623656 tok/s step 14199/19560 | loss 3.342231 (+0.27z)| norm 0.2833 (-0.12z)| lr 1.12e-04 | 322.87 ms | 52.3% bf16 MFU | 1623664 tok/s step 14200/19560 | loss 3.301174 (-0.83z)| norm 0.2803 (-0.30z)| lr 1.12e-04 | 322.44 ms | 52.3% bf16 MFU | 1623781 tok/s step 14201/19560 | loss 3.256469 (-2.02z)| norm 0.2979 (+0.75z)| lr 1.12e-04 | 322.62 ms | 52.3% bf16 MFU | 1623847 tok/s step 14202/19560 | loss 3.360441 (+0.76z)| norm 0.2770 (-0.50z)| lr 1.12e-04 | 322.71 ms | 52.3% bf16 MFU | 1623886 tok/s step 14203/19560 | loss 3.444383 (+2.89z)| norm 0.2869 (+0.09z)| lr 1.12e-04 | 322.84 ms | 52.3% bf16 MFU | 1623892 tok/s step 14204/19560 | loss 3.305210 (-0.72z)| norm 0.2677 (-1.06z)| lr 1.12e-04 | 323.28 ms | 52.2% bf16 MFU | 1623785 tok/s step 14205/19560 | loss 3.310685 (-0.57z)| norm 0.2864 (+0.07z)| lr 1.12e-04 | 323.08 ms | 52.2% bf16 MFU | 1623735 tok/s step 14206/19560 | loss 3.320024 (-0.32z)| norm 0.2826 (-0.15z)| lr 1.12e-04 | 322.25 ms | 52.4% bf16 MFU | 1623896 tok/s step 14207/19560 | loss 3.382220 (+1.28z)| norm 0.2995 (+0.85z)| lr 1.12e-04 | 322.12 ms | 52.4% bf16 MFU | 1624081 tok/s step 14208/19560 | loss 3.322162 (-0.27z)| norm 0.2836 (-0.11z)| lr 1.12e-04 | 323.22 ms | 52.2% bf16 MFU | 1623982 tok/s step 14209/19560 | loss 3.303398 (-0.74z)| norm 0.2703 (-0.89z)| lr 1.12e-04 | 322.63 ms | 52.3% bf16 MFU | 1624034 tok/s step 14210/19560 | loss 3.332067 (-0.01z)| norm 0.2808 (-0.25z)| lr 1.11e-04 | 322.22 ms | 52.4% bf16 MFU | 1624186 tok/s step 14211/19560 | loss 3.258658 (-1.89z)| norm 0.2535 (-1.88z)| lr 1.11e-04 | 322.70 ms | 52.3% bf16 MFU | 1624210 tok/s step 14212/19560 | loss 3.356395 (+0.61z)| norm 0.2726 (-0.73z)| lr 1.11e-04 | 322.94 ms | 52.3% bf16 MFU | 1624173 tok/s step 14213/19560 | loss 3.344914 (+0.31z)| norm 0.2778 (-0.41z)| lr 1.11e-04 | 322.97 ms | 52.3% bf16 MFU | 1624132 tok/s step 14214/19560 | loss 3.293135 (-1.04z)| norm 0.2797 (-0.30z)| lr 1.11e-04 | 322.33 ms | 52.4% bf16 MFU | 1624252 tok/s step 14215/19560 | loss 3.286480 (-1.21z)| norm 0.2576 (-1.59z)| lr 1.11e-04 | 322.39 ms | 52.3% bf16 MFU | 1624351 tok/s step 14216/19560 | loss 3.343443 (+0.26z)| norm 0.2933 (+0.52z)| lr 1.11e-04 | 322.54 ms | 52.3% bf16 MFU | 1624409 tok/s step 14217/19560 | loss 3.329106 (-0.11z)| norm 0.2767 (-0.46z)| lr 1.11e-04 | 322.52 ms | 52.3% bf16 MFU | 1624470 tok/s step 14218/19560 | loss 3.351773 (+0.49z)| norm 0.2827 (-0.11z)| lr 1.11e-04 | 322.91 ms | 52.3% bf16 MFU | 1624428 tok/s step 14219/19560 | loss 3.277739 (-1.45z)| norm 0.2711 (-0.78z)| lr 1.11e-04 | 322.64 ms | 52.3% bf16 MFU | 1624455 tok/s step 14220/19560 | loss 3.364832 (+0.84z)| norm 0.2947 (+0.60z)| lr 1.11e-04 | 322.21 ms | 52.4% bf16 MFU | 1624592 tok/s step 14221/19560 | loss 3.342345 (+0.25z)| norm 0.3014 (+1.00z)| lr 1.11e-04 | 323.11 ms | 52.2% bf16 MFU | 1624494 tok/s step 14222/19560 | loss 3.356436 (+0.63z)| norm 0.3234 (+2.24z)| lr 1.11e-04 | 322.96 ms | 52.3% bf16 MFU | 1624439 tok/s step 14223/19560 | loss 3.346461 (+0.36z)| norm 0.2931 (+0.48z)| lr 1.11e-04 | 322.78 ms | 52.3% bf16 MFU | 1624430 tok/s step 14224/19560 | loss 3.353383 (+0.54z)| norm 0.2805 (-0.26z)| lr 1.11e-04 | 322.78 ms | 52.3% bf16 MFU | 1624423 tok/s step 14225/19560 | loss 3.301930 (-0.82z)| norm 0.2882 (+0.19z)| lr 1.11e-04 | 322.48 ms | 52.3% bf16 MFU | 1624492 tok/s step 14226/19560 | loss 3.280506 (-1.36z)| norm 0.2762 (-0.51z)| lr 1.11e-04 | 322.65 ms | 52.3% bf16 MFU | 1624514 tok/s step 14227/19560 | loss 3.357666 (+0.65z)| norm 0.3047 (+1.14z)| lr 1.11e-04 | 322.82 ms | 52.3% bf16 MFU | 1624492 tok/s step 14228/19560 | loss 3.386168 (+1.37z)| norm 0.2708 (-0.83z)| lr 1.11e-04 | 322.66 ms | 52.3% bf16 MFU | 1624512 tok/s step 14229/19560 | loss 3.393580 (+1.54z)| norm 0.2976 (+0.72z)| lr 1.11e-04 | 322.86 ms | 52.3% bf16 MFU | 1624482 tok/s step 14230/19560 | loss 3.309910 (-0.63z)| norm 0.2938 (+0.50z)| lr 1.11e-04 | 322.74 ms | 52.3% bf16 MFU | 1624482 tok/s step 14231/19560 | loss 3.319287 (-0.38z)| norm 0.2757 (-0.55z)| lr 1.11e-04 | 322.78 ms | 52.3% bf16 MFU | 1624473 tok/s step 14232/19560 | loss 3.355098 (+0.54z)| norm 0.2846 (-0.03z)| lr 1.11e-04 | 322.55 ms | 52.3% bf16 MFU | 1624522 tok/s step 14233/19560 | loss 3.255206 (-2.01z)| norm 0.2736 (-0.66z)| lr 1.11e-04 | 322.76 ms | 52.3% bf16 MFU | 1624516 tok/s step 14234/19560 | loss 3.308106 (-0.65z)| norm 0.2719 (-0.76z)| lr 1.11e-04 | 322.51 ms | 52.3% bf16 MFU | 1624573 tok/s step 14235/19560 | loss 3.240275 (-2.31z)| norm 0.2919 (+0.41z)| lr 1.11e-04 | 323.02 ms | 52.2% bf16 MFU | 1624499 tok/s step 14236/19560 | loss 3.361682 (+0.73z)| norm 0.2706 (-0.83z)| lr 1.10e-04 | 323.31 ms | 52.2% bf16 MFU | 1624356 tok/s step 14237/19560 | loss 3.271196 (-1.51z)| norm 0.2990 (+0.83z)| lr 1.10e-04 | 323.08 ms | 52.2% bf16 MFU | 1624278 tok/s step 14238/19560 | loss 3.386790 (+1.34z)| norm 0.2871 (+0.12z)| lr 1.10e-04 | 322.25 ms | 52.4% bf16 MFU | 1624411 tok/s step 14239/19560 | loss 3.305180 (-0.68z)| norm 0.3057 (+1.19z)| lr 1.10e-04 | 322.80 ms | 52.3% bf16 MFU | 1624399 tok/s step 14240/19560 | loss 3.356334 (+0.58z)| norm 0.2750 (-0.60z)| lr 1.10e-04 | 322.82 ms | 52.3% bf16 MFU | 1624384 tok/s step 14241/19560 | loss 3.388227 (+1.36z)| norm 0.3170 (+1.82z)| lr 1.10e-04 | 323.15 ms | 52.2% bf16 MFU | 1624287 tok/s step 14242/19560 | loss 3.294613 (-0.93z)| norm 0.2901 (+0.25z)| lr 1.10e-04 | 322.63 ms | 52.3% bf16 MFU | 1624326 tok/s step 14243/19560 | loss 3.344197 (+0.28z)| norm 0.3068 (+1.20z)| lr 1.10e-04 | 323.44 ms | 52.2% bf16 MFU | 1624159 tok/s step 14244/19560 | loss 3.324678 (-0.20z)| norm 0.2831 (-0.20z)| lr 1.10e-04 | 323.11 ms | 52.2% bf16 MFU | 1624083 tok/s step 14245/19560 | loss 3.372317 (+0.96z)| norm 0.3154 (+1.70z)| lr 1.10e-04 | 322.61 ms | 52.3% bf16 MFU | 1624136 tok/s step 14246/19560 | loss 3.352541 (+0.46z)| norm 0.2887 (+0.12z)| lr 1.10e-04 | 322.61 ms | 52.3% bf16 MFU | 1624186 tok/s step 14247/19560 | loss 3.285762 (-1.22z)| norm 0.2810 (-0.32z)| lr 1.10e-04 | 323.13 ms | 52.2% bf16 MFU | 1624103 tok/s step 14248/19560 | loss 3.320269 (-0.35z)| norm 0.2871 (+0.03z)| lr 1.10e-04 | 322.61 ms | 52.3% bf16 MFU | 1624154 tok/s step 14249/19560 | loss 3.404170 (+1.73z)| norm 0.3075 (+1.26z)| lr 1.10e-04 | 322.72 ms | 52.3% bf16 MFU | 1624177 tok/s step 14250/19560 | loss 3.339527 (+0.12z)| norm 0.2842 (-0.14z)| lr 1.10e-04 | 322.70 ms | 52.3% bf16 MFU | 1624204 tok/s val loss 3.314049 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3001/10042 = 0.298845 step 14251/19560 | loss 3.360406 (+0.65z)| norm 0.2800 (-0.42z)| lr 1.10e-04 | 322.19 ms | 52.4% bf16 MFU | 1624356 tok/s step 14252/19560 | loss 3.287116 (-1.17z)| norm 0.2644 (-1.42z)| lr 1.10e-04 | 322.96 ms | 52.3% bf16 MFU | 1624307 tok/s step 14253/19560 | loss 3.316465 (-0.44z)| norm 0.3114 (+1.59z)| lr 1.10e-04 | 324.24 ms | 52.1% bf16 MFU | 1623941 tok/s step 14254/19560 | loss 3.340893 (+0.16z)| norm 0.2833 (-0.21z)| lr 1.10e-04 | 323.24 ms | 52.2% bf16 MFU | 1623843 tok/s step 14255/19560 | loss 3.307702 (-0.67z)| norm 0.2750 (-0.73z)| lr 1.10e-04 | 323.50 ms | 52.2% bf16 MFU | 1623684 tok/s step 14256/19560 | loss 3.320925 (-0.34z)| norm 0.3499 (+3.78z)| lr 1.10e-04 | 324.06 ms | 52.1% bf16 MFU | 1623393 tok/s step 14257/19560 | loss 3.312570 (-0.55z)| norm 0.2878 (+0.06z)| lr 1.10e-04 | 323.03 ms | 52.2% bf16 MFU | 1623374 tok/s step 14258/19560 | loss 3.290749 (-1.09z)| norm 0.3034 (+0.98z)| lr 1.10e-04 | 323.10 ms | 52.2% bf16 MFU | 1623338 tok/s step 14259/19560 | loss 3.286013 (-1.19z)| norm 0.3210 (+2.02z)| lr 1.10e-04 | 323.35 ms | 52.2% bf16 MFU | 1623242 tok/s step 14260/19560 | loss 3.392453 (+1.45z)| norm 0.3587 (+3.97z)| lr 1.10e-04 | 323.97 ms | 52.1% bf16 MFU | 1622996 tok/s step 14261/19560 | loss 3.327363 (-0.16z)| norm 0.3015 (+0.76z)| lr 1.10e-04 | 323.00 ms | 52.3% bf16 MFU | 1623005 tok/s step 14262/19560 | loss 3.337686 (+0.09z)| norm 0.3066 (+1.04z)| lr 1.09e-04 | 323.34 ms | 52.2% bf16 MFU | 1622930 tok/s step 14263/19560 | loss 3.314694 (-0.48z)| norm 0.3043 (+0.90z)| lr 1.09e-04 | 322.92 ms | 52.3% bf16 MFU | 1622961 tok/s step 14264/19560 | loss 3.286447 (-1.17z)| norm 0.2958 (+0.42z)| lr 1.09e-04 | 322.83 ms | 52.3% bf16 MFU | 1623014 tok/s step 14265/19560 | loss 3.316322 (-0.43z)| norm 0.2918 (+0.19z)| lr 1.09e-04 | 322.96 ms | 52.3% bf16 MFU | 1623034 tok/s step 14266/19560 | loss 3.290993 (-1.05z)| norm 0.3148 (+1.47z)| lr 1.09e-04 | 322.95 ms | 52.3% bf16 MFU | 1623054 tok/s step 14267/19560 | loss 3.358653 (+0.63z)| norm 0.2730 (-0.85z)| lr 1.09e-04 | 322.85 ms | 52.3% bf16 MFU | 1623098 tok/s step 14268/19560 | loss 3.280444 (-1.30z)| norm 0.2987 (+0.57z)| lr 1.09e-04 | 322.69 ms | 52.3% bf16 MFU | 1623181 tok/s step 14269/19560 | loss 3.300551 (-0.79z)| norm 0.2901 (+0.09z)| lr 1.09e-04 | 323.73 ms | 52.1% bf16 MFU | 1622999 tok/s step 14270/19560 | loss 3.357022 (+0.60z)| norm 0.3058 (+0.96z)| lr 1.09e-04 | 323.17 ms | 52.2% bf16 MFU | 1622965 tok/s step 14271/19560 | loss 3.325712 (-0.17z)| norm 0.2845 (-0.22z)| lr 1.09e-04 | 322.67 ms | 52.3% bf16 MFU | 1623058 tok/s step 14272/19560 | loss 3.300676 (-0.78z)| norm 0.3230 (+1.88z)| lr 1.09e-04 | 322.82 ms | 52.3% bf16 MFU | 1623110 tok/s step 14273/19560 | loss 3.284376 (-1.17z)| norm 0.2724 (-0.89z)| lr 1.09e-04 | 323.06 ms | 52.2% bf16 MFU | 1623098 tok/s step 14274/19560 | loss 3.307286 (-0.60z)| norm 0.2917 (+0.18z)| lr 1.09e-04 | 322.96 ms | 52.3% bf16 MFU | 1623111 tok/s step 14275/19560 | loss 3.335915 (+0.10z)| norm 0.3066 (+1.04z)| lr 1.09e-04 | 323.13 ms | 52.2% bf16 MFU | 1623083 tok/s step 14276/19560 | loss 3.261876 (-1.70z)| norm 0.2690 (-1.07z)| lr 1.09e-04 | 323.33 ms | 52.2% bf16 MFU | 1623006 tok/s step 14277/19560 | loss 3.286573 (-1.08z)| norm 0.2960 (+0.45z)| lr 1.09e-04 | 323.10 ms | 52.2% bf16 MFU | 1622990 tok/s step 14278/19560 | loss 3.376809 (+1.12z)| norm 0.2709 (-0.96z)| lr 1.09e-04 | 323.22 ms | 52.2% bf16 MFU | 1622946 tok/s step 14279/19560 | loss 3.321535 (-0.22z)| norm 0.2794 (-0.47z)| lr 1.09e-04 | 323.78 ms | 52.1% bf16 MFU | 1622762 tok/s step 14280/19560 | loss 3.224762 (-2.50z)| norm 0.2860 (-0.09z)| lr 1.09e-04 | 322.53 ms | 52.3% bf16 MFU | 1622900 tok/s step 14281/19560 | loss 3.380488 (+1.20z)| norm 0.2657 (-1.23z)| lr 1.09e-04 | 323.57 ms | 52.2% bf16 MFU | 1622771 tok/s step 14282/19560 | loss 3.302263 (-0.65z)| norm 0.2693 (-1.02z)| lr 1.09e-04 | 323.18 ms | 52.2% bf16 MFU | 1622747 tok/s step 14283/19560 | loss 3.314754 (-0.34z)| norm 0.2863 (-0.06z)| lr 1.09e-04 | 323.24 ms | 52.2% bf16 MFU | 1622708 tok/s step 14284/19560 | loss 3.350181 (+0.52z)| norm 0.2725 (-0.83z)| lr 1.09e-04 | 323.20 ms | 52.2% bf16 MFU | 1622682 tok/s step 14285/19560 | loss 3.273557 (-1.32z)| norm 0.2811 (-0.34z)| lr 1.09e-04 | 323.12 ms | 52.2% bf16 MFU | 1622678 tok/s step 14286/19560 | loss 3.358582 (+0.72z)| norm 0.2730 (-0.80z)| lr 1.09e-04 | 323.84 ms | 52.1% bf16 MFU | 1622493 tok/s step 14287/19560 | loss 3.368066 (+0.94z)| norm 0.2790 (-0.45z)| lr 1.09e-04 | 323.00 ms | 52.3% bf16 MFU | 1622527 tok/s step 14288/19560 | loss 3.355075 (+0.62z)| norm 0.3037 (+0.94z)| lr 1.08e-04 | 323.22 ms | 52.2% bf16 MFU | 1622505 tok/s step 14289/19560 | loss 3.277064 (-1.25z)| norm 0.2784 (-0.50z)| lr 1.08e-04 | 324.63 ms | 52.0% bf16 MFU | 1622130 tok/s step 14290/19560 | loss 3.352287 (+0.55z)| norm 0.2986 (+0.64z)| lr 1.08e-04 | 323.09 ms | 52.2% bf16 MFU | 1622161 tok/s step 14291/19560 | loss 3.261124 (-1.61z)| norm 0.2750 (-0.72z)| lr 1.08e-04 | 322.96 ms | 52.3% bf16 MFU | 1622221 tok/s step 14292/19560 | loss 3.347900 (+0.47z)| norm 0.2815 (-0.35z)| lr 1.08e-04 | 323.36 ms | 52.2% bf16 MFU | 1622178 tok/s step 14293/19560 | loss 3.314220 (-0.33z)| norm 0.2938 (+0.35z)| lr 1.08e-04 | 322.99 ms | 52.3% bf16 MFU | 1622229 tok/s step 14294/19560 | loss 3.267564 (-1.43z)| norm 0.3009 (+0.75z)| lr 1.08e-04 | 322.73 ms | 52.3% bf16 MFU | 1622346 tok/s step 14295/19560 | loss 3.327742 (+0.02z)| norm 0.2809 (-0.40z)| lr 1.08e-04 | 322.66 ms | 52.3% bf16 MFU | 1622474 tok/s step 14296/19560 | loss 3.247298 (-1.92z)| norm 0.2898 (+0.10z)| lr 1.08e-04 | 322.66 ms | 52.3% bf16 MFU | 1622594 tok/s step 14297/19560 | loss 3.367953 (+1.04z)| norm 0.2967 (+0.50z)| lr 1.08e-04 | 322.94 ms | 52.3% bf16 MFU | 1622638 tok/s step 14298/19560 | loss 3.374215 (+1.18z)| norm 0.2828 (-0.33z)| lr 1.08e-04 | 322.55 ms | 52.3% bf16 MFU | 1622778 tok/s step 14299/19560 | loss 3.287570 (-0.93z)| norm 0.2930 (+0.27z)| lr 1.08e-04 | 323.22 ms | 52.2% bf16 MFU | 1622743 tok/s step 14300/19560 | loss 3.392877 (+1.61z)| norm 0.2780 (-0.63z)| lr 1.08e-04 | 322.99 ms | 52.3% bf16 MFU | 1622767 tok/s step 14301/19560 | loss 3.321683 (-0.13z)| norm 0.3131 (+1.43z)| lr 1.08e-04 | 323.47 ms | 52.2% bf16 MFU | 1622670 tok/s step 14302/19560 | loss 3.288426 (-0.93z)| norm 0.2949 (+0.34z)| lr 1.08e-04 | 323.52 ms | 52.2% bf16 MFU | 1622565 tok/s step 14303/19560 | loss 3.345513 (+0.46z)| norm 0.2796 (-0.56z)| lr 1.08e-04 | 322.94 ms | 52.3% bf16 MFU | 1622610 tok/s step 14304/19560 | loss 3.322509 (-0.11z)| norm 0.3092 (+1.18z)| lr 1.08e-04 | 323.25 ms | 52.2% bf16 MFU | 1622576 tok/s step 14305/19560 | loss 3.294992 (-0.77z)| norm 0.3235 (+1.98z)| lr 1.08e-04 | 323.15 ms | 52.2% bf16 MFU | 1622570 tok/s step 14306/19560 | loss 3.348263 (+0.55z)| norm 0.2846 (-0.28z)| lr 1.08e-04 | 323.14 ms | 52.2% bf16 MFU | 1622566 tok/s step 14307/19560 | loss 3.309325 (-0.41z)| norm 0.3003 (+0.67z)| lr 1.08e-04 | 322.55 ms | 52.3% bf16 MFU | 1622710 tok/s step 14308/19560 | loss 3.327873 (+0.05z)| norm 0.3102 (+1.28z)| lr 1.08e-04 | 323.79 ms | 52.1% bf16 MFU | 1622535 tok/s step 14309/19560 | loss 3.302672 (-0.57z)| norm 0.2883 (-0.06z)| lr 1.08e-04 | 322.99 ms | 52.3% bf16 MFU | 1622571 tok/s step 14310/19560 | loss 3.304492 (-0.52z)| norm 0.3210 (+1.90z)| lr 1.08e-04 | 322.63 ms | 52.3% bf16 MFU | 1622696 tok/s step 14311/19560 | loss 3.306229 (-0.48z)| norm 0.2954 (+0.34z)| lr 1.08e-04 | 322.82 ms | 52.3% bf16 MFU | 1622765 tok/s step 14312/19560 | loss 3.320224 (-0.11z)| norm 0.3024 (+0.76z)| lr 1.08e-04 | 322.68 ms | 52.3% bf16 MFU | 1622866 tok/s step 14313/19560 | loss 3.268016 (-1.45z)| norm 0.3121 (+1.33z)| lr 1.08e-04 | 323.41 ms | 52.2% bf16 MFU | 1622779 tok/s step 14314/19560 | loss 3.356726 (+0.83z)| norm 0.3094 (+1.15z)| lr 1.07e-04 | 322.90 ms | 52.3% bf16 MFU | 1622824 tok/s step 14315/19560 | loss 3.299897 (-0.62z)| norm 0.2965 (+0.37z)| lr 1.07e-04 | 323.00 ms | 52.3% bf16 MFU | 1622842 tok/s step 14316/19560 | loss 3.316656 (-0.18z)| norm 0.2827 (-0.46z)| lr 1.07e-04 | 324.91 ms | 51.9% bf16 MFU | 1622382 tok/s step 14317/19560 | loss 3.274950 (-1.24z)| norm 0.3096 (+1.15z)| lr 1.07e-04 | 322.62 ms | 52.3% bf16 MFU | 1622517 tok/s step 14318/19560 | loss 3.345652 (+0.56z)| norm 0.2785 (-0.72z)| lr 1.07e-04 | 322.71 ms | 52.3% bf16 MFU | 1622624 tok/s step 14319/19560 | loss 3.305883 (-0.45z)| norm 0.2922 (+0.10z)| lr 1.07e-04 | 322.52 ms | 52.3% bf16 MFU | 1622774 tok/s step 14320/19560 | loss 3.296391 (-0.70z)| norm 0.2683 (-1.31z)| lr 1.07e-04 | 322.19 ms | 52.4% bf16 MFU | 1622998 tok/s step 14321/19560 | loss 3.343911 (+0.52z)| norm 0.2725 (-1.05z)| lr 1.07e-04 | 322.95 ms | 52.3% bf16 MFU | 1623020 tok/s step 14322/19560 | loss 3.355703 (+0.81z)| norm 0.2794 (-0.63z)| lr 1.07e-04 | 322.78 ms | 52.3% bf16 MFU | 1623084 tok/s step 14323/19560 | loss 3.358456 (+0.89z)| norm 0.2749 (-0.89z)| lr 1.07e-04 | 322.44 ms | 52.3% bf16 MFU | 1623230 tok/s step 14324/19560 | loss 3.316635 (-0.19z)| norm 0.2889 (-0.06z)| lr 1.07e-04 | 322.77 ms | 52.3% bf16 MFU | 1623286 tok/s step 14325/19560 | loss 3.317874 (-0.16z)| norm 0.2632 (-1.56z)| lr 1.07e-04 | 322.94 ms | 52.3% bf16 MFU | 1623296 tok/s step 14326/19560 | loss 3.331740 (+0.21z)| norm 0.2925 (+0.16z)| lr 1.07e-04 | 323.00 ms | 52.3% bf16 MFU | 1623290 tok/s step 14327/19560 | loss 3.388166 (+1.67z)| norm 0.2745 (-0.90z)| lr 1.07e-04 | 322.84 ms | 52.3% bf16 MFU | 1623325 tok/s step 14328/19560 | loss 3.266845 (-1.46z)| norm 0.3071 (+1.00z)| lr 1.07e-04 | 323.30 ms | 52.2% bf16 MFU | 1623243 tok/s step 14329/19560 | loss 3.304924 (-0.50z)| norm 0.2883 (-0.10z)| lr 1.07e-04 | 323.37 ms | 52.2% bf16 MFU | 1623147 tok/s step 14330/19560 | loss 3.340621 (+0.44z)| norm 0.2774 (-0.74z)| lr 1.07e-04 | 322.55 ms | 52.3% bf16 MFU | 1623263 tok/s step 14331/19560 | loss 3.344281 (+0.58z)| norm 0.2818 (-0.47z)| lr 1.07e-04 | 322.77 ms | 52.3% bf16 MFU | 1623316 tok/s step 14332/19560 | loss 3.356808 (+0.90z)| norm 0.2760 (-0.82z)| lr 1.07e-04 | 322.91 ms | 52.3% bf16 MFU | 1623331 tok/s step 14333/19560 | loss 3.293360 (-0.81z)| norm 0.2544 (-2.05z)| lr 1.07e-04 | 323.02 ms | 52.2% bf16 MFU | 1623319 tok/s step 14334/19560 | loss 3.370165 (+1.25z)| norm 0.2775 (-0.70z)| lr 1.07e-04 | 322.80 ms | 52.3% bf16 MFU | 1623364 tok/s step 14335/19560 | loss 3.351778 (+0.77z)| norm 0.2848 (-0.28z)| lr 1.07e-04 | 322.74 ms | 52.3% bf16 MFU | 1623419 tok/s step 14336/19560 | loss 3.430868 (+2.80z)| norm 0.2967 (+0.41z)| lr 1.07e-04 | 323.01 ms | 52.2% bf16 MFU | 1623405 tok/s step 14337/19560 | loss 3.334655 (+0.26z)| norm 0.2843 (-0.32z)| lr 1.07e-04 | 322.63 ms | 52.3% bf16 MFU | 1623487 tok/s step 14338/19560 | loss 3.362367 (+0.98z)| norm 0.2906 (+0.05z)| lr 1.07e-04 | 322.90 ms | 52.3% bf16 MFU | 1623496 tok/s step 14339/19560 | loss 3.335188 (+0.26z)| norm 0.2918 (+0.10z)| lr 1.07e-04 | 322.58 ms | 52.3% bf16 MFU | 1623586 tok/s step 14340/19560 | loss 3.320690 (-0.12z)| norm 0.2913 (+0.06z)| lr 1.06e-04 | 322.81 ms | 52.3% bf16 MFU | 1623613 tok/s step 14341/19560 | loss 3.235624 (-2.31z)| norm 0.2902 (-0.01z)| lr 1.06e-04 | 322.89 ms | 52.3% bf16 MFU | 1623620 tok/s step 14342/19560 | loss 3.264931 (-1.53z)| norm 0.2671 (-1.37z)| lr 1.06e-04 | 322.84 ms | 52.3% bf16 MFU | 1623639 tok/s step 14343/19560 | loss 3.287728 (-0.94z)| norm 0.2873 (-0.19z)| lr 1.06e-04 | 322.73 ms | 52.3% bf16 MFU | 1623684 tok/s step 14344/19560 | loss 3.351882 (+0.72z)| norm 0.2674 (-1.37z)| lr 1.06e-04 | 322.49 ms | 52.3% bf16 MFU | 1623788 tok/s step 14345/19560 | loss 3.278441 (-1.16z)| norm 0.2625 (-1.64z)| lr 1.06e-04 | 322.78 ms | 52.3% bf16 MFU | 1623814 tok/s step 14346/19560 | loss 3.266927 (-1.43z)| norm 0.2696 (-1.21z)| lr 1.06e-04 | 322.74 ms | 52.3% bf16 MFU | 1623846 tok/s step 14347/19560 | loss 3.308524 (-0.38z)| norm 0.2765 (-0.81z)| lr 1.06e-04 | 322.80 ms | 52.3% bf16 MFU | 1623863 tok/s step 14348/19560 | loss 3.319776 (-0.08z)| norm 0.2892 (-0.05z)| lr 1.06e-04 | 323.37 ms | 52.2% bf16 MFU | 1623735 tok/s step 14349/19560 | loss 3.284618 (-0.97z)| norm 0.2761 (-0.81z)| lr 1.06e-04 | 322.47 ms | 52.3% bf16 MFU | 1623840 tok/s step 14350/19560 | loss 3.289169 (-0.84z)| norm 0.2738 (-0.94z)| lr 1.06e-04 | 322.66 ms | 52.3% bf16 MFU | 1623894 tok/s step 14351/19560 | loss 3.316802 (-0.13z)| norm 0.2973 (+0.46z)| lr 1.06e-04 | 322.96 ms | 52.3% bf16 MFU | 1623868 tok/s step 14352/19560 | loss 3.344528 (+0.59z)| norm 0.2757 (-0.83z)| lr 1.06e-04 | 323.13 ms | 52.2% bf16 MFU | 1623800 tok/s step 14353/19560 | loss 3.274940 (-1.19z)| norm 0.2733 (-0.96z)| lr 1.06e-04 | 323.27 ms | 52.2% bf16 MFU | 1623702 tok/s step 14354/19560 | loss 3.318653 (-0.08z)| norm 0.2689 (-1.21z)| lr 1.06e-04 | 322.26 ms | 52.4% bf16 MFU | 1623862 tok/s step 14355/19560 | loss 3.371246 (+1.27z)| norm 0.2638 (-1.48z)| lr 1.06e-04 | 322.57 ms | 52.3% bf16 MFU | 1623936 tok/s step 14356/19560 | loss 3.353524 (+0.83z)| norm 0.2872 (-0.11z)| lr 1.06e-04 | 322.93 ms | 52.3% bf16 MFU | 1623916 tok/s step 14357/19560 | loss 3.262573 (-1.51z)| norm 0.2502 (-2.24z)| lr 1.06e-04 | 323.22 ms | 52.2% bf16 MFU | 1623824 tok/s step 14358/19560 | loss 3.292198 (-0.73z)| norm 0.2962 (+0.43z)| lr 1.06e-04 | 322.76 ms | 52.3% bf16 MFU | 1623852 tok/s step 14359/19560 | loss 3.293933 (-0.68z)| norm 0.2625 (-1.51z)| lr 1.06e-04 | 322.42 ms | 52.3% bf16 MFU | 1623964 tok/s step 14360/19560 | loss 3.330583 (+0.27z)| norm 0.3005 (+0.67z)| lr 1.06e-04 | 323.44 ms | 52.2% bf16 MFU | 1623815 tok/s step 14361/19560 | loss 3.272102 (-1.26z)| norm 0.2843 (-0.26z)| lr 1.06e-04 | 322.81 ms | 52.3% bf16 MFU | 1623831 tok/s step 14362/19560 | loss 3.292515 (-0.72z)| norm 0.2834 (-0.32z)| lr 1.06e-04 | 322.47 ms | 52.3% bf16 MFU | 1623931 tok/s step 14363/19560 | loss 3.284548 (-0.95z)| norm 0.2799 (-0.52z)| lr 1.06e-04 | 322.99 ms | 52.3% bf16 MFU | 1623895 tok/s step 14364/19560 | loss 3.356411 (+0.95z)| norm 0.2855 (-0.21z)| lr 1.06e-04 | 322.66 ms | 52.3% bf16 MFU | 1623946 tok/s step 14365/19560 | loss 3.326279 (+0.15z)| norm 0.2979 (+0.52z)| lr 1.06e-04 | 322.93 ms | 52.3% bf16 MFU | 1623926 tok/s step 14366/19560 | loss 3.346641 (+0.70z)| norm 0.2953 (+0.36z)| lr 1.05e-04 | 322.38 ms | 52.4% bf16 MFU | 1624045 tok/s step 14367/19560 | loss 3.317018 (-0.10z)| norm 0.2749 (-0.81z)| lr 1.05e-04 | 322.64 ms | 52.3% bf16 MFU | 1624091 tok/s step 14368/19560 | loss 3.349662 (+0.79z)| norm 0.2807 (-0.47z)| lr 1.05e-04 | 322.95 ms | 52.3% bf16 MFU | 1624059 tok/s step 14369/19560 | loss 3.315687 (-0.12z)| norm 0.2849 (-0.22z)| lr 1.05e-04 | 322.72 ms | 52.3% bf16 MFU | 1624085 tok/s step 14370/19560 | loss 3.260443 (-1.61z)| norm 0.2657 (-1.33z)| lr 1.05e-04 | 322.78 ms | 52.3% bf16 MFU | 1624096 tok/s step 14371/19560 | loss 3.290702 (-0.78z)| norm 0.3065 (+1.06z)| lr 1.05e-04 | 322.77 ms | 52.3% bf16 MFU | 1624109 tok/s step 14372/19560 | loss 3.359242 (+1.07z)| norm 0.2837 (-0.28z)| lr 1.05e-04 | 322.56 ms | 52.3% bf16 MFU | 1624173 tok/s step 14373/19560 | loss 3.285109 (-0.92z)| norm 0.2867 (-0.09z)| lr 1.05e-04 | 322.88 ms | 52.3% bf16 MFU | 1624153 tok/s step 14374/19560 | loss 3.372221 (+1.43z)| norm 0.2783 (-0.58z)| lr 1.05e-04 | 322.70 ms | 52.3% bf16 MFU | 1624181 tok/s step 14375/19560 | loss 3.417495 (+2.57z)| norm 0.2923 (+0.24z)| lr 1.05e-04 | 322.52 ms | 52.3% bf16 MFU | 1624251 tok/s step 14376/19560 | loss 3.267714 (-1.36z)| norm 0.3319 (+2.50z)| lr 1.05e-04 | 322.48 ms | 52.3% bf16 MFU | 1624328 tok/s step 14377/19560 | loss 3.311211 (-0.21z)| norm 0.2680 (-1.16z)| lr 1.05e-04 | 323.15 ms | 52.2% bf16 MFU | 1624233 tok/s step 14378/19560 | loss 3.361430 (+1.12z)| norm 0.3198 (+1.78z)| lr 1.05e-04 | 323.50 ms | 52.2% bf16 MFU | 1624056 tok/s step 14379/19560 | loss 3.355523 (+0.97z)| norm 0.3079 (+1.09z)| lr 1.05e-04 | 322.75 ms | 52.3% bf16 MFU | 1624076 tok/s step 14380/19560 | loss 3.333411 (+0.37z)| norm 0.2887 (-0.01z)| lr 1.05e-04 | 322.81 ms | 52.3% bf16 MFU | 1624079 tok/s step 14381/19560 | loss 3.314545 (-0.13z)| norm 0.2963 (+0.43z)| lr 1.05e-04 | 322.57 ms | 52.3% bf16 MFU | 1624143 tok/s step 14382/19560 | loss 3.343260 (+0.63z)| norm 0.3178 (+1.63z)| lr 1.05e-04 | 322.26 ms | 52.4% bf16 MFU | 1624281 tok/s step 14383/19560 | loss 3.323649 (+0.11z)| norm 0.3071 (+1.01z)| lr 1.05e-04 | 322.73 ms | 52.3% bf16 MFU | 1624294 tok/s step 14384/19560 | loss 3.320574 (+0.03z)| norm 0.2774 (-0.68z)| lr 1.05e-04 | 323.13 ms | 52.2% bf16 MFU | 1624205 tok/s step 14385/19560 | loss 3.323982 (+0.12z)| norm 0.2988 (+0.59z)| lr 1.05e-04 | 322.88 ms | 52.3% bf16 MFU | 1624184 tok/s step 14386/19560 | loss 3.317551 (-0.06z)| norm 0.2586 (-1.76z)| lr 1.05e-04 | 322.82 ms | 52.3% bf16 MFU | 1624179 tok/s step 14387/19560 | loss 3.326583 (+0.17z)| norm 0.2867 (-0.09z)| lr 1.05e-04 | 323.19 ms | 52.2% bf16 MFU | 1624081 tok/s step 14388/19560 | loss 3.311146 (-0.23z)| norm 0.2861 (-0.10z)| lr 1.05e-04 | 323.09 ms | 52.2% bf16 MFU | 1624014 tok/s step 14389/19560 | loss 3.332850 (+0.36z)| norm 0.2784 (-0.59z)| lr 1.05e-04 | 321.95 ms | 52.4% bf16 MFU | 1624238 tok/s step 14390/19560 | loss 3.309859 (-0.26z)| norm 0.2859 (-0.10z)| lr 1.05e-04 | 323.55 ms | 52.2% bf16 MFU | 1624047 tok/s step 14391/19560 | loss 3.269767 (-1.33z)| norm 0.2743 (-0.83z)| lr 1.05e-04 | 323.08 ms | 52.2% bf16 MFU | 1623985 tok/s step 14392/19560 | loss 3.323276 (+0.11z)| norm 0.2849 (-0.14z)| lr 1.05e-04 | 322.83 ms | 52.3% bf16 MFU | 1623989 tok/s step 14393/19560 | loss 3.308395 (-0.29z)| norm 0.3038 (+1.08z)| lr 1.04e-04 | 322.72 ms | 52.3% bf16 MFU | 1624019 tok/s step 14394/19560 | loss 3.281097 (-1.03z)| norm 0.2612 (-1.65z)| lr 1.04e-04 | 323.28 ms | 52.2% bf16 MFU | 1623907 tok/s step 14395/19560 | loss 3.264449 (-1.45z)| norm 0.2800 (-0.44z)| lr 1.04e-04 | 322.89 ms | 52.3% bf16 MFU | 1623900 tok/s step 14396/19560 | loss 3.373941 (+1.47z)| norm 0.2872 (+0.03z)| lr 1.04e-04 | 322.38 ms | 52.4% bf16 MFU | 1624020 tok/s step 14397/19560 | loss 3.346158 (+0.71z)| norm 0.2882 (+0.10z)| lr 1.04e-04 | 322.53 ms | 52.3% bf16 MFU | 1624096 tok/s step 14398/19560 | loss 3.339394 (+0.54z)| norm 0.2778 (-0.56z)| lr 1.04e-04 | 322.97 ms | 52.3% bf16 MFU | 1624058 tok/s step 14399/19560 | loss 3.300462 (-0.50z)| norm 0.2672 (-1.24z)| lr 1.04e-04 | 322.24 ms | 52.4% bf16 MFU | 1624205 tok/s step 14400/19560 | loss 3.356399 (+0.98z)| norm 0.2858 (-0.02z)| lr 1.04e-04 | 323.21 ms | 52.2% bf16 MFU | 1624101 tok/s step 14401/19560 | loss 3.288258 (-0.84z)| norm 0.2757 (-0.69z)| lr 1.04e-04 | 322.48 ms | 52.3% bf16 MFU | 1624186 tok/s step 14402/19560 | loss 3.356069 (+0.96z)| norm 0.2677 (-1.20z)| lr 1.04e-04 | 322.86 ms | 52.3% bf16 MFU | 1624171 tok/s step 14403/19560 | loss 3.305538 (-0.38z)| norm 0.2788 (-0.45z)| lr 1.04e-04 | 323.14 ms | 52.2% bf16 MFU | 1624087 tok/s step 14404/19560 | loss 3.337225 (+0.45z)| norm 0.2731 (-0.84z)| lr 1.04e-04 | 322.14 ms | 52.4% bf16 MFU | 1624259 tok/s step 14405/19560 | loss 3.300649 (-0.54z)| norm 0.2834 (-0.14z)| lr 1.04e-04 | 323.29 ms | 52.2% bf16 MFU | 1624132 tok/s step 14406/19560 | loss 3.331198 (+0.30z)| norm 0.3005 (+0.98z)| lr 1.04e-04 | 322.77 ms | 52.3% bf16 MFU | 1624142 tok/s step 14407/19560 | loss 3.275123 (-1.21z)| norm 0.2851 (-0.05z)| lr 1.04e-04 | 322.52 ms | 52.3% bf16 MFU | 1624215 tok/s step 14408/19560 | loss 3.325336 (+0.13z)| norm 0.2811 (-0.31z)| lr 1.04e-04 | 322.70 ms | 52.3% bf16 MFU | 1624239 tok/s step 14409/19560 | loss 3.289037 (-0.86z)| norm 0.2718 (-0.94z)| lr 1.04e-04 | 323.19 ms | 52.2% bf16 MFU | 1624138 tok/s step 14410/19560 | loss 3.333980 (+0.39z)| norm 0.2722 (-0.92z)| lr 1.04e-04 | 323.28 ms | 52.2% bf16 MFU | 1624019 tok/s step 14411/19560 | loss 3.264474 (-1.53z)| norm 0.2788 (-0.47z)| lr 1.04e-04 | 322.80 ms | 52.3% bf16 MFU | 1624026 tok/s step 14412/19560 | loss 3.307504 (-0.33z)| norm 0.2702 (-1.05z)| lr 1.04e-04 | 322.13 ms | 52.4% bf16 MFU | 1624204 tok/s step 14413/19560 | loss 3.360250 (+1.12z)| norm 0.3041 (+1.21z)| lr 1.04e-04 | 322.89 ms | 52.3% bf16 MFU | 1624179 tok/s step 14414/19560 | loss 3.348565 (+0.80z)| norm 0.2758 (-0.68z)| lr 1.04e-04 | 322.48 ms | 52.3% bf16 MFU | 1624260 tok/s step 14415/19560 | loss 3.280521 (-1.09z)| norm 0.2794 (-0.44z)| lr 1.04e-04 | 322.76 ms | 52.3% bf16 MFU | 1624266 tok/s step 14416/19560 | loss 3.339507 (+0.57z)| norm 0.2750 (-0.72z)| lr 1.04e-04 | 322.97 ms | 52.3% bf16 MFU | 1624219 tok/s step 14417/19560 | loss 3.264318 (-1.53z)| norm 0.2805 (-0.35z)| lr 1.04e-04 | 322.03 ms | 52.4% bf16 MFU | 1624411 tok/s step 14418/19560 | loss 3.277450 (-1.15z)| norm 0.2898 (+0.28z)| lr 1.04e-04 | 322.50 ms | 52.3% bf16 MFU | 1624475 tok/s step 14419/19560 | loss 3.293281 (-0.72z)| norm 0.2756 (-0.68z)| lr 1.03e-04 | 322.89 ms | 52.3% bf16 MFU | 1624439 tok/s step 14420/19560 | loss 3.399044 (+2.21z)| norm 0.3014 (+1.04z)| lr 1.03e-04 | 322.93 ms | 52.3% bf16 MFU | 1624393 tok/s step 14421/19560 | loss 3.291202 (-0.77z)| norm 0.2619 (-1.58z)| lr 1.03e-04 | 322.37 ms | 52.4% bf16 MFU | 1624492 tok/s step 14422/19560 | loss 3.360342 (+1.13z)| norm 0.2854 (-0.01z)| lr 1.03e-04 | 323.32 ms | 52.2% bf16 MFU | 1624346 tok/s step 14423/19560 | loss 3.357445 (+1.04z)| norm 0.2670 (-1.22z)| lr 1.03e-04 | 323.31 ms | 52.2% bf16 MFU | 1624209 tok/s step 14424/19560 | loss 3.353523 (+0.92z)| norm 0.2811 (-0.28z)| lr 1.03e-04 | 322.43 ms | 52.3% bf16 MFU | 1624302 tok/s step 14425/19560 | loss 3.295098 (-0.70z)| norm 0.2720 (-0.87z)| lr 1.03e-04 | 322.72 ms | 52.3% bf16 MFU | 1624316 tok/s step 14426/19560 | loss 3.345239 (+0.72z)| norm 0.2897 (+0.30z)| lr 1.03e-04 | 322.46 ms | 52.3% bf16 MFU | 1624395 tok/s step 14427/19560 | loss 3.361671 (+1.16z)| norm 0.2739 (-0.74z)| lr 1.03e-04 | 323.22 ms | 52.2% bf16 MFU | 1624279 tok/s step 14428/19560 | loss 3.303673 (-0.46z)| norm 0.2704 (-0.96z)| lr 1.03e-04 | 322.80 ms | 52.3% bf16 MFU | 1624274 tok/s step 14429/19560 | loss 3.260574 (-1.67z)| norm 0.3068 (+1.45z)| lr 1.03e-04 | 322.32 ms | 52.4% bf16 MFU | 1624389 tok/s step 14430/19560 | loss 3.314234 (-0.15z)| norm 0.2698 (-0.99z)| lr 1.03e-04 | 322.90 ms | 52.3% bf16 MFU | 1624354 tok/s step 14431/19560 | loss 3.231983 (-2.42z)| norm 0.2706 (-0.93z)| lr 1.03e-04 | 322.70 ms | 52.3% bf16 MFU | 1624369 tok/s step 14432/19560 | loss 3.342618 (+0.66z)| norm 0.2824 (-0.14z)| lr 1.03e-04 | 323.12 ms | 52.2% bf16 MFU | 1624280 tok/s step 14433/19560 | loss 3.349674 (+0.85z)| norm 0.2886 (+0.30z)| lr 1.03e-04 | 323.31 ms | 52.2% bf16 MFU | 1624146 tok/s step 14434/19560 | loss 3.297533 (-0.59z)| norm 0.3051 (+1.42z)| lr 1.03e-04 | 322.71 ms | 52.3% bf16 MFU | 1624170 tok/s step 14435/19560 | loss 3.336462 (+0.48z)| norm 0.2659 (-1.24z)| lr 1.03e-04 | 323.08 ms | 52.2% bf16 MFU | 1624101 tok/s step 14436/19560 | loss 3.380015 (+1.67z)| norm 0.2813 (-0.17z)| lr 1.03e-04 | 322.63 ms | 52.3% bf16 MFU | 1624147 tok/s step 14437/19560 | loss 3.328799 (+0.25z)| norm 0.2884 (+0.31z)| lr 1.03e-04 | 323.58 ms | 52.2% bf16 MFU | 1623953 tok/s step 14438/19560 | loss 3.303649 (-0.44z)| norm 0.2708 (-0.89z)| lr 1.03e-04 | 322.31 ms | 52.4% bf16 MFU | 1624089 tok/s step 14439/19560 | loss 3.290867 (-0.79z)| norm 0.2880 (+0.33z)| lr 1.03e-04 | 323.05 ms | 52.2% bf16 MFU | 1624032 tok/s step 14440/19560 | loss 3.329119 (+0.26z)| norm 0.3006 (+1.22z)| lr 1.03e-04 | 322.97 ms | 52.3% bf16 MFU | 1623997 tok/s step 14441/19560 | loss 3.332431 (+0.34z)| norm 0.2915 (+0.59z)| lr 1.03e-04 | 323.02 ms | 52.2% bf16 MFU | 1623952 tok/s step 14442/19560 | loss 3.304494 (-0.42z)| norm 0.2823 (-0.05z)| lr 1.03e-04 | 322.76 ms | 52.3% bf16 MFU | 1623974 tok/s step 14443/19560 | loss 3.267978 (-1.42z)| norm 0.2702 (-0.92z)| lr 1.03e-04 | 323.21 ms | 52.2% bf16 MFU | 1623883 tok/s step 14444/19560 | loss 3.289337 (-0.82z)| norm 0.2771 (-0.41z)| lr 1.03e-04 | 323.22 ms | 52.2% bf16 MFU | 1623792 tok/s step 14445/19560 | loss 3.308694 (-0.30z)| norm 0.2518 (-2.21z)| lr 1.03e-04 | 323.06 ms | 52.2% bf16 MFU | 1623748 tok/s step 14446/19560 | loss 3.284437 (-0.95z)| norm 0.2665 (-1.13z)| lr 1.02e-04 | 323.28 ms | 52.2% bf16 MFU | 1623651 tok/s step 14447/19560 | loss 3.396113 (+2.07z)| norm 0.2729 (-0.66z)| lr 1.02e-04 | 322.82 ms | 52.3% bf16 MFU | 1623672 tok/s step 14448/19560 | loss 3.344575 (+0.66z)| norm 0.2844 (+0.16z)| lr 1.02e-04 | 322.87 ms | 52.3% bf16 MFU | 1623682 tok/s step 14449/19560 | loss 3.275890 (-1.18z)| norm 0.2770 (-0.38z)| lr 1.02e-04 | 322.95 ms | 52.3% bf16 MFU | 1623670 tok/s step 14450/19560 | loss 3.222790 (-2.53z)| norm 0.2591 (-1.65z)| lr 1.02e-04 | 323.13 ms | 52.2% bf16 MFU | 1623613 tok/s step 14451/19560 | loss 3.300548 (-0.47z)| norm 0.3154 (+2.32z)| lr 1.02e-04 | 322.80 ms | 52.3% bf16 MFU | 1623642 tok/s step 14452/19560 | loss 3.410833 (+2.39z)| norm 0.2717 (-0.74z)| lr 1.02e-04 | 322.98 ms | 52.3% bf16 MFU | 1623626 tok/s step 14453/19560 | loss 3.356586 (+0.97z)| norm 0.3192 (+2.52z)| lr 1.02e-04 | 323.32 ms | 52.2% bf16 MFU | 1623522 tok/s step 14454/19560 | loss 3.340551 (+0.55z)| norm 0.2764 (-0.42z)| lr 1.02e-04 | 322.73 ms | 52.3% bf16 MFU | 1623574 tok/s step 14455/19560 | loss 3.303500 (-0.39z)| norm 0.2920 (+0.64z)| lr 1.02e-04 | 323.54 ms | 52.2% bf16 MFU | 1623420 tok/s step 14456/19560 | loss 3.356874 (+0.99z)| norm 0.2651 (-1.20z)| lr 1.02e-04 | 322.69 ms | 52.3% bf16 MFU | 1623486 tok/s step 14457/19560 | loss 3.305791 (-0.35z)| norm 0.2749 (-0.51z)| lr 1.02e-04 | 322.74 ms | 52.3% bf16 MFU | 1623535 tok/s step 14458/19560 | loss 3.307460 (-0.30z)| norm 0.2893 (+0.48z)| lr 1.02e-04 | 322.97 ms | 52.3% bf16 MFU | 1623524 tok/s step 14459/19560 | loss 3.311300 (-0.20z)| norm 0.2686 (-0.94z)| lr 1.02e-04 | 322.88 ms | 52.3% bf16 MFU | 1623538 tok/s step 14460/19560 | loss 3.298404 (-0.52z)| norm 0.2837 (+0.09z)| lr 1.02e-04 | 322.67 ms | 52.3% bf16 MFU | 1623603 tok/s step 14461/19560 | loss 3.282277 (-0.95z)| norm 0.2810 (-0.11z)| lr 1.02e-04 | 322.88 ms | 52.3% bf16 MFU | 1623612 tok/s step 14462/19560 | loss 3.316224 (-0.04z)| norm 0.2737 (-0.62z)| lr 1.02e-04 | 323.18 ms | 52.2% bf16 MFU | 1623546 tok/s step 14463/19560 | loss 3.271884 (-1.20z)| norm 0.2697 (-0.89z)| lr 1.02e-04 | 323.25 ms | 52.2% bf16 MFU | 1623465 tok/s step 14464/19560 | loss 3.286149 (-0.82z)| norm 0.2783 (-0.27z)| lr 1.02e-04 | 323.00 ms | 52.3% bf16 MFU | 1623451 tok/s step 14465/19560 | loss 3.355691 (+1.08z)| norm 0.2663 (-1.10z)| lr 1.02e-04 | 323.12 ms | 52.2% bf16 MFU | 1623407 tok/s step 14466/19560 | loss 3.341637 (+0.70z)| norm 0.2575 (-1.68z)| lr 1.02e-04 | 323.10 ms | 52.2% bf16 MFU | 1623372 tok/s step 14467/19560 | loss 3.297103 (-0.51z)| norm 0.2770 (-0.32z)| lr 1.02e-04 | 323.05 ms | 52.2% bf16 MFU | 1623351 tok/s step 14468/19560 | loss 3.308581 (-0.19z)| norm 0.2563 (-1.73z)| lr 1.02e-04 | 323.15 ms | 52.2% bf16 MFU | 1623304 tok/s step 14469/19560 | loss 3.294961 (-0.59z)| norm 0.2647 (-1.13z)| lr 1.02e-04 | 322.98 ms | 52.3% bf16 MFU | 1623304 tok/s step 14470/19560 | loss 3.405258 (+2.42z)| norm 0.3511 (+4.37z)| lr 1.02e-04 | 323.04 ms | 52.2% bf16 MFU | 1623287 tok/s step 14471/19560 | loss 3.327796 (+0.28z)| norm 0.2750 (-0.43z)| lr 1.02e-04 | 322.72 ms | 52.3% bf16 MFU | 1623353 tok/s step 14472/19560 | loss 3.372680 (+1.51z)| norm 0.2858 (+0.25z)| lr 1.01e-04 | 322.86 ms | 52.3% bf16 MFU | 1623380 tok/s step 14473/19560 | loss 3.251233 (-1.80z)| norm 0.2907 (+0.54z)| lr 1.01e-04 | 323.10 ms | 52.2% bf16 MFU | 1623346 tok/s step 14474/19560 | loss 3.339286 (+0.58z)| norm 0.2823 (+0.00z)| lr 1.01e-04 | 323.06 ms | 52.2% bf16 MFU | 1623323 tok/s step 14475/19560 | loss 3.281777 (-0.98z)| norm 0.3221 (+2.46z)| lr 1.01e-04 | 322.81 ms | 52.3% bf16 MFU | 1623364 tok/s step 14476/19560 | loss 3.265044 (-1.41z)| norm 0.2885 (+0.37z)| lr 1.01e-04 | 322.84 ms | 52.3% bf16 MFU | 1623395 tok/s step 14477/19560 | loss 3.325948 (+0.22z)| norm 0.2859 (+0.20z)| lr 1.01e-04 | 322.94 ms | 52.3% bf16 MFU | 1623399 tok/s step 14478/19560 | loss 3.291656 (-0.71z)| norm 0.2776 (-0.32z)| lr 1.01e-04 | 322.76 ms | 52.3% bf16 MFU | 1623449 tok/s step 14479/19560 | loss 3.387683 (+1.86z)| norm 0.2870 (+0.27z)| lr 1.01e-04 | 323.06 ms | 52.2% bf16 MFU | 1623421 tok/s step 14480/19560 | loss 3.322519 (+0.12z)| norm 0.2787 (-0.25z)| lr 1.01e-04 | 322.82 ms | 52.3% bf16 MFU | 1623454 tok/s step 14481/19560 | loss 3.389024 (+1.86z)| norm 0.3029 (+1.25z)| lr 1.01e-04 | 323.01 ms | 52.2% bf16 MFU | 1623438 tok/s step 14482/19560 | loss 3.309615 (-0.25z)| norm 0.2806 (-0.15z)| lr 1.01e-04 | 322.61 ms | 52.3% bf16 MFU | 1623523 tok/s step 14483/19560 | loss 3.256525 (-1.63z)| norm 0.2898 (+0.41z)| lr 1.01e-04 | 322.98 ms | 52.3% bf16 MFU | 1623512 tok/s step 14484/19560 | loss 3.337569 (+0.52z)| norm 0.3199 (+2.25z)| lr 1.01e-04 | 323.30 ms | 52.2% bf16 MFU | 1623420 tok/s step 14485/19560 | loss 3.282506 (-0.95z)| norm 0.2894 (+0.36z)| lr 1.01e-04 | 323.06 ms | 52.2% bf16 MFU | 1623394 tok/s step 14486/19560 | loss 3.252466 (-1.72z)| norm 0.3090 (+1.57z)| lr 1.01e-04 | 323.55 ms | 52.2% bf16 MFU | 1623247 tok/s step 14487/19560 | loss 3.336040 (+0.47z)| norm 0.2902 (+0.39z)| lr 1.01e-04 | 323.09 ms | 52.2% bf16 MFU | 1623222 tok/s step 14488/19560 | loss 3.421390 (+2.64z)| norm 0.2902 (+0.39z)| lr 1.01e-04 | 322.36 ms | 52.4% bf16 MFU | 1623382 tok/s step 14489/19560 | loss 3.280060 (-1.00z)| norm 0.2963 (+0.77z)| lr 1.01e-04 | 323.07 ms | 52.2% bf16 MFU | 1623355 tok/s step 14490/19560 | loss 3.379148 (+1.52z)| norm 0.3056 (+1.33z)| lr 1.01e-04 | 323.47 ms | 52.2% bf16 MFU | 1623228 tok/s step 14491/19560 | loss 3.305438 (-0.36z)| norm 0.2985 (+0.87z)| lr 1.01e-04 | 323.14 ms | 52.2% bf16 MFU | 1623190 tok/s step 14492/19560 | loss 3.310031 (-0.24z)| norm 0.2953 (+0.67z)| lr 1.01e-04 | 323.87 ms | 52.1% bf16 MFU | 1622971 tok/s step 14493/19560 | loss 3.376719 (+1.45z)| norm 0.3055 (+1.29z)| lr 1.01e-04 | 322.91 ms | 52.3% bf16 MFU | 1623004 tok/s step 14494/19560 | loss 3.336553 (+0.43z)| norm 0.3062 (+1.32z)| lr 1.01e-04 | 323.82 ms | 52.1% bf16 MFU | 1622807 tok/s step 14495/19560 | loss 3.377620 (+1.46z)| norm 0.3014 (+1.02z)| lr 1.01e-04 | 322.63 ms | 52.3% bf16 MFU | 1622919 tok/s step 14496/19560 | loss 3.347986 (+0.71z)| norm 0.2744 (-0.63z)| lr 1.01e-04 | 323.21 ms | 52.2% bf16 MFU | 1622880 tok/s step 14497/19560 | loss 3.306366 (-0.35z)| norm 0.2874 (+0.16z)| lr 1.01e-04 | 323.88 ms | 52.1% bf16 MFU | 1622675 tok/s step 14498/19560 | loss 3.273909 (-1.17z)| norm 0.2709 (-0.85z)| lr 1.01e-04 | 322.66 ms | 52.3% bf16 MFU | 1622786 tok/s step 14499/19560 | loss 3.330202 (+0.25z)| norm 0.2922 (+0.46z)| lr 1.00e-04 | 323.16 ms | 52.2% bf16 MFU | 1622765 tok/s step 14500/19560 | loss 3.368294 (+1.21z)| norm 0.2890 (+0.26z)| lr 1.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1622895 tok/s val loss 3.310410 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2996/10042 = 0.298347 step 14501/19560 | loss 3.254345 (-1.66z)| norm 0.2705 (-0.87z)| lr 1.00e-04 | 323.01 ms | 52.3% bf16 MFU | 1622908 tok/s step 14502/19560 | loss 3.303481 (-0.41z)| norm 0.2886 (+0.24z)| lr 1.00e-04 | 322.62 ms | 52.3% bf16 MFU | 1623017 tok/s step 14503/19560 | loss 3.367205 (+1.24z)| norm 0.2987 (+0.86z)| lr 1.00e-04 | 323.01 ms | 52.2% bf16 MFU | 1623022 tok/s step 14504/19560 | loss 3.330722 (+0.28z)| norm 0.2914 (+0.44z)| lr 1.00e-04 | 322.99 ms | 52.3% bf16 MFU | 1623033 tok/s step 14505/19560 | loss 3.295710 (-0.62z)| norm 0.2587 (-1.62z)| lr 1.00e-04 | 323.16 ms | 52.2% bf16 MFU | 1623000 tok/s step 14506/19560 | loss 3.311473 (-0.20z)| norm 0.3352 (+3.13z)| lr 1.00e-04 | 322.58 ms | 52.3% bf16 MFU | 1623115 tok/s step 14507/19560 | loss 3.297852 (-0.55z)| norm 0.2623 (-1.35z)| lr 1.00e-04 | 322.81 ms | 52.3% bf16 MFU | 1623166 tok/s step 14508/19560 | loss 3.315195 (-0.09z)| norm 0.2801 (-0.25z)| lr 1.00e-04 | 322.75 ms | 52.3% bf16 MFU | 1623229 tok/s step 14509/19560 | loss 3.356063 (+0.96z)| norm 0.3256 (+2.49z)| lr 1.00e-04 | 322.86 ms | 52.3% bf16 MFU | 1623263 tok/s step 14510/19560 | loss 3.332268 (+0.35z)| norm 0.2645 (-1.19z)| lr 1.00e-04 | 323.15 ms | 52.2% bf16 MFU | 1623222 tok/s step 14511/19560 | loss 3.362339 (+1.12z)| norm 0.2870 (+0.20z)| lr 1.00e-04 | 322.57 ms | 52.3% bf16 MFU | 1623328 tok/s step 14512/19560 | loss 3.279346 (-1.02z)| norm 0.2723 (-0.70z)| lr 1.00e-04 | 323.23 ms | 52.2% bf16 MFU | 1623264 tok/s step 14513/19560 | loss 3.351907 (+0.84z)| norm 0.2717 (-0.72z)| lr 1.00e-04 | 322.76 ms | 52.3% bf16 MFU | 1623320 tok/s step 14514/19560 | loss 3.309246 (-0.25z)| norm 0.2887 (+0.31z)| lr 9.99e-05 | 322.32 ms | 52.4% bf16 MFU | 1623483 tok/s step 14515/19560 | loss 3.346628 (+0.70z)| norm 0.2644 (-1.18z)| lr 9.99e-05 | 323.01 ms | 52.2% bf16 MFU | 1623466 tok/s step 14516/19560 | loss 3.284127 (-0.90z)| norm 0.2681 (-0.94z)| lr 9.98e-05 | 323.01 ms | 52.2% bf16 MFU | 1623449 tok/s step 14517/19560 | loss 3.292823 (-0.66z)| norm 0.2763 (-0.43z)| lr 9.98e-05 | 322.74 ms | 52.3% bf16 MFU | 1623501 tok/s step 14518/19560 | loss 3.335641 (+0.43z)| norm 0.2595 (-1.45z)| lr 9.98e-05 | 322.51 ms | 52.3% bf16 MFU | 1623608 tok/s step 14519/19560 | loss 3.320058 (+0.02z)| norm 0.2809 (-0.14z)| lr 9.97e-05 | 322.89 ms | 52.3% bf16 MFU | 1623615 tok/s step 14520/19560 | loss 3.347274 (+0.71z)| norm 0.2785 (-0.28z)| lr 9.97e-05 | 323.07 ms | 52.2% bf16 MFU | 1623577 tok/s step 14521/19560 | loss 3.416597 (+2.42z)| norm 0.3540 (+4.04z)| lr 9.97e-05 | 322.58 ms | 52.3% bf16 MFU | 1623662 tok/s step 14522/19560 | loss 3.351929 (+0.78z)| norm 0.2826 (-0.07z)| lr 9.96e-05 | 322.71 ms | 52.3% bf16 MFU | 1623712 tok/s step 14523/19560 | loss 3.301274 (-0.50z)| norm 0.2634 (-1.16z)| lr 9.96e-05 | 322.72 ms | 52.3% bf16 MFU | 1623757 tok/s step 14524/19560 | loss 3.336997 (+0.41z)| norm 0.2825 (-0.06z)| lr 9.95e-05 | 322.95 ms | 52.3% bf16 MFU | 1623740 tok/s step 14525/19560 | loss 3.381130 (+1.52z)| norm 0.3093 (+1.46z)| lr 9.95e-05 | 323.05 ms | 52.2% bf16 MFU | 1623700 tok/s step 14526/19560 | loss 3.347576 (+0.67z)| norm 0.2906 (+0.38z)| lr 9.95e-05 | 322.33 ms | 52.4% bf16 MFU | 1623843 tok/s step 14527/19560 | loss 3.285291 (-0.90z)| norm 0.2720 (-0.68z)| lr 9.94e-05 | 323.19 ms | 52.2% bf16 MFU | 1623763 tok/s step 14528/19560 | loss 3.284036 (-0.92z)| norm 0.2971 (+0.75z)| lr 9.94e-05 | 323.35 ms | 52.2% bf16 MFU | 1623648 tok/s step 14529/19560 | loss 3.291179 (-0.74z)| norm 0.2823 (-0.10z)| lr 9.94e-05 | 322.43 ms | 52.3% bf16 MFU | 1623767 tok/s step 14530/19560 | loss 3.423788 (+2.53z)| norm 0.2801 (-0.23z)| lr 9.93e-05 | 323.28 ms | 52.2% bf16 MFU | 1623668 tok/s step 14531/19560 | loss 3.303305 (-0.44z)| norm 0.2548 (-1.65z)| lr 9.93e-05 | 322.56 ms | 52.3% bf16 MFU | 1623754 tok/s step 14532/19560 | loss 3.346347 (+0.62z)| norm 0.2777 (-0.36z)| lr 9.92e-05 | 322.83 ms | 52.3% bf16 MFU | 1623768 tok/s step 14533/19560 | loss 3.335580 (+0.35z)| norm 0.2861 (+0.12z)| lr 9.92e-05 | 323.17 ms | 52.2% bf16 MFU | 1623695 tok/s step 14534/19560 | loss 3.311791 (-0.23z)| norm 0.2855 (+0.09z)| lr 9.92e-05 | 322.88 ms | 52.3% bf16 MFU | 1623700 tok/s step 14535/19560 | loss 3.408856 (+2.11z)| norm 0.2901 (+0.35z)| lr 9.91e-05 | 323.18 ms | 52.2% bf16 MFU | 1623629 tok/s step 14536/19560 | loss 3.336654 (+0.35z)| norm 0.2788 (-0.29z)| lr 9.91e-05 | 322.20 ms | 52.4% bf16 MFU | 1623807 tok/s step 14537/19560 | loss 3.291463 (-0.75z)| norm 0.2591 (-1.39z)| lr 9.91e-05 | 322.89 ms | 52.3% bf16 MFU | 1623805 tok/s step 14538/19560 | loss 3.421760 (+2.35z)| norm 0.3202 (+2.01z)| lr 9.90e-05 | 323.45 ms | 52.2% bf16 MFU | 1623659 tok/s step 14539/19560 | loss 3.353514 (+0.71z)| norm 0.2774 (-0.38z)| lr 9.90e-05 | 322.39 ms | 52.4% bf16 MFU | 1623790 tok/s step 14540/19560 | loss 3.325599 (+0.04z)| norm 0.2867 (+0.13z)| lr 9.90e-05 | 322.67 ms | 52.3% bf16 MFU | 1623844 tok/s step 14541/19560 | loss 3.237325 (-2.03z)| norm 0.2684 (-0.87z)| lr 9.89e-05 | 322.28 ms | 52.4% bf16 MFU | 1623992 tok/s step 14542/19560 | loss 3.280521 (-0.99z)| norm 0.2821 (-0.11z)| lr 9.89e-05 | 324.03 ms | 52.1% bf16 MFU | 1623692 tok/s step 14543/19560 | loss 3.430794 (+2.48z)| norm 0.3022 (+1.00z)| lr 9.88e-05 | 322.91 ms | 52.3% bf16 MFU | 1623689 tok/s step 14544/19560 | loss 3.314864 (-0.20z)| norm 0.2694 (-0.82z)| lr 9.88e-05 | 322.30 ms | 52.4% bf16 MFU | 1623839 tok/s step 14545/19560 | loss 3.387948 (+1.47z)| norm 0.2684 (-0.87z)| lr 9.88e-05 | 322.80 ms | 52.3% bf16 MFU | 1623856 tok/s step 14546/19560 | loss 3.349779 (+0.57z)| norm 0.2982 (+0.78z)| lr 9.87e-05 | 322.80 ms | 52.3% bf16 MFU | 1623872 tok/s step 14547/19560 | loss 3.323548 (-0.04z)| norm 0.2856 (+0.08z)| lr 9.87e-05 | 322.72 ms | 52.3% bf16 MFU | 1623909 tok/s step 14548/19560 | loss 3.297022 (-0.64z)| norm 0.3021 (+0.99z)| lr 9.87e-05 | 322.47 ms | 52.3% bf16 MFU | 1624005 tok/s step 14549/19560 | loss 3.293712 (-0.72z)| norm 0.3024 (+0.99z)| lr 9.86e-05 | 323.09 ms | 52.2% bf16 MFU | 1623941 tok/s step 14550/19560 | loss 3.363156 (+0.91z)| norm 0.2882 (+0.20z)| lr 9.86e-05 | 323.07 ms | 52.2% bf16 MFU | 1623886 tok/s step 14551/19560 | loss 3.306400 (-0.41z)| norm 0.3139 (+1.60z)| lr 9.85e-05 | 322.87 ms | 52.3% bf16 MFU | 1623884 tok/s step 14552/19560 | loss 3.336866 (+0.30z)| norm 0.2648 (-1.10z)| lr 9.85e-05 | 322.79 ms | 52.3% bf16 MFU | 1623902 tok/s step 14553/19560 | loss 3.293725 (-0.71z)| norm 0.2828 (-0.11z)| lr 9.85e-05 | 322.85 ms | 52.3% bf16 MFU | 1623904 tok/s step 14554/19560 | loss 3.282551 (-0.96z)| norm 0.2824 (-0.13z)| lr 9.84e-05 | 322.92 ms | 52.3% bf16 MFU | 1623887 tok/s step 14555/19560 | loss 3.316817 (-0.15z)| norm 0.2637 (-1.15z)| lr 9.84e-05 | 323.10 ms | 52.2% bf16 MFU | 1623827 tok/s step 14556/19560 | loss 3.300203 (-0.54z)| norm 0.2772 (-0.42z)| lr 9.84e-05 | 322.13 ms | 52.4% bf16 MFU | 1624014 tok/s step 14557/19560 | loss 3.338936 (+0.36z)| norm 0.2718 (-0.70z)| lr 9.83e-05 | 323.42 ms | 52.2% bf16 MFU | 1623868 tok/s step 14558/19560 | loss 3.318869 (-0.11z)| norm 0.2717 (-0.71z)| lr 9.83e-05 | 323.32 ms | 52.2% bf16 MFU | 1623752 tok/s step 14559/19560 | loss 3.323725 (-0.02z)| norm 0.2898 (+0.28z)| lr 9.82e-05 | 322.36 ms | 52.4% bf16 MFU | 1623886 tok/s step 14560/19560 | loss 3.324794 (+0.01z)| norm 0.2707 (-0.77z)| lr 9.82e-05 | 322.55 ms | 52.3% bf16 MFU | 1623964 tok/s step 14561/19560 | loss 3.305248 (-0.45z)| norm 0.2648 (-1.08z)| lr 9.82e-05 | 322.80 ms | 52.3% bf16 MFU | 1623975 tok/s step 14562/19560 | loss 3.389652 (+1.56z)| norm 0.2701 (-0.77z)| lr 9.81e-05 | 323.07 ms | 52.2% bf16 MFU | 1623918 tok/s step 14563/19560 | loss 3.302207 (-0.53z)| norm 0.2725 (-0.64z)| lr 9.81e-05 | 322.70 ms | 52.3% bf16 MFU | 1623957 tok/s step 14564/19560 | loss 3.324370 (+0.01z)| norm 0.2537 (-1.65z)| lr 9.81e-05 | 322.70 ms | 52.3% bf16 MFU | 1623993 tok/s step 14565/19560 | loss 3.327768 (+0.09z)| norm 0.2705 (-0.73z)| lr 9.80e-05 | 322.63 ms | 52.3% bf16 MFU | 1624046 tok/s step 14566/19560 | loss 3.240842 (-1.97z)| norm 0.2604 (-1.27z)| lr 9.80e-05 | 322.70 ms | 52.3% bf16 MFU | 1624077 tok/s step 14567/19560 | loss 3.254075 (-1.63z)| norm 0.2683 (-0.83z)| lr 9.80e-05 | 323.09 ms | 52.2% bf16 MFU | 1624010 tok/s step 14568/19560 | loss 3.278509 (-1.04z)| norm 0.2748 (-0.47z)| lr 9.79e-05 | 322.93 ms | 52.3% bf16 MFU | 1623987 tok/s step 14569/19560 | loss 3.334594 (+0.28z)| norm 0.2639 (-1.04z)| lr 9.79e-05 | 322.55 ms | 52.3% bf16 MFU | 1624060 tok/s step 14570/19560 | loss 3.356169 (+0.77z)| norm 0.2925 (+0.50z)| lr 9.78e-05 | 322.93 ms | 52.3% bf16 MFU | 1624035 tok/s step 14571/19560 | loss 3.310243 (-0.31z)| norm 0.2757 (-0.41z)| lr 9.78e-05 | 322.63 ms | 52.3% bf16 MFU | 1624086 tok/s step 14572/19560 | loss 3.343351 (+0.46z)| norm 0.2760 (-0.40z)| lr 9.78e-05 | 323.40 ms | 52.2% bf16 MFU | 1623942 tok/s step 14573/19560 | loss 3.384600 (+1.41z)| norm 0.2644 (-1.03z)| lr 9.77e-05 | 322.92 ms | 52.3% bf16 MFU | 1623925 tok/s step 14574/19560 | loss 3.278972 (-1.07z)| norm 0.2846 (+0.06z)| lr 9.77e-05 | 322.28 ms | 52.4% bf16 MFU | 1624070 tok/s step 14575/19560 | loss 3.391264 (+1.57z)| norm 0.2963 (+0.68z)| lr 9.77e-05 | 322.70 ms | 52.3% bf16 MFU | 1624100 tok/s step 14576/19560 | loss 3.276950 (-1.10z)| norm 0.2872 (+0.19z)| lr 9.76e-05 | 322.71 ms | 52.3% bf16 MFU | 1624126 tok/s step 14577/19560 | loss 3.325688 (+0.03z)| norm 0.2929 (+0.49z)| lr 9.76e-05 | 323.20 ms | 52.2% bf16 MFU | 1624029 tok/s step 14578/19560 | loss 3.282362 (-1.02z)| norm 0.2615 (-1.23z)| lr 9.75e-05 | 322.61 ms | 52.3% bf16 MFU | 1624086 tok/s step 14579/19560 | loss 3.300758 (-0.58z)| norm 0.3108 (+1.48z)| lr 9.75e-05 | 322.55 ms | 52.3% bf16 MFU | 1624153 tok/s step 14580/19560 | loss 3.309038 (-0.36z)| norm 0.2756 (-0.46z)| lr 9.75e-05 | 323.07 ms | 52.2% bf16 MFU | 1624087 tok/s step 14581/19560 | loss 3.400812 (+1.84z)| norm 0.3030 (+1.07z)| lr 9.74e-05 | 322.74 ms | 52.3% bf16 MFU | 1624106 tok/s step 14582/19560 | loss 3.308099 (-0.39z)| norm 0.2858 (+0.11z)| lr 9.74e-05 | 322.92 ms | 52.3% bf16 MFU | 1624079 tok/s step 14583/19560 | loss 3.335306 (+0.27z)| norm 0.2745 (-0.51z)| lr 9.74e-05 | 322.55 ms | 52.3% bf16 MFU | 1624147 tok/s step 14584/19560 | loss 3.362363 (+0.92z)| norm 0.2560 (-1.53z)| lr 9.73e-05 | 322.72 ms | 52.3% bf16 MFU | 1624169 tok/s step 14585/19560 | loss 3.328340 (+0.09z)| norm 0.2602 (-1.28z)| lr 9.73e-05 | 322.63 ms | 52.3% bf16 MFU | 1624214 tok/s step 14586/19560 | loss 3.359573 (+0.83z)| norm 0.2918 (+0.45z)| lr 9.73e-05 | 323.27 ms | 52.2% bf16 MFU | 1624095 tok/s step 14587/19560 | loss 3.357725 (+0.78z)| norm 0.2645 (-1.04z)| lr 9.72e-05 | 322.95 ms | 52.3% bf16 MFU | 1624061 tok/s step 14588/19560 | loss 3.326423 (+0.02z)| norm 0.2871 (+0.19z)| lr 9.72e-05 | 322.35 ms | 52.4% bf16 MFU | 1624181 tok/s step 14589/19560 | loss 3.278386 (-1.13z)| norm 0.2873 (+0.20z)| lr 9.71e-05 | 322.71 ms | 52.3% bf16 MFU | 1624205 tok/s step 14590/19560 | loss 3.344240 (+0.45z)| norm 0.3193 (+1.91z)| lr 9.71e-05 | 322.69 ms | 52.3% bf16 MFU | 1624231 tok/s step 14591/19560 | loss 3.302583 (-0.56z)| norm 0.2987 (+0.79z)| lr 9.71e-05 | 323.14 ms | 52.2% bf16 MFU | 1624144 tok/s step 14592/19560 | loss 3.330182 (+0.09z)| norm 0.2699 (-0.77z)| lr 9.70e-05 | 323.08 ms | 52.2% bf16 MFU | 1624075 tok/s step 14593/19560 | loss 3.327936 (+0.04z)| norm 0.3039 (+1.05z)| lr 9.70e-05 | 322.43 ms | 52.3% bf16 MFU | 1624174 tok/s step 14594/19560 | loss 3.340404 (+0.35z)| norm 0.2896 (+0.27z)| lr 9.70e-05 | 322.33 ms | 52.4% bf16 MFU | 1624292 tok/s step 14595/19560 | loss 3.298463 (-0.67z)| norm 0.2933 (+0.46z)| lr 9.69e-05 | 323.27 ms | 52.2% bf16 MFU | 1624169 tok/s step 14596/19560 | loss 3.340210 (+0.34z)| norm 0.2960 (+0.60z)| lr 9.69e-05 | 323.11 ms | 52.2% bf16 MFU | 1624091 tok/s step 14597/19560 | loss 3.356422 (+0.72z)| norm 0.3077 (+1.22z)| lr 9.68e-05 | 322.76 ms | 52.3% bf16 MFU | 1624106 tok/s step 14598/19560 | loss 3.327675 (+0.04z)| norm 0.2888 (+0.22z)| lr 9.68e-05 | 322.58 ms | 52.3% bf16 MFU | 1624166 tok/s step 14599/19560 | loss 3.345803 (+0.48z)| norm 0.2740 (-0.63z)| lr 9.68e-05 | 322.66 ms | 52.3% bf16 MFU | 1624203 tok/s step 14600/19560 | loss 3.309329 (-0.41z)| norm 0.2764 (-0.49z)| lr 9.67e-05 | 323.12 ms | 52.2% bf16 MFU | 1624123 tok/s step 14601/19560 | loss 3.250314 (-1.87z)| norm 0.2716 (-0.76z)| lr 9.67e-05 | 322.66 ms | 52.3% bf16 MFU | 1624163 tok/s step 14602/19560 | loss 3.309770 (-0.39z)| norm 0.2881 (+0.19z)| lr 9.67e-05 | 323.30 ms | 52.2% bf16 MFU | 1624040 tok/s step 14603/19560 | loss 3.298617 (-0.67z)| norm 0.3456 (+3.39z)| lr 9.66e-05 | 322.71 ms | 52.3% bf16 MFU | 1624069 tok/s step 14604/19560 | loss 3.328121 (+0.05z)| norm 0.2773 (-0.42z)| lr 9.66e-05 | 322.75 ms | 52.3% bf16 MFU | 1624087 tok/s step 14605/19560 | loss 3.284479 (-1.03z)| norm 0.2701 (-0.82z)| lr 9.66e-05 | 323.15 ms | 52.2% bf16 MFU | 1624004 tok/s step 14606/19560 | loss 3.279971 (-1.14z)| norm 0.3410 (+3.00z)| lr 9.65e-05 | 323.32 ms | 52.2% bf16 MFU | 1623883 tok/s step 14607/19560 | loss 3.364467 (+0.97z)| norm 0.3621 (+3.85z)| lr 9.65e-05 | 322.83 ms | 52.3% bf16 MFU | 1623891 tok/s step 14608/19560 | loss 3.339699 (+0.35z)| norm 0.2896 (+0.19z)| lr 9.64e-05 | 323.08 ms | 52.2% bf16 MFU | 1623835 tok/s step 14609/19560 | loss 3.307239 (-0.45z)| norm 0.2969 (+0.56z)| lr 9.64e-05 | 322.52 ms | 52.3% bf16 MFU | 1623924 tok/s step 14610/19560 | loss 3.372091 (+1.17z)| norm 0.3249 (+1.93z)| lr 9.64e-05 | 322.92 ms | 52.3% bf16 MFU | 1623908 tok/s step 14611/19560 | loss 3.365892 (+1.00z)| norm 0.3117 (+1.25z)| lr 9.63e-05 | 323.09 ms | 52.2% bf16 MFU | 1623849 tok/s step 14612/19560 | loss 3.362702 (+0.91z)| norm 0.2730 (-0.65z)| lr 9.63e-05 | 322.76 ms | 52.3% bf16 MFU | 1623877 tok/s step 14613/19560 | loss 3.324939 (-0.05z)| norm 0.2909 (+0.24z)| lr 9.63e-05 | 323.03 ms | 52.2% bf16 MFU | 1623834 tok/s step 14614/19560 | loss 3.320237 (-0.19z)| norm 0.2630 (-1.13z)| lr 9.62e-05 | 322.30 ms | 52.4% bf16 MFU | 1623977 tok/s step 14615/19560 | loss 3.456885 (+3.17z)| norm 0.3158 (+1.49z)| lr 9.62e-05 | 323.04 ms | 52.2% bf16 MFU | 1623928 tok/s step 14616/19560 | loss 3.290753 (-0.92z)| norm 0.2876 (+0.09z)| lr 9.61e-05 | 323.42 ms | 52.2% bf16 MFU | 1623785 tok/s step 14617/19560 | loss 3.283289 (-1.11z)| norm 0.2877 (+0.10z)| lr 9.61e-05 | 322.46 ms | 52.3% bf16 MFU | 1623892 tok/s step 14618/19560 | loss 3.302650 (-0.61z)| norm 0.2762 (-0.46z)| lr 9.61e-05 | 323.10 ms | 52.2% bf16 MFU | 1623832 tok/s step 14619/19560 | loss 3.285702 (-1.04z)| norm 0.2897 (+0.21z)| lr 9.60e-05 | 322.95 ms | 52.3% bf16 MFU | 1623812 tok/s step 14620/19560 | loss 3.287811 (-0.97z)| norm 0.2809 (-0.22z)| lr 9.60e-05 | 323.19 ms | 52.2% bf16 MFU | 1623733 tok/s step 14621/19560 | loss 3.325985 (-0.00z)| norm 0.2598 (-1.25z)| lr 9.60e-05 | 323.03 ms | 52.2% bf16 MFU | 1623698 tok/s step 14622/19560 | loss 3.339548 (+0.34z)| norm 0.2871 (+0.11z)| lr 9.59e-05 | 323.07 ms | 52.2% bf16 MFU | 1623655 tok/s step 14623/19560 | loss 3.294035 (-0.80z)| norm 0.2797 (-0.25z)| lr 9.59e-05 | 322.77 ms | 52.3% bf16 MFU | 1623689 tok/s step 14624/19560 | loss 3.402920 (+1.93z)| norm 0.2921 (+0.37z)| lr 9.59e-05 | 322.80 ms | 52.3% bf16 MFU | 1623713 tok/s step 14625/19560 | loss 3.290045 (-0.90z)| norm 0.2831 (-0.08z)| lr 9.58e-05 | 322.81 ms | 52.3% bf16 MFU | 1623733 tok/s step 14626/19560 | loss 3.335213 (+0.22z)| norm 0.2623 (-1.12z)| lr 9.58e-05 | 323.34 ms | 52.2% bf16 MFU | 1623619 tok/s step 14627/19560 | loss 3.269559 (-1.41z)| norm 0.2850 (+0.02z)| lr 9.57e-05 | 322.78 ms | 52.3% bf16 MFU | 1623654 tok/s step 14628/19560 | loss 3.272354 (-1.31z)| norm 0.2646 (-0.99z)| lr 9.57e-05 | 322.76 ms | 52.3% bf16 MFU | 1623691 tok/s step 14629/19560 | loss 3.349859 (+0.61z)| norm 0.2525 (-1.57z)| lr 9.57e-05 | 322.96 ms | 52.3% bf16 MFU | 1623676 tok/s step 14630/19560 | loss 3.246512 (-1.96z)| norm 0.2794 (-0.24z)| lr 9.56e-05 | 322.77 ms | 52.3% bf16 MFU | 1623708 tok/s step 14631/19560 | loss 3.292756 (-0.80z)| norm 0.2989 (+0.73z)| lr 9.56e-05 | 323.39 ms | 52.2% bf16 MFU | 1623585 tok/s step 14632/19560 | loss 3.376809 (+1.28z)| norm 0.2697 (-0.71z)| lr 9.56e-05 | 323.13 ms | 52.2% bf16 MFU | 1623532 tok/s step 14633/19560 | loss 3.422136 (+2.33z)| norm 0.2699 (-0.71z)| lr 9.55e-05 | 323.75 ms | 52.1% bf16 MFU | 1623327 tok/s step 14634/19560 | loss 3.376843 (+1.21z)| norm 0.2722 (-0.59z)| lr 9.55e-05 | 322.30 ms | 52.4% bf16 MFU | 1623497 tok/s step 14635/19560 | loss 3.373631 (+1.11z)| norm 0.2763 (-0.38z)| lr 9.55e-05 | 323.00 ms | 52.3% bf16 MFU | 1623481 tok/s step 14636/19560 | loss 3.423398 (+2.25z)| norm 0.2748 (-0.46z)| lr 9.54e-05 | 322.73 ms | 52.3% bf16 MFU | 1623534 tok/s step 14637/19560 | loss 3.417484 (+2.06z)| norm 0.2751 (-0.43z)| lr 9.54e-05 | 322.60 ms | 52.3% bf16 MFU | 1623617 tok/s step 14638/19560 | loss 3.290550 (-0.87z)| norm 0.2663 (-0.89z)| lr 9.53e-05 | 322.84 ms | 52.3% bf16 MFU | 1623636 tok/s step 14639/19560 | loss 3.277784 (-1.15z)| norm 0.2699 (-0.69z)| lr 9.53e-05 | 323.09 ms | 52.2% bf16 MFU | 1623590 tok/s step 14640/19560 | loss 3.329502 (+0.03z)| norm 0.2676 (-0.81z)| lr 9.53e-05 | 323.33 ms | 52.2% bf16 MFU | 1623488 tok/s step 14641/19560 | loss 3.237856 (-2.04z)| norm 0.2708 (-0.64z)| lr 9.52e-05 | 323.34 ms | 52.2% bf16 MFU | 1623386 tok/s step 14642/19560 | loss 3.285822 (-0.94z)| norm 0.2802 (-0.15z)| lr 9.52e-05 | 322.47 ms | 52.3% bf16 MFU | 1623509 tok/s step 14643/19560 | loss 3.325940 (-0.02z)| norm 0.2873 (+0.21z)| lr 9.52e-05 | 322.67 ms | 52.3% bf16 MFU | 1623576 tok/s step 14644/19560 | loss 3.291300 (-0.81z)| norm 0.2664 (-0.88z)| lr 9.51e-05 | 323.70 ms | 52.1% bf16 MFU | 1623380 tok/s step 14645/19560 | loss 3.373269 (+1.04z)| norm 0.3090 (+1.31z)| lr 9.51e-05 | 323.68 ms | 52.1% bf16 MFU | 1623201 tok/s step 14646/19560 | loss 3.418014 (+2.01z)| norm 0.2898 (+0.31z)| lr 9.51e-05 | 322.66 ms | 52.3% bf16 MFU | 1623285 tok/s step 14647/19560 | loss 3.321961 (-0.14z)| norm 0.2913 (+0.39z)| lr 9.50e-05 | 323.08 ms | 52.2% bf16 MFU | 1623260 tok/s step 14648/19560 | loss 3.282185 (-1.01z)| norm 0.2727 (-0.58z)| lr 9.50e-05 | 323.02 ms | 52.2% bf16 MFU | 1623252 tok/s step 14649/19560 | loss 3.369435 (+0.95z)| norm 0.3054 (+1.20z)| lr 9.49e-05 | 323.45 ms | 52.2% bf16 MFU | 1623134 tok/s step 14650/19560 | loss 3.447999 (+2.64z)| norm 0.2862 (+0.15z)| lr 9.49e-05 | 322.81 ms | 52.3% bf16 MFU | 1623184 tok/s step 14651/19560 | loss 3.321913 (-0.14z)| norm 0.2973 (+0.74z)| lr 9.49e-05 | 323.02 ms | 52.2% bf16 MFU | 1623180 tok/s step 14652/19560 | loss 3.350397 (+0.49z)| norm 0.2919 (+0.44z)| lr 9.48e-05 | 323.08 ms | 52.2% bf16 MFU | 1623160 tok/s step 14653/19560 | loss 3.363472 (+0.78z)| norm 0.2771 (-0.35z)| lr 9.48e-05 | 322.48 ms | 52.3% bf16 MFU | 1623293 tok/s step 14654/19560 | loss 3.315522 (-0.27z)| norm 0.3154 (+1.72z)| lr 9.48e-05 | 323.02 ms | 52.2% bf16 MFU | 1623282 tok/s step 14655/19560 | loss 3.258824 (-1.51z)| norm 0.3260 (+2.24z)| lr 9.47e-05 | 323.38 ms | 52.2% bf16 MFU | 1623183 tok/s step 14656/19560 | loss 3.477751 (+3.15z)| norm 0.3461 (+3.16z)| lr 9.47e-05 | 323.19 ms | 52.2% bf16 MFU | 1623136 tok/s step 14657/19560 | loss 3.330485 (+0.02z)| norm 0.2966 (+0.61z)| lr 9.47e-05 | 322.86 ms | 52.3% bf16 MFU | 1623173 tok/s step 14658/19560 | loss 3.318984 (-0.21z)| norm 0.2872 (+0.13z)| lr 9.46e-05 | 322.82 ms | 52.3% bf16 MFU | 1623220 tok/s step 14659/19560 | loss 3.339489 (+0.23z)| norm 0.2949 (+0.51z)| lr 9.46e-05 | 323.45 ms | 52.2% bf16 MFU | 1623105 tok/s step 14660/19560 | loss 3.351540 (+0.49z)| norm 0.2769 (-0.42z)| lr 9.45e-05 | 322.63 ms | 52.3% bf16 MFU | 1623202 tok/s step 14661/19560 | loss 3.321476 (-0.16z)| norm 0.2710 (-0.72z)| lr 9.45e-05 | 323.27 ms | 52.2% bf16 MFU | 1623134 tok/s step 14662/19560 | loss 3.325454 (-0.08z)| norm 0.2722 (-0.65z)| lr 9.45e-05 | 323.28 ms | 52.2% bf16 MFU | 1623067 tok/s step 14663/19560 | loss 3.318755 (-0.21z)| norm 0.2830 (-0.09z)| lr 9.44e-05 | 323.15 ms | 52.2% bf16 MFU | 1623034 tok/s step 14664/19560 | loss 3.309651 (-0.40z)| norm 0.2791 (-0.29z)| lr 9.44e-05 | 323.15 ms | 52.2% bf16 MFU | 1623004 tok/s step 14665/19560 | loss 3.292417 (-0.78z)| norm 0.2611 (-1.22z)| lr 9.44e-05 | 323.24 ms | 52.2% bf16 MFU | 1622952 tok/s step 14666/19560 | loss 3.384979 (+1.26z)| norm 0.3029 (+0.95z)| lr 9.43e-05 | 324.34 ms | 52.0% bf16 MFU | 1622629 tok/s step 14667/19560 | loss 3.306721 (-0.46z)| norm 0.2843 (-0.02z)| lr 9.43e-05 | 323.20 ms | 52.2% bf16 MFU | 1622606 tok/s step 14668/19560 | loss 3.268055 (-1.29z)| norm 0.2704 (-0.73z)| lr 9.43e-05 | 322.77 ms | 52.3% bf16 MFU | 1622692 tok/s step 14669/19560 | loss 3.397589 (+1.53z)| norm 0.3009 (+0.84z)| lr 9.42e-05 | 323.85 ms | 52.1% bf16 MFU | 1622502 tok/s step 14670/19560 | loss 3.334118 (+0.12z)| norm 0.2749 (-0.51z)| lr 9.42e-05 | 323.79 ms | 52.1% bf16 MFU | 1622338 tok/s step 14671/19560 | loss 3.309590 (-0.41z)| norm 0.2774 (-0.37z)| lr 9.41e-05 | 323.07 ms | 52.2% bf16 MFU | 1622361 tok/s step 14672/19560 | loss 3.295210 (-0.73z)| norm 0.2741 (-0.55z)| lr 9.41e-05 | 322.04 ms | 52.4% bf16 MFU | 1622644 tok/s step 14673/19560 | loss 3.261425 (-1.47z)| norm 0.2666 (-0.94z)| lr 9.41e-05 | 323.08 ms | 52.2% bf16 MFU | 1622651 tok/s step 14674/19560 | loss 3.296983 (-0.66z)| norm 0.2802 (-0.22z)| lr 9.40e-05 | 322.28 ms | 52.4% bf16 MFU | 1622858 tok/s step 14675/19560 | loss 3.268457 (-1.28z)| norm 0.2908 (+0.33z)| lr 9.40e-05 | 323.88 ms | 52.1% bf16 MFU | 1622654 tok/s step 14676/19560 | loss 3.319182 (-0.15z)| norm 0.2823 (-0.11z)| lr 9.40e-05 | 322.77 ms | 52.3% bf16 MFU | 1622738 tok/s step 14677/19560 | loss 3.317216 (-0.20z)| norm 0.2758 (-0.43z)| lr 9.39e-05 | 322.52 ms | 52.3% bf16 MFU | 1622882 tok/s step 14678/19560 | loss 3.353005 (+0.61z)| norm 0.2680 (-0.84z)| lr 9.39e-05 | 323.23 ms | 52.2% bf16 MFU | 1622838 tok/s step 14679/19560 | loss 3.363497 (+0.83z)| norm 0.2601 (-1.23z)| lr 9.39e-05 | 322.79 ms | 52.3% bf16 MFU | 1622907 tok/s step 14680/19560 | loss 3.318771 (-0.17z)| norm 0.2729 (-0.57z)| lr 9.38e-05 | 323.67 ms | 52.1% bf16 MFU | 1622754 tok/s step 14681/19560 | loss 3.299824 (-0.60z)| norm 0.2605 (-1.20z)| lr 9.38e-05 | 323.07 ms | 52.2% bf16 MFU | 1622757 tok/s step 14682/19560 | loss 3.260604 (-1.47z)| norm 0.2817 (-0.09z)| lr 9.37e-05 | 323.02 ms | 52.2% bf16 MFU | 1622773 tok/s step 14683/19560 | loss 3.277892 (-1.07z)| norm 0.2740 (-0.50z)| lr 9.37e-05 | 322.97 ms | 52.3% bf16 MFU | 1622802 tok/s step 14684/19560 | loss 3.329655 (+0.08z)| norm 0.2797 (-0.20z)| lr 9.37e-05 | 323.48 ms | 52.2% bf16 MFU | 1622701 tok/s step 14685/19560 | loss 3.341185 (+0.34z)| norm 0.2943 (+0.56z)| lr 9.36e-05 | 322.98 ms | 52.3% bf16 MFU | 1622730 tok/s step 14686/19560 | loss 3.218753 (-2.32z)| norm 0.2854 (+0.08z)| lr 9.36e-05 | 322.78 ms | 52.3% bf16 MFU | 1622809 tok/s step 14687/19560 | loss 3.448954 (+2.60z)| norm 0.3147 (+1.60z)| lr 9.36e-05 | 322.80 ms | 52.3% bf16 MFU | 1622879 tok/s step 14688/19560 | loss 3.354295 (+0.59z)| norm 0.2822 (-0.10z)| lr 9.35e-05 | 323.14 ms | 52.2% bf16 MFU | 1622859 tok/s step 14689/19560 | loss 3.255864 (-1.48z)| norm 0.2903 (+0.32z)| lr 9.35e-05 | 323.15 ms | 52.2% bf16 MFU | 1622837 tok/s step 14690/19560 | loss 3.331456 (+0.12z)| norm 0.2642 (-1.05z)| lr 9.35e-05 | 323.25 ms | 52.2% bf16 MFU | 1622793 tok/s step 14691/19560 | loss 3.302756 (-0.49z)| norm 0.2754 (-0.47z)| lr 9.34e-05 | 322.49 ms | 52.3% bf16 MFU | 1622940 tok/s step 14692/19560 | loss 3.316894 (-0.18z)| norm 0.2810 (-0.19z)| lr 9.34e-05 | 322.99 ms | 52.3% bf16 MFU | 1622956 tok/s step 14693/19560 | loss 3.320228 (-0.11z)| norm 0.2633 (-1.12z)| lr 9.33e-05 | 323.16 ms | 52.2% bf16 MFU | 1622927 tok/s step 14694/19560 | loss 3.330529 (+0.09z)| norm 0.2838 (-0.04z)| lr 9.33e-05 | 323.29 ms | 52.2% bf16 MFU | 1622866 tok/s step 14695/19560 | loss 3.255223 (-1.52z)| norm 0.2667 (-0.95z)| lr 9.33e-05 | 322.77 ms | 52.3% bf16 MFU | 1622941 tok/s step 14696/19560 | loss 3.299322 (-0.58z)| norm 0.2696 (-0.79z)| lr 9.32e-05 | 322.63 ms | 52.3% bf16 MFU | 1623045 tok/s step 14697/19560 | loss 3.274226 (-1.11z)| norm 0.2803 (-0.23z)| lr 9.32e-05 | 322.83 ms | 52.3% bf16 MFU | 1623094 tok/s step 14698/19560 | loss 3.358637 (+0.70z)| norm 0.2978 (+0.70z)| lr 9.32e-05 | 323.09 ms | 52.2% bf16 MFU | 1623075 tok/s step 14699/19560 | loss 3.275182 (-1.08z)| norm 0.2547 (-1.58z)| lr 9.31e-05 | 323.23 ms | 52.2% bf16 MFU | 1623022 tok/s step 14700/19560 | loss 3.314283 (-0.24z)| norm 0.3062 (+1.12z)| lr 9.31e-05 | 323.07 ms | 52.2% bf16 MFU | 1623012 tok/s step 14701/19560 | loss 3.303664 (-0.45z)| norm 0.2961 (+0.58z)| lr 9.31e-05 | 322.13 ms | 52.4% bf16 MFU | 1623239 tok/s step 14702/19560 | loss 3.315542 (-0.21z)| norm 0.2918 (+0.35z)| lr 9.30e-05 | 323.38 ms | 52.2% bf16 MFU | 1623141 tok/s step 14703/19560 | loss 3.383028 (+1.25z)| norm 0.3296 (+2.28z)| lr 9.30e-05 | 323.22 ms | 52.2% bf16 MFU | 1623087 tok/s step 14704/19560 | loss 3.268076 (-1.22z)| norm 0.2751 (-0.52z)| lr 9.29e-05 | 323.29 ms | 52.2% bf16 MFU | 1623020 tok/s step 14705/19560 | loss 3.314040 (-0.23z)| norm 0.2893 (+0.21z)| lr 9.29e-05 | 323.37 ms | 52.2% bf16 MFU | 1622934 tok/s step 14706/19560 | loss 3.316637 (-0.18z)| norm 0.2907 (+0.27z)| lr 9.29e-05 | 322.83 ms | 52.3% bf16 MFU | 1622990 tok/s step 14707/19560 | loss 3.332049 (+0.14z)| norm 0.2998 (+0.75z)| lr 9.28e-05 | 322.96 ms | 52.3% bf16 MFU | 1623010 tok/s step 14708/19560 | loss 3.326202 (+0.01z)| norm 0.3240 (+1.97z)| lr 9.28e-05 | 323.56 ms | 52.2% bf16 MFU | 1622878 tok/s step 14709/19560 | loss 3.315637 (-0.20z)| norm 0.2740 (-0.59z)| lr 9.28e-05 | 322.97 ms | 52.3% bf16 MFU | 1622901 tok/s step 14710/19560 | loss 3.315759 (-0.20z)| norm 0.2922 (+0.34z)| lr 9.27e-05 | 322.54 ms | 52.3% bf16 MFU | 1623030 tok/s step 14711/19560 | loss 3.378990 (+1.17z)| norm 0.2849 (-0.04z)| lr 9.27e-05 | 322.82 ms | 52.3% bf16 MFU | 1623083 tok/s step 14712/19560 | loss 3.270624 (-1.17z)| norm 0.2861 (+0.01z)| lr 9.27e-05 | 322.60 ms | 52.3% bf16 MFU | 1623187 tok/s step 14713/19560 | loss 3.309302 (-0.33z)| norm 0.3101 (+1.24z)| lr 9.26e-05 | 322.83 ms | 52.3% bf16 MFU | 1623230 tok/s step 14714/19560 | loss 3.243169 (-1.72z)| norm 0.2966 (+0.53z)| lr 9.26e-05 | 323.01 ms | 52.2% bf16 MFU | 1623225 tok/s step 14715/19560 | loss 3.368210 (+0.96z)| norm 0.2899 (+0.18z)| lr 9.25e-05 | 322.51 ms | 52.3% bf16 MFU | 1623346 tok/s step 14716/19560 | loss 3.332047 (+0.18z)| norm 0.3135 (+1.39z)| lr 9.25e-05 | 322.27 ms | 52.4% bf16 MFU | 1623522 tok/s step 14717/19560 | loss 3.338883 (+0.32z)| norm 0.2730 (-0.70z)| lr 9.25e-05 | 322.89 ms | 52.3% bf16 MFU | 1623532 tok/s step 14718/19560 | loss 3.312914 (-0.24z)| norm 0.2699 (-0.85z)| lr 9.24e-05 | 323.35 ms | 52.2% bf16 MFU | 1623426 tok/s step 14719/19560 | loss 3.318918 (-0.11z)| norm 0.2842 (-0.10z)| lr 9.24e-05 | 322.36 ms | 52.4% bf16 MFU | 1623575 tok/s step 14720/19560 | loss 3.341061 (+0.36z)| norm 0.2897 (+0.18z)| lr 9.24e-05 | 323.38 ms | 52.2% bf16 MFU | 1623460 tok/s step 14721/19560 | loss 3.262737 (-1.30z)| norm 0.3106 (+1.27z)| lr 9.23e-05 | 322.83 ms | 52.3% bf16 MFU | 1623490 tok/s step 14722/19560 | loss 3.301060 (-0.47z)| norm 0.3336 (+2.40z)| lr 9.23e-05 | 322.64 ms | 52.3% bf16 MFU | 1623566 tok/s step 14723/19560 | loss 3.278948 (-0.94z)| norm 0.3056 (+0.96z)| lr 9.23e-05 | 323.21 ms | 52.2% bf16 MFU | 1623493 tok/s step 14724/19560 | loss 3.297367 (-0.54z)| norm 0.3001 (+0.68z)| lr 9.22e-05 | 322.86 ms | 52.3% bf16 MFU | 1623514 tok/s step 14725/19560 | loss 3.347916 (+0.54z)| norm 0.3110 (+1.23z)| lr 9.22e-05 | 323.15 ms | 52.2% bf16 MFU | 1623461 tok/s step 14726/19560 | loss 3.291809 (-0.65z)| norm 0.2898 (+0.15z)| lr 9.22e-05 | 322.69 ms | 52.3% bf16 MFU | 1623524 tok/s step 14727/19560 | loss 3.370411 (+1.01z)| norm 0.2889 (+0.10z)| lr 9.21e-05 | 322.64 ms | 52.3% bf16 MFU | 1623598 tok/s step 14728/19560 | loss 3.278473 (-0.93z)| norm 0.3160 (+1.45z)| lr 9.21e-05 | 323.56 ms | 52.2% bf16 MFU | 1623437 tok/s step 14729/19560 | loss 3.320635 (-0.05z)| norm 0.2870 (-0.02z)| lr 9.20e-05 | 322.69 ms | 52.3% bf16 MFU | 1623502 tok/s step 14730/19560 | loss 3.347973 (+0.53z)| norm 0.2997 (+0.62z)| lr 9.20e-05 | 322.14 ms | 52.4% bf16 MFU | 1623704 tok/s step 14731/19560 | loss 3.270536 (-1.11z)| norm 0.2975 (+0.54z)| lr 9.20e-05 | 322.99 ms | 52.3% bf16 MFU | 1623679 tok/s step 14732/19560 | loss 3.265910 (-1.20z)| norm 0.3001 (+0.67z)| lr 9.19e-05 | 322.60 ms | 52.3% bf16 MFU | 1623754 tok/s step 14733/19560 | loss 3.416976 (+1.95z)| norm 0.2851 (-0.12z)| lr 9.19e-05 | 323.18 ms | 52.2% bf16 MFU | 1623679 tok/s step 14734/19560 | loss 3.371188 (+0.98z)| norm 0.2938 (+0.37z)| lr 9.19e-05 | 322.75 ms | 52.3% bf16 MFU | 1623717 tok/s step 14735/19560 | loss 3.275900 (-0.99z)| norm 0.2823 (-0.24z)| lr 9.18e-05 | 323.12 ms | 52.2% bf16 MFU | 1623661 tok/s step 14736/19560 | loss 3.295782 (-0.57z)| norm 0.2880 (+0.10z)| lr 9.18e-05 | 322.54 ms | 52.3% bf16 MFU | 1623751 tok/s step 14737/19560 | loss 3.314758 (-0.18z)| norm 0.2751 (-0.64z)| lr 9.18e-05 | 322.43 ms | 52.3% bf16 MFU | 1623866 tok/s step 14738/19560 | loss 3.301896 (-0.44z)| norm 0.2782 (-0.45z)| lr 9.17e-05 | 323.01 ms | 52.2% bf16 MFU | 1623829 tok/s step 14739/19560 | loss 3.317125 (-0.11z)| norm 0.2636 (-1.29z)| lr 9.17e-05 | 323.16 ms | 52.2% bf16 MFU | 1623756 tok/s step 14740/19560 | loss 3.366692 (+0.93z)| norm 0.2869 (+0.08z)| lr 9.16e-05 | 322.72 ms | 52.3% bf16 MFU | 1623797 tok/s step 14741/19560 | loss 3.327028 (+0.10z)| norm 0.2796 (-0.35z)| lr 9.16e-05 | 323.31 ms | 52.2% bf16 MFU | 1623689 tok/s step 14742/19560 | loss 3.351406 (+0.60z)| norm 0.2774 (-0.49z)| lr 9.16e-05 | 322.44 ms | 52.3% bf16 MFU | 1623805 tok/s step 14743/19560 | loss 3.350744 (+0.62z)| norm 0.2627 (-1.35z)| lr 9.15e-05 | 322.72 ms | 52.3% bf16 MFU | 1623845 tok/s step 14744/19560 | loss 3.355334 (+0.71z)| norm 0.2706 (-0.87z)| lr 9.15e-05 | 322.66 ms | 52.3% bf16 MFU | 1623897 tok/s step 14745/19560 | loss 3.393557 (+1.51z)| norm 0.2704 (-0.87z)| lr 9.15e-05 | 323.25 ms | 52.2% bf16 MFU | 1623800 tok/s step 14746/19560 | loss 3.330378 (+0.15z)| norm 0.2810 (-0.24z)| lr 9.14e-05 | 322.56 ms | 52.3% bf16 MFU | 1623878 tok/s step 14747/19560 | loss 3.306025 (-0.38z)| norm 0.2596 (-1.49z)| lr 9.14e-05 | 322.95 ms | 52.3% bf16 MFU | 1623857 tok/s step 14748/19560 | loss 3.297245 (-0.57z)| norm 0.2769 (-0.46z)| lr 9.14e-05 | 322.70 ms | 52.3% bf16 MFU | 1623897 tok/s step 14749/19560 | loss 3.352581 (+0.62z)| norm 0.2870 (+0.12z)| lr 9.13e-05 | 322.76 ms | 52.3% bf16 MFU | 1623923 tok/s step 14750/19560 | loss 3.274140 (-1.05z)| norm 0.2502 (-2.02z)| lr 9.13e-05 | 322.75 ms | 52.3% bf16 MFU | 1623948 tok/s val loss 3.306326 ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 2968/10042 = 0.295559 step 14751/19560 | loss 3.315908 (-0.16z)| norm 0.2799 (-0.28z)| lr 9.13e-05 | 322.57 ms | 52.3% bf16 MFU | 1624019 tok/s step 14752/19560 | loss 3.355447 (+0.70z)| norm 0.2820 (-0.15z)| lr 9.12e-05 | 323.24 ms | 52.2% bf16 MFU | 1623916 tok/s step 14753/19560 | loss 3.268023 (-1.18z)| norm 0.2852 (+0.04z)| lr 9.12e-05 | 323.23 ms | 52.2% bf16 MFU | 1623821 tok/s step 14754/19560 | loss 3.317389 (-0.12z)| norm 0.2656 (-1.12z)| lr 9.11e-05 | 322.91 ms | 52.3% bf16 MFU | 1623812 tok/s step 14755/19560 | loss 3.273330 (-1.07z)| norm 0.2697 (-0.87z)| lr 9.11e-05 | 322.57 ms | 52.3% bf16 MFU | 1623889 tok/s step 14756/19560 | loss 3.287201 (-0.77z)| norm 0.2724 (-0.71z)| lr 9.11e-05 | 322.72 ms | 52.3% bf16 MFU | 1623923 tok/s step 14757/19560 | loss 3.265180 (-1.23z)| norm 0.2946 (+0.58z)| lr 9.10e-05 | 323.50 ms | 52.2% bf16 MFU | 1623762 tok/s step 14758/19560 | loss 3.371528 (+1.04z)| norm 0.2878 (+0.17z)| lr 9.10e-05 | 322.73 ms | 52.3% bf16 MFU | 1623802 tok/s step 14759/19560 | loss 3.352093 (+0.61z)| norm 0.2684 (-0.97z)| lr 9.10e-05 | 322.09 ms | 52.4% bf16 MFU | 1624001 tok/s step 14760/19560 | loss 3.302168 (-0.46z)| norm 0.2828 (-0.12z)| lr 9.09e-05 | 323.01 ms | 52.2% bf16 MFU | 1623957 tok/s step 14761/19560 | loss 3.322095 (-0.01z)| norm 0.2915 (+0.40z)| lr 9.09e-05 | 322.86 ms | 52.3% bf16 MFU | 1623953 tok/s step 14762/19560 | loss 3.268325 (-1.18z)| norm 0.2710 (-0.83z)| lr 9.09e-05 | 323.06 ms | 52.2% bf16 MFU | 1623900 tok/s step 14763/19560 | loss 3.307783 (-0.30z)| norm 0.2722 (-0.76z)| lr 9.08e-05 | 323.35 ms | 52.2% bf16 MFU | 1623777 tok/s step 14764/19560 | loss 3.321353 (+0.03z)| norm 0.2862 (+0.07z)| lr 9.08e-05 | 322.60 ms | 52.3% bf16 MFU | 1623847 tok/s step 14765/19560 | loss 3.322289 (+0.06z)| norm 0.2806 (-0.27z)| lr 9.07e-05 | 323.12 ms | 52.2% bf16 MFU | 1623784 tok/s step 14766/19560 | loss 3.361911 (+0.97z)| norm 0.2929 (+0.46z)| lr 9.07e-05 | 323.02 ms | 52.2% bf16 MFU | 1623750 tok/s step 14767/19560 | loss 3.368946 (+1.11z)| norm 0.2617 (-1.41z)| lr 9.07e-05 | 322.55 ms | 52.3% bf16 MFU | 1623835 tok/s step 14768/19560 | loss 3.326156 (+0.13z)| norm 0.3015 (+0.97z)| lr 9.06e-05 | 323.09 ms | 52.2% bf16 MFU | 1623779 tok/s step 14769/19560 | loss 3.339064 (+0.41z)| norm 0.2768 (-0.52z)| lr 9.06e-05 | 323.05 ms | 52.2% bf16 MFU | 1623738 tok/s step 14770/19560 | loss 3.367064 (+1.05z)| norm 0.2710 (-0.87z)| lr 9.06e-05 | 323.23 ms | 52.2% bf16 MFU | 1623654 tok/s step 14771/19560 | loss 3.356274 (+0.79z)| norm 0.3008 (+0.91z)| lr 9.05e-05 | 323.20 ms | 52.2% bf16 MFU | 1623578 tok/s step 14772/19560 | loss 3.481865 (+3.51z)| norm 0.2801 (-0.33z)| lr 9.05e-05 | 322.76 ms | 52.3% bf16 MFU | 1623618 tok/s step 14773/19560 | loss 3.438640 (+2.49z)| norm 0.2975 (+0.72z)| lr 9.05e-05 | 323.19 ms | 52.2% bf16 MFU | 1623549 tok/s step 14774/19560 | loss 3.298642 (-0.55z)| norm 0.2951 (+0.57z)| lr 9.04e-05 | 322.83 ms | 52.3% bf16 MFU | 1623573 tok/s step 14775/19560 | loss 3.299487 (-0.53z)| norm 0.2826 (-0.18z)| lr 9.04e-05 | 322.92 ms | 52.3% bf16 MFU | 1623573 tok/s step 14776/19560 | loss 3.445737 (+2.61z)| norm 0.4006 (+5.89z)| lr 9.04e-05 | 322.39 ms | 52.4% bf16 MFU | 1623707 tok/s step 14777/19560 | loss 3.349607 (+0.54z)| norm 0.2860 (-0.02z)| lr 9.03e-05 | 323.30 ms | 52.2% bf16 MFU | 1623606 tok/s step 14778/19560 | loss 3.312284 (-0.25z)| norm 0.2911 (+0.24z)| lr 9.03e-05 | 322.60 ms | 52.3% bf16 MFU | 1623686 tok/s step 14779/19560 | loss 3.324924 (+0.03z)| norm 0.2732 (-0.68z)| lr 9.02e-05 | 322.84 ms | 52.3% bf16 MFU | 1623701 tok/s step 14780/19560 | loss 3.320654 (-0.06z)| norm 0.2928 (+0.34z)| lr 9.02e-05 | 323.19 ms | 52.2% bf16 MFU | 1623626 tok/s step 14781/19560 | loss 3.306444 (-0.36z)| norm 0.2824 (-0.20z)| lr 9.02e-05 | 323.17 ms | 52.2% bf16 MFU | 1623562 tok/s step 14782/19560 | loss 3.254657 (-1.50z)| norm 0.2593 (-1.38z)| lr 9.01e-05 | 323.11 ms | 52.2% bf16 MFU | 1623515 tok/s step 14783/19560 | loss 3.322542 (-0.00z)| norm 0.2850 (-0.03z)| lr 9.01e-05 | 322.50 ms | 52.3% bf16 MFU | 1623625 tok/s step 14784/19560 | loss 3.338822 (+0.40z)| norm 0.2694 (-0.85z)| lr 9.01e-05 | 322.90 ms | 52.3% bf16 MFU | 1623628 tok/s step 14785/19560 | loss 3.331387 (+0.23z)| norm 0.3297 (+2.40z)| lr 9.00e-05 | 323.16 ms | 52.2% bf16 MFU | 1623566 tok/s step 14786/19560 | loss 3.336391 (+0.34z)| norm 0.2732 (-0.64z)| lr 9.00e-05 | 322.31 ms | 52.4% bf16 MFU | 1623721 tok/s step 14787/19560 | loss 3.357558 (+0.83z)| norm 0.2889 (+0.21z)| lr 9.00e-05 | 323.01 ms | 52.3% bf16 MFU | 1623693 tok/s step 14788/19560 | loss 3.309794 (-0.28z)| norm 0.2844 (-0.03z)| lr 8.99e-05 | 323.32 ms | 52.2% bf16 MFU | 1623587 tok/s step 14789/19560 | loss 3.297402 (-0.56z)| norm 0.2864 (+0.07z)| lr 8.99e-05 | 322.68 ms | 52.3% bf16 MFU | 1623646 tok/s step 14790/19560 | loss 3.266138 (-1.27z)| norm 0.3149 (+1.57z)| lr 8.99e-05 | 322.95 ms | 52.3% bf16 MFU | 1623636 tok/s step 14791/19560 | loss 3.399016 (+1.78z)| norm 0.3003 (+0.78z)| lr 8.98e-05 | 323.06 ms | 52.2% bf16 MFU | 1623598 tok/s step 14792/19560 | loss 3.270177 (-1.17z)| norm 0.2974 (+0.62z)| lr 8.98e-05 | 322.63 ms | 52.3% bf16 MFU | 1623670 tok/s step 14793/19560 | loss 3.369782 (+1.09z)| norm 0.2821 (-0.21z)| lr 8.97e-05 | 323.00 ms | 52.3% bf16 MFU | 1623646 tok/s step 14794/19560 | loss 3.276211 (-1.02z)| norm 0.2892 (+0.18z)| lr 8.97e-05 | 323.00 ms | 52.3% bf16 MFU | 1623622 tok/s step 14795/19560 | loss 3.366523 (+1.03z)| norm 0.3126 (+1.41z)| lr 8.97e-05 | 323.58 ms | 52.2% bf16 MFU | 1623456 tok/s step 14796/19560 | loss 3.245254 (-1.72z)| norm 0.2686 (-0.93z)| lr 8.96e-05 | 322.65 ms | 52.3% bf16 MFU | 1623530 tok/s step 14797/19560 | loss 3.322327 (+0.04z)| norm 0.2718 (-0.75z)| lr 8.96e-05 | 322.71 ms | 52.3% bf16 MFU | 1623587 tok/s step 14798/19560 | loss 3.319678 (-0.02z)| norm 0.3340 (+2.49z)| lr 8.96e-05 | 322.94 ms | 52.3% bf16 MFU | 1623582 tok/s step 14799/19560 | loss 3.366070 (+1.03z)| norm 0.3003 (+0.72z)| lr 8.95e-05 | 323.16 ms | 52.2% bf16 MFU | 1623522 tok/s step 14800/19560 | loss 3.382704 (+1.38z)| norm 0.3406 (+2.71z)| lr 8.95e-05 | 323.47 ms | 52.2% bf16 MFU | 1623388 tok/s step 14801/19560 | loss 3.371898 (+1.12z)| norm 0.3146 (+1.37z)| lr 8.95e-05 | 322.77 ms | 52.3% bf16 MFU | 1623435 tok/s step 14802/19560 | loss 3.284118 (-0.87z)| norm 0.2716 (-0.79z)| lr 8.94e-05 | 322.97 ms | 52.3% bf16 MFU | 1623428 tok/s step 14803/19560 | loss 3.270307 (-1.19z)| norm 0.3056 (+0.91z)| lr 8.94e-05 | 322.98 ms | 52.3% bf16 MFU | 1623422 tok/s step 14804/19560 | loss 3.354450 (+0.72z)| norm 0.2796 (-0.39z)| lr 8.94e-05 | 323.05 ms | 52.2% bf16 MFU | 1623398 tok/s step 14805/19560 | loss 3.260623 (-1.39z)| norm 0.2689 (-0.93z)| lr 8.93e-05 | 323.47 ms | 52.2% bf16 MFU | 1623269 tok/s step 14806/19560 | loss 3.320391 (-0.04z)| norm 0.2893 (+0.09z)| lr 8.93e-05 | 322.65 ms | 52.3% bf16 MFU | 1623352 tok/s step 14807/19560 | loss 3.292674 (-0.65z)| norm 0.2651 (-1.13z)| lr 8.93e-05 | 322.55 ms | 52.3% bf16 MFU | 1623457 tok/s step 14808/19560 | loss 3.319310 (-0.05z)| norm 0.2637 (-1.20z)| lr 8.92e-05 | 323.14 ms | 52.2% bf16 MFU | 1623409 tok/s step 14809/19560 | loss 3.238149 (-1.85z)| norm 0.2737 (-0.70z)| lr 8.92e-05 | 323.18 ms | 52.2% bf16 MFU | 1623352 tok/s step 14810/19560 | loss 3.299793 (-0.48z)| norm 0.2712 (-0.82z)| lr 8.91e-05 | 323.44 ms | 52.2% bf16 MFU | 1623233 tok/s step 14811/19560 | loss 3.225985 (-2.10z)| norm 0.2567 (-1.53z)| lr 8.91e-05 | 323.31 ms | 52.2% bf16 MFU | 1623151 tok/s step 14812/19560 | loss 3.327724 (+0.15z)| norm 0.2783 (-0.45z)| lr 8.91e-05 | 322.70 ms | 52.3% bf16 MFU | 1623227 tok/s step 14813/19560 | loss 3.351269 (+0.67z)| norm 0.2744 (-0.64z)| lr 8.90e-05 | 323.03 ms | 52.2% bf16 MFU | 1623218 tok/s step 14814/19560 | loss 3.248946 (-1.61z)| norm 0.2522 (-1.72z)| lr 8.90e-05 | 322.69 ms | 52.3% bf16 MFU | 1623295 tok/s step 14815/19560 | loss 3.259383 (-1.38z)| norm 0.2942 (+0.37z)| lr 8.90e-05 | 323.04 ms | 52.2% bf16 MFU | 1623279 tok/s step 14816/19560 | loss 3.325563 (+0.14z)| norm 0.2557 (-1.52z)| lr 8.89e-05 | 323.54 ms | 52.2% bf16 MFU | 1623140 tok/s step 14817/19560 | loss 3.322553 (+0.06z)| norm 0.2639 (-1.10z)| lr 8.89e-05 | 322.97 ms | 52.3% bf16 MFU | 1623150 tok/s step 14818/19560 | loss 3.313095 (-0.16z)| norm 0.2867 (+0.01z)| lr 8.89e-05 | 323.01 ms | 52.3% bf16 MFU | 1623150 tok/s step 14819/19560 | loss 3.318076 (-0.05z)| norm 0.2589 (-1.34z)| lr 8.88e-05 | 323.22 ms | 52.2% bf16 MFU | 1623096 tok/s step 14820/19560 | loss 3.306306 (-0.32z)| norm 0.2785 (-0.39z)| lr 8.88e-05 | 323.07 ms | 52.2% bf16 MFU | 1623083 tok/s step 14821/19560 | loss 3.373919 (+1.23z)| norm 0.2638 (-1.11z)| lr 8.88e-05 | 322.93 ms | 52.3% bf16 MFU | 1623105 tok/s step 14822/19560 | loss 3.327048 (+0.15z)| norm 0.2639 (-1.09z)| lr 8.87e-05 | 322.86 ms | 52.3% bf16 MFU | 1623143 tok/s step 14823/19560 | loss 3.305273 (-0.36z)| norm 0.2824 (-0.19z)| lr 8.87e-05 | 322.67 ms | 52.3% bf16 MFU | 1623229 tok/s step 14824/19560 | loss 3.312473 (-0.19z)| norm 0.2720 (-0.70z)| lr 8.86e-05 | 323.14 ms | 52.2% bf16 MFU | 1623190 tok/s step 14825/19560 | loss 3.363021 (+0.97z)| norm 0.2930 (+0.32z)| lr 8.86e-05 | 322.51 ms | 52.3% bf16 MFU | 1623314 tok/s step 14826/19560 | loss 3.367828 (+1.07z)| norm 0.2953 (+0.43z)| lr 8.86e-05 | 323.24 ms | 52.2% bf16 MFU | 1623247 tok/s step 14827/19560 | loss 3.343507 (+0.50z)| norm 0.2507 (-1.74z)| lr 8.85e-05 | 322.96 ms | 52.3% bf16 MFU | 1623253 tok/s step 14828/19560 | loss 3.332286 (+0.23z)| norm 0.2687 (-0.85z)| lr 8.85e-05 | 323.11 ms | 52.2% bf16 MFU | 1623222 tok/s step 14829/19560 | loss 3.303221 (-0.44z)| norm 0.2933 (+0.35z)| lr 8.85e-05 | 322.60 ms | 52.3% bf16 MFU | 1623321 tok/s step 14830/19560 | loss 3.279265 (-0.99z)| norm 0.2706 (-0.75z)| lr 8.84e-05 | 323.16 ms | 52.2% bf16 MFU | 1623273 tok/s step 14831/19560 | loss 3.283644 (-0.88z)| norm 0.2631 (-1.10z)| lr 8.84e-05 | 323.12 ms | 52.2% bf16 MFU | 1623238 tok/s step 14832/19560 | loss 3.277079 (-1.03z)| norm 0.2732 (-0.60z)| lr 8.84e-05 | 322.61 ms | 52.3% bf16 MFU | 1623332 tok/s step 14833/19560 | loss 3.336979 (+0.36z)| norm 0.2688 (-0.81z)| lr 8.83e-05 | 322.95 ms | 52.3% bf16 MFU | 1623338 tok/s step 14834/19560 | loss 3.300973 (-0.48z)| norm 0.2760 (-0.45z)| lr 8.83e-05 | 322.60 ms | 52.3% bf16 MFU | 1623431 tok/s step 14835/19560 | loss 3.293193 (-0.65z)| norm 0.2571 (-1.36z)| lr 8.83e-05 | 323.21 ms | 52.2% bf16 MFU | 1623366 tok/s step 14836/19560 | loss 3.585188 (+5.37z)| norm 0.3555 (+3.35z)| lr 8.82e-05 | 323.03 ms | 52.2% bf16 MFU | 1623348 tok/s step 14837/19560 | loss 3.306597 (-0.34z)| norm 0.2769 (-0.39z)| lr 8.82e-05 | 322.88 ms | 52.3% bf16 MFU | 1623369 tok/s step 14838/19560 | loss 3.357816 (+0.71z)| norm 0.2799 (-0.24z)| lr 8.82e-05 | 322.71 ms | 52.3% bf16 MFU | 1623432 tok/s step 14839/19560 | loss 3.307313 (-0.32z)| norm 0.2691 (-0.75z)| lr 8.81e-05 | 323.35 ms | 52.2% bf16 MFU | 1623331 tok/s step 14840/19560 | loss 3.345808 (+0.46z)| norm 0.3080 (+1.09z)| lr 8.81e-05 | 323.04 ms | 52.2% bf16 MFU | 1623314 tok/s step 14841/19560 | loss 3.293759 (-0.61z)| norm 0.2762 (-0.40z)| lr 8.80e-05 | 323.26 ms | 52.2% bf16 MFU | 1623242 tok/s step 14842/19560 | loss 3.319727 (-0.09z)| norm 0.2969 (+0.57z)| lr 8.80e-05 | 323.68 ms | 52.1% bf16 MFU | 1623069 tok/s step 14843/19560 | loss 3.328713 (+0.11z)| norm 0.2924 (+0.36z)| lr 8.80e-05 | 323.07 ms | 52.2% bf16 MFU | 1623058 tok/s step 14844/19560 | loss 3.261623 (-1.27z)| norm 0.2664 (-0.86z)| lr 8.79e-05 | 323.40 ms | 52.2% bf16 MFU | 1622964 tok/s step 14845/19560 | loss 3.279957 (-0.88z)| norm 0.2718 (-0.60z)| lr 8.79e-05 | 323.77 ms | 52.1% bf16 MFU | 1622782 tok/s step 14846/19560 | loss 3.294071 (-0.59z)| norm 0.2717 (-0.61z)| lr 8.79e-05 | 323.09 ms | 52.2% bf16 MFU | 1622779 tok/s step 14847/19560 | loss 3.327827 (+0.11z)| norm 0.2688 (-0.74z)| lr 8.78e-05 | 323.62 ms | 52.2% bf16 MFU | 1622643 tok/s step 14848/19560 | loss 3.338100 (+0.32z)| norm 0.2682 (-0.76z)| lr 8.78e-05 | 323.48 ms | 52.2% bf16 MFU | 1622550 tok/s step 14849/19560 | loss 3.367931 (+0.93z)| norm 0.3371 (+2.46z)| lr 8.78e-05 | 324.02 ms | 52.1% bf16 MFU | 1622327 tok/s step 14850/19560 | loss 3.341817 (+0.38z)| norm 0.2834 (-0.03z)| lr 8.77e-05 | 323.41 ms | 52.2% bf16 MFU | 1622268 tok/s step 14851/19560 | loss 3.319219 (-0.10z)| norm 0.2664 (-0.82z)| lr 8.77e-05 | 323.63 ms | 52.1% bf16 MFU | 1622155 tok/s step 14852/19560 | loss 3.322579 (-0.03z)| norm 0.2671 (-0.78z)| lr 8.77e-05 | 322.88 ms | 52.3% bf16 MFU | 1622236 tok/s step 14853/19560 | loss 3.225204 (-2.01z)| norm 0.2838 (+0.03z)| lr 8.76e-05 | 322.86 ms | 52.3% bf16 MFU | 1622319 tok/s step 14854/19560 | loss 3.291073 (-0.66z)| norm 0.2632 (-0.95z)| lr 8.76e-05 | 323.38 ms | 52.2% bf16 MFU | 1622266 tok/s step 14855/19560 | loss 3.327419 (+0.10z)| norm 0.2684 (-0.69z)| lr 8.76e-05 | 323.87 ms | 52.1% bf16 MFU | 1622095 tok/s step 14856/19560 | loss 3.374954 (+1.06z)| norm 0.2565 (-1.24z)| lr 8.75e-05 | 323.95 ms | 52.1% bf16 MFU | 1621912 tok/s step 14857/19560 | loss 3.288100 (-0.72z)| norm 0.2822 (-0.00z)| lr 8.75e-05 | 323.29 ms | 52.2% bf16 MFU | 1621902 tok/s step 14858/19560 | loss 3.324928 (+0.04z)| norm 0.2573 (-1.18z)| lr 8.74e-05 | 323.13 ms | 52.2% bf16 MFU | 1621934 tok/s step 14859/19560 | loss 3.326027 (+0.05z)| norm 0.2660 (-0.75z)| lr 8.74e-05 | 323.66 ms | 52.1% bf16 MFU | 1621831 tok/s step 14860/19560 | loss 3.296196 (-0.57z)| norm 0.2490 (-1.54z)| lr 8.74e-05 | 323.60 ms | 52.2% bf16 MFU | 1621748 tok/s step 14861/19560 | loss 3.421993 (+2.03z)| norm 0.2708 (-0.49z)| lr 8.73e-05 | 323.68 ms | 52.1% bf16 MFU | 1621649 tok/s step 14862/19560 | loss 3.348094 (+0.51z)| norm 0.2840 (+0.14z)| lr 8.73e-05 | 323.47 ms | 52.2% bf16 MFU | 1621607 tok/s step 14863/19560 | loss 3.369999 (+0.95z)| norm 0.2633 (-0.84z)| lr 8.73e-05 | 322.59 ms | 52.3% bf16 MFU | 1621788 tok/s step 14864/19560 | loss 3.351243 (+0.55z)| norm 0.2940 (+0.61z)| lr 8.72e-05 | 323.58 ms | 52.2% bf16 MFU | 1621712 tok/s step 14865/19560 | loss 3.275653 (-1.01z)| norm 0.2664 (-0.69z)| lr 8.72e-05 | 323.26 ms | 52.2% bf16 MFU | 1621720 tok/s step 14866/19560 | loss 3.288883 (-0.73z)| norm 0.2707 (-0.48z)| lr 8.72e-05 | 323.11 ms | 52.2% bf16 MFU | 1621767 tok/s step 14867/19560 | loss 3.345703 (+0.44z)| norm 0.2728 (-0.39z)| lr 8.71e-05 | 322.81 ms | 52.3% bf16 MFU | 1621884 tok/s step 14868/19560 | loss 3.369045 (+0.92z)| norm 0.2468 (-1.59z)| lr 8.71e-05 | 323.19 ms | 52.2% bf16 MFU | 1621901 tok/s step 14869/19560 | loss 3.257553 (-1.36z)| norm 0.2636 (-0.79z)| lr 8.71e-05 | 323.54 ms | 52.2% bf16 MFU | 1621830 tok/s step 14870/19560 | loss 3.344409 (+0.42z)| norm 0.2687 (-0.55z)| lr 8.70e-05 | 323.11 ms | 52.2% bf16 MFU | 1621871 tok/s step 14871/19560 | loss 3.356359 (+0.66z)| norm 0.2705 (-0.47z)| lr 8.70e-05 | 322.74 ms | 52.3% bf16 MFU | 1622002 tok/s step 14872/19560 | loss 3.331661 (+0.16z)| norm 0.2818 (+0.05z)| lr 8.70e-05 | 323.29 ms | 52.2% bf16 MFU | 1621988 tok/s step 14873/19560 | loss 3.336437 (+0.27z)| norm 0.2660 (-0.69z)| lr 8.69e-05 | 322.87 ms | 52.3% bf16 MFU | 1622080 tok/s step 14874/19560 | loss 3.305993 (-0.36z)| norm 0.2787 (-0.09z)| lr 8.69e-05 | 323.32 ms | 52.2% bf16 MFU | 1622055 tok/s step 14875/19560 | loss 3.285591 (-0.77z)| norm 0.2711 (-0.45z)| lr 8.68e-05 | 322.66 ms | 52.3% bf16 MFU | 1622197 tok/s step 14876/19560 | loss 3.294886 (-0.58z)| norm 0.2869 (+0.29z)| lr 8.68e-05 | 323.76 ms | 52.1% bf16 MFU | 1622056 tok/s step 14877/19560 | loss 3.326962 (+0.08z)| norm 0.2754 (-0.25z)| lr 8.68e-05 | 322.58 ms | 52.3% bf16 MFU | 1622217 tok/s step 14878/19560 | loss 3.263231 (-1.23z)| norm 0.2675 (-0.63z)| lr 8.67e-05 | 322.71 ms | 52.3% bf16 MFU | 1622338 tok/s step 14879/19560 | loss 3.356878 (+0.70z)| norm 0.3022 (+1.00z)| lr 8.67e-05 | 323.68 ms | 52.1% bf16 MFU | 1622211 tok/s step 14880/19560 | loss 3.383675 (+1.24z)| norm 0.2595 (-1.00z)| lr 8.67e-05 | 322.74 ms | 52.3% bf16 MFU | 1622324 tok/s step 14881/19560 | loss 3.321958 (-0.04z)| norm 0.2642 (-0.77z)| lr 8.66e-05 | 323.16 ms | 52.2% bf16 MFU | 1622327 tok/s step 14882/19560 | loss 3.326441 (+0.05z)| norm 0.2809 (+0.01z)| lr 8.66e-05 | 322.72 ms | 52.3% bf16 MFU | 1622440 tok/s step 14883/19560 | loss 3.295329 (-0.59z)| norm 0.2718 (-0.42z)| lr 8.66e-05 | 322.67 ms | 52.3% bf16 MFU | 1622560 tok/s step 14884/19560 | loss 3.339441 (+0.31z)| norm 0.2687 (-0.56z)| lr 8.65e-05 | 322.90 ms | 52.3% bf16 MFU | 1622617 tok/s step 14885/19560 | loss 3.297253 (-0.57z)| norm 0.2673 (-0.62z)| lr 8.65e-05 | 323.81 ms | 52.1% bf16 MFU | 1622442 tok/s step 14886/19560 | loss 3.323385 (-0.02z)| norm 0.2705 (-0.46z)| lr 8.65e-05 | 322.59 ms | 52.3% bf16 MFU | 1622582 tok/s step 14887/19560 | loss 3.315195 (-0.18z)| norm 0.2832 (+0.13z)| lr 8.64e-05 | 322.82 ms | 52.3% bf16 MFU | 1622656 tok/s step 14888/19560 | loss 3.303366 (-0.43z)| norm 0.2682 (-0.57z)| lr 8.64e-05 | 322.97 ms | 52.3% bf16 MFU | 1622690 tok/s step 14889/19560 | loss 3.335373 (+0.24z)| norm 0.2922 (+0.56z)| lr 8.64e-05 | 322.66 ms | 52.3% bf16 MFU | 1622799 tok/s step 14890/19560 | loss 3.272952 (-1.07z)| norm 0.2843 (+0.18z)| lr 8.63e-05 | 322.58 ms | 52.3% bf16 MFU | 1622925 tok/s step 14891/19560 | loss 3.367453 (+0.89z)| norm 0.2953 (+0.69z)| lr 8.63e-05 | 323.09 ms | 52.2% bf16 MFU | 1622916 tok/s step 14892/19560 | loss 3.292826 (-0.66z)| norm 0.3198 (+1.80z)| lr 8.62e-05 | 323.10 ms | 52.2% bf16 MFU | 1622905 tok/s step 14893/19560 | loss 3.283808 (-0.84z)| norm 0.2677 (-0.61z)| lr 8.62e-05 | 323.08 ms | 52.2% bf16 MFU | 1622898 tok/s step 14894/19560 | loss 3.292761 (-0.64z)| norm 0.3370 (+2.52z)| lr 8.62e-05 | 322.60 ms | 52.3% bf16 MFU | 1623013 tok/s step 14895/19560 | loss 3.294143 (-0.60z)| norm 0.2849 (+0.16z)| lr 8.61e-05 | 322.55 ms | 52.3% bf16 MFU | 1623134 tok/s step 14896/19560 | loss 3.372518 (+1.02z)| norm 0.3321 (+2.24z)| lr 8.61e-05 | 323.39 ms | 52.2% bf16 MFU | 1623040 tok/s step 14897/19560 | loss 3.320526 (-0.05z)| norm 0.3174 (+1.56z)| lr 8.61e-05 | 322.53 ms | 52.3% bf16 MFU | 1623166 tok/s step 14898/19560 | loss 3.334100 (+0.23z)| norm 0.2833 (+0.06z)| lr 8.60e-05 | 322.82 ms | 52.3% bf16 MFU | 1623213 tok/s step 14899/19560 | loss 3.320131 (-0.05z)| norm 0.3286 (+2.02z)| lr 8.60e-05 | 322.46 ms | 52.3% bf16 MFU | 1623347 tok/s step 14900/19560 | loss 3.333313 (+0.26z)| norm 0.3109 (+1.23z)| lr 8.60e-05 | 323.12 ms | 52.2% bf16 MFU | 1623308 tok/s step 14901/19560 | loss 3.279978 (-0.90z)| norm 0.2953 (+0.55z)| lr 8.59e-05 | 323.42 ms | 52.2% bf16 MFU | 1623196 tok/s step 14902/19560 | loss 3.260215 (-1.33z)| norm 0.2966 (+0.61z)| lr 8.59e-05 | 322.70 ms | 52.3% bf16 MFU | 1623270 tok/s step 14903/19560 | loss 3.270532 (-1.09z)| norm 0.2785 (-0.17z)| lr 8.59e-05 | 322.81 ms | 52.3% bf16 MFU | 1623314 tok/s step 14904/19560 | loss 3.316763 (-0.04z)| norm 0.2845 (+0.14z)| lr 8.58e-05 | 323.03 ms | 52.2% bf16 MFU | 1623299 tok/s step 14905/19560 | loss 3.258182 (-1.36z)| norm 0.2918 (+0.49z)| lr 8.58e-05 | 322.86 ms | 52.3% bf16 MFU | 1623329 tok/s step 14906/19560 | loss 3.338679 (+0.47z)| norm 0.3006 (+0.92z)| lr 8.58e-05 | 322.76 ms | 52.3% bf16 MFU | 1623383 tok/s step 14907/19560 | loss 3.335790 (+0.40z)| norm 0.2938 (+0.58z)| lr 8.57e-05 | 322.94 ms | 52.3% bf16 MFU | 1623388 tok/s step 14908/19560 | loss 3.322053 (+0.09z)| norm 0.2624 (-0.92z)| lr 8.57e-05 | 322.73 ms | 52.3% bf16 MFU | 1623447 tok/s step 14909/19560 | loss 3.341867 (+0.53z)| norm 0.3017 (+0.96z)| lr 8.57e-05 | 323.58 ms | 52.2% bf16 MFU | 1623288 tok/s step 14910/19560 | loss 3.339721 (+0.47z)| norm 0.2928 (+0.52z)| lr 8.56e-05 | 322.39 ms | 52.3% bf16 MFU | 1623436 tok/s step 14911/19560 | loss 3.280644 (-0.87z)| norm 0.2528 (-1.38z)| lr 8.56e-05 | 322.77 ms | 52.3% bf16 MFU | 1623481 tok/s step 14912/19560 | loss 3.339982 (+0.48z)| norm 0.2903 (+0.40z)| lr 8.55e-05 | 322.91 ms | 52.3% bf16 MFU | 1623489 tok/s step 14913/19560 | loss 3.244850 (-1.65z)| norm 0.2757 (-0.28z)| lr 8.55e-05 | 323.02 ms | 52.2% bf16 MFU | 1623468 tok/s step 14914/19560 | loss 3.287596 (-0.68z)| norm 0.2871 (+0.27z)| lr 8.55e-05 | 322.75 ms | 52.3% bf16 MFU | 1623516 tok/s step 14915/19560 | loss 3.378326 (+1.35z)| norm 0.2937 (+0.59z)| lr 8.54e-05 | 322.67 ms | 52.3% bf16 MFU | 1623582 tok/s step 14916/19560 | loss 3.284620 (-0.74z)| norm 0.3028 (+1.02z)| lr 8.54e-05 | 322.76 ms | 52.3% bf16 MFU | 1623621 tok/s step 14917/19560 | loss 3.379394 (+1.35z)| norm 0.2679 (-0.67z)| lr 8.54e-05 | 322.90 ms | 52.3% bf16 MFU | 1623626 tok/s step 14918/19560 | loss 3.288945 (-0.66z)| norm 0.2888 (+0.36z)| lr 8.53e-05 | 322.68 ms | 52.3% bf16 MFU | 1623685 tok/s step 14919/19560 | loss 3.264535 (-1.19z)| norm 0.2935 (+0.60z)| lr 8.53e-05 | 322.56 ms | 52.3% bf16 MFU | 1623772 tok/s step 14920/19560 | loss 3.326157 (+0.18z)| norm 0.2874 (+0.30z)| lr 8.53e-05 | 323.26 ms | 52.2% bf16 MFU | 1623676 tok/s step 14921/19560 | loss 3.344274 (+0.60z)| norm 0.2870 (+0.28z)| lr 8.52e-05 | 323.25 ms | 52.2% bf16 MFU | 1623588 tok/s step 14922/19560 | loss 3.321726 (+0.08z)| norm 0.3028 (+1.05z)| lr 8.52e-05 | 323.24 ms | 52.2% bf16 MFU | 1623507 tok/s step 14923/19560 | loss 3.345268 (+0.62z)| norm 0.2739 (-0.36z)| lr 8.52e-05 | 322.44 ms | 52.3% bf16 MFU | 1623632 tok/s step 14924/19560 | loss 3.286825 (-0.72z)| norm 0.3153 (+1.66z)| lr 8.51e-05 | 322.91 ms | 52.3% bf16 MFU | 1623632 tok/s step 14925/19560 | loss 3.335185 (+0.39z)| norm 0.2887 (+0.35z)| lr 8.51e-05 | 322.93 ms | 52.3% bf16 MFU | 1623627 tok/s step 14926/19560 | loss 3.318074 (-0.01z)| norm 0.2750 (-0.31z)| lr 8.51e-05 | 322.85 ms | 52.3% bf16 MFU | 1623641 tok/s step 14927/19560 | loss 3.344263 (+0.60z)| norm 0.3043 (+1.16z)| lr 8.50e-05 | 322.68 ms | 52.3% bf16 MFU | 1623699 tok/s step 14928/19560 | loss 3.291999 (-0.59z)| norm 0.3138 (+1.69z)| lr 8.50e-05 | 322.22 ms | 52.4% bf16 MFU | 1623869 tok/s step 14929/19560 | loss 3.243263 (-1.69z)| norm 0.2523 (-1.46z)| lr 8.50e-05 | 323.22 ms | 52.2% bf16 MFU | 1623780 tok/s step 14930/19560 | loss 3.295797 (-0.48z)| norm 0.2810 (+0.02z)| lr 8.49e-05 | 323.19 ms | 52.2% bf16 MFU | 1623703 tok/s step 14931/19560 | loss 3.322266 (+0.12z)| norm 0.2788 (-0.08z)| lr 8.49e-05 | 322.91 ms | 52.3% bf16 MFU | 1623699 tok/s step 14932/19560 | loss 3.363373 (+1.07z)| norm 0.2657 (-0.76z)| lr 8.49e-05 | 322.83 ms | 52.3% bf16 MFU | 1623717 tok/s step 14933/19560 | loss 3.335649 (+0.42z)| norm 0.2854 (+0.26z)| lr 8.48e-05 | 322.68 ms | 52.3% bf16 MFU | 1623772 tok/s step 14934/19560 | loss 3.285916 (-0.73z)| norm 0.2799 (-0.02z)| lr 8.48e-05 | 322.99 ms | 52.3% bf16 MFU | 1623745 tok/s step 14935/19560 | loss 3.279597 (-0.88z)| norm 0.2669 (-0.70z)| lr 8.47e-05 | 322.96 ms | 52.3% bf16 MFU | 1623727 tok/s step 14936/19560 | loss 3.296430 (-0.48z)| norm 0.2808 (+0.02z)| lr 8.47e-05 | 322.86 ms | 52.3% bf16 MFU | 1623736 tok/s step 14937/19560 | loss 3.265659 (-1.21z)| norm 0.2634 (-0.89z)| lr 8.47e-05 | 323.08 ms | 52.2% bf16 MFU | 1623687 tok/s step 14938/19560 | loss 3.346330 (+0.67z)| norm 0.2649 (-0.80z)| lr 8.46e-05 | 322.28 ms | 52.4% bf16 MFU | 1623843 tok/s step 14939/19560 | loss 3.303365 (-0.36z)| norm 0.2924 (+0.62z)| lr 8.46e-05 | 323.18 ms | 52.2% bf16 MFU | 1623764 tok/s step 14940/19560 | loss 3.343558 (+0.60z)| norm 0.2726 (-0.41z)| lr 8.46e-05 | 322.69 ms | 52.3% bf16 MFU | 1623814 tok/s step 14941/19560 | loss 3.342022 (+0.56z)| norm 0.3155 (+1.78z)| lr 8.45e-05 | 322.73 ms | 52.3% bf16 MFU | 1623851 tok/s step 14942/19560 | loss 3.304346 (-0.35z)| norm 0.2922 (+0.57z)| lr 8.45e-05 | 322.50 ms | 52.3% bf16 MFU | 1623943 tok/s step 14943/19560 | loss 3.370504 (+1.23z)| norm 0.2758 (-0.27z)| lr 8.45e-05 | 323.11 ms | 52.2% bf16 MFU | 1623877 tok/s step 14944/19560 | loss 3.334793 (+0.36z)| norm 0.3170 (+1.83z)| lr 8.44e-05 | 323.19 ms | 52.2% bf16 MFU | 1623794 tok/s step 14945/19560 | loss 3.312448 (-0.17z)| norm 0.2661 (-0.80z)| lr 8.44e-05 | 322.50 ms | 52.3% bf16 MFU | 1623889 tok/s step 14946/19560 | loss 3.279186 (-0.97z)| norm 0.2885 (+0.36z)| lr 8.44e-05 | 323.32 ms | 52.2% bf16 MFU | 1623772 tok/s step 14947/19560 | loss 3.309161 (-0.24z)| norm 0.2768 (-0.25z)| lr 8.43e-05 | 322.74 ms | 52.3% bf16 MFU | 1623808 tok/s step 14948/19560 | loss 3.282634 (-0.87z)| norm 0.2665 (-0.78z)| lr 8.43e-05 | 322.27 ms | 52.4% bf16 MFU | 1623961 tok/s step 14949/19560 | loss 3.289162 (-0.71z)| norm 0.3164 (+1.77z)| lr 8.43e-05 | 323.17 ms | 52.2% bf16 MFU | 1623880 tok/s step 14950/19560 | loss 3.298436 (-0.48z)| norm 0.2638 (-0.93z)| lr 8.42e-05 | 322.36 ms | 52.4% bf16 MFU | 1624007 tok/s step 14951/19560 | loss 3.272095 (-1.10z)| norm 0.2544 (-1.40z)| lr 8.42e-05 | 323.40 ms | 52.2% bf16 MFU | 1623866 tok/s step 14952/19560 | loss 3.317352 (-0.02z)| norm 0.3030 (+1.07z)| lr 8.42e-05 | 322.96 ms | 52.3% bf16 MFU | 1623841 tok/s step 14953/19560 | loss 3.258298 (-1.41z)| norm 0.2778 (-0.21z)| lr 8.41e-05 | 322.85 ms | 52.3% bf16 MFU | 1623846 tok/s step 14954/19560 | loss 3.320545 (+0.09z)| norm 0.2692 (-0.63z)| lr 8.41e-05 | 322.89 ms | 52.3% bf16 MFU | 1623840 tok/s step 14955/19560 | loss 3.413864 (+2.27z)| norm 0.3283 (+2.31z)| lr 8.41e-05 | 322.34 ms | 52.4% bf16 MFU | 1623974 tok/s step 14956/19560 | loss 3.370889 (+1.25z)| norm 0.2716 (-0.54z)| lr 8.40e-05 | 322.84 ms | 52.3% bf16 MFU | 1623974 tok/s step 14957/19560 | loss 3.317171 (-0.02z)| norm 0.2667 (-0.78z)| lr 8.40e-05 | 322.63 ms | 52.3% bf16 MFU | 1624028 tok/s step 14958/19560 | loss 3.358171 (+0.93z)| norm 0.2607 (-1.07z)| lr 8.39e-05 | 322.99 ms | 52.3% bf16 MFU | 1623988 tok/s step 14959/19560 | loss 3.300744 (-0.42z)| norm 0.2734 (-0.44z)| lr 8.39e-05 | 323.05 ms | 52.2% bf16 MFU | 1623934 tok/s step 14960/19560 | loss 3.321808 (+0.07z)| norm 0.2569 (-1.25z)| lr 8.39e-05 | 322.40 ms | 52.3% bf16 MFU | 1624048 tok/s step 14961/19560 | loss 3.343676 (+0.58z)| norm 0.2731 (-0.45z)| lr 8.38e-05 | 322.56 ms | 52.3% bf16 MFU | 1624115 tok/s step 14962/19560 | loss 3.296909 (-0.52z)| norm 0.2639 (-0.90z)| lr 8.38e-05 | 323.07 ms | 52.2% bf16 MFU | 1624051 tok/s step 14963/19560 | loss 3.240511 (-1.81z)| norm 0.2759 (-0.31z)| lr 8.38e-05 | 322.47 ms | 52.3% bf16 MFU | 1624142 tok/s step 14964/19560 | loss 3.315485 (-0.03z)| norm 0.2685 (-0.68z)| lr 8.37e-05 | 322.62 ms | 52.3% bf16 MFU | 1624190 tok/s step 14965/19560 | loss 3.380263 (+1.74z)| norm 0.2962 (+0.77z)| lr 8.37e-05 | 323.47 ms | 52.2% bf16 MFU | 1624020 tok/s step 14966/19560 | loss 3.301943 (-0.40z)| norm 0.2761 (-0.29z)| lr 8.37e-05 | 322.96 ms | 52.3% bf16 MFU | 1623989 tok/s step 14967/19560 | loss 3.272875 (-1.19z)| norm 0.2859 (+0.22z)| lr 8.36e-05 | 322.78 ms | 52.3% bf16 MFU | 1624004 tok/s step 14968/19560 | loss 3.304837 (-0.31z)| norm 0.2569 (-1.29z)| lr 8.36e-05 | 322.35 ms | 52.4% bf16 MFU | 1624127 tok/s step 14969/19560 | loss 3.360367 (+1.20z)| norm 0.2923 (+0.58z)| lr 8.36e-05 | 323.19 ms | 52.2% bf16 MFU | 1624032 tok/s step 14970/19560 | loss 3.289155 (-0.74z)| norm 0.2863 (+0.26z)| lr 8.35e-05 | 322.52 ms | 52.3% bf16 MFU | 1624111 tok/s step 14971/19560 | loss 3.443458 (+3.31z)| norm 0.2648 (-0.86z)| lr 8.35e-05 | 322.60 ms | 52.3% bf16 MFU | 1624164 tok/s step 14972/19560 | loss 3.301742 (-0.41z)| norm 0.2740 (-0.38z)| lr 8.35e-05 | 322.79 ms | 52.3% bf16 MFU | 1624168 tok/s step 14973/19560 | loss 3.282782 (-0.92z)| norm 0.2685 (-0.67z)| lr 8.34e-05 | 322.83 ms | 52.3% bf16 MFU | 1624162 tok/s step 14974/19560 | loss 3.291063 (-0.70z)| norm 0.2661 (-0.79z)| lr 8.34e-05 | 322.74 ms | 52.3% bf16 MFU | 1624180 tok/s step 14975/19560 | loss 3.282278 (-0.92z)| norm 0.2685 (-0.66z)| lr 8.34e-05 | 323.00 ms | 52.3% bf16 MFU | 1624129 tok/s step 14976/19560 | loss 3.281315 (-0.93z)| norm 0.2761 (-0.27z)| lr 8.33e-05 | 322.56 ms | 52.3% bf16 MFU | 1624194 tok/s step 14977/19560 | loss 3.271656 (-1.16z)| norm 0.2563 (-1.32z)| lr 8.33e-05 | 322.09 ms | 52.4% bf16 MFU | 1624372 tok/s step 14978/19560 | loss 3.261695 (-1.40z)| norm 0.2714 (-0.49z)| lr 8.33e-05 | 322.83 ms | 52.3% bf16 MFU | 1624356 tok/s step 14979/19560 | loss 3.298573 (-0.43z)| norm 0.2538 (-1.44z)| lr 8.32e-05 | 323.31 ms | 52.2% bf16 MFU | 1624218 tok/s step 14980/19560 | loss 3.306345 (-0.23z)| norm 0.2551 (-1.35z)| lr 8.32e-05 | 323.04 ms | 52.2% bf16 MFU | 1624157 tok/s step 14981/19560 | loss 3.246754 (-1.80z)| norm 0.2736 (-0.35z)| lr 8.32e-05 | 323.07 ms | 52.2% bf16 MFU | 1624092 tok/s step 14982/19560 | loss 3.343922 (+0.75z)| norm 0.2813 (+0.05z)| lr 8.31e-05 | 322.23 ms | 52.4% bf16 MFU | 1624241 tok/s step 14983/19560 | loss 3.305102 (-0.27z)| norm 0.2893 (+0.48z)| lr 8.31e-05 | 322.88 ms | 52.3% bf16 MFU | 1624218 tok/s step 14984/19560 | loss 3.325479 (+0.28z)| norm 0.2687 (-0.64z)| lr 8.30e-05 | 322.86 ms | 52.3% bf16 MFU | 1624201 tok/s step 14985/19560 | loss 3.291720 (-0.62z)| norm 0.2885 (+0.43z)| lr 8.30e-05 | 322.96 ms | 52.3% bf16 MFU | 1624160 tok/s step 14986/19560 | loss 3.284084 (-0.81z)| norm 0.3259 (+2.39z)| lr 8.30e-05 | 323.06 ms | 52.2% bf16 MFU | 1624097 tok/s step 14987/19560 | loss 3.345527 (+0.81z)| norm 0.2796 (-0.09z)| lr 8.29e-05 | 322.87 ms | 52.3% bf16 MFU | 1624084 tok/s step 14988/19560 | loss 3.238836 (-1.97z)| norm 0.2977 (+0.87z)| lr 8.29e-05 | 322.58 ms | 52.3% bf16 MFU | 1624144 tok/s step 14989/19560 | loss 3.279452 (-0.91z)| norm 0.3231 (+2.19z)| lr 8.29e-05 | 323.16 ms | 52.2% bf16 MFU | 1624055 tok/s step 14990/19560 | loss 3.441341 (+3.29z)| norm 0.3263 (+2.29z)| lr 8.28e-05 | 322.63 ms | 52.3% bf16 MFU | 1624105 tok/s step 14991/19560 | loss 3.259808 (-1.38z)| norm 0.3281 (+2.31z)| lr 8.28e-05 | 323.41 ms | 52.2% bf16 MFU | 1623956 tok/s step 14992/19560 | loss 3.341467 (+0.73z)| norm 0.3471 (+3.14z)| lr 8.28e-05 | 322.65 ms | 52.3% bf16 MFU | 1624006 tok/s step 14993/19560 | loss 3.319737 (+0.16z)| norm 0.2998 (+0.80z)| lr 8.27e-05 | 322.79 ms | 52.3% bf16 MFU | 1624019 tok/s step 14994/19560 | loss 3.301687 (-0.31z)| norm 0.3418 (+2.76z)| lr 8.27e-05 | 322.95 ms | 52.3% bf16 MFU | 1623989 tok/s step 14995/19560 | loss 3.359410 (+1.19z)| norm 0.3755 (+4.04z)| lr 8.27e-05 | 322.87 ms | 52.3% bf16 MFU | 1623980 tok/s step 14996/19560 | loss 3.363850 (+1.30z)| norm 0.3058 (+0.92z)| lr 8.26e-05 | 323.33 ms | 52.2% bf16 MFU | 1623856 tok/s step 14997/19560 | loss 3.345203 (+0.81z)| norm 0.3856 (+4.17z)| lr 8.26e-05 | 322.77 ms | 52.3% bf16 MFU | 1623880 tok/s step 14998/19560 | loss 3.252320 (-1.59z)| norm 0.3904 (+4.05z)| lr 8.26e-05 | 323.20 ms | 52.2% bf16 MFU | 1623795 tok/s step 14999/19560 | loss 3.263035 (-1.29z)| norm 0.3034 (+0.62z)| lr 8.25e-05 | 322.80 ms | 52.3% bf16 MFU | 1623813 tok/s step 15000/19560 | loss 3.313325 (+0.02z)| norm 0.3154 (+1.08z)| lr 8.25e-05 | 323.24 ms | 52.2% bf16 MFU | 1623720 tok/s val loss 3.303314 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00003.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00001.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00007.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00004.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00002.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00006.bin evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3010/10042 = 0.299741 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00015000_00005.bin Writing checkpoint at step 15000 Writing model to log124M/model_00015000.bin Writing state to log124M/state_00015000_00000.bin step 15001/19560 | loss 3.339555 (+0.69z)| norm 0.3728 (+3.17z)| lr 8.25e-05 | 318.99 ms | 52.9% bf16 MFU | 1624714 tok/s step 15002/19560 | loss 3.275087 (-0.96z)| norm 0.2884 (-0.01z)| lr 8.24e-05 | 321.47 ms | 52.5% bf16 MFU | 1625023 tok/s step 15003/19560 | loss 3.335854 (+0.59z)| norm 0.2998 (+0.41z)| lr 8.24e-05 | 322.69 ms | 52.3% bf16 MFU | 1625009 tok/s step 15004/19560 | loss 3.363179 (+1.28z)| norm 0.3135 (+0.91z)| lr 8.24e-05 | 322.29 ms | 52.4% bf16 MFU | 1625096 tok/s step 15005/19560 | loss 3.270677 (-1.08z)| norm 0.2848 (-0.17z)| lr 8.23e-05 | 321.96 ms | 52.4% bf16 MFU | 1625263 tok/s step 15006/19560 | loss 3.374072 (+1.53z)| norm 0.2822 (-0.27z)| lr 8.23e-05 | 321.82 ms | 52.4% bf16 MFU | 1625456 tok/s step 15007/19560 | loss 3.346160 (+0.83z)| norm 0.3188 (+1.10z)| lr 8.23e-05 | 322.82 ms | 52.3% bf16 MFU | 1625388 tok/s step 15008/19560 | loss 3.374310 (+1.55z)| norm 0.3255 (+1.33z)| lr 8.22e-05 | 321.41 ms | 52.5% bf16 MFU | 1625680 tok/s step 15009/19560 | loss 3.324368 (+0.27z)| norm 0.3014 (+0.42z)| lr 8.22e-05 | 322.43 ms | 52.3% bf16 MFU | 1625698 tok/s step 15010/19560 | loss 3.284075 (-0.75z)| norm 0.2799 (-0.39z)| lr 8.22e-05 | 323.29 ms | 52.2% bf16 MFU | 1625500 tok/s step 15011/19560 | loss 3.292171 (-0.54z)| norm 0.3025 (+0.45z)| lr 8.21e-05 | 323.09 ms | 52.2% bf16 MFU | 1625361 tok/s step 15012/19560 | loss 3.316826 (+0.09z)| norm 0.2956 (+0.18z)| lr 8.21e-05 | 321.72 ms | 52.5% bf16 MFU | 1625576 tok/s step 15013/19560 | loss 3.361753 (+1.22z)| norm 0.2631 (-1.04z)| lr 8.20e-05 | 322.30 ms | 52.4% bf16 MFU | 1625633 tok/s step 15014/19560 | loss 3.394211 (+2.01z)| norm 0.3094 (+0.70z)| lr 8.20e-05 | 322.38 ms | 52.4% bf16 MFU | 1625667 tok/s step 15015/19560 | loss 3.276082 (-0.95z)| norm 0.3101 (+0.71z)| lr 8.20e-05 | 321.68 ms | 52.5% bf16 MFU | 1625877 tok/s step 15016/19560 | loss 3.369907 (+1.38z)| norm 0.2880 (-0.12z)| lr 8.19e-05 | 322.18 ms | 52.4% bf16 MFU | 1625949 tok/s step 15017/19560 | loss 3.337893 (+0.58z)| norm 0.3157 (+0.91z)| lr 8.19e-05 | 322.80 ms | 52.3% bf16 MFU | 1625862 tok/s step 15018/19560 | loss 3.318864 (+0.10z)| norm 0.2901 (-0.05z)| lr 8.19e-05 | 321.95 ms | 52.4% bf16 MFU | 1625991 tok/s step 15019/19560 | loss 3.292777 (-0.54z)| norm 0.2750 (-0.62z)| lr 8.18e-05 | 322.29 ms | 52.4% bf16 MFU | 1626029 tok/s step 15020/19560 | loss 3.273915 (-1.00z)| norm 0.2854 (-0.22z)| lr 8.18e-05 | 322.61 ms | 52.3% bf16 MFU | 1625985 tok/s step 15021/19560 | loss 3.306954 (-0.18z)| norm 0.2884 (-0.11z)| lr 8.18e-05 | 322.81 ms | 52.3% bf16 MFU | 1625892 tok/s step 15022/19560 | loss 3.311782 (-0.07z)| norm 0.3026 (+0.44z)| lr 8.17e-05 | 322.56 ms | 52.3% bf16 MFU | 1625866 tok/s step 15023/19560 | loss 3.259617 (-1.36z)| norm 0.2627 (-1.07z)| lr 8.17e-05 | 322.42 ms | 52.3% bf16 MFU | 1625877 tok/s step 15024/19560 | loss 3.324609 (+0.27z)| norm 0.2992 (+0.33z)| lr 8.17e-05 | 323.01 ms | 52.3% bf16 MFU | 1625740 tok/s step 15025/19560 | loss 3.282249 (-0.78z)| norm 0.3103 (+0.76z)| lr 8.16e-05 | 323.23 ms | 52.2% bf16 MFU | 1625554 tok/s step 15026/19560 | loss 3.533534 (+4.93z)| norm 0.3289 (+1.45z)| lr 8.16e-05 | 323.54 ms | 52.2% bf16 MFU | 1625300 tok/s step 15027/19560 | loss 3.379347 (+1.43z)| norm 0.3149 (+0.93z)| lr 8.16e-05 | 322.84 ms | 52.3% bf16 MFU | 1625234 tok/s step 15028/19560 | loss 3.292508 (-0.51z)| norm 0.2899 (-0.03z)| lr 8.15e-05 | 323.25 ms | 52.2% bf16 MFU | 1625070 tok/s step 15029/19560 | loss 3.301556 (-0.31z)| norm 0.2864 (-0.16z)| lr 8.15e-05 | 322.50 ms | 52.3% bf16 MFU | 1625100 tok/s step 15030/19560 | loss 3.322221 (+0.14z)| norm 0.2932 (+0.10z)| lr 8.15e-05 | 323.41 ms | 52.2% bf16 MFU | 1624901 tok/s step 15031/19560 | loss 3.261523 (-1.22z)| norm 0.2865 (-0.16z)| lr 8.14e-05 | 323.21 ms | 52.2% bf16 MFU | 1624763 tok/s step 15032/19560 | loss 3.302095 (-0.30z)| norm 0.3030 (+0.47z)| lr 8.14e-05 | 322.52 ms | 52.3% bf16 MFU | 1624806 tok/s step 15033/19560 | loss 3.340927 (+0.56z)| norm 0.3267 (+1.36z)| lr 8.14e-05 | 322.30 ms | 52.4% bf16 MFU | 1624900 tok/s step 15034/19560 | loss 3.268974 (-1.05z)| norm 0.2877 (-0.12z)| lr 8.13e-05 | 323.63 ms | 52.1% bf16 MFU | 1624656 tok/s step 15035/19560 | loss 3.268668 (-1.05z)| norm 0.2849 (-0.23z)| lr 8.13e-05 | 322.16 ms | 52.4% bf16 MFU | 1624795 tok/s step 15036/19560 | loss 3.358355 (+0.96z)| norm 0.2864 (-0.18z)| lr 8.13e-05 | 322.80 ms | 52.3% bf16 MFU | 1624764 tok/s step 15037/19560 | loss 3.360792 (+1.01z)| norm 0.2958 (+0.19z)| lr 8.12e-05 | 323.68 ms | 52.1% bf16 MFU | 1624514 tok/s step 15038/19560 | loss 3.280667 (-0.77z)| norm 0.2809 (-0.38z)| lr 8.12e-05 | 322.71 ms | 52.3% bf16 MFU | 1624519 tok/s step 15039/19560 | loss 3.344721 (+0.65z)| norm 0.3268 (+1.36z)| lr 8.12e-05 | 322.89 ms | 52.3% bf16 MFU | 1624480 tok/s step 15040/19560 | loss 3.321173 (+0.13z)| norm 0.2930 (+0.06z)| lr 8.11e-05 | 322.68 ms | 52.3% bf16 MFU | 1624496 tok/s step 15041/19560 | loss 3.324608 (+0.19z)| norm 0.2718 (-0.75z)| lr 8.11e-05 | 322.82 ms | 52.3% bf16 MFU | 1624477 tok/s step 15042/19560 | loss 3.308671 (-0.17z)| norm 0.3156 (+0.91z)| lr 8.11e-05 | 321.87 ms | 52.4% bf16 MFU | 1624698 tok/s step 15043/19560 | loss 3.301446 (-0.32z)| norm 0.2899 (-0.07z)| lr 8.10e-05 | 322.82 ms | 52.3% bf16 MFU | 1624667 tok/s step 15044/19560 | loss 3.298476 (-0.40z)| norm 0.2910 (-0.02z)| lr 8.10e-05 | 322.86 ms | 52.3% bf16 MFU | 1624629 tok/s step 15045/19560 | loss 3.241920 (-1.65z)| norm 0.2626 (-1.10z)| lr 8.10e-05 | 322.88 ms | 52.3% bf16 MFU | 1624587 tok/s step 15046/19560 | loss 3.333664 (+0.42z)| norm 0.2812 (-0.39z)| lr 8.09e-05 | 322.08 ms | 52.4% bf16 MFU | 1624749 tok/s step 15047/19560 | loss 3.336635 (+0.48z)| norm 0.2883 (-0.12z)| lr 8.09e-05 | 322.12 ms | 52.4% bf16 MFU | 1624892 tok/s step 15048/19560 | loss 3.382780 (+1.51z)| norm 0.2753 (-0.61z)| lr 8.09e-05 | 323.03 ms | 52.2% bf16 MFU | 1624799 tok/s step 15049/19560 | loss 3.361168 (+1.01z)| norm 0.2616 (-1.12z)| lr 8.08e-05 | 322.54 ms | 52.3% bf16 MFU | 1624832 tok/s step 15050/19560 | loss 3.226642 (-1.98z)| norm 0.2859 (-0.19z)| lr 8.08e-05 | 322.14 ms | 52.4% bf16 MFU | 1624967 tok/s step 15051/19560 | loss 3.279371 (-0.79z)| norm 0.2916 (+0.02z)| lr 8.07e-05 | 322.39 ms | 52.3% bf16 MFU | 1625030 tok/s step 15052/19560 | loss 3.270129 (-0.99z)| norm 0.2762 (-0.56z)| lr 8.07e-05 | 323.71 ms | 52.1% bf16 MFU | 1624760 tok/s step 15053/19560 | loss 3.350386 (+0.78z)| norm 0.2699 (-0.79z)| lr 8.07e-05 | 322.43 ms | 52.3% bf16 MFU | 1624825 tok/s step 15054/19560 | loss 3.385578 (+1.54z)| norm 0.2972 (+0.24z)| lr 8.06e-05 | 322.73 ms | 52.3% bf16 MFU | 1624812 tok/s step 15055/19560 | loss 3.329024 (+0.30z)| norm 0.2824 (-0.31z)| lr 8.06e-05 | 323.14 ms | 52.2% bf16 MFU | 1624696 tok/s step 15056/19560 | loss 3.286777 (-0.63z)| norm 0.2632 (-1.03z)| lr 8.06e-05 | 322.84 ms | 52.3% bf16 MFU | 1624661 tok/s step 15057/19560 | loss 3.261900 (-1.18z)| norm 0.2978 (+0.28z)| lr 8.05e-05 | 322.73 ms | 52.3% bf16 MFU | 1624654 tok/s step 15058/19560 | loss 3.388624 (+1.58z)| norm 0.2682 (-0.85z)| lr 8.05e-05 | 322.61 ms | 52.3% bf16 MFU | 1624679 tok/s step 15059/19560 | loss 3.313976 (-0.05z)| norm 0.2831 (-0.29z)| lr 8.05e-05 | 323.43 ms | 52.2% bf16 MFU | 1624496 tok/s step 15060/19560 | loss 3.288452 (-0.59z)| norm 0.2840 (-0.26z)| lr 8.04e-05 | 322.48 ms | 52.3% bf16 MFU | 1624560 tok/s step 15061/19560 | loss 3.322457 (+0.15z)| norm 0.2756 (-0.58z)| lr 8.04e-05 | 322.25 ms | 52.4% bf16 MFU | 1624682 tok/s step 15062/19560 | loss 3.449821 (+2.83z)| norm 0.2822 (-0.33z)| lr 8.04e-05 | 323.44 ms | 52.2% bf16 MFU | 1624496 tok/s step 15063/19560 | loss 3.351431 (+0.72z)| norm 0.2835 (-0.28z)| lr 8.03e-05 | 322.58 ms | 52.3% bf16 MFU | 1624536 tok/s step 15064/19560 | loss 3.299224 (-0.38z)| norm 0.2753 (-0.59z)| lr 8.03e-05 | 323.06 ms | 52.2% bf16 MFU | 1624453 tok/s step 15065/19560 | loss 3.341850 (+0.51z)| norm 0.2842 (-0.26z)| lr 8.03e-05 | 322.43 ms | 52.3% bf16 MFU | 1624532 tok/s step 15066/19560 | loss 3.322635 (+0.10z)| norm 0.2700 (-0.81z)| lr 8.02e-05 | 322.57 ms | 52.3% bf16 MFU | 1624573 tok/s step 15067/19560 | loss 3.272651 (-0.96z)| norm 0.2675 (-0.90z)| lr 8.02e-05 | 322.69 ms | 52.3% bf16 MFU | 1624582 tok/s step 15068/19560 | loss 3.330768 (+0.28z)| norm 0.2653 (-0.98z)| lr 8.02e-05 | 322.88 ms | 52.3% bf16 MFU | 1624541 tok/s step 15069/19560 | loss 3.378876 (+1.30z)| norm 0.2665 (-0.92z)| lr 8.01e-05 | 322.75 ms | 52.3% bf16 MFU | 1624535 tok/s step 15070/19560 | loss 3.301400 (-0.34z)| norm 0.3083 (+0.69z)| lr 8.01e-05 | 323.30 ms | 52.2% bf16 MFU | 1624392 tok/s step 15071/19560 | loss 3.320163 (+0.06z)| norm 0.2729 (-0.67z)| lr 8.01e-05 | 322.91 ms | 52.3% bf16 MFU | 1624355 tok/s step 15072/19560 | loss 3.340259 (+0.49z)| norm 0.2922 (+0.08z)| lr 8.00e-05 | 322.24 ms | 52.4% bf16 MFU | 1624487 tok/s step 15073/19560 | loss 3.372362 (+1.16z)| norm 0.3312 (+1.55z)| lr 8.00e-05 | 322.53 ms | 52.3% bf16 MFU | 1624540 tok/s step 15074/19560 | loss 3.345758 (+0.58z)| norm 0.3003 (+0.36z)| lr 8.00e-05 | 322.60 ms | 52.3% bf16 MFU | 1624573 tok/s step 15075/19560 | loss 3.245854 (-1.51z)| norm 0.3191 (+1.07z)| lr 7.99e-05 | 322.68 ms | 52.3% bf16 MFU | 1624584 tok/s step 15076/19560 | loss 3.266935 (-1.06z)| norm 0.3075 (+0.61z)| lr 7.99e-05 | 322.73 ms | 52.3% bf16 MFU | 1624582 tok/s step 15077/19560 | loss 3.259395 (-1.21z)| norm 0.2903 (-0.04z)| lr 7.99e-05 | 322.27 ms | 52.4% bf16 MFU | 1624695 tok/s step 15078/19560 | loss 3.331757 (+0.29z)| norm 0.3206 (+1.11z)| lr 7.98e-05 | 322.57 ms | 52.3% bf16 MFU | 1624727 tok/s step 15079/19560 | loss 3.319355 (+0.03z)| norm 0.3085 (+0.63z)| lr 7.98e-05 | 323.26 ms | 52.2% bf16 MFU | 1624584 tok/s step 15080/19560 | loss 3.231566 (-1.78z)| norm 0.3125 (+0.78z)| lr 7.98e-05 | 322.26 ms | 52.4% bf16 MFU | 1624702 tok/s step 15081/19560 | loss 3.248032 (-1.43z)| norm 0.2979 (+0.21z)| lr 7.97e-05 | 322.61 ms | 52.3% bf16 MFU | 1624724 tok/s step 15082/19560 | loss 3.304242 (-0.27z)| norm 0.3032 (+0.41z)| lr 7.97e-05 | 324.12 ms | 52.1% bf16 MFU | 1624366 tok/s step 15083/19560 | loss 3.336103 (+0.41z)| norm 0.2988 (+0.25z)| lr 7.97e-05 | 322.68 ms | 52.3% bf16 MFU | 1624388 tok/s step 15084/19560 | loss 3.303766 (-0.26z)| norm 0.3332 (+1.56z)| lr 7.96e-05 | 322.41 ms | 52.3% bf16 MFU | 1624475 tok/s step 15085/19560 | loss 3.278398 (-0.79z)| norm 0.2880 (-0.19z)| lr 7.96e-05 | 322.56 ms | 52.3% bf16 MFU | 1624520 tok/s step 15086/19560 | loss 3.350621 (+0.73z)| norm 0.2821 (-0.44z)| lr 7.96e-05 | 322.26 ms | 52.4% bf16 MFU | 1624640 tok/s step 15087/19560 | loss 3.297533 (-0.38z)| norm 0.2756 (-0.69z)| lr 7.95e-05 | 322.73 ms | 52.3% bf16 MFU | 1624634 tok/s step 15088/19560 | loss 3.316504 (+0.02z)| norm 0.2665 (-1.05z)| lr 7.95e-05 | 322.28 ms | 52.4% bf16 MFU | 1624743 tok/s step 15089/19560 | loss 3.285753 (-0.62z)| norm 0.2740 (-0.76z)| lr 7.95e-05 | 322.59 ms | 52.3% bf16 MFU | 1624768 tok/s step 15090/19560 | loss 3.258875 (-1.17z)| norm 0.2812 (-0.48z)| lr 7.94e-05 | 322.84 ms | 52.3% bf16 MFU | 1624729 tok/s step 15091/19560 | loss 3.305004 (-0.22z)| norm 0.2674 (-1.02z)| lr 7.94e-05 | 322.19 ms | 52.4% bf16 MFU | 1624855 tok/s step 15092/19560 | loss 3.240846 (-1.55z)| norm 0.2758 (-0.70z)| lr 7.94e-05 | 322.33 ms | 52.4% bf16 MFU | 1624941 tok/s step 15093/19560 | loss 3.319309 (+0.10z)| norm 0.3048 (+0.45z)| lr 7.93e-05 | 323.21 ms | 52.2% bf16 MFU | 1624800 tok/s step 15094/19560 | loss 3.280749 (-0.70z)| norm 0.2671 (-1.04z)| lr 7.93e-05 | 322.45 ms | 52.3% bf16 MFU | 1624858 tok/s step 15095/19560 | loss 3.402094 (+1.81z)| norm 0.3024 (+0.35z)| lr 7.93e-05 | 323.24 ms | 52.2% bf16 MFU | 1624713 tok/s step 15096/19560 | loss 3.281249 (-0.70z)| norm 0.2624 (-1.23z)| lr 7.92e-05 | 323.24 ms | 52.2% bf16 MFU | 1624576 tok/s step 15097/19560 | loss 3.286416 (-0.58z)| norm 0.2758 (-0.69z)| lr 7.92e-05 | 322.50 ms | 52.3% bf16 MFU | 1624631 tok/s step 15098/19560 | loss 3.317927 (+0.07z)| norm 0.2725 (-0.82z)| lr 7.92e-05 | 322.54 ms | 52.3% bf16 MFU | 1624674 tok/s step 15099/19560 | loss 3.268720 (-0.95z)| norm 0.2599 (-1.31z)| lr 7.91e-05 | 322.66 ms | 52.3% bf16 MFU | 1624685 tok/s step 15100/19560 | loss 3.286063 (-0.58z)| norm 0.2867 (-0.26z)| lr 7.91e-05 | 322.92 ms | 52.3% bf16 MFU | 1624629 tok/s step 15101/19560 | loss 3.328794 (+0.33z)| norm 0.2626 (-1.20z)| lr 7.91e-05 | 322.45 ms | 52.3% bf16 MFU | 1624694 tok/s step 15102/19560 | loss 3.288008 (-0.55z)| norm 0.2845 (-0.35z)| lr 7.90e-05 | 322.86 ms | 52.3% bf16 MFU | 1624654 tok/s step 15103/19560 | loss 3.315588 (+0.04z)| norm 0.2593 (-1.34z)| lr 7.90e-05 | 322.86 ms | 52.3% bf16 MFU | 1624614 tok/s step 15104/19560 | loss 3.304869 (-0.19z)| norm 0.2936 (+0.00z)| lr 7.90e-05 | 322.78 ms | 52.3% bf16 MFU | 1624598 tok/s step 15105/19560 | loss 3.335330 (+0.45z)| norm 0.2721 (-0.85z)| lr 7.89e-05 | 322.92 ms | 52.3% bf16 MFU | 1624547 tok/s step 15106/19560 | loss 3.319139 (+0.09z)| norm 0.2747 (-0.75z)| lr 7.89e-05 | 322.47 ms | 52.3% bf16 MFU | 1624613 tok/s step 15107/19560 | loss 3.297646 (-0.37z)| norm 0.2724 (-0.86z)| lr 7.88e-05 | 322.19 ms | 52.4% bf16 MFU | 1624746 tok/s step 15108/19560 | loss 3.323203 (+0.18z)| norm 0.2678 (-1.05z)| lr 7.88e-05 | 322.81 ms | 52.3% bf16 MFU | 1624717 tok/s step 15109/19560 | loss 3.313651 (-0.04z)| norm 0.2812 (-0.52z)| lr 7.88e-05 | 322.57 ms | 52.3% bf16 MFU | 1624748 tok/s step 15110/19560 | loss 3.336625 (+0.46z)| norm 0.2952 (+0.04z)| lr 7.87e-05 | 322.99 ms | 52.3% bf16 MFU | 1624673 tok/s step 15111/19560 | loss 3.329089 (+0.29z)| norm 0.2745 (-0.78z)| lr 7.87e-05 | 323.22 ms | 52.2% bf16 MFU | 1624542 tok/s step 15112/19560 | loss 3.317773 (+0.05z)| norm 0.2906 (-0.14z)| lr 7.87e-05 | 322.56 ms | 52.3% bf16 MFU | 1624584 tok/s step 15113/19560 | loss 3.306150 (-0.21z)| norm 0.2683 (-1.03z)| lr 7.86e-05 | 322.91 ms | 52.3% bf16 MFU | 1624536 tok/s step 15114/19560 | loss 3.284643 (-0.68z)| norm 0.2550 (-1.53z)| lr 7.86e-05 | 322.38 ms | 52.4% bf16 MFU | 1624623 tok/s step 15115/19560 | loss 3.279013 (-0.79z)| norm 0.2954 (+0.07z)| lr 7.86e-05 | 322.39 ms | 52.4% bf16 MFU | 1624704 tok/s step 15116/19560 | loss 3.290862 (-0.55z)| norm 0.2780 (-0.62z)| lr 7.85e-05 | 323.88 ms | 52.1% bf16 MFU | 1624408 tok/s step 15117/19560 | loss 3.345053 (+0.64z)| norm 0.3037 (+0.42z)| lr 7.85e-05 | 322.75 ms | 52.3% bf16 MFU | 1624411 tok/s step 15118/19560 | loss 3.409974 (+2.11z)| norm 0.2910 (-0.08z)| lr 7.85e-05 | 322.44 ms | 52.3% bf16 MFU | 1624490 tok/s step 15119/19560 | loss 3.348497 (+0.72z)| norm 0.2779 (-0.60z)| lr 7.84e-05 | 322.22 ms | 52.4% bf16 MFU | 1624621 tok/s step 15120/19560 | loss 3.310935 (-0.12z)| norm 0.2739 (-0.75z)| lr 7.84e-05 | 322.90 ms | 52.3% bf16 MFU | 1624574 tok/s step 15121/19560 | loss 3.308015 (-0.19z)| norm 0.3044 (+0.51z)| lr 7.84e-05 | 322.77 ms | 52.3% bf16 MFU | 1624562 tok/s step 15122/19560 | loss 3.269627 (-1.04z)| norm 0.2890 (-0.11z)| lr 7.83e-05 | 322.85 ms | 52.3% bf16 MFU | 1624531 tok/s step 15123/19560 | loss 3.319485 (+0.09z)| norm 0.2662 (-1.08z)| lr 7.83e-05 | 322.94 ms | 52.3% bf16 MFU | 1624480 tok/s step 15124/19560 | loss 3.324730 (+0.21z)| norm 0.2822 (-0.37z)| lr 7.83e-05 | 323.17 ms | 52.2% bf16 MFU | 1624373 tok/s step 15125/19560 | loss 3.407395 (+2.04z)| norm 0.2601 (-1.39z)| lr 7.82e-05 | 322.31 ms | 52.4% bf16 MFU | 1624486 tok/s step 15126/19560 | loss 3.327479 (+0.25z)| norm 0.2723 (-0.85z)| lr 7.82e-05 | 322.57 ms | 52.3% bf16 MFU | 1624529 tok/s step 15127/19560 | loss 3.301542 (-0.34z)| norm 0.2982 (+0.49z)| lr 7.82e-05 | 322.45 ms | 52.3% bf16 MFU | 1624599 tok/s step 15128/19560 | loss 3.258536 (-1.29z)| norm 0.2541 (-1.76z)| lr 7.81e-05 | 322.64 ms | 52.3% bf16 MFU | 1624617 tok/s step 15129/19560 | loss 3.258567 (-1.27z)| norm 0.2674 (-1.11z)| lr 7.81e-05 | 322.73 ms | 52.3% bf16 MFU | 1624613 tok/s step 15130/19560 | loss 3.285120 (-0.68z)| norm 0.2701 (-0.95z)| lr 7.81e-05 | 323.17 ms | 52.2% bf16 MFU | 1624498 tok/s step 15131/19560 | loss 3.310804 (-0.11z)| norm 0.2744 (-0.70z)| lr 7.80e-05 | 323.20 ms | 52.2% bf16 MFU | 1624382 tok/s step 15132/19560 | loss 3.307254 (-0.18z)| norm 0.2523 (-1.89z)| lr 7.80e-05 | 322.53 ms | 52.3% bf16 MFU | 1624441 tok/s step 15133/19560 | loss 3.285039 (-0.68z)| norm 0.2742 (-0.68z)| lr 7.80e-05 | 322.34 ms | 52.4% bf16 MFU | 1624545 tok/s step 15134/19560 | loss 3.309337 (-0.12z)| norm 0.2578 (-1.56z)| lr 7.79e-05 | 323.13 ms | 52.2% bf16 MFU | 1624445 tok/s step 15135/19560 | loss 3.335919 (+0.48z)| norm 0.2926 (+0.36z)| lr 7.79e-05 | 322.49 ms | 52.3% bf16 MFU | 1624510 tok/s step 15136/19560 | loss 3.275663 (-0.87z)| norm 0.2717 (-0.79z)| lr 7.79e-05 | 322.67 ms | 52.3% bf16 MFU | 1624527 tok/s step 15137/19560 | loss 3.288311 (-0.57z)| norm 0.2892 (+0.20z)| lr 7.78e-05 | 323.03 ms | 52.2% bf16 MFU | 1624452 tok/s step 15138/19560 | loss 3.376027 (+1.39z)| norm 0.2801 (-0.31z)| lr 7.78e-05 | 322.61 ms | 52.3% bf16 MFU | 1624486 tok/s step 15139/19560 | loss 3.254207 (-1.34z)| norm 0.2725 (-0.72z)| lr 7.78e-05 | 322.71 ms | 52.3% bf16 MFU | 1624495 tok/s step 15140/19560 | loss 3.215520 (-2.15z)| norm 0.2867 (+0.08z)| lr 7.77e-05 | 323.50 ms | 52.2% bf16 MFU | 1624304 tok/s step 15141/19560 | loss 3.345826 (+0.72z)| norm 0.2906 (+0.29z)| lr 7.77e-05 | 323.18 ms | 52.2% bf16 MFU | 1624203 tok/s step 15142/19560 | loss 3.241550 (-1.56z)| norm 0.2944 (+0.52z)| lr 7.77e-05 | 322.64 ms | 52.3% bf16 MFU | 1624244 tok/s step 15143/19560 | loss 3.332605 (+0.45z)| norm 0.2770 (-0.47z)| lr 7.76e-05 | 322.49 ms | 52.3% bf16 MFU | 1624319 tok/s step 15144/19560 | loss 3.334171 (+0.49z)| norm 0.2895 (+0.25z)| lr 7.76e-05 | 322.68 ms | 52.3% bf16 MFU | 1624343 tok/s step 15145/19560 | loss 3.268782 (-0.95z)| norm 0.2643 (-1.19z)| lr 7.76e-05 | 322.70 ms | 52.3% bf16 MFU | 1624361 tok/s step 15146/19560 | loss 3.249202 (-1.37z)| norm 0.2637 (-1.20z)| lr 7.75e-05 | 323.01 ms | 52.2% bf16 MFU | 1624298 tok/s step 15147/19560 | loss 3.307849 (-0.07z)| norm 0.3138 (+1.65z)| lr 7.75e-05 | 322.59 ms | 52.3% bf16 MFU | 1624345 tok/s step 15148/19560 | loss 3.390685 (+1.72z)| norm 0.2651 (-1.11z)| lr 7.75e-05 | 322.57 ms | 52.3% bf16 MFU | 1624396 tok/s step 15149/19560 | loss 3.314230 (+0.05z)| norm 0.2922 (+0.42z)| lr 7.74e-05 | 322.74 ms | 52.3% bf16 MFU | 1624400 tok/s step 15150/19560 | loss 3.285735 (-0.57z)| norm 0.3072 (+1.27z)| lr 7.74e-05 | 323.02 ms | 52.2% bf16 MFU | 1624334 tok/s step 15151/19560 | loss 3.295038 (-0.38z)| norm 0.2615 (-1.31z)| lr 7.74e-05 | 322.25 ms | 52.4% bf16 MFU | 1624464 tok/s step 15152/19560 | loss 3.256570 (-1.20z)| norm 0.2955 (+0.61z)| lr 7.73e-05 | 322.34 ms | 52.4% bf16 MFU | 1624566 tok/s step 15153/19560 | loss 3.265973 (-0.99z)| norm 0.2586 (-1.45z)| lr 7.73e-05 | 322.68 ms | 52.3% bf16 MFU | 1624577 tok/s step 15154/19560 | loss 3.331362 (+0.52z)| norm 0.2819 (-0.12z)| lr 7.73e-05 | 322.77 ms | 52.3% bf16 MFU | 1624565 tok/s step 15155/19560 | loss 3.439355 (+3.03z)| norm 0.2892 (+0.32z)| lr 7.72e-05 | 321.94 ms | 52.4% bf16 MFU | 1624762 tok/s step 15156/19560 | loss 3.296359 (-0.33z)| norm 0.2724 (-0.66z)| lr 7.72e-05 | 322.45 ms | 52.3% bf16 MFU | 1624822 tok/s step 15157/19560 | loss 3.298110 (-0.29z)| norm 0.3023 (+1.09z)| lr 7.72e-05 | 322.65 ms | 52.3% bf16 MFU | 1624828 tok/s step 15158/19560 | loss 3.296868 (-0.31z)| norm 0.3122 (+1.64z)| lr 7.71e-05 | 322.45 ms | 52.3% bf16 MFU | 1624885 tok/s step 15159/19560 | loss 3.372080 (+1.43z)| norm 0.2768 (-0.41z)| lr 7.71e-05 | 322.94 ms | 52.3% bf16 MFU | 1624814 tok/s step 15160/19560 | loss 3.295197 (-0.37z)| norm 0.2992 (+0.90z)| lr 7.71e-05 | 322.41 ms | 52.3% bf16 MFU | 1624879 tok/s step 15161/19560 | loss 3.328921 (+0.42z)| norm 0.2833 (-0.01z)| lr 7.70e-05 | 322.81 ms | 52.3% bf16 MFU | 1624843 tok/s step 15162/19560 | loss 3.314336 (+0.07z)| norm 0.2802 (-0.19z)| lr 7.70e-05 | 323.10 ms | 52.2% bf16 MFU | 1624734 tok/s step 15163/19560 | loss 3.274830 (-0.86z)| norm 0.2651 (-1.08z)| lr 7.70e-05 | 322.75 ms | 52.3% bf16 MFU | 1624720 tok/s step 15164/19560 | loss 3.271286 (-0.93z)| norm 0.2760 (-0.42z)| lr 7.69e-05 | 322.75 ms | 52.3% bf16 MFU | 1624705 tok/s step 15165/19560 | loss 3.281482 (-0.68z)| norm 0.2638 (-1.13z)| lr 7.69e-05 | 322.70 ms | 52.3% bf16 MFU | 1624705 tok/s step 15166/19560 | loss 3.299334 (-0.26z)| norm 0.2646 (-1.07z)| lr 7.69e-05 | 323.38 ms | 52.2% bf16 MFU | 1624533 tok/s step 15167/19560 | loss 3.338740 (+0.68z)| norm 0.2687 (-0.82z)| lr 7.68e-05 | 322.61 ms | 52.3% bf16 MFU | 1624564 tok/s step 15168/19560 | loss 3.309043 (-0.02z)| norm 0.2811 (-0.07z)| lr 7.68e-05 | 323.19 ms | 52.2% bf16 MFU | 1624448 tok/s step 15169/19560 | loss 3.332043 (+0.52z)| norm 0.2741 (-0.49z)| lr 7.68e-05 | 322.94 ms | 52.3% bf16 MFU | 1624400 tok/s step 15170/19560 | loss 3.305979 (-0.10z)| norm 0.2561 (-1.56z)| lr 7.67e-05 | 322.86 ms | 52.3% bf16 MFU | 1624375 tok/s step 15171/19560 | loss 3.266536 (-1.02z)| norm 0.2790 (-0.16z)| lr 7.67e-05 | 322.62 ms | 52.3% bf16 MFU | 1624410 tok/s step 15172/19560 | loss 3.298154 (-0.28z)| norm 0.2872 (+0.34z)| lr 7.67e-05 | 323.15 ms | 52.2% bf16 MFU | 1624311 tok/s step 15173/19560 | loss 3.310501 (+0.00z)| norm 0.2817 (-0.01z)| lr 7.66e-05 | 322.65 ms | 52.3% bf16 MFU | 1624344 tok/s step 15174/19560 | loss 3.336463 (+0.62z)| norm 0.2728 (-0.54z)| lr 7.66e-05 | 322.97 ms | 52.3% bf16 MFU | 1624293 tok/s step 15175/19560 | loss 3.275934 (-0.81z)| norm 0.3052 (+1.42z)| lr 7.66e-05 | 322.71 ms | 52.3% bf16 MFU | 1624311 tok/s step 15176/19560 | loss 3.271324 (-0.91z)| norm 0.2714 (-0.63z)| lr 7.65e-05 | 322.75 ms | 52.3% bf16 MFU | 1624318 tok/s step 15177/19560 | loss 3.297692 (-0.26z)| norm 0.2959 (+0.84z)| lr 7.65e-05 | 322.42 ms | 52.3% bf16 MFU | 1624408 tok/s step 15178/19560 | loss 3.338856 (+0.72z)| norm 0.2698 (-0.74z)| lr 7.65e-05 | 323.10 ms | 52.2% bf16 MFU | 1624323 tok/s step 15179/19560 | loss 3.337569 (+0.68z)| norm 0.2782 (-0.22z)| lr 7.64e-05 | 323.83 ms | 52.1% bf16 MFU | 1624059 tok/s step 15180/19560 | loss 3.382039 (+1.74z)| norm 0.2937 (+0.71z)| lr 7.64e-05 | 322.95 ms | 52.3% bf16 MFU | 1624028 tok/s step 15181/19560 | loss 3.321132 (+0.26z)| norm 0.2739 (-0.49z)| lr 7.64e-05 | 322.39 ms | 52.3% bf16 MFU | 1624138 tok/s step 15182/19560 | loss 3.321879 (+0.30z)| norm 0.2569 (-1.50z)| lr 7.63e-05 | 323.52 ms | 52.2% bf16 MFU | 1623961 tok/s step 15183/19560 | loss 3.374825 (+1.58z)| norm 0.2715 (-0.61z)| lr 7.63e-05 | 322.30 ms | 52.4% bf16 MFU | 1624098 tok/s step 15184/19560 | loss 3.316201 (+0.14z)| norm 0.2678 (-0.84z)| lr 7.63e-05 | 323.36 ms | 52.2% bf16 MFU | 1623961 tok/s step 15185/19560 | loss 3.306857 (-0.10z)| norm 0.2800 (-0.09z)| lr 7.62e-05 | 322.68 ms | 52.3% bf16 MFU | 1624002 tok/s step 15186/19560 | loss 3.276575 (-0.84z)| norm 0.2561 (-1.53z)| lr 7.62e-05 | 322.68 ms | 52.3% bf16 MFU | 1624042 tok/s step 15187/19560 | loss 3.307976 (-0.05z)| norm 0.2642 (-1.02z)| lr 7.62e-05 | 323.64 ms | 52.1% bf16 MFU | 1623839 tok/s step 15188/19560 | loss 3.312163 (+0.05z)| norm 0.2710 (-0.61z)| lr 7.61e-05 | 322.57 ms | 52.3% bf16 MFU | 1623915 tok/s step 15189/19560 | loss 3.284129 (-0.64z)| norm 0.2712 (-0.60z)| lr 7.61e-05 | 322.71 ms | 52.3% bf16 MFU | 1623951 tok/s step 15190/19560 | loss 3.314688 (+0.15z)| norm 0.2687 (-0.74z)| lr 7.61e-05 | 322.83 ms | 52.3% bf16 MFU | 1623955 tok/s step 15191/19560 | loss 3.322936 (+0.38z)| norm 0.2670 (-0.83z)| lr 7.60e-05 | 323.83 ms | 52.1% bf16 MFU | 1623709 tok/s step 15192/19560 | loss 3.305765 (-0.08z)| norm 0.2611 (-1.17z)| lr 7.60e-05 | 323.20 ms | 52.2% bf16 MFU | 1623633 tok/s step 15193/19560 | loss 3.292844 (-0.41z)| norm 0.2728 (-0.47z)| lr 7.60e-05 | 322.67 ms | 52.3% bf16 MFU | 1623693 tok/s step 15194/19560 | loss 3.334220 (+0.68z)| norm 0.2558 (-1.46z)| lr 7.59e-05 | 322.66 ms | 52.3% bf16 MFU | 1623752 tok/s step 15195/19560 | loss 3.356357 (+1.25z)| norm 0.2661 (-0.85z)| lr 7.59e-05 | 323.17 ms | 52.2% bf16 MFU | 1623682 tok/s step 15196/19560 | loss 3.282118 (-0.70z)| norm 0.2701 (-0.62z)| lr 7.59e-05 | 322.68 ms | 52.3% bf16 MFU | 1623738 tok/s step 15197/19560 | loss 3.323629 (+0.41z)| norm 0.2759 (-0.28z)| lr 7.58e-05 | 322.81 ms | 52.3% bf16 MFU | 1623759 tok/s step 15198/19560 | loss 3.323187 (+0.40z)| norm 0.2844 (+0.23z)| lr 7.58e-05 | 322.90 ms | 52.3% bf16 MFU | 1623755 tok/s step 15199/19560 | loss 3.350101 (+1.10z)| norm 0.2807 (+0.01z)| lr 7.58e-05 | 322.91 ms | 52.3% bf16 MFU | 1623750 tok/s step 15200/19560 | loss 3.287691 (-0.55z)| norm 0.2657 (-0.87z)| lr 7.57e-05 | 323.15 ms | 52.2% bf16 MFU | 1623683 tok/s step 15201/19560 | loss 3.373653 (+1.74z)| norm 0.2817 (+0.11z)| lr 7.57e-05 | 323.12 ms | 52.2% bf16 MFU | 1623628 tok/s step 15202/19560 | loss 3.326251 (+0.49z)| norm 0.2805 (+0.04z)| lr 7.57e-05 | 322.23 ms | 52.4% bf16 MFU | 1623799 tok/s step 15203/19560 | loss 3.363500 (+1.46z)| norm 0.2739 (-0.36z)| lr 7.56e-05 | 322.96 ms | 52.3% bf16 MFU | 1623779 tok/s step 15204/19560 | loss 3.318198 (+0.24z)| norm 0.2762 (-0.19z)| lr 7.56e-05 | 323.21 ms | 52.2% bf16 MFU | 1623696 tok/s step 15205/19560 | loss 3.279642 (-0.81z)| norm 0.2530 (-1.66z)| lr 7.56e-05 | 323.04 ms | 52.2% bf16 MFU | 1623662 tok/s step 15206/19560 | loss 3.345854 (+0.98z)| norm 0.3018 (+1.50z)| lr 7.55e-05 | 323.27 ms | 52.2% bf16 MFU | 1623571 tok/s step 15207/19560 | loss 3.311642 (+0.06z)| norm 0.2545 (-1.57z)| lr 7.55e-05 | 322.57 ms | 52.3% bf16 MFU | 1623659 tok/s step 15208/19560 | loss 3.281972 (-0.77z)| norm 0.2690 (-0.60z)| lr 7.55e-05 | 323.12 ms | 52.2% bf16 MFU | 1623605 tok/s step 15209/19560 | loss 3.331983 (+0.59z)| norm 0.3185 (+2.64z)| lr 7.54e-05 | 323.16 ms | 52.2% bf16 MFU | 1623542 tok/s step 15210/19560 | loss 3.316501 (+0.16z)| norm 0.2677 (-0.68z)| lr 7.54e-05 | 322.84 ms | 52.3% bf16 MFU | 1623565 tok/s step 15211/19560 | loss 3.391823 (+2.19z)| norm 0.3000 (+1.46z)| lr 7.54e-05 | 322.98 ms | 52.3% bf16 MFU | 1623550 tok/s step 15212/19560 | loss 3.366264 (+1.47z)| norm 0.3220 (+2.97z)| lr 7.53e-05 | 322.39 ms | 52.4% bf16 MFU | 1623685 tok/s step 15213/19560 | loss 3.368487 (+1.51z)| norm 0.2870 (+0.61z)| lr 7.53e-05 | 322.77 ms | 52.3% bf16 MFU | 1623717 tok/s step 15214/19560 | loss 3.306843 (-0.14z)| norm 0.2981 (+1.35z)| lr 7.53e-05 | 322.89 ms | 52.3% bf16 MFU | 1623719 tok/s step 15215/19560 | loss 3.344565 (+0.86z)| norm 0.2738 (-0.28z)| lr 7.52e-05 | 323.46 ms | 52.2% bf16 MFU | 1623577 tok/s step 15216/19560 | loss 3.242756 (-1.83z)| norm 0.2882 (+0.68z)| lr 7.52e-05 | 322.96 ms | 52.3% bf16 MFU | 1623566 tok/s step 15217/19560 | loss 3.330384 (+0.48z)| norm 0.2745 (-0.24z)| lr 7.52e-05 | 322.68 ms | 52.3% bf16 MFU | 1623627 tok/s step 15218/19560 | loss 3.373233 (+1.59z)| norm 0.2972 (+1.26z)| lr 7.51e-05 | 322.93 ms | 52.3% bf16 MFU | 1623623 tok/s step 15219/19560 | loss 3.363741 (+1.32z)| norm 0.3035 (+1.65z)| lr 7.51e-05 | 322.71 ms | 52.3% bf16 MFU | 1623673 tok/s step 15220/19560 | loss 3.281473 (-0.86z)| norm 0.2885 (+0.65z)| lr 7.51e-05 | 323.46 ms | 52.2% bf16 MFU | 1623534 tok/s step 15221/19560 | loss 3.270804 (-1.13z)| norm 0.2998 (+1.40z)| lr 7.50e-05 | 322.32 ms | 52.4% bf16 MFU | 1623687 tok/s step 15222/19560 | loss 3.325529 (+0.31z)| norm 0.2769 (-0.12z)| lr 7.50e-05 | 322.89 ms | 52.3% bf16 MFU | 1623689 tok/s step 15223/19560 | loss 3.326130 (+0.35z)| norm 0.2930 (+0.96z)| lr 7.50e-05 | 323.30 ms | 52.2% bf16 MFU | 1623588 tok/s step 15224/19560 | loss 3.289474 (-0.65z)| norm 0.2677 (-0.73z)| lr 7.49e-05 | 322.65 ms | 52.3% bf16 MFU | 1623655 tok/s step 15225/19560 | loss 3.339760 (+0.71z)| norm 0.2750 (-0.24z)| lr 7.49e-05 | 323.47 ms | 52.2% bf16 MFU | 1623514 tok/s step 15226/19560 | loss 3.238889 (-1.98z)| norm 0.2885 (+0.65z)| lr 7.49e-05 | 323.12 ms | 52.2% bf16 MFU | 1623466 tok/s step 15227/19560 | loss 3.363068 (+1.32z)| norm 0.2872 (+0.55z)| lr 7.48e-05 | 322.31 ms | 52.4% bf16 MFU | 1623625 tok/s step 15228/19560 | loss 3.265866 (-1.27z)| norm 0.3028 (+1.58z)| lr 7.48e-05 | 322.71 ms | 52.3% bf16 MFU | 1623677 tok/s step 15229/19560 | loss 3.310460 (-0.08z)| norm 0.2925 (+0.88z)| lr 7.48e-05 | 322.90 ms | 52.3% bf16 MFU | 1623679 tok/s step 15230/19560 | loss 3.349427 (+0.94z)| norm 0.2794 (+0.01z)| lr 7.47e-05 | 323.11 ms | 52.2% bf16 MFU | 1623626 tok/s step 15231/19560 | loss 3.311642 (-0.06z)| norm 0.2813 (+0.12z)| lr 7.47e-05 | 322.59 ms | 52.3% bf16 MFU | 1623706 tok/s step 15232/19560 | loss 3.285033 (-0.76z)| norm 0.2822 (+0.19z)| lr 7.47e-05 | 322.91 ms | 52.3% bf16 MFU | 1623702 tok/s step 15233/19560 | loss 3.294700 (-0.50z)| norm 0.2734 (-0.40z)| lr 7.46e-05 | 323.44 ms | 52.2% bf16 MFU | 1623565 tok/s step 15234/19560 | loss 3.376242 (+1.64z)| norm 0.3065 (+1.79z)| lr 7.46e-05 | 322.66 ms | 52.3% bf16 MFU | 1623631 tok/s step 15235/19560 | loss 3.347072 (+0.86z)| norm 0.2656 (-0.93z)| lr 7.46e-05 | 323.27 ms | 52.2% bf16 MFU | 1623540 tok/s step 15236/19560 | loss 3.363432 (+1.27z)| norm 0.3049 (+1.65z)| lr 7.45e-05 | 323.02 ms | 52.2% bf16 MFU | 1623518 tok/s step 15237/19560 | loss 3.333039 (+0.47z)| norm 0.2994 (+1.27z)| lr 7.45e-05 | 323.05 ms | 52.2% bf16 MFU | 1623488 tok/s step 15238/19560 | loss 3.273262 (-1.06z)| norm 0.2866 (+0.43z)| lr 7.45e-05 | 323.15 ms | 52.2% bf16 MFU | 1623435 tok/s step 15239/19560 | loss 3.308081 (-0.16z)| norm 0.2972 (+1.12z)| lr 7.44e-05 | 323.04 ms | 52.2% bf16 MFU | 1623412 tok/s step 15240/19560 | loss 3.343467 (+0.75z)| norm 0.2633 (-1.08z)| lr 7.44e-05 | 323.16 ms | 52.2% bf16 MFU | 1623362 tok/s step 15241/19560 | loss 3.290964 (-0.60z)| norm 0.2639 (-1.04z)| lr 7.44e-05 | 322.01 ms | 52.4% bf16 MFU | 1623602 tok/s step 15242/19560 | loss 3.280447 (-0.87z)| norm 0.2737 (-0.41z)| lr 7.43e-05 | 322.59 ms | 52.3% bf16 MFU | 1623684 tok/s step 15243/19560 | loss 3.273691 (-1.04z)| norm 0.2711 (-0.57z)| lr 7.43e-05 | 322.36 ms | 52.4% bf16 MFU | 1623819 tok/s step 15244/19560 | loss 3.328596 (+0.37z)| norm 0.2820 (+0.14z)| lr 7.43e-05 | 322.40 ms | 52.3% bf16 MFU | 1623937 tok/s step 15245/19560 | loss 3.302629 (-0.30z)| norm 0.2798 (+0.01z)| lr 7.42e-05 | 323.04 ms | 52.2% bf16 MFU | 1623890 tok/s step 15246/19560 | loss 3.361895 (+1.27z)| norm 0.2875 (+0.53z)| lr 7.42e-05 | 322.43 ms | 52.3% bf16 MFU | 1623998 tok/s step 15247/19560 | loss 3.343822 (+0.80z)| norm 0.2549 (-1.62z)| lr 7.42e-05 | 322.53 ms | 52.3% bf16 MFU | 1624076 tok/s step 15248/19560 | loss 3.348194 (+0.90z)| norm 0.2708 (-0.57z)| lr 7.41e-05 | 322.52 ms | 52.3% bf16 MFU | 1624152 tok/s step 15249/19560 | loss 3.325857 (+0.31z)| norm 0.2705 (-0.58z)| lr 7.41e-05 | 322.85 ms | 52.3% bf16 MFU | 1624141 tok/s step 15250/19560 | loss 3.290964 (-0.61z)| norm 0.2503 (-1.87z)| lr 7.41e-05 | 322.77 ms | 52.3% bf16 MFU | 1624151 tok/s val loss 3.299448 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3018/10042 = 0.300538 step 15251/19560 | loss 3.294781 (-0.51z)| norm 0.2598 (-1.24z)| lr 7.41e-05 | 321.72 ms | 52.5% bf16 MFU | 1624425 tok/s step 15252/19560 | loss 3.298092 (-0.42z)| norm 0.2602 (-1.20z)| lr 7.40e-05 | 322.43 ms | 52.3% bf16 MFU | 1624507 tok/s step 15253/19560 | loss 3.388398 (+1.98z)| norm 0.2867 (+0.51z)| lr 7.40e-05 | 323.02 ms | 52.2% bf16 MFU | 1624436 tok/s step 15254/19560 | loss 3.274502 (-1.03z)| norm 0.2606 (-1.18z)| lr 7.40e-05 | 322.45 ms | 52.3% bf16 MFU | 1624512 tok/s step 15255/19560 | loss 3.289401 (-0.63z)| norm 0.2650 (-0.88z)| lr 7.39e-05 | 322.81 ms | 52.3% bf16 MFU | 1624494 tok/s step 15256/19560 | loss 3.330617 (+0.45z)| norm 0.2607 (-1.17z)| lr 7.39e-05 | 322.74 ms | 52.3% bf16 MFU | 1624495 tok/s step 15257/19560 | loss 3.316682 (+0.07z)| norm 0.3052 (+1.71z)| lr 7.39e-05 | 322.75 ms | 52.3% bf16 MFU | 1624492 tok/s step 15258/19560 | loss 3.295410 (-0.51z)| norm 0.2592 (-1.27z)| lr 7.38e-05 | 322.82 ms | 52.3% bf16 MFU | 1624473 tok/s step 15259/19560 | loss 3.291161 (-0.62z)| norm 0.3071 (+1.79z)| lr 7.38e-05 | 323.02 ms | 52.2% bf16 MFU | 1624404 tok/s step 15260/19560 | loss 3.282780 (-0.84z)| norm 0.2680 (-0.72z)| lr 7.38e-05 | 322.76 ms | 52.3% bf16 MFU | 1624404 tok/s step 15261/19560 | loss 3.365700 (+1.36z)| norm 0.2804 (+0.08z)| lr 7.37e-05 | 322.37 ms | 52.4% bf16 MFU | 1624502 tok/s step 15262/19560 | loss 3.262933 (-1.36z)| norm 0.2917 (+0.79z)| lr 7.37e-05 | 322.85 ms | 52.3% bf16 MFU | 1624474 tok/s step 15263/19560 | loss 3.258400 (-1.45z)| norm 0.2947 (+0.98z)| lr 7.37e-05 | 322.13 ms | 52.4% bf16 MFU | 1624628 tok/s step 15264/19560 | loss 3.269503 (-1.16z)| norm 0.2662 (-0.86z)| lr 7.36e-05 | 322.56 ms | 52.3% bf16 MFU | 1624668 tok/s step 15265/19560 | loss 3.315083 (+0.03z)| norm 0.3206 (+2.58z)| lr 7.36e-05 | 323.47 ms | 52.2% bf16 MFU | 1624476 tok/s step 15266/19560 | loss 3.276855 (-0.96z)| norm 0.3029 (+1.44z)| lr 7.36e-05 | 322.78 ms | 52.3% bf16 MFU | 1624465 tok/s step 15267/19560 | loss 3.286198 (-0.72z)| norm 0.2918 (+0.73z)| lr 7.35e-05 | 322.20 ms | 52.4% bf16 MFU | 1624604 tok/s step 15268/19560 | loss 3.320119 (+0.17z)| norm 0.3174 (+2.27z)| lr 7.35e-05 | 322.56 ms | 52.3% bf16 MFU | 1624644 tok/s step 15269/19560 | loss 3.252386 (-1.66z)| norm 0.2658 (-0.87z)| lr 7.35e-05 | 322.75 ms | 52.3% bf16 MFU | 1624633 tok/s step 15270/19560 | loss 3.324053 (+0.28z)| norm 0.2596 (-1.23z)| lr 7.34e-05 | 322.72 ms | 52.3% bf16 MFU | 1624630 tok/s step 15271/19560 | loss 3.303825 (-0.27z)| norm 0.2754 (-0.27z)| lr 7.34e-05 | 322.17 ms | 52.4% bf16 MFU | 1624766 tok/s step 15272/19560 | loss 3.320745 (+0.20z)| norm 0.2683 (-0.69z)| lr 7.34e-05 | 322.54 ms | 52.3% bf16 MFU | 1624802 tok/s step 15273/19560 | loss 3.425535 (+2.97z)| norm 0.2515 (-1.69z)| lr 7.33e-05 | 322.49 ms | 52.3% bf16 MFU | 1624850 tok/s step 15274/19560 | loss 3.321769 (+0.17z)| norm 0.2703 (-0.56z)| lr 7.33e-05 | 323.06 ms | 52.2% bf16 MFU | 1624752 tok/s step 15275/19560 | loss 3.360687 (+1.21z)| norm 0.2780 (-0.08z)| lr 7.33e-05 | 322.93 ms | 52.3% bf16 MFU | 1624691 tok/s step 15276/19560 | loss 3.280678 (-0.94z)| norm 0.2553 (-1.46z)| lr 7.32e-05 | 322.67 ms | 52.3% bf16 MFU | 1624698 tok/s step 15277/19560 | loss 3.334030 (+0.52z)| norm 0.2730 (-0.37z)| lr 7.32e-05 | 322.45 ms | 52.3% bf16 MFU | 1624760 tok/s step 15278/19560 | loss 3.303944 (-0.31z)| norm 0.2831 (+0.26z)| lr 7.32e-05 | 323.19 ms | 52.2% bf16 MFU | 1624633 tok/s step 15279/19560 | loss 3.339145 (+0.64z)| norm 0.2825 (+0.22z)| lr 7.31e-05 | 322.56 ms | 52.3% bf16 MFU | 1624672 tok/s step 15280/19560 | loss 3.425552 (+2.91z)| norm 0.3045 (+1.57z)| lr 7.31e-05 | 322.57 ms | 52.3% bf16 MFU | 1624706 tok/s step 15281/19560 | loss 3.311158 (-0.17z)| norm 0.2836 (+0.27z)| lr 7.31e-05 | 322.84 ms | 52.3% bf16 MFU | 1624671 tok/s step 15282/19560 | loss 3.278026 (-1.04z)| norm 0.2879 (+0.53z)| lr 7.30e-05 | 323.09 ms | 52.2% bf16 MFU | 1624574 tok/s step 15283/19560 | loss 3.317092 (+0.03z)| norm 0.2786 (-0.04z)| lr 7.30e-05 | 323.09 ms | 52.2% bf16 MFU | 1624482 tok/s step 15284/19560 | loss 3.284541 (-0.88z)| norm 0.2842 (+0.30z)| lr 7.30e-05 | 323.11 ms | 52.2% bf16 MFU | 1624388 tok/s step 15285/19560 | loss 3.373597 (+1.59z)| norm 0.3130 (+2.07z)| lr 7.29e-05 | 322.44 ms | 52.3% bf16 MFU | 1624469 tok/s step 15286/19560 | loss 3.345785 (+0.80z)| norm 0.2699 (-0.58z)| lr 7.29e-05 | 322.43 ms | 52.3% bf16 MFU | 1624548 tok/s step 15287/19560 | loss 3.245441 (-1.94z)| norm 0.3109 (+1.95z)| lr 7.29e-05 | 322.83 ms | 52.3% bf16 MFU | 1624522 tok/s step 15288/19560 | loss 3.302840 (-0.36z)| norm 0.2811 (+0.12z)| lr 7.28e-05 | 322.63 ms | 52.3% bf16 MFU | 1624548 tok/s step 15289/19560 | loss 3.311500 (-0.12z)| norm 0.2810 (+0.11z)| lr 7.28e-05 | 322.92 ms | 52.3% bf16 MFU | 1624499 tok/s step 15290/19560 | loss 3.318246 (+0.07z)| norm 0.3071 (+1.70z)| lr 7.28e-05 | 322.36 ms | 52.4% bf16 MFU | 1624594 tok/s step 15291/19560 | loss 3.336018 (+0.55z)| norm 0.3029 (+1.42z)| lr 7.27e-05 | 322.07 ms | 52.4% bf16 MFU | 1624758 tok/s step 15292/19560 | loss 3.322754 (+0.17z)| norm 0.3052 (+1.53z)| lr 7.27e-05 | 322.72 ms | 52.3% bf16 MFU | 1624751 tok/s step 15293/19560 | loss 3.328971 (+0.33z)| norm 0.2986 (+1.12z)| lr 7.27e-05 | 322.87 ms | 52.3% bf16 MFU | 1624705 tok/s step 15294/19560 | loss 3.324356 (+0.20z)| norm 0.2999 (+1.17z)| lr 7.26e-05 | 323.00 ms | 52.3% bf16 MFU | 1624628 tok/s step 15295/19560 | loss 3.291273 (-0.72z)| norm 0.3199 (+2.31z)| lr 7.26e-05 | 322.45 ms | 52.3% bf16 MFU | 1624694 tok/s step 15296/19560 | loss 3.295128 (-0.60z)| norm 0.2764 (-0.26z)| lr 7.26e-05 | 322.45 ms | 52.3% bf16 MFU | 1624758 tok/s step 15297/19560 | loss 3.398558 (+2.23z)| norm 0.2782 (-0.16z)| lr 7.25e-05 | 322.30 ms | 52.4% bf16 MFU | 1624855 tok/s step 15298/19560 | loss 3.262085 (-1.49z)| norm 0.2743 (-0.40z)| lr 7.25e-05 | 323.11 ms | 52.2% bf16 MFU | 1624744 tok/s step 15299/19560 | loss 3.346103 (+0.78z)| norm 0.2873 (+0.37z)| lr 7.25e-05 | 322.57 ms | 52.3% bf16 MFU | 1624774 tok/s step 15300/19560 | loss 3.348770 (+0.84z)| norm 0.2587 (-1.32z)| lr 7.24e-05 | 322.54 ms | 52.3% bf16 MFU | 1624810 tok/s step 15301/19560 | loss 3.288762 (-0.79z)| norm 0.2543 (-1.55z)| lr 7.24e-05 | 322.48 ms | 52.3% bf16 MFU | 1624860 tok/s step 15302/19560 | loss 3.335927 (+0.49z)| norm 0.2761 (-0.27z)| lr 7.24e-05 | 322.70 ms | 52.3% bf16 MFU | 1624852 tok/s step 15303/19560 | loss 3.320034 (+0.05z)| norm 0.2904 (+0.58z)| lr 7.23e-05 | 322.77 ms | 52.3% bf16 MFU | 1624827 tok/s step 15304/19560 | loss 3.321013 (+0.07z)| norm 0.2683 (-0.72z)| lr 7.23e-05 | 322.60 ms | 52.3% bf16 MFU | 1624845 tok/s step 15305/19560 | loss 3.297588 (-0.58z)| norm 0.2767 (-0.22z)| lr 7.23e-05 | 323.22 ms | 52.2% bf16 MFU | 1624707 tok/s step 15306/19560 | loss 3.366534 (+1.31z)| norm 0.2847 (+0.25z)| lr 7.23e-05 | 322.56 ms | 52.3% bf16 MFU | 1624741 tok/s step 15307/19560 | loss 3.296649 (-0.60z)| norm 0.2874 (+0.40z)| lr 7.22e-05 | 322.68 ms | 52.3% bf16 MFU | 1624743 tok/s step 15308/19560 | loss 3.321570 (+0.10z)| norm 0.2862 (+0.34z)| lr 7.22e-05 | 323.36 ms | 52.2% bf16 MFU | 1624575 tok/s step 15309/19560 | loss 3.343668 (+0.71z)| norm 0.2467 (-1.97z)| lr 7.22e-05 | 323.36 ms | 52.2% bf16 MFU | 1624415 tok/s step 15310/19560 | loss 3.302859 (-0.42z)| norm 0.2757 (-0.28z)| lr 7.21e-05 | 322.87 ms | 52.3% bf16 MFU | 1624385 tok/s step 15311/19560 | loss 3.340312 (+0.63z)| norm 0.2853 (+0.28z)| lr 7.21e-05 | 322.33 ms | 52.4% bf16 MFU | 1624493 tok/s step 15312/19560 | loss 3.349220 (+0.87z)| norm 0.2649 (-0.92z)| lr 7.21e-05 | 323.02 ms | 52.2% bf16 MFU | 1624422 tok/s step 15313/19560 | loss 3.384783 (+1.82z)| norm 0.3088 (+1.64z)| lr 7.20e-05 | 322.78 ms | 52.3% bf16 MFU | 1624414 tok/s step 15314/19560 | loss 3.355747 (+1.00z)| norm 0.2873 (+0.37z)| lr 7.20e-05 | 322.69 ms | 52.3% bf16 MFU | 1624431 tok/s step 15315/19560 | loss 3.301489 (-0.48z)| norm 0.3028 (+1.26z)| lr 7.20e-05 | 322.89 ms | 52.3% bf16 MFU | 1624395 tok/s step 15316/19560 | loss 3.293881 (-0.69z)| norm 0.2916 (+0.59z)| lr 7.19e-05 | 323.10 ms | 52.2% bf16 MFU | 1624311 tok/s step 15317/19560 | loss 3.296680 (-0.62z)| norm 0.2642 (-1.01z)| lr 7.19e-05 | 322.25 ms | 52.4% bf16 MFU | 1624444 tok/s step 15318/19560 | loss 3.281338 (-1.03z)| norm 0.2740 (-0.44z)| lr 7.19e-05 | 322.70 ms | 52.3% bf16 MFU | 1624457 tok/s step 15319/19560 | loss 3.293621 (-0.68z)| norm 0.2867 (+0.30z)| lr 7.18e-05 | 322.35 ms | 52.4% bf16 MFU | 1624557 tok/s step 15320/19560 | loss 3.292960 (-0.70z)| norm 0.3065 (+1.44z)| lr 7.18e-05 | 322.08 ms | 52.4% bf16 MFU | 1624720 tok/s step 15321/19560 | loss 3.292481 (-0.71z)| norm 0.2912 (+0.54z)| lr 7.18e-05 | 322.66 ms | 52.3% bf16 MFU | 1624728 tok/s step 15322/19560 | loss 3.307484 (-0.29z)| norm 0.3109 (+1.66z)| lr 7.17e-05 | 322.64 ms | 52.3% bf16 MFU | 1624741 tok/s step 15323/19560 | loss 3.353870 (+0.97z)| norm 0.3106 (+1.61z)| lr 7.17e-05 | 322.63 ms | 52.3% bf16 MFU | 1624756 tok/s step 15324/19560 | loss 3.346272 (+0.75z)| norm 0.2814 (-0.09z)| lr 7.17e-05 | 322.74 ms | 52.3% bf16 MFU | 1624742 tok/s step 15325/19560 | loss 3.371138 (+1.41z)| norm 0.3419 (+3.26z)| lr 7.16e-05 | 322.90 ms | 52.3% bf16 MFU | 1624689 tok/s step 15326/19560 | loss 3.316764 (-0.06z)| norm 0.2751 (-0.47z)| lr 7.16e-05 | 322.32 ms | 52.4% bf16 MFU | 1624786 tok/s step 15327/19560 | loss 3.343944 (+0.68z)| norm 0.2934 (+0.55z)| lr 7.16e-05 | 322.34 ms | 52.4% bf16 MFU | 1624872 tok/s step 15328/19560 | loss 3.284784 (-0.93z)| norm 0.2978 (+0.78z)| lr 7.15e-05 | 323.47 ms | 52.2% bf16 MFU | 1624670 tok/s step 15329/19560 | loss 3.313370 (-0.14z)| norm 0.2714 (-0.68z)| lr 7.15e-05 | 322.77 ms | 52.3% bf16 MFU | 1624654 tok/s step 15330/19560 | loss 3.311621 (-0.19z)| norm 0.2873 (+0.20z)| lr 7.15e-05 | 322.49 ms | 52.3% bf16 MFU | 1624709 tok/s step 15331/19560 | loss 3.316537 (-0.04z)| norm 0.2944 (+0.59z)| lr 7.14e-05 | 322.85 ms | 52.3% bf16 MFU | 1624671 tok/s step 15332/19560 | loss 3.316746 (-0.04z)| norm 0.2688 (-0.84z)| lr 7.14e-05 | 322.44 ms | 52.3% bf16 MFU | 1624738 tok/s step 15333/19560 | loss 3.339746 (+0.59z)| norm 0.2972 (+0.73z)| lr 7.14e-05 | 322.86 ms | 52.3% bf16 MFU | 1624696 tok/s step 15334/19560 | loss 3.315047 (-0.09z)| norm 0.2680 (-0.89z)| lr 7.13e-05 | 323.40 ms | 52.2% bf16 MFU | 1624519 tok/s step 15335/19560 | loss 3.350718 (+0.89z)| norm 0.2904 (+0.35z)| lr 7.13e-05 | 322.78 ms | 52.3% bf16 MFU | 1624508 tok/s step 15336/19560 | loss 3.306879 (-0.33z)| norm 0.2853 (+0.06z)| lr 7.13e-05 | 322.68 ms | 52.3% bf16 MFU | 1624522 tok/s step 15337/19560 | loss 3.316825 (-0.05z)| norm 0.2778 (-0.36z)| lr 7.12e-05 | 322.43 ms | 52.3% bf16 MFU | 1624598 tok/s step 15338/19560 | loss 3.236669 (-2.21z)| norm 0.2797 (-0.26z)| lr 7.12e-05 | 323.36 ms | 52.2% bf16 MFU | 1624438 tok/s step 15339/19560 | loss 3.308762 (-0.24z)| norm 0.2875 (+0.20z)| lr 7.12e-05 | 322.48 ms | 52.3% bf16 MFU | 1624506 tok/s step 15340/19560 | loss 3.358623 (+1.14z)| norm 0.2792 (-0.27z)| lr 7.11e-05 | 322.49 ms | 52.3% bf16 MFU | 1624569 tok/s step 15341/19560 | loss 3.325711 (+0.24z)| norm 0.3491 (+3.64z)| lr 7.11e-05 | 322.49 ms | 52.3% bf16 MFU | 1624628 tok/s step 15342/19560 | loss 3.270584 (-1.28z)| norm 0.2803 (-0.21z)| lr 7.11e-05 | 323.35 ms | 52.2% bf16 MFU | 1624467 tok/s step 15343/19560 | loss 3.316060 (-0.01z)| norm 0.3113 (+1.50z)| lr 7.11e-05 | 322.71 ms | 52.3% bf16 MFU | 1624475 tok/s step 15344/19560 | loss 3.293383 (-0.66z)| norm 0.3127 (+1.56z)| lr 7.10e-05 | 322.77 ms | 52.3% bf16 MFU | 1624468 tok/s step 15345/19560 | loss 3.310471 (-0.18z)| norm 0.2802 (-0.24z)| lr 7.10e-05 | 322.69 ms | 52.3% bf16 MFU | 1624483 tok/s step 15346/19560 | loss 3.345662 (+0.83z)| norm 0.3088 (+1.33z)| lr 7.10e-05 | 323.46 ms | 52.2% bf16 MFU | 1624303 tok/s step 15347/19560 | loss 3.261568 (-1.53z)| norm 0.3049 (+1.11z)| lr 7.09e-05 | 322.51 ms | 52.3% bf16 MFU | 1624370 tok/s step 15348/19560 | loss 3.322832 (+0.19z)| norm 0.2749 (-0.53z)| lr 7.09e-05 | 322.44 ms | 52.3% bf16 MFU | 1624452 tok/s step 15349/19560 | loss 3.284119 (-0.91z)| norm 0.2869 (+0.14z)| lr 7.09e-05 | 322.94 ms | 52.3% bf16 MFU | 1624404 tok/s step 15350/19560 | loss 3.337865 (+0.62z)| norm 0.2799 (-0.25z)| lr 7.08e-05 | 322.30 ms | 52.4% bf16 MFU | 1624520 tok/s step 15351/19560 | loss 3.307393 (-0.25z)| norm 0.2911 (+0.36z)| lr 7.08e-05 | 323.24 ms | 52.2% bf16 MFU | 1624392 tok/s step 15352/19560 | loss 3.295884 (-0.58z)| norm 0.3073 (+1.24z)| lr 7.08e-05 | 322.98 ms | 52.3% bf16 MFU | 1624337 tok/s step 15353/19560 | loss 3.333461 (+0.50z)| norm 0.2660 (-1.03z)| lr 7.07e-05 | 322.83 ms | 52.3% bf16 MFU | 1624321 tok/s step 15354/19560 | loss 3.263476 (-1.52z)| norm 0.2756 (-0.49z)| lr 7.07e-05 | 322.90 ms | 52.3% bf16 MFU | 1624289 tok/s step 15355/19560 | loss 3.355654 (+1.14z)| norm 0.2661 (-1.00z)| lr 7.07e-05 | 322.64 ms | 52.3% bf16 MFU | 1624324 tok/s step 15356/19560 | loss 3.323478 (+0.20z)| norm 0.2781 (-0.34z)| lr 7.06e-05 | 323.61 ms | 52.2% bf16 MFU | 1624115 tok/s step 15357/19560 | loss 3.300827 (-0.46z)| norm 0.2795 (-0.25z)| lr 7.06e-05 | 322.50 ms | 52.3% bf16 MFU | 1624194 tok/s step 15358/19560 | loss 3.343298 (+0.78z)| norm 0.2632 (-1.14z)| lr 7.06e-05 | 322.59 ms | 52.3% bf16 MFU | 1624245 tok/s step 15359/19560 | loss 3.300252 (-0.47z)| norm 0.2594 (-1.32z)| lr 7.05e-05 | 322.84 ms | 52.3% bf16 MFU | 1624232 tok/s step 15360/19560 | loss 3.371311 (+1.57z)| norm 0.2761 (-0.42z)| lr 7.05e-05 | 322.96 ms | 52.3% bf16 MFU | 1624189 tok/s step 15361/19560 | loss 3.288242 (-0.83z)| norm 0.2747 (-0.50z)| lr 7.05e-05 | 322.84 ms | 52.3% bf16 MFU | 1624178 tok/s step 15362/19560 | loss 3.357328 (+1.18z)| norm 0.3080 (+1.31z)| lr 7.04e-05 | 322.16 ms | 52.4% bf16 MFU | 1624340 tok/s step 15363/19560 | loss 3.240865 (-2.16z)| norm 0.2983 (+0.77z)| lr 7.04e-05 | 322.56 ms | 52.3% bf16 MFU | 1624393 tok/s step 15364/19560 | loss 3.280631 (-1.00z)| norm 0.2779 (-0.33z)| lr 7.04e-05 | 323.72 ms | 52.1% bf16 MFU | 1624153 tok/s step 15365/19560 | loss 3.236044 (-2.22z)| norm 0.3185 (+1.86z)| lr 7.03e-05 | 322.19 ms | 52.4% bf16 MFU | 1624308 tok/s step 15366/19560 | loss 3.300853 (-0.40z)| norm 0.2812 (-0.15z)| lr 7.03e-05 | 322.96 ms | 52.3% bf16 MFU | 1624263 tok/s step 15367/19560 | loss 3.293277 (-0.61z)| norm 0.2996 (+0.84z)| lr 7.03e-05 | 323.21 ms | 52.2% bf16 MFU | 1624156 tok/s step 15368/19560 | loss 3.333365 (+0.53z)| norm 0.3006 (+0.88z)| lr 7.02e-05 | 322.59 ms | 52.3% bf16 MFU | 1624210 tok/s step 15369/19560 | loss 3.338366 (+0.66z)| norm 0.2991 (+0.78z)| lr 7.02e-05 | 323.00 ms | 52.3% bf16 MFU | 1624158 tok/s step 15370/19560 | loss 3.266900 (-1.36z)| norm 0.2861 (+0.08z)| lr 7.02e-05 | 322.98 ms | 52.3% bf16 MFU | 1624114 tok/s step 15371/19560 | loss 3.291215 (-0.68z)| norm 0.2785 (-0.34z)| lr 7.02e-05 | 323.45 ms | 52.2% bf16 MFU | 1623955 tok/s step 15372/19560 | loss 3.265333 (-1.39z)| norm 0.3162 (+1.67z)| lr 7.01e-05 | 322.72 ms | 52.3% bf16 MFU | 1623988 tok/s step 15373/19560 | loss 3.320567 (+0.16z)| norm 0.2836 (-0.08z)| lr 7.01e-05 | 322.51 ms | 52.3% bf16 MFU | 1624072 tok/s step 15374/19560 | loss 3.319616 (+0.15z)| norm 0.2814 (-0.19z)| lr 7.01e-05 | 323.29 ms | 52.2% bf16 MFU | 1623954 tok/s step 15375/19560 | loss 3.263851 (-1.41z)| norm 0.2927 (+0.40z)| lr 7.00e-05 | 322.77 ms | 52.3% bf16 MFU | 1623974 tok/s step 15376/19560 | loss 3.290117 (-0.66z)| norm 0.3078 (+1.20z)| lr 7.00e-05 | 323.93 ms | 52.1% bf16 MFU | 1623702 tok/s step 15377/19560 | loss 3.295142 (-0.51z)| norm 0.3046 (+1.01z)| lr 7.00e-05 | 322.37 ms | 52.4% bf16 MFU | 1623833 tok/s step 15378/19560 | loss 3.318438 (+0.14z)| norm 0.3019 (+0.85z)| lr 6.99e-05 | 322.54 ms | 52.3% bf16 MFU | 1623918 tok/s step 15379/19560 | loss 3.284743 (-0.81z)| norm 0.3651 (+4.01z)| lr 6.99e-05 | 323.17 ms | 52.2% bf16 MFU | 1623839 tok/s step 15380/19560 | loss 3.333811 (+0.57z)| norm 0.3013 (+0.72z)| lr 6.99e-05 | 322.79 ms | 52.3% bf16 MFU | 1623859 tok/s step 15381/19560 | loss 3.262278 (-1.43z)| norm 0.2950 (+0.39z)| lr 6.98e-05 | 323.16 ms | 52.2% bf16 MFU | 1623784 tok/s step 15382/19560 | loss 3.297159 (-0.45z)| norm 0.3262 (+1.96z)| lr 6.98e-05 | 322.81 ms | 52.3% bf16 MFU | 1623802 tok/s step 15383/19560 | loss 3.264198 (-1.38z)| norm 0.2752 (-0.66z)| lr 6.98e-05 | 322.24 ms | 52.4% bf16 MFU | 1623962 tok/s step 15384/19560 | loss 3.269223 (-1.21z)| norm 0.3004 (+0.63z)| lr 6.97e-05 | 322.89 ms | 52.3% bf16 MFU | 1623951 tok/s step 15385/19560 | loss 3.353480 (+1.16z)| norm 0.3007 (+0.64z)| lr 6.97e-05 | 322.70 ms | 52.3% bf16 MFU | 1623988 tok/s step 15386/19560 | loss 3.320436 (+0.22z)| norm 0.2858 (-0.14z)| lr 6.97e-05 | 322.63 ms | 52.3% bf16 MFU | 1624040 tok/s step 15387/19560 | loss 3.333998 (+0.60z)| norm 0.3136 (+1.31z)| lr 6.96e-05 | 322.30 ms | 52.4% bf16 MFU | 1624172 tok/s step 15388/19560 | loss 3.277978 (-0.98z)| norm 0.2803 (-0.44z)| lr 6.96e-05 | 322.69 ms | 52.3% bf16 MFU | 1624201 tok/s step 15389/19560 | loss 3.393161 (+2.23z)| norm 0.3195 (+1.58z)| lr 6.96e-05 | 322.87 ms | 52.3% bf16 MFU | 1624182 tok/s step 15390/19560 | loss 3.400519 (+2.37z)| norm 0.2702 (-0.96z)| lr 6.95e-05 | 323.55 ms | 52.2% bf16 MFU | 1623995 tok/s step 15391/19560 | loss 3.308096 (-0.18z)| norm 0.2932 (+0.23z)| lr 6.95e-05 | 322.83 ms | 52.3% bf16 MFU | 1623998 tok/s step 15392/19560 | loss 3.394405 (+2.16z)| norm 0.3099 (+1.07z)| lr 6.95e-05 | 323.40 ms | 52.2% bf16 MFU | 1623858 tok/s step 15393/19560 | loss 3.351272 (+0.97z)| norm 0.2606 (-1.45z)| lr 6.94e-05 | 322.96 ms | 52.3% bf16 MFU | 1623833 tok/s step 15394/19560 | loss 3.287667 (-0.77z)| norm 0.2811 (-0.38z)| lr 6.94e-05 | 322.70 ms | 52.3% bf16 MFU | 1623877 tok/s step 15395/19560 | loss 3.277529 (-1.04z)| norm 0.2845 (-0.20z)| lr 6.94e-05 | 323.42 ms | 52.2% bf16 MFU | 1623738 tok/s step 15396/19560 | loss 3.348262 (+0.88z)| norm 0.2598 (-1.46z)| lr 6.94e-05 | 322.54 ms | 52.3% bf16 MFU | 1623826 tok/s step 15397/19560 | loss 3.285059 (-0.86z)| norm 0.2775 (-0.55z)| lr 6.93e-05 | 322.79 ms | 52.3% bf16 MFU | 1623847 tok/s step 15398/19560 | loss 3.335925 (+0.54z)| norm 0.2926 (+0.22z)| lr 6.93e-05 | 322.59 ms | 52.3% bf16 MFU | 1623917 tok/s step 15399/19560 | loss 3.327876 (+0.31z)| norm 0.2710 (-0.90z)| lr 6.93e-05 | 322.91 ms | 52.3% bf16 MFU | 1623903 tok/s step 15400/19560 | loss 3.276300 (-1.09z)| norm 0.2852 (-0.17z)| lr 6.92e-05 | 323.48 ms | 52.2% bf16 MFU | 1623746 tok/s step 15401/19560 | loss 3.328180 (+0.36z)| norm 0.2970 (+0.44z)| lr 6.92e-05 | 323.12 ms | 52.2% bf16 MFU | 1623686 tok/s step 15402/19560 | loss 3.276047 (-1.10z)| norm 0.2722 (-0.89z)| lr 6.92e-05 | 322.49 ms | 52.3% bf16 MFU | 1623789 tok/s step 15403/19560 | loss 3.286435 (-0.79z)| norm 0.3133 (+1.29z)| lr 6.91e-05 | 322.90 ms | 52.3% bf16 MFU | 1623783 tok/s step 15404/19560 | loss 3.358457 (+1.22z)| norm 0.2616 (-1.47z)| lr 6.91e-05 | 322.91 ms | 52.3% bf16 MFU | 1623776 tok/s step 15405/19560 | loss 3.317216 (+0.06z)| norm 0.2738 (-0.82z)| lr 6.91e-05 | 322.55 ms | 52.3% bf16 MFU | 1623858 tok/s step 15406/19560 | loss 3.289768 (-0.71z)| norm 0.3212 (+1.68z)| lr 6.90e-05 | 322.97 ms | 52.3% bf16 MFU | 1623831 tok/s step 15407/19560 | loss 3.286584 (-0.79z)| norm 0.2683 (-1.11z)| lr 6.90e-05 | 322.54 ms | 52.3% bf16 MFU | 1623915 tok/s step 15408/19560 | loss 3.314810 (+0.04z)| norm 0.2815 (-0.40z)| lr 6.90e-05 | 322.85 ms | 52.3% bf16 MFU | 1623916 tok/s step 15409/19560 | loss 3.264786 (-1.41z)| norm 0.2649 (-1.26z)| lr 6.89e-05 | 322.53 ms | 52.3% bf16 MFU | 1623999 tok/s step 15410/19560 | loss 3.318213 (+0.14z)| norm 0.2872 (-0.09z)| lr 6.89e-05 | 322.99 ms | 52.3% bf16 MFU | 1623961 tok/s step 15411/19560 | loss 3.353457 (+1.15z)| norm 0.2685 (-1.06z)| lr 6.89e-05 | 323.00 ms | 52.3% bf16 MFU | 1623921 tok/s step 15412/19560 | loss 3.287249 (-0.77z)| norm 0.2724 (-0.85z)| lr 6.88e-05 | 322.45 ms | 52.3% bf16 MFU | 1624022 tok/s step 15413/19560 | loss 3.320418 (+0.21z)| norm 0.2732 (-0.80z)| lr 6.88e-05 | 323.03 ms | 52.2% bf16 MFU | 1623973 tok/s step 15414/19560 | loss 3.237660 (-2.17z)| norm 0.2782 (-0.54z)| lr 6.88e-05 | 323.56 ms | 52.2% bf16 MFU | 1623792 tok/s step 15415/19560 | loss 3.333554 (+0.60z)| norm 0.2740 (-0.75z)| lr 6.87e-05 | 323.00 ms | 52.3% bf16 MFU | 1623762 tok/s step 15416/19560 | loss 3.394025 (+2.30z)| norm 0.3097 (+1.11z)| lr 6.87e-05 | 323.03 ms | 52.2% bf16 MFU | 1623725 tok/s step 15417/19560 | loss 3.302297 (-0.33z)| norm 0.2870 (-0.08z)| lr 6.87e-05 | 322.72 ms | 52.3% bf16 MFU | 1623769 tok/s step 15418/19560 | loss 3.300204 (-0.39z)| norm 0.2755 (-0.67z)| lr 6.86e-05 | 322.48 ms | 52.3% bf16 MFU | 1623871 tok/s step 15419/19560 | loss 3.325223 (+0.33z)| norm 0.2717 (-0.86z)| lr 6.86e-05 | 322.81 ms | 52.3% bf16 MFU | 1623884 tok/s step 15420/19560 | loss 3.384106 (+1.98z)| norm 0.2892 (+0.07z)| lr 6.86e-05 | 322.77 ms | 52.3% bf16 MFU | 1623908 tok/s step 15421/19560 | loss 3.264338 (-1.39z)| norm 0.2741 (-0.71z)| lr 6.86e-05 | 322.91 ms | 52.3% bf16 MFU | 1623895 tok/s step 15422/19560 | loss 3.290356 (-0.65z)| norm 0.2973 (+0.51z)| lr 6.85e-05 | 322.15 ms | 52.4% bf16 MFU | 1624073 tok/s step 15423/19560 | loss 3.277060 (-1.02z)| norm 0.2867 (-0.04z)| lr 6.85e-05 | 322.86 ms | 52.3% bf16 MFU | 1624064 tok/s step 15424/19560 | loss 3.359601 (+1.28z)| norm 0.2967 (+0.48z)| lr 6.85e-05 | 323.01 ms | 52.2% bf16 MFU | 1624017 tok/s step 15425/19560 | loss 3.345845 (+0.92z)| norm 0.2837 (-0.21z)| lr 6.84e-05 | 321.90 ms | 52.4% bf16 MFU | 1624254 tok/s step 15426/19560 | loss 3.281983 (-0.90z)| norm 0.2836 (-0.22z)| lr 6.84e-05 | 323.10 ms | 52.2% bf16 MFU | 1624176 tok/s step 15427/19560 | loss 3.296087 (-0.49z)| norm 0.2815 (-0.33z)| lr 6.84e-05 | 322.44 ms | 52.3% bf16 MFU | 1624266 tok/s step 15428/19560 | loss 3.308825 (-0.12z)| norm 0.2740 (-0.74z)| lr 6.83e-05 | 322.76 ms | 52.3% bf16 MFU | 1624273 tok/s step 15429/19560 | loss 3.348823 (+1.02z)| norm 0.2681 (-1.07z)| lr 6.83e-05 | 322.89 ms | 52.3% bf16 MFU | 1624245 tok/s step 15430/19560 | loss 3.370595 (+1.62z)| norm 0.2818 (-0.34z)| lr 6.83e-05 | 322.34 ms | 52.4% bf16 MFU | 1624358 tok/s step 15431/19560 | loss 3.326978 (+0.38z)| norm 0.2661 (-1.17z)| lr 6.82e-05 | 322.42 ms | 52.3% bf16 MFU | 1624446 tok/s step 15432/19560 | loss 3.289661 (-0.67z)| norm 0.2769 (-0.59z)| lr 6.82e-05 | 323.12 ms | 52.2% bf16 MFU | 1624353 tok/s step 15433/19560 | loss 3.336066 (+0.63z)| norm 0.2829 (-0.27z)| lr 6.82e-05 | 322.72 ms | 52.3% bf16 MFU | 1624366 tok/s step 15434/19560 | loss 3.328850 (+0.44z)| norm 0.2805 (-0.40z)| lr 6.81e-05 | 322.38 ms | 52.4% bf16 MFU | 1624462 tok/s step 15435/19560 | loss 3.303176 (-0.29z)| norm 0.2966 (+0.47z)| lr 6.81e-05 | 322.33 ms | 52.4% bf16 MFU | 1624567 tok/s step 15436/19560 | loss 3.297921 (-0.44z)| norm 0.2878 (-0.00z)| lr 6.81e-05 | 322.44 ms | 52.3% bf16 MFU | 1624639 tok/s step 15437/19560 | loss 3.368805 (+1.57z)| norm 0.2961 (+0.43z)| lr 6.80e-05 | 322.91 ms | 52.3% bf16 MFU | 1624588 tok/s step 15438/19560 | loss 3.309936 (-0.10z)| norm 0.2937 (+0.29z)| lr 6.80e-05 | 322.48 ms | 52.3% bf16 MFU | 1624648 tok/s step 15439/19560 | loss 3.301069 (-0.35z)| norm 0.3303 (+2.25z)| lr 6.80e-05 | 322.51 ms | 52.3% bf16 MFU | 1624697 tok/s step 15440/19560 | loss 3.352606 (+1.12z)| norm 0.3132 (+1.30z)| lr 6.80e-05 | 323.29 ms | 52.2% bf16 MFU | 1624549 tok/s step 15441/19560 | loss 3.277440 (-1.01z)| norm 0.2681 (-1.13z)| lr 6.79e-05 | 322.25 ms | 52.4% bf16 MFU | 1624670 tok/s step 15442/19560 | loss 3.356840 (+1.28z)| norm 0.5160 (+8.28z)| lr 6.79e-05 | 322.35 ms | 52.4% bf16 MFU | 1624758 tok/s step 15443/19560 | loss 3.350005 (+1.07z)| norm 0.3085 (+0.65z)| lr 6.79e-05 | 322.42 ms | 52.3% bf16 MFU | 1624825 tok/s step 15444/19560 | loss 3.343282 (+0.86z)| norm 0.3978 (+3.69z)| lr 6.78e-05 | 322.16 ms | 52.4% bf16 MFU | 1624953 tok/s step 15445/19560 | loss 3.338972 (+0.73z)| norm 0.2758 (-0.55z)| lr 6.78e-05 | 323.44 ms | 52.2% bf16 MFU | 1624755 tok/s step 15446/19560 | loss 3.322905 (+0.26z)| norm 0.2919 (+0.00z)| lr 6.78e-05 | 322.23 ms | 52.4% bf16 MFU | 1624870 tok/s step 15447/19560 | loss 3.339322 (+0.72z)| norm 0.3115 (+0.68z)| lr 6.77e-05 | 322.92 ms | 52.3% bf16 MFU | 1624807 tok/s step 15448/19560 | loss 3.342025 (+0.79z)| norm 0.2604 (-1.08z)| lr 6.77e-05 | 322.47 ms | 52.3% bf16 MFU | 1624860 tok/s step 15449/19560 | loss 3.297905 (-0.48z)| norm 0.2879 (-0.13z)| lr 6.77e-05 | 322.61 ms | 52.3% bf16 MFU | 1624873 tok/s step 15450/19560 | loss 3.302294 (-0.35z)| norm 0.2748 (-0.57z)| lr 6.76e-05 | 322.10 ms | 52.4% bf16 MFU | 1625015 tok/s step 15451/19560 | loss 3.309471 (-0.14z)| norm 0.2779 (-0.46z)| lr 6.76e-05 | 322.94 ms | 52.3% bf16 MFU | 1624938 tok/s step 15452/19560 | loss 3.345998 (+0.91z)| norm 0.2867 (-0.15z)| lr 6.76e-05 | 322.47 ms | 52.3% bf16 MFU | 1624983 tok/s step 15453/19560 | loss 3.335218 (+0.62z)| norm 0.2661 (-0.85z)| lr 6.75e-05 | 322.57 ms | 52.3% bf16 MFU | 1625002 tok/s step 15454/19560 | loss 3.363934 (+1.43z)| norm 0.3170 (+0.92z)| lr 6.75e-05 | 322.63 ms | 52.3% bf16 MFU | 1625004 tok/s step 15455/19560 | loss 3.344882 (+0.88z)| norm 0.2801 (-0.37z)| lr 6.75e-05 | 322.75 ms | 52.3% bf16 MFU | 1624975 tok/s step 15456/19560 | loss 3.360657 (+1.31z)| norm 0.2777 (-0.45z)| lr 6.74e-05 | 322.62 ms | 52.3% bf16 MFU | 1624981 tok/s step 15457/19560 | loss 3.324382 (+0.27z)| norm 0.2745 (-0.56z)| lr 6.74e-05 | 322.49 ms | 52.3% bf16 MFU | 1625018 tok/s step 15458/19560 | loss 3.389022 (+2.07z)| norm 0.2832 (-0.26z)| lr 6.74e-05 | 322.64 ms | 52.3% bf16 MFU | 1625017 tok/s step 15459/19560 | loss 3.337989 (+0.63z)| norm 0.2889 (-0.06z)| lr 6.73e-05 | 323.29 ms | 52.2% bf16 MFU | 1624854 tok/s step 15460/19560 | loss 3.280763 (-0.97z)| norm 0.2562 (-1.19z)| lr 6.73e-05 | 322.84 ms | 52.3% bf16 MFU | 1624809 tok/s step 15461/19560 | loss 3.268225 (-1.30z)| norm 0.2742 (-0.56z)| lr 6.73e-05 | 322.23 ms | 52.4% bf16 MFU | 1624921 tok/s step 15462/19560 | loss 3.291364 (-0.65z)| norm 0.2882 (-0.07z)| lr 6.73e-05 | 322.94 ms | 52.3% bf16 MFU | 1624850 tok/s step 15463/19560 | loss 3.262861 (-1.42z)| norm 0.2593 (-1.07z)| lr 6.72e-05 | 322.79 ms | 52.3% bf16 MFU | 1624820 tok/s step 15464/19560 | loss 3.279735 (-0.94z)| norm 0.2804 (-0.34z)| lr 6.72e-05 | 322.52 ms | 52.3% bf16 MFU | 1624859 tok/s step 15465/19560 | loss 3.302840 (-0.30z)| norm 0.2952 (+0.17z)| lr 6.72e-05 | 322.49 ms | 52.3% bf16 MFU | 1624904 tok/s step 15466/19560 | loss 3.320941 (+0.18z)| norm 0.2833 (-0.24z)| lr 6.71e-05 | 322.58 ms | 52.3% bf16 MFU | 1624923 tok/s step 15467/19560 | loss 3.313227 (-0.03z)| norm 0.2825 (-0.27z)| lr 6.71e-05 | 323.02 ms | 52.2% bf16 MFU | 1624830 tok/s step 15468/19560 | loss 3.246233 (-1.88z)| norm 0.2664 (-0.82z)| lr 6.71e-05 | 323.04 ms | 52.2% bf16 MFU | 1624739 tok/s step 15469/19560 | loss 3.302305 (-0.31z)| norm 0.2685 (-0.74z)| lr 6.70e-05 | 322.91 ms | 52.3% bf16 MFU | 1624684 tok/s step 15470/19560 | loss 3.383570 (+1.92z)| norm 0.2697 (-0.69z)| lr 6.70e-05 | 322.51 ms | 52.3% bf16 MFU | 1624733 tok/s step 15471/19560 | loss 3.257597 (-1.54z)| norm 0.2690 (-0.70z)| lr 6.70e-05 | 322.43 ms | 52.3% bf16 MFU | 1624800 tok/s step 15472/19560 | loss 3.257215 (-1.53z)| norm 0.2411 (-1.65z)| lr 6.69e-05 | 322.78 ms | 52.3% bf16 MFU | 1624775 tok/s step 15473/19560 | loss 3.356747 (+1.16z)| norm 0.2597 (-0.99z)| lr 6.69e-05 | 322.55 ms | 52.3% bf16 MFU | 1624808 tok/s step 15474/19560 | loss 3.311834 (-0.05z)| norm 0.2839 (-0.15z)| lr 6.69e-05 | 322.50 ms | 52.3% bf16 MFU | 1624851 tok/s step 15475/19560 | loss 3.311355 (-0.07z)| norm 0.2823 (-0.20z)| lr 6.68e-05 | 322.80 ms | 52.3% bf16 MFU | 1624818 tok/s step 15476/19560 | loss 3.304022 (-0.27z)| norm 0.2950 (+0.24z)| lr 6.68e-05 | 323.15 ms | 52.2% bf16 MFU | 1624698 tok/s step 15477/19560 | loss 3.338295 (+0.66z)| norm 0.2556 (-1.12z)| lr 6.68e-05 | 322.76 ms | 52.3% bf16 MFU | 1624682 tok/s step 15478/19560 | loss 3.265024 (-1.32z)| norm 0.3248 (+1.26z)| lr 6.68e-05 | 322.70 ms | 52.3% bf16 MFU | 1624683 tok/s step 15479/19560 | loss 3.333060 (+0.52z)| norm 0.2738 (-0.49z)| lr 6.67e-05 | 322.50 ms | 52.3% bf16 MFU | 1624735 tok/s step 15480/19560 | loss 3.343031 (+0.78z)| norm 0.2779 (-0.34z)| lr 6.67e-05 | 322.17 ms | 52.4% bf16 MFU | 1624866 tok/s step 15481/19560 | loss 3.365903 (+1.39z)| norm 0.3489 (+2.05z)| lr 6.67e-05 | 322.70 ms | 52.3% bf16 MFU | 1624858 tok/s step 15482/19560 | loss 3.342864 (+0.75z)| norm 0.2605 (-0.95z)| lr 6.66e-05 | 322.27 ms | 52.4% bf16 MFU | 1624958 tok/s step 15483/19560 | loss 3.346300 (+0.85z)| norm 0.2989 (+0.35z)| lr 6.66e-05 | 322.34 ms | 52.4% bf16 MFU | 1625037 tok/s step 15484/19560 | loss 3.277851 (-1.00z)| norm 0.3057 (+0.57z)| lr 6.66e-05 | 322.59 ms | 52.3% bf16 MFU | 1625047 tok/s step 15485/19560 | loss 3.254693 (-1.60z)| norm 0.2796 (-0.31z)| lr 6.65e-05 | 322.50 ms | 52.3% bf16 MFU | 1625080 tok/s step 15486/19560 | loss 3.305305 (-0.23z)| norm 0.2853 (-0.13z)| lr 6.65e-05 | 322.82 ms | 52.3% bf16 MFU | 1625030 tok/s step 15487/19560 | loss 3.302796 (-0.30z)| norm 0.2995 (+0.34z)| lr 6.65e-05 | 322.48 ms | 52.3% bf16 MFU | 1625067 tok/s step 15488/19560 | loss 3.345236 (+0.85z)| norm 0.2894 (-0.00z)| lr 6.64e-05 | 322.33 ms | 52.4% bf16 MFU | 1625142 tok/s step 15489/19560 | loss 3.324184 (+0.27z)| norm 0.2678 (-0.74z)| lr 6.64e-05 | 322.62 ms | 52.3% bf16 MFU | 1625138 tok/s step 15490/19560 | loss 3.300237 (-0.37z)| norm 0.2665 (-0.77z)| lr 6.64e-05 | 322.64 ms | 52.3% bf16 MFU | 1625131 tok/s step 15491/19560 | loss 3.350171 (+0.98z)| norm 0.2916 (+0.09z)| lr 6.63e-05 | 322.75 ms | 52.3% bf16 MFU | 1625096 tok/s step 15492/19560 | loss 3.275062 (-1.09z)| norm 0.2658 (-0.79z)| lr 6.63e-05 | 322.48 ms | 52.3% bf16 MFU | 1625132 tok/s step 15493/19560 | loss 3.296583 (-0.52z)| norm 0.2882 (-0.02z)| lr 6.63e-05 | 322.76 ms | 52.3% bf16 MFU | 1625094 tok/s step 15494/19560 | loss 3.306101 (-0.25z)| norm 0.3202 (+1.06z)| lr 6.62e-05 | 322.50 ms | 52.3% bf16 MFU | 1625123 tok/s step 15495/19560 | loss 3.347191 (+0.89z)| norm 0.2768 (-0.41z)| lr 6.62e-05 | 322.94 ms | 52.3% bf16 MFU | 1625040 tok/s step 15496/19560 | loss 3.237993 (-2.11z)| norm 0.3214 (+1.10z)| lr 6.62e-05 | 322.03 ms | 52.4% bf16 MFU | 1625190 tok/s step 15497/19560 | loss 3.284107 (-0.83z)| norm 0.2888 (-0.00z)| lr 6.62e-05 | 322.59 ms | 52.3% bf16 MFU | 1625194 tok/s step 15498/19560 | loss 3.318501 (+0.11z)| norm 0.2765 (-0.42z)| lr 6.61e-05 | 323.24 ms | 52.2% bf16 MFU | 1625032 tok/s step 15499/19560 | loss 3.306947 (-0.22z)| norm 0.2956 (+0.22z)| lr 6.61e-05 | 322.60 ms | 52.3% bf16 MFU | 1625041 tok/s step 15500/19560 | loss 3.315543 (+0.01z)| norm 0.2809 (-0.27z)| lr 6.61e-05 | 322.46 ms | 52.3% bf16 MFU | 1625085 tok/s val loss 3.296848 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3033/10042 = 0.302031 step 15501/19560 | loss 3.282184 (-0.91z)| norm 0.2669 (-0.73z)| lr 6.60e-05 | 322.65 ms | 52.3% bf16 MFU | 1625078 tok/s step 15502/19560 | loss 3.244999 (-1.90z)| norm 0.2988 (+0.34z)| lr 6.60e-05 | 322.71 ms | 52.3% bf16 MFU | 1625057 tok/s step 15503/19560 | loss 3.313408 (-0.04z)| norm 0.2704 (-0.61z)| lr 6.60e-05 | 322.89 ms | 52.3% bf16 MFU | 1624992 tok/s step 15504/19560 | loss 3.330112 (+0.42z)| norm 0.2759 (-0.42z)| lr 6.59e-05 | 322.94 ms | 52.3% bf16 MFU | 1624916 tok/s step 15505/19560 | loss 3.231737 (-2.25z)| norm 0.3047 (+0.56z)| lr 6.59e-05 | 323.12 ms | 52.2% bf16 MFU | 1624799 tok/s step 15506/19560 | loss 3.348282 (+0.91z)| norm 0.2727 (-0.52z)| lr 6.59e-05 | 322.95 ms | 52.3% bf16 MFU | 1624730 tok/s step 15507/19560 | loss 3.338314 (+0.63z)| norm 0.3111 (+0.82z)| lr 6.58e-05 | 322.31 ms | 52.4% bf16 MFU | 1624826 tok/s step 15508/19560 | loss 3.315588 (+0.02z)| norm 0.2747 (-0.44z)| lr 6.58e-05 | 323.17 ms | 52.2% bf16 MFU | 1624702 tok/s step 15509/19560 | loss 3.335839 (+0.55z)| norm 0.2996 (+0.42z)| lr 6.58e-05 | 322.79 ms | 52.3% bf16 MFU | 1624679 tok/s step 15510/19560 | loss 3.260414 (-1.49z)| norm 0.3032 (+0.56z)| lr 6.57e-05 | 322.21 ms | 52.4% bf16 MFU | 1624802 tok/s step 15511/19560 | loss 3.381966 (+1.77z)| norm 0.2948 (+0.26z)| lr 6.57e-05 | 323.23 ms | 52.2% bf16 MFU | 1624662 tok/s step 15512/19560 | loss 3.360322 (+1.17z)| norm 0.2846 (-0.09z)| lr 6.57e-05 | 322.89 ms | 52.3% bf16 MFU | 1624615 tok/s step 15513/19560 | loss 3.315341 (-0.04z)| norm 0.2852 (-0.07z)| lr 6.57e-05 | 322.90 ms | 52.3% bf16 MFU | 1624567 tok/s step 15514/19560 | loss 3.264465 (-1.39z)| norm 0.2674 (-0.69z)| lr 6.56e-05 | 322.20 ms | 52.4% bf16 MFU | 1624700 tok/s step 15515/19560 | loss 3.262929 (-1.41z)| norm 0.2898 (+0.10z)| lr 6.56e-05 | 322.92 ms | 52.3% bf16 MFU | 1624644 tok/s step 15516/19560 | loss 3.331639 (+0.42z)| norm 0.2581 (-1.00z)| lr 6.56e-05 | 322.83 ms | 52.3% bf16 MFU | 1624613 tok/s step 15517/19560 | loss 3.347591 (+0.87z)| norm 0.2727 (-0.48z)| lr 6.55e-05 | 322.23 ms | 52.4% bf16 MFU | 1624737 tok/s step 15518/19560 | loss 3.330818 (+0.43z)| norm 0.2995 (+0.46z)| lr 6.55e-05 | 322.86 ms | 52.3% bf16 MFU | 1624694 tok/s step 15519/19560 | loss 3.366271 (+1.39z)| norm 0.2555 (-1.07z)| lr 6.55e-05 | 323.35 ms | 52.2% bf16 MFU | 1624531 tok/s step 15520/19560 | loss 3.313786 (-0.03z)| norm 0.3003 (+0.49z)| lr 6.54e-05 | 322.76 ms | 52.3% bf16 MFU | 1624523 tok/s step 15521/19560 | loss 3.273305 (-1.15z)| norm 0.2599 (-0.92z)| lr 6.54e-05 | 322.48 ms | 52.3% bf16 MFU | 1624587 tok/s step 15522/19560 | loss 3.388946 (+2.04z)| norm 0.2747 (-0.40z)| lr 6.54e-05 | 322.46 ms | 52.3% bf16 MFU | 1624653 tok/s step 15523/19560 | loss 3.327550 (+0.33z)| norm 0.2984 (+0.43z)| lr 6.53e-05 | 323.06 ms | 52.2% bf16 MFU | 1624565 tok/s step 15524/19560 | loss 3.386227 (+1.93z)| norm 0.3130 (+0.92z)| lr 6.53e-05 | 322.53 ms | 52.3% bf16 MFU | 1624614 tok/s step 15525/19560 | loss 3.340866 (+0.67z)| norm 0.2714 (-0.53z)| lr 6.53e-05 | 322.68 ms | 52.3% bf16 MFU | 1624624 tok/s step 15526/19560 | loss 3.306020 (-0.28z)| norm 0.2704 (-0.56z)| lr 6.53e-05 | 322.63 ms | 52.3% bf16 MFU | 1624645 tok/s step 15527/19560 | loss 3.359399 (+1.18z)| norm 0.3087 (+0.77z)| lr 6.52e-05 | 322.75 ms | 52.3% bf16 MFU | 1624634 tok/s step 15528/19560 | loss 3.330968 (+0.39z)| norm 0.3038 (+0.59z)| lr 6.52e-05 | 322.94 ms | 52.3% bf16 MFU | 1624576 tok/s step 15529/19560 | loss 3.357162 (+1.10z)| norm 0.3074 (+0.71z)| lr 6.52e-05 | 322.95 ms | 52.3% bf16 MFU | 1624519 tok/s step 15530/19560 | loss 3.256123 (-1.65z)| norm 0.3358 (+1.66z)| lr 6.51e-05 | 322.56 ms | 52.3% bf16 MFU | 1624563 tok/s step 15531/19560 | loss 3.299755 (-0.47z)| norm 0.2873 (+0.00z)| lr 6.51e-05 | 323.18 ms | 52.2% bf16 MFU | 1624449 tok/s step 15532/19560 | loss 3.282850 (-0.91z)| norm 0.2988 (+0.39z)| lr 6.51e-05 | 322.69 ms | 52.3% bf16 MFU | 1624465 tok/s step 15533/19560 | loss 3.303294 (-0.35z)| norm 0.2906 (+0.10z)| lr 6.50e-05 | 322.59 ms | 52.3% bf16 MFU | 1624504 tok/s step 15534/19560 | loss 3.345183 (+0.78z)| norm 0.2651 (-0.77z)| lr 6.50e-05 | 322.65 ms | 52.3% bf16 MFU | 1624527 tok/s step 15535/19560 | loss 3.337945 (+0.57z)| norm 0.3264 (+1.34z)| lr 6.50e-05 | 322.85 ms | 52.3% bf16 MFU | 1624496 tok/s step 15536/19560 | loss 3.323241 (+0.17z)| norm 0.3281 (+1.38z)| lr 6.49e-05 | 322.52 ms | 52.3% bf16 MFU | 1624550 tok/s step 15537/19560 | loss 3.294112 (-0.64z)| norm 0.2830 (-0.18z)| lr 6.49e-05 | 322.93 ms | 52.3% bf16 MFU | 1624498 tok/s step 15538/19560 | loss 3.284894 (-0.88z)| norm 0.3884 (+3.27z)| lr 6.49e-05 | 322.97 ms | 52.3% bf16 MFU | 1624440 tok/s step 15539/19560 | loss 3.327968 (+0.30z)| norm 0.3231 (+1.11z)| lr 6.48e-05 | 323.26 ms | 52.2% bf16 MFU | 1624312 tok/s step 15540/19560 | loss 3.309868 (-0.20z)| norm 0.2921 (+0.08z)| lr 6.48e-05 | 322.91 ms | 52.3% bf16 MFU | 1624278 tok/s step 15541/19560 | loss 3.277240 (-1.08z)| norm 0.3486 (+1.90z)| lr 6.48e-05 | 322.77 ms | 52.3% bf16 MFU | 1624281 tok/s step 15542/19560 | loss 3.317564 (+0.01z)| norm 0.2881 (-0.07z)| lr 6.48e-05 | 322.87 ms | 52.3% bf16 MFU | 1624260 tok/s step 15543/19560 | loss 3.313968 (-0.09z)| norm 0.2870 (-0.11z)| lr 6.47e-05 | 323.07 ms | 52.2% bf16 MFU | 1624188 tok/s step 15544/19560 | loss 3.378313 (+1.72z)| norm 0.3662 (+2.40z)| lr 6.47e-05 | 323.43 ms | 52.2% bf16 MFU | 1624030 tok/s step 15545/19560 | loss 3.346446 (+0.81z)| norm 0.2697 (-0.66z)| lr 6.47e-05 | 323.25 ms | 52.2% bf16 MFU | 1623926 tok/s step 15546/19560 | loss 3.333279 (+0.44z)| norm 0.3013 (+0.33z)| lr 6.46e-05 | 322.49 ms | 52.3% bf16 MFU | 1624018 tok/s step 15547/19560 | loss 3.291310 (-0.73z)| norm 0.2948 (+0.12z)| lr 6.46e-05 | 322.81 ms | 52.3% bf16 MFU | 1624024 tok/s step 15548/19560 | loss 3.233962 (-2.29z)| norm 0.3058 (+0.46z)| lr 6.46e-05 | 322.93 ms | 52.3% bf16 MFU | 1624000 tok/s step 15549/19560 | loss 3.330435 (+0.38z)| norm 0.2884 (-0.09z)| lr 6.45e-05 | 323.02 ms | 52.2% bf16 MFU | 1623953 tok/s step 15550/19560 | loss 3.286788 (-0.84z)| norm 0.3091 (+0.56z)| lr 6.45e-05 | 323.58 ms | 52.2% bf16 MFU | 1623769 tok/s step 15551/19560 | loss 3.297940 (-0.53z)| norm 0.3009 (+0.30z)| lr 6.45e-05 | 322.45 ms | 52.3% bf16 MFU | 1623878 tok/s step 15552/19560 | loss 3.283747 (-0.92z)| norm 0.2816 (-0.31z)| lr 6.44e-05 | 322.80 ms | 52.3% bf16 MFU | 1623894 tok/s step 15553/19560 | loss 3.258406 (-1.60z)| norm 0.3148 (+0.74z)| lr 6.44e-05 | 323.00 ms | 52.3% bf16 MFU | 1623858 tok/s step 15554/19560 | loss 3.325418 (+0.27z)| norm 0.3005 (+0.28z)| lr 6.44e-05 | 322.93 ms | 52.3% bf16 MFU | 1623842 tok/s step 15555/19560 | loss 3.187225 (-3.42z)| norm 0.2581 (-1.05z)| lr 6.44e-05 | 322.73 ms | 52.3% bf16 MFU | 1623878 tok/s step 15556/19560 | loss 3.345971 (+0.82z)| norm 0.2852 (-0.20z)| lr 6.43e-05 | 322.91 ms | 52.3% bf16 MFU | 1623866 tok/s step 15557/19560 | loss 3.270319 (-1.18z)| norm 0.2629 (-0.90z)| lr 6.43e-05 | 322.80 ms | 52.3% bf16 MFU | 1623881 tok/s step 15558/19560 | loss 3.279578 (-0.92z)| norm 0.2942 (+0.08z)| lr 6.43e-05 | 323.05 ms | 52.2% bf16 MFU | 1623834 tok/s step 15559/19560 | loss 3.299557 (-0.38z)| norm 0.2632 (-0.90z)| lr 6.42e-05 | 322.47 ms | 52.3% bf16 MFU | 1623936 tok/s step 15560/19560 | loss 3.259266 (-1.45z)| norm 0.3017 (+0.31z)| lr 6.42e-05 | 322.94 ms | 52.3% bf16 MFU | 1623912 tok/s step 15561/19560 | loss 3.333294 (+0.52z)| norm 0.2917 (-0.01z)| lr 6.42e-05 | 323.02 ms | 52.2% bf16 MFU | 1623871 tok/s step 15562/19560 | loss 3.347846 (+0.90z)| norm 0.2990 (+0.22z)| lr 6.41e-05 | 322.80 ms | 52.3% bf16 MFU | 1623887 tok/s step 15563/19560 | loss 3.275141 (-1.01z)| norm 0.3040 (+0.38z)| lr 6.41e-05 | 322.57 ms | 52.3% bf16 MFU | 1623960 tok/s step 15564/19560 | loss 3.323999 (+0.27z)| norm 0.2607 (-0.98z)| lr 6.41e-05 | 322.96 ms | 52.3% bf16 MFU | 1623930 tok/s step 15565/19560 | loss 3.345488 (+0.85z)| norm 0.3089 (+0.53z)| lr 6.40e-05 | 322.79 ms | 52.3% bf16 MFU | 1623947 tok/s step 15566/19560 | loss 3.287883 (-0.68z)| norm 0.2725 (-0.61z)| lr 6.40e-05 | 323.02 ms | 52.2% bf16 MFU | 1623905 tok/s step 15567/19560 | loss 3.258412 (-1.44z)| norm 0.2705 (-0.66z)| lr 6.40e-05 | 322.82 ms | 52.3% bf16 MFU | 1623913 tok/s step 15568/19560 | loss 3.296114 (-0.44z)| norm 0.2857 (-0.17z)| lr 6.39e-05 | 322.84 ms | 52.3% bf16 MFU | 1623917 tok/s step 15569/19560 | loss 3.441827 (+3.25z)| norm 0.3327 (+1.29z)| lr 6.39e-05 | 322.91 ms | 52.3% bf16 MFU | 1623903 tok/s step 15570/19560 | loss 3.316841 (+0.08z)| norm 0.2790 (-0.43z)| lr 6.39e-05 | 322.91 ms | 52.3% bf16 MFU | 1623889 tok/s step 15571/19560 | loss 3.299398 (-0.35z)| norm 0.3008 (+0.45z)| lr 6.39e-05 | 324.17 ms | 52.1% bf16 MFU | 1623559 tok/s step 15572/19560 | loss 3.258006 (-1.39z)| norm 0.2759 (-0.56z)| lr 6.38e-05 | 322.63 ms | 52.3% bf16 MFU | 1623634 tok/s step 15573/19560 | loss 3.301400 (-0.28z)| norm 0.2752 (-0.59z)| lr 6.38e-05 | 322.63 ms | 52.3% bf16 MFU | 1623705 tok/s step 15574/19560 | loss 3.339476 (+0.69z)| norm 0.2937 (+0.21z)| lr 6.38e-05 | 322.82 ms | 52.3% bf16 MFU | 1623725 tok/s step 15575/19560 | loss 3.270438 (-1.05z)| norm 0.2755 (-0.57z)| lr 6.37e-05 | 323.22 ms | 52.2% bf16 MFU | 1623644 tok/s step 15576/19560 | loss 3.373793 (+1.56z)| norm 0.3357 (+2.02z)| lr 6.37e-05 | 323.03 ms | 52.2% bf16 MFU | 1623614 tok/s step 15577/19560 | loss 3.278562 (-0.84z)| norm 0.2575 (-1.35z)| lr 6.37e-05 | 322.64 ms | 52.3% bf16 MFU | 1623683 tok/s step 15578/19560 | loss 3.297365 (-0.36z)| norm 0.2953 (+0.27z)| lr 6.36e-05 | 322.75 ms | 52.3% bf16 MFU | 1623722 tok/s step 15579/19560 | loss 3.308578 (-0.08z)| norm 0.2836 (-0.24z)| lr 6.36e-05 | 322.92 ms | 52.3% bf16 MFU | 1623714 tok/s step 15580/19560 | loss 3.351128 (+0.99z)| norm 0.2953 (+0.26z)| lr 6.36e-05 | 323.16 ms | 52.2% bf16 MFU | 1623648 tok/s step 15581/19560 | loss 3.264936 (-1.16z)| norm 0.2738 (-0.66z)| lr 6.35e-05 | 323.01 ms | 52.3% bf16 MFU | 1623623 tok/s step 15582/19560 | loss 3.276300 (-0.86z)| norm 0.3006 (+0.50z)| lr 6.35e-05 | 322.64 ms | 52.3% bf16 MFU | 1623691 tok/s step 15583/19560 | loss 3.317218 (+0.17z)| norm 0.3227 (+1.43z)| lr 6.35e-05 | 322.87 ms | 52.3% bf16 MFU | 1623698 tok/s step 15584/19560 | loss 3.303051 (-0.18z)| norm 0.2728 (-0.71z)| lr 6.35e-05 | 322.36 ms | 52.4% bf16 MFU | 1623832 tok/s step 15585/19560 | loss 3.326629 (+0.42z)| norm 0.2910 (+0.06z)| lr 6.34e-05 | 323.31 ms | 52.2% bf16 MFU | 1623722 tok/s step 15586/19560 | loss 3.371260 (+1.57z)| norm 0.2887 (-0.04z)| lr 6.34e-05 | 322.83 ms | 52.3% bf16 MFU | 1623736 tok/s step 15587/19560 | loss 3.268580 (-1.04z)| norm 0.3081 (+0.79z)| lr 6.34e-05 | 322.85 ms | 52.3% bf16 MFU | 1623746 tok/s step 15588/19560 | loss 3.258643 (-1.28z)| norm 0.2707 (-0.82z)| lr 6.33e-05 | 322.88 ms | 52.3% bf16 MFU | 1623747 tok/s step 15589/19560 | loss 3.366686 (+1.43z)| norm 0.3020 (+0.52z)| lr 6.33e-05 | 322.69 ms | 52.3% bf16 MFU | 1623796 tok/s step 15590/19560 | loss 3.315500 (+0.14z)| norm 0.2767 (-0.57z)| lr 6.33e-05 | 323.49 ms | 52.2% bf16 MFU | 1623643 tok/s step 15591/19560 | loss 3.298105 (-0.31z)| norm 0.2653 (-1.07z)| lr 6.32e-05 | 322.87 ms | 52.3% bf16 MFU | 1623652 tok/s step 15592/19560 | loss 3.308944 (-0.04z)| norm 0.2799 (-0.44z)| lr 6.32e-05 | 323.03 ms | 52.2% bf16 MFU | 1623622 tok/s step 15593/19560 | loss 3.314054 (+0.09z)| norm 0.3019 (+0.51z)| lr 6.32e-05 | 323.74 ms | 52.1% bf16 MFU | 1623414 tok/s step 15594/19560 | loss 3.248414 (-1.56z)| norm 0.2841 (-0.26z)| lr 6.31e-05 | 322.77 ms | 52.3% bf16 MFU | 1623459 tok/s step 15595/19560 | loss 3.279858 (-0.76z)| norm 0.2663 (-1.02z)| lr 6.31e-05 | 323.54 ms | 52.2% bf16 MFU | 1623309 tok/s step 15596/19560 | loss 3.232467 (-1.94z)| norm 0.2658 (-1.04z)| lr 6.31e-05 | 322.85 ms | 52.3% bf16 MFU | 1623341 tok/s step 15597/19560 | loss 3.273636 (-0.90z)| norm 0.2920 (+0.08z)| lr 6.31e-05 | 322.88 ms | 52.3% bf16 MFU | 1623362 tok/s step 15598/19560 | loss 3.246844 (-1.55z)| norm 0.2917 (+0.06z)| lr 6.30e-05 | 322.69 ms | 52.3% bf16 MFU | 1623430 tok/s step 15599/19560 | loss 3.306139 (-0.07z)| norm 0.2700 (-0.88z)| lr 6.30e-05 | 322.92 ms | 52.3% bf16 MFU | 1623438 tok/s step 15600/19560 | loss 3.367086 (+1.44z)| norm 0.2872 (-0.15z)| lr 6.30e-05 | 322.78 ms | 52.3% bf16 MFU | 1623482 tok/s step 15601/19560 | loss 3.287905 (-0.54z)| norm 0.2789 (-0.53z)| lr 6.29e-05 | 322.54 ms | 52.3% bf16 MFU | 1623583 tok/s step 15602/19560 | loss 3.278730 (-0.76z)| norm 0.2653 (-1.12z)| lr 6.29e-05 | 323.53 ms | 52.2% bf16 MFU | 1623429 tok/s step 15603/19560 | loss 3.363559 (+1.36z)| norm 0.2620 (-1.25z)| lr 6.29e-05 | 323.34 ms | 52.2% bf16 MFU | 1623331 tok/s step 15604/19560 | loss 3.360320 (+1.26z)| norm 0.2883 (-0.09z)| lr 6.28e-05 | 322.33 ms | 52.4% bf16 MFU | 1623493 tok/s step 15605/19560 | loss 3.246775 (-1.54z)| norm 0.2765 (-0.62z)| lr 6.28e-05 | 323.23 ms | 52.2% bf16 MFU | 1623420 tok/s step 15606/19560 | loss 3.312950 (+0.09z)| norm 0.2591 (-1.37z)| lr 6.28e-05 | 322.88 ms | 52.3% bf16 MFU | 1623439 tok/s step 15607/19560 | loss 3.302143 (-0.18z)| norm 0.2656 (-1.08z)| lr 6.28e-05 | 323.03 ms | 52.2% bf16 MFU | 1623419 tok/s step 15608/19560 | loss 3.277745 (-0.77z)| norm 0.2592 (-1.35z)| lr 6.27e-05 | 322.39 ms | 52.3% bf16 MFU | 1623559 tok/s step 15609/19560 | loss 3.266619 (-1.03z)| norm 0.2765 (-0.58z)| lr 6.27e-05 | 322.92 ms | 52.3% bf16 MFU | 1623561 tok/s step 15610/19560 | loss 3.312412 (+0.12z)| norm 0.3163 (+1.20z)| lr 6.27e-05 | 323.32 ms | 52.2% bf16 MFU | 1623461 tok/s step 15611/19560 | loss 3.331258 (+0.60z)| norm 0.2820 (-0.34z)| lr 6.26e-05 | 322.70 ms | 52.3% bf16 MFU | 1623522 tok/s step 15612/19560 | loss 3.299199 (-0.21z)| norm 0.2749 (-0.65z)| lr 6.26e-05 | 322.84 ms | 52.3% bf16 MFU | 1623546 tok/s step 15613/19560 | loss 3.348742 (+1.02z)| norm 0.3069 (+0.78z)| lr 6.26e-05 | 322.81 ms | 52.3% bf16 MFU | 1623575 tok/s step 15614/19560 | loss 3.334333 (+0.65z)| norm 0.2860 (-0.16z)| lr 6.25e-05 | 322.88 ms | 52.3% bf16 MFU | 1623586 tok/s step 15615/19560 | loss 3.394048 (+2.10z)| norm 0.3271 (+1.67z)| lr 6.25e-05 | 322.74 ms | 52.3% bf16 MFU | 1623631 tok/s step 15616/19560 | loss 3.287886 (-0.52z)| norm 0.2938 (+0.18z)| lr 6.25e-05 | 322.57 ms | 52.3% bf16 MFU | 1623716 tok/s step 15617/19560 | loss 3.414430 (+2.53z)| norm 0.2816 (-0.37z)| lr 6.24e-05 | 322.87 ms | 52.3% bf16 MFU | 1623723 tok/s step 15618/19560 | loss 3.267411 (-1.01z)| norm 0.2996 (+0.42z)| lr 6.24e-05 | 322.64 ms | 52.3% bf16 MFU | 1623787 tok/s step 15619/19560 | loss 3.337363 (+0.68z)| norm 0.3014 (+0.50z)| lr 6.24e-05 | 322.68 ms | 52.3% bf16 MFU | 1623838 tok/s step 15620/19560 | loss 3.254951 (-1.30z)| norm 0.2641 (-1.18z)| lr 6.24e-05 | 322.64 ms | 52.3% bf16 MFU | 1623895 tok/s step 15621/19560 | loss 3.308987 (-0.01z)| norm 0.2977 (+0.33z)| lr 6.23e-05 | 322.49 ms | 52.3% bf16 MFU | 1623987 tok/s step 15622/19560 | loss 3.331757 (+0.54z)| norm 0.2764 (-0.61z)| lr 6.23e-05 | 322.80 ms | 52.3% bf16 MFU | 1623997 tok/s step 15623/19560 | loss 3.319377 (+0.24z)| norm 0.2808 (-0.42z)| lr 6.23e-05 | 322.94 ms | 52.3% bf16 MFU | 1623971 tok/s step 15624/19560 | loss 3.334444 (+0.60z)| norm 0.2629 (-1.21z)| lr 6.22e-05 | 323.08 ms | 52.2% bf16 MFU | 1623911 tok/s step 15625/19560 | loss 3.269130 (-0.99z)| norm 0.2728 (-0.75z)| lr 6.22e-05 | 322.63 ms | 52.3% bf16 MFU | 1623968 tok/s step 15626/19560 | loss 3.307322 (-0.06z)| norm 0.2780 (-0.52z)| lr 6.22e-05 | 322.39 ms | 52.3% bf16 MFU | 1624082 tok/s step 15627/19560 | loss 3.275832 (-0.82z)| norm 0.2732 (-0.72z)| lr 6.21e-05 | 323.12 ms | 52.2% bf16 MFU | 1624006 tok/s step 15628/19560 | loss 3.321856 (+0.30z)| norm 0.2846 (-0.21z)| lr 6.21e-05 | 322.80 ms | 52.3% bf16 MFU | 1624016 tok/s step 15629/19560 | loss 3.413215 (+2.43z)| norm 0.3381 (+2.15z)| lr 6.21e-05 | 322.47 ms | 52.3% bf16 MFU | 1624108 tok/s step 15630/19560 | loss 3.271474 (-0.94z)| norm 0.2668 (-1.01z)| lr 6.20e-05 | 322.91 ms | 52.3% bf16 MFU | 1624084 tok/s step 15631/19560 | loss 3.296614 (-0.33z)| norm 0.3024 (+0.56z)| lr 6.20e-05 | 322.44 ms | 52.3% bf16 MFU | 1624179 tok/s step 15632/19560 | loss 3.248388 (-1.46z)| norm 0.3102 (+0.89z)| lr 6.20e-05 | 322.48 ms | 52.3% bf16 MFU | 1624260 tok/s step 15633/19560 | loss 3.371887 (+1.44z)| norm 0.3043 (+0.63z)| lr 6.20e-05 | 322.92 ms | 52.3% bf16 MFU | 1624227 tok/s step 15634/19560 | loss 3.259087 (-1.22z)| norm 0.3074 (+0.75z)| lr 6.19e-05 | 322.46 ms | 52.3% bf16 MFU | 1624312 tok/s step 15635/19560 | loss 3.333103 (+0.54z)| norm 0.2565 (-1.48z)| lr 6.19e-05 | 322.89 ms | 52.3% bf16 MFU | 1624284 tok/s step 15636/19560 | loss 3.329679 (+0.45z)| norm 0.2943 (+0.18z)| lr 6.19e-05 | 322.22 ms | 52.4% bf16 MFU | 1624425 tok/s step 15637/19560 | loss 3.240993 (-1.62z)| norm 0.2666 (-1.02z)| lr 6.18e-05 | 322.98 ms | 52.3% bf16 MFU | 1624368 tok/s step 15638/19560 | loss 3.268625 (-0.97z)| norm 0.2766 (-0.58z)| lr 6.18e-05 | 322.71 ms | 52.3% bf16 MFU | 1624381 tok/s step 15639/19560 | loss 3.288544 (-0.49z)| norm 0.3092 (+0.86z)| lr 6.18e-05 | 322.73 ms | 52.3% bf16 MFU | 1624390 tok/s step 15640/19560 | loss 3.303088 (-0.13z)| norm 0.2822 (-0.33z)| lr 6.17e-05 | 322.73 ms | 52.3% bf16 MFU | 1624397 tok/s step 15641/19560 | loss 3.537677 (+4.90z)| norm 0.3574 (+2.85z)| lr 6.17e-05 | 322.09 ms | 52.4% bf16 MFU | 1624566 tok/s step 15642/19560 | loss 3.303550 (-0.15z)| norm 0.3292 (+1.62z)| lr 6.17e-05 | 322.79 ms | 52.3% bf16 MFU | 1624550 tok/s step 15643/19560 | loss 3.293875 (-0.37z)| norm 0.2965 (+0.24z)| lr 6.17e-05 | 322.67 ms | 52.3% bf16 MFU | 1624564 tok/s step 15644/19560 | loss 3.305145 (-0.12z)| norm 0.3186 (+1.15z)| lr 6.16e-05 | 322.74 ms | 52.3% bf16 MFU | 1624560 tok/s step 15645/19560 | loss 3.385389 (+1.60z)| norm 0.3655 (+3.00z)| lr 6.16e-05 | 322.76 ms | 52.3% bf16 MFU | 1624552 tok/s step 15646/19560 | loss 3.296909 (-0.30z)| norm 0.2734 (-0.75z)| lr 6.16e-05 | 322.78 ms | 52.3% bf16 MFU | 1624539 tok/s step 15647/19560 | loss 3.296144 (-0.30z)| norm 0.2831 (-0.37z)| lr 6.15e-05 | 323.00 ms | 52.3% bf16 MFU | 1624471 tok/s step 15648/19560 | loss 3.356756 (+1.00z)| norm 0.3276 (+1.44z)| lr 6.15e-05 | 322.62 ms | 52.3% bf16 MFU | 1624502 tok/s step 15649/19560 | loss 3.304373 (-0.14z)| norm 0.2581 (-1.39z)| lr 6.15e-05 | 322.28 ms | 52.4% bf16 MFU | 1624618 tok/s step 15650/19560 | loss 3.415120 (+2.23z)| norm 0.2938 (+0.06z)| lr 6.14e-05 | 322.70 ms | 52.3% bf16 MFU | 1624622 tok/s step 15651/19560 | loss 3.354509 (+0.93z)| norm 0.2834 (-0.36z)| lr 6.14e-05 | 322.83 ms | 52.3% bf16 MFU | 1624592 tok/s step 15652/19560 | loss 3.314305 (+0.08z)| norm 0.2681 (-0.97z)| lr 6.14e-05 | 322.68 ms | 52.3% bf16 MFU | 1624603 tok/s step 15653/19560 | loss 3.344392 (+0.73z)| norm 0.2836 (-0.34z)| lr 6.14e-05 | 323.27 ms | 52.2% bf16 MFU | 1624464 tok/s step 15654/19560 | loss 3.280114 (-0.65z)| norm 0.2713 (-0.85z)| lr 6.13e-05 | 322.38 ms | 52.4% bf16 MFU | 1624555 tok/s step 15655/19560 | loss 3.322141 (+0.26z)| norm 0.2903 (-0.07z)| lr 6.13e-05 | 322.53 ms | 52.3% bf16 MFU | 1624603 tok/s step 15656/19560 | loss 3.345502 (+0.76z)| norm 0.2801 (-0.48z)| lr 6.13e-05 | 322.09 ms | 52.4% bf16 MFU | 1624761 tok/s step 15657/19560 | loss 3.252713 (-1.23z)| norm 0.2679 (-0.96z)| lr 6.12e-05 | 322.62 ms | 52.3% bf16 MFU | 1624777 tok/s step 15658/19560 | loss 3.258253 (-1.11z)| norm 0.2699 (-0.87z)| lr 6.12e-05 | 323.10 ms | 52.2% bf16 MFU | 1624671 tok/s step 15659/19560 | loss 3.305300 (-0.09z)| norm 0.2887 (-0.09z)| lr 6.12e-05 | 322.50 ms | 52.3% bf16 MFU | 1624723 tok/s step 15660/19560 | loss 3.334657 (+0.53z)| norm 0.2983 (+0.31z)| lr 6.11e-05 | 322.69 ms | 52.3% bf16 MFU | 1624723 tok/s step 15661/19560 | loss 3.305428 (-0.10z)| norm 0.2520 (-1.58z)| lr 6.11e-05 | 322.57 ms | 52.3% bf16 MFU | 1624753 tok/s step 15662/19560 | loss 3.323383 (+0.29z)| norm 0.2812 (-0.39z)| lr 6.11e-05 | 322.74 ms | 52.3% bf16 MFU | 1624740 tok/s step 15663/19560 | loss 3.216258 (-1.98z)| norm 0.2560 (-1.40z)| lr 6.10e-05 | 323.03 ms | 52.2% bf16 MFU | 1624655 tok/s step 15664/19560 | loss 3.285901 (-0.48z)| norm 0.2756 (-0.59z)| lr 6.10e-05 | 322.16 ms | 52.4% bf16 MFU | 1624792 tok/s step 15665/19560 | loss 3.271678 (-0.78z)| norm 0.2767 (-0.54z)| lr 6.10e-05 | 322.90 ms | 52.3% bf16 MFU | 1624738 tok/s step 15666/19560 | loss 3.325727 (+0.36z)| norm 0.2712 (-0.78z)| lr 6.10e-05 | 322.48 ms | 52.3% bf16 MFU | 1624790 tok/s step 15667/19560 | loss 3.278411 (-0.64z)| norm 0.2918 (+0.15z)| lr 6.09e-05 | 323.13 ms | 52.2% bf16 MFU | 1624677 tok/s step 15668/19560 | loss 3.234324 (-1.55z)| norm 0.2747 (-0.61z)| lr 6.09e-05 | 322.78 ms | 52.3% bf16 MFU | 1624658 tok/s step 15669/19560 | loss 3.308251 (+0.01z)| norm 0.2959 (+0.36z)| lr 6.09e-05 | 322.66 ms | 52.3% bf16 MFU | 1624670 tok/s step 15670/19560 | loss 3.263224 (-0.93z)| norm 0.2653 (-1.03z)| lr 6.08e-05 | 322.74 ms | 52.3% bf16 MFU | 1624660 tok/s step 15671/19560 | loss 3.315331 (+0.16z)| norm 0.3127 (+1.12z)| lr 6.08e-05 | 322.39 ms | 52.4% bf16 MFU | 1624741 tok/s step 15672/19560 | loss 3.329542 (+0.47z)| norm 0.2887 (+0.06z)| lr 6.08e-05 | 322.44 ms | 52.3% bf16 MFU | 1624803 tok/s step 15673/19560 | loss 3.378156 (+1.49z)| norm 0.2821 (-0.26z)| lr 6.07e-05 | 322.57 ms | 52.3% bf16 MFU | 1624830 tok/s step 15674/19560 | loss 3.350182 (+0.90z)| norm 0.2755 (-0.57z)| lr 6.07e-05 | 322.68 ms | 52.3% bf16 MFU | 1624827 tok/s step 15675/19560 | loss 3.244124 (-1.32z)| norm 0.2949 (+0.37z)| lr 6.07e-05 | 322.37 ms | 52.4% bf16 MFU | 1624904 tok/s step 15676/19560 | loss 3.321280 (+0.28z)| norm 0.2764 (-0.52z)| lr 6.07e-05 | 322.39 ms | 52.4% bf16 MFU | 1624972 tok/s step 15677/19560 | loss 3.299297 (-0.18z)| norm 0.2708 (-0.78z)| lr 6.06e-05 | 322.60 ms | 52.3% bf16 MFU | 1624983 tok/s step 15678/19560 | loss 3.328806 (+0.44z)| norm 0.2808 (-0.29z)| lr 6.06e-05 | 322.91 ms | 52.3% bf16 MFU | 1624915 tok/s step 15679/19560 | loss 3.300032 (-0.17z)| norm 0.3208 (+1.63z)| lr 6.06e-05 | 322.49 ms | 52.3% bf16 MFU | 1624956 tok/s step 15680/19560 | loss 3.286966 (-0.44z)| norm 0.2710 (-0.76z)| lr 6.05e-05 | 322.88 ms | 52.3% bf16 MFU | 1624897 tok/s step 15681/19560 | loss 3.370940 (+1.31z)| norm 0.3765 (+4.02z)| lr 6.05e-05 | 322.30 ms | 52.4% bf16 MFU | 1624987 tok/s step 15682/19560 | loss 3.249509 (-1.23z)| norm 0.2863 (-0.04z)| lr 6.05e-05 | 322.45 ms | 52.3% bf16 MFU | 1625035 tok/s step 15683/19560 | loss 3.321704 (+0.27z)| norm 0.2756 (-0.53z)| lr 6.04e-05 | 323.05 ms | 52.2% bf16 MFU | 1624930 tok/s step 15684/19560 | loss 3.273089 (-0.77z)| norm 0.3341 (+2.07z)| lr 6.04e-05 | 322.77 ms | 52.3% bf16 MFU | 1624900 tok/s step 15685/19560 | loss 3.314709 (+0.12z)| norm 0.2915 (+0.16z)| lr 6.04e-05 | 322.45 ms | 52.3% bf16 MFU | 1624953 tok/s step 15686/19560 | loss 3.305824 (-0.07z)| norm 0.2778 (-0.45z)| lr 6.04e-05 | 322.39 ms | 52.4% bf16 MFU | 1625018 tok/s step 15687/19560 | loss 3.282054 (-0.58z)| norm 0.3947 (+4.39z)| lr 6.03e-05 | 322.77 ms | 52.3% bf16 MFU | 1624983 tok/s step 15688/19560 | loss 3.354115 (+0.95z)| norm 0.3181 (+1.20z)| lr 6.03e-05 | 322.67 ms | 52.3% bf16 MFU | 1624976 tok/s step 15689/19560 | loss 3.271398 (-0.82z)| norm 0.3056 (+0.68z)| lr 6.03e-05 | 322.60 ms | 52.3% bf16 MFU | 1624987 tok/s step 15690/19560 | loss 3.314296 (+0.11z)| norm 0.3490 (+2.40z)| lr 6.02e-05 | 323.17 ms | 52.2% bf16 MFU | 1624854 tok/s step 15691/19560 | loss 3.277748 (-0.68z)| norm 0.2980 (+0.35z)| lr 6.02e-05 | 323.40 ms | 52.2% bf16 MFU | 1624668 tok/s step 15692/19560 | loss 3.288063 (-0.45z)| norm 0.2950 (+0.22z)| lr 6.02e-05 | 322.84 ms | 52.3% bf16 MFU | 1624635 tok/s step 15693/19560 | loss 3.312018 (+0.07z)| norm 0.3580 (+2.68z)| lr 6.01e-05 | 322.62 ms | 52.3% bf16 MFU | 1624658 tok/s step 15694/19560 | loss 3.296770 (-0.26z)| norm 0.3383 (+1.86z)| lr 6.01e-05 | 322.92 ms | 52.3% bf16 MFU | 1624604 tok/s step 15695/19560 | loss 3.285254 (-0.51z)| norm 0.2673 (-0.91z)| lr 6.01e-05 | 323.10 ms | 52.2% bf16 MFU | 1624509 tok/s step 15696/19560 | loss 3.291440 (-0.38z)| norm 0.3454 (+2.08z)| lr 6.01e-05 | 322.32 ms | 52.4% bf16 MFU | 1624614 tok/s step 15697/19560 | loss 3.274178 (-0.75z)| norm 0.3218 (+1.19z)| lr 6.00e-05 | 322.82 ms | 52.3% bf16 MFU | 1624589 tok/s step 15698/19560 | loss 3.264045 (-0.96z)| norm 0.2632 (-1.06z)| lr 6.00e-05 | 322.22 ms | 52.4% bf16 MFU | 1624714 tok/s step 15699/19560 | loss 3.317694 (+0.23z)| norm 0.3437 (+1.98z)| lr 6.00e-05 | 322.75 ms | 52.3% bf16 MFU | 1624701 tok/s step 15700/19560 | loss 3.286458 (-0.47z)| norm 0.2721 (-0.72z)| lr 5.99e-05 | 322.61 ms | 52.3% bf16 MFU | 1624724 tok/s step 15701/19560 | loss 3.285599 (-0.49z)| norm 0.2700 (-0.79z)| lr 5.99e-05 | 322.23 ms | 52.4% bf16 MFU | 1624840 tok/s step 15702/19560 | loss 3.302894 (-0.09z)| norm 0.2991 (+0.30z)| lr 5.99e-05 | 322.68 ms | 52.3% bf16 MFU | 1624837 tok/s step 15703/19560 | loss 3.324744 (+0.39z)| norm 0.2960 (+0.18z)| lr 5.98e-05 | 322.86 ms | 52.3% bf16 MFU | 1624790 tok/s step 15704/19560 | loss 3.354576 (+1.07z)| norm 0.2621 (-1.08z)| lr 5.98e-05 | 323.44 ms | 52.2% bf16 MFU | 1624599 tok/s step 15705/19560 | loss 3.400114 (+2.05z)| norm 0.2821 (-0.34z)| lr 5.98e-05 | 322.81 ms | 52.3% bf16 MFU | 1624576 tok/s step 15706/19560 | loss 3.208009 (-2.18z)| norm 0.2660 (-0.94z)| lr 5.98e-05 | 322.88 ms | 52.3% bf16 MFU | 1624537 tok/s step 15707/19560 | loss 3.240973 (-1.43z)| norm 0.2663 (-0.92z)| lr 5.97e-05 | 323.21 ms | 52.2% bf16 MFU | 1624416 tok/s step 15708/19560 | loss 3.327707 (+0.45z)| norm 0.2862 (-0.16z)| lr 5.97e-05 | 322.62 ms | 52.3% bf16 MFU | 1624451 tok/s step 15709/19560 | loss 3.318190 (+0.24z)| norm 0.2555 (-1.31z)| lr 5.97e-05 | 322.81 ms | 52.3% bf16 MFU | 1624436 tok/s step 15710/19560 | loss 3.323902 (+0.35z)| norm 0.2770 (-0.50z)| lr 5.96e-05 | 322.33 ms | 52.4% bf16 MFU | 1624541 tok/s step 15711/19560 | loss 3.239504 (-1.46z)| norm 0.2630 (-1.01z)| lr 5.96e-05 | 323.38 ms | 52.2% bf16 MFU | 1624378 tok/s step 15712/19560 | loss 3.288790 (-0.39z)| norm 0.2646 (-0.94z)| lr 5.96e-05 | 323.15 ms | 52.2% bf16 MFU | 1624281 tok/s step 15713/19560 | loss 3.334448 (+0.59z)| norm 0.2524 (-1.38z)| lr 5.95e-05 | 322.52 ms | 52.3% bf16 MFU | 1624346 tok/s step 15714/19560 | loss 3.341696 (+0.76z)| norm 0.2663 (-0.85z)| lr 5.95e-05 | 323.04 ms | 52.2% bf16 MFU | 1624277 tok/s step 15715/19560 | loss 3.300095 (-0.15z)| norm 0.2543 (-1.28z)| lr 5.95e-05 | 322.60 ms | 52.3% bf16 MFU | 1624323 tok/s step 15716/19560 | loss 3.364867 (+1.24z)| norm 0.2756 (-0.49z)| lr 5.95e-05 | 322.28 ms | 52.4% bf16 MFU | 1624447 tok/s step 15717/19560 | loss 3.341714 (+0.74z)| norm 0.2669 (-0.80z)| lr 5.94e-05 | 323.11 ms | 52.2% bf16 MFU | 1624356 tok/s step 15718/19560 | loss 3.278214 (-0.64z)| norm 0.2763 (-0.45z)| lr 5.94e-05 | 322.68 ms | 52.3% bf16 MFU | 1624378 tok/s step 15719/19560 | loss 3.272364 (-0.76z)| norm 0.2518 (-1.35z)| lr 5.94e-05 | 323.29 ms | 52.2% bf16 MFU | 1624244 tok/s step 15720/19560 | loss 3.288136 (-0.41z)| norm 0.2601 (-1.03z)| lr 5.93e-05 | 322.39 ms | 52.4% bf16 MFU | 1624346 tok/s step 15721/19560 | loss 3.391799 (+1.81z)| norm 0.2766 (-0.42z)| lr 5.93e-05 | 322.52 ms | 52.3% bf16 MFU | 1624407 tok/s step 15722/19560 | loss 3.312927 (+0.10z)| norm 0.2658 (-0.81z)| lr 5.93e-05 | 323.43 ms | 52.2% bf16 MFU | 1624239 tok/s step 15723/19560 | loss 3.286472 (-0.47z)| norm 0.2578 (-1.10z)| lr 5.92e-05 | 322.21 ms | 52.4% bf16 MFU | 1624386 tok/s step 15724/19560 | loss 3.309852 (+0.02z)| norm 0.2932 (+0.19z)| lr 5.92e-05 | 323.31 ms | 52.2% bf16 MFU | 1624248 tok/s step 15725/19560 | loss 3.325547 (+0.36z)| norm 0.2496 (-1.39z)| lr 5.92e-05 | 323.05 ms | 52.2% bf16 MFU | 1624182 tok/s step 15726/19560 | loss 3.267305 (-0.93z)| norm 0.2703 (-0.62z)| lr 5.92e-05 | 322.39 ms | 52.3% bf16 MFU | 1624285 tok/s step 15727/19560 | loss 3.269846 (-0.86z)| norm 0.2719 (-0.57z)| lr 5.91e-05 | 322.56 ms | 52.3% bf16 MFU | 1624341 tok/s step 15728/19560 | loss 3.352263 (+0.95z)| norm 0.2623 (-0.91z)| lr 5.91e-05 | 322.49 ms | 52.3% bf16 MFU | 1624412 tok/s step 15729/19560 | loss 3.370954 (+1.34z)| norm 0.2661 (-0.76z)| lr 5.91e-05 | 322.77 ms | 52.3% bf16 MFU | 1624408 tok/s step 15730/19560 | loss 3.342689 (+0.71z)| norm 0.2756 (-0.42z)| lr 5.90e-05 | 323.02 ms | 52.2% bf16 MFU | 1624340 tok/s step 15731/19560 | loss 3.387934 (+1.69z)| norm 0.2652 (-0.80z)| lr 5.90e-05 | 323.16 ms | 52.2% bf16 MFU | 1624242 tok/s step 15732/19560 | loss 3.296551 (-0.29z)| norm 0.2655 (-0.78z)| lr 5.90e-05 | 322.56 ms | 52.3% bf16 MFU | 1624299 tok/s step 15733/19560 | loss 3.289237 (-0.46z)| norm 0.2866 (-0.02z)| lr 5.90e-05 | 322.82 ms | 52.3% bf16 MFU | 1624288 tok/s step 15734/19560 | loss 3.294855 (-0.33z)| norm 0.2787 (-0.32z)| lr 5.89e-05 | 323.16 ms | 52.2% bf16 MFU | 1624194 tok/s step 15735/19560 | loss 3.313340 (+0.07z)| norm 0.3183 (+1.11z)| lr 5.89e-05 | 322.84 ms | 52.3% bf16 MFU | 1624182 tok/s step 15736/19560 | loss 3.272177 (-0.83z)| norm 0.2627 (-0.91z)| lr 5.89e-05 | 323.19 ms | 52.2% bf16 MFU | 1624084 tok/s step 15737/19560 | loss 3.325016 (+0.32z)| norm 0.3693 (+2.84z)| lr 5.88e-05 | 323.09 ms | 52.2% bf16 MFU | 1624016 tok/s step 15738/19560 | loss 3.282737 (-0.61z)| norm 0.2841 (-0.15z)| lr 5.88e-05 | 322.97 ms | 52.3% bf16 MFU | 1623982 tok/s step 15739/19560 | loss 3.339221 (+0.63z)| norm 0.3229 (+1.21z)| lr 5.88e-05 | 322.80 ms | 52.3% bf16 MFU | 1623993 tok/s step 15740/19560 | loss 3.321546 (+0.24z)| norm 0.2781 (-0.37z)| lr 5.87e-05 | 322.36 ms | 52.4% bf16 MFU | 1624114 tok/s step 15741/19560 | loss 3.274633 (-0.78z)| norm 0.2842 (-0.15z)| lr 5.87e-05 | 322.99 ms | 52.3% bf16 MFU | 1624069 tok/s step 15742/19560 | loss 3.206883 (-2.20z)| norm 0.2685 (-0.70z)| lr 5.87e-05 | 323.36 ms | 52.2% bf16 MFU | 1623934 tok/s step 15743/19560 | loss 3.322100 (+0.30z)| norm 0.2804 (-0.27z)| lr 5.87e-05 | 322.36 ms | 52.4% bf16 MFU | 1624056 tok/s step 15744/19560 | loss 3.329569 (+0.46z)| norm 0.3137 (+0.90z)| lr 5.86e-05 | 322.90 ms | 52.3% bf16 MFU | 1624037 tok/s step 15745/19560 | loss 3.311105 (+0.07z)| norm 0.2781 (-0.35z)| lr 5.86e-05 | 322.82 ms | 52.3% bf16 MFU | 1624040 tok/s step 15746/19560 | loss 3.268824 (-0.87z)| norm 0.3037 (+0.55z)| lr 5.86e-05 | 322.62 ms | 52.3% bf16 MFU | 1624092 tok/s step 15747/19560 | loss 3.303475 (-0.09z)| norm 0.2890 (+0.04z)| lr 5.85e-05 | 322.96 ms | 52.3% bf16 MFU | 1624056 tok/s step 15748/19560 | loss 3.374389 (+1.47z)| norm 0.2962 (+0.28z)| lr 5.85e-05 | 323.04 ms | 52.2% bf16 MFU | 1624002 tok/s step 15749/19560 | loss 3.373957 (+1.44z)| norm 0.4029 (+3.79z)| lr 5.85e-05 | 322.62 ms | 52.3% bf16 MFU | 1624056 tok/s step 15750/19560 | loss 3.281834 (-0.59z)| norm 0.2780 (-0.37z)| lr 5.84e-05 | 322.82 ms | 52.3% bf16 MFU | 1624057 tok/s val loss 3.294148 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3026/10042 = 0.301334 step 15751/19560 | loss 3.232567 (-1.65z)| norm 0.3183 (+0.96z)| lr 5.84e-05 | 321.99 ms | 52.4% bf16 MFU | 1624267 tok/s step 15752/19560 | loss 3.250630 (-1.24z)| norm 0.2962 (+0.22z)| lr 5.84e-05 | 323.21 ms | 52.2% bf16 MFU | 1624159 tok/s step 15753/19560 | loss 3.348643 (+0.89z)| norm 0.2829 (-0.23z)| lr 5.84e-05 | 322.86 ms | 52.3% bf16 MFU | 1624145 tok/s step 15754/19560 | loss 3.313804 (+0.13z)| norm 0.3265 (+1.21z)| lr 5.83e-05 | 322.89 ms | 52.3% bf16 MFU | 1624124 tok/s step 15755/19560 | loss 3.252082 (-1.21z)| norm 0.2523 (-1.24z)| lr 5.83e-05 | 323.27 ms | 52.2% bf16 MFU | 1624009 tok/s step 15756/19560 | loss 3.258589 (-1.05z)| norm 0.2898 (-0.01z)| lr 5.83e-05 | 322.95 ms | 52.3% bf16 MFU | 1623980 tok/s step 15757/19560 | loss 3.335629 (+0.64z)| norm 0.3082 (+0.61z)| lr 5.82e-05 | 322.93 ms | 52.3% bf16 MFU | 1623959 tok/s step 15758/19560 | loss 3.297456 (-0.21z)| norm 0.2991 (+0.30z)| lr 5.82e-05 | 323.15 ms | 52.2% bf16 MFU | 1623882 tok/s step 15759/19560 | loss 3.349145 (+0.92z)| norm 0.3048 (+0.49z)| lr 5.82e-05 | 322.93 ms | 52.3% bf16 MFU | 1623864 tok/s step 15760/19560 | loss 3.310382 (+0.06z)| norm 0.3175 (+0.91z)| lr 5.81e-05 | 322.83 ms | 52.3% bf16 MFU | 1623874 tok/s step 15761/19560 | loss 3.290837 (-0.37z)| norm 0.2819 (-0.27z)| lr 5.81e-05 | 322.59 ms | 52.3% bf16 MFU | 1623941 tok/s step 15762/19560 | loss 3.285006 (-0.50z)| norm 0.2959 (+0.20z)| lr 5.81e-05 | 323.09 ms | 52.2% bf16 MFU | 1623881 tok/s step 15763/19560 | loss 3.276655 (-0.68z)| norm 0.3163 (+0.87z)| lr 5.81e-05 | 322.76 ms | 52.3% bf16 MFU | 1623906 tok/s step 15764/19560 | loss 3.244814 (-1.37z)| norm 0.3266 (+1.20z)| lr 5.80e-05 | 323.76 ms | 52.1% bf16 MFU | 1623679 tok/s step 15765/19560 | loss 3.296903 (-0.22z)| norm 0.2938 (+0.10z)| lr 5.80e-05 | 322.98 ms | 52.3% bf16 MFU | 1623658 tok/s step 15766/19560 | loss 3.350874 (+0.97z)| norm 0.3205 (+0.98z)| lr 5.80e-05 | 322.64 ms | 52.3% bf16 MFU | 1623726 tok/s step 15767/19560 | loss 3.283719 (-0.53z)| norm 0.3203 (+0.97z)| lr 5.79e-05 | 323.08 ms | 52.2% bf16 MFU | 1623678 tok/s step 15768/19560 | loss 3.317320 (+0.22z)| norm 0.3234 (+1.05z)| lr 5.79e-05 | 323.55 ms | 52.2% bf16 MFU | 1623515 tok/s step 15769/19560 | loss 3.333263 (+0.69z)| norm 0.3186 (+0.92z)| lr 5.79e-05 | 323.54 ms | 52.2% bf16 MFU | 1623362 tok/s step 15770/19560 | loss 3.315588 (+0.24z)| norm 0.2966 (+0.19z)| lr 5.79e-05 | 323.48 ms | 52.2% bf16 MFU | 1623232 tok/s step 15771/19560 | loss 3.281963 (-0.60z)| norm 0.2804 (-0.35z)| lr 5.78e-05 | 323.15 ms | 52.2% bf16 MFU | 1623193 tok/s step 15772/19560 | loss 3.295697 (-0.25z)| norm 0.3074 (+0.56z)| lr 5.78e-05 | 322.81 ms | 52.3% bf16 MFU | 1623239 tok/s step 15773/19560 | loss 3.288959 (-0.41z)| norm 0.3039 (+0.47z)| lr 5.78e-05 | 324.01 ms | 52.1% bf16 MFU | 1622984 tok/s step 15774/19560 | loss 3.270233 (-0.88z)| norm 0.2878 (-0.09z)| lr 5.77e-05 | 323.54 ms | 52.2% bf16 MFU | 1622858 tok/s step 15775/19560 | loss 3.319970 (+0.38z)| norm 0.2661 (-0.84z)| lr 5.77e-05 | 323.23 ms | 52.2% bf16 MFU | 1622816 tok/s step 15776/19560 | loss 3.315489 (+0.28z)| norm 0.2608 (-1.00z)| lr 5.77e-05 | 322.42 ms | 52.3% bf16 MFU | 1622980 tok/s step 15777/19560 | loss 3.307907 (+0.08z)| norm 0.2856 (-0.15z)| lr 5.76e-05 | 323.33 ms | 52.2% bf16 MFU | 1622907 tok/s step 15778/19560 | loss 3.291426 (-0.33z)| norm 0.2943 (+0.15z)| lr 5.76e-05 | 323.16 ms | 52.2% bf16 MFU | 1622880 tok/s step 15779/19560 | loss 3.247213 (-1.47z)| norm 0.2765 (-0.47z)| lr 5.76e-05 | 322.98 ms | 52.3% bf16 MFU | 1622900 tok/s step 15780/19560 | loss 3.283213 (-0.51z)| norm 0.2605 (-1.02z)| lr 5.76e-05 | 322.80 ms | 52.3% bf16 MFU | 1622964 tok/s step 15781/19560 | loss 3.348511 (+1.21z)| norm 0.2936 (+0.13z)| lr 5.75e-05 | 322.90 ms | 52.3% bf16 MFU | 1623000 tok/s step 15782/19560 | loss 3.337564 (+0.90z)| norm 0.2999 (+0.34z)| lr 5.75e-05 | 322.72 ms | 52.3% bf16 MFU | 1623081 tok/s step 15783/19560 | loss 3.331201 (+0.73z)| norm 0.2732 (-0.58z)| lr 5.75e-05 | 322.54 ms | 52.3% bf16 MFU | 1623201 tok/s step 15784/19560 | loss 3.308498 (+0.15z)| norm 0.2635 (-0.91z)| lr 5.74e-05 | 322.65 ms | 52.3% bf16 MFU | 1623289 tok/s step 15785/19560 | loss 3.244336 (-1.54z)| norm 0.2853 (-0.16z)| lr 5.74e-05 | 322.89 ms | 52.3% bf16 MFU | 1623311 tok/s step 15786/19560 | loss 3.375936 (+1.88z)| norm 0.2858 (-0.15z)| lr 5.74e-05 | 323.26 ms | 52.2% bf16 MFU | 1623240 tok/s step 15787/19560 | loss 3.334582 (+0.80z)| norm 0.2594 (-1.06z)| lr 5.74e-05 | 322.21 ms | 52.4% bf16 MFU | 1623434 tok/s step 15788/19560 | loss 3.288586 (-0.39z)| norm 0.2790 (-0.37z)| lr 5.73e-05 | 322.98 ms | 52.3% bf16 MFU | 1623426 tok/s step 15789/19560 | loss 3.328195 (+0.63z)| norm 0.2794 (-0.37z)| lr 5.73e-05 | 322.66 ms | 52.3% bf16 MFU | 1623500 tok/s step 15790/19560 | loss 3.281745 (-0.57z)| norm 0.3077 (+0.61z)| lr 5.73e-05 | 322.54 ms | 52.3% bf16 MFU | 1623600 tok/s step 15791/19560 | loss 3.358877 (+1.43z)| norm 0.2925 (+0.07z)| lr 5.72e-05 | 323.46 ms | 52.2% bf16 MFU | 1623463 tok/s step 15792/19560 | loss 3.324285 (+0.51z)| norm 0.2856 (-0.17z)| lr 5.72e-05 | 322.41 ms | 52.3% bf16 MFU | 1623596 tok/s step 15793/19560 | loss 3.283091 (-0.58z)| norm 0.2994 (+0.31z)| lr 5.72e-05 | 322.88 ms | 52.3% bf16 MFU | 1623605 tok/s step 15794/19560 | loss 3.269310 (-0.93z)| norm 0.2651 (-0.89z)| lr 5.71e-05 | 322.56 ms | 52.3% bf16 MFU | 1623694 tok/s step 15795/19560 | loss 3.374408 (+1.80z)| norm 0.3102 (+0.68z)| lr 5.71e-05 | 322.46 ms | 52.3% bf16 MFU | 1623806 tok/s step 15796/19560 | loss 3.306547 (+0.02z)| norm 0.2916 (+0.03z)| lr 5.71e-05 | 323.18 ms | 52.2% bf16 MFU | 1623730 tok/s step 15797/19560 | loss 3.301663 (-0.11z)| norm 0.2827 (-0.28z)| lr 5.71e-05 | 322.46 ms | 52.3% bf16 MFU | 1623838 tok/s step 15798/19560 | loss 3.338618 (+0.85z)| norm 0.2696 (-0.74z)| lr 5.70e-05 | 322.71 ms | 52.3% bf16 MFU | 1623878 tok/s step 15799/19560 | loss 3.365335 (+1.53z)| norm 0.2698 (-0.73z)| lr 5.70e-05 | 322.74 ms | 52.3% bf16 MFU | 1623909 tok/s step 15800/19560 | loss 3.333829 (+0.71z)| norm 0.2721 (-0.64z)| lr 5.70e-05 | 322.20 ms | 52.4% bf16 MFU | 1624075 tok/s step 15801/19560 | loss 3.294943 (-0.30z)| norm 0.3353 (+1.54z)| lr 5.69e-05 | 322.61 ms | 52.3% bf16 MFU | 1624128 tok/s step 15802/19560 | loss 3.381393 (+1.97z)| norm 0.2904 (-0.02z)| lr 5.69e-05 | 322.72 ms | 52.3% bf16 MFU | 1624150 tok/s step 15803/19560 | loss 3.326046 (+0.50z)| norm 0.2796 (-0.39z)| lr 5.69e-05 | 322.44 ms | 52.3% bf16 MFU | 1624244 tok/s step 15804/19560 | loss 3.248357 (-1.53z)| norm 0.3103 (+0.67z)| lr 5.69e-05 | 322.14 ms | 52.4% bf16 MFU | 1624408 tok/s step 15805/19560 | loss 3.332364 (+0.67z)| norm 0.2796 (-0.40z)| lr 5.68e-05 | 322.24 ms | 52.4% bf16 MFU | 1624538 tok/s step 15806/19560 | loss 3.315662 (+0.24z)| norm 0.3251 (+1.16z)| lr 5.68e-05 | 322.88 ms | 52.3% bf16 MFU | 1624500 tok/s step 15807/19560 | loss 3.333548 (+0.70z)| norm 0.2769 (-0.49z)| lr 5.68e-05 | 322.50 ms | 52.3% bf16 MFU | 1624561 tok/s step 15808/19560 | loss 3.265494 (-1.08z)| norm 0.2856 (-0.20z)| lr 5.67e-05 | 322.86 ms | 52.3% bf16 MFU | 1624527 tok/s step 15809/19560 | loss 3.308744 (+0.07z)| norm 0.3441 (+1.89z)| lr 5.67e-05 | 322.60 ms | 52.3% bf16 MFU | 1624561 tok/s step 15810/19560 | loss 3.295435 (-0.30z)| norm 0.2820 (-0.32z)| lr 5.67e-05 | 322.69 ms | 52.3% bf16 MFU | 1624569 tok/s step 15811/19560 | loss 3.285979 (-0.54z)| norm 0.2811 (-0.35z)| lr 5.67e-05 | 322.50 ms | 52.3% bf16 MFU | 1624627 tok/s step 15812/19560 | loss 3.337894 (+0.83z)| norm 0.2679 (-0.81z)| lr 5.66e-05 | 322.73 ms | 52.3% bf16 MFU | 1624622 tok/s step 15813/19560 | loss 3.290182 (-0.44z)| norm 0.2863 (-0.15z)| lr 5.66e-05 | 322.42 ms | 52.3% bf16 MFU | 1624696 tok/s step 15814/19560 | loss 3.350716 (+1.16z)| norm 0.3039 (+0.47z)| lr 5.66e-05 | 322.76 ms | 52.3% bf16 MFU | 1624682 tok/s step 15815/19560 | loss 3.380135 (+1.90z)| norm 0.2643 (-0.95z)| lr 5.65e-05 | 322.60 ms | 52.3% bf16 MFU | 1624708 tok/s step 15816/19560 | loss 3.359565 (+1.35z)| norm 0.2741 (-0.58z)| lr 5.65e-05 | 322.57 ms | 52.3% bf16 MFU | 1624740 tok/s step 15817/19560 | loss 3.351880 (+1.13z)| norm 0.3057 (+0.62z)| lr 5.65e-05 | 322.98 ms | 52.3% bf16 MFU | 1624668 tok/s step 15818/19560 | loss 3.346225 (+0.98z)| norm 0.2595 (-1.12z)| lr 5.64e-05 | 322.66 ms | 52.3% bf16 MFU | 1624678 tok/s step 15819/19560 | loss 3.409121 (+2.53z)| norm 0.3157 (+1.03z)| lr 5.64e-05 | 322.62 ms | 52.3% bf16 MFU | 1624698 tok/s step 15820/19560 | loss 3.276043 (-0.85z)| norm 0.2728 (-0.60z)| lr 5.64e-05 | 322.53 ms | 52.3% bf16 MFU | 1624742 tok/s step 15821/19560 | loss 3.314806 (+0.13z)| norm 0.2666 (-0.84z)| lr 5.64e-05 | 322.65 ms | 52.3% bf16 MFU | 1624751 tok/s step 15822/19560 | loss 3.299803 (-0.25z)| norm 0.2965 (+0.36z)| lr 5.63e-05 | 323.31 ms | 52.2% bf16 MFU | 1624594 tok/s step 15823/19560 | loss 3.261367 (-1.22z)| norm 0.2687 (-0.75z)| lr 5.63e-05 | 321.98 ms | 52.4% bf16 MFU | 1624781 tok/s step 15824/19560 | loss 3.302497 (-0.18z)| norm 0.2696 (-0.70z)| lr 5.63e-05 | 322.86 ms | 52.3% bf16 MFU | 1624736 tok/s step 15825/19560 | loss 3.345087 (+0.88z)| norm 0.3022 (+0.63z)| lr 5.62e-05 | 323.13 ms | 52.2% bf16 MFU | 1624626 tok/s step 15826/19560 | loss 3.279610 (-0.78z)| norm 0.2801 (-0.28z)| lr 5.62e-05 | 322.46 ms | 52.3% bf16 MFU | 1624689 tok/s step 15827/19560 | loss 3.313918 (+0.09z)| norm 0.3701 (+3.33z)| lr 5.62e-05 | 322.40 ms | 52.3% bf16 MFU | 1624764 tok/s step 15828/19560 | loss 3.552087 (+5.36z)| norm 0.2851 (-0.09z)| lr 5.62e-05 | 322.90 ms | 52.3% bf16 MFU | 1624710 tok/s step 15829/19560 | loss 3.303560 (-0.20z)| norm 0.4378 (+5.30z)| lr 5.61e-05 | 322.64 ms | 52.3% bf16 MFU | 1624725 tok/s step 15830/19560 | loss 3.277649 (-0.77z)| norm 0.2862 (-0.08z)| lr 5.61e-05 | 321.58 ms | 52.5% bf16 MFU | 1625006 tok/s step 15831/19560 | loss 3.228375 (-1.83z)| norm 0.2979 (+0.34z)| lr 5.61e-05 | 322.73 ms | 52.3% bf16 MFU | 1624984 tok/s step 15832/19560 | loss 3.313575 (+0.05z)| norm 0.3584 (+2.41z)| lr 5.60e-05 | 322.66 ms | 52.3% bf16 MFU | 1624978 tok/s step 15833/19560 | loss 3.426693 (+2.52z)| norm 0.2975 (+0.29z)| lr 5.60e-05 | 322.45 ms | 52.3% bf16 MFU | 1625027 tok/s step 15834/19560 | loss 3.351217 (+0.86z)| norm 0.2832 (-0.22z)| lr 5.60e-05 | 322.65 ms | 52.3% bf16 MFU | 1625022 tok/s step 15835/19560 | loss 3.291666 (-0.48z)| norm 0.3687 (+2.67z)| lr 5.60e-05 | 322.64 ms | 52.3% bf16 MFU | 1625020 tok/s step 15836/19560 | loss 3.298796 (-0.31z)| norm 0.3400 (+1.66z)| lr 5.59e-05 | 322.90 ms | 52.3% bf16 MFU | 1624953 tok/s step 15837/19560 | loss 3.333651 (+0.47z)| norm 0.2889 (-0.07z)| lr 5.59e-05 | 322.72 ms | 52.3% bf16 MFU | 1624936 tok/s step 15838/19560 | loss 3.385914 (+1.62z)| norm 0.3468 (+1.84z)| lr 5.59e-05 | 322.43 ms | 52.3% bf16 MFU | 1624992 tok/s step 15839/19560 | loss 3.337990 (+0.54z)| norm 0.3303 (+1.27z)| lr 5.58e-05 | 322.70 ms | 52.3% bf16 MFU | 1624976 tok/s step 15840/19560 | loss 3.266306 (-1.07z)| norm 0.3095 (+0.57z)| lr 5.58e-05 | 322.70 ms | 52.3% bf16 MFU | 1624962 tok/s step 15841/19560 | loss 3.319892 (+0.14z)| norm 0.3482 (+1.83z)| lr 5.58e-05 | 322.42 ms | 52.3% bf16 MFU | 1625020 tok/s step 15842/19560 | loss 3.265608 (-1.06z)| norm 0.3199 (+0.87z)| lr 5.57e-05 | 322.63 ms | 52.3% bf16 MFU | 1625021 tok/s step 15843/19560 | loss 3.345695 (+0.72z)| norm 0.2754 (-0.61z)| lr 5.57e-05 | 322.87 ms | 52.3% bf16 MFU | 1624963 tok/s step 15844/19560 | loss 3.285301 (-0.62z)| norm 0.3185 (+0.81z)| lr 5.57e-05 | 322.71 ms | 52.3% bf16 MFU | 1624946 tok/s step 15845/19560 | loss 3.342798 (+0.67z)| norm 0.3287 (+1.14z)| lr 5.57e-05 | 322.76 ms | 52.3% bf16 MFU | 1624918 tok/s step 15846/19560 | loss 3.305566 (-0.17z)| norm 0.2811 (-0.45z)| lr 5.56e-05 | 322.58 ms | 52.3% bf16 MFU | 1624937 tok/s step 15847/19560 | loss 3.240244 (-1.62z)| norm 0.3024 (+0.25z)| lr 5.56e-05 | 322.32 ms | 52.4% bf16 MFU | 1625020 tok/s step 15848/19560 | loss 3.319594 (+0.14z)| norm 0.2943 (-0.03z)| lr 5.56e-05 | 323.02 ms | 52.2% bf16 MFU | 1624923 tok/s step 15849/19560 | loss 3.263166 (-1.10z)| norm 0.2825 (-0.43z)| lr 5.55e-05 | 322.72 ms | 52.3% bf16 MFU | 1624906 tok/s step 15850/19560 | loss 3.263955 (-1.07z)| norm 0.2718 (-0.79z)| lr 5.55e-05 | 322.37 ms | 52.4% bf16 MFU | 1624978 tok/s step 15851/19560 | loss 3.295385 (-0.37z)| norm 0.2806 (-0.51z)| lr 5.55e-05 | 322.68 ms | 52.3% bf16 MFU | 1624968 tok/s step 15852/19560 | loss 3.490534 (+3.74z)| norm 0.2801 (-0.52z)| lr 5.55e-05 | 322.48 ms | 52.3% bf16 MFU | 1625009 tok/s step 15853/19560 | loss 3.277009 (-0.75z)| norm 0.2905 (-0.18z)| lr 5.54e-05 | 322.68 ms | 52.3% bf16 MFU | 1624999 tok/s step 15854/19560 | loss 3.225494 (-1.82z)| norm 0.2771 (-0.64z)| lr 5.54e-05 | 322.92 ms | 52.3% bf16 MFU | 1624927 tok/s step 15855/19560 | loss 3.316146 (+0.07z)| norm 0.3215 (+0.87z)| lr 5.54e-05 | 322.34 ms | 52.4% bf16 MFU | 1625007 tok/s step 15856/19560 | loss 3.244558 (-1.41z)| norm 0.2674 (-0.99z)| lr 5.53e-05 | 322.59 ms | 52.3% bf16 MFU | 1625018 tok/s step 15857/19560 | loss 3.242324 (-1.43z)| norm 0.2744 (-0.75z)| lr 5.53e-05 | 322.72 ms | 52.3% bf16 MFU | 1624996 tok/s step 15858/19560 | loss 3.276213 (-0.71z)| norm 0.2746 (-0.74z)| lr 5.53e-05 | 322.53 ms | 52.3% bf16 MFU | 1625023 tok/s step 15859/19560 | loss 3.305795 (-0.09z)| norm 0.2637 (-1.12z)| lr 5.53e-05 | 322.78 ms | 52.3% bf16 MFU | 1624987 tok/s step 15860/19560 | loss 3.277219 (-0.68z)| norm 0.2764 (-0.69z)| lr 5.52e-05 | 322.31 ms | 52.4% bf16 MFU | 1625070 tok/s step 15861/19560 | loss 3.356165 (+0.96z)| norm 0.2985 (+0.07z)| lr 5.52e-05 | 322.41 ms | 52.3% bf16 MFU | 1625123 tok/s step 15862/19560 | loss 3.245722 (-1.33z)| norm 0.2706 (-0.88z)| lr 5.52e-05 | 322.39 ms | 52.4% bf16 MFU | 1625180 tok/s step 15863/19560 | loss 3.294869 (-0.31z)| norm 0.2587 (-1.27z)| lr 5.51e-05 | 322.79 ms | 52.3% bf16 MFU | 1625133 tok/s step 15864/19560 | loss 3.360446 (+1.04z)| norm 0.3138 (+0.60z)| lr 5.51e-05 | 323.03 ms | 52.2% bf16 MFU | 1625028 tok/s step 15865/19560 | loss 3.296899 (-0.28z)| norm 0.2948 (-0.03z)| lr 5.51e-05 | 322.67 ms | 52.3% bf16 MFU | 1625019 tok/s step 15866/19560 | loss 3.314842 (+0.09z)| norm 0.2747 (-0.74z)| lr 5.51e-05 | 322.44 ms | 52.3% bf16 MFU | 1625068 tok/s step 15867/19560 | loss 3.288740 (-0.44z)| norm 0.2745 (-0.73z)| lr 5.50e-05 | 322.67 ms | 52.3% bf16 MFU | 1625058 tok/s step 15868/19560 | loss 3.319323 (+0.19z)| norm 0.2767 (-0.65z)| lr 5.50e-05 | 323.01 ms | 52.2% bf16 MFU | 1624960 tok/s step 15869/19560 | loss 3.314276 (+0.08z)| norm 0.2766 (-0.65z)| lr 5.50e-05 | 322.57 ms | 52.3% bf16 MFU | 1624980 tok/s step 15870/19560 | loss 3.269123 (-0.88z)| norm 0.2544 (-1.42z)| lr 5.49e-05 | 322.72 ms | 52.3% bf16 MFU | 1624960 tok/s step 15871/19560 | loss 3.293546 (-0.36z)| norm 0.2632 (-1.11z)| lr 5.49e-05 | 322.60 ms | 52.3% bf16 MFU | 1624973 tok/s step 15872/19560 | loss 3.261534 (-1.02z)| norm 0.2816 (-0.46z)| lr 5.49e-05 | 322.76 ms | 52.3% bf16 MFU | 1624943 tok/s step 15873/19560 | loss 3.298521 (-0.24z)| norm 0.2605 (-1.18z)| lr 5.49e-05 | 322.83 ms | 52.3% bf16 MFU | 1624898 tok/s step 15874/19560 | loss 3.277667 (-0.68z)| norm 0.2795 (-0.52z)| lr 5.48e-05 | 322.62 ms | 52.3% bf16 MFU | 1624908 tok/s step 15875/19560 | loss 3.313507 (+0.07z)| norm 0.2630 (-1.08z)| lr 5.48e-05 | 322.80 ms | 52.3% bf16 MFU | 1624873 tok/s step 15876/19560 | loss 3.256763 (-1.11z)| norm 0.2812 (-0.44z)| lr 5.48e-05 | 322.74 ms | 52.3% bf16 MFU | 1624854 tok/s step 15877/19560 | loss 3.396717 (+1.83z)| norm 0.2788 (-0.52z)| lr 5.47e-05 | 322.47 ms | 52.3% bf16 MFU | 1624905 tok/s step 15878/19560 | loss 3.348796 (+0.82z)| norm 0.2668 (-0.96z)| lr 5.47e-05 | 322.37 ms | 52.4% bf16 MFU | 1624977 tok/s step 15879/19560 | loss 3.262887 (-1.00z)| norm 0.3188 (+0.94z)| lr 5.47e-05 | 322.71 ms | 52.3% bf16 MFU | 1624961 tok/s step 15880/19560 | loss 3.322671 (+0.25z)| norm 0.2534 (-1.42z)| lr 5.46e-05 | 322.64 ms | 52.3% bf16 MFU | 1624962 tok/s step 15881/19560 | loss 3.297833 (-0.27z)| norm 0.3028 (+0.36z)| lr 5.46e-05 | 322.51 ms | 52.3% bf16 MFU | 1624997 tok/s step 15882/19560 | loss 3.404725 (+1.96z)| norm 0.3007 (+0.30z)| lr 5.46e-05 | 322.70 ms | 52.3% bf16 MFU | 1624982 tok/s step 15883/19560 | loss 3.415526 (+2.14z)| norm 0.2685 (-0.89z)| lr 5.46e-05 | 322.60 ms | 52.3% bf16 MFU | 1624992 tok/s step 15884/19560 | loss 3.297772 (-0.31z)| norm 0.2709 (-0.79z)| lr 5.45e-05 | 322.72 ms | 52.3% bf16 MFU | 1624972 tok/s step 15885/19560 | loss 3.333205 (+0.43z)| norm 0.2640 (-1.03z)| lr 5.45e-05 | 322.60 ms | 52.3% bf16 MFU | 1624982 tok/s step 15886/19560 | loss 3.270981 (-0.86z)| norm 0.2890 (-0.12z)| lr 5.45e-05 | 322.68 ms | 52.3% bf16 MFU | 1624973 tok/s step 15887/19560 | loss 3.291874 (-0.42z)| norm 0.2691 (-0.83z)| lr 5.44e-05 | 322.77 ms | 52.3% bf16 MFU | 1624942 tok/s step 15888/19560 | loss 3.360986 (+1.01z)| norm 0.3095 (+0.65z)| lr 5.44e-05 | 322.32 ms | 52.4% bf16 MFU | 1625026 tok/s step 15889/19560 | loss 3.347969 (+0.73z)| norm 0.2712 (-0.75z)| lr 5.44e-05 | 322.92 ms | 52.3% bf16 MFU | 1624954 tok/s step 15890/19560 | loss 3.283873 (-0.60z)| norm 0.3039 (+0.44z)| lr 5.44e-05 | 322.78 ms | 52.3% bf16 MFU | 1624921 tok/s step 15891/19560 | loss 3.244272 (-1.40z)| norm 0.2868 (-0.17z)| lr 5.43e-05 | 322.45 ms | 52.3% bf16 MFU | 1624974 tok/s step 15892/19560 | loss 3.289958 (-0.47z)| norm 0.3179 (+0.97z)| lr 5.43e-05 | 322.44 ms | 52.3% bf16 MFU | 1625026 tok/s step 15893/19560 | loss 3.310553 (-0.05z)| norm 0.3010 (+0.35z)| lr 5.43e-05 | 322.59 ms | 52.3% bf16 MFU | 1625037 tok/s step 15894/19560 | loss 3.322371 (+0.20z)| norm 0.3192 (+1.01z)| lr 5.42e-05 | 322.51 ms | 52.3% bf16 MFU | 1625069 tok/s step 15895/19560 | loss 3.321095 (+0.17z)| norm 0.2858 (-0.20z)| lr 5.42e-05 | 322.67 ms | 52.3% bf16 MFU | 1625058 tok/s step 15896/19560 | loss 3.353874 (+0.84z)| norm 0.3039 (+0.47z)| lr 5.42e-05 | 322.46 ms | 52.3% bf16 MFU | 1625101 tok/s step 15897/19560 | loss 3.297665 (-0.32z)| norm 0.3947 (+3.62z)| lr 5.42e-05 | 322.80 ms | 52.3% bf16 MFU | 1625056 tok/s step 15898/19560 | loss 3.226976 (-1.75z)| norm 0.2539 (-1.31z)| lr 5.41e-05 | 322.64 ms | 52.3% bf16 MFU | 1625052 tok/s step 15899/19560 | loss 3.374509 (+1.25z)| norm 0.2759 (-0.54z)| lr 5.41e-05 | 322.48 ms | 52.3% bf16 MFU | 1625088 tok/s step 15900/19560 | loss 3.375721 (+1.26z)| norm 0.3185 (+0.94z)| lr 5.41e-05 | 322.77 ms | 52.3% bf16 MFU | 1625051 tok/s step 15901/19560 | loss 3.324793 (+0.22z)| norm 0.2656 (-0.88z)| lr 5.40e-05 | 322.54 ms | 52.3% bf16 MFU | 1625074 tok/s step 15902/19560 | loss 3.327718 (+0.27z)| norm 0.2614 (-1.02z)| lr 5.40e-05 | 322.64 ms | 52.3% bf16 MFU | 1625070 tok/s step 15903/19560 | loss 3.303598 (-0.22z)| norm 0.3152 (+0.83z)| lr 5.40e-05 | 322.57 ms | 52.3% bf16 MFU | 1625084 tok/s step 15904/19560 | loss 3.279373 (-0.70z)| norm 0.2596 (-1.10z)| lr 5.40e-05 | 322.71 ms | 52.3% bf16 MFU | 1625061 tok/s step 15905/19560 | loss 3.336605 (+0.45z)| norm 0.2788 (-0.43z)| lr 5.39e-05 | 322.71 ms | 52.3% bf16 MFU | 1625039 tok/s step 15906/19560 | loss 3.276645 (-0.76z)| norm 0.2816 (-0.33z)| lr 5.39e-05 | 322.59 ms | 52.3% bf16 MFU | 1625049 tok/s step 15907/19560 | loss 3.344423 (+0.60z)| norm 0.2847 (-0.22z)| lr 5.39e-05 | 322.87 ms | 52.3% bf16 MFU | 1624989 tok/s step 15908/19560 | loss 3.337332 (+0.45z)| norm 0.2928 (+0.05z)| lr 5.38e-05 | 323.04 ms | 52.2% bf16 MFU | 1624888 tok/s step 15909/19560 | loss 3.264651 (-1.02z)| norm 0.2961 (+0.16z)| lr 5.38e-05 | 322.29 ms | 52.4% bf16 MFU | 1624982 tok/s step 15910/19560 | loss 3.231267 (-1.66z)| norm 0.2683 (-0.79z)| lr 5.38e-05 | 322.83 ms | 52.3% bf16 MFU | 1624935 tok/s step 15911/19560 | loss 3.368084 (+1.08z)| norm 0.3098 (+0.63z)| lr 5.38e-05 | 323.15 ms | 52.2% bf16 MFU | 1624810 tok/s step 15912/19560 | loss 3.294597 (-0.39z)| norm 0.2607 (-1.06z)| lr 5.37e-05 | 322.81 ms | 52.3% bf16 MFU | 1624776 tok/s step 15913/19560 | loss 3.336380 (+0.44z)| norm 0.2748 (-0.57z)| lr 5.37e-05 | 323.33 ms | 52.2% bf16 MFU | 1624613 tok/s step 15914/19560 | loss 3.319272 (+0.10z)| norm 0.2859 (-0.19z)| lr 5.37e-05 | 322.91 ms | 52.3% bf16 MFU | 1624565 tok/s step 15915/19560 | loss 3.363168 (+0.98z)| norm 0.2839 (-0.27z)| lr 5.36e-05 | 322.52 ms | 52.3% bf16 MFU | 1624616 tok/s step 15916/19560 | loss 3.296488 (-0.37z)| norm 0.2915 (-0.01z)| lr 5.36e-05 | 322.83 ms | 52.3% bf16 MFU | 1624588 tok/s step 15917/19560 | loss 3.318724 (+0.09z)| norm 0.2844 (-0.26z)| lr 5.36e-05 | 322.94 ms | 52.3% bf16 MFU | 1624534 tok/s step 15918/19560 | loss 3.309424 (-0.11z)| norm 0.3029 (+0.39z)| lr 5.36e-05 | 322.64 ms | 52.3% bf16 MFU | 1624558 tok/s step 15919/19560 | loss 3.341392 (+0.55z)| norm 0.2736 (-0.62z)| lr 5.35e-05 | 323.14 ms | 52.2% bf16 MFU | 1624454 tok/s step 15920/19560 | loss 3.245314 (-1.38z)| norm 0.2709 (-0.71z)| lr 5.35e-05 | 322.68 ms | 52.3% bf16 MFU | 1624470 tok/s step 15921/19560 | loss 3.312248 (-0.04z)| norm 0.2822 (-0.32z)| lr 5.35e-05 | 323.05 ms | 52.2% bf16 MFU | 1624394 tok/s step 15922/19560 | loss 3.274417 (-0.80z)| norm 0.2527 (-1.33z)| lr 5.34e-05 | 322.91 ms | 52.3% bf16 MFU | 1624355 tok/s step 15923/19560 | loss 3.322131 (+0.17z)| norm 0.2682 (-0.78z)| lr 5.34e-05 | 322.58 ms | 52.3% bf16 MFU | 1624403 tok/s step 15924/19560 | loss 3.292849 (-0.42z)| norm 0.2790 (-0.41z)| lr 5.34e-05 | 322.90 ms | 52.3% bf16 MFU | 1624366 tok/s step 15925/19560 | loss 3.339962 (+0.53z)| norm 0.2636 (-0.93z)| lr 5.34e-05 | 323.00 ms | 52.3% bf16 MFU | 1624307 tok/s step 15926/19560 | loss 3.303289 (-0.21z)| norm 0.2810 (-0.34z)| lr 5.33e-05 | 322.36 ms | 52.4% bf16 MFU | 1624412 tok/s step 15927/19560 | loss 3.468261 (+3.02z)| norm 0.2671 (-0.81z)| lr 5.33e-05 | 323.43 ms | 52.2% bf16 MFU | 1624242 tok/s step 15928/19560 | loss 3.259593 (-1.06z)| norm 0.2786 (-0.42z)| lr 5.33e-05 | 322.56 ms | 52.3% bf16 MFU | 1624300 tok/s step 15929/19560 | loss 3.230689 (-1.60z)| norm 0.2678 (-0.78z)| lr 5.32e-05 | 322.73 ms | 52.3% bf16 MFU | 1624313 tok/s step 15930/19560 | loss 3.324801 (+0.23z)| norm 0.2655 (-0.85z)| lr 5.32e-05 | 323.37 ms | 52.2% bf16 MFU | 1624163 tok/s step 15931/19560 | loss 3.314368 (+0.03z)| norm 0.2808 (-0.32z)| lr 5.32e-05 | 322.63 ms | 52.3% bf16 MFU | 1624208 tok/s step 15932/19560 | loss 3.296925 (-0.32z)| norm 0.2636 (-0.90z)| lr 5.32e-05 | 322.89 ms | 52.3% bf16 MFU | 1624186 tok/s step 15933/19560 | loss 3.304350 (-0.17z)| norm 0.2811 (-0.30z)| lr 5.31e-05 | 323.32 ms | 52.2% bf16 MFU | 1624055 tok/s step 15934/19560 | loss 3.310516 (-0.05z)| norm 0.2968 (+0.25z)| lr 5.31e-05 | 323.09 ms | 52.2% bf16 MFU | 1623990 tok/s step 15935/19560 | loss 3.269867 (-0.84z)| norm 0.2693 (-0.70z)| lr 5.31e-05 | 322.76 ms | 52.3% bf16 MFU | 1624010 tok/s step 15936/19560 | loss 3.291560 (-0.42z)| norm 0.2781 (-0.39z)| lr 5.31e-05 | 322.77 ms | 52.3% bf16 MFU | 1624025 tok/s step 15937/19560 | loss 3.294729 (-0.35z)| norm 0.3037 (+0.52z)| lr 5.30e-05 | 322.94 ms | 52.3% bf16 MFU | 1623999 tok/s step 15938/19560 | loss 3.286427 (-0.51z)| norm 0.2982 (+0.32z)| lr 5.30e-05 | 322.58 ms | 52.3% bf16 MFU | 1624063 tok/s step 15939/19560 | loss 3.279420 (-0.65z)| norm 0.2714 (-0.62z)| lr 5.30e-05 | 323.24 ms | 52.2% bf16 MFU | 1623959 tok/s step 15940/19560 | loss 3.316958 (+0.09z)| norm 0.3213 (+1.11z)| lr 5.29e-05 | 322.39 ms | 52.4% bf16 MFU | 1624074 tok/s step 15941/19560 | loss 3.327137 (+0.29z)| norm 0.2911 (+0.05z)| lr 5.29e-05 | 323.12 ms | 52.2% bf16 MFU | 1624000 tok/s step 15942/19560 | loss 3.371433 (+1.15z)| norm 0.2599 (-1.02z)| lr 5.29e-05 | 323.05 ms | 52.2% bf16 MFU | 1623945 tok/s step 15943/19560 | loss 3.318101 (+0.11z)| norm 0.4217 (+4.26z)| lr 5.29e-05 | 322.30 ms | 52.4% bf16 MFU | 1624084 tok/s step 15944/19560 | loss 3.288884 (-0.45z)| norm 0.3230 (+1.04z)| lr 5.28e-05 | 323.09 ms | 52.2% bf16 MFU | 1624015 tok/s step 15945/19560 | loss 3.251914 (-1.17z)| norm 0.3100 (+0.62z)| lr 5.28e-05 | 323.03 ms | 52.2% bf16 MFU | 1623966 tok/s step 15946/19560 | loss 3.295333 (-0.30z)| norm 0.3902 (+3.07z)| lr 5.28e-05 | 322.87 ms | 52.3% bf16 MFU | 1623960 tok/s step 15947/19560 | loss 3.302118 (-0.15z)| norm 0.2894 (-0.07z)| lr 5.27e-05 | 322.73 ms | 52.3% bf16 MFU | 1623988 tok/s step 15948/19560 | loss 3.282860 (-0.54z)| norm 0.2750 (-0.52z)| lr 5.27e-05 | 322.91 ms | 52.3% bf16 MFU | 1623970 tok/s step 15949/19560 | loss 3.289931 (-0.39z)| norm 0.3564 (+1.98z)| lr 5.27e-05 | 322.85 ms | 52.3% bf16 MFU | 1623968 tok/s step 15950/19560 | loss 3.276397 (-0.66z)| norm 0.2629 (-0.90z)| lr 5.27e-05 | 322.78 ms | 52.3% bf16 MFU | 1623984 tok/s step 15951/19560 | loss 3.331661 (+0.44z)| norm 0.2845 (-0.24z)| lr 5.26e-05 | 323.14 ms | 52.2% bf16 MFU | 1623908 tok/s step 15952/19560 | loss 3.310345 (+0.01z)| norm 0.3094 (+0.52z)| lr 5.26e-05 | 322.88 ms | 52.3% bf16 MFU | 1623902 tok/s step 15953/19560 | loss 3.395862 (+1.70z)| norm 0.2836 (-0.27z)| lr 5.26e-05 | 322.76 ms | 52.3% bf16 MFU | 1623926 tok/s step 15954/19560 | loss 3.301749 (-0.18z)| norm 0.2964 (+0.12z)| lr 5.25e-05 | 323.31 ms | 52.2% bf16 MFU | 1623811 tok/s step 15955/19560 | loss 3.267351 (-0.85z)| norm 0.3423 (+1.57z)| lr 5.25e-05 | 322.78 ms | 52.3% bf16 MFU | 1623835 tok/s step 15956/19560 | loss 3.328835 (+0.45z)| norm 0.2786 (-0.43z)| lr 5.25e-05 | 323.20 ms | 52.2% bf16 MFU | 1623752 tok/s step 15957/19560 | loss 3.342355 (+0.74z)| norm 0.3409 (+1.67z)| lr 5.25e-05 | 322.63 ms | 52.3% bf16 MFU | 1623815 tok/s step 15958/19560 | loss 3.281383 (-0.60z)| norm 0.3165 (+0.84z)| lr 5.24e-05 | 323.34 ms | 52.2% bf16 MFU | 1623699 tok/s step 15959/19560 | loss 3.389221 (+1.74z)| norm 0.2844 (-0.24z)| lr 5.24e-05 | 323.23 ms | 52.2% bf16 MFU | 1623616 tok/s step 15960/19560 | loss 3.279784 (-0.66z)| norm 0.3173 (+0.89z)| lr 5.24e-05 | 322.83 ms | 52.3% bf16 MFU | 1623636 tok/s step 15961/19560 | loss 3.313101 (+0.09z)| norm 0.3346 (+1.46z)| lr 5.23e-05 | 323.19 ms | 52.2% bf16 MFU | 1623566 tok/s step 15962/19560 | loss 3.275019 (-0.75z)| norm 0.2556 (-1.21z)| lr 5.23e-05 | 322.74 ms | 52.3% bf16 MFU | 1623612 tok/s step 15963/19560 | loss 3.337929 (+0.66z)| norm 0.2674 (-0.81z)| lr 5.23e-05 | 323.10 ms | 52.2% bf16 MFU | 1623565 tok/s step 15964/19560 | loss 3.327703 (+0.42z)| norm 0.2956 (+0.19z)| lr 5.23e-05 | 322.96 ms | 52.3% bf16 MFU | 1623557 tok/s step 15965/19560 | loss 3.302185 (-0.15z)| norm 0.2789 (-0.39z)| lr 5.22e-05 | 322.45 ms | 52.3% bf16 MFU | 1623677 tok/s step 15966/19560 | loss 3.306412 (-0.04z)| norm 0.2915 (+0.06z)| lr 5.22e-05 | 323.08 ms | 52.2% bf16 MFU | 1623632 tok/s step 15967/19560 | loss 3.308930 (+0.03z)| norm 0.2954 (+0.21z)| lr 5.22e-05 | 322.78 ms | 52.3% bf16 MFU | 1623664 tok/s step 15968/19560 | loss 3.211644 (-2.15z)| norm 0.2741 (-0.54z)| lr 5.21e-05 | 323.04 ms | 52.2% bf16 MFU | 1623631 tok/s step 15969/19560 | loss 3.289784 (-0.39z)| norm 0.2902 (+0.05z)| lr 5.21e-05 | 323.11 ms | 52.2% bf16 MFU | 1623580 tok/s step 15970/19560 | loss 3.252625 (-1.22z)| norm 0.2910 (+0.09z)| lr 5.21e-05 | 322.72 ms | 52.3% bf16 MFU | 1623630 tok/s step 15971/19560 | loss 3.334761 (+0.62z)| norm 0.2887 (+0.00z)| lr 5.21e-05 | 322.78 ms | 52.3% bf16 MFU | 1623662 tok/s step 15972/19560 | loss 3.296592 (-0.23z)| norm 0.3049 (+0.61z)| lr 5.20e-05 | 323.40 ms | 52.2% bf16 MFU | 1623538 tok/s step 15973/19560 | loss 3.306613 (-0.00z)| norm 0.2989 (+0.40z)| lr 5.20e-05 | 322.62 ms | 52.3% bf16 MFU | 1623616 tok/s step 15974/19560 | loss 3.279340 (-0.61z)| norm 0.2812 (-0.26z)| lr 5.20e-05 | 322.98 ms | 52.3% bf16 MFU | 1623599 tok/s step 15975/19560 | loss 3.316048 (+0.20z)| norm 0.2922 (+0.15z)| lr 5.19e-05 | 323.01 ms | 52.3% bf16 MFU | 1623576 tok/s step 15976/19560 | loss 3.374909 (+1.51z)| norm 0.3123 (+0.89z)| lr 5.19e-05 | 323.04 ms | 52.2% bf16 MFU | 1623548 tok/s step 15977/19560 | loss 3.333577 (+0.57z)| norm 0.3067 (+0.68z)| lr 5.19e-05 | 322.49 ms | 52.3% bf16 MFU | 1623657 tok/s step 15978/19560 | loss 3.331626 (+0.52z)| norm 0.2568 (-1.17z)| lr 5.19e-05 | 323.03 ms | 52.2% bf16 MFU | 1623627 tok/s step 15979/19560 | loss 3.324714 (+0.36z)| norm 0.3094 (+0.77z)| lr 5.18e-05 | 322.60 ms | 52.3% bf16 MFU | 1623706 tok/s step 15980/19560 | loss 3.284124 (-0.56z)| norm 0.2963 (+0.28z)| lr 5.18e-05 | 322.45 ms | 52.3% bf16 MFU | 1623817 tok/s step 15981/19560 | loss 3.369151 (+1.47z)| norm 0.3291 (+1.47z)| lr 5.18e-05 | 322.95 ms | 52.3% bf16 MFU | 1623798 tok/s step 15982/19560 | loss 3.272496 (-0.87z)| norm 0.2732 (-0.58z)| lr 5.18e-05 | 322.86 ms | 52.3% bf16 MFU | 1623804 tok/s step 15983/19560 | loss 3.315534 (+0.17z)| norm 0.3174 (+1.04z)| lr 5.17e-05 | 322.47 ms | 52.3% bf16 MFU | 1623907 tok/s step 15984/19560 | loss 3.333272 (+0.60z)| norm 0.3421 (+1.90z)| lr 5.17e-05 | 322.74 ms | 52.3% bf16 MFU | 1623936 tok/s step 15985/19560 | loss 3.269085 (-0.99z)| norm 0.2778 (-0.43z)| lr 5.17e-05 | 323.36 ms | 52.2% bf16 MFU | 1623808 tok/s step 15986/19560 | loss 3.362432 (+1.29z)| norm 0.3419 (+1.85z)| lr 5.16e-05 | 322.54 ms | 52.3% bf16 MFU | 1623892 tok/s step 15987/19560 | loss 3.295207 (-0.36z)| norm 0.2781 (-0.43z)| lr 5.16e-05 | 322.34 ms | 52.4% bf16 MFU | 1624022 tok/s step 15988/19560 | loss 3.378078 (+1.65z)| norm 0.3077 (+0.62z)| lr 5.16e-05 | 322.76 ms | 52.3% bf16 MFU | 1624040 tok/s step 15989/19560 | loss 3.361683 (+1.24z)| norm 0.2969 (+0.23z)| lr 5.16e-05 | 323.52 ms | 52.2% bf16 MFU | 1623866 tok/s step 15990/19560 | loss 3.295175 (-0.39z)| norm 0.2651 (-0.91z)| lr 5.15e-05 | 322.40 ms | 52.3% bf16 MFU | 1623982 tok/s step 15991/19560 | loss 3.308168 (-0.07z)| norm 0.2608 (-1.06z)| lr 5.15e-05 | 322.62 ms | 52.3% bf16 MFU | 1624038 tok/s step 15992/19560 | loss 3.289668 (-0.52z)| norm 0.2924 (+0.08z)| lr 5.15e-05 | 322.17 ms | 52.4% bf16 MFU | 1624205 tok/s step 15993/19560 | loss 3.346703 (+0.88z)| norm 0.2739 (-0.58z)| lr 5.14e-05 | 322.52 ms | 52.3% bf16 MFU | 1624275 tok/s step 15994/19560 | loss 3.348096 (+0.91z)| norm 0.2705 (-0.70z)| lr 5.14e-05 | 322.62 ms | 52.3% bf16 MFU | 1624316 tok/s step 15995/19560 | loss 3.302676 (-0.21z)| norm 0.2554 (-1.23z)| lr 5.14e-05 | 322.67 ms | 52.3% bf16 MFU | 1624343 tok/s step 15996/19560 | loss 3.286743 (-0.60z)| norm 0.2873 (-0.10z)| lr 5.14e-05 | 322.43 ms | 52.3% bf16 MFU | 1624428 tok/s step 15997/19560 | loss 3.308393 (-0.07z)| norm 0.2833 (-0.24z)| lr 5.13e-05 | 322.76 ms | 52.3% bf16 MFU | 1624425 tok/s step 15998/19560 | loss 3.356949 (+1.11z)| norm 0.2860 (-0.16z)| lr 5.13e-05 | 322.44 ms | 52.3% bf16 MFU | 1624503 tok/s step 15999/19560 | loss 3.315861 (+0.10z)| norm 0.2945 (+0.14z)| lr 5.13e-05 | 322.69 ms | 52.3% bf16 MFU | 1624515 tok/s step 16000/19560 | loss 3.338696 (+0.65z)| norm 0.2876 (-0.11z)| lr 5.12e-05 | 322.53 ms | 52.3% bf16 MFU | 1624568 tok/s val loss 3.290730 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3030/10042 = 0.301733 step 16001/19560 | loss 3.344574 (+0.78z)| norm 0.3042 (+0.48z)| lr 5.12e-05 | 322.58 ms | 52.3% bf16 MFU | 1624603 tok/s step 16002/19560 | loss 3.352335 (+0.96z)| norm 0.2575 (-1.20z)| lr 5.12e-05 | 323.16 ms | 52.2% bf16 MFU | 1624491 tok/s step 16003/19560 | loss 3.291307 (-0.54z)| norm 0.3963 (+3.59z)| lr 5.12e-05 | 323.58 ms | 52.2% bf16 MFU | 1624280 tok/s step 16004/19560 | loss 3.246445 (-1.64z)| norm 0.2459 (-1.55z)| lr 5.11e-05 | 323.04 ms | 52.2% bf16 MFU | 1624215 tok/s step 16005/19560 | loss 3.306392 (-0.15z)| norm 0.2581 (-1.13z)| lr 5.11e-05 | 323.19 ms | 52.2% bf16 MFU | 1624116 tok/s step 16006/19560 | loss 3.260842 (-1.27z)| norm 0.3310 (+1.32z)| lr 5.11e-05 | 322.78 ms | 52.3% bf16 MFU | 1624125 tok/s step 16007/19560 | loss 3.293781 (-0.46z)| norm 0.2846 (-0.24z)| lr 5.11e-05 | 322.55 ms | 52.3% bf16 MFU | 1624190 tok/s step 16008/19560 | loss 3.374294 (+1.53z)| norm 0.3138 (+0.74z)| lr 5.10e-05 | 323.07 ms | 52.2% bf16 MFU | 1624122 tok/s step 16009/19560 | loss 3.377452 (+1.58z)| norm 0.3228 (+1.04z)| lr 5.10e-05 | 323.08 ms | 52.2% bf16 MFU | 1624055 tok/s step 16010/19560 | loss 3.296765 (-0.39z)| norm 0.2769 (-0.51z)| lr 5.10e-05 | 322.67 ms | 52.3% bf16 MFU | 1624095 tok/s step 16011/19560 | loss 3.377637 (+1.67z)| norm 0.2753 (-0.57z)| lr 5.09e-05 | 322.99 ms | 52.3% bf16 MFU | 1624053 tok/s step 16012/19560 | loss 3.342715 (+0.77z)| norm 0.2786 (-0.46z)| lr 5.09e-05 | 322.87 ms | 52.3% bf16 MFU | 1624042 tok/s step 16013/19560 | loss 3.303759 (-0.21z)| norm 0.2718 (-0.69z)| lr 5.09e-05 | 323.70 ms | 52.1% bf16 MFU | 1623824 tok/s step 16014/19560 | loss 3.374368 (+1.56z)| norm 0.3265 (+1.15z)| lr 5.09e-05 | 322.85 ms | 52.3% bf16 MFU | 1623828 tok/s step 16015/19560 | loss 3.332073 (+0.48z)| norm 0.3220 (+0.98z)| lr 5.08e-05 | 322.61 ms | 52.3% bf16 MFU | 1623894 tok/s step 16016/19560 | loss 3.292134 (-0.52z)| norm 0.2781 (-0.49z)| lr 5.08e-05 | 323.07 ms | 52.2% bf16 MFU | 1623842 tok/s step 16017/19560 | loss 3.355213 (+1.08z)| norm 0.3439 (+1.69z)| lr 5.08e-05 | 322.42 ms | 52.3% bf16 MFU | 1623956 tok/s step 16018/19560 | loss 3.358333 (+1.14z)| norm 0.3030 (+0.33z)| lr 5.07e-05 | 322.84 ms | 52.3% bf16 MFU | 1623956 tok/s step 16019/19560 | loss 3.310831 (-0.08z)| norm 0.2954 (+0.07z)| lr 5.07e-05 | 323.01 ms | 52.3% bf16 MFU | 1623916 tok/s step 16020/19560 | loss 3.309415 (-0.12z)| norm 0.3126 (+0.65z)| lr 5.07e-05 | 322.57 ms | 52.3% bf16 MFU | 1623987 tok/s step 16021/19560 | loss 3.296208 (-0.45z)| norm 0.2647 (-0.95z)| lr 5.07e-05 | 322.95 ms | 52.3% bf16 MFU | 1623959 tok/s step 16022/19560 | loss 3.371191 (+1.45z)| norm 0.3177 (+0.83z)| lr 5.06e-05 | 322.89 ms | 52.3% bf16 MFU | 1623948 tok/s step 16023/19560 | loss 3.308937 (-0.13z)| norm 0.2528 (-1.32z)| lr 5.06e-05 | 322.85 ms | 52.3% bf16 MFU | 1623948 tok/s step 16024/19560 | loss 3.345737 (+0.81z)| norm 0.2904 (-0.08z)| lr 5.06e-05 | 323.09 ms | 52.2% bf16 MFU | 1623888 tok/s step 16025/19560 | loss 3.355719 (+1.04z)| norm 0.3329 (+1.41z)| lr 5.06e-05 | 322.16 ms | 52.4% bf16 MFU | 1624064 tok/s step 16026/19560 | loss 3.347530 (+0.83z)| norm 0.2649 (-0.95z)| lr 5.05e-05 | 322.84 ms | 52.3% bf16 MFU | 1624060 tok/s step 16027/19560 | loss 3.271646 (-1.12z)| norm 0.2872 (-0.18z)| lr 5.05e-05 | 322.41 ms | 52.3% bf16 MFU | 1624166 tok/s step 16028/19560 | loss 3.284704 (-0.77z)| norm 0.3387 (+1.59z)| lr 5.05e-05 | 322.58 ms | 52.3% bf16 MFU | 1624223 tok/s step 16029/19560 | loss 3.379614 (+1.69z)| norm 0.2747 (-0.62z)| lr 5.04e-05 | 322.90 ms | 52.3% bf16 MFU | 1624195 tok/s step 16030/19560 | loss 3.293390 (-0.54z)| norm 0.3272 (+1.18z)| lr 5.04e-05 | 322.52 ms | 52.3% bf16 MFU | 1624266 tok/s step 16031/19560 | loss 3.217081 (-2.43z)| norm 0.3031 (+0.35z)| lr 5.04e-05 | 322.74 ms | 52.3% bf16 MFU | 1624278 tok/s step 16032/19560 | loss 3.316861 (+0.08z)| norm 0.2598 (-1.15z)| lr 5.04e-05 | 322.95 ms | 52.3% bf16 MFU | 1624237 tok/s step 16033/19560 | loss 3.235640 (-1.93z)| norm 0.3222 (+1.00z)| lr 5.03e-05 | 322.68 ms | 52.3% bf16 MFU | 1624264 tok/s step 16034/19560 | loss 3.324781 (+0.29z)| norm 0.2792 (-0.48z)| lr 5.03e-05 | 322.83 ms | 52.3% bf16 MFU | 1624252 tok/s step 16035/19560 | loss 3.288124 (-0.62z)| norm 0.2844 (-0.31z)| lr 5.03e-05 | 322.55 ms | 52.3% bf16 MFU | 1624312 tok/s step 16036/19560 | loss 3.307736 (-0.12z)| norm 0.2772 (-0.55z)| lr 5.02e-05 | 322.93 ms | 52.3% bf16 MFU | 1624273 tok/s step 16037/19560 | loss 3.320660 (+0.19z)| norm 0.2728 (-0.69z)| lr 5.02e-05 | 322.64 ms | 52.3% bf16 MFU | 1624310 tok/s step 16038/19560 | loss 3.442259 (+3.14z)| norm 0.2849 (-0.28z)| lr 5.02e-05 | 322.86 ms | 52.3% bf16 MFU | 1624289 tok/s step 16039/19560 | loss 3.306398 (-0.20z)| norm 0.2668 (-0.89z)| lr 5.02e-05 | 322.95 ms | 52.3% bf16 MFU | 1624246 tok/s step 16040/19560 | loss 3.343848 (+0.72z)| norm 0.2660 (-0.92z)| lr 5.01e-05 | 322.24 ms | 52.4% bf16 MFU | 1624383 tok/s step 16041/19560 | loss 3.286994 (-0.68z)| norm 0.2803 (-0.43z)| lr 5.01e-05 | 322.65 ms | 52.3% bf16 MFU | 1624412 tok/s step 16042/19560 | loss 3.335349 (+0.52z)| norm 0.3042 (+0.39z)| lr 5.01e-05 | 322.71 ms | 52.3% bf16 MFU | 1624425 tok/s step 16043/19560 | loss 3.315477 (+0.04z)| norm 0.2989 (+0.20z)| lr 5.01e-05 | 322.77 ms | 52.3% bf16 MFU | 1624420 tok/s step 16044/19560 | loss 3.280830 (-0.82z)| norm 0.3033 (+0.35z)| lr 5.00e-05 | 322.80 ms | 52.3% bf16 MFU | 1624409 tok/s step 16045/19560 | loss 3.332346 (+0.46z)| norm 0.3010 (+0.27z)| lr 5.00e-05 | 323.00 ms | 52.3% bf16 MFU | 1624348 tok/s step 16046/19560 | loss 3.273233 (-1.00z)| norm 0.3242 (+1.05z)| lr 5.00e-05 | 322.76 ms | 52.3% bf16 MFU | 1624350 tok/s step 16047/19560 | loss 3.280535 (-0.81z)| norm 0.2883 (-0.18z)| lr 4.99e-05 | 322.52 ms | 52.3% bf16 MFU | 1624413 tok/s step 16048/19560 | loss 3.259958 (-1.33z)| norm 0.2766 (-0.59z)| lr 4.99e-05 | 323.00 ms | 52.3% bf16 MFU | 1624351 tok/s step 16049/19560 | loss 3.322164 (+0.22z)| norm 0.3071 (+0.45z)| lr 4.99e-05 | 322.47 ms | 52.3% bf16 MFU | 1624427 tok/s step 16050/19560 | loss 3.318680 (+0.12z)| norm 0.3127 (+0.64z)| lr 4.99e-05 | 322.17 ms | 52.4% bf16 MFU | 1624574 tok/s step 16051/19560 | loss 3.266283 (-1.17z)| norm 0.2618 (-1.12z)| lr 4.98e-05 | 322.70 ms | 52.3% bf16 MFU | 1624580 tok/s step 16052/19560 | loss 3.310936 (-0.06z)| norm 0.2747 (-0.67z)| lr 4.98e-05 | 322.57 ms | 52.3% bf16 MFU | 1624619 tok/s step 16053/19560 | loss 3.309165 (-0.10z)| norm 0.3197 (+0.86z)| lr 4.98e-05 | 322.22 ms | 52.4% bf16 MFU | 1624743 tok/s step 16054/19560 | loss 3.365729 (+1.28z)| norm 0.2970 (+0.08z)| lr 4.97e-05 | 322.78 ms | 52.3% bf16 MFU | 1624719 tok/s step 16055/19560 | loss 3.303207 (-0.24z)| norm 0.2994 (+0.15z)| lr 4.97e-05 | 322.60 ms | 52.3% bf16 MFU | 1624743 tok/s step 16056/19560 | loss 3.310751 (-0.06z)| norm 0.2976 (+0.08z)| lr 4.97e-05 | 322.92 ms | 52.3% bf16 MFU | 1624684 tok/s step 16057/19560 | loss 3.332296 (+0.50z)| norm 0.3160 (+0.71z)| lr 4.97e-05 | 322.49 ms | 52.3% bf16 MFU | 1624736 tok/s step 16058/19560 | loss 3.335915 (+0.60z)| norm 0.3916 (+3.18z)| lr 4.96e-05 | 322.72 ms | 52.3% bf16 MFU | 1624730 tok/s step 16059/19560 | loss 3.307213 (-0.17z)| norm 0.2680 (-0.95z)| lr 4.96e-05 | 322.95 ms | 52.3% bf16 MFU | 1624666 tok/s step 16060/19560 | loss 3.353127 (+1.04z)| norm 0.3870 (+2.91z)| lr 4.96e-05 | 322.46 ms | 52.3% bf16 MFU | 1624729 tok/s step 16061/19560 | loss 3.310510 (-0.10z)| norm 0.3557 (+1.85z)| lr 4.96e-05 | 322.49 ms | 52.3% bf16 MFU | 1624780 tok/s step 16062/19560 | loss 3.265234 (-1.29z)| norm 0.2666 (-1.00z)| lr 4.95e-05 | 322.70 ms | 52.3% bf16 MFU | 1624777 tok/s step 16063/19560 | loss 3.335639 (+0.57z)| norm 0.3248 (+0.85z)| lr 4.95e-05 | 322.67 ms | 52.3% bf16 MFU | 1624780 tok/s step 16064/19560 | loss 3.355290 (+1.08z)| norm 0.3081 (+0.31z)| lr 4.95e-05 | 322.59 ms | 52.3% bf16 MFU | 1624802 tok/s step 16065/19560 | loss 3.293238 (-0.57z)| norm 0.2953 (-0.10z)| lr 4.94e-05 | 322.60 ms | 52.3% bf16 MFU | 1624822 tok/s step 16066/19560 | loss 3.286665 (-0.75z)| norm 0.2783 (-0.64z)| lr 4.94e-05 | 323.09 ms | 52.2% bf16 MFU | 1624719 tok/s step 16067/19560 | loss 3.307889 (-0.19z)| norm 0.3379 (+1.25z)| lr 4.94e-05 | 323.04 ms | 52.2% bf16 MFU | 1624632 tok/s step 16068/19560 | loss 3.353800 (+1.02z)| norm 0.2745 (-0.76z)| lr 4.94e-05 | 322.50 ms | 52.3% bf16 MFU | 1624686 tok/s step 16069/19560 | loss 3.298265 (-0.45z)| norm 0.2437 (-1.71z)| lr 4.93e-05 | 322.41 ms | 52.3% bf16 MFU | 1624758 tok/s step 16070/19560 | loss 3.258660 (-1.48z)| norm 0.3878 (+2.73z)| lr 4.93e-05 | 322.37 ms | 52.4% bf16 MFU | 1624838 tok/s step 16071/19560 | loss 3.361066 (+1.23z)| norm 0.2931 (-0.16z)| lr 4.93e-05 | 323.22 ms | 52.2% bf16 MFU | 1624701 tok/s step 16072/19560 | loss 3.310133 (-0.12z)| norm 0.3038 (+0.19z)| lr 4.93e-05 | 322.78 ms | 52.3% bf16 MFU | 1624682 tok/s step 16073/19560 | loss 3.347182 (+0.85z)| norm 0.3778 (+2.53z)| lr 4.92e-05 | 323.20 ms | 52.2% bf16 MFU | 1624557 tok/s step 16074/19560 | loss 3.348867 (+0.88z)| norm 0.2823 (-0.50z)| lr 4.92e-05 | 322.82 ms | 52.3% bf16 MFU | 1624534 tok/s step 16075/19560 | loss 3.321297 (+0.14z)| norm 0.3268 (+0.95z)| lr 4.92e-05 | 323.00 ms | 52.3% bf16 MFU | 1624466 tok/s step 16076/19560 | loss 3.340135 (+0.63z)| norm 0.2976 (-0.01z)| lr 4.91e-05 | 322.38 ms | 52.4% bf16 MFU | 1624557 tok/s step 16077/19560 | loss 3.198574 (-3.02z)| norm 0.2578 (-1.31z)| lr 4.91e-05 | 322.87 ms | 52.3% bf16 MFU | 1624522 tok/s step 16078/19560 | loss 3.346881 (+0.79z)| norm 0.2869 (-0.35z)| lr 4.91e-05 | 322.60 ms | 52.3% bf16 MFU | 1624554 tok/s step 16079/19560 | loss 3.265962 (-1.28z)| norm 0.2625 (-1.15z)| lr 4.91e-05 | 322.67 ms | 52.3% bf16 MFU | 1624568 tok/s step 16080/19560 | loss 3.312431 (-0.09z)| norm 0.2647 (-1.07z)| lr 4.90e-05 | 323.35 ms | 52.2% bf16 MFU | 1624410 tok/s step 16081/19560 | loss 3.225749 (-2.27z)| norm 0.2979 (+0.03z)| lr 4.90e-05 | 322.68 ms | 52.3% bf16 MFU | 1624428 tok/s step 16082/19560 | loss 3.315031 (+0.01z)| norm 0.2856 (-0.37z)| lr 4.90e-05 | 322.96 ms | 52.3% bf16 MFU | 1624376 tok/s step 16083/19560 | loss 3.361091 (+1.17z)| norm 0.3415 (+1.47z)| lr 4.90e-05 | 322.79 ms | 52.3% bf16 MFU | 1624368 tok/s step 16084/19560 | loss 3.302012 (-0.34z)| norm 0.2782 (-0.62z)| lr 4.89e-05 | 323.14 ms | 52.2% bf16 MFU | 1624273 tok/s step 16085/19560 | loss 3.287635 (-0.69z)| norm 0.2579 (-1.27z)| lr 4.89e-05 | 322.51 ms | 52.3% bf16 MFU | 1624341 tok/s step 16086/19560 | loss 3.303934 (-0.28z)| norm 0.2688 (-0.90z)| lr 4.89e-05 | 322.65 ms | 52.3% bf16 MFU | 1624372 tok/s step 16087/19560 | loss 3.276783 (-0.97z)| norm 0.2860 (-0.33z)| lr 4.88e-05 | 323.68 ms | 52.1% bf16 MFU | 1624142 tok/s step 16088/19560 | loss 3.238891 (-1.92z)| norm 0.2839 (-0.39z)| lr 4.88e-05 | 322.68 ms | 52.3% bf16 MFU | 1624175 tok/s step 16089/19560 | loss 3.344459 (+0.78z)| norm 0.2793 (-0.53z)| lr 4.88e-05 | 323.06 ms | 52.2% bf16 MFU | 1624111 tok/s step 16090/19560 | loss 3.432699 (+2.92z)| norm 0.2833 (-0.41z)| lr 4.88e-05 | 322.97 ms | 52.3% bf16 MFU | 1624072 tok/s step 16091/19560 | loss 3.263029 (-1.27z)| norm 0.2652 (-1.01z)| lr 4.87e-05 | 322.37 ms | 52.4% bf16 MFU | 1624186 tok/s step 16092/19560 | loss 3.303225 (-0.28z)| norm 0.2644 (-1.03z)| lr 4.87e-05 | 323.35 ms | 52.2% bf16 MFU | 1624049 tok/s step 16093/19560 | loss 3.251394 (-1.53z)| norm 0.2594 (-1.18z)| lr 4.87e-05 | 323.26 ms | 52.2% bf16 MFU | 1623941 tok/s step 16094/19560 | loss 3.331973 (+0.44z)| norm 0.2646 (-1.00z)| lr 4.87e-05 | 322.70 ms | 52.3% bf16 MFU | 1623978 tok/s step 16095/19560 | loss 3.223415 (-2.16z)| norm 0.4145 (+3.71z)| lr 4.86e-05 | 323.04 ms | 52.2% bf16 MFU | 1623929 tok/s step 16096/19560 | loss 3.245883 (-1.65z)| norm 0.2777 (-0.56z)| lr 4.86e-05 | 322.84 ms | 52.3% bf16 MFU | 1623931 tok/s step 16097/19560 | loss 3.346526 (+0.78z)| norm 0.2876 (-0.26z)| lr 4.86e-05 | 322.53 ms | 52.3% bf16 MFU | 1624011 tok/s step 16098/19560 | loss 3.349602 (+0.85z)| norm 0.2918 (-0.12z)| lr 4.85e-05 | 322.91 ms | 52.3% bf16 MFU | 1623993 tok/s step 16099/19560 | loss 3.354296 (+0.95z)| norm 0.2821 (-0.42z)| lr 4.85e-05 | 322.75 ms | 52.3% bf16 MFU | 1624016 tok/s step 16100/19560 | loss 3.285868 (-0.71z)| norm 0.2769 (-0.58z)| lr 4.85e-05 | 323.16 ms | 52.2% bf16 MFU | 1623935 tok/s step 16101/19560 | loss 3.235953 (-1.89z)| norm 0.2833 (-0.38z)| lr 4.85e-05 | 323.14 ms | 52.2% bf16 MFU | 1623863 tok/s step 16102/19560 | loss 3.352267 (+0.89z)| norm 0.2870 (-0.26z)| lr 4.84e-05 | 323.06 ms | 52.2% bf16 MFU | 1623815 tok/s step 16103/19560 | loss 3.317809 (+0.07z)| norm 0.2878 (-0.24z)| lr 4.84e-05 | 322.61 ms | 52.3% bf16 MFU | 1623881 tok/s step 16104/19560 | loss 3.329689 (+0.36z)| norm 0.2937 (-0.05z)| lr 4.84e-05 | 322.80 ms | 52.3% bf16 MFU | 1623897 tok/s step 16105/19560 | loss 3.311657 (-0.07z)| norm 0.2794 (-0.49z)| lr 4.84e-05 | 322.93 ms | 52.3% bf16 MFU | 1623879 tok/s step 16106/19560 | loss 3.240816 (-1.75z)| norm 0.2864 (-0.28z)| lr 4.83e-05 | 322.65 ms | 52.3% bf16 MFU | 1623932 tok/s step 16107/19560 | loss 3.326807 (+0.31z)| norm 0.2860 (-0.28z)| lr 4.83e-05 | 322.79 ms | 52.3% bf16 MFU | 1623948 tok/s step 16108/19560 | loss 3.350672 (+0.87z)| norm 0.3158 (+0.64z)| lr 4.83e-05 | 323.41 ms | 52.2% bf16 MFU | 1623808 tok/s step 16109/19560 | loss 3.337321 (+0.56z)| norm 0.2750 (-0.62z)| lr 4.82e-05 | 322.84 ms | 52.3% bf16 MFU | 1623818 tok/s step 16110/19560 | loss 3.375608 (+1.45z)| norm 0.2699 (-0.78z)| lr 4.82e-05 | 323.00 ms | 52.3% bf16 MFU | 1623787 tok/s step 16111/19560 | loss 3.251741 (-1.49z)| norm 0.2999 (+0.17z)| lr 4.82e-05 | 322.68 ms | 52.3% bf16 MFU | 1623837 tok/s step 16112/19560 | loss 3.332036 (+0.42z)| norm 0.2855 (-0.28z)| lr 4.82e-05 | 322.90 ms | 52.3% bf16 MFU | 1623830 tok/s step 16113/19560 | loss 3.268591 (-1.09z)| norm 0.2585 (-1.12z)| lr 4.81e-05 | 322.56 ms | 52.3% bf16 MFU | 1623908 tok/s step 16114/19560 | loss 3.327411 (+0.32z)| norm 0.2618 (-1.01z)| lr 4.81e-05 | 322.55 ms | 52.3% bf16 MFU | 1623985 tok/s step 16115/19560 | loss 3.376813 (+1.47z)| norm 0.3216 (+0.88z)| lr 4.81e-05 | 322.83 ms | 52.3% bf16 MFU | 1623988 tok/s step 16116/19560 | loss 3.298031 (-0.38z)| norm 0.2852 (-0.26z)| lr 4.81e-05 | 323.43 ms | 52.2% bf16 MFU | 1623840 tok/s step 16117/19560 | loss 3.269676 (-1.04z)| norm 0.3157 (+0.70z)| lr 4.80e-05 | 323.05 ms | 52.2% bf16 MFU | 1623796 tok/s step 16118/19560 | loss 3.293563 (-0.47z)| norm 0.3479 (+1.68z)| lr 4.80e-05 | 322.95 ms | 52.3% bf16 MFU | 1623777 tok/s step 16119/19560 | loss 3.340594 (+0.64z)| norm 0.2776 (-0.53z)| lr 4.80e-05 | 322.95 ms | 52.3% bf16 MFU | 1623759 tok/s step 16120/19560 | loss 3.275745 (-0.90z)| norm 0.3161 (+0.67z)| lr 4.79e-05 | 323.08 ms | 52.2% bf16 MFU | 1623711 tok/s step 16121/19560 | loss 3.334311 (+0.50z)| norm 0.3140 (+0.60z)| lr 4.79e-05 | 323.55 ms | 52.2% bf16 MFU | 1623546 tok/s step 16122/19560 | loss 3.301463 (-0.28z)| norm 0.2661 (-0.91z)| lr 4.79e-05 | 322.76 ms | 52.3% bf16 MFU | 1623588 tok/s step 16123/19560 | loss 3.323359 (+0.24z)| norm 0.3069 (+0.36z)| lr 4.79e-05 | 322.33 ms | 52.4% bf16 MFU | 1623736 tok/s step 16124/19560 | loss 3.383227 (+1.64z)| norm 0.2701 (-0.79z)| lr 4.78e-05 | 323.79 ms | 52.1% bf16 MFU | 1623511 tok/s step 16125/19560 | loss 3.282588 (-0.74z)| norm 0.2585 (-1.15z)| lr 4.78e-05 | 322.48 ms | 52.3% bf16 MFU | 1623626 tok/s step 16126/19560 | loss 3.328521 (+0.35z)| norm 0.2720 (-0.72z)| lr 4.78e-05 | 323.06 ms | 52.2% bf16 MFU | 1623589 tok/s step 16127/19560 | loss 3.308058 (-0.13z)| norm 0.2612 (-1.05z)| lr 4.78e-05 | 323.15 ms | 52.2% bf16 MFU | 1623530 tok/s step 16128/19560 | loss 3.276701 (-0.86z)| norm 0.2922 (-0.08z)| lr 4.77e-05 | 322.46 ms | 52.3% bf16 MFU | 1623649 tok/s step 16129/19560 | loss 3.378535 (+1.53z)| norm 0.2583 (-1.12z)| lr 4.77e-05 | 323.39 ms | 52.2% bf16 MFU | 1623528 tok/s step 16130/19560 | loss 3.311102 (-0.04z)| norm 0.2723 (-0.69z)| lr 4.77e-05 | 322.90 ms | 52.3% bf16 MFU | 1623537 tok/s step 16131/19560 | loss 3.326120 (+0.30z)| norm 0.2777 (-0.51z)| lr 4.76e-05 | 322.77 ms | 52.3% bf16 MFU | 1623576 tok/s step 16132/19560 | loss 3.317485 (+0.09z)| norm 0.2693 (-0.80z)| lr 4.76e-05 | 322.31 ms | 52.4% bf16 MFU | 1623729 tok/s step 16133/19560 | loss 3.311620 (-0.05z)| norm 0.2452 (-1.58z)| lr 4.76e-05 | 323.12 ms | 52.2% bf16 MFU | 1623673 tok/s step 16134/19560 | loss 3.302126 (-0.29z)| norm 0.2799 (-0.44z)| lr 4.76e-05 | 323.60 ms | 52.2% bf16 MFU | 1623498 tok/s step 16135/19560 | loss 3.317880 (+0.08z)| norm 0.3348 (+1.34z)| lr 4.75e-05 | 322.05 ms | 52.4% bf16 MFU | 1623721 tok/s step 16136/19560 | loss 3.286719 (-0.65z)| norm 0.2819 (-0.37z)| lr 4.75e-05 | 323.35 ms | 52.2% bf16 MFU | 1623607 tok/s step 16137/19560 | loss 3.279882 (-0.80z)| norm 0.3104 (+0.56z)| lr 4.75e-05 | 323.22 ms | 52.2% bf16 MFU | 1623532 tok/s step 16138/19560 | loss 3.253936 (-1.42z)| norm 0.3381 (+1.44z)| lr 4.75e-05 | 323.02 ms | 52.2% bf16 MFU | 1623508 tok/s step 16139/19560 | loss 3.296311 (-0.38z)| norm 0.3111 (+0.55z)| lr 4.74e-05 | 323.03 ms | 52.2% bf16 MFU | 1623485 tok/s step 16140/19560 | loss 3.347900 (+0.87z)| norm 0.3012 (+0.23z)| lr 4.74e-05 | 322.79 ms | 52.3% bf16 MFU | 1623522 tok/s step 16141/19560 | loss 3.324699 (+0.30z)| norm 0.3055 (+0.36z)| lr 4.74e-05 | 323.04 ms | 52.2% bf16 MFU | 1623495 tok/s step 16142/19560 | loss 3.271284 (-0.98z)| norm 0.2802 (-0.45z)| lr 4.74e-05 | 322.63 ms | 52.3% bf16 MFU | 1623572 tok/s step 16143/19560 | loss 3.284717 (-0.64z)| norm 0.2636 (-0.98z)| lr 4.73e-05 | 323.44 ms | 52.2% bf16 MFU | 1623443 tok/s step 16144/19560 | loss 3.266397 (-1.08z)| norm 0.2736 (-0.65z)| lr 4.73e-05 | 322.74 ms | 52.3% bf16 MFU | 1623494 tok/s step 16145/19560 | loss 3.250939 (-1.43z)| norm 0.2775 (-0.51z)| lr 4.73e-05 | 322.74 ms | 52.3% bf16 MFU | 1623543 tok/s step 16146/19560 | loss 3.315921 (+0.15z)| norm 0.2741 (-0.62z)| lr 4.72e-05 | 322.86 ms | 52.3% bf16 MFU | 1623561 tok/s step 16147/19560 | loss 3.391201 (+1.94z)| norm 0.2871 (-0.19z)| lr 4.72e-05 | 322.09 ms | 52.4% bf16 MFU | 1623772 tok/s step 16148/19560 | loss 3.302616 (-0.18z)| norm 0.3029 (+0.33z)| lr 4.72e-05 | 322.93 ms | 52.3% bf16 MFU | 1623760 tok/s step 16149/19560 | loss 3.332452 (+0.53z)| norm 0.2624 (-1.00z)| lr 4.72e-05 | 323.32 ms | 52.2% bf16 MFU | 1623651 tok/s step 16150/19560 | loss 3.375373 (+1.55z)| norm 0.3486 (+1.81z)| lr 4.71e-05 | 323.10 ms | 52.2% bf16 MFU | 1623601 tok/s step 16151/19560 | loss 3.320153 (+0.23z)| norm 0.2879 (-0.18z)| lr 4.71e-05 | 323.00 ms | 52.3% bf16 MFU | 1623580 tok/s step 16152/19560 | loss 3.265921 (-1.05z)| norm 0.3293 (+1.16z)| lr 4.71e-05 | 322.65 ms | 52.3% bf16 MFU | 1623649 tok/s step 16153/19560 | loss 3.322642 (+0.31z)| norm 0.2809 (-0.40z)| lr 4.71e-05 | 323.33 ms | 52.2% bf16 MFU | 1623543 tok/s step 16154/19560 | loss 3.304676 (-0.11z)| norm 0.2814 (-0.39z)| lr 4.70e-05 | 323.35 ms | 52.2% bf16 MFU | 1623436 tok/s step 16155/19560 | loss 3.279438 (-0.72z)| norm 0.2678 (-0.83z)| lr 4.70e-05 | 322.91 ms | 52.3% bf16 MFU | 1623445 tok/s step 16156/19560 | loss 3.286697 (-0.55z)| norm 0.2710 (-0.71z)| lr 4.70e-05 | 323.22 ms | 52.2% bf16 MFU | 1623376 tok/s step 16157/19560 | loss 3.309126 (+0.00z)| norm 0.2780 (-0.48z)| lr 4.69e-05 | 322.91 ms | 52.3% bf16 MFU | 1623390 tok/s step 16158/19560 | loss 3.280148 (-0.70z)| norm 0.3190 (+0.87z)| lr 4.69e-05 | 322.81 ms | 52.3% bf16 MFU | 1623428 tok/s step 16159/19560 | loss 3.245413 (-1.57z)| norm 0.2921 (-0.01z)| lr 4.69e-05 | 323.16 ms | 52.2% bf16 MFU | 1623377 tok/s step 16160/19560 | loss 3.273786 (-0.86z)| norm 0.2746 (-0.60z)| lr 4.69e-05 | 322.76 ms | 52.3% bf16 MFU | 1623426 tok/s step 16161/19560 | loss 3.295779 (-0.33z)| norm 0.3765 (+2.70z)| lr 4.68e-05 | 323.05 ms | 52.2% bf16 MFU | 1623400 tok/s step 16162/19560 | loss 3.434667 (+2.99z)| norm 0.3347 (+1.33z)| lr 4.68e-05 | 322.67 ms | 52.3% bf16 MFU | 1623472 tok/s step 16163/19560 | loss 3.309632 (-0.01z)| norm 0.3079 (+0.46z)| lr 4.68e-05 | 322.53 ms | 52.3% bf16 MFU | 1623576 tok/s step 16164/19560 | loss 3.375283 (+1.54z)| norm 0.3732 (+2.47z)| lr 4.68e-05 | 322.85 ms | 52.3% bf16 MFU | 1623595 tok/s step 16165/19560 | loss 3.277002 (-0.79z)| norm 0.3073 (+0.40z)| lr 4.67e-05 | 323.08 ms | 52.2% bf16 MFU | 1623555 tok/s step 16166/19560 | loss 3.335339 (+0.64z)| norm 0.2605 (-1.06z)| lr 4.67e-05 | 322.69 ms | 52.3% bf16 MFU | 1623614 tok/s step 16167/19560 | loss 3.258849 (-1.24z)| norm 0.2762 (-0.57z)| lr 4.67e-05 | 322.88 ms | 52.3% bf16 MFU | 1623623 tok/s step 16168/19560 | loss 3.292316 (-0.41z)| norm 0.3426 (+1.48z)| lr 4.67e-05 | 322.77 ms | 52.3% bf16 MFU | 1623658 tok/s step 16169/19560 | loss 3.329761 (+0.51z)| norm 0.2784 (-0.52z)| lr 4.66e-05 | 323.03 ms | 52.2% bf16 MFU | 1623626 tok/s step 16170/19560 | loss 3.398138 (+2.14z)| norm 0.3193 (+0.75z)| lr 4.66e-05 | 322.61 ms | 52.3% bf16 MFU | 1623702 tok/s step 16171/19560 | loss 3.320631 (+0.27z)| norm 0.3104 (+0.47z)| lr 4.66e-05 | 322.26 ms | 52.4% bf16 MFU | 1623861 tok/s step 16172/19560 | loss 3.250484 (-1.42z)| norm 0.2820 (-0.41z)| lr 4.65e-05 | 322.88 ms | 52.3% bf16 MFU | 1623857 tok/s step 16173/19560 | loss 3.290707 (-0.44z)| norm 0.3039 (+0.27z)| lr 4.65e-05 | 322.82 ms | 52.3% bf16 MFU | 1623869 tok/s step 16174/19560 | loss 3.267334 (-1.00z)| norm 0.2712 (-0.73z)| lr 4.65e-05 | 322.81 ms | 52.3% bf16 MFU | 1623883 tok/s step 16175/19560 | loss 3.290419 (-0.45z)| norm 0.3129 (+0.56z)| lr 4.65e-05 | 322.69 ms | 52.3% bf16 MFU | 1623926 tok/s step 16176/19560 | loss 3.336328 (+0.64z)| norm 0.2866 (-0.26z)| lr 4.64e-05 | 322.47 ms | 52.3% bf16 MFU | 1624023 tok/s step 16177/19560 | loss 3.245009 (-1.53z)| norm 0.2783 (-0.51z)| lr 4.64e-05 | 323.02 ms | 52.2% bf16 MFU | 1623977 tok/s step 16178/19560 | loss 3.304717 (-0.10z)| norm 0.2648 (-0.92z)| lr 4.64e-05 | 323.02 ms | 52.2% bf16 MFU | 1623932 tok/s step 16179/19560 | loss 3.291798 (-0.42z)| norm 0.3059 (+0.35z)| lr 4.64e-05 | 322.80 ms | 52.3% bf16 MFU | 1623945 tok/s step 16180/19560 | loss 3.327510 (+0.44z)| norm 0.2970 (+0.06z)| lr 4.63e-05 | 322.76 ms | 52.3% bf16 MFU | 1623966 tok/s step 16181/19560 | loss 3.352556 (+1.02z)| norm 0.2681 (-0.82z)| lr 4.63e-05 | 322.34 ms | 52.4% bf16 MFU | 1624093 tok/s step 16182/19560 | loss 3.354105 (+1.07z)| norm 0.3086 (+0.44z)| lr 4.63e-05 | 322.46 ms | 52.3% bf16 MFU | 1624184 tok/s step 16183/19560 | loss 3.306854 (-0.07z)| norm 0.3336 (+1.20z)| lr 4.63e-05 | 322.50 ms | 52.3% bf16 MFU | 1624260 tok/s step 16184/19560 | loss 3.243268 (-1.56z)| norm 0.2843 (-0.32z)| lr 4.62e-05 | 322.87 ms | 52.3% bf16 MFU | 1624239 tok/s step 16185/19560 | loss 3.259757 (-1.15z)| norm 0.3819 (+2.61z)| lr 4.62e-05 | 322.58 ms | 52.3% bf16 MFU | 1624291 tok/s step 16186/19560 | loss 3.261718 (-1.09z)| norm 0.2875 (-0.22z)| lr 4.62e-05 | 322.86 ms | 52.3% bf16 MFU | 1624270 tok/s step 16187/19560 | loss 3.332895 (+0.58z)| norm 0.2659 (-0.89z)| lr 4.61e-05 | 322.48 ms | 52.3% bf16 MFU | 1624348 tok/s step 16188/19560 | loss 3.369525 (+1.44z)| norm 0.2793 (-0.46z)| lr 4.61e-05 | 322.52 ms | 52.3% bf16 MFU | 1624412 tok/s step 16189/19560 | loss 3.329545 (+0.49z)| norm 0.3026 (+0.31z)| lr 4.61e-05 | 322.84 ms | 52.3% bf16 MFU | 1624389 tok/s step 16190/19560 | loss 3.289575 (-0.45z)| norm 0.2790 (-0.47z)| lr 4.61e-05 | 322.24 ms | 52.4% bf16 MFU | 1624520 tok/s step 16191/19560 | loss 3.220770 (-2.01z)| norm 0.2922 (-0.03z)| lr 4.60e-05 | 322.54 ms | 52.3% bf16 MFU | 1624570 tok/s step 16192/19560 | loss 3.312626 (+0.12z)| norm 0.2714 (-0.70z)| lr 4.60e-05 | 322.65 ms | 52.3% bf16 MFU | 1624589 tok/s step 16193/19560 | loss 3.205893 (-2.30z)| norm 0.3064 (+0.45z)| lr 4.60e-05 | 322.55 ms | 52.3% bf16 MFU | 1624633 tok/s step 16194/19560 | loss 3.264461 (-0.96z)| norm 0.3022 (+0.30z)| lr 4.60e-05 | 322.87 ms | 52.3% bf16 MFU | 1624594 tok/s step 16195/19560 | loss 3.320076 (+0.31z)| norm 0.2727 (-0.66z)| lr 4.59e-05 | 322.37 ms | 52.4% bf16 MFU | 1624683 tok/s step 16196/19560 | loss 3.316575 (+0.23z)| norm 0.3126 (+0.66z)| lr 4.59e-05 | 322.84 ms | 52.3% bf16 MFU | 1624647 tok/s step 16197/19560 | loss 3.307063 (+0.02z)| norm 0.2878 (-0.18z)| lr 4.59e-05 | 322.90 ms | 52.3% bf16 MFU | 1624599 tok/s step 16198/19560 | loss 3.332960 (+0.59z)| norm 0.2915 (-0.03z)| lr 4.59e-05 | 322.87 ms | 52.3% bf16 MFU | 1624560 tok/s step 16199/19560 | loss 3.299005 (-0.17z)| norm 0.2604 (-1.10z)| lr 4.58e-05 | 322.48 ms | 52.3% bf16 MFU | 1624622 tok/s step 16200/19560 | loss 3.293189 (-0.30z)| norm 0.2931 (+0.04z)| lr 4.58e-05 | 322.65 ms | 52.3% bf16 MFU | 1624638 tok/s step 16201/19560 | loss 3.293339 (-0.29z)| norm 0.2624 (-1.03z)| lr 4.58e-05 | 322.29 ms | 52.4% bf16 MFU | 1624743 tok/s step 16202/19560 | loss 3.316230 (+0.24z)| norm 0.2626 (-1.02z)| lr 4.57e-05 | 322.71 ms | 52.3% bf16 MFU | 1624737 tok/s step 16203/19560 | loss 3.429855 (+2.77z)| norm 0.3024 (+0.42z)| lr 4.57e-05 | 322.86 ms | 52.3% bf16 MFU | 1624695 tok/s step 16204/19560 | loss 3.320213 (+0.31z)| norm 0.2818 (-0.32z)| lr 4.57e-05 | 322.76 ms | 52.3% bf16 MFU | 1624679 tok/s step 16205/19560 | loss 3.277191 (-0.69z)| norm 0.2837 (-0.26z)| lr 4.57e-05 | 322.82 ms | 52.3% bf16 MFU | 1624648 tok/s step 16206/19560 | loss 3.367312 (+1.38z)| norm 0.2805 (-0.38z)| lr 4.56e-05 | 322.22 ms | 52.4% bf16 MFU | 1624771 tok/s step 16207/19560 | loss 3.317315 (+0.22z)| norm 0.3194 (+1.01z)| lr 4.56e-05 | 322.49 ms | 52.3% bf16 MFU | 1624819 tok/s step 16208/19560 | loss 3.321697 (+0.32z)| norm 0.2788 (-0.46z)| lr 4.56e-05 | 322.56 ms | 52.3% bf16 MFU | 1624848 tok/s step 16209/19560 | loss 3.344987 (+0.85z)| norm 0.2876 (-0.14z)| lr 4.56e-05 | 322.71 ms | 52.3% bf16 MFU | 1624838 tok/s step 16210/19560 | loss 3.345674 (+0.86z)| norm 0.2808 (-0.38z)| lr 4.55e-05 | 322.77 ms | 52.3% bf16 MFU | 1624814 tok/s step 16211/19560 | loss 3.321628 (+0.31z)| norm 0.2651 (-0.94z)| lr 4.55e-05 | 322.49 ms | 52.3% bf16 MFU | 1624862 tok/s step 16212/19560 | loss 3.354471 (+1.06z)| norm 0.2546 (-1.31z)| lr 4.55e-05 | 322.63 ms | 52.3% bf16 MFU | 1624871 tok/s step 16213/19560 | loss 3.323170 (+0.33z)| norm 0.2643 (-0.96z)| lr 4.55e-05 | 322.39 ms | 52.3% bf16 MFU | 1624939 tok/s step 16214/19560 | loss 3.279946 (-0.67z)| norm 0.2978 (+0.26z)| lr 4.54e-05 | 322.39 ms | 52.4% bf16 MFU | 1625005 tok/s step 16215/19560 | loss 3.302135 (-0.16z)| norm 0.2712 (-0.71z)| lr 4.54e-05 | 322.78 ms | 52.3% bf16 MFU | 1624970 tok/s step 16216/19560 | loss 3.288766 (-0.49z)| norm 0.2754 (-0.55z)| lr 4.54e-05 | 322.84 ms | 52.3% bf16 MFU | 1624921 tok/s step 16217/19560 | loss 3.293764 (-0.36z)| norm 0.2878 (-0.11z)| lr 4.54e-05 | 322.83 ms | 52.3% bf16 MFU | 1624878 tok/s step 16218/19560 | loss 3.322199 (+0.34z)| norm 0.2825 (-0.30z)| lr 4.53e-05 | 322.68 ms | 52.3% bf16 MFU | 1624873 tok/s step 16219/19560 | loss 3.251168 (-1.38z)| norm 0.2721 (-0.68z)| lr 4.53e-05 | 322.31 ms | 52.4% bf16 MFU | 1624961 tok/s step 16220/19560 | loss 3.363352 (+1.32z)| norm 0.3006 (+0.35z)| lr 4.53e-05 | 323.01 ms | 52.2% bf16 MFU | 1624869 tok/s step 16221/19560 | loss 3.270200 (-0.93z)| norm 0.2784 (-0.47z)| lr 4.52e-05 | 322.33 ms | 52.4% bf16 MFU | 1624954 tok/s step 16222/19560 | loss 3.308388 (-0.01z)| norm 0.2691 (-0.81z)| lr 4.52e-05 | 322.87 ms | 52.3% bf16 MFU | 1624898 tok/s step 16223/19560 | loss 3.284384 (-0.61z)| norm 0.2878 (-0.10z)| lr 4.52e-05 | 322.17 ms | 52.4% bf16 MFU | 1625021 tok/s step 16224/19560 | loss 3.300249 (-0.23z)| norm 0.2583 (-1.27z)| lr 4.52e-05 | 322.65 ms | 52.3% bf16 MFU | 1625017 tok/s step 16225/19560 | loss 3.286579 (-0.56z)| norm 0.3391 (+1.92z)| lr 4.51e-05 | 322.81 ms | 52.3% bf16 MFU | 1624972 tok/s step 16226/19560 | loss 3.246285 (-1.53z)| norm 0.2529 (-1.46z)| lr 4.51e-05 | 322.78 ms | 52.3% bf16 MFU | 1624938 tok/s step 16227/19560 | loss 3.379344 (+1.74z)| norm 0.3018 (+0.45z)| lr 4.51e-05 | 322.77 ms | 52.3% bf16 MFU | 1624908 tok/s step 16228/19560 | loss 3.299297 (-0.23z)| norm 0.3464 (+2.13z)| lr 4.51e-05 | 322.52 ms | 52.3% bf16 MFU | 1624942 tok/s step 16229/19560 | loss 3.295477 (-0.34z)| norm 0.2821 (-0.34z)| lr 4.50e-05 | 322.54 ms | 52.3% bf16 MFU | 1624970 tok/s step 16230/19560 | loss 3.292104 (-0.41z)| norm 0.2638 (-1.03z)| lr 4.50e-05 | 322.71 ms | 52.3% bf16 MFU | 1624954 tok/s step 16231/19560 | loss 3.310723 (+0.06z)| norm 0.2701 (-0.78z)| lr 4.50e-05 | 322.41 ms | 52.3% bf16 MFU | 1625013 tok/s step 16232/19560 | loss 3.339365 (+0.77z)| norm 0.2802 (-0.39z)| lr 4.50e-05 | 322.65 ms | 52.3% bf16 MFU | 1625009 tok/s step 16233/19560 | loss 3.376180 (+1.66z)| norm 0.3183 (+1.05z)| lr 4.49e-05 | 322.62 ms | 52.3% bf16 MFU | 1625014 tok/s step 16234/19560 | loss 3.344656 (+0.87z)| norm 0.2620 (-1.08z)| lr 4.49e-05 | 322.48 ms | 52.3% bf16 MFU | 1625052 tok/s step 16235/19560 | loss 3.301631 (-0.20z)| norm 0.2720 (-0.69z)| lr 4.49e-05 | 322.64 ms | 52.3% bf16 MFU | 1625050 tok/s step 16236/19560 | loss 3.317511 (+0.20z)| norm 0.2760 (-0.53z)| lr 4.48e-05 | 322.70 ms | 52.3% bf16 MFU | 1625033 tok/s step 16237/19560 | loss 3.227728 (-2.00z)| norm 0.3140 (+0.89z)| lr 4.48e-05 | 322.95 ms | 52.3% bf16 MFU | 1624953 tok/s step 16238/19560 | loss 3.291444 (-0.41z)| norm 0.2900 (-0.02z)| lr 4.48e-05 | 322.38 ms | 52.4% bf16 MFU | 1625020 tok/s step 16239/19560 | loss 3.336348 (+0.70z)| norm 0.2936 (+0.12z)| lr 4.48e-05 | 322.47 ms | 52.3% bf16 MFU | 1625062 tok/s step 16240/19560 | loss 3.318377 (+0.25z)| norm 0.2880 (-0.10z)| lr 4.47e-05 | 322.69 ms | 52.3% bf16 MFU | 1625045 tok/s step 16241/19560 | loss 3.245157 (-1.58z)| norm 0.2901 (-0.02z)| lr 4.47e-05 | 322.48 ms | 52.3% bf16 MFU | 1625083 tok/s step 16242/19560 | loss 3.279016 (-0.72z)| norm 0.3156 (+0.93z)| lr 4.47e-05 | 322.59 ms | 52.3% bf16 MFU | 1625090 tok/s step 16243/19560 | loss 3.337360 (+0.75z)| norm 0.2720 (-0.72z)| lr 4.47e-05 | 322.55 ms | 52.3% bf16 MFU | 1625108 tok/s step 16244/19560 | loss 3.309698 (+0.05z)| norm 0.3174 (+1.00z)| lr 4.46e-05 | 322.69 ms | 52.3% bf16 MFU | 1625090 tok/s step 16245/19560 | loss 3.329519 (+0.54z)| norm 0.2851 (-0.22z)| lr 4.46e-05 | 322.52 ms | 52.3% bf16 MFU | 1625115 tok/s step 16246/19560 | loss 3.206055 (-2.50z)| norm 0.2798 (-0.41z)| lr 4.46e-05 | 322.91 ms | 52.3% bf16 MFU | 1625041 tok/s step 16247/19560 | loss 3.290403 (-0.41z)| norm 0.2776 (-0.50z)| lr 4.46e-05 | 322.68 ms | 52.3% bf16 MFU | 1625029 tok/s step 16248/19560 | loss 3.328204 (+0.51z)| norm 0.2813 (-0.34z)| lr 4.45e-05 | 322.49 ms | 52.3% bf16 MFU | 1625065 tok/s step 16249/19560 | loss 3.241738 (-1.59z)| norm 0.2933 (+0.13z)| lr 4.45e-05 | 323.03 ms | 52.2% bf16 MFU | 1624964 tok/s step 16250/19560 | loss 3.304708 (-0.05z)| norm 0.2661 (-0.93z)| lr 4.45e-05 | 322.34 ms | 52.4% bf16 MFU | 1625042 tok/s val loss 3.288704 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag:evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3029/10042 = 0.301633 step 16251/19560 | loss 3.290507 (-0.39z)| norm 0.2924 (+0.10z)| lr 4.45e-05 | 322.58 ms | 52.3% bf16 MFU | 1625054 tok/s step 16252/19560 | loss 3.350054 (+1.08z)| norm 0.2923 (+0.09z)| lr 4.44e-05 | 321.98 ms | 52.4% bf16 MFU | 1625218 tok/s step 16253/19560 | loss 3.352694 (+1.13z)| norm 0.2944 (+0.17z)| lr 4.44e-05 | 322.54 ms | 52.3% bf16 MFU | 1625231 tok/s step 16254/19560 | loss 3.262135 (-1.09z)| norm 0.3248 (+1.35z)| lr 4.44e-05 | 322.80 ms | 52.3% bf16 MFU | 1625179 tok/s step 16255/19560 | loss 3.301381 (-0.12z)| norm 0.3266 (+1.40z)| lr 4.44e-05 | 323.15 ms | 52.2% bf16 MFU | 1625042 tok/s step 16256/19560 | loss 3.263274 (-1.05z)| norm 0.2769 (-0.55z)| lr 4.43e-05 | 322.34 ms | 52.4% bf16 MFU | 1625116 tok/s step 16257/19560 | loss 3.375211 (+1.69z)| norm 0.2609 (-1.18z)| lr 4.43e-05 | 323.02 ms | 52.2% bf16 MFU | 1625014 tok/s step 16258/19560 | loss 3.300582 (-0.13z)| norm 0.2901 (-0.04z)| lr 4.43e-05 | 322.44 ms | 52.3% bf16 MFU | 1625062 tok/s step 16259/19560 | loss 3.297259 (-0.21z)| norm 0.2828 (-0.33z)| lr 4.42e-05 | 322.52 ms | 52.3% bf16 MFU | 1625088 tok/s step 16260/19560 | loss 3.325230 (+0.47z)| norm 0.3433 (+2.01z)| lr 4.42e-05 | 323.08 ms | 52.2% bf16 MFU | 1624972 tok/s step 16261/19560 | loss 3.297262 (-0.21z)| norm 0.3001 (+0.31z)| lr 4.42e-05 | 322.94 ms | 52.3% bf16 MFU | 1624897 tok/s step 16262/19560 | loss 3.332978 (+0.66z)| norm 0.3010 (+0.34z)| lr 4.42e-05 | 322.33 ms | 52.4% bf16 MFU | 1624979 tok/s step 16263/19560 | loss 3.322524 (+0.40z)| norm 0.2693 (-0.90z)| lr 4.41e-05 | 322.47 ms | 52.3% bf16 MFU | 1625022 tok/s step 16264/19560 | loss 3.257812 (-1.17z)| norm 0.2643 (-1.09z)| lr 4.41e-05 | 322.95 ms | 52.3% bf16 MFU | 1624944 tok/s step 16265/19560 | loss 3.260560 (-1.10z)| norm 0.3082 (+0.65z)| lr 4.41e-05 | 322.74 ms | 52.3% bf16 MFU | 1624921 tok/s step 16266/19560 | loss 3.340619 (+0.84z)| norm 0.2753 (-0.64z)| lr 4.41e-05 | 323.19 ms | 52.2% bf16 MFU | 1624787 tok/s step 16267/19560 | loss 3.267001 (-0.95z)| norm 0.2956 (+0.18z)| lr 4.40e-05 | 322.87 ms | 52.3% bf16 MFU | 1624738 tok/s step 16268/19560 | loss 3.362984 (+1.37z)| norm 0.2893 (-0.07z)| lr 4.40e-05 | 322.57 ms | 52.3% bf16 MFU | 1624768 tok/s step 16269/19560 | loss 3.319015 (+0.31z)| norm 0.2902 (-0.03z)| lr 4.40e-05 | 322.91 ms | 52.3% bf16 MFU | 1624712 tok/s step 16270/19560 | loss 3.268167 (-0.92z)| norm 0.2896 (-0.06z)| lr 4.40e-05 | 323.36 ms | 52.2% bf16 MFU | 1624546 tok/s step 16271/19560 | loss 3.338973 (+0.78z)| norm 0.2484 (-1.70z)| lr 4.39e-05 | 322.53 ms | 52.3% bf16 MFU | 1624594 tok/s step 16272/19560 | loss 3.320966 (+0.34z)| norm 0.3192 (+1.12z)| lr 4.39e-05 | 322.58 ms | 52.3% bf16 MFU | 1624630 tok/s step 16273/19560 | loss 3.251131 (-1.36z)| norm 0.2613 (-1.18z)| lr 4.39e-05 | 322.68 ms | 52.3% bf16 MFU | 1624639 tok/s step 16274/19560 | loss 3.264035 (-1.03z)| norm 0.2830 (-0.32z)| lr 4.39e-05 | 322.51 ms | 52.3% bf16 MFU | 1624689 tok/s step 16275/19560 | loss 3.379508 (+1.77z)| norm 0.2655 (-1.01z)| lr 4.38e-05 | 323.00 ms | 52.3% bf16 MFU | 1624612 tok/s step 16276/19560 | loss 3.287273 (-0.46z)| norm 0.2796 (-0.44z)| lr 4.38e-05 | 322.52 ms | 52.3% bf16 MFU | 1624663 tok/s step 16277/19560 | loss 3.337662 (+0.76z)| norm 0.3058 (+0.58z)| lr 4.38e-05 | 322.71 ms | 52.3% bf16 MFU | 1624661 tok/s step 16278/19560 | loss 3.416054 (+2.60z)| norm 0.3510 (+2.38z)| lr 4.38e-05 | 322.70 ms | 52.3% bf16 MFU | 1624663 tok/s step 16279/19560 | loss 3.317865 (+0.26z)| norm 0.2913 (+0.01z)| lr 4.37e-05 | 323.03 ms | 52.2% bf16 MFU | 1624581 tok/s step 16280/19560 | loss 3.380280 (+1.72z)| norm 0.2620 (-1.14z)| lr 4.37e-05 | 322.66 ms | 52.3% bf16 MFU | 1624598 tok/s step 16281/19560 | loss 3.264099 (-1.02z)| norm 0.3191 (+1.12z)| lr 4.37e-05 | 322.79 ms | 52.3% bf16 MFU | 1624581 tok/s step 16282/19560 | loss 3.469563 (+3.60z)| norm 0.3301 (+1.53z)| lr 4.36e-05 | 323.51 ms | 52.2% bf16 MFU | 1624384 tok/s step 16283/19560 | loss 3.293607 (-0.34z)| norm 0.2845 (-0.27z)| lr 4.36e-05 | 322.61 ms | 52.3% bf16 MFU | 1624422 tok/s step 16284/19560 | loss 3.270712 (-0.84z)| norm 0.2935 (+0.08z)| lr 4.36e-05 | 322.71 ms | 52.3% bf16 MFU | 1624432 tok/s step 16285/19560 | loss 3.360177 (+1.14z)| norm 0.3273 (+1.39z)| lr 4.36e-05 | 323.22 ms | 52.2% bf16 MFU | 1624315 tok/s step 16286/19560 | loss 3.300325 (-0.19z)| norm 0.3034 (+0.45z)| lr 4.35e-05 | 322.56 ms | 52.3% bf16 MFU | 1624368 tok/s step 16287/19560 | loss 3.369973 (+1.33z)| norm 0.3317 (+1.54z)| lr 4.35e-05 | 322.99 ms | 52.3% bf16 MFU | 1624312 tok/s step 16288/19560 | loss 3.306022 (-0.09z)| norm 0.3043 (+0.46z)| lr 4.35e-05 | 322.51 ms | 52.3% bf16 MFU | 1624378 tok/s step 16289/19560 | loss 3.255936 (-1.20z)| norm 0.2765 (-0.62z)| lr 4.35e-05 | 323.16 ms | 52.2% bf16 MFU | 1624278 tok/s step 16290/19560 | loss 3.255583 (-1.21z)| norm 0.3094 (+0.74z)| lr 4.34e-05 | 323.26 ms | 52.2% bf16 MFU | 1624159 tok/s step 16291/19560 | loss 3.326446 (+0.40z)| norm 0.3880 (+3.74z)| lr 4.34e-05 | 322.47 ms | 52.3% bf16 MFU | 1624243 tok/s step 16292/19560 | loss 3.345928 (+0.86z)| norm 0.2764 (-0.61z)| lr 4.34e-05 | 322.95 ms | 52.3% bf16 MFU | 1624202 tok/s step 16293/19560 | loss 3.310531 (+0.04z)| norm 0.3395 (+1.92z)| lr 4.34e-05 | 322.49 ms | 52.3% bf16 MFU | 1624278 tok/s step 16294/19560 | loss 3.249429 (-1.34z)| norm 0.3182 (+1.05z)| lr 4.33e-05 | 323.26 ms | 52.2% bf16 MFU | 1624157 tok/s step 16295/19560 | loss 3.313797 (+0.12z)| norm 0.2747 (-0.69z)| lr 4.33e-05 | 323.01 ms | 52.2% bf16 MFU | 1624105 tok/s step 16296/19560 | loss 3.324781 (+0.37z)| norm 0.3167 (+1.01z)| lr 4.33e-05 | 322.96 ms | 52.3% bf16 MFU | 1624070 tok/s step 16297/19560 | loss 3.305362 (-0.07z)| norm 0.3556 (+2.50z)| lr 4.33e-05 | 323.07 ms | 52.2% bf16 MFU | 1624008 tok/s step 16298/19560 | loss 3.325523 (+0.41z)| norm 0.2795 (-0.50z)| lr 4.32e-05 | 322.26 ms | 52.4% bf16 MFU | 1624153 tok/s step 16299/19560 | loss 3.270936 (-0.85z)| norm 0.3326 (+1.59z)| lr 4.32e-05 | 323.02 ms | 52.2% bf16 MFU | 1624100 tok/s step 16300/19560 | loss 3.254243 (-1.24z)| norm 0.2945 (+0.08z)| lr 4.32e-05 | 323.20 ms | 52.2% bf16 MFU | 1624003 tok/s step 16301/19560 | loss 3.289152 (-0.43z)| norm 0.2500 (-1.64z)| lr 4.32e-05 | 322.74 ms | 52.3% bf16 MFU | 1624028 tok/s step 16302/19560 | loss 3.304726 (-0.07z)| norm 0.2841 (-0.31z)| lr 4.31e-05 | 322.91 ms | 52.3% bf16 MFU | 1624008 tok/s step 16303/19560 | loss 3.248821 (-1.36z)| norm 0.2895 (-0.09z)| lr 4.31e-05 | 324.03 ms | 52.1% bf16 MFU | 1623709 tok/s step 16304/19560 | loss 3.316968 (+0.22z)| norm 0.2599 (-1.24z)| lr 4.31e-05 | 323.01 ms | 52.2% bf16 MFU | 1623680 tok/s step 16305/19560 | loss 3.291976 (-0.37z)| norm 0.2562 (-1.37z)| lr 4.31e-05 | 322.61 ms | 52.3% bf16 MFU | 1623755 tok/s step 16306/19560 | loss 3.182935 (-2.81z)| norm 0.2823 (-0.36z)| lr 4.30e-05 | 323.10 ms | 52.2% bf16 MFU | 1623701 tok/s step 16307/19560 | loss 3.277169 (-0.67z)| norm 0.2526 (-1.49z)| lr 4.30e-05 | 322.78 ms | 52.3% bf16 MFU | 1623732 tok/s step 16308/19560 | loss 3.322623 (+0.36z)| norm 0.2583 (-1.25z)| lr 4.30e-05 | 322.68 ms | 52.3% bf16 MFU | 1623784 tok/s step 16309/19560 | loss 3.315260 (+0.20z)| norm 0.2688 (-0.85z)| lr 4.29e-05 | 323.83 ms | 52.1% bf16 MFU | 1623546 tok/s step 16310/19560 | loss 3.301584 (-0.10z)| norm 0.2656 (-0.96z)| lr 4.29e-05 | 323.12 ms | 52.2% bf16 MFU | 1623498 tok/s step 16311/19560 | loss 3.310711 (+0.11z)| norm 0.2745 (-0.61z)| lr 4.29e-05 | 322.89 ms | 52.3% bf16 MFU | 1623511 tok/s step 16312/19560 | loss 3.370259 (+1.45z)| norm 0.2727 (-0.67z)| lr 4.29e-05 | 323.44 ms | 52.2% bf16 MFU | 1623385 tok/s step 16313/19560 | loss 3.308918 (+0.04z)| norm 0.2705 (-0.76z)| lr 4.28e-05 | 323.16 ms | 52.2% bf16 MFU | 1623333 tok/s step 16314/19560 | loss 3.290318 (-0.40z)| norm 0.2526 (-1.46z)| lr 4.28e-05 | 323.44 ms | 52.2% bf16 MFU | 1623216 tok/s step 16315/19560 | loss 3.245969 (-1.40z)| norm 0.3010 (+0.48z)| lr 4.28e-05 | 323.25 ms | 52.2% bf16 MFU | 1623151 tok/s step 16316/19560 | loss 3.306317 (-0.00z)| norm 0.2631 (-1.04z)| lr 4.28e-05 | 322.14 ms | 52.4% bf16 MFU | 1623369 tok/s step 16317/19560 | loss 3.262173 (-1.01z)| norm 0.2820 (-0.28z)| lr 4.27e-05 | 322.84 ms | 52.3% bf16 MFU | 1623398 tok/s step 16318/19560 | loss 3.307104 (+0.02z)| norm 0.2823 (-0.27z)| lr 4.27e-05 | 322.65 ms | 52.3% bf16 MFU | 1623474 tok/s step 16319/19560 | loss 3.306061 (-0.01z)| norm 0.2638 (-1.00z)| lr 4.27e-05 | 323.17 ms | 52.2% bf16 MFU | 1623416 tok/s step 16320/19560 | loss 3.343987 (+0.86z)| norm 0.2690 (-0.79z)| lr 4.27e-05 | 323.35 ms | 52.2% bf16 MFU | 1623318 tok/s step 16321/19560 | loss 3.368940 (+1.44z)| norm 0.3230 (+1.36z)| lr 4.26e-05 | 322.30 ms | 52.4% bf16 MFU | 1623487 tok/s step 16322/19560 | loss 3.300746 (-0.18z)| norm 0.2960 (+0.29z)| lr 4.26e-05 | 323.30 ms | 52.2% bf16 MFU | 1623395 tok/s step 16323/19560 | loss 3.331840 (+0.55z)| norm 0.3280 (+1.54z)| lr 4.26e-05 | 322.97 ms | 52.3% bf16 MFU | 1623392 tok/s step 16324/19560 | loss 3.335269 (+0.63z)| norm 0.2788 (-0.40z)| lr 4.26e-05 | 323.27 ms | 52.2% bf16 MFU | 1623313 tok/s step 16325/19560 | loss 3.273798 (-0.82z)| norm 0.3588 (+2.68z)| lr 4.25e-05 | 323.09 ms | 52.2% bf16 MFU | 1623284 tok/s step 16326/19560 | loss 3.308429 (+0.00z)| norm 0.3046 (+0.58z)| lr 4.25e-05 | 322.37 ms | 52.4% bf16 MFU | 1623438 tok/s step 16327/19560 | loss 3.291100 (-0.41z)| norm 0.2530 (-1.40z)| lr 4.25e-05 | 323.19 ms | 52.2% bf16 MFU | 1623377 tok/s step 16328/19560 | loss 3.309904 (+0.04z)| norm 0.3247 (+1.34z)| lr 4.25e-05 | 323.31 ms | 52.2% bf16 MFU | 1623288 tok/s step 16329/19560 | loss 3.386476 (+1.81z)| norm 0.3745 (+3.09z)| lr 4.24e-05 | 323.19 ms | 52.2% bf16 MFU | 1623236 tok/s step 16330/19560 | loss 3.288128 (-0.48z)| norm 0.2480 (-1.56z)| lr 4.24e-05 | 322.83 ms | 52.3% bf16 MFU | 1623276 tok/s step 16331/19560 | loss 3.328356 (+0.49z)| norm 0.2653 (-0.91z)| lr 4.24e-05 | 322.52 ms | 52.3% bf16 MFU | 1623393 tok/s step 16332/19560 | loss 3.431039 (+2.85z)| norm 0.3298 (+1.42z)| lr 4.24e-05 | 322.79 ms | 52.3% bf16 MFU | 1623436 tok/s step 16333/19560 | loss 3.340601 (+0.73z)| norm 0.2943 (+0.13z)| lr 4.23e-05 | 322.77 ms | 52.3% bf16 MFU | 1623481 tok/s step 16334/19560 | loss 3.276904 (-0.75z)| norm 0.3232 (+1.16z)| lr 4.23e-05 | 322.90 ms | 52.3% bf16 MFU | 1623492 tok/s step 16335/19560 | loss 3.391956 (+1.91z)| norm 0.2813 (-0.34z)| lr 4.23e-05 | 322.60 ms | 52.3% bf16 MFU | 1623577 tok/s step 16336/19560 | loss 3.318203 (+0.21z)| norm 0.2626 (-1.01z)| lr 4.23e-05 | 323.03 ms | 52.2% bf16 MFU | 1623549 tok/s step 16337/19560 | loss 3.280746 (-0.65z)| norm 0.2717 (-0.68z)| lr 4.22e-05 | 323.14 ms | 52.2% bf16 MFU | 1623495 tok/s step 16338/19560 | loss 3.339968 (+0.72z)| norm 0.2837 (-0.25z)| lr 4.22e-05 | 322.86 ms | 52.3% bf16 MFU | 1623514 tok/s step 16339/19560 | loss 3.323088 (+0.33z)| norm 0.2706 (-0.72z)| lr 4.22e-05 | 323.55 ms | 52.2% bf16 MFU | 1623360 tok/s step 16340/19560 | loss 3.290971 (-0.40z)| norm 0.2816 (-0.33z)| lr 4.22e-05 | 322.99 ms | 52.3% bf16 MFU | 1623353 tok/s step 16341/19560 | loss 3.278877 (-0.67z)| norm 0.2779 (-0.47z)| lr 4.21e-05 | 322.85 ms | 52.3% bf16 MFU | 1623383 tok/s step 16342/19560 | loss 3.349673 (+0.95z)| norm 0.2889 (-0.07z)| lr 4.21e-05 | 322.39 ms | 52.3% bf16 MFU | 1623526 tok/s step 16343/19560 | loss 3.251643 (-1.30z)| norm 0.2761 (-0.54z)| lr 4.21e-05 | 322.90 ms | 52.3% bf16 MFU | 1623533 tok/s step 16344/19560 | loss 3.198152 (-2.45z)| norm 0.2801 (-0.39z)| lr 4.21e-05 | 322.87 ms | 52.3% bf16 MFU | 1623548 tok/s step 16345/19560 | loss 3.323901 (+0.37z)| norm 0.2665 (-0.88z)| lr 4.20e-05 | 322.65 ms | 52.3% bf16 MFU | 1623619 tok/s step 16346/19560 | loss 3.254014 (-1.19z)| norm 0.2814 (-0.34z)| lr 4.20e-05 | 322.97 ms | 52.3% bf16 MFU | 1623604 tok/s step 16347/19560 | loss 3.302856 (-0.10z)| norm 0.2953 (+0.16z)| lr 4.20e-05 | 322.50 ms | 52.3% bf16 MFU | 1623709 tok/s step 16348/19560 | loss 3.280661 (-0.59z)| norm 0.2806 (-0.37z)| lr 4.20e-05 | 322.57 ms | 52.3% bf16 MFU | 1623790 tok/s step 16349/19560 | loss 3.262933 (-0.99z)| norm 0.2651 (-0.93z)| lr 4.19e-05 | 323.03 ms | 52.2% bf16 MFU | 1623753 tok/s step 16350/19560 | loss 3.252929 (-1.20z)| norm 0.2815 (-0.33z)| lr 4.19e-05 | 322.27 ms | 52.4% bf16 MFU | 1623908 tok/s step 16351/19560 | loss 3.287290 (-0.43z)| norm 0.3449 (+1.93z)| lr 4.19e-05 | 322.42 ms | 52.3% bf16 MFU | 1624019 tok/s step 16352/19560 | loss 3.317332 (+0.24z)| norm 0.2521 (-1.40z)| lr 4.18e-05 | 322.94 ms | 52.3% bf16 MFU | 1623991 tok/s step 16353/19560 | loss 3.304743 (-0.04z)| norm 0.3268 (+1.29z)| lr 4.18e-05 | 323.24 ms | 52.2% bf16 MFU | 1623890 tok/s step 16354/19560 | loss 3.316360 (+0.21z)| norm 0.3337 (+1.51z)| lr 4.18e-05 | 322.63 ms | 52.3% bf16 MFU | 1623947 tok/s step 16355/19560 | loss 3.259843 (-1.06z)| norm 0.2839 (-0.27z)| lr 4.18e-05 | 322.65 ms | 52.3% bf16 MFU | 1623996 tok/s step 16356/19560 | loss 3.270217 (-0.81z)| norm 0.2792 (-0.43z)| lr 4.17e-05 | 322.61 ms | 52.3% bf16 MFU | 1624054 tok/s step 16357/19560 | loss 3.268197 (-0.85z)| norm 0.2897 (-0.05z)| lr 4.17e-05 | 322.52 ms | 52.3% bf16 MFU | 1624131 tok/s step 16358/19560 | loss 3.306005 (+0.00z)| norm 0.2556 (-1.29z)| lr 4.17e-05 | 322.90 ms | 52.3% bf16 MFU | 1624109 tok/s step 16359/19560 | loss 3.328443 (+0.51z)| norm 0.2989 (+0.28z)| lr 4.17e-05 | 322.64 ms | 52.3% bf16 MFU | 1624152 tok/s step 16360/19560 | loss 3.305084 (-0.02z)| norm 0.2654 (-0.93z)| lr 4.16e-05 | 322.77 ms | 52.3% bf16 MFU | 1624160 tok/s step 16361/19560 | loss 3.354999 (+1.13z)| norm 0.2925 (+0.06z)| lr 4.16e-05 | 322.67 ms | 52.3% bf16 MFU | 1624196 tok/s step 16362/19560 | loss 3.353391 (+1.09z)| norm 0.3249 (+1.22z)| lr 4.16e-05 | 322.38 ms | 52.4% bf16 MFU | 1624302 tok/s step 16363/19560 | loss 3.286504 (-0.43z)| norm 0.3292 (+1.36z)| lr 4.16e-05 | 322.46 ms | 52.3% bf16 MFU | 1624383 tok/s step 16364/19560 | loss 3.310204 (+0.11z)| norm 0.2801 (-0.43z)| lr 4.15e-05 | 322.06 ms | 52.4% bf16 MFU | 1624560 tok/s step 16365/19560 | loss 3.274872 (-0.71z)| norm 0.3092 (+0.63z)| lr 4.15e-05 | 322.81 ms | 52.3% bf16 MFU | 1624538 tok/s step 16366/19560 | loss 3.269092 (-0.84z)| norm 0.2817 (-0.36z)| lr 4.15e-05 | 322.78 ms | 52.3% bf16 MFU | 1624525 tok/s step 16367/19560 | loss 3.276804 (-0.65z)| norm 0.2732 (-0.67z)| lr 4.15e-05 | 322.35 ms | 52.4% bf16 MFU | 1624621 tok/s step 16368/19560 | loss 3.366270 (+1.38z)| norm 0.2824 (-0.33z)| lr 4.14e-05 | 322.63 ms | 52.3% bf16 MFU | 1624643 tok/s step 16369/19560 | loss 3.320399 (+0.33z)| norm 0.4041 (+3.81z)| lr 4.14e-05 | 322.87 ms | 52.3% bf16 MFU | 1624602 tok/s step 16370/19560 | loss 3.304097 (-0.05z)| norm 0.2889 (-0.11z)| lr 4.14e-05 | 322.78 ms | 52.3% bf16 MFU | 1624587 tok/s step 16371/19560 | loss 3.349198 (+0.98z)| norm 0.2839 (-0.29z)| lr 4.14e-05 | 322.50 ms | 52.3% bf16 MFU | 1624644 tok/s step 16372/19560 | loss 3.332993 (+0.60z)| norm 0.3461 (+1.82z)| lr 4.13e-05 | 322.85 ms | 52.3% bf16 MFU | 1624608 tok/s step 16373/19560 | loss 3.397851 (+2.05z)| norm 0.2780 (-0.49z)| lr 4.13e-05 | 322.49 ms | 52.3% bf16 MFU | 1624664 tok/s step 16374/19560 | loss 3.316973 (+0.21z)| norm 0.2681 (-0.82z)| lr 4.13e-05 | 322.08 ms | 52.4% bf16 MFU | 1624822 tok/s step 16375/19560 | loss 3.341741 (+0.77z)| norm 0.3144 (+0.73z)| lr 4.13e-05 | 322.36 ms | 52.4% bf16 MFU | 1624901 tok/s step 16376/19560 | loss 3.287089 (-0.48z)| norm 0.2682 (-0.82z)| lr 4.12e-05 | 322.95 ms | 52.3% bf16 MFU | 1624829 tok/s step 16377/19560 | loss 3.342067 (+0.77z)| norm 0.2838 (-0.29z)| lr 4.12e-05 | 322.84 ms | 52.3% bf16 MFU | 1624786 tok/s step 16378/19560 | loss 3.403408 (+2.13z)| norm 0.3817 (+2.89z)| lr 4.12e-05 | 322.34 ms | 52.4% bf16 MFU | 1624872 tok/s step 16379/19560 | loss 3.284225 (-0.58z)| norm 0.2622 (-1.01z)| lr 4.12e-05 | 322.49 ms | 52.3% bf16 MFU | 1624916 tok/s step 16380/19560 | loss 3.259474 (-1.12z)| norm 0.2998 (+0.21z)| lr 4.11e-05 | 322.66 ms | 52.3% bf16 MFU | 1624915 tok/s step 16381/19560 | loss 3.288079 (-0.46z)| norm 0.3322 (+1.25z)| lr 4.11e-05 | 322.16 ms | 52.4% bf16 MFU | 1625040 tok/s step 16382/19560 | loss 3.340208 (+0.71z)| norm 0.3197 (+0.85z)| lr 4.11e-05 | 322.36 ms | 52.4% bf16 MFU | 1625109 tok/s step 16383/19560 | loss 3.260351 (-1.10z)| norm 0.2702 (-0.74z)| lr 4.11e-05 | 323.23 ms | 52.2% bf16 MFU | 1624955 tok/s step 16384/19560 | loss 3.313259 (+0.09z)| norm 0.2952 (+0.07z)| lr 4.10e-05 | 322.34 ms | 52.4% bf16 MFU | 1625031 tok/s step 16385/19560 | loss 3.313835 (+0.12z)| norm 0.2823 (-0.36z)| lr 4.10e-05 | 322.51 ms | 52.3% bf16 MFU | 1625062 tok/s step 16386/19560 | loss 3.320512 (+0.27z)| norm 0.2863 (-0.23z)| lr 4.10e-05 | 322.57 ms | 52.3% bf16 MFU | 1625075 tok/s step 16387/19560 | loss 3.261185 (-1.08z)| norm 0.3691 (+2.39z)| lr 4.10e-05 | 322.06 ms | 52.4% bf16 MFU | 1625217 tok/s step 16388/19560 | loss 3.287934 (-0.46z)| norm 0.2667 (-0.86z)| lr 4.09e-05 | 322.53 ms | 52.3% bf16 MFU | 1625235 tok/s step 16389/19560 | loss 3.364692 (+1.27z)| norm 0.3583 (+2.03z)| lr 4.09e-05 | 323.28 ms | 52.2% bf16 MFU | 1625061 tok/s step 16390/19560 | loss 3.225638 (-1.85z)| norm 0.2924 (-0.04z)| lr 4.09e-05 | 322.47 ms | 52.3% bf16 MFU | 1625100 tok/s step 16391/19560 | loss 3.328859 (+0.47z)| norm 0.2596 (-1.08z)| lr 4.09e-05 | 323.16 ms | 52.2% bf16 MFU | 1624964 tok/s step 16392/19560 | loss 3.284273 (-0.54z)| norm 0.3065 (+0.40z)| lr 4.08e-05 | 322.83 ms | 52.3% bf16 MFU | 1624916 tok/s step 16393/19560 | loss 3.242080 (-1.48z)| norm 0.2937 (-0.01z)| lr 4.08e-05 | 322.25 ms | 52.4% bf16 MFU | 1625019 tok/s step 16394/19560 | loss 3.219246 (-1.95z)| norm 0.2599 (-1.07z)| lr 4.08e-05 | 322.67 ms | 52.3% bf16 MFU | 1625010 tok/s step 16395/19560 | loss 3.277968 (-0.65z)| norm 0.2795 (-0.45z)| lr 4.08e-05 | 322.44 ms | 52.3% bf16 MFU | 1625059 tok/s step 16396/19560 | loss 3.275042 (-0.70z)| norm 0.3278 (+1.06z)| lr 4.07e-05 | 323.07 ms | 52.2% bf16 MFU | 1624947 tok/s step 16397/19560 | loss 3.274691 (-0.70z)| norm 0.2783 (-0.49z)| lr 4.07e-05 | 322.62 ms | 52.3% bf16 MFU | 1624955 tok/s step 16398/19560 | loss 3.295630 (-0.24z)| norm 0.2690 (-0.77z)| lr 4.07e-05 | 322.52 ms | 52.3% bf16 MFU | 1624988 tok/s step 16399/19560 | loss 3.311182 (+0.11z)| norm 0.2688 (-0.79z)| lr 4.07e-05 | 322.94 ms | 52.3% bf16 MFU | 1624914 tok/s step 16400/19560 | loss 3.242445 (-1.40z)| norm 0.2843 (-0.29z)| lr 4.06e-05 | 322.43 ms | 52.3% bf16 MFU | 1624970 tok/s step 16401/19560 | loss 3.324846 (+0.42z)| norm 0.2767 (-0.54z)| lr 4.06e-05 | 322.55 ms | 52.3% bf16 MFU | 1624994 tok/s step 16402/19560 | loss 3.316365 (+0.22z)| norm 0.3660 (+2.22z)| lr 4.06e-05 | 322.73 ms | 52.3% bf16 MFU | 1624973 tok/s step 16403/19560 | loss 3.315795 (+0.22z)| norm 0.2794 (-0.47z)| lr 4.06e-05 | 322.89 ms | 52.3% bf16 MFU | 1624910 tok/s step 16404/19560 | loss 3.326316 (+0.45z)| norm 0.2682 (-0.81z)| lr 4.05e-05 | 322.84 ms | 52.3% bf16 MFU | 1624863 tok/s step 16405/19560 | loss 3.307618 (+0.04z)| norm 0.2824 (-0.37z)| lr 4.05e-05 | 322.83 ms | 52.3% bf16 MFU | 1624822 tok/s step 16406/19560 | loss 3.392068 (+1.97z)| norm 0.2925 (-0.04z)| lr 4.05e-05 | 322.69 ms | 52.3% bf16 MFU | 1624818 tok/s step 16407/19560 | loss 3.260496 (-1.02z)| norm 0.2858 (-0.25z)| lr 4.05e-05 | 322.63 ms | 52.3% bf16 MFU | 1624830 tok/s step 16408/19560 | loss 3.364810 (+1.36z)| norm 0.3085 (+0.46z)| lr 4.04e-05 | 322.59 ms | 52.3% bf16 MFU | 1624851 tok/s step 16409/19560 | loss 3.294367 (-0.26z)| norm 0.3013 (+0.23z)| lr 4.04e-05 | 322.60 ms | 52.3% bf16 MFU | 1624867 tok/s step 16410/19560 | loss 3.308979 (+0.11z)| norm 0.2593 (-1.08z)| lr 4.04e-05 | 322.68 ms | 52.3% bf16 MFU | 1624864 tok/s step 16411/19560 | loss 3.346279 (+1.01z)| norm 0.2735 (-0.62z)| lr 4.04e-05 | 322.62 ms | 52.3% bf16 MFU | 1624875 tok/s step 16412/19560 | loss 3.390920 (+2.04z)| norm 0.3069 (+0.42z)| lr 4.03e-05 | 322.76 ms | 52.3% bf16 MFU | 1624852 tok/s step 16413/19560 | loss 3.266790 (-0.92z)| norm 0.2601 (-1.03z)| lr 4.03e-05 | 322.66 ms | 52.3% bf16 MFU | 1624854 tok/s step 16414/19560 | loss 3.322020 (+0.41z)| norm 0.2571 (-1.11z)| lr 4.03e-05 | 322.82 ms | 52.3% bf16 MFU | 1624814 tok/s step 16415/19560 | loss 3.324240 (+0.47z)| norm 0.3413 (+1.52z)| lr 4.03e-05 | 322.77 ms | 52.3% bf16 MFU | 1624791 tok/s step 16416/19560 | loss 3.316280 (+0.28z)| norm 0.3250 (+1.01z)| lr 4.02e-05 | 322.52 ms | 52.3% bf16 MFU | 1624832 tok/s step 16417/19560 | loss 3.360333 (+1.32z)| norm 0.3058 (+0.40z)| lr 4.02e-05 | 322.41 ms | 52.3% bf16 MFU | 1624897 tok/s step 16418/19560 | loss 3.267096 (-0.94z)| norm 0.3035 (+0.33z)| lr 4.02e-05 | 322.81 ms | 52.3% bf16 MFU | 1624860 tok/s step 16419/19560 | loss 3.275811 (-0.71z)| norm 0.2799 (-0.39z)| lr 4.02e-05 | 322.53 ms | 52.3% bf16 MFU | 1624895 tok/s step 16420/19560 | loss 3.222396 (-1.96z)| norm 0.2895 (-0.09z)| lr 4.01e-05 | 322.17 ms | 52.4% bf16 MFU | 1625018 tok/s step 16421/19560 | loss 3.314935 (+0.25z)| norm 0.2784 (-0.43z)| lr 4.01e-05 | 322.84 ms | 52.3% bf16 MFU | 1624967 tok/s step 16422/19560 | loss 3.271247 (-0.80z)| norm 0.2819 (-0.31z)| lr 4.01e-05 | 322.31 ms | 52.4% bf16 MFU | 1625051 tok/s step 16423/19560 | loss 3.299649 (-0.12z)| norm 0.2766 (-0.49z)| lr 4.01e-05 | 322.73 ms | 52.3% bf16 MFU | 1625025 tok/s step 16424/19560 | loss 3.306262 (+0.05z)| norm 0.2720 (-0.63z)| lr 4.00e-05 | 322.72 ms | 52.3% bf16 MFU | 1625002 tok/s step 16425/19560 | loss 3.360282 (+1.33z)| norm 0.2755 (-0.50z)| lr 4.00e-05 | 322.45 ms | 52.3% bf16 MFU | 1625050 tok/s step 16426/19560 | loss 3.317640 (+0.31z)| norm 0.2653 (-0.83z)| lr 4.00e-05 | 322.74 ms | 52.3% bf16 MFU | 1625020 tok/s step 16427/19560 | loss 3.321185 (+0.39z)| norm 0.2766 (-0.45z)| lr 4.00e-05 | 322.84 ms | 52.3% bf16 MFU | 1624968 tok/s step 16428/19560 | loss 3.282686 (-0.54z)| norm 0.2670 (-0.76z)| lr 3.99e-05 | 322.72 ms | 52.3% bf16 MFU | 1624950 tok/s step 16429/19560 | loss 3.396554 (+2.14z)| norm 0.3194 (+0.98z)| lr 3.99e-05 | 322.36 ms | 52.4% bf16 MFU | 1625022 tok/s step 16430/19560 | loss 3.196749 (-2.50z)| norm 0.2747 (-0.52z)| lr 3.99e-05 | 323.16 ms | 52.2% bf16 MFU | 1624890 tok/s step 16431/19560 | loss 3.342948 (+0.85z)| norm 0.2716 (-0.62z)| lr 3.99e-05 | 322.84 ms | 52.3% bf16 MFU | 1624845 tok/s step 16432/19560 | loss 3.281123 (-0.57z)| norm 0.2799 (-0.34z)| lr 3.98e-05 | 322.74 ms | 52.3% bf16 MFU | 1624827 tok/s step 16433/19560 | loss 3.390692 (+1.92z)| norm 0.3191 (+0.95z)| lr 3.98e-05 | 322.90 ms | 52.3% bf16 MFU | 1624770 tok/s step 16434/19560 | loss 3.327585 (+0.47z)| norm 0.2671 (-0.79z)| lr 3.98e-05 | 322.46 ms | 52.3% bf16 MFU | 1624827 tok/s step 16435/19560 | loss 3.397954 (+2.07z)| norm 0.2768 (-0.47z)| lr 3.98e-05 | 322.35 ms | 52.4% bf16 MFU | 1624908 tok/s step 16436/19560 | loss 3.275464 (-0.76z)| norm 0.3010 (+0.33z)| lr 3.97e-05 | 322.72 ms | 52.3% bf16 MFU | 1624892 tok/s step 16437/19560 | loss 3.282324 (-0.59z)| norm 0.2755 (-0.53z)| lr 3.97e-05 | 322.64 ms | 52.3% bf16 MFU | 1624897 tok/s step 16438/19560 | loss 3.280129 (-0.64z)| norm 0.2863 (-0.17z)| lr 3.97e-05 | 322.33 ms | 52.4% bf16 MFU | 1624979 tok/s step 16439/19560 | loss 3.297043 (-0.24z)| norm 0.3089 (+0.59z)| lr 3.97e-05 | 323.02 ms | 52.2% bf16 MFU | 1624885 tok/s step 16440/19560 | loss 3.293504 (-0.32z)| norm 0.2714 (-0.69z)| lr 3.96e-05 | 322.50 ms | 52.3% bf16 MFU | 1624925 tok/s step 16441/19560 | loss 3.256910 (-1.15z)| norm 0.2669 (-0.84z)| lr 3.96e-05 | 322.96 ms | 52.3% bf16 MFU | 1624849 tok/s step 16442/19560 | loss 3.332617 (+0.59z)| norm 0.2945 (+0.09z)| lr 3.96e-05 | 322.78 ms | 52.3% bf16 MFU | 1624821 tok/s step 16443/19560 | loss 3.308899 (+0.03z)| norm 0.2820 (-0.33z)| lr 3.96e-05 | 322.53 ms | 52.3% bf16 MFU | 1624856 tok/s step 16444/19560 | loss 3.268499 (-0.90z)| norm 0.2859 (-0.20z)| lr 3.95e-05 | 322.48 ms | 52.3% bf16 MFU | 1624902 tok/s step 16445/19560 | loss 3.289650 (-0.41z)| norm 0.2723 (-0.67z)| lr 3.95e-05 | 322.35 ms | 52.4% bf16 MFU | 1624979 tok/s step 16446/19560 | loss 3.325938 (+0.43z)| norm 0.2838 (-0.28z)| lr 3.95e-05 | 322.40 ms | 52.3% bf16 MFU | 1625040 tok/s step 16447/19560 | loss 3.303209 (-0.10z)| norm 0.2664 (-0.87z)| lr 3.95e-05 | 322.97 ms | 52.3% bf16 MFU | 1624954 tok/s step 16448/19560 | loss 3.264659 (-0.98z)| norm 0.3180 (+0.88z)| lr 3.94e-05 | 322.72 ms | 52.3% bf16 MFU | 1624936 tok/s step 16449/19560 | loss 3.309604 (+0.07z)| norm 0.2963 (+0.15z)| lr 3.94e-05 | 322.94 ms | 52.3% bf16 MFU | 1624863 tok/s step 16450/19560 | loss 3.321473 (+0.35z)| norm 0.2604 (-1.08z)| lr 3.94e-05 | 322.25 ms | 52.4% bf16 MFU | 1624967 tok/s step 16451/19560 | loss 3.284626 (-0.51z)| norm 0.2522 (-1.33z)| lr 3.94e-05 | 322.56 ms | 52.3% bf16 MFU | 1624988 tok/s step 16452/19560 | loss 3.270386 (-0.83z)| norm 0.2720 (-0.65z)| lr 3.93e-05 | 322.93 ms | 52.3% bf16 MFU | 1624916 tok/s step 16453/19560 | loss 3.246535 (-1.37z)| norm 0.2769 (-0.47z)| lr 3.93e-05 | 322.97 ms | 52.3% bf16 MFU | 1624837 tok/s step 16454/19560 | loss 3.299519 (-0.14z)| norm 0.2918 (+0.05z)| lr 3.93e-05 | 322.53 ms | 52.3% bf16 MFU | 1624872 tok/s step 16455/19560 | loss 3.299080 (-0.15z)| norm 0.2639 (-0.93z)| lr 3.93e-05 | 322.82 ms | 52.3% bf16 MFU | 1624833 tok/s step 16456/19560 | loss 3.330820 (+0.58z)| norm 0.2971 (+0.24z)| lr 3.92e-05 | 322.54 ms | 52.3% bf16 MFU | 1624868 tok/s step 16457/19560 | loss 3.303781 (-0.03z)| norm 0.2571 (-1.17z)| lr 3.92e-05 | 323.33 ms | 52.2% bf16 MFU | 1624700 tok/s step 16458/19560 | loss 3.274190 (-0.72z)| norm 0.2607 (-1.05z)| lr 3.92e-05 | 322.40 ms | 52.3% bf16 MFU | 1624776 tok/s step 16459/19560 | loss 3.309096 (+0.10z)| norm 0.2691 (-0.75z)| lr 3.92e-05 | 322.69 ms | 52.3% bf16 MFU | 1624775 tok/s step 16460/19560 | loss 3.342642 (+0.94z)| norm 0.2507 (-1.39z)| lr 3.91e-05 | 322.46 ms | 52.3% bf16 MFU | 1624832 tok/s step 16461/19560 | loss 3.258626 (-1.09z)| norm 0.2508 (-1.37z)| lr 3.91e-05 | 322.71 ms | 52.3% bf16 MFU | 1624823 tok/s step 16462/19560 | loss 3.292964 (-0.26z)| norm 0.2549 (-1.20z)| lr 3.91e-05 | 322.95 ms | 52.3% bf16 MFU | 1624754 tok/s step 16463/19560 | loss 3.297112 (-0.14z)| norm 0.2590 (-1.04z)| lr 3.91e-05 | 322.61 ms | 52.3% bf16 MFU | 1624773 tok/s step 16464/19560 | loss 3.370128 (+1.64z)| norm 0.2733 (-0.53z)| lr 3.90e-05 | 322.99 ms | 52.3% bf16 MFU | 1624697 tok/s step 16465/19560 | loss 3.294873 (-0.21z)| norm 0.2661 (-0.79z)| lr 3.90e-05 | 322.70 ms | 52.3% bf16 MFU | 1624696 tok/s step 16466/19560 | loss 3.332860 (+0.73z)| norm 0.2813 (-0.24z)| lr 3.90e-05 | 322.47 ms | 52.3% bf16 MFU | 1624754 tok/s step 16467/19560 | loss 3.272244 (-0.75z)| norm 0.2691 (-0.68z)| lr 3.90e-05 | 322.89 ms | 52.3% bf16 MFU | 1624703 tok/s step 16468/19560 | loss 3.311790 (+0.21z)| norm 0.2576 (-1.08z)| lr 3.89e-05 | 322.96 ms | 52.3% bf16 MFU | 1624638 tok/s step 16469/19560 | loss 3.311468 (+0.20z)| norm 0.2560 (-1.13z)| lr 3.89e-05 | 322.56 ms | 52.3% bf16 MFU | 1624677 tok/s step 16470/19560 | loss 3.329661 (+0.65z)| norm 0.3165 (+1.03z)| lr 3.89e-05 | 322.24 ms | 52.4% bf16 MFU | 1624794 tok/s step 16471/19560 | loss 3.251581 (-1.27z)| norm 0.2852 (-0.09z)| lr 3.89e-05 | 323.20 ms | 52.2% bf16 MFU | 1624664 tok/s step 16472/19560 | loss 3.340192 (+0.91z)| norm 0.2760 (-0.42z)| lr 3.88e-05 | 323.08 ms | 52.2% bf16 MFU | 1624568 tok/s step 16473/19560 | loss 3.270460 (-0.84z)| norm 0.2541 (-1.19z)| lr 3.88e-05 | 322.74 ms | 52.3% bf16 MFU | 1624564 tok/s step 16474/19560 | loss 3.318864 (+0.37z)| norm 0.2535 (-1.20z)| lr 3.88e-05 | 322.55 ms | 52.3% bf16 MFU | 1624608 tok/s step 16475/19560 | loss 3.354285 (+1.25z)| norm 0.2496 (-1.32z)| lr 3.88e-05 | 322.41 ms | 52.3% bf16 MFU | 1624685 tok/s step 16476/19560 | loss 3.325433 (+0.51z)| norm 0.2724 (-0.51z)| lr 3.87e-05 | 323.00 ms | 52.3% bf16 MFU | 1624609 tok/s step 16477/19560 | loss 3.296438 (-0.23z)| norm 0.2575 (-1.03z)| lr 3.87e-05 | 322.82 ms | 52.3% bf16 MFU | 1624584 tok/s step 16478/19560 | loss 3.331608 (+0.65z)| norm 0.2965 (+0.33z)| lr 3.87e-05 | 322.62 ms | 52.3% bf16 MFU | 1624609 tok/s step 16479/19560 | loss 3.328317 (+0.56z)| norm 0.2697 (-0.60z)| lr 3.87e-05 | 322.64 ms | 52.3% bf16 MFU | 1624627 tok/s step 16480/19560 | loss 3.288515 (-0.45z)| norm 0.3114 (+0.88z)| lr 3.86e-05 | 322.13 ms | 52.4% bf16 MFU | 1624775 tok/s step 16481/19560 | loss 3.252731 (-1.34z)| norm 0.2700 (-0.59z)| lr 3.86e-05 | 322.91 ms | 52.3% bf16 MFU | 1624719 tok/s step 16482/19560 | loss 3.286260 (-0.48z)| norm 0.2720 (-0.51z)| lr 3.86e-05 | 323.28 ms | 52.2% bf16 MFU | 1624571 tok/s step 16483/19560 | loss 3.308676 (+0.07z)| norm 0.2932 (+0.26z)| lr 3.86e-05 | 323.02 ms | 52.2% bf16 MFU | 1624496 tok/s step 16484/19560 | loss 3.271304 (-0.88z)| norm 0.3187 (+1.17z)| lr 3.86e-05 | 322.58 ms | 52.3% bf16 MFU | 1624535 tok/s step 16485/19560 | loss 3.330052 (+0.61z)| norm 0.2632 (-0.83z)| lr 3.85e-05 | 322.63 ms | 52.3% bf16 MFU | 1624560 tok/s step 16486/19560 | loss 3.228797 (-1.93z)| norm 0.2821 (-0.15z)| lr 3.85e-05 | 322.97 ms | 52.3% bf16 MFU | 1624498 tok/s step 16487/19560 | loss 3.362646 (+1.41z)| norm 0.3129 (+0.96z)| lr 3.85e-05 | 322.72 ms | 52.3% bf16 MFU | 1624502 tok/s step 16488/19560 | loss 3.294317 (-0.29z)| norm 0.2560 (-1.10z)| lr 3.85e-05 | 322.96 ms | 52.3% bf16 MFU | 1624446 tok/s step 16489/19560 | loss 3.327762 (+0.55z)| norm 0.2909 (+0.16z)| lr 3.84e-05 | 322.76 ms | 52.3% bf16 MFU | 1624443 tok/s step 16490/19560 | loss 3.266669 (-0.96z)| norm 0.2823 (-0.14z)| lr 3.84e-05 | 322.88 ms | 52.3% bf16 MFU | 1624409 tok/s step 16491/19560 | loss 3.307388 (+0.06z)| norm 0.2510 (-1.26z)| lr 3.84e-05 | 323.04 ms | 52.2% bf16 MFU | 1624338 tok/s step 16492/19560 | loss 3.280939 (-0.60z)| norm 0.2671 (-0.66z)| lr 3.84e-05 | 323.17 ms | 52.2% bf16 MFU | 1624239 tok/s step 16493/19560 | loss 3.263504 (-1.03z)| norm 0.2531 (-1.15z)| lr 3.83e-05 | 323.35 ms | 52.2% bf16 MFU | 1624099 tok/s step 16494/19560 | loss 3.311908 (+0.17z)| norm 0.2369 (-1.71z)| lr 3.83e-05 | 322.98 ms | 52.3% bf16 MFU | 1624059 tok/s step 16495/19560 | loss 3.293109 (-0.31z)| norm 0.2797 (-0.18z)| lr 3.83e-05 | 322.82 ms | 52.3% bf16 MFU | 1624061 tok/s step 16496/19560 | loss 3.256456 (-1.21z)| norm 0.2773 (-0.26z)| lr 3.83e-05 | 323.02 ms | 52.2% bf16 MFU | 1624012 tok/s step 16497/19560 | loss 3.292933 (-0.29z)| norm 0.2609 (-0.87z)| lr 3.82e-05 | 322.78 ms | 52.3% bf16 MFU | 1624027 tok/s step 16498/19560 | loss 3.364423 (+1.49z)| norm 0.2619 (-0.82z)| lr 3.82e-05 | 322.89 ms | 52.3% bf16 MFU | 1624011 tok/s step 16499/19560 | loss 3.410125 (+2.57z)| norm 0.3765 (+3.41z)| lr 3.82e-05 | 322.88 ms | 52.3% bf16 MFU | 1624000 tok/s step 16500/19560 | loss 3.267131 (-0.92z)| norm 0.3113 (+1.03z)| lr 3.82e-05 | 323.15 ms | 52.2% bf16 MFU | 1623921 tok/s val loss 3.286502 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3027/10042 = 0.301434 step 16501/19560 | loss 3.353061 (+1.21z)| norm 0.2855 (+0.07z)| lr 3.81e-05 | 321.89 ms | 52.4% bf16 MFU | 1624163 tok/s step 16502/19560 | loss 3.272374 (-0.78z)| norm 0.2941 (+0.38z)| lr 3.81e-05 | 322.46 ms | 52.3% bf16 MFU | 1624250 tok/s step 16503/19560 | loss 3.303399 (-0.01z)| norm 0.3133 (+1.10z)| lr 3.81e-05 | 322.19 ms | 52.4% bf16 MFU | 1624400 tok/s step 16504/19560 | loss 3.229595 (-1.81z)| norm 0.2539 (-1.12z)| lr 3.81e-05 | 323.01 ms | 52.3% bf16 MFU | 1624338 tok/s step 16505/19560 | loss 3.315353 (+0.30z)| norm 0.3163 (+1.20z)| lr 3.80e-05 | 322.56 ms | 52.3% bf16 MFU | 1624391 tok/s step 16506/19560 | loss 3.274426 (-0.69z)| norm 0.2924 (+0.35z)| lr 3.80e-05 | 322.45 ms | 52.3% bf16 MFU | 1624469 tok/s step 16507/19560 | loss 3.313882 (+0.29z)| norm 0.3035 (+0.78z)| lr 3.80e-05 | 323.28 ms | 52.2% bf16 MFU | 1624336 tok/s step 16508/19560 | loss 3.358336 (+1.39z)| norm 0.2958 (+0.48z)| lr 3.80e-05 | 322.77 ms | 52.3% bf16 MFU | 1624335 tok/s step 16509/19560 | loss 3.305189 (+0.05z)| norm 0.2708 (-0.49z)| lr 3.79e-05 | 323.23 ms | 52.2% bf16 MFU | 1624219 tok/s step 16510/19560 | loss 3.261764 (-1.02z)| norm 0.3169 (+1.34z)| lr 3.79e-05 | 322.51 ms | 52.3% bf16 MFU | 1624290 tok/s step 16511/19560 | loss 3.218077 (-2.09z)| norm 0.2705 (-0.50z)| lr 3.79e-05 | 322.96 ms | 52.3% bf16 MFU | 1624245 tok/s step 16512/19560 | loss 3.318540 (+0.40z)| norm 0.3011 (+0.71z)| lr 3.79e-05 | 323.02 ms | 52.2% bf16 MFU | 1624188 tok/s step 16513/19560 | loss 3.266262 (-0.88z)| norm 0.2987 (+0.61z)| lr 3.78e-05 | 322.59 ms | 52.3% bf16 MFU | 1624241 tok/s step 16514/19560 | loss 3.296078 (-0.14z)| norm 0.3008 (+0.69z)| lr 3.78e-05 | 323.06 ms | 52.2% bf16 MFU | 1624173 tok/s step 16515/19560 | loss 3.296451 (-0.14z)| norm 0.3088 (+1.07z)| lr 3.78e-05 | 322.98 ms | 52.3% bf16 MFU | 1624128 tok/s step 16516/19560 | loss 3.311029 (+0.22z)| norm 0.2730 (-0.42z)| lr 3.78e-05 | 322.49 ms | 52.3% bf16 MFU | 1624209 tok/s step 16517/19560 | loss 3.256229 (-1.12z)| norm 0.2991 (+0.71z)| lr 3.77e-05 | 322.74 ms | 52.3% bf16 MFU | 1624222 tok/s step 16518/19560 | loss 3.306861 (+0.13z)| norm 0.3240 (+1.75z)| lr 3.77e-05 | 322.79 ms | 52.3% bf16 MFU | 1624222 tok/s step 16519/19560 | loss 3.294796 (-0.17z)| norm 0.2965 (+0.57z)| lr 3.77e-05 | 323.20 ms | 52.2% bf16 MFU | 1624120 tok/s step 16520/19560 | loss 3.318474 (+0.42z)| norm 0.3265 (+1.83z)| lr 3.77e-05 | 322.63 ms | 52.3% bf16 MFU | 1624167 tok/s step 16521/19560 | loss 3.280993 (-0.54z)| norm 0.2834 (+0.01z)| lr 3.76e-05 | 322.95 ms | 52.3% bf16 MFU | 1624131 tok/s step 16522/19560 | loss 3.316628 (+0.36z)| norm 0.3271 (+1.82z)| lr 3.76e-05 | 322.72 ms | 52.3% bf16 MFU | 1624154 tok/s step 16523/19560 | loss 3.309124 (+0.15z)| norm 0.3294 (+1.88z)| lr 3.76e-05 | 322.83 ms | 52.3% bf16 MFU | 1624150 tok/s step 16524/19560 | loss 3.283027 (-0.53z)| norm 0.2637 (-0.83z)| lr 3.76e-05 | 323.23 ms | 52.2% bf16 MFU | 1624045 tok/s step 16525/19560 | loss 3.326499 (+0.60z)| norm 0.2812 (-0.10z)| lr 3.76e-05 | 322.84 ms | 52.3% bf16 MFU | 1624041 tok/s step 16526/19560 | loss 3.270221 (-0.87z)| norm 0.2696 (-0.58z)| lr 3.75e-05 | 322.83 ms | 52.3% bf16 MFU | 1624041 tok/s step 16527/19560 | loss 3.256829 (-1.20z)| norm 0.3315 (+1.96z)| lr 3.75e-05 | 322.88 ms | 52.3% bf16 MFU | 1624029 tok/s step 16528/19560 | loss 3.289734 (-0.36z)| norm 0.2644 (-0.81z)| lr 3.75e-05 | 324.50 ms | 52.0% bf16 MFU | 1623611 tok/s step 16529/19560 | loss 3.301583 (-0.04z)| norm 0.2732 (-0.44z)| lr 3.75e-05 | 322.39 ms | 52.4% bf16 MFU | 1623743 tok/s step 16530/19560 | loss 3.250686 (-1.35z)| norm 0.2420 (-1.74z)| lr 3.74e-05 | 322.76 ms | 52.3% bf16 MFU | 1623775 tok/s step 16531/19560 | loss 3.220983 (-2.07z)| norm 0.2517 (-1.31z)| lr 3.74e-05 | 323.74 ms | 52.1% bf16 MFU | 1623560 tok/s step 16532/19560 | loss 3.326568 (+0.63z)| norm 0.2878 (+0.21z)| lr 3.74e-05 | 322.75 ms | 52.3% bf16 MFU | 1623603 tok/s step 16533/19560 | loss 3.250680 (-1.29z)| norm 0.2558 (-1.13z)| lr 3.74e-05 | 322.74 ms | 52.3% bf16 MFU | 1623647 tok/s step 16534/19560 | loss 3.298177 (-0.07z)| norm 0.2713 (-0.47z)| lr 3.73e-05 | 322.93 ms | 52.3% bf16 MFU | 1623640 tok/s step 16535/19560 | loss 3.303713 (+0.07z)| norm 0.2724 (-0.42z)| lr 3.73e-05 | 323.00 ms | 52.3% bf16 MFU | 1623618 tok/s step 16536/19560 | loss 3.292826 (-0.20z)| norm 0.2630 (-0.80z)| lr 3.73e-05 | 322.70 ms | 52.3% bf16 MFU | 1623672 tok/s step 16537/19560 | loss 3.314643 (+0.37z)| norm 0.3177 (+1.49z)| lr 3.73e-05 | 322.88 ms | 52.3% bf16 MFU | 1623676 tok/s step 16538/19560 | loss 3.260071 (-1.06z)| norm 0.3013 (+0.79z)| lr 3.72e-05 | 322.97 ms | 52.3% bf16 MFU | 1623660 tok/s step 16539/19560 | loss 3.246863 (-1.38z)| norm 0.2716 (-0.46z)| lr 3.72e-05 | 322.61 ms | 52.3% bf16 MFU | 1623734 tok/s step 16540/19560 | loss 3.241768 (-1.50z)| norm 0.3266 (+1.82z)| lr 3.72e-05 | 322.72 ms | 52.3% bf16 MFU | 1623777 tok/s step 16541/19560 | loss 3.353046 (+1.42z)| norm 0.2651 (-0.73z)| lr 3.72e-05 | 322.69 ms | 52.3% bf16 MFU | 1623825 tok/s step 16542/19560 | loss 3.285610 (-0.35z)| norm 0.2566 (-1.09z)| lr 3.71e-05 | 322.85 ms | 52.3% bf16 MFU | 1623831 tok/s step 16543/19560 | loss 3.373628 (+1.94z)| norm 0.3368 (+2.26z)| lr 3.71e-05 | 322.68 ms | 52.3% bf16 MFU | 1623879 tok/s step 16544/19560 | loss 3.249356 (-1.28z)| norm 0.3138 (+1.31z)| lr 3.71e-05 | 322.59 ms | 52.3% bf16 MFU | 1623948 tok/s step 16545/19560 | loss 3.315064 (+0.44z)| norm 0.2845 (+0.09z)| lr 3.71e-05 | 322.87 ms | 52.3% bf16 MFU | 1623942 tok/s step 16546/19560 | loss 3.268363 (-0.78z)| norm 0.4091 (+4.81z)| lr 3.70e-05 | 322.74 ms | 52.3% bf16 MFU | 1623969 tok/s step 16547/19560 | loss 3.269025 (-0.76z)| norm 0.2707 (-0.48z)| lr 3.70e-05 | 322.97 ms | 52.3% bf16 MFU | 1623937 tok/s step 16548/19560 | loss 3.202594 (-2.48z)| norm 0.3339 (+1.90z)| lr 3.70e-05 | 322.68 ms | 52.3% bf16 MFU | 1623979 tok/s step 16549/19560 | loss 3.365153 (+1.71z)| norm 0.3526 (+2.52z)| lr 3.70e-05 | 323.00 ms | 52.3% bf16 MFU | 1623940 tok/s step 16550/19560 | loss 3.316312 (+0.45z)| norm 0.2731 (-0.40z)| lr 3.69e-05 | 322.96 ms | 52.3% bf16 MFU | 1623911 tok/s step 16551/19560 | loss 3.291386 (-0.19z)| norm 0.3458 (+2.21z)| lr 3.69e-05 | 322.89 ms | 52.3% bf16 MFU | 1623902 tok/s step 16552/19560 | loss 3.264408 (-0.87z)| norm 0.3101 (+0.91z)| lr 3.69e-05 | 323.16 ms | 52.2% bf16 MFU | 1623826 tok/s step 16553/19560 | loss 3.277920 (-0.51z)| norm 0.2805 (-0.16z)| lr 3.69e-05 | 323.09 ms | 52.2% bf16 MFU | 1623771 tok/s step 16554/19560 | loss 3.293404 (-0.11z)| norm 0.3375 (+1.85z)| lr 3.69e-05 | 322.37 ms | 52.4% bf16 MFU | 1623900 tok/s step 16555/19560 | loss 3.311908 (+0.37z)| norm 0.2886 (+0.11z)| lr 3.68e-05 | 322.47 ms | 52.3% bf16 MFU | 1623999 tok/s step 16556/19560 | loss 3.301979 (+0.11z)| norm 0.2620 (-0.83z)| lr 3.68e-05 | 322.95 ms | 52.3% bf16 MFU | 1623969 tok/s step 16557/19560 | loss 3.316792 (+0.52z)| norm 0.2756 (-0.34z)| lr 3.68e-05 | 322.92 ms | 52.3% bf16 MFU | 1623949 tok/s step 16558/19560 | loss 3.317091 (+0.52z)| norm 0.3376 (+1.83z)| lr 3.68e-05 | 322.67 ms | 52.3% bf16 MFU | 1623994 tok/s step 16559/19560 | loss 3.246904 (-1.37z)| norm 0.2691 (-0.58z)| lr 3.67e-05 | 322.65 ms | 52.3% bf16 MFU | 1624041 tok/s step 16560/19560 | loss 3.297428 (+0.00z)| norm 0.3467 (+2.09z)| lr 3.67e-05 | 322.81 ms | 52.3% bf16 MFU | 1624046 tok/s step 16561/19560 | loss 3.256197 (-1.12z)| norm 0.2848 (-0.04z)| lr 3.67e-05 | 321.82 ms | 52.4% bf16 MFU | 1624301 tok/s step 16562/19560 | loss 3.356767 (+1.66z)| norm 0.2702 (-0.55z)| lr 3.67e-05 | 322.72 ms | 52.3% bf16 MFU | 1624316 tok/s step 16563/19560 | loss 3.273039 (-0.64z)| norm 0.2703 (-0.54z)| lr 3.66e-05 | 322.56 ms | 52.3% bf16 MFU | 1624370 tok/s step 16564/19560 | loss 3.323797 (+0.79z)| norm 0.3235 (+1.30z)| lr 3.66e-05 | 322.62 ms | 52.3% bf16 MFU | 1624405 tok/s step 16565/19560 | loss 3.220138 (-2.11z)| norm 0.2767 (-0.32z)| lr 3.66e-05 | 322.75 ms | 52.3% bf16 MFU | 1624406 tok/s step 16566/19560 | loss 3.308892 (+0.37z)| norm 0.3075 (+0.73z)| lr 3.66e-05 | 322.58 ms | 52.3% bf16 MFU | 1624450 tok/s step 16567/19560 | loss 3.319657 (+0.66z)| norm 0.3038 (+0.61z)| lr 3.65e-05 | 322.69 ms | 52.3% bf16 MFU | 1624465 tok/s step 16568/19560 | loss 3.334323 (+1.06z)| norm 0.2562 (-1.03z)| lr 3.65e-05 | 322.27 ms | 52.4% bf16 MFU | 1624583 tok/s step 16569/19560 | loss 3.259446 (-1.02z)| norm 0.2770 (-0.31z)| lr 3.65e-05 | 322.74 ms | 52.3% bf16 MFU | 1624578 tok/s step 16570/19560 | loss 3.321398 (+0.70z)| norm 0.2696 (-0.56z)| lr 3.65e-05 | 322.79 ms | 52.3% bf16 MFU | 1624562 tok/s step 16571/19560 | loss 3.309790 (+0.38z)| norm 0.2840 (-0.07z)| lr 3.64e-05 | 322.99 ms | 52.3% bf16 MFU | 1624496 tok/s step 16572/19560 | loss 3.310837 (+0.40z)| norm 0.2499 (-1.23z)| lr 3.64e-05 | 322.77 ms | 52.3% bf16 MFU | 1624489 tok/s step 16573/19560 | loss 3.346661 (+1.38z)| norm 0.2662 (-0.67z)| lr 3.64e-05 | 322.81 ms | 52.3% bf16 MFU | 1624470 tok/s step 16574/19560 | loss 3.312372 (+0.43z)| norm 0.2635 (-0.75z)| lr 3.64e-05 | 322.69 ms | 52.3% bf16 MFU | 1624485 tok/s step 16575/19560 | loss 3.343003 (+1.26z)| norm 0.2571 (-0.96z)| lr 3.64e-05 | 322.56 ms | 52.3% bf16 MFU | 1624531 tok/s step 16576/19560 | loss 3.277262 (-0.55z)| norm 0.2550 (-1.02z)| lr 3.63e-05 | 322.69 ms | 52.3% bf16 MFU | 1624541 tok/s step 16577/19560 | loss 3.306430 (+0.25z)| norm 0.2638 (-0.71z)| lr 3.63e-05 | 322.78 ms | 52.3% bf16 MFU | 1624528 tok/s step 16578/19560 | loss 3.308053 (+0.30z)| norm 0.2751 (-0.33z)| lr 3.63e-05 | 323.05 ms | 52.2% bf16 MFU | 1624448 tok/s step 16579/19560 | loss 3.292293 (-0.14z)| norm 0.2607 (-0.82z)| lr 3.63e-05 | 322.68 ms | 52.3% bf16 MFU | 1624464 tok/s step 16580/19560 | loss 3.243430 (-1.47z)| norm 0.2729 (-0.41z)| lr 3.62e-05 | 322.51 ms | 52.3% bf16 MFU | 1624523 tok/s step 16581/19560 | loss 3.287862 (-0.26z)| norm 0.2701 (-0.50z)| lr 3.62e-05 | 323.17 ms | 52.2% bf16 MFU | 1624412 tok/s step 16582/19560 | loss 3.309161 (+0.33z)| norm 0.2702 (-0.49z)| lr 3.62e-05 | 322.67 ms | 52.3% bf16 MFU | 1624434 tok/s step 16583/19560 | loss 3.361148 (+1.73z)| norm 0.2642 (-0.70z)| lr 3.62e-05 | 322.60 ms | 52.3% bf16 MFU | 1624471 tok/s step 16584/19560 | loss 3.261962 (-0.97z)| norm 0.2864 (+0.06z)| lr 3.61e-05 | 322.89 ms | 52.3% bf16 MFU | 1624433 tok/s step 16585/19560 | loss 3.280417 (-0.46z)| norm 0.2716 (-0.45z)| lr 3.61e-05 | 322.47 ms | 52.3% bf16 MFU | 1624504 tok/s step 16586/19560 | loss 3.335760 (+1.04z)| norm 0.2596 (-0.86z)| lr 3.61e-05 | 322.59 ms | 52.3% bf16 MFU | 1624541 tok/s step 16587/19560 | loss 3.282659 (-0.40z)| norm 0.2604 (-0.83z)| lr 3.61e-05 | 322.70 ms | 52.3% bf16 MFU | 1624549 tok/s step 16588/19560 | loss 3.312972 (+0.43z)| norm 0.3080 (+0.79z)| lr 3.60e-05 | 322.67 ms | 52.3% bf16 MFU | 1624563 tok/s step 16589/19560 | loss 3.285599 (-0.32z)| norm 0.2482 (-1.26z)| lr 3.60e-05 | 321.91 ms | 52.4% bf16 MFU | 1624769 tok/s step 16590/19560 | loss 3.337682 (+1.09z)| norm 0.2938 (+0.29z)| lr 3.60e-05 | 322.18 ms | 52.4% bf16 MFU | 1624897 tok/s step 16591/19560 | loss 3.251412 (-1.25z)| norm 0.3243 (+1.32z)| lr 3.60e-05 | 323.09 ms | 52.2% bf16 MFU | 1624788 tok/s step 16592/19560 | loss 3.264926 (-0.87z)| norm 0.2617 (-0.82z)| lr 3.59e-05 | 322.57 ms | 52.3% bf16 MFU | 1624817 tok/s step 16593/19560 | loss 3.276203 (-0.56z)| norm 0.3498 (+2.13z)| lr 3.59e-05 | 322.29 ms | 52.4% bf16 MFU | 1624914 tok/s step 16594/19560 | loss 3.289661 (-0.18z)| norm 0.2831 (-0.11z)| lr 3.59e-05 | 322.79 ms | 52.3% bf16 MFU | 1624882 tok/s step 16595/19560 | loss 3.354294 (+1.58z)| norm 0.2901 (+0.12z)| lr 3.59e-05 | 322.85 ms | 52.3% bf16 MFU | 1624835 tok/s step 16596/19560 | loss 3.329592 (+0.89z)| norm 0.3869 (+3.22z)| lr 3.59e-05 | 323.13 ms | 52.2% bf16 MFU | 1624720 tok/s step 16597/19560 | loss 3.259903 (-1.00z)| norm 0.2764 (-0.37z)| lr 3.58e-05 | 322.29 ms | 52.4% bf16 MFU | 1624821 tok/s step 16598/19560 | loss 3.273523 (-0.61z)| norm 0.2758 (-0.38z)| lr 3.58e-05 | 322.82 ms | 52.3% bf16 MFU | 1624785 tok/s step 16599/19560 | loss 3.338894 (+1.15z)| norm 0.3501 (+2.00z)| lr 3.58e-05 | 323.52 ms | 52.2% bf16 MFU | 1624573 tok/s step 16600/19560 | loss 3.306998 (+0.29z)| norm 0.3154 (+0.87z)| lr 3.58e-05 | 322.45 ms | 52.3% bf16 MFU | 1624642 tok/s step 16601/19560 | loss 3.319684 (+0.63z)| norm 0.2714 (-0.54z)| lr 3.57e-05 | 322.95 ms | 52.3% bf16 MFU | 1624582 tok/s step 16602/19560 | loss 3.311470 (+0.40z)| norm 0.3461 (+1.82z)| lr 3.57e-05 | 322.95 ms | 52.3% bf16 MFU | 1624525 tok/s step 16603/19560 | loss 3.306660 (+0.28z)| norm 0.2916 (+0.07z)| lr 3.57e-05 | 322.70 ms | 52.3% bf16 MFU | 1624534 tok/s step 16604/19560 | loss 3.309899 (+0.38z)| norm 0.2882 (-0.04z)| lr 3.57e-05 | 322.87 ms | 52.3% bf16 MFU | 1624498 tok/s step 16605/19560 | loss 3.256585 (-1.09z)| norm 0.3326 (+1.36z)| lr 3.56e-05 | 322.72 ms | 52.3% bf16 MFU | 1624502 tok/s step 16606/19560 | loss 3.282332 (-0.37z)| norm 0.2964 (+0.20z)| lr 3.56e-05 | 322.51 ms | 52.3% bf16 MFU | 1624558 tok/s step 16607/19560 | loss 3.348253 (+1.45z)| norm 0.3025 (+0.39z)| lr 3.56e-05 | 322.69 ms | 52.3% bf16 MFU | 1624567 tok/s step 16608/19560 | loss 3.354650 (+1.60z)| norm 0.2966 (+0.20z)| lr 3.56e-05 | 322.65 ms | 52.3% bf16 MFU | 1624586 tok/s step 16609/19560 | loss 3.288611 (-0.22z)| norm 0.2851 (-0.17z)| lr 3.55e-05 | 322.82 ms | 52.3% bf16 MFU | 1624562 tok/s step 16610/19560 | loss 3.310118 (+0.37z)| norm 0.3178 (+0.87z)| lr 3.55e-05 | 322.75 ms | 52.3% bf16 MFU | 1624557 tok/s step 16611/19560 | loss 3.306739 (+0.28z)| norm 0.2724 (-0.58z)| lr 3.55e-05 | 323.12 ms | 52.2% bf16 MFU | 1624457 tok/s step 16612/19560 | loss 3.270181 (-0.73z)| norm 0.2642 (-0.83z)| lr 3.55e-05 | 322.63 ms | 52.3% bf16 MFU | 1624486 tok/s step 16613/19560 | loss 3.300500 (+0.11z)| norm 0.2848 (-0.18z)| lr 3.55e-05 | 322.37 ms | 52.4% bf16 MFU | 1624579 tok/s step 16614/19560 | loss 3.394133 (+2.62z)| norm 0.2734 (-0.54z)| lr 3.54e-05 | 322.96 ms | 52.3% bf16 MFU | 1624519 tok/s step 16615/19560 | loss 3.294441 (-0.08z)| norm 0.2615 (-0.91z)| lr 3.54e-05 | 323.30 ms | 52.2% bf16 MFU | 1624378 tok/s step 16616/19560 | loss 3.301679 (+0.12z)| norm 0.2564 (-1.07z)| lr 3.54e-05 | 322.31 ms | 52.4% bf16 MFU | 1624492 tok/s step 16617/19560 | loss 3.272069 (-0.68z)| norm 0.2530 (-1.17z)| lr 3.54e-05 | 322.94 ms | 52.3% bf16 MFU | 1624441 tok/s step 16618/19560 | loss 3.273871 (-0.64z)| norm 0.2750 (-0.46z)| lr 3.53e-05 | 322.53 ms | 52.3% bf16 MFU | 1624496 tok/s step 16619/19560 | loss 3.259847 (-1.01z)| norm 0.2812 (-0.27z)| lr 3.53e-05 | 322.55 ms | 52.3% bf16 MFU | 1624544 tok/s step 16620/19560 | loss 3.375793 (+2.12z)| norm 0.3157 (+0.82z)| lr 3.53e-05 | 323.09 ms | 52.2% bf16 MFU | 1624454 tok/s step 16621/19560 | loss 3.315938 (+0.49z)| norm 0.2894 (-0.03z)| lr 3.53e-05 | 322.43 ms | 52.3% bf16 MFU | 1624535 tok/s step 16622/19560 | loss 3.249550 (-1.28z)| norm 0.2507 (-1.29z)| lr 3.52e-05 | 323.09 ms | 52.2% bf16 MFU | 1624446 tok/s step 16623/19560 | loss 3.267504 (-0.79z)| norm 0.2629 (-0.89z)| lr 3.52e-05 | 322.97 ms | 52.3% bf16 MFU | 1624390 tok/s step 16624/19560 | loss 3.291410 (-0.16z)| norm 0.3185 (+0.90z)| lr 3.52e-05 | 322.54 ms | 52.3% bf16 MFU | 1624444 tok/s step 16625/19560 | loss 3.306515 (+0.25z)| norm 0.2537 (-1.19z)| lr 3.52e-05 | 322.68 ms | 52.3% bf16 MFU | 1624462 tok/s step 16626/19560 | loss 3.392040 (+2.52z)| norm 0.2798 (-0.35z)| lr 3.51e-05 | 322.74 ms | 52.3% bf16 MFU | 1624462 tok/s step 16627/19560 | loss 3.302546 (+0.16z)| norm 0.2728 (-0.57z)| lr 3.51e-05 | 323.51 ms | 52.2% bf16 MFU | 1624269 tok/s step 16628/19560 | loss 3.273297 (-0.65z)| norm 0.2840 (-0.19z)| lr 3.51e-05 | 322.48 ms | 52.3% bf16 MFU | 1624346 tok/s step 16629/19560 | loss 3.355146 (+1.61z)| norm 0.2906 (+0.03z)| lr 3.51e-05 | 322.39 ms | 52.4% bf16 MFU | 1624441 tok/s step 16630/19560 | loss 3.308299 (+0.31z)| norm 0.3383 (+1.59z)| lr 3.51e-05 | 323.10 ms | 52.2% bf16 MFU | 1624354 tok/s step 16631/19560 | loss 3.354973 (+1.58z)| norm 0.3050 (+0.49z)| lr 3.50e-05 | 322.56 ms | 52.3% bf16 MFU | 1624406 tok/s step 16632/19560 | loss 3.307720 (+0.27z)| norm 0.2543 (-1.18z)| lr 3.50e-05 | 322.63 ms | 52.3% bf16 MFU | 1624437 tok/s step 16633/19560 | loss 3.251255 (-1.28z)| norm 0.2737 (-0.53z)| lr 3.50e-05 | 322.64 ms | 52.3% bf16 MFU | 1624465 tok/s step 16634/19560 | loss 3.331100 (+0.91z)| norm 0.2937 (+0.13z)| lr 3.50e-05 | 323.76 ms | 52.1% bf16 MFU | 1624211 tok/s step 16635/19560 | loss 3.308674 (+0.29z)| norm 0.2676 (-0.72z)| lr 3.49e-05 | 322.80 ms | 52.3% bf16 MFU | 1624211 tok/s step 16636/19560 | loss 3.237132 (-1.65z)| norm 0.3027 (+0.44z)| lr 3.49e-05 | 323.03 ms | 52.2% bf16 MFU | 1624152 tok/s step 16637/19560 | loss 3.247070 (-1.36z)| norm 0.2687 (-0.69z)| lr 3.49e-05 | 323.12 ms | 52.2% bf16 MFU | 1624072 tok/s step 16638/19560 | loss 3.285245 (-0.32z)| norm 0.2570 (-1.06z)| lr 3.49e-05 | 322.73 ms | 52.3% bf16 MFU | 1624096 tok/s step 16639/19560 | loss 3.268892 (-0.79z)| norm 0.2433 (-1.49z)| lr 3.48e-05 | 322.90 ms | 52.3% bf16 MFU | 1624074 tok/s step 16640/19560 | loss 3.262554 (-0.95z)| norm 0.2899 (+0.04z)| lr 3.48e-05 | 323.09 ms | 52.2% bf16 MFU | 1624008 tok/s step 16641/19560 | loss 3.295419 (-0.04z)| norm 0.2874 (-0.04z)| lr 3.48e-05 | 322.25 ms | 52.4% bf16 MFU | 1624157 tok/s step 16642/19560 | loss 3.267302 (-0.82z)| norm 0.2637 (-0.81z)| lr 3.48e-05 | 322.65 ms | 52.3% bf16 MFU | 1624195 tok/s step 16643/19560 | loss 3.307353 (+0.29z)| norm 0.2592 (-0.94z)| lr 3.47e-05 | 322.88 ms | 52.3% bf16 MFU | 1624174 tok/s step 16644/19560 | loss 3.510494 (+5.24z)| norm 0.2808 (-0.24z)| lr 3.47e-05 | 322.78 ms | 52.3% bf16 MFU | 1624180 tok/s step 16645/19560 | loss 3.295579 (-0.08z)| norm 0.2781 (-0.32z)| lr 3.47e-05 | 322.81 ms | 52.3% bf16 MFU | 1624178 tok/s step 16646/19560 | loss 3.627929 (+6.58z)| norm 0.3651 (+2.47z)| lr 3.47e-05 | 322.93 ms | 52.3% bf16 MFU | 1624147 tok/s step 16647/19560 | loss 3.264340 (-0.74z)| norm 0.3140 (+0.82z)| lr 3.47e-05 | 322.87 ms | 52.3% bf16 MFU | 1624131 tok/s step 16648/19560 | loss 3.238213 (-1.24z)| norm 0.2874 (-0.02z)| lr 3.46e-05 | 323.68 ms | 52.1% bf16 MFU | 1623913 tok/s step 16649/19560 | loss 3.325856 (+0.50z)| norm 0.2690 (-0.61z)| lr 3.46e-05 | 322.57 ms | 52.3% bf16 MFU | 1623984 tok/s step 16650/19560 | loss 3.246291 (-1.07z)| norm 0.2541 (-1.07z)| lr 3.46e-05 | 322.90 ms | 52.3% bf16 MFU | 1623968 tok/s step 16651/19560 | loss 3.291696 (-0.17z)| norm 0.2857 (-0.04z)| lr 3.46e-05 | 322.91 ms | 52.3% bf16 MFU | 1623952 tok/s step 16652/19560 | loss 3.401072 (+1.96z)| norm 0.3258 (+1.24z)| lr 3.45e-05 | 322.97 ms | 52.3% bf16 MFU | 1623922 tok/s step 16653/19560 | loss 3.348315 (+0.92z)| norm 0.2970 (+0.30z)| lr 3.45e-05 | 322.71 ms | 52.3% bf16 MFU | 1623957 tok/s step 16654/19560 | loss 3.320825 (+0.38z)| norm 0.3258 (+1.21z)| lr 3.45e-05 | 322.92 ms | 52.3% bf16 MFU | 1623940 tok/s step 16655/19560 | loss 3.309603 (+0.15z)| norm 0.2830 (-0.15z)| lr 3.45e-05 | 322.95 ms | 52.3% bf16 MFU | 1623915 tok/s step 16656/19560 | loss 3.318171 (+0.31z)| norm 0.2672 (-0.66z)| lr 3.44e-05 | 322.95 ms | 52.3% bf16 MFU | 1623892 tok/s step 16657/19560 | loss 3.276621 (-0.50z)| norm 0.2824 (-0.17z)| lr 3.44e-05 | 322.64 ms | 52.3% bf16 MFU | 1623947 tok/s step 16658/19560 | loss 3.288654 (-0.27z)| norm 0.2589 (-0.94z)| lr 3.44e-05 | 323.11 ms | 52.2% bf16 MFU | 1623880 tok/s step 16659/19560 | loss 3.250624 (-1.03z)| norm 0.2747 (-0.44z)| lr 3.44e-05 | 322.86 ms | 52.3% bf16 MFU | 1623881 tok/s step 16660/19560 | loss 3.262854 (-0.77z)| norm 0.2720 (-0.52z)| lr 3.44e-05 | 323.22 ms | 52.2% bf16 MFU | 1623790 tok/s step 16661/19560 | loss 3.273935 (-0.56z)| norm 0.2626 (-0.83z)| lr 3.43e-05 | 322.93 ms | 52.3% bf16 MFU | 1623777 tok/s step 16662/19560 | loss 3.261455 (-0.80z)| norm 0.2837 (-0.15z)| lr 3.43e-05 | 322.60 ms | 52.3% bf16 MFU | 1623847 tok/s step 16663/19560 | loss 3.265350 (-0.71z)| norm 0.2889 (+0.02z)| lr 3.43e-05 | 322.55 ms | 52.3% bf16 MFU | 1623928 tok/s step 16664/19560 | loss 3.291650 (-0.20z)| norm 0.2774 (-0.36z)| lr 3.43e-05 | 323.26 ms | 52.2% bf16 MFU | 1623826 tok/s step 16665/19560 | loss 3.275584 (-0.51z)| norm 0.2715 (-0.55z)| lr 3.42e-05 | 323.19 ms | 52.2% bf16 MFU | 1623745 tok/s step 16666/19560 | loss 3.274842 (-0.52z)| norm 0.3530 (+2.10z)| lr 3.42e-05 | 322.98 ms | 52.3% bf16 MFU | 1623722 tok/s step 16667/19560 | loss 3.239648 (-1.22z)| norm 0.3376 (+1.57z)| lr 3.42e-05 | 322.86 ms | 52.3% bf16 MFU | 1623729 tok/s step 16668/19560 | loss 3.292221 (-0.19z)| norm 0.3015 (+0.41z)| lr 3.42e-05 | 322.67 ms | 52.3% bf16 MFU | 1623785 tok/s step 16669/19560 | loss 3.330357 (+0.57z)| norm 0.3118 (+0.74z)| lr 3.41e-05 | 322.91 ms | 52.3% bf16 MFU | 1623778 tok/s step 16670/19560 | loss 3.300941 (-0.01z)| norm 0.3413 (+1.66z)| lr 3.41e-05 | 322.72 ms | 52.3% bf16 MFU | 1623818 tok/s step 16671/19560 | loss 3.367404 (+1.31z)| norm 0.2659 (-0.75z)| lr 3.41e-05 | 323.15 ms | 52.2% bf16 MFU | 1623749 tok/s step 16672/19560 | loss 3.289483 (-0.25z)| norm 0.3614 (+2.29z)| lr 3.41e-05 | 322.43 ms | 52.3% bf16 MFU | 1623864 tok/s step 16673/19560 | loss 3.302633 (+0.02z)| norm 0.3148 (+0.79z)| lr 3.40e-05 | 323.25 ms | 52.2% bf16 MFU | 1623767 tok/s step 16674/19560 | loss 3.304492 (+0.05z)| norm 0.2588 (-1.00z)| lr 3.40e-05 | 323.34 ms | 52.2% bf16 MFU | 1623652 tok/s step 16675/19560 | loss 3.315919 (+0.27z)| norm 0.3235 (+1.15z)| lr 3.40e-05 | 323.35 ms | 52.2% bf16 MFU | 1623540 tok/s step 16676/19560 | loss 3.400908 (+1.95z)| norm 0.2914 (+0.09z)| lr 3.40e-05 | 322.10 ms | 52.4% bf16 MFU | 1623747 tok/s step 16677/19560 | loss 3.250982 (-1.05z)| norm 0.2632 (-0.85z)| lr 3.40e-05 | 323.17 ms | 52.2% bf16 MFU | 1623676 tok/s step 16678/19560 | loss 3.284535 (-0.37z)| norm 0.2642 (-0.81z)| lr 3.39e-05 | 322.90 ms | 52.3% bf16 MFU | 1623678 tok/s step 16679/19560 | loss 3.260796 (-0.84z)| norm 0.3062 (+0.65z)| lr 3.39e-05 | 323.09 ms | 52.2% bf16 MFU | 1623631 tok/s step 16680/19560 | loss 3.309076 (+0.12z)| norm 0.2653 (-0.76z)| lr 3.39e-05 | 322.81 ms | 52.3% bf16 MFU | 1623656 tok/s step 16681/19560 | loss 3.267884 (-0.70z)| norm 0.3142 (+0.92z)| lr 3.39e-05 | 322.28 ms | 52.4% bf16 MFU | 1623812 tok/s step 16682/19560 | loss 3.306868 (+0.08z)| norm 0.2807 (-0.22z)| lr 3.38e-05 | 322.65 ms | 52.3% bf16 MFU | 1623868 tok/s step 16683/19560 | loss 3.298943 (-0.08z)| norm 0.2752 (-0.41z)| lr 3.38e-05 | 322.87 ms | 52.3% bf16 MFU | 1623866 tok/s step 16684/19560 | loss 3.301015 (-0.04z)| norm 0.3067 (+0.68z)| lr 3.38e-05 | 323.01 ms | 52.2% bf16 MFU | 1623829 tok/s step 16685/19560 | loss 3.285632 (-0.34z)| norm 0.2837 (-0.13z)| lr 3.38e-05 | 323.03 ms | 52.2% bf16 MFU | 1623790 tok/s step 16686/19560 | loss 3.279464 (-0.46z)| norm 0.2905 (+0.12z)| lr 3.37e-05 | 322.80 ms | 52.3% bf16 MFU | 1623809 tok/s step 16687/19560 | loss 3.285048 (-0.36z)| norm 0.2928 (+0.20z)| lr 3.37e-05 | 322.76 ms | 52.3% bf16 MFU | 1623839 tok/s step 16688/19560 | loss 3.343126 (+0.81z)| norm 0.3173 (+1.09z)| lr 3.37e-05 | 322.91 ms | 52.3% bf16 MFU | 1623829 tok/s step 16689/19560 | loss 3.247020 (-1.12z)| norm 0.3341 (+1.66z)| lr 3.37e-05 | 323.26 ms | 52.2% bf16 MFU | 1623732 tok/s step 16690/19560 | loss 3.300536 (-0.04z)| norm 0.2570 (-1.07z)| lr 3.37e-05 | 323.10 ms | 52.2% bf16 MFU | 1623680 tok/s step 16691/19560 | loss 3.450850 (+2.87z)| norm 0.3558 (+2.36z)| lr 3.36e-05 | 322.88 ms | 52.3% bf16 MFU | 1623684 tok/s step 16692/19560 | loss 3.325626 (+0.42z)| norm 0.3349 (+1.62z)| lr 3.36e-05 | 322.52 ms | 52.3% bf16 MFU | 1623780 tok/s step 16693/19560 | loss 3.376954 (+1.41z)| norm 0.2857 (-0.08z)| lr 3.36e-05 | 322.82 ms | 52.3% bf16 MFU | 1623795 tok/s step 16694/19560 | loss 3.265018 (-0.78z)| norm 0.3093 (+0.73z)| lr 3.36e-05 | 323.21 ms | 52.2% bf16 MFU | 1623712 tok/s step 16695/19560 | loss 3.242266 (-1.20z)| norm 0.2892 (+0.04z)| lr 3.35e-05 | 323.07 ms | 52.2% bf16 MFU | 1623669 tok/s step 16696/19560 | loss 3.389545 (+1.64z)| norm 0.2748 (-0.46z)| lr 3.35e-05 | 322.83 ms | 52.3% bf16 MFU | 1623686 tok/s step 16697/19560 | loss 3.244742 (-1.15z)| norm 0.2802 (-0.28z)| lr 3.35e-05 | 322.81 ms | 52.3% bf16 MFU | 1623709 tok/s step 16698/19560 | loss 3.300840 (-0.07z)| norm 0.2520 (-1.25z)| lr 3.35e-05 | 322.89 ms | 52.3% bf16 MFU | 1623711 tok/s step 16699/19560 | loss 3.394560 (+1.70z)| norm 0.2663 (-0.75z)| lr 3.35e-05 | 323.03 ms | 52.2% bf16 MFU | 1623676 tok/s step 16700/19560 | loss 3.361934 (+1.07z)| norm 0.2943 (+0.21z)| lr 3.34e-05 | 322.65 ms | 52.3% bf16 MFU | 1623740 tok/s step 16701/19560 | loss 3.280881 (-0.46z)| norm 0.2581 (-1.04z)| lr 3.34e-05 | 323.31 ms | 52.2% bf16 MFU | 1623635 tok/s step 16702/19560 | loss 3.271270 (-0.63z)| norm 0.2919 (+0.12z)| lr 3.34e-05 | 322.84 ms | 52.3% bf16 MFU | 1623652 tok/s step 16703/19560 | loss 3.314201 (+0.19z)| norm 0.3238 (+1.21z)| lr 3.34e-05 | 323.47 ms | 52.2% bf16 MFU | 1623511 tok/s step 16704/19560 | loss 3.260691 (-0.82z)| norm 0.2536 (-1.22z)| lr 3.33e-05 | 323.12 ms | 52.2% bf16 MFU | 1623464 tok/s step 16705/19560 | loss 3.260524 (-0.82z)| norm 0.2705 (-0.64z)| lr 3.33e-05 | 322.97 ms | 52.3% bf16 MFU | 1623459 tok/s step 16706/19560 | loss 3.423868 (+2.21z)| norm 0.2796 (-0.33z)| lr 3.33e-05 | 323.32 ms | 52.2% bf16 MFU | 1623364 tok/s step 16707/19560 | loss 3.305434 (+0.01z)| norm 0.2842 (-0.17z)| lr 3.33e-05 | 323.06 ms | 52.2% bf16 MFU | 1623341 tok/s step 16708/19560 | loss 3.374289 (+1.27z)| norm 0.2774 (-0.41z)| lr 3.32e-05 | 323.23 ms | 52.2% bf16 MFU | 1623274 tok/s step 16709/19560 | loss 3.288729 (-0.32z)| norm 0.2701 (-0.67z)| lr 3.32e-05 | 323.67 ms | 52.1% bf16 MFU | 1623101 tok/s step 16710/19560 | loss 3.305774 (-0.00z)| norm 0.2840 (-0.18z)| lr 3.32e-05 | 322.71 ms | 52.3% bf16 MFU | 1623177 tok/s step 16711/19560 | loss 3.265071 (-0.74z)| norm 0.2801 (-0.33z)| lr 3.32e-05 | 322.17 ms | 52.4% bf16 MFU | 1623387 tok/s step 16712/19560 | loss 3.258179 (-0.87z)| norm 0.2717 (-0.62z)| lr 3.32e-05 | 323.30 ms | 52.2% bf16 MFU | 1623301 tok/s step 16713/19560 | loss 3.311560 (+0.12z)| norm 0.2903 (+0.03z)| lr 3.31e-05 | 322.94 ms | 52.3% bf16 MFU | 1623311 tok/s step 16714/19560 | loss 3.293574 (-0.21z)| norm 0.3030 (+0.46z)| lr 3.31e-05 | 323.03 ms | 52.2% bf16 MFU | 1623296 tok/s step 16715/19560 | loss 3.262620 (-0.78z)| norm 0.2639 (-0.91z)| lr 3.31e-05 | 322.73 ms | 52.3% bf16 MFU | 1623358 tok/s step 16716/19560 | loss 3.315212 (+0.19z)| norm 0.2771 (-0.44z)| lr 3.31e-05 | 322.22 ms | 52.4% bf16 MFU | 1623546 tok/s step 16717/19560 | loss 3.303110 (-0.04z)| norm 0.2730 (-0.60z)| lr 3.30e-05 | 322.84 ms | 52.3% bf16 MFU | 1623568 tok/s step 16718/19560 | loss 3.296162 (-0.16z)| norm 0.2820 (-0.27z)| lr 3.30e-05 | 322.61 ms | 52.3% bf16 MFU | 1623647 tok/s step 16719/19560 | loss 3.343365 (+0.71z)| norm 0.2689 (-0.73z)| lr 3.30e-05 | 322.06 ms | 52.4% bf16 MFU | 1623862 tok/s step 16720/19560 | loss 3.316480 (+0.20z)| norm 0.2833 (-0.22z)| lr 3.30e-05 | 322.27 ms | 52.4% bf16 MFU | 1624012 tok/s step 16721/19560 | loss 3.319405 (+0.25z)| norm 0.2654 (-0.85z)| lr 3.29e-05 | 322.52 ms | 52.3% bf16 MFU | 1624090 tok/s step 16722/19560 | loss 3.256915 (-0.91z)| norm 0.2923 (+0.13z)| lr 3.29e-05 | 321.79 ms | 52.4% bf16 MFU | 1624350 tok/s step 16723/19560 | loss 3.292310 (-0.24z)| norm 0.2994 (+0.38z)| lr 3.29e-05 | 322.16 ms | 52.4% bf16 MFU | 1624503 tok/s step 16724/19560 | loss 3.344808 (+0.73z)| norm 0.2721 (-0.61z)| lr 3.29e-05 | 322.85 ms | 52.3% bf16 MFU | 1624474 tok/s step 16725/19560 | loss 3.296625 (-0.17z)| norm 0.2991 (+0.42z)| lr 3.29e-05 | 322.11 ms | 52.4% bf16 MFU | 1624632 tok/s step 16726/19560 | loss 3.270389 (-0.66z)| norm 0.2551 (-1.26z)| lr 3.28e-05 | 322.84 ms | 52.3% bf16 MFU | 1624600 tok/s step 16727/19560 | loss 3.329332 (+0.44z)| norm 0.3370 (+1.88z)| lr 3.28e-05 | 323.42 ms | 52.2% bf16 MFU | 1624423 tok/s step 16728/19560 | loss 3.300608 (-0.09z)| norm 0.2761 (-0.44z)| lr 3.28e-05 | 322.49 ms | 52.3% bf16 MFU | 1624490 tok/s step 16729/19560 | loss 3.415763 (+2.01z)| norm 0.2933 (+0.21z)| lr 3.28e-05 | 322.78 ms | 52.3% bf16 MFU | 1624481 tok/s step 16730/19560 | loss 3.296687 (-0.18z)| norm 0.2648 (-0.88z)| lr 3.27e-05 | 322.52 ms | 52.3% bf16 MFU | 1624536 tok/s step 16731/19560 | loss 3.236039 (-1.27z)| norm 0.3132 (+1.01z)| lr 3.27e-05 | 322.27 ms | 52.4% bf16 MFU | 1624652 tok/s step 16732/19560 | loss 3.281040 (-0.45z)| norm 0.2472 (-1.54z)| lr 3.27e-05 | 322.84 ms | 52.3% bf16 MFU | 1624620 tok/s step 16733/19560 | loss 3.392183 (+1.55z)| norm 0.8303 (+9.95z)| lr 3.27e-05 | 322.28 ms | 52.4% bf16 MFU | 1624728 tok/s step 16734/19560 | loss 3.268112 (-0.70z)| norm 0.2857 (-0.09z)| lr 3.27e-05 | 322.85 ms | 52.3% bf16 MFU | 1624689 tok/s step 16735/19560 | loss 3.302920 (-0.06z)| norm 0.2715 (-0.35z)| lr 3.26e-05 | 322.60 ms | 52.3% bf16 MFU | 1624715 tok/s step 16736/19560 | loss 3.258206 (-0.86z)| norm 0.2723 (-0.33z)| lr 3.26e-05 | 322.78 ms | 52.3% bf16 MFU | 1624692 tok/s step 16737/19560 | loss 3.318853 (+0.24z)| norm 0.2516 (-0.71z)| lr 3.26e-05 | 322.71 ms | 52.3% bf16 MFU | 1624690 tok/s step 16738/19560 | loss 3.270596 (-0.63z)| norm 0.2588 (-0.57z)| lr 3.26e-05 | 322.59 ms | 52.3% bf16 MFU | 1624717 tok/s step 16739/19560 | loss 3.302073 (-0.06z)| norm 0.2678 (-0.40z)| lr 3.25e-05 | 322.70 ms | 52.3% bf16 MFU | 1624715 tok/s step 16740/19560 | loss 3.297340 (-0.15z)| norm 0.2975 (+0.14z)| lr 3.25e-05 | 322.82 ms | 52.3% bf16 MFU | 1624683 tok/s step 16741/19560 | loss 3.252573 (-0.95z)| norm 0.2543 (-0.65z)| lr 3.25e-05 | 322.62 ms | 52.3% bf16 MFU | 1624703 tok/s step 16742/19560 | loss 3.375947 (+1.29z)| norm 0.2634 (-0.48z)| lr 3.25e-05 | 323.10 ms | 52.2% bf16 MFU | 1624601 tok/s step 16743/19560 | loss 3.270962 (-0.61z)| norm 0.3392 (+0.90z)| lr 3.24e-05 | 322.43 ms | 52.3% bf16 MFU | 1624673 tok/s step 16744/19560 | loss 3.264617 (-0.72z)| norm 0.2506 (-0.72z)| lr 3.24e-05 | 322.35 ms | 52.4% bf16 MFU | 1624763 tok/s step 16745/19560 | loss 3.409487 (+1.86z)| norm 0.2649 (-0.46z)| lr 3.24e-05 | 323.20 ms | 52.2% bf16 MFU | 1624633 tok/s step 16746/19560 | loss 3.311878 (+0.11z)| norm 0.2804 (-0.18z)| lr 3.24e-05 | 322.84 ms | 52.3% bf16 MFU | 1624600 tok/s step 16747/19560 | loss 3.325890 (+0.35z)| norm 0.3057 (+0.28z)| lr 3.24e-05 | 322.77 ms | 52.3% bf16 MFU | 1624587 tok/s step 16748/19560 | loss 3.329287 (+0.42z)| norm 0.2969 (+0.12z)| lr 3.23e-05 | 322.33 ms | 52.4% bf16 MFU | 1624686 tok/s step 16749/19560 | loss 3.253569 (-0.93z)| norm 0.2933 (+0.05z)| lr 3.23e-05 | 323.16 ms | 52.2% bf16 MFU | 1624570 tok/s step 16750/19560 | loss 3.347923 (+0.75z)| norm 0.3237 (+0.60z)| lr 3.23e-05 | 323.14 ms | 52.2% bf16 MFU | 1624465 tok/s val loss 3.284536 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3029/10042 = 0.301633 step 16751/19560 | loss 3.269932 (-0.66z)| norm 0.2479 (-0.79z)| lr 3.23e-05 | 321.42 ms | 52.5% bf16 MFU | 1624800 tok/s step 16752/19560 | loss 3.199228 (-1.89z)| norm 0.2813 (-0.17z)| lr 3.22e-05 | 321.81 ms | 52.4% bf16 MFU | 1625018 tok/s step 16753/19560 | loss 3.343802 (+0.67z)| norm 0.3221 (+0.57z)| lr 3.22e-05 | 322.66 ms | 52.3% bf16 MFU | 1625011 tok/s step 16754/19560 | loss 3.253969 (-0.91z)| norm 0.2807 (-0.19z)| lr 3.22e-05 | 322.49 ms | 52.3% bf16 MFU | 1625049 tok/s step 16755/19560 | loss 3.322571 (+0.31z)| norm 0.3099 (+0.34z)| lr 3.22e-05 | 322.26 ms | 52.4% bf16 MFU | 1625141 tok/s step 16756/19560 | loss 3.311123 (+0.11z)| norm 0.2698 (-0.39z)| lr 3.22e-05 | 322.26 ms | 52.4% bf16 MFU | 1625230 tok/s step 16757/19560 | loss 3.285699 (-0.34z)| norm 0.3353 (+0.80z)| lr 3.21e-05 | 322.49 ms | 52.3% bf16 MFU | 1625256 tok/s step 16758/19560 | loss 3.245613 (-1.05z)| norm 0.2730 (-0.33z)| lr 3.21e-05 | 322.64 ms | 52.3% bf16 MFU | 1625242 tok/s step 16759/19560 | loss 3.255566 (-0.85z)| norm 0.2914 (+0.01z)| lr 3.21e-05 | 322.28 ms | 52.4% bf16 MFU | 1625321 tok/s step 16760/19560 | loss 3.260453 (-0.76z)| norm 0.2872 (-0.07z)| lr 3.21e-05 | 322.52 ms | 52.3% bf16 MFU | 1625334 tok/s step 16761/19560 | loss 3.388437 (+1.49z)| norm 0.2841 (-0.13z)| lr 3.20e-05 | 322.68 ms | 52.3% bf16 MFU | 1625306 tok/s step 16762/19560 | loss 3.291188 (-0.22z)| norm 0.2811 (-0.19z)| lr 3.20e-05 | 322.59 ms | 52.3% bf16 MFU | 1625304 tok/s step 16763/19560 | loss 3.269654 (-0.60z)| norm 0.2798 (-0.21z)| lr 3.20e-05 | 323.35 ms | 52.2% bf16 MFU | 1625110 tok/s step 16764/19560 | loss 3.257715 (-0.82z)| norm 0.2859 (-0.10z)| lr 3.20e-05 | 322.76 ms | 52.3% bf16 MFU | 1625074 tok/s step 16765/19560 | loss 3.311861 (+0.14z)| norm 0.2763 (-0.27z)| lr 3.20e-05 | 322.69 ms | 52.3% bf16 MFU | 1625056 tok/s step 16766/19560 | loss 3.235816 (-1.20z)| norm 0.3225 (+0.57z)| lr 3.19e-05 | 322.61 ms | 52.3% bf16 MFU | 1625062 tok/s step 16767/19560 | loss 3.308638 (+0.08z)| norm 0.2815 (-0.19z)| lr 3.19e-05 | 323.02 ms | 52.2% bf16 MFU | 1624962 tok/s step 16768/19560 | loss 3.332074 (+0.49z)| norm 0.2583 (-0.62z)| lr 3.19e-05 | 322.77 ms | 52.3% bf16 MFU | 1624932 tok/s step 16769/19560 | loss 3.318044 (+0.24z)| norm 0.2781 (-0.25z)| lr 3.19e-05 | 322.33 ms | 52.4% bf16 MFU | 1625014 tok/s step 16770/19560 | loss 3.269161 (-0.63z)| norm 0.2451 (-0.85z)| lr 3.18e-05 | 322.79 ms | 52.3% bf16 MFU | 1624975 tok/s step 16771/19560 | loss 3.298409 (-0.11z)| norm 0.2643 (-0.50z)| lr 3.18e-05 | 322.57 ms | 52.3% bf16 MFU | 1624993 tok/s step 16772/19560 | loss 3.306854 (+0.07z)| norm 0.2519 (-0.72z)| lr 3.18e-05 | 322.72 ms | 52.3% bf16 MFU | 1624973 tok/s step 16773/19560 | loss 3.314536 (+0.21z)| norm 0.2851 (-0.12z)| lr 3.18e-05 | 322.81 ms | 52.3% bf16 MFU | 1624932 tok/s step 16774/19560 | loss 3.298230 (-0.06z)| norm 0.2876 (-0.06z)| lr 3.18e-05 | 322.58 ms | 52.3% bf16 MFU | 1624950 tok/s step 16775/19560 | loss 3.289515 (-0.26z)| norm 0.2866 (-0.07z)| lr 3.17e-05 | 322.41 ms | 52.3% bf16 MFU | 1625011 tok/s step 16776/19560 | loss 3.276062 (-0.57z)| norm 0.2521 (-0.71z)| lr 3.17e-05 | 322.88 ms | 52.3% bf16 MFU | 1624950 tok/s step 16777/19560 | loss 3.301143 (+0.00z)| norm 0.2442 (-0.85z)| lr 3.17e-05 | 322.11 ms | 52.4% bf16 MFU | 1625086 tok/s step 16778/19560 | loss 3.274077 (-0.62z)| norm 0.2587 (-0.58z)| lr 3.17e-05 | 322.49 ms | 52.3% bf16 MFU | 1625119 tok/s step 16779/19560 | loss 3.293022 (-0.19z)| norm 0.2774 (-0.23z)| lr 3.16e-05 | 322.59 ms | 52.3% bf16 MFU | 1625124 tok/s step 16780/19560 | loss 3.288032 (-0.29z)| norm 0.2847 (-0.09z)| lr 3.16e-05 | 322.47 ms | 52.3% bf16 MFU | 1625161 tok/s step 16781/19560 | loss 3.281570 (-0.42z)| norm 0.2391 (-0.92z)| lr 3.16e-05 | 322.93 ms | 52.3% bf16 MFU | 1625079 tok/s step 16782/19560 | loss 3.213964 (-1.95z)| norm 0.2564 (-0.60z)| lr 3.16e-05 | 323.36 ms | 52.2% bf16 MFU | 1624892 tok/s step 16783/19560 | loss 3.264927 (-0.77z)| norm 0.2400 (-0.89z)| lr 3.16e-05 | 322.79 ms | 52.3% bf16 MFU | 1624859 tok/s step 16784/19560 | loss 3.384070 (+1.91z)| norm 0.2552 (-0.61z)| lr 3.15e-05 | 322.69 ms | 52.3% bf16 MFU | 1624854 tok/s step 16785/19560 | loss 3.292271 (-0.16z)| norm 0.2889 (+0.01z)| lr 3.15e-05 | 323.03 ms | 52.2% bf16 MFU | 1624763 tok/s step 16786/19560 | loss 3.307551 (+0.18z)| norm 0.3131 (+0.44z)| lr 3.15e-05 | 322.85 ms | 52.3% bf16 MFU | 1624720 tok/s step 16787/19560 | loss 3.330037 (+0.68z)| norm 0.2595 (-0.54z)| lr 3.15e-05 | 322.51 ms | 52.3% bf16 MFU | 1624767 tok/s step 16788/19560 | loss 3.312823 (+0.28z)| norm 0.2824 (-0.12z)| lr 3.14e-05 | 323.26 ms | 52.2% bf16 MFU | 1624621 tok/s step 16789/19560 | loss 3.234319 (-1.48z)| norm 0.3226 (+0.61z)| lr 3.14e-05 | 322.60 ms | 52.3% bf16 MFU | 1624650 tok/s step 16790/19560 | loss 3.286140 (-0.32z)| norm 0.2601 (-0.53z)| lr 3.14e-05 | 322.57 ms | 52.3% bf16 MFU | 1624686 tok/s step 16791/19560 | loss 3.252568 (-1.07z)| norm 0.2605 (-0.52z)| lr 3.14e-05 | 322.74 ms | 52.3% bf16 MFU | 1624677 tok/s step 16792/19560 | loss 3.235292 (-1.44z)| norm 0.2721 (-0.31z)| lr 3.14e-05 | 322.47 ms | 52.3% bf16 MFU | 1624736 tok/s step 16793/19560 | loss 3.265966 (-0.75z)| norm 0.2871 (-0.03z)| lr 3.13e-05 | 323.09 ms | 52.2% bf16 MFU | 1624637 tok/s step 16794/19560 | loss 3.342288 (+0.94z)| norm 0.2635 (-0.46z)| lr 3.13e-05 | 322.93 ms | 52.3% bf16 MFU | 1624582 tok/s step 16795/19560 | loss 3.308746 (+0.18z)| norm 0.2765 (-0.21z)| lr 3.13e-05 | 322.65 ms | 52.3% bf16 MFU | 1624600 tok/s step 16796/19560 | loss 3.222619 (-1.72z)| norm 0.2897 (+0.04z)| lr 3.13e-05 | 323.27 ms | 52.2% bf16 MFU | 1624460 tok/s step 16797/19560 | loss 3.210274 (-1.95z)| norm 0.2645 (-0.42z)| lr 3.12e-05 | 322.86 ms | 52.3% bf16 MFU | 1624431 tok/s step 16798/19560 | loss 3.330898 (+0.68z)| norm 0.2843 (-0.05z)| lr 3.12e-05 | 322.89 ms | 52.3% bf16 MFU | 1624397 tok/s step 16799/19560 | loss 3.303941 (+0.11z)| norm 0.2576 (-0.54z)| lr 3.12e-05 | 323.05 ms | 52.2% bf16 MFU | 1624324 tok/s step 16800/19560 | loss 3.293525 (-0.12z)| norm 0.3442 (+1.07z)| lr 3.12e-05 | 322.72 ms | 52.3% bf16 MFU | 1624338 tok/s step 16801/19560 | loss 3.212166 (-1.87z)| norm 0.2752 (-0.21z)| lr 3.12e-05 | 322.48 ms | 52.3% bf16 MFU | 1624410 tok/s step 16802/19560 | loss 3.291646 (-0.14z)| norm 0.2824 (-0.08z)| lr 3.11e-05 | 322.35 ms | 52.4% bf16 MFU | 1624511 tok/s step 16803/19560 | loss 3.247314 (-1.09z)| norm 0.3119 (+0.47z)| lr 3.11e-05 | 322.77 ms | 52.3% bf16 MFU | 1624502 tok/s step 16804/19560 | loss 3.403781 (+2.29z)| norm 0.2804 (-0.11z)| lr 3.11e-05 | 322.62 ms | 52.3% bf16 MFU | 1624530 tok/s step 16805/19560 | loss 3.233455 (-1.38z)| norm 0.2731 (-0.25z)| lr 3.11e-05 | 322.59 ms | 52.3% bf16 MFU | 1624567 tok/s step 16806/19560 | loss 3.256139 (-0.89z)| norm 0.3001 (+0.25z)| lr 3.10e-05 | 322.84 ms | 52.3% bf16 MFU | 1624539 tok/s step 16807/19560 | loss 3.303163 (+0.12z)| norm 0.2689 (-0.33z)| lr 3.10e-05 | 322.86 ms | 52.3% bf16 MFU | 1624507 tok/s step 16808/19560 | loss 3.279500 (-0.39z)| norm 0.3318 (+0.83z)| lr 3.10e-05 | 322.85 ms | 52.3% bf16 MFU | 1624477 tok/s step 16809/19560 | loss 3.430587 (+2.75z)| norm 0.2769 (-0.18z)| lr 3.10e-05 | 322.46 ms | 52.3% bf16 MFU | 1624549 tok/s step 16810/19560 | loss 3.313381 (+0.30z)| norm 0.2734 (-0.25z)| lr 3.10e-05 | 322.50 ms | 52.3% bf16 MFU | 1624606 tok/s step 16811/19560 | loss 3.278500 (-0.42z)| norm 0.2838 (-0.05z)| lr 3.09e-05 | 323.55 ms | 52.2% bf16 MFU | 1624398 tok/s step 16812/19560 | loss 3.323280 (+0.51z)| norm 0.3287 (+0.78z)| lr 3.09e-05 | 322.70 ms | 52.3% bf16 MFU | 1624411 tok/s step 16813/19560 | loss 3.355698 (+1.17z)| norm 0.2783 (-0.16z)| lr 3.09e-05 | 322.35 ms | 52.4% bf16 MFU | 1624513 tok/s step 16814/19560 | loss 3.277495 (-0.45z)| norm 0.3251 (+0.71z)| lr 3.09e-05 | 322.65 ms | 52.3% bf16 MFU | 1624534 tok/s step 16815/19560 | loss 3.310220 (+0.22z)| norm 0.3000 (+0.24z)| lr 3.08e-05 | 322.86 ms | 52.3% bf16 MFU | 1624502 tok/s step 16816/19560 | loss 3.298986 (-0.01z)| norm 0.2608 (-0.48z)| lr 3.08e-05 | 322.79 ms | 52.3% bf16 MFU | 1624489 tok/s step 16817/19560 | loss 3.257123 (-0.88z)| norm 0.2568 (-0.54z)| lr 3.08e-05 | 323.42 ms | 52.2% bf16 MFU | 1624318 tok/s step 16818/19560 | loss 3.268685 (-0.63z)| norm 0.2511 (-0.65z)| lr 3.08e-05 | 322.38 ms | 52.4% bf16 MFU | 1624416 tok/s step 16819/19560 | loss 3.247325 (-1.08z)| norm 0.2842 (-0.02z)| lr 3.08e-05 | 322.74 ms | 52.3% bf16 MFU | 1624420 tok/s step 16820/19560 | loss 3.414057 (+2.45z)| norm 0.2759 (-0.17z)| lr 3.07e-05 | 322.68 ms | 52.3% bf16 MFU | 1624438 tok/s step 16821/19560 | loss 3.294058 (-0.07z)| norm 0.2675 (-0.33z)| lr 3.07e-05 | 323.24 ms | 52.2% bf16 MFU | 1624314 tok/s step 16822/19560 | loss 3.306897 (+0.19z)| norm 0.2835 (-0.02z)| lr 3.07e-05 | 322.76 ms | 52.3% bf16 MFU | 1624317 tok/s step 16823/19560 | loss 3.269547 (-0.61z)| norm 0.2928 (+0.15z)| lr 3.07e-05 | 323.01 ms | 52.3% bf16 MFU | 1624258 tok/s step 16824/19560 | loss 3.340307 (+0.93z)| norm 0.2595 (-0.47z)| lr 3.06e-05 | 322.50 ms | 52.3% bf16 MFU | 1624330 tok/s step 16825/19560 | loss 3.303787 (+0.12z)| norm 0.3263 (+0.77z)| lr 3.06e-05 | 323.00 ms | 52.3% bf16 MFU | 1624274 tok/s step 16826/19560 | loss 3.305335 (+0.16z)| norm 0.2628 (-0.42z)| lr 3.06e-05 | 323.17 ms | 52.2% bf16 MFU | 1624177 tok/s step 16827/19560 | loss 3.254234 (-0.95z)| norm 0.3080 (+0.42z)| lr 3.06e-05 | 322.43 ms | 52.3% bf16 MFU | 1624270 tok/s step 16828/19560 | loss 3.259471 (-0.82z)| norm 0.2675 (-0.33z)| lr 3.06e-05 | 322.61 ms | 52.3% bf16 MFU | 1624313 tok/s step 16829/19560 | loss 3.270185 (-0.58z)| norm 0.2595 (-0.48z)| lr 3.05e-05 | 322.99 ms | 52.3% bf16 MFU | 1624258 tok/s step 16830/19560 | loss 3.275399 (-0.46z)| norm 0.3472 (+1.15z)| lr 3.05e-05 | 323.23 ms | 52.2% bf16 MFU | 1624146 tok/s step 16831/19560 | loss 3.260386 (-0.79z)| norm 0.2524 (-0.61z)| lr 3.05e-05 | 322.98 ms | 52.3% bf16 MFU | 1624102 tok/s step 16832/19560 | loss 3.293116 (-0.07z)| norm 0.2615 (-0.44z)| lr 3.05e-05 | 323.16 ms | 52.2% bf16 MFU | 1624017 tok/s step 16833/19560 | loss 3.277722 (-0.41z)| norm 0.2672 (-0.33z)| lr 3.04e-05 | 322.74 ms | 52.3% bf16 MFU | 1624040 tok/s step 16834/19560 | loss 3.382643 (+1.97z)| norm 0.3025 (+0.32z)| lr 3.04e-05 | 322.72 ms | 52.3% bf16 MFU | 1624066 tok/s step 16835/19560 | loss 3.222112 (-1.64z)| norm 0.2597 (-0.47z)| lr 3.04e-05 | 322.81 ms | 52.3% bf16 MFU | 1624071 tok/s step 16836/19560 | loss 3.271356 (-0.52z)| norm 0.2537 (-0.58z)| lr 3.04e-05 | 323.26 ms | 52.2% bf16 MFU | 1623961 tok/s step 16837/19560 | loss 3.273221 (-0.48z)| norm 0.2621 (-0.42z)| lr 3.04e-05 | 322.81 ms | 52.3% bf16 MFU | 1623969 tok/s step 16838/19560 | loss 3.325302 (+0.70z)| norm 0.2727 (-0.22z)| lr 3.03e-05 | 322.35 ms | 52.4% bf16 MFU | 1624093 tok/s step 16839/19560 | loss 3.258019 (-0.82z)| norm 0.3197 (+0.64z)| lr 3.03e-05 | 322.90 ms | 52.3% bf16 MFU | 1624072 tok/s step 16840/19560 | loss 3.267805 (-0.60z)| norm 0.2516 (-0.62z)| lr 3.03e-05 | 323.05 ms | 52.2% bf16 MFU | 1624015 tok/s step 16841/19560 | loss 3.272656 (-0.49z)| norm 0.2461 (-0.71z)| lr 3.03e-05 | 322.47 ms | 52.3% bf16 MFU | 1624106 tok/s step 16842/19560 | loss 3.294426 (+0.01z)| norm 0.2605 (-0.44z)| lr 3.02e-05 | 322.86 ms | 52.3% bf16 MFU | 1624094 tok/s step 16843/19560 | loss 3.298155 (+0.08z)| norm 0.2480 (-0.67z)| lr 3.02e-05 | 323.11 ms | 52.2% bf16 MFU | 1624020 tok/s step 16844/19560 | loss 3.337911 (+0.98z)| norm 0.2532 (-0.57z)| lr 3.02e-05 | 323.25 ms | 52.2% bf16 MFU | 1623916 tok/s step 16845/19560 | loss 3.245756 (-1.09z)| norm 0.2688 (-0.28z)| lr 3.02e-05 | 322.62 ms | 52.3% bf16 MFU | 1623974 tok/s step 16846/19560 | loss 3.186110 (-2.36z)| norm 0.2501 (-0.62z)| lr 3.02e-05 | 323.34 ms | 52.2% bf16 MFU | 1623848 tok/s step 16847/19560 | loss 3.270825 (-0.49z)| norm 0.2401 (-0.79z)| lr 3.01e-05 | 322.37 ms | 52.4% bf16 MFU | 1623974 tok/s step 16848/19560 | loss 3.252874 (-0.87z)| norm 0.2735 (-0.18z)| lr 3.01e-05 | 323.08 ms | 52.2% bf16 MFU | 1623914 tok/s step 16849/19560 | loss 3.335304 (+0.95z)| norm 0.2501 (-0.61z)| lr 3.01e-05 | 323.18 ms | 52.2% bf16 MFU | 1623833 tok/s step 16850/19560 | loss 3.327900 (+0.77z)| norm 0.2752 (-0.14z)| lr 3.01e-05 | 322.47 ms | 52.3% bf16 MFU | 1623935 tok/s step 16851/19560 | loss 3.284239 (-0.19z)| norm 0.3743 (+1.65z)| lr 3.01e-05 | 322.52 ms | 52.3% bf16 MFU | 1624019 tok/s step 16852/19560 | loss 3.420101 (+2.72z)| norm 0.2675 (-0.29z)| lr 3.00e-05 | 322.46 ms | 52.3% bf16 MFU | 1624114 tok/s step 16853/19560 | loss 3.254483 (-0.83z)| norm 0.2645 (-0.34z)| lr 3.00e-05 | 323.25 ms | 52.2% bf16 MFU | 1624005 tok/s step 16854/19560 | loss 3.306162 (+0.27z)| norm 0.2830 (-0.01z)| lr 3.00e-05 | 322.91 ms | 52.3% bf16 MFU | 1623986 tok/s step 16855/19560 | loss 3.275909 (-0.37z)| norm 0.2692 (-0.25z)| lr 3.00e-05 | 323.12 ms | 52.2% bf16 MFU | 1623916 tok/s step 16856/19560 | loss 3.308731 (+0.34z)| norm 0.2616 (-0.39z)| lr 2.99e-05 | 322.95 ms | 52.3% bf16 MFU | 1623892 tok/s step 16857/19560 | loss 3.299145 (+0.15z)| norm 0.2892 (+0.12z)| lr 2.99e-05 | 322.56 ms | 52.3% bf16 MFU | 1623967 tok/s step 16858/19560 | loss 3.358472 (+1.44z)| norm 0.2739 (-0.16z)| lr 2.99e-05 | 322.70 ms | 52.3% bf16 MFU | 1624002 tok/s step 16859/19560 | loss 3.313369 (+0.44z)| norm 0.2838 (+0.02z)| lr 2.99e-05 | 322.83 ms | 52.3% bf16 MFU | 1624004 tok/s step 16860/19560 | loss 3.302667 (+0.20z)| norm 0.2960 (+0.24z)| lr 2.99e-05 | 323.31 ms | 52.2% bf16 MFU | 1623885 tok/s step 16861/19560 | loss 3.314776 (+0.49z)| norm 0.2518 (-1.06z)| lr 2.98e-05 | 322.67 ms | 52.3% bf16 MFU | 1623934 tok/s step 16862/19560 | loss 3.305834 (+0.29z)| norm 0.3243 (+1.78z)| lr 2.98e-05 | 323.24 ms | 52.2% bf16 MFU | 1623837 tok/s step 16863/19560 | loss 3.207859 (-1.87z)| norm 0.3347 (+2.12z)| lr 2.98e-05 | 323.79 ms | 52.1% bf16 MFU | 1623606 tok/s step 16864/19560 | loss 3.244862 (-1.05z)| norm 0.2642 (-0.58z)| lr 2.98e-05 | 322.69 ms | 52.3% bf16 MFU | 1623664 tok/s step 16865/19560 | loss 3.330002 (+0.83z)| norm 0.3689 (+3.27z)| lr 2.97e-05 | 323.21 ms | 52.2% bf16 MFU | 1623586 tok/s step 16866/19560 | loss 3.296795 (+0.09z)| norm 0.3059 (+0.93z)| lr 2.97e-05 | 322.54 ms | 52.3% bf16 MFU | 1623683 tok/s step 16867/19560 | loss 3.351979 (+1.30z)| norm 0.2601 (-0.75z)| lr 2.97e-05 | 323.52 ms | 52.2% bf16 MFU | 1623528 tok/s step 16868/19560 | loss 3.250475 (-0.92z)| norm 0.3614 (+2.86z)| lr 2.97e-05 | 323.27 ms | 52.2% bf16 MFU | 1623442 tok/s step 16869/19560 | loss 3.267936 (-0.54z)| norm 0.2810 (-0.01z)| lr 2.97e-05 | 322.83 ms | 52.3% bf16 MFU | 1623471 tok/s step 16870/19560 | loss 3.300162 (+0.18z)| norm 0.2918 (+0.37z)| lr 2.96e-05 | 323.00 ms | 52.3% bf16 MFU | 1623457 tok/s step 16871/19560 | loss 3.253981 (-0.84z)| norm 0.3200 (+1.40z)| lr 2.96e-05 | 322.49 ms | 52.3% bf16 MFU | 1623571 tok/s step 16872/19560 | loss 3.285982 (-0.14z)| norm 0.2628 (-0.67z)| lr 2.96e-05 | 323.05 ms | 52.2% bf16 MFU | 1623539 tok/s step 16873/19560 | loss 3.326150 (+0.79z)| norm 0.2961 (+0.52z)| lr 2.96e-05 | 323.03 ms | 52.2% bf16 MFU | 1623513 tok/s step 16874/19560 | loss 3.296667 (+0.12z)| norm 0.3016 (+0.72z)| lr 2.96e-05 | 323.12 ms | 52.2% bf16 MFU | 1623466 tok/s step 16875/19560 | loss 3.366267 (+1.68z)| norm 0.2627 (-0.68z)| lr 2.95e-05 | 323.96 ms | 52.1% bf16 MFU | 1623212 tok/s step 16876/19560 | loss 3.305046 (+0.31z)| norm 0.2630 (-0.66z)| lr 2.95e-05 | 322.49 ms | 52.3% bf16 MFU | 1623337 tok/s step 16877/19560 | loss 3.333586 (+0.94z)| norm 0.3220 (+1.46z)| lr 2.95e-05 | 323.04 ms | 52.2% bf16 MFU | 1623318 tok/s step 16878/19560 | loss 3.337728 (+1.04z)| norm 0.2788 (-0.08z)| lr 2.95e-05 | 322.69 ms | 52.3% bf16 MFU | 1623390 tok/s step 16879/19560 | loss 3.304240 (+0.27z)| norm 0.3078 (+0.95z)| lr 2.94e-05 | 322.43 ms | 52.3% bf16 MFU | 1623524 tok/s step 16880/19560 | loss 3.270445 (-0.52z)| norm 0.3071 (+0.92z)| lr 2.94e-05 | 323.37 ms | 52.2% bf16 MFU | 1623416 tok/s step 16881/19560 | loss 3.295683 (+0.07z)| norm 0.2600 (-0.77z)| lr 2.94e-05 | 322.89 ms | 52.3% bf16 MFU | 1623431 tok/s step 16882/19560 | loss 3.240759 (-1.19z)| norm 0.2584 (-0.82z)| lr 2.94e-05 | 322.89 ms | 52.3% bf16 MFU | 1623445 tok/s step 16883/19560 | loss 3.341768 (+1.14z)| norm 0.2862 (+0.19z)| lr 2.94e-05 | 323.44 ms | 52.2% bf16 MFU | 1623321 tok/s step 16884/19560 | loss 3.253980 (-0.88z)| norm 0.2625 (-0.67z)| lr 2.93e-05 | 323.62 ms | 52.2% bf16 MFU | 1623160 tok/s step 16885/19560 | loss 3.258138 (-0.77z)| norm 0.2869 (+0.24z)| lr 2.93e-05 | 322.87 ms | 52.3% bf16 MFU | 1623194 tok/s step 16886/19560 | loss 3.271490 (-0.47z)| norm 0.2857 (+0.19z)| lr 2.93e-05 | 322.90 ms | 52.3% bf16 MFU | 1623219 tok/s step 16887/19560 | loss 3.280939 (-0.26z)| norm 0.2773 (-0.12z)| lr 2.93e-05 | 323.10 ms | 52.2% bf16 MFU | 1623191 tok/s step 16888/19560 | loss 3.306782 (+0.33z)| norm 0.2964 (+0.59z)| lr 2.92e-05 | 322.99 ms | 52.3% bf16 MFU | 1623192 tok/s step 16889/19560 | loss 3.316284 (+0.57z)| norm 0.2965 (+0.59z)| lr 2.92e-05 | 322.96 ms | 52.3% bf16 MFU | 1623203 tok/s step 16890/19560 | loss 3.342093 (+1.16z)| norm 0.2786 (-0.07z)| lr 2.92e-05 | 323.07 ms | 52.2% bf16 MFU | 1623185 tok/s step 16891/19560 | loss 3.264574 (-0.65z)| norm 0.2680 (-0.46z)| lr 2.92e-05 | 322.73 ms | 52.3% bf16 MFU | 1623252 tok/s step 16892/19560 | loss 3.245242 (-1.10z)| norm 0.2636 (-0.62z)| lr 2.92e-05 | 323.01 ms | 52.2% bf16 MFU | 1623245 tok/s step 16893/19560 | loss 3.235922 (-1.30z)| norm 0.2709 (-0.35z)| lr 2.91e-05 | 322.80 ms | 52.3% bf16 MFU | 1623293 tok/s step 16894/19560 | loss 3.274128 (-0.42z)| norm 0.2878 (+0.29z)| lr 2.91e-05 | 322.93 ms | 52.3% bf16 MFU | 1623305 tok/s step 16895/19560 | loss 3.271613 (-0.47z)| norm 0.2604 (-0.72z)| lr 2.91e-05 | 323.12 ms | 52.2% bf16 MFU | 1623270 tok/s step 16896/19560 | loss 3.342108 (+1.17z)| norm 0.2602 (-0.73z)| lr 2.91e-05 | 323.17 ms | 52.2% bf16 MFU | 1623222 tok/s step 16897/19560 | loss 3.330101 (+0.89z)| norm 0.2770 (-0.11z)| lr 2.91e-05 | 322.58 ms | 52.3% bf16 MFU | 1623325 tok/s step 16898/19560 | loss 3.278847 (-0.31z)| norm 0.2887 (+0.32z)| lr 2.90e-05 | 322.66 ms | 52.3% bf16 MFU | 1623403 tok/s step 16899/19560 | loss 3.285707 (-0.14z)| norm 0.2614 (-0.70z)| lr 2.90e-05 | 322.83 ms | 52.3% bf16 MFU | 1623433 tok/s step 16900/19560 | loss 3.340598 (+1.12z)| norm 0.2782 (-0.08z)| lr 2.90e-05 | 322.95 ms | 52.3% bf16 MFU | 1623434 tok/s step 16901/19560 | loss 3.302666 (+0.25z)| norm 0.2889 (+0.32z)| lr 2.90e-05 | 322.32 ms | 52.4% bf16 MFU | 1623593 tok/s step 16902/19560 | loss 3.271740 (-0.47z)| norm 0.2688 (-0.43z)| lr 2.89e-05 | 322.45 ms | 52.3% bf16 MFU | 1623710 tok/s step 16903/19560 | loss 3.269018 (-0.53z)| norm 0.2708 (-0.35z)| lr 2.89e-05 | 323.15 ms | 52.2% bf16 MFU | 1623646 tok/s step 16904/19560 | loss 3.286096 (-0.13z)| norm 0.3006 (+0.75z)| lr 2.89e-05 | 323.12 ms | 52.2% bf16 MFU | 1623594 tok/s step 16905/19560 | loss 3.213816 (-1.77z)| norm 0.2676 (-0.50z)| lr 2.89e-05 | 322.62 ms | 52.3% bf16 MFU | 1623668 tok/s step 16906/19560 | loss 3.309560 (+0.42z)| norm 0.2854 (+0.17z)| lr 2.89e-05 | 322.86 ms | 52.3% bf16 MFU | 1623680 tok/s step 16907/19560 | loss 3.281371 (-0.23z)| norm 0.2538 (-1.01z)| lr 2.88e-05 | 322.64 ms | 52.3% bf16 MFU | 1623746 tok/s step 16908/19560 | loss 3.283637 (-0.17z)| norm 0.2866 (+0.22z)| lr 2.88e-05 | 322.34 ms | 52.4% bf16 MFU | 1623884 tok/s step 16909/19560 | loss 3.296359 (+0.11z)| norm 0.3022 (+0.80z)| lr 2.88e-05 | 322.89 ms | 52.3% bf16 MFU | 1623875 tok/s step 16910/19560 | loss 3.244431 (-1.09z)| norm 0.2542 (-1.03z)| lr 2.88e-05 | 322.44 ms | 52.3% bf16 MFU | 1623982 tok/s step 16911/19560 | loss 3.295084 (+0.08z)| norm 0.4193 (+4.75z)| lr 2.88e-05 | 322.67 ms | 52.3% bf16 MFU | 1624024 tok/s step 16912/19560 | loss 3.262131 (-0.68z)| norm 0.2630 (-0.69z)| lr 2.87e-05 | 322.87 ms | 52.3% bf16 MFU | 1624015 tok/s step 16913/19560 | loss 3.332422 (+0.97z)| norm 0.2975 (+0.51z)| lr 2.87e-05 | 322.63 ms | 52.3% bf16 MFU | 1624067 tok/s step 16914/19560 | loss 3.265246 (-0.60z)| norm 0.2974 (+0.52z)| lr 2.87e-05 | 322.36 ms | 52.4% bf16 MFU | 1624183 tok/s step 16915/19560 | loss 3.307707 (+0.40z)| norm 0.3030 (+0.70z)| lr 2.87e-05 | 322.27 ms | 52.4% bf16 MFU | 1624316 tok/s step 16916/19560 | loss 3.229609 (-1.41z)| norm 0.2693 (-0.47z)| lr 2.86e-05 | 322.41 ms | 52.3% bf16 MFU | 1624408 tok/s step 16917/19560 | loss 3.283763 (-0.16z)| norm 0.2924 (+0.34z)| lr 2.86e-05 | 322.63 ms | 52.3% bf16 MFU | 1624440 tok/s step 16918/19560 | loss 3.282012 (-0.20z)| norm 0.2751 (-0.27z)| lr 2.86e-05 | 322.71 ms | 52.3% bf16 MFU | 1624451 tok/s step 16919/19560 | loss 3.243258 (-1.10z)| norm 0.2431 (-1.39z)| lr 2.86e-05 | 322.73 ms | 52.3% bf16 MFU | 1624455 tok/s step 16920/19560 | loss 3.288352 (-0.06z)| norm 0.2490 (-1.17z)| lr 2.86e-05 | 322.48 ms | 52.3% bf16 MFU | 1624522 tok/s step 16921/19560 | loss 3.276731 (-0.33z)| norm 0.2477 (-1.19z)| lr 2.85e-05 | 322.82 ms | 52.3% bf16 MFU | 1624501 tok/s step 16922/19560 | loss 3.317283 (+0.63z)| norm 0.2897 (+0.26z)| lr 2.85e-05 | 322.33 ms | 52.4% bf16 MFU | 1624603 tok/s step 16923/19560 | loss 3.264728 (-0.60z)| norm 0.2511 (-1.07z)| lr 2.85e-05 | 322.85 ms | 52.3% bf16 MFU | 1624569 tok/s step 16924/19560 | loss 3.281543 (-0.22z)| norm 0.3127 (+1.05z)| lr 2.85e-05 | 322.54 ms | 52.3% bf16 MFU | 1624616 tok/s step 16925/19560 | loss 3.287445 (-0.09z)| norm 0.2546 (-0.95z)| lr 2.85e-05 | 322.70 ms | 52.3% bf16 MFU | 1624620 tok/s step 16926/19560 | loss 3.201544 (-2.12z)| norm 0.2673 (-0.51z)| lr 2.84e-05 | 322.81 ms | 52.3% bf16 MFU | 1624597 tok/s step 16927/19560 | loss 3.231972 (-1.37z)| norm 0.2561 (-0.89z)| lr 2.84e-05 | 322.56 ms | 52.3% bf16 MFU | 1624637 tok/s step 16928/19560 | loss 3.326386 (+0.86z)| norm 0.3231 (+1.43z)| lr 2.84e-05 | 322.46 ms | 52.3% bf16 MFU | 1624699 tok/s step 16929/19560 | loss 3.322318 (+0.75z)| norm 0.2659 (-0.55z)| lr 2.84e-05 | 322.87 ms | 52.3% bf16 MFU | 1624657 tok/s step 16930/19560 | loss 3.273261 (-0.42z)| norm 0.2846 (+0.10z)| lr 2.84e-05 | 322.98 ms | 52.3% bf16 MFU | 1624589 tok/s step 16931/19560 | loss 3.174431 (-2.70z)| norm 0.2881 (+0.22z)| lr 2.83e-05 | 322.74 ms | 52.3% bf16 MFU | 1624584 tok/s step 16932/19560 | loss 3.280870 (-0.20z)| norm 0.2707 (-0.38z)| lr 2.83e-05 | 322.58 ms | 52.3% bf16 MFU | 1624619 tok/s step 16933/19560 | loss 3.252079 (-0.90z)| norm 0.2591 (-0.78z)| lr 2.83e-05 | 323.05 ms | 52.2% bf16 MFU | 1624534 tok/s step 16934/19560 | loss 3.228416 (-1.45z)| norm 0.3278 (+1.58z)| lr 2.83e-05 | 322.64 ms | 52.3% bf16 MFU | 1624556 tok/s step 16935/19560 | loss 3.261342 (-0.66z)| norm 0.2750 (-0.23z)| lr 2.82e-05 | 322.97 ms | 52.3% bf16 MFU | 1624494 tok/s step 16936/19560 | loss 3.268888 (-0.47z)| norm 0.2768 (-0.16z)| lr 2.82e-05 | 322.60 ms | 52.3% bf16 MFU | 1624530 tok/s step 16937/19560 | loss 3.319880 (+0.80z)| norm 0.2912 (+0.34z)| lr 2.82e-05 | 322.69 ms | 52.3% bf16 MFU | 1624540 tok/s step 16938/19560 | loss 3.348206 (+1.49z)| norm 0.2740 (-0.26z)| lr 2.82e-05 | 322.70 ms | 52.3% bf16 MFU | 1624549 tok/s step 16939/19560 | loss 3.294881 (+0.16z)| norm 0.3045 (+0.79z)| lr 2.82e-05 | 322.92 ms | 52.3% bf16 MFU | 1624501 tok/s step 16940/19560 | loss 3.349789 (+1.51z)| norm 0.2438 (-1.30z)| lr 2.81e-05 | 322.90 ms | 52.3% bf16 MFU | 1624460 tok/s step 16941/19560 | loss 3.218041 (-1.71z)| norm 0.2854 (+0.15z)| lr 2.81e-05 | 322.53 ms | 52.3% bf16 MFU | 1624515 tok/s step 16942/19560 | loss 3.312440 (+0.61z)| norm 0.2581 (-0.78z)| lr 2.81e-05 | 322.63 ms | 52.3% bf16 MFU | 1624541 tok/s step 16943/19560 | loss 3.305122 (+0.43z)| norm 0.2514 (-1.00z)| lr 2.81e-05 | 323.04 ms | 52.2% bf16 MFU | 1624463 tok/s step 16944/19560 | loss 3.223235 (-1.56z)| norm 0.3512 (+2.41z)| lr 2.81e-05 | 322.81 ms | 52.3% bf16 MFU | 1624446 tok/s step 16945/19560 | loss 3.286501 (-0.02z)| norm 0.2537 (-0.93z)| lr 2.80e-05 | 322.42 ms | 52.3% bf16 MFU | 1624530 tok/s step 16946/19560 | loss 3.260054 (-0.66z)| norm 0.2820 (+0.04z)| lr 2.80e-05 | 322.78 ms | 52.3% bf16 MFU | 1624519 tok/s step 16947/19560 | loss 3.263941 (-0.57z)| norm 0.3135 (+1.10z)| lr 2.80e-05 | 322.65 ms | 52.3% bf16 MFU | 1624541 tok/s step 16948/19560 | loss 3.286733 (+0.01z)| norm 0.2892 (+0.27z)| lr 2.80e-05 | 322.94 ms | 52.3% bf16 MFU | 1624487 tok/s step 16949/19560 | loss 3.304066 (+0.45z)| norm 0.2898 (+0.28z)| lr 2.80e-05 | 322.62 ms | 52.3% bf16 MFU | 1624518 tok/s step 16950/19560 | loss 3.265019 (-0.54z)| norm 0.2623 (-0.65z)| lr 2.79e-05 | 322.99 ms | 52.3% bf16 MFU | 1624454 tok/s step 16951/19560 | loss 3.251135 (-0.88z)| norm 0.2719 (-0.32z)| lr 2.79e-05 | 322.31 ms | 52.4% bf16 MFU | 1624564 tok/s step 16952/19560 | loss 3.294469 (+0.23z)| norm 0.2543 (-0.92z)| lr 2.79e-05 | 322.29 ms | 52.4% bf16 MFU | 1624674 tok/s step 16953/19560 | loss 3.275650 (-0.25z)| norm 0.3845 (+3.38z)| lr 2.79e-05 | 322.66 ms | 52.3% bf16 MFU | 1624684 tok/s step 16954/19560 | loss 3.327053 (+1.06z)| norm 0.2444 (-1.21z)| lr 2.78e-05 | 322.99 ms | 52.3% bf16 MFU | 1624611 tok/s step 16955/19560 | loss 3.303129 (+0.44z)| norm 0.3479 (+2.13z)| lr 2.78e-05 | 322.27 ms | 52.4% bf16 MFU | 1624724 tok/s step 16956/19560 | loss 3.305569 (+0.49z)| norm 0.2773 (-0.15z)| lr 2.78e-05 | 322.75 ms | 52.3% bf16 MFU | 1624709 tok/s step 16957/19560 | loss 3.241556 (-1.13z)| norm 0.2864 (+0.14z)| lr 2.78e-05 | 322.59 ms | 52.3% bf16 MFU | 1624736 tok/s step 16958/19560 | loss 3.356787 (+1.76z)| norm 0.3411 (+1.92z)| lr 2.78e-05 | 322.55 ms | 52.3% bf16 MFU | 1624772 tok/s step 16959/19560 | loss 3.265430 (-0.54z)| norm 0.3107 (+0.91z)| lr 2.77e-05 | 322.76 ms | 52.3% bf16 MFU | 1624752 tok/s step 16960/19560 | loss 3.265310 (-0.53z)| norm 0.2863 (+0.12z)| lr 2.77e-05 | 323.07 ms | 52.2% bf16 MFU | 1624656 tok/s step 16961/19560 | loss 3.342945 (+1.40z)| norm 0.3127 (+0.96z)| lr 2.77e-05 | 322.68 ms | 52.3% bf16 MFU | 1624664 tok/s step 16962/19560 | loss 3.284608 (-0.04z)| norm 0.2500 (-1.05z)| lr 2.77e-05 | 322.40 ms | 52.3% bf16 MFU | 1624740 tok/s step 16963/19560 | loss 3.208146 (-1.98z)| norm 0.2494 (-1.07z)| lr 2.77e-05 | 322.67 ms | 52.3% bf16 MFU | 1624745 tok/s step 16964/19560 | loss 3.275563 (-0.27z)| norm 0.2722 (-0.34z)| lr 2.76e-05 | 322.86 ms | 52.3% bf16 MFU | 1624701 tok/s step 16965/19560 | loss 3.264453 (-0.55z)| norm 0.2545 (-0.91z)| lr 2.76e-05 | 322.97 ms | 52.3% bf16 MFU | 1624633 tok/s step 16966/19560 | loss 3.293572 (+0.20z)| norm 0.2436 (-1.25z)| lr 2.76e-05 | 322.67 ms | 52.3% bf16 MFU | 1624643 tok/s step 16967/19560 | loss 3.389595 (+2.56z)| norm 0.2678 (-0.46z)| lr 2.76e-05 | 322.54 ms | 52.3% bf16 MFU | 1624686 tok/s step 16968/19560 | loss 3.298540 (+0.28z)| norm 0.2463 (-1.15z)| lr 2.76e-05 | 322.74 ms | 52.3% bf16 MFU | 1624676 tok/s step 16969/19560 | loss 3.460010 (+3.99z)| norm 0.3144 (+1.03z)| lr 2.75e-05 | 322.59 ms | 52.3% bf16 MFU | 1624704 tok/s step 16970/19560 | loss 3.344507 (+1.29z)| norm 0.2476 (-1.12z)| lr 2.75e-05 | 322.73 ms | 52.3% bf16 MFU | 1624696 tok/s step 16971/19560 | loss 3.324063 (+0.81z)| norm 0.2506 (-1.02z)| lr 2.75e-05 | 322.58 ms | 52.3% bf16 MFU | 1624726 tok/s step 16972/19560 | loss 3.320098 (+0.72z)| norm 0.2687 (-0.45z)| lr 2.75e-05 | 322.68 ms | 52.3% bf16 MFU | 1624731 tok/s step 16973/19560 | loss 3.241669 (-1.10z)| norm 0.2413 (-1.32z)| lr 2.74e-05 | 322.69 ms | 52.3% bf16 MFU | 1624730 tok/s step 16974/19560 | loss 3.245759 (-1.03z)| norm 0.2815 (-0.03z)| lr 2.74e-05 | 322.71 ms | 52.3% bf16 MFU | 1624726 tok/s step 16975/19560 | loss 3.231895 (-1.34z)| norm 0.3164 (+1.07z)| lr 2.74e-05 | 322.39 ms | 52.3% bf16 MFU | 1624802 tok/s step 16976/19560 | loss 3.285879 (-0.08z)| norm 0.2442 (-1.24z)| lr 2.74e-05 | 322.81 ms | 52.3% bf16 MFU | 1624768 tok/s step 16977/19560 | loss 3.318507 (+0.69z)| norm 0.3158 (+1.04z)| lr 2.74e-05 | 322.63 ms | 52.3% bf16 MFU | 1624781 tok/s step 16978/19560 | loss 3.321279 (+0.76z)| norm 0.2735 (-0.32z)| lr 2.73e-05 | 323.00 ms | 52.3% bf16 MFU | 1624701 tok/s step 16979/19560 | loss 3.288300 (-0.02z)| norm 0.2480 (-1.14z)| lr 2.73e-05 | 322.98 ms | 52.3% bf16 MFU | 1624629 tok/s step 16980/19560 | loss 3.249969 (-0.93z)| norm 0.3168 (+1.12z)| lr 2.73e-05 | 322.29 ms | 52.4% bf16 MFU | 1624735 tok/s step 16981/19560 | loss 3.324527 (+0.88z)| norm 0.3181 (+1.14z)| lr 2.73e-05 | 322.61 ms | 52.3% bf16 MFU | 1624754 tok/s step 16982/19560 | loss 3.258425 (-0.72z)| norm 0.2815 (-0.06z)| lr 2.73e-05 | 322.66 ms | 52.3% bf16 MFU | 1624762 tok/s step 16983/19560 | loss 3.284637 (-0.09z)| norm 0.3744 (+2.87z)| lr 2.72e-05 | 322.25 ms | 52.4% bf16 MFU | 1624871 tok/s step 16984/19560 | loss 3.384253 (+2.28z)| norm 0.2599 (-0.77z)| lr 2.72e-05 | 322.59 ms | 52.3% bf16 MFU | 1624891 tok/s step 16985/19560 | loss 3.295795 (+0.17z)| norm 0.3049 (+0.66z)| lr 2.72e-05 | 322.71 ms | 52.3% bf16 MFU | 1624879 tok/s step 16986/19560 | loss 3.351488 (+1.50z)| norm 0.2677 (-0.52z)| lr 2.72e-05 | 322.16 ms | 52.4% bf16 MFU | 1625007 tok/s step 16987/19560 | loss 3.268316 (-0.48z)| norm 0.2686 (-0.49z)| lr 2.72e-05 | 322.80 ms | 52.3% bf16 MFU | 1624966 tok/s step 16988/19560 | loss 3.336197 (+1.14z)| norm 0.2656 (-0.57z)| lr 2.71e-05 | 322.73 ms | 52.3% bf16 MFU | 1624944 tok/s step 16989/19560 | loss 3.253224 (-0.83z)| norm 0.2570 (-0.85z)| lr 2.71e-05 | 322.73 ms | 52.3% bf16 MFU | 1624925 tok/s step 16990/19560 | loss 3.252377 (-0.84z)| norm 0.2809 (-0.08z)| lr 2.71e-05 | 322.18 ms | 52.4% bf16 MFU | 1625043 tok/s step 16991/19560 | loss 3.275214 (-0.31z)| norm 0.2831 (+0.00z)| lr 2.71e-05 | 322.68 ms | 52.3% bf16 MFU | 1625031 tok/s step 16992/19560 | loss 3.241626 (-1.12z)| norm 0.2755 (-0.24z)| lr 2.71e-05 | 322.57 ms | 52.3% bf16 MFU | 1625048 tok/s step 16993/19560 | loss 3.317876 (+0.72z)| norm 0.3022 (+0.65z)| lr 2.70e-05 | 323.11 ms | 52.2% bf16 MFU | 1624926 tok/s step 16994/19560 | loss 3.337505 (+1.18z)| norm 0.3471 (+2.10z)| lr 2.70e-05 | 322.76 ms | 52.3% bf16 MFU | 1624899 tok/s step 16995/19560 | loss 3.316359 (+0.68z)| norm 0.2777 (-0.18z)| lr 2.70e-05 | 322.76 ms | 52.3% bf16 MFU | 1624873 tok/s step 16996/19560 | loss 3.274448 (-0.34z)| norm 0.3371 (+1.80z)| lr 2.70e-05 | 322.09 ms | 52.4% bf16 MFU | 1625019 tok/s step 16997/19560 | loss 3.242675 (-1.10z)| norm 0.3492 (+2.15z)| lr 2.69e-05 | 322.31 ms | 52.4% bf16 MFU | 1625101 tok/s step 16998/19560 | loss 3.354915 (+1.59z)| norm 0.2872 (+0.13z)| lr 2.69e-05 | 322.77 ms | 52.3% bf16 MFU | 1625064 tok/s step 16999/19560 | loss 3.252008 (-0.88z)| norm 0.2812 (-0.06z)| lr 2.69e-05 | 322.77 ms | 52.3% bf16 MFU | 1625026 tok/s step 17000/19560 | loss 3.348119 (+1.40z)| norm 0.3150 (+1.03z)| lr 2.69e-05 | 322.61 ms | 52.3% bf16 MFU | 1625032 tok/s val loss 3.282649 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3036/10042 = 0.302330 step 17001/19560 | loss 3.314213 (+0.60z)| norm 0.2971 (+0.45z)| lr 2.69e-05 | 322.33 ms | 52.4% bf16 MFU | 1625107 tok/s step 17002/19560 | loss 3.282849 (-0.14z)| norm 0.2668 (-0.54z)| lr 2.68e-05 | 322.47 ms | 52.3% bf16 MFU | 1625143 tok/s step 17003/19560 | loss 3.441029 (+3.48z)| norm 0.2616 (-0.71z)| lr 2.68e-05 | 322.88 ms | 52.3% bf16 MFU | 1625074 tok/s step 17004/19560 | loss 3.295593 (+0.14z)| norm 0.3483 (+2.08z)| lr 2.68e-05 | 322.86 ms | 52.3% bf16 MFU | 1625016 tok/s step 17005/19560 | loss 3.328510 (+0.90z)| norm 0.3072 (+0.76z)| lr 2.68e-05 | 322.86 ms | 52.3% bf16 MFU | 1624960 tok/s step 17006/19560 | loss 3.287755 (-0.03z)| norm 0.3012 (+0.56z)| lr 2.68e-05 | 322.95 ms | 52.3% bf16 MFU | 1624882 tok/s step 17007/19560 | loss 3.262697 (-0.60z)| norm 0.2901 (+0.20z)| lr 2.67e-05 | 322.78 ms | 52.3% bf16 MFU | 1624852 tok/s step 17008/19560 | loss 3.285554 (-0.07z)| norm 0.2923 (+0.28z)| lr 2.67e-05 | 322.73 ms | 52.3% bf16 MFU | 1624838 tok/s step 17009/19560 | loss 3.238905 (-1.13z)| norm 0.3176 (+1.08z)| lr 2.67e-05 | 322.72 ms | 52.3% bf16 MFU | 1624825 tok/s step 17010/19560 | loss 3.305457 (+0.38z)| norm 0.2791 (-0.17z)| lr 2.67e-05 | 322.92 ms | 52.3% bf16 MFU | 1624762 tok/s step 17011/19560 | loss 3.254641 (-0.77z)| norm 0.2478 (-1.17z)| lr 2.67e-05 | 322.81 ms | 52.3% bf16 MFU | 1624730 tok/s step 17012/19560 | loss 3.258979 (-0.68z)| norm 0.2754 (-0.28z)| lr 2.66e-05 | 322.84 ms | 52.3% bf16 MFU | 1624692 tok/s step 17013/19560 | loss 3.299139 (+0.25z)| norm 0.2820 (-0.07z)| lr 2.66e-05 | 323.02 ms | 52.2% bf16 MFU | 1624610 tok/s step 17014/19560 | loss 3.239089 (-1.13z)| norm 0.2526 (-1.01z)| lr 2.66e-05 | 322.81 ms | 52.3% bf16 MFU | 1624588 tok/s step 17015/19560 | loss 3.276254 (-0.28z)| norm 0.2827 (-0.04z)| lr 2.66e-05 | 323.00 ms | 52.3% bf16 MFU | 1624517 tok/s step 17016/19560 | loss 3.295315 (+0.17z)| norm 0.2763 (-0.24z)| lr 2.66e-05 | 323.22 ms | 52.2% bf16 MFU | 1624395 tok/s step 17017/19560 | loss 3.305009 (+0.39z)| norm 0.3021 (+0.59z)| lr 2.65e-05 | 322.55 ms | 52.3% bf16 MFU | 1624448 tok/s step 17018/19560 | loss 3.275978 (-0.27z)| norm 0.2801 (-0.12z)| lr 2.65e-05 | 322.70 ms | 52.3% bf16 MFU | 1624460 tok/s step 17019/19560 | loss 3.267152 (-0.47z)| norm 0.3323 (+1.53z)| lr 2.65e-05 | 322.88 ms | 52.3% bf16 MFU | 1624425 tok/s step 17020/19560 | loss 3.279873 (-0.18z)| norm 0.2850 (+0.02z)| lr 2.65e-05 | 322.77 ms | 52.3% bf16 MFU | 1624420 tok/s step 17021/19560 | loss 3.334585 (+1.08z)| norm 0.2558 (-0.91z)| lr 2.65e-05 | 322.95 ms | 52.3% bf16 MFU | 1624371 tok/s step 17022/19560 | loss 3.274843 (-0.32z)| norm 0.3572 (+2.26z)| lr 2.64e-05 | 322.48 ms | 52.3% bf16 MFU | 1624442 tok/s step 17023/19560 | loss 3.372851 (+1.93z)| norm 0.3027 (+0.55z)| lr 2.64e-05 | 322.66 ms | 52.3% bf16 MFU | 1624464 tok/s step 17024/19560 | loss 3.253300 (-0.82z)| norm 0.2562 (-0.91z)| lr 2.64e-05 | 322.98 ms | 52.3% bf16 MFU | 1624403 tok/s step 17025/19560 | loss 3.289426 (+0.03z)| norm 0.3408 (+1.70z)| lr 2.64e-05 | 323.00 ms | 52.3% bf16 MFU | 1624342 tok/s step 17026/19560 | loss 3.247413 (-0.94z)| norm 0.2811 (-0.14z)| lr 2.64e-05 | 323.10 ms | 52.2% bf16 MFU | 1624258 tok/s step 17027/19560 | loss 3.341336 (+1.22z)| norm 0.2596 (-0.80z)| lr 2.63e-05 | 322.85 ms | 52.3% bf16 MFU | 1624242 tok/s step 17028/19560 | loss 3.178657 (-2.45z)| norm 0.3291 (+1.33z)| lr 2.63e-05 | 322.36 ms | 52.4% bf16 MFU | 1624350 tok/s step 17029/19560 | loss 3.324472 (+0.83z)| norm 0.3520 (+1.98z)| lr 2.63e-05 | 323.00 ms | 52.3% bf16 MFU | 1624292 tok/s step 17030/19560 | loss 3.269155 (-0.41z)| norm 0.2855 (-0.03z)| lr 2.63e-05 | 323.09 ms | 52.2% bf16 MFU | 1624214 tok/s step 17031/19560 | loss 3.402013 (+2.50z)| norm 0.3099 (+0.70z)| lr 2.62e-05 | 322.68 ms | 52.3% bf16 MFU | 1624242 tok/s step 17032/19560 | loss 3.349601 (+1.32z)| norm 0.3325 (+1.36z)| lr 2.62e-05 | 322.34 ms | 52.4% bf16 MFU | 1624356 tok/s step 17033/19560 | loss 3.307264 (+0.39z)| norm 0.2566 (-0.91z)| lr 2.62e-05 | 322.67 ms | 52.3% bf16 MFU | 1624379 tok/s step 17034/19560 | loss 3.240538 (-1.07z)| norm 0.2583 (-0.85z)| lr 2.62e-05 | 322.88 ms | 52.3% bf16 MFU | 1624348 tok/s step 17035/19560 | loss 3.296966 (+0.17z)| norm 0.2900 (+0.09z)| lr 2.62e-05 | 323.68 ms | 52.1% bf16 MFU | 1624119 tok/s step 17036/19560 | loss 3.299164 (+0.22z)| norm 0.2809 (-0.18z)| lr 2.61e-05 | 322.73 ms | 52.3% bf16 MFU | 1624140 tok/s step 17037/19560 | loss 3.269113 (-0.44z)| norm 0.2750 (-0.36z)| lr 2.61e-05 | 322.53 ms | 52.3% bf16 MFU | 1624209 tok/s step 17038/19560 | loss 3.356388 (+1.45z)| norm 0.2453 (-1.24z)| lr 2.61e-05 | 323.29 ms | 52.2% bf16 MFU | 1624085 tok/s step 17039/19560 | loss 3.272774 (-0.37z)| norm 0.2835 (-0.07z)| lr 2.61e-05 | 322.70 ms | 52.3% bf16 MFU | 1624115 tok/s step 17040/19560 | loss 3.230104 (-1.29z)| norm 0.2623 (-0.75z)| lr 2.61e-05 | 323.44 ms | 52.2% bf16 MFU | 1623958 tok/s step 17041/19560 | loss 3.316009 (+0.58z)| norm 0.2443 (-1.30z)| lr 2.60e-05 | 322.98 ms | 52.3% bf16 MFU | 1623925 tok/s step 17042/19560 | loss 3.281206 (-0.18z)| norm 0.2799 (-0.17z)| lr 2.60e-05 | 322.87 ms | 52.3% bf16 MFU | 1623921 tok/s step 17043/19560 | loss 3.297557 (+0.18z)| norm 0.2573 (-0.87z)| lr 2.60e-05 | 322.99 ms | 52.3% bf16 MFU | 1623887 tok/s step 17044/19560 | loss 3.290103 (+0.00z)| norm 0.2457 (-1.23z)| lr 2.60e-05 | 322.94 ms | 52.3% bf16 MFU | 1623867 tok/s step 17045/19560 | loss 3.315735 (+0.56z)| norm 0.2569 (-0.86z)| lr 2.60e-05 | 323.26 ms | 52.2% bf16 MFU | 1623768 tok/s step 17046/19560 | loss 3.255708 (-0.75z)| norm 0.2573 (-0.85z)| lr 2.59e-05 | 322.26 ms | 52.4% bf16 MFU | 1623924 tok/s step 17047/19560 | loss 3.266304 (-0.52z)| norm 0.2644 (-0.63z)| lr 2.59e-05 | 323.28 ms | 52.2% bf16 MFU | 1623817 tok/s step 17048/19560 | loss 3.273053 (-0.37z)| norm 0.2509 (-1.06z)| lr 2.59e-05 | 322.88 ms | 52.3% bf16 MFU | 1623816 tok/s step 17049/19560 | loss 3.253327 (-0.80z)| norm 0.2676 (-0.54z)| lr 2.59e-05 | 322.50 ms | 52.3% bf16 MFU | 1623910 tok/s step 17050/19560 | loss 3.263214 (-0.58z)| norm 0.2722 (-0.39z)| lr 2.59e-05 | 323.29 ms | 52.2% bf16 MFU | 1623801 tok/s step 17051/19560 | loss 3.272166 (-0.38z)| norm 0.2701 (-0.46z)| lr 2.58e-05 | 322.45 ms | 52.3% bf16 MFU | 1623908 tok/s step 17052/19560 | loss 3.244655 (-0.97z)| norm 0.2544 (-0.94z)| lr 2.58e-05 | 322.73 ms | 52.3% bf16 MFU | 1623938 tok/s step 17053/19560 | loss 3.293125 (+0.08z)| norm 0.2679 (-0.52z)| lr 2.58e-05 | 323.17 ms | 52.2% bf16 MFU | 1623857 tok/s step 17054/19560 | loss 3.313490 (+0.52z)| norm 0.2606 (-0.75z)| lr 2.58e-05 | 322.81 ms | 52.3% bf16 MFU | 1623871 tok/s step 17055/19560 | loss 3.264741 (-0.57z)| norm 0.2540 (-0.96z)| lr 2.58e-05 | 323.24 ms | 52.2% bf16 MFU | 1623777 tok/s step 17056/19560 | loss 3.274300 (-0.35z)| norm 0.2493 (-1.09z)| lr 2.57e-05 | 322.68 ms | 52.3% bf16 MFU | 1623827 tok/s step 17057/19560 | loss 3.283100 (-0.15z)| norm 0.2766 (-0.23z)| lr 2.57e-05 | 323.00 ms | 52.3% bf16 MFU | 1623795 tok/s step 17058/19560 | loss 3.289454 (-0.01z)| norm 0.2515 (-1.01z)| lr 2.57e-05 | 323.55 ms | 52.2% bf16 MFU | 1623626 tok/s step 17059/19560 | loss 3.343736 (+1.20z)| norm 0.2578 (-0.81z)| lr 2.57e-05 | 322.69 ms | 52.3% bf16 MFU | 1623682 tok/s step 17060/19560 | loss 3.297201 (+0.13z)| norm 0.2797 (-0.11z)| lr 2.57e-05 | 322.95 ms | 52.3% bf16 MFU | 1623669 tok/s step 17061/19560 | loss 3.303451 (+0.27z)| norm 0.2723 (-0.35z)| lr 2.56e-05 | 322.80 ms | 52.3% bf16 MFU | 1623694 tok/s step 17062/19560 | loss 3.321055 (+0.66z)| norm 0.3425 (+1.87z)| lr 2.56e-05 | 323.22 ms | 52.2% bf16 MFU | 1623613 tok/s step 17063/19560 | loss 3.274750 (-0.41z)| norm 0.2583 (-0.79z)| lr 2.56e-05 | 323.13 ms | 52.2% bf16 MFU | 1623559 tok/s step 17064/19560 | loss 3.315272 (+0.52z)| norm 0.3117 (+0.88z)| lr 2.56e-05 | 323.43 ms | 52.2% bf16 MFU | 1623432 tok/s step 17065/19560 | loss 3.403197 (+2.47z)| norm 0.2898 (+0.19z)| lr 2.56e-05 | 323.42 ms | 52.2% bf16 MFU | 1623313 tok/s step 17066/19560 | loss 3.306365 (+0.30z)| norm 0.2789 (-0.15z)| lr 2.55e-05 | 322.75 ms | 52.3% bf16 MFU | 1623369 tok/s step 17067/19560 | loss 3.301741 (+0.19z)| norm 0.2633 (-0.63z)| lr 2.55e-05 | 322.88 ms | 52.3% bf16 MFU | 1623390 tok/s step 17068/19560 | loss 3.238950 (-1.21z)| norm 0.2598 (-0.74z)| lr 2.55e-05 | 323.23 ms | 52.2% bf16 MFU | 1623323 tok/s step 17069/19560 | loss 3.255427 (-0.85z)| norm 0.2620 (-0.67z)| lr 2.55e-05 | 323.31 ms | 52.2% bf16 MFU | 1623238 tok/s step 17070/19560 | loss 3.345162 (+1.19z)| norm 0.2532 (-0.94z)| lr 2.55e-05 | 323.23 ms | 52.2% bf16 MFU | 1623178 tok/s step 17071/19560 | loss 3.293936 (+0.02z)| norm 0.2733 (-0.32z)| lr 2.54e-05 | 322.91 ms | 52.3% bf16 MFU | 1623201 tok/s step 17072/19560 | loss 3.262367 (-0.71z)| norm 0.2595 (-0.74z)| lr 2.54e-05 | 323.54 ms | 52.2% bf16 MFU | 1623064 tok/s step 17073/19560 | loss 3.239794 (-1.21z)| norm 0.2555 (-0.87z)| lr 2.54e-05 | 322.75 ms | 52.3% bf16 MFU | 1623132 tok/s step 17074/19560 | loss 3.281126 (-0.27z)| norm 0.2644 (-0.58z)| lr 2.54e-05 | 323.87 ms | 52.1% bf16 MFU | 1622917 tok/s step 17075/19560 | loss 3.241672 (-1.17z)| norm 0.3102 (+0.89z)| lr 2.54e-05 | 323.17 ms | 52.2% bf16 MFU | 1622888 tok/s step 17076/19560 | loss 3.292370 (-0.01z)| norm 0.2458 (-1.16z)| lr 2.53e-05 | 322.60 ms | 52.3% bf16 MFU | 1623004 tok/s step 17077/19560 | loss 3.242775 (-1.12z)| norm 0.2658 (-0.52z)| lr 2.53e-05 | 322.90 ms | 52.3% bf16 MFU | 1623038 tok/s step 17078/19560 | loss 3.278851 (-0.31z)| norm 0.3225 (+1.27z)| lr 2.53e-05 | 323.20 ms | 52.2% bf16 MFU | 1622996 tok/s step 17079/19560 | loss 3.262624 (-0.68z)| norm 0.2844 (+0.06z)| lr 2.53e-05 | 323.42 ms | 52.2% bf16 MFU | 1622899 tok/s step 17080/19560 | loss 3.245156 (-1.06z)| norm 0.2570 (-0.81z)| lr 2.53e-05 | 322.75 ms | 52.3% bf16 MFU | 1622975 tok/s step 17081/19560 | loss 3.318777 (+0.59z)| norm 0.3084 (+0.88z)| lr 2.52e-05 | 322.55 ms | 52.3% bf16 MFU | 1623098 tok/s step 17082/19560 | loss 3.265798 (-0.59z)| norm 0.2843 (+0.07z)| lr 2.52e-05 | 322.65 ms | 52.3% bf16 MFU | 1623190 tok/s step 17083/19560 | loss 3.269345 (-0.51z)| norm 0.2672 (-0.49z)| lr 2.52e-05 | 322.43 ms | 52.3% bf16 MFU | 1623334 tok/s step 17084/19560 | loss 3.287759 (-0.09z)| norm 0.3545 (+2.39z)| lr 2.52e-05 | 322.73 ms | 52.3% bf16 MFU | 1623393 tok/s step 17085/19560 | loss 3.320575 (+0.64z)| norm 0.2873 (+0.17z)| lr 2.52e-05 | 322.95 ms | 52.3% bf16 MFU | 1623394 tok/s step 17086/19560 | loss 3.270534 (-0.48z)| norm 0.2976 (+0.53z)| lr 2.51e-05 | 322.87 ms | 52.3% bf16 MFU | 1623418 tok/s step 17087/19560 | loss 3.260152 (-0.72z)| norm 0.2751 (-0.22z)| lr 2.51e-05 | 322.14 ms | 52.4% bf16 MFU | 1623622 tok/s step 17088/19560 | loss 3.225769 (-1.49z)| norm 0.2623 (-0.65z)| lr 2.51e-05 | 322.82 ms | 52.3% bf16 MFU | 1623644 tok/s step 17089/19560 | loss 3.302961 (+0.28z)| norm 0.2643 (-0.57z)| lr 2.51e-05 | 322.46 ms | 52.3% bf16 MFU | 1623758 tok/s step 17090/19560 | loss 3.269382 (-0.49z)| norm 0.2591 (-0.75z)| lr 2.51e-05 | 322.47 ms | 52.3% bf16 MFU | 1623863 tok/s step 17091/19560 | loss 3.300880 (+0.22z)| norm 0.2590 (-0.75z)| lr 2.50e-05 | 323.12 ms | 52.2% bf16 MFU | 1623799 tok/s step 17092/19560 | loss 3.223337 (-1.55z)| norm 0.2805 (-0.03z)| lr 2.50e-05 | 322.46 ms | 52.3% bf16 MFU | 1623903 tok/s step 17093/19560 | loss 3.267263 (-0.55z)| norm 0.2439 (-1.26z)| lr 2.50e-05 | 322.65 ms | 52.3% bf16 MFU | 1623956 tok/s step 17094/19560 | loss 3.297924 (+0.16z)| norm 0.2647 (-0.57z)| lr 2.50e-05 | 322.53 ms | 52.3% bf16 MFU | 1624035 tok/s step 17095/19560 | loss 3.282973 (-0.17z)| norm 0.2731 (-0.28z)| lr 2.50e-05 | 322.65 ms | 52.3% bf16 MFU | 1624079 tok/s step 17096/19560 | loss 3.285148 (-0.12z)| norm 0.2602 (-0.73z)| lr 2.49e-05 | 322.61 ms | 52.3% bf16 MFU | 1624133 tok/s step 17097/19560 | loss 3.292131 (+0.08z)| norm 0.3368 (+1.87z)| lr 2.49e-05 | 322.87 ms | 52.3% bf16 MFU | 1624119 tok/s step 17098/19560 | loss 3.268462 (-0.50z)| norm 0.2667 (-0.51z)| lr 2.49e-05 | 322.60 ms | 52.3% bf16 MFU | 1624174 tok/s step 17099/19560 | loss 3.290037 (+0.05z)| norm 0.3175 (+1.19z)| lr 2.49e-05 | 322.52 ms | 52.3% bf16 MFU | 1624246 tok/s step 17100/19560 | loss 3.267101 (-0.52z)| norm 0.2852 (+0.09z)| lr 2.49e-05 | 322.40 ms | 52.3% bf16 MFU | 1624344 tok/s step 17101/19560 | loss 3.284747 (-0.08z)| norm 0.2840 (+0.04z)| lr 2.48e-05 | 323.21 ms | 52.2% bf16 MFU | 1624233 tok/s step 17102/19560 | loss 3.287952 (-0.01z)| norm 0.2695 (-0.45z)| lr 2.48e-05 | 322.43 ms | 52.3% bf16 MFU | 1624324 tok/s step 17103/19560 | loss 3.281151 (-0.19z)| norm 0.2509 (-1.07z)| lr 2.48e-05 | 322.70 ms | 52.3% bf16 MFU | 1624342 tok/s step 17104/19560 | loss 3.303829 (+0.39z)| norm 0.2954 (+0.44z)| lr 2.48e-05 | 322.55 ms | 52.3% bf16 MFU | 1624399 tok/s step 17105/19560 | loss 3.254376 (-0.87z)| norm 0.2664 (-0.55z)| lr 2.48e-05 | 322.49 ms | 52.3% bf16 MFU | 1624466 tok/s step 17106/19560 | loss 3.278880 (-0.24z)| norm 0.2884 (+0.21z)| lr 2.47e-05 | 322.47 ms | 52.3% bf16 MFU | 1624536 tok/s step 17107/19560 | loss 3.264317 (-0.60z)| norm 0.2613 (-0.74z)| lr 2.47e-05 | 322.38 ms | 52.4% bf16 MFU | 1624624 tok/s step 17108/19560 | loss 3.284691 (-0.09z)| norm 0.2471 (-1.21z)| lr 2.47e-05 | 322.73 ms | 52.3% bf16 MFU | 1624620 tok/s step 17109/19560 | loss 3.283211 (-0.12z)| norm 0.2541 (-0.95z)| lr 2.47e-05 | 322.63 ms | 52.3% bf16 MFU | 1624640 tok/s step 17110/19560 | loss 3.286036 (-0.05z)| norm 0.2513 (-1.03z)| lr 2.47e-05 | 322.98 ms | 52.3% bf16 MFU | 1624573 tok/s step 17111/19560 | loss 3.280546 (-0.19z)| norm 0.2824 (+0.07z)| lr 2.46e-05 | 322.21 ms | 52.4% bf16 MFU | 1624702 tok/s step 17112/19560 | loss 3.246867 (-1.06z)| norm 0.2513 (-1.05z)| lr 2.46e-05 | 322.72 ms | 52.3% bf16 MFU | 1624696 tok/s step 17113/19560 | loss 3.285671 (-0.03z)| norm 0.2598 (-0.73z)| lr 2.46e-05 | 322.58 ms | 52.3% bf16 MFU | 1624728 tok/s step 17114/19560 | loss 3.319499 (+0.88z)| norm 0.2954 (+0.55z)| lr 2.46e-05 | 323.37 ms | 52.2% bf16 MFU | 1624558 tok/s step 17115/19560 | loss 3.272234 (-0.38z)| norm 0.2489 (-1.12z)| lr 2.46e-05 | 323.55 ms | 52.2% bf16 MFU | 1624352 tok/s step 17116/19560 | loss 3.323723 (+1.00z)| norm 0.3200 (+1.41z)| lr 2.45e-05 | 322.62 ms | 52.3% bf16 MFU | 1624390 tok/s step 17117/19560 | loss 3.319353 (+0.87z)| norm 0.2519 (-1.02z)| lr 2.45e-05 | 323.10 ms | 52.2% bf16 MFU | 1624304 tok/s step 17118/19560 | loss 3.278506 (-0.23z)| norm 0.2650 (-0.55z)| lr 2.45e-05 | 322.96 ms | 52.3% bf16 MFU | 1624258 tok/s step 17119/19560 | loss 3.297834 (+0.28z)| norm 0.2553 (-0.88z)| lr 2.45e-05 | 322.90 ms | 52.3% bf16 MFU | 1624230 tok/s step 17120/19560 | loss 3.308108 (+0.55z)| norm 0.2901 (+0.35z)| lr 2.45e-05 | 322.28 ms | 52.4% bf16 MFU | 1624360 tok/s step 17121/19560 | loss 3.247128 (-1.09z)| norm 0.2371 (-1.50z)| lr 2.44e-05 | 322.59 ms | 52.3% bf16 MFU | 1624404 tok/s step 17122/19560 | loss 3.244802 (-1.13z)| norm 0.2723 (-0.25z)| lr 2.44e-05 | 323.05 ms | 52.2% bf16 MFU | 1624329 tok/s step 17123/19560 | loss 3.282825 (-0.10z)| norm 0.2967 (+0.63z)| lr 2.44e-05 | 322.36 ms | 52.4% bf16 MFU | 1624432 tok/s step 17124/19560 | loss 3.276876 (-0.26z)| norm 0.2758 (-0.11z)| lr 2.44e-05 | 322.93 ms | 52.3% bf16 MFU | 1624388 tok/s step 17125/19560 | loss 3.390455 (+2.73z)| norm 0.3304 (+1.92z)| lr 2.44e-05 | 322.50 ms | 52.3% bf16 MFU | 1624453 tok/s step 17126/19560 | loss 3.398291 (+2.87z)| norm 0.2694 (-0.34z)| lr 2.43e-05 | 322.24 ms | 52.4% bf16 MFU | 1624580 tok/s step 17127/19560 | loss 3.286615 (-0.04z)| norm 0.2604 (-0.67z)| lr 2.43e-05 | 323.04 ms | 52.2% bf16 MFU | 1624501 tok/s step 17128/19560 | loss 3.241783 (-1.19z)| norm 0.2604 (-0.65z)| lr 2.43e-05 | 322.61 ms | 52.3% bf16 MFU | 1624533 tok/s step 17129/19560 | loss 3.270835 (-0.42z)| norm 0.3205 (+1.56z)| lr 2.43e-05 | 322.89 ms | 52.3% bf16 MFU | 1624492 tok/s step 17130/19560 | loss 3.257074 (-0.78z)| norm 0.2573 (-0.77z)| lr 2.43e-05 | 322.52 ms | 52.3% bf16 MFU | 1624547 tok/s step 17131/19560 | loss 3.288800 (+0.09z)| norm 0.3112 (+1.20z)| lr 2.42e-05 | 322.92 ms | 52.3% bf16 MFU | 1624500 tok/s step 17132/19560 | loss 3.345719 (+1.65z)| norm 0.2569 (-0.79z)| lr 2.42e-05 | 322.93 ms | 52.3% bf16 MFU | 1624452 tok/s step 17133/19560 | loss 3.263709 (-0.60z)| norm 0.2631 (-0.54z)| lr 2.42e-05 | 322.61 ms | 52.3% bf16 MFU | 1624486 tok/s step 17134/19560 | loss 3.232534 (-1.45z)| norm 0.2748 (-0.09z)| lr 2.42e-05 | 322.64 ms | 52.3% bf16 MFU | 1624512 tok/s step 17135/19560 | loss 3.305189 (+0.55z)| norm 0.2618 (-0.58z)| lr 2.42e-05 | 322.36 ms | 52.4% bf16 MFU | 1624608 tok/s step 17136/19560 | loss 3.283975 (-0.04z)| norm 0.2969 (+0.75z)| lr 2.41e-05 | 322.76 ms | 52.3% bf16 MFU | 1624596 tok/s step 17137/19560 | loss 3.283215 (-0.07z)| norm 0.2607 (-0.60z)| lr 2.41e-05 | 322.61 ms | 52.3% bf16 MFU | 1624624 tok/s step 17138/19560 | loss 3.286959 (+0.04z)| norm 0.2588 (-0.67z)| lr 2.41e-05 | 322.42 ms | 52.3% bf16 MFU | 1624698 tok/s step 17139/19560 | loss 3.266698 (-0.53z)| norm 0.2604 (-0.61z)| lr 2.41e-05 | 322.61 ms | 52.3% bf16 MFU | 1624720 tok/s step 17140/19560 | loss 3.239151 (-1.28z)| norm 0.2806 (+0.15z)| lr 2.41e-05 | 323.09 ms | 52.2% bf16 MFU | 1624619 tok/s step 17141/19560 | loss 3.223115 (-1.69z)| norm 0.2515 (-0.95z)| lr 2.40e-05 | 322.38 ms | 52.4% bf16 MFU | 1624704 tok/s step 17142/19560 | loss 3.262325 (-0.63z)| norm 0.2707 (-0.22z)| lr 2.40e-05 | 323.02 ms | 52.2% bf16 MFU | 1624622 tok/s step 17143/19560 | loss 3.351125 (+1.78z)| norm 0.3273 (+1.90z)| lr 2.40e-05 | 322.54 ms | 52.3% bf16 MFU | 1624667 tok/s step 17144/19560 | loss 3.285269 (-0.01z)| norm 0.3432 (+2.42z)| lr 2.40e-05 | 322.55 ms | 52.3% bf16 MFU | 1624705 tok/s step 17145/19560 | loss 3.277074 (-0.23z)| norm 0.2536 (-0.86z)| lr 2.40e-05 | 322.65 ms | 52.3% bf16 MFU | 1624718 tok/s step 17146/19560 | loss 3.250323 (-0.95z)| norm 0.2911 (+0.52z)| lr 2.39e-05 | 323.29 ms | 52.2% bf16 MFU | 1624568 tok/s step 17147/19560 | loss 3.237112 (-1.29z)| norm 0.2757 (-0.03z)| lr 2.39e-05 | 322.70 ms | 52.3% bf16 MFU | 1624574 tok/s step 17148/19560 | loss 3.249430 (-0.95z)| norm 0.2467 (-1.10z)| lr 2.39e-05 | 322.81 ms | 52.3% bf16 MFU | 1624553 tok/s step 17149/19560 | loss 3.291858 (+0.20z)| norm 0.2577 (-0.69z)| lr 2.39e-05 | 322.73 ms | 52.3% bf16 MFU | 1624553 tok/s step 17150/19560 | loss 3.298610 (+0.38z)| norm 0.2854 (+0.37z)| lr 2.39e-05 | 322.97 ms | 52.3% bf16 MFU | 1624493 tok/s step 17151/19560 | loss 3.260866 (-0.63z)| norm 0.2536 (-0.84z)| lr 2.39e-05 | 322.78 ms | 52.3% bf16 MFU | 1624483 tok/s step 17152/19560 | loss 3.336012 (+1.42z)| norm 0.3049 (+1.13z)| lr 2.38e-05 | 322.38 ms | 52.4% bf16 MFU | 1624574 tok/s step 17153/19560 | loss 3.308275 (+0.65z)| norm 0.2565 (-0.73z)| lr 2.38e-05 | 322.56 ms | 52.3% bf16 MFU | 1624614 tok/s step 17154/19560 | loss 3.248439 (-0.99z)| norm 0.2484 (-1.04z)| lr 2.38e-05 | 322.97 ms | 52.3% bf16 MFU | 1624550 tok/s step 17155/19560 | loss 3.273117 (-0.30z)| norm 0.3380 (+2.41z)| lr 2.38e-05 | 322.96 ms | 52.3% bf16 MFU | 1624492 tok/s step 17156/19560 | loss 3.232679 (-1.47z)| norm 0.2427 (-1.25z)| lr 2.38e-05 | 322.51 ms | 52.3% bf16 MFU | 1624550 tok/s step 17157/19560 | loss 3.281843 (-0.06z)| norm 0.2871 (+0.52z)| lr 2.37e-05 | 322.52 ms | 52.3% bf16 MFU | 1624604 tok/s step 17158/19560 | loss 3.310403 (+0.74z)| norm 0.2820 (+0.31z)| lr 2.37e-05 | 322.57 ms | 52.3% bf16 MFU | 1624640 tok/s step 17159/19560 | loss 3.307428 (+0.71z)| norm 0.2795 (+0.22z)| lr 2.37e-05 | 322.65 ms | 52.3% bf16 MFU | 1624655 tok/s step 17160/19560 | loss 3.255727 (-0.82z)| norm 0.2819 (+0.34z)| lr 2.37e-05 | 323.27 ms | 52.2% bf16 MFU | 1624512 tok/s step 17161/19560 | loss 3.291857 (+0.28z)| norm 0.3091 (+1.45z)| lr 2.37e-05 | 322.70 ms | 52.3% bf16 MFU | 1624522 tok/s step 17162/19560 | loss 3.234534 (-1.46z)| norm 0.2705 (-0.15z)| lr 2.36e-05 | 322.70 ms | 52.3% bf16 MFU | 1624532 tok/s step 17163/19560 | loss 3.316965 (+1.03z)| norm 0.2465 (-1.12z)| lr 2.36e-05 | 322.87 ms | 52.3% bf16 MFU | 1624496 tok/s step 17164/19560 | loss 3.240929 (-1.25z)| norm 0.2515 (-0.90z)| lr 2.36e-05 | 323.17 ms | 52.2% bf16 MFU | 1624388 tok/s step 17165/19560 | loss 3.305496 (+0.68z)| norm 0.2960 (+0.91z)| lr 2.36e-05 | 322.97 ms | 52.3% bf16 MFU | 1624335 tok/s step 17166/19560 | loss 3.229936 (-1.57z)| norm 0.2637 (-0.42z)| lr 2.36e-05 | 322.71 ms | 52.3% bf16 MFU | 1624351 tok/s step 17167/19560 | loss 3.297359 (+0.47z)| norm 0.2457 (-1.14z)| lr 2.35e-05 | 323.15 ms | 52.2% bf16 MFU | 1624256 tok/s step 17168/19560 | loss 3.274774 (-0.23z)| norm 0.2965 (+0.93z)| lr 2.35e-05 | 323.00 ms | 52.3% bf16 MFU | 1624203 tok/s step 17169/19560 | loss 3.318889 (+1.12z)| norm 0.2710 (-0.12z)| lr 2.35e-05 | 322.78 ms | 52.3% bf16 MFU | 1624208 tok/s step 17170/19560 | loss 3.344119 (+1.85z)| norm 0.2673 (-0.27z)| lr 2.35e-05 | 322.30 ms | 52.4% bf16 MFU | 1624334 tok/s step 17171/19560 | loss 3.261257 (-0.64z)| norm 0.3359 (+2.46z)| lr 2.35e-05 | 322.84 ms | 52.3% bf16 MFU | 1624317 tok/s step 17172/19560 | loss 3.307450 (+0.75z)| norm 0.2458 (-1.15z)| lr 2.34e-05 | 323.17 ms | 52.2% bf16 MFU | 1624218 tok/s step 17173/19560 | loss 3.381819 (+2.88z)| norm 0.3314 (+2.22z)| lr 2.34e-05 | 322.52 ms | 52.3% bf16 MFU | 1624288 tok/s step 17174/19560 | loss 3.320676 (+1.08z)| norm 0.3017 (+1.03z)| lr 2.34e-05 | 322.81 ms | 52.3% bf16 MFU | 1624281 tok/s step 17175/19560 | loss 3.281988 (-0.05z)| norm 0.2590 (-0.65z)| lr 2.34e-05 | 322.44 ms | 52.3% bf16 MFU | 1624367 tok/s step 17176/19560 | loss 3.321718 (+1.09z)| norm 0.3007 (+0.98z)| lr 2.34e-05 | 322.47 ms | 52.3% bf16 MFU | 1624441 tok/s step 17177/19560 | loss 3.294046 (+0.28z)| norm 0.3250 (+1.89z)| lr 2.33e-05 | 322.91 ms | 52.3% bf16 MFU | 1624400 tok/s step 17178/19560 | loss 3.244299 (-1.16z)| norm 0.2886 (+0.47z)| lr 2.33e-05 | 322.90 ms | 52.3% bf16 MFU | 1624365 tok/s step 17179/19560 | loss 3.244701 (-1.14z)| norm 0.2411 (-1.35z)| lr 2.33e-05 | 322.77 ms | 52.3% bf16 MFU | 1624365 tok/s step 17180/19560 | loss 3.289114 (+0.14z)| norm 0.2735 (-0.11z)| lr 2.33e-05 | 322.66 ms | 52.3% bf16 MFU | 1624392 tok/s step 17181/19560 | loss 3.290807 (+0.18z)| norm 0.3260 (+1.87z)| lr 2.33e-05 | 323.19 ms | 52.2% bf16 MFU | 1624283 tok/s step 17182/19560 | loss 3.278129 (-0.17z)| norm 0.2548 (-0.83z)| lr 2.32e-05 | 322.63 ms | 52.3% bf16 MFU | 1624322 tok/s step 17183/19560 | loss 3.285267 (+0.03z)| norm 0.2435 (-1.25z)| lr 2.32e-05 | 322.44 ms | 52.3% bf16 MFU | 1624405 tok/s step 17184/19560 | loss 3.274307 (-0.29z)| norm 0.2645 (-0.47z)| lr 2.32e-05 | 323.03 ms | 52.2% bf16 MFU | 1624337 tok/s step 17185/19560 | loss 3.303912 (+0.56z)| norm 0.2681 (-0.33z)| lr 2.32e-05 | 322.44 ms | 52.3% bf16 MFU | 1624419 tok/s step 17186/19560 | loss 3.263403 (-0.61z)| norm 0.2492 (-1.04z)| lr 2.32e-05 | 323.26 ms | 52.2% bf16 MFU | 1624292 tok/s step 17187/19560 | loss 3.250624 (-0.96z)| norm 0.2595 (-0.65z)| lr 2.32e-05 | 323.07 ms | 52.2% bf16 MFU | 1624218 tok/s step 17188/19560 | loss 3.287797 (+0.13z)| norm 0.2587 (-0.68z)| lr 2.31e-05 | 323.43 ms | 52.2% bf16 MFU | 1624058 tok/s step 17189/19560 | loss 3.291264 (+0.23z)| norm 0.2402 (-1.36z)| lr 2.31e-05 | 322.33 ms | 52.4% bf16 MFU | 1624184 tok/s step 17190/19560 | loss 3.255054 (-0.82z)| norm 0.2726 (-0.12z)| lr 2.31e-05 | 322.23 ms | 52.4% bf16 MFU | 1624328 tok/s step 17191/19560 | loss 3.305730 (+0.66z)| norm 0.2502 (-0.98z)| lr 2.31e-05 | 323.29 ms | 52.2% bf16 MFU | 1624196 tok/s step 17192/19560 | loss 3.256000 (-0.78z)| norm 0.2434 (-1.22z)| lr 2.31e-05 | 322.78 ms | 52.3% bf16 MFU | 1624201 tok/s step 17193/19560 | loss 3.320860 (+1.20z)| norm 0.3178 (+1.62z)| lr 2.30e-05 | 323.09 ms | 52.2% bf16 MFU | 1624127 tok/s step 17194/19560 | loss 3.258804 (-0.70z)| norm 0.2526 (-0.86z)| lr 2.30e-05 | 323.02 ms | 52.2% bf16 MFU | 1624075 tok/s step 17195/19560 | loss 3.334768 (+1.61z)| norm 0.2528 (-0.85z)| lr 2.30e-05 | 322.86 ms | 52.3% bf16 MFU | 1624064 tok/s step 17196/19560 | loss 3.306091 (+0.73z)| norm 0.2815 (+0.24z)| lr 2.30e-05 | 322.58 ms | 52.3% bf16 MFU | 1624126 tok/s step 17197/19560 | loss 3.314265 (+0.96z)| norm 0.2525 (-0.86z)| lr 2.30e-05 | 322.31 ms | 52.4% bf16 MFU | 1624251 tok/s step 17198/19560 | loss 3.367238 (+2.55z)| norm 0.2806 (+0.20z)| lr 2.29e-05 | 322.97 ms | 52.3% bf16 MFU | 1624206 tok/s step 17199/19560 | loss 3.331662 (+1.45z)| norm 0.2920 (+0.63z)| lr 2.29e-05 | 322.84 ms | 52.3% bf16 MFU | 1624194 tok/s step 17200/19560 | loss 3.271398 (-0.36z)| norm 0.2402 (-1.33z)| lr 2.29e-05 | 322.86 ms | 52.3% bf16 MFU | 1624179 tok/s step 17201/19560 | loss 3.280289 (-0.10z)| norm 0.2641 (-0.43z)| lr 2.29e-05 | 323.12 ms | 52.2% bf16 MFU | 1624099 tok/s step 17202/19560 | loss 3.257174 (-0.80z)| norm 0.2807 (+0.19z)| lr 2.29e-05 | 322.57 ms | 52.3% bf16 MFU | 1624163 tok/s step 17203/19560 | loss 3.329111 (+1.35z)| norm 0.3449 (+2.57z)| lr 2.28e-05 | 323.10 ms | 52.2% bf16 MFU | 1624088 tok/s step 17204/19560 | loss 3.296315 (+0.36z)| norm 0.2517 (-0.90z)| lr 2.28e-05 | 322.32 ms | 52.4% bf16 MFU | 1624214 tok/s step 17205/19560 | loss 3.234856 (-1.48z)| norm 0.2619 (-0.52z)| lr 2.28e-05 | 322.91 ms | 52.3% bf16 MFU | 1624186 tok/s step 17206/19560 | loss 3.302272 (+0.54z)| norm 0.2790 (+0.13z)| lr 2.28e-05 | 323.28 ms | 52.2% bf16 MFU | 1624066 tok/s step 17207/19560 | loss 3.316230 (+0.94z)| norm 0.2468 (-1.07z)| lr 2.28e-05 | 322.61 ms | 52.3% bf16 MFU | 1624121 tok/s step 17208/19560 | loss 3.289887 (+0.14z)| norm 0.3026 (+1.01z)| lr 2.28e-05 | 322.99 ms | 52.3% bf16 MFU | 1624078 tok/s step 17209/19560 | loss 3.284473 (-0.01z)| norm 0.2682 (-0.26z)| lr 2.27e-05 | 322.97 ms | 52.3% bf16 MFU | 1624041 tok/s step 17210/19560 | loss 3.305777 (+0.62z)| norm 0.3178 (+1.58z)| lr 2.27e-05 | 322.91 ms | 52.3% bf16 MFU | 1624020 tok/s step 17211/19560 | loss 3.258027 (-0.82z)| norm 0.3608 (+3.04z)| lr 2.27e-05 | 322.46 ms | 52.3% bf16 MFU | 1624114 tok/s step 17212/19560 | loss 3.247927 (-1.11z)| norm 0.2597 (-0.58z)| lr 2.27e-05 | 322.78 ms | 52.3% bf16 MFU | 1624124 tok/s step 17213/19560 | loss 3.293298 (+0.26z)| norm 0.2748 (-0.02z)| lr 2.27e-05 | 323.00 ms | 52.3% bf16 MFU | 1624078 tok/s step 17214/19560 | loss 3.238417 (-1.38z)| norm 0.3166 (+1.51z)| lr 2.26e-05 | 322.87 ms | 52.3% bf16 MFU | 1624067 tok/s step 17215/19560 | loss 3.283051 (-0.04z)| norm 0.2747 (-0.03z)| lr 2.26e-05 | 322.86 ms | 52.3% bf16 MFU | 1624057 tok/s step 17216/19560 | loss 3.250876 (-1.02z)| norm 0.2770 (+0.05z)| lr 2.26e-05 | 322.90 ms | 52.3% bf16 MFU | 1624037 tok/s step 17217/19560 | loss 3.280858 (-0.11z)| norm 0.3254 (+1.79z)| lr 2.26e-05 | 323.21 ms | 52.2% bf16 MFU | 1623941 tok/s step 17218/19560 | loss 3.261027 (-0.71z)| norm 0.2665 (-0.35z)| lr 2.26e-05 | 323.41 ms | 52.2% bf16 MFU | 1623800 tok/s step 17219/19560 | loss 3.327583 (+1.29z)| norm 0.2739 (-0.09z)| lr 2.25e-05 | 322.86 ms | 52.3% bf16 MFU | 1623804 tok/s step 17220/19560 | loss 3.306469 (+0.65z)| norm 0.2662 (-0.37z)| lr 2.25e-05 | 322.90 ms | 52.3% bf16 MFU | 1623799 tok/s step 17221/19560 | loss 3.304398 (+0.57z)| norm 0.2502 (-0.95z)| lr 2.25e-05 | 322.80 ms | 52.3% bf16 MFU | 1623819 tok/s step 17222/19560 | loss 3.321238 (+1.08z)| norm 0.2858 (+0.34z)| lr 2.25e-05 | 323.39 ms | 52.2% bf16 MFU | 1623689 tok/s step 17223/19560 | loss 3.272551 (-0.40z)| norm 0.2656 (-0.39z)| lr 2.25e-05 | 322.89 ms | 52.3% bf16 MFU | 1623690 tok/s step 17224/19560 | loss 3.289547 (+0.12z)| norm 0.2638 (-0.46z)| lr 2.24e-05 | 322.74 ms | 52.3% bf16 MFU | 1623730 tok/s step 17225/19560 | loss 3.346913 (+1.82z)| norm 0.2921 (+0.60z)| lr 2.24e-05 | 323.18 ms | 52.2% bf16 MFU | 1623658 tok/s step 17226/19560 | loss 3.261263 (-0.74z)| norm 0.2531 (-0.85z)| lr 2.24e-05 | 322.78 ms | 52.3% bf16 MFU | 1623689 tok/s step 17227/19560 | loss 3.314986 (+0.86z)| norm 0.2750 (-0.02z)| lr 2.24e-05 | 322.01 ms | 52.4% bf16 MFU | 1623914 tok/s step 17228/19560 | loss 3.331784 (+1.33z)| norm 0.2599 (-0.58z)| lr 2.24e-05 | 322.54 ms | 52.3% bf16 MFU | 1623994 tok/s step 17229/19560 | loss 3.334464 (+1.39z)| norm 0.2640 (-0.42z)| lr 2.24e-05 | 324.18 ms | 52.1% bf16 MFU | 1623657 tok/s step 17230/19560 | loss 3.275765 (-0.33z)| norm 0.2835 (+0.31z)| lr 2.23e-05 | 322.87 ms | 52.3% bf16 MFU | 1623665 tok/s step 17231/19560 | loss 3.244890 (-1.23z)| norm 0.2780 (+0.09z)| lr 2.23e-05 | 322.41 ms | 52.3% bf16 MFU | 1623791 tok/s step 17232/19560 | loss 3.283972 (-0.08z)| norm 0.3079 (+1.21z)| lr 2.23e-05 | 322.77 ms | 52.3% bf16 MFU | 1623817 tok/s step 17233/19560 | loss 3.224016 (-1.81z)| norm 0.2559 (-0.74z)| lr 2.23e-05 | 323.06 ms | 52.2% bf16 MFU | 1623769 tok/s step 17234/19560 | loss 3.281747 (-0.14z)| norm 0.2489 (-0.98z)| lr 2.23e-05 | 322.99 ms | 52.3% bf16 MFU | 1623742 tok/s step 17235/19560 | loss 3.288704 (+0.06z)| norm 0.2991 (+0.88z)| lr 2.22e-05 | 322.58 ms | 52.3% bf16 MFU | 1623820 tok/s step 17236/19560 | loss 3.238750 (-1.37z)| norm 0.2592 (-0.62z)| lr 2.22e-05 | 322.75 ms | 52.3% bf16 MFU | 1623852 tok/s step 17237/19560 | loss 3.351818 (+1.85z)| norm 0.2555 (-0.75z)| lr 2.22e-05 | 322.56 ms | 52.3% bf16 MFU | 1623929 tok/s step 17238/19560 | loss 3.282562 (-0.12z)| norm 0.2823 (+0.24z)| lr 2.22e-05 | 322.99 ms | 52.3% bf16 MFU | 1623894 tok/s step 17239/19560 | loss 3.263607 (-0.65z)| norm 0.2770 (+0.04z)| lr 2.22e-05 | 322.95 ms | 52.3% bf16 MFU | 1623870 tok/s step 17240/19560 | loss 3.331879 (+1.26z)| norm 0.2566 (-0.73z)| lr 2.21e-05 | 322.99 ms | 52.3% bf16 MFU | 1623839 tok/s step 17241/19560 | loss 3.325632 (+1.07z)| norm 0.2537 (-0.83z)| lr 2.21e-05 | 322.74 ms | 52.3% bf16 MFU | 1623871 tok/s step 17242/19560 | loss 3.235048 (-1.46z)| norm 0.2579 (-0.66z)| lr 2.21e-05 | 322.78 ms | 52.3% bf16 MFU | 1623893 tok/s step 17243/19560 | loss 3.301885 (+0.41z)| norm 0.2965 (+0.77z)| lr 2.21e-05 | 322.91 ms | 52.3% bf16 MFU | 1623879 tok/s step 17244/19560 | loss 3.355959 (+1.90z)| norm 0.3084 (+1.23z)| lr 2.21e-05 | 323.93 ms | 52.1% bf16 MFU | 1623610 tok/s step 17245/19560 | loss 3.299246 (+0.33z)| norm 0.2740 (-0.08z)| lr 2.20e-05 | 322.69 ms | 52.3% bf16 MFU | 1623666 tok/s step 17246/19560 | loss 3.295013 (+0.21z)| norm 0.2582 (-0.67z)| lr 2.20e-05 | 322.98 ms | 52.3% bf16 MFU | 1623646 tok/s step 17247/19560 | loss 3.233780 (-1.47z)| norm 0.2740 (-0.08z)| lr 2.20e-05 | 323.30 ms | 52.2% bf16 MFU | 1623548 tok/s step 17248/19560 | loss 3.286995 (+0.01z)| norm 0.2616 (-0.54z)| lr 2.20e-05 | 322.74 ms | 52.3% bf16 MFU | 1623594 tok/s step 17249/19560 | loss 3.231017 (-1.53z)| norm 0.2887 (+0.47z)| lr 2.20e-05 | 323.00 ms | 52.3% bf16 MFU | 1623572 tok/s step 17250/19560 | loss 3.266960 (-0.55z)| norm 0.2545 (-0.83z)| lr 2.20e-05 | 322.36 ms | 52.4% bf16 MFU | 1623715 tok/s val loss 3.280791 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3051/10042 = 0.303824 step 17251/19560 | loss 3.312043 (+0.69z)| norm 0.2523 (-0.90z)| lr 2.19e-05 | 322.23 ms | 52.4% bf16 MFU | 1623883 tok/s step 17252/19560 | loss 3.269660 (-0.48z)| norm 0.2889 (+0.49z)| lr 2.19e-05 | 322.47 ms | 52.3% bf16 MFU | 1623981 tok/s step 17253/19560 | loss 3.313327 (+0.77z)| norm 0.2804 (+0.19z)| lr 2.19e-05 | 322.74 ms | 52.3% bf16 MFU | 1624006 tok/s step 17254/19560 | loss 3.234772 (-1.48z)| norm 0.2911 (+0.59z)| lr 2.19e-05 | 322.69 ms | 52.3% bf16 MFU | 1624043 tok/s step 17255/19560 | loss 3.357088 (+2.07z)| norm 0.3004 (+0.94z)| lr 2.19e-05 | 322.44 ms | 52.3% bf16 MFU | 1624140 tok/s step 17256/19560 | loss 3.201001 (-2.40z)| norm 0.3232 (+1.78z)| lr 2.18e-05 | 322.84 ms | 52.3% bf16 MFU | 1624134 tok/s step 17257/19560 | loss 3.280097 (-0.15z)| norm 0.2738 (-0.09z)| lr 2.18e-05 | 322.97 ms | 52.3% bf16 MFU | 1624094 tok/s step 17258/19560 | loss 3.304004 (+0.52z)| norm 0.3311 (+2.07z)| lr 2.18e-05 | 322.73 ms | 52.3% bf16 MFU | 1624116 tok/s step 17259/19560 | loss 3.269796 (-0.45z)| norm 0.2330 (-1.63z)| lr 2.18e-05 | 322.81 ms | 52.3% bf16 MFU | 1624117 tok/s step 17260/19560 | loss 3.332763 (+1.35z)| norm 0.2594 (-0.63z)| lr 2.18e-05 | 322.56 ms | 52.3% bf16 MFU | 1624180 tok/s step 17261/19560 | loss 3.271514 (-0.40z)| norm 0.2873 (+0.42z)| lr 2.17e-05 | 322.48 ms | 52.3% bf16 MFU | 1624261 tok/s step 17262/19560 | loss 3.307771 (+0.62z)| norm 0.2982 (+0.82z)| lr 2.17e-05 | 322.83 ms | 52.3% bf16 MFU | 1624251 tok/s step 17263/19560 | loss 3.299664 (+0.39z)| norm 0.2695 (-0.27z)| lr 2.17e-05 | 322.65 ms | 52.3% bf16 MFU | 1624284 tok/s step 17264/19560 | loss 3.254400 (-0.91z)| norm 0.2936 (+0.65z)| lr 2.17e-05 | 322.55 ms | 52.3% bf16 MFU | 1624341 tok/s step 17265/19560 | loss 3.280265 (-0.16z)| norm 0.2885 (+0.44z)| lr 2.17e-05 | 322.26 ms | 52.4% bf16 MFU | 1624469 tok/s step 17266/19560 | loss 3.264002 (-0.62z)| norm 0.2554 (-0.81z)| lr 2.17e-05 | 322.69 ms | 52.3% bf16 MFU | 1624483 tok/s step 17267/19560 | loss 3.257506 (-0.81z)| norm 0.2951 (+0.68z)| lr 2.16e-05 | 322.75 ms | 52.3% bf16 MFU | 1624481 tok/s step 17268/19560 | loss 3.291147 (+0.15z)| norm 0.2408 (-1.35z)| lr 2.16e-05 | 322.80 ms | 52.3% bf16 MFU | 1624467 tok/s step 17269/19560 | loss 3.328950 (+1.22z)| norm 0.2522 (-0.92z)| lr 2.16e-05 | 324.01 ms | 52.1% bf16 MFU | 1624148 tok/s step 17270/19560 | loss 3.261338 (-0.74z)| norm 0.2533 (-0.87z)| lr 2.16e-05 | 321.95 ms | 52.4% bf16 MFU | 1624366 tok/s step 17271/19560 | loss 3.276001 (-0.30z)| norm 0.2722 (-0.15z)| lr 2.16e-05 | 322.73 ms | 52.3% bf16 MFU | 1624374 tok/s step 17272/19560 | loss 3.329791 (+1.27z)| norm 0.2862 (+0.41z)| lr 2.15e-05 | 322.45 ms | 52.3% bf16 MFU | 1624452 tok/s step 17273/19560 | loss 3.566962 (+6.60z)| norm 0.3155 (+1.52z)| lr 2.15e-05 | 322.83 ms | 52.3% bf16 MFU | 1624432 tok/s step 17274/19560 | loss 3.268889 (-0.48z)| norm 0.2488 (-1.05z)| lr 2.15e-05 | 322.28 ms | 52.4% bf16 MFU | 1624550 tok/s step 17275/19560 | loss 3.372537 (+1.95z)| norm 0.3206 (+1.70z)| lr 2.15e-05 | 323.02 ms | 52.2% bf16 MFU | 1624478 tok/s step 17276/19560 | loss 3.283487 (-0.16z)| norm 0.2866 (+0.39z)| lr 2.15e-05 | 322.68 ms | 52.3% bf16 MFU | 1624495 tok/s step 17277/19560 | loss 3.373507 (+1.93z)| norm 0.2544 (-0.84z)| lr 2.15e-05 | 322.37 ms | 52.4% bf16 MFU | 1624587 tok/s step 17278/19560 | loss 3.241883 (-1.13z)| norm 0.2938 (+0.66z)| lr 2.14e-05 | 322.80 ms | 52.3% bf16 MFU | 1624568 tok/s step 17279/19560 | loss 3.231263 (-1.36z)| norm 0.2732 (-0.13z)| lr 2.14e-05 | 322.88 ms | 52.3% bf16 MFU | 1624528 tok/s step 17280/19560 | loss 3.232206 (-1.32z)| norm 0.2830 (+0.25z)| lr 2.14e-05 | 322.74 ms | 52.3% bf16 MFU | 1624526 tok/s step 17281/19560 | loss 3.267974 (-0.49z)| norm 0.2711 (-0.21z)| lr 2.14e-05 | 322.45 ms | 52.3% bf16 MFU | 1624598 tok/s step 17282/19560 | loss 3.275476 (-0.32z)| norm 0.2487 (-1.08z)| lr 2.14e-05 | 322.68 ms | 52.3% bf16 MFU | 1624608 tok/s step 17283/19560 | loss 3.249242 (-0.92z)| norm 0.2764 (+0.01z)| lr 2.13e-05 | 323.21 ms | 52.2% bf16 MFU | 1624483 tok/s step 17284/19560 | loss 3.265927 (-0.55z)| norm 0.2589 (-0.69z)| lr 2.13e-05 | 322.53 ms | 52.3% bf16 MFU | 1624538 tok/s step 17285/19560 | loss 3.293086 (+0.08z)| norm 0.2545 (-0.85z)| lr 2.13e-05 | 322.39 ms | 52.3% bf16 MFU | 1624622 tok/s step 17286/19560 | loss 3.583408 (+5.80z)| norm 0.4008 (+4.50z)| lr 2.13e-05 | 322.35 ms | 52.4% bf16 MFU | 1624715 tok/s step 17287/19560 | loss 3.322418 (+0.61z)| norm 0.3046 (+0.99z)| lr 2.13e-05 | 323.07 ms | 52.2% bf16 MFU | 1624621 tok/s step 17288/19560 | loss 3.263273 (-0.57z)| norm 0.2610 (-0.58z)| lr 2.12e-05 | 322.80 ms | 52.3% bf16 MFU | 1624600 tok/s step 17289/19560 | loss 3.321045 (+0.57z)| norm 0.3286 (+1.85z)| lr 2.12e-05 | 322.74 ms | 52.3% bf16 MFU | 1624595 tok/s step 17290/19560 | loss 3.375431 (+1.62z)| norm 0.2899 (+0.45z)| lr 2.12e-05 | 322.58 ms | 52.3% bf16 MFU | 1624629 tok/s step 17291/19560 | loss 3.307062 (+0.28z)| norm 0.2624 (-0.54z)| lr 2.12e-05 | 322.78 ms | 52.3% bf16 MFU | 1624612 tok/s step 17292/19560 | loss 3.339741 (+0.91z)| norm 0.2533 (-0.87z)| lr 2.12e-05 | 322.29 ms | 52.4% bf16 MFU | 1624720 tok/s step 17293/19560 | loss 3.345605 (+1.01z)| norm 0.2837 (+0.23z)| lr 2.12e-05 | 322.48 ms | 52.3% bf16 MFU | 1624773 tok/s step 17294/19560 | loss 3.306923 (+0.24z)| norm 0.2756 (-0.07z)| lr 2.11e-05 | 323.08 ms | 52.2% bf16 MFU | 1624674 tok/s step 17295/19560 | loss 3.358721 (+1.25z)| norm 0.2658 (-0.43z)| lr 2.11e-05 | 323.16 ms | 52.2% bf16 MFU | 1624559 tok/s step 17296/19560 | loss 3.298840 (+0.07z)| norm 0.2518 (-0.92z)| lr 2.11e-05 | 322.29 ms | 52.4% bf16 MFU | 1624668 tok/s step 17297/19560 | loss 3.326853 (+0.62z)| norm 0.2484 (-1.03z)| lr 2.11e-05 | 322.35 ms | 52.4% bf16 MFU | 1624758 tok/s step 17298/19560 | loss 3.179762 (-2.22z)| norm 0.2628 (-0.51z)| lr 2.11e-05 | 322.62 ms | 52.3% bf16 MFU | 1624774 tok/s step 17299/19560 | loss 3.290376 (-0.08z)| norm 0.2714 (-0.19z)| lr 2.10e-05 | 322.86 ms | 52.3% bf16 MFU | 1624729 tok/s step 17300/19560 | loss 3.299519 (+0.10z)| norm 0.2612 (-0.57z)| lr 2.10e-05 | 322.47 ms | 52.3% bf16 MFU | 1624784 tok/s step 17301/19560 | loss 3.245174 (-0.94z)| norm 0.2739 (-0.09z)| lr 2.10e-05 | 322.83 ms | 52.3% bf16 MFU | 1624747 tok/s step 17302/19560 | loss 3.313938 (+0.40z)| norm 0.2723 (-0.14z)| lr 2.10e-05 | 322.85 ms | 52.3% bf16 MFU | 1624705 tok/s step 17303/19560 | loss 3.306156 (+0.25z)| norm 0.2612 (-0.55z)| lr 2.10e-05 | 322.47 ms | 52.3% bf16 MFU | 1624762 tok/s step 17304/19560 | loss 3.279599 (-0.27z)| norm 0.3063 (+1.14z)| lr 2.10e-05 | 322.99 ms | 52.3% bf16 MFU | 1624685 tok/s step 17305/19560 | loss 3.260363 (-0.64z)| norm 0.2591 (-0.62z)| lr 2.09e-05 | 322.28 ms | 52.4% bf16 MFU | 1624792 tok/s step 17306/19560 | loss 3.490400 (+3.63z)| norm 0.2923 (+0.64z)| lr 2.09e-05 | 322.56 ms | 52.3% bf16 MFU | 1624822 tok/s step 17307/19560 | loss 3.295466 (+0.00z)| norm 0.2737 (-0.08z)| lr 2.09e-05 | 322.78 ms | 52.3% bf16 MFU | 1624795 tok/s step 17308/19560 | loss 3.270278 (-0.46z)| norm 0.2898 (+0.53z)| lr 2.09e-05 | 322.67 ms | 52.3% bf16 MFU | 1624797 tok/s step 17309/19560 | loss 3.224243 (-1.30z)| norm 0.2636 (-0.46z)| lr 2.09e-05 | 322.80 ms | 52.3% bf16 MFU | 1624766 tok/s step 17310/19560 | loss 3.320672 (+0.48z)| norm 0.2519 (-0.91z)| lr 2.08e-05 | 322.71 ms | 52.3% bf16 MFU | 1624759 tok/s step 17311/19560 | loss 3.243780 (-0.94z)| norm 0.2676 (-0.31z)| lr 2.08e-05 | 323.09 ms | 52.2% bf16 MFU | 1624659 tok/s step 17312/19560 | loss 3.356750 (+1.13z)| norm 0.3080 (+1.24z)| lr 2.08e-05 | 322.36 ms | 52.4% bf16 MFU | 1624747 tok/s step 17313/19560 | loss 3.283858 (-0.21z)| norm 0.2521 (-0.91z)| lr 2.08e-05 | 322.60 ms | 52.3% bf16 MFU | 1624768 tok/s step 17314/19560 | loss 3.305426 (+0.18z)| norm 0.2645 (-0.44z)| lr 2.08e-05 | 322.60 ms | 52.3% bf16 MFU | 1624791 tok/s step 17315/19560 | loss 3.293847 (-0.03z)| norm 0.2782 (+0.08z)| lr 2.08e-05 | 322.46 ms | 52.3% bf16 MFU | 1624847 tok/s step 17316/19560 | loss 3.314783 (+0.35z)| norm 0.2902 (+0.54z)| lr 2.07e-05 | 322.04 ms | 52.4% bf16 MFU | 1625005 tok/s step 17317/19560 | loss 3.320719 (+0.45z)| norm 0.2970 (+0.79z)| lr 2.07e-05 | 322.63 ms | 52.3% bf16 MFU | 1625006 tok/s step 17318/19560 | loss 3.260878 (-0.65z)| norm 0.2904 (+0.53z)| lr 2.07e-05 | 323.32 ms | 52.2% bf16 MFU | 1624834 tok/s step 17319/19560 | loss 3.219458 (-1.39z)| norm 0.2732 (-0.15z)| lr 2.07e-05 | 322.30 ms | 52.4% bf16 MFU | 1624929 tok/s step 17320/19560 | loss 3.290286 (-0.10z)| norm 0.2873 (+0.39z)| lr 2.07e-05 | 322.88 ms | 52.3% bf16 MFU | 1624872 tok/s step 17321/19560 | loss 3.247434 (-0.87z)| norm 0.2829 (+0.23z)| lr 2.06e-05 | 323.11 ms | 52.2% bf16 MFU | 1624761 tok/s step 17322/19560 | loss 3.246071 (-0.89z)| norm 0.2408 (-1.43z)| lr 2.06e-05 | 322.49 ms | 52.3% bf16 MFU | 1624811 tok/s step 17323/19560 | loss 3.275111 (-0.36z)| norm 0.2621 (-0.60z)| lr 2.06e-05 | 322.94 ms | 52.3% bf16 MFU | 1624746 tok/s step 17324/19560 | loss 3.339834 (+0.82z)| norm 0.2834 (+0.25z)| lr 2.06e-05 | 322.64 ms | 52.3% bf16 MFU | 1624759 tok/s step 17325/19560 | loss 3.253412 (-0.75z)| norm 0.2417 (-1.39z)| lr 2.06e-05 | 322.13 ms | 52.4% bf16 MFU | 1624900 tok/s step 17326/19560 | loss 3.344206 (+0.91z)| norm 0.2695 (-0.29z)| lr 2.06e-05 | 323.02 ms | 52.2% bf16 MFU | 1624808 tok/s step 17327/19560 | loss 3.298612 (+0.08z)| norm 0.2705 (-0.25z)| lr 2.05e-05 | 322.26 ms | 52.4% bf16 MFU | 1624914 tok/s step 17328/19560 | loss 3.215868 (-1.41z)| norm 0.2611 (-0.63z)| lr 2.05e-05 | 322.84 ms | 52.3% bf16 MFU | 1624867 tok/s step 17329/19560 | loss 3.280323 (-0.24z)| norm 0.2605 (-0.65z)| lr 2.05e-05 | 322.62 ms | 52.3% bf16 MFU | 1624879 tok/s step 17330/19560 | loss 3.316357 (+0.41z)| norm 0.3225 (+1.77z)| lr 2.05e-05 | 322.73 ms | 52.3% bf16 MFU | 1624862 tok/s step 17331/19560 | loss 3.332293 (+0.70z)| norm 0.3110 (+1.36z)| lr 2.05e-05 | 322.56 ms | 52.3% bf16 MFU | 1624890 tok/s step 17332/19560 | loss 3.339767 (+0.82z)| norm 0.2639 (-0.53z)| lr 2.04e-05 | 323.15 ms | 52.2% bf16 MFU | 1624768 tok/s step 17333/19560 | loss 3.217152 (-1.39z)| norm 0.3095 (+1.28z)| lr 2.04e-05 | 323.05 ms | 52.2% bf16 MFU | 1624675 tok/s step 17334/19560 | loss 3.298616 (+0.08z)| norm 0.3454 (+2.62z)| lr 2.04e-05 | 322.44 ms | 52.3% bf16 MFU | 1624741 tok/s step 17335/19560 | loss 3.355794 (+1.10z)| norm 0.2650 (-0.51z)| lr 2.04e-05 | 322.85 ms | 52.3% bf16 MFU | 1624700 tok/s step 17336/19560 | loss 3.302083 (+0.13z)| norm 0.3150 (+1.43z)| lr 2.04e-05 | 322.45 ms | 52.3% bf16 MFU | 1624764 tok/s step 17337/19560 | loss 3.244126 (-0.90z)| norm 0.3420 (+2.40z)| lr 2.04e-05 | 322.44 ms | 52.3% bf16 MFU | 1624825 tok/s step 17338/19560 | loss 3.305143 (+0.19z)| norm 0.2753 (-0.12z)| lr 2.03e-05 | 322.84 ms | 52.3% bf16 MFU | 1624783 tok/s step 17339/19560 | loss 3.294733 (+0.00z)| norm 0.3146 (+1.45z)| lr 2.03e-05 | 323.27 ms | 52.2% bf16 MFU | 1624635 tok/s step 17340/19560 | loss 3.290709 (-0.08z)| norm 0.2994 (+0.83z)| lr 2.03e-05 | 323.16 ms | 52.2% bf16 MFU | 1624522 tok/s step 17341/19560 | loss 3.268086 (-0.48z)| norm 0.2646 (-0.54z)| lr 2.03e-05 | 322.60 ms | 52.3% bf16 MFU | 1624555 tok/s step 17342/19560 | loss 3.330531 (+0.63z)| norm 0.2821 (+0.16z)| lr 2.03e-05 | 323.04 ms | 52.2% bf16 MFU | 1624476 tok/s step 17343/19560 | loss 3.371231 (+1.35z)| norm 0.3004 (+0.88z)| lr 2.02e-05 | 323.06 ms | 52.2% bf16 MFU | 1624396 tok/s step 17344/19560 | loss 3.257953 (-0.69z)| norm 0.2642 (-0.55z)| lr 2.02e-05 | 322.80 ms | 52.3% bf16 MFU | 1624387 tok/s step 17345/19560 | loss 3.324034 (+0.49z)| norm 0.2491 (-1.14z)| lr 2.02e-05 | 322.79 ms | 52.3% bf16 MFU | 1624381 tok/s step 17346/19560 | loss 3.283028 (-0.25z)| norm 0.3317 (+2.12z)| lr 2.02e-05 | 322.92 ms | 52.3% bf16 MFU | 1624340 tok/s step 17347/19560 | loss 3.219585 (-1.36z)| norm 0.3092 (+1.21z)| lr 2.02e-05 | 322.33 ms | 52.4% bf16 MFU | 1624452 tok/s step 17348/19560 | loss 3.264424 (-0.55z)| norm 0.2667 (-0.46z)| lr 2.02e-05 | 323.10 ms | 52.2% bf16 MFU | 1624365 tok/s step 17349/19560 | loss 3.280387 (-0.27z)| norm 0.3257 (+1.82z)| lr 2.01e-05 | 322.64 ms | 52.3% bf16 MFU | 1624395 tok/s step 17350/19560 | loss 3.340703 (+0.80z)| norm 0.3155 (+1.40z)| lr 2.01e-05 | 322.91 ms | 52.3% bf16 MFU | 1624358 tok/s step 17351/19560 | loss 3.228322 (-1.18z)| norm 0.2635 (-0.60z)| lr 2.01e-05 | 323.20 ms | 52.2% bf16 MFU | 1624250 tok/s step 17352/19560 | loss 3.284482 (-0.19z)| norm 0.2627 (-0.64z)| lr 2.01e-05 | 322.23 ms | 52.4% bf16 MFU | 1624391 tok/s step 17353/19560 | loss 3.319625 (+0.44z)| norm 0.3076 (+1.09z)| lr 2.01e-05 | 322.49 ms | 52.3% bf16 MFU | 1624460 tok/s step 17354/19560 | loss 3.339591 (+0.78z)| norm 0.2988 (+0.74z)| lr 2.00e-05 | 322.80 ms | 52.3% bf16 MFU | 1624446 tok/s step 17355/19560 | loss 3.346192 (+0.89z)| norm 0.2901 (+0.40z)| lr 2.00e-05 | 322.09 ms | 52.4% bf16 MFU | 1624613 tok/s step 17356/19560 | loss 3.228894 (-1.16z)| norm 0.2573 (-0.86z)| lr 2.00e-05 | 322.74 ms | 52.3% bf16 MFU | 1624608 tok/s step 17357/19560 | loss 3.269889 (-0.43z)| norm 0.3847 (+3.78z)| lr 2.00e-05 | 322.71 ms | 52.3% bf16 MFU | 1624609 tok/s step 17358/19560 | loss 3.302448 (+0.14z)| norm 0.2781 (-0.09z)| lr 2.00e-05 | 322.92 ms | 52.3% bf16 MFU | 1624558 tok/s step 17359/19560 | loss 3.350692 (+0.97z)| norm 0.2948 (+0.51z)| lr 2.00e-05 | 322.75 ms | 52.3% bf16 MFU | 1624552 tok/s step 17360/19560 | loss 3.277126 (-0.32z)| norm 0.3006 (+0.72z)| lr 1.99e-05 | 322.91 ms | 52.3% bf16 MFU | 1624508 tok/s step 17361/19560 | loss 3.277946 (-0.32z)| norm 0.2601 (-0.75z)| lr 1.99e-05 | 323.27 ms | 52.2% bf16 MFU | 1624374 tok/s step 17362/19560 | loss 3.341189 (+0.79z)| norm 0.2552 (-0.93z)| lr 1.99e-05 | 323.40 ms | 52.2% bf16 MFU | 1624213 tok/s step 17363/19560 | loss 3.321651 (+0.44z)| norm 0.2893 (+0.31z)| lr 1.99e-05 | 322.08 ms | 52.4% bf16 MFU | 1624393 tok/s step 17364/19560 | loss 3.370754 (+1.29z)| norm 0.2763 (-0.17z)| lr 1.99e-05 | 322.57 ms | 52.3% bf16 MFU | 1624442 tok/s step 17365/19560 | loss 3.278710 (-0.32z)| norm 0.3062 (+0.91z)| lr 1.98e-05 | 322.90 ms | 52.3% bf16 MFU | 1624404 tok/s step 17366/19560 | loss 3.216239 (-1.41z)| norm 0.2771 (-0.15z)| lr 1.98e-05 | 322.88 ms | 52.3% bf16 MFU | 1624374 tok/s step 17367/19560 | loss 3.341904 (+0.79z)| norm 0.2703 (-0.40z)| lr 1.98e-05 | 322.21 ms | 52.4% bf16 MFU | 1624514 tok/s step 17368/19560 | loss 3.308670 (+0.21z)| norm 0.2624 (-0.69z)| lr 1.98e-05 | 322.78 ms | 52.3% bf16 MFU | 1624503 tok/s step 17369/19560 | loss 3.278512 (-0.32z)| norm 0.2581 (-0.85z)| lr 1.98e-05 | 323.52 ms | 52.2% bf16 MFU | 1624306 tok/s step 17370/19560 | loss 3.264959 (-0.56z)| norm 0.2628 (-0.68z)| lr 1.98e-05 | 322.84 ms | 52.3% bf16 MFU | 1624290 tok/s step 17371/19560 | loss 3.305696 (+0.16z)| norm 0.2687 (-0.45z)| lr 1.97e-05 | 322.57 ms | 52.3% bf16 MFU | 1624342 tok/s step 17372/19560 | loss 3.485733 (+3.19z)| norm 0.3115 (+1.12z)| lr 1.97e-05 | 323.06 ms | 52.2% bf16 MFU | 1624269 tok/s step 17373/19560 | loss 3.261024 (-0.62z)| norm 0.2699 (-0.41z)| lr 1.97e-05 | 322.41 ms | 52.3% bf16 MFU | 1624362 tok/s step 17374/19560 | loss 3.275580 (-0.37z)| norm 0.2600 (-0.77z)| lr 1.97e-05 | 323.42 ms | 52.2% bf16 MFU | 1624198 tok/s step 17375/19560 | loss 3.412144 (+1.90z)| norm 0.3230 (+1.51z)| lr 1.97e-05 | 323.11 ms | 52.2% bf16 MFU | 1624121 tok/s step 17376/19560 | loss 3.260322 (-0.64z)| norm 0.2511 (-1.10z)| lr 1.97e-05 | 322.78 ms | 52.3% bf16 MFU | 1624128 tok/s step 17377/19560 | loss 3.356745 (+0.96z)| norm 0.2495 (-1.14z)| lr 1.96e-05 | 322.40 ms | 52.3% bf16 MFU | 1624231 tok/s step 17378/19560 | loss 3.237842 (-1.03z)| norm 0.3289 (+1.69z)| lr 1.96e-05 | 323.31 ms | 52.2% bf16 MFU | 1624100 tok/s step 17379/19560 | loss 3.270169 (-0.48z)| norm 0.2520 (-1.06z)| lr 1.96e-05 | 322.80 ms | 52.3% bf16 MFU | 1624104 tok/s step 17380/19560 | loss 3.327871 (+0.47z)| norm 0.2937 (+0.43z)| lr 1.96e-05 | 322.63 ms | 52.3% bf16 MFU | 1624152 tok/s step 17381/19560 | loss 3.265551 (-0.56z)| norm 0.2541 (-0.98z)| lr 1.96e-05 | 322.66 ms | 52.3% bf16 MFU | 1624188 tok/s step 17382/19560 | loss 3.303807 (+0.07z)| norm 0.2981 (+0.59z)| lr 1.95e-05 | 322.32 ms | 52.4% bf16 MFU | 1624310 tok/s step 17383/19560 | loss 3.286039 (-0.22z)| norm 0.2790 (-0.09z)| lr 1.95e-05 | 322.94 ms | 52.3% bf16 MFU | 1624269 tok/s step 17384/19560 | loss 3.346969 (+0.79z)| norm 0.2419 (-1.39z)| lr 1.95e-05 | 323.61 ms | 52.2% bf16 MFU | 1624061 tok/s step 17385/19560 | loss 3.274573 (-0.43z)| norm 0.2706 (-0.36z)| lr 1.95e-05 | 322.87 ms | 52.3% bf16 MFU | 1624049 tok/s step 17386/19560 | loss 3.338462 (+0.64z)| norm 0.2947 (+0.52z)| lr 1.95e-05 | 323.43 ms | 52.2% bf16 MFU | 1623898 tok/s step 17387/19560 | loss 3.313376 (+0.21z)| norm 0.2478 (-1.19z)| lr 1.95e-05 | 322.08 ms | 52.4% bf16 MFU | 1624095 tok/s step 17388/19560 | loss 3.415120 (+1.90z)| norm 0.2749 (-0.21z)| lr 1.94e-05 | 322.68 ms | 52.3% bf16 MFU | 1624129 tok/s step 17389/19560 | loss 3.263247 (-0.64z)| norm 0.3324 (+1.85z)| lr 1.94e-05 | 323.13 ms | 52.2% bf16 MFU | 1624048 tok/s step 17390/19560 | loss 3.251910 (-0.82z)| norm 0.3299 (+1.73z)| lr 1.94e-05 | 323.12 ms | 52.2% bf16 MFU | 1623973 tok/s step 17391/19560 | loss 3.337412 (+0.60z)| norm 0.2609 (-0.72z)| lr 1.94e-05 | 322.42 ms | 52.3% bf16 MFU | 1624079 tok/s step 17392/19560 | loss 3.313140 (+0.19z)| norm 0.3124 (+1.10z)| lr 1.94e-05 | 322.77 ms | 52.3% bf16 MFU | 1624093 tok/s step 17393/19560 | loss 3.273806 (-0.46z)| norm 0.2933 (+0.42z)| lr 1.94e-05 | 322.87 ms | 52.3% bf16 MFU | 1624079 tok/s step 17394/19560 | loss 3.272525 (-0.49z)| norm 0.2451 (-1.27z)| lr 1.93e-05 | 322.63 ms | 52.3% bf16 MFU | 1624126 tok/s step 17395/19560 | loss 3.238048 (-1.06z)| norm 0.2677 (-0.47z)| lr 1.93e-05 | 322.73 ms | 52.3% bf16 MFU | 1624146 tok/s step 17396/19560 | loss 3.302691 (+0.02z)| norm 0.2888 (+0.26z)| lr 1.93e-05 | 323.24 ms | 52.2% bf16 MFU | 1624038 tok/s step 17397/19560 | loss 3.313158 (+0.19z)| norm 0.2812 (-0.02z)| lr 1.93e-05 | 322.55 ms | 52.3% bf16 MFU | 1624109 tok/s step 17398/19560 | loss 3.294155 (-0.13z)| norm 0.2466 (-1.25z)| lr 1.93e-05 | 322.99 ms | 52.3% bf16 MFU | 1624064 tok/s step 17399/19560 | loss 3.248007 (-0.89z)| norm 0.3003 (+0.66z)| lr 1.92e-05 | 322.65 ms | 52.3% bf16 MFU | 1624107 tok/s step 17400/19560 | loss 3.279714 (-0.36z)| norm 0.2550 (-0.94z)| lr 1.92e-05 | 322.92 ms | 52.3% bf16 MFU | 1624081 tok/s step 17401/19560 | loss 3.315958 (+0.30z)| norm 0.2903 (+0.32z)| lr 1.92e-05 | 323.33 ms | 52.2% bf16 MFU | 1623954 tok/s step 17402/19560 | loss 3.304014 (+0.08z)| norm 0.2730 (-0.31z)| lr 1.92e-05 | 322.94 ms | 52.3% bf16 MFU | 1623932 tok/s step 17403/19560 | loss 3.336856 (+0.69z)| norm 0.3388 (+2.03z)| lr 1.92e-05 | 322.81 ms | 52.3% bf16 MFU | 1623942 tok/s step 17404/19560 | loss 3.271593 (-0.50z)| norm 0.2547 (-0.95z)| lr 1.92e-05 | 322.47 ms | 52.3% bf16 MFU | 1624036 tok/s step 17405/19560 | loss 3.252200 (-0.84z)| norm 0.2572 (-0.86z)| lr 1.91e-05 | 323.30 ms | 52.2% bf16 MFU | 1623919 tok/s step 17406/19560 | loss 3.298379 (-0.00z)| norm 0.2711 (-0.36z)| lr 1.91e-05 | 322.11 ms | 52.4% bf16 MFU | 1624108 tok/s step 17407/19560 | loss 3.263458 (-0.65z)| norm 0.2897 (+0.29z)| lr 1.91e-05 | 323.35 ms | 52.2% bf16 MFU | 1623974 tok/s step 17408/19560 | loss 3.268142 (-0.57z)| norm 0.2460 (-1.24z)| lr 1.91e-05 | 323.20 ms | 52.2% bf16 MFU | 1623884 tok/s step 17409/19560 | loss 3.322747 (+0.43z)| norm 0.2574 (-0.83z)| lr 1.91e-05 | 322.55 ms | 52.3% bf16 MFU | 1623963 tok/s step 17410/19560 | loss 3.250529 (-0.90z)| norm 0.3065 (+0.88z)| lr 1.91e-05 | 322.65 ms | 52.3% bf16 MFU | 1624010 tok/s step 17411/19560 | loss 3.345227 (+0.84z)| norm 0.2836 (+0.07z)| lr 1.90e-05 | 322.51 ms | 52.3% bf16 MFU | 1624091 tok/s step 17412/19560 | loss 3.248379 (-0.95z)| norm 0.2594 (-0.78z)| lr 1.90e-05 | 323.03 ms | 52.2% bf16 MFU | 1624039 tok/s step 17413/19560 | loss 3.309551 (+0.17z)| norm 0.2773 (-0.16z)| lr 1.90e-05 | 322.60 ms | 52.3% bf16 MFU | 1624097 tok/s step 17414/19560 | loss 3.261245 (-0.76z)| norm 0.2925 (+0.44z)| lr 1.90e-05 | 322.95 ms | 52.3% bf16 MFU | 1624064 tok/s step 17415/19560 | loss 3.292654 (-0.10z)| norm 0.3025 (+0.82z)| lr 1.90e-05 | 323.68 ms | 52.1% bf16 MFU | 1623849 tok/s step 17416/19560 | loss 3.322845 (+0.52z)| norm 0.2483 (-1.23z)| lr 1.89e-05 | 322.62 ms | 52.3% bf16 MFU | 1623910 tok/s step 17417/19560 | loss 3.273798 (-0.49z)| norm 0.3661 (+3.14z)| lr 1.89e-05 | 322.44 ms | 52.3% bf16 MFU | 1624015 tok/s step 17418/19560 | loss 3.324542 (+0.58z)| norm 0.2691 (-0.44z)| lr 1.89e-05 | 322.39 ms | 52.3% bf16 MFU | 1624127 tok/s step 17419/19560 | loss 3.313137 (+0.34z)| norm 0.2648 (-0.60z)| lr 1.89e-05 | 323.09 ms | 52.2% bf16 MFU | 1624056 tok/s step 17420/19560 | loss 3.271832 (-0.52z)| norm 0.2612 (-0.73z)| lr 1.89e-05 | 323.36 ms | 52.2% bf16 MFU | 1623923 tok/s step 17421/19560 | loss 3.267069 (-0.61z)| norm 0.2744 (-0.24z)| lr 1.89e-05 | 322.54 ms | 52.3% bf16 MFU | 1624000 tok/s step 17422/19560 | loss 3.329983 (+0.72z)| norm 0.2996 (+0.68z)| lr 1.88e-05 | 322.99 ms | 52.3% bf16 MFU | 1623962 tok/s step 17423/19560 | loss 3.334794 (+0.82z)| norm 0.2507 (-1.11z)| lr 1.88e-05 | 322.89 ms | 52.3% bf16 MFU | 1623950 tok/s step 17424/19560 | loss 3.294086 (-0.04z)| norm 0.2549 (-0.96z)| lr 1.88e-05 | 323.02 ms | 52.2% bf16 MFU | 1623906 tok/s step 17425/19560 | loss 3.329564 (+0.71z)| norm 0.3201 (+1.41z)| lr 1.88e-05 | 323.30 ms | 52.2% bf16 MFU | 1623795 tok/s step 17426/19560 | loss 3.254362 (-0.91z)| norm 0.2793 (-0.09z)| lr 1.88e-05 | 322.71 ms | 52.3% bf16 MFU | 1623838 tok/s step 17427/19560 | loss 3.287480 (-0.20z)| norm 0.2806 (-0.04z)| lr 1.88e-05 | 323.71 ms | 52.1% bf16 MFU | 1623628 tok/s step 17428/19560 | loss 3.309821 (+0.29z)| norm 0.2568 (-0.92z)| lr 1.87e-05 | 322.92 ms | 52.3% bf16 MFU | 1623626 tok/s step 17429/19560 | loss 3.302241 (+0.11z)| norm 0.2615 (-0.74z)| lr 1.87e-05 | 322.83 ms | 52.3% bf16 MFU | 1623648 tok/s step 17430/19560 | loss 3.235413 (-1.32z)| norm 0.3010 (+0.70z)| lr 1.87e-05 | 323.32 ms | 52.2% bf16 MFU | 1623544 tok/s step 17431/19560 | loss 3.353914 (+1.23z)| norm 0.2593 (-0.83z)| lr 1.87e-05 | 322.96 ms | 52.3% bf16 MFU | 1623536 tok/s step 17432/19560 | loss 3.271202 (-0.55z)| norm 0.2538 (-1.01z)| lr 1.87e-05 | 323.25 ms | 52.2% bf16 MFU | 1623455 tok/s step 17433/19560 | loss 3.343831 (+1.00z)| norm 0.2664 (-0.55z)| lr 1.87e-05 | 323.28 ms | 52.2% bf16 MFU | 1623372 tok/s step 17434/19560 | loss 3.340173 (+1.01z)| norm 0.2585 (-0.83z)| lr 1.86e-05 | 322.83 ms | 52.3% bf16 MFU | 1623405 tok/s step 17435/19560 | loss 3.350822 (+1.24z)| norm 0.3120 (+1.11z)| lr 1.86e-05 | 322.96 ms | 52.3% bf16 MFU | 1623404 tok/s step 17436/19560 | loss 3.339499 (+0.96z)| norm 0.2625 (-0.68z)| lr 1.86e-05 | 323.36 ms | 52.2% bf16 MFU | 1623304 tok/s step 17437/19560 | loss 3.330556 (+0.75z)| norm 0.2676 (-0.50z)| lr 1.86e-05 | 322.85 ms | 52.3% bf16 MFU | 1623336 tok/s step 17438/19560 | loss 3.326123 (+0.65z)| norm 0.2877 (+0.22z)| lr 1.86e-05 | 323.00 ms | 52.3% bf16 MFU | 1623327 tok/s step 17439/19560 | loss 3.321723 (+0.53z)| norm 0.2561 (-0.93z)| lr 1.85e-05 | 323.20 ms | 52.2% bf16 MFU | 1623270 tok/s step 17440/19560 | loss 3.247792 (-1.16z)| norm 0.2467 (-1.25z)| lr 1.85e-05 | 323.26 ms | 52.2% bf16 MFU | 1623200 tok/s step 17441/19560 | loss 3.239688 (-1.33z)| norm 0.2412 (-1.44z)| lr 1.85e-05 | 323.17 ms | 52.2% bf16 MFU | 1623155 tok/s step 17442/19560 | loss 3.363706 (+1.50z)| norm 0.3855 (+3.56z)| lr 1.85e-05 | 322.67 ms | 52.3% bf16 MFU | 1623241 tok/s step 17443/19560 | loss 3.271386 (-0.60z)| norm 0.2660 (-0.54z)| lr 1.85e-05 | 323.66 ms | 52.1% bf16 MFU | 1623074 tok/s step 17444/19560 | loss 3.322494 (+0.56z)| norm 0.2626 (-0.65z)| lr 1.85e-05 | 322.21 ms | 52.4% bf16 MFU | 1623278 tok/s step 17445/19560 | loss 3.284584 (-0.29z)| norm 0.3268 (+1.53z)| lr 1.84e-05 | 322.53 ms | 52.3% bf16 MFU | 1623391 tok/s step 17446/19560 | loss 3.250668 (-1.06z)| norm 0.2664 (-0.52z)| lr 1.84e-05 | 323.04 ms | 52.2% bf16 MFU | 1623371 tok/s step 17447/19560 | loss 3.325049 (+0.62z)| norm 0.2631 (-0.63z)| lr 1.84e-05 | 322.43 ms | 52.3% bf16 MFU | 1623505 tok/s step 17448/19560 | loss 3.360921 (+1.42z)| norm 0.3165 (+1.17z)| lr 1.84e-05 | 323.09 ms | 52.2% bf16 MFU | 1623466 tok/s step 17449/19560 | loss 3.278292 (-0.47z)| norm 0.2437 (-1.27z)| lr 1.84e-05 | 323.16 ms | 52.2% bf16 MFU | 1623412 tok/s step 17450/19560 | loss 3.274863 (-0.56z)| norm 0.2559 (-0.87z)| lr 1.84e-05 | 322.66 ms | 52.3% bf16 MFU | 1623486 tok/s step 17451/19560 | loss 3.235728 (-1.44z)| norm 0.3757 (+3.03z)| lr 1.83e-05 | 323.27 ms | 52.2% bf16 MFU | 1623404 tok/s step 17452/19560 | loss 3.257309 (-0.94z)| norm 0.2546 (-0.90z)| lr 1.83e-05 | 323.00 ms | 52.3% bf16 MFU | 1623392 tok/s step 17453/19560 | loss 3.368227 (+1.57z)| norm 0.3404 (+1.85z)| lr 1.83e-05 | 322.51 ms | 52.3% bf16 MFU | 1623505 tok/s step 17454/19560 | loss 3.284333 (-0.33z)| norm 0.3285 (+1.44z)| lr 1.83e-05 | 322.86 ms | 52.3% bf16 MFU | 1623524 tok/s step 17455/19560 | loss 3.326195 (+0.62z)| norm 0.2603 (-0.74z)| lr 1.83e-05 | 322.73 ms | 52.3% bf16 MFU | 1623574 tok/s step 17456/19560 | loss 3.312362 (+0.29z)| norm 0.2832 (-0.01z)| lr 1.83e-05 | 322.68 ms | 52.3% bf16 MFU | 1623636 tok/s step 17457/19560 | loss 3.321769 (+0.50z)| norm 0.3068 (+0.73z)| lr 1.82e-05 | 323.23 ms | 52.2% bf16 MFU | 1623556 tok/s step 17458/19560 | loss 3.222207 (-1.76z)| norm 0.2669 (-0.53z)| lr 1.82e-05 | 322.78 ms | 52.3% bf16 MFU | 1623593 tok/s step 17459/19560 | loss 3.412818 (+2.51z)| norm 0.3552 (+2.26z)| lr 1.82e-05 | 322.73 ms | 52.3% bf16 MFU | 1623639 tok/s step 17460/19560 | loss 3.296821 (-0.06z)| norm 0.2835 (-0.02z)| lr 1.82e-05 | 322.56 ms | 52.3% bf16 MFU | 1623726 tok/s step 17461/19560 | loss 3.274013 (-0.59z)| norm 0.2889 (+0.16z)| lr 1.82e-05 | 322.99 ms | 52.3% bf16 MFU | 1623702 tok/s step 17462/19560 | loss 3.270903 (-0.65z)| norm 0.3363 (+1.67z)| lr 1.82e-05 | 322.40 ms | 52.3% bf16 MFU | 1623827 tok/s step 17463/19560 | loss 3.267529 (-0.72z)| norm 0.2629 (-0.67z)| lr 1.81e-05 | 322.96 ms | 52.3% bf16 MFU | 1623804 tok/s step 17464/19560 | loss 3.428881 (+2.83z)| norm 0.2927 (+0.29z)| lr 1.81e-05 | 321.83 ms | 52.4% bf16 MFU | 1624068 tok/s step 17465/19560 | loss 3.286092 (-0.32z)| norm 0.2919 (+0.28z)| lr 1.81e-05 | 322.68 ms | 52.3% bf16 MFU | 1624105 tok/s step 17466/19560 | loss 3.309715 (+0.20z)| norm 0.2947 (+0.37z)| lr 1.81e-05 | 323.34 ms | 52.2% bf16 MFU | 1623974 tok/s step 17467/19560 | loss 3.310405 (+0.22z)| norm 0.2886 (+0.18z)| lr 1.81e-05 | 322.53 ms | 52.3% bf16 MFU | 1624054 tok/s step 17468/19560 | loss 3.274753 (-0.57z)| norm 0.3048 (+0.70z)| lr 1.80e-05 | 322.38 ms | 52.4% bf16 MFU | 1624165 tok/s step 17469/19560 | loss 3.235530 (-1.42z)| norm 0.3322 (+1.56z)| lr 1.80e-05 | 322.45 ms | 52.3% bf16 MFU | 1624255 tok/s step 17470/19560 | loss 3.269600 (-0.66z)| norm 0.2691 (-0.47z)| lr 1.80e-05 | 323.17 ms | 52.2% bf16 MFU | 1624159 tok/s step 17471/19560 | loss 3.268202 (-0.68z)| norm 0.3156 (+1.02z)| lr 1.80e-05 | 322.61 ms | 52.3% bf16 MFU | 1624207 tok/s step 17472/19560 | loss 3.236461 (-1.37z)| norm 0.2842 (+0.01z)| lr 1.80e-05 | 322.52 ms | 52.3% bf16 MFU | 1624278 tok/s step 17473/19560 | loss 3.284055 (-0.32z)| norm 0.3013 (+0.55z)| lr 1.80e-05 | 322.79 ms | 52.3% bf16 MFU | 1624275 tok/s step 17474/19560 | loss 3.263245 (-0.77z)| norm 0.2671 (-0.54z)| lr 1.79e-05 | 322.56 ms | 52.3% bf16 MFU | 1624330 tok/s step 17475/19560 | loss 3.365033 (+1.45z)| norm 0.2750 (-0.28z)| lr 1.79e-05 | 322.48 ms | 52.3% bf16 MFU | 1624404 tok/s step 17476/19560 | loss 3.305406 (+0.12z)| norm 0.2650 (-0.60z)| lr 1.79e-05 | 322.94 ms | 52.3% bf16 MFU | 1624357 tok/s step 17477/19560 | loss 3.411692 (+2.40z)| norm 0.2773 (-0.19z)| lr 1.79e-05 | 322.52 ms | 52.3% bf16 MFU | 1624420 tok/s step 17478/19560 | loss 3.261408 (-0.84z)| norm 0.2749 (-0.26z)| lr 1.79e-05 | 322.93 ms | 52.3% bf16 MFU | 1624377 tok/s step 17479/19560 | loss 3.386892 (+1.84z)| norm 0.2793 (-0.12z)| lr 1.79e-05 | 322.61 ms | 52.3% bf16 MFU | 1624416 tok/s step 17480/19560 | loss 3.239920 (-1.31z)| norm 0.2561 (-0.88z)| lr 1.78e-05 | 323.02 ms | 52.2% bf16 MFU | 1624348 tok/s step 17481/19560 | loss 3.274385 (-0.56z)| norm 0.2371 (-1.49z)| lr 1.78e-05 | 322.45 ms | 52.3% bf16 MFU | 1624427 tok/s step 17482/19560 | loss 3.265834 (-0.73z)| norm 0.3261 (+1.42z)| lr 1.78e-05 | 322.44 ms | 52.3% bf16 MFU | 1624507 tok/s step 17483/19560 | loss 3.326741 (+0.57z)| norm 0.2807 (-0.06z)| lr 1.78e-05 | 323.19 ms | 52.2% bf16 MFU | 1624393 tok/s step 17484/19560 | loss 3.270492 (-0.64z)| norm 0.2438 (-1.25z)| lr 1.78e-05 | 322.56 ms | 52.3% bf16 MFU | 1624443 tok/s step 17485/19560 | loss 3.236870 (-1.36z)| norm 0.2763 (-0.18z)| lr 1.78e-05 | 322.32 ms | 52.4% bf16 MFU | 1624552 tok/s step 17486/19560 | loss 3.304814 (+0.10z)| norm 0.2554 (-0.88z)| lr 1.77e-05 | 323.09 ms | 52.2% bf16 MFU | 1624461 tok/s step 17487/19560 | loss 3.236747 (-1.34z)| norm 0.3000 (+0.63z)| lr 1.77e-05 | 322.92 ms | 52.3% bf16 MFU | 1624418 tok/s step 17488/19560 | loss 3.231321 (-1.44z)| norm 0.3488 (+2.23z)| lr 1.77e-05 | 322.36 ms | 52.4% bf16 MFU | 1624518 tok/s step 17489/19560 | loss 3.370585 (+1.50z)| norm 0.2782 (-0.13z)| lr 1.77e-05 | 322.55 ms | 52.3% bf16 MFU | 1624564 tok/s step 17490/19560 | loss 3.309535 (+0.22z)| norm 0.2357 (-1.52z)| lr 1.77e-05 | 323.09 ms | 52.2% bf16 MFU | 1624473 tok/s step 17491/19560 | loss 3.268968 (-0.63z)| norm 0.2622 (-0.64z)| lr 1.77e-05 | 322.23 ms | 52.4% bf16 MFU | 1624602 tok/s step 17492/19560 | loss 3.261447 (-0.78z)| norm 0.2812 (-0.01z)| lr 1.76e-05 | 322.71 ms | 52.3% bf16 MFU | 1624605 tok/s step 17493/19560 | loss 3.335372 (+0.78z)| norm 0.3003 (+0.62z)| lr 1.76e-05 | 322.90 ms | 52.3% bf16 MFU | 1624559 tok/s step 17494/19560 | loss 3.324618 (+0.54z)| norm 0.2468 (-1.14z)| lr 1.76e-05 | 323.14 ms | 52.2% bf16 MFU | 1624456 tok/s step 17495/19560 | loss 3.245024 (-1.15z)| norm 0.2557 (-0.84z)| lr 1.76e-05 | 322.58 ms | 52.3% bf16 MFU | 1624497 tok/s step 17496/19560 | loss 3.253183 (-0.96z)| norm 0.2606 (-0.67z)| lr 1.76e-05 | 322.55 ms | 52.3% bf16 MFU | 1624546 tok/s step 17497/19560 | loss 3.291315 (-0.15z)| norm 0.3389 (+1.85z)| lr 1.76e-05 | 323.08 ms | 52.2% bf16 MFU | 1624457 tok/s step 17498/19560 | loss 3.276176 (-0.47z)| norm 0.2724 (-0.31z)| lr 1.75e-05 | 322.94 ms | 52.3% bf16 MFU | 1624407 tok/s step 17499/19560 | loss 3.315861 (+0.37z)| norm 0.2460 (-1.16z)| lr 1.75e-05 | 322.88 ms | 52.3% bf16 MFU | 1624376 tok/s step 17500/19560 | loss 3.281781 (-0.34z)| norm 0.2590 (-0.72z)| lr 1.75e-05 | 322.37 ms | 52.4% bf16 MFU | 1624474 tok/s val loss 3.279209 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Hel 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating Helevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3056/10042 = 0.304322 step 17501/19560 | loss 3.276967 (-0.46z)| norm 0.2943 (+0.42z)| lr 1.75e-05 | 322.54 ms | 52.3% bf16 MFU | 1624526 tok/s step 17502/19560 | loss 3.321721 (+0.56z)| norm 0.3638 (+2.57z)| lr 1.75e-05 | 322.88 ms | 52.3% bf16 MFU | 1624488 tok/s step 17503/19560 | loss 3.277177 (-0.45z)| norm 0.2895 (+0.24z)| lr 1.75e-05 | 323.04 ms | 52.2% bf16 MFU | 1624412 tok/s step 17504/19560 | loss 3.280571 (-0.37z)| norm 0.2905 (+0.26z)| lr 1.74e-05 | 322.39 ms | 52.4% bf16 MFU | 1624504 tok/s step 17505/19560 | loss 3.229156 (-1.56z)| norm 0.2961 (+0.43z)| lr 1.74e-05 | 322.36 ms | 52.4% bf16 MFU | 1624600 tok/s step 17506/19560 | loss 3.291233 (-0.11z)| norm 0.3246 (+1.35z)| lr 1.74e-05 | 322.69 ms | 52.3% bf16 MFU | 1624607 tok/s step 17507/19560 | loss 3.284943 (-0.26z)| norm 0.2541 (-0.92z)| lr 1.74e-05 | 322.85 ms | 52.3% bf16 MFU | 1624574 tok/s step 17508/19560 | loss 3.435487 (+3.17z)| norm 0.2983 (+0.50z)| lr 1.74e-05 | 322.97 ms | 52.3% bf16 MFU | 1624511 tok/s step 17509/19560 | loss 3.364060 (+1.51z)| norm 0.3280 (+1.43z)| lr 1.74e-05 | 322.34 ms | 52.4% bf16 MFU | 1624612 tok/s step 17510/19560 | loss 3.259096 (-0.86z)| norm 0.2900 (+0.22z)| lr 1.73e-05 | 322.96 ms | 52.3% bf16 MFU | 1624550 tok/s step 17511/19560 | loss 3.288260 (-0.20z)| norm 0.2587 (-0.78z)| lr 1.73e-05 | 322.32 ms | 52.4% bf16 MFU | 1624653 tok/s step 17512/19560 | loss 3.344691 (+1.08z)| norm 0.3797 (+2.96z)| lr 1.73e-05 | 322.70 ms | 52.3% bf16 MFU | 1624654 tok/s step 17513/19560 | loss 3.235357 (-1.39z)| norm 0.2500 (-1.05z)| lr 1.73e-05 | 322.67 ms | 52.3% bf16 MFU | 1624663 tok/s step 17514/19560 | loss 3.272874 (-0.53z)| norm 0.2429 (-1.25z)| lr 1.73e-05 | 322.13 ms | 52.4% bf16 MFU | 1624809 tok/s step 17515/19560 | loss 3.227415 (-1.53z)| norm 0.2967 (+0.39z)| lr 1.73e-05 | 323.15 ms | 52.2% bf16 MFU | 1624689 tok/s step 17516/19560 | loss 3.303832 (+0.21z)| norm 0.3678 (+2.50z)| lr 1.72e-05 | 322.72 ms | 52.3% bf16 MFU | 1624684 tok/s step 17517/19560 | loss 3.306866 (+0.27z)| norm 0.2767 (-0.23z)| lr 1.72e-05 | 322.30 ms | 52.4% bf16 MFU | 1624786 tok/s step 17518/19560 | loss 3.277580 (-0.41z)| norm 0.4015 (+3.39z)| lr 1.72e-05 | 323.07 ms | 52.2% bf16 MFU | 1624688 tok/s step 17519/19560 | loss 3.273755 (-0.49z)| norm 0.3433 (+1.67z)| lr 1.72e-05 | 322.60 ms | 52.3% bf16 MFU | 1624713 tok/s step 17520/19560 | loss 3.327467 (+0.75z)| norm 0.3005 (+0.44z)| lr 1.72e-05 | 322.33 ms | 52.4% bf16 MFU | 1624804 tok/s step 17521/19560 | loss 3.288753 (-0.15z)| norm 0.2848 (-0.01z)| lr 1.72e-05 | 323.09 ms | 52.2% bf16 MFU | 1624700 tok/s step 17522/19560 | loss 3.372236 (+1.75z)| norm 0.3184 (+0.94z)| lr 1.71e-05 | 322.68 ms | 52.3% bf16 MFU | 1624704 tok/s step 17523/19560 | loss 3.271168 (-0.58z)| norm 0.2631 (-0.66z)| lr 1.71e-05 | 322.62 ms | 52.3% bf16 MFU | 1624724 tok/s step 17524/19560 | loss 3.352932 (+1.29z)| norm 0.2902 (+0.13z)| lr 1.71e-05 | 322.27 ms | 52.4% bf16 MFU | 1624831 tok/s step 17525/19560 | loss 3.234721 (-1.39z)| norm 0.2716 (-0.41z)| lr 1.71e-05 | 322.39 ms | 52.4% bf16 MFU | 1624902 tok/s step 17526/19560 | loss 3.257600 (-0.86z)| norm 0.2764 (-0.28z)| lr 1.71e-05 | 322.61 ms | 52.3% bf16 MFU | 1624915 tok/s step 17527/19560 | loss 3.281824 (-0.32z)| norm 0.2397 (-1.32z)| lr 1.71e-05 | 322.50 ms | 52.3% bf16 MFU | 1624953 tok/s step 17528/19560 | loss 3.276271 (-0.45z)| norm 0.2529 (-0.94z)| lr 1.70e-05 | 322.46 ms | 52.3% bf16 MFU | 1625001 tok/s step 17529/19560 | loss 3.353963 (+1.31z)| norm 0.2561 (-0.84z)| lr 1.70e-05 | 322.42 ms | 52.3% bf16 MFU | 1625056 tok/s step 17530/19560 | loss 3.275576 (-0.46z)| norm 0.2706 (-0.42z)| lr 1.70e-05 | 322.88 ms | 52.3% bf16 MFU | 1624993 tok/s step 17531/19560 | loss 3.328554 (+0.74z)| norm 0.2388 (-1.32z)| lr 1.70e-05 | 322.62 ms | 52.3% bf16 MFU | 1624997 tok/s step 17532/19560 | loss 3.249938 (-1.03z)| norm 0.2544 (-0.87z)| lr 1.70e-05 | 322.71 ms | 52.3% bf16 MFU | 1624980 tok/s step 17533/19560 | loss 3.357216 (+1.37z)| norm 0.2617 (-0.66z)| lr 1.70e-05 | 322.58 ms | 52.3% bf16 MFU | 1624994 tok/s step 17534/19560 | loss 3.329868 (+0.74z)| norm 0.2462 (-1.10z)| lr 1.69e-05 | 321.89 ms | 52.4% bf16 MFU | 1625184 tok/s step 17535/19560 | loss 3.249499 (-1.06z)| norm 0.2448 (-1.12z)| lr 1.69e-05 | 323.02 ms | 52.2% bf16 MFU | 1625080 tok/s step 17536/19560 | loss 3.320623 (+0.53z)| norm 0.2563 (-0.80z)| lr 1.69e-05 | 322.71 ms | 52.3% bf16 MFU | 1625057 tok/s step 17537/19560 | loss 3.343237 (+1.03z)| norm 0.2660 (-0.52z)| lr 1.69e-05 | 322.76 ms | 52.3% bf16 MFU | 1625024 tok/s step 17538/19560 | loss 3.297152 (-0.01z)| norm 0.2534 (-0.87z)| lr 1.69e-05 | 322.69 ms | 52.3% bf16 MFU | 1625010 tok/s step 17539/19560 | loss 3.255348 (-0.93z)| norm 0.2446 (-1.11z)| lr 1.69e-05 | 322.69 ms | 52.3% bf16 MFU | 1624996 tok/s step 17540/19560 | loss 3.301646 (+0.10z)| norm 0.2869 (+0.09z)| lr 1.68e-05 | 322.43 ms | 52.3% bf16 MFU | 1625050 tok/s step 17541/19560 | loss 3.189073 (-2.37z)| norm 0.2715 (-0.35z)| lr 1.68e-05 | 322.96 ms | 52.3% bf16 MFU | 1624967 tok/s step 17542/19560 | loss 3.299287 (+0.06z)| norm 0.2670 (-0.47z)| lr 1.68e-05 | 322.14 ms | 52.4% bf16 MFU | 1625095 tok/s step 17543/19560 | loss 3.283299 (-0.29z)| norm 0.2570 (-0.74z)| lr 1.68e-05 | 323.21 ms | 52.2% bf16 MFU | 1624946 tok/s step 17544/19560 | loss 3.283462 (-0.28z)| norm 0.2671 (-0.46z)| lr 1.68e-05 | 322.11 ms | 52.4% bf16 MFU | 1625082 tok/s step 17545/19560 | loss 3.307147 (+0.23z)| norm 0.2568 (-0.74z)| lr 1.68e-05 | 323.13 ms | 52.2% bf16 MFU | 1624954 tok/s step 17546/19560 | loss 3.265800 (-0.67z)| norm 0.2873 (+0.14z)| lr 1.67e-05 | 322.61 ms | 52.3% bf16 MFU | 1624964 tok/s step 17547/19560 | loss 3.284673 (-0.25z)| norm 0.2554 (-0.79z)| lr 1.67e-05 | 322.34 ms | 52.4% bf16 MFU | 1625042 tok/s step 17548/19560 | loss 3.411373 (+2.48z)| norm 0.3293 (+1.34z)| lr 1.67e-05 | 322.48 ms | 52.3% bf16 MFU | 1625080 tok/s step 17549/19560 | loss 3.294090 (-0.07z)| norm 0.2821 (-0.03z)| lr 1.67e-05 | 322.85 ms | 52.3% bf16 MFU | 1625022 tok/s step 17550/19560 | loss 3.247851 (-1.05z)| norm 0.3421 (+1.69z)| lr 1.67e-05 | 322.25 ms | 52.4% bf16 MFU | 1625118 tok/s step 17551/19560 | loss 3.336198 (+0.86z)| norm 0.2544 (-0.83z)| lr 1.67e-05 | 322.58 ms | 52.3% bf16 MFU | 1625128 tok/s step 17552/19560 | loss 3.244819 (-1.10z)| norm 0.2493 (-0.98z)| lr 1.66e-05 | 322.51 ms | 52.3% bf16 MFU | 1625155 tok/s step 17553/19560 | loss 3.247010 (-1.04z)| norm 0.3563 (+2.06z)| lr 1.66e-05 | 322.68 ms | 52.3% bf16 MFU | 1625136 tok/s step 17554/19560 | loss 3.259838 (-0.77z)| norm 0.3286 (+1.26z)| lr 1.66e-05 | 322.54 ms | 52.3% bf16 MFU | 1625155 tok/s step 17555/19560 | loss 3.293670 (-0.04z)| norm 0.2892 (+0.15z)| lr 1.66e-05 | 322.45 ms | 52.3% bf16 MFU | 1625195 tok/s step 17556/19560 | loss 3.291685 (-0.08z)| norm 0.2956 (+0.32z)| lr 1.66e-05 | 322.52 ms | 52.3% bf16 MFU | 1625214 tok/s step 17557/19560 | loss 3.243031 (-1.11z)| norm 0.2937 (+0.26z)| lr 1.66e-05 | 322.44 ms | 52.3% bf16 MFU | 1625254 tok/s step 17558/19560 | loss 3.324006 (+0.61z)| norm 0.3252 (+1.14z)| lr 1.65e-05 | 323.22 ms | 52.2% bf16 MFU | 1625095 tok/s step 17559/19560 | loss 3.271109 (-0.52z)| norm 0.2361 (-1.36z)| lr 1.65e-05 | 323.00 ms | 52.3% bf16 MFU | 1624999 tok/s step 17560/19560 | loss 3.290243 (-0.11z)| norm 0.2640 (-0.58z)| lr 1.65e-05 | 322.77 ms | 52.3% bf16 MFU | 1624965 tok/s step 17561/19560 | loss 3.292993 (-0.04z)| norm 0.3229 (+1.06z)| lr 1.65e-05 | 323.05 ms | 52.2% bf16 MFU | 1624863 tok/s step 17562/19560 | loss 3.285933 (-0.18z)| norm 0.2571 (-0.78z)| lr 1.65e-05 | 322.94 ms | 52.3% bf16 MFU | 1624793 tok/s step 17563/19560 | loss 3.273129 (-0.45z)| norm 0.3024 (+0.49z)| lr 1.65e-05 | 323.37 ms | 52.2% bf16 MFU | 1624619 tok/s step 17564/19560 | loss 3.279817 (-0.30z)| norm 0.2699 (-0.42z)| lr 1.64e-05 | 322.54 ms | 52.3% bf16 MFU | 1624664 tok/s step 17565/19560 | loss 3.335787 (+0.93z)| norm 0.2554 (-0.83z)| lr 1.64e-05 | 322.61 ms | 52.3% bf16 MFU | 1624689 tok/s step 17566/19560 | loss 3.295144 (+0.05z)| norm 0.2613 (-0.66z)| lr 1.64e-05 | 322.06 ms | 52.4% bf16 MFU | 1624851 tok/s step 17567/19560 | loss 3.308891 (+0.35z)| norm 0.2486 (-1.01z)| lr 1.64e-05 | 322.51 ms | 52.3% bf16 MFU | 1624891 tok/s step 17568/19560 | loss 3.296027 (+0.06z)| norm 0.2976 (+0.35z)| lr 1.64e-05 | 323.57 ms | 52.2% bf16 MFU | 1624662 tok/s step 17569/19560 | loss 3.301206 (+0.16z)| norm 0.2940 (+0.24z)| lr 1.64e-05 | 322.82 ms | 52.3% bf16 MFU | 1624634 tok/s step 17570/19560 | loss 3.349372 (+1.24z)| norm 0.2900 (+0.15z)| lr 1.63e-05 | 322.80 ms | 52.3% bf16 MFU | 1624611 tok/s step 17571/19560 | loss 3.265335 (-0.63z)| norm 0.2886 (+0.11z)| lr 1.63e-05 | 323.23 ms | 52.2% bf16 MFU | 1624482 tok/s step 17572/19560 | loss 3.290456 (-0.07z)| norm 0.2432 (-1.20z)| lr 1.63e-05 | 323.49 ms | 52.2% bf16 MFU | 1624295 tok/s step 17573/19560 | loss 3.250753 (-0.95z)| norm 0.3350 (+1.45z)| lr 1.63e-05 | 322.39 ms | 52.4% bf16 MFU | 1624394 tok/s step 17574/19560 | loss 3.322382 (+0.64z)| norm 0.3243 (+1.12z)| lr 1.63e-05 | 323.26 ms | 52.2% bf16 MFU | 1624267 tok/s step 17575/19560 | loss 3.241513 (-1.15z)| norm 0.2507 (-0.99z)| lr 1.63e-05 | 323.01 ms | 52.3% bf16 MFU | 1624211 tok/s step 17576/19560 | loss 3.265840 (-0.59z)| norm 0.3065 (+0.61z)| lr 1.63e-05 | 322.82 ms | 52.3% bf16 MFU | 1624204 tok/s step 17577/19560 | loss 3.363395 (+1.56z)| norm 0.2663 (-0.55z)| lr 1.62e-05 | 323.40 ms | 52.2% bf16 MFU | 1624052 tok/s step 17578/19560 | loss 3.236898 (-1.23z)| norm 0.2929 (+0.21z)| lr 1.62e-05 | 323.13 ms | 52.2% bf16 MFU | 1623977 tok/s step 17579/19560 | loss 3.223850 (-1.52z)| norm 0.3037 (+0.56z)| lr 1.62e-05 | 322.73 ms | 52.3% bf16 MFU | 1624004 tok/s step 17580/19560 | loss 3.244440 (-1.06z)| norm 0.2430 (-1.24z)| lr 1.62e-05 | 323.04 ms | 52.2% bf16 MFU | 1623952 tok/s step 17581/19560 | loss 3.274903 (-0.38z)| norm 0.2486 (-1.06z)| lr 1.62e-05 | 323.32 ms | 52.2% bf16 MFU | 1623833 tok/s step 17582/19560 | loss 3.284743 (-0.16z)| norm 0.2818 (-0.06z)| lr 1.62e-05 | 322.80 ms | 52.3% bf16 MFU | 1623850 tok/s step 17583/19560 | loss 3.288937 (-0.06z)| norm 0.2676 (-0.49z)| lr 1.61e-05 | 322.96 ms | 52.3% bf16 MFU | 1623826 tok/s step 17584/19560 | loss 3.357127 (+1.44z)| norm 0.2580 (-0.77z)| lr 1.61e-05 | 322.70 ms | 52.3% bf16 MFU | 1623870 tok/s step 17585/19560 | loss 3.319142 (+0.60z)| norm 0.2840 (+0.01z)| lr 1.61e-05 | 323.03 ms | 52.2% bf16 MFU | 1623827 tok/s step 17586/19560 | loss 3.350982 (+1.29z)| norm 0.2812 (-0.07z)| lr 1.61e-05 | 322.82 ms | 52.3% bf16 MFU | 1623841 tok/s step 17587/19560 | loss 3.298217 (+0.14z)| norm 0.2704 (-0.38z)| lr 1.61e-05 | 322.72 ms | 52.3% bf16 MFU | 1623879 tok/s step 17588/19560 | loss 3.253621 (-0.86z)| norm 0.2518 (-0.94z)| lr 1.61e-05 | 322.58 ms | 52.3% bf16 MFU | 1623949 tok/s step 17589/19560 | loss 3.248000 (-0.98z)| norm 0.3176 (+1.05z)| lr 1.60e-05 | 322.75 ms | 52.3% bf16 MFU | 1623974 tok/s step 17590/19560 | loss 3.350189 (+1.31z)| norm 0.2929 (+0.32z)| lr 1.60e-05 | 323.55 ms | 52.2% bf16 MFU | 1623797 tok/s step 17591/19560 | loss 3.232245 (-1.33z)| norm 0.3316 (+1.48z)| lr 1.60e-05 | 323.24 ms | 52.2% bf16 MFU | 1623707 tok/s step 17592/19560 | loss 3.248604 (-0.97z)| norm 0.2553 (-0.84z)| lr 1.60e-05 | 323.17 ms | 52.2% bf16 MFU | 1623637 tok/s step 17593/19560 | loss 3.334223 (+1.01z)| norm 0.2708 (-0.36z)| lr 1.60e-05 | 322.86 ms | 52.3% bf16 MFU | 1623650 tok/s step 17594/19560 | loss 3.232740 (-1.32z)| norm 0.3616 (+2.33z)| lr 1.60e-05 | 322.93 ms | 52.3% bf16 MFU | 1623645 tok/s step 17595/19560 | loss 3.309077 (+0.44z)| norm 0.2902 (+0.21z)| lr 1.59e-05 | 323.12 ms | 52.2% bf16 MFU | 1623592 tok/s step 17596/19560 | loss 3.273287 (-0.39z)| norm 0.2425 (-1.19z)| lr 1.59e-05 | 323.06 ms | 52.2% bf16 MFU | 1623556 tok/s step 17597/19560 | loss 3.252698 (-0.87z)| norm 0.2913 (+0.26z)| lr 1.59e-05 | 322.88 ms | 52.3% bf16 MFU | 1623567 tok/s step 17598/19560 | loss 3.319443 (+0.66z)| norm 0.2624 (-0.60z)| lr 1.59e-05 | 322.54 ms | 52.3% bf16 MFU | 1623664 tok/s step 17599/19560 | loss 3.290865 (+0.00z)| norm 0.2819 (-0.01z)| lr 1.59e-05 | 323.21 ms | 52.2% bf16 MFU | 1623587 tok/s step 17600/19560 | loss 3.326237 (+0.80z)| norm 0.2408 (-1.22z)| lr 1.59e-05 | 323.89 ms | 52.1% bf16 MFU | 1623343 tok/s step 17601/19560 | loss 3.293943 (+0.05z)| norm 0.2741 (-0.22z)| lr 1.58e-05 | 322.31 ms | 52.4% bf16 MFU | 1623509 tok/s step 17602/19560 | loss 3.311321 (+0.45z)| norm 0.3074 (+0.76z)| lr 1.58e-05 | 322.49 ms | 52.3% bf16 MFU | 1623620 tok/s step 17603/19560 | loss 3.303067 (+0.27z)| norm 0.2534 (-0.84z)| lr 1.58e-05 | 323.57 ms | 52.2% bf16 MFU | 1623456 tok/s step 17604/19560 | loss 3.300170 (+0.21z)| norm 0.2813 (-0.02z)| lr 1.58e-05 | 323.08 ms | 52.2% bf16 MFU | 1623423 tok/s step 17605/19560 | loss 3.267035 (-0.56z)| norm 0.2937 (+0.35z)| lr 1.58e-05 | 323.40 ms | 52.2% bf16 MFU | 1623310 tok/s step 17606/19560 | loss 3.280930 (-0.23z)| norm 0.3359 (+1.57z)| lr 1.58e-05 | 322.70 ms | 52.3% bf16 MFU | 1623378 tok/s step 17607/19560 | loss 3.299757 (+0.25z)| norm 0.3197 (+1.08z)| lr 1.58e-05 | 322.68 ms | 52.3% bf16 MFU | 1623449 tok/s step 17608/19560 | loss 3.315613 (+0.63z)| norm 0.2930 (+0.29z)| lr 1.57e-05 | 323.69 ms | 52.1% bf16 MFU | 1623262 tok/s step 17609/19560 | loss 3.310129 (+0.48z)| norm 0.2885 (+0.15z)| lr 1.57e-05 | 322.64 ms | 52.3% bf16 MFU | 1623350 tok/s step 17610/19560 | loss 3.269887 (-0.52z)| norm 0.3061 (+0.68z)| lr 1.57e-05 | 323.07 ms | 52.2% bf16 MFU | 1623325 tok/s step 17611/19560 | loss 3.328640 (+0.94z)| norm 0.3674 (+2.42z)| lr 1.57e-05 | 322.48 ms | 52.3% bf16 MFU | 1623449 tok/s step 17612/19560 | loss 3.262671 (-0.69z)| norm 0.2430 (-1.19z)| lr 1.57e-05 | 323.01 ms | 52.3% bf16 MFU | 1623433 tok/s step 17613/19560 | loss 3.275917 (-0.38z)| norm 0.2455 (-1.11z)| lr 1.57e-05 | 323.35 ms | 52.2% bf16 MFU | 1623334 tok/s step 17614/19560 | loss 3.251896 (-0.96z)| norm 0.3103 (+0.75z)| lr 1.56e-05 | 322.72 ms | 52.3% bf16 MFU | 1623396 tok/s step 17615/19560 | loss 3.276057 (-0.37z)| norm 0.2833 (-0.02z)| lr 1.56e-05 | 322.97 ms | 52.3% bf16 MFU | 1623392 tok/s step 17616/19560 | loss 3.286730 (-0.11z)| norm 0.2457 (-1.10z)| lr 1.56e-05 | 322.48 ms | 52.3% bf16 MFU | 1623512 tok/s step 17617/19560 | loss 3.216428 (-1.87z)| norm 0.2442 (-1.13z)| lr 1.56e-05 | 322.32 ms | 52.4% bf16 MFU | 1623666 tok/s step 17618/19560 | loss 3.270838 (-0.48z)| norm 0.3343 (+1.47z)| lr 1.56e-05 | 323.01 ms | 52.2% bf16 MFU | 1623638 tok/s step 17619/19560 | loss 3.275213 (-0.37z)| norm 0.2793 (-0.13z)| lr 1.56e-05 | 322.93 ms | 52.3% bf16 MFU | 1623632 tok/s step 17620/19560 | loss 3.243997 (-1.15z)| norm 0.2454 (-1.10z)| lr 1.55e-05 | 322.32 ms | 52.4% bf16 MFU | 1623782 tok/s step 17621/19560 | loss 3.358520 (+1.73z)| norm 0.2610 (-0.64z)| lr 1.55e-05 | 322.59 ms | 52.3% bf16 MFU | 1623854 tok/s step 17622/19560 | loss 3.267085 (-0.56z)| norm 0.2699 (-0.39z)| lr 1.55e-05 | 322.73 ms | 52.3% bf16 MFU | 1623889 tok/s step 17623/19560 | loss 3.301085 (+0.28z)| norm 0.2849 (+0.04z)| lr 1.55e-05 | 322.64 ms | 52.3% bf16 MFU | 1623945 tok/s step 17624/19560 | loss 3.333068 (+1.08z)| norm 0.2572 (-0.77z)| lr 1.55e-05 | 323.04 ms | 52.2% bf16 MFU | 1623897 tok/s step 17625/19560 | loss 3.283640 (-0.17z)| norm 0.2813 (-0.06z)| lr 1.55e-05 | 323.02 ms | 52.2% bf16 MFU | 1623856 tok/s step 17626/19560 | loss 3.350541 (+1.49z)| norm 0.3194 (+1.05z)| lr 1.54e-05 | 322.99 ms | 52.3% bf16 MFU | 1623826 tok/s step 17627/19560 | loss 3.361983 (+1.75z)| norm 0.2462 (-1.09z)| lr 1.54e-05 | 322.74 ms | 52.3% bf16 MFU | 1623858 tok/s step 17628/19560 | loss 3.314964 (+0.58z)| norm 0.2835 (-0.01z)| lr 1.54e-05 | 322.69 ms | 52.3% bf16 MFU | 1623902 tok/s step 17629/19560 | loss 3.271255 (-0.50z)| norm 0.2712 (-0.36z)| lr 1.54e-05 | 322.20 ms | 52.4% bf16 MFU | 1624068 tok/s step 17630/19560 | loss 3.237472 (-1.32z)| norm 0.2327 (-1.48z)| lr 1.54e-05 | 323.25 ms | 52.2% bf16 MFU | 1623960 tok/s step 17631/19560 | loss 3.433085 (+3.32z)| norm 0.3481 (+1.91z)| lr 1.54e-05 | 322.28 ms | 52.4% bf16 MFU | 1624102 tok/s step 17632/19560 | loss 3.315484 (+0.54z)| norm 0.2580 (-0.72z)| lr 1.54e-05 | 323.01 ms | 52.3% bf16 MFU | 1624055 tok/s step 17633/19560 | loss 3.287192 (-0.13z)| norm 0.2620 (-0.60z)| lr 1.53e-05 | 322.39 ms | 52.3% bf16 MFU | 1624164 tok/s step 17634/19560 | loss 3.251552 (-0.97z)| norm 0.2585 (-0.69z)| lr 1.53e-05 | 322.22 ms | 52.4% bf16 MFU | 1624312 tok/s step 17635/19560 | loss 3.276973 (-0.37z)| norm 0.2643 (-0.52z)| lr 1.53e-05 | 322.85 ms | 52.3% bf16 MFU | 1624294 tok/s step 17636/19560 | loss 3.257225 (-0.84z)| norm 0.2649 (-0.50z)| lr 1.53e-05 | 322.49 ms | 52.3% bf16 MFU | 1624367 tok/s step 17637/19560 | loss 3.245711 (-1.11z)| norm 0.2446 (-1.08z)| lr 1.53e-05 | 322.50 ms | 52.3% bf16 MFU | 1624434 tok/s step 17638/19560 | loss 3.297360 (+0.17z)| norm 0.2705 (-0.31z)| lr 1.53e-05 | 322.63 ms | 52.3% bf16 MFU | 1624464 tok/s step 17639/19560 | loss 3.336101 (+1.12z)| norm 0.2541 (-0.79z)| lr 1.52e-05 | 322.75 ms | 52.3% bf16 MFU | 1624463 tok/s step 17640/19560 | loss 3.286094 (-0.11z)| norm 0.2548 (-0.77z)| lr 1.52e-05 | 322.60 ms | 52.3% bf16 MFU | 1624499 tok/s step 17641/19560 | loss 3.236786 (-1.34z)| norm 0.2557 (-0.74z)| lr 1.52e-05 | 322.83 ms | 52.3% bf16 MFU | 1624476 tok/s step 17642/19560 | loss 3.312523 (+0.55z)| norm 0.2422 (-1.15z)| lr 1.52e-05 | 322.63 ms | 52.3% bf16 MFU | 1624505 tok/s step 17643/19560 | loss 3.291771 (+0.01z)| norm 0.2406 (-1.18z)| lr 1.52e-05 | 322.46 ms | 52.3% bf16 MFU | 1624575 tok/s step 17644/19560 | loss 3.342881 (+1.29z)| norm 0.2462 (-1.01z)| lr 1.52e-05 | 322.56 ms | 52.3% bf16 MFU | 1624616 tok/s step 17645/19560 | loss 3.307642 (+0.41z)| norm 0.2628 (-0.49z)| lr 1.51e-05 | 322.59 ms | 52.3% bf16 MFU | 1624647 tok/s step 17646/19560 | loss 3.285892 (-0.14z)| norm 0.2854 (+0.25z)| lr 1.51e-05 | 322.51 ms | 52.3% bf16 MFU | 1624697 tok/s step 17647/19560 | loss 3.286219 (-0.14z)| norm 0.2471 (-1.00z)| lr 1.51e-05 | 322.55 ms | 52.3% bf16 MFU | 1624734 tok/s step 17648/19560 | loss 3.296699 (+0.13z)| norm 0.3315 (+1.80z)| lr 1.51e-05 | 322.81 ms | 52.3% bf16 MFU | 1624705 tok/s step 17649/19560 | loss 3.277162 (-0.36z)| norm 0.2494 (-0.90z)| lr 1.51e-05 | 322.78 ms | 52.3% bf16 MFU | 1624685 tok/s step 17650/19560 | loss 3.285707 (-0.13z)| norm 0.2341 (-1.39z)| lr 1.51e-05 | 322.59 ms | 52.3% bf16 MFU | 1624714 tok/s step 17651/19560 | loss 3.268209 (-0.57z)| norm 0.2817 (+0.18z)| lr 1.51e-05 | 322.51 ms | 52.3% bf16 MFU | 1624760 tok/s step 17652/19560 | loss 3.271281 (-0.48z)| norm 0.2471 (-0.95z)| lr 1.50e-05 | 322.57 ms | 52.3% bf16 MFU | 1624788 tok/s step 17653/19560 | loss 3.263812 (-0.69z)| norm 0.2933 (+0.56z)| lr 1.50e-05 | 322.75 ms | 52.3% bf16 MFU | 1624771 tok/s step 17654/19560 | loss 3.378882 (+2.24z)| norm 0.2996 (+0.76z)| lr 1.50e-05 | 322.66 ms | 52.3% bf16 MFU | 1624778 tok/s step 17655/19560 | loss 3.257360 (-0.86z)| norm 0.2656 (-0.36z)| lr 1.50e-05 | 322.64 ms | 52.3% bf16 MFU | 1624788 tok/s step 17656/19560 | loss 3.270919 (-0.51z)| norm 0.2571 (-0.64z)| lr 1.50e-05 | 322.73 ms | 52.3% bf16 MFU | 1624775 tok/s step 17657/19560 | loss 3.292979 (+0.07z)| norm 0.2716 (-0.17z)| lr 1.50e-05 | 322.60 ms | 52.3% bf16 MFU | 1624796 tok/s step 17658/19560 | loss 3.478061 (+4.41z)| norm 0.2730 (-0.12z)| lr 1.49e-05 | 322.74 ms | 52.3% bf16 MFU | 1624782 tok/s step 17659/19560 | loss 3.256526 (-0.83z)| norm 0.2402 (-1.21z)| lr 1.49e-05 | 322.81 ms | 52.3% bf16 MFU | 1624750 tok/s step 17660/19560 | loss 3.290643 (-0.03z)| norm 0.3515 (+2.40z)| lr 1.49e-05 | 322.18 ms | 52.4% bf16 MFU | 1624879 tok/s step 17661/19560 | loss 3.339708 (+1.15z)| norm 0.2604 (-0.55z)| lr 1.49e-05 | 322.43 ms | 52.3% bf16 MFU | 1624937 tok/s step 17662/19560 | loss 3.271096 (-0.48z)| norm 0.2551 (-0.73z)| lr 1.49e-05 | 322.50 ms | 52.3% bf16 MFU | 1624975 tok/s step 17663/19560 | loss 3.265642 (-0.62z)| norm 0.2884 (+0.34z)| lr 1.49e-05 | 322.75 ms | 52.3% bf16 MFU | 1624948 tok/s step 17664/19560 | loss 3.254461 (-0.87z)| norm 0.3163 (+1.23z)| lr 1.49e-05 | 322.78 ms | 52.3% bf16 MFU | 1624916 tok/s step 17665/19560 | loss 3.245026 (-1.08z)| norm 0.2792 (+0.02z)| lr 1.48e-05 | 322.67 ms | 52.3% bf16 MFU | 1624912 tok/s step 17666/19560 | loss 3.341995 (+1.23z)| norm 0.2389 (-1.28z)| lr 1.48e-05 | 322.92 ms | 52.3% bf16 MFU | 1624846 tok/s step 17667/19560 | loss 3.282856 (-0.19z)| norm 0.3022 (+0.76z)| lr 1.48e-05 | 322.18 ms | 52.4% bf16 MFU | 1624970 tok/s step 17668/19560 | loss 3.264770 (-0.61z)| norm 0.2300 (-1.55z)| lr 1.48e-05 | 322.90 ms | 52.3% bf16 MFU | 1624906 tok/s step 17669/19560 | loss 3.247829 (-1.05z)| norm 0.2456 (-1.04z)| lr 1.48e-05 | 322.36 ms | 52.4% bf16 MFU | 1624981 tok/s step 17670/19560 | loss 3.284600 (-0.15z)| norm 0.2499 (-0.90z)| lr 1.48e-05 | 322.52 ms | 52.3% bf16 MFU | 1625012 tok/s step 17671/19560 | loss 3.240991 (-1.20z)| norm 0.2637 (-0.46z)| lr 1.47e-05 | 322.13 ms | 52.4% bf16 MFU | 1625138 tok/s step 17672/19560 | loss 3.342175 (+1.24z)| norm 0.2560 (-0.70z)| lr 1.47e-05 | 322.63 ms | 52.3% bf16 MFU | 1625133 tok/s step 17673/19560 | loss 3.249489 (-0.98z)| norm 0.2884 (+0.33z)| lr 1.47e-05 | 322.84 ms | 52.3% bf16 MFU | 1625075 tok/s step 17674/19560 | loss 3.329531 (+0.93z)| norm 0.2608 (-0.55z)| lr 1.47e-05 | 322.75 ms | 52.3% bf16 MFU | 1625043 tok/s step 17675/19560 | loss 3.250445 (-0.96z)| norm 0.2594 (-0.60z)| lr 1.47e-05 | 322.56 ms | 52.3% bf16 MFU | 1625060 tok/s step 17676/19560 | loss 3.378227 (+2.14z)| norm 0.2883 (+0.34z)| lr 1.47e-05 | 322.48 ms | 52.3% bf16 MFU | 1625098 tok/s step 17677/19560 | loss 3.382422 (+2.18z)| norm 0.2544 (-0.74z)| lr 1.47e-05 | 322.40 ms | 52.3% bf16 MFU | 1625152 tok/s step 17678/19560 | loss 3.270936 (-0.48z)| norm 0.2450 (-1.04z)| lr 1.46e-05 | 322.48 ms | 52.3% bf16 MFU | 1625184 tok/s step 17679/19560 | loss 3.337233 (+1.10z)| norm 0.2814 (+0.15z)| lr 1.46e-05 | 322.50 ms | 52.3% bf16 MFU | 1625211 tok/s step 17680/19560 | loss 3.263586 (-0.67z)| norm 0.2731 (-0.13z)| lr 1.46e-05 | 323.28 ms | 52.2% bf16 MFU | 1625039 tok/s step 17681/19560 | loss 3.337601 (+1.09z)| norm 0.2464 (-1.00z)| lr 1.46e-05 | 322.33 ms | 52.4% bf16 MFU | 1625114 tok/s step 17682/19560 | loss 3.337651 (+1.08z)| norm 0.2498 (-0.87z)| lr 1.46e-05 | 322.77 ms | 52.3% bf16 MFU | 1625074 tok/s step 17683/19560 | loss 3.276168 (-0.39z)| norm 0.2604 (-0.51z)| lr 1.46e-05 | 322.98 ms | 52.3% bf16 MFU | 1624983 tok/s step 17684/19560 | loss 3.300870 (+0.20z)| norm 0.2549 (-0.69z)| lr 1.45e-05 | 323.44 ms | 52.2% bf16 MFU | 1624784 tok/s step 17685/19560 | loss 3.271304 (-0.52z)| norm 0.3790 (+3.34z)| lr 1.45e-05 | 322.39 ms | 52.3% bf16 MFU | 1624856 tok/s step 17686/19560 | loss 3.252033 (-0.96z)| norm 0.2535 (-0.71z)| lr 1.45e-05 | 322.51 ms | 52.3% bf16 MFU | 1624895 tok/s step 17687/19560 | loss 3.327551 (+0.83z)| norm 0.2549 (-0.67z)| lr 1.45e-05 | 323.26 ms | 52.2% bf16 MFU | 1624743 tok/s step 17688/19560 | loss 3.269554 (-0.55z)| norm 0.3072 (+1.03z)| lr 1.45e-05 | 322.68 ms | 52.3% bf16 MFU | 1624745 tok/s step 17689/19560 | loss 3.236594 (-1.32z)| norm 0.3278 (+1.70z)| lr 1.45e-05 | 322.75 ms | 52.3% bf16 MFU | 1624731 tok/s step 17690/19560 | loss 3.242898 (-1.15z)| norm 0.2813 (+0.17z)| lr 1.45e-05 | 322.91 ms | 52.3% bf16 MFU | 1624676 tok/s step 17691/19560 | loss 3.323455 (+0.74z)| norm 0.2532 (-0.73z)| lr 1.44e-05 | 322.21 ms | 52.4% bf16 MFU | 1624800 tok/s step 17692/19560 | loss 3.296385 (+0.10z)| norm 0.2473 (-0.91z)| lr 1.44e-05 | 322.98 ms | 52.3% bf16 MFU | 1624725 tok/s step 17693/19560 | loss 3.276887 (-0.35z)| norm 0.2958 (+0.65z)| lr 1.44e-05 | 323.51 ms | 52.2% bf16 MFU | 1624518 tok/s step 17694/19560 | loss 3.317758 (+0.61z)| norm 0.3087 (+1.06z)| lr 1.44e-05 | 322.51 ms | 52.3% bf16 MFU | 1624576 tok/s step 17695/19560 | loss 3.253988 (-0.89z)| norm 0.2761 (-0.01z)| lr 1.44e-05 | 322.90 ms | 52.3% bf16 MFU | 1624530 tok/s step 17696/19560 | loss 3.275993 (-0.36z)| norm 0.2634 (-0.41z)| lr 1.44e-05 | 322.58 ms | 52.3% bf16 MFU | 1624568 tok/s step 17697/19560 | loss 3.308629 (+0.40z)| norm 0.3849 (+3.37z)| lr 1.43e-05 | 322.98 ms | 52.3% bf16 MFU | 1624505 tok/s step 17698/19560 | loss 3.321463 (+0.72z)| norm 0.2451 (-0.97z)| lr 1.43e-05 | 322.68 ms | 52.3% bf16 MFU | 1624518 tok/s step 17699/19560 | loss 3.282357 (-0.21z)| norm 0.2607 (-0.48z)| lr 1.43e-05 | 322.93 ms | 52.3% bf16 MFU | 1624470 tok/s step 17700/19560 | loss 3.319180 (+0.65z)| norm 0.2523 (-0.75z)| lr 1.43e-05 | 322.73 ms | 52.3% bf16 MFU | 1624472 tok/s step 17701/19560 | loss 3.293121 (+0.03z)| norm 0.2880 (+0.38z)| lr 1.43e-05 | 322.75 ms | 52.3% bf16 MFU | 1624470 tok/s step 17702/19560 | loss 3.320235 (+0.67z)| norm 0.2567 (-0.59z)| lr 1.43e-05 | 322.69 ms | 52.3% bf16 MFU | 1624484 tok/s step 17703/19560 | loss 3.300719 (+0.20z)| norm 0.2294 (-1.44z)| lr 1.43e-05 | 322.30 ms | 52.4% bf16 MFU | 1624595 tok/s step 17704/19560 | loss 3.332115 (+0.94z)| norm 0.2943 (+0.61z)| lr 1.42e-05 | 322.91 ms | 52.3% bf16 MFU | 1624547 tok/s step 17705/19560 | loss 3.266531 (-0.62z)| norm 0.2530 (-0.69z)| lr 1.42e-05 | 323.05 ms | 52.2% bf16 MFU | 1624466 tok/s step 17706/19560 | loss 3.301092 (+0.21z)| norm 0.2586 (-0.51z)| lr 1.42e-05 | 322.94 ms | 52.3% bf16 MFU | 1624416 tok/s step 17707/19560 | loss 3.265207 (-0.68z)| norm 0.3049 (+0.95z)| lr 1.42e-05 | 323.10 ms | 52.2% bf16 MFU | 1624329 tok/s step 17708/19560 | loss 3.219154 (-1.79z)| norm 0.2578 (-0.54z)| lr 1.42e-05 | 322.10 ms | 52.4% bf16 MFU | 1624497 tok/s step 17709/19560 | loss 3.255950 (-0.89z)| norm 0.2494 (-0.81z)| lr 1.42e-05 | 323.31 ms | 52.2% bf16 MFU | 1624353 tok/s step 17710/19560 | loss 3.231789 (-1.45z)| norm 0.2938 (+0.60z)| lr 1.41e-05 | 322.77 ms | 52.3% bf16 MFU | 1624353 tok/s step 17711/19560 | loss 3.237514 (-1.30z)| norm 0.2613 (-0.43z)| lr 1.41e-05 | 322.85 ms | 52.3% bf16 MFU | 1624333 tok/s step 17712/19560 | loss 3.265784 (-0.61z)| norm 0.2451 (-0.94z)| lr 1.41e-05 | 322.49 ms | 52.3% bf16 MFU | 1624402 tok/s step 17713/19560 | loss 3.171853 (-2.76z)| norm 0.2407 (-1.06z)| lr 1.41e-05 | 322.78 ms | 52.3% bf16 MFU | 1624398 tok/s step 17714/19560 | loss 3.326299 (+0.86z)| norm 0.2697 (-0.15z)| lr 1.41e-05 | 323.27 ms | 52.2% bf16 MFU | 1624268 tok/s step 17715/19560 | loss 3.295072 (+0.13z)| norm 0.2657 (-0.27z)| lr 1.41e-05 | 323.09 ms | 52.2% bf16 MFU | 1624192 tok/s step 17716/19560 | loss 3.250805 (-0.91z)| norm 0.2491 (-0.79z)| lr 1.41e-05 | 322.65 ms | 52.3% bf16 MFU | 1624230 tok/s step 17717/19560 | loss 3.207064 (-1.91z)| norm 0.2451 (-0.90z)| lr 1.40e-05 | 322.41 ms | 52.3% bf16 MFU | 1624327 tok/s step 17718/19560 | loss 3.283294 (-0.13z)| norm 0.2422 (-0.98z)| lr 1.40e-05 | 322.51 ms | 52.3% bf16 MFU | 1624392 tok/s step 17719/19560 | loss 3.246929 (-0.99z)| norm 0.2424 (-0.96z)| lr 1.40e-05 | 322.66 ms | 52.3% bf16 MFU | 1624417 tok/s step 17720/19560 | loss 3.449125 (+3.55z)| norm 0.2595 (-0.42z)| lr 1.40e-05 | 322.56 ms | 52.3% bf16 MFU | 1624466 tok/s step 17721/19560 | loss 3.270810 (-0.43z)| norm 0.2769 (+0.13z)| lr 1.40e-05 | 322.36 ms | 52.4% bf16 MFU | 1624563 tok/s step 17722/19560 | loss 3.284606 (-0.13z)| norm 0.2929 (+0.68z)| lr 1.40e-05 | 322.51 ms | 52.3% bf16 MFU | 1624617 tok/s step 17723/19560 | loss 3.256677 (-0.75z)| norm 0.2556 (-0.53z)| lr 1.40e-05 | 322.64 ms | 52.3% bf16 MFU | 1624637 tok/s step 17724/19560 | loss 3.245801 (-0.99z)| norm 0.2718 (-0.01z)| lr 1.39e-05 | 322.84 ms | 52.3% bf16 MFU | 1624605 tok/s step 17725/19560 | loss 3.284531 (-0.12z)| norm 0.2792 (+0.24z)| lr 1.39e-05 | 322.94 ms | 52.3% bf16 MFU | 1624548 tok/s step 17726/19560 | loss 3.249904 (-0.89z)| norm 0.2562 (-0.52z)| lr 1.39e-05 | 322.30 ms | 52.4% bf16 MFU | 1624656 tok/s step 17727/19560 | loss 3.310869 (+0.48z)| norm 0.2808 (+0.29z)| lr 1.39e-05 | 323.26 ms | 52.2% bf16 MFU | 1624518 tok/s step 17728/19560 | loss 3.257968 (-0.70z)| norm 0.2922 (+0.65z)| lr 1.39e-05 | 322.44 ms | 52.3% bf16 MFU | 1624592 tok/s step 17729/19560 | loss 3.294981 (+0.13z)| norm 0.2820 (+0.32z)| lr 1.39e-05 | 322.25 ms | 52.4% bf16 MFU | 1624710 tok/s step 17730/19560 | loss 3.278792 (-0.23z)| norm 0.2771 (+0.16z)| lr 1.38e-05 | 322.69 ms | 52.3% bf16 MFU | 1624712 tok/s step 17731/19560 | loss 3.201198 (-1.93z)| norm 0.2498 (-0.74z)| lr 1.38e-05 | 322.49 ms | 52.3% bf16 MFU | 1624764 tok/s step 17732/19560 | loss 3.266198 (-0.48z)| norm 0.2478 (-0.80z)| lr 1.38e-05 | 322.83 ms | 52.3% bf16 MFU | 1624728 tok/s step 17733/19560 | loss 3.278836 (-0.20z)| norm 0.3155 (+1.43z)| lr 1.38e-05 | 322.73 ms | 52.3% bf16 MFU | 1624718 tok/s step 17734/19560 | loss 3.270021 (-0.40z)| norm 0.2846 (+0.43z)| lr 1.38e-05 | 322.51 ms | 52.3% bf16 MFU | 1624765 tok/s step 17735/19560 | loss 3.451030 (+3.43z)| norm 0.4371 (+4.98z)| lr 1.38e-05 | 322.69 ms | 52.3% bf16 MFU | 1624763 tok/s step 17736/19560 | loss 3.217350 (-1.49z)| norm 0.2449 (-0.83z)| lr 1.38e-05 | 322.36 ms | 52.4% bf16 MFU | 1624846 tok/s step 17737/19560 | loss 3.307559 (+0.41z)| norm 0.2911 (+0.57z)| lr 1.37e-05 | 322.40 ms | 52.3% bf16 MFU | 1624913 tok/s step 17738/19560 | loss 3.325979 (+0.78z)| norm 0.3001 (+0.85z)| lr 1.37e-05 | 322.86 ms | 52.3% bf16 MFU | 1624862 tok/s step 17739/19560 | loss 3.329241 (+0.85z)| norm 0.2600 (-0.35z)| lr 1.37e-05 | 322.53 ms | 52.3% bf16 MFU | 1624896 tok/s step 17740/19560 | loss 3.300809 (+0.25z)| norm 0.2447 (-0.84z)| lr 1.37e-05 | 322.50 ms | 52.3% bf16 MFU | 1624935 tok/s step 17741/19560 | loss 3.310499 (+0.45z)| norm 0.4655 (+5.33z)| lr 1.37e-05 | 322.87 ms | 52.3% bf16 MFU | 1624880 tok/s step 17742/19560 | loss 3.288117 (-0.03z)| norm 0.3248 (+1.42z)| lr 1.37e-05 | 322.40 ms | 52.3% bf16 MFU | 1624946 tok/s step 17743/19560 | loss 3.355586 (+1.37z)| norm 0.2932 (+0.55z)| lr 1.37e-05 | 322.94 ms | 52.3% bf16 MFU | 1624873 tok/s step 17744/19560 | loss 3.278975 (-0.23z)| norm 0.3282 (+1.49z)| lr 1.36e-05 | 322.86 ms | 52.3% bf16 MFU | 1624822 tok/s step 17745/19560 | loss 3.280557 (-0.21z)| norm 0.3196 (+1.23z)| lr 1.36e-05 | 322.66 ms | 52.3% bf16 MFU | 1624827 tok/s step 17746/19560 | loss 3.271191 (-0.41z)| norm 0.2651 (-0.24z)| lr 1.36e-05 | 322.67 ms | 52.3% bf16 MFU | 1624828 tok/s step 17747/19560 | loss 3.292369 (+0.03z)| norm 0.2642 (-0.27z)| lr 1.36e-05 | 322.41 ms | 52.3% bf16 MFU | 1624895 tok/s step 17748/19560 | loss 3.275579 (-0.33z)| norm 0.3446 (+1.90z)| lr 1.36e-05 | 323.04 ms | 52.2% bf16 MFU | 1624800 tok/s step 17749/19560 | loss 3.239341 (-1.08z)| norm 0.2845 (+0.26z)| lr 1.36e-05 | 322.91 ms | 52.3% bf16 MFU | 1624741 tok/s step 17750/19560 | loss 3.316895 (+0.56z)| norm 0.2708 (-0.11z)| lr 1.35e-05 | 322.99 ms | 52.3% bf16 MFU | 1624666 tok/s val loss 3.278147 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3053/10042 = 0.304023 step 17751/19560 | loss 3.338180 (+1.00z)| norm 0.2670 (-0.21z)| lr 1.35e-05 | 321.99 ms | 52.4% bf16 MFU | 1624848 tok/s step 17752/19560 | loss 3.301796 (+0.24z)| norm 0.2914 (+0.45z)| lr 1.35e-05 | 323.22 ms | 52.2% bf16 MFU | 1624709 tok/s step 17753/19560 | loss 3.319060 (+0.60z)| norm 0.3699 (+2.50z)| lr 1.35e-05 | 322.41 ms | 52.3% bf16 MFU | 1624781 tok/s step 17754/19560 | loss 3.273736 (-0.35z)| norm 0.3229 (+1.25z)| lr 1.35e-05 | 322.44 ms | 52.3% bf16 MFU | 1624842 tok/s step 17755/19560 | loss 3.278165 (-0.24z)| norm 0.2673 (-0.23z)| lr 1.35e-05 | 322.66 ms | 52.3% bf16 MFU | 1624846 tok/s step 17756/19560 | loss 3.331428 (+0.90z)| norm 0.2636 (-0.32z)| lr 1.35e-05 | 322.46 ms | 52.3% bf16 MFU | 1624899 tok/s step 17757/19560 | loss 3.260735 (-0.62z)| norm 0.2393 (-0.96z)| lr 1.34e-05 | 323.32 ms | 52.2% bf16 MFU | 1624734 tok/s step 17758/19560 | loss 3.366277 (+1.61z)| norm 0.2709 (-0.13z)| lr 1.34e-05 | 323.21 ms | 52.2% bf16 MFU | 1624603 tok/s step 17759/19560 | loss 3.248975 (-0.89z)| norm 0.2829 (+0.21z)| lr 1.34e-05 | 322.51 ms | 52.3% bf16 MFU | 1624655 tok/s step 17760/19560 | loss 3.205254 (-1.81z)| norm 0.2615 (-0.37z)| lr 1.34e-05 | 322.81 ms | 52.3% bf16 MFU | 1624629 tok/s step 17761/19560 | loss 3.292259 (+0.09z)| norm 0.2916 (+0.43z)| lr 1.34e-05 | 322.43 ms | 52.3% bf16 MFU | 1624699 tok/s step 17762/19560 | loss 3.342036 (+1.15z)| norm 0.2733 (-0.06z)| lr 1.34e-05 | 322.67 ms | 52.3% bf16 MFU | 1624706 tok/s step 17763/19560 | loss 3.313774 (+0.53z)| norm 0.2410 (-0.93z)| lr 1.34e-05 | 322.14 ms | 52.4% bf16 MFU | 1624845 tok/s step 17764/19560 | loss 3.311429 (+0.47z)| norm 0.2582 (-0.46z)| lr 1.33e-05 | 322.82 ms | 52.3% bf16 MFU | 1624806 tok/s step 17765/19560 | loss 3.232869 (-1.23z)| norm 0.2754 (-0.01z)| lr 1.33e-05 | 322.43 ms | 52.3% bf16 MFU | 1624868 tok/s step 17766/19560 | loss 3.283849 (-0.12z)| norm 0.2591 (-0.44z)| lr 1.33e-05 | 322.80 ms | 52.3% bf16 MFU | 1624833 tok/s step 17767/19560 | loss 3.358805 (+1.49z)| norm 0.2660 (-0.26z)| lr 1.33e-05 | 323.30 ms | 52.2% bf16 MFU | 1624675 tok/s step 17768/19560 | loss 3.324058 (+0.73z)| norm 0.2675 (-0.22z)| lr 1.33e-05 | 322.72 ms | 52.3% bf16 MFU | 1624671 tok/s step 17769/19560 | loss 3.262434 (-0.60z)| norm 0.2887 (+0.35z)| lr 1.33e-05 | 323.18 ms | 52.2% bf16 MFU | 1624551 tok/s step 17770/19560 | loss 3.267251 (-0.49z)| norm 0.2415 (-0.93z)| lr 1.33e-05 | 322.64 ms | 52.3% bf16 MFU | 1624574 tok/s step 17771/19560 | loss 3.269582 (-0.43z)| norm 0.2985 (+0.60z)| lr 1.32e-05 | 322.43 ms | 52.3% bf16 MFU | 1624647 tok/s step 17772/19560 | loss 3.263398 (-0.56z)| norm 0.2507 (-0.70z)| lr 1.32e-05 | 323.04 ms | 52.2% bf16 MFU | 1624563 tok/s step 17773/19560 | loss 3.193478 (-2.02z)| norm 0.2487 (-0.75z)| lr 1.32e-05 | 322.48 ms | 52.3% bf16 MFU | 1624626 tok/s step 17774/19560 | loss 3.322344 (+0.72z)| norm 0.2853 (+0.24z)| lr 1.32e-05 | 322.54 ms | 52.3% bf16 MFU | 1624669 tok/s step 17775/19560 | loss 3.295298 (+0.14z)| norm 0.2508 (-0.69z)| lr 1.32e-05 | 323.24 ms | 52.2% bf16 MFU | 1624534 tok/s step 17776/19560 | loss 3.313914 (+0.54z)| norm 0.3017 (+0.70z)| lr 1.32e-05 | 322.86 ms | 52.3% bf16 MFU | 1624500 tok/s step 17777/19560 | loss 3.319911 (+0.66z)| norm 0.3115 (+0.95z)| lr 1.31e-05 | 323.43 ms | 52.2% bf16 MFU | 1624326 tok/s step 17778/19560 | loss 3.270663 (-0.39z)| norm 0.2653 (-0.32z)| lr 1.31e-05 | 322.69 ms | 52.3% bf16 MFU | 1624348 tok/s step 17779/19560 | loss 3.316943 (+0.59z)| norm 0.3692 (+2.45z)| lr 1.31e-05 | 322.63 ms | 52.3% bf16 MFU | 1624384 tok/s step 17780/19560 | loss 3.288775 (-0.01z)| norm 0.2956 (+0.47z)| lr 1.31e-05 | 322.88 ms | 52.3% bf16 MFU | 1624354 tok/s step 17781/19560 | loss 3.251430 (-0.80z)| norm 0.3140 (+0.96z)| lr 1.31e-05 | 322.73 ms | 52.3% bf16 MFU | 1624365 tok/s step 17782/19560 | loss 3.290065 (+0.03z)| norm 0.2822 (+0.11z)| lr 1.31e-05 | 323.31 ms | 52.2% bf16 MFU | 1624228 tok/s step 17783/19560 | loss 3.387444 (+2.07z)| norm 0.3212 (+1.14z)| lr 1.31e-05 | 322.52 ms | 52.3% bf16 MFU | 1624297 tok/s step 17784/19560 | loss 3.235301 (-1.14z)| norm 0.2481 (-0.80z)| lr 1.30e-05 | 322.63 ms | 52.3% bf16 MFU | 1624335 tok/s step 17785/19560 | loss 3.312220 (+0.48z)| norm 0.2672 (-0.30z)| lr 1.30e-05 | 322.32 ms | 52.4% bf16 MFU | 1624448 tok/s step 17786/19560 | loss 3.318604 (+0.68z)| norm 0.2659 (-0.33z)| lr 1.30e-05 | 323.17 ms | 52.2% bf16 MFU | 1624341 tok/s step 17787/19560 | loss 3.255044 (-0.75z)| norm 0.2514 (-0.72z)| lr 1.30e-05 | 322.91 ms | 52.3% bf16 MFU | 1624304 tok/s step 17788/19560 | loss 3.294236 (+0.13z)| norm 0.2472 (-0.82z)| lr 1.30e-05 | 322.26 ms | 52.4% bf16 MFU | 1624435 tok/s step 17789/19560 | loss 3.277074 (-0.24z)| norm 0.2733 (-0.12z)| lr 1.30e-05 | 323.31 ms | 52.2% bf16 MFU | 1624294 tok/s step 17790/19560 | loss 3.320145 (+0.72z)| norm 0.2654 (-0.33z)| lr 1.30e-05 | 322.47 ms | 52.3% bf16 MFU | 1624372 tok/s step 17791/19560 | loss 3.241089 (-1.05z)| norm 0.2330 (-1.19z)| lr 1.29e-05 | 322.69 ms | 52.3% bf16 MFU | 1624390 tok/s step 17792/19560 | loss 3.303762 (+0.35z)| norm 0.2446 (-0.86z)| lr 1.29e-05 | 322.06 ms | 52.4% bf16 MFU | 1624566 tok/s step 17793/19560 | loss 3.320782 (+0.72z)| norm 0.2497 (-0.72z)| lr 1.29e-05 | 322.62 ms | 52.3% bf16 MFU | 1624592 tok/s step 17794/19560 | loss 3.310741 (+0.50z)| norm 0.2881 (+0.30z)| lr 1.29e-05 | 322.12 ms | 52.4% bf16 MFU | 1624742 tok/s step 17795/19560 | loss 3.322381 (+0.76z)| norm 0.3053 (+0.77z)| lr 1.29e-05 | 322.78 ms | 52.3% bf16 MFU | 1624720 tok/s step 17796/19560 | loss 3.306170 (+0.38z)| norm 0.2646 (-0.34z)| lr 1.29e-05 | 322.68 ms | 52.3% bf16 MFU | 1624723 tok/s step 17797/19560 | loss 3.260706 (-0.65z)| norm 0.2658 (-0.31z)| lr 1.29e-05 | 322.64 ms | 52.3% bf16 MFU | 1624738 tok/s step 17798/19560 | loss 3.383846 (+2.09z)| norm 0.2488 (-0.77z)| lr 1.28e-05 | 322.84 ms | 52.3% bf16 MFU | 1624699 tok/s step 17799/19560 | loss 3.284462 (-0.14z)| norm 0.2920 (+0.39z)| lr 1.28e-05 | 322.64 ms | 52.3% bf16 MFU | 1624714 tok/s step 17800/19560 | loss 3.229266 (-1.35z)| norm 0.2411 (-0.98z)| lr 1.28e-05 | 322.74 ms | 52.3% bf16 MFU | 1624703 tok/s step 17801/19560 | loss 3.307191 (+0.38z)| norm 0.3079 (+0.82z)| lr 1.28e-05 | 322.33 ms | 52.4% bf16 MFU | 1624796 tok/s step 17802/19560 | loss 3.351818 (+1.37z)| norm 0.2694 (-0.22z)| lr 1.28e-05 | 322.44 ms | 52.3% bf16 MFU | 1624855 tok/s step 17803/19560 | loss 3.338340 (+1.06z)| norm 0.2572 (-0.55z)| lr 1.28e-05 | 322.36 ms | 52.4% bf16 MFU | 1624933 tok/s step 17804/19560 | loss 3.336545 (+1.03z)| norm 0.2450 (-0.87z)| lr 1.28e-05 | 322.67 ms | 52.3% bf16 MFU | 1624929 tok/s step 17805/19560 | loss 3.297991 (+0.18z)| norm 0.2617 (-0.42z)| lr 1.27e-05 | 322.88 ms | 52.3% bf16 MFU | 1624872 tok/s step 17806/19560 | loss 3.221184 (-1.55z)| norm 0.2701 (-0.20z)| lr 1.27e-05 | 322.33 ms | 52.4% bf16 MFU | 1624956 tok/s step 17807/19560 | loss 3.340801 (+1.16z)| norm 0.2675 (-0.27z)| lr 1.27e-05 | 322.35 ms | 52.4% bf16 MFU | 1625031 tok/s step 17808/19560 | loss 3.244060 (-1.03z)| norm 0.2764 (-0.03z)| lr 1.27e-05 | 323.15 ms | 52.2% bf16 MFU | 1624901 tok/s step 17809/19560 | loss 3.298744 (+0.22z)| norm 0.2399 (-1.01z)| lr 1.27e-05 | 322.57 ms | 52.3% bf16 MFU | 1624923 tok/s step 17810/19560 | loss 3.286095 (-0.06z)| norm 0.2302 (-1.27z)| lr 1.27e-05 | 322.64 ms | 52.3% bf16 MFU | 1624925 tok/s step 17811/19560 | loss 3.262868 (-0.59z)| norm 0.2802 (+0.08z)| lr 1.27e-05 | 322.57 ms | 52.3% bf16 MFU | 1624946 tok/s step 17812/19560 | loss 3.285469 (-0.07z)| norm 0.3443 (+1.76z)| lr 1.26e-05 | 322.66 ms | 52.3% bf16 MFU | 1624944 tok/s step 17813/19560 | loss 3.322846 (+0.77z)| norm 0.3356 (+1.57z)| lr 1.26e-05 | 322.67 ms | 52.3% bf16 MFU | 1624938 tok/s step 17814/19560 | loss 3.347728 (+1.32z)| norm 0.2416 (-0.98z)| lr 1.26e-05 | 322.53 ms | 52.3% bf16 MFU | 1624967 tok/s step 17815/19560 | loss 3.287562 (-0.04z)| norm 0.2662 (-0.31z)| lr 1.26e-05 | 322.59 ms | 52.3% bf16 MFU | 1624981 tok/s step 17816/19560 | loss 3.326620 (+0.84z)| norm 0.3361 (+1.57z)| lr 1.26e-05 | 322.74 ms | 52.3% bf16 MFU | 1624955 tok/s step 17817/19560 | loss 3.246335 (-0.99z)| norm 0.3061 (+0.77z)| lr 1.26e-05 | 322.77 ms | 52.3% bf16 MFU | 1624925 tok/s step 17818/19560 | loss 3.255397 (-0.79z)| norm 0.2777 (-0.00z)| lr 1.26e-05 | 322.55 ms | 52.3% bf16 MFU | 1624952 tok/s step 17819/19560 | loss 3.261069 (-0.65z)| norm 0.2561 (-0.59z)| lr 1.25e-05 | 322.44 ms | 52.3% bf16 MFU | 1625005 tok/s step 17820/19560 | loss 3.270202 (-0.44z)| norm 0.2411 (-0.99z)| lr 1.25e-05 | 322.70 ms | 52.3% bf16 MFU | 1624988 tok/s step 17821/19560 | loss 3.333224 (+0.99z)| norm 0.2480 (-0.79z)| lr 1.25e-05 | 322.54 ms | 52.3% bf16 MFU | 1625013 tok/s step 17822/19560 | loss 3.263144 (-0.59z)| norm 0.2789 (+0.05z)| lr 1.25e-05 | 322.86 ms | 52.3% bf16 MFU | 1624957 tok/s step 17823/19560 | loss 3.297463 (+0.18z)| norm 0.2430 (-0.91z)| lr 1.25e-05 | 323.05 ms | 52.2% bf16 MFU | 1624856 tok/s step 17824/19560 | loss 3.471181 (+3.86z)| norm 0.2546 (-0.60z)| lr 1.25e-05 | 322.88 ms | 52.3% bf16 MFU | 1624802 tok/s step 17825/19560 | loss 3.263049 (-0.59z)| norm 0.2522 (-0.66z)| lr 1.25e-05 | 322.49 ms | 52.3% bf16 MFU | 1624850 tok/s step 17826/19560 | loss 3.289001 (-0.03z)| norm 0.2579 (-0.50z)| lr 1.24e-05 | 322.76 ms | 52.3% bf16 MFU | 1624827 tok/s step 17827/19560 | loss 3.259204 (-0.67z)| norm 0.2583 (-0.49z)| lr 1.24e-05 | 322.76 ms | 52.3% bf16 MFU | 1624804 tok/s step 17828/19560 | loss 3.261366 (-0.61z)| norm 0.2896 (+0.38z)| lr 1.24e-05 | 322.50 ms | 52.3% bf16 MFU | 1624849 tok/s step 17829/19560 | loss 3.330281 (+0.86z)| norm 0.2630 (-0.36z)| lr 1.24e-05 | 322.49 ms | 52.3% bf16 MFU | 1624894 tok/s step 17830/19560 | loss 3.317010 (+0.57z)| norm 0.2467 (-0.81z)| lr 1.24e-05 | 322.43 ms | 52.3% bf16 MFU | 1624952 tok/s step 17831/19560 | loss 3.356102 (+1.39z)| norm 0.2809 (+0.13z)| lr 1.24e-05 | 322.81 ms | 52.3% bf16 MFU | 1624910 tok/s step 17832/19560 | loss 3.321460 (+0.66z)| norm 0.3194 (+1.20z)| lr 1.24e-05 | 322.70 ms | 52.3% bf16 MFU | 1624900 tok/s step 17833/19560 | loss 3.328651 (+0.80z)| norm 0.2632 (-0.37z)| lr 1.23e-05 | 322.35 ms | 52.4% bf16 MFU | 1624976 tok/s step 17834/19560 | loss 3.348988 (+1.21z)| norm 0.2697 (-0.20z)| lr 1.23e-05 | 322.36 ms | 52.4% bf16 MFU | 1625048 tok/s step 17835/19560 | loss 3.301910 (+0.22z)| norm 0.2392 (-1.03z)| lr 1.23e-05 | 322.85 ms | 52.3% bf16 MFU | 1624992 tok/s step 17836/19560 | loss 3.342710 (+1.06z)| norm 0.3394 (+1.73z)| lr 1.23e-05 | 322.21 ms | 52.4% bf16 MFU | 1625102 tok/s step 17837/19560 | loss 3.263385 (-0.62z)| norm 0.3127 (+0.98z)| lr 1.23e-05 | 322.66 ms | 52.3% bf16 MFU | 1625090 tok/s step 17838/19560 | loss 3.302092 (+0.19z)| norm 0.2973 (+0.55z)| lr 1.23e-05 | 322.62 ms | 52.3% bf16 MFU | 1625092 tok/s step 17839/19560 | loss 3.324536 (+0.66z)| norm 0.2649 (-0.34z)| lr 1.23e-05 | 322.93 ms | 52.3% bf16 MFU | 1625014 tok/s step 17840/19560 | loss 3.323946 (+0.63z)| norm 0.3478 (+1.90z)| lr 1.22e-05 | 322.50 ms | 52.3% bf16 MFU | 1625048 tok/s step 17841/19560 | loss 3.285742 (-0.21z)| norm 0.3203 (+1.14z)| lr 1.22e-05 | 322.47 ms | 52.3% bf16 MFU | 1625089 tok/s step 17842/19560 | loss 3.271380 (-0.52z)| norm 0.2380 (-1.10z)| lr 1.22e-05 | 322.61 ms | 52.3% bf16 MFU | 1625092 tok/s step 17843/19560 | loss 3.254422 (-0.88z)| norm 0.2350 (-1.17z)| lr 1.22e-05 | 322.71 ms | 52.3% bf16 MFU | 1625070 tok/s step 17844/19560 | loss 3.266479 (-0.62z)| norm 0.2650 (-0.36z)| lr 1.22e-05 | 322.34 ms | 52.4% bf16 MFU | 1625141 tok/s step 17845/19560 | loss 3.297091 (+0.04z)| norm 0.3118 (+0.89z)| lr 1.22e-05 | 322.95 ms | 52.3% bf16 MFU | 1625055 tok/s step 17846/19560 | loss 3.319302 (+0.53z)| norm 0.2675 (-0.31z)| lr 1.22e-05 | 322.81 ms | 52.3% bf16 MFU | 1625008 tok/s step 17847/19560 | loss 3.347338 (+1.14z)| norm 0.2401 (-1.06z)| lr 1.21e-05 | 322.46 ms | 52.3% bf16 MFU | 1625054 tok/s step 17848/19560 | loss 3.349762 (+1.26z)| norm 0.3053 (+0.70z)| lr 1.21e-05 | 322.61 ms | 52.3% bf16 MFU | 1625058 tok/s step 17849/19560 | loss 3.313068 (+0.40z)| norm 0.2430 (-0.98z)| lr 1.21e-05 | 323.16 ms | 52.2% bf16 MFU | 1624923 tok/s step 17850/19560 | loss 3.204719 (-2.08z)| norm 0.2435 (-0.95z)| lr 1.21e-05 | 322.91 ms | 52.3% bf16 MFU | 1624860 tok/s step 17851/19560 | loss 3.276665 (-0.43z)| norm 0.2525 (-0.71z)| lr 1.21e-05 | 322.57 ms | 52.3% bf16 MFU | 1624883 tok/s step 17852/19560 | loss 3.270739 (-0.58z)| norm 0.2440 (-0.93z)| lr 1.21e-05 | 322.73 ms | 52.3% bf16 MFU | 1624866 tok/s step 17853/19560 | loss 3.311086 (+0.35z)| norm 0.2547 (-0.63z)| lr 1.21e-05 | 322.80 ms | 52.3% bf16 MFU | 1624832 tok/s step 17854/19560 | loss 3.255621 (-0.93z)| norm 0.2523 (-0.70z)| lr 1.20e-05 | 323.17 ms | 52.2% bf16 MFU | 1624706 tok/s step 17855/19560 | loss 3.344196 (+1.10z)| norm 0.2670 (-0.30z)| lr 1.20e-05 | 322.63 ms | 52.3% bf16 MFU | 1624724 tok/s step 17856/19560 | loss 3.347691 (+1.16z)| norm 0.2708 (-0.19z)| lr 1.20e-05 | 322.50 ms | 52.3% bf16 MFU | 1624773 tok/s step 17857/19560 | loss 3.294809 (-0.05z)| norm 0.2357 (-1.12z)| lr 1.20e-05 | 322.63 ms | 52.3% bf16 MFU | 1624787 tok/s step 17858/19560 | loss 3.265705 (-0.71z)| norm 0.2697 (-0.21z)| lr 1.20e-05 | 323.36 ms | 52.2% bf16 MFU | 1624618 tok/s step 17859/19560 | loss 3.281894 (-0.36z)| norm 0.2770 (-0.02z)| lr 1.20e-05 | 322.62 ms | 52.3% bf16 MFU | 1624643 tok/s step 17860/19560 | loss 3.315075 (+0.40z)| norm 0.2648 (-0.35z)| lr 1.20e-05 | 323.00 ms | 52.3% bf16 MFU | 1624569 tok/s step 17861/19560 | loss 3.274794 (-0.54z)| norm 0.2645 (-0.35z)| lr 1.19e-05 | 322.38 ms | 52.4% bf16 MFU | 1624655 tok/s step 17862/19560 | loss 3.340943 (+0.99z)| norm 0.3064 (+0.77z)| lr 1.19e-05 | 322.66 ms | 52.3% bf16 MFU | 1624668 tok/s step 17863/19560 | loss 3.419839 (+2.89z)| norm 0.2783 (+0.05z)| lr 1.19e-05 | 323.13 ms | 52.2% bf16 MFU | 1624560 tok/s step 17864/19560 | loss 3.302798 (+0.10z)| norm 0.3008 (+0.69z)| lr 1.19e-05 | 322.09 ms | 52.4% bf16 MFU | 1624720 tok/s step 17865/19560 | loss 3.278864 (-0.47z)| norm 0.2908 (+0.40z)| lr 1.19e-05 | 322.89 ms | 52.3% bf16 MFU | 1624672 tok/s step 17866/19560 | loss 3.295128 (-0.08z)| norm 0.2775 (+0.02z)| lr 1.19e-05 | 322.52 ms | 52.3% bf16 MFU | 1624718 tok/s step 17867/19560 | loss 3.379851 (+1.93z)| norm 0.2575 (-0.56z)| lr 1.19e-05 | 322.78 ms | 52.3% bf16 MFU | 1624698 tok/s step 17868/19560 | loss 3.302192 (+0.08z)| norm 0.2991 (+0.64z)| lr 1.19e-05 | 322.40 ms | 52.3% bf16 MFU | 1624772 tok/s step 17869/19560 | loss 3.248747 (-1.17z)| norm 0.2709 (-0.16z)| lr 1.18e-05 | 322.63 ms | 52.3% bf16 MFU | 1624785 tok/s step 17870/19560 | loss 3.299838 (+0.04z)| norm 0.2489 (-0.88z)| lr 1.18e-05 | 322.77 ms | 52.3% bf16 MFU | 1624764 tok/s step 17871/19560 | loss 3.264743 (-0.78z)| norm 0.2621 (-0.43z)| lr 1.18e-05 | 322.79 ms | 52.3% bf16 MFU | 1624737 tok/s step 17872/19560 | loss 3.197434 (-2.32z)| norm 0.2651 (-0.31z)| lr 1.18e-05 | 322.69 ms | 52.3% bf16 MFU | 1624736 tok/s step 17873/19560 | loss 3.323654 (+0.61z)| norm 0.3075 (+1.14z)| lr 1.18e-05 | 322.52 ms | 52.3% bf16 MFU | 1624780 tok/s step 17874/19560 | loss 3.358548 (+1.40z)| norm 0.2794 (+0.17z)| lr 1.18e-05 | 322.62 ms | 52.3% bf16 MFU | 1624794 tok/s step 17875/19560 | loss 3.268689 (-0.67z)| norm 0.2389 (-1.20z)| lr 1.18e-05 | 322.75 ms | 52.3% bf16 MFU | 1624778 tok/s step 17876/19560 | loss 3.324906 (+0.62z)| norm 0.2392 (-1.18z)| lr 1.17e-05 | 322.96 ms | 52.3% bf16 MFU | 1624707 tok/s step 17877/19560 | loss 3.330526 (+0.73z)| norm 0.2533 (-0.68z)| lr 1.17e-05 | 322.72 ms | 52.3% bf16 MFU | 1624701 tok/s step 17878/19560 | loss 3.255379 (-1.00z)| norm 0.2649 (-0.28z)| lr 1.17e-05 | 322.60 ms | 52.3% bf16 MFU | 1624727 tok/s step 17879/19560 | loss 3.320633 (+0.52z)| norm 0.2490 (-0.82z)| lr 1.17e-05 | 322.71 ms | 52.3% bf16 MFU | 1624722 tok/s step 17880/19560 | loss 3.288848 (-0.22z)| norm 0.2504 (-0.76z)| lr 1.17e-05 | 322.70 ms | 52.3% bf16 MFU | 1624719 tok/s step 17881/19560 | loss 3.293031 (-0.11z)| norm 0.2331 (-1.37z)| lr 1.17e-05 | 322.89 ms | 52.3% bf16 MFU | 1624669 tok/s step 17882/19560 | loss 3.366637 (+1.56z)| norm 0.2536 (-0.63z)| lr 1.17e-05 | 322.70 ms | 52.3% bf16 MFU | 1624669 tok/s step 17883/19560 | loss 3.354517 (+1.26z)| norm 0.2898 (+0.68z)| lr 1.16e-05 | 323.22 ms | 52.2% bf16 MFU | 1624540 tok/s step 17884/19560 | loss 3.307335 (+0.19z)| norm 0.2435 (-0.99z)| lr 1.16e-05 | 322.67 ms | 52.3% bf16 MFU | 1624555 tok/s step 17885/19560 | loss 3.235213 (-1.45z)| norm 0.2728 (+0.06z)| lr 1.16e-05 | 322.78 ms | 52.3% bf16 MFU | 1624540 tok/s step 17886/19560 | loss 3.350224 (+1.18z)| norm 0.2827 (+0.41z)| lr 1.16e-05 | 323.23 ms | 52.2% bf16 MFU | 1624414 tok/s step 17887/19560 | loss 3.308678 (+0.22z)| norm 0.2683 (-0.11z)| lr 1.16e-05 | 322.51 ms | 52.3% bf16 MFU | 1624477 tok/s step 17888/19560 | loss 3.344572 (+1.03z)| norm 0.2489 (-0.80z)| lr 1.16e-05 | 322.94 ms | 52.3% bf16 MFU | 1624426 tok/s step 17889/19560 | loss 3.316770 (+0.38z)| norm 0.3678 (+3.33z)| lr 1.16e-05 | 322.79 ms | 52.3% bf16 MFU | 1624417 tok/s step 17890/19560 | loss 3.445953 (+3.24z)| norm 0.2521 (-0.67z)| lr 1.15e-05 | 322.74 ms | 52.3% bf16 MFU | 1624421 tok/s step 17891/19560 | loss 3.292171 (-0.20z)| norm 0.2406 (-1.07z)| lr 1.15e-05 | 322.53 ms | 52.3% bf16 MFU | 1624477 tok/s step 17892/19560 | loss 3.326891 (+0.57z)| norm 0.2526 (-0.65z)| lr 1.15e-05 | 323.06 ms | 52.2% bf16 MFU | 1624397 tok/s step 17893/19560 | loss 3.273166 (-0.64z)| norm 0.2809 (+0.33z)| lr 1.15e-05 | 322.53 ms | 52.3% bf16 MFU | 1624453 tok/s step 17894/19560 | loss 3.359850 (+1.29z)| norm 0.2405 (-1.06z)| lr 1.15e-05 | 322.49 ms | 52.3% bf16 MFU | 1624519 tok/s step 17895/19560 | loss 3.272506 (-0.65z)| norm 0.2364 (-1.19z)| lr 1.15e-05 | 323.04 ms | 52.2% bf16 MFU | 1624441 tok/s step 17896/19560 | loss 3.330136 (+0.64z)| norm 0.2529 (-0.62z)| lr 1.15e-05 | 322.44 ms | 52.3% bf16 MFU | 1624518 tok/s step 17897/19560 | loss 3.335424 (+0.75z)| norm 0.2444 (-0.90z)| lr 1.15e-05 | 323.13 ms | 52.2% bf16 MFU | 1624419 tok/s step 17898/19560 | loss 3.283427 (-0.43z)| norm 0.2546 (-0.55z)| lr 1.14e-05 | 322.97 ms | 52.3% bf16 MFU | 1624364 tok/s step 17899/19560 | loss 3.242138 (-1.35z)| norm 0.2596 (-0.37z)| lr 1.14e-05 | 322.75 ms | 52.3% bf16 MFU | 1624368 tok/s step 17900/19560 | loss 3.357640 (+1.23z)| norm 0.2407 (-1.02z)| lr 1.14e-05 | 322.86 ms | 52.3% bf16 MFU | 1624343 tok/s step 17901/19560 | loss 3.269984 (-0.77z)| norm 0.2321 (-1.30z)| lr 1.14e-05 | 322.64 ms | 52.3% bf16 MFU | 1624375 tok/s step 17902/19560 | loss 3.358168 (+1.24z)| norm 0.2906 (+0.69z)| lr 1.14e-05 | 322.97 ms | 52.3% bf16 MFU | 1624322 tok/s step 17903/19560 | loss 3.369128 (+1.47z)| norm 0.3796 (+3.51z)| lr 1.14e-05 | 323.09 ms | 52.2% bf16 MFU | 1624244 tok/s step 17904/19560 | loss 3.338104 (+0.76z)| norm 0.2544 (-0.54z)| lr 1.14e-05 | 322.48 ms | 52.3% bf16 MFU | 1624322 tok/s step 17905/19560 | loss 3.254267 (-1.11z)| norm 0.2942 (+0.76z)| lr 1.13e-05 | 323.13 ms | 52.2% bf16 MFU | 1624232 tok/s step 17906/19560 | loss 3.285481 (-0.42z)| norm 0.2674 (-0.11z)| lr 1.13e-05 | 322.78 ms | 52.3% bf16 MFU | 1624236 tok/s step 17907/19560 | loss 3.257020 (-1.04z)| norm 0.2573 (-0.43z)| lr 1.13e-05 | 323.16 ms | 52.2% bf16 MFU | 1624143 tok/s step 17908/19560 | loss 3.220546 (-1.82z)| norm 0.2509 (-0.64z)| lr 1.13e-05 | 322.76 ms | 52.3% bf16 MFU | 1624156 tok/s step 17909/19560 | loss 3.393065 (+1.95z)| norm 0.3043 (+1.19z)| lr 1.13e-05 | 322.41 ms | 52.3% bf16 MFU | 1624256 tok/s step 17910/19560 | loss 3.336737 (+0.70z)| norm 0.2493 (-0.68z)| lr 1.13e-05 | 322.28 ms | 52.4% bf16 MFU | 1624383 tok/s step 17911/19560 | loss 3.335347 (+0.69z)| norm 0.2368 (-1.10z)| lr 1.13e-05 | 322.34 ms | 52.4% bf16 MFU | 1624490 tok/s step 17912/19560 | loss 3.282847 (-0.48z)| norm 0.2576 (-0.38z)| lr 1.12e-05 | 322.68 ms | 52.3% bf16 MFU | 1624504 tok/s step 17913/19560 | loss 3.316552 (+0.27z)| norm 0.3068 (+1.29z)| lr 1.12e-05 | 322.28 ms | 52.4% bf16 MFU | 1624618 tok/s step 17914/19560 | loss 3.245847 (-1.29z)| norm 0.2620 (-0.24z)| lr 1.12e-05 | 323.00 ms | 52.3% bf16 MFU | 1624546 tok/s step 17915/19560 | loss 3.286084 (-0.40z)| norm 0.3002 (+1.05z)| lr 1.12e-05 | 323.04 ms | 52.2% bf16 MFU | 1624468 tok/s step 17916/19560 | loss 3.372649 (+1.50z)| norm 0.3413 (+2.38z)| lr 1.12e-05 | 322.68 ms | 52.3% bf16 MFU | 1624483 tok/s step 17917/19560 | loss 3.251466 (-1.17z)| norm 0.2440 (-0.86z)| lr 1.12e-05 | 322.23 ms | 52.4% bf16 MFU | 1624613 tok/s step 17918/19560 | loss 3.276093 (-0.62z)| norm 0.2441 (-0.85z)| lr 1.12e-05 | 323.46 ms | 52.2% bf16 MFU | 1624427 tok/s step 17919/19560 | loss 3.265024 (-0.87z)| norm 0.2453 (-0.82z)| lr 1.12e-05 | 322.20 ms | 52.4% bf16 MFU | 1624565 tok/s step 17920/19560 | loss 3.304930 (+0.01z)| norm 0.2545 (-0.51z)| lr 1.11e-05 | 323.28 ms | 52.2% bf16 MFU | 1624425 tok/s step 17921/19560 | loss 3.230608 (-1.60z)| norm 0.2455 (-0.81z)| lr 1.11e-05 | 322.18 ms | 52.4% bf16 MFU | 1624569 tok/s step 17922/19560 | loss 3.449048 (+3.03z)| norm 0.2996 (+0.99z)| lr 1.11e-05 | 323.49 ms | 52.2% bf16 MFU | 1624378 tok/s step 17923/19560 | loss 3.326334 (+0.45z)| norm 0.2458 (-0.79z)| lr 1.11e-05 | 323.01 ms | 52.2% bf16 MFU | 1624314 tok/s step 17924/19560 | loss 3.230196 (-1.54z)| norm 0.2472 (-0.74z)| lr 1.11e-05 | 322.82 ms | 52.3% bf16 MFU | 1624302 tok/s step 17925/19560 | loss 3.325364 (+0.43z)| norm 0.2603 (-0.30z)| lr 1.11e-05 | 322.99 ms | 52.3% bf16 MFU | 1624248 tok/s step 17926/19560 | loss 3.281018 (-0.48z)| norm 0.2617 (-0.26z)| lr 1.11e-05 | 322.83 ms | 52.3% bf16 MFU | 1624238 tok/s step 17927/19560 | loss 3.358731 (+1.14z)| norm 0.2421 (-0.90z)| lr 1.10e-05 | 322.81 ms | 52.3% bf16 MFU | 1624234 tok/s step 17928/19560 | loss 3.325657 (+0.43z)| norm 0.2422 (-0.90z)| lr 1.10e-05 | 323.00 ms | 52.3% bf16 MFU | 1624181 tok/s step 17929/19560 | loss 3.277217 (-0.59z)| norm 0.3385 (+2.28z)| lr 1.10e-05 | 322.78 ms | 52.3% bf16 MFU | 1624186 tok/s step 17930/19560 | loss 3.394292 (+1.86z)| norm 0.2807 (+0.38z)| lr 1.10e-05 | 322.75 ms | 52.3% bf16 MFU | 1624198 tok/s step 17931/19560 | loss 3.322435 (+0.36z)| norm 0.2470 (-0.73z)| lr 1.10e-05 | 322.59 ms | 52.3% bf16 MFU | 1624252 tok/s step 17932/19560 | loss 3.310272 (+0.11z)| norm 0.2580 (-0.37z)| lr 1.10e-05 | 323.30 ms | 52.2% bf16 MFU | 1624122 tok/s step 17933/19560 | loss 3.341747 (+0.76z)| norm 0.2478 (-0.70z)| lr 1.10e-05 | 322.89 ms | 52.3% bf16 MFU | 1624103 tok/s step 17934/19560 | loss 3.304604 (-0.03z)| norm 0.2474 (-0.71z)| lr 1.10e-05 | 323.26 ms | 52.2% bf16 MFU | 1623991 tok/s step 17935/19560 | loss 3.285777 (-0.42z)| norm 0.2840 (+0.49z)| lr 1.09e-05 | 323.01 ms | 52.2% bf16 MFU | 1623947 tok/s step 17936/19560 | loss 3.350989 (+0.95z)| norm 0.2709 (+0.06z)| lr 1.09e-05 | 323.29 ms | 52.2% bf16 MFU | 1623835 tok/s step 17937/19560 | loss 3.312602 (+0.13z)| norm 0.2629 (-0.21z)| lr 1.09e-05 | 322.90 ms | 52.3% bf16 MFU | 1623827 tok/s step 17938/19560 | loss 3.293305 (-0.28z)| norm 0.3123 (+1.39z)| lr 1.09e-05 | 322.79 ms | 52.3% bf16 MFU | 1623846 tok/s step 17939/19560 | loss 3.324832 (+0.38z)| norm 0.2828 (+0.42z)| lr 1.09e-05 | 322.75 ms | 52.3% bf16 MFU | 1623876 tok/s step 17940/19560 | loss 3.298346 (-0.19z)| norm 0.2373 (-1.07z)| lr 1.09e-05 | 322.72 ms | 52.3% bf16 MFU | 1623911 tok/s step 17941/19560 | loss 3.331439 (+0.52z)| norm 0.3429 (+2.45z)| lr 1.09e-05 | 323.34 ms | 52.2% bf16 MFU | 1623790 tok/s step 17942/19560 | loss 3.327847 (+0.44z)| norm 0.3352 (+2.14z)| lr 1.08e-05 | 322.78 ms | 52.3% bf16 MFU | 1623815 tok/s step 17943/19560 | loss 3.253392 (-1.14z)| norm 0.2513 (-0.61z)| lr 1.08e-05 | 322.76 ms | 52.3% bf16 MFU | 1623844 tok/s step 17944/19560 | loss 3.302165 (-0.10z)| norm 0.2418 (-0.91z)| lr 1.08e-05 | 322.53 ms | 52.3% bf16 MFU | 1623929 tok/s step 17945/19560 | loss 3.249317 (-1.23z)| norm 0.2384 (-1.01z)| lr 1.08e-05 | 323.13 ms | 52.2% bf16 MFU | 1623859 tok/s step 17946/19560 | loss 3.286573 (-0.44z)| norm 0.3596 (+2.92z)| lr 1.08e-05 | 322.75 ms | 52.3% bf16 MFU | 1623887 tok/s step 17947/19560 | loss 3.305269 (-0.04z)| norm 0.2726 (+0.11z)| lr 1.08e-05 | 323.25 ms | 52.2% bf16 MFU | 1623790 tok/s step 17948/19560 | loss 3.267387 (-0.86z)| norm 0.2339 (-1.14z)| lr 1.08e-05 | 323.62 ms | 52.2% bf16 MFU | 1623603 tok/s step 17949/19560 | loss 3.296976 (-0.22z)| norm 0.2463 (-0.74z)| lr 1.08e-05 | 322.36 ms | 52.4% bf16 MFU | 1623743 tok/s step 17950/19560 | loss 3.317791 (+0.22z)| norm 0.2500 (-0.61z)| lr 1.07e-05 | 322.87 ms | 52.3% bf16 MFU | 1623748 tok/s step 17951/19560 | loss 3.464935 (+3.24z)| norm 0.2636 (-0.18z)| lr 1.07e-05 | 323.26 ms | 52.2% bf16 MFU | 1623654 tok/s step 17952/19560 | loss 3.282392 (-0.54z)| norm 0.2644 (-0.16z)| lr 1.07e-05 | 322.80 ms | 52.3% bf16 MFU | 1623681 tok/s step 17953/19560 | loss 3.227130 (-1.72z)| norm 0.2485 (-0.67z)| lr 1.07e-05 | 323.00 ms | 52.3% bf16 MFU | 1623656 tok/s step 17954/19560 | loss 3.333916 (+0.57z)| norm 0.2858 (+0.53z)| lr 1.07e-05 | 322.56 ms | 52.3% bf16 MFU | 1623743 tok/s step 17955/19560 | loss 3.323606 (+0.34z)| norm 0.2478 (-0.69z)| lr 1.07e-05 | 323.30 ms | 52.2% bf16 MFU | 1623640 tok/s step 17956/19560 | loss 3.286036 (-0.48z)| norm 0.2353 (-1.08z)| lr 1.07e-05 | 322.75 ms | 52.3% bf16 MFU | 1623680 tok/s step 17957/19560 | loss 3.270811 (-0.79z)| norm 0.4177 (+4.37z)| lr 1.06e-05 | 322.78 ms | 52.3% bf16 MFU | 1623710 tok/s step 17958/19560 | loss 3.247013 (-1.29z)| norm 0.3072 (+1.08z)| lr 1.06e-05 | 323.09 ms | 52.2% bf16 MFU | 1623662 tok/s step 17959/19560 | loss 3.293774 (-0.27z)| norm 0.2847 (+0.41z)| lr 1.06e-05 | 322.78 ms | 52.3% bf16 MFU | 1623692 tok/s step 17960/19560 | loss 3.250335 (-1.19z)| norm 0.2759 (+0.17z)| lr 1.06e-05 | 322.68 ms | 52.3% bf16 MFU | 1623748 tok/s step 17961/19560 | loss 3.285566 (-0.43z)| norm 0.3187 (+1.41z)| lr 1.06e-05 | 322.88 ms | 52.3% bf16 MFU | 1623751 tok/s step 17962/19560 | loss 3.214221 (-1.92z)| norm 0.3263 (+1.61z)| lr 1.06e-05 | 323.25 ms | 52.2% bf16 MFU | 1623658 tok/s step 17963/19560 | loss 3.309553 (+0.10z)| norm 0.2681 (-0.10z)| lr 1.06e-05 | 322.46 ms | 52.3% bf16 MFU | 1623769 tok/s step 17964/19560 | loss 3.237374 (-1.40z)| norm 0.2661 (-0.14z)| lr 1.06e-05 | 322.73 ms | 52.3% bf16 MFU | 1623808 tok/s step 17965/19560 | loss 3.295509 (-0.18z)| norm 0.4014 (+3.68z)| lr 1.05e-05 | 323.26 ms | 52.2% bf16 MFU | 1623713 tok/s step 17966/19560 | loss 3.298268 (-0.12z)| norm 0.2589 (-0.35z)| lr 1.05e-05 | 323.01 ms | 52.2% bf16 MFU | 1623684 tok/s step 17967/19560 | loss 3.253103 (-1.06z)| norm 0.2550 (-0.46z)| lr 1.05e-05 | 322.90 ms | 52.3% bf16 MFU | 1623683 tok/s step 17968/19560 | loss 3.260074 (-0.90z)| norm 0.2758 (+0.15z)| lr 1.05e-05 | 322.67 ms | 52.3% bf16 MFU | 1623741 tok/s step 17969/19560 | loss 3.263219 (-0.83z)| norm 0.2743 (+0.12z)| lr 1.05e-05 | 323.34 ms | 52.2% bf16 MFU | 1623628 tok/s step 17970/19560 | loss 3.259051 (-0.91z)| norm 0.2797 (+0.27z)| lr 1.05e-05 | 322.08 ms | 52.4% bf16 MFU | 1623837 tok/s step 17971/19560 | loss 3.318632 (+0.32z)| norm 0.2364 (-1.00z)| lr 1.05e-05 | 323.12 ms | 52.2% bf16 MFU | 1623774 tok/s step 17972/19560 | loss 3.275654 (-0.58z)| norm 0.2467 (-0.69z)| lr 1.04e-05 | 323.10 ms | 52.2% bf16 MFU | 1623719 tok/s step 17973/19560 | loss 3.365904 (+1.30z)| norm 0.2579 (-0.36z)| lr 1.04e-05 | 322.94 ms | 52.3% bf16 MFU | 1623708 tok/s step 17974/19560 | loss 3.276387 (-0.57z)| norm 0.2959 (+0.75z)| lr 1.04e-05 | 323.24 ms | 52.2% bf16 MFU | 1623621 tok/s step 17975/19560 | loss 3.323956 (+0.43z)| norm 0.2469 (-0.68z)| lr 1.04e-05 | 322.92 ms | 52.3% bf16 MFU | 1623618 tok/s step 17976/19560 | loss 3.274805 (-0.59z)| norm 0.2666 (-0.10z)| lr 1.04e-05 | 322.40 ms | 52.3% bf16 MFU | 1623747 tok/s step 17977/19560 | loss 3.238397 (-1.33z)| norm 0.2563 (-0.41z)| lr 1.04e-05 | 323.52 ms | 52.2% bf16 MFU | 1623589 tok/s step 17978/19560 | loss 3.301306 (-0.03z)| norm 0.2351 (-1.03z)| lr 1.04e-05 | 323.41 ms | 52.2% bf16 MFU | 1623465 tok/s step 17979/19560 | loss 3.312302 (+0.19z)| norm 0.2423 (-0.81z)| lr 1.04e-05 | 322.96 ms | 52.3% bf16 MFU | 1623462 tok/s step 17980/19560 | loss 3.303219 (-0.00z)| norm 0.2789 (+0.26z)| lr 1.03e-05 | 323.18 ms | 52.2% bf16 MFU | 1623403 tok/s step 17981/19560 | loss 3.235128 (-1.43z)| norm 0.2381 (-0.94z)| lr 1.03e-05 | 322.65 ms | 52.3% bf16 MFU | 1623479 tok/s step 17982/19560 | loss 3.276394 (-0.56z)| norm 0.2659 (-0.13z)| lr 1.03e-05 | 323.69 ms | 52.1% bf16 MFU | 1623290 tok/s step 17983/19560 | loss 3.304863 (+0.05z)| norm 0.2956 (+0.74z)| lr 1.03e-05 | 323.19 ms | 52.2% bf16 MFU | 1623237 tok/s step 17984/19560 | loss 3.303125 (+0.02z)| norm 0.2387 (-0.92z)| lr 1.03e-05 | 323.76 ms | 52.1% bf16 MFU | 1623043 tok/s step 17985/19560 | loss 3.262596 (-0.84z)| norm 0.2391 (-0.91z)| lr 1.03e-05 | 323.09 ms | 52.2% bf16 MFU | 1623027 tok/s step 17986/19560 | loss 3.252680 (-1.04z)| norm 0.2335 (-1.06z)| lr 1.03e-05 | 322.53 ms | 52.3% bf16 MFU | 1623154 tok/s step 17987/19560 | loss 3.300690 (-0.03z)| norm 0.2401 (-0.86z)| lr 1.03e-05 | 323.49 ms | 52.2% bf16 MFU | 1623032 tok/s step 17988/19560 | loss 3.278435 (-0.50z)| norm 0.2407 (-0.83z)| lr 1.02e-05 | 322.74 ms | 52.3% bf16 MFU | 1623105 tok/s step 17989/19560 | loss 3.273777 (-0.59z)| norm 0.2416 (-0.80z)| lr 1.02e-05 | 322.37 ms | 52.4% bf16 MFU | 1623268 tok/s step 17990/19560 | loss 3.333389 (+0.67z)| norm 0.2571 (-0.34z)| lr 1.02e-05 | 322.71 ms | 52.3% bf16 MFU | 1623336 tok/s step 17991/19560 | loss 3.240731 (-1.29z)| norm 0.2380 (-0.88z)| lr 1.02e-05 | 322.58 ms | 52.3% bf16 MFU | 1623435 tok/s step 17992/19560 | loss 3.320119 (+0.42z)| norm 0.3035 (+1.01z)| lr 1.02e-05 | 322.84 ms | 52.3% bf16 MFU | 1623463 tok/s step 17993/19560 | loss 3.303632 (+0.06z)| norm 0.2688 (+0.01z)| lr 1.02e-05 | 323.31 ms | 52.2% bf16 MFU | 1623372 tok/s step 17994/19560 | loss 3.283492 (-0.37z)| norm 0.2552 (-0.38z)| lr 1.02e-05 | 322.17 ms | 52.4% bf16 MFU | 1623572 tok/s step 17995/19560 | loss 3.318911 (+0.41z)| norm 0.3328 (+1.83z)| lr 1.01e-05 | 322.56 ms | 52.3% bf16 MFU | 1623662 tok/s step 17996/19560 | loss 3.327309 (+0.59z)| norm 0.2424 (-0.74z)| lr 1.01e-05 | 323.19 ms | 52.2% bf16 MFU | 1623590 tok/s step 17997/19560 | loss 3.294260 (-0.14z)| norm 0.2474 (-0.59z)| lr 1.01e-05 | 323.10 ms | 52.2% bf16 MFU | 1623545 tok/s step 17998/19560 | loss 3.239035 (-1.33z)| norm 0.2485 (-0.56z)| lr 1.01e-05 | 322.48 ms | 52.3% bf16 MFU | 1623658 tok/s step 17999/19560 | loss 3.321126 (+0.45z)| norm 0.2726 (+0.12z)| lr 1.01e-05 | 322.95 ms | 52.3% bf16 MFU | 1623646 tok/s step 18000/19560 | loss 3.277573 (-0.53z)| norm 0.2398 (-0.80z)| lr 1.01e-05 | 322.88 ms | 52.3% bf16 MFU | 1623654 tok/s val loss 3.276819 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3058/10042 = 0.304521 step 18001/19560 | loss 3.250238 (-1.12z)| norm 0.3496 (+2.28z)| lr 1.01e-05 | 324.43 ms | 52.0% bf16 MFU | 1623271 tok/s step 18002/19560 | loss 3.299793 (-0.01z)| norm 0.2352 (-0.92z)| lr 1.01e-05 | 322.62 ms | 52.3% bf16 MFU | 1623363 tok/s step 18003/19560 | loss 3.260585 (-0.88z)| norm 0.2548 (-0.37z)| lr 1.00e-05 | 323.71 ms | 52.1% bf16 MFU | 1623176 tok/s step 18004/19560 | loss 3.242704 (-1.26z)| norm 0.3200 (+1.42z)| lr 1.00e-05 | 323.29 ms | 52.2% bf16 MFU | 1623104 tok/s step 18005/19560 | loss 3.291542 (-0.17z)| norm 0.3960 (+3.36z)| lr 1.00e-05 | 323.36 ms | 52.2% bf16 MFU | 1623017 tok/s step 18006/19560 | loss 3.279990 (-0.43z)| norm 0.2831 (+0.35z)| lr 1.00e-05 | 323.01 ms | 52.2% bf16 MFU | 1623022 tok/s step 18007/19560 | loss 3.265726 (-0.74z)| norm 0.2524 (-0.47z)| lr 1.00e-05 | 322.54 ms | 52.3% bf16 MFU | 1623144 tok/s step 18008/19560 | loss 3.297340 (-0.04z)| norm 0.3140 (+1.15z)| lr 9.98e-06 | 323.91 ms | 52.1% bf16 MFU | 1622919 tok/s step 18009/19560 | loss 3.274894 (-0.53z)| norm 0.4173 (+3.66z)| lr 9.97e-06 | 322.49 ms | 52.3% bf16 MFU | 1623060 tok/s step 18010/19560 | loss 3.313731 (+0.34z)| norm 0.3100 (+0.94z)| lr 9.96e-06 | 322.83 ms | 52.3% bf16 MFU | 1623108 tok/s step 18011/19560 | loss 3.260129 (-0.85z)| norm 0.2591 (-0.33z)| lr 9.94e-06 | 322.81 ms | 52.3% bf16 MFU | 1623160 tok/s step 18012/19560 | loss 3.341240 (+0.97z)| norm 0.2905 (+0.45z)| lr 9.93e-06 | 322.56 ms | 52.3% bf16 MFU | 1623273 tok/s step 18013/19560 | loss 3.270450 (-0.63z)| norm 0.2806 (+0.20z)| lr 9.92e-06 | 323.00 ms | 52.3% bf16 MFU | 1623269 tok/s step 18014/19560 | loss 3.208427 (-1.98z)| norm 0.3435 (+1.75z)| lr 9.91e-06 | 322.68 ms | 52.3% bf16 MFU | 1623347 tok/s step 18015/19560 | loss 3.315227 (+0.40z)| norm 0.2598 (-0.33z)| lr 9.89e-06 | 322.98 ms | 52.3% bf16 MFU | 1623342 tok/s step 18016/19560 | loss 3.310914 (+0.31z)| norm 0.2588 (-0.36z)| lr 9.88e-06 | 322.50 ms | 52.3% bf16 MFU | 1623461 tok/s step 18017/19560 | loss 3.295258 (-0.03z)| norm 0.2555 (-0.43z)| lr 9.87e-06 | 322.94 ms | 52.3% bf16 MFU | 1623462 tok/s step 18018/19560 | loss 3.105268 (-4.12z)| norm 0.4967 (+5.06z)| lr 9.85e-06 | 322.48 ms | 52.3% bf16 MFU | 1623578 tok/s step 18019/19560 | loss 3.351151 (+1.23z)| norm 0.2971 (+0.51z)| lr 9.84e-06 | 322.59 ms | 52.3% bf16 MFU | 1623662 tok/s step 18020/19560 | loss 3.306831 (+0.27z)| norm 0.2558 (-0.43z)| lr 9.83e-06 | 322.87 ms | 52.3% bf16 MFU | 1623672 tok/s step 18021/19560 | loss 3.292673 (-0.04z)| norm 0.2767 (+0.05z)| lr 9.82e-06 | 322.70 ms | 52.3% bf16 MFU | 1623724 tok/s step 18022/19560 | loss 3.294436 (+0.01z)| norm 0.3113 (+0.82z)| lr 9.80e-06 | 323.03 ms | 52.2% bf16 MFU | 1623688 tok/s step 18023/19560 | loss 3.302835 (+0.19z)| norm 0.3160 (+0.92z)| lr 9.79e-06 | 323.10 ms | 52.2% bf16 MFU | 1623638 tok/s step 18024/19560 | loss 3.241401 (-1.14z)| norm 0.2564 (-0.44z)| lr 9.78e-06 | 322.30 ms | 52.4% bf16 MFU | 1623791 tok/s step 18025/19560 | loss 3.338908 (+0.99z)| norm 0.2855 (+0.21z)| lr 9.77e-06 | 322.81 ms | 52.3% bf16 MFU | 1623809 tok/s step 18026/19560 | loss 3.239223 (-1.18z)| norm 0.2471 (-0.66z)| lr 9.75e-06 | 322.95 ms | 52.3% bf16 MFU | 1623791 tok/s step 18027/19560 | loss 3.291780 (-0.04z)| norm 0.3638 (+1.95z)| lr 9.74e-06 | 322.68 ms | 52.3% bf16 MFU | 1623841 tok/s step 18028/19560 | loss 3.264989 (-0.62z)| norm 0.3240 (+1.04z)| lr 9.73e-06 | 322.63 ms | 52.3% bf16 MFU | 1623902 tok/s step 18029/19560 | loss 3.258021 (-0.77z)| norm 0.2804 (+0.05z)| lr 9.72e-06 | 322.52 ms | 52.3% bf16 MFU | 1623986 tok/s step 18030/19560 | loss 3.320369 (+0.61z)| norm 0.2467 (-0.69z)| lr 9.70e-06 | 322.85 ms | 52.3% bf16 MFU | 1623983 tok/s step 18031/19560 | loss 3.279403 (-0.28z)| norm 0.2531 (-0.54z)| lr 9.69e-06 | 322.66 ms | 52.3% bf16 MFU | 1624027 tok/s step 18032/19560 | loss 3.301585 (+0.22z)| norm 0.2632 (-0.31z)| lr 9.68e-06 | 322.38 ms | 52.4% bf16 MFU | 1624140 tok/s step 18033/19560 | loss 3.268613 (-0.52z)| norm 0.2635 (-0.30z)| lr 9.67e-06 | 322.68 ms | 52.3% bf16 MFU | 1624172 tok/s step 18034/19560 | loss 3.270647 (-0.47z)| norm 0.2607 (-0.36z)| lr 9.65e-06 | 322.77 ms | 52.3% bf16 MFU | 1624181 tok/s step 18035/19560 | loss 3.270873 (-0.47z)| norm 0.2461 (-0.69z)| lr 9.64e-06 | 322.72 ms | 52.3% bf16 MFU | 1624200 tok/s step 18036/19560 | loss 3.361184 (+1.54z)| norm 0.2598 (-0.38z)| lr 9.63e-06 | 322.87 ms | 52.3% bf16 MFU | 1624182 tok/s step 18037/19560 | loss 3.335819 (+0.99z)| norm 0.2941 (+0.41z)| lr 9.61e-06 | 322.60 ms | 52.3% bf16 MFU | 1624233 tok/s step 18038/19560 | loss 3.413571 (+2.69z)| norm 0.3139 (+0.85z)| lr 9.60e-06 | 322.70 ms | 52.3% bf16 MFU | 1624257 tok/s step 18039/19560 | loss 3.334188 (+0.92z)| norm 0.2708 (-0.14z)| lr 9.59e-06 | 322.74 ms | 52.3% bf16 MFU | 1624268 tok/s step 18040/19560 | loss 3.311039 (+0.40z)| norm 0.2884 (+0.26z)| lr 9.58e-06 | 322.61 ms | 52.3% bf16 MFU | 1624312 tok/s step 18041/19560 | loss 3.260840 (-0.71z)| norm 0.2362 (-0.93z)| lr 9.56e-06 | 322.93 ms | 52.3% bf16 MFU | 1624272 tok/s step 18042/19560 | loss 3.289304 (-0.08z)| norm 0.2472 (-0.67z)| lr 9.55e-06 | 322.61 ms | 52.3% bf16 MFU | 1624315 tok/s step 18043/19560 | loss 3.330692 (+0.83z)| norm 0.2498 (-0.61z)| lr 9.54e-06 | 322.50 ms | 52.3% bf16 MFU | 1624385 tok/s step 18044/19560 | loss 3.247438 (-1.01z)| norm 0.3035 (+0.64z)| lr 9.53e-06 | 322.44 ms | 52.3% bf16 MFU | 1624466 tok/s step 18045/19560 | loss 3.354131 (+1.37z)| norm 0.2610 (-0.35z)| lr 9.51e-06 | 322.22 ms | 52.4% bf16 MFU | 1624599 tok/s step 18046/19560 | loss 3.296557 (+0.07z)| norm 0.2421 (-0.79z)| lr 9.50e-06 | 322.54 ms | 52.3% bf16 MFU | 1624643 tok/s step 18047/19560 | loss 3.315170 (+0.48z)| norm 0.2504 (-0.59z)| lr 9.49e-06 | 322.34 ms | 52.4% bf16 MFU | 1624736 tok/s step 18048/19560 | loss 3.290925 (-0.06z)| norm 0.2662 (-0.23z)| lr 9.48e-06 | 322.91 ms | 52.3% bf16 MFU | 1624681 tok/s step 18049/19560 | loss 3.273539 (-0.46z)| norm 0.2680 (-0.19z)| lr 9.46e-06 | 322.65 ms | 52.3% bf16 MFU | 1624696 tok/s step 18050/19560 | loss 3.250565 (-0.99z)| norm 0.3198 (+1.00z)| lr 9.45e-06 | 322.56 ms | 52.3% bf16 MFU | 1624730 tok/s step 18051/19560 | loss 3.255111 (-0.87z)| norm 0.2622 (-0.34z)| lr 9.44e-06 | 322.50 ms | 52.3% bf16 MFU | 1624778 tok/s step 18052/19560 | loss 3.300788 (+0.20z)| norm 0.2411 (-0.82z)| lr 9.43e-06 | 322.66 ms | 52.3% bf16 MFU | 1624783 tok/s step 18053/19560 | loss 3.330683 (+0.91z)| norm 0.2757 (-0.02z)| lr 9.42e-06 | 322.68 ms | 52.3% bf16 MFU | 1624783 tok/s step 18054/19560 | loss 3.252488 (-0.95z)| norm 0.2354 (-0.95z)| lr 9.40e-06 | 322.49 ms | 52.3% bf16 MFU | 1624832 tok/s step 18055/19560 | loss 3.257799 (-0.81z)| norm 0.2501 (-0.61z)| lr 9.39e-06 | 322.79 ms | 52.3% bf16 MFU | 1624804 tok/s step 18056/19560 | loss 3.335398 (+1.05z)| norm 0.2499 (-0.62z)| lr 9.38e-06 | 323.61 ms | 52.2% bf16 MFU | 1624569 tok/s step 18057/19560 | loss 3.274988 (-0.40z)| norm 0.2603 (-0.37z)| lr 9.37e-06 | 322.24 ms | 52.4% bf16 MFU | 1624690 tok/s step 18058/19560 | loss 3.347419 (+1.37z)| norm 0.2620 (-0.32z)| lr 9.35e-06 | 322.59 ms | 52.3% bf16 MFU | 1624719 tok/s step 18059/19560 | loss 3.458013 (+3.81z)| norm 0.3071 (+0.72z)| lr 9.34e-06 | 322.68 ms | 52.3% bf16 MFU | 1624724 tok/s step 18060/19560 | loss 3.269989 (-0.50z)| norm 0.2983 (+0.50z)| lr 9.33e-06 | 322.70 ms | 52.3% bf16 MFU | 1624721 tok/s step 18061/19560 | loss 3.268250 (-0.53z)| norm 0.2583 (-0.43z)| lr 9.32e-06 | 323.05 ms | 52.2% bf16 MFU | 1624632 tok/s step 18062/19560 | loss 3.297654 (+0.15z)| norm 0.2701 (-0.16z)| lr 9.30e-06 | 322.40 ms | 52.3% bf16 MFU | 1624711 tok/s step 18063/19560 | loss 3.328208 (+0.84z)| norm 0.3313 (+1.25z)| lr 9.29e-06 | 322.81 ms | 52.3% bf16 MFU | 1624682 tok/s step 18064/19560 | loss 3.292306 (+0.03z)| norm 0.2320 (-1.04z)| lr 9.28e-06 | 322.71 ms | 52.3% bf16 MFU | 1624680 tok/s step 18065/19560 | loss 3.284553 (-0.15z)| norm 0.2641 (-0.30z)| lr 9.27e-06 | 322.52 ms | 52.3% bf16 MFU | 1624727 tok/s step 18066/19560 | loss 3.254292 (-0.84z)| norm 0.2612 (-0.36z)| lr 9.25e-06 | 323.07 ms | 52.2% bf16 MFU | 1624632 tok/s step 18067/19560 | loss 3.332141 (+0.96z)| norm 0.2534 (-0.53z)| lr 9.24e-06 | 322.30 ms | 52.4% bf16 MFU | 1624735 tok/s step 18068/19560 | loss 3.298731 (+0.18z)| norm 0.2403 (-0.84z)| lr 9.23e-06 | 322.58 ms | 52.3% bf16 MFU | 1624764 tok/s step 18069/19560 | loss 3.286163 (-0.10z)| norm 0.2497 (-0.61z)| lr 9.22e-06 | 323.10 ms | 52.2% bf16 MFU | 1624660 tok/s step 18070/19560 | loss 3.285571 (-0.10z)| norm 0.2584 (-0.39z)| lr 9.21e-06 | 322.42 ms | 52.3% bf16 MFU | 1624731 tok/s step 18071/19560 | loss 3.302575 (+0.28z)| norm 0.2785 (+0.08z)| lr 9.19e-06 | 322.49 ms | 52.3% bf16 MFU | 1624780 tok/s step 18072/19560 | loss 3.257066 (-0.77z)| norm 0.2538 (-0.51z)| lr 9.18e-06 | 322.96 ms | 52.3% bf16 MFU | 1624711 tok/s step 18073/19560 | loss 3.265426 (-0.58z)| norm 0.2890 (+0.31z)| lr 9.17e-06 | 322.33 ms | 52.4% bf16 MFU | 1624804 tok/s step 18074/19560 | loss 3.348826 (+1.35z)| norm 0.2542 (-0.50z)| lr 9.16e-06 | 322.80 ms | 52.3% bf16 MFU | 1624774 tok/s step 18075/19560 | loss 3.241889 (-1.11z)| norm 0.2447 (-0.72z)| lr 9.14e-06 | 322.90 ms | 52.3% bf16 MFU | 1624719 tok/s step 18076/19560 | loss 3.238966 (-1.17z)| norm 0.2372 (-0.90z)| lr 9.13e-06 | 323.01 ms | 52.2% bf16 MFU | 1624640 tok/s step 18077/19560 | loss 3.288430 (-0.03z)| norm 0.2472 (-0.66z)| lr 9.12e-06 | 322.62 ms | 52.3% bf16 MFU | 1624661 tok/s step 18078/19560 | loss 3.291446 (+0.04z)| norm 0.2427 (-0.76z)| lr 9.11e-06 | 322.87 ms | 52.3% bf16 MFU | 1624620 tok/s step 18079/19560 | loss 3.324018 (+0.87z)| norm 0.2785 (+0.09z)| lr 9.09e-06 | 322.92 ms | 52.3% bf16 MFU | 1624569 tok/s step 18080/19560 | loss 3.268722 (-0.48z)| norm 0.2569 (-0.43z)| lr 9.08e-06 | 322.83 ms | 52.3% bf16 MFU | 1624543 tok/s step 18081/19560 | loss 3.280964 (-0.20z)| norm 0.2352 (-0.94z)| lr 9.07e-06 | 322.68 ms | 52.3% bf16 MFU | 1624555 tok/s step 18082/19560 | loss 3.315452 (+0.66z)| norm 0.2543 (-0.48z)| lr 9.06e-06 | 322.54 ms | 52.3% bf16 MFU | 1624601 tok/s step 18083/19560 | loss 3.291688 (+0.08z)| norm 0.2464 (-0.67z)| lr 9.05e-06 | 322.99 ms | 52.3% bf16 MFU | 1624534 tok/s step 18084/19560 | loss 3.182864 (-2.54z)| norm 0.2337 (-0.97z)| lr 9.03e-06 | 322.46 ms | 52.3% bf16 MFU | 1624602 tok/s step 18085/19560 | loss 3.339252 (+1.23z)| norm 0.2681 (-0.13z)| lr 9.02e-06 | 322.71 ms | 52.3% bf16 MFU | 1624604 tok/s step 18086/19560 | loss 3.413209 (+2.89z)| norm 0.2550 (-0.45z)| lr 9.01e-06 | 322.23 ms | 52.4% bf16 MFU | 1624726 tok/s step 18087/19560 | loss 3.285221 (-0.10z)| norm 0.2385 (-0.85z)| lr 9.00e-06 | 322.48 ms | 52.3% bf16 MFU | 1624779 tok/s step 18088/19560 | loss 3.250527 (-0.91z)| norm 0.2316 (-1.01z)| lr 8.99e-06 | 322.54 ms | 52.3% bf16 MFU | 1624814 tok/s step 18089/19560 | loss 3.331791 (+0.98z)| norm 0.2369 (-0.86z)| lr 8.97e-06 | 323.02 ms | 52.2% bf16 MFU | 1624727 tok/s step 18090/19560 | loss 3.317214 (+0.63z)| norm 0.2561 (-0.37z)| lr 8.96e-06 | 322.53 ms | 52.3% bf16 MFU | 1624767 tok/s step 18091/19560 | loss 3.276684 (-0.32z)| norm 0.2470 (-0.60z)| lr 8.95e-06 | 322.97 ms | 52.3% bf16 MFU | 1624695 tok/s step 18092/19560 | loss 3.352931 (+1.45z)| norm 0.2328 (-0.94z)| lr 8.94e-06 | 322.41 ms | 52.3% bf16 MFU | 1624768 tok/s step 18093/19560 | loss 3.305650 (+0.34z)| norm 0.2534 (-0.42z)| lr 8.92e-06 | 322.46 ms | 52.3% bf16 MFU | 1624824 tok/s step 18094/19560 | loss 3.258686 (-0.76z)| norm 0.2340 (-0.91z)| lr 8.91e-06 | 322.65 ms | 52.3% bf16 MFU | 1624830 tok/s step 18095/19560 | loss 3.254554 (-0.86z)| norm 0.2876 (+0.47z)| lr 8.90e-06 | 323.13 ms | 52.2% bf16 MFU | 1624716 tok/s step 18096/19560 | loss 3.286077 (-0.12z)| norm 0.2289 (-1.04z)| lr 8.89e-06 | 323.11 ms | 52.2% bf16 MFU | 1624612 tok/s step 18097/19560 | loss 3.279256 (-0.29z)| norm 0.2547 (-0.37z)| lr 8.88e-06 | 322.66 ms | 52.3% bf16 MFU | 1624625 tok/s step 18098/19560 | loss 3.364171 (+1.68z)| norm 0.2594 (-0.24z)| lr 8.86e-06 | 322.77 ms | 52.3% bf16 MFU | 1624610 tok/s step 18099/19560 | loss 3.236067 (-1.29z)| norm 0.2324 (-0.94z)| lr 8.85e-06 | 322.66 ms | 52.3% bf16 MFU | 1624625 tok/s step 18100/19560 | loss 3.305411 (+0.32z)| norm 0.2368 (-0.82z)| lr 8.84e-06 | 322.41 ms | 52.3% bf16 MFU | 1624702 tok/s step 18101/19560 | loss 3.275729 (-0.36z)| norm 0.2527 (-0.41z)| lr 8.83e-06 | 323.14 ms | 52.2% bf16 MFU | 1624590 tok/s step 18102/19560 | loss 3.338038 (+1.09z)| norm 0.2432 (-0.65z)| lr 8.82e-06 | 323.31 ms | 52.2% bf16 MFU | 1624442 tok/s step 18103/19560 | loss 3.273688 (-0.41z)| norm 0.2461 (-0.57z)| lr 8.80e-06 | 323.09 ms | 52.2% bf16 MFU | 1624357 tok/s step 18104/19560 | loss 3.268604 (-0.53z)| norm 0.2445 (-0.61z)| lr 8.79e-06 | 323.27 ms | 52.2% bf16 MFU | 1624231 tok/s step 18105/19560 | loss 3.305219 (+0.32z)| norm 0.2585 (-0.25z)| lr 8.78e-06 | 321.92 ms | 52.4% bf16 MFU | 1624450 tok/s step 18106/19560 | loss 3.299850 (+0.19z)| norm 0.2347 (-0.86z)| lr 8.77e-06 | 323.34 ms | 52.2% bf16 MFU | 1624302 tok/s step 18107/19560 | loss 3.320006 (+0.67z)| norm 0.3078 (+1.01z)| lr 8.76e-06 | 323.03 ms | 52.2% bf16 MFU | 1624237 tok/s step 18108/19560 | loss 3.266597 (-0.58z)| norm 0.2385 (-0.76z)| lr 8.74e-06 | 322.33 ms | 52.4% bf16 MFU | 1624352 tok/s step 18109/19560 | loss 3.285963 (-0.14z)| norm 0.2506 (-0.45z)| lr 8.73e-06 | 322.66 ms | 52.3% bf16 MFU | 1624378 tok/s step 18110/19560 | loss 3.291953 (+0.00z)| norm 0.2469 (-0.55z)| lr 8.72e-06 | 322.46 ms | 52.3% bf16 MFU | 1624455 tok/s step 18111/19560 | loss 3.274716 (-0.40z)| norm 0.2392 (-0.73z)| lr 8.71e-06 | 323.41 ms | 52.2% bf16 MFU | 1624288 tok/s step 18112/19560 | loss 3.378399 (+2.01z)| norm 0.2590 (-0.23z)| lr 8.70e-06 | 322.53 ms | 52.3% bf16 MFU | 1624350 tok/s step 18113/19560 | loss 3.254946 (-0.87z)| norm 0.2628 (-0.14z)| lr 8.68e-06 | 322.63 ms | 52.3% bf16 MFU | 1624386 tok/s step 18114/19560 | loss 3.254602 (-0.88z)| norm 0.2411 (-0.70z)| lr 8.67e-06 | 322.95 ms | 52.3% bf16 MFU | 1624339 tok/s step 18115/19560 | loss 3.321892 (+0.68z)| norm 0.2669 (-0.04z)| lr 8.66e-06 | 323.18 ms | 52.2% bf16 MFU | 1624237 tok/s step 18116/19560 | loss 3.307480 (+0.34z)| norm 0.3122 (+1.12z)| lr 8.65e-06 | 323.43 ms | 52.2% bf16 MFU | 1624076 tok/s step 18117/19560 | loss 3.218666 (-1.69z)| norm 0.2487 (-0.52z)| lr 8.64e-06 | 322.49 ms | 52.3% bf16 MFU | 1624161 tok/s step 18118/19560 | loss 3.232972 (-1.34z)| norm 0.2463 (-0.58z)| lr 8.62e-06 | 322.91 ms | 52.3% bf16 MFU | 1624134 tok/s step 18119/19560 | loss 3.274617 (-0.39z)| norm 0.2778 (+0.22z)| lr 8.61e-06 | 323.32 ms | 52.2% bf16 MFU | 1624007 tok/s step 18120/19560 | loss 3.226882 (-1.47z)| norm 0.2379 (-0.80z)| lr 8.60e-06 | 323.14 ms | 52.2% bf16 MFU | 1623930 tok/s step 18121/19560 | loss 3.265759 (-0.57z)| norm 0.2345 (-0.87z)| lr 8.59e-06 | 323.14 ms | 52.2% bf16 MFU | 1623856 tok/s step 18122/19560 | loss 3.348753 (+1.31z)| norm 0.2404 (-0.72z)| lr 8.58e-06 | 322.80 ms | 52.3% bf16 MFU | 1623873 tok/s step 18123/19560 | loss 3.255478 (-0.80z)| norm 0.2725 (+0.12z)| lr 8.57e-06 | 322.92 ms | 52.3% bf16 MFU | 1623860 tok/s step 18124/19560 | loss 3.244022 (-1.04z)| norm 0.2820 (+0.36z)| lr 8.55e-06 | 322.85 ms | 52.3% bf16 MFU | 1623864 tok/s step 18125/19560 | loss 3.263321 (-0.60z)| norm 0.2873 (+0.49z)| lr 8.54e-06 | 322.89 ms | 52.3% bf16 MFU | 1623858 tok/s step 18126/19560 | loss 3.281090 (-0.21z)| norm 0.2368 (-0.82z)| lr 8.53e-06 | 323.21 ms | 52.2% bf16 MFU | 1623771 tok/s step 18127/19560 | loss 3.270257 (-0.44z)| norm 0.3147 (+1.19z)| lr 8.52e-06 | 322.96 ms | 52.3% bf16 MFU | 1623752 tok/s step 18128/19560 | loss 3.243855 (-1.03z)| norm 0.2296 (-1.01z)| lr 8.51e-06 | 322.70 ms | 52.3% bf16 MFU | 1623799 tok/s step 18129/19560 | loss 3.277971 (-0.27z)| norm 0.2717 (+0.10z)| lr 8.49e-06 | 322.90 ms | 52.3% bf16 MFU | 1623792 tok/s step 18130/19560 | loss 3.312814 (+0.52z)| norm 0.2567 (-0.30z)| lr 8.48e-06 | 322.75 ms | 52.3% bf16 MFU | 1623824 tok/s step 18131/19560 | loss 3.304469 (+0.33z)| norm 0.2585 (-0.25z)| lr 8.47e-06 | 322.60 ms | 52.3% bf16 MFU | 1623892 tok/s step 18132/19560 | loss 3.283493 (-0.16z)| norm 0.2375 (-0.80z)| lr 8.46e-06 | 323.06 ms | 52.2% bf16 MFU | 1623843 tok/s step 18133/19560 | loss 3.293777 (+0.08z)| norm 0.2371 (-0.81z)| lr 8.45e-06 | 322.67 ms | 52.3% bf16 MFU | 1623893 tok/s step 18134/19560 | loss 3.315964 (+0.58z)| norm 0.2619 (-0.12z)| lr 8.44e-06 | 323.04 ms | 52.2% bf16 MFU | 1623847 tok/s step 18135/19560 | loss 3.248695 (-0.95z)| norm 0.2388 (-0.76z)| lr 8.42e-06 | 322.90 ms | 52.3% bf16 MFU | 1623839 tok/s step 18136/19560 | loss 3.301317 (+0.24z)| norm 0.2418 (-0.66z)| lr 8.41e-06 | 322.98 ms | 52.3% bf16 MFU | 1623812 tok/s step 18137/19560 | loss 3.234095 (-1.27z)| norm 0.2655 (+0.04z)| lr 8.40e-06 | 322.73 ms | 52.3% bf16 MFU | 1623849 tok/s step 18138/19560 | loss 3.260365 (-0.67z)| norm 0.2649 (+0.03z)| lr 8.39e-06 | 322.81 ms | 52.3% bf16 MFU | 1623864 tok/s step 18139/19560 | loss 3.297543 (+0.17z)| norm 0.2652 (+0.04z)| lr 8.38e-06 | 322.93 ms | 52.3% bf16 MFU | 1623848 tok/s step 18140/19560 | loss 3.289175 (-0.01z)| norm 0.2757 (+0.36z)| lr 8.37e-06 | 322.72 ms | 52.3% bf16 MFU | 1623885 tok/s step 18141/19560 | loss 3.305904 (+0.36z)| norm 0.2533 (-0.31z)| lr 8.35e-06 | 322.96 ms | 52.3% bf16 MFU | 1623861 tok/s step 18142/19560 | loss 3.239938 (-1.16z)| norm 0.2471 (-0.49z)| lr 8.34e-06 | 322.76 ms | 52.3% bf16 MFU | 1623888 tok/s step 18143/19560 | loss 3.268791 (-0.49z)| norm 0.2390 (-0.74z)| lr 8.33e-06 | 322.94 ms | 52.3% bf16 MFU | 1623867 tok/s step 18144/19560 | loss 3.299971 (+0.23z)| norm 0.2448 (-0.55z)| lr 8.32e-06 | 322.53 ms | 52.3% bf16 MFU | 1623952 tok/s step 18145/19560 | loss 3.274762 (-0.34z)| norm 0.2539 (-0.27z)| lr 8.31e-06 | 322.75 ms | 52.3% bf16 MFU | 1623977 tok/s step 18146/19560 | loss 3.239751 (-1.26z)| norm 0.3118 (+2.03z)| lr 8.29e-06 | 322.79 ms | 52.3% bf16 MFU | 1623990 tok/s step 18147/19560 | loss 3.289799 (-0.01z)| norm 0.2348 (-1.04z)| lr 8.28e-06 | 322.61 ms | 52.3% bf16 MFU | 1624048 tok/s step 18148/19560 | loss 3.358533 (+1.67z)| norm 0.2614 (+0.03z)| lr 8.27e-06 | 323.58 ms | 52.2% bf16 MFU | 1623860 tok/s step 18149/19560 | loss 3.340536 (+1.21z)| norm 0.2592 (-0.05z)| lr 8.26e-06 | 322.62 ms | 52.3% bf16 MFU | 1623922 tok/s step 18150/19560 | loss 3.316361 (+0.61z)| norm 0.2416 (-0.75z)| lr 8.25e-06 | 322.76 ms | 52.3% bf16 MFU | 1623946 tok/s step 18151/19560 | loss 3.354456 (+1.52z)| norm 0.2767 (+0.71z)| lr 8.24e-06 | 322.77 ms | 52.3% bf16 MFU | 1623966 tok/s step 18152/19560 | loss 3.317811 (+0.62z)| norm 0.2572 (-0.10z)| lr 8.22e-06 | 322.51 ms | 52.3% bf16 MFU | 1624050 tok/s step 18153/19560 | loss 3.248234 (-1.05z)| norm 0.2877 (+1.16z)| lr 8.21e-06 | 323.38 ms | 52.2% bf16 MFU | 1623912 tok/s step 18154/19560 | loss 3.343321 (+1.24z)| norm 0.2535 (-0.26z)| lr 8.20e-06 | 322.92 ms | 52.3% bf16 MFU | 1623894 tok/s step 18155/19560 | loss 3.277004 (-0.37z)| norm 0.2592 (+0.01z)| lr 8.19e-06 | 322.67 ms | 52.3% bf16 MFU | 1623943 tok/s step 18156/19560 | loss 3.265049 (-0.66z)| norm 0.2455 (-0.60z)| lr 8.18e-06 | 323.14 ms | 52.2% bf16 MFU | 1623868 tok/s step 18157/19560 | loss 3.337813 (+1.09z)| norm 0.2771 (+0.88z)| lr 8.17e-06 | 323.03 ms | 52.2% bf16 MFU | 1623826 tok/s step 18158/19560 | loss 3.248721 (-1.05z)| norm 0.2644 (+0.28z)| lr 8.16e-06 | 323.07 ms | 52.2% bf16 MFU | 1623775 tok/s step 18159/19560 | loss 3.309776 (+0.42z)| norm 0.2348 (-1.09z)| lr 8.14e-06 | 322.50 ms | 52.3% bf16 MFU | 1623873 tok/s step 18160/19560 | loss 3.264848 (-0.66z)| norm 0.2561 (-0.10z)| lr 8.13e-06 | 323.00 ms | 52.3% bf16 MFU | 1623837 tok/s step 18161/19560 | loss 3.266262 (-0.63z)| norm 0.2452 (-0.60z)| lr 8.12e-06 | 323.28 ms | 52.2% bf16 MFU | 1623735 tok/s step 18162/19560 | loss 3.228844 (-1.51z)| norm 0.2480 (-0.46z)| lr 8.11e-06 | 323.12 ms | 52.2% bf16 MFU | 1623678 tok/s step 18163/19560 | loss 3.294602 (+0.06z)| norm 0.2607 (+0.12z)| lr 8.10e-06 | 324.10 ms | 52.1% bf16 MFU | 1623378 tok/s step 18164/19560 | loss 3.355886 (+1.53z)| norm 0.2455 (-0.58z)| lr 8.09e-06 | 322.22 ms | 52.4% bf16 MFU | 1623565 tok/s step 18165/19560 | loss 3.271950 (-0.47z)| norm 0.2317 (-1.20z)| lr 8.07e-06 | 323.06 ms | 52.2% bf16 MFU | 1623532 tok/s step 18166/19560 | loss 3.306728 (+0.40z)| norm 0.2359 (-1.00z)| lr 8.06e-06 | 323.25 ms | 52.2% bf16 MFU | 1623453 tok/s step 18167/19560 | loss 3.345499 (+1.36z)| norm 0.2343 (-1.06z)| lr 8.05e-06 | 323.03 ms | 52.2% bf16 MFU | 1623431 tok/s step 18168/19560 | loss 3.375797 (+2.07z)| norm 0.2664 (+0.48z)| lr 8.04e-06 | 323.87 ms | 52.1% bf16 MFU | 1623201 tok/s step 18169/19560 | loss 3.318507 (+0.66z)| norm 0.2463 (-0.49z)| lr 8.03e-06 | 322.88 ms | 52.3% bf16 MFU | 1623230 tok/s step 18170/19560 | loss 3.281429 (-0.25z)| norm 0.2457 (-0.52z)| lr 8.02e-06 | 322.69 ms | 52.3% bf16 MFU | 1623306 tok/s step 18171/19560 | loss 3.280353 (-0.27z)| norm 0.2516 (-0.24z)| lr 8.01e-06 | 322.35 ms | 52.4% bf16 MFU | 1623463 tok/s step 18172/19560 | loss 3.259138 (-0.79z)| norm 0.2742 (+0.87z)| lr 7.99e-06 | 323.27 ms | 52.2% bf16 MFU | 1623381 tok/s step 18173/19560 | loss 3.333387 (+1.04z)| norm 0.2472 (-0.44z)| lr 7.98e-06 | 323.67 ms | 52.1% bf16 MFU | 1623204 tok/s step 18174/19560 | loss 3.256294 (-0.85z)| norm 0.2394 (-0.82z)| lr 7.97e-06 | 322.96 ms | 52.3% bf16 MFU | 1623214 tok/s step 18175/19560 | loss 3.219366 (-1.73z)| norm 0.2460 (-0.50z)| lr 7.96e-06 | 323.27 ms | 52.2% bf16 MFU | 1623145 tok/s step 18176/19560 | loss 3.323337 (+0.80z)| norm 0.2574 (+0.06z)| lr 7.95e-06 | 322.65 ms | 52.3% bf16 MFU | 1623236 tok/s step 18177/19560 | loss 3.259176 (-0.76z)| norm 0.2743 (+0.89z)| lr 7.94e-06 | 323.47 ms | 52.2% bf16 MFU | 1623114 tok/s step 18178/19560 | loss 3.313140 (+0.55z)| norm 0.2736 (+0.90z)| lr 7.93e-06 | 322.74 ms | 52.3% bf16 MFU | 1623184 tok/s step 18179/19560 | loss 3.283214 (-0.19z)| norm 0.2514 (-0.21z)| lr 7.91e-06 | 322.79 ms | 52.3% bf16 MFU | 1623237 tok/s step 18180/19560 | loss 3.258051 (-0.80z)| norm 0.2384 (-0.87z)| lr 7.90e-06 | 322.70 ms | 52.3% bf16 MFU | 1623311 tok/s step 18181/19560 | loss 3.287686 (-0.06z)| norm 0.2542 (-0.07z)| lr 7.89e-06 | 322.69 ms | 52.3% bf16 MFU | 1623382 tok/s step 18182/19560 | loss 3.320388 (+0.72z)| norm 0.2656 (+0.50z)| lr 7.88e-06 | 322.76 ms | 52.3% bf16 MFU | 1623432 tok/s step 18183/19560 | loss 3.299257 (+0.20z)| norm 0.3046 (+2.41z)| lr 7.87e-06 | 323.32 ms | 52.2% bf16 MFU | 1623338 tok/s step 18184/19560 | loss 3.297302 (+0.16z)| norm 0.2328 (-1.15z)| lr 7.86e-06 | 323.09 ms | 52.2% bf16 MFU | 1623309 tok/s step 18185/19560 | loss 3.294877 (+0.09z)| norm 0.2512 (-0.24z)| lr 7.85e-06 | 322.79 ms | 52.3% bf16 MFU | 1623356 tok/s step 18186/19560 | loss 3.251232 (-0.97z)| norm 0.2655 (+0.47z)| lr 7.83e-06 | 323.01 ms | 52.2% bf16 MFU | 1623344 tok/s step 18187/19560 | loss 3.324625 (+0.94z)| norm 0.3429 (+4.10z)| lr 7.82e-06 | 323.11 ms | 52.2% bf16 MFU | 1623310 tok/s step 18188/19560 | loss 3.277841 (-0.31z)| norm 0.2679 (+0.57z)| lr 7.81e-06 | 323.37 ms | 52.2% bf16 MFU | 1623211 tok/s step 18189/19560 | loss 3.342680 (+1.40z)| norm 0.2703 (+0.68z)| lr 7.80e-06 | 322.87 ms | 52.3% bf16 MFU | 1623242 tok/s step 18190/19560 | loss 3.274953 (-0.39z)| norm 0.3188 (+2.89z)| lr 7.79e-06 | 323.06 ms | 52.2% bf16 MFU | 1623223 tok/s step 18191/19560 | loss 3.304563 (+0.40z)| norm 0.3058 (+2.36z)| lr 7.78e-06 | 323.29 ms | 52.2% bf16 MFU | 1623148 tok/s step 18192/19560 | loss 3.309649 (+0.53z)| norm 0.2489 (-0.36z)| lr 7.77e-06 | 323.35 ms | 52.2% bf16 MFU | 1623062 tok/s step 18193/19560 | loss 3.325196 (+0.93z)| norm 0.2581 (+0.08z)| lr 7.76e-06 | 322.66 ms | 52.3% bf16 MFU | 1623154 tok/s step 18194/19560 | loss 3.306631 (+0.43z)| norm 0.2359 (-0.97z)| lr 7.74e-06 | 323.18 ms | 52.2% bf16 MFU | 1623111 tok/s step 18195/19560 | loss 3.273977 (-0.42z)| norm 0.2709 (+0.69z)| lr 7.73e-06 | 323.03 ms | 52.2% bf16 MFU | 1623106 tok/s step 18196/19560 | loss 3.301892 (+0.32z)| norm 0.2522 (-0.20z)| lr 7.72e-06 | 322.92 ms | 52.3% bf16 MFU | 1623130 tok/s step 18197/19560 | loss 3.245766 (-1.16z)| norm 0.2524 (-0.19z)| lr 7.71e-06 | 323.13 ms | 52.2% bf16 MFU | 1623101 tok/s step 18198/19560 | loss 3.279444 (-0.27z)| norm 0.2462 (-0.48z)| lr 7.70e-06 | 322.30 ms | 52.4% bf16 MFU | 1623281 tok/s step 18199/19560 | loss 3.261158 (-0.74z)| norm 0.2710 (+0.71z)| lr 7.69e-06 | 323.06 ms | 52.2% bf16 MFU | 1623262 tok/s step 18200/19560 | loss 3.270471 (-0.50z)| norm 0.2581 (+0.09z)| lr 7.68e-06 | 322.76 ms | 52.3% bf16 MFU | 1623318 tok/s step 18201/19560 | loss 3.262463 (-0.71z)| norm 0.2569 (+0.04z)| lr 7.67e-06 | 323.06 ms | 52.2% bf16 MFU | 1623297 tok/s step 18202/19560 | loss 3.246073 (-1.13z)| norm 0.2598 (+0.18z)| lr 7.65e-06 | 322.85 ms | 52.3% bf16 MFU | 1623330 tok/s step 18203/19560 | loss 3.291718 (+0.07z)| norm 0.2386 (-0.84z)| lr 7.64e-06 | 322.59 ms | 52.3% bf16 MFU | 1623426 tok/s step 18204/19560 | loss 3.309621 (+0.54z)| norm 0.3607 (+4.59z)| lr 7.63e-06 | 323.20 ms | 52.2% bf16 MFU | 1623362 tok/s step 18205/19560 | loss 3.276087 (-0.36z)| norm 0.2743 (+0.76z)| lr 7.62e-06 | 322.61 ms | 52.3% bf16 MFU | 1623452 tok/s step 18206/19560 | loss 3.335985 (+1.24z)| norm 0.2904 (+1.44z)| lr 7.61e-06 | 322.54 ms | 52.3% bf16 MFU | 1623555 tok/s step 18207/19560 | loss 3.278079 (-0.30z)| norm 0.2538 (-0.16z)| lr 7.60e-06 | 322.60 ms | 52.3% bf16 MFU | 1623636 tok/s step 18208/19560 | loss 3.300336 (+0.29z)| norm 0.3091 (+2.22z)| lr 7.59e-06 | 322.26 ms | 52.4% bf16 MFU | 1623799 tok/s step 18209/19560 | loss 3.248247 (-1.10z)| norm 0.3117 (+2.26z)| lr 7.58e-06 | 322.96 ms | 52.3% bf16 MFU | 1623778 tok/s step 18210/19560 | loss 3.357856 (+1.80z)| norm 0.2508 (-0.32z)| lr 7.56e-06 | 322.67 ms | 52.3% bf16 MFU | 1623831 tok/s step 18211/19560 | loss 3.300945 (+0.29z)| norm 0.2510 (-0.32z)| lr 7.55e-06 | 322.95 ms | 52.3% bf16 MFU | 1623812 tok/s step 18212/19560 | loss 3.308278 (+0.48z)| norm 0.2785 (+0.84z)| lr 7.54e-06 | 322.79 ms | 52.3% bf16 MFU | 1623834 tok/s step 18213/19560 | loss 3.302843 (+0.34z)| norm 0.2639 (+0.22z)| lr 7.53e-06 | 323.03 ms | 52.2% bf16 MFU | 1623794 tok/s step 18214/19560 | loss 3.286234 (-0.09z)| norm 0.3088 (+2.08z)| lr 7.52e-06 | 322.83 ms | 52.3% bf16 MFU | 1623806 tok/s step 18215/19560 | loss 3.470689 (+4.70z)| norm 0.3044 (+1.85z)| lr 7.51e-06 | 322.80 ms | 52.3% bf16 MFU | 1623826 tok/s step 18216/19560 | loss 3.311576 (+0.53z)| norm 0.2512 (-0.36z)| lr 7.50e-06 | 323.34 ms | 52.2% bf16 MFU | 1623709 tok/s step 18217/19560 | loss 3.260058 (-0.81z)| norm 0.2394 (-0.85z)| lr 7.49e-06 | 322.45 ms | 52.3% bf16 MFU | 1623821 tok/s step 18218/19560 | loss 3.335699 (+1.17z)| norm 0.2490 (-0.45z)| lr 7.48e-06 | 322.75 ms | 52.3% bf16 MFU | 1623852 tok/s step 18219/19560 | loss 3.278137 (-0.34z)| norm 0.2472 (-0.52z)| lr 7.46e-06 | 323.15 ms | 52.2% bf16 MFU | 1623782 tok/s step 18220/19560 | loss 3.263429 (-0.71z)| norm 0.3226 (+2.53z)| lr 7.45e-06 | 322.84 ms | 52.3% bf16 MFU | 1623793 tok/s step 18221/19560 | loss 3.283839 (-0.17z)| norm 0.2731 (+0.51z)| lr 7.44e-06 | 322.59 ms | 52.3% bf16 MFU | 1623866 tok/s step 18222/19560 | loss 3.235214 (-1.44z)| norm 0.2247 (-1.46z)| lr 7.43e-06 | 322.99 ms | 52.3% bf16 MFU | 1623835 tok/s step 18223/19560 | loss 3.280475 (-0.25z)| norm 0.2589 (-0.06z)| lr 7.42e-06 | 322.62 ms | 52.3% bf16 MFU | 1623898 tok/s step 18224/19560 | loss 3.298440 (+0.22z)| norm 0.3006 (+1.61z)| lr 7.41e-06 | 322.91 ms | 52.3% bf16 MFU | 1623885 tok/s step 18225/19560 | loss 3.316673 (+0.69z)| norm 0.2378 (-0.93z)| lr 7.40e-06 | 322.81 ms | 52.3% bf16 MFU | 1623898 tok/s step 18226/19560 | loss 3.319988 (+0.79z)| norm 0.2567 (-0.16z)| lr 7.39e-06 | 322.77 ms | 52.3% bf16 MFU | 1623921 tok/s step 18227/19560 | loss 3.253360 (-0.99z)| norm 0.2560 (-0.20z)| lr 7.38e-06 | 322.77 ms | 52.3% bf16 MFU | 1623943 tok/s step 18228/19560 | loss 3.231151 (-1.55z)| norm 0.2617 (+0.02z)| lr 7.37e-06 | 322.22 ms | 52.4% bf16 MFU | 1624102 tok/s step 18229/19560 | loss 3.296801 (+0.18z)| norm 0.2496 (-0.47z)| lr 7.35e-06 | 323.00 ms | 52.3% bf16 MFU | 1624056 tok/s step 18230/19560 | loss 3.310152 (+0.55z)| norm 0.2409 (-0.82z)| lr 7.34e-06 | 322.37 ms | 52.4% bf16 MFU | 1624170 tok/s step 18231/19560 | loss 3.303604 (+0.36z)| norm 0.2383 (-0.93z)| lr 7.33e-06 | 322.72 ms | 52.3% bf16 MFU | 1624192 tok/s step 18232/19560 | loss 3.266025 (-0.64z)| norm 0.2386 (-0.91z)| lr 7.32e-06 | 323.13 ms | 52.2% bf16 MFU | 1624110 tok/s step 18233/19560 | loss 3.206771 (-2.16z)| norm 0.2510 (-0.40z)| lr 7.31e-06 | 322.32 ms | 52.4% bf16 MFU | 1624233 tok/s step 18234/19560 | loss 3.274297 (-0.38z)| norm 0.2305 (-1.23z)| lr 7.30e-06 | 322.31 ms | 52.4% bf16 MFU | 1624354 tok/s step 18235/19560 | loss 3.254862 (-0.88z)| norm 0.2366 (-0.97z)| lr 7.29e-06 | 322.56 ms | 52.3% bf16 MFU | 1624404 tok/s step 18236/19560 | loss 3.256876 (-0.82z)| norm 0.2465 (-0.57z)| lr 7.28e-06 | 322.93 ms | 52.3% bf16 MFU | 1624362 tok/s step 18237/19560 | loss 3.298646 (+0.27z)| norm 0.2505 (-0.40z)| lr 7.27e-06 | 322.94 ms | 52.3% bf16 MFU | 1624318 tok/s step 18238/19560 | loss 3.338927 (+1.30z)| norm 0.2798 (+0.78z)| lr 7.26e-06 | 322.46 ms | 52.3% bf16 MFU | 1624398 tok/s step 18239/19560 | loss 3.275047 (-0.36z)| norm 0.2826 (+0.89z)| lr 7.24e-06 | 322.60 ms | 52.3% bf16 MFU | 1624436 tok/s step 18240/19560 | loss 3.304962 (+0.44z)| norm 0.2420 (-0.77z)| lr 7.23e-06 | 322.72 ms | 52.3% bf16 MFU | 1624445 tok/s step 18241/19560 | loss 3.280036 (-0.22z)| norm 0.2406 (-0.82z)| lr 7.22e-06 | 322.83 ms | 52.3% bf16 MFU | 1624424 tok/s step 18242/19560 | loss 3.248935 (-1.05z)| norm 0.2474 (-0.55z)| lr 7.21e-06 | 322.54 ms | 52.3% bf16 MFU | 1624478 tok/s step 18243/19560 | loss 3.265040 (-0.61z)| norm 0.2545 (-0.25z)| lr 7.20e-06 | 322.49 ms | 52.3% bf16 MFU | 1624541 tok/s step 18244/19560 | loss 3.268386 (-0.51z)| norm 0.2444 (-0.65z)| lr 7.19e-06 | 322.75 ms | 52.3% bf16 MFU | 1624537 tok/s step 18245/19560 | loss 3.311881 (+0.63z)| norm 0.2245 (-1.46z)| lr 7.18e-06 | 323.01 ms | 52.2% bf16 MFU | 1624467 tok/s step 18246/19560 | loss 3.295741 (+0.19z)| norm 0.2402 (-0.81z)| lr 7.17e-06 | 323.52 ms | 52.2% bf16 MFU | 1624271 tok/s step 18247/19560 | loss 3.343452 (+1.46z)| norm 0.2325 (-1.11z)| lr 7.16e-06 | 322.78 ms | 52.3% bf16 MFU | 1624273 tok/s step 18248/19560 | loss 3.370337 (+2.13z)| norm 0.2440 (-0.64z)| lr 7.15e-06 | 322.81 ms | 52.3% bf16 MFU | 1624267 tok/s step 18249/19560 | loss 3.249267 (-1.10z)| norm 0.2487 (-0.45z)| lr 7.14e-06 | 322.33 ms | 52.4% bf16 MFU | 1624380 tok/s step 18250/19560 | loss 3.349565 (+1.58z)| norm 0.2546 (-0.21z)| lr 7.13e-06 | 322.68 ms | 52.3% bf16 MFU | 1624401 tok/s val loss 3.275970 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3060/10042 = 0.304720 step 18251/19560 | loss 3.304297 (+0.36z)| norm 0.2367 (-0.94z)| lr 7.11e-06 | 322.25 ms | 52.4% bf16 MFU | 1624529 tok/s step 18252/19560 | loss 3.262317 (-0.77z)| norm 0.2693 (+0.41z)| lr 7.10e-06 | 322.38 ms | 52.4% bf16 MFU | 1624618 tok/s step 18253/19560 | loss 3.312063 (+0.56z)| norm 0.2401 (-0.78z)| lr 7.09e-06 | 322.85 ms | 52.3% bf16 MFU | 1624583 tok/s step 18254/19560 | loss 3.262503 (-0.77z)| norm 0.2463 (-0.53z)| lr 7.08e-06 | 322.53 ms | 52.3% bf16 MFU | 1624632 tok/s step 18255/19560 | loss 3.279706 (-0.31z)| norm 0.2416 (-0.72z)| lr 7.07e-06 | 323.14 ms | 52.2% bf16 MFU | 1624525 tok/s step 18256/19560 | loss 3.253023 (-1.03z)| norm 0.2529 (-0.25z)| lr 7.06e-06 | 322.65 ms | 52.3% bf16 MFU | 1624546 tok/s step 18257/19560 | loss 3.289636 (-0.05z)| norm 0.2374 (-0.90z)| lr 7.05e-06 | 322.82 ms | 52.3% bf16 MFU | 1624523 tok/s step 18258/19560 | loss 3.409735 (+3.05z)| norm 0.2540 (-0.19z)| lr 7.04e-06 | 322.33 ms | 52.4% bf16 MFU | 1624625 tok/s step 18259/19560 | loss 3.294162 (+0.05z)| norm 0.2342 (-1.02z)| lr 7.03e-06 | 322.85 ms | 52.3% bf16 MFU | 1624591 tok/s step 18260/19560 | loss 3.264549 (-0.71z)| norm 0.2268 (-1.32z)| lr 7.02e-06 | 322.48 ms | 52.3% bf16 MFU | 1624652 tok/s step 18261/19560 | loss 3.294351 (+0.06z)| norm 0.2423 (-0.67z)| lr 7.01e-06 | 322.63 ms | 52.3% bf16 MFU | 1624671 tok/s step 18262/19560 | loss 3.235911 (-1.43z)| norm 0.2594 (+0.05z)| lr 7.00e-06 | 322.51 ms | 52.3% bf16 MFU | 1624719 tok/s step 18263/19560 | loss 3.340103 (+1.24z)| norm 0.2967 (+1.60z)| lr 6.98e-06 | 322.96 ms | 52.3% bf16 MFU | 1624653 tok/s step 18264/19560 | loss 3.260531 (-0.80z)| norm 0.2734 (+0.61z)| lr 6.97e-06 | 322.77 ms | 52.3% bf16 MFU | 1624636 tok/s step 18265/19560 | loss 3.236544 (-1.42z)| norm 0.3096 (+2.08z)| lr 6.96e-06 | 322.69 ms | 52.3% bf16 MFU | 1624642 tok/s step 18266/19560 | loss 3.242791 (-1.25z)| norm 0.2679 (+0.36z)| lr 6.95e-06 | 323.34 ms | 52.2% bf16 MFU | 1624482 tok/s step 18267/19560 | loss 3.258122 (-0.85z)| norm 0.2376 (-0.88z)| lr 6.94e-06 | 322.50 ms | 52.3% bf16 MFU | 1624544 tok/s step 18268/19560 | loss 3.309216 (+0.45z)| norm 0.2639 (+0.21z)| lr 6.93e-06 | 323.13 ms | 52.2% bf16 MFU | 1624444 tok/s step 18269/19560 | loss 3.251060 (-1.01z)| norm 0.2509 (-0.33z)| lr 6.92e-06 | 322.68 ms | 52.3% bf16 MFU | 1624461 tok/s step 18270/19560 | loss 3.328261 (+0.93z)| norm 0.2524 (-0.27z)| lr 6.91e-06 | 322.94 ms | 52.3% bf16 MFU | 1624412 tok/s step 18271/19560 | loss 3.284261 (-0.19z)| norm 0.2436 (-0.63z)| lr 6.90e-06 | 322.77 ms | 52.3% bf16 MFU | 1624408 tok/s step 18272/19560 | loss 3.221688 (-1.75z)| norm 0.2465 (-0.52z)| lr 6.89e-06 | 323.06 ms | 52.2% bf16 MFU | 1624332 tok/s step 18273/19560 | loss 3.276875 (-0.36z)| norm 0.2532 (-0.24z)| lr 6.88e-06 | 322.07 ms | 52.4% bf16 MFU | 1624510 tok/s step 18274/19560 | loss 3.367501 (+1.88z)| norm 0.2454 (-0.55z)| lr 6.87e-06 | 322.52 ms | 52.3% bf16 MFU | 1624564 tok/s step 18275/19560 | loss 3.273562 (-0.46z)| norm 0.2630 (+0.18z)| lr 6.86e-06 | 322.99 ms | 52.3% bf16 MFU | 1624497 tok/s step 18276/19560 | loss 3.299218 (+0.19z)| norm 0.2676 (+0.37z)| lr 6.85e-06 | 322.66 ms | 52.3% bf16 MFU | 1624517 tok/s step 18277/19560 | loss 3.323662 (+0.81z)| norm 0.2609 (+0.09z)| lr 6.84e-06 | 322.60 ms | 52.3% bf16 MFU | 1624550 tok/s step 18278/19560 | loss 3.354879 (+1.58z)| norm 0.2651 (+0.26z)| lr 6.83e-06 | 322.76 ms | 52.3% bf16 MFU | 1624541 tok/s step 18279/19560 | loss 3.304588 (+0.33z)| norm 0.2568 (-0.09z)| lr 6.81e-06 | 322.98 ms | 52.3% bf16 MFU | 1624479 tok/s step 18280/19560 | loss 3.301710 (+0.26z)| norm 0.2864 (+1.15z)| lr 6.80e-06 | 322.44 ms | 52.3% bf16 MFU | 1624555 tok/s step 18281/19560 | loss 3.318192 (+0.67z)| norm 0.2675 (+0.37z)| lr 6.79e-06 | 322.85 ms | 52.3% bf16 MFU | 1624523 tok/s step 18282/19560 | loss 3.269699 (-0.55z)| norm 0.2779 (+0.80z)| lr 6.78e-06 | 322.53 ms | 52.3% bf16 MFU | 1624574 tok/s step 18283/19560 | loss 3.294365 (+0.08z)| norm 0.2525 (-0.28z)| lr 6.77e-06 | 323.23 ms | 52.2% bf16 MFU | 1624447 tok/s step 18284/19560 | loss 3.297517 (+0.15z)| norm 0.2310 (-1.17z)| lr 6.76e-06 | 322.58 ms | 52.3% bf16 MFU | 1624491 tok/s step 18285/19560 | loss 3.240027 (-1.30z)| norm 0.2763 (+0.73z)| lr 6.75e-06 | 322.73 ms | 52.3% bf16 MFU | 1624494 tok/s step 18286/19560 | loss 3.381496 (+2.26z)| norm 0.2433 (-0.65z)| lr 6.74e-06 | 322.98 ms | 52.3% bf16 MFU | 1624433 tok/s step 18287/19560 | loss 3.260010 (-0.79z)| norm 0.2439 (-0.63z)| lr 6.73e-06 | 322.81 ms | 52.3% bf16 MFU | 1624417 tok/s step 18288/19560 | loss 3.285838 (-0.15z)| norm 0.2328 (-1.08z)| lr 6.72e-06 | 322.78 ms | 52.3% bf16 MFU | 1624411 tok/s step 18289/19560 | loss 3.273460 (-0.46z)| norm 0.2884 (+1.23z)| lr 6.71e-06 | 322.78 ms | 52.3% bf16 MFU | 1624405 tok/s step 18290/19560 | loss 3.317614 (+0.64z)| norm 0.2406 (-0.76z)| lr 6.70e-06 | 322.63 ms | 52.3% bf16 MFU | 1624437 tok/s step 18291/19560 | loss 3.362110 (+1.74z)| norm 0.2587 (-0.01z)| lr 6.69e-06 | 322.72 ms | 52.3% bf16 MFU | 1624445 tok/s step 18292/19560 | loss 3.277931 (-0.37z)| norm 0.2708 (+0.49z)| lr 6.68e-06 | 322.59 ms | 52.3% bf16 MFU | 1624486 tok/s step 18293/19560 | loss 3.344641 (+1.31z)| norm 0.3226 (+2.56z)| lr 6.67e-06 | 323.00 ms | 52.3% bf16 MFU | 1624422 tok/s step 18294/19560 | loss 3.356292 (+1.58z)| norm 0.2315 (-1.15z)| lr 6.66e-06 | 323.23 ms | 52.2% bf16 MFU | 1624301 tok/s step 18295/19560 | loss 3.304217 (+0.28z)| norm 0.2722 (+0.49z)| lr 6.65e-06 | 322.50 ms | 52.3% bf16 MFU | 1624372 tok/s step 18296/19560 | loss 3.308629 (+0.41z)| norm 0.2436 (-0.66z)| lr 6.64e-06 | 322.74 ms | 52.3% bf16 MFU | 1624377 tok/s step 18297/19560 | loss 3.303639 (+0.29z)| norm 0.2372 (-0.92z)| lr 6.63e-06 | 322.81 ms | 52.3% bf16 MFU | 1624366 tok/s step 18298/19560 | loss 3.290740 (-0.04z)| norm 0.2495 (-0.42z)| lr 6.61e-06 | 322.68 ms | 52.3% bf16 MFU | 1624386 tok/s step 18299/19560 | loss 3.299921 (+0.19z)| norm 0.2658 (+0.24z)| lr 6.60e-06 | 322.42 ms | 52.3% bf16 MFU | 1624473 tok/s step 18300/19560 | loss 3.269761 (-0.59z)| norm 0.2480 (-0.48z)| lr 6.59e-06 | 322.86 ms | 52.3% bf16 MFU | 1624443 tok/s step 18301/19560 | loss 3.322317 (+0.77z)| norm 0.2325 (-1.10z)| lr 6.58e-06 | 322.72 ms | 52.3% bf16 MFU | 1624450 tok/s step 18302/19560 | loss 3.266093 (-0.69z)| norm 0.2351 (-0.99z)| lr 6.57e-06 | 322.83 ms | 52.3% bf16 MFU | 1624429 tok/s step 18303/19560 | loss 3.259406 (-0.88z)| norm 0.2318 (-1.12z)| lr 6.56e-06 | 323.07 ms | 52.2% bf16 MFU | 1624350 tok/s step 18304/19560 | loss 3.297732 (+0.13z)| norm 0.2504 (-0.36z)| lr 6.55e-06 | 322.66 ms | 52.3% bf16 MFU | 1624378 tok/s step 18305/19560 | loss 3.353915 (+1.57z)| norm 0.2623 (+0.12z)| lr 6.54e-06 | 322.47 ms | 52.3% bf16 MFU | 1624452 tok/s step 18306/19560 | loss 3.320873 (+0.71z)| norm 0.2419 (-0.70z)| lr 6.53e-06 | 322.92 ms | 52.3% bf16 MFU | 1624409 tok/s step 18307/19560 | loss 3.277966 (-0.40z)| norm 0.2285 (-1.22z)| lr 6.52e-06 | 322.91 ms | 52.3% bf16 MFU | 1624371 tok/s step 18308/19560 | loss 3.253462 (-1.04z)| norm 0.2481 (-0.44z)| lr 6.51e-06 | 322.46 ms | 52.3% bf16 MFU | 1624447 tok/s step 18309/19560 | loss 3.284571 (-0.23z)| norm 0.2272 (-1.26z)| lr 6.50e-06 | 322.70 ms | 52.3% bf16 MFU | 1624459 tok/s step 18310/19560 | loss 3.244888 (-1.24z)| norm 0.3052 (+1.82z)| lr 6.49e-06 | 322.59 ms | 52.3% bf16 MFU | 1624498 tok/s step 18311/19560 | loss 3.315924 (+0.59z)| norm 0.2449 (-0.55z)| lr 6.48e-06 | 322.11 ms | 52.4% bf16 MFU | 1624657 tok/s step 18312/19560 | loss 3.274022 (-0.48z)| norm 0.2460 (-0.51z)| lr 6.47e-06 | 323.30 ms | 52.2% bf16 MFU | 1624507 tok/s step 18313/19560 | loss 3.264357 (-0.73z)| norm 0.2555 (-0.13z)| lr 6.46e-06 | 322.87 ms | 52.3% bf16 MFU | 1624474 tok/s step 18314/19560 | loss 3.268197 (-0.63z)| norm 0.2463 (-0.49z)| lr 6.45e-06 | 323.03 ms | 52.2% bf16 MFU | 1624400 tok/s step 18315/19560 | loss 3.270846 (-0.55z)| norm 0.2930 (+1.45z)| lr 6.44e-06 | 322.72 ms | 52.3% bf16 MFU | 1624410 tok/s step 18316/19560 | loss 3.256023 (-0.93z)| norm 0.2334 (-1.02z)| lr 6.43e-06 | 323.09 ms | 52.2% bf16 MFU | 1624325 tok/s step 18317/19560 | loss 3.362989 (+1.81z)| norm 0.2955 (+1.54z)| lr 6.42e-06 | 322.65 ms | 52.3% bf16 MFU | 1624357 tok/s step 18318/19560 | loss 3.343546 (+1.29z)| norm 0.2539 (-0.16z)| lr 6.41e-06 | 322.98 ms | 52.3% bf16 MFU | 1624303 tok/s step 18319/19560 | loss 3.353436 (+1.52z)| norm 0.2804 (+0.98z)| lr 6.40e-06 | 322.57 ms | 52.3% bf16 MFU | 1624354 tok/s step 18320/19560 | loss 3.361312 (+1.69z)| norm 0.2595 (+0.09z)| lr 6.39e-06 | 322.88 ms | 52.3% bf16 MFU | 1624325 tok/s step 18321/19560 | loss 3.315340 (+0.54z)| norm 0.2259 (-1.33z)| lr 6.38e-06 | 322.40 ms | 52.3% bf16 MFU | 1624418 tok/s step 18322/19560 | loss 3.249848 (-1.08z)| norm 0.2426 (-0.62z)| lr 6.37e-06 | 322.69 ms | 52.3% bf16 MFU | 1624434 tok/s step 18323/19560 | loss 3.269919 (-0.58z)| norm 0.2879 (+1.29z)| lr 6.36e-06 | 323.00 ms | 52.3% bf16 MFU | 1624371 tok/s step 18324/19560 | loss 3.218289 (-1.82z)| norm 0.2651 (+0.32z)| lr 6.35e-06 | 322.64 ms | 52.3% bf16 MFU | 1624403 tok/s step 18325/19560 | loss 3.267599 (-0.62z)| norm 0.2447 (-0.54z)| lr 6.34e-06 | 322.96 ms | 52.3% bf16 MFU | 1624353 tok/s step 18326/19560 | loss 3.268054 (-0.60z)| norm 0.2586 (+0.04z)| lr 6.33e-06 | 322.71 ms | 52.3% bf16 MFU | 1624367 tok/s step 18327/19560 | loss 3.279229 (-0.33z)| norm 0.2398 (-0.74z)| lr 6.32e-06 | 323.03 ms | 52.2% bf16 MFU | 1624301 tok/s step 18328/19560 | loss 3.277220 (-0.38z)| norm 0.2552 (-0.09z)| lr 6.31e-06 | 323.09 ms | 52.2% bf16 MFU | 1624222 tok/s step 18329/19560 | loss 3.316353 (+0.57z)| norm 0.2904 (+1.38z)| lr 6.30e-06 | 322.71 ms | 52.3% bf16 MFU | 1624243 tok/s step 18330/19560 | loss 3.249847 (-1.07z)| norm 0.2611 (+0.15z)| lr 6.28e-06 | 323.14 ms | 52.2% bf16 MFU | 1624156 tok/s step 18331/19560 | loss 3.257603 (-0.87z)| norm 0.2592 (+0.06z)| lr 6.27e-06 | 322.65 ms | 52.3% bf16 MFU | 1624194 tok/s step 18332/19560 | loss 3.408097 (+2.74z)| norm 0.3483 (+3.88z)| lr 6.26e-06 | 323.14 ms | 52.2% bf16 MFU | 1624110 tok/s step 18333/19560 | loss 3.260533 (-0.79z)| norm 0.2319 (-1.09z)| lr 6.25e-06 | 323.13 ms | 52.2% bf16 MFU | 1624031 tok/s step 18334/19560 | loss 3.251353 (-0.99z)| norm 0.2361 (-0.89z)| lr 6.24e-06 | 322.39 ms | 52.3% bf16 MFU | 1624142 tok/s step 18335/19560 | loss 3.298623 (+0.13z)| norm 0.2322 (-1.05z)| lr 6.23e-06 | 322.98 ms | 52.3% bf16 MFU | 1624100 tok/s step 18336/19560 | loss 3.256927 (-0.85z)| norm 0.2402 (-0.70z)| lr 6.22e-06 | 322.98 ms | 52.3% bf16 MFU | 1624059 tok/s step 18337/19560 | loss 3.295744 (+0.06z)| norm 0.2651 (+0.41z)| lr 6.21e-06 | 323.45 ms | 52.2% bf16 MFU | 1623903 tok/s step 18338/19560 | loss 3.254398 (-0.91z)| norm 0.2820 (+1.15z)| lr 6.20e-06 | 322.80 ms | 52.3% bf16 MFU | 1623917 tok/s step 18339/19560 | loss 3.424999 (+3.05z)| norm 0.2435 (-0.55z)| lr 6.19e-06 | 322.52 ms | 52.3% bf16 MFU | 1624001 tok/s step 18340/19560 | loss 3.325910 (+0.75z)| norm 0.2517 (-0.18z)| lr 6.18e-06 | 322.82 ms | 52.3% bf16 MFU | 1624005 tok/s step 18341/19560 | loss 3.217882 (-1.71z)| norm 0.2359 (-0.87z)| lr 6.17e-06 | 323.25 ms | 52.2% bf16 MFU | 1623902 tok/s step 18342/19560 | loss 3.332915 (+0.91z)| norm 0.2277 (-1.23z)| lr 6.16e-06 | 322.99 ms | 52.3% bf16 MFU | 1623868 tok/s step 18343/19560 | loss 3.368410 (+1.84z)| norm 0.3411 (+3.72z)| lr 6.15e-06 | 323.24 ms | 52.2% bf16 MFU | 1623774 tok/s step 18344/19560 | loss 3.290692 (-0.03z)| norm 0.2500 (-0.23z)| lr 6.14e-06 | 322.86 ms | 52.3% bf16 MFU | 1623781 tok/s step 18345/19560 | loss 3.339978 (+1.14z)| norm 0.2449 (-0.45z)| lr 6.13e-06 | 323.26 ms | 52.2% bf16 MFU | 1623685 tok/s step 18346/19560 | loss 3.300889 (+0.20z)| norm 0.2603 (+0.21z)| lr 6.12e-06 | 323.33 ms | 52.2% bf16 MFU | 1623577 tok/s step 18347/19560 | loss 3.304136 (+0.28z)| norm 0.2499 (-0.24z)| lr 6.11e-06 | 323.15 ms | 52.2% bf16 MFU | 1623520 tok/s step 18348/19560 | loss 3.267310 (-0.61z)| norm 0.2517 (-0.14z)| lr 6.10e-06 | 323.24 ms | 52.2% bf16 MFU | 1623442 tok/s step 18349/19560 | loss 3.287941 (-0.12z)| norm 0.2700 (+0.68z)| lr 6.09e-06 | 322.90 ms | 52.3% bf16 MFU | 1623453 tok/s step 18350/19560 | loss 3.284806 (-0.20z)| norm 0.2327 (-1.00z)| lr 6.08e-06 | 322.54 ms | 52.3% bf16 MFU | 1623555 tok/s step 18351/19560 | loss 3.299520 (+0.15z)| norm 0.2437 (-0.50z)| lr 6.07e-06 | 322.73 ms | 52.3% bf16 MFU | 1623605 tok/s step 18352/19560 | loss 3.271447 (-0.53z)| norm 0.2319 (-1.02z)| lr 6.06e-06 | 323.47 ms | 52.2% bf16 MFU | 1623466 tok/s step 18353/19560 | loss 3.360278 (+1.61z)| norm 0.2448 (-0.43z)| lr 6.05e-06 | 323.14 ms | 52.2% bf16 MFU | 1623415 tok/s step 18354/19560 | loss 3.275997 (-0.41z)| norm 0.2701 (+0.72z)| lr 6.04e-06 | 322.43 ms | 52.3% bf16 MFU | 1623548 tok/s step 18355/19560 | loss 3.322173 (+0.69z)| norm 0.2482 (-0.28z)| lr 6.03e-06 | 323.65 ms | 52.1% bf16 MFU | 1623366 tok/s step 18356/19560 | loss 3.301872 (+0.19z)| norm 0.2561 (+0.08z)| lr 6.02e-06 | 323.15 ms | 52.2% bf16 MFU | 1623320 tok/s step 18357/19560 | loss 3.270076 (-0.58z)| norm 0.2546 (+0.02z)| lr 6.01e-06 | 323.17 ms | 52.2% bf16 MFU | 1623271 tok/s step 18358/19560 | loss 3.322970 (+0.70z)| norm 0.3161 (+2.72z)| lr 6.00e-06 | 322.66 ms | 52.3% bf16 MFU | 1623352 tok/s step 18359/19560 | loss 3.298923 (+0.12z)| norm 0.2493 (-0.25z)| lr 5.99e-06 | 322.70 ms | 52.3% bf16 MFU | 1623420 tok/s step 18360/19560 | loss 3.288564 (-0.14z)| norm 0.2290 (-1.15z)| lr 5.98e-06 | 324.16 ms | 52.1% bf16 MFU | 1623118 tok/s step 18361/19560 | loss 3.314997 (+0.50z)| norm 0.2674 (+0.55z)| lr 5.97e-06 | 323.07 ms | 52.2% bf16 MFU | 1623103 tok/s step 18362/19560 | loss 3.355330 (+1.47z)| norm 0.2784 (+1.02z)| lr 5.96e-06 | 323.58 ms | 52.2% bf16 MFU | 1622961 tok/s step 18363/19560 | loss 3.269561 (-0.65z)| norm 0.2396 (-0.70z)| lr 5.95e-06 | 323.01 ms | 52.2% bf16 MFU | 1622969 tok/s step 18364/19560 | loss 3.360429 (+1.57z)| norm 0.2628 (+0.32z)| lr 5.94e-06 | 323.09 ms | 52.2% bf16 MFU | 1622956 tok/s step 18365/19560 | loss 3.284696 (-0.29z)| norm 0.2349 (-0.91z)| lr 5.93e-06 | 322.74 ms | 52.3% bf16 MFU | 1623034 tok/s step 18366/19560 | loss 3.312004 (+0.39z)| norm 0.3296 (+3.14z)| lr 5.92e-06 | 323.22 ms | 52.2% bf16 MFU | 1622986 tok/s step 18367/19560 | loss 3.218547 (-1.88z)| norm 0.2498 (-0.25z)| lr 5.91e-06 | 323.09 ms | 52.2% bf16 MFU | 1622974 tok/s step 18368/19560 | loss 3.259454 (-0.87z)| norm 0.2263 (-1.24z)| lr 5.90e-06 | 322.73 ms | 52.3% bf16 MFU | 1623053 tok/s step 18369/19560 | loss 3.278793 (-0.40z)| norm 0.2312 (-1.03z)| lr 5.89e-06 | 323.03 ms | 52.2% bf16 MFU | 1623052 tok/s step 18370/19560 | loss 3.245700 (-1.20z)| norm 0.2369 (-0.78z)| lr 5.88e-06 | 322.75 ms | 52.3% bf16 MFU | 1623122 tok/s step 18371/19560 | loss 3.336880 (+0.99z)| norm 0.3105 (+2.28z)| lr 5.87e-06 | 322.98 ms | 52.3% bf16 MFU | 1623130 tok/s step 18372/19560 | loss 3.283385 (-0.31z)| norm 0.2393 (-0.68z)| lr 5.86e-06 | 323.04 ms | 52.2% bf16 MFU | 1623122 tok/s step 18373/19560 | loss 3.478818 (+4.09z)| norm 0.2874 (+1.30z)| lr 5.85e-06 | 322.91 ms | 52.3% bf16 MFU | 1623148 tok/s step 18374/19560 | loss 3.292300 (-0.11z)| norm 0.2467 (-0.40z)| lr 5.85e-06 | 322.90 ms | 52.3% bf16 MFU | 1623174 tok/s step 18375/19560 | loss 3.279347 (-0.40z)| norm 0.2587 (+0.09z)| lr 5.84e-06 | 322.78 ms | 52.3% bf16 MFU | 1623231 tok/s step 18376/19560 | loss 3.212135 (-1.88z)| norm 0.2686 (+0.50z)| lr 5.83e-06 | 322.98 ms | 52.3% bf16 MFU | 1623234 tok/s step 18377/19560 | loss 3.327749 (+0.71z)| norm 0.2677 (+0.46z)| lr 5.82e-06 | 322.76 ms | 52.3% bf16 MFU | 1623293 tok/s step 18378/19560 | loss 3.264473 (-0.71z)| norm 0.2501 (-0.28z)| lr 5.81e-06 | 322.93 ms | 52.3% bf16 MFU | 1623305 tok/s step 18379/19560 | loss 3.335409 (+0.90z)| norm 0.2574 (+0.02z)| lr 5.80e-06 | 322.86 ms | 52.3% bf16 MFU | 1623333 tok/s step 18380/19560 | loss 3.282865 (-0.30z)| norm 0.3651 (+4.18z)| lr 5.79e-06 | 323.05 ms | 52.2% bf16 MFU | 1623314 tok/s step 18381/19560 | loss 3.266394 (-0.66z)| norm 0.3081 (+1.92z)| lr 5.78e-06 | 322.95 ms | 52.3% bf16 MFU | 1623319 tok/s step 18382/19560 | loss 3.303203 (+0.17z)| norm 0.2617 (+0.13z)| lr 5.77e-06 | 322.69 ms | 52.3% bf16 MFU | 1623389 tok/s step 18383/19560 | loss 3.279065 (-0.38z)| norm 0.2522 (-0.24z)| lr 5.76e-06 | 322.80 ms | 52.3% bf16 MFU | 1623430 tok/s step 18384/19560 | loss 3.306507 (+0.23z)| norm 0.2460 (-0.47z)| lr 5.75e-06 | 322.93 ms | 52.3% bf16 MFU | 1623435 tok/s step 18385/19560 | loss 3.321968 (+0.58z)| norm 0.2424 (-0.62z)| lr 5.74e-06 | 322.80 ms | 52.3% bf16 MFU | 1623473 tok/s step 18386/19560 | loss 3.314429 (+0.43z)| norm 0.2475 (-0.42z)| lr 5.73e-06 | 322.59 ms | 52.3% bf16 MFU | 1623561 tok/s step 18387/19560 | loss 3.283377 (-0.29z)| norm 0.2958 (+1.42z)| lr 5.72e-06 | 322.78 ms | 52.3% bf16 MFU | 1623598 tok/s step 18388/19560 | loss 3.271303 (-0.57z)| norm 0.2558 (-0.12z)| lr 5.71e-06 | 322.28 ms | 52.4% bf16 MFU | 1623759 tok/s step 18389/19560 | loss 3.251791 (-1.02z)| norm 0.2691 (+0.38z)| lr 5.70e-06 | 323.13 ms | 52.2% bf16 MFU | 1623697 tok/s step 18390/19560 | loss 3.274949 (-0.49z)| norm 0.2633 (+0.16z)| lr 5.69e-06 | 322.91 ms | 52.3% bf16 MFU | 1623694 tok/s step 18391/19560 | loss 3.297174 (+0.04z)| norm 0.2538 (-0.20z)| lr 5.68e-06 | 322.87 ms | 52.3% bf16 MFU | 1623702 tok/s step 18392/19560 | loss 3.247722 (-1.12z)| norm 0.3251 (+2.49z)| lr 5.67e-06 | 322.53 ms | 52.3% bf16 MFU | 1623793 tok/s step 18393/19560 | loss 3.338940 (+1.01z)| norm 0.2685 (+0.36z)| lr 5.66e-06 | 322.65 ms | 52.3% bf16 MFU | 1623850 tok/s step 18394/19560 | loss 3.285440 (-0.26z)| norm 0.2620 (+0.12z)| lr 5.65e-06 | 322.99 ms | 52.3% bf16 MFU | 1623818 tok/s step 18395/19560 | loss 3.281372 (-0.36z)| norm 0.2307 (-1.09z)| lr 5.64e-06 | 322.81 ms | 52.3% bf16 MFU | 1623834 tok/s step 18396/19560 | loss 3.363778 (+1.57z)| norm 0.2421 (-0.64z)| lr 5.63e-06 | 322.76 ms | 52.3% bf16 MFU | 1623862 tok/s step 18397/19560 | loss 3.242887 (-1.27z)| norm 0.2867 (+1.06z)| lr 5.62e-06 | 322.84 ms | 52.3% bf16 MFU | 1623869 tok/s step 18398/19560 | loss 3.249028 (-1.11z)| norm 0.3093 (+1.88z)| lr 5.61e-06 | 322.97 ms | 52.3% bf16 MFU | 1623844 tok/s step 18399/19560 | loss 3.307799 (+0.26z)| norm 0.2645 (+0.19z)| lr 5.60e-06 | 322.58 ms | 52.3% bf16 MFU | 1623915 tok/s step 18400/19560 | loss 3.277786 (-0.46z)| norm 0.2697 (+0.37z)| lr 5.59e-06 | 322.78 ms | 52.3% bf16 MFU | 1623933 tok/s step 18401/19560 | loss 3.221102 (-1.77z)| norm 0.2281 (-1.18z)| lr 5.58e-06 | 322.70 ms | 52.3% bf16 MFU | 1623970 tok/s step 18402/19560 | loss 3.251251 (-1.05z)| norm 0.2498 (-0.37z)| lr 5.57e-06 | 323.17 ms | 52.2% bf16 MFU | 1623889 tok/s step 18403/19560 | loss 3.239295 (-1.32z)| norm 0.2930 (+1.24z)| lr 5.56e-06 | 322.54 ms | 52.3% bf16 MFU | 1623970 tok/s step 18404/19560 | loss 3.323065 (+0.64z)| norm 0.2975 (+1.39z)| lr 5.55e-06 | 322.44 ms | 52.3% bf16 MFU | 1624072 tok/s step 18405/19560 | loss 3.409681 (+2.59z)| norm 0.3177 (+2.08z)| lr 5.54e-06 | 323.00 ms | 52.3% bf16 MFU | 1624026 tok/s step 18406/19560 | loss 3.295321 (-0.01z)| norm 0.2631 (+0.09z)| lr 5.54e-06 | 322.46 ms | 52.3% bf16 MFU | 1624121 tok/s step 18407/19560 | loss 3.364992 (+1.56z)| norm 0.2476 (-0.47z)| lr 5.53e-06 | 322.95 ms | 52.3% bf16 MFU | 1624086 tok/s step 18408/19560 | loss 3.284407 (-0.27z)| norm 0.2704 (+0.37z)| lr 5.52e-06 | 322.10 ms | 52.4% bf16 MFU | 1624267 tok/s step 18409/19560 | loss 3.312747 (+0.38z)| norm 0.2639 (+0.13z)| lr 5.51e-06 | 322.67 ms | 52.3% bf16 MFU | 1624297 tok/s step 18410/19560 | loss 3.260920 (-0.80z)| norm 0.2712 (+0.40z)| lr 5.50e-06 | 322.66 ms | 52.3% bf16 MFU | 1624327 tok/s step 18411/19560 | loss 3.316608 (+0.46z)| norm 0.3339 (+2.60z)| lr 5.49e-06 | 322.92 ms | 52.3% bf16 MFU | 1624290 tok/s step 18412/19560 | loss 3.276276 (-0.45z)| norm 0.2777 (+0.59z)| lr 5.48e-06 | 322.34 ms | 52.4% bf16 MFU | 1624400 tok/s step 18413/19560 | loss 3.282557 (-0.32z)| norm 0.2707 (+0.34z)| lr 5.47e-06 | 323.62 ms | 52.2% bf16 MFU | 1624184 tok/s step 18414/19560 | loss 3.276243 (-0.45z)| norm 0.2412 (-0.71z)| lr 5.46e-06 | 322.17 ms | 52.4% bf16 MFU | 1624344 tok/s step 18415/19560 | loss 3.279338 (-0.38z)| norm 0.2685 (+0.25z)| lr 5.45e-06 | 322.79 ms | 52.3% bf16 MFU | 1624339 tok/s step 18416/19560 | loss 3.269461 (-0.61z)| norm 0.3112 (+1.74z)| lr 5.44e-06 | 323.34 ms | 52.2% bf16 MFU | 1624197 tok/s step 18417/19560 | loss 3.397789 (+2.30z)| norm 0.3295 (+2.34z)| lr 5.43e-06 | 322.60 ms | 52.3% bf16 MFU | 1624247 tok/s step 18418/19560 | loss 3.290841 (-0.13z)| norm 0.2482 (-0.49z)| lr 5.42e-06 | 322.47 ms | 52.3% bf16 MFU | 1624326 tok/s step 18419/19560 | loss 3.255232 (-0.92z)| norm 0.2361 (-0.91z)| lr 5.41e-06 | 322.74 ms | 52.3% bf16 MFU | 1624334 tok/s step 18420/19560 | loss 3.275018 (-0.47z)| norm 0.2415 (-0.71z)| lr 5.40e-06 | 322.99 ms | 52.3% bf16 MFU | 1624278 tok/s step 18421/19560 | loss 3.256689 (-0.87z)| norm 0.2360 (-0.89z)| lr 5.39e-06 | 323.03 ms | 52.2% bf16 MFU | 1624216 tok/s step 18422/19560 | loss 3.327988 (+0.77z)| norm 0.3260 (+2.22z)| lr 5.38e-06 | 322.79 ms | 52.3% bf16 MFU | 1624216 tok/s step 18423/19560 | loss 3.311656 (+0.39z)| norm 0.2715 (+0.33z)| lr 5.37e-06 | 322.47 ms | 52.3% bf16 MFU | 1624298 tok/s step 18424/19560 | loss 3.264791 (-0.68z)| norm 0.2387 (-0.81z)| lr 5.36e-06 | 323.19 ms | 52.2% bf16 MFU | 1624196 tok/s step 18425/19560 | loss 3.275331 (-0.43z)| norm 0.2908 (+0.98z)| lr 5.36e-06 | 323.03 ms | 52.2% bf16 MFU | 1624137 tok/s step 18426/19560 | loss 3.314532 (+0.47z)| norm 0.2434 (-0.66z)| lr 5.35e-06 | 322.79 ms | 52.3% bf16 MFU | 1624142 tok/s step 18427/19560 | loss 3.207500 (-1.95z)| norm 0.2669 (+0.16z)| lr 5.34e-06 | 322.68 ms | 52.3% bf16 MFU | 1624174 tok/s step 18428/19560 | loss 3.289510 (-0.09z)| norm 0.2433 (-0.66z)| lr 5.33e-06 | 323.36 ms | 52.2% bf16 MFU | 1624034 tok/s step 18429/19560 | loss 3.286589 (-0.15z)| norm 0.2959 (+1.14z)| lr 5.32e-06 | 322.47 ms | 52.3% bf16 MFU | 1624126 tok/s step 18430/19560 | loss 3.305709 (+0.27z)| norm 0.2503 (-0.44z)| lr 5.31e-06 | 322.78 ms | 52.3% bf16 MFU | 1624133 tok/s step 18431/19560 | loss 3.300078 (+0.14z)| norm 0.2239 (-1.35z)| lr 5.30e-06 | 323.34 ms | 52.2% bf16 MFU | 1624000 tok/s step 18432/19560 | loss 3.303061 (+0.20z)| norm 0.2510 (-0.41z)| lr 5.29e-06 | 322.91 ms | 52.3% bf16 MFU | 1623981 tok/s step 18433/19560 | loss 3.269316 (-0.55z)| norm 0.2427 (-0.69z)| lr 5.28e-06 | 322.91 ms | 52.3% bf16 MFU | 1623963 tok/s step 18434/19560 | loss 3.290096 (-0.07z)| norm 0.2840 (+0.72z)| lr 5.27e-06 | 322.47 ms | 52.3% bf16 MFU | 1624058 tok/s step 18435/19560 | loss 3.387055 (+2.10z)| norm 0.2488 (-0.50z)| lr 5.26e-06 | 322.82 ms | 52.3% bf16 MFU | 1624058 tok/s step 18436/19560 | loss 3.312830 (+0.41z)| norm 0.2479 (-0.53z)| lr 5.25e-06 | 323.10 ms | 52.2% bf16 MFU | 1623989 tok/s step 18437/19560 | loss 3.264040 (-0.68z)| norm 0.2665 (+0.10z)| lr 5.24e-06 | 322.68 ms | 52.3% bf16 MFU | 1624029 tok/s step 18438/19560 | loss 3.220165 (-1.66z)| norm 0.2697 (+0.22z)| lr 5.23e-06 | 322.81 ms | 52.3% bf16 MFU | 1624033 tok/s step 18439/19560 | loss 3.239778 (-1.20z)| norm 0.2395 (-0.83z)| lr 5.22e-06 | 322.64 ms | 52.3% bf16 MFU | 1624082 tok/s step 18440/19560 | loss 3.237373 (-1.24z)| norm 0.2800 (+0.58z)| lr 5.22e-06 | 322.38 ms | 52.4% bf16 MFU | 1624192 tok/s step 18441/19560 | loss 3.303757 (+0.23z)| norm 0.2809 (+0.60z)| lr 5.21e-06 | 322.66 ms | 52.3% bf16 MFU | 1624228 tok/s step 18442/19560 | loss 3.330583 (+0.81z)| norm 0.2877 (+0.83z)| lr 5.20e-06 | 323.04 ms | 52.2% bf16 MFU | 1624166 tok/s step 18443/19560 | loss 3.285138 (-0.20z)| norm 0.2482 (-0.54z)| lr 5.19e-06 | 322.74 ms | 52.3% bf16 MFU | 1624183 tok/s step 18444/19560 | loss 3.246878 (-1.05z)| norm 0.2598 (-0.14z)| lr 5.18e-06 | 322.70 ms | 52.3% bf16 MFU | 1624209 tok/s step 18445/19560 | loss 3.318507 (+0.55z)| norm 0.2691 (+0.19z)| lr 5.17e-06 | 322.57 ms | 52.3% bf16 MFU | 1624265 tok/s step 18446/19560 | loss 3.268420 (-0.56z)| norm 0.3006 (+1.29z)| lr 5.16e-06 | 322.82 ms | 52.3% bf16 MFU | 1624257 tok/s step 18447/19560 | loss 3.308509 (+0.36z)| norm 0.2432 (-0.72z)| lr 5.15e-06 | 322.87 ms | 52.3% bf16 MFU | 1624237 tok/s step 18448/19560 | loss 3.252399 (-0.90z)| norm 0.3121 (+1.67z)| lr 5.14e-06 | 322.28 ms | 52.4% bf16 MFU | 1624366 tok/s step 18449/19560 | loss 3.265692 (-0.59z)| norm 0.2405 (-0.83z)| lr 5.13e-06 | 322.57 ms | 52.3% bf16 MFU | 1624415 tok/s step 18450/19560 | loss 3.347256 (+1.25z)| norm 0.3066 (+1.45z)| lr 5.12e-06 | 322.32 ms | 52.4% bf16 MFU | 1624525 tok/s step 18451/19560 | loss 3.282042 (-0.24z)| norm 0.2342 (-1.05z)| lr 5.11e-06 | 322.94 ms | 52.3% bf16 MFU | 1624472 tok/s step 18452/19560 | loss 3.317499 (+0.56z)| norm 0.2923 (+0.96z)| lr 5.10e-06 | 322.93 ms | 52.3% bf16 MFU | 1624425 tok/s step 18453/19560 | loss 3.256329 (-0.84z)| norm 0.2788 (+0.48z)| lr 5.10e-06 | 322.73 ms | 52.3% bf16 MFU | 1624431 tok/s step 18454/19560 | loss 3.290169 (-0.07z)| norm 0.2442 (-0.71z)| lr 5.09e-06 | 322.80 ms | 52.3% bf16 MFU | 1624420 tok/s step 18455/19560 | loss 3.308051 (+0.33z)| norm 0.2361 (-0.99z)| lr 5.08e-06 | 322.67 ms | 52.3% bf16 MFU | 1624441 tok/s step 18456/19560 | loss 3.299426 (+0.13z)| norm 0.2513 (-0.46z)| lr 5.07e-06 | 322.40 ms | 52.3% bf16 MFU | 1624529 tok/s step 18457/19560 | loss 3.289407 (-0.09z)| norm 0.2534 (-0.38z)| lr 5.06e-06 | 322.82 ms | 52.3% bf16 MFU | 1624507 tok/s step 18458/19560 | loss 3.295923 (+0.05z)| norm 0.2654 (+0.04z)| lr 5.05e-06 | 322.49 ms | 52.3% bf16 MFU | 1624569 tok/s step 18459/19560 | loss 3.265481 (-0.66z)| norm 0.2667 (+0.08z)| lr 5.04e-06 | 322.40 ms | 52.3% bf16 MFU | 1624650 tok/s step 18460/19560 | loss 3.285783 (-0.17z)| norm 0.2448 (-0.67z)| lr 5.03e-06 | 322.48 ms | 52.3% bf16 MFU | 1624708 tok/s step 18461/19560 | loss 3.319207 (+0.61z)| norm 0.2369 (-0.96z)| lr 5.02e-06 | 323.16 ms | 52.2% bf16 MFU | 1624592 tok/s step 18462/19560 | loss 3.283206 (-0.25z)| norm 0.2627 (-0.04z)| lr 5.01e-06 | 322.73 ms | 52.3% bf16 MFU | 1624590 tok/s step 18463/19560 | loss 3.338814 (+1.07z)| norm 0.2610 (-0.11z)| lr 5.00e-06 | 322.83 ms | 52.3% bf16 MFU | 1624563 tok/s step 18464/19560 | loss 3.262945 (-0.74z)| norm 0.2310 (-1.19z)| lr 4.99e-06 | 322.67 ms | 52.3% bf16 MFU | 1624576 tok/s step 18465/19560 | loss 3.288380 (-0.13z)| norm 0.2805 (+0.59z)| lr 4.99e-06 | 322.70 ms | 52.3% bf16 MFU | 1624583 tok/s step 18466/19560 | loss 3.304446 (+0.24z)| norm 0.2367 (-0.97z)| lr 4.98e-06 | 322.81 ms | 52.3% bf16 MFU | 1624561 tok/s step 18467/19560 | loss 3.312589 (+0.47z)| norm 0.2334 (-1.08z)| lr 4.97e-06 | 322.54 ms | 52.3% bf16 MFU | 1624607 tok/s step 18468/19560 | loss 3.341743 (+1.19z)| norm 0.2800 (+0.57z)| lr 4.96e-06 | 323.03 ms | 52.2% bf16 MFU | 1624527 tok/s step 18469/19560 | loss 3.302727 (+0.21z)| norm 0.2374 (-0.95z)| lr 4.95e-06 | 322.54 ms | 52.3% bf16 MFU | 1624576 tok/s step 18470/19560 | loss 3.302942 (+0.22z)| norm 0.2973 (+1.17z)| lr 4.94e-06 | 322.66 ms | 52.3% bf16 MFU | 1624591 tok/s step 18471/19560 | loss 3.251067 (-1.07z)| norm 0.2769 (+0.48z)| lr 4.93e-06 | 322.73 ms | 52.3% bf16 MFU | 1624588 tok/s step 18472/19560 | loss 3.357816 (+1.61z)| norm 0.2308 (-1.22z)| lr 4.92e-06 | 322.75 ms | 52.3% bf16 MFU | 1624580 tok/s step 18473/19560 | loss 3.189945 (-2.52z)| norm 0.2561 (-0.29z)| lr 4.91e-06 | 322.54 ms | 52.3% bf16 MFU | 1624627 tok/s step 18474/19560 | loss 3.265759 (-0.65z)| norm 0.2845 (+0.75z)| lr 4.90e-06 | 323.00 ms | 52.3% bf16 MFU | 1624554 tok/s step 18475/19560 | loss 3.257347 (-0.85z)| norm 0.2741 (+0.36z)| lr 4.90e-06 | 322.71 ms | 52.3% bf16 MFU | 1624559 tok/s step 18476/19560 | loss 3.274459 (-0.43z)| norm 0.2366 (-1.01z)| lr 4.89e-06 | 322.72 ms | 52.3% bf16 MFU | 1624562 tok/s step 18477/19560 | loss 3.280415 (-0.28z)| norm 0.2331 (-1.12z)| lr 4.88e-06 | 322.72 ms | 52.3% bf16 MFU | 1624564 tok/s step 18478/19560 | loss 3.305898 (+0.34z)| norm 0.2587 (-0.20z)| lr 4.87e-06 | 322.46 ms | 52.3% bf16 MFU | 1624630 tok/s step 18479/19560 | loss 3.297201 (+0.13z)| norm 0.2814 (+0.62z)| lr 4.86e-06 | 323.09 ms | 52.2% bf16 MFU | 1624536 tok/s step 18480/19560 | loss 3.299361 (+0.18z)| norm 0.2419 (-0.83z)| lr 4.85e-06 | 322.71 ms | 52.3% bf16 MFU | 1624542 tok/s step 18481/19560 | loss 3.252047 (-0.97z)| norm 0.2458 (-0.69z)| lr 4.84e-06 | 322.69 ms | 52.3% bf16 MFU | 1624552 tok/s step 18482/19560 | loss 3.295097 (+0.09z)| norm 0.3022 (+1.37z)| lr 4.83e-06 | 323.04 ms | 52.2% bf16 MFU | 1624472 tok/s step 18483/19560 | loss 3.249091 (-1.03z)| norm 0.2380 (-0.97z)| lr 4.82e-06 | 322.80 ms | 52.3% bf16 MFU | 1624457 tok/s step 18484/19560 | loss 3.297015 (+0.15z)| norm 0.2530 (-0.42z)| lr 4.81e-06 | 322.42 ms | 52.3% bf16 MFU | 1624538 tok/s step 18485/19560 | loss 3.421580 (+3.08z)| norm 0.2653 (+0.02z)| lr 4.81e-06 | 322.87 ms | 52.3% bf16 MFU | 1624504 tok/s step 18486/19560 | loss 3.254136 (-0.89z)| norm 0.2405 (-0.87z)| lr 4.80e-06 | 322.76 ms | 52.3% bf16 MFU | 1624498 tok/s step 18487/19560 | loss 3.268581 (-0.54z)| norm 0.2486 (-0.57z)| lr 4.79e-06 | 322.69 ms | 52.3% bf16 MFU | 1624510 tok/s step 18488/19560 | loss 3.292357 (+0.02z)| norm 0.2368 (-1.01z)| lr 4.78e-06 | 322.85 ms | 52.3% bf16 MFU | 1624482 tok/s step 18489/19560 | loss 3.247189 (-1.03z)| norm 0.2606 (-0.13z)| lr 4.77e-06 | 323.17 ms | 52.2% bf16 MFU | 1624374 tok/s step 18490/19560 | loss 3.295195 (+0.12z)| norm 0.2534 (-0.39z)| lr 4.76e-06 | 322.35 ms | 52.4% bf16 MFU | 1624478 tok/s step 18491/19560 | loss 3.271513 (-0.45z)| norm 0.2442 (-0.73z)| lr 4.75e-06 | 322.67 ms | 52.3% bf16 MFU | 1624497 tok/s step 18492/19560 | loss 3.262744 (-0.65z)| norm 0.2782 (+0.52z)| lr 4.74e-06 | 323.56 ms | 52.2% bf16 MFU | 1624292 tok/s step 18493/19560 | loss 3.303837 (+0.34z)| norm 0.2401 (-0.89z)| lr 4.73e-06 | 323.21 ms | 52.2% bf16 MFU | 1624182 tok/s step 18494/19560 | loss 3.247440 (-1.00z)| norm 0.2406 (-0.86z)| lr 4.73e-06 | 322.83 ms | 52.3% bf16 MFU | 1624176 tok/s step 18495/19560 | loss 3.265857 (-0.58z)| norm 0.2652 (+0.06z)| lr 4.72e-06 | 323.01 ms | 52.3% bf16 MFU | 1624125 tok/s step 18496/19560 | loss 3.305226 (+0.37z)| norm 0.2576 (-0.24z)| lr 4.71e-06 | 322.51 ms | 52.3% bf16 MFU | 1624201 tok/s step 18497/19560 | loss 3.332416 (+1.02z)| norm 0.2351 (-1.10z)| lr 4.70e-06 | 323.02 ms | 52.2% bf16 MFU | 1624145 tok/s step 18498/19560 | loss 3.311229 (+0.50z)| norm 0.3477 (+3.06z)| lr 4.69e-06 | 323.39 ms | 52.2% bf16 MFU | 1623998 tok/s step 18499/19560 | loss 3.284923 (-0.14z)| norm 0.2346 (-1.09z)| lr 4.68e-06 | 322.73 ms | 52.3% bf16 MFU | 1624026 tok/s step 18500/19560 | loss 3.218974 (-1.72z)| norm 0.2413 (-0.85z)| lr 4.67e-06 | 322.66 ms | 52.3% bf16 MFU | 1624069 tok/s val loss 3.275321 laSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evalualaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluaevaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3046/10042 = 0.303326 step 18501/19560 | loss 3.295618 (+0.19z)| norm 0.2338 (-1.11z)| lr 4.66e-06 | 322.80 ms | 52.3% bf16 MFU | 1624074 tok/s step 18502/19560 | loss 3.267593 (-0.55z)| norm 0.2478 (-0.59z)| lr 4.66e-06 | 322.37 ms | 52.4% bf16 MFU | 1624188 tok/s step 18503/19560 | loss 3.271066 (-0.45z)| norm 0.2408 (-0.84z)| lr 4.65e-06 | 322.76 ms | 52.3% bf16 MFU | 1624198 tok/s step 18504/19560 | loss 3.266036 (-0.61z)| norm 0.2585 (-0.18z)| lr 4.64e-06 | 322.91 ms | 52.3% bf16 MFU | 1624170 tok/s step 18505/19560 | loss 3.250096 (-1.02z)| norm 0.2381 (-0.93z)| lr 4.63e-06 | 322.73 ms | 52.3% bf16 MFU | 1624188 tok/s step 18506/19560 | loss 3.265188 (-0.61z)| norm 0.2646 (+0.04z)| lr 4.62e-06 | 323.36 ms | 52.2% bf16 MFU | 1624046 tok/s step 18507/19560 | loss 3.306887 (+0.51z)| norm 0.2381 (-0.92z)| lr 4.61e-06 | 323.05 ms | 52.2% bf16 MFU | 1623990 tok/s step 18508/19560 | loss 3.249743 (-1.02z)| norm 0.2368 (-0.99z)| lr 4.60e-06 | 322.85 ms | 52.3% bf16 MFU | 1623987 tok/s step 18509/19560 | loss 3.255617 (-0.85z)| norm 0.2320 (-1.16z)| lr 4.59e-06 | 323.06 ms | 52.2% bf16 MFU | 1623932 tok/s step 18510/19560 | loss 3.522037 (+5.46z)| norm 0.3349 (+2.75z)| lr 4.59e-06 | 322.09 ms | 52.4% bf16 MFU | 1624122 tok/s step 18511/19560 | loss 3.233483 (-1.29z)| norm 0.2455 (-0.63z)| lr 4.58e-06 | 322.41 ms | 52.3% bf16 MFU | 1624223 tok/s step 18512/19560 | loss 3.252504 (-0.84z)| norm 0.2724 (+0.38z)| lr 4.57e-06 | 322.22 ms | 52.4% bf16 MFU | 1624367 tok/s step 18513/19560 | loss 3.307722 (+0.45z)| norm 0.2648 (+0.09z)| lr 4.56e-06 | 322.60 ms | 52.3% bf16 MFU | 1624409 tok/s step 18514/19560 | loss 3.306224 (+0.42z)| norm 0.2332 (-1.11z)| lr 4.55e-06 | 322.78 ms | 52.3% bf16 MFU | 1624402 tok/s step 18515/19560 | loss 3.319851 (+0.73z)| norm 0.2442 (-0.68z)| lr 4.54e-06 | 322.60 ms | 52.3% bf16 MFU | 1624440 tok/s step 18516/19560 | loss 3.225246 (-1.45z)| norm 0.2308 (-1.17z)| lr 4.53e-06 | 322.37 ms | 52.4% bf16 MFU | 1624536 tok/s step 18517/19560 | loss 3.317980 (+0.68z)| norm 0.2734 (+0.44z)| lr 4.52e-06 | 322.47 ms | 52.3% bf16 MFU | 1624601 tok/s step 18518/19560 | loss 3.388556 (+2.24z)| norm 0.2610 (-0.03z)| lr 4.52e-06 | 323.32 ms | 52.2% bf16 MFU | 1624450 tok/s step 18519/19560 | loss 3.278266 (-0.25z)| norm 0.2480 (-0.52z)| lr 4.51e-06 | 323.30 ms | 52.2% bf16 MFU | 1624311 tok/s step 18520/19560 | loss 3.294351 (+0.10z)| norm 0.2405 (-0.79z)| lr 4.50e-06 | 322.93 ms | 52.3% bf16 MFU | 1624272 tok/s step 18521/19560 | loss 3.269547 (-0.45z)| norm 0.2854 (+0.93z)| lr 4.49e-06 | 322.87 ms | 52.3% bf16 MFU | 1624250 tok/s step 18522/19560 | loss 3.292658 (+0.08z)| norm 0.2900 (+1.09z)| lr 4.48e-06 | 322.55 ms | 52.3% bf16 MFU | 1624310 tok/s step 18523/19560 | loss 3.277957 (-0.26z)| norm 0.2447 (-0.65z)| lr 4.47e-06 | 322.92 ms | 52.3% bf16 MFU | 1624274 tok/s step 18524/19560 | loss 3.352420 (+1.45z)| norm 0.2459 (-0.61z)| lr 4.46e-06 | 323.26 ms | 52.2% bf16 MFU | 1624153 tok/s step 18525/19560 | loss 3.246663 (-0.98z)| norm 0.2528 (-0.33z)| lr 4.46e-06 | 322.89 ms | 52.3% bf16 MFU | 1624132 tok/s step 18526/19560 | loss 3.278426 (-0.25z)| norm 0.2406 (-0.79z)| lr 4.45e-06 | 323.47 ms | 52.2% bf16 MFU | 1623968 tok/s step 18527/19560 | loss 3.303609 (+0.33z)| norm 0.2669 (+0.23z)| lr 4.44e-06 | 322.71 ms | 52.3% bf16 MFU | 1624002 tok/s step 18528/19560 | loss 3.258716 (-0.70z)| norm 0.2350 (-0.99z)| lr 4.43e-06 | 323.10 ms | 52.2% bf16 MFU | 1623937 tok/s step 18529/19560 | loss 3.286252 (-0.08z)| norm 0.2310 (-1.15z)| lr 4.42e-06 | 323.23 ms | 52.2% bf16 MFU | 1623841 tok/s step 18530/19560 | loss 3.294682 (+0.11z)| norm 0.2361 (-0.95z)| lr 4.41e-06 | 323.01 ms | 52.2% bf16 MFU | 1623805 tok/s step 18531/19560 | loss 3.306344 (+0.37z)| norm 0.2372 (-0.89z)| lr 4.40e-06 | 323.30 ms | 52.2% bf16 MFU | 1623699 tok/s step 18532/19560 | loss 3.306628 (+0.38z)| norm 0.2652 (+0.21z)| lr 4.40e-06 | 322.61 ms | 52.3% bf16 MFU | 1623772 tok/s step 18533/19560 | loss 3.229812 (-1.42z)| norm 0.2486 (-0.43z)| lr 4.39e-06 | 323.11 ms | 52.2% bf16 MFU | 1623716 tok/s step 18534/19560 | loss 3.286232 (-0.07z)| norm 0.2901 (+1.22z)| lr 4.38e-06 | 323.16 ms | 52.2% bf16 MFU | 1623648 tok/s step 18535/19560 | loss 3.241560 (-1.12z)| norm 0.2381 (-0.85z)| lr 4.37e-06 | 323.11 ms | 52.2% bf16 MFU | 1623597 tok/s step 18536/19560 | loss 3.266796 (-0.51z)| norm 0.3342 (+2.86z)| lr 4.36e-06 | 323.24 ms | 52.2% bf16 MFU | 1623516 tok/s step 18537/19560 | loss 3.287360 (-0.01z)| norm 0.3611 (+3.65z)| lr 4.35e-06 | 322.85 ms | 52.3% bf16 MFU | 1623536 tok/s step 18538/19560 | loss 3.258431 (-0.71z)| norm 0.2379 (-0.81z)| lr 4.35e-06 | 322.90 ms | 52.3% bf16 MFU | 1623545 tok/s step 18539/19560 | loss 3.377340 (+2.12z)| norm 0.2424 (-0.64z)| lr 4.34e-06 | 323.13 ms | 52.2% bf16 MFU | 1623495 tok/s step 18540/19560 | loss 3.200082 (-2.05z)| norm 0.2366 (-0.85z)| lr 4.33e-06 | 322.87 ms | 52.3% bf16 MFU | 1623513 tok/s step 18541/19560 | loss 3.270485 (-0.40z)| norm 0.2311 (-1.04z)| lr 4.32e-06 | 323.08 ms | 52.2% bf16 MFU | 1623476 tok/s step 18542/19560 | loss 3.325905 (+0.88z)| norm 0.2523 (-0.25z)| lr 4.31e-06 | 322.68 ms | 52.3% bf16 MFU | 1623543 tok/s step 18543/19560 | loss 3.264550 (-0.54z)| norm 0.2658 (+0.25z)| lr 4.30e-06 | 322.46 ms | 52.3% bf16 MFU | 1623661 tok/s step 18544/19560 | loss 3.373167 (+1.94z)| norm 0.2919 (+1.23z)| lr 4.29e-06 | 322.78 ms | 52.3% bf16 MFU | 1623693 tok/s step 18545/19560 | loss 3.317929 (+0.70z)| norm 0.2359 (-0.86z)| lr 4.29e-06 | 322.53 ms | 52.3% bf16 MFU | 1623786 tok/s step 18546/19560 | loss 3.267930 (-0.47z)| norm 0.2299 (-1.08z)| lr 4.28e-06 | 322.60 ms | 52.3% bf16 MFU | 1623855 tok/s step 18547/19560 | loss 3.269017 (-0.44z)| norm 0.2593 (+0.04z)| lr 4.27e-06 | 323.25 ms | 52.2% bf16 MFU | 1623758 tok/s step 18548/19560 | loss 3.322968 (+0.81z)| norm 0.2483 (-0.39z)| lr 4.26e-06 | 322.71 ms | 52.3% bf16 MFU | 1623804 tok/s step 18549/19560 | loss 3.318885 (+0.71z)| norm 0.2815 (+0.88z)| lr 4.25e-06 | 322.69 ms | 52.3% bf16 MFU | 1623852 tok/s step 18550/19560 | loss 3.319858 (+0.73z)| norm 0.2261 (-1.25z)| lr 4.24e-06 | 322.55 ms | 52.3% bf16 MFU | 1623931 tok/s step 18551/19560 | loss 3.323813 (+0.82z)| norm 0.2443 (-0.53z)| lr 4.24e-06 | 322.19 ms | 52.4% bf16 MFU | 1624098 tok/s step 18552/19560 | loss 3.189939 (-2.26z)| norm 0.2359 (-0.86z)| lr 4.23e-06 | 323.04 ms | 52.2% bf16 MFU | 1624042 tok/s step 18553/19560 | loss 3.288623 (+0.01z)| norm 0.2319 (-1.00z)| lr 4.22e-06 | 323.02 ms | 52.2% bf16 MFU | 1623994 tok/s step 18554/19560 | loss 3.311421 (+0.53z)| norm 0.2469 (-0.41z)| lr 4.21e-06 | 323.03 ms | 52.2% bf16 MFU | 1623945 tok/s step 18555/19560 | loss 3.286734 (-0.05z)| norm 0.2481 (-0.36z)| lr 4.20e-06 | 322.26 ms | 52.4% bf16 MFU | 1624092 tok/s step 18556/19560 | loss 3.265873 (-0.53z)| norm 0.2467 (-0.41z)| lr 4.19e-06 | 323.08 ms | 52.2% bf16 MFU | 1624028 tok/s step 18557/19560 | loss 3.288366 (-0.01z)| norm 0.2347 (-0.87z)| lr 4.19e-06 | 322.47 ms | 52.3% bf16 MFU | 1624120 tok/s step 18558/19560 | loss 3.282193 (-0.15z)| norm 0.3208 (+2.46z)| lr 4.18e-06 | 322.31 ms | 52.4% bf16 MFU | 1624246 tok/s step 18559/19560 | loss 3.316248 (+0.64z)| norm 0.2541 (-0.13z)| lr 4.17e-06 | 322.73 ms | 52.3% bf16 MFU | 1624261 tok/s step 18560/19560 | loss 3.333498 (+1.03z)| norm 0.2601 (+0.10z)| lr 4.16e-06 | 322.41 ms | 52.3% bf16 MFU | 1624355 tok/s step 18561/19560 | loss 3.268862 (-0.47z)| norm 0.2516 (-0.23z)| lr 4.15e-06 | 322.80 ms | 52.3% bf16 MFU | 1624347 tok/s step 18562/19560 | loss 3.288085 (-0.02z)| norm 0.2778 (+0.79z)| lr 4.14e-06 | 323.20 ms | 52.2% bf16 MFU | 1624239 tok/s step 18563/19560 | loss 3.323154 (+0.82z)| norm 0.2719 (+0.55z)| lr 4.14e-06 | 322.35 ms | 52.4% bf16 MFU | 1624351 tok/s step 18564/19560 | loss 3.255582 (-0.76z)| norm 0.2241 (-1.30z)| lr 4.13e-06 | 322.68 ms | 52.3% bf16 MFU | 1624373 tok/s step 18565/19560 | loss 3.291715 (+0.08z)| norm 0.2551 (-0.09z)| lr 4.12e-06 | 323.11 ms | 52.2% bf16 MFU | 1624286 tok/s step 18566/19560 | loss 3.219470 (-1.62z)| norm 0.2367 (-0.79z)| lr 4.11e-06 | 322.66 ms | 52.3% bf16 MFU | 1624317 tok/s step 18567/19560 | loss 3.299308 (+0.25z)| norm 0.2483 (-0.35z)| lr 4.10e-06 | 322.99 ms | 52.3% bf16 MFU | 1624263 tok/s step 18568/19560 | loss 3.309853 (+0.49z)| norm 0.2296 (-1.05z)| lr 4.09e-06 | 322.93 ms | 52.3% bf16 MFU | 1624228 tok/s step 18569/19560 | loss 3.257128 (-0.76z)| norm 0.2530 (-0.14z)| lr 4.09e-06 | 322.53 ms | 52.3% bf16 MFU | 1624294 tok/s step 18570/19560 | loss 3.241935 (-1.10z)| norm 0.2287 (-1.07z)| lr 4.08e-06 | 322.63 ms | 52.3% bf16 MFU | 1624333 tok/s step 18571/19560 | loss 3.228639 (-1.39z)| norm 0.2253 (-1.19z)| lr 4.07e-06 | 322.53 ms | 52.3% bf16 MFU | 1624393 tok/s step 18572/19560 | loss 3.337428 (+1.16z)| norm 0.2525 (-0.13z)| lr 4.06e-06 | 322.81 ms | 52.3% bf16 MFU | 1624379 tok/s step 18573/19560 | loss 3.310782 (+0.53z)| norm 0.2637 (+0.30z)| lr 4.05e-06 | 322.70 ms | 52.3% bf16 MFU | 1624396 tok/s step 18574/19560 | loss 3.242663 (-1.07z)| norm 0.2232 (-1.25z)| lr 4.05e-06 | 322.99 ms | 52.3% bf16 MFU | 1624338 tok/s step 18575/19560 | loss 3.230506 (-1.33z)| norm 0.2420 (-0.52z)| lr 4.04e-06 | 322.51 ms | 52.3% bf16 MFU | 1624403 tok/s step 18576/19560 | loss 3.232754 (-1.27z)| norm 0.2379 (-0.66z)| lr 4.03e-06 | 322.55 ms | 52.3% bf16 MFU | 1624456 tok/s step 18577/19560 | loss 3.301935 (+0.33z)| norm 0.2467 (-0.32z)| lr 4.02e-06 | 322.70 ms | 52.3% bf16 MFU | 1624468 tok/s step 18578/19560 | loss 3.307203 (+0.47z)| norm 0.2488 (-0.22z)| lr 4.01e-06 | 322.77 ms | 52.3% bf16 MFU | 1624461 tok/s step 18579/19560 | loss 3.280180 (-0.17z)| norm 0.2424 (-0.48z)| lr 4.00e-06 | 322.56 ms | 52.3% bf16 MFU | 1624508 tok/s step 18580/19560 | loss 3.239376 (-1.11z)| norm 0.2687 (+0.59z)| lr 4.00e-06 | 322.49 ms | 52.3% bf16 MFU | 1624570 tok/s step 18581/19560 | loss 3.262766 (-0.56z)| norm 0.2441 (-0.40z)| lr 3.99e-06 | 323.06 ms | 52.2% bf16 MFU | 1624485 tok/s step 18582/19560 | loss 3.360257 (+1.69z)| norm 0.2683 (+0.58z)| lr 3.98e-06 | 322.65 ms | 52.3% bf16 MFU | 1624507 tok/s step 18583/19560 | loss 3.272666 (-0.33z)| norm 0.2589 (+0.19z)| lr 3.97e-06 | 322.95 ms | 52.3% bf16 MFU | 1624454 tok/s step 18584/19560 | loss 3.250831 (-0.83z)| norm 0.2405 (-0.56z)| lr 3.96e-06 | 322.80 ms | 52.3% bf16 MFU | 1624440 tok/s step 18585/19560 | loss 3.381183 (+2.13z)| norm 0.2393 (-0.60z)| lr 3.96e-06 | 322.52 ms | 52.3% bf16 MFU | 1624498 tok/s step 18586/19560 | loss 3.254117 (-0.75z)| norm 0.2332 (-0.84z)| lr 3.95e-06 | 322.67 ms | 52.3% bf16 MFU | 1624516 tok/s step 18587/19560 | loss 3.278580 (-0.19z)| norm 0.2428 (-0.44z)| lr 3.94e-06 | 322.86 ms | 52.3% bf16 MFU | 1624484 tok/s step 18588/19560 | loss 3.318782 (+0.71z)| norm 0.2709 (+0.70z)| lr 3.93e-06 | 322.98 ms | 52.3% bf16 MFU | 1624424 tok/s step 18589/19560 | loss 3.278939 (-0.19z)| norm 0.2617 (+0.31z)| lr 3.92e-06 | 322.52 ms | 52.3% bf16 MFU | 1624483 tok/s step 18590/19560 | loss 3.257201 (-0.67z)| norm 0.2326 (-0.86z)| lr 3.92e-06 | 323.12 ms | 52.2% bf16 MFU | 1624388 tok/s step 18591/19560 | loss 3.302212 (+0.35z)| norm 0.2314 (-0.90z)| lr 3.91e-06 | 322.66 ms | 52.3% bf16 MFU | 1624414 tok/s step 18592/19560 | loss 3.322012 (+0.79z)| norm 0.2492 (-0.18z)| lr 3.90e-06 | 322.66 ms | 52.3% bf16 MFU | 1624437 tok/s step 18593/19560 | loss 3.350048 (+1.41z)| norm 0.2716 (+0.74z)| lr 3.89e-06 | 322.75 ms | 52.3% bf16 MFU | 1624437 tok/s step 18594/19560 | loss 3.258156 (-0.65z)| norm 0.2362 (-0.71z)| lr 3.88e-06 | 322.83 ms | 52.3% bf16 MFU | 1624416 tok/s step 18595/19560 | loss 3.271502 (-0.35z)| norm 0.2386 (-0.62z)| lr 3.88e-06 | 322.21 ms | 52.4% bf16 MFU | 1624554 tok/s step 18596/19560 | loss 3.298515 (+0.27z)| norm 0.2288 (-1.00z)| lr 3.87e-06 | 322.96 ms | 52.3% bf16 MFU | 1624495 tok/s step 18597/19560 | loss 3.266744 (-0.44z)| norm 0.2516 (-0.07z)| lr 3.86e-06 | 322.50 ms | 52.3% bf16 MFU | 1624554 tok/s step 18598/19560 | loss 3.242206 (-0.98z)| norm 0.2347 (-0.75z)| lr 3.85e-06 | 322.54 ms | 52.3% bf16 MFU | 1624601 tok/s step 18599/19560 | loss 3.331973 (+1.03z)| norm 0.2412 (-0.47z)| lr 3.84e-06 | 323.08 ms | 52.2% bf16 MFU | 1624509 tok/s step 18600/19560 | loss 3.253447 (-0.73z)| norm 0.2386 (-0.59z)| lr 3.84e-06 | 322.56 ms | 52.3% bf16 MFU | 1624553 tok/s step 18601/19560 | loss 3.267745 (-0.43z)| norm 0.2848 (+1.32z)| lr 3.83e-06 | 322.42 ms | 52.3% bf16 MFU | 1624631 tok/s step 18602/19560 | loss 3.320376 (+0.78z)| norm 0.2293 (-0.96z)| lr 3.82e-06 | 322.96 ms | 52.3% bf16 MFU | 1624569 tok/s step 18603/19560 | loss 3.222794 (-1.46z)| norm 0.2549 (+0.11z)| lr 3.81e-06 | 322.79 ms | 52.3% bf16 MFU | 1624554 tok/s step 18604/19560 | loss 3.267420 (-0.43z)| norm 0.2375 (-0.62z)| lr 3.80e-06 | 322.76 ms | 52.3% bf16 MFU | 1624546 tok/s step 18605/19560 | loss 3.293388 (+0.16z)| norm 0.2403 (-0.50z)| lr 3.80e-06 | 322.45 ms | 52.3% bf16 MFU | 1624618 tok/s step 18606/19560 | loss 3.276943 (-0.21z)| norm 0.2782 (+1.07z)| lr 3.79e-06 | 322.70 ms | 52.3% bf16 MFU | 1624622 tok/s step 18607/19560 | loss 3.306937 (+0.47z)| norm 0.2418 (-0.43z)| lr 3.78e-06 | 323.05 ms | 52.2% bf16 MFU | 1624536 tok/s step 18608/19560 | loss 3.320903 (+0.79z)| norm 0.2431 (-0.38z)| lr 3.77e-06 | 322.57 ms | 52.3% bf16 MFU | 1624576 tok/s step 18609/19560 | loss 3.267303 (-0.44z)| norm 0.2337 (-0.77z)| lr 3.76e-06 | 322.83 ms | 52.3% bf16 MFU | 1624550 tok/s step 18610/19560 | loss 3.243651 (-0.97z)| norm 0.2486 (-0.13z)| lr 3.76e-06 | 322.65 ms | 52.3% bf16 MFU | 1624570 tok/s step 18611/19560 | loss 3.232377 (-1.22z)| norm 0.2377 (-0.59z)| lr 3.75e-06 | 322.49 ms | 52.3% bf16 MFU | 1624629 tok/s step 18612/19560 | loss 3.427091 (+3.07z)| norm 0.3696 (+4.54z)| lr 3.74e-06 | 322.77 ms | 52.3% bf16 MFU | 1624616 tok/s step 18613/19560 | loss 3.308976 (+0.52z)| norm 0.2593 (+0.26z)| lr 3.73e-06 | 322.85 ms | 52.3% bf16 MFU | 1624581 tok/s step 18614/19560 | loss 3.317636 (+0.70z)| norm 0.2445 (-0.32z)| lr 3.72e-06 | 322.78 ms | 52.3% bf16 MFU | 1624566 tok/s step 18615/19560 | loss 3.218951 (-1.51z)| norm 0.2354 (-0.66z)| lr 3.72e-06 | 322.57 ms | 52.3% bf16 MFU | 1624606 tok/s step 18616/19560 | loss 3.268718 (-0.39z)| norm 0.2363 (-0.63z)| lr 3.71e-06 | 322.73 ms | 52.3% bf16 MFU | 1624604 tok/s step 18617/19560 | loss 3.341189 (+1.22z)| norm 0.2657 (+0.51z)| lr 3.70e-06 | 323.22 ms | 52.2% bf16 MFU | 1624478 tok/s step 18618/19560 | loss 3.225703 (-1.35z)| norm 0.2484 (-0.16z)| lr 3.69e-06 | 322.93 ms | 52.3% bf16 MFU | 1624431 tok/s step 18619/19560 | loss 3.531546 (+4.88z)| norm 0.2650 (+0.48z)| lr 3.69e-06 | 322.43 ms | 52.3% bf16 MFU | 1624512 tok/s step 18620/19560 | loss 3.268405 (-0.40z)| norm 0.3516 (+3.62z)| lr 3.68e-06 | 322.34 ms | 52.4% bf16 MFU | 1624612 tok/s step 18621/19560 | loss 3.277984 (-0.20z)| norm 0.2315 (-0.80z)| lr 3.67e-06 | 323.08 ms | 52.2% bf16 MFU | 1624521 tok/s step 18622/19560 | loss 3.279598 (-0.18z)| norm 0.2493 (-0.15z)| lr 3.66e-06 | 322.93 ms | 52.3% bf16 MFU | 1624472 tok/s step 18623/19560 | loss 3.285840 (-0.05z)| norm 0.2889 (+1.30z)| lr 3.65e-06 | 322.90 ms | 52.3% bf16 MFU | 1624433 tok/s step 18624/19560 | loss 3.252923 (-0.71z)| norm 0.2414 (-0.43z)| lr 3.65e-06 | 322.46 ms | 52.3% bf16 MFU | 1624507 tok/s step 18625/19560 | loss 3.331295 (+0.87z)| norm 0.2423 (-0.41z)| lr 3.64e-06 | 322.64 ms | 52.3% bf16 MFU | 1624532 tok/s step 18626/19560 | loss 3.304424 (+0.33z)| norm 0.2624 (+0.37z)| lr 3.63e-06 | 322.41 ms | 52.3% bf16 MFU | 1624614 tok/s step 18627/19560 | loss 3.305388 (+0.34z)| norm 0.2519 (-0.04z)| lr 3.62e-06 | 322.90 ms | 52.3% bf16 MFU | 1624566 tok/s step 18628/19560 | loss 3.435011 (+2.84z)| norm 0.2845 (+1.20z)| lr 3.62e-06 | 322.77 ms | 52.3% bf16 MFU | 1624555 tok/s step 18629/19560 | loss 3.221390 (-1.32z)| norm 0.2536 (+0.01z)| lr 3.61e-06 | 322.78 ms | 52.3% bf16 MFU | 1624542 tok/s step 18630/19560 | loss 3.253282 (-0.70z)| norm 0.2439 (-0.36z)| lr 3.60e-06 | 322.60 ms | 52.3% bf16 MFU | 1624575 tok/s step 18631/19560 | loss 3.324502 (+0.68z)| norm 0.2422 (-0.43z)| lr 3.59e-06 | 322.18 ms | 52.4% bf16 MFU | 1624712 tok/s step 18632/19560 | loss 3.285215 (-0.09z)| norm 0.2360 (-0.65z)| lr 3.58e-06 | 322.91 ms | 52.3% bf16 MFU | 1624658 tok/s step 18633/19560 | loss 3.293650 (+0.07z)| norm 0.2449 (-0.32z)| lr 3.58e-06 | 322.77 ms | 52.3% bf16 MFU | 1624642 tok/s step 18634/19560 | loss 3.279930 (-0.20z)| norm 0.2428 (-0.39z)| lr 3.57e-06 | 323.42 ms | 52.2% bf16 MFU | 1624464 tok/s step 18635/19560 | loss 3.205989 (-1.61z)| norm 0.2343 (-0.72z)| lr 3.56e-06 | 322.79 ms | 52.3% bf16 MFU | 1624454 tok/s step 18636/19560 | loss 3.311542 (+0.42z)| norm 0.2336 (-0.74z)| lr 3.55e-06 | 322.48 ms | 52.3% bf16 MFU | 1624522 tok/s step 18637/19560 | loss 3.286820 (-0.07z)| norm 0.2437 (-0.36z)| lr 3.55e-06 | 322.73 ms | 52.3% bf16 MFU | 1624524 tok/s step 18638/19560 | loss 3.381895 (+1.93z)| norm 0.2688 (+0.65z)| lr 3.54e-06 | 322.76 ms | 52.3% bf16 MFU | 1624516 tok/s step 18639/19560 | loss 3.299620 (+0.21z)| norm 0.2334 (-0.76z)| lr 3.53e-06 | 323.08 ms | 52.2% bf16 MFU | 1624430 tok/s step 18640/19560 | loss 3.305620 (+0.33z)| norm 0.2499 (-0.09z)| lr 3.52e-06 | 322.38 ms | 52.4% bf16 MFU | 1624524 tok/s step 18641/19560 | loss 3.227004 (-1.30z)| norm 0.2459 (-0.25z)| lr 3.52e-06 | 322.84 ms | 52.3% bf16 MFU | 1624497 tok/s step 18642/19560 | loss 3.308572 (+0.40z)| norm 0.2608 (+0.34z)| lr 3.51e-06 | 323.02 ms | 52.2% bf16 MFU | 1624426 tok/s step 18643/19560 | loss 3.240538 (-1.00z)| norm 0.2202 (-1.27z)| lr 3.50e-06 | 322.54 ms | 52.3% bf16 MFU | 1624479 tok/s step 18644/19560 | loss 3.353965 (+1.33z)| norm 0.2360 (-0.64z)| lr 3.49e-06 | 322.97 ms | 52.3% bf16 MFU | 1624422 tok/s step 18645/19560 | loss 3.291396 (+0.04z)| norm 0.2492 (-0.11z)| lr 3.49e-06 | 322.71 ms | 52.3% bf16 MFU | 1624433 tok/s step 18646/19560 | loss 3.194317 (-1.95z)| norm 0.2306 (-0.84z)| lr 3.48e-06 | 322.85 ms | 52.3% bf16 MFU | 1624408 tok/s step 18647/19560 | loss 3.316998 (+0.60z)| norm 0.2424 (-0.37z)| lr 3.47e-06 | 322.71 ms | 52.3% bf16 MFU | 1624418 tok/s step 18648/19560 | loss 3.304633 (+0.34z)| norm 0.2295 (-0.88z)| lr 3.46e-06 | 322.66 ms | 52.3% bf16 MFU | 1624442 tok/s step 18649/19560 | loss 3.227917 (-1.25z)| norm 0.2446 (-0.27z)| lr 3.46e-06 | 322.70 ms | 52.3% bf16 MFU | 1624455 tok/s step 18650/19560 | loss 3.270401 (-0.36z)| norm 0.2493 (-0.07z)| lr 3.45e-06 | 323.03 ms | 52.2% bf16 MFU | 1624385 tok/s step 18651/19560 | loss 3.297513 (+0.19z)| norm 0.2470 (-0.16z)| lr 3.44e-06 | 322.98 ms | 52.3% bf16 MFU | 1624330 tok/s step 18652/19560 | loss 3.378664 (+1.86z)| norm 0.2881 (+1.47z)| lr 3.43e-06 | 322.63 ms | 52.3% bf16 MFU | 1624366 tok/s step 18653/19560 | loss 3.305844 (+0.35z)| norm 0.2279 (-0.93z)| lr 3.42e-06 | 323.10 ms | 52.2% bf16 MFU | 1624281 tok/s step 18654/19560 | loss 3.446186 (+3.09z)| norm 0.3001 (+1.91z)| lr 3.42e-06 | 322.37 ms | 52.4% bf16 MFU | 1624385 tok/s step 18655/19560 | loss 3.215541 (-1.45z)| norm 0.2262 (-0.99z)| lr 3.41e-06 | 323.22 ms | 52.2% bf16 MFU | 1624270 tok/s step 18656/19560 | loss 3.214519 (-1.45z)| norm 0.2528 (+0.06z)| lr 3.40e-06 | 323.00 ms | 52.3% bf16 MFU | 1624217 tok/s step 18657/19560 | loss 3.248856 (-0.78z)| norm 0.3062 (+2.10z)| lr 3.39e-06 | 323.01 ms | 52.2% bf16 MFU | 1624161 tok/s step 18658/19560 | loss 3.417179 (+2.42z)| norm 0.2426 (-0.37z)| lr 3.39e-06 | 322.84 ms | 52.3% bf16 MFU | 1624152 tok/s step 18659/19560 | loss 3.234418 (-1.04z)| norm 0.2319 (-0.78z)| lr 3.38e-06 | 322.89 ms | 52.3% bf16 MFU | 1624131 tok/s step 18660/19560 | loss 3.247744 (-0.77z)| norm 0.2453 (-0.26z)| lr 3.37e-06 | 322.70 ms | 52.3% bf16 MFU | 1624158 tok/s step 18661/19560 | loss 3.255001 (-0.64z)| norm 0.2601 (+0.32z)| lr 3.36e-06 | 322.53 ms | 52.3% bf16 MFU | 1624229 tok/s step 18662/19560 | loss 3.276276 (-0.24z)| norm 0.2591 (+0.29z)| lr 3.36e-06 | 323.40 ms | 52.2% bf16 MFU | 1624076 tok/s step 18663/19560 | loss 3.288877 (-0.01z)| norm 0.2196 (-1.24z)| lr 3.35e-06 | 322.48 ms | 52.3% bf16 MFU | 1624161 tok/s step 18664/19560 | loss 3.272985 (-0.31z)| norm 0.2582 (+0.29z)| lr 3.34e-06 | 322.91 ms | 52.3% bf16 MFU | 1624134 tok/s step 18665/19560 | loss 3.227859 (-1.15z)| norm 0.2629 (+0.56z)| lr 3.34e-06 | 322.60 ms | 52.3% bf16 MFU | 1624187 tok/s step 18666/19560 | loss 3.234786 (-1.02z)| norm 0.2344 (-0.69z)| lr 3.33e-06 | 322.57 ms | 52.3% bf16 MFU | 1624244 tok/s step 18667/19560 | loss 3.266875 (-0.40z)| norm 0.2650 (+0.64z)| lr 3.32e-06 | 323.01 ms | 52.2% bf16 MFU | 1624189 tok/s step 18668/19560 | loss 3.231194 (-1.09z)| norm 0.2849 (+1.49z)| lr 3.31e-06 | 322.93 ms | 52.3% bf16 MFU | 1624155 tok/s step 18669/19560 | loss 3.250777 (-0.71z)| norm 0.2217 (-1.26z)| lr 3.31e-06 | 322.79 ms | 52.3% bf16 MFU | 1624159 tok/s step 18670/19560 | loss 3.329657 (+0.80z)| norm 0.2253 (-1.09z)| lr 3.30e-06 | 323.04 ms | 52.2% bf16 MFU | 1624100 tok/s step 18671/19560 | loss 3.364568 (+1.44z)| norm 0.2323 (-0.77z)| lr 3.29e-06 | 322.43 ms | 52.3% bf16 MFU | 1624197 tok/s step 18672/19560 | loss 3.289660 (+0.03z)| norm 0.2470 (-0.12z)| lr 3.28e-06 | 322.79 ms | 52.3% bf16 MFU | 1624200 tok/s step 18673/19560 | loss 3.310361 (+0.43z)| norm 0.2372 (-0.56z)| lr 3.28e-06 | 322.97 ms | 52.3% bf16 MFU | 1624157 tok/s step 18674/19560 | loss 3.236815 (-0.97z)| norm 0.3203 (+2.96z)| lr 3.27e-06 | 323.09 ms | 52.2% bf16 MFU | 1624086 tok/s step 18675/19560 | loss 3.289266 (+0.03z)| norm 0.2423 (-0.35z)| lr 3.26e-06 | 322.50 ms | 52.3% bf16 MFU | 1624167 tok/s step 18676/19560 | loss 3.363450 (+1.43z)| norm 0.2668 (+0.69z)| lr 3.25e-06 | 322.99 ms | 52.3% bf16 MFU | 1624120 tok/s step 18677/19560 | loss 3.230859 (-1.07z)| norm 0.2350 (-0.65z)| lr 3.25e-06 | 322.87 ms | 52.3% bf16 MFU | 1624107 tok/s step 18678/19560 | loss 3.233205 (-1.01z)| norm 0.2401 (-0.44z)| lr 3.24e-06 | 322.77 ms | 52.3% bf16 MFU | 1624119 tok/s step 18679/19560 | loss 3.281142 (-0.10z)| norm 0.2317 (-0.79z)| lr 3.23e-06 | 322.58 ms | 52.3% bf16 MFU | 1624178 tok/s step 18680/19560 | loss 3.239777 (-0.90z)| norm 0.2743 (+1.01z)| lr 3.22e-06 | 322.91 ms | 52.3% bf16 MFU | 1624150 tok/s step 18681/19560 | loss 3.260943 (-0.49z)| norm 0.2239 (-1.13z)| lr 3.22e-06 | 322.95 ms | 52.3% bf16 MFU | 1624114 tok/s step 18682/19560 | loss 3.432793 (+2.70z)| norm 0.2642 (+0.58z)| lr 3.21e-06 | 322.91 ms | 52.3% bf16 MFU | 1624090 tok/s step 18683/19560 | loss 3.375807 (+1.61z)| norm 0.2435 (-0.30z)| lr 3.20e-06 | 323.32 ms | 52.2% bf16 MFU | 1623965 tok/s step 18684/19560 | loss 3.269985 (-0.34z)| norm 0.2363 (-0.60z)| lr 3.20e-06 | 323.00 ms | 52.3% bf16 MFU | 1623925 tok/s step 18685/19560 | loss 3.260004 (-0.52z)| norm 0.2572 (+0.28z)| lr 3.19e-06 | 322.37 ms | 52.4% bf16 MFU | 1624046 tok/s step 18686/19560 | loss 3.297478 (+0.17z)| norm 0.2490 (-0.05z)| lr 3.18e-06 | 323.02 ms | 52.2% bf16 MFU | 1623998 tok/s step 18687/19560 | loss 3.333886 (+0.84z)| norm 0.3263 (+3.19z)| lr 3.17e-06 | 324.72 ms | 52.0% bf16 MFU | 1623528 tok/s step 18688/19560 | loss 3.316261 (+0.52z)| norm 0.2322 (-0.77z)| lr 3.17e-06 | 322.88 ms | 52.3% bf16 MFU | 1623541 tok/s step 18689/19560 | loss 3.261399 (-0.49z)| norm 0.2466 (-0.16z)| lr 3.16e-06 | 322.42 ms | 52.3% bf16 MFU | 1623669 tok/s step 18690/19560 | loss 3.272787 (-0.28z)| norm 0.2540 (+0.16z)| lr 3.15e-06 | 322.64 ms | 52.3% bf16 MFU | 1623735 tok/s step 18691/19560 | loss 3.327722 (+0.73z)| norm 0.2655 (+0.65z)| lr 3.14e-06 | 322.67 ms | 52.3% bf16 MFU | 1623791 tok/s step 18692/19560 | loss 3.249255 (-0.71z)| norm 0.2207 (-1.24z)| lr 3.14e-06 | 322.75 ms | 52.3% bf16 MFU | 1623824 tok/s step 18693/19560 | loss 3.306618 (+0.34z)| norm 0.2363 (-0.58z)| lr 3.13e-06 | 322.91 ms | 52.3% bf16 MFU | 1623814 tok/s step 18694/19560 | loss 3.292875 (+0.08z)| norm 0.2261 (-1.00z)| lr 3.12e-06 | 322.58 ms | 52.3% bf16 MFU | 1623889 tok/s step 18695/19560 | loss 3.276554 (-0.22z)| norm 0.2672 (+0.72z)| lr 3.12e-06 | 322.72 ms | 52.3% bf16 MFU | 1623924 tok/s step 18696/19560 | loss 3.293103 (+0.09z)| norm 0.2451 (-0.21z)| lr 3.11e-06 | 323.07 ms | 52.2% bf16 MFU | 1623870 tok/s step 18697/19560 | loss 3.295676 (+0.13z)| norm 0.2422 (-0.33z)| lr 3.10e-06 | 322.95 ms | 52.3% bf16 MFU | 1623850 tok/s step 18698/19560 | loss 3.346864 (+1.06z)| norm 0.2394 (-0.46z)| lr 3.09e-06 | 322.69 ms | 52.3% bf16 MFU | 1623895 tok/s step 18699/19560 | loss 3.339539 (+0.91z)| norm 0.2356 (-0.62z)| lr 3.09e-06 | 322.93 ms | 52.3% bf16 MFU | 1623878 tok/s step 18700/19560 | loss 3.129602 (-2.86z)| norm 0.2629 (+0.53z)| lr 3.08e-06 | 322.51 ms | 52.3% bf16 MFU | 1623966 tok/s step 18701/19560 | loss 3.267581 (-0.37z)| norm 0.2386 (-0.49z)| lr 3.07e-06 | 322.70 ms | 52.3% bf16 MFU | 1624002 tok/s step 18702/19560 | loss 3.304049 (+0.27z)| norm 0.2269 (-0.99z)| lr 3.07e-06 | 322.86 ms | 52.3% bf16 MFU | 1623995 tok/s step 18703/19560 | loss 3.298196 (+0.16z)| norm 0.2782 (+1.17z)| lr 3.06e-06 | 323.01 ms | 52.2% bf16 MFU | 1623952 tok/s step 18704/19560 | loss 3.321163 (+0.56z)| norm 0.2855 (+1.45z)| lr 3.05e-06 | 323.43 ms | 52.2% bf16 MFU | 1623806 tok/s step 18705/19560 | loss 3.321846 (+0.57z)| norm 0.2856 (+1.43z)| lr 3.04e-06 | 322.76 ms | 52.3% bf16 MFU | 1623836 tok/s step 18706/19560 | loss 3.246193 (-0.79z)| norm 0.2724 (+0.87z)| lr 3.04e-06 | 322.88 ms | 52.3% bf16 MFU | 1623833 tok/s step 18707/19560 | loss 3.352704 (+1.12z)| norm 0.2604 (+0.37z)| lr 3.03e-06 | 322.59 ms | 52.3% bf16 MFU | 1623902 tok/s step 18708/19560 | loss 3.254686 (-0.65z)| norm 0.2500 (-0.05z)| lr 3.02e-06 | 322.86 ms | 52.3% bf16 MFU | 1623901 tok/s step 18709/19560 | loss 3.263455 (-0.49z)| norm 0.2546 (+0.13z)| lr 3.02e-06 | 323.13 ms | 52.2% bf16 MFU | 1623834 tok/s step 18710/19560 | loss 3.250041 (-0.72z)| norm 0.2609 (+0.40z)| lr 3.01e-06 | 322.62 ms | 52.3% bf16 MFU | 1623897 tok/s step 18711/19560 | loss 3.302090 (+0.22z)| norm 0.2435 (-0.32z)| lr 3.00e-06 | 322.42 ms | 52.3% bf16 MFU | 1624006 tok/s step 18712/19560 | loss 3.310364 (+0.36z)| norm 0.2668 (+0.64z)| lr 3.00e-06 | 323.11 ms | 52.2% bf16 MFU | 1623936 tok/s step 18713/19560 | loss 3.296244 (+0.12z)| norm 0.2387 (-0.53z)| lr 2.99e-06 | 323.09 ms | 52.2% bf16 MFU | 1623875 tok/s step 18714/19560 | loss 3.274742 (-0.28z)| norm 0.2573 (+0.24z)| lr 2.98e-06 | 322.94 ms | 52.3% bf16 MFU | 1623856 tok/s step 18715/19560 | loss 3.323050 (+0.60z)| norm 0.2364 (-0.63z)| lr 2.97e-06 | 322.49 ms | 52.3% bf16 MFU | 1623952 tok/s step 18716/19560 | loss 3.291443 (+0.03z)| norm 0.2554 (+0.17z)| lr 2.97e-06 | 322.82 ms | 52.3% bf16 MFU | 1623960 tok/s step 18717/19560 | loss 3.254056 (-0.65z)| norm 0.2372 (-0.58z)| lr 2.96e-06 | 322.61 ms | 52.3% bf16 MFU | 1624020 tok/s step 18718/19560 | loss 3.290553 (+0.01z)| norm 0.2351 (-0.67z)| lr 2.95e-06 | 323.33 ms | 52.2% bf16 MFU | 1623894 tok/s step 18719/19560 | loss 3.269330 (-0.37z)| norm 0.2392 (-0.51z)| lr 2.95e-06 | 322.89 ms | 52.3% bf16 MFU | 1623885 tok/s step 18720/19560 | loss 3.252903 (-0.67z)| norm 0.2507 (-0.02z)| lr 2.94e-06 | 322.63 ms | 52.3% bf16 MFU | 1623944 tok/s step 18721/19560 | loss 3.286333 (-0.04z)| norm 0.2428 (-0.35z)| lr 2.93e-06 | 322.91 ms | 52.3% bf16 MFU | 1623928 tok/s step 18722/19560 | loss 3.335832 (+0.85z)| norm 0.2368 (-0.60z)| lr 2.92e-06 | 323.12 ms | 52.2% bf16 MFU | 1623860 tok/s step 18723/19560 | loss 3.241826 (-0.87z)| norm 0.2309 (-0.84z)| lr 2.92e-06 | 323.97 ms | 52.1% bf16 MFU | 1623582 tok/s step 18724/19560 | loss 3.256489 (-0.59z)| norm 0.2798 (+1.18z)| lr 2.91e-06 | 323.12 ms | 52.2% bf16 MFU | 1623532 tok/s step 18725/19560 | loss 3.323108 (+0.62z)| norm 0.2378 (-0.56z)| lr 2.90e-06 | 322.60 ms | 52.3% bf16 MFU | 1623615 tok/s step 18726/19560 | loss 3.380808 (+1.64z)| norm 0.2520 (+0.02z)| lr 2.90e-06 | 322.60 ms | 52.3% bf16 MFU | 1623693 tok/s step 18727/19560 | loss 3.302096 (+0.22z)| norm 0.2302 (-0.88z)| lr 2.89e-06 | 323.13 ms | 52.2% bf16 MFU | 1623636 tok/s step 18728/19560 | loss 3.250896 (-0.71z)| norm 0.2400 (-0.47z)| lr 2.88e-06 | 322.91 ms | 52.3% bf16 MFU | 1623636 tok/s step 18729/19560 | loss 3.333215 (+0.77z)| norm 0.2306 (-0.85z)| lr 2.88e-06 | 323.07 ms | 52.2% bf16 MFU | 1623595 tok/s step 18730/19560 | loss 3.267298 (-0.42z)| norm 0.2287 (-0.93z)| lr 2.87e-06 | 322.88 ms | 52.3% bf16 MFU | 1623604 tok/s step 18731/19560 | loss 3.234161 (-1.02z)| norm 0.2534 (+0.10z)| lr 2.86e-06 | 322.64 ms | 52.3% bf16 MFU | 1623673 tok/s step 18732/19560 | loss 3.359853 (+1.25z)| norm 0.2493 (-0.07z)| lr 2.86e-06 | 323.08 ms | 52.2% bf16 MFU | 1623628 tok/s step 18733/19560 | loss 3.334650 (+0.78z)| norm 0.2312 (-0.83z)| lr 2.85e-06 | 322.55 ms | 52.3% bf16 MFU | 1623720 tok/s step 18734/19560 | loss 3.231149 (-1.07z)| norm 0.2453 (-0.23z)| lr 2.84e-06 | 322.89 ms | 52.3% bf16 MFU | 1623720 tok/s step 18735/19560 | loss 3.311908 (+0.38z)| norm 0.2397 (-0.46z)| lr 2.84e-06 | 323.13 ms | 52.2% bf16 MFU | 1623660 tok/s step 18736/19560 | loss 3.248470 (-0.75z)| norm 0.2304 (-0.85z)| lr 2.83e-06 | 323.27 ms | 52.2% bf16 MFU | 1623569 tok/s step 18737/19560 | loss 3.251527 (-0.69z)| norm 0.2412 (-0.40z)| lr 2.82e-06 | 323.34 ms | 52.2% bf16 MFU | 1623466 tok/s step 18738/19560 | loss 3.254503 (-0.64z)| norm 0.2431 (-0.32z)| lr 2.81e-06 | 323.42 ms | 52.2% bf16 MFU | 1623346 tok/s step 18739/19560 | loss 3.288523 (-0.04z)| norm 0.2575 (+0.28z)| lr 2.81e-06 | 323.41 ms | 52.2% bf16 MFU | 1623236 tok/s step 18740/19560 | loss 3.310740 (+0.39z)| norm 0.2661 (+0.76z)| lr 2.80e-06 | 323.09 ms | 52.2% bf16 MFU | 1623211 tok/s step 18741/19560 | loss 3.276782 (-0.24z)| norm 0.2515 (+0.08z)| lr 2.79e-06 | 323.22 ms | 52.2% bf16 MFU | 1623154 tok/s step 18742/19560 | loss 3.295381 (+0.11z)| norm 0.2528 (+0.13z)| lr 2.79e-06 | 322.98 ms | 52.3% bf16 MFU | 1623160 tok/s step 18743/19560 | loss 3.334826 (+0.83z)| norm 0.2400 (-0.47z)| lr 2.78e-06 | 323.30 ms | 52.2% bf16 MFU | 1623086 tok/s step 18744/19560 | loss 3.373391 (+1.51z)| norm 0.2406 (-0.44z)| lr 2.77e-06 | 322.70 ms | 52.3% bf16 MFU | 1623166 tok/s step 18745/19560 | loss 3.294518 (+0.07z)| norm 0.2650 (+0.70z)| lr 2.77e-06 | 322.83 ms | 52.3% bf16 MFU | 1623210 tok/s step 18746/19560 | loss 3.261997 (-0.54z)| norm 0.2714 (+0.99z)| lr 2.76e-06 | 322.78 ms | 52.3% bf16 MFU | 1623263 tok/s step 18747/19560 | loss 3.218379 (-1.41z)| norm 0.2356 (-0.67z)| lr 2.75e-06 | 322.99 ms | 52.3% bf16 MFU | 1623261 tok/s step 18748/19560 | loss 3.289619 (+0.02z)| norm 0.2584 (+0.47z)| lr 2.75e-06 | 323.12 ms | 52.2% bf16 MFU | 1623226 tok/s step 18749/19560 | loss 3.256525 (-0.64z)| norm 0.2481 (-0.07z)| lr 2.74e-06 | 322.87 ms | 52.3% bf16 MFU | 1623256 tok/s step 18750/19560 | loss 3.275263 (-0.27z)| norm 0.2332 (-0.83z)| lr 2.73e-06 | 322.40 ms | 52.3% bf16 MFU | 1623402 tok/s val loss 3.274832 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3039/10042 = 0.302629 step 18751/19560 | loss 3.300226 (+0.23z)| norm 0.2537 (+0.25z)| lr 2.73e-06 | 324.21 ms | 52.1% bf16 MFU | 1623089 tok/s step 18752/19560 | loss 3.293430 (+0.09z)| norm 0.2479 (-0.06z)| lr 2.72e-06 | 322.42 ms | 52.3% bf16 MFU | 1623240 tok/s step 18753/19560 | loss 3.278530 (-0.20z)| norm 0.2242 (-1.29z)| lr 2.71e-06 | 323.99 ms | 52.1% bf16 MFU | 1622989 tok/s step 18754/19560 | loss 3.352822 (+1.28z)| norm 0.2311 (-0.91z)| lr 2.71e-06 | 322.14 ms | 52.4% bf16 MFU | 1623214 tok/s step 18755/19560 | loss 3.293490 (+0.09z)| norm 0.2326 (-0.82z)| lr 2.70e-06 | 322.36 ms | 52.4% bf16 MFU | 1623373 tok/s step 18756/19560 | loss 3.323222 (+0.73z)| norm 0.2390 (-0.48z)| lr 2.69e-06 | 323.22 ms | 52.2% bf16 MFU | 1623308 tok/s step 18757/19560 | loss 3.241441 (-0.97z)| norm 0.2466 (-0.08z)| lr 2.69e-06 | 322.32 ms | 52.4% bf16 MFU | 1623473 tok/s step 18758/19560 | loss 3.276377 (-0.25z)| norm 0.2392 (-0.46z)| lr 2.68e-06 | 322.35 ms | 52.4% bf16 MFU | 1623622 tok/s step 18759/19560 | loss 3.292942 (+0.10z)| norm 0.2225 (-1.32z)| lr 2.67e-06 | 322.84 ms | 52.3% bf16 MFU | 1623640 tok/s step 18760/19560 | loss 3.276381 (-0.24z)| norm 0.2292 (-0.97z)| lr 2.67e-06 | 322.96 ms | 52.3% bf16 MFU | 1623628 tok/s step 18761/19560 | loss 3.310692 (+0.47z)| norm 0.2361 (-0.61z)| lr 2.66e-06 | 322.67 ms | 52.3% bf16 MFU | 1623689 tok/s step 18762/19560 | loss 3.305079 (+0.35z)| norm 0.2702 (+1.15z)| lr 2.65e-06 | 322.78 ms | 52.3% bf16 MFU | 1623718 tok/s step 18763/19560 | loss 3.253973 (-0.73z)| norm 0.2875 (+1.99z)| lr 2.65e-06 | 322.93 ms | 52.3% bf16 MFU | 1623708 tok/s step 18764/19560 | loss 3.243468 (-0.94z)| norm 0.2579 (+0.48z)| lr 2.64e-06 | 322.26 ms | 52.4% bf16 MFU | 1623869 tok/s step 18765/19560 | loss 3.331790 (+0.90z)| norm 0.2508 (+0.11z)| lr 2.63e-06 | 323.15 ms | 52.2% bf16 MFU | 1623797 tok/s step 18766/19560 | loss 3.295171 (+0.15z)| norm 0.2364 (-0.61z)| lr 2.63e-06 | 322.08 ms | 52.4% bf16 MFU | 1623999 tok/s step 18767/19560 | loss 3.296134 (+0.17z)| norm 0.2379 (-0.54z)| lr 2.62e-06 | 322.06 ms | 52.4% bf16 MFU | 1624195 tok/s step 18768/19560 | loss 3.355132 (+1.41z)| norm 0.2536 (+0.26z)| lr 2.61e-06 | 322.51 ms | 52.3% bf16 MFU | 1624268 tok/s step 18769/19560 | loss 3.272763 (-0.34z)| norm 0.3089 (+2.97z)| lr 2.61e-06 | 322.45 ms | 52.3% bf16 MFU | 1624351 tok/s step 18770/19560 | loss 3.312985 (+0.51z)| norm 0.2647 (+0.78z)| lr 2.60e-06 | 323.25 ms | 52.2% bf16 MFU | 1624229 tok/s step 18771/19560 | loss 3.325457 (+0.77z)| norm 0.2431 (-0.30z)| lr 2.59e-06 | 323.29 ms | 52.2% bf16 MFU | 1624104 tok/s step 18772/19560 | loss 3.304826 (+0.34z)| norm 0.2351 (-0.70z)| lr 2.59e-06 | 322.62 ms | 52.3% bf16 MFU | 1624153 tok/s step 18773/19560 | loss 3.278815 (-0.21z)| norm 0.2340 (-0.75z)| lr 2.58e-06 | 321.84 ms | 52.4% bf16 MFU | 1624397 tok/s step 18774/19560 | loss 3.330766 (+0.88z)| norm 0.2394 (-0.48z)| lr 2.57e-06 | 322.90 ms | 52.3% bf16 MFU | 1624361 tok/s step 18775/19560 | loss 3.251463 (-0.82z)| norm 0.2287 (-1.01z)| lr 2.57e-06 | 322.51 ms | 52.3% bf16 MFU | 1624424 tok/s step 18776/19560 | loss 3.233802 (-1.18z)| norm 0.2666 (+0.86z)| lr 2.56e-06 | 322.26 ms | 52.4% bf16 MFU | 1624549 tok/s step 18777/19560 | loss 3.312859 (+0.50z)| norm 0.2318 (-0.86z)| lr 2.55e-06 | 322.84 ms | 52.3% bf16 MFU | 1624520 tok/s step 18778/19560 | loss 3.324988 (+0.76z)| norm 0.2341 (-0.74z)| lr 2.55e-06 | 323.06 ms | 52.2% bf16 MFU | 1624438 tok/s step 18779/19560 | loss 3.270796 (-0.41z)| norm 0.2451 (-0.20z)| lr 2.54e-06 | 322.53 ms | 52.3% bf16 MFU | 1624493 tok/s step 18780/19560 | loss 3.327932 (+0.84z)| norm 0.2737 (+1.23z)| lr 2.54e-06 | 322.27 ms | 52.4% bf16 MFU | 1624610 tok/s step 18781/19560 | loss 3.328646 (+0.85z)| norm 0.2659 (+0.83z)| lr 2.53e-06 | 322.39 ms | 52.4% bf16 MFU | 1624693 tok/s step 18782/19560 | loss 3.266966 (-0.48z)| norm 0.2279 (-1.06z)| lr 2.52e-06 | 323.26 ms | 52.2% bf16 MFU | 1624553 tok/s step 18783/19560 | loss 3.248856 (-0.91z)| norm 0.2327 (-0.82z)| lr 2.52e-06 | 322.74 ms | 52.3% bf16 MFU | 1624551 tok/s step 18784/19560 | loss 3.326703 (+0.87z)| norm 0.2353 (-0.68z)| lr 2.51e-06 | 322.42 ms | 52.3% bf16 MFU | 1624629 tok/s step 18785/19560 | loss 3.235108 (-1.25z)| norm 0.2244 (-1.24z)| lr 2.50e-06 | 322.57 ms | 52.3% bf16 MFU | 1624666 tok/s step 18786/19560 | loss 3.287960 (-0.00z)| norm 0.2703 (+1.16z)| lr 2.50e-06 | 323.04 ms | 52.2% bf16 MFU | 1624582 tok/s step 18787/19560 | loss 3.269600 (-0.45z)| norm 0.2367 (-0.60z)| lr 2.49e-06 | 322.53 ms | 52.3% bf16 MFU | 1624629 tok/s step 18788/19560 | loss 3.341702 (+1.27z)| norm 0.2632 (+0.78z)| lr 2.48e-06 | 323.30 ms | 52.2% bf16 MFU | 1624481 tok/s step 18789/19560 | loss 3.293993 (+0.11z)| norm 0.3066 (+2.93z)| lr 2.48e-06 | 322.55 ms | 52.3% bf16 MFU | 1624528 tok/s step 18790/19560 | loss 3.295661 (+0.15z)| norm 0.3427 (+4.35z)| lr 2.47e-06 | 322.20 ms | 52.4% bf16 MFU | 1624663 tok/s step 18791/19560 | loss 3.326539 (+0.88z)| norm 0.2398 (-0.46z)| lr 2.46e-06 | 322.71 ms | 52.3% bf16 MFU | 1624662 tok/s step 18792/19560 | loss 3.339535 (+1.18z)| norm 0.2402 (-0.43z)| lr 2.46e-06 | 322.17 ms | 52.4% bf16 MFU | 1624797 tok/s step 18793/19560 | loss 3.264181 (-0.64z)| norm 0.2324 (-0.79z)| lr 2.45e-06 | 322.79 ms | 52.3% bf16 MFU | 1624768 tok/s step 18794/19560 | loss 3.256818 (-0.83z)| norm 0.2312 (-0.84z)| lr 2.45e-06 | 322.77 ms | 52.3% bf16 MFU | 1624746 tok/s step 18795/19560 | loss 3.343364 (+1.25z)| norm 0.2871 (+1.76z)| lr 2.44e-06 | 322.19 ms | 52.4% bf16 MFU | 1624871 tok/s step 18796/19560 | loss 3.298754 (+0.16z)| norm 0.2890 (+1.84z)| lr 2.43e-06 | 322.62 ms | 52.3% bf16 MFU | 1624883 tok/s step 18797/19560 | loss 3.292513 (+0.00z)| norm 0.2379 (-0.54z)| lr 2.43e-06 | 322.45 ms | 52.3% bf16 MFU | 1624938 tok/s step 18798/19560 | loss 3.299269 (+0.18z)| norm 0.2364 (-0.61z)| lr 2.42e-06 | 322.07 ms | 52.4% bf16 MFU | 1625085 tok/s step 18799/19560 | loss 3.250381 (-1.01z)| norm 0.2478 (-0.09z)| lr 2.41e-06 | 322.82 ms | 52.3% bf16 MFU | 1625034 tok/s step 18800/19560 | loss 3.489635 (+4.47z)| norm 0.2528 (+0.14z)| lr 2.41e-06 | 323.11 ms | 52.2% bf16 MFU | 1624914 tok/s step 18801/19560 | loss 3.283621 (-0.20z)| norm 0.2675 (+0.82z)| lr 2.40e-06 | 322.49 ms | 52.3% bf16 MFU | 1624956 tok/s step 18802/19560 | loss 3.286372 (-0.15z)| norm 0.2451 (-0.21z)| lr 2.39e-06 | 322.80 ms | 52.3% bf16 MFU | 1624917 tok/s step 18803/19560 | loss 3.297237 (+0.10z)| norm 0.2436 (-0.28z)| lr 2.39e-06 | 322.56 ms | 52.3% bf16 MFU | 1624940 tok/s step 18804/19560 | loss 3.268119 (-0.56z)| norm 0.2648 (+0.76z)| lr 2.38e-06 | 322.26 ms | 52.4% bf16 MFU | 1625039 tok/s step 18805/19560 | loss 3.271845 (-0.48z)| norm 0.2598 (+0.50z)| lr 2.38e-06 | 322.56 ms | 52.3% bf16 MFU | 1625058 tok/s step 18806/19560 | loss 3.304065 (+0.26z)| norm 0.2334 (-0.79z)| lr 2.37e-06 | 322.50 ms | 52.3% bf16 MFU | 1625091 tok/s step 18807/19560 | loss 3.285176 (-0.19z)| norm 0.2319 (-0.86z)| lr 2.36e-06 | 323.18 ms | 52.2% bf16 MFU | 1624951 tok/s step 18808/19560 | loss 3.341722 (+1.12z)| norm 0.2916 (+2.03z)| lr 2.36e-06 | 322.19 ms | 52.4% bf16 MFU | 1625066 tok/s step 18809/19560 | loss 3.378070 (+1.92z)| norm 0.2369 (-0.62z)| lr 2.35e-06 | 322.79 ms | 52.3% bf16 MFU | 1625024 tok/s step 18810/19560 | loss 3.238729 (-1.31z)| norm 0.2366 (-0.63z)| lr 2.34e-06 | 323.18 ms | 52.2% bf16 MFU | 1624887 tok/s step 18811/19560 | loss 3.252499 (-0.97z)| norm 0.2515 (+0.09z)| lr 2.34e-06 | 322.10 ms | 52.4% bf16 MFU | 1625027 tok/s step 18812/19560 | loss 3.394250 (+2.39z)| norm 0.2658 (+0.78z)| lr 2.33e-06 | 322.18 ms | 52.4% bf16 MFU | 1625142 tok/s step 18813/19560 | loss 3.255625 (-0.89z)| norm 0.2304 (-0.93z)| lr 2.33e-06 | 322.52 ms | 52.3% bf16 MFU | 1625164 tok/s step 18814/19560 | loss 3.283633 (-0.23z)| norm 0.2366 (-0.62z)| lr 2.32e-06 | 322.39 ms | 52.4% bf16 MFU | 1625220 tok/s step 18815/19560 | loss 3.330753 (+0.89z)| norm 0.2983 (+2.45z)| lr 2.31e-06 | 322.68 ms | 52.3% bf16 MFU | 1625198 tok/s step 18816/19560 | loss 3.214871 (-1.82z)| norm 0.2580 (+0.43z)| lr 2.31e-06 | 322.51 ms | 52.3% bf16 MFU | 1625221 tok/s step 18817/19560 | loss 3.372562 (+1.84z)| norm 0.2296 (-0.98z)| lr 2.30e-06 | 322.59 ms | 52.3% bf16 MFU | 1625223 tok/s step 18818/19560 | loss 3.318274 (+0.57z)| norm 0.2368 (-0.62z)| lr 2.29e-06 | 323.26 ms | 52.2% bf16 MFU | 1625055 tok/s step 18819/19560 | loss 3.360260 (+1.53z)| norm 0.2257 (-1.16z)| lr 2.29e-06 | 322.32 ms | 52.4% bf16 MFU | 1625132 tok/s step 18820/19560 | loss 3.259814 (-0.79z)| norm 0.2523 (+0.16z)| lr 2.28e-06 | 322.19 ms | 52.4% bf16 MFU | 1625239 tok/s step 18821/19560 | loss 3.316622 (+0.52z)| norm 0.2394 (-0.49z)| lr 2.28e-06 | 322.51 ms | 52.3% bf16 MFU | 1625259 tok/s step 18822/19560 | loss 3.300522 (+0.15z)| norm 0.2480 (-0.07z)| lr 2.27e-06 | 322.93 ms | 52.3% bf16 MFU | 1625173 tok/s step 18823/19560 | loss 3.290457 (-0.09z)| norm 0.2338 (-0.77z)| lr 2.26e-06 | 322.42 ms | 52.3% bf16 MFU | 1625220 tok/s step 18824/19560 | loss 3.262817 (-0.72z)| norm 0.2500 (+0.05z)| lr 2.26e-06 | 323.18 ms | 52.2% bf16 MFU | 1625074 tok/s step 18825/19560 | loss 3.258433 (-0.81z)| norm 0.2284 (-1.04z)| lr 2.25e-06 | 322.67 ms | 52.3% bf16 MFU | 1625063 tok/s step 18826/19560 | loss 3.303374 (+0.23z)| norm 0.2334 (-0.78z)| lr 2.25e-06 | 322.71 ms | 52.3% bf16 MFU | 1625043 tok/s step 18827/19560 | loss 3.334145 (+0.94z)| norm 0.2983 (+2.40z)| lr 2.24e-06 | 322.85 ms | 52.3% bf16 MFU | 1624987 tok/s step 18828/19560 | loss 3.340106 (+1.10z)| norm 0.2383 (-0.54z)| lr 2.23e-06 | 323.27 ms | 52.2% bf16 MFU | 1624829 tok/s step 18829/19560 | loss 3.269754 (-0.61z)| norm 0.2320 (-0.85z)| lr 2.23e-06 | 322.40 ms | 52.3% bf16 MFU | 1624897 tok/s step 18830/19560 | loss 3.248829 (-1.11z)| norm 0.2488 (-0.03z)| lr 2.22e-06 | 323.63 ms | 52.1% bf16 MFU | 1624653 tok/s step 18831/19560 | loss 3.303454 (+0.22z)| norm 0.2307 (-0.91z)| lr 2.22e-06 | 322.81 ms | 52.3% bf16 MFU | 1624627 tok/s step 18832/19560 | loss 3.309545 (+0.37z)| norm 0.2232 (-1.26z)| lr 2.21e-06 | 323.15 ms | 52.2% bf16 MFU | 1624518 tok/s step 18833/19560 | loss 3.263886 (-0.73z)| norm 0.2274 (-1.04z)| lr 2.20e-06 | 322.18 ms | 52.4% bf16 MFU | 1624659 tok/s step 18834/19560 | loss 3.341308 (+1.13z)| norm 0.2413 (-0.33z)| lr 2.20e-06 | 322.98 ms | 52.3% bf16 MFU | 1624589 tok/s step 18835/19560 | loss 3.288013 (-0.15z)| norm 0.2355 (-0.62z)| lr 2.19e-06 | 322.47 ms | 52.3% bf16 MFU | 1624652 tok/s step 18836/19560 | loss 3.358906 (+1.55z)| norm 0.2957 (+2.36z)| lr 2.19e-06 | 323.09 ms | 52.2% bf16 MFU | 1624557 tok/s step 18837/19560 | loss 3.311873 (+0.40z)| norm 0.2535 (+0.27z)| lr 2.18e-06 | 323.63 ms | 52.1% bf16 MFU | 1624330 tok/s step 18838/19560 | loss 3.335228 (+0.95z)| norm 0.2508 (+0.14z)| lr 2.17e-06 | 322.27 ms | 52.4% bf16 MFU | 1624456 tok/s step 18839/19560 | loss 3.271599 (-0.59z)| norm 0.2279 (-0.98z)| lr 2.17e-06 | 322.78 ms | 52.3% bf16 MFU | 1624448 tok/s step 18840/19560 | loss 3.262029 (-0.81z)| norm 0.2281 (-0.96z)| lr 2.16e-06 | 323.77 ms | 52.1% bf16 MFU | 1624191 tok/s step 18841/19560 | loss 3.287608 (-0.19z)| norm 0.2345 (-0.64z)| lr 2.16e-06 | 322.84 ms | 52.3% bf16 MFU | 1624181 tok/s step 18842/19560 | loss 3.308968 (+0.32z)| norm 0.2362 (-0.54z)| lr 2.15e-06 | 322.66 ms | 52.3% bf16 MFU | 1624218 tok/s step 18843/19560 | loss 3.365084 (+1.66z)| norm 0.2429 (-0.22z)| lr 2.14e-06 | 323.08 ms | 52.2% bf16 MFU | 1624146 tok/s step 18844/19560 | loss 3.295907 (-0.00z)| norm 0.2783 (+1.51z)| lr 2.14e-06 | 323.31 ms | 52.2% bf16 MFU | 1624019 tok/s step 18845/19560 | loss 3.271472 (-0.60z)| norm 0.2424 (-0.25z)| lr 2.13e-06 | 322.74 ms | 52.3% bf16 MFU | 1624042 tok/s step 18846/19560 | loss 3.289859 (-0.15z)| norm 0.2305 (-0.83z)| lr 2.13e-06 | 321.99 ms | 52.4% bf16 MFU | 1624254 tok/s step 18847/19560 | loss 3.285199 (-0.27z)| norm 0.2536 (+0.29z)| lr 2.12e-06 | 322.64 ms | 52.3% bf16 MFU | 1624291 tok/s step 18848/19560 | loss 3.274688 (-0.53z)| norm 0.2325 (-0.73z)| lr 2.11e-06 | 323.08 ms | 52.2% bf16 MFU | 1624214 tok/s step 18849/19560 | loss 3.318478 (+0.53z)| norm 0.2362 (-0.55z)| lr 2.11e-06 | 322.70 ms | 52.3% bf16 MFU | 1624239 tok/s step 18850/19560 | loss 3.310767 (+0.35z)| norm 0.2544 (+0.33z)| lr 2.10e-06 | 322.75 ms | 52.3% bf16 MFU | 1624249 tok/s step 18851/19560 | loss 3.335711 (+0.94z)| norm 0.2324 (-0.74z)| lr 2.10e-06 | 322.61 ms | 52.3% bf16 MFU | 1624292 tok/s step 18852/19560 | loss 3.335258 (+0.91z)| norm 0.2362 (-0.54z)| lr 2.09e-06 | 323.34 ms | 52.2% bf16 MFU | 1624152 tok/s step 18853/19560 | loss 3.270185 (-0.67z)| norm 0.2319 (-0.75z)| lr 2.08e-06 | 322.78 ms | 52.3% bf16 MFU | 1624158 tok/s step 18854/19560 | loss 3.357802 (+1.49z)| norm 0.2435 (-0.18z)| lr 2.08e-06 | 322.56 ms | 52.3% bf16 MFU | 1624220 tok/s step 18855/19560 | loss 3.363469 (+1.60z)| norm 0.2420 (-0.26z)| lr 2.07e-06 | 323.31 ms | 52.2% bf16 MFU | 1624090 tok/s step 18856/19560 | loss 3.315926 (+0.43z)| norm 0.2401 (-0.35z)| lr 2.07e-06 | 322.46 ms | 52.3% bf16 MFU | 1624182 tok/s step 18857/19560 | loss 3.315904 (+0.44z)| norm 0.2616 (+0.70z)| lr 2.06e-06 | 322.74 ms | 52.3% bf16 MFU | 1624198 tok/s step 18858/19560 | loss 3.265355 (-0.80z)| norm 0.2522 (+0.23z)| lr 2.05e-06 | 323.23 ms | 52.2% bf16 MFU | 1624090 tok/s step 18859/19560 | loss 3.274149 (-0.60z)| norm 0.2777 (+1.47z)| lr 2.05e-06 | 322.83 ms | 52.3% bf16 MFU | 1624087 tok/s step 18860/19560 | loss 3.281290 (-0.41z)| norm 0.2526 (+0.23z)| lr 2.04e-06 | 322.31 ms | 52.4% bf16 MFU | 1624216 tok/s step 18861/19560 | loss 3.270825 (-0.66z)| norm 0.2360 (-0.58z)| lr 2.04e-06 | 323.13 ms | 52.2% bf16 MFU | 1624131 tok/s step 18862/19560 | loss 3.326401 (+0.71z)| norm 0.2542 (+0.30z)| lr 2.03e-06 | 322.72 ms | 52.3% bf16 MFU | 1624155 tok/s step 18863/19560 | loss 3.313870 (+0.40z)| norm 0.2847 (+1.77z)| lr 2.03e-06 | 323.09 ms | 52.2% bf16 MFU | 1624084 tok/s step 18864/19560 | loss 3.280265 (-0.46z)| norm 0.2292 (-0.93z)| lr 2.02e-06 | 323.10 ms | 52.2% bf16 MFU | 1624014 tok/s step 18865/19560 | loss 3.369154 (+1.76z)| norm 0.2313 (-0.82z)| lr 2.01e-06 | 323.30 ms | 52.2% bf16 MFU | 1623896 tok/s step 18866/19560 | loss 3.295620 (-0.10z)| norm 0.2277 (-0.99z)| lr 2.01e-06 | 323.29 ms | 52.2% bf16 MFU | 1623788 tok/s step 18867/19560 | loss 3.292583 (-0.18z)| norm 0.2295 (-0.89z)| lr 2.00e-06 | 322.69 ms | 52.3% bf16 MFU | 1623834 tok/s step 18868/19560 | loss 3.367362 (+1.68z)| norm 0.2641 (+0.79z)| lr 2.00e-06 | 323.24 ms | 52.2% bf16 MFU | 1623741 tok/s step 18869/19560 | loss 3.287007 (-0.33z)| norm 0.2732 (+1.21z)| lr 1.99e-06 | 322.54 ms | 52.3% bf16 MFU | 1623828 tok/s step 18870/19560 | loss 3.292265 (-0.20z)| norm 0.2313 (-0.79z)| lr 1.99e-06 | 322.32 ms | 52.4% bf16 MFU | 1623967 tok/s step 18871/19560 | loss 3.323164 (+0.58z)| norm 0.2647 (+0.80z)| lr 1.98e-06 | 322.62 ms | 52.3% bf16 MFU | 1624023 tok/s step 18872/19560 | loss 3.309722 (+0.26z)| norm 0.2301 (-0.85z)| lr 1.97e-06 | 322.74 ms | 52.3% bf16 MFU | 1624046 tok/s step 18873/19560 | loss 3.334164 (+0.87z)| norm 0.2502 (+0.11z)| lr 1.97e-06 | 322.64 ms | 52.3% bf16 MFU | 1624093 tok/s step 18874/19560 | loss 3.280023 (-0.51z)| norm 0.2319 (-0.75z)| lr 1.96e-06 | 323.26 ms | 52.2% bf16 MFU | 1623982 tok/s step 18875/19560 | loss 3.351855 (+1.30z)| norm 0.2278 (-0.94z)| lr 1.96e-06 | 323.12 ms | 52.2% bf16 MFU | 1623912 tok/s step 18876/19560 | loss 3.272285 (-0.73z)| norm 0.2346 (-0.61z)| lr 1.95e-06 | 322.56 ms | 52.3% bf16 MFU | 1623987 tok/s step 18877/19560 | loss 3.241423 (-1.51z)| norm 0.3087 (+2.82z)| lr 1.95e-06 | 323.07 ms | 52.2% bf16 MFU | 1623928 tok/s step 18878/19560 | loss 3.358942 (+1.46z)| norm 0.2339 (-0.64z)| lr 1.94e-06 | 322.81 ms | 52.3% bf16 MFU | 1623938 tok/s step 18879/19560 | loss 3.264940 (-0.91z)| norm 0.2455 (-0.10z)| lr 1.93e-06 | 323.43 ms | 52.2% bf16 MFU | 1623792 tok/s step 18880/19560 | loss 3.402499 (+2.48z)| norm 0.2843 (+1.66z)| lr 1.93e-06 | 322.12 ms | 52.4% bf16 MFU | 1623982 tok/s step 18881/19560 | loss 3.249750 (-1.28z)| norm 0.2391 (-0.42z)| lr 1.92e-06 | 322.96 ms | 52.3% bf16 MFU | 1623953 tok/s step 18882/19560 | loss 3.301140 (-0.01z)| norm 0.2336 (-0.67z)| lr 1.92e-06 | 322.39 ms | 52.4% bf16 MFU | 1624068 tok/s step 18883/19560 | loss 3.297466 (-0.10z)| norm 0.2965 (+2.16z)| lr 1.91e-06 | 322.54 ms | 52.3% bf16 MFU | 1624140 tok/s step 18884/19560 | loss 3.270291 (-0.76z)| norm 0.2528 (+0.18z)| lr 1.91e-06 | 323.96 ms | 52.1% bf16 MFU | 1623852 tok/s step 18885/19560 | loss 3.258807 (-1.05z)| norm 0.2529 (+0.18z)| lr 1.90e-06 | 323.02 ms | 52.2% bf16 MFU | 1623814 tok/s step 18886/19560 | loss 3.332834 (+0.77z)| norm 0.2293 (-0.88z)| lr 1.89e-06 | 322.59 ms | 52.3% bf16 MFU | 1623886 tok/s step 18887/19560 | loss 3.310992 (+0.23z)| norm 0.2674 (+0.83z)| lr 1.89e-06 | 322.51 ms | 52.3% bf16 MFU | 1623975 tok/s step 18888/19560 | loss 3.290231 (-0.29z)| norm 0.2365 (-0.57z)| lr 1.88e-06 | 323.87 ms | 52.1% bf16 MFU | 1623719 tok/s step 18889/19560 | loss 3.319905 (+0.45z)| norm 0.2322 (-0.77z)| lr 1.88e-06 | 322.83 ms | 52.3% bf16 MFU | 1623736 tok/s step 18890/19560 | loss 3.282154 (-0.48z)| norm 0.2463 (-0.12z)| lr 1.87e-06 | 323.28 ms | 52.2% bf16 MFU | 1623637 tok/s step 18891/19560 | loss 3.306343 (+0.11z)| norm 0.2267 (-1.00z)| lr 1.87e-06 | 322.79 ms | 52.3% bf16 MFU | 1623668 tok/s step 18892/19560 | loss 3.265449 (-0.92z)| norm 0.2417 (-0.30z)| lr 1.86e-06 | 322.44 ms | 52.3% bf16 MFU | 1623786 tok/s step 18893/19560 | loss 3.349765 (+1.18z)| norm 0.2443 (-0.18z)| lr 1.86e-06 | 322.96 ms | 52.3% bf16 MFU | 1623767 tok/s step 18894/19560 | loss 3.269383 (-0.82z)| norm 0.2438 (-0.21z)| lr 1.85e-06 | 323.93 ms | 52.1% bf16 MFU | 1623504 tok/s step 18895/19560 | loss 3.282837 (-0.48z)| norm 0.2696 (+0.96z)| lr 1.84e-06 | 323.34 ms | 52.2% bf16 MFU | 1623403 tok/s step 18896/19560 | loss 3.334063 (+0.80z)| norm 0.2341 (-0.65z)| lr 1.84e-06 | 323.22 ms | 52.2% bf16 MFU | 1623336 tok/s step 18897/19560 | loss 3.306421 (+0.10z)| norm 0.2583 (+0.48z)| lr 1.83e-06 | 322.78 ms | 52.3% bf16 MFU | 1623382 tok/s step 18898/19560 | loss 3.304432 (+0.06z)| norm 0.2662 (+0.85z)| lr 1.83e-06 | 323.44 ms | 52.2% bf16 MFU | 1623262 tok/s step 18899/19560 | loss 3.341844 (+0.99z)| norm 0.2435 (-0.21z)| lr 1.82e-06 | 323.99 ms | 52.1% bf16 MFU | 1623010 tok/s step 18900/19560 | loss 3.335851 (+0.83z)| norm 0.2241 (-1.12z)| lr 1.82e-06 | 322.58 ms | 52.3% bf16 MFU | 1623125 tok/s step 18901/19560 | loss 3.234061 (-1.68z)| norm 0.2287 (-0.90z)| lr 1.81e-06 | 323.59 ms | 52.2% bf16 MFU | 1622980 tok/s step 18902/19560 | loss 3.360747 (+1.43z)| norm 0.2389 (-0.42z)| lr 1.81e-06 | 323.00 ms | 52.3% bf16 MFU | 1622989 tok/s step 18903/19560 | loss 3.260835 (-1.02z)| norm 0.2244 (-1.10z)| lr 1.80e-06 | 323.33 ms | 52.2% bf16 MFU | 1622916 tok/s step 18904/19560 | loss 3.324749 (+0.53z)| norm 0.2433 (-0.21z)| lr 1.79e-06 | 322.45 ms | 52.3% bf16 MFU | 1623068 tok/s step 18905/19560 | loss 3.281020 (-0.54z)| norm 0.2442 (-0.17z)| lr 1.79e-06 | 323.29 ms | 52.2% bf16 MFU | 1623001 tok/s step 18906/19560 | loss 3.295523 (-0.18z)| norm 0.2604 (+0.59z)| lr 1.78e-06 | 322.94 ms | 52.3% bf16 MFU | 1623027 tok/s step 18907/19560 | loss 3.278681 (-0.60z)| norm 0.2904 (+1.95z)| lr 1.78e-06 | 323.02 ms | 52.2% bf16 MFU | 1623029 tok/s step 18908/19560 | loss 3.272352 (-0.74z)| norm 0.2361 (-0.56z)| lr 1.77e-06 | 322.62 ms | 52.3% bf16 MFU | 1623132 tok/s step 18909/19560 | loss 3.304499 (+0.06z)| norm 0.2496 (+0.08z)| lr 1.77e-06 | 322.87 ms | 52.3% bf16 MFU | 1623167 tok/s step 18910/19560 | loss 3.298615 (-0.09z)| norm 0.2385 (-0.45z)| lr 1.76e-06 | 322.83 ms | 52.3% bf16 MFU | 1623211 tok/s step 18911/19560 | loss 3.338891 (+0.89z)| norm 0.2524 (+0.20z)| lr 1.76e-06 | 323.03 ms | 52.2% bf16 MFU | 1623201 tok/s step 18912/19560 | loss 3.266259 (-0.91z)| norm 0.2594 (+0.52z)| lr 1.75e-06 | 323.44 ms | 52.2% bf16 MFU | 1623090 tok/s step 18913/19560 | loss 3.317482 (+0.36z)| norm 0.2275 (-0.98z)| lr 1.75e-06 | 322.75 ms | 52.3% bf16 MFU | 1623159 tok/s step 18914/19560 | loss 3.312356 (+0.22z)| norm 0.2375 (-0.50z)| lr 1.74e-06 | 323.38 ms | 52.2% bf16 MFU | 1623064 tok/s step 18915/19560 | loss 3.263609 (-1.00z)| norm 0.2987 (+2.31z)| lr 1.74e-06 | 322.77 ms | 52.3% bf16 MFU | 1623129 tok/s step 18916/19560 | loss 3.246006 (-1.42z)| norm 0.2359 (-0.58z)| lr 1.73e-06 | 323.37 ms | 52.2% bf16 MFU | 1623039 tok/s step 18917/19560 | loss 3.316644 (+0.35z)| norm 0.2536 (+0.26z)| lr 1.72e-06 | 322.54 ms | 52.3% bf16 MFU | 1623161 tok/s step 18918/19560 | loss 3.413949 (+2.68z)| norm 0.2847 (+1.90z)| lr 1.72e-06 | 323.01 ms | 52.2% bf16 MFU | 1623158 tok/s step 18919/19560 | loss 3.271730 (-0.77z)| norm 0.2376 (-0.50z)| lr 1.71e-06 | 323.00 ms | 52.3% bf16 MFU | 1623160 tok/s step 18920/19560 | loss 3.318205 (+0.37z)| norm 0.2418 (-0.29z)| lr 1.71e-06 | 322.25 ms | 52.4% bf16 MFU | 1623350 tok/s step 18921/19560 | loss 3.213272 (-2.15z)| norm 0.2357 (-0.61z)| lr 1.70e-06 | 323.05 ms | 52.2% bf16 MFU | 1623329 tok/s step 18922/19560 | loss 3.275551 (-0.66z)| norm 0.2248 (-1.16z)| lr 1.70e-06 | 323.07 ms | 52.2% bf16 MFU | 1623303 tok/s step 18923/19560 | loss 3.286259 (-0.39z)| norm 0.2383 (-0.46z)| lr 1.69e-06 | 323.01 ms | 52.2% bf16 MFU | 1623295 tok/s step 18924/19560 | loss 3.277746 (-0.59z)| norm 0.2513 (+0.24z)| lr 1.69e-06 | 322.97 ms | 52.3% bf16 MFU | 1623297 tok/s step 18925/19560 | loss 3.253678 (-1.16z)| norm 0.2718 (+1.30z)| lr 1.68e-06 | 322.27 ms | 52.4% bf16 MFU | 1623474 tok/s step 18926/19560 | loss 3.296340 (-0.14z)| norm 0.2527 (+0.29z)| lr 1.68e-06 | 323.35 ms | 52.2% bf16 MFU | 1623371 tok/s step 18927/19560 | loss 3.286576 (-0.38z)| norm 0.2521 (+0.26z)| lr 1.67e-06 | 322.60 ms | 52.3% bf16 MFU | 1623463 tok/s step 18928/19560 | loss 3.247656 (-1.38z)| norm 0.2210 (-1.35z)| lr 1.67e-06 | 322.90 ms | 52.3% bf16 MFU | 1623474 tok/s step 18929/19560 | loss 3.261998 (-1.00z)| norm 0.2300 (-0.87z)| lr 1.66e-06 | 323.45 ms | 52.2% bf16 MFU | 1623346 tok/s step 18930/19560 | loss 3.207069 (-2.36z)| norm 0.2506 (+0.20z)| lr 1.66e-06 | 322.67 ms | 52.3% bf16 MFU | 1623420 tok/s step 18931/19560 | loss 3.263300 (-0.92z)| norm 0.2294 (-0.90z)| lr 1.65e-06 | 323.17 ms | 52.2% bf16 MFU | 1623364 tok/s step 18932/19560 | loss 3.267941 (-0.80z)| norm 0.2605 (+0.72z)| lr 1.65e-06 | 322.96 ms | 52.3% bf16 MFU | 1623366 tok/s step 18933/19560 | loss 3.237557 (-1.55z)| norm 0.2544 (+0.41z)| lr 1.64e-06 | 323.03 ms | 52.2% bf16 MFU | 1623349 tok/s step 18934/19560 | loss 3.350679 (+1.29z)| norm 0.2622 (+0.81z)| lr 1.63e-06 | 323.76 ms | 52.1% bf16 MFU | 1623150 tok/s step 18935/19560 | loss 3.244780 (-1.35z)| norm 0.2306 (-0.84z)| lr 1.63e-06 | 323.00 ms | 52.3% bf16 MFU | 1623150 tok/s step 18936/19560 | loss 3.267953 (-0.76z)| norm 0.2385 (-0.42z)| lr 1.62e-06 | 322.78 ms | 52.3% bf16 MFU | 1623208 tok/s step 18937/19560 | loss 3.330532 (+0.82z)| norm 0.2819 (+1.85z)| lr 1.62e-06 | 322.91 ms | 52.3% bf16 MFU | 1623230 tok/s step 18938/19560 | loss 3.330638 (+0.81z)| norm 0.2340 (-0.67z)| lr 1.61e-06 | 322.84 ms | 52.3% bf16 MFU | 1623266 tok/s step 18939/19560 | loss 3.242438 (-1.43z)| norm 0.2235 (-1.20z)| lr 1.61e-06 | 323.11 ms | 52.2% bf16 MFU | 1623235 tok/s step 18940/19560 | loss 3.238943 (-1.51z)| norm 0.2572 (+0.56z)| lr 1.60e-06 | 322.80 ms | 52.3% bf16 MFU | 1623282 tok/s step 18941/19560 | loss 3.377970 (+2.02z)| norm 0.2841 (+1.93z)| lr 1.60e-06 | 322.72 ms | 52.3% bf16 MFU | 1623347 tok/s step 18942/19560 | loss 3.281736 (-0.43z)| norm 0.2344 (-0.64z)| lr 1.59e-06 | 322.85 ms | 52.3% bf16 MFU | 1623376 tok/s step 18943/19560 | loss 3.235565 (-1.57z)| norm 0.2318 (-0.77z)| lr 1.59e-06 | 322.33 ms | 52.4% bf16 MFU | 1623534 tok/s step 18944/19560 | loss 3.309377 (+0.28z)| norm 0.2446 (-0.09z)| lr 1.58e-06 | 322.15 ms | 52.4% bf16 MFU | 1623732 tok/s step 18945/19560 | loss 3.294449 (-0.09z)| norm 0.2232 (-1.22z)| lr 1.58e-06 | 323.26 ms | 52.2% bf16 MFU | 1623640 tok/s step 18946/19560 | loss 3.293093 (-0.12z)| norm 0.2317 (-0.76z)| lr 1.57e-06 | 322.53 ms | 52.3% bf16 MFU | 1623734 tok/s step 18947/19560 | loss 3.207849 (-2.29z)| norm 0.2463 (+0.00z)| lr 1.57e-06 | 322.59 ms | 52.3% bf16 MFU | 1623810 tok/s step 18948/19560 | loss 3.264426 (-0.83z)| norm 0.2314 (-0.78z)| lr 1.56e-06 | 322.48 ms | 52.3% bf16 MFU | 1623909 tok/s step 18949/19560 | loss 3.339788 (+1.11z)| norm 0.2994 (+2.73z)| lr 1.56e-06 | 322.88 ms | 52.3% bf16 MFU | 1623902 tok/s step 18950/19560 | loss 3.216854 (-2.01z)| norm 0.2611 (+0.74z)| lr 1.55e-06 | 323.05 ms | 52.2% bf16 MFU | 1623854 tok/s step 18951/19560 | loss 3.287913 (-0.21z)| norm 0.2222 (-1.25z)| lr 1.55e-06 | 322.61 ms | 52.3% bf16 MFU | 1623919 tok/s step 18952/19560 | loss 3.280205 (-0.41z)| norm 0.2257 (-1.06z)| lr 1.54e-06 | 322.58 ms | 52.3% bf16 MFU | 1623989 tok/s step 18953/19560 | loss 3.274002 (-0.57z)| norm 0.2609 (+0.73z)| lr 1.54e-06 | 323.27 ms | 52.2% bf16 MFU | 1623881 tok/s step 18954/19560 | loss 3.319890 (+0.60z)| norm 0.2263 (-1.03z)| lr 1.53e-06 | 322.73 ms | 52.3% bf16 MFU | 1623915 tok/s step 18955/19560 | loss 3.229926 (-1.66z)| norm 0.2918 (+2.32z)| lr 1.53e-06 | 322.62 ms | 52.3% bf16 MFU | 1623975 tok/s step 18956/19560 | loss 3.230926 (-1.61z)| norm 0.2328 (-0.70z)| lr 1.52e-06 | 322.47 ms | 52.3% bf16 MFU | 1624069 tok/s step 18957/19560 | loss 3.285321 (-0.24z)| norm 0.2435 (-0.16z)| lr 1.52e-06 | 323.03 ms | 52.2% bf16 MFU | 1624017 tok/s step 18958/19560 | loss 3.345006 (+1.24z)| norm 0.2496 (+0.15z)| lr 1.51e-06 | 322.57 ms | 52.3% bf16 MFU | 1624083 tok/s step 18959/19560 | loss 3.287722 (-0.20z)| norm 0.2516 (+0.25z)| lr 1.51e-06 | 322.62 ms | 52.3% bf16 MFU | 1624134 tok/s step 18960/19560 | loss 3.302962 (+0.19z)| norm 0.3265 (+3.84z)| lr 1.50e-06 | 323.10 ms | 52.2% bf16 MFU | 1624060 tok/s step 18961/19560 | loss 3.282878 (-0.32z)| norm 0.2523 (+0.22z)| lr 1.50e-06 | 322.18 ms | 52.4% bf16 MFU | 1624223 tok/s step 18962/19560 | loss 3.287015 (-0.21z)| norm 0.2417 (-0.30z)| lr 1.49e-06 | 322.79 ms | 52.3% bf16 MFU | 1624224 tok/s step 18963/19560 | loss 3.406940 (+2.72z)| norm 0.2429 (-0.24z)| lr 1.49e-06 | 322.46 ms | 52.3% bf16 MFU | 1624306 tok/s step 18964/19560 | loss 3.344846 (+1.20z)| norm 0.2273 (-0.99z)| lr 1.48e-06 | 322.54 ms | 52.3% bf16 MFU | 1624367 tok/s step 18965/19560 | loss 3.310990 (+0.37z)| norm 0.2583 (+0.54z)| lr 1.48e-06 | 322.90 ms | 52.3% bf16 MFU | 1624331 tok/s step 18966/19560 | loss 3.295127 (-0.01z)| norm 0.2399 (-0.36z)| lr 1.47e-06 | 322.84 ms | 52.3% bf16 MFU | 1624314 tok/s step 18967/19560 | loss 3.272367 (-0.58z)| norm 0.2234 (-1.18z)| lr 1.47e-06 | 322.24 ms | 52.4% bf16 MFU | 1624449 tok/s step 18968/19560 | loss 3.251211 (-1.10z)| norm 0.2371 (-0.50z)| lr 1.46e-06 | 322.45 ms | 52.3% bf16 MFU | 1624523 tok/s step 18969/19560 | loss 3.250452 (-1.10z)| norm 0.2449 (-0.12z)| lr 1.46e-06 | 322.59 ms | 52.3% bf16 MFU | 1624560 tok/s step 18970/19560 | loss 3.309643 (+0.35z)| norm 0.2416 (-0.29z)| lr 1.45e-06 | 322.69 ms | 52.3% bf16 MFU | 1624569 tok/s step 18971/19560 | loss 3.275833 (-0.47z)| norm 0.2403 (-0.35z)| lr 1.45e-06 | 322.26 ms | 52.4% bf16 MFU | 1624687 tok/s step 18972/19560 | loss 3.427148 (+3.14z)| norm 0.3760 (+5.59z)| lr 1.44e-06 | 322.56 ms | 52.3% bf16 MFU | 1624724 tok/s step 18973/19560 | loss 3.301376 (+0.13z)| norm 0.2569 (+0.38z)| lr 1.44e-06 | 322.62 ms | 52.3% bf16 MFU | 1624742 tok/s step 18974/19560 | loss 3.272385 (-0.56z)| norm 0.2399 (-0.37z)| lr 1.43e-06 | 322.64 ms | 52.3% bf16 MFU | 1624756 tok/s step 18975/19560 | loss 3.349358 (+1.26z)| norm 0.2619 (+0.59z)| lr 1.43e-06 | 322.07 ms | 52.4% bf16 MFU | 1624911 tok/s step 18976/19560 | loss 3.259958 (-0.86z)| norm 0.2398 (-0.38z)| lr 1.42e-06 | 323.38 ms | 52.2% bf16 MFU | 1624729 tok/s step 18977/19560 | loss 3.304344 (+0.20z)| norm 0.2406 (-0.34z)| lr 1.42e-06 | 322.16 ms | 52.4% bf16 MFU | 1624864 tok/s step 18978/19560 | loss 3.279763 (-0.38z)| norm 0.2371 (-0.49z)| lr 1.41e-06 | 322.18 ms | 52.4% bf16 MFU | 1624986 tok/s step 18979/19560 | loss 3.191910 (-2.39z)| norm 0.2628 (+0.62z)| lr 1.41e-06 | 323.30 ms | 52.2% bf16 MFU | 1624820 tok/s step 18980/19560 | loss 3.308120 (+0.32z)| norm 0.2500 (+0.06z)| lr 1.40e-06 | 322.67 ms | 52.3% bf16 MFU | 1624821 tok/s step 18981/19560 | loss 3.249804 (-1.04z)| norm 0.2974 (+2.08z)| lr 1.40e-06 | 322.67 ms | 52.3% bf16 MFU | 1624822 tok/s step 18982/19560 | loss 3.222304 (-1.65z)| norm 0.2303 (-0.81z)| lr 1.39e-06 | 323.00 ms | 52.3% bf16 MFU | 1624740 tok/s step 18983/19560 | loss 3.285583 (-0.17z)| norm 0.2282 (-0.89z)| lr 1.39e-06 | 323.18 ms | 52.2% bf16 MFU | 1624616 tok/s step 18984/19560 | loss 3.239424 (-1.23z)| norm 0.2430 (-0.26z)| lr 1.38e-06 | 322.38 ms | 52.4% bf16 MFU | 1624701 tok/s step 18985/19560 | loss 3.210497 (-1.86z)| norm 0.2315 (-0.74z)| lr 1.38e-06 | 322.25 ms | 52.4% bf16 MFU | 1624813 tok/s step 18986/19560 | loss 3.322721 (+0.72z)| norm 0.2390 (-0.41z)| lr 1.38e-06 | 323.18 ms | 52.2% bf16 MFU | 1624686 tok/s step 18987/19560 | loss 3.213575 (-1.77z)| norm 0.2236 (-1.06z)| lr 1.37e-06 | 322.68 ms | 52.3% bf16 MFU | 1624690 tok/s step 18988/19560 | loss 3.274154 (-0.39z)| norm 0.2352 (-0.55z)| lr 1.37e-06 | 322.31 ms | 52.4% bf16 MFU | 1624790 tok/s step 18989/19560 | loss 3.238815 (-1.18z)| norm 0.2348 (-0.57z)| lr 1.36e-06 | 322.57 ms | 52.3% bf16 MFU | 1624818 tok/s step 18990/19560 | loss 3.228454 (-1.39z)| norm 0.2298 (-0.77z)| lr 1.36e-06 | 322.64 ms | 52.3% bf16 MFU | 1624826 tok/s step 18991/19560 | loss 3.298103 (+0.18z)| norm 0.3034 (+2.34z)| lr 1.35e-06 | 322.90 ms | 52.3% bf16 MFU | 1624769 tok/s step 18992/19560 | loss 3.287405 (-0.06z)| norm 0.2537 (+0.23z)| lr 1.35e-06 | 322.90 ms | 52.3% bf16 MFU | 1624714 tok/s step 18993/19560 | loss 3.305640 (+0.37z)| norm 0.2880 (+1.65z)| lr 1.34e-06 | 322.49 ms | 52.3% bf16 MFU | 1624765 tok/s step 18994/19560 | loss 3.280278 (-0.21z)| norm 0.2309 (-0.75z)| lr 1.34e-06 | 322.25 ms | 52.4% bf16 MFU | 1624876 tok/s step 18995/19560 | loss 3.273921 (-0.35z)| norm 0.2379 (-0.46z)| lr 1.33e-06 | 322.65 ms | 52.3% bf16 MFU | 1624879 tok/s step 18996/19560 | loss 3.325275 (+0.84z)| norm 0.2322 (-0.69z)| lr 1.33e-06 | 323.06 ms | 52.2% bf16 MFU | 1624778 tok/s step 18997/19560 | loss 3.223436 (-1.49z)| norm 0.2383 (-0.42z)| lr 1.32e-06 | 322.42 ms | 52.3% bf16 MFU | 1624845 tok/s step 18998/19560 | loss 3.229519 (-1.33z)| norm 0.2357 (-0.53z)| lr 1.32e-06 | 322.44 ms | 52.3% bf16 MFU | 1624904 tok/s step 18999/19560 | loss 3.232114 (-1.25z)| norm 0.2327 (-0.65z)| lr 1.31e-06 | 322.93 ms | 52.3% bf16 MFU | 1624835 tok/s step 19000/19560 | loss 3.246706 (-0.90z)| norm 0.2653 (+0.72z)| lr 1.31e-06 | 322.89 ms | 52.3% bf16 MFU | 1624779 tok/s val loss 3.274507 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3052/10042 = 0.303924 step 19001/19560 | loss 3.315334 (+0.65z)| norm 0.2624 (+0.59z)| lr 1.30e-06 | 322.95 ms | 52.3% bf16 MFU | 1624713 tok/s step 19002/19560 | loss 3.281774 (-0.11z)| norm 0.2259 (-0.95z)| lr 1.30e-06 | 322.70 ms | 52.3% bf16 MFU | 1624712 tok/s step 19003/19560 | loss 3.287724 (+0.04z)| norm 0.2454 (-0.13z)| lr 1.29e-06 | 322.55 ms | 52.3% bf16 MFU | 1624748 tok/s step 19004/19560 | loss 3.285656 (-0.01z)| norm 0.2290 (-0.82z)| lr 1.29e-06 | 322.81 ms | 52.3% bf16 MFU | 1624718 tok/s step 19005/19560 | loss 3.332619 (+1.04z)| norm 0.2357 (-0.53z)| lr 1.29e-06 | 322.92 ms | 52.3% bf16 MFU | 1624662 tok/s step 19006/19560 | loss 3.236593 (-1.13z)| norm 0.2459 (-0.09z)| lr 1.28e-06 | 322.78 ms | 52.3% bf16 MFU | 1624644 tok/s step 19007/19560 | loss 3.289782 (+0.08z)| norm 0.2276 (-0.88z)| lr 1.28e-06 | 323.01 ms | 52.2% bf16 MFU | 1624568 tok/s step 19008/19560 | loss 3.330597 (+1.06z)| norm 0.2599 (+0.53z)| lr 1.27e-06 | 322.59 ms | 52.3% bf16 MFU | 1624602 tok/s step 19009/19560 | loss 3.303335 (+0.41z)| norm 0.2725 (+1.07z)| lr 1.27e-06 | 323.03 ms | 52.2% bf16 MFU | 1624524 tok/s step 19010/19560 | loss 3.284701 (-0.03z)| norm 0.2597 (+0.51z)| lr 1.26e-06 | 322.85 ms | 52.3% bf16 MFU | 1624495 tok/s step 19011/19560 | loss 3.360579 (+1.73z)| norm 0.2402 (-0.33z)| lr 1.26e-06 | 323.04 ms | 52.2% bf16 MFU | 1624420 tok/s step 19012/19560 | loss 3.331506 (+1.04z)| norm 0.2374 (-0.45z)| lr 1.25e-06 | 322.53 ms | 52.3% bf16 MFU | 1624476 tok/s step 19013/19560 | loss 3.255437 (-0.73z)| norm 0.2247 (-1.00z)| lr 1.25e-06 | 322.98 ms | 52.3% bf16 MFU | 1624417 tok/s step 19014/19560 | loss 3.229300 (-1.32z)| norm 0.2260 (-0.94z)| lr 1.24e-06 | 322.96 ms | 52.3% bf16 MFU | 1624365 tok/s step 19015/19560 | loss 3.449984 (+3.59z)| norm 0.2525 (+0.23z)| lr 1.24e-06 | 322.32 ms | 52.4% bf16 MFU | 1624477 tok/s step 19016/19560 | loss 3.260325 (-0.58z)| norm 0.2284 (-0.83z)| lr 1.24e-06 | 322.96 ms | 52.3% bf16 MFU | 1624422 tok/s step 19017/19560 | loss 3.242328 (-0.97z)| norm 0.2310 (-0.71z)| lr 1.23e-06 | 322.70 ms | 52.3% bf16 MFU | 1624436 tok/s step 19018/19560 | loss 3.263934 (-0.49z)| norm 0.2326 (-0.63z)| lr 1.23e-06 | 322.72 ms | 52.3% bf16 MFU | 1624444 tok/s step 19019/19560 | loss 3.318134 (+0.70z)| norm 0.2602 (+0.57z)| lr 1.22e-06 | 323.00 ms | 52.3% bf16 MFU | 1624382 tok/s step 19020/19560 | loss 3.198064 (-1.90z)| norm 0.3425 (+3.91z)| lr 1.22e-06 | 322.21 ms | 52.4% bf16 MFU | 1624522 tok/s step 19021/19560 | loss 3.202006 (-1.78z)| norm 0.2360 (-0.50z)| lr 1.21e-06 | 322.96 ms | 52.3% bf16 MFU | 1624464 tok/s step 19022/19560 | loss 3.270724 (-0.30z)| norm 0.2572 (+0.38z)| lr 1.21e-06 | 322.89 ms | 52.3% bf16 MFU | 1624428 tok/s step 19023/19560 | loss 3.287889 (+0.07z)| norm 0.2268 (-0.86z)| lr 1.20e-06 | 322.88 ms | 52.3% bf16 MFU | 1624396 tok/s step 19024/19560 | loss 3.380030 (+2.02z)| norm 0.2297 (-0.74z)| lr 1.20e-06 | 322.34 ms | 52.4% bf16 MFU | 1624501 tok/s step 19025/19560 | loss 3.400276 (+2.38z)| norm 0.2336 (-0.58z)| lr 1.19e-06 | 322.26 ms | 52.4% bf16 MFU | 1624621 tok/s step 19026/19560 | loss 3.340494 (+1.13z)| norm 0.2309 (-0.68z)| lr 1.19e-06 | 322.36 ms | 52.4% bf16 MFU | 1624710 tok/s step 19027/19560 | loss 3.336265 (+1.04z)| norm 0.2272 (-0.82z)| lr 1.19e-06 | 322.51 ms | 52.3% bf16 MFU | 1624757 tok/s step 19028/19560 | loss 3.307832 (+0.46z)| norm 0.2792 (+1.30z)| lr 1.18e-06 | 322.92 ms | 52.3% bf16 MFU | 1624698 tok/s step 19029/19560 | loss 3.271359 (-0.31z)| norm 0.2522 (+0.18z)| lr 1.18e-06 | 322.56 ms | 52.3% bf16 MFU | 1624734 tok/s step 19030/19560 | loss 3.285008 (-0.01z)| norm 0.2307 (-0.70z)| lr 1.17e-06 | 322.60 ms | 52.3% bf16 MFU | 1624757 tok/s step 19031/19560 | loss 3.196199 (-1.85z)| norm 0.2597 (+0.48z)| lr 1.17e-06 | 322.71 ms | 52.3% bf16 MFU | 1624752 tok/s step 19032/19560 | loss 3.258995 (-0.53z)| norm 0.2380 (-0.41z)| lr 1.16e-06 | 323.27 ms | 52.2% bf16 MFU | 1624606 tok/s step 19033/19560 | loss 3.301275 (+0.35z)| norm 0.2376 (-0.42z)| lr 1.16e-06 | 322.99 ms | 52.3% bf16 MFU | 1624538 tok/s step 19034/19560 | loss 3.269713 (-0.31z)| norm 0.2416 (-0.25z)| lr 1.16e-06 | 322.99 ms | 52.3% bf16 MFU | 1624472 tok/s step 19035/19560 | loss 3.338879 (+1.12z)| norm 0.2645 (+0.71z)| lr 1.15e-06 | 322.40 ms | 52.3% bf16 MFU | 1624560 tok/s step 19036/19560 | loss 3.289355 (+0.09z)| norm 0.2330 (-0.60z)| lr 1.15e-06 | 322.63 ms | 52.3% bf16 MFU | 1624584 tok/s step 19037/19560 | loss 3.239436 (-0.93z)| norm 0.2373 (-0.42z)| lr 1.14e-06 | 323.19 ms | 52.2% bf16 MFU | 1624466 tok/s step 19038/19560 | loss 3.273980 (-0.21z)| norm 0.2336 (-0.57z)| lr 1.14e-06 | 323.27 ms | 52.2% bf16 MFU | 1624334 tok/s step 19039/19560 | loss 3.253624 (-0.62z)| norm 0.2268 (-0.84z)| lr 1.13e-06 | 322.73 ms | 52.3% bf16 MFU | 1624345 tok/s step 19040/19560 | loss 3.299861 (+0.33z)| norm 0.2613 (+0.59z)| lr 1.13e-06 | 322.72 ms | 52.3% bf16 MFU | 1624359 tok/s step 19041/19560 | loss 3.298908 (+0.32z)| norm 0.2379 (-0.39z)| lr 1.12e-06 | 322.86 ms | 52.3% bf16 MFU | 1624335 tok/s step 19042/19560 | loss 3.215261 (-1.40z)| norm 0.2414 (-0.24z)| lr 1.12e-06 | 322.99 ms | 52.3% bf16 MFU | 1624280 tok/s step 19043/19560 | loss 3.241434 (-0.85z)| norm 0.2343 (-0.53z)| lr 1.12e-06 | 322.96 ms | 52.3% bf16 MFU | 1624236 tok/s step 19044/19560 | loss 3.238705 (-0.91z)| norm 0.2308 (-0.67z)| lr 1.11e-06 | 322.82 ms | 52.3% bf16 MFU | 1624228 tok/s step 19045/19560 | loss 3.346843 (+1.31z)| norm 0.2417 (-0.21z)| lr 1.11e-06 | 322.95 ms | 52.3% bf16 MFU | 1624188 tok/s step 19046/19560 | loss 3.223727 (-1.21z)| norm 0.2281 (-0.77z)| lr 1.10e-06 | 322.76 ms | 52.3% bf16 MFU | 1624197 tok/s step 19047/19560 | loss 3.296319 (+0.31z)| norm 0.2305 (-0.66z)| lr 1.10e-06 | 323.00 ms | 52.3% bf16 MFU | 1624145 tok/s step 19048/19560 | loss 3.291643 (+0.21z)| norm 0.2514 (+0.22z)| lr 1.09e-06 | 322.79 ms | 52.3% bf16 MFU | 1624150 tok/s step 19049/19560 | loss 3.342994 (+1.27z)| norm 0.2363 (-0.42z)| lr 1.09e-06 | 322.99 ms | 52.3% bf16 MFU | 1624104 tok/s step 19050/19560 | loss 3.273520 (-0.19z)| norm 0.2393 (-0.30z)| lr 1.09e-06 | 323.12 ms | 52.2% bf16 MFU | 1624029 tok/s step 19051/19560 | loss 3.309584 (+0.57z)| norm 0.2382 (-0.35z)| lr 1.08e-06 | 322.80 ms | 52.3% bf16 MFU | 1624036 tok/s step 19052/19560 | loss 3.244061 (-0.80z)| norm 0.2195 (-1.13z)| lr 1.08e-06 | 322.74 ms | 52.3% bf16 MFU | 1624059 tok/s step 19053/19560 | loss 3.229474 (-1.10z)| norm 0.2496 (+0.16z)| lr 1.07e-06 | 322.82 ms | 52.3% bf16 MFU | 1624061 tok/s step 19054/19560 | loss 3.354494 (+1.49z)| norm 0.2309 (-0.63z)| lr 1.07e-06 | 323.12 ms | 52.2% bf16 MFU | 1623986 tok/s step 19055/19560 | loss 3.281706 (-0.02z)| norm 0.2470 (+0.05z)| lr 1.06e-06 | 322.53 ms | 52.3% bf16 MFU | 1624064 tok/s step 19056/19560 | loss 3.265389 (-0.36z)| norm 0.2360 (-0.42z)| lr 1.06e-06 | 322.49 ms | 52.3% bf16 MFU | 1624147 tok/s step 19057/19560 | loss 3.297035 (+0.29z)| norm 0.2426 (-0.14z)| lr 1.06e-06 | 322.75 ms | 52.3% bf16 MFU | 1624162 tok/s step 19058/19560 | loss 3.292149 (+0.18z)| norm 0.2249 (-0.89z)| lr 1.05e-06 | 322.77 ms | 52.3% bf16 MFU | 1624171 tok/s step 19059/19560 | loss 3.302681 (+0.39z)| norm 0.2644 (+0.79z)| lr 1.05e-06 | 322.93 ms | 52.3% bf16 MFU | 1624138 tok/s step 19060/19560 | loss 3.227076 (-1.18z)| norm 0.2267 (-0.81z)| lr 1.04e-06 | 322.64 ms | 52.3% bf16 MFU | 1624181 tok/s step 19061/19560 | loss 3.256898 (-0.56z)| norm 0.2394 (-0.26z)| lr 1.04e-06 | 323.18 ms | 52.2% bf16 MFU | 1624085 tok/s step 19062/19560 | loss 3.387888 (+2.15z)| norm 0.2397 (-0.25z)| lr 1.04e-06 | 323.37 ms | 52.2% bf16 MFU | 1623946 tok/s step 19063/19560 | loss 3.252356 (-0.66z)| norm 0.2279 (-0.75z)| lr 1.03e-06 | 322.71 ms | 52.3% bf16 MFU | 1623982 tok/s step 19064/19560 | loss 3.263747 (-0.42z)| norm 0.2841 (+1.62z)| lr 1.03e-06 | 323.01 ms | 52.2% bf16 MFU | 1623938 tok/s step 19065/19560 | loss 3.277456 (-0.13z)| norm 0.2361 (-0.40z)| lr 1.02e-06 | 322.88 ms | 52.3% bf16 MFU | 1623931 tok/s step 19066/19560 | loss 3.256107 (-0.56z)| norm 0.2652 (+0.83z)| lr 1.02e-06 | 322.86 ms | 52.3% bf16 MFU | 1623929 tok/s step 19067/19560 | loss 3.287451 (+0.08z)| norm 0.2632 (+0.73z)| lr 1.02e-06 | 323.26 ms | 52.2% bf16 MFU | 1623827 tok/s step 19068/19560 | loss 3.324170 (+0.84z)| norm 0.2412 (-0.20z)| lr 1.01e-06 | 323.36 ms | 52.2% bf16 MFU | 1623705 tok/s step 19069/19560 | loss 3.357842 (+1.55z)| norm 0.2509 (+0.23z)| lr 1.01e-06 | 323.24 ms | 52.2% bf16 MFU | 1623618 tok/s step 19070/19560 | loss 3.355707 (+1.48z)| norm 0.2751 (+1.25z)| lr 1.00e-06 | 322.96 ms | 52.3% bf16 MFU | 1623607 tok/s step 19071/19560 | loss 3.264307 (-0.43z)| norm 0.2383 (-0.33z)| lr 9.99e-07 | 322.47 ms | 52.3% bf16 MFU | 1623718 tok/s step 19072/19560 | loss 3.307116 (+0.47z)| norm 0.2285 (-0.74z)| lr 9.95e-07 | 323.41 ms | 52.2% bf16 MFU | 1623588 tok/s step 19073/19560 | loss 3.239667 (-0.93z)| norm 0.2526 (+0.28z)| lr 9.91e-07 | 323.11 ms | 52.2% bf16 MFU | 1623539 tok/s step 19074/19560 | loss 3.256722 (-0.57z)| norm 0.2410 (-0.22z)| lr 9.87e-07 | 322.67 ms | 52.3% bf16 MFU | 1623605 tok/s step 19075/19560 | loss 3.283804 (-0.02z)| norm 0.2313 (-0.63z)| lr 9.83e-07 | 322.73 ms | 52.3% bf16 MFU | 1623651 tok/s step 19076/19560 | loss 3.236456 (-1.01z)| norm 0.2390 (-0.31z)| lr 9.78e-07 | 323.40 ms | 52.2% bf16 MFU | 1623528 tok/s step 19077/19560 | loss 3.203093 (-1.67z)| norm 0.2271 (-0.80z)| lr 9.74e-07 | 322.87 ms | 52.3% bf16 MFU | 1623542 tok/s step 19078/19560 | loss 3.255949 (-0.58z)| norm 0.2250 (-0.88z)| lr 9.70e-07 | 323.30 ms | 52.2% bf16 MFU | 1623448 tok/s step 19079/19560 | loss 3.276030 (-0.16z)| norm 0.2211 (-1.06z)| lr 9.66e-07 | 322.81 ms | 52.3% bf16 MFU | 1623484 tok/s step 19080/19560 | loss 3.234917 (-1.01z)| norm 0.2319 (-0.59z)| lr 9.62e-07 | 322.22 ms | 52.4% bf16 MFU | 1623665 tok/s step 19081/19560 | loss 3.147818 (-2.73z)| norm 0.2491 (+0.17z)| lr 9.58e-07 | 323.39 ms | 52.2% bf16 MFU | 1623544 tok/s step 19082/19560 | loss 3.261972 (-0.40z)| norm 0.2360 (-0.41z)| lr 9.54e-07 | 323.04 ms | 52.2% bf16 MFU | 1623516 tok/s step 19083/19560 | loss 3.252669 (-0.60z)| norm 0.2262 (-0.83z)| lr 9.50e-07 | 323.32 ms | 52.2% bf16 MFU | 1623418 tok/s step 19084/19560 | loss 3.248785 (-0.68z)| norm 0.2258 (-0.84z)| lr 9.46e-07 | 322.89 ms | 52.3% bf16 MFU | 1623434 tok/s step 19085/19560 | loss 3.256628 (-0.52z)| norm 0.2449 (+0.01z)| lr 9.43e-07 | 323.01 ms | 52.3% bf16 MFU | 1623420 tok/s step 19086/19560 | loss 3.287144 (+0.12z)| norm 0.2253 (-0.85z)| lr 9.39e-07 | 322.92 ms | 52.3% bf16 MFU | 1623429 tok/s step 19087/19560 | loss 3.262923 (-0.38z)| norm 0.2236 (-0.92z)| lr 9.35e-07 | 323.03 ms | 52.2% bf16 MFU | 1623408 tok/s step 19088/19560 | loss 3.301624 (+0.42z)| norm 0.2228 (-0.96z)| lr 9.31e-07 | 323.15 ms | 52.2% bf16 MFU | 1623358 tok/s step 19089/19560 | loss 3.273408 (-0.16z)| norm 0.2705 (+1.24z)| lr 9.27e-07 | 322.28 ms | 52.4% bf16 MFU | 1623531 tok/s step 19090/19560 | loss 3.293086 (+0.24z)| norm 0.2333 (-0.47z)| lr 9.23e-07 | 322.99 ms | 52.3% bf16 MFU | 1623516 tok/s step 19091/19560 | loss 3.269201 (-0.23z)| norm 0.2643 (+0.95z)| lr 9.19e-07 | 322.87 ms | 52.3% bf16 MFU | 1623532 tok/s step 19092/19560 | loss 3.306929 (+0.57z)| norm 0.2611 (+0.79z)| lr 9.15e-07 | 323.22 ms | 52.2% bf16 MFU | 1623459 tok/s step 19093/19560 | loss 3.248441 (-0.66z)| norm 0.2359 (-0.36z)| lr 9.11e-07 | 322.73 ms | 52.3% bf16 MFU | 1623512 tok/s step 19094/19560 | loss 3.256440 (-0.48z)| norm 0.2389 (-0.23z)| lr 9.07e-07 | 322.94 ms | 52.3% bf16 MFU | 1623510 tok/s step 19095/19560 | loss 3.210876 (-1.42z)| norm 0.2320 (-0.55z)| lr 9.03e-07 | 322.88 ms | 52.3% bf16 MFU | 1623523 tok/s step 19096/19560 | loss 3.256052 (-0.47z)| norm 0.2542 (+0.47z)| lr 8.99e-07 | 322.83 ms | 52.3% bf16 MFU | 1623550 tok/s step 19097/19560 | loss 3.276364 (-0.05z)| norm 0.2441 (+0.00z)| lr 8.96e-07 | 322.58 ms | 52.3% bf16 MFU | 1623636 tok/s step 19098/19560 | loss 3.237965 (-0.85z)| norm 0.2252 (-0.86z)| lr 8.92e-07 | 322.55 ms | 52.3% bf16 MFU | 1623726 tok/s step 19099/19560 | loss 3.246035 (-0.67z)| norm 0.2455 (+0.07z)| lr 8.88e-07 | 323.10 ms | 52.2% bf16 MFU | 1623675 tok/s step 19100/19560 | loss 3.226894 (-1.08z)| norm 0.2473 (+0.24z)| lr 8.84e-07 | 323.14 ms | 52.2% bf16 MFU | 1623615 tok/s step 19101/19560 | loss 3.245245 (-0.67z)| norm 0.2215 (-1.16z)| lr 8.80e-07 | 322.97 ms | 52.3% bf16 MFU | 1623602 tok/s step 19102/19560 | loss 3.373526 (+2.07z)| norm 0.2831 (+2.15z)| lr 8.76e-07 | 322.42 ms | 52.3% bf16 MFU | 1623728 tok/s step 19103/19560 | loss 3.280020 (+0.08z)| norm 0.2320 (-0.58z)| lr 8.73e-07 | 322.66 ms | 52.3% bf16 MFU | 1623785 tok/s step 19104/19560 | loss 3.210567 (-1.40z)| norm 0.2765 (+1.78z)| lr 8.69e-07 | 322.53 ms | 52.3% bf16 MFU | 1623874 tok/s step 19105/19560 | loss 3.335319 (+1.26z)| norm 0.2270 (-0.84z)| lr 8.65e-07 | 322.99 ms | 52.3% bf16 MFU | 1623842 tok/s step 19106/19560 | loss 3.247234 (-0.61z)| norm 0.2335 (-0.50z)| lr 8.61e-07 | 322.48 ms | 52.3% bf16 MFU | 1623939 tok/s step 19107/19560 | loss 3.327574 (+1.09z)| norm 0.2370 (-0.30z)| lr 8.57e-07 | 322.50 ms | 52.3% bf16 MFU | 1624028 tok/s step 19108/19560 | loss 3.299099 (+0.48z)| norm 0.2460 (+0.18z)| lr 8.54e-07 | 322.70 ms | 52.3% bf16 MFU | 1624061 tok/s step 19109/19560 | loss 3.248340 (-0.61z)| norm 0.2437 (+0.08z)| lr 8.50e-07 | 322.49 ms | 52.3% bf16 MFU | 1624146 tok/s step 19110/19560 | loss 3.219116 (-1.24z)| norm 0.2270 (-0.83z)| lr 8.46e-07 | 323.01 ms | 52.2% bf16 MFU | 1624094 tok/s step 19111/19560 | loss 3.352340 (+1.59z)| norm 0.2897 (+2.52z)| lr 8.42e-07 | 323.08 ms | 52.2% bf16 MFU | 1624028 tok/s step 19112/19560 | loss 3.317619 (+0.84z)| norm 0.2530 (+0.55z)| lr 8.39e-07 | 322.80 ms | 52.3% bf16 MFU | 1624037 tok/s step 19113/19560 | loss 3.264261 (-0.30z)| norm 0.2409 (-0.10z)| lr 8.35e-07 | 323.10 ms | 52.2% bf16 MFU | 1623969 tok/s step 19114/19560 | loss 3.310109 (+0.68z)| norm 0.2384 (-0.24z)| lr 8.31e-07 | 322.68 ms | 52.3% bf16 MFU | 1624009 tok/s step 19115/19560 | loss 3.297171 (+0.39z)| norm 0.2338 (-0.49z)| lr 8.28e-07 | 322.46 ms | 52.3% bf16 MFU | 1624104 tok/s step 19116/19560 | loss 3.302969 (+0.51z)| norm 0.2276 (-0.82z)| lr 8.24e-07 | 323.12 ms | 52.2% bf16 MFU | 1624027 tok/s step 19117/19560 | loss 3.321486 (+0.90z)| norm 0.2577 (+0.79z)| lr 8.20e-07 | 322.79 ms | 52.3% bf16 MFU | 1624038 tok/s step 19118/19560 | loss 3.372553 (+1.96z)| norm 0.2708 (+1.46z)| lr 8.16e-07 | 322.50 ms | 52.3% bf16 MFU | 1624122 tok/s step 19119/19560 | loss 3.334637 (+1.14z)| norm 0.2291 (-0.76z)| lr 8.13e-07 | 322.14 ms | 52.4% bf16 MFU | 1624292 tok/s step 19120/19560 | loss 3.274260 (-0.15z)| norm 0.2488 (+0.34z)| lr 8.09e-07 | 323.44 ms | 52.2% bf16 MFU | 1624127 tok/s step 19121/19560 | loss 3.255566 (-0.53z)| norm 0.2360 (-0.36z)| lr 8.05e-07 | 322.82 ms | 52.3% bf16 MFU | 1624126 tok/s step 19122/19560 | loss 3.286314 (+0.12z)| norm 0.2562 (+0.78z)| lr 8.02e-07 | 322.69 ms | 52.3% bf16 MFU | 1624156 tok/s step 19123/19560 | loss 3.324981 (+0.93z)| norm 0.2474 (+0.27z)| lr 7.98e-07 | 322.56 ms | 52.3% bf16 MFU | 1624218 tok/s step 19124/19560 | loss 3.291421 (+0.22z)| norm 0.2306 (-0.68z)| lr 7.94e-07 | 323.07 ms | 52.2% bf16 MFU | 1624148 tok/s step 19125/19560 | loss 3.356869 (+1.59z)| norm 0.2698 (+1.52z)| lr 7.91e-07 | 322.50 ms | 52.3% bf16 MFU | 1624225 tok/s step 19126/19560 | loss 3.277878 (-0.09z)| norm 0.2333 (-0.54z)| lr 7.87e-07 | 322.09 ms | 52.4% bf16 MFU | 1624402 tok/s step 19127/19560 | loss 3.333683 (+1.08z)| norm 0.2237 (-1.07z)| lr 7.84e-07 | 323.00 ms | 52.3% bf16 MFU | 1624341 tok/s step 19128/19560 | loss 3.308987 (+0.54z)| norm 0.2303 (-0.68z)| lr 7.80e-07 | 322.20 ms | 52.4% bf16 MFU | 1624486 tok/s step 19129/19560 | loss 3.272966 (-0.22z)| norm 0.2452 (+0.16z)| lr 7.76e-07 | 322.72 ms | 52.3% bf16 MFU | 1624490 tok/s step 19130/19560 | loss 3.264721 (-0.39z)| norm 0.2347 (-0.44z)| lr 7.73e-07 | 323.35 ms | 52.2% bf16 MFU | 1624337 tok/s step 19131/19560 | loss 3.221342 (-1.30z)| norm 0.3052 (+3.38z)| lr 7.69e-07 | 322.35 ms | 52.4% bf16 MFU | 1624442 tok/s step 19132/19560 | loss 3.253765 (-0.61z)| norm 0.2369 (-0.33z)| lr 7.66e-07 | 322.39 ms | 52.4% bf16 MFU | 1624533 tok/s step 19133/19560 | loss 3.300098 (+0.38z)| norm 0.2342 (-0.48z)| lr 7.62e-07 | 322.39 ms | 52.3% bf16 MFU | 1624618 tok/s step 19134/19560 | loss 3.279559 (-0.06z)| norm 0.2213 (-1.16z)| lr 7.59e-07 | 322.72 ms | 52.3% bf16 MFU | 1624617 tok/s step 19135/19560 | loss 3.303135 (+0.44z)| norm 0.2329 (-0.54z)| lr 7.55e-07 | 322.80 ms | 52.3% bf16 MFU | 1624596 tok/s step 19136/19560 | loss 3.203981 (-1.64z)| norm 0.2300 (-0.68z)| lr 7.51e-07 | 323.18 ms | 52.2% bf16 MFU | 1624482 tok/s step 19137/19560 | loss 3.297920 (+0.35z)| norm 0.2167 (-1.38z)| lr 7.48e-07 | 322.76 ms | 52.3% bf16 MFU | 1624476 tok/s step 19138/19560 | loss 3.271682 (-0.21z)| norm 0.2474 (+0.29z)| lr 7.44e-07 | 322.66 ms | 52.3% bf16 MFU | 1624496 tok/s step 19139/19560 | loss 3.281610 (+0.02z)| norm 0.2474 (+0.29z)| lr 7.41e-07 | 322.92 ms | 52.3% bf16 MFU | 1624452 tok/s step 19140/19560 | loss 3.277777 (-0.06z)| norm 0.2298 (-0.66z)| lr 7.37e-07 | 322.77 ms | 52.3% bf16 MFU | 1624446 tok/s step 19141/19560 | loss 3.246888 (-0.72z)| norm 0.2562 (+0.76z)| lr 7.34e-07 | 322.62 ms | 52.3% bf16 MFU | 1624477 tok/s step 19142/19560 | loss 3.274911 (-0.12z)| norm 0.2279 (-0.78z)| lr 7.30e-07 | 322.55 ms | 52.3% bf16 MFU | 1624526 tok/s step 19143/19560 | loss 3.330772 (+1.16z)| norm 0.2343 (-0.43z)| lr 7.27e-07 | 322.72 ms | 52.3% bf16 MFU | 1624529 tok/s step 19144/19560 | loss 3.254549 (-0.57z)| norm 0.2332 (-0.49z)| lr 7.23e-07 | 322.34 ms | 52.4% bf16 MFU | 1624627 tok/s step 19145/19560 | loss 3.244694 (-0.79z)| norm 0.2321 (-0.55z)| lr 7.20e-07 | 322.39 ms | 52.4% bf16 MFU | 1624708 tok/s step 19146/19560 | loss 3.324116 (+0.99z)| norm 0.2470 (+0.26z)| lr 7.17e-07 | 322.82 ms | 52.3% bf16 MFU | 1624678 tok/s step 19147/19560 | loss 3.304982 (+0.56z)| norm 0.2264 (-0.86z)| lr 7.13e-07 | 322.12 ms | 52.4% bf16 MFU | 1624825 tok/s step 19148/19560 | loss 3.214737 (-1.49z)| norm 0.2222 (-1.18z)| lr 7.10e-07 | 322.41 ms | 52.3% bf16 MFU | 1624892 tok/s step 19149/19560 | loss 3.297832 (+0.39z)| norm 0.2268 (-0.88z)| lr 7.06e-07 | 322.82 ms | 52.3% bf16 MFU | 1624852 tok/s step 19150/19560 | loss 3.275377 (-0.13z)| norm 0.2700 (+1.78z)| lr 7.03e-07 | 322.62 ms | 52.3% bf16 MFU | 1624865 tok/s step 19151/19560 | loss 3.348152 (+1.52z)| norm 0.2510 (+0.60z)| lr 6.99e-07 | 322.70 ms | 52.3% bf16 MFU | 1624856 tok/s step 19152/19560 | loss 3.299596 (+0.44z)| norm 0.2263 (-0.92z)| lr 6.96e-07 | 322.57 ms | 52.3% bf16 MFU | 1624881 tok/s step 19153/19560 | loss 3.259091 (-0.50z)| norm 0.2387 (-0.16z)| lr 6.93e-07 | 322.53 ms | 52.3% bf16 MFU | 1624914 tok/s step 19154/19560 | loss 3.253928 (-0.61z)| norm 0.2931 (+3.05z)| lr 6.89e-07 | 322.06 ms | 52.4% bf16 MFU | 1625063 tok/s step 19155/19560 | loss 3.280468 (+0.04z)| norm 0.2371 (-0.28z)| lr 6.86e-07 | 322.74 ms | 52.3% bf16 MFU | 1625036 tok/s step 19156/19560 | loss 3.341593 (+1.52z)| norm 0.2406 (-0.06z)| lr 6.82e-07 | 322.56 ms | 52.3% bf16 MFU | 1625054 tok/s step 19157/19560 | loss 3.291271 (+0.30z)| norm 0.2248 (-1.01z)| lr 6.79e-07 | 322.26 ms | 52.4% bf16 MFU | 1625146 tok/s step 19158/19560 | loss 3.216537 (-1.49z)| norm 0.2275 (-0.84z)| lr 6.76e-07 | 322.47 ms | 52.3% bf16 MFU | 1625182 tok/s step 19159/19560 | loss 3.289860 (+0.26z)| norm 0.2291 (-0.73z)| lr 6.72e-07 | 322.43 ms | 52.3% bf16 MFU | 1625226 tok/s step 19160/19560 | loss 3.327106 (+1.15z)| norm 0.2231 (-1.08z)| lr 6.69e-07 | 322.59 ms | 52.3% bf16 MFU | 1625228 tok/s step 19161/19560 | loss 3.354599 (+1.78z)| norm 0.2441 (+0.18z)| lr 6.66e-07 | 322.85 ms | 52.3% bf16 MFU | 1625162 tok/s step 19162/19560 | loss 3.360396 (+1.88z)| norm 0.2468 (+0.35z)| lr 6.62e-07 | 322.55 ms | 52.3% bf16 MFU | 1625178 tok/s step 19163/19560 | loss 3.262548 (-0.42z)| norm 0.2377 (-0.19z)| lr 6.59e-07 | 322.70 ms | 52.3% bf16 MFU | 1625155 tok/s step 19164/19560 | loss 3.247655 (-0.77z)| norm 0.2385 (-0.14z)| lr 6.56e-07 | 322.49 ms | 52.3% bf16 MFU | 1625186 tok/s step 19165/19560 | loss 3.405233 (+2.86z)| norm 0.2779 (+2.20z)| lr 6.52e-07 | 322.51 ms | 52.3% bf16 MFU | 1625208 tok/s step 19166/19560 | loss 3.274490 (-0.16z)| norm 0.2300 (-0.67z)| lr 6.49e-07 | 322.70 ms | 52.3% bf16 MFU | 1625183 tok/s step 19167/19560 | loss 3.334226 (+1.20z)| norm 0.2653 (+1.42z)| lr 6.46e-07 | 322.61 ms | 52.3% bf16 MFU | 1625182 tok/s step 19168/19560 | loss 3.255370 (-0.60z)| norm 0.2320 (-0.55z)| lr 6.43e-07 | 322.55 ms | 52.3% bf16 MFU | 1625195 tok/s step 19169/19560 | loss 3.269298 (-0.28z)| norm 0.2542 (+0.77z)| lr 6.39e-07 | 322.71 ms | 52.3% bf16 MFU | 1625166 tok/s step 19170/19560 | loss 3.290468 (+0.20z)| norm 0.2546 (+0.78z)| lr 6.36e-07 | 322.45 ms | 52.3% bf16 MFU | 1625205 tok/s step 19171/19560 | loss 3.266249 (-0.37z)| norm 0.2429 (+0.08z)| lr 6.33e-07 | 322.68 ms | 52.3% bf16 MFU | 1625185 tok/s step 19172/19560 | loss 3.216163 (-1.52z)| norm 0.2308 (-0.64z)| lr 6.30e-07 | 323.20 ms | 52.2% bf16 MFU | 1625034 tok/s step 19173/19560 | loss 3.295733 (+0.33z)| norm 0.2334 (-0.48z)| lr 6.26e-07 | 322.57 ms | 52.3% bf16 MFU | 1625050 tok/s step 19174/19560 | loss 3.244622 (-0.87z)| norm 0.2289 (-0.75z)| lr 6.23e-07 | 322.74 ms | 52.3% bf16 MFU | 1625021 tok/s step 19175/19560 | loss 3.337411 (+1.29z)| norm 0.2281 (-0.80z)| lr 6.20e-07 | 322.02 ms | 52.4% bf16 MFU | 1625175 tok/s step 19176/19560 | loss 3.274814 (-0.16z)| norm 0.2456 (+0.25z)| lr 6.17e-07 | 322.48 ms | 52.3% bf16 MFU | 1625208 tok/s step 19177/19560 | loss 3.261256 (-0.47z)| norm 0.2344 (-0.42z)| lr 6.14e-07 | 322.44 ms | 52.3% bf16 MFU | 1625247 tok/s step 19178/19560 | loss 3.274510 (-0.16z)| norm 0.2392 (-0.13z)| lr 6.10e-07 | 323.04 ms | 52.2% bf16 MFU | 1625134 tok/s step 19179/19560 | loss 3.322186 (+0.95z)| norm 0.2261 (-0.90z)| lr 6.07e-07 | 322.25 ms | 52.4% bf16 MFU | 1625224 tok/s step 19180/19560 | loss 3.264317 (-0.40z)| norm 0.2674 (+1.52z)| lr 6.04e-07 | 322.52 ms | 52.3% bf16 MFU | 1625244 tok/s step 19181/19560 | loss 3.330027 (+1.12z)| norm 0.2498 (+0.48z)| lr 6.01e-07 | 322.59 ms | 52.3% bf16 MFU | 1625243 tok/s step 19182/19560 | loss 3.265554 (-0.38z)| norm 0.2452 (+0.20z)| lr 5.98e-07 | 322.44 ms | 52.3% bf16 MFU | 1625280 tok/s step 19183/19560 | loss 3.243413 (-0.90z)| norm 0.2550 (+0.77z)| lr 5.94e-07 | 322.72 ms | 52.3% bf16 MFU | 1625244 tok/s step 19184/19560 | loss 3.299675 (+0.43z)| norm 0.2371 (-0.29z)| lr 5.91e-07 | 322.28 ms | 52.4% bf16 MFU | 1625323 tok/s step 19185/19560 | loss 3.327176 (+1.07z)| norm 0.2232 (-1.09z)| lr 5.88e-07 | 322.94 ms | 52.3% bf16 MFU | 1625231 tok/s step 19186/19560 | loss 3.253609 (-0.66z)| norm 0.2249 (-0.99z)| lr 5.85e-07 | 322.67 ms | 52.3% bf16 MFU | 1625211 tok/s step 19187/19560 | loss 3.275143 (-0.14z)| norm 0.2187 (-1.34z)| lr 5.82e-07 | 323.08 ms | 52.2% bf16 MFU | 1625089 tok/s step 19188/19560 | loss 3.291121 (+0.22z)| norm 0.2264 (-0.88z)| lr 5.79e-07 | 322.83 ms | 52.3% bf16 MFU | 1625036 tok/s step 19189/19560 | loss 3.280879 (-0.03z)| norm 0.2404 (-0.06z)| lr 5.76e-07 | 322.70 ms | 52.3% bf16 MFU | 1625018 tok/s step 19190/19560 | loss 3.303783 (+0.55z)| norm 0.2298 (-0.67z)| lr 5.73e-07 | 322.29 ms | 52.4% bf16 MFU | 1625106 tok/s step 19191/19560 | loss 3.368860 (+2.07z)| norm 0.2946 (+3.00z)| lr 5.70e-07 | 322.64 ms | 52.3% bf16 MFU | 1625101 tok/s step 19192/19560 | loss 3.244087 (-0.91z)| norm 0.2329 (-0.49z)| lr 5.67e-07 | 322.26 ms | 52.4% bf16 MFU | 1625192 tok/s step 19193/19560 | loss 3.330049 (+1.13z)| norm 0.2247 (-0.97z)| lr 5.63e-07 | 322.30 ms | 52.4% bf16 MFU | 1625267 tok/s step 19194/19560 | loss 3.275425 (-0.17z)| norm 0.2342 (-0.40z)| lr 5.60e-07 | 322.60 ms | 52.3% bf16 MFU | 1625263 tok/s step 19195/19560 | loss 3.306133 (+0.55z)| norm 0.2692 (+1.63z)| lr 5.57e-07 | 322.56 ms | 52.3% bf16 MFU | 1625271 tok/s step 19196/19560 | loss 3.253018 (-0.70z)| norm 0.2355 (-0.32z)| lr 5.54e-07 | 322.67 ms | 52.3% bf16 MFU | 1625250 tok/s step 19197/19560 | loss 3.270958 (-0.26z)| norm 0.2297 (-0.65z)| lr 5.51e-07 | 322.31 ms | 52.4% bf16 MFU | 1625321 tok/s step 19198/19560 | loss 3.297131 (+0.39z)| norm 0.2318 (-0.52z)| lr 5.48e-07 | 323.03 ms | 52.2% bf16 MFU | 1625208 tok/s step 19199/19560 | loss 3.326897 (+1.10z)| norm 0.2263 (-0.83z)| lr 5.45e-07 | 322.71 ms | 52.3% bf16 MFU | 1625180 tok/s step 19200/19560 | loss 3.304204 (+0.55z)| norm 0.2456 (+0.29z)| lr 5.42e-07 | 322.86 ms | 52.3% bf16 MFU | 1625115 tok/s step 19201/19560 | loss 3.208578 (-1.75z)| norm 0.2193 (-1.24z)| lr 5.39e-07 | 322.67 ms | 52.3% bf16 MFU | 1625101 tok/s step 19202/19560 | loss 3.279503 (-0.05z)| norm 0.2262 (-0.82z)| lr 5.36e-07 | 322.12 ms | 52.4% bf16 MFU | 1625228 tok/s step 19203/19560 | loss 3.355763 (+1.76z)| norm 0.2415 (+0.07z)| lr 5.33e-07 | 322.66 ms | 52.3% bf16 MFU | 1625211 tok/s step 19204/19560 | loss 3.333178 (+1.20z)| norm 0.2430 (+0.15z)| lr 5.30e-07 | 323.23 ms | 52.2% bf16 MFU | 1625051 tok/s step 19205/19560 | loss 3.288083 (+0.11z)| norm 0.2413 (+0.05z)| lr 5.27e-07 | 322.13 ms | 52.4% bf16 MFU | 1625178 tok/s step 19206/19560 | loss 3.314533 (+0.74z)| norm 0.2333 (-0.43z)| lr 5.24e-07 | 322.59 ms | 52.3% bf16 MFU | 1625182 tok/s step 19207/19560 | loss 3.298055 (+0.34z)| norm 0.2401 (-0.03z)| lr 5.21e-07 | 322.36 ms | 52.4% bf16 MFU | 1625244 tok/s step 19208/19560 | loss 3.240735 (-1.05z)| norm 0.2281 (-0.74z)| lr 5.18e-07 | 322.62 ms | 52.3% bf16 MFU | 1625236 tok/s step 19209/19560 | loss 3.311195 (+0.65z)| norm 0.2258 (-0.86z)| lr 5.16e-07 | 322.61 ms | 52.3% bf16 MFU | 1625233 tok/s step 19210/19560 | loss 3.273878 (-0.29z)| norm 0.2243 (-0.95z)| lr 5.13e-07 | 322.84 ms | 52.3% bf16 MFU | 1625170 tok/s step 19211/19560 | loss 3.268361 (-0.44z)| norm 0.2400 (-0.03z)| lr 5.10e-07 | 322.29 ms | 52.4% bf16 MFU | 1625249 tok/s step 19212/19560 | loss 3.257576 (-0.71z)| norm 0.2498 (+0.54z)| lr 5.07e-07 | 322.42 ms | 52.3% bf16 MFU | 1625291 tok/s step 19213/19560 | loss 3.342412 (+1.41z)| norm 0.2340 (-0.39z)| lr 5.04e-07 | 322.20 ms | 52.4% bf16 MFU | 1625387 tok/s step 19214/19560 | loss 3.281402 (-0.12z)| norm 0.2383 (-0.14z)| lr 5.01e-07 | 322.50 ms | 52.3% bf16 MFU | 1625402 tok/s step 19215/19560 | loss 3.257895 (-0.72z)| norm 0.2191 (-1.27z)| lr 4.98e-07 | 323.07 ms | 52.2% bf16 MFU | 1625274 tok/s step 19216/19560 | loss 3.274011 (-0.30z)| norm 0.2462 (+0.32z)| lr 4.95e-07 | 322.51 ms | 52.3% bf16 MFU | 1625292 tok/s step 19217/19560 | loss 3.286561 (+0.01z)| norm 0.2269 (-0.82z)| lr 4.92e-07 | 322.58 ms | 52.3% bf16 MFU | 1625294 tok/s step 19218/19560 | loss 3.268508 (-0.44z)| norm 0.2337 (-0.41z)| lr 4.90e-07 | 322.95 ms | 52.3% bf16 MFU | 1625200 tok/s step 19219/19560 | loss 3.325065 (+0.97z)| norm 0.2448 (+0.27z)| lr 4.87e-07 | 322.52 ms | 52.3% bf16 MFU | 1625219 tok/s step 19220/19560 | loss 3.310286 (+0.60z)| norm 0.2297 (-0.63z)| lr 4.84e-07 | 323.06 ms | 52.2% bf16 MFU | 1625102 tok/s step 19221/19560 | loss 3.296464 (+0.24z)| norm 0.2443 (+0.25z)| lr 4.81e-07 | 322.67 ms | 52.3% bf16 MFU | 1625090 tok/s step 19222/19560 | loss 3.250954 (-0.90z)| norm 0.2605 (+1.21z)| lr 4.78e-07 | 322.82 ms | 52.3% bf16 MFU | 1625038 tok/s step 19223/19560 | loss 3.345524 (+1.46z)| norm 0.2351 (-0.32z)| lr 4.75e-07 | 322.77 ms | 52.3% bf16 MFU | 1625002 tok/s step 19224/19560 | loss 3.360904 (+1.81z)| norm 0.2342 (-0.36z)| lr 4.73e-07 | 322.45 ms | 52.3% bf16 MFU | 1625049 tok/s step 19225/19560 | loss 3.303524 (+0.37z)| norm 0.2903 (+2.90z)| lr 4.70e-07 | 322.46 ms | 52.3% bf16 MFU | 1625093 tok/s step 19226/19560 | loss 3.224030 (-1.61z)| norm 0.2472 (+0.38z)| lr 4.67e-07 | 323.33 ms | 52.2% bf16 MFU | 1624915 tok/s step 19227/19560 | loss 3.222858 (-1.63z)| norm 0.2684 (+1.59z)| lr 4.64e-07 | 322.66 ms | 52.3% bf16 MFU | 1624913 tok/s step 19228/19560 | loss 3.249607 (-0.98z)| norm 0.2366 (-0.24z)| lr 4.61e-07 | 322.26 ms | 52.4% bf16 MFU | 1625014 tok/s step 19229/19560 | loss 3.277113 (-0.30z)| norm 0.2650 (+1.38z)| lr 4.59e-07 | 323.20 ms | 52.2% bf16 MFU | 1624872 tok/s step 19230/19560 | loss 3.297790 (+0.24z)| norm 0.2445 (+0.22z)| lr 4.56e-07 | 322.78 ms | 52.3% bf16 MFU | 1624843 tok/s step 19231/19560 | loss 3.295674 (+0.18z)| norm 0.2419 (+0.06z)| lr 4.53e-07 | 323.12 ms | 52.2% bf16 MFU | 1624730 tok/s step 19232/19560 | loss 3.410510 (+3.00z)| norm 0.2904 (+2.88z)| lr 4.50e-07 | 322.75 ms | 52.3% bf16 MFU | 1624714 tok/s step 19233/19560 | loss 3.471267 (+4.19z)| norm 0.3449 (+5.31z)| lr 4.48e-07 | 322.81 ms | 52.3% bf16 MFU | 1624686 tok/s step 19234/19560 | loss 3.312278 (+0.48z)| norm 0.2460 (+0.20z)| lr 4.45e-07 | 322.96 ms | 52.3% bf16 MFU | 1624622 tok/s step 19235/19560 | loss 3.384511 (+2.12z)| norm 0.2754 (+1.69z)| lr 4.42e-07 | 322.80 ms | 52.3% bf16 MFU | 1624600 tok/s step 19236/19560 | loss 3.308131 (+0.37z)| norm 0.2793 (+1.85z)| lr 4.40e-07 | 322.93 ms | 52.3% bf16 MFU | 1624545 tok/s step 19237/19560 | loss 3.378181 (+1.93z)| norm 0.2399 (-0.14z)| lr 4.37e-07 | 323.02 ms | 52.2% bf16 MFU | 1624472 tok/s step 19238/19560 | loss 3.276423 (-0.40z)| norm 0.2304 (-0.62z)| lr 4.34e-07 | 322.73 ms | 52.3% bf16 MFU | 1624476 tok/s step 19239/19560 | loss 3.301708 (+0.20z)| norm 0.2536 (+0.58z)| lr 4.31e-07 | 322.65 ms | 52.3% bf16 MFU | 1624500 tok/s step 19240/19560 | loss 3.258559 (-0.79z)| norm 0.2454 (+0.16z)| lr 4.29e-07 | 322.77 ms | 52.3% bf16 MFU | 1624491 tok/s step 19241/19560 | loss 3.291487 (-0.04z)| norm 0.2232 (-0.97z)| lr 4.26e-07 | 322.47 ms | 52.3% bf16 MFU | 1624560 tok/s step 19242/19560 | loss 3.299929 (+0.16z)| norm 0.2347 (-0.38z)| lr 4.23e-07 | 322.98 ms | 52.3% bf16 MFU | 1624494 tok/s step 19243/19560 | loss 3.288332 (-0.11z)| norm 0.2468 (+0.23z)| lr 4.21e-07 | 322.64 ms | 52.3% bf16 MFU | 1624518 tok/s step 19244/19560 | loss 3.265860 (-0.62z)| norm 0.2290 (-0.68z)| lr 4.18e-07 | 323.50 ms | 52.2% bf16 MFU | 1624326 tok/s step 19245/19560 | loss 3.303987 (+0.27z)| norm 0.2356 (-0.33z)| lr 4.16e-07 | 322.36 ms | 52.4% bf16 MFU | 1624429 tok/s step 19246/19560 | loss 3.372912 (+1.86z)| norm 0.2394 (-0.13z)| lr 4.13e-07 | 323.08 ms | 52.2% bf16 MFU | 1624346 tok/s step 19247/19560 | loss 3.288667 (-0.08z)| norm 0.2418 (-0.01z)| lr 4.10e-07 | 322.57 ms | 52.3% bf16 MFU | 1624397 tok/s step 19248/19560 | loss 3.271493 (-0.48z)| norm 0.2298 (-0.62z)| lr 4.08e-07 | 323.06 ms | 52.2% bf16 MFU | 1624322 tok/s step 19249/19560 | loss 3.308930 (+0.38z)| norm 0.2206 (-1.09z)| lr 4.05e-07 | 323.41 ms | 52.2% bf16 MFU | 1624161 tok/s step 19250/19560 | loss 3.224555 (-1.56z)| norm 0.2409 (-0.03z)| lr 4.02e-07 | 323.06 ms | 52.2% bf16 MFU | 1624098 tok/s val loss 3.274337 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3041/10042 = 0.302828 step 19251/19560 | loss 3.248930 (-0.98z)| norm 0.2272 (-0.73z)| lr 4.00e-07 | 323.01 ms | 52.3% bf16 MFU | 1624050 tok/s step 19252/19560 | loss 3.302523 (+0.25z)| norm 0.2417 (+0.01z)| lr 3.97e-07 | 322.19 ms | 52.4% bf16 MFU | 1624211 tok/s step 19253/19560 | loss 3.278228 (-0.29z)| norm 0.2425 (+0.07z)| lr 3.95e-07 | 322.48 ms | 52.3% bf16 MFU | 1624290 tok/s step 19254/19560 | loss 3.285057 (-0.14z)| norm 0.2426 (+0.06z)| lr 3.92e-07 | 323.49 ms | 52.2% bf16 MFU | 1624112 tok/s step 19255/19560 | loss 3.296559 (+0.14z)| norm 0.2344 (-0.37z)| lr 3.90e-07 | 323.10 ms | 52.2% bf16 MFU | 1624042 tok/s step 19256/19560 | loss 3.267822 (-0.52z)| norm 0.2449 (+0.18z)| lr 3.87e-07 | 322.62 ms | 52.3% bf16 MFU | 1624095 tok/s step 19257/19560 | loss 3.300872 (+0.24z)| norm 0.2243 (-0.89z)| lr 3.85e-07 | 322.57 ms | 52.3% bf16 MFU | 1624157 tok/s step 19258/19560 | loss 3.283847 (-0.16z)| norm 0.2284 (-0.68z)| lr 3.82e-07 | 322.67 ms | 52.3% bf16 MFU | 1624192 tok/s step 19259/19560 | loss 3.251215 (-0.93z)| norm 0.2361 (-0.26z)| lr 3.80e-07 | 323.31 ms | 52.2% bf16 MFU | 1624063 tok/s step 19260/19560 | loss 3.272517 (-0.44z)| norm 0.2286 (-0.66z)| lr 3.77e-07 | 323.44 ms | 52.2% bf16 MFU | 1623909 tok/s step 19261/19560 | loss 3.341210 (+1.17z)| norm 0.2687 (+1.50z)| lr 3.75e-07 | 322.33 ms | 52.4% bf16 MFU | 1624042 tok/s step 19262/19560 | loss 3.254067 (-0.87z)| norm 0.2209 (-1.09z)| lr 3.72e-07 | 323.75 ms | 52.1% bf16 MFU | 1623812 tok/s step 19263/19560 | loss 3.245092 (-1.06z)| norm 0.2428 (+0.10z)| lr 3.70e-07 | 322.74 ms | 52.3% bf16 MFU | 1623846 tok/s step 19264/19560 | loss 3.288813 (-0.06z)| norm 0.2298 (-0.61z)| lr 3.67e-07 | 322.93 ms | 52.3% bf16 MFU | 1623830 tok/s step 19265/19560 | loss 3.284988 (-0.15z)| norm 0.2554 (+0.77z)| lr 3.65e-07 | 323.12 ms | 52.2% bf16 MFU | 1623768 tok/s step 19266/19560 | loss 3.356613 (+1.52z)| norm 0.2408 (-0.03z)| lr 3.62e-07 | 322.57 ms | 52.3% bf16 MFU | 1623847 tok/s step 19267/19560 | loss 3.273652 (-0.43z)| norm 0.2254 (-0.86z)| lr 3.60e-07 | 322.51 ms | 52.3% bf16 MFU | 1623938 tok/s step 19268/19560 | loss 3.262373 (-0.69z)| norm 0.2256 (-0.84z)| lr 3.57e-07 | 322.38 ms | 52.4% bf16 MFU | 1624057 tok/s step 19269/19560 | loss 3.318710 (+0.62z)| norm 0.2349 (-0.33z)| lr 3.55e-07 | 323.08 ms | 52.2% bf16 MFU | 1623993 tok/s step 19270/19560 | loss 3.288188 (-0.10z)| norm 0.2288 (-0.66z)| lr 3.52e-07 | 322.81 ms | 52.3% bf16 MFU | 1624000 tok/s step 19271/19560 | loss 3.333214 (+0.96z)| norm 0.2757 (+1.85z)| lr 3.50e-07 | 322.40 ms | 52.3% bf16 MFU | 1624109 tok/s step 19272/19560 | loss 3.296208 (+0.08z)| norm 0.2306 (-0.57z)| lr 3.48e-07 | 322.66 ms | 52.3% bf16 MFU | 1624148 tok/s step 19273/19560 | loss 3.287485 (-0.13z)| norm 0.2399 (-0.07z)| lr 3.45e-07 | 323.23 ms | 52.2% bf16 MFU | 1624042 tok/s step 19274/19560 | loss 3.295650 (+0.07z)| norm 0.2559 (+0.78z)| lr 3.43e-07 | 322.49 ms | 52.3% bf16 MFU | 1624128 tok/s step 19275/19560 | loss 3.272704 (-0.47z)| norm 0.2420 (+0.03z)| lr 3.40e-07 | 322.76 ms | 52.3% bf16 MFU | 1624141 tok/s step 19276/19560 | loss 3.218681 (-1.76z)| norm 0.2557 (+0.75z)| lr 3.38e-07 | 322.47 ms | 52.3% bf16 MFU | 1624226 tok/s step 19277/19560 | loss 3.317580 (+0.59z)| norm 0.2312 (-0.57z)| lr 3.36e-07 | 323.25 ms | 52.2% bf16 MFU | 1624110 tok/s step 19278/19560 | loss 3.263622 (-0.69z)| norm 0.2503 (+0.47z)| lr 3.33e-07 | 322.60 ms | 52.3% bf16 MFU | 1624164 tok/s step 19279/19560 | loss 3.346094 (+1.27z)| norm 0.2398 (-0.10z)| lr 3.31e-07 | 322.97 ms | 52.3% bf16 MFU | 1624121 tok/s step 19280/19560 | loss 3.272525 (-0.47z)| norm 0.2393 (-0.13z)| lr 3.29e-07 | 322.84 ms | 52.3% bf16 MFU | 1624115 tok/s step 19281/19560 | loss 3.278217 (-0.34z)| norm 0.2254 (-0.88z)| lr 3.26e-07 | 323.09 ms | 52.2% bf16 MFU | 1624046 tok/s step 19282/19560 | loss 3.299853 (+0.16z)| norm 0.2236 (-0.97z)| lr 3.24e-07 | 322.63 ms | 52.3% bf16 MFU | 1624094 tok/s step 19283/19560 | loss 3.293855 (+0.02z)| norm 0.2750 (+1.86z)| lr 3.22e-07 | 323.17 ms | 52.2% bf16 MFU | 1624007 tok/s step 19284/19560 | loss 3.314177 (+0.51z)| norm 0.2618 (+1.12z)| lr 3.19e-07 | 322.86 ms | 52.3% bf16 MFU | 1624002 tok/s step 19285/19560 | loss 3.271951 (-0.50z)| norm 0.2440 (+0.13z)| lr 3.17e-07 | 322.61 ms | 52.3% bf16 MFU | 1624060 tok/s step 19286/19560 | loss 3.359468 (+1.57z)| norm 0.2666 (+1.35z)| lr 3.15e-07 | 322.66 ms | 52.3% bf16 MFU | 1624102 tok/s step 19287/19560 | loss 3.316138 (+0.53z)| norm 0.2409 (-0.06z)| lr 3.12e-07 | 323.15 ms | 52.2% bf16 MFU | 1624018 tok/s step 19288/19560 | loss 3.354184 (+1.43z)| norm 0.2361 (-0.33z)| lr 3.10e-07 | 322.62 ms | 52.3% bf16 MFU | 1624072 tok/s step 19289/19560 | loss 3.226495 (-1.59z)| norm 0.2192 (-1.25z)| lr 3.08e-07 | 322.75 ms | 52.3% bf16 MFU | 1624090 tok/s step 19290/19560 | loss 3.255301 (-0.89z)| norm 0.2488 (+0.38z)| lr 3.06e-07 | 322.85 ms | 52.3% bf16 MFU | 1624082 tok/s step 19291/19560 | loss 3.345230 (+1.24z)| norm 0.2347 (-0.40z)| lr 3.03e-07 | 322.20 ms | 52.4% bf16 MFU | 1624239 tok/s step 19292/19560 | loss 3.337065 (+1.03z)| norm 0.2534 (+0.62z)| lr 3.01e-07 | 322.70 ms | 52.3% bf16 MFU | 1624262 tok/s step 19293/19560 | loss 3.332549 (+0.96z)| norm 0.2357 (-0.33z)| lr 2.99e-07 | 322.67 ms | 52.3% bf16 MFU | 1624291 tok/s step 19294/19560 | loss 3.320102 (+0.65z)| norm 0.2729 (+1.70z)| lr 2.97e-07 | 322.77 ms | 52.3% bf16 MFU | 1624294 tok/s step 19295/19560 | loss 3.278474 (-0.36z)| norm 0.2221 (-1.08z)| lr 2.94e-07 | 322.81 ms | 52.3% bf16 MFU | 1624286 tok/s step 19296/19560 | loss 3.336716 (+1.05z)| norm 0.2697 (+1.51z)| lr 2.92e-07 | 321.88 ms | 52.4% bf16 MFU | 1624512 tok/s step 19297/19560 | loss 3.488705 (+4.37z)| norm 0.2432 (+0.07z)| lr 2.90e-07 | 322.90 ms | 52.3% bf16 MFU | 1624469 tok/s step 19298/19560 | loss 3.233586 (-1.38z)| norm 0.2283 (-0.73z)| lr 2.88e-07 | 323.23 ms | 52.2% bf16 MFU | 1624346 tok/s step 19299/19560 | loss 3.356093 (+1.35z)| norm 0.2506 (+0.48z)| lr 2.86e-07 | 322.42 ms | 52.3% bf16 MFU | 1624435 tok/s step 19300/19560 | loss 3.303146 (+0.15z)| norm 0.2343 (-0.41z)| lr 2.83e-07 | 322.51 ms | 52.3% bf16 MFU | 1624497 tok/s step 19301/19560 | loss 3.294331 (-0.05z)| norm 0.2260 (-0.86z)| lr 2.81e-07 | 322.99 ms | 52.3% bf16 MFU | 1624435 tok/s step 19302/19560 | loss 3.329759 (+0.74z)| norm 0.2277 (-0.77z)| lr 2.79e-07 | 322.52 ms | 52.3% bf16 MFU | 1624493 tok/s step 19303/19560 | loss 3.265265 (-0.71z)| norm 0.2366 (-0.28z)| lr 2.77e-07 | 322.60 ms | 52.3% bf16 MFU | 1624528 tok/s step 19304/19560 | loss 3.349243 (+1.18z)| norm 0.2595 (+0.96z)| lr 2.75e-07 | 322.97 ms | 52.3% bf16 MFU | 1624469 tok/s step 19305/19560 | loss 3.272334 (-0.56z)| norm 0.2215 (-1.10z)| lr 2.73e-07 | 321.88 ms | 52.4% bf16 MFU | 1624687 tok/s step 19306/19560 | loss 3.339035 (+0.93z)| norm 0.2633 (+1.15z)| lr 2.71e-07 | 322.52 ms | 52.3% bf16 MFU | 1624732 tok/s step 19307/19560 | loss 3.326389 (+0.65z)| norm 0.2483 (+0.33z)| lr 2.68e-07 | 322.85 ms | 52.3% bf16 MFU | 1624693 tok/s step 19308/19560 | loss 3.239887 (-1.30z)| norm 0.2517 (+0.53z)| lr 2.66e-07 | 322.27 ms | 52.4% bf16 MFU | 1624801 tok/s step 19309/19560 | loss 3.290012 (-0.16z)| norm 0.2586 (+0.90z)| lr 2.64e-07 | 322.73 ms | 52.3% bf16 MFU | 1624787 tok/s step 19310/19560 | loss 3.229684 (-1.51z)| norm 0.2256 (-0.89z)| lr 2.62e-07 | 323.16 ms | 52.2% bf16 MFU | 1624666 tok/s step 19311/19560 | loss 3.276692 (-0.46z)| norm 0.2206 (-1.14z)| lr 2.60e-07 | 322.74 ms | 52.3% bf16 MFU | 1624658 tok/s step 19312/19560 | loss 3.251194 (-1.02z)| norm 0.2497 (+0.43z)| lr 2.58e-07 | 322.45 ms | 52.3% bf16 MFU | 1624722 tok/s step 19313/19560 | loss 3.250871 (-1.01z)| norm 0.2559 (+0.75z)| lr 2.56e-07 | 322.19 ms | 52.4% bf16 MFU | 1624850 tok/s step 19314/19560 | loss 3.337910 (+0.92z)| norm 0.2324 (-0.53z)| lr 2.54e-07 | 322.45 ms | 52.3% bf16 MFU | 1624906 tok/s step 19315/19560 | loss 3.266329 (-0.68z)| norm 0.2269 (-0.83z)| lr 2.52e-07 | 322.13 ms | 52.4% bf16 MFU | 1625038 tok/s step 19316/19560 | loss 3.290651 (-0.14z)| norm 0.2392 (-0.17z)| lr 2.50e-07 | 322.91 ms | 52.3% bf16 MFU | 1624968 tok/s step 19317/19560 | loss 3.301366 (+0.10z)| norm 0.2353 (-0.38z)| lr 2.48e-07 | 322.93 ms | 52.3% bf16 MFU | 1624896 tok/s step 19318/19560 | loss 3.312407 (+0.34z)| norm 0.2438 (+0.08z)| lr 2.46e-07 | 322.51 ms | 52.3% bf16 MFU | 1624932 tok/s step 19319/19560 | loss 3.310884 (+0.32z)| norm 0.2264 (-0.87z)| lr 2.44e-07 | 322.89 ms | 52.3% bf16 MFU | 1624873 tok/s step 19320/19560 | loss 3.288950 (-0.18z)| norm 0.2193 (-1.25z)| lr 2.42e-07 | 322.49 ms | 52.3% bf16 MFU | 1624916 tok/s step 19321/19560 | loss 3.248623 (-1.08z)| norm 0.2287 (-0.73z)| lr 2.40e-07 | 322.46 ms | 52.3% bf16 MFU | 1624965 tok/s step 19322/19560 | loss 3.337935 (+0.93z)| norm 0.2635 (+1.20z)| lr 2.38e-07 | 322.38 ms | 52.4% bf16 MFU | 1625033 tok/s step 19323/19560 | loss 3.248996 (-1.06z)| norm 0.2209 (-1.16z)| lr 2.36e-07 | 322.97 ms | 52.3% bf16 MFU | 1624948 tok/s step 19324/19560 | loss 3.275689 (-0.47z)| norm 0.2556 (+0.78z)| lr 2.34e-07 | 322.86 ms | 52.3% bf16 MFU | 1624896 tok/s step 19325/19560 | loss 3.316733 (+0.45z)| norm 0.2467 (+0.27z)| lr 2.32e-07 | 322.53 ms | 52.3% bf16 MFU | 1624929 tok/s step 19326/19560 | loss 3.275325 (-0.48z)| norm 0.2417 (-0.02z)| lr 2.30e-07 | 322.64 ms | 52.3% bf16 MFU | 1624931 tok/s step 19327/19560 | loss 3.257483 (-0.87z)| norm 0.2353 (-0.37z)| lr 2.28e-07 | 322.56 ms | 52.3% bf16 MFU | 1624955 tok/s step 19328/19560 | loss 3.358836 (+1.39z)| norm 0.2488 (+0.38z)| lr 2.26e-07 | 322.34 ms | 52.4% bf16 MFU | 1625033 tok/s step 19329/19560 | loss 3.320922 (+0.53z)| norm 0.2408 (-0.08z)| lr 2.24e-07 | 322.82 ms | 52.3% bf16 MFU | 1624984 tok/s step 19330/19560 | loss 3.324433 (+0.60z)| norm 0.2325 (-0.55z)| lr 2.22e-07 | 322.53 ms | 52.3% bf16 MFU | 1625013 tok/s step 19331/19560 | loss 3.335994 (+0.87z)| norm 0.2665 (+1.35z)| lr 2.20e-07 | 322.76 ms | 52.3% bf16 MFU | 1624982 tok/s step 19332/19560 | loss 3.241354 (-1.26z)| norm 0.2404 (-0.12z)| lr 2.18e-07 | 322.70 ms | 52.3% bf16 MFU | 1624966 tok/s step 19333/19560 | loss 3.209420 (-1.94z)| norm 0.2798 (+2.05z)| lr 2.16e-07 | 322.53 ms | 52.3% bf16 MFU | 1624995 tok/s step 19334/19560 | loss 3.302655 (+0.14z)| norm 0.2408 (-0.11z)| lr 2.14e-07 | 322.58 ms | 52.3% bf16 MFU | 1625011 tok/s step 19335/19560 | loss 3.311715 (+0.34z)| norm 0.2304 (-0.68z)| lr 2.13e-07 | 322.52 ms | 52.3% bf16 MFU | 1625040 tok/s step 19336/19560 | loss 3.239522 (-1.27z)| norm 0.2283 (-0.80z)| lr 2.11e-07 | 323.06 ms | 52.2% bf16 MFU | 1624932 tok/s step 19337/19560 | loss 3.339724 (+0.96z)| norm 0.2421 (-0.04z)| lr 2.09e-07 | 322.69 ms | 52.3% bf16 MFU | 1624922 tok/s step 19338/19560 | loss 3.266980 (-0.66z)| norm 0.2514 (+0.47z)| lr 2.07e-07 | 322.38 ms | 52.4% bf16 MFU | 1624991 tok/s step 19339/19560 | loss 3.373092 (+1.67z)| norm 0.2317 (-0.63z)| lr 2.05e-07 | 322.46 ms | 52.3% bf16 MFU | 1625035 tok/s step 19340/19560 | loss 3.279072 (-0.41z)| norm 0.2625 (+1.07z)| lr 2.03e-07 | 322.82 ms | 52.3% bf16 MFU | 1624989 tok/s step 19341/19560 | loss 3.337915 (+0.89z)| norm 0.2397 (-0.19z)| lr 2.01e-07 | 322.60 ms | 52.3% bf16 MFU | 1624999 tok/s step 19342/19560 | loss 3.308337 (+0.24z)| norm 0.2289 (-0.78z)| lr 2.00e-07 | 322.80 ms | 52.3% bf16 MFU | 1624959 tok/s step 19343/19560 | loss 3.241470 (-1.23z)| norm 0.2268 (-0.91z)| lr 1.98e-07 | 323.03 ms | 52.2% bf16 MFU | 1624864 tok/s step 19344/19560 | loss 3.332412 (+0.76z)| norm 0.2374 (-0.31z)| lr 1.96e-07 | 322.64 ms | 52.3% bf16 MFU | 1624870 tok/s step 19345/19560 | loss 3.338848 (+0.89z)| norm 0.2296 (-0.75z)| lr 1.94e-07 | 323.14 ms | 52.2% bf16 MFU | 1624752 tok/s step 19346/19560 | loss 3.345453 (+1.02z)| norm 0.2481 (+0.28z)| lr 1.92e-07 | 322.76 ms | 52.3% bf16 MFU | 1624733 tok/s step 19347/19560 | loss 3.376701 (+1.67z)| norm 0.2333 (-0.54z)| lr 1.91e-07 | 322.40 ms | 52.3% bf16 MFU | 1624806 tok/s step 19348/19560 | loss 3.280707 (-0.40z)| norm 0.2300 (-0.73z)| lr 1.89e-07 | 322.49 ms | 52.3% bf16 MFU | 1624854 tok/s step 19349/19560 | loss 3.287130 (-0.26z)| norm 0.2276 (-0.85z)| lr 1.87e-07 | 323.11 ms | 52.2% bf16 MFU | 1624742 tok/s step 19350/19560 | loss 3.286341 (-0.28z)| norm 0.2432 (+0.02z)| lr 1.85e-07 | 322.30 ms | 52.4% bf16 MFU | 1624840 tok/s step 19351/19560 | loss 3.315761 (+0.36z)| norm 0.2318 (-0.61z)| lr 1.84e-07 | 322.13 ms | 52.4% bf16 MFU | 1624976 tok/s step 19352/19560 | loss 3.292512 (-0.13z)| norm 0.2537 (+0.60z)| lr 1.82e-07 | 322.78 ms | 52.3% bf16 MFU | 1624941 tok/s step 19353/19560 | loss 3.277290 (-0.46z)| norm 0.2371 (-0.31z)| lr 1.80e-07 | 322.81 ms | 52.3% bf16 MFU | 1624902 tok/s step 19354/19560 | loss 3.293319 (-0.13z)| norm 0.2455 (+0.17z)| lr 1.78e-07 | 322.44 ms | 52.3% bf16 MFU | 1624956 tok/s step 19355/19560 | loss 3.306525 (+0.15z)| norm 0.2412 (-0.06z)| lr 1.77e-07 | 322.64 ms | 52.3% bf16 MFU | 1624959 tok/s step 19356/19560 | loss 3.321712 (+0.48z)| norm 0.2239 (-1.05z)| lr 1.75e-07 | 322.43 ms | 52.3% bf16 MFU | 1625014 tok/s step 19357/19560 | loss 3.315708 (+0.34z)| norm 0.2241 (-1.02z)| lr 1.73e-07 | 322.76 ms | 52.3% bf16 MFU | 1624983 tok/s step 19358/19560 | loss 3.307049 (+0.15z)| norm 0.2318 (-0.58z)| lr 1.72e-07 | 322.22 ms | 52.4% bf16 MFU | 1625089 tok/s step 19359/19560 | loss 3.249009 (-1.15z)| norm 0.2290 (-0.73z)| lr 1.70e-07 | 322.42 ms | 52.3% bf16 MFU | 1625139 tok/s step 19360/19560 | loss 3.294883 (-0.10z)| norm 0.2292 (-0.71z)| lr 1.68e-07 | 322.23 ms | 52.4% bf16 MFU | 1625236 tok/s step 19361/19560 | loss 3.339671 (+1.01z)| norm 0.2376 (-0.20z)| lr 1.66e-07 | 322.69 ms | 52.3% bf16 MFU | 1625212 tok/s step 19362/19560 | loss 3.251104 (-1.14z)| norm 0.2293 (-0.77z)| lr 1.65e-07 | 323.09 ms | 52.2% bf16 MFU | 1625088 tok/s step 19363/19560 | loss 3.296405 (-0.02z)| norm 0.2249 (-1.07z)| lr 1.63e-07 | 322.97 ms | 52.3% bf16 MFU | 1624999 tok/s step 19364/19560 | loss 3.298488 (+0.04z)| norm 0.2225 (-1.24z)| lr 1.62e-07 | 323.24 ms | 52.2% bf16 MFU | 1624847 tok/s step 19365/19560 | loss 3.227999 (-1.69z)| norm 0.2282 (-0.81z)| lr 1.60e-07 | 322.85 ms | 52.3% bf16 MFU | 1624803 tok/s step 19366/19560 | loss 3.272878 (-0.57z)| norm 0.2548 (+1.12z)| lr 1.58e-07 | 322.25 ms | 52.4% bf16 MFU | 1624910 tok/s step 19367/19560 | loss 3.284221 (-0.29z)| norm 0.2252 (-1.03z)| lr 1.57e-07 | 322.64 ms | 52.3% bf16 MFU | 1624916 tok/s step 19368/19560 | loss 3.262299 (-0.83z)| norm 0.2418 (+0.18z)| lr 1.55e-07 | 323.02 ms | 52.2% bf16 MFU | 1624823 tok/s step 19369/19560 | loss 3.282937 (-0.32z)| norm 0.2259 (-0.98z)| lr 1.53e-07 | 322.86 ms | 52.3% bf16 MFU | 1624776 tok/s step 19370/19560 | loss 3.249227 (-1.14z)| norm 0.2266 (-0.92z)| lr 1.52e-07 | 322.42 ms | 52.3% bf16 MFU | 1624843 tok/s step 19371/19560 | loss 3.311473 (+0.40z)| norm 0.2802 (+2.88z)| lr 1.50e-07 | 322.92 ms | 52.3% bf16 MFU | 1624780 tok/s step 19372/19560 | loss 3.344348 (+1.19z)| norm 0.2685 (+2.00z)| lr 1.49e-07 | 322.42 ms | 52.3% bf16 MFU | 1624845 tok/s step 19373/19560 | loss 3.339947 (+1.07z)| norm 0.2435 (+0.25z)| lr 1.47e-07 | 322.61 ms | 52.3% bf16 MFU | 1624859 tok/s step 19374/19560 | loss 3.272487 (-0.57z)| norm 0.2246 (-1.05z)| lr 1.46e-07 | 322.96 ms | 52.3% bf16 MFU | 1624784 tok/s step 19375/19560 | loss 3.270739 (-0.61z)| norm 0.2337 (-0.41z)| lr 1.44e-07 | 322.61 ms | 52.3% bf16 MFU | 1624803 tok/s step 19376/19560 | loss 3.291270 (-0.11z)| norm 0.2239 (-1.09z)| lr 1.42e-07 | 322.92 ms | 52.3% bf16 MFU | 1624742 tok/s step 19377/19560 | loss 3.289497 (-0.15z)| norm 0.2286 (-0.77z)| lr 1.41e-07 | 322.37 ms | 52.4% bf16 MFU | 1624823 tok/s step 19378/19560 | loss 3.301518 (+0.14z)| norm 0.2290 (-0.74z)| lr 1.39e-07 | 322.73 ms | 52.3% bf16 MFU | 1624809 tok/s step 19379/19560 | loss 3.298518 (+0.05z)| norm 0.2243 (-1.06z)| lr 1.38e-07 | 322.77 ms | 52.3% bf16 MFU | 1624786 tok/s step 19380/19560 | loss 3.302499 (+0.15z)| norm 0.2231 (-1.13z)| lr 1.36e-07 | 322.44 ms | 52.3% bf16 MFU | 1624847 tok/s step 19381/19560 | loss 3.490825 (+4.48z)| norm 0.5067 (+9.59z)| lr 1.35e-07 | 322.81 ms | 52.3% bf16 MFU | 1624811 tok/s step 19382/19560 | loss 3.228394 (-1.59z)| norm 0.2344 (-0.25z)| lr 1.33e-07 | 323.01 ms | 52.2% bf16 MFU | 1624727 tok/s step 19383/19560 | loss 3.298752 (+0.03z)| norm 0.2616 (+0.72z)| lr 1.32e-07 | 323.01 ms | 52.2% bf16 MFU | 1624648 tok/s step 19384/19560 | loss 3.332369 (+0.79z)| norm 0.2427 (+0.04z)| lr 1.30e-07 | 323.39 ms | 52.2% bf16 MFU | 1624477 tok/s step 19385/19560 | loss 3.317844 (+0.45z)| norm 0.2266 (-0.54z)| lr 1.29e-07 | 322.51 ms | 52.3% bf16 MFU | 1624537 tok/s step 19386/19560 | loss 3.371939 (+1.66z)| norm 0.2546 (+0.46z)| lr 1.27e-07 | 322.63 ms | 52.3% bf16 MFU | 1624563 tok/s step 19387/19560 | loss 3.220060 (-1.78z)| norm 0.2706 (+1.03z)| lr 1.26e-07 | 322.23 ms | 52.4% bf16 MFU | 1624688 tok/s step 19388/19560 | loss 3.258910 (-0.89z)| norm 0.2250 (-0.61z)| lr 1.25e-07 | 322.59 ms | 52.3% bf16 MFU | 1624716 tok/s step 19389/19560 | loss 3.273052 (-0.57z)| norm 0.2253 (-0.59z)| lr 1.23e-07 | 322.82 ms | 52.3% bf16 MFU | 1624684 tok/s step 19390/19560 | loss 3.297350 (-0.02z)| norm 0.2314 (-0.38z)| lr 1.22e-07 | 322.55 ms | 52.3% bf16 MFU | 1624722 tok/s step 19391/19560 | loss 3.297080 (-0.04z)| norm 0.2389 (-0.11z)| lr 1.20e-07 | 322.79 ms | 52.3% bf16 MFU | 1624698 tok/s step 19392/19560 | loss 3.268903 (-0.68z)| norm 0.2333 (-0.31z)| lr 1.19e-07 | 322.41 ms | 52.3% bf16 MFU | 1624772 tok/s step 19393/19560 | loss 3.308504 (+0.22z)| norm 0.2738 (+1.14z)| lr 1.17e-07 | 322.52 ms | 52.3% bf16 MFU | 1624813 tok/s step 19394/19560 | loss 3.263911 (-0.78z)| norm 0.2353 (-0.24z)| lr 1.16e-07 | 322.42 ms | 52.3% bf16 MFU | 1624878 tok/s step 19395/19560 | loss 3.234714 (-1.43z)| norm 0.2164 (-0.91z)| lr 1.15e-07 | 322.98 ms | 52.3% bf16 MFU | 1624799 tok/s step 19396/19560 | loss 3.258687 (-0.89z)| norm 0.2720 (+1.07z)| lr 1.13e-07 | 322.92 ms | 52.3% bf16 MFU | 1624738 tok/s step 19397/19560 | loss 3.295298 (-0.05z)| norm 0.2248 (-0.62z)| lr 1.12e-07 | 322.25 ms | 52.4% bf16 MFU | 1624848 tok/s step 19398/19560 | loss 3.296153 (-0.03z)| norm 0.2535 (+0.40z)| lr 1.11e-07 | 322.76 ms | 52.3% bf16 MFU | 1624826 tok/s step 19399/19560 | loss 3.463237 (+3.56z)| norm 0.2638 (+0.77z)| lr 1.09e-07 | 323.30 ms | 52.2% bf16 MFU | 1624669 tok/s step 19400/19560 | loss 3.270394 (-0.61z)| norm 0.2635 (+0.75z)| lr 1.08e-07 | 322.25 ms | 52.4% bf16 MFU | 1624784 tok/s step 19401/19560 | loss 3.372547 (+1.57z)| norm 0.2283 (-0.50z)| lr 1.07e-07 | 322.44 ms | 52.3% bf16 MFU | 1624845 tok/s step 19402/19560 | loss 3.291061 (-0.17z)| norm 0.2245 (-0.63z)| lr 1.05e-07 | 323.25 ms | 52.2% bf16 MFU | 1624698 tok/s step 19403/19560 | loss 3.318369 (+0.40z)| norm 0.2268 (-0.54z)| lr 1.04e-07 | 322.47 ms | 52.3% bf16 MFU | 1624754 tok/s step 19404/19560 | loss 3.255044 (-0.96z)| norm 0.2695 (+0.97z)| lr 1.03e-07 | 322.63 ms | 52.3% bf16 MFU | 1624769 tok/s step 19405/19560 | loss 3.320327 (+0.44z)| norm 0.2267 (-0.55z)| lr 1.01e-07 | 322.76 ms | 52.3% bf16 MFU | 1624749 tok/s step 19406/19560 | loss 3.312657 (+0.27z)| norm 0.2427 (+0.02z)| lr 1.00e-07 | 322.69 ms | 52.3% bf16 MFU | 1624749 tok/s step 19407/19560 | loss 3.336913 (+0.80z)| norm 0.2309 (-0.39z)| lr 9.87e-08 | 322.75 ms | 52.3% bf16 MFU | 1624732 tok/s step 19408/19560 | loss 3.321363 (+0.45z)| norm 0.2238 (-0.64z)| lr 9.74e-08 | 322.87 ms | 52.3% bf16 MFU | 1624687 tok/s step 19409/19560 | loss 3.230602 (-1.49z)| norm 0.2266 (-0.54z)| lr 9.61e-08 | 323.33 ms | 52.2% bf16 MFU | 1624530 tok/s step 19410/19560 | loss 3.319440 (+0.41z)| norm 0.2269 (-0.53z)| lr 9.49e-08 | 322.73 ms | 52.3% bf16 MFU | 1624531 tok/s step 19411/19560 | loss 3.219660 (-1.70z)| norm 0.2465 (+0.17z)| lr 9.36e-08 | 322.81 ms | 52.3% bf16 MFU | 1624511 tok/s step 19412/19560 | loss 3.301674 (+0.04z)| norm 0.2324 (-0.32z)| lr 9.24e-08 | 323.38 ms | 52.2% bf16 MFU | 1624350 tok/s step 19413/19560 | loss 3.251629 (-1.01z)| norm 0.2321 (-0.33z)| lr 9.12e-08 | 322.88 ms | 52.3% bf16 MFU | 1624321 tok/s step 19414/19560 | loss 3.277214 (-0.46z)| norm 0.2340 (-0.25z)| lr 8.99e-08 | 322.79 ms | 52.3% bf16 MFU | 1624316 tok/s step 19415/19560 | loss 3.236461 (-1.31z)| norm 0.2348 (-0.22z)| lr 8.87e-08 | 322.63 ms | 52.3% bf16 MFU | 1624352 tok/s step 19416/19560 | loss 3.300826 (+0.07z)| norm 0.2719 (+1.09z)| lr 8.75e-08 | 323.46 ms | 52.2% bf16 MFU | 1624178 tok/s step 19417/19560 | loss 3.260288 (-0.81z)| norm 0.2478 (+0.22z)| lr 8.63e-08 | 322.55 ms | 52.3% bf16 MFU | 1624240 tok/s step 19418/19560 | loss 3.282933 (-0.33z)| norm 0.2279 (-0.48z)| lr 8.51e-08 | 322.76 ms | 52.3% bf16 MFU | 1624247 tok/s step 19419/19560 | loss 3.323429 (+0.55z)| norm 0.2308 (-0.38z)| lr 8.39e-08 | 323.03 ms | 52.2% bf16 MFU | 1624185 tok/s step 19420/19560 | loss 3.239364 (-1.24z)| norm 0.2173 (-0.85z)| lr 8.27e-08 | 323.51 ms | 52.2% bf16 MFU | 1624006 tok/s step 19421/19560 | loss 3.243805 (-1.13z)| norm 0.2396 (-0.06z)| lr 8.16e-08 | 322.85 ms | 52.3% bf16 MFU | 1624002 tok/s step 19422/19560 | loss 3.284670 (-0.25z)| norm 0.2283 (-0.45z)| lr 8.04e-08 | 322.81 ms | 52.3% bf16 MFU | 1624009 tok/s step 19423/19560 | loss 3.205769 (-1.90z)| norm 0.2414 (+0.02z)| lr 7.93e-08 | 322.84 ms | 52.3% bf16 MFU | 1624009 tok/s step 19424/19560 | loss 3.260849 (-0.73z)| norm 0.2350 (-0.20z)| lr 7.81e-08 | 322.80 ms | 52.3% bf16 MFU | 1624019 tok/s step 19425/19560 | loss 3.328660 (+0.79z)| norm 0.2454 (+0.17z)| lr 7.70e-08 | 322.89 ms | 52.3% bf16 MFU | 1624005 tok/s step 19426/19560 | loss 3.262365 (-0.72z)| norm 0.2470 (+0.22z)| lr 7.59e-08 | 322.70 ms | 52.3% bf16 MFU | 1624039 tok/s step 19427/19560 | loss 3.330067 (+0.83z)| norm 0.2322 (-0.31z)| lr 7.47e-08 | 322.78 ms | 52.3% bf16 MFU | 1624052 tok/s step 19428/19560 | loss 3.247211 (-1.05z)| norm 0.2379 (-0.10z)| lr 7.36e-08 | 322.94 ms | 52.3% bf16 MFU | 1624024 tok/s step 19429/19560 | loss 3.359724 (+1.48z)| norm 0.2613 (+0.73z)| lr 7.25e-08 | 322.41 ms | 52.3% bf16 MFU | 1624129 tok/s step 19430/19560 | loss 3.282366 (-0.25z)| norm 0.2552 (+0.50z)| lr 7.14e-08 | 322.51 ms | 52.3% bf16 MFU | 1624205 tok/s step 19431/19560 | loss 3.321820 (+0.63z)| norm 0.2441 (+0.10z)| lr 7.03e-08 | 322.71 ms | 52.3% bf16 MFU | 1624227 tok/s step 19432/19560 | loss 3.302001 (+0.19z)| norm 0.2765 (+1.26z)| lr 6.93e-08 | 322.72 ms | 52.3% bf16 MFU | 1624246 tok/s step 19433/19560 | loss 3.300609 (+0.15z)| norm 0.2248 (-0.59z)| lr 6.82e-08 | 322.56 ms | 52.3% bf16 MFU | 1624303 tok/s step 19434/19560 | loss 3.267169 (-0.60z)| norm 0.2371 (-0.15z)| lr 6.71e-08 | 323.39 ms | 52.2% bf16 MFU | 1624149 tok/s step 19435/19560 | loss 3.271062 (-0.50z)| norm 0.2227 (-0.66z)| lr 6.61e-08 | 322.73 ms | 52.3% bf16 MFU | 1624169 tok/s step 19436/19560 | loss 3.304510 (+0.26z)| norm 0.2355 (-0.19z)| lr 6.50e-08 | 322.55 ms | 52.3% bf16 MFU | 1624234 tok/s step 19437/19560 | loss 3.249028 (-1.01z)| norm 0.2412 (+0.02z)| lr 6.40e-08 | 322.44 ms | 52.3% bf16 MFU | 1624323 tok/s step 19438/19560 | loss 3.312217 (+0.43z)| norm 0.2575 (+0.59z)| lr 6.30e-08 | 322.64 ms | 52.3% bf16 MFU | 1624355 tok/s step 19439/19560 | loss 3.255569 (-0.88z)| norm 0.2282 (-0.46z)| lr 6.19e-08 | 323.43 ms | 52.2% bf16 MFU | 1624189 tok/s step 19440/19560 | loss 3.271836 (-0.51z)| norm 0.2253 (-0.56z)| lr 6.09e-08 | 322.59 ms | 52.3% bf16 MFU | 1624241 tok/s step 19441/19560 | loss 3.319942 (+0.59z)| norm 0.2317 (-0.32z)| lr 5.99e-08 | 322.48 ms | 52.3% bf16 MFU | 1624318 tok/s step 19442/19560 | loss 3.312606 (+0.43z)| norm 0.2804 (+1.41z)| lr 5.89e-08 | 322.45 ms | 52.3% bf16 MFU | 1624400 tok/s step 19443/19560 | loss 3.242254 (-1.20z)| norm 0.2350 (-0.22z)| lr 5.80e-08 | 322.71 ms | 52.3% bf16 MFU | 1624411 tok/s step 19444/19560 | loss 3.297286 (+0.08z)| norm 0.2355 (-0.20z)| lr 5.70e-08 | 324.23 ms | 52.1% bf16 MFU | 1624042 tok/s step 19445/19560 | loss 3.362718 (+1.57z)| norm 0.2337 (-0.26z)| lr 5.60e-08 | 322.94 ms | 52.3% bf16 MFU | 1624016 tok/s step 19446/19560 | loss 3.370044 (+1.70z)| norm 0.2497 (+0.31z)| lr 5.50e-08 | 322.79 ms | 52.3% bf16 MFU | 1624028 tok/s step 19447/19560 | loss 3.289576 (-0.12z)| norm 0.2620 (+0.74z)| lr 5.41e-08 | 322.84 ms | 52.3% bf16 MFU | 1624026 tok/s step 19448/19560 | loss 3.445710 (+3.25z)| norm 0.3247 (+2.86z)| lr 5.31e-08 | 323.71 ms | 52.1% bf16 MFU | 1623806 tok/s step 19449/19560 | loss 3.247884 (-1.04z)| norm 0.2235 (-0.65z)| lr 5.22e-08 | 323.52 ms | 52.2% bf16 MFU | 1623643 tok/s step 19450/19560 | loss 3.319474 (+0.52z)| norm 0.2294 (-0.43z)| lr 5.13e-08 | 322.45 ms | 52.3% bf16 MFU | 1623758 tok/s step 19451/19560 | loss 3.304584 (+0.18z)| norm 0.2261 (-0.55z)| lr 5.04e-08 | 322.76 ms | 52.3% bf16 MFU | 1623791 tok/s step 19452/19560 | loss 3.339195 (+0.93z)| norm 0.2299 (-0.41z)| lr 4.94e-08 | 322.63 ms | 52.3% bf16 MFU | 1623855 tok/s step 19453/19560 | loss 3.277215 (-0.42z)| norm 0.2302 (-0.40z)| lr 4.85e-08 | 322.71 ms | 52.3% bf16 MFU | 1623894 tok/s step 19454/19560 | loss 3.354035 (+1.24z)| norm 0.2891 (+1.62z)| lr 4.77e-08 | 322.36 ms | 52.4% bf16 MFU | 1624018 tok/s step 19455/19560 | loss 3.332899 (+0.77z)| norm 0.2319 (-0.34z)| lr 4.68e-08 | 323.18 ms | 52.2% bf16 MFU | 1623931 tok/s step 19456/19560 | loss 3.296911 (-0.00z)| norm 0.2401 (-0.06z)| lr 4.59e-08 | 322.65 ms | 52.3% bf16 MFU | 1623982 tok/s step 19457/19560 | loss 3.270682 (-0.57z)| norm 0.2279 (-0.48z)| lr 4.50e-08 | 322.67 ms | 52.3% bf16 MFU | 1624025 tok/s step 19458/19560 | loss 3.249633 (-1.01z)| norm 0.2278 (-0.48z)| lr 4.41e-08 | 322.85 ms | 52.3% bf16 MFU | 1624021 tok/s step 19459/19560 | loss 3.354621 (+1.27z)| norm 0.2241 (-0.59z)| lr 4.33e-08 | 322.84 ms | 52.3% bf16 MFU | 1624020 tok/s step 19460/19560 | loss 3.311810 (+0.33z)| norm 0.2528 (+0.39z)| lr 4.25e-08 | 322.68 ms | 52.3% bf16 MFU | 1624058 tok/s step 19461/19560 | loss 3.232205 (-1.42z)| norm 0.2358 (-0.18z)| lr 4.16e-08 | 322.39 ms | 52.4% bf16 MFU | 1624169 tok/s step 19462/19560 | loss 3.263106 (-0.74z)| norm 0.2236 (-0.60z)| lr 4.08e-08 | 323.18 ms | 52.2% bf16 MFU | 1624074 tok/s step 19463/19560 | loss 3.283748 (-0.28z)| norm 0.2328 (-0.28z)| lr 4.00e-08 | 322.44 ms | 52.3% bf16 MFU | 1624171 tok/s step 19464/19560 | loss 3.277028 (-0.44z)| norm 0.2331 (-0.28z)| lr 3.92e-08 | 323.32 ms | 52.2% bf16 MFU | 1624042 tok/s step 19465/19560 | loss 3.282331 (-0.31z)| norm 0.2381 (-0.10z)| lr 3.84e-08 | 323.22 ms | 52.2% bf16 MFU | 1623944 tok/s step 19466/19560 | loss 3.234975 (-1.34z)| norm 0.2203 (-0.71z)| lr 3.76e-08 | 323.07 ms | 52.2% bf16 MFU | 1623888 tok/s step 19467/19560 | loss 3.310495 (+0.33z)| norm 0.2458 (+0.17z)| lr 3.68e-08 | 322.92 ms | 52.3% bf16 MFU | 1623874 tok/s step 19468/19560 | loss 3.225099 (-1.55z)| norm 0.2346 (-0.21z)| lr 3.60e-08 | 322.73 ms | 52.3% bf16 MFU | 1623907 tok/s step 19469/19560 | loss 3.329664 (+0.76z)| norm 0.2698 (+0.99z)| lr 3.52e-08 | 323.14 ms | 52.2% bf16 MFU | 1623835 tok/s step 19470/19560 | loss 3.255013 (-0.87z)| norm 0.2329 (-0.28z)| lr 3.45e-08 | 322.98 ms | 52.3% bf16 MFU | 1623807 tok/s step 19471/19560 | loss 3.351332 (+1.23z)| norm 0.2359 (-0.18z)| lr 3.37e-08 | 323.00 ms | 52.3% bf16 MFU | 1623776 tok/s step 19472/19560 | loss 3.314158 (+0.41z)| norm 0.2564 (+0.53z)| lr 3.30e-08 | 322.98 ms | 52.3% bf16 MFU | 1623750 tok/s step 19473/19560 | loss 3.287467 (-0.17z)| norm 0.2201 (-0.72z)| lr 3.22e-08 | 322.71 ms | 52.3% bf16 MFU | 1623794 tok/s step 19474/19560 | loss 3.283077 (-0.25z)| norm 0.2498 (+0.30z)| lr 3.15e-08 | 322.66 ms | 52.3% bf16 MFU | 1623848 tok/s step 19475/19560 | loss 3.292237 (-0.04z)| norm 0.2308 (-0.35z)| lr 3.08e-08 | 323.35 ms | 52.2% bf16 MFU | 1623727 tok/s step 19476/19560 | loss 3.288332 (-0.13z)| norm 0.2286 (-0.43z)| lr 3.01e-08 | 323.20 ms | 52.2% bf16 MFU | 1623648 tok/s step 19477/19560 | loss 3.287972 (-0.13z)| norm 0.2317 (-0.33z)| lr 2.94e-08 | 322.87 ms | 52.3% bf16 MFU | 1623658 tok/s step 19478/19560 | loss 3.254336 (-0.88z)| norm 0.2279 (-0.45z)| lr 2.87e-08 | 322.94 ms | 52.3% bf16 MFU | 1623649 tok/s step 19479/19560 | loss 3.307001 (+0.30z)| norm 0.2320 (-0.31z)| lr 2.80e-08 | 322.35 ms | 52.4% bf16 MFU | 1623788 tok/s step 19480/19560 | loss 3.290272 (-0.07z)| norm 0.2397 (-0.04z)| lr 2.73e-08 | 322.88 ms | 52.3% bf16 MFU | 1623788 tok/s step 19481/19560 | loss 3.278827 (-0.33z)| norm 0.2268 (-0.48z)| lr 2.66e-08 | 323.38 ms | 52.2% bf16 MFU | 1623663 tok/s step 19482/19560 | loss 3.279766 (-0.31z)| norm 0.2501 (+0.32z)| lr 2.60e-08 | 322.38 ms | 52.4% bf16 MFU | 1623796 tok/s step 19483/19560 | loss 3.294460 (+0.02z)| norm 0.2183 (-0.77z)| lr 2.53e-08 | 322.08 ms | 52.4% bf16 MFU | 1623996 tok/s step 19484/19560 | loss 3.294750 (+0.03z)| norm 0.2393 (-0.05z)| lr 2.47e-08 | 323.82 ms | 52.1% bf16 MFU | 1623750 tok/s step 19485/19560 | loss 3.206136 (-1.91z)| norm 0.2298 (-0.38z)| lr 2.40e-08 | 322.44 ms | 52.3% bf16 MFU | 1623862 tok/s step 19486/19560 | loss 3.305559 (+0.29z)| norm 0.2360 (-0.17z)| lr 2.34e-08 | 322.68 ms | 52.3% bf16 MFU | 1623910 tok/s step 19487/19560 | loss 3.205659 (-1.90z)| norm 0.2344 (-0.22z)| lr 2.28e-08 | 322.67 ms | 52.3% bf16 MFU | 1623957 tok/s step 19488/19560 | loss 3.295388 (+0.07z)| norm 0.2501 (+0.31z)| lr 2.22e-08 | 322.75 ms | 52.3% bf16 MFU | 1623981 tok/s step 19489/19560 | loss 3.259730 (-0.70z)| norm 0.2356 (-0.19z)| lr 2.16e-08 | 322.52 ms | 52.3% bf16 MFU | 1624061 tok/s step 19490/19560 | loss 3.376874 (+1.84z)| norm 0.2452 (+0.14z)| lr 2.10e-08 | 322.35 ms | 52.4% bf16 MFU | 1624182 tok/s step 19491/19560 | loss 3.276298 (-0.35z)| norm 0.2307 (-0.36z)| lr 2.04e-08 | 323.50 ms | 52.2% bf16 MFU | 1624007 tok/s step 19492/19560 | loss 3.333976 (+0.90z)| norm 0.2622 (+0.71z)| lr 1.98e-08 | 322.57 ms | 52.3% bf16 MFU | 1624075 tok/s step 19493/19560 | loss 3.334810 (+0.91z)| norm 0.2779 (+1.23z)| lr 1.92e-08 | 322.77 ms | 52.3% bf16 MFU | 1624088 tok/s step 19494/19560 | loss 3.282141 (-0.25z)| norm 0.2353 (-0.22z)| lr 1.87e-08 | 322.60 ms | 52.3% bf16 MFU | 1624145 tok/s step 19495/19560 | loss 3.358419 (+1.40z)| norm 0.2803 (+1.30z)| lr 1.81e-08 | 322.53 ms | 52.3% bf16 MFU | 1624216 tok/s step 19496/19560 | loss 3.172189 (-2.56z)| norm 0.2254 (-0.57z)| lr 1.76e-08 | 322.79 ms | 52.3% bf16 MFU | 1624218 tok/s step 19497/19560 | loss 3.271140 (-0.47z)| norm 0.2321 (-0.34z)| lr 1.70e-08 | 322.97 ms | 52.3% bf16 MFU | 1624175 tok/s step 19498/19560 | loss 3.234278 (-1.24z)| norm 0.2504 (+0.27z)| lr 1.65e-08 | 322.67 ms | 52.3% bf16 MFU | 1624208 tok/s step 19499/19560 | loss 3.247674 (-0.94z)| norm 0.2773 (+1.19z)| lr 1.60e-08 | 322.28 ms | 52.4% bf16 MFU | 1624338 tok/s step 19500/19560 | loss 3.259732 (-0.68z)| norm 0.2216 (-0.69z)| lr 1.55e-08 | 322.73 ms | 52.3% bf16 MFU | 1624348 tok/s val loss 3.274286 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3053/10042 = 0.304023 step 19501/19560 | loss 3.259327 (-0.67z)| norm 0.2423 (+0.01z)| lr 1.50e-08 | 323.32 ms | 52.2% bf16 MFU | 1624210 tok/s step 19502/19560 | loss 3.267449 (-0.50z)| norm 0.2386 (-0.12z)| lr 1.45e-08 | 323.02 ms | 52.2% bf16 MFU | 1624154 tok/s step 19503/19560 | loss 3.249987 (-0.87z)| norm 0.2405 (-0.05z)| lr 1.40e-08 | 322.52 ms | 52.3% bf16 MFU | 1624226 tok/s step 19504/19560 | loss 3.235215 (-1.16z)| norm 0.2419 (-0.01z)| lr 1.35e-08 | 322.47 ms | 52.3% bf16 MFU | 1624309 tok/s step 19505/19560 | loss 3.286049 (-0.10z)| norm 0.2334 (-0.30z)| lr 1.31e-08 | 322.34 ms | 52.4% bf16 MFU | 1624418 tok/s step 19506/19560 | loss 3.342950 (+1.09z)| norm 0.2419 (-0.01z)| lr 1.26e-08 | 322.50 ms | 52.3% bf16 MFU | 1624482 tok/s step 19507/19560 | loss 3.176334 (-2.32z)| norm 0.2349 (-0.26z)| lr 1.21e-08 | 322.53 ms | 52.3% bf16 MFU | 1624536 tok/s step 19508/19560 | loss 3.300934 (+0.22z)| norm 0.2415 (-0.04z)| lr 1.17e-08 | 322.45 ms | 52.3% bf16 MFU | 1624607 tok/s step 19509/19560 | loss 3.215183 (-1.58z)| norm 0.2244 (-0.93z)| lr 1.12e-08 | 322.77 ms | 52.3% bf16 MFU | 1624592 tok/s step 19510/19560 | loss 3.350995 (+1.35z)| norm 0.2495 (+0.52z)| lr 1.08e-08 | 322.13 ms | 52.4% bf16 MFU | 1624742 tok/s step 19511/19560 | loss 3.223073 (-1.40z)| norm 0.2278 (-0.72z)| lr 1.04e-08 | 323.42 ms | 52.2% bf16 MFU | 1624559 tok/s step 19512/19560 | loss 3.324532 (+0.78z)| norm 0.2270 (-0.76z)| lr 1.00e-08 | 322.18 ms | 52.4% bf16 MFU | 1624698 tok/s step 19513/19560 | loss 3.299175 (+0.24z)| norm 0.2300 (-0.59z)| lr 9.58e-09 | 322.53 ms | 52.3% bf16 MFU | 1624741 tok/s step 19514/19560 | loss 3.315346 (+0.61z)| norm 0.2231 (-0.97z)| lr 9.19e-09 | 322.23 ms | 52.4% bf16 MFU | 1624856 tok/s step 19515/19560 | loss 3.339180 (+1.11z)| norm 0.2434 (+0.22z)| lr 8.82e-09 | 322.30 ms | 52.4% bf16 MFU | 1624949 tok/s step 19516/19560 | loss 3.271267 (-0.38z)| norm 0.2632 (+1.36z)| lr 8.42e-09 | 323.23 ms | 52.2% bf16 MFU | 1624803 tok/s step 19517/19560 | loss 3.279777 (-0.19z)| norm 0.2204 (-1.14z)| lr 8.06e-09 | 322.65 ms | 52.3% bf16 MFU | 1624809 tok/s step 19518/19560 | loss 3.272051 (-0.36z)| norm 0.2416 (+0.09z)| lr 7.69e-09 | 322.47 ms | 52.3% bf16 MFU | 1624862 tok/s step 19519/19560 | loss 3.292563 (+0.09z)| norm 0.2457 (+0.33z)| lr 7.35e-09 | 322.02 ms | 52.4% bf16 MFU | 1625027 tok/s step 19520/19560 | loss 3.255743 (-0.71z)| norm 0.2337 (-0.37z)| lr 6.99e-09 | 322.48 ms | 52.3% bf16 MFU | 1625066 tok/s step 19521/19560 | loss 3.306442 (+0.40z)| norm 0.4176 (+7.66z)| lr 6.65e-09 | 322.72 ms | 52.3% bf16 MFU | 1625043 tok/s step 19522/19560 | loss 3.265384 (-0.50z)| norm 0.2388 (-0.11z)| lr 6.33e-09 | 322.31 ms | 52.4% bf16 MFU | 1625123 tok/s step 19523/19560 | loss 3.287360 (-0.03z)| norm 0.2400 (-0.06z)| lr 6.01e-09 | 322.39 ms | 52.4% bf16 MFU | 1625180 tok/s step 19524/19560 | loss 3.348942 (+1.31z)| norm 0.2311 (-0.44z)| lr 5.70e-09 | 322.59 ms | 52.3% bf16 MFU | 1625184 tok/s step 19525/19560 | loss 3.288743 (-0.01z)| norm 0.2543 (+0.57z)| lr 5.40e-09 | 322.52 ms | 52.3% bf16 MFU | 1625204 tok/s step 19526/19560 | loss 3.253451 (-0.78z)| norm 0.2482 (+0.30z)| lr 5.10e-09 | 322.17 ms | 52.4% bf16 MFU | 1625312 tok/s step 19527/19560 | loss 3.297486 (+0.23z)| norm 0.2443 (+0.14z)| lr 4.81e-09 | 322.30 ms | 52.4% bf16 MFU | 1625381 tok/s step 19528/19560 | loss 3.344249 (+1.29z)| norm 0.2426 (+0.07z)| lr 4.52e-09 | 322.26 ms | 52.4% bf16 MFU | 1625458 tok/s step 19529/19560 | loss 3.320137 (+0.76z)| norm 0.2304 (-0.47z)| lr 4.26e-09 | 322.46 ms | 52.3% bf16 MFU | 1625480 tok/s step 19530/19560 | loss 3.211420 (-1.75z)| norm 0.2271 (-0.62z)| lr 4.01e-09 | 322.17 ms | 52.4% bf16 MFU | 1625575 tok/s step 19531/19560 | loss 3.282756 (-0.10z)| norm 0.2369 (-0.19z)| lr 3.74e-09 | 322.32 ms | 52.4% bf16 MFU | 1625627 tok/s step 19532/19560 | loss 3.322522 (+0.81z)| norm 0.2297 (-0.50z)| lr 3.50e-09 | 322.49 ms | 52.3% bf16 MFU | 1625634 tok/s step 19533/19560 | loss 3.183940 (-2.33z)| norm 0.2322 (-0.38z)| lr 3.25e-09 | 322.19 ms | 52.4% bf16 MFU | 1625715 tok/s step 19534/19560 | loss 3.261054 (-0.57z)| norm 0.2298 (-0.49z)| lr 3.04e-09 | 322.31 ms | 52.4% bf16 MFU | 1625763 tok/s step 19535/19560 | loss 3.285622 (+0.00z)| norm 0.2308 (-0.44z)| lr 2.81e-09 | 322.43 ms | 52.3% bf16 MFU | 1625777 tok/s step 19536/19560 | loss 3.226166 (-1.33z)| norm 0.2262 (-0.65z)| lr 2.59e-09 | 322.62 ms | 52.3% bf16 MFU | 1625743 tok/s step 19537/19560 | loss 3.296719 (+0.26z)| norm 0.2256 (-0.68z)| lr 2.40e-09 | 322.54 ms | 52.3% bf16 MFU | 1625730 tok/s step 19538/19560 | loss 3.247455 (-0.85z)| norm 0.2376 (-0.14z)| lr 2.20e-09 | 322.36 ms | 52.4% bf16 MFU | 1625763 tok/s step 19539/19560 | loss 3.270198 (-0.35z)| norm 0.2475 (+0.30z)| lr 2.02e-09 | 322.67 ms | 52.3% bf16 MFU | 1625716 tok/s step 19540/19560 | loss 3.334180 (+1.12z)| norm 0.2611 (+0.89z)| lr 1.84e-09 | 322.56 ms | 52.3% bf16 MFU | 1625700 tok/s step 19541/19560 | loss 3.291686 (+0.14z)| norm 0.2329 (-0.36z)| lr 1.66e-09 | 322.43 ms | 52.3% bf16 MFU | 1625718 tok/s step 19542/19560 | loss 3.261523 (-0.55z)| norm 0.2602 (+0.84z)| lr 1.50e-09 | 322.77 ms | 52.3% bf16 MFU | 1625648 tok/s step 19543/19560 | loss 3.297180 (+0.25z)| norm 0.2302 (-0.49z)| lr 1.34e-09 | 322.52 ms | 52.3% bf16 MFU | 1625645 tok/s step 19544/19560 | loss 3.272327 (-0.31z)| norm 0.2410 (+0.00z)| lr 1.20e-09 | 322.53 ms | 52.3% bf16 MFU | 1625641 tok/s step 19545/19560 | loss 3.272689 (-0.31z)| norm 0.2415 (+0.03z)| lr 1.07e-09 | 322.30 ms | 52.4% bf16 MFU | 1625694 tok/s step 19546/19560 | loss 3.306659 (+0.47z)| norm 0.2433 (+0.10z)| lr 9.30e-10 | 322.30 ms | 52.4% bf16 MFU | 1625744 tok/s step 19547/19560 | loss 3.195958 (-2.03z)| norm 0.2625 (+0.95z)| lr 8.23e-10 | 322.80 ms | 52.3% bf16 MFU | 1625667 tok/s step 19548/19560 | loss 3.252087 (-0.76z)| norm 0.2371 (-0.20z)| lr 6.97e-10 | 322.88 ms | 52.3% bf16 MFU | 1625572 tok/s step 19549/19560 | loss 3.253364 (-0.73z)| norm 0.2426 (+0.05z)| lr 6.08e-10 | 322.82 ms | 52.3% bf16 MFU | 1625498 tok/s step 19550/19560 | loss 3.287394 (+0.05z)| norm 0.2722 (+1.36z)| lr 5.01e-10 | 322.11 ms | 52.4% bf16 MFU | 1625608 tok/s step 19551/19560 | loss 3.300835 (+0.34z)| norm 0.2235 (-0.81z)| lr 4.11e-10 | 322.39 ms | 52.4% bf16 MFU | 1625641 tok/s step 19552/19560 | loss 3.195441 (-2.06z)| norm 0.2398 (-0.08z)| lr 3.40e-10 | 322.58 ms | 52.3% bf16 MFU | 1625622 tok/s step 19553/19560 | loss 3.188078 (-2.17z)| norm 0.2430 (+0.06z)| lr 2.68e-10 | 323.04 ms | 52.2% bf16 MFU | 1625492 tok/s step 19554/19560 | loss 3.274673 (-0.22z)| norm 0.2263 (-0.68z)| lr 1.97e-10 | 322.78 ms | 52.3% bf16 MFU | 1625431 tok/s step 19555/19560 | loss 3.320196 (+0.80z)| norm 0.2514 (+0.43z)| lr 1.43e-10 | 322.75 ms | 52.3% bf16 MFU | 1625381 tok/s step 19556/19560 | loss 3.252189 (-0.73z)| norm 0.2931 (+2.23z)| lr 1.07e-10 | 322.91 ms | 52.3% bf16 MFU | 1625293 tok/s step 19557/19560 | loss 3.257648 (-0.59z)| norm 0.2504 (+0.37z)| lr 7.15e-11 | 322.42 ms | 52.3% bf16 MFU | 1625334 tok/s step 19558/19560 | loss 3.282717 (-0.02z)| norm 0.2317 (-0.44z)| lr 3.58e-11 | 322.74 ms | 52.3% bf16 MFU | 1625292 tok/s step 19559/19560 | loss 3.315925 (+0.73z)| norm 0.2217 (-0.87z)| lr 1.79e-11 | 322.70 ms | 52.3% bf16 MFU | 1625262 tok/s step 19560/19560 | loss 3.223261 (-1.35z)| norm 0.2269 (-0.63z)| lr 0.00e+00 | 322.63 ms | 52.3% bf16 MFU | 1625252 tok/s val loss 3.274314 ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79ting HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79evaluating HellaSwag: 0/79 evaluating HellaSwag: 10/79 evaluating HellaSwag: 20/79 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 HellaSwag: 3043/10042 = 0.303027 evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00005.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00007.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00004.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00006.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00001.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00003.bin evaluating HellaSwag: 30/79 evaluating HellaSwag: 40/79 evaluating HellaSwag: 50/79 evaluating HellaSwag: 60/79 evaluating HellaSwag: 70/79 Writing state to log124M/state_00019560_00002.bin Error: Token out of vocabulary at train_gpt2.cu:675 Error details: File: train_gpt2.cu Line: 675 Token: -1149026846 Position: 0 Vocab: 50257 generating: --- -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[20376,1],0] Exit code: 1 --------------------------------------------------------------------------