diff --git "a/training_log.txt" "b/training_log.txt" new file mode 100644--- /dev/null +++ "b/training_log.txt" @@ -0,0 +1,9436 @@ +WARNING: APEX not installed - defaulting to deepspeed's fused adam +Time to load fused_adam op: 19.47908115386963 seconds +Time to load utils op: 11.555307388305664 seconds +Rank: 1 partition count [2, 2] and sizes[(253755392, False), (160768, False)] +Time to load utils op: 0.0002818107604980469 seconds +WARNING: shuffle index length (191132) is not equal to sample index length (191133) +WARNING: shuffle index length (19112) is not equal to sample index length (19113) +WARNING: shuffle index length (19112) is not equal to sample index length (19113) +WARNING: shuffle index length (56921) is not equal to sample index length (56922) +WARNING: shuffle index length (642) is not equal to sample index length (643) +WARNING: shuffle index length (320) is not equal to sample index length (321) +WARNING: shuffle index length (330776) is not equal to sample index length (330777) +WARNING: shuffle index length (82693) is not equal to sample index length (82694) +WARNING: shuffle index length (82693) is not equal to sample index length (82694) +WARNING: shuffle index length (206225) is not equal to sample index length (206226) +WARNING: shuffle index length (18746) is not equal to sample index length (18747) +WARNING: shuffle index length (18746) is not equal to sample index length (18747) +WARNING: shuffle index length (239232) is not equal to sample index length (239233) +WARNING: shuffle index length (2522) is not equal to sample index length (2523) +WARNING: shuffle index length (503) is not equal to sample index length (504) +WARNING: shuffle index length (63680) is not equal to sample index length (63681) +WARNING: shuffle index length (678) is not equal to sample index length (679) +WARNING: shuffle index length (55) is not equal to sample index length (56) +WARNING: shuffle index length (155804) is not equal to sample index length (155805) +WARNING: shuffle index length (9736) is not equal to sample index length (9737) +WARNING: shuffle index length (9736) is not equal to sample index length (9737) +WARNING: shuffle index length (233684) is not equal to sample index length (233685) +WARNING: shuffle index length (29209) is not equal to sample index length (29210) +WARNING: shuffle index length (29209) is not equal to sample index length (29210) +WARNING: shuffle index length (356567) is not equal to sample index length (356568) +WARNING: shuffle index length (118855) is not equal to sample index length (118856) +WARNING: shuffle index length (118855) is not equal to sample index length (118856) +WARNING: shuffle index length (251228) is not equal to sample index length (251229) +WARNING: shuffle index length (35888) is not equal to sample index length (35889) +WARNING: shuffle index length (35888) is not equal to sample index length (35889) +WARNING: shuffle index length (97704) is not equal to sample index length (97705) +WARNING: shuffle index length (2170) is not equal to sample index length (2171) +WARNING: shuffle index length (2170) is not equal to sample index length (2171) +WARNING: shuffle index length (402219) is not equal to sample index length (402220) +WARNING: shuffle index length (134072) is not equal to sample index length (134073) +WARNING: shuffle index length (134072) is not equal to sample index length (134073) +WARNING: shuffle index length (81781) is not equal to sample index length (81782) +WARNING: shuffle index length (1167) is not equal to sample index length (1168) +WARNING: shuffle index length (1167) is not equal to sample index length (1168) +WARNING: shuffle index length (243192) is not equal to sample index length (243193) +WARNING: shuffle index length (30398) is not equal to sample index length (30399) +WARNING: shuffle index length (30398) is not equal to sample index length (30399) +WARNING: shuffle index length (170358) is not equal to sample index length (170359) +WARNING: shuffle index length (13103) is not equal to sample index length (13104) +WARNING: shuffle index length (13103) is not equal to sample index length (13104) +WARNING: shuffle index length (44351) is not equal to sample index length (44352) +WARNING: shuffle index length (455) is not equal to sample index length (456) +WARNING: shuffle index length (90) is not equal to sample index length (91) +WARNING: shuffle index length (319032) is not equal to sample index length (319033) +WARNING: shuffle index length (106343) is not equal to sample index length (106344) +WARNING: shuffle index length (106343) is not equal to sample index length (106344) +WARNING: shuffle index length (61012) is not equal to sample index length (61013) +WARNING: shuffle index length (852) is not equal to sample index length (853) +WARNING: shuffle index length (425) is not equal to sample index length (426) +WARNING: shuffle index length (439086) is not equal to sample index length (439087) +WARNING: shuffle index length (146361) is not equal to sample index length (146362) +WARNING: shuffle index length (146361) is not equal to sample index length (146362) +WARNING: shuffle index length (185084) is not equal to sample index length (185085) +WARNING: shuffle index length (16824) is not equal to sample index length (16825) +WARNING: shuffle index length (16824) is not equal to sample index length (16825) +WARNING: shuffle index length (241688) is not equal to sample index length (241689) +WARNING: shuffle index length (34526) is not equal to sample index length (34527) +WARNING: shuffle index length (34526) is not equal to sample index length (34527) +WARNING: shuffle index length (65414) is not equal to sample index length (65415) +WARNING: shuffle index length (970) is not equal to sample index length (971) +WARNING: shuffle index length (322) is not equal to sample index length (323) +WARNING: shuffle index length (45755) is not equal to sample index length (45756) +WARNING: shuffle index length (464) is not equal to sample index length (465) +WARNING: shuffle index length (17) is not equal to sample index length (18) +WARNING: shuffle index length (353487) is not equal to sample index length (353488) +WARNING: shuffle index length (176743) is not equal to sample index length (176744) +WARNING: shuffle index length (176743) is not equal to sample index length (176744) +WARNING: shuffle index length (244450) is not equal to sample index length (244451) +WARNING: shuffle index length (6789) is not equal to sample index length (6790) +WARNING: shuffle index length (6789) is not equal to sample index length (6790) +WARNING: shuffle index length (391188) is not equal to sample index length (391189) +WARNING: shuffle index length (48897) is not equal to sample index length (48898) +WARNING: shuffle index length (48897) is not equal to sample index length (48898) +WARNING: shuffle index length (110482) is not equal to sample index length (110483) +WARNING: shuffle index length (2123) is not equal to sample index length (2124) +WARNING: shuffle index length (2123) is not equal to sample index length (2124) +WARNING: shuffle index length (98343) is not equal to sample index length (98344) +WARNING: shuffle index length (1694) is not equal to sample index length (1695) +WARNING: shuffle index length (1694) is not equal to sample index length (1695) +> RANK 1 elapsed time for building blendable dataset indices: 0.24 (sec) +> RANK 1 elapsed time for building blendable dataset indices: 0.04 (sec) +> RANK 1 elapsed time for building blendable dataset indices: 0.04 (sec) +ters... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_8: + no. of documents:119929 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_8: + no. of documents:119929 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_9: + no. of documents:36548 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_9: + no. of documents:36548 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_9: + no. of documents:36548 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_10: + no. of documents:2205 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_10: + no. of documents:2205 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_10: + no. of documents:2205 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_11: + no. of documents:142377 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_11: + no. of documents:142377 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_11: + no. of documents:142377 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_12: + no. of documents:1190 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_12: + no. of documents:1190 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_12: + no. of documents:1190 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_13: + no. of documents:50089 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_13: + no. of documents:50089 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_13: + no. of documents:50089 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_14: + no. of documents:13106 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_14: + no. of documents:13106 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_14: + no. of documents:13106 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_15: + no. of documents:160 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_15: + no. of documents:160 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_15: + no. of documents:160 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_16: + no. of documents:122805 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_16: + no. of documents:122805 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_16: + no. of documents:122805 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_17: + no. of documents:462 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_17: + no. of documents:462 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_17: + no. of documents:462 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_18: + no. of documents:184580 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_18: + no. of documents:184580 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_18: + no. of documents:184580 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_19: + no. of documents:19128 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_19: + no. of documents:19128 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_19: + no. of documents:19128 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_20: + no. of documents:42349 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_20: + no. of documents:42349 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_20: + no. of documents:42349 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_21: + no. of documents:579 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_21: + no. of documents:579 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_21: + no. of documents:579 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_22: + no. of documents:178 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_22: + no. of documents:178 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_22: + no. of documents:178 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_23: + no. of documents:193509 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_23: + no. of documents:193509 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_23: + no. of documents:193509 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_24: + no. of documents:49149 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_24: + no. of documents:49149 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_24: + no. of documents:49149 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_25: + no. of documents:331407 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_25: + no. of documents:331407 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_25: + no. of documents:331407 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_26: + no. of documents:3185 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_26: + no. of documents:3185 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_26: + no. of documents:3185 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_27: + no. of documents:2285 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_27: + no. of documents:2285 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_27: + no. of documents:2285 + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_0: + no. of documents:19427 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.005599 + > elapsed time to build and save sample-idx mapping (seconds): 0.001973 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003966 + > loading doc-idx mapping from data-fcm/adl/adl_text_document_train_0_indexmap_184628ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/adl/adl_text_document_train_0_indexmap_184628ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/adl/adl_text_document_train_0_indexmap_184628ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 191134 + total number of epochs: 10 +WARNING: shuffle index length (191132) is not equal to sample index length (191133) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_0: + no. of documents:19427 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000627 + > elapsed time to build and save sample-idx mapping (seconds): 0.000205 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000456 + > loading doc-idx mapping from data-fcm/adl/adl_text_document_valid_0_indexmap_1853ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/adl/adl_text_document_valid_0_indexmap_1853ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/adl/adl_text_document_valid_0_indexmap_1853ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 19114 + total number of epochs: 1 +WARNING: shuffle index length (19112) is not equal to sample index length (19113) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_0: + no. of documents:19427 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000589 + > elapsed time to build and save sample-idx mapping (seconds): 0.000185 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000448 + > loading doc-idx mapping from data-fcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.039 seconds + total number of samples: 19114 + total number of epochs: 1 +WARNING: shuffle index length (19112) is not equal to sample index length (19113) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_1: + no. of documents:368 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001673 + > elapsed time to build and save sample-idx mapping (seconds): 0.000503 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001210 + > loading doc-idx mapping from data-fcm/botxt/botxt_text_document_train_1_indexmap_56882ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/botxt/botxt_text_document_train_1_indexmap_56882ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/botxt/botxt_text_document_train_1_indexmap_56882ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 56923 + total number of epochs: 177 +WARNING: shuffle index length (56921) is not equal to sample index length (56922) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_1: + no. of documents:368 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000132 + > elapsed time to build and save sample-idx mapping (seconds): 0.000094 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000070 + > loading doc-idx mapping from data-fcm/botxt/botxt_text_document_valid_1_indexmap_571ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/botxt/botxt_text_document_valid_1_indexmap_571ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/botxt/botxt_text_document_valid_1_indexmap_571ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 644 + total number of epochs: 2 +WARNING: shuffle index length (642) is not equal to sample index length (643) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_1: + no. of documents:368 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000146 + > elapsed time to build and save sample-idx mapping (seconds): 0.000079 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000063 + > loading doc-idx mapping from data-fcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 322 + total number of epochs: 1 +WARNING: shuffle index length (320) is not equal to sample index length (321) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_2: + no. of documents:83897 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.009058 + > elapsed time to build and save sample-idx mapping (seconds): 0.002007 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.007053 + > loading doc-idx mapping from data-fcm/cc/cc_text_document_train_2_indexmap_274136ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/cc/cc_text_document_train_2_indexmap_274136ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/cc/cc_text_document_train_2_indexmap_274136ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 330778 + total number of epochs: 4 +WARNING: shuffle index length (330776) is not equal to sample index length (330777) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_2: + no. of documents:83897 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.002280 + > elapsed time to build and save sample-idx mapping (seconds): 0.000575 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001760 + > loading doc-idx mapping from data-fcm/cc/cc_text_document_valid_2_indexmap_2750ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/cc/cc_text_document_valid_2_indexmap_2750ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/cc/cc_text_document_valid_2_indexmap_2750ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 82695 + total number of epochs: 1 +WARNING: shuffle index length (82693) is not equal to sample index length (82694) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_2: + no. of documents:83897 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.002356 + > elapsed time to build and save sample-idx mapping (seconds): 0.000577 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001755 + > loading doc-idx mapping from data-fcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 82695 + total number of epochs: 1 +WARNING: shuffle index length (82693) is not equal to sample index length (82694) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_3: + no. of documents:20589 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.005772 + > elapsed time to build and save sample-idx mapping (seconds): 0.001461 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004219 + > loading doc-idx mapping from data-fcm/danavis/danavis_text_document_train_3_indexmap_187729ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/danavis/danavis_text_document_train_3_indexmap_187729ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/danavis/danavis_text_document_train_3_indexmap_187729ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 206227 + total number of epochs: 11 +WARNING: shuffle index length (206225) is not equal to sample index length (206226) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_3: + no. of documents:20589 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000652 + > elapsed time to build and save sample-idx mapping (seconds): 0.000213 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000437 + > loading doc-idx mapping from data-fcm/danavis/danavis_text_document_valid_3_indexmap_1884ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/danavis/danavis_text_document_valid_3_indexmap_1884ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/danavis/danavis_text_document_valid_3_indexmap_1884ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 18748 + total number of epochs: 1 +WARNING: shuffle index length (18746) is not equal to sample index length (18747) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_3: + no. of documents:20589 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000618 + > elapsed time to build and save sample-idx mapping (seconds): 0.000206 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000448 + > loading doc-idx mapping from data-fcm/danavis/danavis_text_document_test_3_indexmap_6ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/danavis/danavis_text_document_test_3_indexmap_6ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/danavis/danavis_text_document_test_3_indexmap_6ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 18748 + total number of epochs: 1 +WARNING: shuffle index length (18746) is not equal to sample index length (18747) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_4: + no. of documents:49040 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 1.138281 + > elapsed time to build and save sample-idx mapping (seconds): 0.021119 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004794 + > loading doc-idx mapping from data-fcm/dannet/dannet_text_document_train_4_indexmap_238972ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/dannet/dannet_text_document_train_4_indexmap_238972ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/dannet/dannet_text_document_train_4_indexmap_238972ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 239234 + total number of epochs: 474 +WARNING: shuffle index length (239232) is not equal to sample index length (239233) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_4: + no. of documents:49040 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.006273 + > elapsed time to build and save sample-idx mapping (seconds): 0.000356 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000119 + > loading doc-idx mapping from data-fcm/dannet/dannet_text_document_valid_4_indexmap_2398ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/dannet/dannet_text_document_valid_4_indexmap_2398ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/dannet/dannet_text_document_valid_4_indexmap_2398ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 2524 + total number of epochs: 5 +WARNING: shuffle index length (2522) is not equal to sample index length (2523) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_4: + no. of documents:49040 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001288 + > elapsed time to build and save sample-idx mapping (seconds): 0.000140 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000073 + > loading doc-idx mapping from data-fcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 505 + total number of epochs: 1 +WARNING: shuffle index length (503) is not equal to sample index length (504) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_5: + no. of documents:536 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.016869 + > elapsed time to build and save sample-idx mapping (seconds): 0.001085 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001311 + > loading doc-idx mapping from data-fcm/depbank/depbank_text_document_train_5_indexmap_63668ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/depbank/depbank_text_document_train_5_indexmap_63668ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/depbank/depbank_text_document_train_5_indexmap_63668ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 63682 + total number of epochs: 1124 +WARNING: shuffle index length (63680) is not equal to sample index length (63681) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_5: + no. of documents:536 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000282 + > elapsed time to build and save sample-idx mapping (seconds): 0.000094 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000074 + > loading doc-idx mapping from data-fcm/depbank/depbank_text_document_valid_5_indexmap_639ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/depbank/depbank_text_document_valid_5_indexmap_639ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/depbank/depbank_text_document_valid_5_indexmap_639ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 680 + total number of epochs: 12 +WARNING: shuffle index length (678) is not equal to sample index length (679) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_5: + no. of documents:536 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000123 + > elapsed time to build and save sample-idx mapping (seconds): 0.000078 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000055 + > loading doc-idx mapping from data-fcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 57 + total number of epochs: 1 +WARNING: shuffle index length (55) is not equal to sample index length (56) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_6: + no. of documents:9738 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003996 + > elapsed time to build and save sample-idx mapping (seconds): 0.000910 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003172 + > loading doc-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_151040ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_151040ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_151040ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 155806 + total number of epochs: 16 +WARNING: shuffle index length (155804) is not equal to sample index length (155805) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_6: + no. of documents:9738 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000346 + > elapsed time to build and save sample-idx mapping (seconds): 0.000140 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000263 + > loading doc-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1516ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1516ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1516ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 9738 + total number of epochs: 1 +WARNING: shuffle index length (9736) is not equal to sample index length (9737) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_6: + no. of documents:9738 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000358 + > elapsed time to build and save sample-idx mapping (seconds): 0.000131 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000264 + > loading doc-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 9738 + total number of epochs: 1 +WARNING: shuffle index length (9736) is not equal to sample index length (9737) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_7: + no. of documents:31944 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.006364 + > elapsed time to build and save sample-idx mapping (seconds): 0.001629 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004630 + > loading doc-idx mapping from data-fcm/ep/ep_text_document_train_7_indexmap_212559ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ep/ep_text_document_train_7_indexmap_212559ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ep/ep_text_document_train_7_indexmap_212559ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 233686 + total number of epochs: 8 +WARNING: shuffle index length (233684) is not equal to sample index length (233685) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_7: + no. of documents:31944 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000878 + > elapsed time to build and save sample-idx mapping (seconds): 0.000283 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000644 + > loading doc-idx mapping from data-fcm/ep/ep_text_document_valid_7_indexmap_2133ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ep/ep_text_document_valid_7_indexmap_2133ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ep/ep_text_document_valid_7_indexmap_2133ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 29211 + total number of epochs: 1 +WARNING: shuffle index length (29209) is not equal to sample index length (29210) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_7: + no. of documents:31944 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000859 + > elapsed time to build and save sample-idx mapping (seconds): 0.000283 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000649 + > loading doc-idx mapping from data-fcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 29211 + total number of epochs: 1 +WARNING: shuffle index length (29209) is not equal to sample index length (29210) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_8: + no. of documents:119929 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.009630 + > elapsed time to build and save sample-idx mapping (seconds): 0.002189 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.007513 + > loading doc-idx mapping from data-fcm/eubookshop/eubookshop_text_document_train_8_indexmap_297555ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/eubookshop/eubookshop_text_document_train_8_indexmap_297555ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/eubookshop/eubookshop_text_document_train_8_indexmap_297555ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 356569 + total number of epochs: 3 +WARNING: shuffle index length (356567) is not equal to sample index length (356568) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_8: + no. of documents:119929 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003088 + > elapsed time to build and save sample-idx mapping (seconds): 0.000823 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002414 + > loading doc-idx mapping from data-fcm/eubookshop/eubookshop_text_document_valid_8_indexmap_2985ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/eubookshop/eubookshop_text_document_valid_8_indexmap_2985ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/eubookshop/eubookshop_text_document_valid_8_indexmap_2985ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 118857 + total number of epochs: 1 +WARNING: shuffle index length (118855) is not equal to sample index length (118856) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_8: + no. of documents:119929 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003100 + > elapsed time to build and save sample-idx mapping (seconds): 0.000821 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002371 + > loading doc-idx mapping from data-fcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 118857 + total number of epochs: 1 +WARNING: shuffle index length (118855) is not equal to sample index length (118856) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_9: + no. of documents:36548 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.006415 + > elapsed time to build and save sample-idx mapping (seconds): 0.001517 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004882 + > loading doc-idx mapping from data-fcm/ft/ft_text_document_train_9_indexmap_220641ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ft/ft_text_document_train_9_indexmap_220641ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ft/ft_text_document_train_9_indexmap_220641ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 251230 + total number of epochs: 7 +WARNING: shuffle index length (251228) is not equal to sample index length (251229) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_9: + no. of documents:36548 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000975 + > elapsed time to build and save sample-idx mapping (seconds): 0.000296 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000780 + > loading doc-idx mapping from data-fcm/ft/ft_text_document_valid_9_indexmap_2214ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ft/ft_text_document_valid_9_indexmap_2214ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ft/ft_text_document_valid_9_indexmap_2214ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 35890 + total number of epochs: 1 +WARNING: shuffle index length (35888) is not equal to sample index length (35889) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_9: + no. of documents:36548 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000977 + > elapsed time to build and save sample-idx mapping (seconds): 0.000287 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000781 + > loading doc-idx mapping from data-fcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 35890 + total number of epochs: 1 +WARNING: shuffle index length (35888) is not equal to sample index length (35889) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_10: + no. of documents:2205 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.002587 + > elapsed time to build and save sample-idx mapping (seconds): 0.000624 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002059 + > loading doc-idx mapping from data-fcm/gutenberg/gutenberg_text_document_train_10_indexmap_97212ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/gutenberg/gutenberg_text_document_train_10_indexmap_97212ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/gutenberg/gutenberg_text_document_train_10_indexmap_97212ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 97706 + total number of epochs: 45 +WARNING: shuffle index length (97704) is not equal to sample index length (97705) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_10: + no. of documents:2205 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000156 + > elapsed time to build and save sample-idx mapping (seconds): 0.000085 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000102 + > loading doc-idx mapping from data-fcm/gutenberg/gutenberg_text_document_valid_10_indexmap_976ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/gutenberg/gutenberg_text_document_valid_10_indexmap_976ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/gutenberg/gutenberg_text_document_valid_10_indexmap_976ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 2172 + total number of epochs: 1 +WARNING: shuffle index length (2170) is not equal to sample index length (2171) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_10: + no. of documents:2205 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000160 + > elapsed time to build and save sample-idx mapping (seconds): 0.000088 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000099 + > loading doc-idx mapping from data-fcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 2172 + total number of epochs: 1 +WARNING: shuffle index length (2170) is not equal to sample index length (2171) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_11: + no. of documents:142377 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.011301 + > elapsed time to build and save sample-idx mapping (seconds): 0.002746 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.008439 + > loading doc-idx mapping from data-fcm/hest/hest_text_document_train_11_indexmap_308288ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/hest/hest_text_document_train_11_indexmap_308288ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/hest/hest_text_document_train_11_indexmap_308288ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 402221 + total number of epochs: 3 +WARNING: shuffle index length (402219) is not equal to sample index length (402220) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_11: + no. of documents:142377 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003696 + > elapsed time to build and save sample-idx mapping (seconds): 0.001015 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002676 + > loading doc-idx mapping from data-fcm/hest/hest_text_document_valid_11_indexmap_3093ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/hest/hest_text_document_valid_11_indexmap_3093ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/hest/hest_text_document_valid_11_indexmap_3093ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 134074 + total number of epochs: 1 +WARNING: shuffle index length (134072) is not equal to sample index length (134073) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_11: + no. of documents:142377 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003754 + > elapsed time to build and save sample-idx mapping (seconds): 0.000971 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002649 + > loading doc-idx mapping from data-fcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 134074 + total number of epochs: 1 +WARNING: shuffle index length (134072) is not equal to sample index length (134073) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_12: + no. of documents:1190 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.002114 + > elapsed time to build and save sample-idx mapping (seconds): 0.000537 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001712 + > loading doc-idx mapping from data-fcm/jvj/jvj_text_document_train_12_indexmap_80844ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/jvj/jvj_text_document_train_12_indexmap_80844ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/jvj/jvj_text_document_train_12_indexmap_80844ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 81783 + total number of epochs: 70 +WARNING: shuffle index length (81781) is not equal to sample index length (81782) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_12: + no. of documents:1190 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000140 + > elapsed time to build and save sample-idx mapping (seconds): 0.000100 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000095 + > loading doc-idx mapping from data-fcm/jvj/jvj_text_document_valid_12_indexmap_811ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/jvj/jvj_text_document_valid_12_indexmap_811ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/jvj/jvj_text_document_valid_12_indexmap_811ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 1169 + total number of epochs: 1 +WARNING: shuffle index length (1167) is not equal to sample index length (1168) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_12: + no. of documents:1190 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000133 + > elapsed time to build and save sample-idx mapping (seconds): 0.000087 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000080 + > loading doc-idx mapping from data-fcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 1169 + total number of epochs: 1 +WARNING: shuffle index length (1167) is not equal to sample index length (1168) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_13: + no. of documents:50089 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.010590 + > elapsed time to build and save sample-idx mapping (seconds): 0.003436 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004810 + > loading doc-idx mapping from data-fcm/kb/kb_text_document_train_13_indexmap_240324ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/kb/kb_text_document_train_13_indexmap_240324ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/kb/kb_text_document_train_13_indexmap_240324ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 243194 + total number of epochs: 8 +WARNING: shuffle index length (243192) is not equal to sample index length (243193) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_13: + no. of documents:50089 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001332 + > elapsed time to build and save sample-idx mapping (seconds): 0.000508 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000669 + > loading doc-idx mapping from data-fcm/kb/kb_text_document_valid_13_indexmap_2411ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/kb/kb_text_document_valid_13_indexmap_2411ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/kb/kb_text_document_valid_13_indexmap_2411ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 30400 + total number of epochs: 1 +WARNING: shuffle index length (30398) is not equal to sample index length (30399) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_13: + no. of documents:50089 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001337 + > elapsed time to build and save sample-idx mapping (seconds): 0.000510 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000653 + > loading doc-idx mapping from data-fcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 30400 + total number of epochs: 1 +WARNING: shuffle index length (30398) is not equal to sample index length (30399) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_14: + no. of documents:13106 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.004550 + > elapsed time to build and save sample-idx mapping (seconds): 0.000996 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003561 + > loading doc-idx mapping from data-fcm/korpus2000/korpus2000_text_document_train_14_indexmap_164752ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/korpus2000/korpus2000_text_document_train_14_indexmap_164752ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/korpus2000/korpus2000_text_document_train_14_indexmap_164752ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 170360 + total number of epochs: 13 +WARNING: shuffle index length (170358) is not equal to sample index length (170359) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_14: + no. of documents:13106 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000419 + > elapsed time to build and save sample-idx mapping (seconds): 0.000155 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000327 + > loading doc-idx mapping from data-fcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1653ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1653ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1653ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 13105 + total number of epochs: 1 +WARNING: shuffle index length (13103) is not equal to sample index length (13104) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_14: + no. of documents:13106 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000422 + > elapsed time to build and save sample-idx mapping (seconds): 0.000149 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000339 + > loading doc-idx mapping from data-fcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 13105 + total number of epochs: 1 +WARNING: shuffle index length (13103) is not equal to sample index length (13104) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_15: + no. of documents:160 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001975 + > elapsed time to build and save sample-idx mapping (seconds): 0.000688 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000975 + > loading doc-idx mapping from data-fcm/naat/naat_text_document_train_15_indexmap_44312ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/naat/naat_text_document_train_15_indexmap_44312ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/naat/naat_text_document_train_15_indexmap_44312ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 44353 + total number of epochs: 486 +WARNING: shuffle index length (44351) is not equal to sample index length (44352) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_15: + no. of documents:160 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000126 + > elapsed time to build and save sample-idx mapping (seconds): 0.000077 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000064 + > loading doc-idx mapping from data-fcm/naat/naat_text_document_valid_15_indexmap_445ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/naat/naat_text_document_valid_15_indexmap_445ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/naat/naat_text_document_valid_15_indexmap_445ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 457 + total number of epochs: 5 +WARNING: shuffle index length (455) is not equal to sample index length (456) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_15: + no. of documents:160 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000108 + > elapsed time to build and save sample-idx mapping (seconds): 0.000079 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000054 + > loading doc-idx mapping from data-fcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 92 + total number of epochs: 1 +WARNING: shuffle index length (90) is not equal to sample index length (91) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_16: + no. of documents:122805 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.009905 + > elapsed time to build and save sample-idx mapping (seconds): 0.002648 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.006683 + > loading doc-idx mapping from data-fcm/opensub/opensub_text_document_train_16_indexmap_299067ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/opensub/opensub_text_document_train_16_indexmap_299067ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/opensub/opensub_text_document_train_16_indexmap_299067ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 319034 + total number of epochs: 3 +WARNING: shuffle index length (319032) is not equal to sample index length (319033) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_16: + no. of documents:122805 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003158 + > elapsed time to build and save sample-idx mapping (seconds): 0.000970 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002207 + > loading doc-idx mapping from data-fcm/opensub/opensub_text_document_valid_16_indexmap_3001ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/opensub/opensub_text_document_valid_16_indexmap_3001ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/opensub/opensub_text_document_valid_16_indexmap_3001ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 106345 + total number of epochs: 1 +WARNING: shuffle index length (106343) is not equal to sample index length (106344) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_16: + no. of documents:122805 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003124 + > elapsed time to build and save sample-idx mapping (seconds): 0.000943 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002241 + > loading doc-idx mapping from data-fcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 106345 + total number of epochs: 1 +WARNING: shuffle index length (106343) is not equal to sample index length (106344) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_17: + no. of documents:462 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001564 + > elapsed time to build and save sample-idx mapping (seconds): 0.000476 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001249 + > loading doc-idx mapping from data-fcm/relig/relig_text_document_train_17_indexmap_60896ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/relig/relig_text_document_train_17_indexmap_60896ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/relig/relig_text_document_train_17_indexmap_60896ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 61014 + total number of epochs: 143 +WARNING: shuffle index length (61012) is not equal to sample index length (61013) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_17: + no. of documents:462 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000127 + > elapsed time to build and save sample-idx mapping (seconds): 0.000093 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000071 + > loading doc-idx mapping from data-fcm/relig/relig_text_document_valid_17_indexmap_611ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/relig/relig_text_document_valid_17_indexmap_611ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/relig/relig_text_document_valid_17_indexmap_611ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 854 + total number of epochs: 2 +WARNING: shuffle index length (852) is not equal to sample index length (853) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_17: + no. of documents:462 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000113 + > elapsed time to build and save sample-idx mapping (seconds): 0.000076 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000061 + > loading doc-idx mapping from data-fcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.030 seconds + total number of samples: 427 + total number of epochs: 1 +WARNING: shuffle index length (425) is not equal to sample index length (426) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_18: + no. of documents:184580 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.014131 + > elapsed time to build and save sample-idx mapping (seconds): 0.004426 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.009193 + > loading doc-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323127ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323127ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323127ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 439088 + total number of epochs: 3 +WARNING: shuffle index length (439086) is not equal to sample index length (439087) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_18: + no. of documents:184580 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.004959 + > elapsed time to build and save sample-idx mapping (seconds): 0.001535 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002960 + > loading doc-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3242ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3242ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3242ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 146363 + total number of epochs: 1 +WARNING: shuffle index length (146361) is not equal to sample index length (146362) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_18: + no. of documents:184580 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.004951 + > elapsed time to build and save sample-idx mapping (seconds): 0.001531 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002995 + > loading doc-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 146363 + total number of epochs: 1 +WARNING: shuffle index length (146361) is not equal to sample index length (146362) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_19: + no. of documents:19128 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.005524 + > elapsed time to build and save sample-idx mapping (seconds): 0.001399 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003805 + > loading doc-idx mapping from data-fcm/retspraksis/retspraksis_text_document_train_19_indexmap_183807ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retspraksis/retspraksis_text_document_train_19_indexmap_183807ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retspraksis/retspraksis_text_document_train_19_indexmap_183807ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 185086 + total number of epochs: 11 +WARNING: shuffle index length (185084) is not equal to sample index length (185085) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_19: + no. of documents:19128 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000600 + > elapsed time to build and save sample-idx mapping (seconds): 0.000209 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000394 + > loading doc-idx mapping from data-fcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1844ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1844ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1844ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 16826 + total number of epochs: 1 +WARNING: shuffle index length (16824) is not equal to sample index length (16825) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_19: + no. of documents:19128 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000570 + > elapsed time to build and save sample-idx mapping (seconds): 0.000201 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000388 + > loading doc-idx mapping from data-fcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 16826 + total number of epochs: 1 +WARNING: shuffle index length (16824) is not equal to sample index length (16825) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_20: + no. of documents:42349 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.007735 + > elapsed time to build and save sample-idx mapping (seconds): 0.002097 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004830 + > loading doc-idx mapping from data-fcm/skat/skat_text_document_train_20_indexmap_229716ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/skat/skat_text_document_train_20_indexmap_229716ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/skat/skat_text_document_train_20_indexmap_229716ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 241690 + total number of epochs: 7 +WARNING: shuffle index length (241688) is not equal to sample index length (241689) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_20: + no. of documents:42349 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001160 + > elapsed time to build and save sample-idx mapping (seconds): 0.000381 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000742 + > loading doc-idx mapping from data-fcm/skat/skat_text_document_valid_20_indexmap_2305ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/skat/skat_text_document_valid_20_indexmap_2305ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/skat/skat_text_document_valid_20_indexmap_2305ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 34528 + total number of epochs: 1 +WARNING: shuffle index length (34526) is not equal to sample index length (34527) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_20: + no. of documents:42349 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001156 + > elapsed time to build and save sample-idx mapping (seconds): 0.000375 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000752 + > loading doc-idx mapping from data-fcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 34528 + total number of epochs: 1 +WARNING: shuffle index length (34526) is not equal to sample index length (34527) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_21: + no. of documents:579 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.002893 + > elapsed time to build and save sample-idx mapping (seconds): 0.001014 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001306 + > loading doc-idx mapping from data-fcm/spont/spont_text_document_train_21_indexmap_65157ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/spont/spont_text_document_train_21_indexmap_65157ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/spont/spont_text_document_train_21_indexmap_65157ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 65416 + total number of epochs: 202 +WARNING: shuffle index length (65414) is not equal to sample index length (65415) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_21: + no. of documents:579 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000152 + > elapsed time to build and save sample-idx mapping (seconds): 0.000088 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000074 + > loading doc-idx mapping from data-fcm/spont/spont_text_document_valid_21_indexmap_654ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/spont/spont_text_document_valid_21_indexmap_654ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/spont/spont_text_document_valid_21_indexmap_654ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 972 + total number of epochs: 3 +WARNING: shuffle index length (970) is not equal to sample index length (971) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_21: + no. of documents:579 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000115 + > elapsed time to build and save sample-idx mapping (seconds): 0.000077 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000061 + > loading doc-idx mapping from data-fcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 324 + total number of epochs: 1 +WARNING: shuffle index length (322) is not equal to sample index length (323) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_22: + no. of documents:178 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.011418 + > elapsed time to build and save sample-idx mapping (seconds): 0.001091 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001018 + > loading doc-idx mapping from data-fcm/synne/synne_text_document_train_22_indexmap_45751ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/synne/synne_text_document_train_22_indexmap_45751ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/synne/synne_text_document_train_22_indexmap_45751ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 45757 + total number of epochs: 2457 +WARNING: shuffle index length (45755) is not equal to sample index length (45756) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_22: + no. of documents:178 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000223 + > elapsed time to build and save sample-idx mapping (seconds): 0.000096 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000069 + > loading doc-idx mapping from data-fcm/synne/synne_text_document_valid_22_indexmap_459ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/synne/synne_text_document_valid_22_indexmap_459ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/synne/synne_text_document_valid_22_indexmap_459ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 466 + total number of epochs: 25 +WARNING: shuffle index length (464) is not equal to sample index length (465) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_22: + no. of documents:178 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000112 + > elapsed time to build and save sample-idx mapping (seconds): 0.000075 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000054 + > loading doc-idx mapping from data-fcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 19 + total number of epochs: 1 +WARNING: shuffle index length (17) is not equal to sample index length (18) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_23: + no. of documents:193509 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.010242 + > elapsed time to build and save sample-idx mapping (seconds): 0.002901 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.007478 + > loading doc-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_325565ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_325565ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_325565ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 353489 + total number of epochs: 2 +WARNING: shuffle index length (353487) is not equal to sample index length (353488) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_23: + no. of documents:193509 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.005174 + > elapsed time to build and save sample-idx mapping (seconds): 0.001490 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003705 + > loading doc-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3266ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3266ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3266ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 176745 + total number of epochs: 1 +WARNING: shuffle index length (176743) is not equal to sample index length (176744) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_23: + no. of documents:193509 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.005237 + > elapsed time to build and save sample-idx mapping (seconds): 0.001507 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.003703 + > loading doc-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 176745 + total number of epochs: 1 +WARNING: shuffle index length (176743) is not equal to sample index length (176744) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_24: + no. of documents:49149 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.051964 + > elapsed time to build and save sample-idx mapping (seconds): 0.005176 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.004910 + > loading doc-idx mapping from data-fcm/tv2r/tv2r_text_document_train_24_indexmap_239113ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tv2r/tv2r_text_document_train_24_indexmap_239113ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tv2r/tv2r_text_document_train_24_indexmap_239113ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 244452 + total number of epochs: 36 +WARNING: shuffle index length (244450) is not equal to sample index length (244451) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_24: + no. of documents:49149 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001363 + > elapsed time to build and save sample-idx mapping (seconds): 0.000247 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000205 + > loading doc-idx mapping from data-fcm/tv2r/tv2r_text_document_valid_24_indexmap_2399ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tv2r/tv2r_text_document_valid_24_indexmap_2399ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tv2r/tv2r_text_document_valid_24_indexmap_2399ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 6791 + total number of epochs: 1 +WARNING: shuffle index length (6789) is not equal to sample index length (6790) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_24: + no. of documents:49149 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.001289 + > elapsed time to build and save sample-idx mapping (seconds): 0.000236 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000199 + > loading doc-idx mapping from data-fcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 6791 + total number of epochs: 1 +WARNING: shuffle index length (6789) is not equal to sample index length (6790) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_25: + no. of documents:331407 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.083631 + > elapsed time to build and save sample-idx mapping (seconds): 0.010468 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.008210 + > loading doc-idx mapping from data-fcm/wiki/wiki_text_document_train_25_indexmap_343139ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wiki/wiki_text_document_train_25_indexmap_343139ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wiki/wiki_text_document_train_25_indexmap_343139ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 391190 + total number of epochs: 8 +WARNING: shuffle index length (391188) is not equal to sample index length (391189) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_25: + no. of documents:331407 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.008747 + > elapsed time to build and save sample-idx mapping (seconds): 0.001491 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001085 + > loading doc-idx mapping from data-fcm/wiki/wiki_text_document_valid_25_indexmap_3443ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wiki/wiki_text_document_valid_25_indexmap_3443ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wiki/wiki_text_document_valid_25_indexmap_3443ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 48899 + total number of epochs: 1 +WARNING: shuffle index length (48897) is not equal to sample index length (48898) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_25: + no. of documents:331407 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.008838 + > elapsed time to build and save sample-idx mapping (seconds): 0.001537 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.001080 + > loading doc-idx mapping from data-fcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 48899 + total number of epochs: 1 +WARNING: shuffle index length (48897) is not equal to sample index length (48898) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_26: + no. of documents:3185 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.004419 + > elapsed time to build and save sample-idx mapping (seconds): 0.001317 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002282 + > loading doc-idx mapping from data-fcm/wikibooks/wikibooks_text_document_train_26_indexmap_108481ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikibooks/wikibooks_text_document_train_26_indexmap_108481ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikibooks/wikibooks_text_document_train_26_indexmap_108481ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 110484 + total number of epochs: 52 +WARNING: shuffle index length (110482) is not equal to sample index length (110483) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_26: + no. of documents:3185 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000200 + > elapsed time to build and save sample-idx mapping (seconds): 0.000112 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000104 + > loading doc-idx mapping from data-fcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1089ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1089ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1089ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 2125 + total number of epochs: 1 +WARNING: shuffle index length (2123) is not equal to sample index length (2124) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_26: + no. of documents:3185 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000185 + > elapsed time to build and save sample-idx mapping (seconds): 0.000111 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000101 + > loading doc-idx mapping from data-fcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 2125 + total number of epochs: 1 +WARNING: shuffle index length (2123) is not equal to sample index length (2124) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + train_27: + no. of documents:2285 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.003298 + > elapsed time to build and save sample-idx mapping (seconds): 0.000980 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.002065 + > loading doc-idx mapping from data-fcm/wikisource/wikisource_text_document_train_27_indexmap_98252ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikisource/wikisource_text_document_train_27_indexmap_98252ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikisource/wikisource_text_document_train_27_indexmap_98252ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 98345 + total number of epochs: 58 +WARNING: shuffle index length (98343) is not equal to sample index length (98344) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + valid_27: + no. of documents:2285 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000170 + > elapsed time to build and save sample-idx mapping (seconds): 0.000102 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000091 + > loading doc-idx mapping from data-fcm/wikisource/wikisource_text_document_valid_27_indexmap_986ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikisource/wikisource_text_document_valid_27_indexmap_986ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikisource/wikisource_text_document_valid_27_indexmap_986ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 1696 + total number of epochs: 1 +WARNING: shuffle index length (1694) is not equal to sample index length (1695) + reading sizes... + reading pointers... + reading document index... + creating numpy buffer of mmap... + creating memory view of numpy buffer... + test_27: + no. of documents:2285 + > WARNING: could not find index map files, building the indices on rank 0 ... + > elasped time to build and save doc-idx mapping (seconds): 0.000159 + > elapsed time to build and save sample-idx mapping (seconds): 0.000096 + > elapsed time to build and save shuffle-idx mapping (seconds): 0.000093 + > loading doc-idx mapping from data-fcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_doc_idx.npy + > loading sample-idx mapping from data-fcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_sample_idx.npy + > loading shuffle-idx mapping from data-fcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_shuffle_idx.npy + loaded indexed file in 0.001 seconds + total number of samples: 1696 + total number of epochs: 1 +WARNING: shuffle index length (1694) is not equal to sample index length (1695) +> RANK 0 elapsed time for building blendable dataset indices: 0.24 (sec) +> RANK 0 elapsed time for building blendable dataset indices: 0.04 (sec) +> RANK 0 elapsed time for building blendable dataset indices: 0.04 (sec) +setting training data start iteration to 0 +setting validation data start iteration to 0 +done with setups ... +time (ms) | model and optimizer: 32404.85 | train/valid/test data iterators: 2661.50 +training ... + samples/sec: 6.551 | iteration 100/ 320000 | elapsed time per iteration (ms): 2442.3 | learning rate: 9.375E-06 | approx flops per GPU: 40.7TFLOPS | lm_loss: 9.544133E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +after 100 iterations memory (MB) | allocated: 3902.71630859375 | max allocated: 14147.748046875 | reserved: 17338.0 | max reserved: 17338.0 +time (ms) | forward: 580.57 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.91 + samples/sec: 6.578 | iteration 200/ 320000 | elapsed time per iteration (ms): 2432.2 | learning rate: 1.875E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 8.121008E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.32 | backward: 1808.67 | backward-backward: 1808.65 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.87 + samples/sec: 6.578 | iteration 300/ 320000 | elapsed time per iteration (ms): 2432.2 | learning rate: 2.812E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 7.069163E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.71 | backward: 1807.72 | backward-backward: 1807.70 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 1.02 + samples/sec: 6.579 | iteration 400/ 320000 | elapsed time per iteration (ms): 2431.9 | learning rate: 3.750E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 6.643742E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.54 | backward: 1807.93 | backward-backward: 1807.90 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.94 + samples/sec: 6.580 | iteration 500/ 320000 | elapsed time per iteration (ms): 2431.6 | learning rate: 4.688E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 6.423395E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.74 | backward: 1807.18 | backward-backward: 1807.15 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.96 + samples/sec: 6.577 | iteration 600/ 320000 | elapsed time per iteration (ms): 2432.8 | learning rate: 5.625E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 6.248856E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.94 | backward: 1808.54 | backward-backward: 1808.51 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.88 + samples/sec: 6.578 | iteration 700/ 320000 | elapsed time per iteration (ms): 2432.5 | learning rate: 6.562E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 6.075846E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.54 | backward: 1808.63 | backward-backward: 1808.60 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.86 + samples/sec: 6.582 | iteration 800/ 320000 | elapsed time per iteration (ms): 2431.1 | learning rate: 7.500E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.922576E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.43 | backward: 1807.24 | backward-backward: 1807.21 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.88 + samples/sec: 6.578 | iteration 900/ 320000 | elapsed time per iteration (ms): 2432.2 | learning rate: 8.437E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.804067E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.05 | backward: 1807.30 | backward-backward: 1807.27 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.93 + samples/sec: 6.567 | iteration 1000/ 320000 | elapsed time per iteration (ms): 2436.3 | learning rate: 9.375E-05 | approx flops per GPU: 40.8TFLOPS | lm_loss: 5.711368E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.97 | backward: 1809.88 | backward-backward: 1809.85 | backward-allreduce: 0.00 | optimizer: 56.92 | batch generator: 0.96 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 1000 | lm_loss value: 5.638843E+00 | lm_loss_ppl value: 2.811371E+02 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.424 | iteration 1100/ 320000 | elapsed time per iteration (ms): 2490.5 | learning rate: 1.031E-04 | approx flops per GPU: 39.9TFLOPS | lm_loss: 5.595344E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.89 | backward: 1808.60 | backward-backward: 1808.57 | backward-allreduce: 0.00 | optimizer: 56.52 | batch generator: 0.96 + samples/sec: 6.576 | iteration 1200/ 320000 | elapsed time per iteration (ms): 2433.0 | learning rate: 1.124E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.535649E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.77 | backward: 1808.85 | backward-backward: 1808.82 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.86 + samples/sec: 6.577 | iteration 1300/ 320000 | elapsed time per iteration (ms): 2432.7 | learning rate: 1.218E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.430936E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.44 | backward: 1808.60 | backward-backward: 1808.57 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.96 + samples/sec: 6.574 | iteration 1400/ 320000 | elapsed time per iteration (ms): 2434.0 | learning rate: 1.312E-04 | approx flops per GPU: 40.8TFLOPS | lm_loss: 5.340742E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.88 | backward: 1808.91 | backward-backward: 1808.88 | backward-allreduce: 0.00 | optimizer: 56.68 | batch generator: 0.90 + samples/sec: 6.577 | iteration 1500/ 320000 | elapsed time per iteration (ms): 2432.7 | learning rate: 1.405E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.316165E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1808.22 | backward-backward: 1808.19 | backward-allreduce: 0.00 | optimizer: 56.81 | batch generator: 0.85 + samples/sec: 6.582 | iteration 1600/ 320000 | elapsed time per iteration (ms): 2431.0 | learning rate: 1.499E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.255714E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1807.67 | backward-backward: 1807.64 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.84 + samples/sec: 6.588 | iteration 1700/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.593E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.182761E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1805.92 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.83 + samples/sec: 6.594 | iteration 1800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.687E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.139535E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.588 | iteration 1900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.780E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.075781E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.69 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.88 + samples/sec: 6.588 | iteration 2000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.874E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.038739E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 2000 | lm_loss value: 4.990431E+00 | lm_loss_ppl value: 1.469997E+02 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 2100/ 320000 | elapsed time per iteration (ms): 2485.6 | learning rate: 1.968E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 5.009710E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.67 | backward-backward: 1805.65 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.88 + samples/sec: 6.589 | iteration 2200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.062E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.927545E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1805.89 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.583 | iteration 2300/ 320000 | elapsed time per iteration (ms): 2430.4 | learning rate: 2.155E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.905972E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.46 | backward: 1806.63 | backward-backward: 1806.61 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.84 + samples/sec: 6.595 | iteration 2400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.247E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.894602E+00 | loss scale: 131072.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 54.37 | batch generator: 0.80 + samples/sec: 6.589 | iteration 2500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.341E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.822763E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 + samples/sec: 6.586 | iteration 2600/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.435E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.794308E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1806.45 | backward-backward: 1806.43 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.83 + samples/sec: 6.593 | iteration 2700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.528E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.784180E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.83 + samples/sec: 6.587 | iteration 2800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.622E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.717621E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1806.45 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 + samples/sec: 6.589 | iteration 2900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.716E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.724394E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.71 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.83 + samples/sec: 6.587 | iteration 3000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.810E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.677125E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1806.02 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.89 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 3000 | lm_loss value: 4.713490E+00 | lm_loss_ppl value: 1.114404E+02 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.435 | iteration 3100/ 320000 | elapsed time per iteration (ms): 2486.3 | learning rate: 2.903E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.640468E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.87 + samples/sec: 6.589 | iteration 3200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.629657E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 + samples/sec: 6.583 | iteration 3300/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.587240E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1807.14 | backward-backward: 1807.12 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 + samples/sec: 6.591 | iteration 3400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.560501E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.83 + samples/sec: 6.586 | iteration 3500/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.512705E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1806.63 | backward-backward: 1806.60 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.586 | iteration 3600/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.500413E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1806.23 | backward-backward: 1806.21 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.587 | iteration 3700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.480738E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1806.67 | backward-backward: 1806.65 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.84 + samples/sec: 6.585 | iteration 3800/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.431222E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1806.57 | backward-backward: 1806.54 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 + samples/sec: 6.590 | iteration 3900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.416038E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1805.48 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.88 + samples/sec: 6.582 | iteration 4000/ 320000 | elapsed time per iteration (ms): 2430.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.393821E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.39 | backward: 1807.20 | backward-backward: 1807.18 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.85 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 4000 | lm_loss value: 4.366561E+00 | lm_loss_ppl value: 7.877230E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 4100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 3.000E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.366091E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.15 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.86 + samples/sec: 6.587 | iteration 4200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.326951E+00 | loss scale: 262144.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.83 + samples/sec: 6.582 | iteration 4300/ 320000 | elapsed time per iteration (ms): 2430.9 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.294877E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.69 | backward: 1807.12 | backward-backward: 1807.09 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 + samples/sec: 6.589 | iteration 4400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.280101E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.82 + samples/sec: 6.584 | iteration 4500/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.277866E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1806.32 | backward-backward: 1806.29 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.86 + samples/sec: 6.593 | iteration 4600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.248083E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.82 + samples/sec: 6.587 | iteration 4700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.231368E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1806.01 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 + samples/sec: 6.592 | iteration 4800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.196505E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 + samples/sec: 6.587 | iteration 4900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.197580E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.82 + samples/sec: 6.591 | iteration 5000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.169177E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.81 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 5000 | lm_loss value: 4.151481E+00 | lm_loss_ppl value: 6.352803E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 5100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 3.000E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.158211E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.93 + samples/sec: 6.586 | iteration 5200/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.165546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.82 + samples/sec: 6.592 | iteration 5300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.124736E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.82 + samples/sec: 6.585 | iteration 5400/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.106201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1807.17 | backward-backward: 1807.14 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.592 | iteration 5500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.078767E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.89 + samples/sec: 6.589 | iteration 5600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.068387E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.68 | backward-backward: 1805.65 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.586 | iteration 5700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.056988E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1805.98 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.80 + samples/sec: 6.589 | iteration 5800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.043163E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1805.21 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.86 + samples/sec: 6.586 | iteration 5900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.037292E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1806.00 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 + samples/sec: 6.597 | iteration 6000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.011004E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.77 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 6000 | lm_loss value: 4.032863E+00 | lm_loss_ppl value: 5.642222E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 6100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.999E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.038445E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.86 + samples/sec: 6.590 | iteration 6200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.017003E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.78 + samples/sec: 6.590 | iteration 6300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.010419E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 + samples/sec: 6.591 | iteration 6400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.992961E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.84 + samples/sec: 6.593 | iteration 6500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.974695E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.76 + samples/sec: 6.589 | iteration 6600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.959271E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1805.56 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.596 | iteration 6700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.965064E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 + samples/sec: 6.587 | iteration 6800/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.935625E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.45 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 + samples/sec: 6.594 | iteration 6900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.919808E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 + samples/sec: 6.589 | iteration 7000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.906917E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1806.04 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.78 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 7000 | lm_loss value: 3.875342E+00 | lm_loss_ppl value: 4.819919E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 7100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.999E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.912306E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.91 + samples/sec: 6.587 | iteration 7200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.892912E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.81 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 + samples/sec: 6.589 | iteration 7300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.888095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.591 | iteration 7400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.866493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.89 + samples/sec: 6.586 | iteration 7500/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.865265E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 56.84 | batch generator: 0.80 + samples/sec: 6.592 | iteration 7600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.855328E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.587 | iteration 7700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.855052E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.91 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.85 + samples/sec: 6.594 | iteration 7800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.820201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 + samples/sec: 6.592 | iteration 7900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.823879E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 + samples/sec: 6.589 | iteration 8000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.811602E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 8000 | lm_loss value: 3.792802E+00 | lm_loss_ppl value: 4.438056E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 8100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.998E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.813032E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.84 + samples/sec: 6.585 | iteration 8200/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.812372E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1806.33 | backward-backward: 1806.30 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.81 + samples/sec: 6.594 | iteration 8300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.821699E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.80 + samples/sec: 6.587 | iteration 8400/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.800127E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1805.89 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.98 + samples/sec: 6.589 | iteration 8500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.781032E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.38 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 + samples/sec: 6.591 | iteration 8600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.778197E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 + samples/sec: 6.585 | iteration 8700/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.769279E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1806.82 | backward-backward: 1806.79 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 + samples/sec: 6.590 | iteration 8800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.754744E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.82 + samples/sec: 6.586 | iteration 8900/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.751784E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1806.66 | backward-backward: 1806.64 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.80 + samples/sec: 6.593 | iteration 9000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.745650E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.35 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.85 +--------------------------------------------------------------------------------------------------------- + validation results at iteration 9000 | lm_loss value: 3.757839E+00 | lm_loss_ppl value: 4.285572E+01 | +--------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 9100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.997E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.748583E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.89 + samples/sec: 6.586 | iteration 9200/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.746035E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.64 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.94 + samples/sec: 6.587 | iteration 9300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.730315E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 + samples/sec: 6.587 | iteration 9400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.730369E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.81 + samples/sec: 6.590 | iteration 9500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.706880E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1805.49 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.585 | iteration 9600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.732579E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1806.73 | backward-backward: 1806.70 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 + samples/sec: 6.588 | iteration 9700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.699149E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.63 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.81 + samples/sec: 6.584 | iteration 9800/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.699451E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1807.26 | backward-backward: 1807.23 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 + samples/sec: 6.590 | iteration 9900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.711613E+00 | loss scale: 262144.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.78 + samples/sec: 6.588 | iteration 10000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.692773E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1806.22 | backward-backward: 1806.20 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 10000 | lm_loss value: 3.738776E+00 | lm_loss_ppl value: 4.204651E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.219 | iteration 10100/ 320000 | elapsed time per iteration (ms): 2572.7 | learning rate: 2.997E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 3.708925E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1805.40 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.88 + samples/sec: 6.590 | iteration 10200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.696192E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 + samples/sec: 6.587 | iteration 10300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.660418E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1806.08 | backward-backward: 1806.06 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 + samples/sec: 6.592 | iteration 10400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.996E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.659210E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.585 | iteration 10500/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.663712E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1807.14 | backward-backward: 1807.12 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.592 | iteration 10600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.996E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.689550E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 + samples/sec: 6.590 | iteration 10700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.674955E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.82 + samples/sec: 6.586 | iteration 10800/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.678895E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.27 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.82 + samples/sec: 6.594 | iteration 10900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.996E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.650595E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 + samples/sec: 6.587 | iteration 11000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.637246E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1806.10 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 11000 | lm_loss value: 3.656139E+00 | lm_loss_ppl value: 3.871160E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 11100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 2.996E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.647494E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.92 + samples/sec: 6.589 | iteration 11200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.653591E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1805.48 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 + samples/sec: 6.593 | iteration 11300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.995E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.625738E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 + samples/sec: 6.589 | iteration 11400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.629917E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.54 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 + samples/sec: 6.591 | iteration 11500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.634222E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.589 | iteration 11600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.626231E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 + samples/sec: 6.589 | iteration 11700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.644237E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.95 + samples/sec: 6.594 | iteration 11800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.995E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.592169E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.82 + samples/sec: 6.583 | iteration 11900/ 320000 | elapsed time per iteration (ms): 2430.6 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.610683E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1807.00 | backward-backward: 1806.97 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.79 + samples/sec: 6.593 | iteration 12000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.599871E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 12000 | lm_loss value: 3.582617E+00 | lm_loss_ppl value: 3.596753E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 12100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.994E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.595412E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.85 + samples/sec: 6.589 | iteration 12200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.601770E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 + samples/sec: 6.592 | iteration 12300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.589620E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 + samples/sec: 6.585 | iteration 12400/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.606794E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1806.38 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.79 + samples/sec: 6.593 | iteration 12500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.598042E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.586 | iteration 12600/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.585042E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1806.23 | backward-backward: 1806.21 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 + samples/sec: 6.593 | iteration 12700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.601317E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.590 | iteration 12800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.563917E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 + samples/sec: 6.588 | iteration 12900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.566060E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.82 + samples/sec: 6.592 | iteration 13000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.580006E+00 | loss scale: 131072.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1805.60 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 13000 | lm_loss value: 3.596840E+00 | lm_loss_ppl value: 3.648276E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 13100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 2.993E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.556089E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1806.38 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 + samples/sec: 6.594 | iteration 13200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.572833E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.84 + samples/sec: 6.587 | iteration 13300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.574725E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1806.40 | backward-backward: 1806.38 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.83 + samples/sec: 6.593 | iteration 13400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.595098E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.595 | iteration 13500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.568260E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.81 + samples/sec: 6.587 | iteration 13600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.568923E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.76 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 + samples/sec: 6.596 | iteration 13700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.551802E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 + samples/sec: 6.589 | iteration 13800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.564131E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.593 | iteration 13900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.572117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1803.94 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.86 + samples/sec: 6.592 | iteration 14000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.547695E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 14000 | lm_loss value: 3.490722E+00 | lm_loss_ppl value: 3.280963E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 14100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.991E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.559055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1805.39 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.96 + samples/sec: 6.596 | iteration 14200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.991E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.570592E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.75 + samples/sec: 6.587 | iteration 14300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.554870E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1805.61 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.83 + samples/sec: 6.597 | iteration 14400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.991E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.523307E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.589 | iteration 14500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.546764E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.591 | iteration 14600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.547617E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.92 + samples/sec: 6.595 | iteration 14700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.990E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.539000E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.83 + samples/sec: 6.588 | iteration 14800/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.519862E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1805.74 | backward-backward: 1805.71 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.595 | iteration 14900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.990E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.510806E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.589 | iteration 15000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.549343E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 15000 | lm_loss value: 3.519748E+00 | lm_loss_ppl value: 3.377593E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 15100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.990E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.526826E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.88 + samples/sec: 6.590 | iteration 15200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.528263E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.82 + samples/sec: 6.587 | iteration 15300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.517517E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1805.77 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.77 + samples/sec: 6.596 | iteration 15400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.989E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.530598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 + samples/sec: 6.586 | iteration 15500/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.531173E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1806.52 | backward-backward: 1806.49 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.75 + samples/sec: 6.594 | iteration 15600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.989E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.544355E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.40 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.80 + samples/sec: 6.591 | iteration 15700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.533088E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.84 + samples/sec: 6.589 | iteration 15800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.528536E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.79 + samples/sec: 6.595 | iteration 15900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.988E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.472824E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.80 + samples/sec: 6.590 | iteration 16000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.988E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.506108E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 16000 | lm_loss value: 3.442317E+00 | lm_loss_ppl value: 3.125929E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 16100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.988E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.496479E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.87 + samples/sec: 6.591 | iteration 16200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.988E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.508001E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.79 + samples/sec: 6.584 | iteration 16300/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 2.988E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.505143E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.41 | backward: 1806.31 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 + samples/sec: 6.595 | iteration 16400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.987E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.482912E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.82 + samples/sec: 6.590 | iteration 16500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.499284E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 + samples/sec: 6.590 | iteration 16600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.511786E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 + samples/sec: 6.595 | iteration 16700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.987E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.496960E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.586 | iteration 16800/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.485474E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1806.29 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.596 | iteration 16900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.986E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.507872E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.83 + samples/sec: 6.589 | iteration 17000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.986E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.482399E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.72 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 17000 | lm_loss value: 3.480448E+00 | lm_loss_ppl value: 3.247428E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 17100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.986E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.489208E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.87 + samples/sec: 6.595 | iteration 17200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.986E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.474105E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.80 + samples/sec: 6.584 | iteration 17300/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.986E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.476275E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1806.27 | backward-backward: 1806.25 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.78 + samples/sec: 6.591 | iteration 17400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.474101E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.595 | iteration 17500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.985E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.468087E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.82 + samples/sec: 6.589 | iteration 17600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.496052E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.82 + samples/sec: 6.598 | iteration 17700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.985E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.461133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.85 + samples/sec: 6.588 | iteration 17800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.462828E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.92 + samples/sec: 6.596 | iteration 17900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.479768E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 + samples/sec: 6.595 | iteration 18000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.444852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 18000 | lm_loss value: 3.425723E+00 | lm_loss_ppl value: 3.074488E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 18100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 2.984E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.474619E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.57 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.86 + samples/sec: 6.598 | iteration 18200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.473020E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.24 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.79 + samples/sec: 6.593 | iteration 18300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.483026E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 + samples/sec: 6.591 | iteration 18400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.983E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.475246E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.78 + samples/sec: 6.595 | iteration 18500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.452444E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1804.36 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.88 + samples/sec: 6.586 | iteration 18600/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.983E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.475268E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.79 + samples/sec: 6.596 | iteration 18700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.481735E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.596 | iteration 18800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.466931E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 54.64 | batch generator: 0.80 + samples/sec: 6.590 | iteration 18900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.982E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.455374E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.598 | iteration 19000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.452114E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 19000 | lm_loss value: 3.560665E+00 | lm_loss_ppl value: 3.518659E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 19100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.982E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.443787E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.92 + samples/sec: 6.593 | iteration 19200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.454551E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.98 + samples/sec: 6.596 | iteration 19300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.981E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.449123E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.589 | iteration 19400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.981E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.448017E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.78 + samples/sec: 6.593 | iteration 19500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.981E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.462814E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.76 + samples/sec: 6.593 | iteration 19600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.981E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.456854E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.590 | iteration 19700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.980E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.466067E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.84 + samples/sec: 6.597 | iteration 19800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.980E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.428347E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 + samples/sec: 6.588 | iteration 19900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.980E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.445447E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 + samples/sec: 6.598 | iteration 20000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.980E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.443718E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 54.50 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 20000 | lm_loss value: 3.356253E+00 | lm_loss_ppl value: 2.868152E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.194 | iteration 20100/ 320000 | elapsed time per iteration (ms): 2583.2 | learning rate: 2.979E-04 | approx flops per GPU: 38.5TFLOPS | lm_loss: 3.448316E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.09 | backward: 1808.88 | backward-backward: 1808.85 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.87 + samples/sec: 6.588 | iteration 20200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.979E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.429279E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.84 + samples/sec: 6.596 | iteration 20300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.979E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.419334E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.592 | iteration 20400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.979E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.430520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 + samples/sec: 6.592 | iteration 20500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.435808E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.86 + samples/sec: 6.596 | iteration 20600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.428703E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 + samples/sec: 6.591 | iteration 20700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.978E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.408939E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 + samples/sec: 6.594 | iteration 20800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.422203E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.84 + samples/sec: 6.596 | iteration 20900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.977E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.407698E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.588 | iteration 21000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.414617E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 21000 | lm_loss value: 3.483074E+00 | lm_loss_ppl value: 3.255966E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 21100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 2.977E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.408804E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1803.43 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.84 + samples/sec: 6.590 | iteration 21200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.422194E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.81 + samples/sec: 6.591 | iteration 21300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.976E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.441313E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.598 | iteration 21400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.976E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.411677E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.79 + samples/sec: 6.589 | iteration 21500/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.976E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.426009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.596 | iteration 21600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.976E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.421942E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.591 | iteration 21700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.975E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.401697E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.80 + samples/sec: 6.586 | iteration 21800/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.975E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.419076E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1806.29 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 + samples/sec: 6.598 | iteration 21900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.975E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.403380E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.77 + samples/sec: 6.593 | iteration 22000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.975E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.420613E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 22000 | lm_loss value: 3.399085E+00 | lm_loss_ppl value: 2.993668E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 22100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.974E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.414502E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.83 + samples/sec: 6.598 | iteration 22200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.974E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.405916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1802.99 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.90 + samples/sec: 6.590 | iteration 22300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.974E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.417830E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.85 + samples/sec: 6.595 | iteration 22400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.974E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.427811E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.64 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 + samples/sec: 6.595 | iteration 22500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.973E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.401518E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 + samples/sec: 6.588 | iteration 22600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.428861E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.597 | iteration 22700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.973E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.396415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 + samples/sec: 6.588 | iteration 22800/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.972E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.390486E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.91 + samples/sec: 6.592 | iteration 22900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.972E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.403435E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.80 + samples/sec: 6.595 | iteration 23000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.972E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.392015E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 23000 | lm_loss value: 3.272032E+00 | lm_loss_ppl value: 2.636485E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 23100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 2.972E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.376038E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.64 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.86 + samples/sec: 6.595 | iteration 23200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.971E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.421747E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.78 + samples/sec: 6.595 | iteration 23300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.971E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.399332E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.588 | iteration 23400/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.971E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.423455E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 + samples/sec: 6.597 | iteration 23500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.970E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.404994E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.78 + samples/sec: 6.595 | iteration 23600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.970E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.393764E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.589 | iteration 23700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.970E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.408431E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.24 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.80 + samples/sec: 6.594 | iteration 23800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.970E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.368959E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.80 + samples/sec: 6.588 | iteration 23900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.969E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.400615E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1806.28 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 + samples/sec: 6.590 | iteration 24000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.969E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.398347E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 24000 | lm_loss value: 3.331427E+00 | lm_loss_ppl value: 2.797824E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 24100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.969E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.387838E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.87 + samples/sec: 6.591 | iteration 24200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.968E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.372586E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 54.73 | batch generator: 0.79 + samples/sec: 6.593 | iteration 24300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.968E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.394916E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 + samples/sec: 6.595 | iteration 24400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.968E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.380905E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 + samples/sec: 6.589 | iteration 24500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.967E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.396344E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.78 + samples/sec: 6.598 | iteration 24600/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.967E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.393440E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.81 + samples/sec: 6.593 | iteration 24700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.967E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.370731E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.592 | iteration 24800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.967E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.387773E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 + samples/sec: 6.598 | iteration 24900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.966E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.374792E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.21 | backward-backward: 1802.18 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.95 + samples/sec: 6.590 | iteration 25000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.966E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.391769E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.21 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 25000 | lm_loss value: 3.416953E+00 | lm_loss_ppl value: 3.047641E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 25100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.966E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.398478E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.87 + samples/sec: 6.599 | iteration 25200/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.965E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.359407E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1802.92 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.77 + samples/sec: 6.591 | iteration 25300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.965E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.380056E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 + samples/sec: 6.597 | iteration 25400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.965E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.377707E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.74 + samples/sec: 6.596 | iteration 25500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.964E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.361227E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.590 | iteration 25600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.964E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.371947E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.598 | iteration 25700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.964E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.361889E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.80 + samples/sec: 6.591 | iteration 25800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.963E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.358741E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.89 + samples/sec: 6.593 | iteration 25900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.963E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.334632E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.595 | iteration 26000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.963E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.389368E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 26000 | lm_loss value: 3.361182E+00 | lm_loss_ppl value: 2.882323E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 26100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 2.962E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.346477E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.88 + samples/sec: 6.595 | iteration 26200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.962E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.362013E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 + samples/sec: 6.596 | iteration 26300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.962E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.365816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.590 | iteration 26400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.961E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.350198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.597 | iteration 26500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.961E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.364846E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1802.98 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.594 | iteration 26600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.961E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.353881E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.589 | iteration 26700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.960E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.360777E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.80 + samples/sec: 6.596 | iteration 26800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.960E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.374823E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.81 + samples/sec: 6.590 | iteration 26900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.960E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.342791E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 + samples/sec: 6.593 | iteration 27000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.959E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.348108E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.75 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 27000 | lm_loss value: 3.362614E+00 | lm_loss_ppl value: 2.886456E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 27100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 2.959E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.357746E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.82 + samples/sec: 6.587 | iteration 27200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.959E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.346404E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1806.19 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 + samples/sec: 6.596 | iteration 27300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.958E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.379757E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.596 | iteration 27400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.958E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.358697E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 + samples/sec: 6.588 | iteration 27500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.958E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.359658E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 + samples/sec: 6.597 | iteration 27600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.957E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.359825E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.13 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.85 + samples/sec: 6.593 | iteration 27700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.957E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.367246E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 + samples/sec: 6.591 | iteration 27800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.957E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.367898E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.78 + samples/sec: 6.597 | iteration 27900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.956E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.359609E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.83 + samples/sec: 6.593 | iteration 28000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.956E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.333290E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 28000 | lm_loss value: 3.314011E+00 | lm_loss_ppl value: 2.749520E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 28100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.956E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.349198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.85 + samples/sec: 6.596 | iteration 28200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.955E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.351786E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 + samples/sec: 6.588 | iteration 28300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.955E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.335667E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 + samples/sec: 6.592 | iteration 28400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.954E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.356590E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.30 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.93 + samples/sec: 6.598 | iteration 28500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.954E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.348547E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.587 | iteration 28600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.954E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.352981E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1806.02 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 + samples/sec: 6.596 | iteration 28700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.953E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.352268E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.593 | iteration 28800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.953E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.352967E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 + samples/sec: 6.589 | iteration 28900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.953E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.353538E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.597 | iteration 29000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.952E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.339279E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 29000 | lm_loss value: 3.425852E+00 | lm_loss_ppl value: 3.074882E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 29100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.952E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.345793E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.85 + samples/sec: 6.591 | iteration 29200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.952E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.336689E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 + samples/sec: 6.593 | iteration 29300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.951E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.342727E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.80 + samples/sec: 6.590 | iteration 29400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.951E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.332814E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.82 + samples/sec: 6.595 | iteration 29500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.950E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.349707E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.596 | iteration 29600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.950E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.355996E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.589 | iteration 29700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.950E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.343875E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.73 + samples/sec: 6.599 | iteration 29800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.949E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.348473E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.81 + samples/sec: 6.595 | iteration 29900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.949E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.334255E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.82 + samples/sec: 6.591 | iteration 30000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.949E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.324513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.85 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 30000 | lm_loss value: 3.317941E+00 | lm_loss_ppl value: 2.760347E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.214 | iteration 30100/ 320000 | elapsed time per iteration (ms): 2574.7 | learning rate: 2.948E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 3.333587E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1802.75 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.87 + samples/sec: 6.592 | iteration 30200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.948E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.342040E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.55 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.589 | iteration 30300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.947E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.333147E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.80 + samples/sec: 6.593 | iteration 30400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.947E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.319361E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.82 + samples/sec: 6.588 | iteration 30500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.947E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.347233E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.86 | backward-backward: 1805.83 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.84 + samples/sec: 6.594 | iteration 30600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.946E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.336950E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.594 | iteration 30700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.946E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.329894E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 + samples/sec: 6.588 | iteration 30800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.945E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.336179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.33 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 + samples/sec: 6.598 | iteration 30900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.945E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.314680E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.77 + samples/sec: 6.594 | iteration 31000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.945E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.341254E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 31000 | lm_loss value: 3.399720E+00 | lm_loss_ppl value: 2.995571E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 31100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 2.944E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.309352E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.89 + samples/sec: 6.598 | iteration 31200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.944E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.354530E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.05 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.80 + samples/sec: 6.590 | iteration 31300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.943E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.340010E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.89 + samples/sec: 6.594 | iteration 31400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.943E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.337470E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 + samples/sec: 6.593 | iteration 31500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.943E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.336060E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.77 + samples/sec: 6.587 | iteration 31600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.942E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.339204E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.41 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.80 + samples/sec: 6.597 | iteration 31700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.942E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.322900E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.82 + samples/sec: 6.592 | iteration 31800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.941E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.316378E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.82 + samples/sec: 6.591 | iteration 31900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.941E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.323075E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.85 + samples/sec: 6.597 | iteration 32000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.941E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.335767E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 32000 | lm_loss value: 3.313113E+00 | lm_loss_ppl value: 2.747050E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 32100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 2.940E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.321289E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.87 + samples/sec: 6.592 | iteration 32200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.940E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.334215E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.78 + samples/sec: 6.590 | iteration 32300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.939E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.319738E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.87 + samples/sec: 6.592 | iteration 32400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.939E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.309179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.590 | iteration 32500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.939E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.302378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 + samples/sec: 6.588 | iteration 32600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.938E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.294450E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.78 + samples/sec: 6.593 | iteration 32700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.938E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.330324E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.82 + samples/sec: 6.588 | iteration 32800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.937E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.323299E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.14 | backward: 1805.44 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.598 | iteration 32900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.937E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.313532E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.82 + samples/sec: 6.590 | iteration 33000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.936E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.322302E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 33000 | lm_loss value: 3.265535E+00 | lm_loss_ppl value: 2.619412E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 33100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.936E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.341927E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.87 + samples/sec: 6.593 | iteration 33200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.936E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.327797E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.590 | iteration 33300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.935E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.317283E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.86 + samples/sec: 6.595 | iteration 33400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.935E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.314185E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 + samples/sec: 6.589 | iteration 33500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.934E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.299505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 + samples/sec: 6.593 | iteration 33600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.934E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.338174E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.86 + samples/sec: 6.587 | iteration 33700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.933E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.327827E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 + samples/sec: 6.595 | iteration 33800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.933E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.332490E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1803.38 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 + samples/sec: 6.588 | iteration 33900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.933E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.347032E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.81 + samples/sec: 6.594 | iteration 34000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.932E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.310110E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 34000 | lm_loss value: 3.285975E+00 | lm_loss_ppl value: 2.673504E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 34100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.932E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.296806E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.85 + samples/sec: 6.591 | iteration 34200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.931E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.312100E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 + samples/sec: 6.593 | iteration 34300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.931E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.303693E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.98 + samples/sec: 6.586 | iteration 34400/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.930E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.308622E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.78 + samples/sec: 6.595 | iteration 34500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.930E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.317644E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 + samples/sec: 6.589 | iteration 34600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.304094E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 + samples/sec: 6.589 | iteration 34700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292383E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.78 + samples/sec: 6.590 | iteration 34800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.299163E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.80 + samples/sec: 6.595 | iteration 34900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.928E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.300848E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 54.58 | batch generator: 0.76 + samples/sec: 6.595 | iteration 35000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.928E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.310663E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 35000 | lm_loss value: 3.304226E+00 | lm_loss_ppl value: 2.722746E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 35100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 2.927E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.313981E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.85 + samples/sec: 6.593 | iteration 35200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.927E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.280543E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 + samples/sec: 6.586 | iteration 35300/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.926E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.293574E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1806.15 | backward-backward: 1806.12 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 + samples/sec: 6.595 | iteration 35400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.926E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.275975E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.82 + samples/sec: 6.590 | iteration 35500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.925E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.312343E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.590 | iteration 35600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.925E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.309229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.96 + samples/sec: 6.595 | iteration 35700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.924E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.297406E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.589 | iteration 35800/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.924E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.295615E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.79 + samples/sec: 6.597 | iteration 35900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.924E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.311627E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 54.57 | batch generator: 0.83 + samples/sec: 6.588 | iteration 36000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.923E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.311508E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 36000 | lm_loss value: 3.247501E+00 | lm_loss_ppl value: 2.572596E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 36100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.923E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.296137E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.55 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.98 + samples/sec: 6.590 | iteration 36200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.922E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.303869E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.26 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.75 + samples/sec: 6.596 | iteration 36300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.922E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.299466E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.53 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.81 + samples/sec: 6.592 | iteration 36400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.921E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.300033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.597 | iteration 36500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.921E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.284456E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.78 + samples/sec: 6.596 | iteration 36600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.920E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.306037E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.589 | iteration 36700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.920E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.276359E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.85 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.83 + samples/sec: 6.596 | iteration 36800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.919E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.316312E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.87 + samples/sec: 6.587 | iteration 36900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.919E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.287717E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.34 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.80 + samples/sec: 6.596 | iteration 37000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.918E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.297647E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 37000 | lm_loss value: 3.269347E+00 | lm_loss_ppl value: 2.629417E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 37100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.918E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.318200E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.95 + samples/sec: 6.592 | iteration 37200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.917E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.308899E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.27 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 + samples/sec: 6.593 | iteration 37300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.917E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.301365E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 + samples/sec: 6.589 | iteration 37400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.916E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.295309E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.73 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.597 | iteration 37500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.916E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.289921E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.82 + samples/sec: 6.587 | iteration 37600/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.915E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292361E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.596 | iteration 37700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.915E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.306301E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.05 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.588 | iteration 37800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.915E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.306010E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 + samples/sec: 6.591 | iteration 37900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.914E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.277820E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1804.35 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.85 + samples/sec: 6.590 | iteration 38000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.914E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.281407E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.83 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 38000 | lm_loss value: 3.283272E+00 | lm_loss_ppl value: 2.666286E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.435 | iteration 38100/ 320000 | elapsed time per iteration (ms): 2486.5 | learning rate: 2.913E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.279669E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.68 | backward: 1805.99 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.88 + samples/sec: 6.594 | iteration 38200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.913E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.294324E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.96 + samples/sec: 6.591 | iteration 38300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.912E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.302659E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 + samples/sec: 6.595 | iteration 38400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.912E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.279781E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.588 | iteration 38500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.911E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.270660E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.55 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 + samples/sec: 6.595 | iteration 38600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.911E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.285378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.95 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 + samples/sec: 6.587 | iteration 38700/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.910E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.273495E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1806.17 | backward-backward: 1806.15 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 + samples/sec: 6.592 | iteration 38800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.910E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.297963E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 + samples/sec: 6.590 | iteration 38900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.909E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.290584E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.78 + samples/sec: 6.589 | iteration 39000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.909E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.278093E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 39000 | lm_loss value: 3.267773E+00 | lm_loss_ppl value: 2.625280E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 39100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 2.908E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.308775E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.86 + samples/sec: 6.585 | iteration 39200/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.908E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.47 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 + samples/sec: 6.596 | iteration 39300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.907E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.282884E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.27 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.584 | iteration 39400/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.907E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.283084E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1806.37 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.77 + samples/sec: 6.594 | iteration 39500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.906E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.299297E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 + samples/sec: 6.587 | iteration 39600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.905E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.271527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1806.03 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.86 + samples/sec: 6.588 | iteration 39700/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.905E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.264084E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.62 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.593 | iteration 39800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.904E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.290982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.91 + samples/sec: 6.585 | iteration 39900/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.904E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.266765E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.32 | backward: 1806.03 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 + samples/sec: 6.596 | iteration 40000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.903E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.280063E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.38 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 40000 | lm_loss value: 3.302979E+00 | lm_loss_ppl value: 2.719351E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.227 | iteration 40100/ 320000 | elapsed time per iteration (ms): 2569.3 | learning rate: 2.903E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.288339E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.30 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.88 + samples/sec: 6.580 | iteration 40200/ 320000 | elapsed time per iteration (ms): 2431.7 | learning rate: 2.902E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292398E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.09 | backward: 1806.80 | backward-backward: 1806.78 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.84 + samples/sec: 6.596 | iteration 40300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.902E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.287114E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 + samples/sec: 6.585 | iteration 40400/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.901E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.310957E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1806.47 | backward-backward: 1806.44 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 + samples/sec: 6.590 | iteration 40500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.901E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.274646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.41 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.589 | iteration 40600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.900E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.259879E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.585 | iteration 40700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.900E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.281866E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.56 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 + samples/sec: 6.597 | iteration 40800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.899E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.275925E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.85 + samples/sec: 6.587 | iteration 40900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.899E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.279842E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.29 | backward: 1805.69 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.591 | iteration 41000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.898E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.262370E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 41000 | lm_loss value: 3.149135E+00 | lm_loss_ppl value: 2.331588E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 41100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.898E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.303012E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.87 + samples/sec: 6.594 | iteration 41200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.897E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.279580E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.82 + samples/sec: 6.590 | iteration 41300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.897E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263164E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.62 | batch generator: 0.79 + samples/sec: 6.593 | iteration 41400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.896E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259227E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.84 + samples/sec: 6.589 | iteration 41500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.895E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.276266E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 + samples/sec: 6.592 | iteration 41600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.895E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260909E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 + samples/sec: 6.592 | iteration 41700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.894E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.300620E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.587 | iteration 41800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.894E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.266699E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.29 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.78 + samples/sec: 6.595 | iteration 41900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.893E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259993E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.72 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.586 | iteration 42000/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.893E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256994E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1806.30 | backward-backward: 1806.27 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 42000 | lm_loss value: 3.222668E+00 | lm_loss_ppl value: 2.509499E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 42100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 2.892E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.267379E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.85 + samples/sec: 6.583 | iteration 42200/ 320000 | elapsed time per iteration (ms): 2430.4 | learning rate: 2.892E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.274846E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1806.97 | backward-backward: 1806.94 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.592 | iteration 42300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.891E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.267472E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.80 + samples/sec: 6.587 | iteration 42400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.891E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.279415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1805.80 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.88 + samples/sec: 6.588 | iteration 42500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.890E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.276018E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 + samples/sec: 6.595 | iteration 42600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.889E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.267664E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.75 + samples/sec: 6.585 | iteration 42700/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.889E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.283136E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1806.34 | backward-backward: 1806.31 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.594 | iteration 42800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.888E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.270929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.82 + samples/sec: 6.585 | iteration 42900/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.888E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263475E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1806.29 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.81 + samples/sec: 6.595 | iteration 43000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.887E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.273058E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 43000 | lm_loss value: 3.251161E+00 | lm_loss_ppl value: 2.582030E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 43100/ 320000 | elapsed time per iteration (ms): 2485.9 | learning rate: 2.887E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.283583E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1806.01 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.92 + samples/sec: 6.590 | iteration 43200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.886E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.258838E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1804.55 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 + samples/sec: 6.593 | iteration 43300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.886E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.275416E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.84 + samples/sec: 6.584 | iteration 43400/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.885E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.265067E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.34 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 56.42 | batch generator: 0.78 + samples/sec: 6.594 | iteration 43500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.884E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.258336E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.584 | iteration 43600/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.884E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.273667E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.27 | backward: 1806.67 | backward-backward: 1806.65 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 + samples/sec: 6.593 | iteration 43700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.883E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260529E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.82 + samples/sec: 6.588 | iteration 43800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.883E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1806.12 | backward-backward: 1806.09 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.591 | iteration 43900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.882E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.252132E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.594 | iteration 44000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.882E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.274056E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 44000 | lm_loss value: 3.269602E+00 | lm_loss_ppl value: 2.630088E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.435 | iteration 44100/ 320000 | elapsed time per iteration (ms): 2486.4 | learning rate: 2.881E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.247450E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1806.16 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.87 + samples/sec: 6.594 | iteration 44200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.880E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.276892E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.585 | iteration 44300/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.880E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256718E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1806.68 | backward-backward: 1806.65 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.592 | iteration 44400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.879E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.271162E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.81 + samples/sec: 6.588 | iteration 44500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.879E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.276075E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.78 + samples/sec: 6.588 | iteration 44600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.878E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.243927E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1805.70 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.84 + samples/sec: 6.591 | iteration 44700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.878E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.275547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.588 | iteration 44800/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.877E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.247580E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1806.15 | backward-backward: 1806.13 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.81 + samples/sec: 6.594 | iteration 44900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.876E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.227776E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 + samples/sec: 6.584 | iteration 45000/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.876E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.266314E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1806.79 | backward-backward: 1806.77 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 45000 | lm_loss value: 3.314288E+00 | lm_loss_ppl value: 2.750280E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 45100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.875E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.254685E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.87 + samples/sec: 6.584 | iteration 45200/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 2.875E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.265681E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1806.63 | backward-backward: 1806.60 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.79 + samples/sec: 6.592 | iteration 45300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.874E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.267998E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 + samples/sec: 6.592 | iteration 45400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.873E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.273783E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.589 | iteration 45500/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.873E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.268258E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.593 | iteration 45600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.872E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.247079E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 56.36 | batch generator: 0.77 + samples/sec: 6.583 | iteration 45700/ 320000 | elapsed time per iteration (ms): 2430.6 | learning rate: 2.872E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263759E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1807.14 | backward-backward: 1807.11 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.82 + samples/sec: 6.597 | iteration 45800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.871E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.264288E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1802.88 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.589 | iteration 45900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.870E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.234799E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.592 | iteration 46000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.870E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.269709E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 46000 | lm_loss value: 3.267737E+00 | lm_loss_ppl value: 2.625186E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 46100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.869E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.250376E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.92 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.85 + samples/sec: 6.589 | iteration 46200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.869E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.278786E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 + samples/sec: 6.596 | iteration 46300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.868E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.281036E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 + samples/sec: 6.588 | iteration 46400/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.867E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.252454E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 + samples/sec: 6.594 | iteration 46500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.867E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260141E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.79 + samples/sec: 6.594 | iteration 46600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.866E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260497E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.97 + samples/sec: 6.591 | iteration 46700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.866E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.254757E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.82 + samples/sec: 6.594 | iteration 46800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.865E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259379E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.589 | iteration 46900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.864E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.237976E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.44 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.594 | iteration 47000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.864E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.264636E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 47000 | lm_loss value: 3.222121E+00 | lm_loss_ppl value: 2.508127E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 47100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.863E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.266078E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.48 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.86 + samples/sec: 6.588 | iteration 47200/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.863E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.255427E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 + samples/sec: 6.594 | iteration 47300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.862E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.267843E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.584 | iteration 47400/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.861E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256222E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1806.55 | backward-backward: 1806.52 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 + samples/sec: 6.594 | iteration 47500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.861E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.263672E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.591 | iteration 47600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.860E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.287526E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 + samples/sec: 6.590 | iteration 47700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.859E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233950E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.81 + samples/sec: 6.589 | iteration 47800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.859E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.259295E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.86 + samples/sec: 6.586 | iteration 47900/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.858E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.255516E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1806.26 | backward-backward: 1806.24 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 + samples/sec: 6.592 | iteration 48000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.858E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.273506E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 48000 | lm_loss value: 3.316089E+00 | lm_loss_ppl value: 2.755239E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.434 | iteration 48100/ 320000 | elapsed time per iteration (ms): 2486.6 | learning rate: 2.857E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.271648E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.46 | backward: 1806.51 | backward-backward: 1806.48 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.87 + samples/sec: 6.594 | iteration 48200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.856E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.245812E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.586 | iteration 48300/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.856E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.249074E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1806.22 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.82 + samples/sec: 6.594 | iteration 48400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.855E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.257038E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.594 | iteration 48500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.854E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.242788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.590 | iteration 48600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.854E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233116E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.23 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.592 | iteration 48700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.853E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.224156E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.585 | iteration 48800/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.853E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.25 | backward: 1806.36 | backward-backward: 1806.33 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.593 | iteration 48900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.852E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.216051E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.77 + samples/sec: 6.584 | iteration 49000/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.851E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.253940E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1806.84 | backward-backward: 1806.81 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 49000 | lm_loss value: 3.201081E+00 | lm_loss_ppl value: 2.455907E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 49100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.851E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.234465E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.84 + samples/sec: 6.590 | iteration 49200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.850E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.253250E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.81 + samples/sec: 6.592 | iteration 49300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.849E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.226408E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.97 + samples/sec: 5.582 | iteration 49400/ 320000 | elapsed time per iteration (ms): 2866.6 | learning rate: 2.849E-04 | approx flops per GPU: 34.7TFLOPS | lm_loss: 3.241888E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 656.49 | backward: 2108.31 | backward-backward: 2108.28 | backward-allreduce: 0.00 | optimizer: 101.24 | batch generator: 1.05 + samples/sec: 5.324 | iteration 49500/ 320000 | elapsed time per iteration (ms): 3005.4 | learning rate: 2.848E-04 | approx flops per GPU: 33.1TFLOPS | lm_loss: 3.261207E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.80 | backward: 2187.52 | backward-backward: 2187.49 | backward-allreduce: 0.00 | optimizer: 129.41 | batch generator: 1.36 + samples/sec: 6.585 | iteration 49600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.847E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.251077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1806.21 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.87 + samples/sec: 6.590 | iteration 49700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.847E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.260913E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.82 + samples/sec: 6.591 | iteration 49800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.846E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.216264E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.80 + samples/sec: 6.588 | iteration 49900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.845E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.254234E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.85 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 + samples/sec: 6.584 | iteration 50000/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 2.845E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.252521E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 50000 | lm_loss value: 3.218520E+00 | lm_loss_ppl value: 2.499110E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.225 | iteration 50100/ 320000 | elapsed time per iteration (ms): 2570.3 | learning rate: 2.844E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.227791E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.40 | backward: 1808.32 | backward-backward: 1808.29 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.91 + samples/sec: 6.591 | iteration 50200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.844E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.237579E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 + samples/sec: 6.582 | iteration 50300/ 320000 | elapsed time per iteration (ms): 2431.0 | learning rate: 2.843E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.234201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.17 | backward: 1807.17 | backward-backward: 1807.14 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.81 + samples/sec: 6.232 | iteration 50400/ 320000 | elapsed time per iteration (ms): 2567.2 | learning rate: 2.842E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.253631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 596.10 | backward: 1899.58 | backward-backward: 1899.56 | backward-allreduce: 0.00 | optimizer: 71.00 | batch generator: 1.02 + samples/sec: 6.237 | iteration 50500/ 320000 | elapsed time per iteration (ms): 2565.5 | learning rate: 2.842E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.234631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 599.46 | backward: 1896.86 | backward-backward: 1896.83 | backward-allreduce: 0.00 | optimizer: 68.65 | batch generator: 0.96 + samples/sec: 6.537 | iteration 50600/ 320000 | elapsed time per iteration (ms): 2447.4 | learning rate: 2.841E-04 | approx flops per GPU: 40.6TFLOPS | lm_loss: 3.251991E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 572.42 | backward: 1815.65 | backward-backward: 1815.60 | backward-allreduce: 0.00 | optimizer: 58.64 | batch generator: 1.03 + samples/sec: 6.513 | iteration 50700/ 320000 | elapsed time per iteration (ms): 2456.6 | learning rate: 2.840E-04 | approx flops per GPU: 40.5TFLOPS | lm_loss: 3.245840E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 577.56 | backward: 1816.76 | backward-backward: 1816.67 | backward-allreduce: 0.00 | optimizer: 61.45 | batch generator: 1.22 + samples/sec: 6.480 | iteration 50800/ 320000 | elapsed time per iteration (ms): 2469.2 | learning rate: 2.840E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.233462E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.13 | backward: 1822.97 | backward-backward: 1822.88 | backward-allreduce: 0.00 | optimizer: 64.11 | batch generator: 1.51 + samples/sec: 6.475 | iteration 50900/ 320000 | elapsed time per iteration (ms): 2471.1 | learning rate: 2.839E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.239586E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.72 | backward: 1823.81 | backward-backward: 1823.73 | backward-allreduce: 0.00 | optimizer: 64.53 | batch generator: 1.54 + samples/sec: 6.470 | iteration 51000/ 320000 | elapsed time per iteration (ms): 2472.9 | learning rate: 2.838E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.254668E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 582.63 | backward: 1824.29 | backward-backward: 1824.19 | backward-allreduce: 0.00 | optimizer: 64.99 | batch generator: 1.51 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 51000 | lm_loss value: 3.225231E+00 | lm_loss_ppl value: 2.515939E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.321 | iteration 51100/ 320000 | elapsed time per iteration (ms): 2531.3 | learning rate: 2.838E-04 | approx flops per GPU: 39.3TFLOPS | lm_loss: 3.237078E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.90 | backward: 1825.12 | backward-backward: 1825.03 | backward-allreduce: 0.00 | optimizer: 66.51 | batch generator: 1.69 + samples/sec: 6.474 | iteration 51200/ 320000 | elapsed time per iteration (ms): 2471.3 | learning rate: 2.837E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.260124E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.55 | backward: 1824.21 | backward-backward: 1824.12 | backward-allreduce: 0.00 | optimizer: 64.54 | batch generator: 1.42 + samples/sec: 6.474 | iteration 51300/ 320000 | elapsed time per iteration (ms): 2471.4 | learning rate: 2.836E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.258935E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.24 | backward: 1824.21 | backward-backward: 1824.12 | backward-allreduce: 0.00 | optimizer: 64.96 | batch generator: 1.47 + samples/sec: 6.482 | iteration 51400/ 320000 | elapsed time per iteration (ms): 2468.4 | learning rate: 2.836E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.257245E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.84 | backward: 1821.22 | backward-backward: 1821.13 | backward-allreduce: 0.00 | optimizer: 64.30 | batch generator: 1.53 + samples/sec: 6.468 | iteration 51500/ 320000 | elapsed time per iteration (ms): 2473.8 | learning rate: 2.835E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.232244E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 581.51 | backward: 1825.32 | backward-backward: 1825.23 | backward-allreduce: 0.00 | optimizer: 65.88 | batch generator: 1.45 + samples/sec: 6.480 | iteration 51600/ 320000 | elapsed time per iteration (ms): 2469.3 | learning rate: 2.834E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.242898E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.28 | backward: 1823.02 | backward-backward: 1822.93 | backward-allreduce: 0.00 | optimizer: 65.02 | batch generator: 1.41 + samples/sec: 6.467 | iteration 51700/ 320000 | elapsed time per iteration (ms): 2473.9 | learning rate: 2.834E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.243681E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 582.68 | backward: 1824.89 | backward-backward: 1824.79 | backward-allreduce: 0.00 | optimizer: 65.21 | batch generator: 1.45 + samples/sec: 6.482 | iteration 51800/ 320000 | elapsed time per iteration (ms): 2468.2 | learning rate: 2.833E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.229672E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.69 | backward: 1822.65 | backward-backward: 1822.55 | backward-allreduce: 0.00 | optimizer: 63.92 | batch generator: 1.47 + samples/sec: 6.479 | iteration 51900/ 320000 | elapsed time per iteration (ms): 2469.6 | learning rate: 2.832E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.251024E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 581.89 | backward: 1823.32 | backward-backward: 1823.23 | backward-allreduce: 0.00 | optimizer: 63.39 | batch generator: 1.41 + samples/sec: 6.483 | iteration 52000/ 320000 | elapsed time per iteration (ms): 2468.0 | learning rate: 2.832E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.221794E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.81 | backward: 1822.45 | backward-backward: 1822.37 | backward-allreduce: 0.00 | optimizer: 63.79 | batch generator: 1.42 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 52000 | lm_loss value: 3.265073E+00 | lm_loss_ppl value: 2.618201E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.341 | iteration 52100/ 320000 | elapsed time per iteration (ms): 2523.2 | learning rate: 2.831E-04 | approx flops per GPU: 39.4TFLOPS | lm_loss: 3.234552E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 579.51 | backward: 1820.97 | backward-backward: 1820.88 | backward-allreduce: 0.00 | optimizer: 63.42 | batch generator: 1.58 + samples/sec: 6.478 | iteration 52200/ 320000 | elapsed time per iteration (ms): 2470.0 | learning rate: 2.830E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.247197E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 579.59 | backward: 1824.23 | backward-backward: 1824.14 | backward-allreduce: 0.00 | optimizer: 65.19 | batch generator: 1.46 + samples/sec: 6.483 | iteration 52300/ 320000 | elapsed time per iteration (ms): 2467.8 | learning rate: 2.830E-04 | approx flops per GPU: 40.3TFLOPS | lm_loss: 3.228024E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.43 | backward: 1822.01 | backward-backward: 1821.92 | backward-allreduce: 0.00 | optimizer: 64.40 | batch generator: 1.52 + samples/sec: 6.475 | iteration 52400/ 320000 | elapsed time per iteration (ms): 2470.9 | learning rate: 2.829E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.209339E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 580.00 | backward: 1824.99 | backward-backward: 1824.90 | backward-allreduce: 0.00 | optimizer: 64.87 | batch generator: 1.44 + samples/sec: 6.476 | iteration 52500/ 320000 | elapsed time per iteration (ms): 2470.7 | learning rate: 2.828E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.240155E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 579.82 | backward: 1824.54 | backward-backward: 1824.45 | backward-allreduce: 0.00 | optimizer: 65.26 | batch generator: 1.48 + samples/sec: 6.472 | iteration 52600/ 320000 | elapsed time per iteration (ms): 2472.1 | learning rate: 2.827E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.239853E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 582.73 | backward: 1823.71 | backward-backward: 1823.62 | backward-allreduce: 0.00 | optimizer: 64.65 | batch generator: 1.53 + samples/sec: 6.471 | iteration 52700/ 320000 | elapsed time per iteration (ms): 2472.8 | learning rate: 2.827E-04 | approx flops per GPU: 40.2TFLOPS | lm_loss: 3.226451E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 582.13 | backward: 1824.72 | backward-backward: 1824.62 | backward-allreduce: 0.00 | optimizer: 64.89 | batch generator: 1.51 + samples/sec: 6.530 | iteration 52800/ 320000 | elapsed time per iteration (ms): 2450.2 | learning rate: 2.826E-04 | approx flops per GPU: 40.6TFLOPS | lm_loss: 3.243336E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 575.43 | backward: 1813.27 | backward-backward: 1813.20 | backward-allreduce: 0.00 | optimizer: 60.73 | batch generator: 1.17 + samples/sec: 6.588 | iteration 52900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.825E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220443E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.593 | iteration 53000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.825E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.264384E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 53000 | lm_loss value: 3.237565E+00 | lm_loss_ppl value: 2.547162E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 53100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.824E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.226793E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.87 + samples/sec: 6.588 | iteration 53200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.823E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.219220E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 + samples/sec: 6.592 | iteration 53300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.823E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232660E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.79 + samples/sec: 6.589 | iteration 53400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.822E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.215894E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 + samples/sec: 6.591 | iteration 53500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.821E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.247238E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.599 | iteration 53600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.821E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.237043E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.17 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.85 + samples/sec: 6.587 | iteration 53700/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.820E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220175E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.94 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 + samples/sec: 6.595 | iteration 53800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.819E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232050E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.81 + samples/sec: 6.594 | iteration 53900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.818E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260413E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.588 | iteration 54000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.818E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.238575E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.80 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 54000 | lm_loss value: 3.296471E+00 | lm_loss_ppl value: 2.701712E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 54100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 2.817E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.252889E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.86 + samples/sec: 6.587 | iteration 54200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.816E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.223928E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.80 + samples/sec: 6.593 | iteration 54300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.816E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.206589E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.12 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.596 | iteration 54400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.815E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.249595E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.589 | iteration 54500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.814E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214820E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.597 | iteration 54600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.814E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.244579E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.595 | iteration 54700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.813E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.217836E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 + samples/sec: 6.588 | iteration 54800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.812E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.228810E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 + samples/sec: 6.597 | iteration 54900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.811E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.214406E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1802.57 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.86 + samples/sec: 6.595 | iteration 55000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.811E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225255E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 55000 | lm_loss value: 3.173585E+00 | lm_loss_ppl value: 2.389298E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 55100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.810E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.221766E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.88 + samples/sec: 6.598 | iteration 55200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.809E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.218560E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1802.36 | backward-backward: 1802.33 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.81 + samples/sec: 6.594 | iteration 55300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.808E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.224995E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.584 | iteration 55400/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.808E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.238816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1806.40 | backward-backward: 1806.38 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.79 + samples/sec: 6.595 | iteration 55500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.807E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209662E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.84 + samples/sec: 6.593 | iteration 55600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.806E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.247446E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.587 | iteration 55700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.806E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.99 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 + samples/sec: 6.595 | iteration 55800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.805E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.237684E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 + samples/sec: 6.593 | iteration 55900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.804E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223248E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.76 + samples/sec: 6.588 | iteration 56000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.803E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.226805E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 56000 | lm_loss value: 3.181808E+00 | lm_loss_ppl value: 2.409027E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 56100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.803E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.229417E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.47 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.90 + samples/sec: 6.593 | iteration 56200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.802E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223441E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.588 | iteration 56300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.801E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.217790E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.79 + samples/sec: 6.593 | iteration 56400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.801E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.255292E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.593 | iteration 56500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.800E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225774E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.91 + samples/sec: 6.584 | iteration 56600/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.799E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.222697E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1806.97 | backward-backward: 1806.94 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.78 + samples/sec: 6.589 | iteration 56700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.798E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233300E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.88 + samples/sec: 6.594 | iteration 56800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.798E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225939E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.89 + samples/sec: 6.587 | iteration 56900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.797E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210602E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1806.02 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.591 | iteration 57000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.796E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.224691E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.83 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 57000 | lm_loss value: 3.298335E+00 | lm_loss_ppl value: 2.706754E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 57100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 2.795E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.228019E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.89 + samples/sec: 6.588 | iteration 57200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.795E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.215078E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.595 | iteration 57300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.794E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225081E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.595 | iteration 57400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.793E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.246505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 + samples/sec: 6.588 | iteration 57500/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.792E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.235474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.590 | iteration 57600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.792E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.239914E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.77 + samples/sec: 6.594 | iteration 57700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.791E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.216964E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.93 + samples/sec: 6.586 | iteration 57800/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.790E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.193616E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1806.33 | backward-backward: 1806.30 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 + samples/sec: 6.587 | iteration 57900/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.789E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.211708E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.31 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.98 + samples/sec: 6.596 | iteration 58000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.789E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219455E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 58000 | lm_loss value: 3.137465E+00 | lm_loss_ppl value: 2.304536E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 58100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 2.788E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.231240E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.87 + samples/sec: 6.589 | iteration 58200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.787E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.226377E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 + samples/sec: 6.596 | iteration 58300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.786E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204754E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.80 + samples/sec: 6.588 | iteration 58400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.786E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.190145E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.591 | iteration 58500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.785E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.218693E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.88 + samples/sec: 6.596 | iteration 58600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.784E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.202021E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.83 + samples/sec: 6.587 | iteration 58700/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.783E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220266E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.77 + samples/sec: 6.590 | iteration 58800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.783E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.222621E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.05 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 + samples/sec: 6.595 | iteration 58900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.782E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.215608E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.590 | iteration 59000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.781E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206369E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 59000 | lm_loss value: 3.227249E+00 | lm_loss_ppl value: 2.521021E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 59100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.780E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.198093E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.88 + samples/sec: 6.596 | iteration 59200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.779E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.217461E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.85 + samples/sec: 6.595 | iteration 59300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.779E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232116E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.80 + samples/sec: 6.590 | iteration 59400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.778E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195463E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.590 | iteration 59500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.777E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.215424E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 + samples/sec: 6.596 | iteration 59600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.776E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186445E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 + samples/sec: 6.590 | iteration 59700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.776E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.229471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 + samples/sec: 6.585 | iteration 59800/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.775E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.222042E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1806.10 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.82 + samples/sec: 6.596 | iteration 59900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.774E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.221949E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.84 + samples/sec: 6.591 | iteration 60000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.773E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208536E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.92 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 60000 | lm_loss value: 3.135985E+00 | lm_loss_ppl value: 2.301130E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.236 | iteration 60100/ 320000 | elapsed time per iteration (ms): 2565.8 | learning rate: 2.773E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.212286E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.87 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.93 + samples/sec: 6.594 | iteration 60200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.772E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.192006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.85 + samples/sec: 6.593 | iteration 60300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.771E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209075E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.589 | iteration 60400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.770E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208252E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 + samples/sec: 6.598 | iteration 60500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.205007E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.79 + samples/sec: 6.594 | iteration 60600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.201516E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.589 | iteration 60700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.768E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.204899E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.45 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 + samples/sec: 6.595 | iteration 60800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.767E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.207003E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.65 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 + samples/sec: 6.590 | iteration 60900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.766E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.207615E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 56.60 | batch generator: 0.85 + samples/sec: 6.588 | iteration 61000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.765E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.235678E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 61000 | lm_loss value: 3.174518E+00 | lm_loss_ppl value: 2.391529E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 61100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 2.765E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.207857E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 54.74 | batch generator: 0.81 + samples/sec: 6.595 | iteration 61200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.764E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204748E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.75 + samples/sec: 6.589 | iteration 61300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.763E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.216393E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1805.38 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.592 | iteration 61400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.762E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214134E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.597 | iteration 61500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.762E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.199415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.588 | iteration 61600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.761E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.201023E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.590 | iteration 61700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.760E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.595 | iteration 61800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.759E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.216651E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.86 + samples/sec: 6.590 | iteration 61900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.758E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214512E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 + samples/sec: 6.586 | iteration 62000/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.758E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.247897E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.76 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 62000 | lm_loss value: 3.189103E+00 | lm_loss_ppl value: 2.426664E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 62100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.757E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.220407E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.84 + samples/sec: 6.591 | iteration 62200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.756E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.199751E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.84 + samples/sec: 6.592 | iteration 62300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.755E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209976E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.79 + samples/sec: 6.597 | iteration 62400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.754E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.188441E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.82 + samples/sec: 6.594 | iteration 62500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.754E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.195771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.591 | iteration 62600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.753E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 + samples/sec: 6.592 | iteration 62700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.752E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.212514E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 + samples/sec: 6.595 | iteration 62800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.751E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225213E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.79 + samples/sec: 6.592 | iteration 62900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.750E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209078E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.80 + samples/sec: 6.594 | iteration 63000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.749E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200561E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 63000 | lm_loss value: 3.125797E+00 | lm_loss_ppl value: 2.277805E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 63100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.749E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.214180E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.89 + samples/sec: 6.594 | iteration 63200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.748E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.214072E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.83 + samples/sec: 6.600 | iteration 63300/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 2.747E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.188936E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.83 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.80 + samples/sec: 6.595 | iteration 63400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.746E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209179E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.592 | iteration 63500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.745E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.198951E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.593 | iteration 63600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.745E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219892E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.592 | iteration 63700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.744E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183545E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.33 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.593 | iteration 63800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.743E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.202458E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.80 + samples/sec: 6.592 | iteration 63900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.742E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.181086E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.68 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.88 + samples/sec: 6.590 | iteration 64000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.741E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.99 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 64000 | lm_loss value: 3.178851E+00 | lm_loss_ppl value: 2.401914E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 64100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.740E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.182695E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.86 + samples/sec: 6.592 | iteration 64200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.740E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.203844E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.595 | iteration 64300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.739E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173289E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.598 | iteration 64400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.738E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.226259E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.98 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.76 + samples/sec: 6.591 | iteration 64500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.737E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.212171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.593 | iteration 64600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.736E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.197030E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.36 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.592 | iteration 64700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.735E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.215208E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.591 | iteration 64800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.735E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203166E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.592 | iteration 64900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.734E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.205803E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1805.23 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.95 + samples/sec: 6.591 | iteration 65000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.733E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206436E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 65000 | lm_loss value: 3.243041E+00 | lm_loss_ppl value: 2.561148E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 65100/ 320000 | elapsed time per iteration (ms): 2481.3 | learning rate: 2.732E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.193721E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.74 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.90 + samples/sec: 6.589 | iteration 65200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.731E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.207208E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.82 + samples/sec: 6.589 | iteration 65300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.730E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.202784E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1805.96 | backward-backward: 1805.94 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.588 | iteration 65400/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.730E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.196872E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.81 + samples/sec: 6.591 | iteration 65500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.729E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.181535E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.591 | iteration 65600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.728E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.212550E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.83 + samples/sec: 6.588 | iteration 65700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.727E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.88 + samples/sec: 6.591 | iteration 65800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.726E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195410E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.80 + samples/sec: 6.597 | iteration 65900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.725E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204755E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 + samples/sec: 6.591 | iteration 66000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.725E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.175911E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 66000 | lm_loss value: 3.188969E+00 | lm_loss_ppl value: 2.426340E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 66100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.724E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.195547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1805.52 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.87 + samples/sec: 6.593 | iteration 66200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.723E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.193299E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 54.70 | batch generator: 0.77 + samples/sec: 6.590 | iteration 66300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.722E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205566E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1804.63 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 + samples/sec: 6.589 | iteration 66400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.721E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220756E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.80 + samples/sec: 6.598 | iteration 66500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.720E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178600E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1802.98 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.590 | iteration 66600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.719E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206245E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.590 | iteration 66700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.719E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.217511E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.26 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.591 | iteration 66800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.718E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.590 | iteration 66900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.717E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.193539E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.84 + samples/sec: 6.590 | iteration 67000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.716E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.187225E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 67000 | lm_loss value: 3.239478E+00 | lm_loss_ppl value: 2.552039E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 67100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.715E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.213927E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.86 + samples/sec: 6.591 | iteration 67200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.714E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.200579E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.88 + samples/sec: 6.589 | iteration 67300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.713E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.179279E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.82 + samples/sec: 6.587 | iteration 67400/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.713E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.194961E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.82 + samples/sec: 6.589 | iteration 67500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.712E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191185E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.596 | iteration 67600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.711E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.201562E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.587 | iteration 67700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.710E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205323E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 + samples/sec: 6.585 | iteration 67800/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.709E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203084E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.44 | backward: 1806.23 | backward-backward: 1806.21 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 + samples/sec: 6.591 | iteration 67900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.708E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.173824E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.597 | iteration 68000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.707E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.188416E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.91 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 68000 | lm_loss value: 3.257170E+00 | lm_loss_ppl value: 2.597593E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 68100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.706E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.166176E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.90 + samples/sec: 6.590 | iteration 68200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.706E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191464E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.78 + samples/sec: 6.597 | iteration 68300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.705E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.185811E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1802.75 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 + samples/sec: 6.596 | iteration 68400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.704E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179256E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 + samples/sec: 6.586 | iteration 68500/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.703E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.185805E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.80 + samples/sec: 6.591 | iteration 68600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.702E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191183E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.56 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.86 + samples/sec: 6.598 | iteration 68700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.701E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191865E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.02 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 + samples/sec: 6.591 | iteration 68800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.700E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.221637E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 + samples/sec: 6.591 | iteration 68900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.699E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.173362E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 + samples/sec: 6.597 | iteration 69000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.699E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.174951E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1802.72 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 69000 | lm_loss value: 3.191154E+00 | lm_loss_ppl value: 2.431648E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 69100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.698E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.206825E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.72 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.84 + samples/sec: 6.590 | iteration 69200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.697E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.190428E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 + samples/sec: 6.592 | iteration 69300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.696E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.213215E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.87 + samples/sec: 6.595 | iteration 69400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.695E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.190154E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 + samples/sec: 6.587 | iteration 69500/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.694E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.199632E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1805.48 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.81 + samples/sec: 6.588 | iteration 69600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.693E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.158188E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.80 + samples/sec: 6.598 | iteration 69700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.692E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183073E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 + samples/sec: 6.591 | iteration 69800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.691E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206797E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.587 | iteration 69900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.690E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205674E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 + samples/sec: 6.594 | iteration 70000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.690E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183416E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 70000 | lm_loss value: 3.205048E+00 | lm_loss_ppl value: 2.465669E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.244 | iteration 70100/ 320000 | elapsed time per iteration (ms): 2562.5 | learning rate: 2.689E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.182544E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.88 + samples/sec: 6.587 | iteration 70200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.688E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.93 | backward-backward: 1805.90 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.81 + samples/sec: 6.589 | iteration 70300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.687E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192715E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 + samples/sec: 6.596 | iteration 70400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.686E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.197077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 + samples/sec: 6.590 | iteration 70500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.685E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.186804E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 + samples/sec: 6.588 | iteration 70600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.684E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.182904E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.76 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.81 + samples/sec: 6.591 | iteration 70700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.683E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.194614E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.98 + samples/sec: 6.595 | iteration 70800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.682E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189717E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.97 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.78 + samples/sec: 6.588 | iteration 70900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.681E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.199714E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.82 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.590 | iteration 71000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.681E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.169664E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 71000 | lm_loss value: 3.219360E+00 | lm_loss_ppl value: 2.501211E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 71100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.680E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.189048E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.87 + samples/sec: 6.598 | iteration 71200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.679E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.176649E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.77 + samples/sec: 6.590 | iteration 71300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.678E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192383E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.592 | iteration 71400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.677E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.187941E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 + samples/sec: 6.598 | iteration 71500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.676E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179403E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.82 | backward: 1803.01 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.590 | iteration 71600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.675E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.189167E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.591 | iteration 71700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.674E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.176305E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.86 + samples/sec: 6.592 | iteration 71800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.673E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183004E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 56.67 | batch generator: 0.79 + samples/sec: 6.596 | iteration 71900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.672E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.206879E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.27 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.588 | iteration 72000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.671E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.179352E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 72000 | lm_loss value: 3.116561E+00 | lm_loss_ppl value: 2.256863E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 72100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.671E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.171815E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.87 + samples/sec: 6.592 | iteration 72200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.670E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.165262E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.98 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.588 | iteration 72300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.669E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.185670E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.82 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 + samples/sec: 6.590 | iteration 72400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.668E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.172154E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 + samples/sec: 6.599 | iteration 72500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.667E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.176308E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 + samples/sec: 6.588 | iteration 72600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.666E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188140E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1805.15 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 + samples/sec: 6.590 | iteration 72700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.665E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.170534E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.83 + samples/sec: 6.597 | iteration 72800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.664E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191211E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1802.95 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.83 + samples/sec: 6.589 | iteration 72900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.663E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.176812E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 + samples/sec: 6.590 | iteration 73000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.662E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.175191E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 73000 | lm_loss value: 3.174797E+00 | lm_loss_ppl value: 2.392196E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 73100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 2.661E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.190165E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.84 + samples/sec: 6.596 | iteration 73200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.660E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.588 | iteration 73300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.659E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.175868E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 + samples/sec: 6.593 | iteration 73400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.658E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179723E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 + samples/sec: 6.597 | iteration 73500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.658E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.202552E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.82 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 + samples/sec: 6.590 | iteration 73600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.657E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.182404E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 + samples/sec: 6.590 | iteration 73700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.656E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.189167E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.597 | iteration 73800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.655E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.174146E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 + samples/sec: 6.592 | iteration 73900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.654E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.84 + samples/sec: 6.585 | iteration 74000/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.653E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.183853E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1806.78 | backward-backward: 1806.75 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.85 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 74000 | lm_loss value: 3.216368E+00 | lm_loss_ppl value: 2.493739E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 74100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.652E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.180345E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.51 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.86 + samples/sec: 6.597 | iteration 74200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.651E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.165345E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1802.63 | backward-backward: 1802.61 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.80 + samples/sec: 6.591 | iteration 74300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.650E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188603E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.82 + samples/sec: 6.590 | iteration 74400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.649E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.153294E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.83 + samples/sec: 6.592 | iteration 74500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.648E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173661E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.598 | iteration 74600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.647E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.157567E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.75 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.590 | iteration 74700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.646E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.177882E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 + samples/sec: 6.591 | iteration 74800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.645E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167096E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 + samples/sec: 6.598 | iteration 74900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.644E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.171949E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.78 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 + samples/sec: 6.587 | iteration 75000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.643E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.179185E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.75 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 75000 | lm_loss value: 3.142713E+00 | lm_loss_ppl value: 2.316663E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 75100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.642E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.180854E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.17 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.87 + samples/sec: 6.598 | iteration 75200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.641E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166694E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.81 + samples/sec: 6.590 | iteration 75300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.641E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.165671E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.36 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.88 + samples/sec: 6.587 | iteration 75400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.640E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1805.66 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.594 | iteration 75500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.639E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.167572E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 54.92 | batch generator: 0.82 + samples/sec: 6.597 | iteration 75600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.638E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.169545E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.76 + samples/sec: 6.588 | iteration 75700/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.637E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.205864E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.89 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 + samples/sec: 6.587 | iteration 75800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.636E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.183188E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.57 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.81 + samples/sec: 6.597 | iteration 75900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.635E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161022E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1802.94 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.98 + samples/sec: 6.591 | iteration 76000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.634E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.158288E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 76000 | lm_loss value: 3.144618E+00 | lm_loss_ppl value: 2.321081E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.435 | iteration 76100/ 320000 | elapsed time per iteration (ms): 2486.2 | learning rate: 2.633E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.177100E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.41 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 56.62 | batch generator: 0.86 + samples/sec: 6.596 | iteration 76200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.632E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.81 + samples/sec: 6.591 | iteration 76300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.631E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.174299E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.591 | iteration 76400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.630E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.168564E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 + samples/sec: 6.597 | iteration 76500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.629E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.169187E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 + samples/sec: 6.593 | iteration 76600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.628E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173500E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.77 + samples/sec: 6.593 | iteration 76700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.627E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.180978E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.25 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.84 + samples/sec: 6.599 | iteration 76800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.626E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150291E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1802.90 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 + samples/sec: 6.589 | iteration 76900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.625E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.168539E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 + samples/sec: 6.598 | iteration 77000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.624E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179750E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 77000 | lm_loss value: 3.154337E+00 | lm_loss_ppl value: 2.343750E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 77100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.623E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.182405E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.84 + samples/sec: 6.590 | iteration 77200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.622E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.175907E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.78 + samples/sec: 6.598 | iteration 77300/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.621E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164641E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.591 | iteration 77400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.620E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.164340E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 + samples/sec: 6.598 | iteration 77500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.619E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163936E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.595 | iteration 77600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.618E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170905E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.590 | iteration 77700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.617E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.156612E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.84 + samples/sec: 6.597 | iteration 77800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.616E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.176494E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 + samples/sec: 6.588 | iteration 77900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.615E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.178545E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.58 | backward-backward: 1805.56 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.596 | iteration 78000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.614E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.169757E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.85 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 78000 | lm_loss value: 3.239316E+00 | lm_loss_ppl value: 2.551626E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 78100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 2.613E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.169123E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.84 + samples/sec: 6.589 | iteration 78200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.612E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167933E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 + samples/sec: 6.596 | iteration 78300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.611E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189564E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.78 + samples/sec: 6.588 | iteration 78400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.610E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.165418E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.80 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.595 | iteration 78500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.609E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155845E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 + samples/sec: 6.593 | iteration 78600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.608E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.162360E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.591 | iteration 78700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.607E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.198617E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.598 | iteration 78800/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.606E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166940E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.80 + samples/sec: 6.589 | iteration 78900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.605E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.163468E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.87 + samples/sec: 6.597 | iteration 79000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.604E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179945E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 79000 | lm_loss value: 3.169758E+00 | lm_loss_ppl value: 2.380172E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 79100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.603E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.172415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.85 + samples/sec: 6.590 | iteration 79200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.602E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.174207E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.596 | iteration 79300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.601E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163448E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.588 | iteration 79400/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.600E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.181931E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 56.49 | batch generator: 0.83 + samples/sec: 6.593 | iteration 79500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.599E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163660E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.81 + samples/sec: 6.595 | iteration 79600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.598E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153124E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 + samples/sec: 6.590 | iteration 79700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.597E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.164102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.83 + samples/sec: 6.598 | iteration 79800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.596E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142712E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.81 + samples/sec: 6.588 | iteration 79900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.595E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195369E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.28 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.82 + samples/sec: 6.593 | iteration 80000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.594E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150693E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 80000 | lm_loss value: 3.174204E+00 | lm_loss_ppl value: 2.390779E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.245 | iteration 80100/ 320000 | elapsed time per iteration (ms): 2561.9 | learning rate: 2.593E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.152026E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.85 + samples/sec: 6.590 | iteration 80200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.592E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.168804E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 + samples/sec: 6.596 | iteration 80300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.591E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161660E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1802.85 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.82 + samples/sec: 6.590 | iteration 80400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.590E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.155317E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.592 | iteration 80500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.589E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186330E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.80 + samples/sec: 6.594 | iteration 80600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.588E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.175036E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.591 | iteration 80700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.587E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.171649E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.81 + samples/sec: 6.598 | iteration 80800/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.586E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164346E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.591 | iteration 80900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.585E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.172136E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 + samples/sec: 6.598 | iteration 81000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.584E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.171423E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1802.70 | backward-backward: 1802.68 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 81000 | lm_loss value: 3.206590E+00 | lm_loss_ppl value: 2.469474E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 81100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 2.583E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.179810E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.83 + samples/sec: 6.593 | iteration 81200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.582E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139134E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 + samples/sec: 6.597 | iteration 81300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.581E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.169472E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.588 | iteration 81400/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.580E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.184665E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.63 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.597 | iteration 81500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.579E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.149417E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 54.60 | batch generator: 0.78 + samples/sec: 6.592 | iteration 81600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.578E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.79 + samples/sec: 6.590 | iteration 81700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.577E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.155536E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 + samples/sec: 6.598 | iteration 81800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.576E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.180943E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 + samples/sec: 6.590 | iteration 81900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.575E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.128937E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.596 | iteration 82000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.574E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.167978E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 82000 | lm_loss value: 3.122636E+00 | lm_loss_ppl value: 2.270616E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 82100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.573E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.152297E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.85 + samples/sec: 6.590 | iteration 82200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.572E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.159345E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.597 | iteration 82300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.571E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148669E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.13 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.591 | iteration 82400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.570E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.145943E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 + samples/sec: 6.595 | iteration 82500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.569E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.145288E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.74 + samples/sec: 6.594 | iteration 82600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.568E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142193E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.78 + samples/sec: 6.584 | iteration 82700/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.567E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.176481E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1806.39 | backward-backward: 1806.36 | backward-allreduce: 0.00 | optimizer: 56.65 | batch generator: 0.78 + samples/sec: 6.595 | iteration 82800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.566E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.147565E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.78 + samples/sec: 6.593 | iteration 82900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.565E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150814E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 + samples/sec: 6.587 | iteration 83000/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.564E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.157881E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 83000 | lm_loss value: 3.179853E+00 | lm_loss_ppl value: 2.404322E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 83100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 2.563E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.175343E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.88 + samples/sec: 6.592 | iteration 83200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.562E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.174682E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.80 + samples/sec: 6.592 | iteration 83300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.561E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.165422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.86 + samples/sec: 6.596 | iteration 83400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.560E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.168488E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.77 + samples/sec: 6.588 | iteration 83500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.559E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.166048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 + samples/sec: 6.595 | iteration 83600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.558E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.140807E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 + samples/sec: 6.596 | iteration 83700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.557E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143196E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.77 + samples/sec: 6.586 | iteration 83800/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.555E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.154346E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1805.86 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.78 + samples/sec: 6.597 | iteration 83900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.554E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150609E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 + samples/sec: 6.593 | iteration 84000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.553E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.160883E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 84000 | lm_loss value: 3.047928E+00 | lm_loss_ppl value: 2.107165E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 84100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 2.552E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.159553E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.88 + samples/sec: 6.597 | iteration 84200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.551E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 + samples/sec: 6.587 | iteration 84300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.550E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.131565E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1805.91 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.595 | iteration 84400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.549E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166202E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.82 + samples/sec: 6.594 | iteration 84500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.548E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.84 + samples/sec: 6.589 | iteration 84600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.547E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167708E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.32 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.85 + samples/sec: 6.597 | iteration 84700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.546E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.145014E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.84 + samples/sec: 6.586 | iteration 84800/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.545E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.141770E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 56.63 | batch generator: 0.80 + samples/sec: 6.592 | iteration 84900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.544E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.80 + samples/sec: 6.600 | iteration 85000/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.543E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148779E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 54.44 | batch generator: 0.76 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 85000 | lm_loss value: 3.227161E+00 | lm_loss_ppl value: 2.520798E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 85100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.542E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.176615E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.86 + samples/sec: 6.596 | iteration 85200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.541E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.181065E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 + samples/sec: 6.594 | iteration 85300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.540E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138063E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 + samples/sec: 6.592 | iteration 85400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.539E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.175978E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.75 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.78 + samples/sec: 6.600 | iteration 85500/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 2.538E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137490E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1802.31 | backward-backward: 1802.29 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 + samples/sec: 6.592 | iteration 85600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.536E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163027E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.596 | iteration 85700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.535E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.147916E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.24 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.83 + samples/sec: 6.596 | iteration 85800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.534E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144922E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.75 + samples/sec: 6.590 | iteration 85900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.533E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.174689E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.73 + samples/sec: 6.596 | iteration 86000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.532E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154162E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 86000 | lm_loss value: 3.167988E+00 | lm_loss_ppl value: 2.375963E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 86100/ 320000 | elapsed time per iteration (ms): 2482.6 | learning rate: 2.531E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.150589E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.87 + samples/sec: 6.591 | iteration 86200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.530E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.130128E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.82 + samples/sec: 6.599 | iteration 86300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.529E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170512E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1802.48 | backward-backward: 1802.45 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.595 | iteration 86400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.528E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.151414E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 + samples/sec: 6.591 | iteration 86500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.527E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167277E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 + samples/sec: 6.602 | iteration 86600/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 2.526E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.140055E+00 | loss scale: 16384.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1802.53 | backward-backward: 1802.51 | backward-allreduce: 0.00 | optimizer: 54.64 | batch generator: 0.76 + samples/sec: 6.593 | iteration 86700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.525E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.146365E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.593 | iteration 86800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.524E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131602E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.599 | iteration 86900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.523E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131111E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1802.36 | backward-backward: 1802.33 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.591 | iteration 87000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.522E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.145404E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 56.77 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 87000 | lm_loss value: 3.088933E+00 | lm_loss_ppl value: 2.195365E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 87100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.520E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.172390E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.89 + samples/sec: 6.599 | iteration 87200/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.519E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141731E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1802.31 | backward-backward: 1802.28 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.595 | iteration 87300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.518E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154305E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.78 + samples/sec: 6.592 | iteration 87400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.517E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.135766E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.599 | iteration 87500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.516E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179732E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1802.50 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.596 | iteration 87600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.515E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.92 + samples/sec: 6.591 | iteration 87700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.514E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.135703E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.75 + samples/sec: 6.596 | iteration 87800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.513E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144582E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 + samples/sec: 6.596 | iteration 87900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.512E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.591 | iteration 88000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.511E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.163436E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 88000 | lm_loss value: 3.129302E+00 | lm_loss_ppl value: 2.285801E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 88100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 2.510E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.140759E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1803.48 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 56.70 | batch generator: 0.87 + samples/sec: 6.597 | iteration 88200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.509E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.151631E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.591 | iteration 88300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.507E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.160329E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.77 + samples/sec: 6.594 | iteration 88400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.506E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132620E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 + samples/sec: 6.597 | iteration 88500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.505E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153297E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.04 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 1.02 + samples/sec: 6.588 | iteration 88600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.504E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.159380E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 + samples/sec: 6.592 | iteration 88700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.503E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 + samples/sec: 6.600 | iteration 88800/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.502E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132334E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.75 | backward: 1802.83 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.592 | iteration 88900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.501E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.133323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 + samples/sec: 6.592 | iteration 89000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.500E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128971E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.95 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 89000 | lm_loss value: 3.187423E+00 | lm_loss_ppl value: 2.422593E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 89100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 2.499E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.152600E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.83 + samples/sec: 6.597 | iteration 89200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.498E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136109E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1802.78 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.81 + samples/sec: 6.594 | iteration 89300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.496E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150872E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.593 | iteration 89400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.495E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.165646E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.86 + samples/sec: 6.600 | iteration 89500/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.494E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.146589E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1802.56 | backward-backward: 1802.54 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.593 | iteration 89600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.493E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152248E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 + samples/sec: 6.593 | iteration 89700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.492E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.135556E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.75 + samples/sec: 6.595 | iteration 89800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.491E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137495E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.81 + samples/sec: 6.599 | iteration 89900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.490E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154153E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1802.68 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.75 + samples/sec: 6.591 | iteration 90000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.489E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.139104E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 90000 | lm_loss value: 3.137366E+00 | lm_loss_ppl value: 2.304309E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.238 | iteration 90100/ 320000 | elapsed time per iteration (ms): 2564.8 | learning rate: 2.488E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.149582E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.85 + samples/sec: 6.592 | iteration 90200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.487E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138968E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.595 | iteration 90300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.485E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128783E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.82 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.79 + samples/sec: 6.593 | iteration 90400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.484E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141105E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 + samples/sec: 6.591 | iteration 90500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.483E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.134490E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.76 + samples/sec: 6.592 | iteration 90600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.482E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144060E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.598 | iteration 90700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.481E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.158707E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.596 | iteration 90800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.480E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138225E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 + samples/sec: 6.593 | iteration 90900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.479E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123425E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.87 + samples/sec: 6.591 | iteration 91000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.478E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.135434E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 91000 | lm_loss value: 3.195013E+00 | lm_loss_ppl value: 2.441049E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 91100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.477E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.136804E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.87 + samples/sec: 6.599 | iteration 91200/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.475E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128068E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.80 + samples/sec: 6.591 | iteration 91300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.474E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.131796E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.82 + samples/sec: 6.588 | iteration 91400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.473E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.151454E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.77 + samples/sec: 6.592 | iteration 91500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.472E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138353E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.79 + samples/sec: 6.593 | iteration 91600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.471E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153589E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.82 + samples/sec: 6.599 | iteration 91700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.470E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128453E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.594 | iteration 91800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.469E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.135032E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.82 + samples/sec: 6.594 | iteration 91900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.468E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141266E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.595 | iteration 92000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.466E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128524E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 92000 | lm_loss value: 3.137675E+00 | lm_loss_ppl value: 2.305021E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 92100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 2.465E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.130575E+00 | loss scale: 8192.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.16 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.87 + samples/sec: 6.595 | iteration 92200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.464E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127632E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.96 + samples/sec: 6.603 | iteration 92300/ 320000 | elapsed time per iteration (ms): 2423.2 | learning rate: 2.463E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.149431E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1801.35 | backward-backward: 1801.32 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 + samples/sec: 6.597 | iteration 92400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.462E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154929E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1802.62 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.85 + samples/sec: 6.593 | iteration 92500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.461E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143011E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.81 + samples/sec: 6.596 | iteration 92600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.460E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173962E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1802.96 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.84 + samples/sec: 6.596 | iteration 92700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.459E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148442E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 + samples/sec: 6.597 | iteration 92800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.457E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139879E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.11 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 + samples/sec: 6.598 | iteration 92900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.456E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.129960E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1802.57 | backward-backward: 1802.54 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.602 | iteration 93000/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 2.455E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143069E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1801.46 | backward-backward: 1801.44 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 93000 | lm_loss value: 3.201482E+00 | lm_loss_ppl value: 2.456890E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 93100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.454E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.107549E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.89 + samples/sec: 6.595 | iteration 93200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.453E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156127E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 + samples/sec: 6.595 | iteration 93300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.452E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137556E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.594 | iteration 93400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.451E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130562E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.54 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.76 + samples/sec: 6.592 | iteration 93500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.449E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.151012E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.84 + samples/sec: 6.592 | iteration 93600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.448E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.135820E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.76 + samples/sec: 6.596 | iteration 93700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.447E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123899E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.81 + samples/sec: 6.596 | iteration 93800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.446E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122383E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.76 + samples/sec: 6.597 | iteration 93900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.445E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141813E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.19 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.84 + samples/sec: 6.595 | iteration 94000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.444E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141225E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 94000 | lm_loss value: 3.155014E+00 | lm_loss_ppl value: 2.345337E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 94100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.443E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.134548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.36 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.86 + samples/sec: 6.600 | iteration 94200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.441E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123069E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1802.00 | backward-backward: 1801.97 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.595 | iteration 94300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.440E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153562E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.591 | iteration 94400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.439E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.145496E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 + samples/sec: 6.593 | iteration 94500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.438E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127617E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.83 + samples/sec: 6.592 | iteration 94600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.437E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136671E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.94 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 56.51 | batch generator: 0.81 + samples/sec: 6.597 | iteration 94700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.436E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137934E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.53 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.85 + samples/sec: 6.596 | iteration 94800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.435E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.135816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.75 + samples/sec: 6.592 | iteration 94900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.433E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120821E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 + samples/sec: 6.596 | iteration 95000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.432E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138617E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 95000 | lm_loss value: 3.155421E+00 | lm_loss_ppl value: 2.346292E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 95100/ 320000 | elapsed time per iteration (ms): 2481.5 | learning rate: 2.431E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.133061E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1802.58 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.84 + samples/sec: 6.596 | iteration 95200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.430E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.110448E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1802.63 | backward-backward: 1802.61 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.96 + samples/sec: 6.593 | iteration 95300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.429E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.118309E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 + samples/sec: 6.595 | iteration 95400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.428E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.099602E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.84 + samples/sec: 6.600 | iteration 95500/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 2.426E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123442E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.12 | backward-backward: 1802.10 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.594 | iteration 95600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.425E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134792E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.592 | iteration 95700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.424E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134197E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.81 + samples/sec: 6.599 | iteration 95800/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.423E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112358E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.72 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.80 + samples/sec: 6.593 | iteration 95900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.422E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123313E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 + samples/sec: 6.593 | iteration 96000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.421E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132708E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 96000 | lm_loss value: 3.085776E+00 | lm_loss_ppl value: 2.188445E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 96100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 2.420E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.150459E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.73 | backward-backward: 1802.71 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.87 + samples/sec: 6.597 | iteration 96200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.418E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.133239E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.75 + samples/sec: 6.595 | iteration 96300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.417E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143748E+00 | loss scale: 16384.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.86 + samples/sec: 6.599 | iteration 96400/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.416E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.146658E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1802.22 | backward-backward: 1802.20 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.88 + samples/sec: 6.597 | iteration 96500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.415E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.115298E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1802.57 | backward-backward: 1802.54 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.78 + samples/sec: 6.595 | iteration 96600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.414E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130354E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 + samples/sec: 6.600 | iteration 96700/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.413E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142833E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1802.39 | backward-backward: 1802.37 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 + samples/sec: 6.591 | iteration 96800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.411E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.129725E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.73 + samples/sec: 6.597 | iteration 96900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.410E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.140675E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.599 | iteration 97000/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.409E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114683E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1802.42 | backward-backward: 1802.40 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 97000 | lm_loss value: 3.075826E+00 | lm_loss_ppl value: 2.166777E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 97100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 2.408E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.112626E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.85 + samples/sec: 6.601 | iteration 97200/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 2.407E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.091358E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1801.84 | backward-backward: 1801.82 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.82 + samples/sec: 6.594 | iteration 97300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.406E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119040E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 + samples/sec: 6.597 | iteration 97400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.404E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139204E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.09 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.599 | iteration 97500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.403E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112606E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1802.49 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.81 + samples/sec: 6.590 | iteration 97600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.402E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121924E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.601 | iteration 97700/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 2.401E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142015E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1802.85 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 54.86 | batch generator: 0.78 + samples/sec: 6.593 | iteration 97800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.400E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125328E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.592 | iteration 97900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.398E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132265E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.81 + samples/sec: 6.597 | iteration 98000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.397E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142570E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.11 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.82 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 98000 | lm_loss value: 3.032373E+00 | lm_loss_ppl value: 2.074640E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 98100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.396E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.124872E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.76 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.88 + samples/sec: 6.599 | iteration 98200/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.395E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143596E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.55 | backward-backward: 1802.53 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.83 + samples/sec: 6.593 | iteration 98300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.394E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.80 + samples/sec: 6.596 | iteration 98400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.393E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122183E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.78 + samples/sec: 6.596 | iteration 98500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.391E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130582E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 + samples/sec: 6.592 | iteration 98600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.390E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136077E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 + samples/sec: 6.600 | iteration 98700/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 2.389E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.141614E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1802.50 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.81 + samples/sec: 6.595 | iteration 98800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.388E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.121660E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.78 + samples/sec: 6.595 | iteration 98900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.387E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.121409E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 + samples/sec: 6.595 | iteration 99000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.385E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097149E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.77 +---------------------------------------------------------------------------------------------------------- + validation results at iteration 99000 | lm_loss value: 3.158209E+00 | lm_loss_ppl value: 2.352843E+01 | +---------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 99100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.384E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.138651E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.84 + samples/sec: 6.598 | iteration 99200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.383E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123643E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 + samples/sec: 6.596 | iteration 99300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.382E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126285E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.78 + samples/sec: 6.595 | iteration 99400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.381E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.117166E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.88 + samples/sec: 6.599 | iteration 99500/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.380E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.147147E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1801.86 | backward-backward: 1801.84 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 + samples/sec: 6.592 | iteration 99600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.378E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137367E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.595 | iteration 99700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.377E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136395E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.76 + samples/sec: 6.595 | iteration 99800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.376E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.129498E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.601 | iteration 99900/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 2.375E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138706E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.29 | backward-backward: 1802.27 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.83 + samples/sec: 6.596 | iteration 100000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.374E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097448E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1802.97 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 100000 | lm_loss value: 3.164404E+00 | lm_loss_ppl value: 2.367464E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.235 | iteration 100100/ 320000 | elapsed time per iteration (ms): 2566.0 | learning rate: 2.372E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.115140E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.11 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.88 + samples/sec: 6.596 | iteration 100200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.371E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.118538E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.593 | iteration 100300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.370E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124969E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.81 + samples/sec: 6.599 | iteration 100400/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.369E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131427E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1802.77 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.593 | iteration 100500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.368E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109353E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.593 | iteration 100600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.366E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126189E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 + samples/sec: 6.593 | iteration 100700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.365E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109403E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 + samples/sec: 6.594 | iteration 100800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.364E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.113102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 + samples/sec: 6.599 | iteration 100900/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.363E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136154E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1802.59 | backward-backward: 1802.56 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.82 + samples/sec: 6.591 | iteration 101000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.362E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.120721E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 101000 | lm_loss value: 3.127241E+00 | lm_loss_ppl value: 2.281096E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 101100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 2.360E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.107505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.82 + samples/sec: 6.590 | iteration 101200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.359E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.138663E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.55 | batch generator: 0.83 + samples/sec: 6.593 | iteration 101300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.358E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120774E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.80 + samples/sec: 6.590 | iteration 101400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.357E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.104075E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1805.23 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 + samples/sec: 6.596 | iteration 101500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.356E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128416E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.82 + samples/sec: 6.594 | iteration 101600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.354E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134644E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 + samples/sec: 6.595 | iteration 101700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.353E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.108694E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.78 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.83 + samples/sec: 6.594 | iteration 101800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.352E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.121245E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.83 + samples/sec: 6.592 | iteration 101900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.351E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134745E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.81 + samples/sec: 6.595 | iteration 102000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.349E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119676E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 102000 | lm_loss value: 3.127235E+00 | lm_loss_ppl value: 2.281081E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 102100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 2.348E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.105589E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.95 + samples/sec: 6.595 | iteration 102200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.347E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.094782E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.64 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.591 | iteration 102300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.346E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.132429E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.83 + samples/sec: 6.595 | iteration 102400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.345E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109857E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.595 | iteration 102500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.343E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124293E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.597 | iteration 102600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.342E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.121631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.86 + samples/sec: 6.591 | iteration 102700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.341E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.115215E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1805.41 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 + samples/sec: 6.591 | iteration 102800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.340E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.140429E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.27 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 + samples/sec: 6.591 | iteration 102900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.339E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.097981E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.97 + samples/sec: 6.589 | iteration 103000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.337E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.100786E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1805.27 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 103000 | lm_loss value: 3.120894E+00 | lm_loss_ppl value: 2.266664E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 103100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 2.336E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.119419E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.05 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.87 + samples/sec: 6.592 | iteration 103200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.335E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124850E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.591 | iteration 103300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.334E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.137056E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.83 + samples/sec: 6.590 | iteration 103400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.332E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.119116E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.80 + samples/sec: 6.592 | iteration 103500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.331E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121685E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 + samples/sec: 6.591 | iteration 103600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.330E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.108497E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 + samples/sec: 6.592 | iteration 103700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.329E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137293E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 + samples/sec: 6.590 | iteration 103800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.328E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.106420E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 + samples/sec: 6.592 | iteration 103900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.326E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124749E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.44 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.86 + samples/sec: 6.592 | iteration 104000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.325E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119502E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 54.74 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 104000 | lm_loss value: 3.102231E+00 | lm_loss_ppl value: 2.224754E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 104100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 2.324E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.126498E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 1.02 + samples/sec: 6.589 | iteration 104200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.323E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.099198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.03 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.83 + samples/sec: 6.591 | iteration 104300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.321E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.105303E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 + samples/sec: 6.592 | iteration 104400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.320E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126521E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.84 + samples/sec: 6.594 | iteration 104500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.319E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114849E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.83 + samples/sec: 6.587 | iteration 104600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.318E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.105490E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 56.52 | batch generator: 0.88 + samples/sec: 6.592 | iteration 104700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.316E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.589 | iteration 104800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.315E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.107657E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.96 + samples/sec: 6.589 | iteration 104900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.314E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121022E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.21 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.81 + samples/sec: 6.591 | iteration 105000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.313E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.113618E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 105000 | lm_loss value: 3.117446E+00 | lm_loss_ppl value: 2.258861E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 105100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.312E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.103168E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.89 + samples/sec: 6.589 | iteration 105200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.310E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.117788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.85 + samples/sec: 6.588 | iteration 105300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.309E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.116443E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.60 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 + samples/sec: 6.591 | iteration 105400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.308E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.096557E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.83 + samples/sec: 6.582 | iteration 105500/ 320000 | elapsed time per iteration (ms): 2430.7 | learning rate: 2.307E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.106275E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1807.09 | backward-backward: 1807.07 | backward-allreduce: 0.00 | optimizer: 56.42 | batch generator: 0.83 + samples/sec: 6.590 | iteration 105600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.305E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.120505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1805.46 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.592 | iteration 105700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.304E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.116129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.592 | iteration 105800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.303E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137099E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 + samples/sec: 6.593 | iteration 105900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.302E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114636E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.81 + samples/sec: 6.590 | iteration 106000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.300E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.095711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 106000 | lm_loss value: 3.082507E+00 | lm_loss_ppl value: 2.181302E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 106100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 2.299E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.116894E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.78 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.88 + samples/sec: 6.592 | iteration 106200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.298E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.093417E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 + samples/sec: 6.555 | iteration 106300/ 320000 | elapsed time per iteration (ms): 2441.0 | learning rate: 2.297E-04 | approx flops per GPU: 40.7TFLOPS | lm_loss: 3.140315E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 570.58 | backward: 1812.26 | backward-backward: 1812.21 | backward-allreduce: 0.00 | optimizer: 57.59 | batch generator: 0.95 + samples/sec: 6.544 | iteration 106400/ 320000 | elapsed time per iteration (ms): 2445.1 | learning rate: 2.295E-04 | approx flops per GPU: 40.7TFLOPS | lm_loss: 3.105305E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 572.94 | backward: 1812.75 | backward-backward: 1812.69 | backward-allreduce: 0.00 | optimizer: 58.76 | batch generator: 1.07 + samples/sec: 6.594 | iteration 106500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.294E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.111346E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.595 | iteration 106600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.293E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120341E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.79 + samples/sec: 6.590 | iteration 106700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.292E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.119551E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.87 + samples/sec: 6.591 | iteration 106800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.290E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.097495E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.594 | iteration 106900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.289E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.117767E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 + samples/sec: 6.598 | iteration 107000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.288E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 107000 | lm_loss value: 3.139050E+00 | lm_loss_ppl value: 2.308194E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 107100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.287E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.106603E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 + samples/sec: 6.595 | iteration 107200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.285E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114632E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.594 | iteration 107300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.284E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109815E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 + samples/sec: 6.594 | iteration 107400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.283E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.107421E+00 | loss scale: 16384.0 | number of skipped iterations: 3 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 53.98 | batch generator: 0.81 + samples/sec: 6.599 | iteration 107500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.282E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.100229E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1802.67 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.95 + samples/sec: 6.593 | iteration 107600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.280E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.110644E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.593 | iteration 107700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.279E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.106526E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1803.60 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.78 + samples/sec: 6.595 | iteration 107800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.278E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.111256E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.590 | iteration 107900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.277E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.118774E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.87 + samples/sec: 6.599 | iteration 108000/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.275E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.105188E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.58 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.86 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 108000 | lm_loss value: 3.050287E+00 | lm_loss_ppl value: 2.112142E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 108100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.274E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.124265E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.87 + samples/sec: 6.592 | iteration 108200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.273E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.091519E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 + samples/sec: 6.594 | iteration 108300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.272E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.098810E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.587 | iteration 108400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.270E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.093645E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.85 + samples/sec: 6.598 | iteration 108500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.269E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125229E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.11 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.80 + samples/sec: 6.592 | iteration 108600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.268E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128337E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.79 + samples/sec: 6.592 | iteration 108700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.267E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.098552E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.595 | iteration 108800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.265E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109172E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.80 + samples/sec: 6.590 | iteration 108900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.264E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.106265E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.82 + samples/sec: 6.595 | iteration 109000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.263E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.081758E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 109000 | lm_loss value: 3.168851E+00 | lm_loss_ppl value: 2.378014E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 109100/ 320000 | elapsed time per iteration (ms): 2481.5 | learning rate: 2.261E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.103164E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.85 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.88 + samples/sec: 6.594 | iteration 109200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.260E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136463E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 + samples/sec: 6.596 | iteration 109300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.259E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.105170E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 + samples/sec: 6.598 | iteration 109400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.258E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.117061E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1802.77 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 + samples/sec: 6.599 | iteration 109500/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.256E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.080365E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1802.05 | backward-backward: 1802.02 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.593 | iteration 109600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.255E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112474E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.595 | iteration 109700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.254E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097740E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.593 | iteration 109800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.253E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.107850E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 + samples/sec: 6.590 | iteration 109900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.251E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 56.53 | batch generator: 0.81 + samples/sec: 6.598 | iteration 110000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.250E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104342E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step10000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 110000 | lm_loss value: 3.049305E+00 | lm_loss_ppl value: 2.110068E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.237 | iteration 110100/ 320000 | elapsed time per iteration (ms): 2565.4 | learning rate: 2.249E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.080419E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.84 + samples/sec: 6.593 | iteration 110200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.247E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.088086E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.597 | iteration 110300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.246E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114011E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.78 + samples/sec: 6.592 | iteration 110400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.245E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.091231E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.83 + samples/sec: 6.592 | iteration 110500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.244E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.111566E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.12 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.593 | iteration 110600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.242E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097313E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 + samples/sec: 6.594 | iteration 110700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.241E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122801E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.80 + samples/sec: 6.598 | iteration 110800/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.240E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.107351E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.20 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.76 + samples/sec: 6.591 | iteration 110900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.239E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.093178E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.86 + samples/sec: 6.592 | iteration 111000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.237E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.095140E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.86 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 111000 | lm_loss value: 3.111984E+00 | lm_loss_ppl value: 2.246558E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 111100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 2.236E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.088494E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.86 + samples/sec: 6.595 | iteration 111200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.235E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.103481E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 + samples/sec: 6.591 | iteration 111300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.233E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.107906E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.81 + samples/sec: 6.598 | iteration 111400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.232E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.100751E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1802.62 | backward-backward: 1802.60 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.78 + samples/sec: 6.595 | iteration 111500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.231E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.094886E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.78 + samples/sec: 6.588 | iteration 111600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.230E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.105632E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.79 + samples/sec: 6.592 | iteration 111700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.228E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.084484E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.29 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 + samples/sec: 6.593 | iteration 111800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.227E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112347E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.81 + samples/sec: 6.587 | iteration 111900/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.226E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.107197E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.30 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.77 + samples/sec: 6.598 | iteration 112000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.224E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.091616E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 112000 | lm_loss value: 3.145288E+00 | lm_loss_ppl value: 2.322637E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 112100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 2.223E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.073200E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.84 + samples/sec: 6.591 | iteration 112200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.222E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.064874E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.85 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 + samples/sec: 6.599 | iteration 112300/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.221E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.095600E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1802.81 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 + samples/sec: 6.590 | iteration 112400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.219E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073251E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 + samples/sec: 6.590 | iteration 112500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.218E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.089146E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.80 + samples/sec: 6.597 | iteration 112600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.217E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.084083E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1802.99 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.83 + samples/sec: 6.590 | iteration 112700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.215E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.114701E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.593 | iteration 112800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.214E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104523E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 + samples/sec: 6.596 | iteration 112900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.213E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.094032E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.588 | iteration 113000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.212E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.101813E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.14 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.74 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 113000 | lm_loss value: 3.058391E+00 | lm_loss_ppl value: 2.129328E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.424 | iteration 113100/ 320000 | elapsed time per iteration (ms): 2490.6 | learning rate: 2.210E-04 | approx flops per GPU: 39.9TFLOPS | lm_loss: 3.093242E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 568.53 | backward: 1808.58 | backward-backward: 1808.55 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.88 + samples/sec: 6.555 | iteration 113200/ 320000 | elapsed time per iteration (ms): 2441.0 | learning rate: 2.209E-04 | approx flops per GPU: 40.7TFLOPS | lm_loss: 3.099180E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 569.10 | backward: 1814.73 | backward-backward: 1814.70 | backward-allreduce: 0.00 | optimizer: 56.81 | batch generator: 0.81 + samples/sec: 6.591 | iteration 113300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.208E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.106698E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.77 + samples/sec: 6.597 | iteration 113400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.206E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.092045E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 + samples/sec: 6.589 | iteration 113500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.205E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.087215E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 + samples/sec: 6.593 | iteration 113600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.204E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.074151E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.43 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 1.09 + samples/sec: 6.594 | iteration 113700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.202E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.095883E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.91 + samples/sec: 6.589 | iteration 113800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.201E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081171E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.45 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.95 + samples/sec: 6.600 | iteration 113900/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.200E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.092673E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1802.53 | backward-backward: 1802.51 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 + samples/sec: 6.591 | iteration 114000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.199E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.078889E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 114000 | lm_loss value: 3.143751E+00 | lm_loss_ppl value: 2.319069E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 114100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.197E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.091160E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.49 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.92 + samples/sec: 6.599 | iteration 114200/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.196E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109534E+00 | loss scale: 8192.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.85 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.76 + samples/sec: 6.589 | iteration 114300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.195E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.117061E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.78 + samples/sec: 6.596 | iteration 114400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.193E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.083881E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1802.54 | backward-backward: 1802.52 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.89 + samples/sec: 6.596 | iteration 114500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.192E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096150E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.590 | iteration 114600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.191E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.101366E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.65 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.596 | iteration 114700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.189E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.099032E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1802.63 | backward-backward: 1802.61 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 + samples/sec: 6.590 | iteration 114800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.188E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.098110E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 + samples/sec: 6.595 | iteration 114900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.187E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.087171E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1802.70 | backward-backward: 1802.68 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.84 + samples/sec: 6.596 | iteration 115000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.185E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.078266E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1803.04 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 115000 | lm_loss value: 3.122589E+00 | lm_loss_ppl value: 2.270510E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 115100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 2.184E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.094203E+00 | loss scale: 8192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.85 + samples/sec: 6.599 | iteration 115200/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.183E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.090450E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1802.46 | backward-backward: 1802.44 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 + samples/sec: 6.589 | iteration 115300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.182E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.094834E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.78 + samples/sec: 6.593 | iteration 115400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.180E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075201E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.596 | iteration 115500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.179E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.077558E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.92 + samples/sec: 6.589 | iteration 115600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.178E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074727E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.73 + samples/sec: 6.595 | iteration 115700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.176E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097809E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 + samples/sec: 6.591 | iteration 115800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.175E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074508E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.75 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 + samples/sec: 6.594 | iteration 115900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.174E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102458E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 + samples/sec: 6.595 | iteration 116000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.172E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.089418E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.61 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.86 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 116000 | lm_loss value: 3.161323E+00 | lm_loss_ppl value: 2.360180E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 116100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 2.171E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.087432E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.85 + samples/sec: 6.598 | iteration 116200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.170E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.074887E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 + samples/sec: 6.590 | iteration 116300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.168E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.089257E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.77 + samples/sec: 6.591 | iteration 116400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.167E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.083437E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.79 + samples/sec: 6.591 | iteration 116500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.166E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.062031E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.86 + samples/sec: 6.589 | iteration 116600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.164E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074667E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.81 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 1.11 + samples/sec: 6.595 | iteration 116700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.163E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.083087E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.85 + samples/sec: 6.588 | iteration 116800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.162E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.100735E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.596 | iteration 116900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.160E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063961E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.76 + samples/sec: 6.590 | iteration 117000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.159E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.101981E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 117000 | lm_loss value: 3.008284E+00 | lm_loss_ppl value: 2.025262E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 117100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.158E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.084163E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.86 + samples/sec: 6.593 | iteration 117200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.157E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.073850E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 + samples/sec: 6.591 | iteration 117300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.155E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.079782E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.87 + samples/sec: 6.595 | iteration 117400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.154E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.082406E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.86 + samples/sec: 6.592 | iteration 117500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.153E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.061526E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.79 + samples/sec: 6.596 | iteration 117600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.151E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067720E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.78 + samples/sec: 6.593 | iteration 117700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.150E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063403E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.594 | iteration 117800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.149E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047973E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 + samples/sec: 6.599 | iteration 117900/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.147E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.077684E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1802.95 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.75 + samples/sec: 6.589 | iteration 118000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.146E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.103731E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 118000 | lm_loss value: 3.035818E+00 | lm_loss_ppl value: 2.081800E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 118100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 2.145E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.079916E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.85 + samples/sec: 6.594 | iteration 118200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.143E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.083196E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.82 + samples/sec: 6.590 | iteration 118300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.142E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.087509E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 + samples/sec: 6.596 | iteration 118400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.141E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.089990E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.91 + samples/sec: 6.589 | iteration 118500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.139E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.082739E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.61 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 + samples/sec: 6.590 | iteration 118600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.138E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.086632E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.77 + samples/sec: 6.592 | iteration 118700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.137E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.073419E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 + samples/sec: 6.589 | iteration 118800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.135E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081941E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 + samples/sec: 6.594 | iteration 118900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.134E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.072159E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.588 | iteration 119000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.133E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.087855E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.57 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 119000 | lm_loss value: 3.109795E+00 | lm_loss_ppl value: 2.241644E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 119100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 2.131E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.094319E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.84 + samples/sec: 6.591 | iteration 119200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.130E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.085832E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 54.78 | batch generator: 0.80 + samples/sec: 6.592 | iteration 119300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.129E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075969E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.593 | iteration 119400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.127E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.081794E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.589 | iteration 119500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.126E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.076626E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 + samples/sec: 6.594 | iteration 119600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.125E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.086751E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 + samples/sec: 6.585 | iteration 119700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.123E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084556E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1805.83 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.79 + samples/sec: 6.595 | iteration 119800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.122E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075686E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.21 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.86 + samples/sec: 6.590 | iteration 119900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.121E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.088111E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.593 | iteration 120000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.119E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.078523E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.74 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step20000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 120000 | lm_loss value: 3.025698E+00 | lm_loss_ppl value: 2.060839E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.240 | iteration 120100/ 320000 | elapsed time per iteration (ms): 2564.3 | learning rate: 2.118E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.060322E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.87 + samples/sec: 6.587 | iteration 120200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.117E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.086046E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1805.49 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.78 + samples/sec: 6.594 | iteration 120300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.115E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.083078E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.79 + samples/sec: 6.590 | iteration 120400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.114E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.078393E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1805.15 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.96 + samples/sec: 6.600 | iteration 120500/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 2.113E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075702E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1802.60 | backward-backward: 1802.58 | backward-allreduce: 0.00 | optimizer: 54.83 | batch generator: 0.79 + samples/sec: 6.592 | iteration 120600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.111E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.048943E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 + samples/sec: 6.591 | iteration 120700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.110E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.103765E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.88 + samples/sec: 6.594 | iteration 120800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.109E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.079125E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 + samples/sec: 6.590 | iteration 120900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.107E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.107979E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 + samples/sec: 6.597 | iteration 121000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.106E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063886E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1802.83 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 121000 | lm_loss value: 3.076452E+00 | lm_loss_ppl value: 2.168133E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 121100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 2.105E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.065402E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.85 + samples/sec: 6.595 | iteration 121200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.103E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.101442E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 + samples/sec: 6.597 | iteration 121300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.102E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054083E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 + samples/sec: 6.590 | iteration 121400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.100E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.069442E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.31 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 + samples/sec: 6.595 | iteration 121500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.099E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075626E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.592 | iteration 121600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.098E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067014E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.82 + samples/sec: 6.594 | iteration 121700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.096E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.071282E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.596 | iteration 121800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.095E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.042370E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.590 | iteration 121900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.094E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.068823E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.75 + samples/sec: 6.598 | iteration 122000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.092E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063108E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1802.17 | backward-backward: 1802.15 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 122000 | lm_loss value: 3.022027E+00 | lm_loss_ppl value: 2.053287E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 122100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.091E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.064274E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.86 + samples/sec: 6.590 | iteration 122200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.090E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.055159E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.81 + samples/sec: 6.597 | iteration 122300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.088E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.052821E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.76 + samples/sec: 6.595 | iteration 122400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.087E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.068882E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 + samples/sec: 6.590 | iteration 122500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.086E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.060616E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.593 | iteration 122600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.084E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.071339E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 + samples/sec: 6.599 | iteration 122700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.083E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.073759E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.55 | backward-backward: 1802.52 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 + samples/sec: 6.591 | iteration 122800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.082E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.083247E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 + samples/sec: 6.592 | iteration 122900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.080E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.060766E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 + samples/sec: 6.594 | iteration 123000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.079E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.068777E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 123000 | lm_loss value: 3.033765E+00 | lm_loss_ppl value: 2.077530E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 123100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.077E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.069836E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.87 + samples/sec: 6.591 | iteration 123200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.076E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.083291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 + samples/sec: 6.599 | iteration 123300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.075E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044280E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.83 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.73 + samples/sec: 6.590 | iteration 123400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.073E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.058128E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 + samples/sec: 6.590 | iteration 123500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.072E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.055041E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 + samples/sec: 6.594 | iteration 123600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.071E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.084484E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1803.69 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.592 | iteration 123700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.069E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.062847E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 + samples/sec: 6.588 | iteration 123800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.068E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.056530E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.51 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 + samples/sec: 6.595 | iteration 123900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.067E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.060414E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.81 + samples/sec: 6.597 | iteration 124000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.065E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.068764E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 124000 | lm_loss value: 3.087045E+00 | lm_loss_ppl value: 2.191223E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.435 | iteration 124100/ 320000 | elapsed time per iteration (ms): 2486.4 | learning rate: 2.064E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.069670E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.94 | backward-backward: 1805.92 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.89 + samples/sec: 6.589 | iteration 124200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.062E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081858E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.51 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.597 | iteration 124300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.061E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.061830E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 + samples/sec: 6.593 | iteration 124400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.060E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.082648E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.84 + samples/sec: 6.590 | iteration 124500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.058E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.062227E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.81 + samples/sec: 6.594 | iteration 124600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.057E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.080071E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.75 + samples/sec: 6.597 | iteration 124700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.056E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.080854E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.29 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 + samples/sec: 6.590 | iteration 124800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.054E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.068296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.14 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.592 | iteration 124900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.053E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067428E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 + samples/sec: 6.598 | iteration 125000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.052E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058988E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1802.75 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 125000 | lm_loss value: 2.992376E+00 | lm_loss_ppl value: 1.993298E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 125100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 2.050E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.066073E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.88 + samples/sec: 6.588 | iteration 125200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.049E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073398E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.76 + samples/sec: 6.596 | iteration 125300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.047E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054430E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.595 | iteration 125400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.046E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.061958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 + samples/sec: 6.590 | iteration 125500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.045E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.056128E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.592 | iteration 125600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.043E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.048734E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.77 + samples/sec: 6.600 | iteration 125700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.042E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.070765E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.78 + samples/sec: 6.589 | iteration 125800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.041E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074206E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.85 + samples/sec: 6.591 | iteration 125900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.039E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.066527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.82 + samples/sec: 6.599 | iteration 126000/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.038E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045248E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1802.89 | backward-backward: 1802.87 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 126000 | lm_loss value: 3.037052E+00 | lm_loss_ppl value: 2.084370E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 126100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 2.037E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.067670E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 1.00 + samples/sec: 6.591 | iteration 126200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.035E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.038206E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.78 + samples/sec: 6.597 | iteration 126300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.034E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063153E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.75 + samples/sec: 6.592 | iteration 126400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.032E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.055527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.594 | iteration 126500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.031E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.078688E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 + samples/sec: 6.597 | iteration 126600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.030E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.050456E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 + samples/sec: 6.589 | iteration 126700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.028E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.054198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1805.29 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.591 | iteration 126800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.027E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.036754E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.598 | iteration 126900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.025E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065831E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.78 + samples/sec: 6.591 | iteration 127000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.024E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.060044E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 127000 | lm_loss value: 3.090858E+00 | lm_loss_ppl value: 2.199595E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 127100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.023E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.056682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.84 + samples/sec: 6.597 | iteration 127200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.021E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.059341E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.28 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.590 | iteration 127300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.020E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.051319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.80 + samples/sec: 6.590 | iteration 127400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.019E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.059857E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 + samples/sec: 6.598 | iteration 127500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.017E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.066541E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.86 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 + samples/sec: 6.592 | iteration 127600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.016E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037567E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.590 | iteration 127700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.014E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.067426E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.78 + samples/sec: 6.597 | iteration 127800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.013E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054425E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.82 + samples/sec: 6.592 | iteration 127900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.012E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.052233E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 + samples/sec: 6.592 | iteration 128000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.010E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.060955E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 128000 | lm_loss value: 3.089863E+00 | lm_loss_ppl value: 2.197407E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 128100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 2.009E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.049107E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.84 + samples/sec: 6.594 | iteration 128200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.008E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039107E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.77 + samples/sec: 6.589 | iteration 128300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.006E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.062849E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 + samples/sec: 6.594 | iteration 128400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.005E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.036470E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 + samples/sec: 6.595 | iteration 128500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.003E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056804E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.80 + samples/sec: 6.592 | iteration 128600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.002E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.050488E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 + samples/sec: 6.598 | iteration 128700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.001E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.046310E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1802.83 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.83 + samples/sec: 6.591 | iteration 128800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.054642E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.595 | iteration 128900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.068001E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.597 | iteration 129000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.996E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049932E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 129000 | lm_loss value: 3.049594E+00 | lm_loss_ppl value: 2.110677E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 129100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 1.995E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.080033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.85 + samples/sec: 6.597 | iteration 129200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037728E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1802.93 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.90 + samples/sec: 6.592 | iteration 129300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035256E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.593 | iteration 129400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.991E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035027E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.594 | iteration 129500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.990E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.062029E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 + samples/sec: 6.592 | iteration 129600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.988E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.071824E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.75 + samples/sec: 6.600 | iteration 129700/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 1.987E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044484E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1802.86 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.79 + samples/sec: 6.595 | iteration 129800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.985E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.064375E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.76 + samples/sec: 6.596 | iteration 129900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054696E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 + samples/sec: 6.600 | iteration 130000/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 1.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.057232E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1801.97 | backward-backward: 1801.95 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.81 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step30000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 130000 | lm_loss value: 3.032578E+00 | lm_loss_ppl value: 2.075066E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.210 | iteration 130100/ 320000 | elapsed time per iteration (ms): 2576.3 | learning rate: 1.981E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 3.060234E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.47 | backward: 1805.64 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.84 + samples/sec: 6.597 | iteration 130200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.980E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045363E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.01 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.85 + samples/sec: 6.600 | iteration 130300/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 1.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.050321E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1802.49 | backward-backward: 1802.46 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 + samples/sec: 6.595 | iteration 130400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.977E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043405E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 + samples/sec: 6.599 | iteration 130500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.976E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.061244E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 + samples/sec: 6.596 | iteration 130600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.974E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.063245E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1802.84 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.80 + samples/sec: 6.592 | iteration 130700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.973E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022270E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.80 + samples/sec: 6.595 | iteration 130800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.971E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.87 + samples/sec: 6.598 | iteration 130900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.970E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051273E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.81 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 + samples/sec: 6.590 | iteration 131000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.969E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.058353E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 131000 | lm_loss value: 3.012972E+00 | lm_loss_ppl value: 2.034778E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 131100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.967E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.040150E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.88 + samples/sec: 6.597 | iteration 131200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.966E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031024E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.14 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.89 + samples/sec: 6.590 | iteration 131300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.964E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.038213E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 + samples/sec: 6.595 | iteration 131400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.963E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.060768E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.595 | iteration 131500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.962E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047241E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.590 | iteration 131600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.960E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.035491E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.588 | iteration 131700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.959E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.071375E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.19 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.80 + samples/sec: 6.595 | iteration 131800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.957E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058180E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.82 + samples/sec: 6.594 | iteration 131900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.956E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021861E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.80 + samples/sec: 6.593 | iteration 132000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.955E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056813E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.85 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 132000 | lm_loss value: 3.081576E+00 | lm_loss_ppl value: 2.179272E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 132100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 1.953E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.048577E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.83 + samples/sec: 6.596 | iteration 132200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.952E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 + samples/sec: 6.592 | iteration 132300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.950E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.017845E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 + samples/sec: 6.593 | iteration 132400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.949E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051430E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 + samples/sec: 6.599 | iteration 132500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.948E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054457E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1802.99 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.78 + samples/sec: 6.592 | iteration 132600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.946E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021662E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 + samples/sec: 6.592 | iteration 132700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.945E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 + samples/sec: 6.592 | iteration 132800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.943E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.025841E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.80 + samples/sec: 6.597 | iteration 132900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.942E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065359E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.38 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.78 + samples/sec: 6.592 | iteration 133000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.941E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047146E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 133000 | lm_loss value: 3.063661E+00 | lm_loss_ppl value: 2.140578E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 133100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.939E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.037458E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.90 + samples/sec: 6.598 | iteration 133200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.938E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035414E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.73 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 + samples/sec: 6.593 | iteration 133300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.936E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.032924E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.78 + samples/sec: 6.592 | iteration 133400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.935E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031680E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.596 | iteration 133500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.934E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031467E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.597 | iteration 133600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.932E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043866E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.83 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 + samples/sec: 6.591 | iteration 133700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.931E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.048725E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.91 + samples/sec: 6.594 | iteration 133800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.929E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039785E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.82 + samples/sec: 6.590 | iteration 133900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.928E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.043007E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.77 + samples/sec: 6.598 | iteration 134000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.927E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065591E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 134000 | lm_loss value: 3.069824E+00 | lm_loss_ppl value: 2.153812E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 134100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 1.925E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.043590E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.90 + samples/sec: 6.595 | iteration 134200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.924E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.038817E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.596 | iteration 134300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.922E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045693E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 + samples/sec: 6.595 | iteration 134400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.921E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022903E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.73 + samples/sec: 6.597 | iteration 134500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.919E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022929E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1802.76 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.80 + samples/sec: 6.596 | iteration 134600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.918E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047475E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 + samples/sec: 6.595 | iteration 134700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.917E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033488E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.82 + samples/sec: 6.596 | iteration 134800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.915E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.042204E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.81 + samples/sec: 6.597 | iteration 134900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.914E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.027991E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 + samples/sec: 6.596 | iteration 135000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.912E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041668E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.91 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 135000 | lm_loss value: 3.059825E+00 | lm_loss_ppl value: 2.132382E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 135100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 1.911E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.052133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 1.00 + samples/sec: 6.592 | iteration 135200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.910E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044903E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.592 | iteration 135300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.908E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.040572E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.78 + samples/sec: 6.598 | iteration 135400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.907E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033973E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 + samples/sec: 6.594 | iteration 135500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.905E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.023548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.594 | iteration 135600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.904E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.028306E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.75 + samples/sec: 6.594 | iteration 135700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.902E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067569E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.86 + samples/sec: 6.596 | iteration 135800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.901E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058576E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.597 | iteration 135900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.900E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044962E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.75 + samples/sec: 6.591 | iteration 136000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.898E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.036035E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 136000 | lm_loss value: 3.030420E+00 | lm_loss_ppl value: 2.070592E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 136100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 1.897E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.062454E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.89 + samples/sec: 6.595 | iteration 136200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.895E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039156E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.79 + samples/sec: 6.599 | iteration 136300/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.894E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.020565E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1802.99 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 + samples/sec: 6.595 | iteration 136400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.893E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.032160E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 + samples/sec: 6.593 | iteration 136500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.891E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.034215E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.593 | iteration 136600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.890E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018666E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.72 + samples/sec: 6.591 | iteration 136700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.888E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.021796E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.85 + samples/sec: 6.599 | iteration 136800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.887E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033137E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.594 | iteration 136900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.885E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.066820E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.593 | iteration 137000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.884E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056705E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.75 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 137000 | lm_loss value: 3.068589E+00 | lm_loss_ppl value: 2.151152E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 137100/ 320000 | elapsed time per iteration (ms): 2481.1 | learning rate: 1.883E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.038867E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.97 | backward: 1802.49 | backward-backward: 1802.46 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.86 + samples/sec: 6.591 | iteration 137200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.881E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.027907E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.75 + samples/sec: 6.599 | iteration 137300/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.880E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013692E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1802.46 | backward-backward: 1802.43 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 + samples/sec: 6.597 | iteration 137400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.878E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021676E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.86 + samples/sec: 6.592 | iteration 137500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.877E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.062826E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.83 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.593 | iteration 137600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.875E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044134E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.81 + samples/sec: 6.598 | iteration 137700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.874E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018692E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1802.61 | backward-backward: 1802.58 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 + samples/sec: 6.595 | iteration 137800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.873E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.050996E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 + samples/sec: 6.593 | iteration 137900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.871E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037458E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 + samples/sec: 6.598 | iteration 138000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.870E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.025712E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.85 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 138000 | lm_loss value: 3.044087E+00 | lm_loss_ppl value: 2.099086E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 138100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 1.868E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.028942E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.08 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 1.05 + samples/sec: 6.587 | iteration 138200/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.867E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.042545E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.80 + samples/sec: 6.595 | iteration 138300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.866E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049267E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.80 + samples/sec: 6.599 | iteration 138400/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.864E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031348E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.91 | backward: 1802.83 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 + samples/sec: 6.587 | iteration 138500/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.863E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.063171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.40 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.82 + samples/sec: 6.590 | iteration 138600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.861E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.042833E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 + samples/sec: 6.597 | iteration 138700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.860E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.019696E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.83 + samples/sec: 6.591 | iteration 138800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.858E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.040004E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.76 + samples/sec: 6.591 | iteration 138900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.857E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.016422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 + samples/sec: 6.595 | iteration 139000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.856E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.019999E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 139000 | lm_loss value: 2.944004E+00 | lm_loss_ppl value: 1.899174E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 139100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 1.854E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.037854E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.85 + samples/sec: 6.593 | iteration 139200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.853E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026535E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.12 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.83 + samples/sec: 6.596 | iteration 139300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.851E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037317E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 + samples/sec: 6.590 | iteration 139400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.850E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.032645E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 + samples/sec: 6.588 | iteration 139500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.848E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.031006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.80 + samples/sec: 6.590 | iteration 139600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.847E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.031002E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 + samples/sec: 6.597 | iteration 139700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.846E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024561E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.80 + samples/sec: 6.596 | iteration 139800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.844E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.030312E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.76 + samples/sec: 6.592 | iteration 139900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.843E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.012854E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 + samples/sec: 6.593 | iteration 140000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.841E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.015888E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step40000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 140000 | lm_loss value: 3.063662E+00 | lm_loss_ppl value: 2.140579E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.222 | iteration 140100/ 320000 | elapsed time per iteration (ms): 2571.6 | learning rate: 1.840E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.029729E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.85 + samples/sec: 6.595 | iteration 140200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.838E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024015E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 + samples/sec: 6.591 | iteration 140300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.837E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014704E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.587 | iteration 140400/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.835E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.022416E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.37 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.80 + samples/sec: 6.597 | iteration 140500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.834E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043169E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1802.80 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.596 | iteration 140600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.833E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.027946E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.82 + samples/sec: 6.587 | iteration 140700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.831E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.029113E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.79 + samples/sec: 6.591 | iteration 140800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.830E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.017930E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.25 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.82 + samples/sec: 6.592 | iteration 140900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.828E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033524E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.594 | iteration 141000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.827E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022070E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 141000 | lm_loss value: 3.080925E+00 | lm_loss_ppl value: 2.177853E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 141100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 1.825E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.031105E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1805.48 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.89 + samples/sec: 6.590 | iteration 141200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.824E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.025312E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.80 + samples/sec: 6.590 | iteration 141300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.823E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.019339E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.97 + samples/sec: 6.598 | iteration 141400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.821E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009781E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.81 + samples/sec: 6.586 | iteration 141500/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.820E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.035902E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.39 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.86 + samples/sec: 6.589 | iteration 141600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.818E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.037529E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 + samples/sec: 6.593 | iteration 141700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.817E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.019633E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.75 + samples/sec: 6.599 | iteration 141800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.815E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007230E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 54.43 | batch generator: 0.79 + samples/sec: 6.588 | iteration 141900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.814E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.010744E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.590 | iteration 142000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.813E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.999482E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 142000 | lm_loss value: 3.006179E+00 | lm_loss_ppl value: 2.021004E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 142100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 1.811E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.021129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.85 + samples/sec: 6.590 | iteration 142200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.810E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.043835E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.81 + samples/sec: 6.592 | iteration 142300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.808E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.025631E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.44 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.78 + samples/sec: 6.597 | iteration 142400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.807E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026989E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.11 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.592 | iteration 142500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.805E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001243E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.76 + samples/sec: 6.588 | iteration 142600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.804E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.033384E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.77 + samples/sec: 6.599 | iteration 142700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.802E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033478E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1802.58 | backward-backward: 1802.56 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 + samples/sec: 6.595 | iteration 142800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.801E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022153E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.86 + samples/sec: 6.589 | iteration 142900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.800E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.023239E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.31 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.82 + samples/sec: 6.593 | iteration 143000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.798E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011572E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.44 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 143000 | lm_loss value: 3.135687E+00 | lm_loss_ppl value: 2.300443E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 143100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 1.797E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.992381E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.09 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.85 + samples/sec: 6.591 | iteration 143200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.795E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.013780E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 + samples/sec: 6.587 | iteration 143300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.794E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.024452E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.47 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.89 + samples/sec: 6.596 | iteration 143400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.792E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001596E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.85 + samples/sec: 6.593 | iteration 143500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.791E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.017137E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 + samples/sec: 6.588 | iteration 143600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.789E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.019846E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.31 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.79 + samples/sec: 6.592 | iteration 143700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.788E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024855E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.76 + samples/sec: 6.595 | iteration 143800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.787E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.014076E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 + samples/sec: 6.587 | iteration 143900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.785E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.011089E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.40 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.94 + samples/sec: 6.594 | iteration 144000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.784E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013037E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.92 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 144000 | lm_loss value: 2.872026E+00 | lm_loss_ppl value: 1.767279E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 144100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 1.782E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.026293E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.86 + samples/sec: 6.589 | iteration 144200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.781E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.016819E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.597 | iteration 144300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.779E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.023470E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1803.72 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 54.35 | batch generator: 0.78 + samples/sec: 6.596 | iteration 144400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.778E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.005642E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.589 | iteration 144500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.776E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014860E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.29 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 + samples/sec: 6.593 | iteration 144600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.775E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008758E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1803.95 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 + samples/sec: 6.591 | iteration 144700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.774E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.006990E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 + samples/sec: 6.587 | iteration 144800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.772E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014591E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.53 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.597 | iteration 144900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.771E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008408E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 + samples/sec: 6.592 | iteration 145000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.003885E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 145000 | lm_loss value: 2.967851E+00 | lm_loss_ppl value: 1.945007E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 145100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 1.768E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.013310E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.84 + samples/sec: 6.597 | iteration 145200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.766E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009355E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.11 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 + samples/sec: 6.589 | iteration 145300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.765E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.012231E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 + samples/sec: 6.592 | iteration 145400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.763E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969508E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.35 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.595 | iteration 145500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.762E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035185E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 + samples/sec: 6.589 | iteration 145600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.761E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.998504E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.27 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 + samples/sec: 6.598 | iteration 145700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.759E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.014991E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1802.45 | backward-backward: 1802.42 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.591 | iteration 145800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.758E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.009021E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.586 | iteration 145900/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.756E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.037935E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.36 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.73 + samples/sec: 6.595 | iteration 146000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.755E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009698E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.27 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 146000 | lm_loss value: 3.021375E+00 | lm_loss_ppl value: 2.051949E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 146100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 1.753E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.006284E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1805.82 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.595 | iteration 146200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.752E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007090E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.593 | iteration 146300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.750E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013248E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.74 + samples/sec: 6.586 | iteration 146400/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.749E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.028515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.65 | backward: 1805.65 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.594 | iteration 146500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.747E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998268E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 + samples/sec: 6.586 | iteration 146600/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.746E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.006394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.79 + samples/sec: 6.592 | iteration 146700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.745E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.983523E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.596 | iteration 146800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.743E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.991919E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.80 + samples/sec: 6.585 | iteration 146900/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 1.742E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.008527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1805.96 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.75 + samples/sec: 6.595 | iteration 147000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.740E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024461E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.65 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 147000 | lm_loss value: 3.012671E+00 | lm_loss_ppl value: 2.034166E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 147100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 1.739E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.026111E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1805.77 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.89 + samples/sec: 6.594 | iteration 147200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.737E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011507E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1803.28 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 + samples/sec: 6.597 | iteration 147300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.736E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.996978E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.75 + samples/sec: 6.590 | iteration 147400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.734E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014136E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.76 + samples/sec: 6.596 | iteration 147500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.733E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041214E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 + samples/sec: 6.594 | iteration 147600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.732E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.034612E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.587 | iteration 147700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.730E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.028787E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.78 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.87 + samples/sec: 6.597 | iteration 147800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.729E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008867E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.593 | iteration 147900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.727E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997423E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.586 | iteration 148000/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.726E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.017035E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.49 | backward: 1805.33 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 148000 | lm_loss value: 2.969936E+00 | lm_loss_ppl value: 1.949067E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 148100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 1.724E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.008131E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.90 + samples/sec: 6.586 | iteration 148200/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.723E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.009712E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.54 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.80 + samples/sec: 6.591 | iteration 148300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.721E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.013130E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.35 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 + samples/sec: 6.595 | iteration 148400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.720E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008856E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.68 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.586 | iteration 148500/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.718E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.990135E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.53 | backward: 1805.73 | backward-backward: 1805.71 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.86 + samples/sec: 6.596 | iteration 148600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.717E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011000E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.38 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.76 + samples/sec: 6.592 | iteration 148700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.716E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.012153E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.592 | iteration 148800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.714E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026171E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.53 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.78 + samples/sec: 6.593 | iteration 148900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.713E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.999431E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.75 + samples/sec: 6.587 | iteration 149000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.711E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.011778E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.58 | backward: 1805.35 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 149000 | lm_loss value: 2.981685E+00 | lm_loss_ppl value: 1.972102E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 149100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.710E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.991318E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.89 + samples/sec: 6.588 | iteration 149200/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.708E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.013188E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 + samples/sec: 6.592 | iteration 149300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.707E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021961E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.82 + samples/sec: 6.593 | iteration 149400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.705E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.993239E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 + samples/sec: 6.593 | iteration 149500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.704E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006541E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 + samples/sec: 6.595 | iteration 149600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.702E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998258E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.79 + samples/sec: 6.592 | iteration 149700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.701E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007240E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.598 | iteration 149800/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.699E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.993647E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.82 + samples/sec: 6.590 | iteration 149900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.698E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.020144E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.87 + samples/sec: 6.598 | iteration 150000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.697E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035073E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1802.83 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step50000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 150000 | lm_loss value: 2.916648E+00 | lm_loss_ppl value: 1.847924E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.207 | iteration 150100/ 320000 | elapsed time per iteration (ms): 2577.9 | learning rate: 1.695E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.976295E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.85 | backward: 1807.80 | backward-backward: 1807.78 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.87 + samples/sec: 6.589 | iteration 150200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.694E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.003804E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.79 + samples/sec: 6.596 | iteration 150300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.692E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.010716E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.76 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.75 + samples/sec: 6.594 | iteration 150400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.691E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.992361E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.76 + samples/sec: 6.596 | iteration 150500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.689E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.986776E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 + samples/sec: 6.591 | iteration 150600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.688E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.983868E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 + samples/sec: 6.596 | iteration 150700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.686E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.002856E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 + samples/sec: 6.591 | iteration 150800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.685E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.982844E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 + samples/sec: 6.593 | iteration 150900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.683E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.999709E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.592 | iteration 151000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.682E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013692E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 151000 | lm_loss value: 2.923672E+00 | lm_loss_ppl value: 1.860949E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 151100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 1.681E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.011578E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.65 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.85 + samples/sec: 6.595 | iteration 151200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.679E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.96 + samples/sec: 6.585 | iteration 151300/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 1.678E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.999903E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1806.22 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.81 + samples/sec: 6.596 | iteration 151400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.676E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.978781E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.24 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 + samples/sec: 6.594 | iteration 151500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.675E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.977940E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.592 | iteration 151600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.673E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.004445E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.595 | iteration 151700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.672E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009492E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.589 | iteration 151800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.670E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.999218E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.90 + samples/sec: 6.598 | iteration 151900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.669E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016407E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.00 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.81 + samples/sec: 6.590 | iteration 152000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.667E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.990901E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 152000 | lm_loss value: 3.033347E+00 | lm_loss_ppl value: 2.076662E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 152100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 1.666E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.994393E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.87 + samples/sec: 6.590 | iteration 152200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.664E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.000322E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.589 | iteration 152300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.663E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.005586E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.17 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.594 | iteration 152400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.662E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.012498E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.588 | iteration 152500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.660E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.006141E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.75 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 + samples/sec: 6.595 | iteration 152600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.659E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.989402E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.79 + samples/sec: 6.593 | iteration 152700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.657E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001414E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.588 | iteration 152800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.656E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.002898E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.598 | iteration 152900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.654E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.979635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.588 | iteration 153000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.653E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.983808E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 153000 | lm_loss value: 2.927353E+00 | lm_loss_ppl value: 1.867813E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 153100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.651E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.988497E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.86 + samples/sec: 6.589 | iteration 153200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.650E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.976262E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.80 + samples/sec: 6.590 | iteration 153300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.648E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014687E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.81 + samples/sec: 6.595 | iteration 153400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.647E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.992180E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.585 | iteration 153500/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 1.645E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.983005E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1806.40 | backward-backward: 1806.37 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.78 + samples/sec: 6.597 | iteration 153600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.644E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.999695E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.591 | iteration 153700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.642E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.990131E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 + samples/sec: 6.586 | iteration 153800/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.641E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.969124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 + samples/sec: 6.596 | iteration 153900/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.640E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981948E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 + samples/sec: 6.588 | iteration 154000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.638E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.992834E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 154000 | lm_loss value: 3.005297E+00 | lm_loss_ppl value: 2.019222E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 154100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 1.637E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.987995E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.85 + samples/sec: 6.594 | iteration 154200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.635E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.963529E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.86 + samples/sec: 6.588 | iteration 154300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.634E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.973472E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.19 | backward: 1805.71 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.597 | iteration 154400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.632E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007917E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.08 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 + samples/sec: 6.589 | iteration 154500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.631E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.014982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 + samples/sec: 6.591 | iteration 154600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.629E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.002899E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.65 | batch generator: 0.76 + samples/sec: 6.592 | iteration 154700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.628E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.968603E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.82 + samples/sec: 6.585 | iteration 154800/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 1.626E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.996223E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.32 | backward: 1805.96 | backward-backward: 1805.94 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.80 + samples/sec: 6.597 | iteration 154900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.625E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.989323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 + samples/sec: 6.589 | iteration 155000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.623E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.963140E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.60 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 155000 | lm_loss value: 2.972611E+00 | lm_loss_ppl value: 1.954288E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 155100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.622E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.984459E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 54.55 | batch generator: 0.99 + samples/sec: 6.595 | iteration 155200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.621E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.986625E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.82 + samples/sec: 6.590 | iteration 155300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.619E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.991208E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1805.37 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.592 | iteration 155400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.618E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.986336E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 + samples/sec: 6.595 | iteration 155500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.616E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982070E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 + samples/sec: 6.589 | iteration 155600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.615E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.991624E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.591 | iteration 155700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.613E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.972868E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.79 + samples/sec: 6.594 | iteration 155800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.612E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009465E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.589 | iteration 155900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.610E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.970847E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.596 | iteration 156000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.609E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.990421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.04 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 156000 | lm_loss value: 3.028934E+00 | lm_loss_ppl value: 2.067517E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 156100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 1.607E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.962716E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.85 + samples/sec: 6.586 | iteration 156200/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 1.606E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.959265E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1805.75 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.82 + samples/sec: 6.596 | iteration 156300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.604E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.77 + samples/sec: 6.591 | iteration 156400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.603E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.974516E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 + samples/sec: 6.589 | iteration 156500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.601E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.971215E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1805.89 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.80 + samples/sec: 6.598 | iteration 156600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.600E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.991467E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.75 + samples/sec: 6.591 | iteration 156700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.599E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.978375E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.80 + samples/sec: 6.590 | iteration 156800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.597E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.001807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 + samples/sec: 6.599 | iteration 156900/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.596E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976177E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.79 + samples/sec: 6.590 | iteration 157000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.594E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.974887E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 157000 | lm_loss value: 2.918399E+00 | lm_loss_ppl value: 1.851162E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 157100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 1.593E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.999412E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.83 + samples/sec: 6.598 | iteration 157200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.591E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.979333E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.591 | iteration 157300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.590E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.991652E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 + samples/sec: 6.590 | iteration 157400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.588E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.989381E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1804.57 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 + samples/sec: 6.598 | iteration 157500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.587E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.09 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 + samples/sec: 6.591 | iteration 157600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.585E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.979987E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 + samples/sec: 6.587 | iteration 157700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.584E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.989767E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.39 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.94 + samples/sec: 6.595 | iteration 157800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.582E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959209E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.78 + samples/sec: 6.594 | iteration 157900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.581E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.988754E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 + samples/sec: 6.590 | iteration 158000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.579E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.978794E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 158000 | lm_loss value: 3.023163E+00 | lm_loss_ppl value: 2.055620E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 158100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 1.578E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.991219E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 1.01 + samples/sec: 6.592 | iteration 158200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.577E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.955492E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.78 + samples/sec: 6.588 | iteration 158300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.575E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.970529E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.57 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 + samples/sec: 6.600 | iteration 158400/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.574E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.979223E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1802.48 | backward-backward: 1802.45 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.81 + samples/sec: 6.594 | iteration 158500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.572E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.973608E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 + samples/sec: 6.592 | iteration 158600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.571E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.968325E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.598 | iteration 158700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.569E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981199E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1802.50 | backward-backward: 1802.48 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.82 + samples/sec: 6.594 | iteration 158800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.568E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969596E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.589 | iteration 158900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.566E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.982292E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.19 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.80 + samples/sec: 6.599 | iteration 159000/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.565E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.979471E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1802.33 | backward-backward: 1802.31 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 159000 | lm_loss value: 3.026112E+00 | lm_loss_ppl value: 2.061692E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 159100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 1.563E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.971291E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.83 + samples/sec: 6.592 | iteration 159200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.562E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.974822E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.83 + samples/sec: 6.596 | iteration 159300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.560E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.985768E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1802.91 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.599 | iteration 159400/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.559E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.993650E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1802.89 | backward-backward: 1802.87 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.590 | iteration 159500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.557E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.966135E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.16 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.592 | iteration 159600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.556E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.984546E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.80 + samples/sec: 6.599 | iteration 159700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.554E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959957E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1802.66 | backward-backward: 1802.63 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.81 + samples/sec: 6.591 | iteration 159800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.553E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.982831E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 + samples/sec: 6.590 | iteration 159900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.552E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.960102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 + samples/sec: 6.596 | iteration 160000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.550E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.955252E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1802.91 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step60000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 160000 | lm_loss value: 2.854709E+00 | lm_loss_ppl value: 1.736939E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.214 | iteration 160100/ 320000 | elapsed time per iteration (ms): 2575.0 | learning rate: 1.549E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.980813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.20 | backward: 1808.66 | backward-backward: 1808.64 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 + samples/sec: 6.590 | iteration 160200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.547E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.962712E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 + samples/sec: 6.598 | iteration 160300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.546E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969044E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.13 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 + samples/sec: 6.591 | iteration 160400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.544E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.955201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.591 | iteration 160500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.543E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.957273E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.46 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 54.78 | batch generator: 0.76 + samples/sec: 6.596 | iteration 160600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.541E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.87 + samples/sec: 6.593 | iteration 160700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.540E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.977228E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.591 | iteration 160800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.538E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.965793E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.80 + samples/sec: 6.598 | iteration 160900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.537E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976277E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.595 | iteration 161000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.535E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.974176E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.72 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 161000 | lm_loss value: 2.988014E+00 | lm_loss_ppl value: 1.984622E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 161100/ 320000 | elapsed time per iteration (ms): 2486.2 | learning rate: 1.534E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.970132E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 56.56 | batch generator: 0.86 + samples/sec: 6.599 | iteration 161200/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.532E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.938765E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 + samples/sec: 6.594 | iteration 161300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.531E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969193E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.593 | iteration 161400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.529E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.965111E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.598 | iteration 161500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.528E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982143E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1802.93 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.83 + samples/sec: 6.592 | iteration 161600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.527E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.963119E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.592 | iteration 161700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.525E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.55 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.82 + samples/sec: 6.598 | iteration 161800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.524E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.967398E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.81 + samples/sec: 6.592 | iteration 161900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.522E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.977384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.80 + samples/sec: 6.590 | iteration 162000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.521E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.955462E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 162000 | lm_loss value: 2.924352E+00 | lm_loss_ppl value: 1.862216E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 162100/ 320000 | elapsed time per iteration (ms): 2481.0 | learning rate: 1.519E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.973094E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.85 + samples/sec: 6.592 | iteration 162200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.518E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937202E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.595 | iteration 162300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.516E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.954445E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.78 + samples/sec: 6.600 | iteration 162400/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 1.515E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.966685E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1802.41 | backward-backward: 1802.38 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.590 | iteration 162500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.513E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.981096E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.595 | iteration 162600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.512E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982485E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.85 + samples/sec: 6.600 | iteration 162700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.510E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.962750E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1802.47 | backward-backward: 1802.45 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.592 | iteration 162800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.509E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953845E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.594 | iteration 162900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.507E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.955799E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 + samples/sec: 6.599 | iteration 163000/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.506E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960954E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.86 | backward: 1802.42 | backward-backward: 1802.39 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 163000 | lm_loss value: 2.926718E+00 | lm_loss_ppl value: 1.866626E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 163100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 1.504E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.969215E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.63 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.91 + samples/sec: 6.596 | iteration 163200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.503E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.965139E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1802.85 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.89 + samples/sec: 6.594 | iteration 163300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.502E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.956763E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.79 + samples/sec: 6.589 | iteration 163400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.500E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.966851E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.79 + samples/sec: 6.595 | iteration 163500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.499E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982045E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.81 + samples/sec: 6.598 | iteration 163600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.497E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960123E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.590 | iteration 163700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.496E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.965246E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.75 + samples/sec: 6.595 | iteration 163800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.494E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959731E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.77 + samples/sec: 6.597 | iteration 163900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.493E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.940274E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.10 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 + samples/sec: 6.592 | iteration 164000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.491E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.956307E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 164000 | lm_loss value: 2.915514E+00 | lm_loss_ppl value: 1.845829E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 164100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 1.490E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.979049E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.86 + samples/sec: 6.598 | iteration 164200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.488E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976490E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.593 | iteration 164300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.487E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.968833E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.78 + samples/sec: 6.592 | iteration 164400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.485E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.965778E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.81 + samples/sec: 6.598 | iteration 164500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.484E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959665E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1802.91 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 + samples/sec: 6.589 | iteration 164600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.482E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.952462E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 + samples/sec: 6.594 | iteration 164700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.481E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.985143E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.89 + samples/sec: 6.599 | iteration 164800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.479E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.962080E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.593 | iteration 164900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.478E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953167E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.35 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.83 + samples/sec: 6.598 | iteration 165000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.477E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950101E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1802.85 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 165000 | lm_loss value: 2.951279E+00 | lm_loss_ppl value: 1.913040E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 165100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 1.475E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.962470E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.86 + samples/sec: 6.589 | iteration 165200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.474E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.965554E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.82 + samples/sec: 6.595 | iteration 165300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.472E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.962333E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.85 + samples/sec: 6.598 | iteration 165400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.471E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.934336E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 + samples/sec: 6.587 | iteration 165500/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.469E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.932898E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.63 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.78 + samples/sec: 6.594 | iteration 165600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.468E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950616E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 + samples/sec: 6.597 | iteration 165700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.466E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.966582E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1802.78 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.591 | iteration 165800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.465E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.932215E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 + samples/sec: 6.595 | iteration 165900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.463E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.949272E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 + samples/sec: 6.601 | iteration 166000/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 1.462E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.961177E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.67 | backward-backward: 1802.64 | backward-allreduce: 0.00 | optimizer: 54.78 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 166000 | lm_loss value: 2.964847E+00 | lm_loss_ppl value: 1.939173E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 166100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 1.460E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.973555E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.86 + samples/sec: 6.594 | iteration 166200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.459E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.958238E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.82 + samples/sec: 6.599 | iteration 166300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.457E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969126E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1802.60 | backward-backward: 1802.57 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 + samples/sec: 6.590 | iteration 166400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.456E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.955402E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 + samples/sec: 6.595 | iteration 166500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.454E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969364E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.594 | iteration 166600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.453E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959689E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.81 + samples/sec: 6.592 | iteration 166700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.452E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.962515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 + samples/sec: 6.592 | iteration 166800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.450E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950001E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.91 + samples/sec: 6.600 | iteration 166900/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.449E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.948041E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1802.64 | backward-backward: 1802.61 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 + samples/sec: 6.592 | iteration 167000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.447E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.935975E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.84 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 167000 | lm_loss value: 2.962843E+00 | lm_loss_ppl value: 1.935292E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 167100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 1.446E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.947067E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 54.49 | batch generator: 0.84 + samples/sec: 6.599 | iteration 167200/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.444E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950988E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1802.99 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 + samples/sec: 6.592 | iteration 167300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.443E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.975178E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.51 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.86 + samples/sec: 6.593 | iteration 167400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.441E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.957060E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.83 + samples/sec: 6.599 | iteration 167500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.440E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.928706E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.78 | backward: 1802.90 | backward-backward: 1802.87 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 + samples/sec: 6.588 | iteration 167600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.438E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.957769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1805.15 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.78 + samples/sec: 6.592 | iteration 167700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.437E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.82 + samples/sec: 6.598 | iteration 167800/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.435E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947784E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.11 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.594 | iteration 167900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.434E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947057E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 + samples/sec: 6.590 | iteration 168000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.432E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.938679E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 168000 | lm_loss value: 2.920211E+00 | lm_loss_ppl value: 1.854520E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 168100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 1.431E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.937378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.85 + samples/sec: 6.595 | iteration 168200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.430E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937378E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 54.75 | batch generator: 0.86 + samples/sec: 6.592 | iteration 168300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.428E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953859E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.600 | iteration 168400/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 1.427E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943263E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1802.81 | backward-backward: 1802.78 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.76 + samples/sec: 6.594 | iteration 168500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.425E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.938681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.591 | iteration 168600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.424E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.947435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.594 | iteration 168700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.422E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951412E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.77 + samples/sec: 6.594 | iteration 168800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.421E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.957281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 + samples/sec: 6.591 | iteration 168900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.419E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.958852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.82 + samples/sec: 6.596 | iteration 169000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.418E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951962E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.89 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 169000 | lm_loss value: 2.960931E+00 | lm_loss_ppl value: 1.931594E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 169100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.416E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.929684E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.85 + samples/sec: 6.587 | iteration 169200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.415E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.934854E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.40 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.597 | iteration 169300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.413E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937015E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.88 + samples/sec: 6.595 | iteration 169400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.412E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946413E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.75 + samples/sec: 6.590 | iteration 169500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.410E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.926391E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.79 + samples/sec: 6.595 | iteration 169600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.409E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950253E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.598 | iteration 169700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.407E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937620E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1802.65 | backward-backward: 1802.63 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.589 | iteration 169800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.406E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.943432E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.80 + samples/sec: 6.596 | iteration 169900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.405E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937280E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.82 + samples/sec: 6.597 | iteration 170000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.403E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.954183E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.06 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step70000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 170000 | lm_loss value: 2.915996E+00 | lm_loss_ppl value: 1.846720E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.232 | iteration 170100/ 320000 | elapsed time per iteration (ms): 2567.5 | learning rate: 1.402E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.934737E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.80 | backward: 1805.79 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 + samples/sec: 6.594 | iteration 170200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.400E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943683E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.90 + samples/sec: 6.600 | iteration 170300/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 1.399E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941815E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1802.93 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.77 + samples/sec: 6.592 | iteration 170400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.397E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950764E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.594 | iteration 170500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.396E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950182E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 + samples/sec: 6.601 | iteration 170600/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 1.394E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946375E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1801.87 | backward-backward: 1801.85 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.595 | iteration 170700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.393E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926508E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 + samples/sec: 6.592 | iteration 170800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.391E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930347E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 + samples/sec: 6.593 | iteration 170900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.390E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922127E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 56.52 | batch generator: 0.80 + samples/sec: 6.598 | iteration 171000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.388E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.948191E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1802.78 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 171000 | lm_loss value: 2.914586E+00 | lm_loss_ppl value: 1.844117E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 171100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 1.387E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.942807E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.89 + samples/sec: 6.595 | iteration 171200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.385E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953337E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.76 + samples/sec: 6.599 | iteration 171300/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.384E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945087E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.80 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.74 + samples/sec: 6.593 | iteration 171400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.383E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.918628E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.87 + samples/sec: 6.591 | iteration 171500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.381E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.935267E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 + samples/sec: 6.597 | iteration 171600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.380E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943466E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 + samples/sec: 6.596 | iteration 171700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.378E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937442E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 + samples/sec: 6.589 | iteration 171800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.377E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.949361E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.77 + samples/sec: 6.593 | iteration 171900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.375E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941507E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 + samples/sec: 6.597 | iteration 172000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.374E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.939223E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 172000 | lm_loss value: 2.860545E+00 | lm_loss_ppl value: 1.747105E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 172100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 1.372E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.918576E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.85 + samples/sec: 6.592 | iteration 172200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.371E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.925965E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.596 | iteration 172300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.369E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.933926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 + samples/sec: 6.594 | iteration 172400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.368E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.929806E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.75 + samples/sec: 6.588 | iteration 172500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.366E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.949124E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.82 + samples/sec: 6.591 | iteration 172600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.365E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.965960E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.14 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.598 | iteration 172700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.363E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946551E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.27 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.80 + samples/sec: 6.596 | iteration 172800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.362E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.918060E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.590 | iteration 172900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.361E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.921382E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1805.66 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.591 | iteration 173000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.359E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.938403E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.72 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 173000 | lm_loss value: 2.951116E+00 | lm_loss_ppl value: 1.912729E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 173100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 1.358E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.915018E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.87 + samples/sec: 6.596 | iteration 173200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.356E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.924438E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.77 + samples/sec: 6.588 | iteration 173300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.355E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.933195E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.74 + samples/sec: 6.591 | iteration 173400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.353E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.941678E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.78 + samples/sec: 6.596 | iteration 173500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.352E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.948639E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 + samples/sec: 6.595 | iteration 173600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.350E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.589 | iteration 173700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.349E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.943115E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 + samples/sec: 6.591 | iteration 173800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.347E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.939687E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.78 + samples/sec: 6.592 | iteration 173900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.346E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930507E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.76 + samples/sec: 6.597 | iteration 174000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.344E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943428E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 174000 | lm_loss value: 2.865651E+00 | lm_loss_ppl value: 1.756048E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 174100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 1.343E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.914919E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.75 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.84 + samples/sec: 6.586 | iteration 174200/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.342E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.949487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.93 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 56.26 | batch generator: 0.78 + samples/sec: 6.598 | iteration 174300/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.340E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.932097E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.79 + samples/sec: 6.595 | iteration 174400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.339E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941950E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.589 | iteration 174500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.337E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.933343E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.592 | iteration 174600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.336E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.914772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 + samples/sec: 6.595 | iteration 174700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.334E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.921107E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 + samples/sec: 6.597 | iteration 174800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.333E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930250E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.81 + samples/sec: 6.592 | iteration 174900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.331E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911957E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.88 + samples/sec: 6.592 | iteration 175000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.330E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.919750E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.42 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 175000 | lm_loss value: 2.942545E+00 | lm_loss_ppl value: 1.896405E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 175100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 1.328E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.939513E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1803.61 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.86 + samples/sec: 6.599 | iteration 175200/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.327E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.917258E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1802.52 | backward-backward: 1802.49 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.84 + samples/sec: 6.589 | iteration 175300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.325E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.922204E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.78 + samples/sec: 6.594 | iteration 175400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.324E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926256E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 + samples/sec: 6.596 | iteration 175500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.323E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926103E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.598 | iteration 175600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.321E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.934100E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.83 + samples/sec: 6.592 | iteration 175700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.320E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912630E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.591 | iteration 175800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.318E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.932989E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.82 + samples/sec: 6.599 | iteration 175900/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.317E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.928060E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1802.48 | backward-backward: 1802.46 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.94 + samples/sec: 6.593 | iteration 176000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.315E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.909318E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 176000 | lm_loss value: 2.928627E+00 | lm_loss_ppl value: 1.870193E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 176100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 1.314E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.936281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.81 + samples/sec: 6.593 | iteration 176200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.312E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.907831E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 + samples/sec: 6.596 | iteration 176300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.311E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911938E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 + samples/sec: 6.586 | iteration 176400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.309E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919104E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 1.02 + samples/sec: 6.591 | iteration 176500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.308E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.916021E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 + samples/sec: 6.596 | iteration 176600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.306E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.918903E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.48 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.595 | iteration 176700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.305E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.908832E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 + samples/sec: 6.590 | iteration 176800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.304E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.904807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.21 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 + samples/sec: 6.594 | iteration 176900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.302E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926237E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.76 + samples/sec: 6.598 | iteration 177000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.301E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930586E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.48 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 177000 | lm_loss value: 2.915743E+00 | lm_loss_ppl value: 1.846252E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 177100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 1.299E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.904016E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.30 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.87 + samples/sec: 6.590 | iteration 177200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.298E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.941609E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.61 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 + samples/sec: 6.596 | iteration 177300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.296E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945775E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 + samples/sec: 6.591 | iteration 177400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.295E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.916497E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.80 + samples/sec: 6.589 | iteration 177500/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.293E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.927205E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 + samples/sec: 6.592 | iteration 177600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.292E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.920486E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.598 | iteration 177700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.290E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.908813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.86 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 + samples/sec: 6.594 | iteration 177800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.289E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.919378E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.82 + samples/sec: 6.590 | iteration 177900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.287E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.928415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 + samples/sec: 6.591 | iteration 178000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.286E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.938339E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.40 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 178000 | lm_loss value: 2.928427E+00 | lm_loss_ppl value: 1.869819E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 178100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 1.285E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.906533E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.84 + samples/sec: 6.588 | iteration 178200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.283E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.915801E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.80 + samples/sec: 6.592 | iteration 178300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.282E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898726E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 567.33 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 54.70 | batch generator: 0.80 + samples/sec: 6.598 | iteration 178400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.280E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898359E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1802.90 | backward-backward: 1802.87 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.83 + samples/sec: 6.589 | iteration 178500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.279E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.954476E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 56.36 | batch generator: 0.79 + samples/sec: 6.590 | iteration 178600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.277E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919369E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.597 | iteration 178700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.276E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.940635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 + samples/sec: 6.592 | iteration 178800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.274E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.905472E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.44 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 + samples/sec: 6.591 | iteration 178900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.273E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.923934E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1804.62 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 + samples/sec: 6.598 | iteration 179000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.272E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912248E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 179000 | lm_loss value: 2.899158E+00 | lm_loss_ppl value: 1.815884E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 179100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 1.270E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.929440E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.85 + samples/sec: 6.590 | iteration 179200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.269E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.914893E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.83 + samples/sec: 6.593 | iteration 179300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.267E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912498E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.83 + samples/sec: 6.598 | iteration 179400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.266E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.916761E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.592 | iteration 179500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.264E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.915110E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.75 + samples/sec: 6.589 | iteration 179600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.263E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.899152E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.77 + samples/sec: 6.598 | iteration 179700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.261E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.899507E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1802.99 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 + samples/sec: 6.590 | iteration 179800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.260E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.907039E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 + samples/sec: 6.590 | iteration 179900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.258E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.923088E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 + samples/sec: 6.596 | iteration 180000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.257E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.897910E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.75 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step80000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 180000 | lm_loss value: 2.942339E+00 | lm_loss_ppl value: 1.896014E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.210 | iteration 180100/ 320000 | elapsed time per iteration (ms): 2576.6 | learning rate: 1.256E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.909508E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.26 | backward: 1808.38 | backward-backward: 1808.36 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.87 + samples/sec: 6.588 | iteration 180200/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.254E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.897804E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 + samples/sec: 6.597 | iteration 180300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.253E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.908450E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 + samples/sec: 6.591 | iteration 180400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.251E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.903730E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.80 + samples/sec: 6.595 | iteration 180500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.250E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.889267E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.78 + samples/sec: 6.598 | iteration 180600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.248E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.918279E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.76 + samples/sec: 6.586 | iteration 180700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.247E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.901935E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1806.21 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.78 + samples/sec: 6.595 | iteration 180800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.245E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.895624E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.82 + samples/sec: 6.598 | iteration 180900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.244E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898376E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.29 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.588 | iteration 181000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.242E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.912653E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.92 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 181000 | lm_loss value: 2.842064E+00 | lm_loss_ppl value: 1.715113E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 181100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.241E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.912485E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.86 + samples/sec: 6.598 | iteration 181200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.240E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.881777E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.590 | iteration 181300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.238E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.910804E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 + samples/sec: 6.595 | iteration 181400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.237E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.914162E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.595 | iteration 181500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.235E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.896360E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.76 + samples/sec: 6.588 | iteration 181600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.234E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.927159E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.77 + samples/sec: 6.592 | iteration 181700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.232E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.903326E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 + samples/sec: 6.595 | iteration 181800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.231E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.916656E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.589 | iteration 181900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.229E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.908867E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.80 + samples/sec: 6.594 | iteration 182000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.228E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.902429E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 182000 | lm_loss value: 2.851450E+00 | lm_loss_ppl value: 1.731286E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 182100/ 320000 | elapsed time per iteration (ms): 2481.5 | learning rate: 1.227E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.889535E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1803.10 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 + samples/sec: 6.589 | iteration 182200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.225E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.894985E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.63 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.82 + samples/sec: 6.596 | iteration 182300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.224E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.902799E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.88 + samples/sec: 6.590 | iteration 182400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.222E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.925697E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.74 + samples/sec: 6.592 | iteration 182500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.221E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.904127E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.08 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.83 + samples/sec: 6.597 | iteration 182600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.219E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.896524E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.589 | iteration 182700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.218E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.912169E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.599 | iteration 182800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.216E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.910268E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 54.84 | batch generator: 0.81 + samples/sec: 6.592 | iteration 182900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.215E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.892001E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 + samples/sec: 6.591 | iteration 183000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.214E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.889170E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1805.15 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 183000 | lm_loss value: 2.877123E+00 | lm_loss_ppl value: 1.776310E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 183100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 1.212E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.894171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.91 + samples/sec: 6.588 | iteration 183200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.211E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.897486E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.55 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.89 + samples/sec: 6.597 | iteration 183300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.209E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.887001E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.591 | iteration 183400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.208E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.934471E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 + samples/sec: 6.594 | iteration 183500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.206E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.890047E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 + samples/sec: 6.594 | iteration 183600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.205E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.880782E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 + samples/sec: 6.590 | iteration 183700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.203E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.907233E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.76 + samples/sec: 6.599 | iteration 183800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.202E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.901257E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.591 | iteration 183900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.201E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.891435E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1805.73 | backward-backward: 1805.70 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.76 + samples/sec: 6.592 | iteration 184000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.199E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.889539E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 184000 | lm_loss value: 2.894803E+00 | lm_loss_ppl value: 1.807993E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 184100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 1.198E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.888818E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.86 + samples/sec: 6.591 | iteration 184200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.196E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.914987E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 + samples/sec: 6.597 | iteration 184300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.195E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.905977E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 54.84 | batch generator: 0.71 + samples/sec: 6.591 | iteration 184400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.193E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.880615E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.19 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 + samples/sec: 6.598 | iteration 184500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.192E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.895813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.588 | iteration 184600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.191E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.910818E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.79 + samples/sec: 6.600 | iteration 184700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.189E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.887367E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1802.81 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.77 + samples/sec: 6.593 | iteration 184800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.188E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.902870E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.596 | iteration 184900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.186E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.881776E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.596 | iteration 185000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.185E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882634E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 185000 | lm_loss value: 2.899109E+00 | lm_loss_ppl value: 1.815797E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 185100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 1.183E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.893860E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.85 + samples/sec: 6.597 | iteration 185200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.182E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.909367E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.79 + samples/sec: 6.591 | iteration 185300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.180E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.884349E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.600 | iteration 185400/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.179E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.907739E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1802.46 | backward-backward: 1802.44 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 + samples/sec: 6.592 | iteration 185500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.178E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878329E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 + samples/sec: 6.597 | iteration 185600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.176E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.884004E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 + samples/sec: 6.596 | iteration 185700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.175E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.894629E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.591 | iteration 185800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.173E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.898392E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.595 | iteration 185900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.172E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.894128E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.93 + samples/sec: 6.591 | iteration 186000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.170E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.861195E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 186000 | lm_loss value: 2.909348E+00 | lm_loss_ppl value: 1.834484E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 186100/ 320000 | elapsed time per iteration (ms): 2481.4 | learning rate: 1.169E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.867545E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.86 + samples/sec: 6.585 | iteration 186200/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.168E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.900770E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1806.09 | backward-backward: 1806.06 | backward-allreduce: 0.00 | optimizer: 56.57 | batch generator: 0.77 + samples/sec: 6.599 | iteration 186300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.166E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.906981E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 + samples/sec: 6.593 | iteration 186400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.165E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.892109E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.592 | iteration 186500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.163E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.904184E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 + samples/sec: 6.598 | iteration 186600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.162E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882749E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 + samples/sec: 6.589 | iteration 186700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.160E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.883213E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.89 + samples/sec: 6.596 | iteration 186800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.159E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882681E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.76 + samples/sec: 6.594 | iteration 186900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.157E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.896065E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.34 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.593 | iteration 187000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.156E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898166E+00 | loss scale: 16384.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.05 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 54.74 | batch generator: 0.73 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 187000 | lm_loss value: 2.930801E+00 | lm_loss_ppl value: 1.874263E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 187100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 1.155E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.908065E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.87 + samples/sec: 6.588 | iteration 187200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.153E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.869083E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.78 + samples/sec: 6.598 | iteration 187300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.152E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.881102E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.97 | backward: 1803.11 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.591 | iteration 187400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.150E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.875036E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.595 | iteration 187500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.149E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.858645E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 + samples/sec: 6.593 | iteration 187600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.147E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.886241E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.85 + samples/sec: 6.591 | iteration 187700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.146E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.870358E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.88 + samples/sec: 6.598 | iteration 187800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.145E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882787E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1802.92 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 + samples/sec: 6.591 | iteration 187900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.143E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.872924E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.23 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.94 + samples/sec: 6.599 | iteration 188000/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.142E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.881431E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1802.40 | backward-backward: 1802.38 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 188000 | lm_loss value: 2.836677E+00 | lm_loss_ppl value: 1.705898E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 188100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 1.140E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.902876E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.87 + samples/sec: 6.595 | iteration 188200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.139E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.880020E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 + samples/sec: 6.594 | iteration 188300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.137E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877081E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.48 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.79 + samples/sec: 6.587 | iteration 188400/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.136E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.864098E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1806.08 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.598 | iteration 188500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.135E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877446E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 + samples/sec: 6.592 | iteration 188600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.133E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.897515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 + samples/sec: 6.595 | iteration 188700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.132E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.890860E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 + samples/sec: 6.597 | iteration 188800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.130E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875083E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.81 + samples/sec: 6.593 | iteration 188900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.129E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875239E+00 | loss scale: 16384.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1805.10 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 54.62 | batch generator: 0.76 + samples/sec: 6.600 | iteration 189000/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 1.128E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.865639E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1802.26 | backward-backward: 1802.24 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 189000 | lm_loss value: 2.787803E+00 | lm_loss_ppl value: 1.624529E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 189100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.126E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.887470E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 + samples/sec: 6.595 | iteration 189200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.125E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.869974E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.595 | iteration 189300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.123E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.891342E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 + samples/sec: 6.588 | iteration 189400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.122E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.879682E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.81 + samples/sec: 6.599 | iteration 189500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.120E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878406E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1802.27 | backward-backward: 1802.24 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.93 + samples/sec: 6.591 | iteration 189600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.119E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.854060E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.82 + samples/sec: 6.599 | iteration 189700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.118E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878698E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1802.50 | backward-backward: 1802.48 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.597 | iteration 189800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.116E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.873597E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1802.97 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 + samples/sec: 6.592 | iteration 189900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.115E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878015E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.598 | iteration 190000/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.113E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.865634E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1802.74 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.77 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step90000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 190000 | lm_loss value: 2.860255E+00 | lm_loss_ppl value: 1.746597E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.203 | iteration 190100/ 320000 | elapsed time per iteration (ms): 2579.3 | learning rate: 1.112E-04 | approx flops per GPU: 38.5TFLOPS | lm_loss: 2.886305E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 569.37 | backward: 1808.65 | backward-backward: 1808.63 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.85 + samples/sec: 6.597 | iteration 190200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.110E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.871734E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.80 + samples/sec: 6.597 | iteration 190300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.109E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875904E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 + samples/sec: 6.590 | iteration 190400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.108E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.861574E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.49 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.83 + samples/sec: 6.595 | iteration 190500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.106E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.871764E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.79 + samples/sec: 6.593 | iteration 190600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.105E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.895243E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.75 + samples/sec: 6.592 | iteration 190700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.103E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.891291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.599 | iteration 190800/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.102E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.594 | iteration 190900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.100E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.855860E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.587 | iteration 191000/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.099E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.870822E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1806.28 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 191000 | lm_loss value: 2.875153E+00 | lm_loss_ppl value: 1.772814E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 191100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 1.098E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.882393E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.85 + samples/sec: 6.593 | iteration 191200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.096E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.879959E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.76 + samples/sec: 6.590 | iteration 191300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.095E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.874664E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.77 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.77 + samples/sec: 6.598 | iteration 191400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.093E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877599E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 + samples/sec: 6.594 | iteration 191500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.092E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.844525E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.76 + samples/sec: 6.588 | iteration 191600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.091E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.863393E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.69 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 + samples/sec: 6.597 | iteration 191700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.089E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.856668E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.595 | iteration 191800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.088E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.871125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1803.78 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.591 | iteration 191900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.086E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.861248E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.90 + samples/sec: 6.597 | iteration 192000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.085E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.883492E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 192000 | lm_loss value: 2.783086E+00 | lm_loss_ppl value: 1.616885E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 192100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.084E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.889257E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.86 + samples/sec: 6.589 | iteration 192200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.082E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.857142E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1805.27 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.596 | iteration 192300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.081E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.856524E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 + samples/sec: 6.597 | iteration 192400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.079E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.884305E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.588 | iteration 192500/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.078E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.875736E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1806.16 | backward-backward: 1806.13 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.78 + samples/sec: 6.592 | iteration 192600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.076E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877444E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.81 + samples/sec: 6.594 | iteration 192700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.075E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875116E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.80 + samples/sec: 6.590 | iteration 192800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.074E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.863660E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.81 + samples/sec: 6.589 | iteration 192900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.072E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.885876E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.75 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.85 + samples/sec: 6.597 | iteration 193000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.071E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.866650E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 193000 | lm_loss value: 2.868802E+00 | lm_loss_ppl value: 1.761590E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 193100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.069E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.857432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.83 + samples/sec: 6.590 | iteration 193200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.068E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.865851E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.51 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.593 | iteration 193300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.067E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.876440E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.78 + samples/sec: 6.598 | iteration 193400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.065E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.863687E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 + samples/sec: 6.591 | iteration 193500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.064E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.884990E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1805.19 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.590 | iteration 193600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.062E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.859315E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.76 + samples/sec: 6.598 | iteration 193700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.061E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.859491E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.590 | iteration 193800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.060E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.873080E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 56.92 | batch generator: 0.95 + samples/sec: 6.590 | iteration 193900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.058E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.876517E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.40 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.591 | iteration 194000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.057E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.861474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 194000 | lm_loss value: 2.882784E+00 | lm_loss_ppl value: 1.786393E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 194100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.055E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.862131E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.85 + samples/sec: 6.597 | iteration 194200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.054E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875278E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 + samples/sec: 6.592 | iteration 194300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.053E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.881512E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.592 | iteration 194400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.051E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.868011E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 + samples/sec: 6.596 | iteration 194500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.050E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875050E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 + samples/sec: 6.597 | iteration 194600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.048E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.852160E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.79 + samples/sec: 6.589 | iteration 194700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.047E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.865165E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 + samples/sec: 6.591 | iteration 194800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.046E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.844009E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 + samples/sec: 6.594 | iteration 194900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.044E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.856657E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.82 + samples/sec: 6.593 | iteration 195000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.043E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.866370E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 195000 | lm_loss value: 2.781148E+00 | lm_loss_ppl value: 1.613753E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 195100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 1.041E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.872111E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.86 + samples/sec: 6.593 | iteration 195200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.040E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.843492E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.598 | iteration 195300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.039E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.872704E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1802.82 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.592 | iteration 195400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.037E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.855403E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.590 | iteration 195500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.036E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.853541E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 + samples/sec: 6.598 | iteration 195600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.034E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878503E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.597 | iteration 195700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.033E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.875824E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.85 + samples/sec: 6.588 | iteration 195800/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.032E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.854630E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.93 | backward-backward: 1805.90 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.77 + samples/sec: 6.591 | iteration 195900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.030E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.854791E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.87 + samples/sec: 6.593 | iteration 196000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.029E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.839808E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 196000 | lm_loss value: 2.808303E+00 | lm_loss_ppl value: 1.658175E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 196100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 1.027E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.855288E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.87 + samples/sec: 6.590 | iteration 196200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.026E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.864797E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.599 | iteration 196300/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.025E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.855809E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1802.67 | backward-backward: 1802.64 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.80 + samples/sec: 6.590 | iteration 196400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.023E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.835857E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 + samples/sec: 6.592 | iteration 196500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.022E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.849824E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.76 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.599 | iteration 196600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.020E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.878375E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.79 + samples/sec: 6.592 | iteration 196700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.019E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.862002E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.60 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 54.50 | batch generator: 0.79 + samples/sec: 6.591 | iteration 196800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.018E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.842774E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.599 | iteration 196900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.016E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.861841E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.83 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.80 + samples/sec: 6.591 | iteration 197000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.015E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.847816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 197000 | lm_loss value: 2.800873E+00 | lm_loss_ppl value: 1.645901E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 197100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 1.013E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.862067E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.90 + samples/sec: 6.591 | iteration 197200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.012E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.841813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.84 + samples/sec: 6.261 | iteration 197300/ 320000 | elapsed time per iteration (ms): 2555.6 | learning rate: 1.011E-04 | approx flops per GPU: 38.9TFLOPS | lm_loss: 2.847807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 595.27 | backward: 1887.43 | backward-backward: 1887.40 | backward-allreduce: 0.00 | optimizer: 72.52 | batch generator: 0.84 + samples/sec: 5.419 | iteration 197400/ 320000 | elapsed time per iteration (ms): 2952.4 | learning rate: 1.009E-04 | approx flops per GPU: 33.7TFLOPS | lm_loss: 2.842638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.85 | backward: 2140.67 | backward-backward: 2140.65 | backward-allreduce: 0.00 | optimizer: 126.38 | batch generator: 1.01 + samples/sec: 5.394 | iteration 197500/ 320000 | elapsed time per iteration (ms): 2966.5 | learning rate: 1.008E-04 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.849199E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.32 | backward: 2153.56 | backward-backward: 2153.54 | backward-allreduce: 0.00 | optimizer: 127.15 | batch generator: 0.98 + samples/sec: 5.392 | iteration 197600/ 320000 | elapsed time per iteration (ms): 2967.6 | learning rate: 1.006E-04 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.875533E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 688.46 | backward: 2153.28 | backward-backward: 2153.25 | backward-allreduce: 0.00 | optimizer: 125.31 | batch generator: 1.04 + samples/sec: 5.395 | iteration 197700/ 320000 | elapsed time per iteration (ms): 2965.9 | learning rate: 1.005E-04 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.860453E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.86 | backward: 2150.27 | backward-backward: 2150.25 | backward-allreduce: 0.00 | optimizer: 128.27 | batch generator: 0.98 + samples/sec: 5.382 | iteration 197800/ 320000 | elapsed time per iteration (ms): 2972.6 | learning rate: 1.004E-04 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.832885E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.30 | backward: 2157.01 | backward-backward: 2156.99 | backward-allreduce: 0.00 | optimizer: 127.81 | batch generator: 1.04 + samples/sec: 5.398 | iteration 197900/ 320000 | elapsed time per iteration (ms): 2964.2 | learning rate: 1.002E-04 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.853011E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.03 | backward: 2151.55 | backward-backward: 2151.53 | backward-allreduce: 0.00 | optimizer: 127.06 | batch generator: 0.99 + samples/sec: 5.386 | iteration 198000/ 320000 | elapsed time per iteration (ms): 2970.8 | learning rate: 1.001E-04 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.845185E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.10 | backward: 2153.12 | backward-backward: 2153.09 | backward-allreduce: 0.00 | optimizer: 129.01 | batch generator: 1.05 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 198000 | lm_loss value: 2.777985E+00 | lm_loss_ppl value: 1.608657E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.265 | iteration 198100/ 320000 | elapsed time per iteration (ms): 3039.1 | learning rate: 9.995E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.845104E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.79 | backward: 2155.11 | backward-backward: 2155.09 | backward-allreduce: 0.00 | optimizer: 127.98 | batch generator: 1.18 + samples/sec: 5.406 | iteration 198200/ 320000 | elapsed time per iteration (ms): 2959.9 | learning rate: 9.981E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.826981E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.59 | backward: 2145.43 | backward-backward: 2145.40 | backward-allreduce: 0.00 | optimizer: 127.31 | batch generator: 1.07 + samples/sec: 5.366 | iteration 198300/ 320000 | elapsed time per iteration (ms): 2981.9 | learning rate: 9.967E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.864411E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.84 | backward: 2163.03 | backward-backward: 2163.01 | backward-allreduce: 0.00 | optimizer: 129.54 | batch generator: 1.03 + samples/sec: 5.408 | iteration 198400/ 320000 | elapsed time per iteration (ms): 2958.7 | learning rate: 9.954E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.842406E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.14 | backward: 2146.72 | backward-backward: 2146.69 | backward-allreduce: 0.00 | optimizer: 127.30 | batch generator: 0.96 + samples/sec: 5.383 | iteration 198500/ 320000 | elapsed time per iteration (ms): 2972.5 | learning rate: 9.940E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.843913E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.80 | backward: 2156.48 | backward-backward: 2156.46 | backward-allreduce: 0.00 | optimizer: 129.70 | batch generator: 1.03 + samples/sec: 5.387 | iteration 198600/ 320000 | elapsed time per iteration (ms): 2970.3 | learning rate: 9.926E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.856050E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.52 | backward: 2156.56 | backward-backward: 2156.54 | backward-allreduce: 0.00 | optimizer: 126.69 | batch generator: 1.01 + samples/sec: 5.394 | iteration 198700/ 320000 | elapsed time per iteration (ms): 2966.1 | learning rate: 9.912E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.869107E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.22 | backward: 2152.07 | backward-backward: 2152.05 | backward-allreduce: 0.00 | optimizer: 127.22 | batch generator: 1.04 + samples/sec: 5.395 | iteration 198800/ 320000 | elapsed time per iteration (ms): 2965.5 | learning rate: 9.898E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.862235E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.58 | backward: 2150.23 | backward-backward: 2150.21 | backward-allreduce: 0.00 | optimizer: 127.02 | batch generator: 1.02 + samples/sec: 5.397 | iteration 198900/ 320000 | elapsed time per iteration (ms): 2964.6 | learning rate: 9.884E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.853870E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.45 | backward: 2151.83 | backward-backward: 2151.81 | backward-allreduce: 0.00 | optimizer: 125.82 | batch generator: 0.98 + samples/sec: 5.383 | iteration 199000/ 320000 | elapsed time per iteration (ms): 2972.2 | learning rate: 9.870E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.850110E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.30 | backward: 2156.01 | backward-backward: 2155.98 | backward-allreduce: 0.00 | optimizer: 128.31 | batch generator: 0.99 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 199000 | lm_loss value: 2.828424E+00 | lm_loss_ppl value: 1.691877E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.289 | iteration 199100/ 320000 | elapsed time per iteration (ms): 3025.4 | learning rate: 9.857E-05 | approx flops per GPU: 32.9TFLOPS | lm_loss: 2.844522E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.60 | backward: 2147.60 | backward-backward: 2147.58 | backward-allreduce: 0.00 | optimizer: 124.90 | batch generator: 1.16 + samples/sec: 5.367 | iteration 199200/ 320000 | elapsed time per iteration (ms): 2981.3 | learning rate: 9.843E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.851793E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.64 | backward: 2161.55 | backward-backward: 2161.52 | backward-allreduce: 0.00 | optimizer: 130.60 | batch generator: 0.99 + samples/sec: 5.402 | iteration 199300/ 320000 | elapsed time per iteration (ms): 2961.8 | learning rate: 9.829E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.829368E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.85 | backward: 2145.86 | backward-backward: 2145.84 | backward-allreduce: 0.00 | optimizer: 127.59 | batch generator: 1.09 + samples/sec: 5.394 | iteration 199400/ 320000 | elapsed time per iteration (ms): 2966.5 | learning rate: 9.815E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.874377E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.71 | backward: 2151.78 | backward-backward: 2151.76 | backward-allreduce: 0.00 | optimizer: 127.45 | batch generator: 1.00 + samples/sec: 5.382 | iteration 199500/ 320000 | elapsed time per iteration (ms): 2972.9 | learning rate: 9.801E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.847173E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.20 | backward: 2154.82 | backward-backward: 2154.80 | backward-allreduce: 0.00 | optimizer: 129.40 | batch generator: 0.94 + samples/sec: 5.402 | iteration 199600/ 320000 | elapsed time per iteration (ms): 2961.7 | learning rate: 9.788E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.808427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.60 | backward: 2150.58 | backward-backward: 2150.56 | backward-allreduce: 0.00 | optimizer: 127.05 | batch generator: 1.07 + samples/sec: 5.391 | iteration 199700/ 320000 | elapsed time per iteration (ms): 2968.1 | learning rate: 9.774E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.828392E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 685.17 | backward: 2156.48 | backward-backward: 2156.45 | backward-allreduce: 0.00 | optimizer: 125.90 | batch generator: 1.03 + samples/sec: 5.394 | iteration 199800/ 320000 | elapsed time per iteration (ms): 2966.5 | learning rate: 9.760E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.827758E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.06 | backward: 2151.51 | backward-backward: 2151.48 | backward-allreduce: 0.00 | optimizer: 128.45 | batch generator: 1.03 + samples/sec: 5.392 | iteration 199900/ 320000 | elapsed time per iteration (ms): 2967.1 | learning rate: 9.746E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.844318E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.86 | backward: 2153.29 | backward-backward: 2153.26 | backward-allreduce: 0.00 | optimizer: 126.42 | batch generator: 1.05 + samples/sec: 5.403 | iteration 200000/ 320000 | elapsed time per iteration (ms): 2961.1 | learning rate: 9.732E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.851229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.35 | backward: 2148.20 | backward-backward: 2148.17 | backward-allreduce: 0.00 | optimizer: 127.10 | batch generator: 0.98 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step100000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 200000 | lm_loss value: 2.782051E+00 | lm_loss_ppl value: 1.615212E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.099 | iteration 200100/ 320000 | elapsed time per iteration (ms): 3138.2 | learning rate: 9.719E-05 | approx flops per GPU: 31.7TFLOPS | lm_loss: 2.851010E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.16 | backward: 2156.23 | backward-backward: 2156.21 | backward-allreduce: 0.00 | optimizer: 129.78 | batch generator: 1.27 + samples/sec: 5.397 | iteration 200200/ 320000 | elapsed time per iteration (ms): 2964.3 | learning rate: 9.705E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.831878E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.66 | backward: 2149.98 | backward-backward: 2149.96 | backward-allreduce: 0.00 | optimizer: 128.14 | batch generator: 1.04 + samples/sec: 5.401 | iteration 200300/ 320000 | elapsed time per iteration (ms): 2962.3 | learning rate: 9.691E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.845733E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.87 | backward: 2149.54 | backward-backward: 2149.51 | backward-allreduce: 0.00 | optimizer: 127.36 | batch generator: 1.05 + samples/sec: 5.380 | iteration 200400/ 320000 | elapsed time per iteration (ms): 2974.1 | learning rate: 9.677E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.846103E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.83 | backward: 2158.45 | backward-backward: 2158.43 | backward-allreduce: 0.00 | optimizer: 127.25 | batch generator: 1.03 + samples/sec: 5.405 | iteration 200500/ 320000 | elapsed time per iteration (ms): 2960.0 | learning rate: 9.664E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.843140E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.88 | backward: 2146.82 | backward-backward: 2146.80 | backward-allreduce: 0.00 | optimizer: 128.82 | batch generator: 0.99 + samples/sec: 5.379 | iteration 200600/ 320000 | elapsed time per iteration (ms): 2974.5 | learning rate: 9.650E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.831102E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.10 | backward: 2158.81 | backward-backward: 2158.78 | backward-allreduce: 0.00 | optimizer: 127.04 | batch generator: 1.04 + samples/sec: 5.396 | iteration 200700/ 320000 | elapsed time per iteration (ms): 2965.2 | learning rate: 9.636E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.859192E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 684.71 | backward: 2152.49 | backward-backward: 2152.47 | backward-allreduce: 0.00 | optimizer: 127.47 | batch generator: 1.02 + samples/sec: 5.389 | iteration 200800/ 320000 | elapsed time per iteration (ms): 2969.0 | learning rate: 9.622E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.835936E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.71 | backward: 2152.58 | backward-backward: 2152.56 | backward-allreduce: 0.00 | optimizer: 131.18 | batch generator: 0.98 + samples/sec: 5.387 | iteration 200900/ 320000 | elapsed time per iteration (ms): 2970.1 | learning rate: 9.609E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.835904E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.02 | backward: 2155.97 | backward-backward: 2155.95 | backward-allreduce: 0.00 | optimizer: 128.63 | batch generator: 0.95 + samples/sec: 5.402 | iteration 201000/ 320000 | elapsed time per iteration (ms): 2961.9 | learning rate: 9.595E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.845550E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.73 | backward: 2147.13 | backward-backward: 2147.11 | backward-allreduce: 0.00 | optimizer: 128.51 | batch generator: 1.05 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 201000 | lm_loss value: 2.846431E+00 | lm_loss_ppl value: 1.722619E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.254 | iteration 201100/ 320000 | elapsed time per iteration (ms): 3045.4 | learning rate: 9.581E-05 | approx flops per GPU: 32.6TFLOPS | lm_loss: 2.844910E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.42 | backward: 2157.92 | backward-backward: 2157.90 | backward-allreduce: 0.00 | optimizer: 129.90 | batch generator: 1.17 + samples/sec: 5.394 | iteration 201200/ 320000 | elapsed time per iteration (ms): 2966.5 | learning rate: 9.568E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.821596E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.04 | backward: 2152.39 | backward-backward: 2152.36 | backward-allreduce: 0.00 | optimizer: 127.56 | batch generator: 1.02 + samples/sec: 5.414 | iteration 201300/ 320000 | elapsed time per iteration (ms): 2955.5 | learning rate: 9.554E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.836929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.16 | backward: 2144.03 | backward-backward: 2144.01 | backward-allreduce: 0.00 | optimizer: 127.84 | batch generator: 1.05 + samples/sec: 5.371 | iteration 201400/ 320000 | elapsed time per iteration (ms): 2979.1 | learning rate: 9.540E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.824103E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.13 | backward: 2161.45 | backward-backward: 2161.43 | backward-allreduce: 0.00 | optimizer: 128.01 | batch generator: 1.03 + samples/sec: 5.407 | iteration 201500/ 320000 | elapsed time per iteration (ms): 2959.3 | learning rate: 9.526E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.830170E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.13 | backward: 2146.34 | backward-backward: 2146.32 | backward-allreduce: 0.00 | optimizer: 127.31 | batch generator: 1.02 + samples/sec: 5.395 | iteration 201600/ 320000 | elapsed time per iteration (ms): 2965.5 | learning rate: 9.513E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.834677E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.57 | backward: 2150.62 | backward-backward: 2150.59 | backward-allreduce: 0.00 | optimizer: 126.75 | batch generator: 1.04 + samples/sec: 5.380 | iteration 201700/ 320000 | elapsed time per iteration (ms): 2974.0 | learning rate: 9.499E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.833290E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.54 | backward: 2159.36 | backward-backward: 2159.33 | backward-allreduce: 0.00 | optimizer: 126.51 | batch generator: 1.07 + samples/sec: 5.406 | iteration 201800/ 320000 | elapsed time per iteration (ms): 2959.7 | learning rate: 9.486E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.835665E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 684.55 | backward: 2147.82 | backward-backward: 2147.80 | backward-allreduce: 0.00 | optimizer: 126.75 | batch generator: 0.97 + samples/sec: 5.380 | iteration 201900/ 320000 | elapsed time per iteration (ms): 2973.9 | learning rate: 9.472E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.831448E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.66 | backward: 2155.49 | backward-backward: 2155.46 | backward-allreduce: 0.00 | optimizer: 131.22 | batch generator: 1.03 + samples/sec: 5.384 | iteration 202000/ 320000 | elapsed time per iteration (ms): 2971.9 | learning rate: 9.458E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.831192E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.00 | backward: 2154.82 | backward-backward: 2154.80 | backward-allreduce: 0.00 | optimizer: 129.56 | batch generator: 1.00 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 202000 | lm_loss value: 2.830135E+00 | lm_loss_ppl value: 1.694775E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.272 | iteration 202100/ 320000 | elapsed time per iteration (ms): 3034.6 | learning rate: 9.444E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.833895E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.16 | backward: 2152.22 | backward-backward: 2152.20 | backward-allreduce: 0.00 | optimizer: 127.96 | batch generator: 1.11 + samples/sec: 5.399 | iteration 202200/ 320000 | elapsed time per iteration (ms): 2963.5 | learning rate: 9.431E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.832726E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.67 | backward: 2149.94 | backward-backward: 2149.91 | backward-allreduce: 0.00 | optimizer: 126.34 | batch generator: 1.03 + samples/sec: 5.397 | iteration 202300/ 320000 | elapsed time per iteration (ms): 2964.4 | learning rate: 9.417E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.826690E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.77 | backward: 2149.16 | backward-backward: 2149.14 | backward-allreduce: 0.00 | optimizer: 127.91 | batch generator: 1.04 + samples/sec: 5.387 | iteration 202400/ 320000 | elapsed time per iteration (ms): 2970.1 | learning rate: 9.403E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.855530E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.65 | backward: 2154.64 | backward-backward: 2154.62 | backward-allreduce: 0.00 | optimizer: 128.35 | batch generator: 0.98 + samples/sec: 5.403 | iteration 202500/ 320000 | elapsed time per iteration (ms): 2961.0 | learning rate: 9.390E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.832136E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.55 | backward: 2149.51 | backward-backward: 2149.48 | backward-allreduce: 0.00 | optimizer: 127.46 | batch generator: 0.98 + samples/sec: 5.378 | iteration 202600/ 320000 | elapsed time per iteration (ms): 2975.0 | learning rate: 9.376E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.836259E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.27 | backward: 2156.19 | backward-backward: 2156.16 | backward-allreduce: 0.00 | optimizer: 131.02 | batch generator: 1.02 + samples/sec: 5.391 | iteration 202700/ 320000 | elapsed time per iteration (ms): 2967.7 | learning rate: 9.362E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.846094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.77 | backward: 2154.48 | backward-backward: 2154.45 | backward-allreduce: 0.00 | optimizer: 126.97 | batch generator: 0.98 + samples/sec: 5.405 | iteration 202800/ 320000 | elapsed time per iteration (ms): 2960.0 | learning rate: 9.349E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.819505E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.90 | backward: 2147.38 | backward-backward: 2147.36 | backward-allreduce: 0.00 | optimizer: 127.15 | batch generator: 1.04 + samples/sec: 5.369 | iteration 202900/ 320000 | elapsed time per iteration (ms): 2980.2 | learning rate: 9.335E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.825245E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.51 | backward: 2159.92 | backward-backward: 2159.90 | backward-allreduce: 0.00 | optimizer: 130.26 | batch generator: 1.01 + samples/sec: 5.401 | iteration 203000/ 320000 | elapsed time per iteration (ms): 2962.4 | learning rate: 9.322E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.841751E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.67 | backward: 2149.14 | backward-backward: 2149.12 | backward-allreduce: 0.00 | optimizer: 127.11 | batch generator: 1.01 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 203000 | lm_loss value: 2.838610E+00 | lm_loss_ppl value: 1.709199E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.274 | iteration 203100/ 320000 | elapsed time per iteration (ms): 3034.0 | learning rate: 9.308E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.831543E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.05 | backward: 2149.28 | backward-backward: 2149.26 | backward-allreduce: 0.00 | optimizer: 128.38 | batch generator: 1.24 + samples/sec: 5.382 | iteration 203200/ 320000 | elapsed time per iteration (ms): 2973.1 | learning rate: 9.295E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.835450E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 687.28 | backward: 2157.18 | backward-backward: 2157.16 | backward-allreduce: 0.00 | optimizer: 128.09 | batch generator: 0.98 + samples/sec: 5.405 | iteration 203300/ 320000 | elapsed time per iteration (ms): 2960.5 | learning rate: 9.281E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.850287E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.75 | backward: 2147.75 | backward-backward: 2147.73 | backward-allreduce: 0.00 | optimizer: 127.43 | batch generator: 1.02 + samples/sec: 5.386 | iteration 203400/ 320000 | elapsed time per iteration (ms): 2970.9 | learning rate: 9.267E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.815852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.85 | backward: 2154.82 | backward-backward: 2154.79 | backward-allreduce: 0.00 | optimizer: 127.71 | batch generator: 1.04 + samples/sec: 5.383 | iteration 203500/ 320000 | elapsed time per iteration (ms): 2972.3 | learning rate: 9.254E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.836983E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.71 | backward: 2154.68 | backward-backward: 2154.65 | backward-allreduce: 0.00 | optimizer: 128.43 | batch generator: 0.98 + samples/sec: 5.403 | iteration 203600/ 320000 | elapsed time per iteration (ms): 2961.5 | learning rate: 9.240E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.833971E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 681.02 | backward: 2150.54 | backward-backward: 2150.51 | backward-allreduce: 0.00 | optimizer: 129.41 | batch generator: 0.99 + samples/sec: 5.381 | iteration 203700/ 320000 | elapsed time per iteration (ms): 2973.3 | learning rate: 9.227E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.820074E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.57 | backward: 2158.19 | backward-backward: 2158.17 | backward-allreduce: 0.00 | optimizer: 129.05 | batch generator: 1.15 + samples/sec: 5.374 | iteration 203800/ 320000 | elapsed time per iteration (ms): 2977.0 | learning rate: 9.213E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.825303E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.80 | backward: 2158.02 | backward-backward: 2158.00 | backward-allreduce: 0.00 | optimizer: 128.65 | batch generator: 1.08 + samples/sec: 5.420 | iteration 203900/ 320000 | elapsed time per iteration (ms): 2951.8 | learning rate: 9.199E-05 | approx flops per GPU: 33.7TFLOPS | lm_loss: 2.820193E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.68 | backward: 2140.85 | backward-backward: 2140.82 | backward-allreduce: 0.00 | optimizer: 125.76 | batch generator: 1.03 + samples/sec: 5.373 | iteration 204000/ 320000 | elapsed time per iteration (ms): 2977.9 | learning rate: 9.186E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.820435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.93 | backward: 2158.73 | backward-backward: 2158.71 | backward-allreduce: 0.00 | optimizer: 128.75 | batch generator: 1.08 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 204000 | lm_loss value: 2.748774E+00 | lm_loss_ppl value: 1.562346E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.268 | iteration 204100/ 320000 | elapsed time per iteration (ms): 3037.0 | learning rate: 9.172E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.823125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.76 | backward: 2154.58 | backward-backward: 2154.56 | backward-allreduce: 0.00 | optimizer: 127.45 | batch generator: 1.10 + samples/sec: 5.413 | iteration 204200/ 320000 | elapsed time per iteration (ms): 2956.1 | learning rate: 9.159E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.830126E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.75 | backward: 2144.19 | backward-backward: 2144.16 | backward-allreduce: 0.00 | optimizer: 126.61 | batch generator: 1.01 + samples/sec: 5.369 | iteration 204300/ 320000 | elapsed time per iteration (ms): 2979.9 | learning rate: 9.145E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.820817E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.26 | backward: 2162.35 | backward-backward: 2162.32 | backward-allreduce: 0.00 | optimizer: 128.81 | batch generator: 1.08 + samples/sec: 5.383 | iteration 204400/ 320000 | elapsed time per iteration (ms): 2972.3 | learning rate: 9.132E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.814146E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.33 | backward: 2154.31 | backward-backward: 2154.28 | backward-allreduce: 0.00 | optimizer: 130.13 | batch generator: 1.01 + samples/sec: 5.402 | iteration 204500/ 320000 | elapsed time per iteration (ms): 2961.7 | learning rate: 9.118E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.821695E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.92 | backward: 2147.68 | backward-backward: 2147.66 | backward-allreduce: 0.00 | optimizer: 127.59 | batch generator: 0.99 + samples/sec: 5.378 | iteration 204600/ 320000 | elapsed time per iteration (ms): 2974.9 | learning rate: 9.104E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.805771E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.36 | backward: 2158.28 | backward-backward: 2158.25 | backward-allreduce: 0.00 | optimizer: 128.75 | batch generator: 1.02 + samples/sec: 5.388 | iteration 204700/ 320000 | elapsed time per iteration (ms): 2969.5 | learning rate: 9.091E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.814683E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.27 | backward: 2156.17 | backward-backward: 2156.14 | backward-allreduce: 0.00 | optimizer: 126.52 | batch generator: 1.11 + samples/sec: 5.396 | iteration 204800/ 320000 | elapsed time per iteration (ms): 2965.2 | learning rate: 9.077E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.822700E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.84 | backward: 2149.99 | backward-backward: 2149.96 | backward-allreduce: 0.00 | optimizer: 128.84 | batch generator: 1.04 + samples/sec: 5.382 | iteration 204900/ 320000 | elapsed time per iteration (ms): 2972.9 | learning rate: 9.064E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.835482E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.94 | backward: 2156.39 | backward-backward: 2156.36 | backward-allreduce: 0.00 | optimizer: 129.09 | batch generator: 1.02 + samples/sec: 5.386 | iteration 205000/ 320000 | elapsed time per iteration (ms): 2970.6 | learning rate: 9.051E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.821820E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 685.16 | backward: 2157.51 | backward-backward: 2157.48 | backward-allreduce: 0.00 | optimizer: 127.41 | batch generator: 0.98 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 205000 | lm_loss value: 2.794773E+00 | lm_loss_ppl value: 1.635891E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.270 | iteration 205100/ 320000 | elapsed time per iteration (ms): 3036.2 | learning rate: 9.037E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.849200E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.10 | backward: 2153.18 | backward-backward: 2153.16 | backward-allreduce: 0.00 | optimizer: 128.42 | batch generator: 1.08 + samples/sec: 5.390 | iteration 205200/ 320000 | elapsed time per iteration (ms): 2968.3 | learning rate: 9.024E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.824833E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.89 | backward: 2153.15 | backward-backward: 2153.12 | backward-allreduce: 0.00 | optimizer: 128.78 | batch generator: 1.04 + samples/sec: 5.381 | iteration 205300/ 320000 | elapsed time per iteration (ms): 2973.4 | learning rate: 9.010E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.814316E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.22 | backward: 2156.68 | backward-backward: 2156.65 | backward-allreduce: 0.00 | optimizer: 127.96 | batch generator: 1.02 + samples/sec: 5.379 | iteration 205400/ 320000 | elapsed time per iteration (ms): 2974.4 | learning rate: 8.997E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.818949E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.91 | backward: 2156.75 | backward-backward: 2156.73 | backward-allreduce: 0.00 | optimizer: 130.23 | batch generator: 0.99 + samples/sec: 5.399 | iteration 205500/ 320000 | elapsed time per iteration (ms): 2963.6 | learning rate: 8.983E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.826435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.07 | backward: 2150.03 | backward-backward: 2150.00 | backward-allreduce: 0.00 | optimizer: 127.00 | batch generator: 1.04 + samples/sec: 5.388 | iteration 205600/ 320000 | elapsed time per iteration (ms): 2969.3 | learning rate: 8.970E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.806267E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.86 | backward: 2154.38 | backward-backward: 2154.35 | backward-allreduce: 0.00 | optimizer: 127.57 | batch generator: 1.00 + samples/sec: 5.380 | iteration 205700/ 320000 | elapsed time per iteration (ms): 2974.1 | learning rate: 8.956E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.823875E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.77 | backward: 2157.10 | backward-backward: 2157.08 | backward-allreduce: 0.00 | optimizer: 128.75 | batch generator: 0.99 + samples/sec: 5.411 | iteration 205800/ 320000 | elapsed time per iteration (ms): 2956.8 | learning rate: 8.943E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.821635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.52 | backward: 2146.90 | backward-backward: 2146.88 | backward-allreduce: 0.00 | optimizer: 125.86 | batch generator: 0.99 + samples/sec: 5.389 | iteration 205900/ 320000 | elapsed time per iteration (ms): 2969.1 | learning rate: 8.929E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.819472E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.09 | backward: 2154.47 | backward-backward: 2154.45 | backward-allreduce: 0.00 | optimizer: 127.04 | batch generator: 1.05 + samples/sec: 5.375 | iteration 206000/ 320000 | elapsed time per iteration (ms): 2976.8 | learning rate: 8.916E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.823921E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.52 | backward: 2157.56 | backward-backward: 2157.54 | backward-allreduce: 0.00 | optimizer: 129.14 | batch generator: 1.03 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 206000 | lm_loss value: 2.880498E+00 | lm_loss_ppl value: 1.782315E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.286 | iteration 206100/ 320000 | elapsed time per iteration (ms): 3027.1 | learning rate: 8.902E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.816317E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 684.62 | backward: 2146.65 | backward-backward: 2146.62 | backward-allreduce: 0.00 | optimizer: 126.66 | batch generator: 1.13 + samples/sec: 5.384 | iteration 206200/ 320000 | elapsed time per iteration (ms): 2971.9 | learning rate: 8.889E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.806292E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.12 | backward: 2155.26 | backward-backward: 2155.23 | backward-allreduce: 0.00 | optimizer: 128.96 | batch generator: 1.09 + samples/sec: 5.385 | iteration 206300/ 320000 | elapsed time per iteration (ms): 2971.0 | learning rate: 8.876E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.809778E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.82 | backward: 2157.14 | backward-backward: 2157.12 | backward-allreduce: 0.00 | optimizer: 126.44 | batch generator: 1.02 + samples/sec: 5.378 | iteration 206400/ 320000 | elapsed time per iteration (ms): 2975.0 | learning rate: 8.862E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.827455E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.02 | backward: 2157.84 | backward-backward: 2157.81 | backward-allreduce: 0.00 | optimizer: 129.52 | batch generator: 1.04 + samples/sec: 5.380 | iteration 206500/ 320000 | elapsed time per iteration (ms): 2973.9 | learning rate: 8.849E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.802441E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.43 | backward: 2156.52 | backward-backward: 2156.50 | backward-allreduce: 0.00 | optimizer: 129.40 | batch generator: 0.99 + samples/sec: 5.398 | iteration 206600/ 320000 | elapsed time per iteration (ms): 2963.9 | learning rate: 8.835E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.831519E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.42 | backward: 2151.11 | backward-backward: 2151.08 | backward-allreduce: 0.00 | optimizer: 125.89 | batch generator: 1.03 + samples/sec: 5.395 | iteration 206700/ 320000 | elapsed time per iteration (ms): 2965.7 | learning rate: 8.822E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.839472E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.18 | backward: 2150.72 | backward-backward: 2150.70 | backward-allreduce: 0.00 | optimizer: 127.35 | batch generator: 1.10 + samples/sec: 5.387 | iteration 206800/ 320000 | elapsed time per iteration (ms): 2970.3 | learning rate: 8.808E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.815447E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.51 | backward: 2155.66 | backward-backward: 2155.64 | backward-allreduce: 0.00 | optimizer: 127.59 | batch generator: 1.07 + samples/sec: 5.373 | iteration 206900/ 320000 | elapsed time per iteration (ms): 2978.1 | learning rate: 8.795E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.786901E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.58 | backward: 2161.89 | backward-backward: 2161.87 | backward-allreduce: 0.00 | optimizer: 128.07 | batch generator: 1.03 + samples/sec: 5.410 | iteration 207000/ 320000 | elapsed time per iteration (ms): 2957.5 | learning rate: 8.782E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.813261E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 684.24 | backward: 2146.09 | backward-backward: 2146.07 | backward-allreduce: 0.00 | optimizer: 126.67 | batch generator: 1.06 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 207000 | lm_loss value: 2.868265E+00 | lm_loss_ppl value: 1.760644E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.259 | iteration 207100/ 320000 | elapsed time per iteration (ms): 3042.5 | learning rate: 8.768E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.801133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.88 | backward: 2158.25 | backward-backward: 2158.22 | backward-allreduce: 0.00 | optimizer: 127.09 | batch generator: 1.16 + samples/sec: 5.389 | iteration 207200/ 320000 | elapsed time per iteration (ms): 2969.0 | learning rate: 8.755E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.834846E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.42 | backward: 2152.78 | backward-backward: 2152.75 | backward-allreduce: 0.00 | optimizer: 128.23 | batch generator: 1.07 + samples/sec: 5.379 | iteration 207300/ 320000 | elapsed time per iteration (ms): 2974.3 | learning rate: 8.742E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.802786E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.40 | backward: 2158.22 | backward-backward: 2158.19 | backward-allreduce: 0.00 | optimizer: 129.11 | batch generator: 0.97 + samples/sec: 5.379 | iteration 207400/ 320000 | elapsed time per iteration (ms): 2974.4 | learning rate: 8.728E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.812039E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.62 | backward: 2157.45 | backward-backward: 2157.42 | backward-allreduce: 0.00 | optimizer: 127.75 | batch generator: 1.20 + samples/sec: 5.395 | iteration 207500/ 320000 | elapsed time per iteration (ms): 2965.8 | learning rate: 8.715E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.821615E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.83 | backward: 2151.26 | backward-backward: 2151.24 | backward-allreduce: 0.00 | optimizer: 128.17 | batch generator: 0.99 + samples/sec: 5.378 | iteration 207600/ 320000 | elapsed time per iteration (ms): 2975.2 | learning rate: 8.701E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.812168E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.31 | backward: 2159.14 | backward-backward: 2159.12 | backward-allreduce: 0.00 | optimizer: 128.13 | batch generator: 1.08 + samples/sec: 5.384 | iteration 207700/ 320000 | elapsed time per iteration (ms): 2971.6 | learning rate: 8.688E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.803253E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.73 | backward: 2156.91 | backward-backward: 2156.88 | backward-allreduce: 0.00 | optimizer: 129.50 | batch generator: 1.05 + samples/sec: 5.416 | iteration 207800/ 320000 | elapsed time per iteration (ms): 2954.1 | learning rate: 8.675E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.823365E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.56 | backward: 2143.76 | backward-backward: 2143.74 | backward-allreduce: 0.00 | optimizer: 125.24 | batch generator: 1.08 + samples/sec: 5.368 | iteration 207900/ 320000 | elapsed time per iteration (ms): 2980.6 | learning rate: 8.661E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.814229E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.68 | backward: 2161.38 | backward-backward: 2161.35 | backward-allreduce: 0.00 | optimizer: 130.02 | batch generator: 1.06 + samples/sec: 5.375 | iteration 208000/ 320000 | elapsed time per iteration (ms): 2976.5 | learning rate: 8.648E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.796286E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.35 | backward: 2156.27 | backward-backward: 2156.24 | backward-allreduce: 0.00 | optimizer: 131.37 | batch generator: 0.98 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 208000 | lm_loss value: 2.721405E+00 | lm_loss_ppl value: 1.520167E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.281 | iteration 208100/ 320000 | elapsed time per iteration (ms): 3030.0 | learning rate: 8.635E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.783540E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 685.93 | backward: 2153.34 | backward-backward: 2153.32 | backward-allreduce: 0.00 | optimizer: 129.14 | batch generator: 1.10 + samples/sec: 5.386 | iteration 208200/ 320000 | elapsed time per iteration (ms): 2970.4 | learning rate: 8.621E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.806720E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.16 | backward: 2155.21 | backward-backward: 2155.19 | backward-allreduce: 0.00 | optimizer: 127.48 | batch generator: 1.03 + samples/sec: 5.386 | iteration 208300/ 320000 | elapsed time per iteration (ms): 2970.8 | learning rate: 8.608E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.804016E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.77 | backward: 2153.29 | backward-backward: 2153.26 | backward-allreduce: 0.00 | optimizer: 128.26 | batch generator: 0.97 + samples/sec: 5.391 | iteration 208400/ 320000 | elapsed time per iteration (ms): 2967.6 | learning rate: 8.595E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.814471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.59 | backward: 2149.88 | backward-backward: 2149.85 | backward-allreduce: 0.00 | optimizer: 129.56 | batch generator: 1.02 + samples/sec: 5.388 | iteration 208500/ 320000 | elapsed time per iteration (ms): 2969.4 | learning rate: 8.581E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.817336E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.16 | backward: 2153.75 | backward-backward: 2153.72 | backward-allreduce: 0.00 | optimizer: 127.00 | batch generator: 1.01 + samples/sec: 5.390 | iteration 208600/ 320000 | elapsed time per iteration (ms): 2968.4 | learning rate: 8.568E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.779337E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.29 | backward: 2154.01 | backward-backward: 2153.99 | backward-allreduce: 0.00 | optimizer: 129.63 | batch generator: 1.02 + samples/sec: 5.388 | iteration 208700/ 320000 | elapsed time per iteration (ms): 2969.4 | learning rate: 8.555E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.829275E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.36 | backward: 2153.33 | backward-backward: 2153.30 | backward-allreduce: 0.00 | optimizer: 129.16 | batch generator: 1.03 + samples/sec: 5.390 | iteration 208800/ 320000 | elapsed time per iteration (ms): 2968.4 | learning rate: 8.542E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.793587E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.54 | backward: 2154.38 | backward-backward: 2154.36 | backward-allreduce: 0.00 | optimizer: 127.89 | batch generator: 1.01 + samples/sec: 5.396 | iteration 208900/ 320000 | elapsed time per iteration (ms): 2965.0 | learning rate: 8.528E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.796477E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.48 | backward: 2149.13 | backward-backward: 2149.10 | backward-allreduce: 0.00 | optimizer: 128.79 | batch generator: 1.09 + samples/sec: 5.387 | iteration 209000/ 320000 | elapsed time per iteration (ms): 2970.2 | learning rate: 8.515E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.789373E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.03 | backward: 2157.24 | backward-backward: 2157.21 | backward-allreduce: 0.00 | optimizer: 126.42 | batch generator: 0.94 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 209000 | lm_loss value: 2.856432E+00 | lm_loss_ppl value: 1.739933E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.288 | iteration 209100/ 320000 | elapsed time per iteration (ms): 3025.5 | learning rate: 8.502E-05 | approx flops per GPU: 32.9TFLOPS | lm_loss: 2.812476E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.66 | backward: 2145.75 | backward-backward: 2145.72 | backward-allreduce: 0.00 | optimizer: 126.81 | batch generator: 1.12 + samples/sec: 5.379 | iteration 209200/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 8.489E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.799283E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.63 | backward: 2158.21 | backward-backward: 2158.19 | backward-allreduce: 0.00 | optimizer: 127.44 | batch generator: 1.00 + samples/sec: 5.398 | iteration 209300/ 320000 | elapsed time per iteration (ms): 2964.3 | learning rate: 8.475E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.815770E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.61 | backward: 2150.37 | backward-backward: 2150.34 | backward-allreduce: 0.00 | optimizer: 127.86 | batch generator: 0.99 + samples/sec: 5.397 | iteration 209400/ 320000 | elapsed time per iteration (ms): 2964.9 | learning rate: 8.462E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.804220E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.60 | backward: 2152.58 | backward-backward: 2152.56 | backward-allreduce: 0.00 | optimizer: 127.15 | batch generator: 1.01 + samples/sec: 5.386 | iteration 209500/ 320000 | elapsed time per iteration (ms): 2970.8 | learning rate: 8.449E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.797120E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.68 | backward: 2153.94 | backward-backward: 2153.92 | backward-allreduce: 0.00 | optimizer: 128.62 | batch generator: 1.08 + samples/sec: 5.395 | iteration 209600/ 320000 | elapsed time per iteration (ms): 2965.8 | learning rate: 8.436E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.808018E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.53 | backward: 2151.53 | backward-backward: 2151.50 | backward-allreduce: 0.00 | optimizer: 128.27 | batch generator: 1.01 + samples/sec: 5.394 | iteration 209700/ 320000 | elapsed time per iteration (ms): 2966.2 | learning rate: 8.422E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.794668E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.24 | backward: 2153.66 | backward-backward: 2153.63 | backward-allreduce: 0.00 | optimizer: 126.77 | batch generator: 1.05 + samples/sec: 5.388 | iteration 209800/ 320000 | elapsed time per iteration (ms): 2969.5 | learning rate: 8.409E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.825013E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.19 | backward: 2152.75 | backward-backward: 2152.73 | backward-allreduce: 0.00 | optimizer: 129.08 | batch generator: 1.02 + samples/sec: 5.386 | iteration 209900/ 320000 | elapsed time per iteration (ms): 2970.7 | learning rate: 8.396E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.802656E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.65 | backward: 2154.13 | backward-backward: 2154.10 | backward-allreduce: 0.00 | optimizer: 128.46 | batch generator: 1.07 + samples/sec: 5.403 | iteration 210000/ 320000 | elapsed time per iteration (ms): 2961.1 | learning rate: 8.383E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.786638E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.78 | backward: 2148.64 | backward-backward: 2148.62 | backward-allreduce: 0.00 | optimizer: 126.14 | batch generator: 1.10 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step110000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 210000 | lm_loss value: 2.820584E+00 | lm_loss_ppl value: 1.678665E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.118 | iteration 210100/ 320000 | elapsed time per iteration (ms): 3126.0 | learning rate: 8.369E-05 | approx flops per GPU: 31.8TFLOPS | lm_loss: 2.790081E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.62 | backward: 2157.92 | backward-backward: 2157.90 | backward-allreduce: 0.00 | optimizer: 128.36 | batch generator: 1.14 + samples/sec: 5.387 | iteration 210200/ 320000 | elapsed time per iteration (ms): 2970.1 | learning rate: 8.356E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.791583E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 684.67 | backward: 2157.91 | backward-backward: 2157.88 | backward-allreduce: 0.00 | optimizer: 126.99 | batch generator: 0.97 + samples/sec: 5.393 | iteration 210300/ 320000 | elapsed time per iteration (ms): 2966.8 | learning rate: 8.343E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.806267E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.26 | backward: 2149.74 | backward-backward: 2149.71 | backward-allreduce: 0.00 | optimizer: 127.28 | batch generator: 1.04 + samples/sec: 5.378 | iteration 210400/ 320000 | elapsed time per iteration (ms): 2975.0 | learning rate: 8.330E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.804817E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.50 | backward: 2158.60 | backward-backward: 2158.58 | backward-allreduce: 0.00 | optimizer: 129.35 | batch generator: 1.11 + samples/sec: 5.400 | iteration 210500/ 320000 | elapsed time per iteration (ms): 2963.2 | learning rate: 8.317E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.824088E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.74 | backward: 2151.15 | backward-backward: 2151.13 | backward-allreduce: 0.00 | optimizer: 127.73 | batch generator: 0.95 + samples/sec: 5.380 | iteration 210600/ 320000 | elapsed time per iteration (ms): 2974.0 | learning rate: 8.304E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.804763E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.16 | backward: 2156.14 | backward-backward: 2156.11 | backward-allreduce: 0.00 | optimizer: 128.26 | batch generator: 1.03 + samples/sec: 5.390 | iteration 210700/ 320000 | elapsed time per iteration (ms): 2968.6 | learning rate: 8.291E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.805580E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.19 | backward: 2154.91 | backward-backward: 2154.89 | backward-allreduce: 0.00 | optimizer: 126.97 | batch generator: 0.98 + samples/sec: 5.401 | iteration 210800/ 320000 | elapsed time per iteration (ms): 2962.2 | learning rate: 8.277E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.787581E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.54 | backward: 2147.05 | backward-backward: 2147.03 | backward-allreduce: 0.00 | optimizer: 127.11 | batch generator: 0.98 + samples/sec: 5.370 | iteration 210900/ 320000 | elapsed time per iteration (ms): 2979.6 | learning rate: 8.264E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.790507E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.14 | backward: 2160.19 | backward-backward: 2160.17 | backward-allreduce: 0.00 | optimizer: 128.74 | batch generator: 1.05 + samples/sec: 5.407 | iteration 211000/ 320000 | elapsed time per iteration (ms): 2959.2 | learning rate: 8.251E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.791617E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.36 | backward: 2148.63 | backward-backward: 2148.60 | backward-allreduce: 0.00 | optimizer: 126.71 | batch generator: 1.00 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 211000 | lm_loss value: 2.734311E+00 | lm_loss_ppl value: 1.539913E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.263 | iteration 211100/ 320000 | elapsed time per iteration (ms): 3039.8 | learning rate: 8.238E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.783865E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.17 | backward: 2156.46 | backward-backward: 2156.44 | backward-allreduce: 0.00 | optimizer: 127.65 | batch generator: 1.11 + samples/sec: 5.379 | iteration 211200/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 8.225E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.792607E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.44 | backward: 2158.33 | backward-backward: 2158.31 | backward-allreduce: 0.00 | optimizer: 127.47 | batch generator: 0.96 + samples/sec: 5.393 | iteration 211300/ 320000 | elapsed time per iteration (ms): 2966.7 | learning rate: 8.212E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.786385E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.44 | backward: 2152.42 | backward-backward: 2152.39 | backward-allreduce: 0.00 | optimizer: 127.26 | batch generator: 1.10 + samples/sec: 5.393 | iteration 211400/ 320000 | elapsed time per iteration (ms): 2966.7 | learning rate: 8.199E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.797976E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.81 | backward: 2154.78 | backward-backward: 2154.75 | backward-allreduce: 0.00 | optimizer: 126.68 | batch generator: 0.98 + samples/sec: 5.386 | iteration 211500/ 320000 | elapsed time per iteration (ms): 2970.8 | learning rate: 8.186E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.790131E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 688.22 | backward: 2154.43 | backward-backward: 2154.40 | backward-allreduce: 0.00 | optimizer: 127.57 | batch generator: 1.05 + samples/sec: 5.386 | iteration 211600/ 320000 | elapsed time per iteration (ms): 2970.7 | learning rate: 8.172E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.775264E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.34 | backward: 2154.79 | backward-backward: 2154.77 | backward-allreduce: 0.00 | optimizer: 128.00 | batch generator: 1.11 + samples/sec: 5.393 | iteration 211700/ 320000 | elapsed time per iteration (ms): 2966.8 | learning rate: 8.159E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.794841E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.36 | backward: 2152.66 | backward-backward: 2152.64 | backward-allreduce: 0.00 | optimizer: 127.23 | batch generator: 1.08 + samples/sec: 5.383 | iteration 211800/ 320000 | elapsed time per iteration (ms): 2972.6 | learning rate: 8.146E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.770604E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.62 | backward: 2156.51 | backward-backward: 2156.48 | backward-allreduce: 0.00 | optimizer: 126.97 | batch generator: 1.08 + samples/sec: 5.406 | iteration 211900/ 320000 | elapsed time per iteration (ms): 2959.8 | learning rate: 8.133E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.790223E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.41 | backward: 2146.05 | backward-backward: 2146.02 | backward-allreduce: 0.00 | optimizer: 126.77 | batch generator: 1.07 + samples/sec: 5.361 | iteration 212000/ 320000 | elapsed time per iteration (ms): 2984.3 | learning rate: 8.120E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.795227E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.54 | backward: 2163.30 | backward-backward: 2163.28 | backward-allreduce: 0.00 | optimizer: 130.90 | batch generator: 1.10 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 212000 | lm_loss value: 2.771917E+00 | lm_loss_ppl value: 1.598925E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.284 | iteration 212100/ 320000 | elapsed time per iteration (ms): 3028.2 | learning rate: 8.107E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.821415E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.22 | backward: 2147.66 | backward-backward: 2147.64 | backward-allreduce: 0.00 | optimizer: 127.09 | batch generator: 1.11 + samples/sec: 5.364 | iteration 212200/ 320000 | elapsed time per iteration (ms): 2983.0 | learning rate: 8.094E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.787452E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.66 | backward: 2162.32 | backward-backward: 2162.29 | backward-allreduce: 0.00 | optimizer: 129.43 | batch generator: 1.06 + samples/sec: 5.376 | iteration 212300/ 320000 | elapsed time per iteration (ms): 2976.3 | learning rate: 8.081E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.786998E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.28 | backward: 2156.98 | backward-backward: 2156.96 | backward-allreduce: 0.00 | optimizer: 129.49 | batch generator: 1.06 + samples/sec: 5.117 | iteration 212400/ 320000 | elapsed time per iteration (ms): 3127.0 | learning rate: 8.068E-05 | approx flops per GPU: 31.8TFLOPS | lm_loss: 2.793729E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.02 | backward: 2274.06 | backward-backward: 2274.03 | backward-allreduce: 0.00 | optimizer: 163.21 | batch generator: 1.00 + samples/sec: 4.952 | iteration 212500/ 320000 | elapsed time per iteration (ms): 3231.1 | learning rate: 8.055E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.793296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.20 | backward: 2354.51 | backward-backward: 2354.47 | backward-allreduce: 0.00 | optimizer: 187.58 | batch generator: 1.07 + samples/sec: 4.932 | iteration 212600/ 320000 | elapsed time per iteration (ms): 3244.1 | learning rate: 8.042E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.791596E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.17 | backward: 2363.38 | backward-backward: 2363.34 | backward-allreduce: 0.00 | optimizer: 189.66 | batch generator: 1.04 + samples/sec: 4.954 | iteration 212700/ 320000 | elapsed time per iteration (ms): 3229.6 | learning rate: 8.029E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.788089E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.29 | backward: 2349.72 | backward-backward: 2349.69 | backward-allreduce: 0.00 | optimizer: 188.79 | batch generator: 1.12 + samples/sec: 4.938 | iteration 212800/ 320000 | elapsed time per iteration (ms): 3240.5 | learning rate: 8.016E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.800580E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.18 | backward: 2359.77 | backward-backward: 2359.74 | backward-allreduce: 0.00 | optimizer: 189.77 | batch generator: 1.12 + samples/sec: 4.955 | iteration 212900/ 320000 | elapsed time per iteration (ms): 3229.0 | learning rate: 8.003E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.794990E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.27 | backward: 2352.08 | backward-backward: 2352.04 | backward-allreduce: 0.00 | optimizer: 188.80 | batch generator: 1.08 + samples/sec: 4.934 | iteration 213000/ 320000 | elapsed time per iteration (ms): 3242.5 | learning rate: 7.990E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.776247E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.98 | backward: 2360.06 | backward-backward: 2360.02 | backward-allreduce: 0.00 | optimizer: 192.72 | batch generator: 1.07 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 213000 | lm_loss value: 2.719934E+00 | lm_loss_ppl value: 1.517932E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 4.854 | iteration 213100/ 320000 | elapsed time per iteration (ms): 3296.3 | learning rate: 7.976E-05 | approx flops per GPU: 30.2TFLOPS | lm_loss: 2.792831E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.46 | backward: 2346.60 | backward-backward: 2346.56 | backward-allreduce: 0.00 | optimizer: 186.24 | batch generator: 1.22 + samples/sec: 4.935 | iteration 213200/ 320000 | elapsed time per iteration (ms): 3241.8 | learning rate: 7.963E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.790498E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 691.71 | backward: 2358.10 | backward-backward: 2358.07 | backward-allreduce: 0.00 | optimizer: 191.27 | batch generator: 1.06 + samples/sec: 4.957 | iteration 213300/ 320000 | elapsed time per iteration (ms): 3227.8 | learning rate: 7.950E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.791912E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.75 | backward: 2350.67 | backward-backward: 2350.64 | backward-allreduce: 0.00 | optimizer: 187.70 | batch generator: 1.17 + samples/sec: 4.928 | iteration 213400/ 320000 | elapsed time per iteration (ms): 3246.7 | learning rate: 7.937E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.817937E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.98 | backward: 2365.06 | backward-backward: 2365.02 | backward-allreduce: 0.00 | optimizer: 191.86 | batch generator: 1.08 + samples/sec: 4.942 | iteration 213500/ 320000 | elapsed time per iteration (ms): 3237.4 | learning rate: 7.925E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.777730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.69 | backward: 2357.68 | backward-backward: 2357.65 | backward-allreduce: 0.00 | optimizer: 189.17 | batch generator: 1.08 + samples/sec: 4.950 | iteration 213600/ 320000 | elapsed time per iteration (ms): 3232.2 | learning rate: 7.912E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.765374E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.85 | backward: 2354.49 | backward-backward: 2354.46 | backward-allreduce: 0.00 | optimizer: 188.06 | batch generator: 1.02 + samples/sec: 4.940 | iteration 213700/ 320000 | elapsed time per iteration (ms): 3238.5 | learning rate: 7.899E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.809930E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.04 | backward: 2360.86 | backward-backward: 2360.83 | backward-allreduce: 0.00 | optimizer: 187.80 | batch generator: 1.07 + samples/sec: 4.969 | iteration 213800/ 320000 | elapsed time per iteration (ms): 3219.7 | learning rate: 7.886E-05 | approx flops per GPU: 30.9TFLOPS | lm_loss: 2.817702E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 689.39 | backward: 2344.29 | backward-backward: 2344.25 | backward-allreduce: 0.00 | optimizer: 185.29 | batch generator: 1.05 + samples/sec: 4.931 | iteration 213900/ 320000 | elapsed time per iteration (ms): 3244.8 | learning rate: 7.873E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.793356E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.23 | backward: 2363.36 | backward-backward: 2363.33 | backward-allreduce: 0.00 | optimizer: 192.41 | batch generator: 1.06 + samples/sec: 4.966 | iteration 214000/ 320000 | elapsed time per iteration (ms): 3222.2 | learning rate: 7.860E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.794073E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.16 | backward: 2346.75 | backward-backward: 2346.71 | backward-allreduce: 0.00 | optimizer: 186.44 | batch generator: 1.02 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 214000 | lm_loss value: 2.692934E+00 | lm_loss_ppl value: 1.477496E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 4.828 | iteration 214100/ 320000 | elapsed time per iteration (ms): 3313.9 | learning rate: 7.847E-05 | approx flops per GPU: 30.0TFLOPS | lm_loss: 2.803437E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.79 | backward: 2359.75 | backward-backward: 2359.72 | backward-allreduce: 0.00 | optimizer: 189.63 | batch generator: 1.20 + samples/sec: 4.961 | iteration 214200/ 320000 | elapsed time per iteration (ms): 3224.9 | learning rate: 7.834E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.795512E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.76 | backward: 2347.21 | backward-backward: 2347.18 | backward-allreduce: 0.00 | optimizer: 187.27 | batch generator: 1.07 + samples/sec: 4.927 | iteration 214300/ 320000 | elapsed time per iteration (ms): 3247.3 | learning rate: 7.821E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.805640E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.33 | backward: 2364.98 | backward-backward: 2364.95 | backward-allreduce: 0.00 | optimizer: 191.06 | batch generator: 1.09 + samples/sec: 4.957 | iteration 214400/ 320000 | elapsed time per iteration (ms): 3227.4 | learning rate: 7.808E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.775006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.68 | backward: 2351.18 | backward-backward: 2351.15 | backward-allreduce: 0.00 | optimizer: 188.89 | batch generator: 1.00 + samples/sec: 4.926 | iteration 214500/ 320000 | elapsed time per iteration (ms): 3248.3 | learning rate: 7.795E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.770289E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.22 | backward: 2365.01 | backward-backward: 2364.98 | backward-allreduce: 0.00 | optimizer: 192.27 | batch generator: 1.06 + samples/sec: 4.964 | iteration 214600/ 320000 | elapsed time per iteration (ms): 3223.0 | learning rate: 7.782E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.787204E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.46 | backward: 2345.78 | backward-backward: 2345.75 | backward-allreduce: 0.00 | optimizer: 187.97 | batch generator: 0.97 + samples/sec: 4.947 | iteration 214700/ 320000 | elapsed time per iteration (ms): 3234.6 | learning rate: 7.769E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.774275E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.54 | backward: 2356.05 | backward-backward: 2356.02 | backward-allreduce: 0.00 | optimizer: 189.11 | batch generator: 1.03 + samples/sec: 4.944 | iteration 214800/ 320000 | elapsed time per iteration (ms): 3236.4 | learning rate: 7.757E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.756038E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.54 | backward: 2359.30 | backward-backward: 2359.27 | backward-allreduce: 0.00 | optimizer: 187.73 | batch generator: 1.09 + samples/sec: 4.945 | iteration 214900/ 320000 | elapsed time per iteration (ms): 3235.7 | learning rate: 7.744E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.809406E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.58 | backward: 2352.78 | backward-backward: 2352.75 | backward-allreduce: 0.00 | optimizer: 191.62 | batch generator: 1.09 + samples/sec: 4.937 | iteration 215000/ 320000 | elapsed time per iteration (ms): 3241.1 | learning rate: 7.731E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.767548E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 691.14 | backward: 2358.76 | backward-backward: 2358.73 | backward-allreduce: 0.00 | optimizer: 190.41 | batch generator: 1.06 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 215000 | lm_loss value: 2.755270E+00 | lm_loss_ppl value: 1.572529E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 4.851 | iteration 215100/ 320000 | elapsed time per iteration (ms): 3298.0 | learning rate: 7.718E-05 | approx flops per GPU: 30.1TFLOPS | lm_loss: 2.779788E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 688.69 | backward: 2348.49 | backward-backward: 2348.46 | backward-allreduce: 0.00 | optimizer: 186.03 | batch generator: 1.18 + samples/sec: 4.928 | iteration 215200/ 320000 | elapsed time per iteration (ms): 3246.9 | learning rate: 7.705E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.781298E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.30 | backward: 2365.24 | backward-backward: 2365.21 | backward-allreduce: 0.00 | optimizer: 190.48 | batch generator: 1.09 + samples/sec: 4.970 | iteration 215300/ 320000 | elapsed time per iteration (ms): 3219.3 | learning rate: 7.692E-05 | approx flops per GPU: 30.9TFLOPS | lm_loss: 2.796955E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.47 | backward: 2345.87 | backward-backward: 2345.84 | backward-allreduce: 0.00 | optimizer: 183.10 | batch generator: 1.04 + samples/sec: 4.927 | iteration 215400/ 320000 | elapsed time per iteration (ms): 3247.2 | learning rate: 7.680E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.794391E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 691.04 | backward: 2363.07 | backward-backward: 2363.03 | backward-allreduce: 0.00 | optimizer: 192.31 | batch generator: 1.00 + samples/sec: 4.957 | iteration 215500/ 320000 | elapsed time per iteration (ms): 3227.8 | learning rate: 7.667E-05 | approx flops per GPU: 30.8TFLOPS | lm_loss: 2.797202E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.17 | backward: 2350.80 | backward-backward: 2350.77 | backward-allreduce: 0.00 | optimizer: 187.09 | batch generator: 1.00 + samples/sec: 4.924 | iteration 215600/ 320000 | elapsed time per iteration (ms): 3249.1 | learning rate: 7.654E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.781793E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.80 | backward: 2367.43 | backward-backward: 2367.40 | backward-allreduce: 0.00 | optimizer: 191.05 | batch generator: 0.99 + samples/sec: 4.949 | iteration 215700/ 320000 | elapsed time per iteration (ms): 3233.3 | learning rate: 7.641E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.797812E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.37 | backward: 2353.61 | backward-backward: 2353.57 | backward-allreduce: 0.00 | optimizer: 190.62 | batch generator: 0.98 + samples/sec: 4.916 | iteration 215800/ 320000 | elapsed time per iteration (ms): 3254.9 | learning rate: 7.628E-05 | approx flops per GPU: 30.5TFLOPS | lm_loss: 2.777245E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 695.26 | backward: 2367.57 | backward-backward: 2367.54 | backward-allreduce: 0.00 | optimizer: 191.20 | batch generator: 1.19 + samples/sec: 5.404 | iteration 215900/ 320000 | elapsed time per iteration (ms): 2960.9 | learning rate: 7.615E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.776783E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.20 | backward: 2145.43 | backward-backward: 2145.41 | backward-allreduce: 0.00 | optimizer: 128.67 | batch generator: 1.07 + samples/sec: 5.386 | iteration 216000/ 320000 | elapsed time per iteration (ms): 2970.6 | learning rate: 7.603E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.787635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.09 | backward: 2156.89 | backward-backward: 2156.86 | backward-allreduce: 0.00 | optimizer: 126.06 | batch generator: 1.02 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 216000 | lm_loss value: 2.823956E+00 | lm_loss_ppl value: 1.684336E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.270 | iteration 216100/ 320000 | elapsed time per iteration (ms): 3036.1 | learning rate: 7.590E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.783008E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.17 | backward: 2151.89 | backward-backward: 2151.86 | backward-allreduce: 0.00 | optimizer: 127.64 | batch generator: 1.21 + samples/sec: 5.383 | iteration 216200/ 320000 | elapsed time per iteration (ms): 2972.2 | learning rate: 7.577E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.774137E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.11 | backward: 2157.56 | backward-backward: 2157.54 | backward-allreduce: 0.00 | optimizer: 127.02 | batch generator: 1.01 + samples/sec: 5.402 | iteration 216300/ 320000 | elapsed time per iteration (ms): 2961.6 | learning rate: 7.564E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.774546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.30 | backward: 2150.57 | backward-backward: 2150.55 | backward-allreduce: 0.00 | optimizer: 126.27 | batch generator: 1.01 + samples/sec: 5.368 | iteration 216400/ 320000 | elapsed time per iteration (ms): 2980.5 | learning rate: 7.551E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.779686E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.31 | backward: 2162.49 | backward-backward: 2162.47 | backward-allreduce: 0.00 | optimizer: 129.13 | batch generator: 1.08 + samples/sec: 5.403 | iteration 216500/ 320000 | elapsed time per iteration (ms): 2961.5 | learning rate: 7.539E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.782499E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.23 | backward: 2146.80 | backward-backward: 2146.78 | backward-allreduce: 0.00 | optimizer: 126.89 | batch generator: 1.03 + samples/sec: 5.381 | iteration 216600/ 320000 | elapsed time per iteration (ms): 2973.6 | learning rate: 7.526E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.773435E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.78 | backward: 2157.55 | backward-backward: 2157.52 | backward-allreduce: 0.00 | optimizer: 128.71 | batch generator: 1.01 + samples/sec: 5.380 | iteration 216700/ 320000 | elapsed time per iteration (ms): 2974.3 | learning rate: 7.513E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.786795E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.69 | backward: 2158.88 | backward-backward: 2158.86 | backward-allreduce: 0.00 | optimizer: 128.17 | batch generator: 1.08 + samples/sec: 5.389 | iteration 216800/ 320000 | elapsed time per iteration (ms): 2968.9 | learning rate: 7.501E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.785215E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.92 | backward: 2152.83 | backward-backward: 2152.80 | backward-allreduce: 0.00 | optimizer: 127.64 | batch generator: 1.01 + samples/sec: 5.400 | iteration 216900/ 320000 | elapsed time per iteration (ms): 2962.8 | learning rate: 7.488E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.788842E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.67 | backward: 2149.18 | backward-backward: 2149.16 | backward-allreduce: 0.00 | optimizer: 126.39 | batch generator: 1.14 + samples/sec: 5.376 | iteration 217000/ 320000 | elapsed time per iteration (ms): 2976.0 | learning rate: 7.475E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.761383E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.01 | backward: 2158.14 | backward-backward: 2158.11 | backward-allreduce: 0.00 | optimizer: 128.36 | batch generator: 1.04 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 217000 | lm_loss value: 2.824639E+00 | lm_loss_ppl value: 1.685486E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.268 | iteration 217100/ 320000 | elapsed time per iteration (ms): 3037.1 | learning rate: 7.462E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.804192E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.54 | backward: 2153.46 | backward-backward: 2153.43 | backward-allreduce: 0.00 | optimizer: 127.84 | batch generator: 1.10 + samples/sec: 5.404 | iteration 217200/ 320000 | elapsed time per iteration (ms): 2960.8 | learning rate: 7.450E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.789897E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.64 | backward: 2148.85 | backward-backward: 2148.83 | backward-allreduce: 0.00 | optimizer: 126.79 | batch generator: 1.01 + samples/sec: 5.374 | iteration 217300/ 320000 | elapsed time per iteration (ms): 2977.3 | learning rate: 7.437E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.777922E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.63 | backward: 2160.32 | backward-backward: 2160.30 | backward-allreduce: 0.00 | optimizer: 128.80 | batch generator: 1.05 + samples/sec: 5.400 | iteration 217400/ 320000 | elapsed time per iteration (ms): 2963.2 | learning rate: 7.424E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.776813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.12 | backward: 2149.35 | backward-backward: 2149.32 | backward-allreduce: 0.00 | optimizer: 127.23 | batch generator: 0.98 + samples/sec: 5.388 | iteration 217500/ 320000 | elapsed time per iteration (ms): 2969.5 | learning rate: 7.412E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.800643E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.49 | backward: 2155.88 | backward-backward: 2155.86 | backward-allreduce: 0.00 | optimizer: 126.63 | batch generator: 1.05 + samples/sec: 5.385 | iteration 217600/ 320000 | elapsed time per iteration (ms): 2971.5 | learning rate: 7.399E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.785008E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.83 | backward: 2154.76 | backward-backward: 2154.74 | backward-allreduce: 0.00 | optimizer: 129.38 | batch generator: 1.01 + samples/sec: 5.382 | iteration 217700/ 320000 | elapsed time per iteration (ms): 2972.8 | learning rate: 7.386E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.790739E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.71 | backward: 2152.86 | backward-backward: 2152.84 | backward-allreduce: 0.00 | optimizer: 129.71 | batch generator: 1.07 + samples/sec: 5.404 | iteration 217800/ 320000 | elapsed time per iteration (ms): 2960.6 | learning rate: 7.374E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.750977E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.63 | backward: 2149.69 | backward-backward: 2149.66 | backward-allreduce: 0.00 | optimizer: 125.75 | batch generator: 0.94 + samples/sec: 5.379 | iteration 217900/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 7.361E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.794588E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.24 | backward: 2157.31 | backward-backward: 2157.29 | backward-allreduce: 0.00 | optimizer: 128.74 | batch generator: 1.04 + samples/sec: 5.385 | iteration 218000/ 320000 | elapsed time per iteration (ms): 2971.2 | learning rate: 7.348E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.779672E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.96 | backward: 2156.79 | backward-backward: 2156.76 | backward-allreduce: 0.00 | optimizer: 127.96 | batch generator: 1.06 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 218000 | lm_loss value: 2.787642E+00 | lm_loss_ppl value: 1.624267E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.276 | iteration 218100/ 320000 | elapsed time per iteration (ms): 3032.8 | learning rate: 7.335E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.781361E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.84 | backward: 2147.98 | backward-backward: 2147.95 | backward-allreduce: 0.00 | optimizer: 129.71 | batch generator: 1.12 + samples/sec: 5.372 | iteration 218200/ 320000 | elapsed time per iteration (ms): 2978.2 | learning rate: 7.323E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.769239E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.61 | backward: 2160.32 | backward-backward: 2160.29 | backward-allreduce: 0.00 | optimizer: 128.76 | batch generator: 0.97 + samples/sec: 5.395 | iteration 218300/ 320000 | elapsed time per iteration (ms): 2965.9 | learning rate: 7.310E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.770122E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.36 | backward: 2151.60 | backward-backward: 2151.58 | backward-allreduce: 0.00 | optimizer: 127.43 | batch generator: 1.00 + samples/sec: 5.396 | iteration 218400/ 320000 | elapsed time per iteration (ms): 2965.3 | learning rate: 7.298E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.770840E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.73 | backward: 2150.67 | backward-backward: 2150.65 | backward-allreduce: 0.00 | optimizer: 127.38 | batch generator: 1.08 + samples/sec: 5.378 | iteration 218500/ 320000 | elapsed time per iteration (ms): 2975.3 | learning rate: 7.285E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.775885E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.87 | backward: 2156.74 | backward-backward: 2156.71 | backward-allreduce: 0.00 | optimizer: 129.19 | batch generator: 1.10 + samples/sec: 5.390 | iteration 218600/ 320000 | elapsed time per iteration (ms): 2968.7 | learning rate: 7.272E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.772968E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.65 | backward: 2154.02 | backward-backward: 2154.00 | backward-allreduce: 0.00 | optimizer: 127.50 | batch generator: 0.99 + samples/sec: 5.386 | iteration 218700/ 320000 | elapsed time per iteration (ms): 2970.9 | learning rate: 7.260E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.792061E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.34 | backward: 2155.30 | backward-backward: 2155.27 | backward-allreduce: 0.00 | optimizer: 127.77 | batch generator: 1.01 + samples/sec: 5.385 | iteration 218800/ 320000 | elapsed time per iteration (ms): 2971.0 | learning rate: 7.247E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.777117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.67 | backward: 2155.39 | backward-backward: 2155.36 | backward-allreduce: 0.00 | optimizer: 128.46 | batch generator: 1.02 + samples/sec: 5.388 | iteration 218900/ 320000 | elapsed time per iteration (ms): 2969.5 | learning rate: 7.234E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.769380E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.78 | backward: 2153.65 | backward-backward: 2153.62 | backward-allreduce: 0.00 | optimizer: 127.54 | batch generator: 1.05 + samples/sec: 5.383 | iteration 219000/ 320000 | elapsed time per iteration (ms): 2972.4 | learning rate: 7.222E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.783018E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.89 | backward: 2157.46 | backward-backward: 2157.44 | backward-allreduce: 0.00 | optimizer: 127.51 | batch generator: 1.04 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 219000 | lm_loss value: 2.726906E+00 | lm_loss_ppl value: 1.528552E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.262 | iteration 219100/ 320000 | elapsed time per iteration (ms): 3040.8 | learning rate: 7.209E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.783333E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.13 | backward: 2161.38 | backward-backward: 2161.35 | backward-allreduce: 0.00 | optimizer: 129.27 | batch generator: 1.15 + samples/sec: 5.400 | iteration 219200/ 320000 | elapsed time per iteration (ms): 2962.8 | learning rate: 7.197E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.760797E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.23 | backward: 2148.06 | backward-backward: 2148.03 | backward-allreduce: 0.00 | optimizer: 127.93 | batch generator: 1.06 + samples/sec: 5.375 | iteration 219300/ 320000 | elapsed time per iteration (ms): 2977.0 | learning rate: 7.184E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.754707E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 688.09 | backward: 2160.90 | backward-backward: 2160.88 | backward-allreduce: 0.00 | optimizer: 127.45 | batch generator: 1.06 + samples/sec: 5.398 | iteration 219400/ 320000 | elapsed time per iteration (ms): 2963.8 | learning rate: 7.172E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.754947E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.06 | backward: 2149.05 | backward-backward: 2149.03 | backward-allreduce: 0.00 | optimizer: 127.20 | batch generator: 1.05 + samples/sec: 5.365 | iteration 219500/ 320000 | elapsed time per iteration (ms): 2982.3 | learning rate: 7.159E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.760689E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.63 | backward: 2160.80 | backward-backward: 2160.77 | backward-allreduce: 0.00 | optimizer: 130.33 | batch generator: 0.99 + samples/sec: 5.408 | iteration 219600/ 320000 | elapsed time per iteration (ms): 2958.4 | learning rate: 7.147E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.777003E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.89 | backward: 2145.49 | backward-backward: 2145.46 | backward-allreduce: 0.00 | optimizer: 127.49 | batch generator: 0.94 + samples/sec: 5.386 | iteration 219700/ 320000 | elapsed time per iteration (ms): 2970.6 | learning rate: 7.134E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.773775E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.13 | backward: 2154.01 | backward-backward: 2153.98 | backward-allreduce: 0.00 | optimizer: 128.76 | batch generator: 1.07 + samples/sec: 5.383 | iteration 219800/ 320000 | elapsed time per iteration (ms): 2972.3 | learning rate: 7.122E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.779684E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.31 | backward: 2156.60 | backward-backward: 2156.58 | backward-allreduce: 0.00 | optimizer: 127.89 | batch generator: 1.05 + samples/sec: 5.384 | iteration 219900/ 320000 | elapsed time per iteration (ms): 2971.6 | learning rate: 7.109E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.749551E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.77 | backward: 2155.79 | backward-backward: 2155.77 | backward-allreduce: 0.00 | optimizer: 128.49 | batch generator: 1.03 + samples/sec: 5.406 | iteration 220000/ 320000 | elapsed time per iteration (ms): 2959.5 | learning rate: 7.096E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.778635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.04 | backward: 2149.03 | backward-backward: 2149.00 | backward-allreduce: 0.00 | optimizer: 125.93 | batch generator: 1.04 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step120000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 220000 | lm_loss value: 2.793975E+00 | lm_loss_ppl value: 1.634586E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.113 | iteration 220100/ 320000 | elapsed time per iteration (ms): 3129.3 | learning rate: 7.084E-05 | approx flops per GPU: 31.8TFLOPS | lm_loss: 2.756482E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.21 | backward: 2162.69 | backward-backward: 2162.66 | backward-allreduce: 0.00 | optimizer: 129.43 | batch generator: 1.08 + samples/sec: 5.402 | iteration 220200/ 320000 | elapsed time per iteration (ms): 2961.7 | learning rate: 7.071E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.788323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.00 | backward: 2151.25 | backward-backward: 2151.22 | backward-allreduce: 0.00 | optimizer: 124.96 | batch generator: 0.98 + samples/sec: 5.381 | iteration 220300/ 320000 | elapsed time per iteration (ms): 2973.3 | learning rate: 7.059E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.757963E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.79 | backward: 2154.88 | backward-backward: 2154.86 | backward-allreduce: 0.00 | optimizer: 129.17 | batch generator: 0.98 + samples/sec: 5.388 | iteration 220400/ 320000 | elapsed time per iteration (ms): 2969.8 | learning rate: 7.046E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.765295E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.61 | backward: 2152.63 | backward-backward: 2152.61 | backward-allreduce: 0.00 | optimizer: 129.01 | batch generator: 1.03 + samples/sec: 5.398 | iteration 220500/ 320000 | elapsed time per iteration (ms): 2964.2 | learning rate: 7.034E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.746312E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.99 | backward: 2153.56 | backward-backward: 2153.54 | backward-allreduce: 0.00 | optimizer: 125.11 | batch generator: 1.03 + samples/sec: 5.380 | iteration 220600/ 320000 | elapsed time per iteration (ms): 2974.0 | learning rate: 7.022E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.766339E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 687.55 | backward: 2158.40 | backward-backward: 2158.38 | backward-allreduce: 0.00 | optimizer: 127.52 | batch generator: 1.06 + samples/sec: 5.397 | iteration 220700/ 320000 | elapsed time per iteration (ms): 2964.7 | learning rate: 7.009E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.773086E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.79 | backward: 2151.03 | backward-backward: 2151.01 | backward-allreduce: 0.00 | optimizer: 127.29 | batch generator: 1.01 + samples/sec: 5.379 | iteration 220800/ 320000 | elapsed time per iteration (ms): 2974.6 | learning rate: 6.997E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.762758E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.18 | backward: 2158.46 | backward-backward: 2158.44 | backward-allreduce: 0.00 | optimizer: 129.51 | batch generator: 1.00 + samples/sec: 5.388 | iteration 220900/ 320000 | elapsed time per iteration (ms): 2969.8 | learning rate: 6.984E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.762352E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.95 | backward: 2154.53 | backward-backward: 2154.50 | backward-allreduce: 0.00 | optimizer: 126.78 | batch generator: 1.02 + samples/sec: 5.378 | iteration 221000/ 320000 | elapsed time per iteration (ms): 2975.3 | learning rate: 6.972E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.756641E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.79 | backward: 2160.42 | backward-backward: 2160.40 | backward-allreduce: 0.00 | optimizer: 127.64 | batch generator: 0.98 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 221000 | lm_loss value: 2.783574E+00 | lm_loss_ppl value: 1.617674E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.277 | iteration 221100/ 320000 | elapsed time per iteration (ms): 3032.1 | learning rate: 6.960E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.754333E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.85 | backward: 2150.09 | backward-backward: 2150.06 | backward-allreduce: 0.00 | optimizer: 127.00 | batch generator: 1.12 + samples/sec: 5.365 | iteration 221200/ 320000 | elapsed time per iteration (ms): 2982.2 | learning rate: 6.947E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.760701E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.88 | backward: 2164.48 | backward-backward: 2164.46 | backward-allreduce: 0.00 | optimizer: 129.31 | batch generator: 1.04 + samples/sec: 5.401 | iteration 221300/ 320000 | elapsed time per iteration (ms): 2962.3 | learning rate: 6.935E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.754572E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 682.16 | backward: 2151.44 | backward-backward: 2151.41 | backward-allreduce: 0.00 | optimizer: 128.17 | batch generator: 0.98 + samples/sec: 5.384 | iteration 221400/ 320000 | elapsed time per iteration (ms): 2971.6 | learning rate: 6.922E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.769144E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.12 | backward: 2153.61 | backward-backward: 2153.59 | backward-allreduce: 0.00 | optimizer: 129.41 | batch generator: 1.04 + samples/sec: 5.380 | iteration 221500/ 320000 | elapsed time per iteration (ms): 2973.9 | learning rate: 6.910E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.761921E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.94 | backward: 2157.54 | backward-backward: 2157.51 | backward-allreduce: 0.00 | optimizer: 128.86 | batch generator: 1.01 + samples/sec: 5.386 | iteration 221600/ 320000 | elapsed time per iteration (ms): 2970.5 | learning rate: 6.898E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.754366E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.08 | backward: 2156.30 | backward-backward: 2156.27 | backward-allreduce: 0.00 | optimizer: 127.55 | batch generator: 1.10 + samples/sec: 5.405 | iteration 221700/ 320000 | elapsed time per iteration (ms): 2960.1 | learning rate: 6.885E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.767308E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.34 | backward: 2147.10 | backward-backward: 2147.08 | backward-allreduce: 0.00 | optimizer: 127.11 | batch generator: 1.03 + samples/sec: 5.377 | iteration 221800/ 320000 | elapsed time per iteration (ms): 2975.5 | learning rate: 6.873E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.757051E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.04 | backward: 2159.01 | backward-backward: 2158.98 | backward-allreduce: 0.00 | optimizer: 126.93 | batch generator: 1.02 + samples/sec: 5.393 | iteration 221900/ 320000 | elapsed time per iteration (ms): 2966.6 | learning rate: 6.860E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.750736E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.27 | backward: 2149.97 | backward-backward: 2149.94 | backward-allreduce: 0.00 | optimizer: 128.78 | batch generator: 1.01 + samples/sec: 5.386 | iteration 222000/ 320000 | elapsed time per iteration (ms): 2970.6 | learning rate: 6.848E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.746020E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.49 | backward: 2154.42 | backward-backward: 2154.40 | backward-allreduce: 0.00 | optimizer: 127.15 | batch generator: 1.01 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 222000 | lm_loss value: 2.735972E+00 | lm_loss_ppl value: 1.542473E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.261 | iteration 222100/ 320000 | elapsed time per iteration (ms): 3041.2 | learning rate: 6.836E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.753961E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.33 | backward: 2156.08 | backward-backward: 2156.06 | backward-allreduce: 0.00 | optimizer: 128.09 | batch generator: 1.08 + samples/sec: 5.383 | iteration 222200/ 320000 | elapsed time per iteration (ms): 2972.3 | learning rate: 6.823E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.766174E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.32 | backward: 2156.52 | backward-backward: 2156.50 | backward-allreduce: 0.00 | optimizer: 128.97 | batch generator: 1.03 + samples/sec: 5.402 | iteration 222300/ 320000 | elapsed time per iteration (ms): 2962.1 | learning rate: 6.811E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.764917E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.95 | backward: 2146.77 | backward-backward: 2146.74 | backward-allreduce: 0.00 | optimizer: 128.89 | batch generator: 1.04 + samples/sec: 5.376 | iteration 222400/ 320000 | elapsed time per iteration (ms): 2976.2 | learning rate: 6.799E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.767605E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.79 | backward: 2159.82 | backward-backward: 2159.79 | backward-allreduce: 0.00 | optimizer: 128.12 | batch generator: 0.99 + samples/sec: 5.392 | iteration 222500/ 320000 | elapsed time per iteration (ms): 2967.4 | learning rate: 6.786E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.728195E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.81 | backward: 2152.60 | backward-backward: 2152.57 | backward-allreduce: 0.00 | optimizer: 128.51 | batch generator: 1.08 + samples/sec: 5.403 | iteration 222600/ 320000 | elapsed time per iteration (ms): 2961.1 | learning rate: 6.774E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.772704E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.03 | backward: 2149.86 | backward-backward: 2149.83 | backward-allreduce: 0.00 | optimizer: 127.67 | batch generator: 0.98 + samples/sec: 5.367 | iteration 222700/ 320000 | elapsed time per iteration (ms): 2981.1 | learning rate: 6.762E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.767272E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.97 | backward: 2160.77 | backward-backward: 2160.75 | backward-allreduce: 0.00 | optimizer: 129.80 | batch generator: 1.08 + samples/sec: 5.396 | iteration 222800/ 320000 | elapsed time per iteration (ms): 2965.1 | learning rate: 6.749E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.756275E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.93 | backward: 2151.03 | backward-backward: 2151.01 | backward-allreduce: 0.00 | optimizer: 126.63 | batch generator: 1.02 + samples/sec: 5.396 | iteration 222900/ 320000 | elapsed time per iteration (ms): 2965.3 | learning rate: 6.737E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.769426E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.95 | backward: 2151.98 | backward-backward: 2151.95 | backward-allreduce: 0.00 | optimizer: 126.89 | batch generator: 1.06 + samples/sec: 5.369 | iteration 223000/ 320000 | elapsed time per iteration (ms): 2980.2 | learning rate: 6.725E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.748542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.60 | backward: 2161.49 | backward-backward: 2161.47 | backward-allreduce: 0.00 | optimizer: 130.55 | batch generator: 1.05 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 223000 | lm_loss value: 2.752218E+00 | lm_loss_ppl value: 1.567737E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.290 | iteration 223100/ 320000 | elapsed time per iteration (ms): 3024.6 | learning rate: 6.713E-05 | approx flops per GPU: 32.9TFLOPS | lm_loss: 2.756127E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.09 | backward: 2143.16 | backward-backward: 2143.14 | backward-allreduce: 0.00 | optimizer: 125.86 | batch generator: 1.15 + samples/sec: 5.389 | iteration 223200/ 320000 | elapsed time per iteration (ms): 2968.8 | learning rate: 6.700E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.763326E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.45 | backward: 2154.65 | backward-backward: 2154.62 | backward-allreduce: 0.00 | optimizer: 127.19 | batch generator: 1.01 + samples/sec: 5.379 | iteration 223300/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 6.688E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.761823E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.63 | backward: 2158.03 | backward-backward: 2158.01 | backward-allreduce: 0.00 | optimizer: 128.61 | batch generator: 1.03 + samples/sec: 5.400 | iteration 223400/ 320000 | elapsed time per iteration (ms): 2963.0 | learning rate: 6.676E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.770928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 682.95 | backward: 2149.16 | backward-backward: 2149.13 | backward-allreduce: 0.00 | optimizer: 130.44 | batch generator: 1.02 + samples/sec: 5.370 | iteration 223500/ 320000 | elapsed time per iteration (ms): 2979.4 | learning rate: 6.664E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.725544E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.21 | backward: 2159.05 | backward-backward: 2159.03 | backward-allreduce: 0.00 | optimizer: 129.53 | batch generator: 1.08 + samples/sec: 5.390 | iteration 223600/ 320000 | elapsed time per iteration (ms): 2968.5 | learning rate: 6.652E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.774220E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 685.20 | backward: 2155.48 | backward-backward: 2155.45 | backward-allreduce: 0.00 | optimizer: 127.26 | batch generator: 1.13 + samples/sec: 5.402 | iteration 223700/ 320000 | elapsed time per iteration (ms): 2961.9 | learning rate: 6.639E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.744409E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.30 | backward: 2150.14 | backward-backward: 2150.11 | backward-allreduce: 0.00 | optimizer: 126.94 | batch generator: 1.08 + samples/sec: 5.375 | iteration 223800/ 320000 | elapsed time per iteration (ms): 2977.0 | learning rate: 6.627E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.761261E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.34 | backward: 2158.84 | backward-backward: 2158.81 | backward-allreduce: 0.00 | optimizer: 128.24 | batch generator: 1.13 + samples/sec: 5.389 | iteration 223900/ 320000 | elapsed time per iteration (ms): 2969.0 | learning rate: 6.615E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.716960E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.02 | backward: 2153.46 | backward-backward: 2153.43 | backward-allreduce: 0.00 | optimizer: 127.98 | batch generator: 1.14 + samples/sec: 5.409 | iteration 224000/ 320000 | elapsed time per iteration (ms): 2957.8 | learning rate: 6.603E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.754958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.05 | backward: 2144.58 | backward-backward: 2144.55 | backward-allreduce: 0.00 | optimizer: 127.65 | batch generator: 1.04 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 224000 | lm_loss value: 2.744051E+00 | lm_loss_ppl value: 1.554986E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.248 | iteration 224100/ 320000 | elapsed time per iteration (ms): 3048.7 | learning rate: 6.590E-05 | approx flops per GPU: 32.6TFLOPS | lm_loss: 2.770806E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.55 | backward: 2161.17 | backward-backward: 2161.14 | backward-allreduce: 0.00 | optimizer: 129.80 | batch generator: 1.07 + samples/sec: 5.388 | iteration 224200/ 320000 | elapsed time per iteration (ms): 2969.6 | learning rate: 6.578E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.776890E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.33 | backward: 2155.47 | backward-backward: 2155.45 | backward-allreduce: 0.00 | optimizer: 128.32 | batch generator: 0.98 + samples/sec: 5.395 | iteration 224300/ 320000 | elapsed time per iteration (ms): 2965.5 | learning rate: 6.566E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.760507E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.60 | backward: 2151.74 | backward-backward: 2151.71 | backward-allreduce: 0.00 | optimizer: 127.63 | batch generator: 1.07 + samples/sec: 5.387 | iteration 224400/ 320000 | elapsed time per iteration (ms): 2970.3 | learning rate: 6.554E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.754520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.90 | backward: 2152.75 | backward-backward: 2152.73 | backward-allreduce: 0.00 | optimizer: 130.10 | batch generator: 1.08 + samples/sec: 5.383 | iteration 224500/ 320000 | elapsed time per iteration (ms): 2972.5 | learning rate: 6.542E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.743267E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.94 | backward: 2156.16 | backward-backward: 2156.13 | backward-allreduce: 0.00 | optimizer: 128.83 | batch generator: 1.01 + samples/sec: 5.392 | iteration 224600/ 320000 | elapsed time per iteration (ms): 2967.1 | learning rate: 6.530E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.756674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.49 | backward: 2152.22 | backward-backward: 2152.19 | backward-allreduce: 0.00 | optimizer: 127.89 | batch generator: 1.12 + samples/sec: 5.389 | iteration 224700/ 320000 | elapsed time per iteration (ms): 2968.9 | learning rate: 6.517E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.751501E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.77 | backward: 2152.67 | backward-backward: 2152.64 | backward-allreduce: 0.00 | optimizer: 127.84 | batch generator: 1.06 + samples/sec: 5.385 | iteration 224800/ 320000 | elapsed time per iteration (ms): 2971.1 | learning rate: 6.506E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.767052E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 688.02 | backward: 2156.47 | backward-backward: 2156.44 | backward-allreduce: 0.00 | optimizer: 126.10 | batch generator: 1.06 + samples/sec: 5.387 | iteration 224900/ 320000 | elapsed time per iteration (ms): 2970.2 | learning rate: 6.493E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.742213E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.28 | backward: 2153.05 | backward-backward: 2153.03 | backward-allreduce: 0.00 | optimizer: 129.33 | batch generator: 1.02 + samples/sec: 5.371 | iteration 225000/ 320000 | elapsed time per iteration (ms): 2979.1 | learning rate: 6.481E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.754933E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.78 | backward: 2160.10 | backward-backward: 2160.08 | backward-allreduce: 0.00 | optimizer: 130.74 | batch generator: 1.01 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 225000 | lm_loss value: 2.757601E+00 | lm_loss_ppl value: 1.576198E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.267 | iteration 225100/ 320000 | elapsed time per iteration (ms): 3037.9 | learning rate: 6.469E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.740681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.79 | backward: 2155.33 | backward-backward: 2155.30 | backward-allreduce: 0.00 | optimizer: 127.58 | batch generator: 1.12 + samples/sec: 5.387 | iteration 225200/ 320000 | elapsed time per iteration (ms): 2970.4 | learning rate: 6.457E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.748837E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.10 | backward: 2152.19 | backward-backward: 2152.16 | backward-allreduce: 0.00 | optimizer: 129.54 | batch generator: 1.04 + samples/sec: 5.374 | iteration 225300/ 320000 | elapsed time per iteration (ms): 2977.5 | learning rate: 6.445E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.728548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.78 | backward: 2160.31 | backward-backward: 2160.28 | backward-allreduce: 0.00 | optimizer: 128.85 | batch generator: 1.00 + samples/sec: 5.387 | iteration 225400/ 320000 | elapsed time per iteration (ms): 2970.0 | learning rate: 6.433E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.749515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.38 | backward: 2157.68 | backward-backward: 2157.65 | backward-allreduce: 0.00 | optimizer: 126.45 | batch generator: 1.07 + samples/sec: 5.388 | iteration 225500/ 320000 | elapsed time per iteration (ms): 2969.7 | learning rate: 6.421E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.719767E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.72 | backward: 2153.80 | backward-backward: 2153.78 | backward-allreduce: 0.00 | optimizer: 128.69 | batch generator: 1.04 + samples/sec: 5.374 | iteration 225600/ 320000 | elapsed time per iteration (ms): 2977.4 | learning rate: 6.409E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.732435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.48 | backward: 2160.03 | backward-backward: 2160.00 | backward-allreduce: 0.00 | optimizer: 129.37 | batch generator: 1.01 + samples/sec: 5.390 | iteration 225700/ 320000 | elapsed time per iteration (ms): 2968.2 | learning rate: 6.397E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.733448E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.18 | backward: 2152.02 | backward-backward: 2152.00 | backward-allreduce: 0.00 | optimizer: 127.48 | batch generator: 1.06 + samples/sec: 5.393 | iteration 225800/ 320000 | elapsed time per iteration (ms): 2966.9 | learning rate: 6.385E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.762674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.27 | backward: 2153.35 | backward-backward: 2153.32 | backward-allreduce: 0.00 | optimizer: 127.70 | batch generator: 1.07 + samples/sec: 5.377 | iteration 225900/ 320000 | elapsed time per iteration (ms): 2975.5 | learning rate: 6.373E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.748378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.82 | backward: 2157.59 | backward-backward: 2157.57 | backward-allreduce: 0.00 | optimizer: 129.53 | batch generator: 1.14 + samples/sec: 5.392 | iteration 226000/ 320000 | elapsed time per iteration (ms): 2967.1 | learning rate: 6.360E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.736000E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.73 | backward: 2152.59 | backward-backward: 2152.57 | backward-allreduce: 0.00 | optimizer: 127.24 | batch generator: 1.01 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 226000 | lm_loss value: 2.787629E+00 | lm_loss_ppl value: 1.624246E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.265 | iteration 226100/ 320000 | elapsed time per iteration (ms): 3039.0 | learning rate: 6.348E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.757868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.96 | backward: 2155.70 | backward-backward: 2155.67 | backward-allreduce: 0.00 | optimizer: 128.99 | batch generator: 1.14 + samples/sec: 5.372 | iteration 226200/ 320000 | elapsed time per iteration (ms): 2978.3 | learning rate: 6.336E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.745549E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.26 | backward: 2159.85 | backward-backward: 2159.83 | backward-allreduce: 0.00 | optimizer: 128.65 | batch generator: 1.06 + samples/sec: 5.395 | iteration 226300/ 320000 | elapsed time per iteration (ms): 2965.6 | learning rate: 6.325E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.740940E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 685.73 | backward: 2152.18 | backward-backward: 2152.15 | backward-allreduce: 0.00 | optimizer: 127.20 | batch generator: 1.02 + samples/sec: 5.416 | iteration 226400/ 320000 | elapsed time per iteration (ms): 2954.5 | learning rate: 6.313E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.735605E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.43 | backward: 2142.71 | backward-backward: 2142.68 | backward-allreduce: 0.00 | optimizer: 125.82 | batch generator: 1.10 + samples/sec: 5.371 | iteration 226500/ 320000 | elapsed time per iteration (ms): 2979.1 | learning rate: 6.301E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.726845E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.51 | backward: 2161.84 | backward-backward: 2161.81 | backward-allreduce: 0.00 | optimizer: 129.21 | batch generator: 0.99 + samples/sec: 5.396 | iteration 226600/ 320000 | elapsed time per iteration (ms): 2964.9 | learning rate: 6.289E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.740463E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.88 | backward: 2152.84 | backward-backward: 2152.81 | backward-allreduce: 0.00 | optimizer: 125.63 | batch generator: 1.04 + samples/sec: 5.375 | iteration 226700/ 320000 | elapsed time per iteration (ms): 2976.7 | learning rate: 6.277E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.745323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.00 | backward: 2157.73 | backward-backward: 2157.70 | backward-allreduce: 0.00 | optimizer: 129.45 | batch generator: 0.99 + samples/sec: 5.393 | iteration 226800/ 320000 | elapsed time per iteration (ms): 2966.6 | learning rate: 6.265E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.741299E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.95 | backward: 2150.50 | backward-backward: 2150.47 | backward-allreduce: 0.00 | optimizer: 127.64 | batch generator: 1.00 + samples/sec: 5.387 | iteration 226900/ 320000 | elapsed time per iteration (ms): 2970.4 | learning rate: 6.253E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.740632E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.16 | backward: 2155.88 | backward-backward: 2155.86 | backward-allreduce: 0.00 | optimizer: 127.81 | batch generator: 0.98 + samples/sec: 5.374 | iteration 227000/ 320000 | elapsed time per iteration (ms): 2977.4 | learning rate: 6.241E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.764804E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.81 | backward: 2160.07 | backward-backward: 2160.05 | backward-allreduce: 0.00 | optimizer: 128.05 | batch generator: 1.05 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 227000 | lm_loss value: 2.667482E+00 | lm_loss_ppl value: 1.440366E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.264 | iteration 227100/ 320000 | elapsed time per iteration (ms): 3039.4 | learning rate: 6.229E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.742462E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.90 | backward: 2155.24 | backward-backward: 2155.21 | backward-allreduce: 0.00 | optimizer: 130.77 | batch generator: 1.15 + samples/sec: 5.373 | iteration 227200/ 320000 | elapsed time per iteration (ms): 2977.9 | learning rate: 6.217E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.724138E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.04 | backward: 2159.24 | backward-backward: 2159.21 | backward-allreduce: 0.00 | optimizer: 130.13 | batch generator: 1.02 + samples/sec: 5.388 | iteration 227300/ 320000 | elapsed time per iteration (ms): 2969.6 | learning rate: 6.205E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.726558E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.46 | backward: 2156.09 | backward-backward: 2156.06 | backward-allreduce: 0.00 | optimizer: 127.49 | batch generator: 0.96 + samples/sec: 5.390 | iteration 227400/ 320000 | elapsed time per iteration (ms): 2968.7 | learning rate: 6.193E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.744682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.94 | backward: 2155.45 | backward-backward: 2155.42 | backward-allreduce: 0.00 | optimizer: 126.78 | batch generator: 1.01 + samples/sec: 5.374 | iteration 227500/ 320000 | elapsed time per iteration (ms): 2977.2 | learning rate: 6.181E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.729781E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.05 | backward: 2159.84 | backward-backward: 2159.81 | backward-allreduce: 0.00 | optimizer: 128.71 | batch generator: 0.99 + samples/sec: 5.395 | iteration 227600/ 320000 | elapsed time per iteration (ms): 2965.7 | learning rate: 6.169E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.726253E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.52 | backward: 2152.28 | backward-backward: 2152.25 | backward-allreduce: 0.00 | optimizer: 127.33 | batch generator: 1.05 + samples/sec: 5.415 | iteration 227700/ 320000 | elapsed time per iteration (ms): 2954.5 | learning rate: 6.157E-05 | approx flops per GPU: 33.6TFLOPS | lm_loss: 2.726212E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.94 | backward: 2144.00 | backward-backward: 2143.98 | backward-allreduce: 0.00 | optimizer: 126.02 | batch generator: 1.06 + samples/sec: 5.371 | iteration 227800/ 320000 | elapsed time per iteration (ms): 2979.0 | learning rate: 6.145E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.738534E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.22 | backward: 2162.07 | backward-backward: 2162.04 | backward-allreduce: 0.00 | optimizer: 128.21 | batch generator: 1.02 + samples/sec: 5.385 | iteration 227900/ 320000 | elapsed time per iteration (ms): 2971.3 | learning rate: 6.133E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.719239E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.46 | backward: 2155.81 | backward-backward: 2155.78 | backward-allreduce: 0.00 | optimizer: 126.45 | batch generator: 1.02 + samples/sec: 5.384 | iteration 228000/ 320000 | elapsed time per iteration (ms): 2971.8 | learning rate: 6.122E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.712909E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.88 | backward: 2157.32 | backward-backward: 2157.29 | backward-allreduce: 0.00 | optimizer: 125.98 | batch generator: 1.10 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 228000 | lm_loss value: 2.765807E+00 | lm_loss_ppl value: 1.589185E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.173 | iteration 228100/ 320000 | elapsed time per iteration (ms): 3093.1 | learning rate: 6.110E-05 | approx flops per GPU: 32.1TFLOPS | lm_loss: 2.746743E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.29 | backward: 2193.21 | backward-backward: 2193.19 | backward-allreduce: 0.00 | optimizer: 140.88 | batch generator: 1.20 + samples/sec: 4.946 | iteration 228200/ 320000 | elapsed time per iteration (ms): 3234.7 | learning rate: 6.098E-05 | approx flops per GPU: 30.7TFLOPS | lm_loss: 2.713227E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.89 | backward: 2353.63 | backward-backward: 2353.60 | backward-allreduce: 0.00 | optimizer: 190.35 | batch generator: 1.08 + samples/sec: 4.929 | iteration 228300/ 320000 | elapsed time per iteration (ms): 3246.0 | learning rate: 6.086E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.723352E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.25 | backward: 2362.49 | backward-backward: 2362.46 | backward-allreduce: 0.00 | optimizer: 192.48 | batch generator: 0.97 + samples/sec: 4.929 | iteration 228400/ 320000 | elapsed time per iteration (ms): 3245.9 | learning rate: 6.074E-05 | approx flops per GPU: 30.6TFLOPS | lm_loss: 2.728743E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.15 | backward: 2363.12 | backward-backward: 2363.09 | backward-allreduce: 0.00 | optimizer: 192.87 | batch generator: 1.17 + samples/sec: 4.818 | iteration 228500/ 320000 | elapsed time per iteration (ms): 3320.8 | learning rate: 6.062E-05 | approx flops per GPU: 29.9TFLOPS | lm_loss: 2.723033E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.67 | backward: 2421.70 | backward-backward: 2421.66 | backward-allreduce: 0.00 | optimizer: 208.48 | batch generator: 1.01 + samples/sec: 5.219 | iteration 228600/ 320000 | elapsed time per iteration (ms): 3065.6 | learning rate: 6.051E-05 | approx flops per GPU: 32.4TFLOPS | lm_loss: 2.761258E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 694.19 | backward: 2222.72 | backward-backward: 2222.69 | backward-allreduce: 0.00 | optimizer: 147.97 | batch generator: 1.05 + samples/sec: 5.363 | iteration 228700/ 320000 | elapsed time per iteration (ms): 2983.6 | learning rate: 6.039E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.732124E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 688.15 | backward: 2163.51 | backward-backward: 2163.48 | backward-allreduce: 0.00 | optimizer: 131.33 | batch generator: 1.04 + samples/sec: 5.373 | iteration 228800/ 320000 | elapsed time per iteration (ms): 2977.9 | learning rate: 6.027E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.714806E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.36 | backward: 2159.93 | backward-backward: 2159.90 | backward-allreduce: 0.00 | optimizer: 129.95 | batch generator: 1.04 + samples/sec: 5.386 | iteration 228900/ 320000 | elapsed time per iteration (ms): 2970.5 | learning rate: 6.015E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.715573E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.47 | backward: 2156.81 | backward-backward: 2156.78 | backward-allreduce: 0.00 | optimizer: 126.71 | batch generator: 1.07 + samples/sec: 5.362 | iteration 229000/ 320000 | elapsed time per iteration (ms): 2984.1 | learning rate: 6.004E-05 | approx flops per GPU: 33.3TFLOPS | lm_loss: 2.737422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 690.23 | backward: 2162.72 | backward-backward: 2162.69 | backward-allreduce: 0.00 | optimizer: 130.59 | batch generator: 1.09 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 229000 | lm_loss value: 2.717387E+00 | lm_loss_ppl value: 1.514070E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.257 | iteration 229100/ 320000 | elapsed time per iteration (ms): 3043.8 | learning rate: 5.992E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.731638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.34 | backward: 2159.48 | backward-backward: 2159.45 | backward-allreduce: 0.00 | optimizer: 128.43 | batch generator: 1.17 + samples/sec: 5.370 | iteration 229200/ 320000 | elapsed time per iteration (ms): 2979.8 | learning rate: 5.980E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.713497E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.41 | backward: 2163.25 | backward-backward: 2163.22 | backward-allreduce: 0.00 | optimizer: 128.49 | batch generator: 1.12 + samples/sec: 5.385 | iteration 229300/ 320000 | elapsed time per iteration (ms): 2971.3 | learning rate: 5.968E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.739091E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.56 | backward: 2156.34 | backward-backward: 2156.31 | backward-allreduce: 0.00 | optimizer: 126.84 | batch generator: 1.16 + samples/sec: 5.291 | iteration 229400/ 320000 | elapsed time per iteration (ms): 3023.8 | learning rate: 5.957E-05 | approx flops per GPU: 32.9TFLOPS | lm_loss: 2.713980E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 697.61 | backward: 2190.52 | backward-backward: 2190.49 | backward-allreduce: 0.00 | optimizer: 135.09 | batch generator: 1.06 + samples/sec: 5.382 | iteration 229500/ 320000 | elapsed time per iteration (ms): 2972.6 | learning rate: 5.945E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.707762E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.66 | backward: 2156.87 | backward-backward: 2156.84 | backward-allreduce: 0.00 | optimizer: 128.53 | batch generator: 1.05 + samples/sec: 5.319 | iteration 229600/ 320000 | elapsed time per iteration (ms): 3007.9 | learning rate: 5.933E-05 | approx flops per GPU: 33.0TFLOPS | lm_loss: 2.712317E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 694.74 | backward: 2179.23 | backward-backward: 2179.20 | backward-allreduce: 0.00 | optimizer: 133.31 | batch generator: 1.10 + samples/sec: 5.294 | iteration 229700/ 320000 | elapsed time per iteration (ms): 3022.2 | learning rate: 5.921E-05 | approx flops per GPU: 32.9TFLOPS | lm_loss: 2.724073E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 698.16 | backward: 2188.33 | backward-backward: 2188.30 | backward-allreduce: 0.00 | optimizer: 135.12 | batch generator: 1.16 + samples/sec: 5.379 | iteration 229800/ 320000 | elapsed time per iteration (ms): 2974.3 | learning rate: 5.910E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.718882E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.33 | backward: 2158.04 | backward-backward: 2158.01 | backward-allreduce: 0.00 | optimizer: 128.29 | batch generator: 1.08 + samples/sec: 5.307 | iteration 229900/ 320000 | elapsed time per iteration (ms): 3015.1 | learning rate: 5.898E-05 | approx flops per GPU: 33.0TFLOPS | lm_loss: 2.726010E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 697.06 | backward: 2183.82 | backward-backward: 2183.79 | backward-allreduce: 0.00 | optimizer: 133.57 | batch generator: 1.13 + samples/sec: 5.378 | iteration 230000/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 5.886E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.717605E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.03 | backward: 2156.53 | backward-backward: 2156.50 | backward-allreduce: 0.00 | optimizer: 128.72 | batch generator: 1.08 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step130000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 230000 | lm_loss value: 2.757775E+00 | lm_loss_ppl value: 1.576473E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.111 | iteration 230100/ 320000 | elapsed time per iteration (ms): 3130.2 | learning rate: 5.875E-05 | approx flops per GPU: 31.8TFLOPS | lm_loss: 2.727872E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.06 | backward: 2153.62 | backward-backward: 2153.60 | backward-allreduce: 0.00 | optimizer: 126.79 | batch generator: 1.17 + samples/sec: 5.376 | iteration 230200/ 320000 | elapsed time per iteration (ms): 2976.1 | learning rate: 5.863E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.722316E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.13 | backward: 2159.80 | backward-backward: 2159.77 | backward-allreduce: 0.00 | optimizer: 128.62 | batch generator: 1.01 + samples/sec: 5.393 | iteration 230300/ 320000 | elapsed time per iteration (ms): 2967.0 | learning rate: 5.851E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.734355E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.76 | backward: 2153.05 | backward-backward: 2153.02 | backward-allreduce: 0.00 | optimizer: 126.66 | batch generator: 1.08 + samples/sec: 5.395 | iteration 230400/ 320000 | elapsed time per iteration (ms): 2965.6 | learning rate: 5.840E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.722501E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.99 | backward: 2154.34 | backward-backward: 2154.32 | backward-allreduce: 0.00 | optimizer: 125.76 | batch generator: 1.01 + samples/sec: 5.383 | iteration 230500/ 320000 | elapsed time per iteration (ms): 2972.0 | learning rate: 5.828E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.723625E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.85 | backward: 2155.50 | backward-backward: 2155.47 | backward-allreduce: 0.00 | optimizer: 128.18 | batch generator: 1.01 + samples/sec: 5.388 | iteration 230600/ 320000 | elapsed time per iteration (ms): 2969.7 | learning rate: 5.816E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.717625E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.10 | backward: 2154.75 | backward-backward: 2154.72 | backward-allreduce: 0.00 | optimizer: 129.28 | batch generator: 1.05 + samples/sec: 5.398 | iteration 230700/ 320000 | elapsed time per iteration (ms): 2964.1 | learning rate: 5.805E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.729171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.58 | backward: 2151.84 | backward-backward: 2151.82 | backward-allreduce: 0.00 | optimizer: 126.19 | batch generator: 0.99 + samples/sec: 5.377 | iteration 230800/ 320000 | elapsed time per iteration (ms): 2975.6 | learning rate: 5.793E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.719929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.58 | backward: 2158.49 | backward-backward: 2158.46 | backward-allreduce: 0.00 | optimizer: 128.00 | batch generator: 0.98 + samples/sec: 5.393 | iteration 230900/ 320000 | elapsed time per iteration (ms): 2967.0 | learning rate: 5.781E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.715138E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.70 | backward: 2152.52 | backward-backward: 2152.50 | backward-allreduce: 0.00 | optimizer: 126.18 | batch generator: 1.05 + samples/sec: 5.393 | iteration 231000/ 320000 | elapsed time per iteration (ms): 2966.7 | learning rate: 5.770E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.721133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 683.95 | backward: 2154.52 | backward-backward: 2154.50 | backward-allreduce: 0.00 | optimizer: 127.70 | batch generator: 1.02 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 231000 | lm_loss value: 2.752268E+00 | lm_loss_ppl value: 1.567815E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.273 | iteration 231100/ 320000 | elapsed time per iteration (ms): 3034.1 | learning rate: 5.758E-05 | approx flops per GPU: 32.8TFLOPS | lm_loss: 2.705823E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.65 | backward: 2153.68 | backward-backward: 2153.66 | backward-allreduce: 0.00 | optimizer: 126.41 | batch generator: 1.14 + samples/sec: 5.387 | iteration 231200/ 320000 | elapsed time per iteration (ms): 2970.3 | learning rate: 5.747E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.727592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.85 | backward: 2155.24 | backward-backward: 2155.21 | backward-allreduce: 0.00 | optimizer: 127.68 | batch generator: 1.06 + samples/sec: 5.393 | iteration 231300/ 320000 | elapsed time per iteration (ms): 2966.8 | learning rate: 5.735E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.713584E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.10 | backward: 2152.99 | backward-backward: 2152.96 | backward-allreduce: 0.00 | optimizer: 127.19 | batch generator: 1.01 + samples/sec: 5.396 | iteration 231400/ 320000 | elapsed time per iteration (ms): 2965.4 | learning rate: 5.724E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.713705E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.37 | backward: 2152.56 | backward-backward: 2152.53 | backward-allreduce: 0.00 | optimizer: 125.96 | batch generator: 1.13 + samples/sec: 5.380 | iteration 231500/ 320000 | elapsed time per iteration (ms): 2973.8 | learning rate: 5.712E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.739738E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.47 | backward: 2156.95 | backward-backward: 2156.92 | backward-allreduce: 0.00 | optimizer: 128.85 | batch generator: 1.02 + samples/sec: 5.400 | iteration 231600/ 320000 | elapsed time per iteration (ms): 2962.8 | learning rate: 5.701E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.726812E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.30 | backward: 2149.23 | backward-backward: 2149.20 | backward-allreduce: 0.00 | optimizer: 127.69 | batch generator: 1.05 + samples/sec: 5.396 | iteration 231700/ 320000 | elapsed time per iteration (ms): 2965.4 | learning rate: 5.689E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.717003E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.25 | backward: 2150.99 | backward-backward: 2150.97 | backward-allreduce: 0.00 | optimizer: 126.63 | batch generator: 0.95 + samples/sec: 5.383 | iteration 231800/ 320000 | elapsed time per iteration (ms): 2972.4 | learning rate: 5.677E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.726895E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 689.80 | backward: 2154.21 | backward-backward: 2154.19 | backward-allreduce: 0.00 | optimizer: 127.91 | batch generator: 1.04 + samples/sec: 5.395 | iteration 231900/ 320000 | elapsed time per iteration (ms): 2965.6 | learning rate: 5.666E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.730956E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.05 | backward: 2151.64 | backward-backward: 2151.62 | backward-allreduce: 0.00 | optimizer: 127.36 | batch generator: 1.04 + samples/sec: 5.400 | iteration 232000/ 320000 | elapsed time per iteration (ms): 2963.0 | learning rate: 5.654E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.705334E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.12 | backward: 2149.82 | backward-backward: 2149.79 | backward-allreduce: 0.00 | optimizer: 125.59 | batch generator: 1.00 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 232000 | lm_loss value: 2.729872E+00 | lm_loss_ppl value: 1.533093E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.255 | iteration 232100/ 320000 | elapsed time per iteration (ms): 3044.9 | learning rate: 5.643E-05 | approx flops per GPU: 32.6TFLOPS | lm_loss: 2.711451E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.17 | backward: 2158.16 | backward-backward: 2158.13 | backward-allreduce: 0.00 | optimizer: 129.13 | batch generator: 1.12 + samples/sec: 5.399 | iteration 232200/ 320000 | elapsed time per iteration (ms): 2963.7 | learning rate: 5.631E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.735332E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.46 | backward: 2149.92 | backward-backward: 2149.89 | backward-allreduce: 0.00 | optimizer: 125.81 | batch generator: 1.01 + samples/sec: 5.392 | iteration 232300/ 320000 | elapsed time per iteration (ms): 2967.5 | learning rate: 5.620E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.710974E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.05 | backward: 2153.57 | backward-backward: 2153.55 | backward-allreduce: 0.00 | optimizer: 127.34 | batch generator: 1.06 + samples/sec: 5.375 | iteration 232400/ 320000 | elapsed time per iteration (ms): 2976.8 | learning rate: 5.608E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.691380E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.94 | backward: 2159.44 | backward-backward: 2159.42 | backward-allreduce: 0.00 | optimizer: 128.95 | batch generator: 1.04 + samples/sec: 5.393 | iteration 232500/ 320000 | elapsed time per iteration (ms): 2967.1 | learning rate: 5.597E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.687236E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.43 | backward: 2154.01 | backward-backward: 2153.98 | backward-allreduce: 0.00 | optimizer: 127.11 | batch generator: 1.03 + samples/sec: 5.393 | iteration 232600/ 320000 | elapsed time per iteration (ms): 2966.9 | learning rate: 5.585E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.722927E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.58 | backward: 2153.24 | backward-backward: 2153.22 | backward-allreduce: 0.00 | optimizer: 126.52 | batch generator: 1.02 + samples/sec: 5.376 | iteration 232700/ 320000 | elapsed time per iteration (ms): 2976.2 | learning rate: 5.574E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.713356E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.26 | backward: 2159.95 | backward-backward: 2159.93 | backward-allreduce: 0.00 | optimizer: 128.42 | batch generator: 1.13 + samples/sec: 5.395 | iteration 232800/ 320000 | elapsed time per iteration (ms): 2965.9 | learning rate: 5.563E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.684577E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 684.41 | backward: 2154.12 | backward-backward: 2154.10 | backward-allreduce: 0.00 | optimizer: 126.75 | batch generator: 1.05 + samples/sec: 5.395 | iteration 232900/ 320000 | elapsed time per iteration (ms): 2965.9 | learning rate: 5.551E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.709340E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.47 | backward: 2152.98 | backward-backward: 2152.96 | backward-allreduce: 0.00 | optimizer: 126.92 | batch generator: 0.99 + samples/sec: 5.378 | iteration 233000/ 320000 | elapsed time per iteration (ms): 2974.8 | learning rate: 5.540E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.725794E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 686.08 | backward: 2160.53 | backward-backward: 2160.51 | backward-allreduce: 0.00 | optimizer: 127.67 | batch generator: 1.06 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 233000 | lm_loss value: 2.753324E+00 | lm_loss_ppl value: 1.569471E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 5.271 | iteration 233100/ 320000 | elapsed time per iteration (ms): 3035.4 | learning rate: 5.528E-05 | approx flops per GPU: 32.7TFLOPS | lm_loss: 2.701660E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 687.61 | backward: 2149.71 | backward-backward: 2149.68 | backward-allreduce: 0.00 | optimizer: 128.70 | batch generator: 1.23 + samples/sec: 5.397 | iteration 233200/ 320000 | elapsed time per iteration (ms): 2964.5 | learning rate: 5.517E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.727024E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.52 | backward: 2152.93 | backward-backward: 2152.91 | backward-allreduce: 0.00 | optimizer: 125.59 | batch generator: 1.00 + samples/sec: 5.378 | iteration 233300/ 320000 | elapsed time per iteration (ms): 2975.3 | learning rate: 5.506E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.716054E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.47 | backward: 2158.63 | backward-backward: 2158.60 | backward-allreduce: 0.00 | optimizer: 127.66 | batch generator: 1.04 + samples/sec: 5.395 | iteration 233400/ 320000 | elapsed time per iteration (ms): 2966.0 | learning rate: 5.494E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.697893E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 685.34 | backward: 2151.37 | backward-backward: 2151.35 | backward-allreduce: 0.00 | optimizer: 128.73 | batch generator: 1.00 + samples/sec: 5.392 | iteration 233500/ 320000 | elapsed time per iteration (ms): 2967.4 | learning rate: 5.483E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.714317E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.19 | backward: 2150.33 | backward-backward: 2150.30 | backward-allreduce: 0.00 | optimizer: 129.34 | batch generator: 1.06 + samples/sec: 5.377 | iteration 233600/ 320000 | elapsed time per iteration (ms): 2975.7 | learning rate: 5.471E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.717295E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 687.85 | backward: 2158.83 | backward-backward: 2158.81 | backward-allreduce: 0.00 | optimizer: 128.54 | batch generator: 1.04 + samples/sec: 5.392 | iteration 233700/ 320000 | elapsed time per iteration (ms): 2967.2 | learning rate: 5.460E-05 | approx flops per GPU: 33.5TFLOPS | lm_loss: 2.713669E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 686.99 | backward: 2152.02 | backward-backward: 2152.00 | backward-allreduce: 0.00 | optimizer: 127.69 | batch generator: 1.07 + samples/sec: 5.384 | iteration 233800/ 320000 | elapsed time per iteration (ms): 2972.0 | learning rate: 5.449E-05 | approx flops per GPU: 33.4TFLOPS | lm_loss: 2.723139E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 688.16 | backward: 2155.96 | backward-backward: 2155.94 | backward-allreduce: 0.00 | optimizer: 127.41 | batch generator: 1.02 + samples/sec: 6.382 | iteration 233900/ 320000 | elapsed time per iteration (ms): 2506.9 | learning rate: 5.437E-05 | approx flops per GPU: 39.7TFLOPS | lm_loss: 2.716029E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 584.25 | backward: 1855.84 | backward-backward: 1855.82 | backward-allreduce: 0.00 | optimizer: 66.40 | batch generator: 0.80 + samples/sec: 6.595 | iteration 234000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.426E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.713698E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 234000 | lm_loss value: 2.720017E+00 | lm_loss_ppl value: 1.518059E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 234100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 5.415E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.689934E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.26 | batch generator: 0.86 + samples/sec: 6.594 | iteration 234200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.403E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.689576E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.85 + samples/sec: 6.592 | iteration 234300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 5.392E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.708188E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.76 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 + samples/sec: 6.594 | iteration 234400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.381E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.708034E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.79 + samples/sec: 6.593 | iteration 234500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 5.370E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.693375E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 + samples/sec: 6.596 | iteration 234600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 5.358E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.729903E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.76 + samples/sec: 6.593 | iteration 234700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.347E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.702033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 + samples/sec: 6.594 | iteration 234800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.336E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.723404E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 + samples/sec: 6.594 | iteration 234900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.325E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.707123E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.593 | iteration 235000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 5.313E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.719583E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 235000 | lm_loss value: 2.694426E+00 | lm_loss_ppl value: 1.479702E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 235100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 5.302E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.698541E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.93 + samples/sec: 6.592 | iteration 235200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 5.291E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.706649E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.79 + samples/sec: 6.594 | iteration 235300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.280E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.715706E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.594 | iteration 235400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.268E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.709256E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.79 + samples/sec: 6.593 | iteration 235500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.257E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.712444E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.83 + samples/sec: 6.593 | iteration 235600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.246E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.698996E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.592 | iteration 235700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.235E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.700073E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.85 + samples/sec: 6.593 | iteration 235800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.224E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.706134E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.594 | iteration 235900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 5.213E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.688730E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.81 + samples/sec: 6.591 | iteration 236000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 5.202E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.681030E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.73 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 236000 | lm_loss value: 2.701216E+00 | lm_loss_ppl value: 1.489783E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 236100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 5.190E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.683914E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.85 + samples/sec: 6.594 | iteration 236200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.179E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.734350E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.591 | iteration 236300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 5.168E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.680621E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.90 + samples/sec: 6.594 | iteration 236400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.157E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.711476E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 + samples/sec: 6.594 | iteration 236500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.146E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.90 + samples/sec: 6.595 | iteration 236600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 5.135E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.693739E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 + samples/sec: 6.594 | iteration 236700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.124E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695582E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 + samples/sec: 6.595 | iteration 236800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 5.113E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.701288E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 + samples/sec: 6.594 | iteration 236900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.102E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.700592E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.594 | iteration 237000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.091E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.719792E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 237000 | lm_loss value: 2.685239E+00 | lm_loss_ppl value: 1.466171E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 237100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 5.080E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.695771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.86 + samples/sec: 6.595 | iteration 237200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.069E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.80 + samples/sec: 6.591 | iteration 237300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 5.057E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.715274E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.89 + samples/sec: 6.591 | iteration 237400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 5.046E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.676222E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.77 + samples/sec: 6.595 | iteration 237500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.035E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.715969E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.596 | iteration 237600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 5.024E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.698774E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.14 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.79 + samples/sec: 6.590 | iteration 237700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 5.013E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.701888E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 + samples/sec: 6.593 | iteration 237800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 5.002E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.699966E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 + samples/sec: 6.593 | iteration 237900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.992E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.693690E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.80 + samples/sec: 6.591 | iteration 238000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 4.981E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.698171E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 238000 | lm_loss value: 2.593055E+00 | lm_loss_ppl value: 1.337056E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 238100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 4.970E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.665128E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.88 + samples/sec: 6.594 | iteration 238200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 4.959E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.699159E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.74 + samples/sec: 6.595 | iteration 238300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 4.948E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.682807E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 + samples/sec: 6.593 | iteration 238400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.937E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.649428E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.79 + samples/sec: 6.596 | iteration 238500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 4.926E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.709008E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.595 | iteration 238600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.915E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.718856E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.595 | iteration 238700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 4.904E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.696378E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 + samples/sec: 6.594 | iteration 238800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 4.893E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.703105E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 + samples/sec: 6.594 | iteration 238900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.883E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.688863E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.90 + samples/sec: 6.594 | iteration 239000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.872E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.670611E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 239000 | lm_loss value: 2.589211E+00 | lm_loss_ppl value: 1.331926E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 239100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 4.861E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.683011E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.85 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.88 + samples/sec: 6.591 | iteration 239200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 4.850E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.684034E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.96 + samples/sec: 6.592 | iteration 239300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 4.839E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.688093E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.80 + samples/sec: 6.597 | iteration 239400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 4.828E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.685385E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.47 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 + samples/sec: 6.588 | iteration 239500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 4.817E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.694173E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.81 + samples/sec: 6.592 | iteration 239600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 4.807E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.701760E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.81 + samples/sec: 6.592 | iteration 239700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 4.796E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.676217E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 + samples/sec: 6.590 | iteration 239800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 4.785E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.691137E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.82 + samples/sec: 6.597 | iteration 239900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 4.774E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.701331E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.591 | iteration 240000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 4.764E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.687798E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.85 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step140000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 240000 | lm_loss value: 2.664314E+00 | lm_loss_ppl value: 1.435810E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.015 | iteration 240100/ 320000 | elapsed time per iteration (ms): 2660.2 | learning rate: 4.753E-05 | approx flops per GPU: 37.4TFLOPS | lm_loss: 2.676460E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 586.89 | backward: 1870.73 | backward-backward: 1870.71 | backward-allreduce: 0.00 | optimizer: 62.73 | batch generator: 0.89 + samples/sec: 6.573 | iteration 240200/ 320000 | elapsed time per iteration (ms): 2434.3 | learning rate: 4.742E-05 | approx flops per GPU: 40.8TFLOPS | lm_loss: 2.686707E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 568.55 | backward: 1809.86 | backward-backward: 1809.84 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.81 + samples/sec: 6.600 | iteration 240300/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 4.731E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.699761E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1802.59 | backward-backward: 1802.56 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.81 + samples/sec: 6.590 | iteration 240400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 4.721E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.662097E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.593 | iteration 240500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.710E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.677684E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.596 | iteration 240600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 4.699E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.696271E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.77 + samples/sec: 6.591 | iteration 240700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 4.689E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.700434E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 + samples/sec: 6.595 | iteration 240800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.678E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.666339E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.72 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.595 | iteration 240900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.667E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.673617E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 + samples/sec: 6.590 | iteration 241000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.657E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.666004E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 241000 | lm_loss value: 2.679104E+00 | lm_loss_ppl value: 1.457203E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 241100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 4.646E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.690795E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.87 + samples/sec: 6.590 | iteration 241200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.635E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.666360E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.15 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 + samples/sec: 6.597 | iteration 241300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 4.625E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.682762E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 + samples/sec: 6.590 | iteration 241400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 4.614E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.666826E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1805.23 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 + samples/sec: 6.590 | iteration 241500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.603E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.690869E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.596 | iteration 241600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 4.593E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.697332E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 + samples/sec: 6.585 | iteration 241700/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 4.582E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.680399E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1806.36 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.77 + samples/sec: 6.596 | iteration 241800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 4.572E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.676748E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.593 | iteration 241900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.561E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.665471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1804.40 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 + samples/sec: 6.589 | iteration 242000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 4.550E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.659221E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 242000 | lm_loss value: 2.695219E+00 | lm_loss_ppl value: 1.480876E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 242100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 4.540E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.690719E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.95 + samples/sec: 6.584 | iteration 242200/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 4.529E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.672768E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1806.27 | backward-backward: 1806.25 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.77 + samples/sec: 6.596 | iteration 242300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 4.519E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.677471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.595 | iteration 242400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.508E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.670738E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.76 + samples/sec: 6.588 | iteration 242500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 4.498E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.674655E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.44 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.78 + samples/sec: 6.595 | iteration 242600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 4.487E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.672916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 + samples/sec: 6.594 | iteration 242700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 4.477E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.653176E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 + samples/sec: 6.584 | iteration 242800/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 4.466E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.698283E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1805.99 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.79 + samples/sec: 6.598 | iteration 242900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 4.456E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.680118E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.594 | iteration 243000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.445E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.675104E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.34 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 243000 | lm_loss value: 2.589995E+00 | lm_loss_ppl value: 1.332971E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 243100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 4.435E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.702460E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.86 + samples/sec: 6.599 | iteration 243200/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 4.425E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.678313E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1802.74 | backward-backward: 1802.71 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.590 | iteration 243300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 4.414E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.695881E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.594 | iteration 243400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 4.404E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.675390E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.596 | iteration 243500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 4.393E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.655071E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.78 + samples/sec: 6.587 | iteration 243600/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 4.383E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.675841E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1806.15 | backward-backward: 1806.12 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.79 + samples/sec: 6.597 | iteration 243700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 4.373E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.665742E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.83 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.588 | iteration 243800/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 4.362E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.668053E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.591 | iteration 243900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 4.352E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.667437E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.77 + samples/sec: 6.594 | iteration 244000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 4.341E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.675365E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 244000 | lm_loss value: 2.705753E+00 | lm_loss_ppl value: 1.496558E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 244100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 4.331E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.662417E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1806.00 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.83 + samples/sec: 6.597 | iteration 244200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 4.321E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.660334E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 + samples/sec: 6.589 | iteration 244300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 4.310E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.688346E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.71 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.83 + samples/sec: 6.593 | iteration 244400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.300E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.679316E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 + samples/sec: 6.594 | iteration 244500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.290E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.647599E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 + samples/sec: 6.586 | iteration 244600/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 4.279E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.670952E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1806.82 | backward-backward: 1806.79 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 + samples/sec: 6.595 | iteration 244700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.269E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.664006E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.82 + samples/sec: 6.589 | iteration 244800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 4.259E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.683183E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.82 + samples/sec: 6.593 | iteration 244900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 4.249E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.675310E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.589 | iteration 245000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 4.239E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.650290E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.88 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 245000 | lm_loss value: 2.603283E+00 | lm_loss_ppl value: 1.350801E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 245100/ 320000 | elapsed time per iteration (ms): 2485.9 | learning rate: 4.228E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.664464E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1806.00 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.89 + samples/sec: 6.594 | iteration 245200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 4.218E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652548E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 + samples/sec: 6.588 | iteration 245300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 4.208E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645935E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1806.04 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.595 | iteration 245400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.198E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669272E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 + samples/sec: 6.591 | iteration 245500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 4.187E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.666593E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 + samples/sec: 6.590 | iteration 245600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.177E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.665948E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.596 | iteration 245700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 4.167E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.665713E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.83 + samples/sec: 6.589 | iteration 245800/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 4.157E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.667712E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1806.48 | backward-backward: 1806.45 | backward-allreduce: 0.00 | optimizer: 54.83 | batch generator: 0.76 + samples/sec: 6.596 | iteration 245900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 4.147E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.658289E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.87 + samples/sec: 6.592 | iteration 246000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 4.137E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.678812E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 246000 | lm_loss value: 2.643543E+00 | lm_loss_ppl value: 1.406293E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 246100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 4.127E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.658771E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.86 + samples/sec: 6.596 | iteration 246200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 4.117E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.651397E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.588 | iteration 246300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 4.106E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.672918E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1805.54 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.90 + samples/sec: 6.596 | iteration 246400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 4.096E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.640948E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.82 + samples/sec: 6.586 | iteration 246500/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 4.086E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.654687E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.12 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.87 + samples/sec: 6.597 | iteration 246600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 4.076E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.656640E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 + samples/sec: 6.590 | iteration 246700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.066E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652758E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.09 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.592 | iteration 246800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 4.056E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.672903E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.593 | iteration 246900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.046E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.659283E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.77 + samples/sec: 6.591 | iteration 247000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 4.036E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.659344E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 247000 | lm_loss value: 2.636668E+00 | lm_loss_ppl value: 1.396660E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 247100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 4.026E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.649926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.90 + samples/sec: 6.590 | iteration 247200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.016E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.640680E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 + samples/sec: 6.591 | iteration 247300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 4.006E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.657646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 + samples/sec: 6.595 | iteration 247400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.996E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.658389E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.587 | iteration 247500/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.986E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.670272E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.94 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.86 + samples/sec: 6.595 | iteration 247600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.976E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.671517E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.81 + samples/sec: 6.590 | iteration 247700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.966E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.649530E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.591 | iteration 247800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.956E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.661805E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.88 + samples/sec: 6.595 | iteration 247900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.946E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.662375E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 + samples/sec: 6.585 | iteration 248000/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.936E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.649843E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1806.22 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 248000 | lm_loss value: 2.631509E+00 | lm_loss_ppl value: 1.389472E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 248100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 3.926E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.648994E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.85 + samples/sec: 6.586 | iteration 248200/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.916E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.639947E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.88 + samples/sec: 6.589 | iteration 248300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.906E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.643942E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 56.54 | batch generator: 0.77 + samples/sec: 6.589 | iteration 248400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.896E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.649456E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 + samples/sec: 6.589 | iteration 248500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.886E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.651009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 + samples/sec: 6.591 | iteration 248600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.876E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.643083E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 + samples/sec: 6.590 | iteration 248700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.867E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645793E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1805.88 | backward-backward: 1805.85 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.81 + samples/sec: 6.596 | iteration 248800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.857E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.673068E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1803.94 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 + samples/sec: 6.594 | iteration 248900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.847E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652620E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.80 + samples/sec: 6.590 | iteration 249000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.837E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.641789E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 249000 | lm_loss value: 2.642820E+00 | lm_loss_ppl value: 1.405278E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 249100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 3.827E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.635435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.83 | backward: 1803.33 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.83 + samples/sec: 6.584 | iteration 249200/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.818E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.659580E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.34 | backward: 1806.39 | backward-backward: 1806.37 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.82 + samples/sec: 6.590 | iteration 249300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.808E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652184E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.78 + samples/sec: 6.595 | iteration 249400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.798E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.648759E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.74 + samples/sec: 6.590 | iteration 249500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.788E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.661034E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.598 | iteration 249600/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.778E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644722E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1803.06 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 + samples/sec: 6.592 | iteration 249700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.769E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.627822E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 + samples/sec: 6.593 | iteration 249800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.759E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.665895E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.597 | iteration 249900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.749E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.649741E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.04 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.589 | iteration 250000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.740E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645483E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1805.72 | backward-backward: 1805.70 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.80 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step150000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 250000 | lm_loss value: 2.648905E+00 | lm_loss_ppl value: 1.413855E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.229 | iteration 250100/ 320000 | elapsed time per iteration (ms): 2568.7 | learning rate: 3.730E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.640982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.86 + samples/sec: 6.589 | iteration 250200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.720E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.657554E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.594 | iteration 250300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.710E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.648656E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.590 | iteration 250400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.701E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621968E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.80 + samples/sec: 6.589 | iteration 250500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.691E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.650269E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.76 + samples/sec: 6.595 | iteration 250600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.681E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633920E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.588 | iteration 250700/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.672E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.640640E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1805.58 | backward-backward: 1805.56 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 + samples/sec: 6.597 | iteration 250800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.662E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.642979E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 + samples/sec: 6.590 | iteration 250900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.652E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.669044E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1805.10 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.86 + samples/sec: 6.594 | iteration 251000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.643E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.645857E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1803.95 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.95 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 251000 | lm_loss value: 2.626937E+00 | lm_loss_ppl value: 1.383134E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 251100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.633E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.656037E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 54.83 | batch generator: 0.92 + samples/sec: 6.590 | iteration 251200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.624E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.631071E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 + samples/sec: 6.595 | iteration 251300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.614E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.647612E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.86 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 + samples/sec: 6.586 | iteration 251400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.605E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.650176E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1806.08 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.78 + samples/sec: 6.590 | iteration 251500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.595E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.644917E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.80 + samples/sec: 6.592 | iteration 251600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.586E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625000E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.92 + samples/sec: 6.597 | iteration 251700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.576E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612652E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.81 + samples/sec: 6.594 | iteration 251800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.567E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.637385E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.590 | iteration 251900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.557E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.644746E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.85 + samples/sec: 6.597 | iteration 252000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.547E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.653837E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.97 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 252000 | lm_loss value: 2.572616E+00 | lm_loss_ppl value: 1.310005E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 252100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 3.538E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.641029E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.85 + samples/sec: 6.597 | iteration 252200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.528E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.642251E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1802.97 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.76 + samples/sec: 6.592 | iteration 252300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.519E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644322E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.75 + samples/sec: 6.595 | iteration 252400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.510E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.657974E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 + samples/sec: 6.597 | iteration 252500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.500E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633268E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.84 + samples/sec: 6.585 | iteration 252600/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.491E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645548E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1806.20 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 1.02 + samples/sec: 6.595 | iteration 252700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.481E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.627713E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.55 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.75 + samples/sec: 6.591 | iteration 252800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.472E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.631666E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.595 | iteration 252900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.462E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.649607E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.76 + samples/sec: 6.593 | iteration 253000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.453E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.639254E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 253000 | lm_loss value: 2.661034E+00 | lm_loss_ppl value: 1.431108E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 253100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 3.444E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.626949E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.86 + samples/sec: 6.595 | iteration 253200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.434E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.77 + samples/sec: 6.591 | iteration 253300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.425E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.640664E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.79 + samples/sec: 6.597 | iteration 253400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.416E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619961E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 + samples/sec: 6.589 | iteration 253500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.406E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.650542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.77 + samples/sec: 6.595 | iteration 253600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.397E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636511E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 + samples/sec: 6.589 | iteration 253700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.387E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.648095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.80 + samples/sec: 6.591 | iteration 253800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.378E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.619680E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.81 + samples/sec: 6.597 | iteration 253900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.369E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.650222E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 + samples/sec: 6.590 | iteration 254000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.360E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.625244E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 254000 | lm_loss value: 2.575245E+00 | lm_loss_ppl value: 1.313453E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 254100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 3.350E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.626592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.84 + samples/sec: 6.598 | iteration 254200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.341E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.632697E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.77 + samples/sec: 6.588 | iteration 254300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.332E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.639116E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.84 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.599 | iteration 254400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.323E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626821E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.78 + samples/sec: 6.592 | iteration 254500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.313E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.641582E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.86 + samples/sec: 6.596 | iteration 254600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.304E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.629340E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.80 + samples/sec: 6.599 | iteration 254700/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.295E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623354E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1802.69 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.82 + samples/sec: 6.586 | iteration 254800/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.286E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.622038E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1805.77 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.77 + samples/sec: 6.597 | iteration 254900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.277E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626596E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1803.06 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.596 | iteration 255000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.268E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614948E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 255000 | lm_loss value: 2.609419E+00 | lm_loss_ppl value: 1.359116E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 255100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 3.258E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.619958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.90 + samples/sec: 6.598 | iteration 255200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.249E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.646533E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1802.44 | backward-backward: 1802.42 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.82 + samples/sec: 6.592 | iteration 255300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.240E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.629885E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.595 | iteration 255400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.231E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621731E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 + samples/sec: 6.598 | iteration 255500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.222E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652376E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1802.69 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.87 + samples/sec: 6.589 | iteration 255600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.213E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606530E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.83 + samples/sec: 6.595 | iteration 255700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.204E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.627387E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 + samples/sec: 6.594 | iteration 255800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.194E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.637518E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 + samples/sec: 6.588 | iteration 255900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.185E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621518E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.31 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.76 + samples/sec: 6.596 | iteration 256000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.176E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644682E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 256000 | lm_loss value: 2.598944E+00 | lm_loss_ppl value: 1.344953E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 256100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 3.167E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.625229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.93 + samples/sec: 6.593 | iteration 256200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.158E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652459E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.82 + samples/sec: 6.598 | iteration 256300/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.149E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601204E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.91 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.74 + samples/sec: 6.588 | iteration 256400/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.140E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.625504E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1805.67 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 + samples/sec: 6.600 | iteration 256500/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.131E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612731E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.97 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.78 + samples/sec: 6.595 | iteration 256600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.122E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.591 | iteration 256700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.113E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612759E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 + samples/sec: 6.596 | iteration 256800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.104E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615480E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.88 + samples/sec: 6.591 | iteration 256900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.095E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626535E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 + samples/sec: 6.592 | iteration 257000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.086E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621671E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 257000 | lm_loss value: 2.574606E+00 | lm_loss_ppl value: 1.312615E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 257100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.077E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.631593E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.85 + samples/sec: 6.591 | iteration 257200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.069E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626030E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1805.45 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.76 + samples/sec: 6.597 | iteration 257300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.060E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605012E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 + samples/sec: 6.598 | iteration 257400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.051E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614577E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.72 + samples/sec: 6.593 | iteration 257500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.042E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625495E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 + samples/sec: 6.600 | iteration 257600/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.033E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631676E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.99 | backward: 1802.48 | backward-backward: 1802.46 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.82 + samples/sec: 6.594 | iteration 257700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.024E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607763E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 + samples/sec: 6.593 | iteration 257800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.015E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 + samples/sec: 6.598 | iteration 257900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.006E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622983E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 + samples/sec: 6.591 | iteration 258000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623854E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 258000 | lm_loss value: 2.687788E+00 | lm_loss_ppl value: 1.469913E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 258100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.630604E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.87 + samples/sec: 6.598 | iteration 258200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610776E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.01 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.82 + samples/sec: 6.589 | iteration 258300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604983E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.595 | iteration 258400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618324E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.598 | iteration 258500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616711E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.591 | iteration 258600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.627692E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 + samples/sec: 6.597 | iteration 258700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614549E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.72 | backward: 1803.45 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 + samples/sec: 6.590 | iteration 258800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616523E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 + samples/sec: 6.587 | iteration 258900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601995E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1806.03 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 + samples/sec: 6.596 | iteration 259000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621006E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 259000 | lm_loss value: 2.592148E+00 | lm_loss_ppl value: 1.335843E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 259100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.610283E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.86 + samples/sec: 6.593 | iteration 259200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603488E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.597 | iteration 259300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621116E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1802.73 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.83 + samples/sec: 6.597 | iteration 259400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626152E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.591 | iteration 259500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613482E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 + samples/sec: 6.593 | iteration 259600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599255E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.80 + samples/sec: 6.600 | iteration 259700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614606E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1802.31 | backward-backward: 1802.28 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.591 | iteration 259800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.605081E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.595 | iteration 259900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 + samples/sec: 6.599 | iteration 260000/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614030E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1802.47 | backward-backward: 1802.45 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.98 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step160000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 260000 | lm_loss value: 2.583860E+00 | lm_loss_ppl value: 1.324818E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.208 | iteration 260100/ 320000 | elapsed time per iteration (ms): 2577.3 | learning rate: 3.000E-05 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.620095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.22 | backward: 1806.73 | backward-backward: 1806.71 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.90 + samples/sec: 6.589 | iteration 260200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602898E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.01 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 56.51 | batch generator: 0.81 + samples/sec: 6.600 | iteration 260300/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610881E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1802.53 | backward-backward: 1802.50 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.591 | iteration 260400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1805.25 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.592 | iteration 260500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597302E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.85 + samples/sec: 6.599 | iteration 260600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619071E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.82 | backward: 1803.17 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.76 + samples/sec: 6.592 | iteration 260700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597772E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.77 + samples/sec: 6.591 | iteration 260800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.608621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.98 + samples/sec: 6.599 | iteration 260900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618550E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.75 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.76 + samples/sec: 6.593 | iteration 261000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621675E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.36 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 261000 | lm_loss value: 2.619267E+00 | lm_loss_ppl value: 1.372566E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 261100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.624486E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.93 + samples/sec: 6.599 | iteration 261200/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615477E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.93 | backward: 1802.69 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.592 | iteration 261300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.85 + samples/sec: 6.592 | iteration 261400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 + samples/sec: 6.598 | iteration 261500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606979E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.597 | iteration 261600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615345E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 + samples/sec: 6.591 | iteration 261700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 + samples/sec: 6.591 | iteration 261800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591528E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.599 | iteration 261900/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614474E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1803.04 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.79 + samples/sec: 6.593 | iteration 262000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 262000 | lm_loss value: 2.522773E+00 | lm_loss_ppl value: 1.246310E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 262100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.604284E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.86 + samples/sec: 6.598 | iteration 262200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623667E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.10 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 + samples/sec: 6.596 | iteration 262300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602213E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.587 | iteration 262400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623807E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1806.09 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.79 + samples/sec: 6.592 | iteration 262500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603672E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.597 | iteration 262600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619221E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.595 | iteration 262700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618969E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 + samples/sec: 6.588 | iteration 262800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597853E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.87 + samples/sec: 6.594 | iteration 262900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616864E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.80 + samples/sec: 6.600 | iteration 263000/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591732E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.89 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 54.86 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 263000 | lm_loss value: 2.639654E+00 | lm_loss_ppl value: 1.400835E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 263100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.592583E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 1.03 + samples/sec: 6.589 | iteration 263200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.608283E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.93 + samples/sec: 6.595 | iteration 263300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589778E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.597 | iteration 263400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618102E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.584 | iteration 263500/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603841E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1806.57 | backward-backward: 1806.55 | backward-allreduce: 0.00 | optimizer: 56.61 | batch generator: 0.95 + samples/sec: 6.591 | iteration 263600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 + samples/sec: 6.598 | iteration 263700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612872E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1803.29 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.594 | iteration 263800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605225E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 + samples/sec: 6.591 | iteration 263900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586590E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.79 + samples/sec: 6.590 | iteration 264000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.581813E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 264000 | lm_loss value: 2.607442E+00 | lm_loss_ppl value: 1.356430E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 264100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.622679E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.85 + samples/sec: 6.596 | iteration 264200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596290E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.80 + samples/sec: 6.593 | iteration 264300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.586753E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.82 + samples/sec: 6.592 | iteration 264400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620049E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.595 | iteration 264500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591642E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 + samples/sec: 6.593 | iteration 264600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604002E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.79 + samples/sec: 6.589 | iteration 264700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602683E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.91 + samples/sec: 6.592 | iteration 264800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614737E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.61 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.595 | iteration 264900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.75 + samples/sec: 6.600 | iteration 265000/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624794E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.87 | backward: 1802.84 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 265000 | lm_loss value: 2.574054E+00 | lm_loss_ppl value: 1.311890E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 265100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.630855E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.88 + samples/sec: 6.593 | iteration 265200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575031E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 + samples/sec: 6.596 | iteration 265300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.632822E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.77 + samples/sec: 6.600 | iteration 265400/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601362E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.86 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 + samples/sec: 6.590 | iteration 265500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612367E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 + samples/sec: 6.590 | iteration 265600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602141E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.95 + samples/sec: 6.589 | iteration 265700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.79 + samples/sec: 6.597 | iteration 265800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616324E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.84 + samples/sec: 6.598 | iteration 265900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1802.82 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.591 | iteration 266000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609008E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.88 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 266000 | lm_loss value: 2.692730E+00 | lm_loss_ppl value: 1.477194E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 266100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.615878E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.87 + samples/sec: 6.589 | iteration 266200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601028E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.81 + samples/sec: 6.590 | iteration 266300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599534E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 + samples/sec: 6.591 | iteration 266400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614991E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.591 | iteration 266500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611108E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.592 | iteration 266600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592737E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 + samples/sec: 6.598 | iteration 266700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599157E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.17 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 + samples/sec: 6.596 | iteration 266800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578285E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 + samples/sec: 6.594 | iteration 266900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599718E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 + samples/sec: 6.592 | iteration 267000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605284E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 267000 | lm_loss value: 2.593359E+00 | lm_loss_ppl value: 1.337462E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 267100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.602115E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.86 + samples/sec: 6.589 | iteration 267200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616777E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.79 + samples/sec: 6.597 | iteration 267300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606584E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 + samples/sec: 6.596 | iteration 267400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625355E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.76 + samples/sec: 6.590 | iteration 267500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588954E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.77 + samples/sec: 6.593 | iteration 267600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605844E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.592 | iteration 267700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591768E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.66 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.591 | iteration 267800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601549E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 + samples/sec: 6.591 | iteration 267900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.77 + samples/sec: 6.591 | iteration 268000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.581911E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 268000 | lm_loss value: 2.610209E+00 | lm_loss_ppl value: 1.360190E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 268100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.593893E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.85 + samples/sec: 6.600 | iteration 268200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602023E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1802.67 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 + samples/sec: 6.599 | iteration 268300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601357E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1802.86 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 + samples/sec: 6.595 | iteration 268400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616700E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 + samples/sec: 6.592 | iteration 268500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603208E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 + samples/sec: 6.591 | iteration 268600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598653E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 + samples/sec: 6.587 | iteration 268700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602831E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1805.81 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.90 + samples/sec: 6.595 | iteration 268800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617875E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 + samples/sec: 6.596 | iteration 268900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603242E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 54.52 | batch generator: 0.76 + samples/sec: 6.589 | iteration 269000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609007E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1805.78 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 269000 | lm_loss value: 2.630586E+00 | lm_loss_ppl value: 1.388191E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 269100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.598631E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.91 + samples/sec: 6.592 | iteration 269200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599990E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 + samples/sec: 6.592 | iteration 269300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592730E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.13 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.594 | iteration 269400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583311E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.594 | iteration 269500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581136E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 + samples/sec: 6.594 | iteration 269600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582234E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.88 + samples/sec: 6.594 | iteration 269700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 + samples/sec: 6.595 | iteration 269800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622112E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 + samples/sec: 6.594 | iteration 269900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569817E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 + samples/sec: 6.597 | iteration 270000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595494E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.70 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.78 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step170000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 270000 | lm_loss value: 2.537112E+00 | lm_loss_ppl value: 1.264310E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.221 | iteration 270100/ 320000 | elapsed time per iteration (ms): 2571.8 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.585039E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1806.21 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.86 + samples/sec: 6.594 | iteration 270200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606370E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1804.62 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 + samples/sec: 6.593 | iteration 270300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591503E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.82 + samples/sec: 6.591 | iteration 270400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602534E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 + samples/sec: 6.593 | iteration 270500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583860E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.80 + samples/sec: 6.594 | iteration 270600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.593 | iteration 270700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591197E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.82 + samples/sec: 6.593 | iteration 270800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606285E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.592 | iteration 270900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606219E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.595 | iteration 271000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.574531E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.54 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 271000 | lm_loss value: 2.598959E+00 | lm_loss_ppl value: 1.344973E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 271100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.594412E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.87 + samples/sec: 6.595 | iteration 271200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 + samples/sec: 6.594 | iteration 271300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.40 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.80 + samples/sec: 6.592 | iteration 271400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605363E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.594 | iteration 271500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601333E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.35 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.595 | iteration 271600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604191E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.76 + samples/sec: 6.593 | iteration 271700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597053E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.593 | iteration 271800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585255E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 + samples/sec: 6.595 | iteration 271900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587095E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.77 + samples/sec: 6.595 | iteration 272000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581040E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1803.69 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 272000 | lm_loss value: 2.550083E+00 | lm_loss_ppl value: 1.280817E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 272100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.598873E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.05 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.87 + samples/sec: 6.594 | iteration 272200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609453E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.77 + samples/sec: 6.591 | iteration 272300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598477E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.595 | iteration 272400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600622E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.13 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 + samples/sec: 6.595 | iteration 272500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601382E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.75 + samples/sec: 6.594 | iteration 272600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579329E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.595 | iteration 272700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577090E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 + samples/sec: 6.595 | iteration 272800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593535E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.595 | iteration 272900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 + samples/sec: 6.593 | iteration 273000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609122E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.20 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.94 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 273000 | lm_loss value: 2.637447E+00 | lm_loss_ppl value: 1.397748E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 273100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.611186E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.86 + samples/sec: 6.592 | iteration 273200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595046E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 + samples/sec: 6.593 | iteration 273300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604245E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.20 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.594 | iteration 273400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604232E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 + samples/sec: 6.591 | iteration 273500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589096E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 1.02 + samples/sec: 6.594 | iteration 273600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595087E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.83 + samples/sec: 6.593 | iteration 273700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595475E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 + samples/sec: 6.592 | iteration 273800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600183E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.75 + samples/sec: 6.593 | iteration 273900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589167E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.21 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.77 + samples/sec: 6.592 | iteration 274000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585200E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1805.03 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 274000 | lm_loss value: 2.565356E+00 | lm_loss_ppl value: 1.300529E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 274100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.579419E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.91 + samples/sec: 6.592 | iteration 274200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587298E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.94 + samples/sec: 6.596 | iteration 274300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594007E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 54.66 | batch generator: 0.79 + samples/sec: 6.589 | iteration 274400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600653E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.88 + samples/sec: 6.591 | iteration 274500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.575322E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.76 + samples/sec: 6.592 | iteration 274600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588051E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.79 + samples/sec: 6.596 | iteration 274700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600712E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.87 + samples/sec: 6.591 | iteration 274800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.605598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.91 + samples/sec: 6.591 | iteration 274900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589680E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.594 | iteration 275000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593440E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 275000 | lm_loss value: 2.593900E+00 | lm_loss_ppl value: 1.338186E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 275100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.593439E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 + samples/sec: 6.591 | iteration 275200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.570428E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.20 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.592 | iteration 275300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591719E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 + samples/sec: 6.596 | iteration 275400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587010E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1804.54 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 54.58 | batch generator: 0.79 + samples/sec: 6.591 | iteration 275500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589157E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.88 + samples/sec: 6.598 | iteration 275600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598239E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.77 + samples/sec: 6.596 | iteration 275700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611268E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.82 + samples/sec: 6.595 | iteration 275800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605826E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1803.54 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.97 + samples/sec: 6.591 | iteration 275900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593297E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.82 + samples/sec: 6.593 | iteration 276000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592466E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 276000 | lm_loss value: 2.558896E+00 | lm_loss_ppl value: 1.292154E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.444 | iteration 276100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.588635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.85 + samples/sec: 6.594 | iteration 276200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600748E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 + samples/sec: 6.593 | iteration 276300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588538E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.86 + samples/sec: 6.594 | iteration 276400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591726E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 + samples/sec: 6.593 | iteration 276500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606253E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.590 | iteration 276600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.580291E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.82 + samples/sec: 6.592 | iteration 276700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606938E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.592 | iteration 276800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579174E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.36 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 + samples/sec: 6.593 | iteration 276900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594486E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.597 | iteration 277000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588912E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 277000 | lm_loss value: 2.624652E+00 | lm_loss_ppl value: 1.379977E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 277100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.592674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1805.82 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.87 + samples/sec: 6.591 | iteration 277200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.583754E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 + samples/sec: 6.592 | iteration 277300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594052E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.83 + samples/sec: 6.591 | iteration 277400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618977E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.591 | iteration 277500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593315E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.77 + samples/sec: 6.591 | iteration 277600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.578139E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.81 + samples/sec: 6.593 | iteration 277700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593006E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.597 | iteration 277800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604224E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.19 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.82 + samples/sec: 6.590 | iteration 277900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602259E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.33 | backward: 1805.21 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 + samples/sec: 6.592 | iteration 278000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585715E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 278000 | lm_loss value: 2.580792E+00 | lm_loss_ppl value: 1.320760E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 278100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.591070E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1803.69 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.85 + samples/sec: 6.592 | iteration 278200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592062E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 + samples/sec: 6.590 | iteration 278300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606917E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.81 + samples/sec: 6.594 | iteration 278400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582318E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.91 + samples/sec: 6.587 | iteration 278500/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586949E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.80 + samples/sec: 6.598 | iteration 278600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598830E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.96 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.80 + samples/sec: 6.589 | iteration 278700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596179E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1806.19 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.75 + samples/sec: 6.594 | iteration 278800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590851E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.88 + samples/sec: 6.595 | iteration 278900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584703E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 + samples/sec: 6.590 | iteration 279000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602173E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 279000 | lm_loss value: 2.569028E+00 | lm_loss_ppl value: 1.305313E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.446 | iteration 279100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.598403E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.86 + samples/sec: 6.596 | iteration 279200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 + samples/sec: 6.588 | iteration 279300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589725E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.80 + samples/sec: 6.595 | iteration 279400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596143E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.60 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.77 + samples/sec: 6.588 | iteration 279500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604671E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.80 + samples/sec: 6.590 | iteration 279600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.578893E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 + samples/sec: 6.598 | iteration 279700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585543E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 54.92 | batch generator: 0.75 + samples/sec: 6.583 | iteration 279800/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601567E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1806.79 | backward-backward: 1806.77 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.79 + samples/sec: 6.592 | iteration 279900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.94 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.81 + samples/sec: 6.593 | iteration 280000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604921E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step180000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 280000 | lm_loss value: 2.568557E+00 | lm_loss_ppl value: 1.304698E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.197 | iteration 280100/ 320000 | elapsed time per iteration (ms): 2581.9 | learning rate: 3.000E-05 | approx flops per GPU: 38.5TFLOPS | lm_loss: 2.588251E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 568.45 | backward: 1807.66 | backward-backward: 1807.64 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.88 + samples/sec: 6.597 | iteration 280200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593265E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.24 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 + samples/sec: 6.589 | iteration 280300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606221E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.80 + samples/sec: 6.594 | iteration 280400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580633E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.77 + samples/sec: 6.593 | iteration 280500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 + samples/sec: 6.587 | iteration 280600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.584219E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1806.28 | backward-backward: 1806.25 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.74 + samples/sec: 6.596 | iteration 280700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577074E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.78 + samples/sec: 6.589 | iteration 280800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.587099E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.79 + samples/sec: 6.590 | iteration 280900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601322E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.78 + samples/sec: 6.596 | iteration 281000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594797E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 281000 | lm_loss value: 2.582062E+00 | lm_loss_ppl value: 1.322438E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.434 | iteration 281100/ 320000 | elapsed time per iteration (ms): 2486.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.582225E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1806.66 | backward-backward: 1806.64 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.87 + samples/sec: 6.596 | iteration 281200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.557579E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.589 | iteration 281300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.605967E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1806.10 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.592 | iteration 281400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575834E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.591 | iteration 281500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598694E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.79 + samples/sec: 6.589 | iteration 281600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593457E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.70 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 + samples/sec: 6.597 | iteration 281700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572950E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.84 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.587 | iteration 281800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.605240E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1806.21 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 + samples/sec: 6.594 | iteration 281900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580633E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.95 + samples/sec: 6.587 | iteration 282000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.577351E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1806.12 | backward-backward: 1806.09 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 282000 | lm_loss value: 2.557621E+00 | lm_loss_ppl value: 1.290507E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 282100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.594499E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.46 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.84 + samples/sec: 6.592 | iteration 282200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588862E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 + samples/sec: 6.587 | iteration 282300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595400E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1805.78 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.94 + samples/sec: 6.595 | iteration 282400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582902E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.88 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.82 + samples/sec: 6.585 | iteration 282500/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.578719E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1806.32 | backward-backward: 1806.29 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.95 + samples/sec: 6.593 | iteration 282600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599272E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.52 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.82 + samples/sec: 6.594 | iteration 282700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603904E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.23 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.86 + samples/sec: 6.587 | iteration 282800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595749E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.39 | backward: 1806.15 | backward-backward: 1806.13 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.80 + samples/sec: 6.591 | iteration 282900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.590564E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1804.57 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 + samples/sec: 6.594 | iteration 283000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582762E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 283000 | lm_loss value: 2.548509E+00 | lm_loss_ppl value: 1.278802E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.431 | iteration 283100/ 320000 | elapsed time per iteration (ms): 2487.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.582612E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1806.90 | backward-backward: 1806.87 | backward-allreduce: 0.00 | optimizer: 56.72 | batch generator: 0.87 + samples/sec: 6.598 | iteration 283200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560205E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1802.74 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 + samples/sec: 6.592 | iteration 283300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.589 | iteration 283400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.590927E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 + samples/sec: 6.598 | iteration 283500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590951E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 + samples/sec: 6.588 | iteration 283600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.587251E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.77 + samples/sec: 6.596 | iteration 283700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584675E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.40 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 + samples/sec: 6.591 | iteration 283800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.580924E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.589 | iteration 283900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.584463E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1805.64 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.596 | iteration 284000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584109E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.02 | backward: 1803.78 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 284000 | lm_loss value: 2.544537E+00 | lm_loss_ppl value: 1.273732E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 284100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.584502E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1806.21 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.90 + samples/sec: 6.587 | iteration 284200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.573124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.14 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.94 + samples/sec: 6.594 | iteration 284300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601697E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 + samples/sec: 6.585 | iteration 284400/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591836E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1806.48 | backward-backward: 1806.46 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.85 + samples/sec: 6.595 | iteration 284500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582543E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.95 + samples/sec: 6.587 | iteration 284600/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597455E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.14 | backward: 1806.27 | backward-backward: 1806.24 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.80 + samples/sec: 6.596 | iteration 284700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604841E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.26 | backward: 1803.43 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 + samples/sec: 6.590 | iteration 284800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.590 | iteration 284900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.578307E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1804.35 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.80 + samples/sec: 6.596 | iteration 285000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588697E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 285000 | lm_loss value: 2.659253E+00 | lm_loss_ppl value: 1.428561E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.436 | iteration 285100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.580221E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.87 + samples/sec: 6.591 | iteration 285200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.580815E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.93 + samples/sec: 6.590 | iteration 285300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.571772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.80 + samples/sec: 6.588 | iteration 285400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593694E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1805.68 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 + samples/sec: 6.595 | iteration 285500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571395E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.96 + samples/sec: 6.589 | iteration 285600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.580911E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1805.71 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.594 | iteration 285700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581367E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 + samples/sec: 6.590 | iteration 285800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591391E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.29 | backward: 1805.84 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 + samples/sec: 6.591 | iteration 285900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595282E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.76 + samples/sec: 6.594 | iteration 286000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 286000 | lm_loss value: 2.512641E+00 | lm_loss_ppl value: 1.233748E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.439 | iteration 286100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.598772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.85 + samples/sec: 6.595 | iteration 286200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604676E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.15 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.82 + samples/sec: 6.584 | iteration 286300/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601910E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.62 | backward: 1806.24 | backward-backward: 1806.22 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.82 + samples/sec: 6.592 | iteration 286400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609488E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.49 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.80 + samples/sec: 6.590 | iteration 286500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.584210E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 + samples/sec: 6.595 | iteration 286600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 + samples/sec: 6.591 | iteration 286700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623878E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.593 | iteration 286800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583266E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 + samples/sec: 6.596 | iteration 286900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581289E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.53 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 + samples/sec: 6.584 | iteration 287000/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.571180E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1806.71 | backward-backward: 1806.68 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 287000 | lm_loss value: 2.540795E+00 | lm_loss_ppl value: 1.268976E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 287100/ 320000 | elapsed time per iteration (ms): 2482.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.582874E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.12 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.85 + samples/sec: 6.588 | iteration 287200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598633E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.589 | iteration 287300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597862E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.65 | backward: 1805.40 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.596 | iteration 287400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610846E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 54.52 | batch generator: 0.80 + samples/sec: 6.587 | iteration 287500/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.567429E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.79 + samples/sec: 6.596 | iteration 287600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595957E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.07 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 + samples/sec: 6.586 | iteration 287700/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588932E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 + samples/sec: 6.595 | iteration 287800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593685E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1803.39 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 + samples/sec: 6.593 | iteration 287900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581220E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 + samples/sec: 6.588 | iteration 288000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.570217E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1806.22 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 288000 | lm_loss value: 2.536442E+00 | lm_loss_ppl value: 1.263464E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 288100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.580319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.87 + samples/sec: 6.586 | iteration 288200/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599134E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1806.13 | backward-backward: 1806.11 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 + samples/sec: 6.598 | iteration 288300/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593875E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.09 | backward: 1802.93 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 + samples/sec: 6.591 | iteration 288400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.580563E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.77 + samples/sec: 6.586 | iteration 288500/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.573350E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1806.44 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.594 | iteration 288600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597731E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.90 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.79 + samples/sec: 6.586 | iteration 288700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.585589E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.09 | backward: 1806.21 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.86 + samples/sec: 6.593 | iteration 288800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568887E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.67 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.84 + samples/sec: 6.593 | iteration 288900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582439E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.40 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 + samples/sec: 6.586 | iteration 289000/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589034E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1805.98 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 289000 | lm_loss value: 2.561521E+00 | lm_loss_ppl value: 1.295550E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 289100/ 320000 | elapsed time per iteration (ms): 2482.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.601500E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.88 + samples/sec: 6.589 | iteration 289200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586255E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1806.03 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.80 + samples/sec: 6.590 | iteration 289300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.582396E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 + samples/sec: 6.598 | iteration 289400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593286E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.94 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 + samples/sec: 6.586 | iteration 289500/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.578886E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1806.19 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.78 + samples/sec: 6.592 | iteration 289600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588047E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.594 | iteration 289700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588441E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.86 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.76 + samples/sec: 6.584 | iteration 289800/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612880E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.60 | backward: 1806.97 | backward-backward: 1806.95 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.76 + samples/sec: 6.596 | iteration 289900/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576718E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.595 | iteration 290000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584130E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.85 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step190000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 290000 | lm_loss value: 2.569811E+00 | lm_loss_ppl value: 1.306336E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.194 | iteration 290100/ 320000 | elapsed time per iteration (ms): 2583.2 | learning rate: 3.000E-05 | approx flops per GPU: 38.5TFLOPS | lm_loss: 2.562439E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.74 | backward: 1808.04 | backward-backward: 1808.01 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.88 + samples/sec: 6.600 | iteration 290200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591728E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.27 | backward: 1802.92 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 54.79 | batch generator: 0.81 + samples/sec: 6.594 | iteration 290300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593714E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 + samples/sec: 6.586 | iteration 290400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.577190E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.32 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.597 | iteration 290500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594441E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.58 | backward: 1802.76 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.84 + samples/sec: 6.593 | iteration 290600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575117E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.81 + samples/sec: 6.582 | iteration 290700/ 320000 | elapsed time per iteration (ms): 2430.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596682E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1806.49 | backward-backward: 1806.46 | backward-allreduce: 0.00 | optimizer: 56.99 | batch generator: 0.85 + samples/sec: 6.598 | iteration 290800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576186E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1802.72 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 + samples/sec: 6.598 | iteration 290900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.558732E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.17 | backward: 1802.95 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 + samples/sec: 6.589 | iteration 291000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.582702E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.76 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 291000 | lm_loss value: 2.556042E+00 | lm_loss_ppl value: 1.288472E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.442 | iteration 291100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.579847E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.63 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.87 + samples/sec: 6.600 | iteration 291200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576108E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.91 | backward: 1802.41 | backward-backward: 1802.39 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 + samples/sec: 6.592 | iteration 291300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584022E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 + samples/sec: 6.592 | iteration 291400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592813E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 + samples/sec: 6.598 | iteration 291500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.570919E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.98 | backward: 1802.93 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 + samples/sec: 6.593 | iteration 291600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584808E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 + samples/sec: 6.589 | iteration 291700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609595E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 + samples/sec: 6.594 | iteration 291800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590774E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1803.33 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.80 + samples/sec: 6.598 | iteration 291900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595960E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.11 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.590 | iteration 292000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.607883E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1805.09 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 292000 | lm_loss value: 2.525019E+00 | lm_loss_ppl value: 1.249113E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 292100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.574488E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 56.77 | batch generator: 0.87 + samples/sec: 6.598 | iteration 292200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591834E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.10 | backward: 1802.90 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.84 + samples/sec: 6.589 | iteration 292300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.571451E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.89 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.86 + samples/sec: 6.591 | iteration 292400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.567086E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 + samples/sec: 6.599 | iteration 292500/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578585E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.03 | backward: 1802.96 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.86 + samples/sec: 6.589 | iteration 292600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588503E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1805.71 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.589 | iteration 292700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.565816E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.597 | iteration 292800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571192E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.22 | backward: 1803.21 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.589 | iteration 292900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1805.59 | backward-backward: 1805.56 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.77 + samples/sec: 6.587 | iteration 293000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586135E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 293000 | lm_loss value: 2.550811E+00 | lm_loss_ppl value: 1.281750E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 293100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.592721E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.50 | backward: 1803.61 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.86 + samples/sec: 6.597 | iteration 293200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600385E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.79 + samples/sec: 6.590 | iteration 293300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586445E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.594 | iteration 293400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598104E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 + samples/sec: 6.598 | iteration 293500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.92 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 + samples/sec: 6.591 | iteration 293600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601382E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.47 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 + samples/sec: 6.591 | iteration 293700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595194E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.598 | iteration 293800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580907E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.08 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.79 + samples/sec: 6.596 | iteration 293900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583817E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.30 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.80 + samples/sec: 6.587 | iteration 294000/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.575426E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1805.87 | backward-backward: 1805.85 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 294000 | lm_loss value: 2.566291E+00 | lm_loss_ppl value: 1.301745E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 294100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.596994E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.88 + samples/sec: 6.598 | iteration 294200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571056E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.00 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.78 + samples/sec: 6.588 | iteration 294300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586458E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.85 | backward-backward: 1805.83 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 + samples/sec: 6.592 | iteration 294400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564326E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.79 + samples/sec: 6.599 | iteration 294500/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582556E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.28 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.93 + samples/sec: 6.593 | iteration 294600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.565732E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 + samples/sec: 6.589 | iteration 294700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.566773E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.90 + samples/sec: 6.597 | iteration 294800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592103E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1803.27 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 + samples/sec: 6.595 | iteration 294900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598363E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 + samples/sec: 6.589 | iteration 295000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.568962E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.57 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 295000 | lm_loss value: 2.616446E+00 | lm_loss_ppl value: 1.368699E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.437 | iteration 295100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.586259E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.87 + samples/sec: 6.594 | iteration 295200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.77 + samples/sec: 6.592 | iteration 295300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591442E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 + samples/sec: 6.594 | iteration 295400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590534E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.44 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 + samples/sec: 6.593 | iteration 295500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 + samples/sec: 6.593 | iteration 295600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592373E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.74 + samples/sec: 6.592 | iteration 295700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.37 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.87 + samples/sec: 6.595 | iteration 295800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575469E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.64 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.86 + samples/sec: 6.597 | iteration 295900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1802.56 | backward-backward: 1802.53 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.89 + samples/sec: 6.592 | iteration 296000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572227E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 296000 | lm_loss value: 2.559632E+00 | lm_loss_ppl value: 1.293106E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.441 | iteration 296100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.587202E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.87 + samples/sec: 6.589 | iteration 296200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.577685E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.34 | backward: 1805.58 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 + samples/sec: 6.591 | iteration 296300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.571272E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 + samples/sec: 6.594 | iteration 296400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587740E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.38 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 + samples/sec: 6.593 | iteration 296500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577248E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.25 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 + samples/sec: 6.595 | iteration 296600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569757E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.35 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.76 + samples/sec: 6.596 | iteration 296700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582570E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.18 | backward: 1803.47 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 + samples/sec: 6.601 | iteration 296800/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.01 | backward: 1802.13 | backward-backward: 1802.11 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 + samples/sec: 6.591 | iteration 296900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.43 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 + samples/sec: 6.594 | iteration 297000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584472E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.32 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 297000 | lm_loss value: 2.553304E+00 | lm_loss_ppl value: 1.284949E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.443 | iteration 297100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.588483E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.39 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.83 + samples/sec: 6.592 | iteration 297200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.570759E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.48 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.592 | iteration 297300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584105E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.83 + samples/sec: 6.595 | iteration 297400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585806E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 + samples/sec: 6.594 | iteration 297500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573740E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.31 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 + samples/sec: 6.600 | iteration 297600/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.559928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.16 | backward: 1802.24 | backward-backward: 1802.22 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 + samples/sec: 6.592 | iteration 297700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.570872E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.41 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 + samples/sec: 6.587 | iteration 297800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589442E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 + samples/sec: 6.590 | iteration 297900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.579499E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.56 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.81 + samples/sec: 6.591 | iteration 298000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.572627E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 298000 | lm_loss value: 2.572649E+00 | lm_loss_ppl value: 1.310048E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.440 | iteration 298100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.569278E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.45 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.85 + samples/sec: 6.594 | iteration 298200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584540E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.53 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 + samples/sec: 6.589 | iteration 298300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.595460E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.55 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.83 + samples/sec: 6.594 | iteration 298400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580022E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 565.95 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.82 + samples/sec: 6.476 | iteration 298500/ 320000 | elapsed time per iteration (ms): 2470.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.2TFLOPS | lm_loss: 2.580487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 576.44 | backward: 1835.26 | backward-backward: 1835.24 | backward-allreduce: 0.00 | optimizer: 58.67 | batch generator: 0.79 + samples/sec: 6.591 | iteration 298600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.583209E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.54 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.82 + samples/sec: 6.592 | iteration 298700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564612E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.62 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.82 + samples/sec: 6.592 | iteration 298800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582383E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.51 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 + samples/sec: 6.591 | iteration 298900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.575585E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.86 + samples/sec: 6.590 | iteration 299000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.567465E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 299000 | lm_loss value: 2.520248E+00 | lm_loss_ppl value: 1.243168E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.438 | iteration 299100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.586096E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.90 + samples/sec: 6.598 | iteration 299200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.570072E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.42 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 + samples/sec: 6.596 | iteration 299300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576024E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.59 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 + samples/sec: 6.584 | iteration 299400/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.574923E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1806.19 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 56.51 | batch generator: 0.81 + samples/sec: 6.589 | iteration 299500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.582512E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 56.50 | batch generator: 0.79 + samples/sec: 6.590 | iteration 299600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.577043E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1804.83 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.84 + samples/sec: 6.591 | iteration 299700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.566732E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.81 + samples/sec: 6.593 | iteration 299800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580646E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.68 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.78 + samples/sec: 6.591 | iteration 299900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586088E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 + samples/sec: 6.590 | iteration 300000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.582769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.80 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step200000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 300000 | lm_loss value: 2.548677E+00 | lm_loss_ppl value: 1.279017E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.578 | iteration 300100/ 320000 | elapsed time per iteration (ms): 2432.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +after 300100 iterations memory (MB) | allocated: 3902.71630859375 | max allocated: 14147.748046875 | reserved: 17282.0 | max reserved: 17282.0 +time (ms) | forward: 569.57 | backward: 1805.81 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.61 | batch generator: 1.20 + samples/sec: 6.588 | iteration 300200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592097E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.88 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 + samples/sec: 6.599 | iteration 300300/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567657E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.59 | backward-backward: 1801.57 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.82 + samples/sec: 6.593 | iteration 300400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597082E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.31 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.82 + samples/sec: 6.590 | iteration 300500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598162E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.74 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.78 + samples/sec: 6.597 | iteration 300600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567004E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.77 | backward: 1801.31 | backward-backward: 1801.28 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.81 + samples/sec: 6.595 | iteration 300700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575433E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.51 | backward: 1802.32 | backward-backward: 1802.30 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 + samples/sec: 6.588 | iteration 300800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.586744E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.95 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.92 + samples/sec: 6.585 | iteration 300900/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600375E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.12 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 57.03 | batch generator: 0.83 + samples/sec: 6.596 | iteration 301000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608182E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1802.16 | backward-backward: 1802.14 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 301000 | lm_loss value: 2.534282E+00 | lm_loss_ppl value: 1.260737E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 301100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.560940E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.29 | backward: 1801.72 | backward-backward: 1801.70 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.97 + samples/sec: 6.595 | iteration 301200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604364E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.33 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.87 + samples/sec: 6.593 | iteration 301300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569098E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 568.20 | backward: 1802.09 | backward-backward: 1802.07 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.84 + samples/sec: 6.593 | iteration 301400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560752E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.93 | backward: 1802.46 | backward-backward: 1802.43 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.85 + samples/sec: 6.595 | iteration 301500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602783E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.77 | backward: 1802.35 | backward-backward: 1802.33 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 + samples/sec: 6.597 | iteration 301600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.558203E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.35 | backward: 1801.97 | backward-backward: 1801.95 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 + samples/sec: 6.597 | iteration 301700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.586405E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.69 | backward: 1801.93 | backward-backward: 1801.90 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 + samples/sec: 6.597 | iteration 301800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564309E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1801.82 | backward-backward: 1801.79 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 + samples/sec: 6.597 | iteration 301900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573896E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.84 | backward: 1801.25 | backward-backward: 1801.22 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.85 + samples/sec: 6.598 | iteration 302000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569040E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1801.29 | backward-backward: 1801.27 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 302000 | lm_loss value: 2.484907E+00 | lm_loss_ppl value: 1.200001E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.448 | iteration 302100/ 320000 | elapsed time per iteration (ms): 2481.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.590094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1801.16 | backward-backward: 1801.13 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.94 + samples/sec: 6.600 | iteration 302200/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1801.08 | backward-backward: 1801.06 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.83 + samples/sec: 6.600 | iteration 302300/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.561049E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1801.29 | backward-backward: 1801.27 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.81 + samples/sec: 6.600 | iteration 302400/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578712E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1800.98 | backward-backward: 1800.96 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.82 + samples/sec: 6.600 | iteration 302500/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.566684E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.34 | backward: 1801.41 | backward-backward: 1801.38 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.78 + samples/sec: 6.600 | iteration 302600/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589220E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1801.19 | backward-backward: 1801.17 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 + samples/sec: 6.599 | iteration 302700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1801.62 | backward-backward: 1801.60 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 + samples/sec: 6.600 | iteration 302800/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585008E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.97 | backward: 1801.39 | backward-backward: 1801.36 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 + samples/sec: 6.599 | iteration 302900/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563148E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.38 | backward: 1801.50 | backward-backward: 1801.47 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 + samples/sec: 6.598 | iteration 303000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590488E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.52 | backward: 1801.65 | backward-backward: 1801.63 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 303000 | lm_loss value: 2.590759E+00 | lm_loss_ppl value: 1.333989E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 303100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.587387E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1801.46 | backward-backward: 1801.43 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.88 + samples/sec: 6.598 | iteration 303200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564921E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.01 | backward: 1801.86 | backward-backward: 1801.83 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 + samples/sec: 6.600 | iteration 303300/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579545E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1801.11 | backward-backward: 1801.09 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.88 + samples/sec: 6.602 | iteration 303400/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563951E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1800.51 | backward-backward: 1800.48 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 + samples/sec: 6.602 | iteration 303500/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589158E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1800.70 | backward-backward: 1800.68 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.84 + samples/sec: 6.602 | iteration 303600/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576689E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1800.86 | backward-backward: 1800.84 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.601 | iteration 303700/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597867E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1800.84 | backward-backward: 1800.82 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.602 | iteration 303800/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599957E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1800.43 | backward-backward: 1800.41 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.602 | iteration 303900/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593430E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1800.67 | backward-backward: 1800.64 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 + samples/sec: 6.601 | iteration 304000/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560087E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1801.02 | backward-backward: 1800.99 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 304000 | lm_loss value: 2.470732E+00 | lm_loss_ppl value: 1.183111E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 304100/ 320000 | elapsed time per iteration (ms): 2480.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.582133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1800.47 | backward-backward: 1800.45 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.86 + samples/sec: 6.598 | iteration 304200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584517E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1801.17 | backward-backward: 1801.14 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.87 + samples/sec: 6.599 | iteration 304300/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569611E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1801.96 | backward-backward: 1801.94 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.78 + samples/sec: 6.602 | iteration 304400/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580088E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1801.14 | backward-backward: 1801.11 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.80 + samples/sec: 6.600 | iteration 304500/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.546669E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1801.21 | backward-backward: 1801.19 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 + samples/sec: 6.601 | iteration 304600/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579716E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1801.13 | backward-backward: 1801.10 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.78 + samples/sec: 6.601 | iteration 304700/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573281E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.25 | backward-backward: 1801.22 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.79 + samples/sec: 6.602 | iteration 304800/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583604E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.00 | backward: 1800.89 | backward-backward: 1800.87 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.78 + samples/sec: 6.602 | iteration 304900/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580747E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1800.83 | backward-backward: 1800.80 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.601 | iteration 305000/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569039E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1800.72 | backward-backward: 1800.70 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 305000 | lm_loss value: 2.552590E+00 | lm_loss_ppl value: 1.284032E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.451 | iteration 305100/ 320000 | elapsed time per iteration (ms): 2480.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.578621E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1800.72 | backward-backward: 1800.69 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.84 + samples/sec: 6.602 | iteration 305200/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603097E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.96 | backward: 1800.49 | backward-backward: 1800.47 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.601 | iteration 305300/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.586406E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1800.75 | backward-backward: 1800.72 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.74 + samples/sec: 6.599 | iteration 305400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604205E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1801.29 | backward-backward: 1801.26 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 + samples/sec: 6.601 | iteration 305500/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568032E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1800.90 | backward-backward: 1800.87 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 + samples/sec: 6.599 | iteration 305600/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563498E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1801.78 | backward-backward: 1801.76 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 + samples/sec: 6.601 | iteration 305700/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578777E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1801.14 | backward-backward: 1801.12 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.601 | iteration 305800/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583675E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1801.14 | backward-backward: 1801.11 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.83 + samples/sec: 6.601 | iteration 305900/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573953E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 + samples/sec: 6.601 | iteration 306000/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1801.24 | backward-backward: 1801.22 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.82 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 306000 | lm_loss value: 2.574776E+00 | lm_loss_ppl value: 1.312838E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 306100/ 320000 | elapsed time per iteration (ms): 2480.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.585966E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.24 | backward: 1801.14 | backward-backward: 1801.12 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.86 + samples/sec: 6.597 | iteration 306200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572408E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.53 | backward: 1801.83 | backward-backward: 1801.80 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.86 + samples/sec: 6.600 | iteration 306300/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580695E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.46 | backward-backward: 1801.44 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.84 + samples/sec: 6.598 | iteration 306400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569867E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.72 | backward: 1801.17 | backward-backward: 1801.15 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 + samples/sec: 6.597 | iteration 306500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569416E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1802.03 | backward-backward: 1802.01 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.82 + samples/sec: 6.600 | iteration 306600/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589483E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1801.49 | backward-backward: 1801.46 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 + samples/sec: 6.600 | iteration 306700/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600679E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1801.57 | backward-backward: 1801.54 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 + samples/sec: 6.599 | iteration 306800/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573402E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1801.74 | backward-backward: 1801.71 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 + samples/sec: 6.600 | iteration 306900/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590955E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1801.28 | backward-backward: 1801.26 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 + samples/sec: 6.602 | iteration 307000/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.544929E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1801.01 | backward-backward: 1800.99 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 307000 | lm_loss value: 2.591101E+00 | lm_loss_ppl value: 1.334445E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 307100/ 320000 | elapsed time per iteration (ms): 2481.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.591664E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.25 | backward: 1801.19 | backward-backward: 1801.17 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.87 + samples/sec: 6.600 | iteration 307200/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591277E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 + samples/sec: 6.599 | iteration 307300/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577951E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.28 | backward: 1801.47 | backward-backward: 1801.45 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.602 | iteration 307400/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.562906E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1800.81 | backward-backward: 1800.79 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.82 + samples/sec: 6.598 | iteration 307500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564366E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1801.66 | backward-backward: 1801.64 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.81 + samples/sec: 6.600 | iteration 307600/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568801E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1801.25 | backward-backward: 1801.23 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 + samples/sec: 6.602 | iteration 307700/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583191E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1800.75 | backward-backward: 1800.73 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.75 + samples/sec: 6.600 | iteration 307800/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580722E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.83 + samples/sec: 6.596 | iteration 307900/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588373E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.66 | backward: 1801.88 | backward-backward: 1801.85 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 + samples/sec: 6.602 | iteration 308000/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578865E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.93 | backward: 1800.74 | backward-backward: 1800.72 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 308000 | lm_loss value: 2.497848E+00 | lm_loss_ppl value: 1.215631E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.450 | iteration 308100/ 320000 | elapsed time per iteration (ms): 2480.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.568755E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1801.12 | backward-backward: 1801.09 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.82 + samples/sec: 6.601 | iteration 308200/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571722E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1801.01 | backward-backward: 1800.98 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.83 + samples/sec: 6.602 | iteration 308300/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.570897E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.06 | backward: 1800.68 | backward-backward: 1800.65 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 + samples/sec: 6.600 | iteration 308400/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568437E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1801.48 | backward-backward: 1801.45 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 + samples/sec: 6.598 | iteration 308500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.564536E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.03 | backward: 1801.74 | backward-backward: 1801.71 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 + samples/sec: 6.596 | iteration 308600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577878E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.25 | backward: 1801.99 | backward-backward: 1801.96 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 + samples/sec: 6.602 | iteration 308700/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575379E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.18 | backward: 1800.84 | backward-backward: 1800.81 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.82 + samples/sec: 6.602 | iteration 308800/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.574867E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1800.55 | backward-backward: 1800.53 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.76 + samples/sec: 6.600 | iteration 308900/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577022E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.48 | backward: 1800.69 | backward-backward: 1800.67 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 + samples/sec: 6.601 | iteration 309000/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582946E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1801.19 | backward-backward: 1801.17 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 309000 | lm_loss value: 2.642396E+00 | lm_loss_ppl value: 1.404682E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.452 | iteration 309100/ 320000 | elapsed time per iteration (ms): 2480.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.565827E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1800.42 | backward-backward: 1800.40 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.87 + samples/sec: 6.602 | iteration 309200/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582941E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1800.59 | backward-backward: 1800.57 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 + samples/sec: 6.603 | iteration 309300/ 320000 | elapsed time per iteration (ms): 2423.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563870E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1800.60 | backward-backward: 1800.57 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 + samples/sec: 6.601 | iteration 309400/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577794E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.02 | backward: 1800.85 | backward-backward: 1800.82 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 + samples/sec: 6.601 | iteration 309500/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577366E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1800.96 | backward-backward: 1800.94 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 + samples/sec: 6.600 | iteration 309600/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.551892E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.63 | backward: 1800.47 | backward-backward: 1800.45 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.86 + samples/sec: 6.598 | iteration 309700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582050E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.75 | backward-backward: 1801.72 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 + samples/sec: 6.601 | iteration 309800/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.562390E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1801.04 | backward-backward: 1801.02 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 + samples/sec: 6.601 | iteration 309900/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582082E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1801.54 | backward-backward: 1801.51 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.79 + samples/sec: 6.603 | iteration 310000/ 320000 | elapsed time per iteration (ms): 2423.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585386E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.74 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.81 +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step210000 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 310000 | lm_loss value: 2.591755E+00 | lm_loss_ppl value: 1.335319E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.251 | iteration 310100/ 320000 | elapsed time per iteration (ms): 2559.6 | learning rate: 3.000E-05 | approx flops per GPU: 38.8TFLOPS | lm_loss: 2.562512E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1800.96 | backward-backward: 1800.94 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.88 + samples/sec: 6.603 | iteration 310200/ 320000 | elapsed time per iteration (ms): 2423.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.566831E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1800.42 | backward-backward: 1800.40 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 + samples/sec: 6.602 | iteration 310300/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603387E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1800.99 | backward-backward: 1800.97 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.86 + samples/sec: 6.601 | iteration 310400/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.565114E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1800.76 | backward-backward: 1800.74 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 + samples/sec: 6.602 | iteration 310500/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584702E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.86 | backward: 1800.79 | backward-backward: 1800.77 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.84 + samples/sec: 6.601 | iteration 310600/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577973E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.11 | backward-backward: 1801.09 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 + samples/sec: 6.599 | iteration 310700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579903E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.04 | backward: 1800.76 | backward-backward: 1800.74 | backward-allreduce: 0.00 | optimizer: 56.26 | batch generator: 0.80 + samples/sec: 6.599 | iteration 310800/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607326E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1801.34 | backward-backward: 1801.32 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 + samples/sec: 6.598 | iteration 310900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576089E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.10 | backward: 1802.01 | backward-backward: 1801.99 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 + samples/sec: 6.599 | iteration 311000/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571494E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.27 | backward: 1801.48 | backward-backward: 1801.46 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.90 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 311000 | lm_loss value: 2.495144E+00 | lm_loss_ppl value: 1.212348E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.450 | iteration 311100/ 320000 | elapsed time per iteration (ms): 2480.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.579734E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1801.33 | backward-backward: 1801.31 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.87 + samples/sec: 6.601 | iteration 311200/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575441E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1801.54 | backward-backward: 1801.52 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.79 + samples/sec: 6.600 | iteration 311300/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588563E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.31 | backward-backward: 1801.29 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.602 | iteration 311400/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578690E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1801.44 | backward-backward: 1801.41 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.79 + samples/sec: 6.601 | iteration 311500/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595948E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.19 | backward-backward: 1801.16 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 + samples/sec: 6.601 | iteration 311600/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.549503E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1800.87 | backward-backward: 1800.84 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 + samples/sec: 6.601 | iteration 311700/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.561166E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.09 | backward-backward: 1801.06 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 + samples/sec: 6.598 | iteration 311800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.555129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1801.35 | backward-backward: 1801.32 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.78 + samples/sec: 6.599 | iteration 311900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581909E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1801.47 | backward-backward: 1801.45 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 + samples/sec: 6.600 | iteration 312000/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577600E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.40 | backward: 1801.01 | backward-backward: 1800.98 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 312000 | lm_loss value: 2.600553E+00 | lm_loss_ppl value: 1.347118E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.451 | iteration 312100/ 320000 | elapsed time per iteration (ms): 2480.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.558724E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1801.05 | backward-backward: 1801.03 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.85 + samples/sec: 6.600 | iteration 312200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572047E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.43 | backward: 1800.76 | backward-backward: 1800.74 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 + samples/sec: 6.600 | iteration 312300/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.555141E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1801.18 | backward-backward: 1801.15 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 + samples/sec: 6.603 | iteration 312400/ 320000 | elapsed time per iteration (ms): 2423.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572513E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1800.88 | backward-backward: 1800.86 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.80 + samples/sec: 6.600 | iteration 312500/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.578261E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1801.41 | backward-backward: 1801.38 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 + samples/sec: 6.599 | iteration 312600/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.561309E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1801.75 | backward-backward: 1801.73 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 + samples/sec: 6.600 | iteration 312700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.558634E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.26 | backward: 1801.10 | backward-backward: 1801.08 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 + samples/sec: 6.600 | iteration 312800/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.541095E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1801.55 | backward-backward: 1801.52 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.92 + samples/sec: 6.599 | iteration 312900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575343E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.81 | backward-backward: 1801.79 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.81 + samples/sec: 6.600 | iteration 313000/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567305E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.69 | backward: 1801.23 | backward-backward: 1801.20 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.80 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 313000 | lm_loss value: 2.583228E+00 | lm_loss_ppl value: 1.323980E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.451 | iteration 313100/ 320000 | elapsed time per iteration (ms): 2480.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.581078E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1800.68 | backward-backward: 1800.66 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.82 + samples/sec: 6.602 | iteration 313200/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583327E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1800.60 | backward-backward: 1800.58 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 + samples/sec: 6.601 | iteration 313300/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571817E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.15 | backward: 1800.78 | backward-backward: 1800.76 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.85 + samples/sec: 6.601 | iteration 313400/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577246E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1800.94 | backward-backward: 1800.91 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 + samples/sec: 6.602 | iteration 313500/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569091E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1800.66 | backward-backward: 1800.64 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.601 | iteration 313600/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 + samples/sec: 6.602 | iteration 313700/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580389E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1800.86 | backward-backward: 1800.84 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 + samples/sec: 6.600 | iteration 313800/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.558928E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.36 | backward: 1800.91 | backward-backward: 1800.88 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.82 + samples/sec: 6.600 | iteration 313900/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.556955E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.22 | backward: 1800.90 | backward-backward: 1800.88 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.84 + samples/sec: 6.601 | iteration 314000/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576515E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1801.17 | backward-backward: 1801.14 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.87 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 314000 | lm_loss value: 2.588885E+00 | lm_loss_ppl value: 1.331492E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.445 | iteration 314100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.576083E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.85 + samples/sec: 6.599 | iteration 314200/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.73 | backward-backward: 1801.71 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 + samples/sec: 6.599 | iteration 314300/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588044E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1801.80 | backward-backward: 1801.77 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 + samples/sec: 6.600 | iteration 314400/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584691E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.73 | backward: 1801.55 | backward-backward: 1801.53 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 + samples/sec: 6.600 | iteration 314500/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.559850E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1801.16 | backward-backward: 1801.14 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 + samples/sec: 6.600 | iteration 314600/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.16 | backward: 1801.12 | backward-backward: 1801.09 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.99 + samples/sec: 6.601 | iteration 314700/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579653E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.14 | backward-backward: 1801.11 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.83 + samples/sec: 6.599 | iteration 314800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.555415E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.33 | backward: 1801.34 | backward-backward: 1801.31 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.90 + samples/sec: 6.600 | iteration 314900/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563956E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1800.92 | backward-backward: 1800.89 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.84 + samples/sec: 6.602 | iteration 315000/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575382E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1801.89 | backward-backward: 1801.87 | backward-allreduce: 0.00 | optimizer: 54.49 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 315000 | lm_loss value: 2.583545E+00 | lm_loss_ppl value: 1.324400E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.447 | iteration 315100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.579786E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.83 | backward: 1801.86 | backward-backward: 1801.83 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.86 + samples/sec: 6.598 | iteration 315200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585227E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1802.31 | backward-backward: 1802.28 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 + samples/sec: 6.599 | iteration 315300/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.557683E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1801.48 | backward-backward: 1801.46 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 + samples/sec: 6.598 | iteration 315400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569447E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1801.82 | backward-backward: 1801.80 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.82 + samples/sec: 6.599 | iteration 315500/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568206E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.62 | backward-backward: 1801.59 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.600 | iteration 315600/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.568232E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1801.52 | backward-backward: 1801.50 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 + samples/sec: 6.599 | iteration 315700/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.571207E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.36 | backward-backward: 1801.33 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.595 | iteration 315800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.554875E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.50 | backward: 1802.33 | backward-backward: 1802.30 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.86 + samples/sec: 6.598 | iteration 315900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.577215E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1802.17 | backward-backward: 1802.15 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 + samples/sec: 6.600 | iteration 316000/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.546910E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.99 | backward: 1801.57 | backward-backward: 1801.54 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 316000 | lm_loss value: 2.533452E+00 | lm_loss_ppl value: 1.259692E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.449 | iteration 316100/ 320000 | elapsed time per iteration (ms): 2481.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.587105E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1801.66 | backward-backward: 1801.64 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.84 + samples/sec: 6.600 | iteration 316200/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.572547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1801.09 | backward-backward: 1801.06 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 + samples/sec: 6.599 | iteration 316300/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583526E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1801.80 | backward-backward: 1801.78 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 + samples/sec: 6.600 | iteration 316400/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587612E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1801.53 | backward-backward: 1801.51 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 + samples/sec: 6.600 | iteration 316500/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1801.39 | backward-backward: 1801.37 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 + samples/sec: 6.600 | iteration 316600/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.575765E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1801.53 | backward-backward: 1801.51 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 + samples/sec: 6.599 | iteration 316700/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584073E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.82 | backward: 1801.50 | backward-backward: 1801.47 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.85 + samples/sec: 6.601 | iteration 316800/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567396E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.91 | backward: 1801.31 | backward-backward: 1801.29 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.79 + samples/sec: 6.599 | iteration 316900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567840E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1801.41 | backward-backward: 1801.38 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 + samples/sec: 6.600 | iteration 317000/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.573543E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.84 | backward: 1801.11 | backward-backward: 1801.08 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.83 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 317000 | lm_loss value: 2.481098E+00 | lm_loss_ppl value: 1.195438E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.451 | iteration 317100/ 320000 | elapsed time per iteration (ms): 2480.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.562123E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.94 | backward: 1800.85 | backward-backward: 1800.83 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.87 + samples/sec: 6.601 | iteration 317200/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.563852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.98 | backward: 1801.09 | backward-backward: 1801.07 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 + samples/sec: 6.601 | iteration 317300/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.562842E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.89 | backward: 1800.44 | backward-backward: 1800.42 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.87 + samples/sec: 6.601 | iteration 317400/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560191E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.76 | backward: 1801.12 | backward-backward: 1801.10 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 + samples/sec: 6.603 | iteration 317500/ 320000 | elapsed time per iteration (ms): 2423.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.554124E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1800.17 | backward-backward: 1800.14 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.82 + samples/sec: 6.603 | iteration 317600/ 320000 | elapsed time per iteration (ms): 2423.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.581624E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1800.46 | backward-backward: 1800.43 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 + samples/sec: 6.602 | iteration 317700/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576177E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1800.61 | backward-backward: 1800.58 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 + samples/sec: 6.600 | iteration 317800/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.544089E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.05 | backward: 1800.94 | backward-backward: 1800.92 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.75 + samples/sec: 6.600 | iteration 317900/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.553655E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1801.46 | backward-backward: 1801.44 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 + samples/sec: 6.603 | iteration 318000/ 320000 | elapsed time per iteration (ms): 2423.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585588E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.78 | backward: 1800.83 | backward-backward: 1800.81 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 318000 | lm_loss value: 2.545193E+00 | lm_loss_ppl value: 1.274569E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.451 | iteration 318100/ 320000 | elapsed time per iteration (ms): 2480.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.576874E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.11 | backward: 1801.07 | backward-backward: 1801.05 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.87 + samples/sec: 6.600 | iteration 318200/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587859E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 567.07 | backward: 1801.13 | backward-backward: 1801.10 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.83 + samples/sec: 6.602 | iteration 318300/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.569983E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.88 | backward: 1800.53 | backward-backward: 1800.50 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.84 + samples/sec: 6.599 | iteration 318400/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.574200E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.95 | backward: 1801.18 | backward-backward: 1801.15 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.76 + samples/sec: 6.602 | iteration 318500/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560100E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.90 | backward: 1800.71 | backward-backward: 1800.68 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 + samples/sec: 6.602 | iteration 318600/ 320000 | elapsed time per iteration (ms): 2423.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596654E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.92 | backward: 1800.76 | backward-backward: 1800.73 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.86 + samples/sec: 6.602 | iteration 318700/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593483E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.75 | backward: 1800.60 | backward-backward: 1800.58 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 + samples/sec: 6.602 | iteration 318800/ 320000 | elapsed time per iteration (ms): 2423.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560842E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.80 | backward: 1800.46 | backward-backward: 1800.43 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 + samples/sec: 6.603 | iteration 318900/ 320000 | elapsed time per iteration (ms): 2423.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.576042E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.70 | backward: 1800.42 | backward-backward: 1800.40 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 + samples/sec: 6.601 | iteration 319000/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.562984E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1800.86 | backward-backward: 1800.84 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 319000 | lm_loss value: 2.530998E+00 | lm_loss_ppl value: 1.256603E+01 | +----------------------------------------------------------------------------------------------------------- + samples/sec: 6.450 | iteration 319100/ 320000 | elapsed time per iteration (ms): 2480.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.569254E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.85 | backward: 1800.79 | backward-backward: 1800.77 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.90 + samples/sec: 6.601 | iteration 319200/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.566165E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 567.21 | backward: 1801.30 | backward-backward: 1801.28 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.81 + samples/sec: 6.603 | iteration 319300/ 320000 | elapsed time per iteration (ms): 2423.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.544725E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.72 | backward: 1800.83 | backward-backward: 1800.81 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.76 + samples/sec: 6.601 | iteration 319400/ 320000 | elapsed time per iteration (ms): 2423.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.561673E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.87 | backward: 1800.34 | backward-backward: 1800.31 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.79 + samples/sec: 6.603 | iteration 319500/ 320000 | elapsed time per iteration (ms): 2423.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.574023E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | +time (ms) | forward: 566.77 | backward: 1800.89 | backward-backward: 1800.86 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.81 + samples/sec: 6.603 | iteration 319600/ 320000 | elapsed time per iteration (ms): 2423.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.580322E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.81 | backward: 1800.20 | backward-backward: 1800.17 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.84 + samples/sec: 6.603 | iteration 319700/ 320000 | elapsed time per iteration (ms): 2423.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.560445E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1800.43 | backward-backward: 1800.41 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 + samples/sec: 6.604 | iteration 319800/ 320000 | elapsed time per iteration (ms): 2422.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.567856E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.79 | backward: 1799.80 | backward-backward: 1799.78 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 + samples/sec: 6.603 | iteration 319900/ 320000 | elapsed time per iteration (ms): 2423.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.558972E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.63 | backward: 1800.13 | backward-backward: 1800.11 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 + samples/sec: 6.602 | iteration 320000/ 320000 | elapsed time per iteration (ms): 2423.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.586820E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | +time (ms) | forward: 566.71 | backward: 1800.85 | backward-backward: 1800.83 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 +----------------------------------------------------------------------------------------------------------- + validation results at iteration 320000 | lm_loss value: 2.620824E+00 | lm_loss_ppl value: 1.374705E+01 | +----------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------------------------------------------------------------- + validation results at the end of training for val data | lm_loss value: 2.607056E+00 | lm_loss_ppl value: 1.355907E+01 | +--------------------------------------------------------------------------------------------------------------------------- +WARNING: Deleting old checkpoints: + checkpoints-fcm/global_step220000 +---------------------------------------------------------------------------------------------------------------------- + test results at the end of training for test data | lm_loss value: 2.521412E+00 | lm_loss_ppl value: 1.244616E+01 | +----------------------------------------------------------------------------------------------------------------------