Training on TPU with TensorFlow
詳现ãªèª¬æãäžèŠã§ãåã«TPUã®ã³ãŒããµã³ãã«ãå ¥æããŠãã¬ãŒãã³ã°ãéå§ãããå Žåã¯ãç§ãã¡ã®TPUã®äŸã®ããŒãããã¯ããã§ãã¯ããŠãã ããïŒ
What is a TPU?
TPUã¯Tensor Processing UnitïŒãã³ãœã«åŠçãŠãããïŒã®ç¥ã§ãããããã¯GoogleãèšèšããããŒããŠã§ã¢ã§ããã¥ãŒã©ã«ãããã¯ãŒã¯å ã®ãã³ãœã«èšç®ãå€§å¹ ã«é«éåããããã«äœ¿çšãããŸããããã¯GPUã®ãããªãã®ã§ãããããã¯ãŒã¯ã®ãã¬ãŒãã³ã°ãšæšè«ã®äž¡æ¹ã«äœ¿çšã§ããŸããäžè¬çã«ã¯Googleã®ã¯ã©ãŠããµãŒãã¹ãä»ããŠã¢ã¯ã»ã¹ãããŸãããGoogle ColabãšKaggle KernelsãéããŠãç¡æã§å°èŠæš¡ã®TPUã«çŽæ¥ã¢ã¯ã»ã¹ã§ããŸãã
ð€ Transformersã®ãã¹ãŠã®TensorFlowã¢ãã«ã¯Kerasã¢ãã«ã§ãã®ã§ããã®ææžã®ã»ãšãã©ã®æ¹æ³ã¯äžè¬çã«Kerasã¢ãã«çšã®TPUãã¬ãŒãã³ã°ã«é©çšã§ããŸãïŒãã ããTransformersãšDatasetsã®HuggingFaceãšã³ã·ã¹ãã ïŒhug-o-systemïŒïŒã«åºæã®ãã€ã³ããããã€ããããããã«ã€ããŠã¯é©çšãããšãã«ããã瀺ããŸãã
What kinds of TPU are available?
æ°ãããŠãŒã¶ãŒã¯ãããŸããŸãªTPUãšãã®ã¢ã¯ã»ã¹æ¹æ³ã«é¢ããå¹ åºãæ å ±ã«ããæ··ä¹±ããŸããç解ããããã®æåã®éèŠãªéãã¯ãTPUããŒããšTPU VMã®éãã§ãã
TPUããŒãã䜿çšãããšãäºå®äžãªã¢ãŒãã®TPUã«éæ¥çã«ã¢ã¯ã»ã¹ããŸããå¥åã®VMãå¿ èŠã§ããããã¯ãŒã¯ãšããŒã¿ãã€ãã©ã€ã³ãåæåãããããããªã¢ãŒãããŒãã«è»¢éããŸããGoogle Colabã§TPUã䜿çšãããšãTPUããŒãã¹ã¿ã€ã«ã§ã¢ã¯ã»ã¹ããŠããŸãã
TPUããŒãã䜿çšãããšãããã«æ £ããŠããªã人ã ã«ã¯ããªãäºæããªãåäœãçºçããããšããããŸãïŒç¹ã«ãTPUã¯Pythonã³ãŒããå®è¡ããŠãããã·ã³ãšç©ççã«ç°ãªãã·ã¹ãã ã«é 眮ãããŠãããããããŒã¿ã¯ããŒã«ã«ãã·ã³ã«ããŒã«ã«ã§æ ŒçŽãããŠããããŒã¿ãã€ãã©ã€ã³ãå®å šã«å€±æããŸãã代ããã«ãããŒã¿ã¯Google Cloud Storageã«æ ŒçŽããå¿ èŠããããŸããããã§ããŒã¿ãã€ãã©ã€ã³ã¯ãªã¢ãŒãã®TPUããŒãã§å®è¡ãããŠããå Žåã§ããããŒã¿ã«ã¢ã¯ã»ã¹ã§ããŸãã
ãã¹ãŠã®ããŒã¿ãnp.ndarray
ãŸãã¯tf.Tensor
ãšããŠã¡ã¢ãªã«åããããšãã§ããå ŽåãColabãŸãã¯TPUããŒãã䜿çšããŠããå Žåã§ããããŒã¿ãGoogle Cloud Storageã«ã¢ããããŒãããã«fit()
ã§ãã¬ãŒãã³ã°ã§ããŸãã
ð€ Hugging Faceåºæã®ãã³ãð€: TFã³ãŒãã®äŸã§ããèŠãã§ãããDataset.to_tf_dataset()
ãšãã®é«ã¬ãã«ã®ã©ãããŒã§ããmodel.prepare_tf_dataset()
ã¯ãTPUããŒãã§å€±æããŸããããã¯ãtf.data.Dataset
ãäœæããŠããã«ãããããããããããçŽç²ãªãtf.data
ãã€ãã©ã€ã³ã§ã¯ãªããtf.numpy_function
ãŸãã¯Dataset.from_generator()
ã䜿çšããŠåºç€ãšãªãHuggingFace Dataset
ããããŒã¿ãã¹ããªãŒã ã§èªã¿èŸŒãããšããã§ãããã®HuggingFace Dataset
ã¯ããŒã«ã«ãã£ã¹ã¯äžã®ããŒã¿ãããã¯ã¢ããããŠããããªã¢ãŒãTPUããŒããèªã¿åãããšãã§ããªãããã§ãã
TPUã«ã¢ã¯ã»ã¹ãã第äºã®æ¹æ³ã¯ãTPU VMãä»ããŠã§ããTPU VMã䜿çšããå ŽåãTPUãæ¥ç¶ãããŠãããã·ã³ã«çŽæ¥æ¥ç¶ããŸããããã¯GPU VMã§ãã¬ãŒãã³ã°ãè¡ãã®ãšåæ§ã§ããTPU VMã¯äžè¬çã«ããŒã¿ãã€ãã©ã€ã³ã«é¢ããŠã¯ç¹ã«äœæ¥ããããããäžèšã®ãã¹ãŠã®èŠåã¯TPU VMã«ã¯é©çšãããŸããïŒ
ããã¯äž»èŠ³çãªææžã§ãã®ã§ããã¡ãã®æèŠã§ãïŒå¯èœãªéãTPUããŒãã®äœ¿çšãé¿ããŠãã ããã TPU VMãããæ··ä¹±ããããããããã°ãé£ããã§ããå°æ¥çã«ã¯ãµããŒããããªããªãå¯èœæ§ããããŸã - Googleã®ææ°ã®TPUã§ããTPUv4ã¯ãTPU VMãšããŠã®ã¿ã¢ã¯ã»ã¹ã§ãããããTPUããŒãã¯å°æ¥çã«ã¯ãã¬ã¬ã·ãŒãã®ã¢ã¯ã»ã¹æ¹æ³ã«ãªãå¯èœæ§ãé«ãã§ãããã ããç¡æã§TPUã«ã¢ã¯ã»ã¹ã§ããã®ã¯ColabãšKaggle Kernelsã®å ŽåããããŸãããã®å Žåãã©ãããŠã䜿çšããªããã°ãªããªãå Žåã®åãæ±ãæ¹æ³ã説æããããšããŸãïŒè©³çŽ°ã¯TPUã®äŸã®ããŒãããã¯ã§è©³çŽ°ãªèª¬æã確èªããŠãã ããã
What sizes of TPU are available?
åäžã®TPUïŒv2-8/v3-8/v4-8ïŒã¯8ã€ã®ã¬ããªã«ãå®è¡ããŸããTPUã¯æ°çŸããæ°åã®ã¬ããªã«ãåæã«å®è¡ã§ãããããã«ååšããŸããåäžã®TPUãããå€ãã®TPUã䜿çšãããããããå šäœã§ã¯ãªãå ŽåïŒããšãã°v3-32ïŒãTPUããªãŒãã¯ãããã¹ã©ã€ã¹ãšããŠåç §ãããŸãã
Colabãä»ããŠç¡æã®TPUã«ã¢ã¯ã»ã¹ããå Žåãéåžžã¯åäžã®v2-8 TPUãæäŸãããŸãã
I keep hearing about this XLA thing. Whatâs XLA, and how does it relate to TPUs?
XLAã¯ãTensorFlowãšJAXã®äž¡æ¹ã§äœ¿çšãããæé©åã³ã³ãã€ã©ã§ããJAXã§ã¯å¯äžã®ã³ã³ãã€ã©ã§ãããTensorFlowã§ã¯ãªãã·ã§ã³ã§ããïŒãããTPUã§ã¯å¿
é ã§ãïŒïŒãKerasã¢ãã«ããã¬ãŒãã³ã°ããéã«model.compile()
ã«åŒæ°jit_compile=True
ãæž¡ãããšã§æãç°¡åã«æå¹ã«ã§ããŸãããšã©ãŒãçºçãããããã©ãŒãã³ã¹ãè¯å¥œã§ããã°ãããã¯TPUã«ç§»è¡ããæºåãæŽã£ãè¯ãå
åã§ãïŒ
TPUäžã§ã®ãããã°ã¯äžè¬çã«CPU/GPUãããå°ãé£ãããããTPUã§è©Šãåã«ãŸãCPU/GPUã§XLAã䜿çšããŠã³ãŒããå®è¡ããããšããå§ãããŸãããã¡ãããé·æéãã¬ãŒãã³ã°ããå¿ èŠã¯ãããŸãããã¢ãã«ãšããŒã¿ãã€ãã©ã€ã³ãæåŸ éãã«åäœãããã確èªããããã®æ°ã¹ãããã ãã§ãã
XLAã³ã³ãã€ã«ãããã³ãŒãã¯éåžžé«éã§ãããããã£ãŠãTPUã§å®è¡ããäºå®ããªãå Žåã§ããjit_compile=True
ãè¿œå ããããšã§ããã©ãŒãã³ã¹ãåäžãããããšãã§ããŸãããã ãã以äžã®XLAäºææ§ã«é¢ãã泚æäºé
ã«æ³šæããŠãã ããïŒ
èŠãçµéšããçãŸãããã³ã: jit_compile=True
ã䜿çšããããšã¯ãCPU/GPUã³ãŒããXLAäºæã§ããããšã確èªããé床ãåäžãããè¯ãæ¹æ³ã§ãããå®éã«TPUã§ã³ãŒããå®è¡ããéã«ã¯å€ãã®åé¡ãåŒãèµ·ããå¯èœæ§ããããŸãã XLAã³ã³ãã€ã«ã¯TPUäžã§æé»çã«è¡ããããããå®éã«ã³ãŒããTPUã§å®è¡ããåã«ãã®è¡ãåé€ããããšãå¿ããªãã§ãã ããïŒ
How do I make my model XLA compatible?
å€ãã®å Žåãã³ãŒãã¯ãã§ã«XLAäºæãããããŸããïŒãã ããXLAã§ã¯åäœããéåžžã®TensorFlowã§ãåäœããªãããã€ãã®èŠçŽ ããããŸãã以äžã«ã3ã€ã®äž»èŠãªã«ãŒã«ã«ãŸãšããŠããŸãïŒ
ð€ HuggingFaceåºæã®ãã³ãð€: TensorFlowã¢ãã«ãšæ倱é¢æ°ãXLAäºæã«æžãçŽãããã«å€ãã®åªåãæã£ãŠããŸããéåžžãã¢ãã«ãšæ倱é¢æ°ã¯ããã©ã«ãã§ã«ãŒã«ïŒ1ãšïŒ2ã«åŸã£ãŠãããããtransformers
ã¢ãã«ã䜿çšããŠããå Žåã¯ããããã¹ãããã§ããŸãããã ããç¬èªã®ã¢ãã«ãšæ倱é¢æ°ãèšè¿°ããå Žåã¯ããããã®ã«ãŒã«ãå¿ããªãã§ãã ããïŒ
XLA Rule #1: Your code cannot have âdata-dependent conditionalsâ
ããã¯ãä»»æã®if
ã¹ããŒãã¡ã³ããtf.Tensor
å
ã®å€ã«äŸåããŠããªãå¿
èŠãããããšãæå³ããŸããäŸãã°ã次ã®ã³ãŒããããã¯ã¯XLAã§ã³ã³ãã€ã«ã§ããŸããïŒ
if tf.reduce_sum(tensor) > 10:
tensor = tensor / 2.0
ããã¯æåã¯éåžžã«å¶éçã«æãããããããŸããããã»ãšãã©ã®ãã¥ãŒã©ã«ãããã³ãŒãã¯ãããè¡ãå¿
èŠã¯ãããŸãããéåžžããã®å¶çŽãåé¿ããããã«tf.cond
ã䜿çšãããïŒããã¥ã¡ã³ãã¯ãã¡ããåç
§ïŒãæ¡ä»¶ãåé€ããŠä»£ããã«æ瀺å€æ°ã䜿çšãããããããšãã§ããŸãã次ã®ããã«ïŒ
sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32)
tensor = tensor / (1.0 + sum_over_10)
ãã®ã³ãŒãã¯ãäžèšã®ã³ãŒããšãŸã£ããåãå¹æãæã£ãŠããŸãããæ¡ä»¶ãåé¿ããããšã§ãXLAã§åé¡ãªãã³ã³ãã€ã«ã§ããããšã確èªããŸãïŒ
XLA Rule #2: Your code cannot have âdata-dependent shapesâ
ããã¯ãã³ãŒãå
ã®ãã¹ãŠã® tf.Tensor
ãªããžã§ã¯ãã®åœ¢ç¶ãããã®å€ã«äŸåããªãããšãæå³ããŸããããšãã°ãtf.unique
é¢æ°ã¯XLAã§ã³ã³ãã€ã«ã§ããªãã®ã§ããã®ã«ãŒã«ã«éåããŸãããªããªããããã¯å
¥å Tensor
ã®äžæã®å€ã®åã€ã³ã¹ã¿ã³ã¹ãå«ã tensor
ãè¿ãããã§ãããã®åºåã®åœ¢ç¶ã¯ãå
¥å Tensor
ã®éè€å
·åã«ãã£ãŠç°ãªããããXLAã¯ãããåŠçããªãããšã«ãªããŸãïŒ
äžè¬çã«ãã»ãšãã©ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã³ãŒãã¯ããã©ã«ãã§ã«ãŒã«ïŒ2ã«åŸããŸãããã ããããã€ãã®äžè¬çãªã±ãŒã¹ã§ã¯åé¡ãçºçããããšããããŸããéåžžã«äžè¬çãªã±ãŒã¹ã®1ã€ã¯ãã©ãã«ãã¹ãã³ã°ã䜿çšããå Žåã§ããã©ãã«ãç¡èŠããŠæ倱ãèšç®ããå Žæã瀺ãããã«ãã©ãã«ãè² ã®å€ã«èšå®ããæ¹æ³ã§ããNumPyãŸãã¯PyTorchã®ã©ãã«ãã¹ãã³ã°ããµããŒãããæ倱é¢æ°ãèŠããšã次ã®ãããªããŒã«ã€ã³ããã¯ã¹ã䜿çšããã³ãŒããããèŠãããŸãïŒ
label_mask = labels >= 0
masked_outputs = outputs[label_mask]
masked_labels = labels[label_mask]
loss = compute_loss(masked_outputs, masked_labels)
mean_loss = torch.mean(loss)
ãã®ã³ãŒãã¯NumPyãPyTorchã§ã¯å®å
šã«æ©èœããŸãããXLAã§ã¯åäœããŸããïŒãªããªããmasked_outputs
ãšmasked_labels
ã®åœ¢ç¶ã¯ãã¹ã¯ãããäœçœ®ã®æ°ã«äŸåãããããããã¯ããŒã¿äŸåã®åœ¢ç¶ã«ãªããŸãããã ããã«ãŒã«ïŒ1ãšåæ§ã«ããã®ã³ãŒããæžãçŽããŠãããŒã¿äŸåã®åœ¢ç¶ãªãã§ãŸã£ããåãåºåãçæã§ããããšããããŸãã
label_mask = tf.cast(labels >= 0, tf.float32)
loss = compute_loss(outputs, labels)
loss = loss * label_mask # Set negative label positions to 0
mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask)
ããã§ã¯ãããŒã¿äŸåã®åœ¢ç¶ãé¿ããããã«ãåäœçœ®ã§æ倱ãèšç®ããŠãããå¹³åãèšç®ããéã«ååãšåæ¯ã®äž¡æ¹ã§ãã¹ã¯ãããäœçœ®ããŒãåããæ¹æ³ã玹ä»ããŸããããã«ãããæåã®ã¢ãããŒããšãŸã£ããåãçµæãåŸãããŸãããXLAäºææ§ãç¶æããŸãã泚æç¹ãšããŠãã«ãŒã«ïŒ1ãšåãããªãã¯ã䜿çšããŸã - tf.bool
ãtf.float32
ã«å€æããŠææšå€æ°ãšããŠäœ¿çšããŸããããã¯éåžžã«äŸ¿å©ãªããªãã¯ã§ãã®ã§ãèªåã®ã³ãŒããXLAã«å€æããå¿
èŠãããå Žåã«ã¯èŠããŠãããŠãã ããïŒ
XLA Rule #3: XLA will need to recompile your model for every different input shape it sees
ããã¯éèŠãªã«ãŒã«ã§ããããã¯ã€ãŸããå ¥å圢ç¶ãéåžžã«å€åçãªå ŽåãXLA ã¯ã¢ãã«ãäœåºŠãåã³ã³ãã€ã«ããå¿ èŠãããããã倧ããªããã©ãŒãã³ã¹ã®åé¡ãçºçããå¯èœæ§ããããšããããšã§ãããã㯠NLP ã¢ãã«ã§äžè¬çã«çºçããããŒã¯ãã€ãºåŸã®å ¥åããã¹ãã®é·ããç°ãªãå ŽåããããŸããä»ã®ã¢ããªãã£ã§ã¯ãéçãªåœ¢ç¶ãäžè¬çã§ããããã®ã«ãŒã«ã¯ã»ãšãã©åé¡ã«ãªããŸããã
ã«ãŒã«ïŒ3ãåé¿ããæ¹æ³ã¯äœã§ããããïŒéµã¯ãããã£ã³ã°ãã§ã - ãã¹ãŠã®å ¥åãåãé·ãã«ããã£ã³ã°ãã次ã«ãattention_maskãã䜿çšããããšã§ãå¯å€åœ¢ç¶ãšåãçµæãåŸãããšãã§ããŸãããXLA ã®åé¡ã¯çºçããŸããããã ããé床ã®ããã£ã³ã°ãæ·±å»ãªé 延ãåŒãèµ·ããå¯èœæ§ããããŸã - ããŒã¿ã»ããå šäœã§æ倧ã®é·ãã«ãã¹ãŠã®ãµã³ãã«ãããã£ã³ã°ãããšãå€ãã®èšç®ãšã¡ã¢ãªãç¡é§ã«ããå¯èœæ§ããããŸãïŒ
ãã®åé¡ã«ã¯å®ç§ãªè§£æ±ºçã¯ãããŸããããããã€ãã®ããªãã¯ãè©Šãããšãã§ããŸããéåžžã«äŸ¿å©ãªããªãã¯ã®1ã€ã¯ããããã®ãµã³ãã«ã32ãŸãã¯64ããŒã¯ã³ã®åæ°ãŸã§ããã£ã³ã°ããããšã§ããããã«ãããããŒã¯ã³æ°ããããã«å¢å ããã ãã§ããã¹ãŠã®å ¥å圢ç¶ã32ãŸãã¯64ã®åæ°ã§ããå¿ èŠããããããäžæã®å ¥å圢ç¶ã®æ°ãå€§å¹ ã«æžå°ããŸããäžæã®å ¥å圢ç¶ãå°ãªããšãXLA ã®åã³ã³ãã€ã«ãå°ãªããªããŸãïŒ
ð€ HuggingFace ã«é¢ããå
·äœçãªãã³ãð€: åŒç€Ÿã®ããŒã¯ãã€ã¶ãŒãšããŒã¿ã³ã¬ã¯ã¿ãŒã«ã¯ãããã§åœ¹ç«ã€ã¡ãœããããããŸããããŒã¯ãã€ã¶ãŒãåŒã³åºãéã« padding="max_length"
ãŸã㯠padding="longest"
ã䜿çšããŠãããã£ã³ã°ãããããŒã¿ãåºåããããã«èšå®ã§ããŸããããŒã¯ãã€ã¶ãŒãšããŒã¿ã³ã¬ã¯ã¿ãŒã«ã¯ãäžæã®å
¥å圢ç¶ã®æ°ãæžããã®ã«åœ¹ç«ã€ pad_to_multiple_of
åŒæ°ããããŸãïŒ
How do I actually train my model on TPU?
äžåºŠãã¬ãŒãã³ã°ã XLA äºææ§ãããããšã確èªããïŒTPU Node/Colab ã䜿çšããå Žåã¯ïŒããŒã¿ã»ãããé©åã«æºåãããŠããå ŽåãTPU äžã§å®è¡ããããšã¯é©ãã»ã©ç°¡åã§ãïŒã³ãŒããå€æŽããå¿
èŠãããã®ã¯ãããã€ãã®è¡ãè¿œå ã㊠TPU ãåæåããã¢ãã«ãšããŒã¿ã»ããã TPUStrategy
ã¹ã³ãŒãå
ã§äœæãããããã«ããããšã ãã§ãããããå®éã«èŠãã«ã¯ãTPU ã®ãµã³ãã«ããŒãããã¯ãã芧ãã ããïŒ
Summary
ããã§ã¯å€ãã®æ å ±ãæäŸãããŸããã®ã§ãTPU ã§ã¢ãã«ããã¬ãŒãã³ã°ããéã«ä»¥äžã®ãã§ãã¯ãªã¹ãã䜿çšã§ããŸãïŒ
- ã³ãŒãã XLA ã®äžã€ã®ã«ãŒã«ã«åŸã£ãŠããããšã確èªããŸãã
- CPU/GPU ã§
jit_compile=True
ã䜿çšããŠã¢ãã«ãã³ã³ãã€ã«ããXLA ã§ãã¬ãŒãã³ã°ã§ããããšã確èªããŸãã - ããŒã¿ã»ãããã¡ã¢ãªã«èªã¿èŸŒãããTPU äºæã®ããŒã¿ã»ããèªã¿èŸŒã¿ã¢ãããŒãã䜿çšããŸãïŒããŒãããã¯ãåç §ïŒã
- ã³ãŒãã ColabïŒã¢ã¯ã»ã©ã¬ãŒã¿ããTPUãã«èšå®ïŒãŸã㯠Google Cloud ã® TPU VM ã«ç§»è¡ããŸãã
- TPU åæåã³ãŒããè¿œå ããŸãïŒããŒãããã¯ãåç §ïŒã
TPUStrategy
ãäœæããããŒã¿ã»ããã®èªã¿èŸŒã¿ãšã¢ãã«ã®äœæãstrategy.scope()
å ã§è¡ãããããšã確èªããŸãïŒããŒãããã¯ãåç §ïŒã- TPU ã«ç§»è¡ããéã«
jit_compile=True
ãå€ãã®ãå¿ããªãã§ãã ããïŒ - ðððð¥ºð¥ºð¥º
model.fit()
ãåŒã³åºããŸãã- ããã§ãšãããããŸãïŒ