The Transformer model family
2017幎ã«å°å ¥ãããŠä»¥æ¥ãå ã®Transformerã¢ãã«ã¯ãèªç¶èšèªåŠçïŒNLPïŒã®ã¿ã¹ã¯ãè¶ ããå€ãã®æ°ãããšããµã€ãã£ã³ã°ãªã¢ãã«ãã€ã³ã¹ãã€ã¢ããŸãããã¿ã³ãã¯è³ªã®æããããŸããæ§é ãäºæž¬ããã¢ãã«ãããŒã¿ãŒãèµ°ãããããã®ãã¬ãŒãã³ã°ããã¢ãã«ããããŠæç³»åäºæž¬ã®ããã®ã¢ãã«ãªã©ããããŸããTransformerã®ããŸããŸãªããªã¢ã³ããå©çšå¯èœã§ããã倧å±ãèŠèœãšãããšããããŸãããããã®ãã¹ãŠã®ã¢ãã«ã«å ±éããã®ã¯ãå ã®Transformerã¢ãŒããã¯ãã£ã«åºã¥ããŠããããšã§ããäžéšã®ã¢ãã«ã¯ãšã³ã³ãŒããŸãã¯ãã³ãŒãã®ã¿ã䜿çšããä»ã®ã¢ãã«ã¯äž¡æ¹ã䜿çšããŸããããã¯ãTransformerãã¡ããªãŒå ã®ã¢ãã«ã®é«ã¬ãã«ã®éããã«ããŽã©ã€ãºãã調æ»ããããã®æçšãªåé¡æ³ãæäŸãã以åã«åºäŒã£ãããšã®ãªãTransformerãç解ããã®ã«åœ¹ç«ã¡ãŸãã
å ã®Transformerã¢ãã«ã«æ £ããŠããªããããªãã¬ãã·ã¥ãå¿ èŠãªå Žåã¯ãHugging Faceã³ãŒã¹ã®Transformerã®åäœåçç« ããã§ãã¯ããŠãã ããã
Computer vision
Convolutional network
é·ãéãç³ã¿èŸŒã¿ãããã¯ãŒã¯ïŒCNNïŒã¯ã³ã³ãã¥ãŒã¿ããžã§ã³ã®ã¿ã¹ã¯ã«ãããŠæ¯é çãªãã©ãã€ã ã§ããããããžã§ã³Transformerã¯ãã®ã¹ã±ãŒã©ããªãã£ãšå¹çæ§ã瀺ããŸãããããã§ããäžéšã®CNNã®æé«ã®ç¹æ§ãç¹ã«ç¹å®ã®ã¿ã¹ã¯ã«ãšã£ãŠã¯éåžžã«åŒ·åãªç¿»èš³äžå€æ§ãªã©ãäžéšã®Transformerã¯ã¢ãŒããã¯ãã£ã«ç³ã¿èŸŒã¿ãçµã¿èŸŒãã§ããŸããConvNeXtã¯ãç³ã¿èŸŒã¿ãçŸä»£åããããã«Transformerããèšèšã®éžæè¢ãåãå ¥ããäŸãã°ãConvNeXtã¯ç»åããããã«åå²ããããã«éãªãåããªãã¹ã©ã€ãã£ã³ã°ãŠã£ã³ããŠãšãã°ããŒãã«å容éãå¢å ãããããã®å€§ããªã«ãŒãã«ã䜿çšããŸããConvNeXtã¯ãã¡ã¢ãªå¹çãåäžãããããã©ãŒãã³ã¹ãåäžãããããã«ããã€ãã®ã¬ã€ã€ãŒãã¶ã€ã³ã®éžæè¢ãæäŸããTransformerãšç«¶åçã«ãªããŸãïŒ
Encoder
ããžã§ã³ ãã©ã³ã¹ãã©ãŒããŒïŒViTïŒ ã¯ãç³ã¿èŸŒã¿ã䜿çšããªãã³ã³ãã¥ãŒã¿ããžã§ã³ã¿ã¹ã¯ã®æãéããŸãããViT ã¯æšæºã®ãã©ã³ã¹ãã©ãŒããŒãšã³ã³ãŒããŒã䜿çšããŸãããç»åãæ±ãæ¹æ³ãäž»èŠãªãã¬ãŒã¯ã¹ã«ãŒã§ãããç»åãåºå®ãµã€ãºã®ãããã«åå²ããããããããŒã¯ã³ã®ããã«äœ¿çšããŠåã蟌ã¿ãäœæããŸããViT ã¯ãåœæã®CNNãšç«¶äºåã®ããçµæã瀺ãããã«ãã©ã³ã¹ãã©ãŒããŒã®å¹ççãªã¢ãŒããã¯ãã£ã掻çšããŸãããããã¬ãŒãã³ã°ã«å¿ èŠãªãªãœãŒã¹ãå°ãªããŠæžã¿ãŸãããViT ã«ç¶ããŠãã»ã°ã¡ã³ããŒã·ã§ã³ãæ€åºãªã©ã®å¯ãªããžã§ã³ã¿ã¹ã¯ãåŠçã§ããä»ã®ããžã§ã³ã¢ãã«ãç»å ŽããŸããã
ãããã®ã¢ãã«ã®1ã€ãSwin ãã©ã³ã¹ãã©ãŒããŒã§ããSwin ãã©ã³ã¹ãã©ãŒããŒã¯ãããå°ããªãµã€ãºã®ãããããéå±€çãªç¹åŸŽãããïŒCNNã®ãã㧠ViT ãšã¯ç°ãªããŸãïŒãæ§ç¯ãã深局ã®ããããšé£æ¥ããããããšããŒãžããŸãã泚æã¯ããŒã«ã«ãŠã£ã³ããŠå ã§ã®ã¿èšç®ããããŠã£ã³ããŠã¯æ³šæã®ã¬ã€ã€ãŒéã§ã·ãããããã¢ãã«ãããè¯ãåŠç¿ããã®ããµããŒãããæ¥ç¶ãäœæããŸããSwin ãã©ã³ã¹ãã©ãŒããŒã¯éå±€çãªç¹åŸŽããããçæã§ãããããã»ã°ã¡ã³ããŒã·ã§ã³ãæ€åºãªã©ã®å¯ãªäºæž¬ã¿ã¹ã¯ã«é©ããŠããŸããSegFormer ãéå±€çãªç¹åŸŽããããæ§ç¯ããããã«ãã©ã³ã¹ãã©ãŒããŒãšã³ã³ãŒããŒã䜿çšããŸããããã¹ãŠã®ç¹åŸŽããããçµã¿åãããŠäºæž¬ããããã«ã·ã³ãã«ãªãã«ãã¬ã€ã€ãŒããŒã»ãããã³ïŒMLPïŒãã³ãŒããŒãè¿œå ããŸãã
BeIT ããã³ ViTMAE ãªã©ã®ä»ã®ããžã§ã³ã¢ãã«ã¯ãBERTã®äºåãã¬ãŒãã³ã°ç®æšããã€ã³ã¹ãã¬ãŒã·ã§ã³ãåŸãŸãããBeIT 㯠masked image modeling (MIM) ã«ãã£ãŠäºåãã¬ãŒãã³ã°ãããŠããŸããç»åãããã¯ã©ã³ãã ã«ãã¹ã¯ãããç»åãèŠèŠããŒã¯ã³ã«ããŒã¯ã³åãããŸããBeIT ã¯ãã¹ã¯ããããããã«å¯Ÿå¿ããèŠèŠããŒã¯ã³ãäºæž¬ããããã«ãã¬ãŒãã³ã°ãããŸããViTMAE ã䌌ããããªäºåãã¬ãŒãã³ã°ç®æšãæã£ãŠãããèŠèŠããŒã¯ã³ã®ä»£ããã«ãã¯ã»ã«ãäºæž¬ããå¿ èŠããããŸããç°äŸãªã®ã¯ç»åãããã®75%ããã¹ã¯ãããŠããããšã§ãïŒãã³ãŒããŒã¯ãã¹ã¯ãããããŒã¯ã³ãšãšã³ã³ãŒããããããããããã¯ã»ã«ãåæ§ç¯ããŸããäºåãã¬ãŒãã³ã°ã®åŸããã³ãŒããŒã¯æšãŠããããšã³ã³ãŒããŒã¯ããŠã³ã¹ããªãŒã ã®ã¿ã¹ã¯ã§äœ¿çšã§ããç¶æ ã§ãã
Decoder
ãã³ãŒããŒã®ã¿ã®ããžã§ã³ã¢ãã«ã¯çããã§ãããªããªããã»ãšãã©ã®ããžã§ã³ã¢ãã«ã¯ç»åè¡šçŸãåŠã¶ããã«ãšã³ã³ãŒããŒã䜿çšããããã§ããããããç»åçæãªã©ã®ãŠãŒã¹ã±ãŒã¹ã§ã¯ããã³ãŒããŒã¯èªç¶ãªé©å¿ã§ããããã¯ãGPT-2ãªã©ã®ããã¹ãçæã¢ãã«ããèŠãŠããããã«ãImageGPT ã§ãåæ§ã®ã¢ãŒããã¯ãã£ã䜿çšããŸãããã·ãŒã±ã³ã¹å ã®æ¬¡ã®ããŒã¯ã³ãäºæž¬ãã代ããã«ãç»åå ã®æ¬¡ã®ãã¯ã»ã«ãäºæž¬ããŸããç»åçæã«å ããŠãImageGPT ã¯ç»ååé¡ã®ããã«ããã¡ã€ã³ãã¥ãŒãã³ã°ã§ããŸãã
Encoder-decoder
ããžã§ã³ã¢ãã«ã¯äžè¬çã«ãšã³ã³ãŒããŒïŒããã¯ããŒã³ãšãåŒã°ããŸãïŒã䜿çšããŠéèŠãªç»åç¹åŸŽãæœåºããããããã©ã³ã¹ãã©ãŒããŒãã³ãŒããŒã«æž¡ãããã«äœ¿çšããŸããDETR ã¯äºåãã¬ãŒãã³ã°æžã¿ã®ããã¯ããŒã³ãæã£ãŠããŸããããªããžã§ã¯ãæ€åºã®ããã«å®å šãªãã©ã³ã¹ãã©ãŒããŒãšã³ã³ãŒããŒãã³ãŒããŒã¢ãŒããã¯ãã£ã䜿çšããŠããŸãããšã³ã³ãŒããŒã¯ç»åè¡šçŸãåŠã³ããã³ãŒããŒå ã®ãªããžã§ã¯ãã¯ãšãªïŒåãªããžã§ã¯ãã¯ãšãªã¯ç»åå ã®é åãŸãã¯ãªããžã§ã¯ãã«çŠç¹ãåœãŠãåŠç¿ãããåã蟌ã¿ã§ãïŒãšçµã¿åãããŸããDETR ã¯åãªããžã§ã¯ãã¯ãšãªã«å¯Ÿããå¢çããã¯ã¹ã®åº§æšãšã¯ã©ã¹ã©ãã«ãäºæž¬ããŸãã
Natural lanaguage processing
Encoder
BERT ã¯ãšã³ã³ãŒããŒå°çšã®Transformerã§ãå ¥åã®äžéšã®ããŒã¯ã³ãã©ã³ãã ã«ãã¹ã¯ããŠä»ã®ããŒã¯ã³ãèŠãªãããã«ããŠããŸããããã«ãããããŒã¯ã³ããã¹ã¯ããæèã«åºã¥ããŠãã¹ã¯ãããããŒã¯ã³ãäºæž¬ããããšãäºåãã¬ãŒãã³ã°ã®ç®æšã§ããããã«ãããBERTã¯å ¥åã®ããæ·±ããã€è±ããªè¡šçŸãåŠç¿ããã®ã«å·Šå³ã®æèãå®å šã«æŽ»çšã§ããŸããããããBERTã®äºåãã¬ãŒãã³ã°æŠç¥ã«ã¯ãŸã æ¹åã®äœå°ããããŸãããRoBERTa ã¯ããã¬ãŒãã³ã°ãé·æéè¡ãããã倧ããªãããã§ãã¬ãŒãã³ã°ããäºååŠçäžã«äžåºŠã ãã§ãªãåãšããã¯ã§ããŒã¯ã³ãã©ã³ãã ã«ãã¹ã¯ãã次æäºæž¬ã®ç®æšãåé€ããæ°ããäºåãã¬ãŒãã³ã°ã¬ã·ããå°å ¥ããããšã§ãããæ¹åããŸããã
æ§èœãåäžãããäž»èŠãªæŠç¥ã¯ã¢ãã«ã®ãµã€ãºãå¢ããããšã§ããã倧èŠæš¡ãªã¢ãã«ã®ãã¬ãŒãã³ã°ã¯èšç®ã³ã¹ããããããŸããèšç®ã³ã¹ããåæžããæ¹æ³ã®1ã€ã¯ãDistilBERT ã®ãããªå°ããªã¢ãã«ã䜿çšããããšã§ããDistilBERTã¯ç¥èèžç - å§çž®æè¡ - ã䜿çšããŠãBERTã®ã»ãŒãã¹ãŠã®èšèªç解æ©èœãä¿æããªãããããå°ããªããŒãžã§ã³ãäœæããŸãã
ããããã»ãšãã©ã®Transformerã¢ãã«ã¯åŒãç¶ãããå€ãã®ãã©ã¡ãŒã¿ã«çŠç¹ãåœãŠããã¬ãŒãã³ã°å¹çãåäžãããæ°ããã¢ãã«ãç»å ŽããŠããŸããALBERT ã¯ã2ã€ã®æ¹æ³ã§ãã©ã¡ãŒã¿ã®æ°ãæžããããšã«ãã£ãŠã¡ã¢ãªæ¶è²»éãåæžããŸãã倧ããªèªåœåã蟌ã¿ã2ã€ã®å°ããªè¡åã«åå²ããã¬ã€ã€ãŒããã©ã¡ãŒã¿ãå
±æã§ããããã«ããŸããDeBERTa ã¯ãåèªãšãã®äœçœ®ã2ã€ã®ãã¯ãã«ã§å¥ã
ã«ãšã³ã³ãŒããã解ããã泚ææ©æ§ãè¿œå ããŸããã泚æã¯ãããã®å¥ã
ã®ãã¯ãã«ããèšç®ãããŸããåèªãšäœçœ®ã®åã蟌ã¿ãå«ãŸããåäžã®ãã¯ãã«ã§ã¯ãªããLongformer ã¯ãç¹ã«é·ãã·ãŒã±ã³ã¹é·ã®ããã¥ã¡ã³ããåŠçããããã«æ³šæãããå¹ççã«ããããšã«çŠç¹ãåœãŠãŸãããåºå®ããããŠã£ã³ããŠãµã€ãºã®åšãã®åããŒã¯ã³ããèšç®ãããããŒã«ã«ãŠã£ã³ããŠä»ã泚æïŒç¹å®ã®ã¿ã¹ã¯ããŒã¯ã³ïŒåé¡ã®ããã® [CLS]
ãªã©ïŒã®ã¿ã®ããã®ã°ããŒãã«ãªæ³šæãå«ãïŒã®çµã¿åããã䜿çšããŠãå®å
šãªæ³šæè¡åã§ã¯ãªãçãªæ³šæè¡åãäœæããŸãã
Decoder
GPT-2ã¯ãã·ãŒã±ã³ã¹å ã®æ¬¡ã®åèªãäºæž¬ãããã³ãŒããŒå°çšã®Transformerã§ããã¢ãã«ã¯å ãèŠãããšãã§ããªãããã«ããŒã¯ã³ãå³ã«ãã¹ã¯ããâã®ããèŠâãé²ããŸãã倧éã®ããã¹ããäºåãã¬ãŒãã³ã°ããããšã«ãããGPT-2ã¯ããã¹ãçæãéåžžã«åŸæã§ãããã¹ããæ£ç¢ºã§ããããšãããã«ããŠããæææ£ç¢ºã§ã¯ãªãããšããããŸããããããGPT-2ã«ã¯BERTã®äºåãã¬ãŒãã³ã°ããã®åæ¹åã³ã³ããã¹ããäžè¶³ããŠãããç¹å®ã®ã¿ã¹ã¯ã«ã¯é©ããŠããŸããã§ãããXLNETã¯ãåæ¹åã«åŠç¿ã§ããé åèšèªã¢ããªã³ã°ç®æšïŒPLMïŒã䜿çšããããšã§ãBERTãšGPT-2ã®äºåãã¬ãŒãã³ã°ç®æšã®ãã¹ããçµã¿åãããŠããŸãã
GPT-2ã®åŸãèšèªã¢ãã«ã¯ããã«å€§ããæé·ããä»ã§ã¯å€§èŠæš¡èšèªã¢ãã«ïŒLLMïŒãšããŠç¥ãããŠããŸãã倧èŠæš¡ãªããŒã¿ã»ããã§äºåãã¬ãŒãã³ã°ãããã°ãLLMã¯ã»ãŒãŒãã·ã§ããåŠç¿ã瀺ãããšããããŸããGPT-Jã¯ã6Bã®ãã©ã¡ãŒã¿ãæã€LLMã§ã400Bã®ããŒã¯ã³ã§ãã¬ãŒãã³ã°ãããŠããŸããGPT-Jã«ã¯OPTãç¶ãããã®ãã¡æ倧ã®ã¢ãã«ã¯175Bã§ã180Bã®ããŒã¯ã³ã§ãã¬ãŒãã³ã°ãããŠããŸããåãææã«BLOOMããªãªãŒã¹ããããã®ãã¡ããªãŒã®æ倧ã®ã¢ãã«ã¯176Bã®ãã©ã¡ãŒã¿ãæã¡ã46ã®èšèªãš13ã®ããã°ã©ãã³ã°èšèªã§366Bã®ããŒã¯ã³ã§ãã¬ãŒãã³ã°ãããŠããŸãã
Encoder-decoder
BARTã¯ãå
ã®Transformerã¢ãŒããã¯ãã£ãä¿æããŠããŸãããäºåãã¬ãŒãã³ã°ç®æšãããã¹ãè£å®ã®ç Žæã«å€æŽããŠããŸããäžéšã®ããã¹ãã¹ãã³ã¯åäžã®mask
ããŒã¯ã³ã§çœ®æãããŸãããã³ãŒããŒã¯ç ŽæããŠããªãããŒã¯ã³ãäºæž¬ãïŒæªæ¥ã®ããŒã¯ã³ã¯ãã¹ã¯ãããŸãïŒããšã³ã³ãŒããŒã®é ããç¶æ
ã䜿çšããŠäºæž¬ãè£å©ããŸããPegasusã¯BARTã«äŒŒãŠããŸãããPegasusã¯ããã¹ãã¹ãã³ã®ä»£ããã«æå
šäœããã¹ã¯ããŸãããã¹ã¯ãããèšèªã¢ããªã³ã°ã«å ããŠãPegasusã¯ã®ã£ããæçæïŒGSGïŒã«ãã£ãŠäºåãã¬ãŒãã³ã°ãããŠããŸããGSGã®ç®æšã¯ãææžã«éèŠãªæããã¹ã¯ããããããmask
ããŒã¯ã³ã§çœ®æããããšã§ãããã³ãŒããŒã¯æ®ãã®æããåºåãçæããªããã°ãªããŸãããT5ã¯ããã¹ãŠã®NLPã¿ã¹ã¯ãç¹å®ã®ãã¬ãã£ãã¯ã¹ã䜿çšããŠããã¹ã察ããã¹ãã®åé¡ã«å€æãããããŠããŒã¯ãªã¢ãã«ã§ããããšãã°ããã¬ãã£ãã¯ã¹Summarize:
ã¯èŠçŽã¿ã¹ã¯ã瀺ããŸããT5ã¯æåž«ãããã¬ãŒãã³ã°ïŒGLUEãšSuperGLUEïŒãšèªå·±æåž«ãããã¬ãŒãã³ã°ïŒããŒã¯ã³ã®15ïŒ
ãã©ã³ãã ã«ãµã³ãã«ãããããã¢ãŠãïŒã«ãã£ãŠäºåãã¬ãŒãã³ã°ãããŠããŸãã
Audio
Encoder
Wav2Vec2 ã¯ãçã®ãªãŒãã£ãªæ³¢åœ¢ããçŽæ¥é³å£°è¡šçŸãåŠç¿ããããã®Transformerãšã³ã³ãŒããŒã䜿çšããŸããããã¯ãå¯Ÿç §çãªã¿ã¹ã¯ã§äºååŠç¿ãããäžé£ã®åœã®è¡šçŸããçã®é³å£°è¡šçŸãç¹å®ããŸãã HuBERT ã¯Wav2Vec2ã«äŒŒãŠããŸãããç°ãªããã¬ãŒãã³ã°ããã»ã¹ãæã£ãŠããŸããã¿ãŒã²ããã©ãã«ã¯ãé¡äŒŒãããªãŒãã£ãªã»ã°ã¡ã³ããã¯ã©ã¹ã¿ã«å²ãåœãŠããããããé ããŠãããã«ãªãã¯ã©ã¹ã¿ãªã³ã°ã¹ãããã«ãã£ãŠäœæãããŸããé ããŠãããã¯åã蟌ã¿ã«ããããããäºæž¬ãè¡ããŸãã
Encoder-decoder
Speech2Text ã¯ãèªåé³å£°èªèïŒASRïŒããã³é³å£°ç¿»èš³ã®ããã«èšèšãããé³å£°ã¢ãã«ã§ãããã®ã¢ãã«ã¯ããªãŒãã£ãªæ³¢åœ¢ããæœåºããããã°ã¡ã«ãã£ã«ã¿ãŒãã³ã¯ãã£ãŒãã£ãŒãåãå ¥ããäºåãã¬ãŒãã³ã°ãããèªå·±ååž°çã«ãã©ã³ã¹ã¯ãªãããŸãã¯ç¿»èš³ãçæããŸãã Whisper ãASRã¢ãã«ã§ãããä»ã®å€ãã®é³å£°ã¢ãã«ãšã¯ç°ãªããâš ã©ãã«ä»ã âš ãªãŒãã£ãªãã©ã³ã¹ã¯ãªãã·ã§ã³ããŒã¿ã倧éã«äºåã«åŠç¿ããŠããŒãã·ã§ããããã©ãŒãã³ã¹ãå®çŸããŸããããŒã¿ã»ããã®å€§éšåã«ã¯éè±èªã®èšèªãå«ãŸããŠãããWhisperã¯äœãªãœãŒã¹èšèªã«ã䜿çšã§ããŸããæ§é çã«ã¯ãWhisperã¯Speech2Textã«äŒŒãŠããŸãããªãŒãã£ãªä¿¡å·ã¯ãšã³ã³ãŒããŒã«ãã£ãŠãšã³ã³ãŒãããããã°ã¡ã«ã¹ãã¯ããã°ã©ã ã«å€æãããŸãããã³ãŒããŒã¯ãšã³ã³ãŒããŒã®é ãç¶æ ãšåã®ããŒã¯ã³ãããã©ã³ã¹ã¯ãªãããèªå·±ååž°çã«çæããŸãã
Multimodal
Encoder
VisualBERT ã¯ãBERTã®åŸã«ãªãªãŒã¹ãããããžã§ã³èšèªã¿ã¹ã¯åãã®ãã«ãã¢ãŒãã«ã¢ãã«ã§ããããã¯BERTãšäºåãã¬ãŒãã³ã°ãããç©äœæ€åºã·ã¹ãã ãçµã¿åãããç»åç¹åŸŽãããžã¥ã¢ã«åã蟌ã¿ã«æœåºããããã¹ãåã蟌ã¿ãšäžç·ã«BERTã«æž¡ããŸããVisualBERTã¯éãã¹ã¯ããã¹ããåºã«ãããã¹ã¯ããã¹ããäºæž¬ããããã¹ããç»åãšæŽåããŠãããã©ãããäºæž¬ããå¿ èŠããããŸããViTããªãªãŒã¹ãããéãViLT ã¯ç»ååã蟌ã¿ãååŸããããã«ãã®æ¹æ³ãæ¡çšããŸãããç»ååã蟌ã¿ã¯ããã¹ãåã蟌ã¿ãšå ±ã«å ±åã§åŠçãããŸãããããããViLTã¯ç»åããã¹ããããã³ã°ããã¹ã¯èšèªã¢ããªã³ã°ãããã³å šåèªãã¹ãã³ã°ã«ããäºåãã¬ãŒãã³ã°ãè¡ãããŸãã
CLIP ã¯ç°ãªãã¢ãããŒããåãã(ç»å
ãããã¹ã
) ã®ãã¢äºæž¬ãè¡ããŸããç»åãšã³ã³ãŒããŒïŒViTïŒãšããã¹ããšã³ã³ãŒããŒïŒTransformerïŒã¯ã(ç»å
ãããã¹ã
) ãã¢ããŒã¿ã»ããäžã§å
±åãã¬ãŒãã³ã°ããã(ç»å
ãããã¹ã
) ãã¢ã®ç»åãšããã¹ãã®åã蟌ã¿ã®é¡äŒŒæ§ãæ倧åããŸããäºåãã¬ãŒãã³ã°åŸãCLIPã䜿çšããŠç»åããããã¹ããäºæž¬ãããããã®éãè¡ãããšãã§ããŸããOWL-ViT ã¯ããŒãã·ã§ããç©äœæ€åºã®ããã¯ããŒã³ãšããŠCLIPã䜿çšããŠããŸããäºåãã¬ãŒãã³ã°åŸãç©äœæ€åºããããè¿œå ããã(ã¯ã©ã¹
ãããŠã³ãã£ã³ã°ããã¯ã¹
) ãã¢ã«å¯Ÿããã»ããäºæž¬ãè¡ãããŸãã
Encoder-decoder
å
åŠæåèªèïŒOCRïŒã¯ãéåžžãç»åãç解ãããã¹ããçæããããã«è€æ°ã®ã³ã³ããŒãã³ããé¢äžããããã¹ãèªèã¿ã¹ã¯ã§ãã TrOCR ã¯ããšã³ãããŒãšã³ãã®Transformerã䜿çšããŠãã®ããã»ã¹ãç°¡ç¥åããŸãããšã³ã³ãŒããŒã¯ç»åãåºå®ãµã€ãºã®ããããšããŠåŠçããããã®ViTã¹ã¿ã€ã«ã®ã¢ãã«ã§ããããã³ãŒããŒã¯ãšã³ã³ãŒããŒã®é ãç¶æ
ãåãå
¥ããããã¹ããèªå·±ååž°çã«çæããŸããDonut ã¯OCRããŒã¹ã®ã¢ãããŒãã«äŸåããªãããäžè¬çãªããžã¥ã¢ã«ããã¥ã¡ã³ãç解ã¢ãã«ã§ããšã³ã³ãŒããŒãšããŠSwin Transformerããã³ãŒããŒãšããŠå€èšèªBARTã䜿çšããŸãã Donutã¯ç»åãšããã¹ãã®æ³šéã«åºã¥ããŠæ¬¡ã®åèªãäºæž¬ããããšã«ãããããã¹ããèªãããã«äºåãã¬ãŒãã³ã°ãããŸãããã³ãŒããŒã¯ããã³ãããäžããããããŒã¯ã³ã·ãŒã±ã³ã¹ãçæããŸããããã³ããã¯åããŠã³ã¹ããªãŒã ã¿ã¹ã¯ããšã«ç¹å¥ãªããŒã¯ã³ã䜿çšããŠè¡šçŸãããŸããäŸãã°ãããã¥ã¡ã³ãã®è§£æã«ã¯è§£æ
ããŒã¯ã³ãããããšã³ã³ãŒããŒã®é ãç¶æ
ãšçµã¿åããããŠããã¥ã¡ã³ããæ§é åãããåºåãã©ãŒãããïŒJSONïŒã«è§£æããŸãã
Reinforcement learning
Decoder
ææ決å®ãšè»è·¡ãã©ã³ã¹ãã©ãŒããŒã¯ãç¶æ ãã¢ã¯ã·ã§ã³ãå ±é ¬ãã·ãŒã±ã³ã¹ã¢ããªã³ã°ã®åé¡ãšããŠæããŸãã Decision Transformer ã¯ããªã¿ãŒã³ã»ãã¥ã»ãŽãŒãéå»ã®ç¶æ ãããã³ã¢ã¯ã·ã§ã³ã«åºã¥ããŠå°æ¥ã®åžæãªã¿ãŒã³ã«ã€ãªããã¢ã¯ã·ã§ã³ã®ç³»åãçæããŸããæåŸã® K ã¿ã€ã ã¹ãããã§ã¯ã3ã€ã®ã¢ããªãã£ãããããããŒã¯ã³åã蟌ã¿ã«å€æãããå°æ¥ã®ã¢ã¯ã·ã§ã³ããŒã¯ã³ãäºæž¬ããããã«GPTã®ãããªã¢ãã«ã«ãã£ãŠåŠçãããŸããTrajectory Transformer ãç¶æ ãã¢ã¯ã·ã§ã³ãå ±é ¬ãããŒã¯ã³åããGPTã¢ãŒããã¯ãã£ã§åŠçããŸããå ±é ¬èª¿æŽã«çŠç¹ãåœãŠãDecision Transformerãšã¯ç°ãªããTrajectory Transformerã¯ããŒã ãµãŒãã䜿çšããŠå°æ¥ã®ã¢ã¯ã·ã§ã³ãçæããŸãã
< > Update on GitHub