S EAMLESS M 4 T (v1) x text ( s ) Mel-Filterbanksextractor(bins =80 ) x speech ( s ) Transformertextencoder S EAMLESS M 4 T - N L L B Conformerspeechencoder W 2 V - B E R T 2 . 0 Lengthadaptor Transformertextdecoder S EAMLESS M 4 T - N L L B y text ( t ) subword-lengthT2Uencoder ARunitdecoder u dedup HiFi-GANunit-vocoder ( t ) Continuousdecoderoutput S EAMLESS M 4 T V 2 x text ( s ) Mel-Filterbanksextractor(bins =80 ) x speech ( s ) Transformertextencoder S EAMLESS M 4 T - N L L B Conformerspeechencoder W 2 V - B E R T 2 . 0 Lengthadaptor Transformertextdecoder S EAMLESS M 4 T - N L L B y text ( t ) subword-lengthT2Uencoder subword-to-characterupsampler Unitdurationpredictor character-to-unitupsampler Aligner u dup y text-char Trainingsupervision NARunitdecoder u dup HiFi-GANunit-vocoder ( t ) Continuousdecoderoutput