Speech2Text2 ----------------------------------------------------------------------------------------------------------------------- Overview ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Speech2Text2 model is used together with :doc:`Wav2Vec2 ` for Speech Translation models proposed in `Large-Scale Self- and Semi-Supervised Learning for Speech Translation `__ by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. Speech2Text2 is a *decoder-only* transformer model that can be used with any speech *encoder-only*, such as :doc:`Wav2Vec2 ` or :doc:`HuBERT ` for Speech-to-Text tasks. Please refer to the :doc:`SpeechEncoderDecoder ` class on how to combine Speech2Text2 with any speech *encoder-only* model. This model was contributed by `Patrick von Platen `__. The original code can be found `here `__. Tips: - Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see the `official models `__ . - Speech2Text2 is always used within the :doc:`SpeechEncoderDecoder ` framework. - Speech2Text2's tokenizer currently only supports inference, but not training. Inference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Speech2Text2's :class:`~transformers.SpeechEncoderDecoderModel` model accepts raw waveform input values from speech and makes use of :func:`~transformers.generation_utils.GenerationMixin.generate` to translate the input speech autoregressively to the target language. The :class:`~transformers.Wav2Vec2FeatureExtractor` class is responsible for preprocessing the input speech and :class:`~transformers.Speech2Text2Tokenizer` decodes the generated target tokens to the target string. The :class:`~transformers.Speech2Text2Processor` wraps :class:`~transformers.Wav2Vec2FeatureExtractor` and :class:`~transformers.Speech2Text2Tokenizer` into a single instance to both extract the input features and decode the predicted token ids. - Step-by-step Speech Translation .. code-block:: >>> import torch >>> from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel >>> from datasets import load_dataset >>> import soundfile as sf >>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de") >>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de") >>> def map_to_array(batch): ... speech, _ = sf.read(batch["file"]) ... batch["speech"] = speech ... return batch >>> ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") >>> ds = ds.map(map_to_array) >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt") >>> generated_ids = model.generate(input_ids=inputs["input_values"], attention_mask=inputs["attention_mask"]) >>> transcription = processor.batch_decode(generated_ids) - Speech Translation via Pipelines The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code .. code-block:: >>> from datasets import load_dataset >>> from transformers import pipeline >>> librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") >>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de") >>> translation_de = asr(librispeech_en[0]["file"]) See `model hub `__ to look for Speech2Text2 checkpoints. Speech2Text2Config ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.Speech2Text2Config :members: Speech2TextTokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.Speech2Text2Tokenizer :members: batch_decode, decode, save_vocabulary Speech2Text2Processor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.Speech2Text2Processor :members: __call__, from_pretrained, save_pretrained, batch_decode, decode, as_target_processor Speech2Text2ForCausalLM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.Speech2Text2ForCausalLM :members: forward