Conformer Encoder/Decoder for Speech Translation

This model was trained with SpeechBrain, and is based on the Fisher Callhome recipie. The performance of the model is the following:

Release	CoVoSTv2 JA->EN Test BLEU	Custom Dataset Validation BLEU	Custom Dataset Test BLEU	GPUs
01-13-21	9.73	8.38	12.01	1xRTX 3090

This model was trained on subtitled audio downloaded from YouTube, and was not fine-tuned on the CoVoSTv2 training set. When calculating the BLEU score for CoVoSTv2, the utterances were first preprocessed by the same pipeline that preprocessed the original data for the model, which includes removing all punctuation outside of apostrophes, and removing capitalization, similar to the data preprocessing done for the Fisher Callhome dataset in the speechbrain recipe.

Pipeline description

The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling transcribe_file if needed.

Install SpeechBrain

First of all, install SpeechBrain with the following command:

pip install speechbrain

Transcribing your own audio files (Spoken Japanese, to written English)

from speechbrain.pretrained import EncoderDecoderASR
st_model = EncoderDecoderASR.from_hparams(source="bob80333/speechbrain_ja2en_st_63M_yt600h")
st_model.transcribe_file("your_file_here.wav")

Inference on GPU

To perform inference on the GPU, add run_opts={"device":"cuda"} when calling the from_hparams method.

Limitations:

The model is likely to get caught in repetitions. The model is not very good at translation, which is reflected by its low BLEU scores. The outputs of this model are unlikely to be correct, do not rely on it for any serious purpose. This model was trained on data from Youtube, and has inherited whatever biases can be found in Youtube audio/subtitles. The creator of this model doesn't actually know Japanese.