bob80333's picture
Update README.md
2bcc79f
---
language: "en"
thumbnail:
tags:
- speech-translation
- CTC
- Attention
- Transformer
- pytorch
- speechbrain
- automatic-speech-recognition
metrics:
- BLEU
---
# Conformer Encoder/Decoder for Speech Translation
This model was trained with [SpeechBrain](https://speechbrain.github.io), and is based on the Fisher Callhome recipie.
The performance of the model is the following:
| Release | CoVoSTv2 JA->EN Test BLEU | Custom Dataset Validation BLEU | Custom Dataset Test BLEU | GPUs |
|:-------------:|:--------------:|:--------------:|:--------------:|:--------:|
| 01-13-21 | 9.73 | 8.38 | 12.01 | 1xRTX 3090 |
This model was trained on subtitled audio downloaded from YouTube, and was not fine-tuned on the CoVoSTv2 training set.
When calculating the BLEU score for CoVoSTv2, the utterances were first preprocessed by the same pipeline that preprocessed the original data for the model, which includes removing all punctuation outside of apostrophes, and removing capitalization, similar to the data preprocessing done for the Fisher Callhome dataset in the speechbrain recipe.
## Pipeline description
The system is trained with recordings sampled at 16kHz (single channel).
The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
## Install SpeechBrain
First of all, install SpeechBrain with the following command:
```
pip install speechbrain
```
### Transcribing your own audio files (Spoken Japanese, to written English)
```python
from speechbrain.pretrained import EncoderDecoderASR
st_model = EncoderDecoderASR.from_hparams(source="bob80333/speechbrain_ja2en_st_63M_yt600h")
st_model.transcribe_file("your_file_here.wav")
```
### Inference on GPU
To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
### Limitations:
The model is likely to get caught in repetitions. The model is not very good at translation, which is reflected by its low BLEU scores.
The outputs of this model are unlikely to be correct, do not rely on it for any serious purpose.
This model was trained on data from Youtube, and has inherited whatever biases can be found in Youtube audio/subtitles.
The creator of this model doesn't actually know Japanese.