|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- audio |
|
--- |
|
|
|
# Cascaded English Speech2Text Translation |
|
This is a pipeline for speech-to-text translation from English speech to any target language text based on the cascaded approach, that consists of ASR and translation. |
|
The pipeline employs [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) for ASR (English speech -> English text) |
|
and [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) for text translation. |
|
The input must be English speech, while the translation can be in any languages NLLB trained on. Please find the all available languages and their language codes |
|
[here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200). |
|
|
|
**Model for Japanese speech translation is available at [ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation).** |
|
|
|
## Benchmark |
|
The folloiwng table shows CER computed over the reference and predicted translation for translating English speech to Japanese text task |
|
(subsets of [CoVoST2 and Fleurs](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)) with different size of NLLB along with OpenAI Whisper models. |
|
|
|
| model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) | |
|
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:| |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 62.4 | 63.5 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 64.4 | 67.2 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 62.4 | 62.9 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 63.4 | 66.2 | |
|
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 178.9 | 209.5 | |
|
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 179.6 | 201.8 | |
|
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 178.7 | 201.8 | |
|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 178.7 | 202 | |
|
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 178.9 | 206.8 | |
|
| [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 179.5 | 214.2 | |
|
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 185.2 | 200.5 | |
|
|
|
See [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper) for the evaluation detail. |
|
|
|
### Inference Speed |
|
Due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end OpenAI whisper models for the sake of high accuracy. |
|
Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations. |
|
|
|
| model | 10 | 30 | 60 | 300 | |
|
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:| |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 0.173 | 0.247 | 0.352 | 1.772 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 0.173 | 0.24 | 0.348 | 1.515 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17 | 0.245 | 0.348 | 1.882 | |
|
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33 | |
|
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 0.061 | 0.184 | 0.372 | 1.804 | |
|
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 0.062 | 0.199 | 0.415 | 1.854 | |
|
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 0.062 | 0.183 | 0.363 | 1.899 | |
|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 0.045 | 0.132 | 0.266 | 1.368 | |
|
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 0.135 | 0.376 | 0.631 | 3.495 | |
|
| [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 0.054 | 0.108 | 0.231 | 1.019 | |
|
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 0.045 | 0.124 | 0.208 | 0.838 | |
|
|
|
## Usage |
|
Here is an example to translate English speech into Japanese text translation. |
|
First, download a sample speech. |
|
```bash |
|
wget https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval/resolve/main/sample.wav -O sample_en.wav |
|
``` |
|
|
|
Then, run the pipeline as below. |
|
```python3 |
|
from transformers import pipeline |
|
|
|
# load model |
|
pipe = pipeline( |
|
model="japanese-asr/en-cascaded-s2t-translation", |
|
model_translation="facebook/nllb-200-distilled-600M", |
|
tgt_lang="jpn_Jpan", |
|
model_kwargs={"attn_implementation": "sdpa"}, |
|
chunk_length_s=15, |
|
trust_remote_code=True, |
|
) |
|
|
|
# translate |
|
output = pipe("./sample.wav") |
|
``` |
|
|
|
Other NLLB models can be used by setting `model_translation` such as following. |
|
- [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) |
|
- [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) |
|
- [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) |
|
- [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) |
|
|
|
|