patrickvonplaten
commited on
Commit
•
9883158
1
Parent(s):
fc8b3e0
Update README.md
Browse files
README.md
CHANGED
@@ -12,4 +12,113 @@ pipeline_tag: automatic-speech-recognition
|
|
12 |
license: apache-2.0
|
13 |
---
|
14 |
|
15 |
-
# Wav2Vec2-XLS-R-2B-22-16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
license: apache-2.0
|
13 |
---
|
14 |
|
15 |
+
# Wav2Vec2-XLS-R-2B-22-16
|
16 |
+
|
17 |
+
Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.**
|
18 |
+
|
19 |
+
![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)
|
20 |
+
|
21 |
+
This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
|
22 |
+
The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-2b`**](https://huggingface.co/facebook/wav2vec2-xls-r-2b) checkpoint and
|
23 |
+
the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint.
|
24 |
+
Consequently, the encoder-decoder model was fine-tuned on `{input_lang}` -> `{output_lang}` translation pairs
|
25 |
+
of the [Covost2 dataset](https://huggingface.co/datasets/covost2).
|
26 |
+
|
27 |
+
The model can translate from the following spoken languages `{input_lang}` to the following written languages `{output_lang}`:
|
28 |
+
|
29 |
+
`{input_lang}` -> `{output_lang}`
|
30 |
+
|
31 |
+
with `{input_lang}` one of:
|
32 |
+
|
33 |
+
{`en`, `fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`}
|
34 |
+
|
35 |
+
and `{output_lang}`:
|
36 |
+
|
37 |
+
{`en`, `de`, `tr`, `fa`, `sv-SE`, `mn`, `zh-CN`, `cy`, `ca`, `sl`, `et`, `id`, `ar`, `ta`, `lv`, `ja`}
|
38 |
+
|
39 |
+
## Usage
|
40 |
+
|
41 |
+
### Demo
|
42 |
+
|
43 |
+
The model can be tested on [this space](https://huggingface.co/spaces/facebook/XLS-R-2B-22-16).
|
44 |
+
You can select the target language, record some audio in any of the above mentioned input languages,
|
45 |
+
and then sit back and see how well the checkpoint can translate the input.
|
46 |
+
|
47 |
+
### Example
|
48 |
+
|
49 |
+
As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
|
50 |
+
transcripts by passing the speech features to the model.
|
51 |
+
|
52 |
+
You can use the model directly via the ASR pipeline. By default, the checkpoint will
|
53 |
+
translate spoken English to written German. To change the written target language,
|
54 |
+
you need to pass the correct `forced_bos_token_id` to `generate(...)` to condition
|
55 |
+
the decoder on the correct target language.
|
56 |
+
|
57 |
+
To select the correct `forced_bos_token_id` given your choosen language id, please make use
|
58 |
+
of the following mapping:
|
59 |
+
|
60 |
+
```python
|
61 |
+
MAPPING = {
|
62 |
+
"en": 250004,
|
63 |
+
"de": 250003,
|
64 |
+
"tr": 250023,
|
65 |
+
"fa": 250029,
|
66 |
+
"sv": 250042,
|
67 |
+
"mn": 250037,
|
68 |
+
"zh": 250025,
|
69 |
+
"cy": 250007,
|
70 |
+
"ca": 250005,
|
71 |
+
"sl": 250052,
|
72 |
+
"et": 250006,
|
73 |
+
"id": 250032,
|
74 |
+
"ar": 250001,
|
75 |
+
"ta": 250044,
|
76 |
+
"lv": 250017,
|
77 |
+
"ja": 250012,
|
78 |
+
}
|
79 |
+
```
|
80 |
+
|
81 |
+
As an example, if you would like to translate to Swedish, you can do the following:
|
82 |
+
|
83 |
+
```python
|
84 |
+
from datasets import load_dataset
|
85 |
+
from transformers import pipeline
|
86 |
+
|
87 |
+
# select correct `forced_bos_token_id`
|
88 |
+
forced_bos_token_id = MAPPING["sv"]
|
89 |
+
|
90 |
+
# replace following lines to load an audio file of your choice
|
91 |
+
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
92 |
+
audio_file = librispeech_en[0]["file"]
|
93 |
+
|
94 |
+
asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")
|
95 |
+
|
96 |
+
translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
|
97 |
+
```
|
98 |
+
|
99 |
+
or step-by-step as follows:
|
100 |
+
|
101 |
+
```python
|
102 |
+
import torch
|
103 |
+
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
|
104 |
+
from datasets import load_dataset
|
105 |
+
|
106 |
+
model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
|
107 |
+
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
|
108 |
+
|
109 |
+
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
110 |
+
|
111 |
+
# select correct `forced_bos_token_id`
|
112 |
+
forced_bos_token_id = MAPPING["sv"]
|
113 |
+
|
114 |
+
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
|
115 |
+
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
|
116 |
+
transcription = processor.batch_decode(generated_ids)
|
117 |
+
```
|
118 |
+
|
119 |
+
## More XLS-R models for `{lang}` -> `en` Speech Translation
|
120 |
+
|
121 |
+
- [Wav2Vec2-XLS-R-300M-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-300m-en-to-15)
|
122 |
+
- [Wav2Vec2-XLS-R-1B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-1b-en-to-15)
|
123 |
+
- [Wav2Vec2-XLS-R-2B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15)
|
124 |
+
- [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)
|