patrickvonplaten commited on
Commit
9883158
1 Parent(s): fc8b3e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -1
README.md CHANGED
@@ -12,4 +12,113 @@ pipeline_tag: automatic-speech-recognition
12
  license: apache-2.0
13
  ---
14
 
15
- # Wav2Vec2-XLS-R-2B-22-16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  license: apache-2.0
13
  ---
14
 
15
+ # Wav2Vec2-XLS-R-2B-22-16
16
+
17
+ Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.**
18
+
19
+ ![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)
20
+
21
+ This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
22
+ The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-2b`**](https://huggingface.co/facebook/wav2vec2-xls-r-2b) checkpoint and
23
+ the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint.
24
+ Consequently, the encoder-decoder model was fine-tuned on `{input_lang}` -> `{output_lang}` translation pairs
25
+ of the [Covost2 dataset](https://huggingface.co/datasets/covost2).
26
+
27
+ The model can translate from the following spoken languages `{input_lang}` to the following written languages `{output_lang}`:
28
+
29
+ `{input_lang}` -> `{output_lang}`
30
+
31
+ with `{input_lang}` one of:
32
+
33
+ {`en`, `fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`}
34
+
35
+ and `{output_lang}`:
36
+
37
+ {`en`, `de`, `tr`, `fa`, `sv-SE`, `mn`, `zh-CN`, `cy`, `ca`, `sl`, `et`, `id`, `ar`, `ta`, `lv`, `ja`}
38
+
39
+ ## Usage
40
+
41
+ ### Demo
42
+
43
+ The model can be tested on [this space](https://huggingface.co/spaces/facebook/XLS-R-2B-22-16).
44
+ You can select the target language, record some audio in any of the above mentioned input languages,
45
+ and then sit back and see how well the checkpoint can translate the input.
46
+
47
+ ### Example
48
+
49
+ As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
50
+ transcripts by passing the speech features to the model.
51
+
52
+ You can use the model directly via the ASR pipeline. By default, the checkpoint will
53
+ translate spoken English to written German. To change the written target language,
54
+ you need to pass the correct `forced_bos_token_id` to `generate(...)` to condition
55
+ the decoder on the correct target language.
56
+
57
+ To select the correct `forced_bos_token_id` given your choosen language id, please make use
58
+ of the following mapping:
59
+
60
+ ```python
61
+ MAPPING = {
62
+ "en": 250004,
63
+ "de": 250003,
64
+ "tr": 250023,
65
+ "fa": 250029,
66
+ "sv": 250042,
67
+ "mn": 250037,
68
+ "zh": 250025,
69
+ "cy": 250007,
70
+ "ca": 250005,
71
+ "sl": 250052,
72
+ "et": 250006,
73
+ "id": 250032,
74
+ "ar": 250001,
75
+ "ta": 250044,
76
+ "lv": 250017,
77
+ "ja": 250012,
78
+ }
79
+ ```
80
+
81
+ As an example, if you would like to translate to Swedish, you can do the following:
82
+
83
+ ```python
84
+ from datasets import load_dataset
85
+ from transformers import pipeline
86
+
87
+ # select correct `forced_bos_token_id`
88
+ forced_bos_token_id = MAPPING["sv"]
89
+
90
+ # replace following lines to load an audio file of your choice
91
+ librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
92
+ audio_file = librispeech_en[0]["file"]
93
+
94
+ asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")
95
+
96
+ translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
97
+ ```
98
+
99
+ or step-by-step as follows:
100
+
101
+ ```python
102
+ import torch
103
+ from transformers import Speech2Text2Processor, SpeechEncoderDecoder
104
+ from datasets import load_dataset
105
+
106
+ model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
107
+ processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
108
+
109
+ ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
110
+
111
+ # select correct `forced_bos_token_id`
112
+ forced_bos_token_id = MAPPING["sv"]
113
+
114
+ inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
115
+ generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
116
+ transcription = processor.batch_decode(generated_ids)
117
+ ```
118
+
119
+ ## More XLS-R models for `{lang}` -> `en` Speech Translation
120
+
121
+ - [Wav2Vec2-XLS-R-300M-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-300m-en-to-15)
122
+ - [Wav2Vec2-XLS-R-1B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-1b-en-to-15)
123
+ - [Wav2Vec2-XLS-R-2B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15)
124
+ - [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)