patrickvonplaten commited on
Commit
2e79ab5
1 Parent(s): 1856b7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -1
README.md CHANGED
@@ -12,4 +12,111 @@ pipeline_tag: automatic-speech-recognition
12
  license: apache-2.0
13
  ---
14
 
15
- # Wav2Vec2-XLS-R-1B-EN-15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  license: apache-2.0
13
  ---
14
 
15
+ # Wav2Vec2-XLS-R-1B-EN-15
16
+
17
+ Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.**
18
+
19
+ ![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)
20
+
21
+ This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
22
+ The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-1b`**](https://huggingface.co/facebook/wav2vec2-xls-r-1b) checkpoint and
23
+ the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint.
24
+ Consequently, the encoder-decoder model was fine-tuned on 15 `en` -> `{lang}` translation pairs of the [Covost2 dataset](https://huggingface.co/datasets/covost2).
25
+
26
+ The model can translate from spoken `en` (Engish) to the following written languages `{lang}`:
27
+
28
+ `en` -> {`de`, `tr`, `fa`, `sv-SE`, `mn`, `zh-CN`, `cy`, `ca`, `sl`, `et`, `id`, `ar`, `ta`, `lv`, `ja`}
29
+
30
+ For more information, please refer to Section *5.1.1* of the [official XLS-R paper](https://arxiv.org/abs/2111.09296).
31
+
32
+ ## Usage
33
+
34
+ ### Demo
35
+
36
+ The model can be tested on [this space](https://huggingface.co/spaces/facebook/XLS-R-1B-EN-15).
37
+ You can select the target language, record some audio in English,
38
+ and then sit back and see how well the checkpoint can translate the input.
39
+
40
+ ### Example
41
+
42
+ As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
43
+ transcripts by passing the speech features to the model.
44
+
45
+ You can use the model directly via the ASR pipeline. By default, the checkpoint will
46
+ translate spoken English to written German. To change the written target language,
47
+ you need to pass the correct `forced_bos_token_id` to `generate(...)` to condition
48
+ the decoder on the correct target language.
49
+
50
+ To select the correct `forced_bos_token_id` given your choosen language id, please make use
51
+ of the following mapping:
52
+
53
+ ```python
54
+ MAPPING = {
55
+ "de": 250003,
56
+ "tr": 250023,
57
+ "fa": 250029,
58
+ "sv": 250042,
59
+ "mn": 250037,
60
+ "zh": 250025,
61
+ "cy": 250007,
62
+ "ca": 250005,
63
+ "sl": 250052,
64
+ "et": 250006,
65
+ "id": 250032,
66
+ "ar": 250001,
67
+ "ta": 250044,
68
+ "lv": 250017,
69
+ "ja": 250012,
70
+ }
71
+ ```
72
+
73
+ As an example, if you would like to translate to Swedish, you can do the following:
74
+
75
+ ```python
76
+ from datasets import load_dataset
77
+ from transformers import pipeline
78
+
79
+ # select correct `forced_bos_token_id`
80
+ forced_bos_token_id = MAPPING["sv"]
81
+
82
+ # replace following lines to load an audio file of your choice
83
+ librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
84
+ audio_file = librispeech_en[0]["file"]
85
+
86
+ asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-en-to-15", feature_extractor="facebook/wav2vec2-xls-r-1b-en-to-15")
87
+
88
+ translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
89
+ ```
90
+
91
+ or step-by-step as follows:
92
+
93
+ ```python
94
+ import torch
95
+ from transformers import Speech2Text2Processor, SpeechEncoderDecoder
96
+ from datasets import load_dataset
97
+
98
+ model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-1b-en-to-15")
99
+ processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-en-to-15")
100
+
101
+ ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
102
+
103
+ # select correct `forced_bos_token_id`
104
+ forced_bos_token_id = MAPPING["sv"]
105
+
106
+ inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
107
+ generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
108
+ transcription = processor.batch_decode(generated_ids)
109
+ ```
110
+
111
+ ## Results `en` -> `{lang}`
112
+
113
+ See the row of **XLS-R (1B)** for the performance on [Covost2](https://huggingface.co/datasets/covost2) for this model.
114
+
115
+ ![results image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/English-%3EX.png)
116
+
117
+ ## More XLS-R models for `{lang}` -> `en` Speech Translation
118
+
119
+ - [Wav2Vec2-XLS-R-300M-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-300m-en-to-15)
120
+ - [Wav2Vec2-XLS-R-1B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-1b-en-to-15)
121
+ - [Wav2Vec2-XLS-R-2B-EN-15](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15)
122
+ - [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)