valhalla commited on
Commit
6a381ad
1 Parent(s): 844c1e6

update readme

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md CHANGED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - nl
6
+ - es
7
+ - fr
8
+ - it
9
+ - pt
10
+ - ro
11
+ - ru
12
+ datasets:
13
+ - mustc
14
+ tags:
15
+ - audio
16
+ - speech-translation
17
+ - automatic-speech-recognition
18
+ license: MIT
19
+ ---
20
+
21
+
22
+ # S2T-MEDIUM-MUSTC-MULTILINGUAL-ST
23
+
24
+ `s2t-medium-mustc-multilingual-st` is a Speech to Text Transformer (S2T) model trained for end-to-end Multilingual Speech Translation (ST).
25
+ The S2T model was proposed in [this paper](https://arxiv.org/abs/2010.05171) and released in
26
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text)
27
+
28
+
29
+ ## Model description
30
+
31
+ S2T is a transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech
32
+ Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are
33
+ fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the
34
+ transcripts/translations autoregressively.
35
+
36
+ ## Intended uses & limitations
37
+
38
+ This model can be used for end-to-end English speech to French text translation.
39
+ See the [model hub](https://huggingface.co/models?filter=speech_to_text_transformer) to look for other S2T checkpoints.
40
+
41
+
42
+ ### How to use
43
+
44
+ As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
45
+ transcripts by passing the speech features to the model.
46
+
47
+ For multilingual speech translation models, `eos_token_id` is used as the `decoder_start_token_id` and
48
+ the target language id is forced as the first generated token. To force the target language id as the first
49
+ generated token, pass the `forced_bos_token_id` parameter to the `generate()` method. The following
50
+ example shows how to transate English speech to French and German text using the `facebook/s2t-medium-mustc-multilingual-st`
51
+ checkpoint.
52
+
53
+ *Note: The `Speech2TextProcessor` object uses [torchaudio](https://github.com/pytorch/audio) to extract the
54
+ filter bank features. Make sure to install the `torchaudio` package before running this example.*
55
+
56
+ You could either install those as extra speech dependancies with
57
+ `pip install transformers"[speech, sentencepiece]"` or install the packages seperatly
58
+ with `pip install torchaudio sentencepiece`.
59
+
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
64
+ from datasets import load_dataset
65
+ import soundfile as sf
66
+
67
+ model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
68
+ processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
69
+
70
+ def map_to_array(batch):
71
+ speech, _ = sf.read(batch["file"])
72
+ batch["speech"] = speech
73
+ return batch
74
+
75
+ ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
76
+ ds = ds.map(map_to_array)
77
+
78
+ inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
79
+
80
+ # translate English Speech To French Text
81
+ generated_ids = model.generate(
82
+ input_ids=inputs["input_features"],
83
+ attention_mask=inputs["attention_mask"],
84
+ forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"]
85
+ )
86
+ translation_fr = processor.batch_decode(generated_ids)
87
+
88
+ # translate English Speech To German Text
89
+ generated_ids = model.generate(
90
+ input_ids=inputs["input_features"],
91
+ attention_mask=inputs["attention_mask"],
92
+ forced_bos_token_id=processor.tokenizer.lang_code_to_id["de"]
93
+ )
94
+ translation_de = processor.batch_decode(generated_ids, skip_special_tokens=True)
95
+ ```
96
+
97
+
98
+ ## Training data
99
+
100
+ The s2t-medium-mustc-multilingual-st is trained on [MuST-C](https://ict.fbk.eu/must-c/).
101
+ MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems
102
+ for speech translation from English into several languages. For each target language, MuST-C comprises several hundred
103
+ hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual
104
+ transcriptions and translations.
105
+
106
+
107
+ ## Training procedure
108
+
109
+ ### Preprocessing
110
+
111
+ The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from
112
+ WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization)
113
+ is applied to each example.
114
+
115
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
116
+
117
+
118
+ ### Training
119
+
120
+ The model is trained with standard autoregressive cross-entropy loss and using [SpecAugment](https://arxiv.org/abs/1904.08779).
121
+ The encoder receives speech features, and the decoder generates the transcripts autoregressively. To accelerate
122
+ model training and for better performance the encoder is pre-trained for multilingual ASR. For multilingual models, target language ID token
123
+ is used as target BOS.
124
+
125
+ ## Evaluation results
126
+
127
+ MuST-C test results (BLEU score):
128
+
129
+ | En-De | En-Nl | En-Es | En-Fr | En-It | En-Pt | En-Ro | En-Ru |
130
+ |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
131
+ | 24.5 | 28.6 | 28.2 | 34.9 | 24.6 | 31.1 | 23.8 | 16.0 |
132
+
133
+
134
+
135
+ ### BibTeX entry and citation info
136
+
137
+ ```bibtex
138
+ @inproceedings{wang2020fairseqs2t,
139
+ title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
140
+ author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
141
+ booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
142
+ year = {2020},
143
+ }
144
+
145
+ ```