valhalla commited on
Commit
dd2805b
1 Parent(s): bfeeb3c
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - librispeech_asr
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ license: MIT
9
+ ---
10
+
11
+
12
+ # S2T-SMALL-LIBRISPEECH-ASR
13
+
14
+ `s2t-small-librispeech-asr` is a Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR).
15
+ The S2T model was proposed in [this paper](https://arxiv.org/abs/2010.05171) and released in
16
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text)
17
+
18
+
19
+ ## Model description
20
+
21
+ S2T is an end-to-end sequence-to-sequence transformer model. It is trained with standard
22
+ autoregressive cross-entropy loss and generates the transcripts autoregressively.
23
+
24
+ ## Intended uses & limitations
25
+
26
+ This model can be used for end-to-end speech recognition (ASR).
27
+ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look for other S2T checkpoints.
28
+
29
+
30
+ ### How to use
31
+
32
+ As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
33
+ transcripts by passing the speech features to the model.
34
+
35
+ *Note: The `Speech2TextProcessor` object uses [torchaudio](https://github.com/pytorch/audio) to extract the
36
+ filter bank features. Make sure to install the `torchaudio` package before running this example.*
37
+
38
+ *Note: The feature extractor depends on [torchaudio](https://github.com/pytorch/audio) and the tokenizer depends on [sentencepiece](https://github.com/google/sentencepiece)
39
+ so be sure to install those packages before running the examples.*
40
+
41
+ You could either install those as extra speech dependancies with
42
+ `pip install transformers"[speech, sentencepiece]"` or install the packages seperatly
43
+ with `pip install torchaudio sentencepiece`.
44
+
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
49
+ from datasets import load_dataset
50
+ import soundfile as sf
51
+
52
+ model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
53
+ processor = Speech2Textprocessor.from_pretrained("facebook/s2t-small-librispeech-asr")
54
+
55
+ def map_to_array(batch):
56
+ speech, _ = sf.read(batch["file"])
57
+ batch["speech"] = speech
58
+ return batch
59
+
60
+ ds = load_dataset(
61
+ "patrickvonplaten/librispeech_asr_dummy",
62
+ "clean",
63
+ split="validation"
64
+ )
65
+ ds = ds.map(map_to_array)
66
+
67
+ input_features = processor(
68
+ ds["speech"][0],
69
+ sampling_rate=16_000,
70
+ return_tensors="pt"
71
+ ).input_features # Batch size 1
72
+ generated_ids = model.generate(input_ids=input_features)
73
+
74
+ transcription = processor.batch_decode(generated_ids)
75
+ ```
76
+
77
+ #### Evaluation on LibriSpeech Test
78
+
79
+ The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
80
+ *"clean"* and *"other"* test dataset.
81
+
82
+ ```python
83
+ from datasets import load_dataset, load_metric
84
+ from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
85
+ import soundfile as sf
86
+
87
+ librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
88
+ wer = load_metric("wer")
89
+
90
+ model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
91
+ processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)
92
+
93
+ def map_to_array(batch):
94
+ speech, _ = sf.read(batch["file"])
95
+ batch["speech"] = speech
96
+ return batch
97
+
98
+ librispeech_eval = librispeech_eval.map(map_to_array)
99
+
100
+ def map_to_pred(batch):
101
+ features = processor(batch["speech"], sampling_rate=16000, padding=True, return_tensors="pt")
102
+ input_features = features.input_features.to("cuda")
103
+ attention_mask = features.attention_mask.to("cuda")
104
+
105
+ gen_tokens = model.generate(input_ids=input_features, attention_mask=attention_mask)
106
+ batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)
107
+ return batch
108
+
109
+ result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["speech"])
110
+
111
+ print("WER:", wer(predictions=result["transcription"], references=result["text"]))
112
+ ```
113
+
114
+ *Result (WER)*:
115
+
116
+ | "clean" | "other" |
117
+ |:-------:|:-------:|
118
+ | 4.3 | 9.0 |
119
+
120
+
121
+
122
+ ## Training data
123
+
124
+ The S2T-SMALL-LIBRISPEECH-ASR is trained on [LibriSpeech ASR Corpus](https://www.openslr.org/12), a dataset consisting of
125
+ approximately 1000 hours of 16kHz read English speech.
126
+
127
+
128
+ ## Training procedure
129
+
130
+ ### Preprocessing
131
+
132
+ The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from
133
+ WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization)
134
+ is applied to each example.
135
+
136
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
137
+
138
+
139
+ ### Training
140
+
141
+ The model is trained with standard autoregressive cross-entropy loss and using [SpecAugment](https://arxiv.org/abs/1904.08779).
142
+ The encoder receives speech features, and the decoder generates the transcripts autoregressively.
143
+
144
+
145
+ ### BibTeX entry and citation info
146
+
147
+ ```bibtex
148
+ @inproceedings{wang2020fairseqs2t,
149
+ title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
150
+ author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
151
+ booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
152
+ year = {2020},
153
+ }
154
+
155
+ ```
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "hf_models_fb/s2t-small-librispeech-asr",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "Speech2TextForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "conv_channels": 1024,
12
+ "conv_kernel_sizes": [
13
+ 5,
14
+ 5
15
+ ],
16
+ "d_model": 256,
17
+ "decoder_attention_heads": 4,
18
+ "decoder_ffn_dim": 2048,
19
+ "decoder_layerdrop": 0.0,
20
+ "decoder_layers": 6,
21
+ "decoder_start_token_id": 2,
22
+ "dropout": 0.1,
23
+ "early_stopping": true,
24
+ "encoder_attention_heads": 4,
25
+ "encoder_ffn_dim": 2048,
26
+ "encoder_layerdrop": 0.0,
27
+ "encoder_layers": 12,
28
+ "eos_token_id": 2,
29
+ "gradient_checkpointing": false,
30
+ "init_std": 0.02,
31
+ "input_channels": 1,
32
+ "input_feat_per_channel": 80,
33
+ "is_encoder_decoder": true,
34
+ "max_length": 200,
35
+ "max_source_positions": 6000,
36
+ "max_target_positions": 1024,
37
+ "model_type": "speech_to_text",
38
+ "num_beams": 5,
39
+ "num_conv_layers": 2,
40
+ "num_hidden_layers": 12,
41
+ "pad_token_id": 1,
42
+ "scale_embedding": true,
43
+ "transformers_version": "4.4.0.dev0",
44
+ "use_cache": true,
45
+ "vocab_size": 10000
46
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_ceptral_normalize": true,
3
+ "feature_size": 80,
4
+ "normalize_means": true,
5
+ "normalize_vars": true,
6
+ "num_mel_bins": 80,
7
+ "padding_side": "right",
8
+ "padding_value": 0.0,
9
+ "return_attention_mask": true,
10
+ "sampling_rate": 16000
11
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95be85b800e626fa6063bf30bd40874b3a426fc12b0393b7046546e470fcc535
3
+ size 118267196
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:052a168787a9160b4b2ba54e4995e9600298812c34191ca3f70cea51cd4f5c1e
3
+ size 416684
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "do_upper_case": false, "do_lower_case": true, "tgt_lang": null, "lang_codes": null, "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/f39f1499e9c4d2b3e803e3cad8a31c4cf3e626e1c69197d4cd6921e5c07007f9.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd", "tokenizer_file": null, "name_or_path": "hf_models_fb/s2t-small-librispeech-asr"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff