huzy0 commited on
Commit
43fe950
·
verified ·
1 Parent(s): ea50f21

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -9
README.md CHANGED
@@ -11,7 +11,7 @@ language:
11
 
12
  # MERaLiON-SpeechEncoder-v1
13
 
14
- The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major South-East Asian ones are planned for subsequent releases.
15
 
16
  - **Developed by:** I<sup>2</sup>R, A\*STAR
17
  - **Funded by:** Singapore NRF
@@ -31,9 +31,9 @@ For more details on background, pre-training, tuning experiments and evaluation,
31
 
32
  We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
33
 
34
- This version of MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.
35
 
36
- We provide a code snippet below for the direct usage of retrieving latent features from the model, and an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.
37
 
38
 
39
  ### Direct Use
@@ -68,7 +68,8 @@ def batch_collater(data):
68
  return tensors
69
 
70
  audio_array = batch_collater(data)
71
- inputs = feature_extractor(audio_array, sampling_rate=16_000, return_attention_mask=True, return_tensors='pt', do_normalize=False)
 
72
  input_values = inputs['input_values']
73
  input_lengths = torch.sum(inputs['attention_mask'], dim=-1)
74
 
@@ -77,7 +78,8 @@ input_values, input_lengths = input_values.to(device), input_lengths.to(device)
77
  # model inference to obtain features
78
  with torch.no_grad():
79
  model.eval()
80
- output = model(input_values=input_values, input_lengths=input_lengths, output_hidden_states=True)
 
81
  ```
82
 
83
  ### Downstream Use
@@ -109,8 +111,12 @@ librispeech100h_test = librispeech100h_test.remove_columns(['file', 'speaker_id'
109
  librispeech100h_train = librispeech100h_train.map(pre_processing)
110
  librispeech100h_test = librispeech100h_test.map(pre_processing)
111
 
112
- vocab_train = librispeech100h_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_train.column_names)
113
- vocab_test = librispeech100h_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_test.column_names)
 
 
 
 
114
  vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
115
  vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
116
 
@@ -128,7 +134,8 @@ feature_extractor = AutoFeatureExtractor.from_pretrained(
128
  trust_remote_code = True,
129
  )
130
 
131
- tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
 
132
 
133
  model = AutoModelForCTC.from_pretrained(
134
  repo_id,
@@ -144,7 +151,7 @@ model = AutoModelForCTC.from_pretrained(
144
  )
145
  model = model.to(device)
146
  ```
147
- Consult this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further training recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
148
 
149
  ## Technical Specifications
150
 
 
11
 
12
  # MERaLiON-SpeechEncoder-v1
13
 
14
+ The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases.
15
 
16
  - **Developed by:** I<sup>2</sup>R, A\*STAR
17
  - **Funded by:** Singapore NRF
 
31
 
32
  We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
33
 
34
+ This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.
35
 
36
+ We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.
37
 
38
 
39
  ### Direct Use
 
68
  return tensors
69
 
70
  audio_array = batch_collater(data)
71
+ inputs = feature_extractor(audio_array, sampling_rate=16_000, return_attention_mask=True,
72
+ return_tensors='pt', do_normalize=False)
73
  input_values = inputs['input_values']
74
  input_lengths = torch.sum(inputs['attention_mask'], dim=-1)
75
 
 
78
  # model inference to obtain features
79
  with torch.no_grad():
80
  model.eval()
81
+ output = model(input_values=input_values,
82
+ input_lengths=input_lengths, output_hidden_states=True)
83
  ```
84
 
85
  ### Downstream Use
 
111
  librispeech100h_train = librispeech100h_train.map(pre_processing)
112
  librispeech100h_test = librispeech100h_test.map(pre_processing)
113
 
114
+ vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
115
+ batch_size=-1, keep_in_memory=True,
116
+ remove_columns=librispeech100h_train.column_names)
117
+ vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
118
+ batch_size=-1, keep_in_memory=True,
119
+ remove_columns=librispeech100h_test.column_names)
120
  vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
121
  vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
122
 
 
134
  trust_remote_code = True,
135
  )
136
 
137
+ tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
138
+ unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
139
 
140
  model = AutoModelForCTC.from_pretrained(
141
  repo_id,
 
151
  )
152
  model = model.to(device)
153
  ```
154
+ Refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
155
 
156
  ## Technical Specifications
157