MERaLiON
/

MERaLiON-SpeechEncoder-v1

@@ -11,7 +11,7 @@ language:
 # MERaLiON-SpeechEncoder-v1
-The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major South-East Asian ones are planned for subsequent releases.
 - **Developed by:** I<sup>2</sup>R, A\*STAR
 - **Funded by:** Singapore NRF
@@ -31,9 +31,9 @@ For more details on background, pre-training, tuning experiments and evaluation,
 We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
-This version of MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.
-We provide a code snippet below for the direct usage of retrieving latent features from the model, and an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.
 ### Direct Use
@@ -68,7 +68,8 @@ def batch_collater(data):
     return tensors
 audio_array = batch_collater(data)
-inputs = feature_extractor(audio_array, sampling_rate=16_000, return_attention_mask=True, return_tensors='pt', do_normalize=False)
 input_values = inputs['input_values']
 input_lengths = torch.sum(inputs['attention_mask'], dim=-1)
@@ -77,7 +78,8 @@ input_values, input_lengths = input_values.to(device), input_lengths.to(device)
 # model inference to obtain features
 with torch.no_grad():
     model.eval()
-    output = model(input_values=input_values, input_lengths=input_lengths, output_hidden_states=True)
 ```
 ### Downstream Use
@@ -109,8 +111,12 @@ librispeech100h_test = librispeech100h_test.remove_columns(['file', 'speaker_id'
 librispeech100h_train = librispeech100h_train.map(pre_processing)
 librispeech100h_test = librispeech100h_test.map(pre_processing)
-vocab_train = librispeech100h_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_train.column_names)
-vocab_test = librispeech100h_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech100h_test.column_names)
 vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
 vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
@@ -128,7 +134,8 @@ feature_extractor = AutoFeatureExtractor.from_pretrained(
     trust_remote_code = True,
 )
-tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
 model = AutoModelForCTC.from_pretrained(
     repo_id,
@@ -144,7 +151,7 @@ model = AutoModelForCTC.from_pretrained(
 )
 model = model.to(device)
 ```
-Consult this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further training recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
 ## Technical Specifications

 # MERaLiON-SpeechEncoder-v1
+The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on 200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases.
 - **Developed by:** I<sup>2</sup>R, A\*STAR
 - **Funded by:** Singapore NRF
 We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the SUPERB benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER).
+This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated.
+We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz.
 ### Direct Use
     return tensors
 audio_array = batch_collater(data)
+inputs = feature_extractor(audio_array, sampling_rate=16_000, return_attention_mask=True,
+                             return_tensors='pt', do_normalize=False)
 input_values = inputs['input_values']
 input_lengths = torch.sum(inputs['attention_mask'], dim=-1)
 # model inference to obtain features
 with torch.no_grad():
     model.eval()
+    output = model(input_values=input_values,
+                     input_lengths=input_lengths, output_hidden_states=True)
 ```
 ### Downstream Use
 librispeech100h_train = librispeech100h_train.map(pre_processing)
 librispeech100h_test = librispeech100h_test.map(pre_processing)
+vocab_train = librispeech100h_train.map(extract_all_chars, batched=True,
+                                    batch_size=-1, keep_in_memory=True,
+                                    remove_columns=librispeech100h_train.column_names)
+vocab_test = librispeech100h_test.map(extract_all_chars, batched=True,
+                                    batch_size=-1, keep_in_memory=True,
+                                    remove_columns=librispeech100h_test.column_names)
 vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
 vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
     trust_remote_code = True,
 )
+tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json",
+                         unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
 model = AutoModelForCTC.from_pretrained(
     repo_id,
 )
 model = model.to(device)
 ```
+Refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops.
 ## Technical Specifications