sumedh commited on
Commit
a34a2d0
1 Parent(s): fbeb9ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -17
README.md CHANGED
@@ -26,7 +26,8 @@ model-index:
26
  ---
27
 
28
  # Wav2Vec2-Large-XLSR-53-Marathi
29
- Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [OpenSLR SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices, although it works well for male voice too.
 
30
  ## Usage
31
  The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
32
  ```python
@@ -42,13 +43,13 @@ model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
42
  resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
43
  # Preprocessing the datasets. We need to read the aduio files as arrays
44
  def speech_file_to_array_fn(batch):
45
- \\tspeech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
46
- \\tbatch["speech"] = resampler(speech_array).squeeze().numpy()
47
- \\treturn batch
48
  mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
49
  inputs = processor(mr_test_dataset_new["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
50
  with torch.no_grad():
51
- \\tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
52
  predicted_ids = torch.argmax(logits, dim=-1)
53
  print("Prediction:", processor.batch_decode(predicted_ids))
54
  print("Reference:", mr_test_dataset_new["actual_text"][:5])
@@ -67,26 +68,26 @@ processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marath
67
  model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
68
  model.to("cuda")
69
 
70
- chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“]'
71
  resampler = torchaudio.transforms.Resample(48_000, 16_000)
72
  # Preprocessing the datasets. We need to read the aduio files as arrays
73
  def speech_file_to_array_fn(batch):
74
- \\tbatch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
75
- \\tspeech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
76
- \\tbatch["speech"] = resampler(speech_array).squeeze().numpy()
77
- \\treturn batch
78
  mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
79
  def evaluate(batch):
80
- \\tinputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
81
- \\twith torch.no_grad():
82
- \\t\\tlogits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
83
- \\t\\tpred_ids = torch.argmax(logits, dim=-1)
84
- \\t\\tbatch["pred_strings"] = processor.batch_decode(pred_ids)
85
- \\treturn batch
86
  result = mr_test_dataset_new.map(evaluate, batched=True, batch_size=8)
87
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
88
  ```
89
- **WER on the Test Set**: 12.70 %
90
  ## Training
91
  Train-Test ratio was 90:10.
92
  The colab notebook used for training can be found [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing).
 
26
  ---
27
 
28
  # Wav2Vec2-Large-XLSR-53-Marathi
29
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [OpenSLR SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices but it works well for male voices too.
30
+ **WER on the Test Set**: 12.70 %
31
  ## Usage
32
  The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
33
  ```python
 
43
  resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
44
  # Preprocessing the datasets. We need to read the aduio files as arrays
45
  def speech_file_to_array_fn(batch):
46
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
47
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
48
+ return batch
49
  mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
50
  inputs = processor(mr_test_dataset_new["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
51
  with torch.no_grad():
52
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
53
  predicted_ids = torch.argmax(logits, dim=-1)
54
  print("Prediction:", processor.batch_decode(predicted_ids))
55
  print("Reference:", mr_test_dataset_new["actual_text"][:5])
 
68
  model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
69
  model.to("cuda")
70
 
71
+ chars_to_ignore_regex = '[\\\\\\\\,\\\\\\\\?\\\\\\\\.\\\\\\\\!\\\\\\\\-\\\\\\\\;\\\\\\\\:\\\\\\\\"\\\\\\\\“]'
72
  resampler = torchaudio.transforms.Resample(48_000, 16_000)
73
  # Preprocessing the datasets. We need to read the aduio files as arrays
74
  def speech_file_to_array_fn(batch):
75
+ batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
76
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
77
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
78
+ return batch
79
  mr_test_dataset_new = mr_test_dataset_new.map(speech_file_to_array_fn)
80
  def evaluate(batch):
81
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
82
+ with torch.no_grad():
83
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
84
+ pred_ids = torch.argmax(logits, dim=-1)
85
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
86
+ return batch
87
  result = mr_test_dataset_new.map(evaluate, batched=True, batch_size=8)
88
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
89
  ```
90
+
91
  ## Training
92
  Train-Test ratio was 90:10.
93
  The colab notebook used for training can be found [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing).