elgeish
/

wav2vec2-base-timit-asr

@@ -5,10 +5,8 @@ datasets:
 tags:
 - audio
 - automatic-speech-recognition
 license: apache-2.0
-widget:
-- label: Sample 1 (from LibriSpeech)
-  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
 ---
 # Wav2Vec2-Base-TIMIT
@@ -22,18 +20,24 @@ When using this model, make sure that your speech input is sampled at 16kHz.
 The model can be used directly (without a language model) as follows:
 ```python
 import torch
 from datasets import load_dataset
-import soundfile as sf
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-model_name = "elgeish/wav2vec2-base-timit"
 processor = Wav2Vec2Processor.from_pretrained(model_name, do_lower_case=True)
 model = Wav2Vec2ForCTC.from_pretrained(model_name)
-dataset = load_dataset("timit_asr", split="test[:10]")
 def prepare_example(example):
     example["speech"], _ = sf.read(example["file"])
     return example
 dataset = dataset.map(prepare_example, remove_columns=["file"])
@@ -41,6 +45,7 @@ inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt",
 with torch.no_grad():
     predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
 predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)
 for reference, predicted in zip(dataset["text"], predicted_transcripts):
     print("reference:", reference)
@@ -51,39 +56,38 @@ for reference, predicted in zip(dataset["text"], predicted_transcripts):
 Here's the output:
 ```
-reference: The bungalow was pleasantly situated near the shore.
-predicted: the bunglow was plesntly situated near the shor
---
-reference: Don't ask me to carry an oily rag like that.
-predicted: don't ask me to carry an oily rag like that
 --
-reference: Are you looking for employment?
-predicted: are you oking for employment
 --
-reference: She had your dark suit in greasy wash water all year.
-predicted: she had your dark suit in greasy wash water all year
 --
-reference: At twilight on the twelfth day we'll have Chablis.
-predicted: at twilight on the twelfth day we'll have shiple
 --
-reference: Eating spinach nightly increases strength miraculously.
-predicted: eating spanage nightly increases strength moraculously
 --
-reference: Got a heck of a buy on this, dirt cheap.
-predicted: got a heck of a by on this dert cheep
 --
-reference: The scalloped edge is particularly appealing.
-predicted: the scaliped edge iuse particularly appeling
 --
-reference: A big goat idly ambled through the farmyard.
-predicted: a big goat idely ambled through the farmyard
 --
-reference: This group is secularist and their program tends to be technological.
-predicted: this croup is secularist and their program tens to be technological
 --
 ```
 ## Fine-Tuning Script
 You can find the script used to produce this model
-[here](https://github.com/elgeish/transformers/blob/f2b98f876b040bab3c3db8561ec39c1abb2c733c/examples/research_projects/wav2vec2/finetune_base_timit_asr.sh).

 tags:
 - audio
 - automatic-speech-recognition
+- speech
 license: apache-2.0
 ---
 # Wav2Vec2-Base-TIMIT
 The model can be used directly (without a language model) as follows:
 ```python
+import soundfile as sf
 import torch
 from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+model_name = "elgeish/wav2vec2-base-timit-asr"
 processor = Wav2Vec2Processor.from_pretrained(model_name, do_lower_case=True)
 model = Wav2Vec2ForCTC.from_pretrained(model_name)
+model.eval()
+dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10))
+char_translations = str.maketrans({"-": " ", ".": "", "?": ""})
 def prepare_example(example):
     example["speech"], _ = sf.read(example["file"])
+    example["text"] = example["text"].translate(char_translations)
+    example["text"] = " ".join(example["text"].split())  # clean up whitespaces
+    example["text"] = example["text"].lower()
     return example
 dataset = dataset.map(prepare_example, remove_columns=["file"])
 with torch.no_grad():
     predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
+predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id  # see fine-tuning script
 predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)
 for reference, predicted in zip(dataset["text"], predicted_transcripts):
     print("reference:", reference)
 Here's the output:
 ```
+reference: she had your dark suit in greasy wash water all year
+predicted: she had your dark suit in greasy wash water all year
 --
+reference: where were you while we were away
+predicted: where were you while we were away
 --
+reference: cory and trish played tag with beach balls for hours
+predicted: tcory and trish played tag with beach balls for hours
 --
+reference: tradition requires parental approval for under age marriage
+predicted: tradition requires parrental proval for under age marrage
 --
+reference: objects made of pewter are beautiful
+predicted: objects made of puder are bautiful
 --
+reference: don't ask me to carry an oily rag like that
+predicted: don't o ask me to carry an oily rag like that
 --
+reference: cory and trish played tag with beach balls for hours
+predicted: cory and trish played tag with beach balls for ours
 --
+reference: don't ask me to carry an oily rag like that
+predicted: don't ask me to carry an oily rag like that
 --
+reference: don't do charlie's dirty dishes
+predicted: don't  do chawly's tirty dishes
 --
+reference: only those story tellers will remain who can imitate the style of the virtuous
+predicted: only those story tillaers will remain who can imvitate the style the virtuous
 ```
 ## Fine-Tuning Script
 You can find the script used to produce this model
+[here](https://github.com/elgeish/transformers/blob/cfc0bd01f2ac2ea3a5acc578ef2e204bf4304de7/examples/research_projects/wav2vec2/finetune_base_timit_asr.sh).