harshit345
/

wav2vec2-large-lv60-timit

+---
+language: en
+datasets:
+- timit_asr
+tags:
+- audio
+- automatic-speech-recognition
+- speech
+license: apache-2.0
+---
+# Wav2Vec2-Large-LV60-TIMIT
+Fine-tuned [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60)
+on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr).
+When using this model, make sure that your speech input is sampled at 16kHz.
+## Usage
+The model can be used directly (without a language model) as follows:
+```python
+import soundfile as sf
+import torch
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+model_name = "elgeish/wav2vec2-large-lv60-timit-asr"
+processor = Wav2Vec2Processor.from_pretrained(model_name)
+model = Wav2Vec2ForCTC.from_pretrained(model_name)
+model.eval()
+dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10))
+char_translations = str.maketrans({"-": " ", ",": "", ".": "", "?": ""})
+def prepare_example(example):
+    example["speech"], _ = sf.read(example["file"])
+    example["text"] = example["text"].translate(char_translations)
+    example["text"] = " ".join(example["text"].split())  # clean up whitespaces
+    example["text"] = example["text"].lower()
+    return example
+dataset = dataset.map(prepare_example, remove_columns=["file"])
+inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
+with torch.no_grad():
+    predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
+predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id  # see fine-tuning script
+predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)
+for reference, predicted in zip(dataset["text"], predicted_transcripts):
+    print("reference:", reference)
+    print("predicted:", predicted)
+    print("--")
+```
+Here's the output:
+```
+reference: the emblem depicts the acropolis all aglow
+predicted: the amblum depicts the acropolis all a glo
+--
+reference: don't ask me to carry an oily rag like that
+predicted: don't ask me to carry an oily rag like that
+--
+reference: they enjoy it when i audition
+predicted: they enjoy it when i addition
+--
+reference: set aside to dry with lid on sugar bowl
+predicted: set aside to dry with a litt on shoogerbowl
+--
+reference: a boring novel is a superb sleeping pill
+predicted: a bor and novel is a suberb sleeping peel
+--
+reference: only the most accomplished artists obtain popularity
+predicted: only the most accomplished artists obtain popularity
+--
+reference: he has never himself done anything for which to be hated which of us has
+predicted: he has never himself done anything for which to be hated which of us has
+--
+reference: the fish began to leap frantically on the surface of the small lake
+predicted: the fish began to leap frantically on the surface of the small lake
+--
+reference: or certain words or rituals that child and adult go through may do the trick
+predicted: or certain words or rituals that child an adult go through may do the trick
+--
+reference: are your grades higher or lower than nancy's
+predicted: are your grades higher or lower than nancies
+--
+```
+## Fine-Tuning Script
+You can find the script used to produce this model
+[here](https://github.com/elgeish/transformers/blob/8ee49e09c91ffd5d23034ce32ed630d988c50ddf/examples/research_projects/wav2vec2/finetune_large_lv60_timit_asr.sh).
+**Note:** This model can be fine-tuned further;
+[trainer_state.json](https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr/blob/main/trainer_state.json)
+shows useful details, namely the last state (this checkpoint):
+```json
+{
+    "epoch": 29.51,
+    "eval_loss": 25.424150466918945,
+    "eval_runtime": 182.9499,
+    "eval_samples_per_second": 9.183,
+    "eval_wer": 0.1351704233095107,
+    "step": 8500
+}
+```