harshit345 commited on
Commit
5278efb
1 Parent(s): cb1a042

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - timit_asr
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ - speech
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # Wav2Vec2-Large-LV60-TIMIT
13
+
14
+ Fine-tuned [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60)
15
+ on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr).
16
+ When using this model, make sure that your speech input is sampled at 16kHz.
17
+
18
+ ## Usage
19
+
20
+ The model can be used directly (without a language model) as follows:
21
+
22
+ ```python
23
+ import soundfile as sf
24
+ import torch
25
+ from datasets import load_dataset
26
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
27
+
28
+ model_name = "elgeish/wav2vec2-large-lv60-timit-asr"
29
+ processor = Wav2Vec2Processor.from_pretrained(model_name)
30
+ model = Wav2Vec2ForCTC.from_pretrained(model_name)
31
+ model.eval()
32
+
33
+ dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10))
34
+ char_translations = str.maketrans({"-": " ", ",": "", ".": "", "?": ""})
35
+
36
+ def prepare_example(example):
37
+ example["speech"], _ = sf.read(example["file"])
38
+ example["text"] = example["text"].translate(char_translations)
39
+ example["text"] = " ".join(example["text"].split()) # clean up whitespaces
40
+ example["text"] = example["text"].lower()
41
+ return example
42
+
43
+ dataset = dataset.map(prepare_example, remove_columns=["file"])
44
+ inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
45
+
46
+ with torch.no_grad():
47
+ predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
48
+ predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id # see fine-tuning script
49
+ predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)
50
+
51
+ for reference, predicted in zip(dataset["text"], predicted_transcripts):
52
+ print("reference:", reference)
53
+ print("predicted:", predicted)
54
+ print("--")
55
+ ```
56
+
57
+ Here's the output:
58
+
59
+ ```
60
+ reference: the emblem depicts the acropolis all aglow
61
+ predicted: the amblum depicts the acropolis all a glo
62
+ --
63
+ reference: don't ask me to carry an oily rag like that
64
+ predicted: don't ask me to carry an oily rag like that
65
+ --
66
+ reference: they enjoy it when i audition
67
+ predicted: they enjoy it when i addition
68
+ --
69
+ reference: set aside to dry with lid on sugar bowl
70
+ predicted: set aside to dry with a litt on shoogerbowl
71
+ --
72
+ reference: a boring novel is a superb sleeping pill
73
+ predicted: a bor and novel is a suberb sleeping peel
74
+ --
75
+ reference: only the most accomplished artists obtain popularity
76
+ predicted: only the most accomplished artists obtain popularity
77
+ --
78
+ reference: he has never himself done anything for which to be hated which of us has
79
+ predicted: he has never himself done anything for which to be hated which of us has
80
+ --
81
+ reference: the fish began to leap frantically on the surface of the small lake
82
+ predicted: the fish began to leap frantically on the surface of the small lake
83
+ --
84
+ reference: or certain words or rituals that child and adult go through may do the trick
85
+ predicted: or certain words or rituals that child an adult go through may do the trick
86
+ --
87
+ reference: are your grades higher or lower than nancy's
88
+ predicted: are your grades higher or lower than nancies
89
+ --
90
+ ```
91
+
92
+ ## Fine-Tuning Script
93
+
94
+ You can find the script used to produce this model
95
+ [here](https://github.com/elgeish/transformers/blob/8ee49e09c91ffd5d23034ce32ed630d988c50ddf/examples/research_projects/wav2vec2/finetune_large_lv60_timit_asr.sh).
96
+
97
+ **Note:** This model can be fine-tuned further;
98
+ [trainer_state.json](https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr/blob/main/trainer_state.json)
99
+ shows useful details, namely the last state (this checkpoint):
100
+
101
+ ```json
102
+ {
103
+ "epoch": 29.51,
104
+ "eval_loss": 25.424150466918945,
105
+ "eval_runtime": 182.9499,
106
+ "eval_samples_per_second": 9.183,
107
+ "eval_wer": 0.1351704233095107,
108
+ "step": 8500
109
+ }
110
+ ```