File size: 3,800 Bytes
f54a7f0
 
 
 
 
 
 
 
 
 
 
cf01f28
f54a7f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
language: en
datasets:
- timit_asr
tags:
- audio
- automatic-speech-recognition
- speech
license: apache-2.0
---

# Wav2Vec2-Large-LV60-TIMIT

Fine-tuned [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60)
on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr).
When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

The model can be used directly (without a language model) as follows:

```python
import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "elgeish/wav2vec2-large-lv60-timit-asr"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10))
char_translations = str.maketrans({"-": " ", ",": "", ".": "", "?": ""})

def prepare_example(example):
    example["speech"], _ = sf.read(example["file"])
    example["text"] = example["text"].translate(char_translations)
    example["text"] = " ".join(example["text"].split())  # clean up whitespaces
    example["text"] = example["text"].lower()
    return example

dataset = dataset.map(prepare_example, remove_columns=["file"])
inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")

with torch.no_grad():
    predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id  # see fine-tuning script
predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)

for reference, predicted in zip(dataset["text"], predicted_transcripts):
    print("reference:", reference)
    print("predicted:", predicted)
    print("--")
```

Here's the output:

```
reference: the emblem depicts the acropolis all aglow
predicted: the amblum depicts the acropolis all a glo
--
reference: don't ask me to carry an oily rag like that
predicted: don't ask me to carry an oily rag like that
--
reference: they enjoy it when i audition
predicted: they enjoy it when i addition
--
reference: set aside to dry with lid on sugar bowl
predicted: set aside to dry with a litt on shoogerbowl
--
reference: a boring novel is a superb sleeping pill
predicted: a bor and novel is a suberb sleeping peel
--
reference: only the most accomplished artists obtain popularity
predicted: only the most accomplished artists obtain popularity
--
reference: he has never himself done anything for which to be hated which of us has
predicted: he has never himself done anything for which to be hated which of us has
--
reference: the fish began to leap frantically on the surface of the small lake
predicted: the fish began to leap frantically on the surface of the small lake
--
reference: or certain words or rituals that child and adult go through may do the trick
predicted: or certain words or rituals that child an adult go through may do the trick
--
reference: are your grades higher or lower than nancy's
predicted: are your grades higher or lower than nancies
--
```

## Fine-Tuning Script

You can find the script used to produce this model
[here](https://github.com/elgeish/transformers/blob/8ee49e09c91ffd5d23034ce32ed630d988c50ddf/examples/research_projects/wav2vec2/finetune_large_lv60_timit_asr.sh).

**Note:** This model can be fine-tuned further;
[trainer_state.json](https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr/blob/main/trainer_state.json)
shows useful details, namely the last state (this checkpoint):

```json
{
    "epoch": 29.51,
    "eval_loss": 25.424150466918945,
    "eval_runtime": 182.9499,
    "eval_samples_per_second": 9.183,
    "eval_wer": 0.1351704233095107,
    "step": 8500
}
```