File size: 2,667 Bytes
3406ec5
 
 
94c4789
3406ec5
 
 
 
 
 
 
43c1e73
dc4cd1a
89aed7f
 
 
eed3219
 
 
 
89aed7f
 
0557f50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b5369f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: en
datasets:
- patrickvonplaten/librispeech_asr_dummy
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- en
- speech
---

# Fine-tuned facebook/wav2vec2-base large model for speech recognition in English

Fine-tuned [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on English using the train and validation splits of [zodata](https://www.kaggle.com/datasets/mohamedk0emad/zodata).
the dataset has 307912 transcibed voice samples, we used 6158 samples for training and 6036 samples for testing
and the result for testing with WER accuracy metric is:
Test WER: 0.340

When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits given by the [kaggle](https://www.kaggle.com/)


# Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```python
 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch
 
 # load model and tokenizer
 processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel")
 model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel")
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)
 ```
 ## Evaluation
 
 This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data.
 
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))
```