File size: 2,962 Bytes
27a09ed
 
 
 
 
 
 
 
 
 
 
 
 
3f9b353
5e1d69e
27a09ed
 
 
 
 
 
 
 
 
 
 
 
 
 
b771920
a25bd32
27a09ed
b771920
27a09ed
 
 
 
 
 
 
 
f20dab3
 
27a09ed
 
 
 
55691c8
27a09ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e1d69e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
language: en
datasets:
- LIUM/tedlium
tags:
- speech
license: apache-2.0
---
# Wav2Vec2-Large-Tedlium
The Wav2Vec2 large model fine-tuned on the TEDLIUM corpus.

The model is initialised with Facebook's [Wav2Vec2 large LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint pre-trained on 60,000h of audiobooks from the LibriVox project. It is fine-tuned on 452h of TED talks from the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (Release 3). When using the model, make sure that your speech input is sampled at 16Khz.

The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/10c85yc4?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.

See [this notebook](https://colab.research.google.com/drive/1FjTsqbYKphl9kL-eILgUc-bl4zVThL8F?usp=sharing) for more information on how this model was fine-tuned.


# Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
```python
 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
 model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
     
 # load dummy dataset
 ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
 
 # process audio inputs
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)
 print("Target: ", ds["text"][0])
 print("Transcription: ", transcription[0])
 ```
 
## Evaluation
 
This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data.
 
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))
```