patrickvonplaten commited on
Commit
95ce0ea
1 Parent(s): c85b633

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - librispeech_asr
5
+ tags:
6
+ - speech
7
+ - audio
8
+ - automatic-speech-recognition
9
+ - hf-asr-leaderboard
10
+ license: apache-2.0
11
+ model-index:
12
+ - name: wav2vec2-conformer-rel-pos-large-960h-ft
13
+ results:
14
+ - task:
15
+ name: Automatic Speech Recognition
16
+ type: automatic-speech-recognition
17
+ dataset:
18
+ name: Librispeech (clean)
19
+ type: librispeech_asr
20
+ args: en
21
+ metrics:
22
+ - name: Test WER
23
+ type: wer
24
+ value: 1.85
25
+ ---
26
+
27
+ # Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings
28
+
29
+ [Facebook's Wav2Vec2 Conformer (TODO-add link)]()
30
+
31
+ Wav2Vec2 Conformer with rotary position embeddings, pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
32
+
33
+ [Paper (TODO)](https://arxiv.org/abs/2006.11477)
34
+
35
+ Authors: ...
36
+
37
+ **Abstract**
38
+
39
+ ...
40
+
41
+ The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
42
+
43
+
44
+ # Usage
45
+
46
+ To transcribe audio files the model can be used as a standalone acoustic model as follows:
47
+
48
+ ```python
49
+ from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
50
+ from datasets import load_dataset
51
+ import torch
52
+
53
+ # load model and processor
54
+ processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
55
+ model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
56
+
57
+ # load dummy dataset and read soundfiles
58
+ ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
59
+
60
+ # tokenize
61
+ input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
62
+
63
+ # retrieve logits
64
+ logits = model(input_values).logits
65
+
66
+ # take argmax and decode
67
+ predicted_ids = torch.argmax(logits, dim=-1)
68
+ transcription = processor.batch_decode(predicted_ids)
69
+ ```
70
+
71
+ ## Evaluation
72
+
73
+ This code snippet shows how to evaluate **facebook/wav2vec2-conformer-rope-large-960h-ft** on LibriSpeech's "clean" and "other" test data.
74
+
75
+ ```python
76
+ from datasets import load_dataset
77
+ from transformers import Wav2Vec2ConformerForCTC, Wav2Vec2Processor
78
+ import torch
79
+ from jiwer import wer
80
+
81
+
82
+ librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
83
+
84
+ model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft").to("cuda")
85
+ processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
86
+
87
+ def map_to_pred(batch):
88
+ inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
89
+ input_values = inputs.input_values.to("cuda")
90
+ attention_mask = inputs.attention_mask.to("cuda")
91
+
92
+ with torch.no_grad():
93
+ logits = model(input_values, attention_mask=attention_mask).logits
94
+
95
+ predicted_ids = torch.argmax(logits, dim=-1)
96
+ transcription = processor.batch_decode(predicted_ids)
97
+ batch["transcription"] = transcription
98
+ return batch
99
+
100
+ result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
101
+
102
+ print("WER:", wer(result["text"], result["transcription"]))
103
+ ```
104
+
105
+ *Result (WER)*:
106
+
107
+ | "clean" | "other" |
108
+ |---|---|
109
+ | 1.96 | 3.98 |