patrickvonplaten commited on
Commit
e8e5b69
1 Parent(s): 1e9cbae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -59
README.md CHANGED
@@ -9,7 +9,7 @@ tags:
9
  - hf-asr-leaderboard
10
  license: apache-2.0
11
  model-index:
12
- - name: wav2vec2-conformer-rel-pos-large-960h-ft
13
  results:
14
  - task:
15
  name: Automatic Speech Recognition
@@ -21,89 +21,50 @@ model-index:
21
  metrics:
22
  - name: Test WER
23
  type: wer
24
- value: 1.85
25
  ---
26
 
27
- # Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings
28
 
29
- [Facebook's Wav2Vec2 Conformer (TODO-add link)]()
30
-
31
- Wav2Vec2 Conformer with relative position embeddings, pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
32
-
33
- [Paper (TODO)](https://arxiv.org/abs/2006.11477)
34
-
35
- Authors: ...
36
-
37
- **Abstract**
38
-
39
- ...
40
-
41
- The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
42
-
43
-
44
- # Usage
45
-
46
- To transcribe audio files the model can be used as a standalone acoustic model as follows:
47
-
48
- ```python
49
- from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
50
- from datasets import load_dataset
51
- import torch
52
-
53
- # load model and processor
54
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
55
- model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
56
-
57
- # load dummy dataset and read soundfiles
58
- ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
59
-
60
- # tokenize
61
- input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
62
 
63
- # retrieve logits
64
- logits = model(input_values).logits
65
 
66
- # take argmax and decode
67
- predicted_ids = torch.argmax(logits, dim=-1)
68
- transcription = processor.batch_decode(predicted_ids)
69
- ```
70
-
71
- ## Evaluation
72
-
73
- This code snippet shows how to evaluate **facebook/wav2vec2-conformer-rel-pos-large-960h-ft** on LibriSpeech's "clean" and "other" test data.
74
 
75
  ```python
76
  from datasets import load_dataset
77
- from transformers import Wav2Vec2ConformerForCTC, Wav2Vec2Processor
78
  import torch
79
  from jiwer import wer
80
 
 
81
 
82
- librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
83
 
84
- model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self").to("cuda")
85
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
86
 
87
  def map_to_pred(batch):
88
- inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
89
- input_values = inputs.input_values.to("cuda")
90
- attention_mask = inputs.attention_mask.to("cuda")
91
-
92
  with torch.no_grad():
93
- logits = model(input_values, attention_mask=attention_mask).logits
94
 
95
- predicted_ids = torch.argmax(logits, dim=-1)
96
- transcription = processor.batch_decode(predicted_ids)
97
  batch["transcription"] = transcription
98
  return batch
99
 
100
  result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
101
 
102
- print("WER:", wer(result["text"], result["transcription"]))
103
  ```
104
 
105
  *Result (WER)*:
106
 
107
  | "clean" | "other" |
108
  |---|---|
109
- | 1.85 | 3.82 |
 
9
  - hf-asr-leaderboard
10
  license: apache-2.0
11
  model-index:
12
+ - name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
13
  results:
14
  - task:
15
  name: Automatic Speech Recognition
 
21
  metrics:
22
  - name: Test WER
23
  type: wer
24
+ value: --
25
  ---
26
 
27
+ # Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram
28
 
29
+ This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is
30
+ augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ ## Evaluation
 
33
 
34
+ This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
 
 
 
 
 
 
 
35
 
36
  ```python
37
  from datasets import load_dataset
38
+ from transformers import AutoModelForCTC, AutoProcessor
39
  import torch
40
  from jiwer import wer
41
 
42
+ model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"
43
 
44
+ librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
45
 
46
+ model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
47
+ processor = AutoProcessor.from_pretrained(model_id)
48
 
49
  def map_to_pred(batch):
50
+ inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
51
+
52
+ inputs = {k: v.to("cuda") for k,v in inputs.items()}
53
+
54
  with torch.no_grad():
55
+ logits = model(**inputs).logits
56
 
57
+ transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
 
58
  batch["transcription"] = transcription
59
  return batch
60
 
61
  result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
62
 
63
+ print(wer(result["text"], result["transcription"]))
64
  ```
65
 
66
  *Result (WER)*:
67
 
68
  | "clean" | "other" |
69
  |---|---|
70
+ | -- | -- |