1 ---
2 language: mt
3 datasets:
4 - common_voice
5 tags:
6 - audio
7 - automatic-speech-recognition
8 - speech
9 - xlsr-fine-tuning-week
10 license: apache-2.0
11 model-index:
12 - name: XLSR Wav2Vec2 Maltese by Akash PB
13 results:
14 - task:
15 name: Speech Recognition
16 type: automatic-speech-recognition
17 dataset:
18 name: Common Voice mt
19 type: common_voice
20 args: {lang_id}
21 metrics:
22 - name: Test WER
23 type: wer
24 value: 29.42
25 ---
26 # Wav2Vec2-Large-XLSR-53-Maltese
27 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in Maltese using the [Common Voice](https://huggingface.co/datasets/common_voice)
28 When using this model, make sure that your speech input is sampled at 16kHz.
29 ## Usage
30 The model can be used directly (without a language model) as follows:
31 ```python
32 import torchaudio
33 from datasets import load_dataset, load_metric
34 from transformers import (
35 Wav2Vec2ForCTC,
36 Wav2Vec2Processor,
37 )
38 import torch
39 import re
40 import sys
41
42 model_name = "Akashpb13/xlsr_maltese_wav2vec2"
43 device = "cuda"
44 chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\"\\“\\%\\‘\\”\\�\\)\\(\\*)]'
45
46 model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
47 processor = Wav2Vec2Processor.from_pretrained(model_name)
48
49 ds = load_dataset("common_voice", "mt", split="test", data_dir="./cv-corpus-6.1-2020-12-11")
50
51 resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
52
53 def map_to_array(batch):
54 speech, _ = torchaudio.load(batch["path"])
55 batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
56 batch["sampling_rate"] = resampler.new_freq
57 batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
58 return batch
59
60 ds = ds.map(map_to_array)
61
62 def map_to_pred(batch):
63 features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
64 input_values = features.input_values.to(device)
65 attention_mask = features.attention_mask.to(device)
66 with torch.no_grad():
67 logits = model(input_values, attention_mask=attention_mask).logits
68 pred_ids = torch.argmax(logits, dim=-1)
69 batch["predicted"] = processor.batch_decode(pred_ids)
70 batch["target"] = batch["sentence"]
71 return batch
72
73 result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))
74
75 wer = load_metric("wer")
76 print(wer.compute(predictions=result["predicted"], references=result["target"]))
77
78 ```
79 **Test Result**: 29.42 %
80