cahya commited on
Commit
f33e29f
1 Parent(s): 238fcb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -1
README.md CHANGED
@@ -1,3 +1,4 @@
 
1
  language: id
2
  datasets:
3
  - common_voice
@@ -22,4 +23,48 @@ model-index:
22
  metrics:
23
  - name: Test WER
24
  type: wer
25
- value: 0.40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
  language: id
3
  datasets:
4
  - common_voice
 
23
  metrics:
24
  - name: Test WER
25
  type: wer
26
+ value: 0.40
27
+ ---
28
+
29
+ # Wav2Vec2-Large-XLSR-Indonesian
30
+
31
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
32
+ on the [Indonesian Common Voice dataset](https://huggingface.co/datasets/common_voice).
33
+ When using this model, make sure that your speech input is sampled at 16kHz.
34
+
35
+ ## Usage
36
+ The model can be used directly (without a language model) as follows:
37
+ ```python
38
+ import librosa
39
+ import torch
40
+ from datasets import load_dataset
41
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
42
+
43
+ dataset = load_dataset("common_voice", "id", split="test") # "test[:n]" for n examples
44
+ processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian")
45
+ model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian")
46
+ model.eval()
47
+
48
+ def prepare_example(example):
49
+ example["speech"], _ = librosa.load(example["file"], sr=16000)
50
+ example["text"] = example["text"].replace("-", " ").replace('! ', '')
51
+ example["text"] = " ".join(w for w in example["text"].split() if w != "sil")
52
+ return example
53
+
54
+ dataset = dataset.map(prepare_example, remove_columns=["file", "orthographic", "phonetic"])
55
+
56
+ def predict(batch):
57
+ inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
58
+ with torch.no_grad():
59
+ predicted = torch.argmax(model(inputs.input_values).logits, dim=-1)
60
+ predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script
61
+ batch["predicted"] = processor.tokenizer.batch_decode(predicted)
62
+ return batch
63
+ dataset = dataset.map(predict, batched=True, batch_size=1, remove_columns=["speech"])
64
+ for reference, predicted in zip(dataset["text"], dataset["predicted"]):
65
+ print("reference:", reference)
66
+ print("predicted:", predicted)
67
+ #print("reference (untransliterated):", buckwalter.untrans(reference))
68
+ #print("predicted (untransliterated):", buckwalter.untrans(predicted))
69
+ print("--")
70
+ ```