vitouphy commited on
Commit
b72f955
•
1 Parent(s): b37a8c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -6
README.md CHANGED
@@ -57,17 +57,57 @@ It achieves the following results on the evaluation set:
57
  - WER: 0.257040856802856
58
  - CER: 0.07025001801282513
59
 
60
- ## Model description
 
 
 
 
 
61
 
62
- More information needed
63
 
64
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- More information needed
 
 
67
 
68
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- More information needed
 
 
 
 
 
 
71
 
72
  ## Training procedure
73
 
 
57
  - WER: 0.257040856802856
58
  - CER: 0.07025001801282513
59
 
60
+ ## Installation
61
+ Install the following libraries on top of HuggingFace Transformers for the supports of language model.
62
+ ```
63
+ pip install pyctcdecode
64
+ pip install https://github.com/kpu/kenlm/archive/master.zip
65
+ ```
66
 
67
+ ## Usage
68
 
69
+ **Approach 1:** Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.
70
+ ```python
71
+ from transformers import pipeline
72
+
73
+ # Load the model
74
+ pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
75
+
76
+ # Process raw audio
77
+ output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))
78
+ ```
79
+
80
+ **Approach 2:** More custom way to predict phonemes.
81
+ ```python
82
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
83
+ import librosa
84
+ import torch
85
 
86
+ # load model and processor
87
+ processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
88
+ model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
89
 
90
+ # Read and process the input
91
+ speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
92
+ inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
93
+
94
+ with torch.no_grad():
95
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
96
+
97
+ predicted_ids = torch.argmax(logits, axis=-1)
98
+ predicted_sentences = processor.batch_decode(predicted_ids)
99
+ print(predicted_sentences)
100
+ ```
101
+
102
+ ## Intended uses & limitations
103
 
104
+ The data used for this model is only around 4 hours of recordings.
105
+ - We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
106
+ - Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
107
+ - Its limitation is:
108
+ - Rare characters, e.g. ឬស្សី ឪឡឹក
109
+ - Speech needs to be clear and articulate.
110
+ - More data to cover more vocabulary and character may help improve this system.
111
 
112
  ## Training procedure
113