vitouphy commited on
Commit
dece104
•
1 Parent(s): 995ee44

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -6
README.md CHANGED
@@ -62,17 +62,59 @@ It achieves the following results on the evaluation set:
62
  - Since this dataset is small (4 hours of voice recording), we decided not to train that for too long to avoid overfitting and under-generalization.
63
  - This model performs worse than its 300M-variant. Probably, we don't explore the hyper-parameter enough?
64
 
65
- ## Model description
66
 
67
- More information needed
 
 
 
 
 
68
 
69
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- ## Training and evaluation data
 
 
 
 
 
 
74
 
75
- More information needed
76
 
77
  ## Training procedure
78
 
 
62
  - Since this dataset is small (4 hours of voice recording), we decided not to train that for too long to avoid overfitting and under-generalization.
63
  - This model performs worse than its 300M-variant. Probably, we don't explore the hyper-parameter enough?
64
 
 
65
 
66
+ ## Installation
67
+ Install the following libraries on top of HuggingFace Transformers for the supports of language model.
68
+ ```
69
+ pip install pyctcdecode
70
+ pip install https://github.com/kpu/kenlm/archive/master.zip
71
+ ```
72
 
73
+ ## Usage
74
+
75
+ **Approach 1:** Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.
76
+ ```python
77
+ from transformers import pipeline
78
+
79
+ # Load the model
80
+ pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
81
+
82
+ # Process raw audio
83
+ output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))
84
+ ```
85
+
86
+ **Approach 2:** More custom way to predict phonemes.
87
+ ```python
88
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
89
+ import librosa
90
+ import torch
91
 
92
+ # load model and processor
93
+ processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
94
+ model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
95
+
96
+ # Read and process the input
97
+ speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
98
+ inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
99
+
100
+ with torch.no_grad():
101
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
102
+
103
+ predicted_ids = torch.argmax(logits, axis=-1)
104
+ predicted_sentences = processor.batch_decode(predicted_ids)
105
+ print(predicted_sentences)
106
+ ```
107
+
108
+ ## Intended uses & limitations
109
 
110
+ The data used for this model is only around 4 hours of recordings.
111
+ - We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
112
+ - Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
113
+ - Its limitation is:
114
+ - Rare characters, e.g. ឬស្សី ឪឡឹក
115
+ - Speech needs to be clear and articulate.
116
+ - More data to cover more vocabulary and character may help improve this system.
117
 
 
118
 
119
  ## Training procedure
120