Update README.md
Browse files
README.md
CHANGED
@@ -70,6 +70,87 @@ The architecture is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.
|
|
70 |
|
71 |
More information needed
|
72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
## Training and evaluation data
|
74 |
|
75 |
Training data :
|
|
|
70 |
|
71 |
More information needed
|
72 |
|
73 |
+
## How to use
|
74 |
+
|
75 |
+
Make sure you have installed the correct dependencies for the language model-boosted version to work. You can just run this command to install the `kenlm` and `pyctcdecode` libraries :
|
76 |
+
|
77 |
+
```pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode```
|
78 |
+
|
79 |
+
|
80 |
+
With the framework `transformers` you can load the model with the following code :
|
81 |
+
|
82 |
+
```
|
83 |
+
from transformers import AutoProcessor, AutoModelForCTC
|
84 |
+
|
85 |
+
processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2")
|
86 |
+
|
87 |
+
model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")
|
88 |
+
```
|
89 |
+
|
90 |
+
Or, if you want to test the model, you can load the automatic speech recognition pipeline from `transformers` with :
|
91 |
+
|
92 |
+
```
|
93 |
+
from transformers import pipeline
|
94 |
+
|
95 |
+
asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")
|
96 |
+
```
|
97 |
+
|
98 |
+
|
99 |
+
## Example use with the `datasets` library
|
100 |
+
|
101 |
+
First, you need to load your data
|
102 |
+
|
103 |
+
We will use the [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) dataset in this example.
|
104 |
+
|
105 |
+
```
|
106 |
+
from datasets import load_dataset
|
107 |
+
|
108 |
+
dataset = load_dataset("gigant/romanian_speech_synthesis_0_8_1")
|
109 |
+
```
|
110 |
+
|
111 |
+
You can listen to the samples with the `IPython.display` library :
|
112 |
+
|
113 |
+
```
|
114 |
+
from IPython.display import Audio
|
115 |
+
|
116 |
+
i = 0
|
117 |
+
sample = dataset["train"][i]
|
118 |
+
Audio(sample["audio"]["array"], rate = sample["audio"]["sampling_rate"])
|
119 |
+
```
|
120 |
+
|
121 |
+
The model is trained to work with audio sampled at 16kHz, so if the sampling rate of the audio in the dataset is different, we will have to resample it.
|
122 |
+
|
123 |
+
In the example, the audio is sampled at 48kHz. We can see this by checking `dataset["train"][0]["audio"]["sampling_rate"]`
|
124 |
+
|
125 |
+
The following code resample the audio using the `torchaudio` library :
|
126 |
+
|
127 |
+
```
|
128 |
+
import torchaudio
|
129 |
+
import torch
|
130 |
+
|
131 |
+
i = 0
|
132 |
+
audio = sample["audio"]["array"]
|
133 |
+
rate = sample["audio"]["sampling_rate"]
|
134 |
+
resampler = torchaudio.transforms.Resample(rate, 16_000)
|
135 |
+
audio_16 = resampler(torch.Tensor(audio)).numpy()
|
136 |
+
```
|
137 |
+
|
138 |
+
To listen to the resampled sample :
|
139 |
+
|
140 |
+
```
|
141 |
+
Audio(audio_16, rate=16000)
|
142 |
+
```
|
143 |
+
|
144 |
+
Know you can get the model prediction by running
|
145 |
+
|
146 |
+
```
|
147 |
+
predicted_text = asr(audio_16)
|
148 |
+
ground_truth = dataset["train"][i]["sentence"]
|
149 |
+
|
150 |
+
print(f"Predicted text : {predicted_text}")
|
151 |
+
print(f"Ground truth : {ground_truth}")
|
152 |
+
```
|
153 |
+
|
154 |
## Training and evaluation data
|
155 |
|
156 |
Training data :
|