devasheeshG commited on
Commit
013bf1c
1 Parent(s): c960644

edit ReadMe.md

Browse files
Files changed (3) hide show
  1. .gitignore +2 -1
  2. README.md +53 -30
  3. __init__.py +49 -49
.gitignore CHANGED
@@ -1 +1,2 @@
1
- __pycache__
 
 
1
+ __pycache__
2
+ .vscode
README.md CHANGED
@@ -75,7 +75,7 @@ model-index:
75
  value: 0
76
  name: Test CER
77
  description: Character Error Rate
78
-
79
  - task:
80
  type: automatic-speech-recognition
81
  name: Automatic Speech Recognition
@@ -216,7 +216,6 @@ language:
216
  - ba
217
  - jw
218
  - su
219
-
220
  ---
221
  ## Versions:
222
 
@@ -233,17 +232,18 @@ language:
233
  - RAM: 2.8 GB (Original_Model: 5.5GB)
234
  - VRAM: 1812 MB (Original_Model: 6GB)
235
  - test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
 
236
  - **Time in seconds for Processing by each device**
237
 
238
- | Device Name | float32 (Original) | float16 | CudaCores | TensorCores |
239
- | ----------------- | -------------------- | ------- | --------- | ----------- |
240
- | 3060 | 1.7 | 1.1 | 3,584 | 112 |
241
- | 1660 Super | OOM | 3.3 | 1,408 | N/A |
242
- | Collab (Tesla T4) | 2.8 | 2.2 | 2,560 | 320 |
243
- | Collab (CPU) | 35 | N/A | N/A | N/A |
244
- | M1 (CPU) | - | - | - | - |
245
- | M1 (GPU -> 'mps') | - | - | - | - |
246
-
247
 
248
  - **NOTE: TensorCores are efficient in mixed-precision calculations**
249
  - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
@@ -257,46 +257,66 @@ language:
257
  - **WIP: Word Information Preserved**
258
  - **CER: Character Error Rate**
259
 
260
- ### Hindi (test.tsv) [Common Voice 14.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0)
 
261
  **Test done on RTX 3060 on 2557 Samples**
262
- | | WER | MER | WIL | WIP | CER |
263
- | ----------------------- | -------------------- | ------- | --------- | ----------- | ----- |
264
- | Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 |
265
- | This_Model (38 min) | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 |
266
 
267
- ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  **Test done on RTX 3060 on __ Samples**
269
- | | WER | MER | WIL | WIP | CER |
270
- | ----------------- | -------------------- | ------- | --------- | ----------- | --- |
271
- | Original_Model | - | - | - | - | - |
272
- | This_Model | - | - | - | - | - |
273
 
274
- ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
 
 
 
 
 
 
275
  **Test done on RTX 3060 on __ Samples**
276
- | | WER | MER | WIL | WIP | CER |
277
- | ----------------- | -------------------- | ------- | --------- | ----------- | --- |
278
- | Original_Model | - | - | - | - | - |
279
- | This_Model | - | - | - | - | - |
 
280
 
281
  - **'jiwer' library is used for calculations**
282
 
283
  ## Code for conversion:
284
- - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
 
285
 
286
  ## Usage
 
287
  A file ``__init__.py`` is contained inside this repo which contains all the code to use this model.
288
 
289
  Firstly, clone this repo and place all the files inside a folder.
 
290
  ### Make sure you have git-lfs installed (https://git-lfs.com)
 
291
  ```bash
292
  git lfs install
293
  git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers
294
  ```
 
295
  **Please try in jupyter notebook**
296
 
297
  ```python
298
  # Import the Model
299
- from whisper_medium_fp16_transformers import Model
300
  ```
301
 
302
  ```python
@@ -310,12 +330,15 @@ model = Model(
310
 
311
  ```python
312
  # Load Audio
313
- audio = model.load_audio('whisper_medium_fp16_transformers/test.wav')
 
314
  ```
315
 
316
  ```python
317
  # Transcribe (First transcription takes time)
318
  model.transcribe(audio)
319
  ```
 
320
  ## Credits
321
- It is fp16 version of ```openai/whisper-medium```
 
 
75
  value: 0
76
  name: Test CER
77
  description: Character Error Rate
78
+
79
  - task:
80
  type: automatic-speech-recognition
81
  name: Automatic Speech Recognition
 
216
  - ba
217
  - jw
218
  - su
 
219
  ---
220
  ## Versions:
221
 
 
232
  - RAM: 2.8 GB (Original_Model: 5.5GB)
233
  - VRAM: 1812 MB (Original_Model: 6GB)
234
  - test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
235
+
236
  - **Time in seconds for Processing by each device**
237
 
238
+ | Device Name | float32 (Original) | float16 | CudaCores | TensorCores |
239
+ | ----------------- | ------------------ | ------- | --------- | ----------- |
240
+ | 3060 | 1.7 | 1.1 | 3,584 | 112 |
241
+ | 1660 Super | OOM | 3.3 | 1,408 | N/A |
242
+ | Collab (Tesla T4) | 2.8 | 2.2 | 2,560 | 320 |
243
+ | Collab (CPU) | 35 | N/A | N/A | N/A |
244
+ | M1 (CPU) | - | - | - | - |
245
+ | M1 (GPU -> 'mps') | - | - | - | - |
246
+
247
 
248
  - **NOTE: TensorCores are efficient in mixed-precision calculations**
249
  - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
 
257
  - **WIP: Word Information Preserved**
258
  - **CER: Character Error Rate**
259
 
260
+ ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
261
+
262
  **Test done on RTX 3060 on 2557 Samples**
 
 
 
 
263
 
264
+ | | WER | MER | WIL | WIP | CER |
265
+ | ----------------------- | ----- | ----- | ----- | ----- | ----- |
266
+ | Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 |
267
+ | This_Model (38 min) | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 |
268
+
269
+ ### Hindi to English (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
270
+
271
+ **Test done on RTX 3060 on 1000 Samples**
272
+
273
+ | | WER | MER | WIL | WIP | CER |
274
+ | ----------------------- | --- | --- | --- | --- | --- |
275
+ | Original_Model (30 min) | - | - | - | - | - |
276
+ | This_Model (20 min) | - | - | - | - | - |
277
+
278
+ ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
279
+
280
  **Test done on RTX 3060 on __ Samples**
 
 
 
 
281
 
282
+ | | WER | MER | WIL | WIP | CER |
283
+ | -------------- | --- | --- | --- | --- | --- |
284
+ | Original_Model | - | - | - | - | - |
285
+ | This_Model | - | - | - | - | - |
286
+
287
+ ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
288
+
289
  **Test done on RTX 3060 on __ Samples**
290
+
291
+ | | WER | MER | WIL | WIP | CER |
292
+ | -------------- | --- | --- | --- | --- | --- |
293
+ | Original_Model | - | - | - | - | - |
294
+ | This_Model | - | - | - | - | - |
295
 
296
  - **'jiwer' library is used for calculations**
297
 
298
  ## Code for conversion:
299
+
300
+ - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
301
 
302
  ## Usage
303
+
304
  A file ``__init__.py`` is contained inside this repo which contains all the code to use this model.
305
 
306
  Firstly, clone this repo and place all the files inside a folder.
307
+
308
  ### Make sure you have git-lfs installed (https://git-lfs.com)
309
+
310
  ```bash
311
  git lfs install
312
  git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers
313
  ```
314
+
315
  **Please try in jupyter notebook**
316
 
317
  ```python
318
  # Import the Model
319
+ from whisper_medium_fp16_transformers import Model, load_audio, pad_or_trim
320
  ```
321
 
322
  ```python
 
330
 
331
  ```python
332
  # Load Audio
333
+ audio = load_audio('whisper_medium_fp16_transformers/test.wav')
334
+ audio = pad_or_trim(audio)
335
  ```
336
 
337
  ```python
338
  # Transcribe (First transcription takes time)
339
  model.transcribe(audio)
340
  ```
341
+
342
  ## Credits
343
+
344
+ It is fp16 version of ``openai/whisper-medium``
__init__.py CHANGED
@@ -1,5 +1,5 @@
1
  from transformers import (
2
- WhisperForConditionalGeneration, WhisperProcessor, WhisperConfig
3
  )
4
  import torch
5
  import ffmpeg
@@ -13,6 +13,52 @@ SAMPLE_RATE = 16000
13
  CHUNK_LENGTH = 30 # 30-second chunks
14
  N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE # 480000 samples in a 30-second chunk
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  class Model:
17
  def __init__(self,
18
  model_name_or_path: str,
@@ -49,54 +95,8 @@ class Model:
49
 
50
  print('dtype of model acc to config: ', self.config.torch_dtype)
51
  print('dtype of loaded model: ', self.model.dtype)
52
-
53
-
54
- # audio = whisper.load_audio('test.wav')
55
- def load_audio(self, file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
56
- try:
57
- # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
58
- # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
59
- out, _ = (
60
- ffmpeg.input(file, ss=start_time, threads=0)
61
- .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
62
- .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
63
- )
64
- except ffmpeg.Error as e:
65
- raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
66
-
67
- # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
68
- return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
69
-
70
-
71
- # audio = whisper.pad_or_trim(audio)
72
- def _pad_or_trim(self, array, length: int = N_SAMPLES, *, axis: int = -1):
73
- """
74
- Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
75
- """
76
- if torch.is_tensor(array):
77
- if array.shape[axis] > length:
78
- array = array.index_select(
79
- dim=axis, index=torch.arange(length, device=array.device)
80
- )
81
-
82
- if array.shape[axis] < length:
83
- pad_widths = [(0, 0)] * array.ndim
84
- pad_widths[axis] = (0, length - array.shape[axis])
85
- array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
86
- else:
87
- if array.shape[axis] > length:
88
- array = array.take(indices=range(length), axis=axis)
89
-
90
- if array.shape[axis] < length:
91
- pad_widths = [(0, 0)] * array.ndim
92
- pad_widths[axis] = (0, length - array.shape[axis])
93
- array = np.pad(array, pad_widths)
94
-
95
- return array
96
 
97
- def transcribe(self, audio: np.ndarray, language: str = "english"):
98
- # audio = load_audio(audio)
99
- audio = self._pad_or_trim(audio)
100
  input_features = self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt").input_features.half().to(self.DEVICE)
101
  with torch.no_grad():
102
  predicted_ids = self.model.generate(
@@ -109,5 +109,5 @@ class Model:
109
  return_timestamps=True,
110
  )
111
 
112
- transcription = self.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
113
  return transcription.strip()
 
1
  from transformers import (
2
+ WhisperForConditionalGeneration, WhisperProcessor, WhisperConfig,
3
  )
4
  import torch
5
  import ffmpeg
 
13
  CHUNK_LENGTH = 30 # 30-second chunks
14
  N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE # 480000 samples in a 30-second chunk
15
 
16
+ # audio = whisper.load_audio('test.wav')
17
+ def load_audio(file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
18
+ """
19
+ Load an audio file into a numpy array at the specified sampling rate.
20
+ """
21
+ try:
22
+ # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
23
+ # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
24
+ out, _ = (
25
+ ffmpeg.input(file, ss=start_time, threads=0)
26
+ .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
27
+ .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
28
+ )
29
+ except ffmpeg.Error as e:
30
+ raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
31
+
32
+ # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
33
+ return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
34
+
35
+
36
+ # audio = whisper.pad_or_trim(audio)
37
+ def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
38
+ """
39
+ Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
40
+ """
41
+ if torch.is_tensor(array):
42
+ if array.shape[axis] > length:
43
+ array = array.index_select(
44
+ dim=axis, index=torch.arange(length, device=array.device)
45
+ )
46
+
47
+ if array.shape[axis] < length:
48
+ pad_widths = [(0, 0)] * array.ndim
49
+ pad_widths[axis] = (0, length - array.shape[axis])
50
+ array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
51
+ else:
52
+ if array.shape[axis] > length:
53
+ array = array.take(indices=range(length), axis=axis)
54
+
55
+ if array.shape[axis] < length:
56
+ pad_widths = [(0, 0)] * array.ndim
57
+ pad_widths[axis] = (0, length - array.shape[axis])
58
+ array = np.pad(array, pad_widths)
59
+
60
+ return array
61
+
62
  class Model:
63
  def __init__(self,
64
  model_name_or_path: str,
 
95
 
96
  print('dtype of model acc to config: ', self.config.torch_dtype)
97
  print('dtype of loaded model: ', self.model.dtype)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
+ def transcribe(self, audio, language: str = "english", skip_special_tokens: bool = True) -> str:
 
 
100
  input_features = self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt").input_features.half().to(self.DEVICE)
101
  with torch.no_grad():
102
  predicted_ids = self.model.generate(
 
109
  return_timestamps=True,
110
  )
111
 
112
+ transcription = self.tokenizer.batch_decode(predicted_ids, skip_special_tokens=skip_special_tokens)[0]
113
  return transcription.strip()