File size: 8,046 Bytes

b289b78

---
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - pytorch
  - audio
  - speech
  - automatic-speech-recognition
  - whisper
  - wav2vec2

model-index:
  - name: whisper_large_v2_fp16_transformers
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (clean)
          config: clean
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 0
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 0
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 0
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 0
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 0
            name: Test CER
            description: Character Error Rate

      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (other)
          config: other
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 0
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 0
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 0
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 0
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 0
            name: Test CER
            description: Character Error Rate

      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: mozilla-foundation/common_voice_14_0
          name: Common Voice (14.0) (Hindi)
          config: hi
          split: test
          args:
            language: hi
        metrics:
          - type: wer
            value: 44.64
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 41.69
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 59.53
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 40.46
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 16.80
            name: Test CER
            description: Character Error Rate

widget:
  - example_title: Hinglish Sample
    src: https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac

language:
  - en
  - zh
  - de
  - es
  - ru
  - ko
  - fr
  - ja
  - pt
  - tr
  - pl
  - ca
  - nl
  - ar
  - sv
  - it
  - id
  - hi
  - fi
  - vi
  - he
  - uk
  - el
  - ms
  - cs
  - ro
  - da
  - hu
  - ta
  - "no"
  - th
  - ur
  - hr
  - bg
  - lt
  - la
  - mi
  - ml
  - cy
  - sk
  - te
  - fa
  - lv
  - bn
  - sr
  - az
  - sl
  - kn
  - et
  - mk
  - br
  - eu
  - is
  - hy
  - ne
  - mn
  - bs
  - kk
  - sq
  - sw
  - gl
  - mr
  - pa
  - si
  - km
  - sn
  - yo
  - so
  - af
  - oc
  - ka
  - be
  - tg
  - sd
  - gu
  - am
  - yi
  - lo
  - uz
  - fo
  - ht
  - ps
  - tk
  - nn
  - mt
  - sa
  - lb
  - my
  - bo
  - tl
  - mg
  - as
  - tt
  - haw
  - ln
  - ha
  - ba
  - jw
  - su
---
## Versions:

- CUDA: 12.1
- cuDNN Version: 8.9.2.26_1.0-1_amd64

* tensorflow Version: 2.12.0
* torch Version: 2.1.0.dev20230606+cu12135
* transformers Version: 4.30.2
* accelerate Version: 0.20.3

## Model Benchmarks:

- RAM: 3 GB (Original_Model: 6GB)
- VRAM: 3.7 GB (Original_Model: 11GB)
- test.wav: 23 s (Multilingual Speech i.e. English+Hindi)

  - **Time in seconds for Processing by each device**

  | Device Name       | float32 (Original) | float16 | CudaCores | TensorCores |
  | ----------------- | ------------------ | ------- | --------- | ----------- |
  | 3060              | 2.2                | 1.3     | 3,584     | 112         |
  | 1660 Super        | OOM                | 6       | 1,408     | N/A         |
  | Collab (Tesla T4) | -                  | -       | 2,560     | 320         |
  | Collab (CPU)      | -                  | N/A     | N/A       | N/A         |
  | M1 (CPU)          | -                  | -       | N/A       | N/A         |
  | M1 (GPU -> 'mps') | -                  | -       | N/A       | N/A         |


  - **NOTE: TensorCores are efficient in mixed-precision calculations**
  - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
- Punchuation: Sometimes False ('I don't know the exact reason why this is happening')

## Model Error Benchmarks:

- **WER: Word Error Rate**
- **MER: Match Error Rate**
- **WIL: Word Information Lost**
- **WIP: Word Information Preserved**
- **CER: Character Error Rate**

### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)

**Test done on RTX 3060 on 1000 Samples**

|                         | WER   | MER   | WIL   | WIP   | CER   |
| ----------------------- | ----- | ----- | ----- | ----- | ----- |
| Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 |
| This_Model (20 min)     | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 |

### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi)

**Test done on RTX 3060 on 1000 Samples**

|                         | WER | MER | WIL | WIP | CER |
| ----------------------- | --- | --- | --- | --- | --- |
| Original_Model (30 min) | -   | -   | -   | -   | -   |
| This_Model (20 min)     | -   | -   | -   | -   | -   |

### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)

**Test done on RTX 3060 on \_\_\_ Samples**

|                | WER | MER | WIL | WIP | CER |
| -------------- | --- | --- | --- | --- | --- |
| Original_Model | -   | -   | -   | -   | -   |
| This_Model     | -   | -   | -   | -   | -   |

### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)

**Test done on RTX 3060 on \_\_\_ Samples**

|                | WER | MER | WIL | WIP | CER |
| -------------- | --- | --- | --- | --- | --- |
| Original_Model | -   | -   | -   | -   | -   |
| This_Model     | -   | -   | -   | -   | -   |

- **'jiwer' library is used for calculations**

## Code for conversion:

- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)

## Usage

A file `__init__.py` is contained inside this repo which contains all the code to use this model.

Firstly, clone this repo and place all the files inside a folder.

### Make sure you have git-lfs installed (https://git-lfs.com)

```bash
git lfs install
git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers
```

**Please try in jupyter notebook**

```python
# Import the Model
from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim
```

```python
# Initilise the model
model = Model(
            model_name_or_path='whisper_large_v2_fp16_transformers',
            cuda_visible_device="0",
            device='cuda',
      )
```

```python
# Load Audio
audio = load_audio('whisper_large_v2_fp16_transformers/test.wav')
audio = pad_or_trim(audio)
```

```python
# Transcribe (First transcription takes time)
model.transcribe(audio)
```

## Credits

It is fp16 version of ``openai/whisper-large-v2``