aotrih commited on
Commit
d72acf0
1 Parent(s): cfe51aa

whisperkittools generated README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -33
README.md CHANGED
@@ -16,31 +16,34 @@ tags:
16
 
17
  ### Quality Evaluation
18
 
19
- | | WER | QoI (%) | File Size (MB) |
20
- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
21
- | [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2) | 2.85 | 100 | 3100 |
22
- | [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2) | 3.28 | 96.6 | 3100 |
23
- | [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_1050MB) | 3.32 | 95 | 1050 |
24
- | [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo) | 3.24 | 96.6 | 3100 |
25
- | [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) | 3.33 | 94.9 | 1022 |
26
- | [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-small) | 3.98 | 82.9 | 483 |
27
- | [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-base) | 6.11 | 67.1 | 145 |
28
- | [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-tiny) | 8.94 | 52.4 | 66 |
29
- | [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3) | 2.48 | 95.2 | 3100 |
30
- | [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3_turbo) | 2.44 | 95.4 | 3100 |
31
- | [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB) | 2.49 | 94.8 | 1018 |
32
- | [whisper.cpp/openai_whisper-large-v2-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v2-q5_0) | 2.8 | 96.6 | 1080 |
33
- | [whisper.cpp/openai_whisper-large-v3-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v3-q5_0) | 2.35 | 95.6 | 1080 |
34
 
35
 
36
  ### Quality-of-Inference (QoI) Certification
37
  We believe that rigorously measuring the quality of inference is necessary for developers and
38
  enterprises to make informed decisions when opting to use optimized or compressed variants of
39
- Whisper models in production. The current measurements are between reference and optimized
40
- WhisperKit models. We are going to extend the scope of this measurement to other Whisper
41
- implementations soon so developers can certify the behavior change (if any) caused by
42
- alternating use of WhisperKit with (or migration from) these implementations.
43
 
 
 
 
 
 
 
 
44
  In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
45
  which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
46
  backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
@@ -56,23 +59,16 @@ for example in dataset:
56
  qoi = (sum(qoi) / len(qoi)) * 100.
57
  ```
58
 
59
- We define the reference model as the default float16 precision Core ML model that is generated by
60
- whisperkittools. This reference model matches the accuracy of the original PyTorch model
61
- on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
62
- as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
63
- (120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
64
- their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
65
- same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
66
- for details.
67
 
68
  ### Reproducing Results
69
  Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
70
  Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
71
  we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
72
- run identical [evaluation jobs](#evaluation)
73
- locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
74
- evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
75
- 1 day to complete the same evaluation.
76
 
77
 
78
 
@@ -81,6 +77,7 @@ Glossary:
81
  - `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
82
  as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
83
 
84
- - `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
85
- `_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.
 
86
 
 
16
 
17
  ### Quality Evaluation
18
 
19
+ | | WER | QoI (%) | File Size (MB) |
20
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
21
+ | [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2) | 2.85 | 100 | 3100 |
22
+ | [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2) | 3.28 | 96.6 | 3100 |
23
+ | [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_1050MB) | 3.32 | 95 | 1050 |
24
+ | [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_turbo) | 3.24 | 96.6 | 3100 |
25
+ | [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) | 3.33 | 94.9 | 1022 |
26
+ | [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-small) | 3.98 | 82.9 | 483 |
27
+ | [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-base) | 6.11 | 67.1 | 145 |
28
+ | [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-tiny) | 8.94 | 52.4 | 66 |
29
+ | [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3) | 2.48 | 95.2 | 3100 |
30
+ | [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3_turbo) | 2.44 | 95.4 | 3100 |
31
+ | [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB) | 2.49 | 94.8 | 1018 |
 
 
32
 
33
 
34
  ### Quality-of-Inference (QoI) Certification
35
  We believe that rigorously measuring the quality of inference is necessary for developers and
36
  enterprises to make informed decisions when opting to use optimized or compressed variants of
37
+ any machine learning model in production. For WhisperKit, we take the following implementations
38
+ and benchmark them using consistent evaluation harnesses:
 
 
39
 
40
+ - `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)($0.36/hour as of 02/29/24, 25MB max file size)
41
+ - `WhisperKit`: Argmax's Core ML implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
42
+ - `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
43
+ - `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
44
+
45
+ `WhisperOpenAIAPI` is the reference and we assume that it is using the equivalent of
46
+ [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) in float16 precision.
47
  In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
48
  which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
49
  backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
 
59
  qoi = (sum(qoi) / len(qoi)) * 100.
60
  ```
61
 
62
+ We use `librispeech/test.clean` (~5 hours of short English audio clips) and `earnings22` (~120 hours of long English audio clips with various accents).
63
+ We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and whisperkittools offers
64
+ the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset) for details.
 
 
 
 
 
65
 
66
  ### Reproducing Results
67
  Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
68
  Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
69
  we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
70
+ run identical [evaluation jobs](#evaluation) locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
71
+ evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than 1 day to complete the same evaluation.
 
 
72
 
73
 
74
 
 
77
  - `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
78
  as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
79
 
80
+ - `_*MB`: Indicates the presence of model compression. Instead of cluttering the filename with details like
81
+ `_AudioEncoder-5.8bits_TextDecoder-6.1bits_QLoRA-rank=16`, we choose to summarize the compression spec as the
82
+ resulting total file size since this is what matters to developers in production.
83