aotrih commited on
Commit
6f3d506
1 Parent(s): 51c21d2

whisperkittools generated README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -1
README.md CHANGED
@@ -1,3 +1,85 @@
 
1
  ---
2
- license: mit
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
+ pretty_name: "WhisperKit ASR Evaluation Results"
4
+ tags:
5
+ - whisper
6
+ - whisperkit
7
+ - coreml
8
+ - asr
9
+ - quantized
10
  ---
11
+ # WhisperKit Evaluation Results
12
+
13
+
14
+
15
+ ## Dataset: `librispeech`
16
+
17
+ ### Quality Evaluation
18
+
19
+ | | WER | QoI (%) | File Size (MB) |
20
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
21
+ | [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2) | 2.85 | 100 | 3100 |
22
+ | [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2) | 3.28 | 96.6 | 3100 |
23
+ | [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_1050MB) | 3.32 | 95 | 1050 |
24
+ | [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo) | 3.24 | 96.6 | 3100 |
25
+ | [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) | 3.33 | 94.9 | 1022 |
26
+ | [whisper.cpp/openai_whisper-large-v2-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v2-q5_0) | 2.8 | 96.6 | 1080 |
27
+ | [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-small) | 3.98 | 82.9 | 483 |
28
+ | [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-base) | 6.11 | 67.1 | 145 |
29
+ | [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-tiny) | 8.94 | 52.4 | 66 |
30
+ | [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3) | 2.48 | 95.2 | 3100 |
31
+ | [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3_turbo) | 2.44 | 95.4 | 3100 |
32
+ | [openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/openai_whisper-large-v3_turbo_1018MB) | 2.49 | 94.8 | 1018 |
33
+
34
+
35
+ ### Quality-of-Inference (QoI) Certification
36
+ We believe that rigorously measuring the quality of inference is necessary for developers and
37
+ enterprises to make informed decisions when opting to use optimized or compressed variants of
38
+ Whisper models in production. The current measurements are between reference and optimized
39
+ WhisperKit models. We are going to extend the scope of this measurement to other Whisper
40
+ implementations soon so developers can certify the behavior change (if any) caused by
41
+ alternating use of WhisperKit with (or migration from) these implementations.
42
+
43
+ In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
44
+ which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
45
+ backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
46
+ where per-example known behavior changes after a code/model update and causes divergence in
47
+ downstream code or breaks the user experience itself (even if dataset averages might stay flat
48
+ across updates). Pseudocode for `qoi`:
49
+
50
+ ```python
51
+ qoi = []
52
+ for example in dataset:
53
+ no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
54
+ qoi.append(no_regression)
55
+ qoi = (sum(qoi) / len(qoi)) * 100.
56
+ ```
57
+
58
+ We define the reference model as the default float16 precision Core ML model that is generated by
59
+ whisperkittools. This reference model matches the accuracy of the original PyTorch model
60
+ on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
61
+ as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
62
+ (120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
63
+ their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
64
+ same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
65
+ for details.
66
+
67
+ ### Reproducing Results
68
+ Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
69
+ Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
70
+ we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
71
+ run identical [evaluation jobs](#evaluation)
72
+ locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
73
+ evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
74
+ 1 day to complete the same evaluation.
75
+
76
+
77
+
78
+ Glossary:
79
+
80
+ - `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
81
+ as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
82
+
83
+ - `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
84
+ `_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.
85
+