whisperkittools generated README.md
Browse files
README.md
CHANGED
@@ -1,44 +1,8 @@
|
|
1 |
|
2 |
-
|
3 |
-
pretty_name: "WhisperKit ASR Evaluation Results"
|
4 |
-
tags:
|
5 |
-
- whisper
|
6 |
-
- whisperkit
|
7 |
-
- coreml
|
8 |
-
- asr
|
9 |
-
- quantized
|
10 |
-
---
|
11 |
-
# WhisperKit Evaluation Results
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
## Dataset: `librispeech`
|
16 |
-
|
17 |
-
### Quality Evaluation
|
18 |
-
|
19 |
-
| | WER | QoI (%) | File Size (MB) |
|
20 |
-
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
21 |
-
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech) | 2.85 | 100 | 3100 |
|
22 |
-
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 2.48 | 95.2 | 3100 |
|
23 |
-
| [WhisperKit/openai_whisper-large-v3_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech) | 2.44 | 95.4 | 3100 |
|
24 |
-
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB/librispeech) | 2.49 | 94.8 | 1018 |
|
25 |
-
| [WhisperKit/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech) | 3.28 | 96.6 | 3100 |
|
26 |
-
| [WhisperKit/openai_whisper-large-v2_1050MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_1050MB/librispeech) | 3.32 | 95 | 1050 |
|
27 |
-
| [WhisperKit/openai_whisper-large-v2_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech) | 3.24 | 96.6 | 3100 |
|
28 |
-
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB/librispeech) | 3.33 | 94.9 | 1022 |
|
29 |
-
| [WhisperKit/openai_whisper-small.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech) | 4.31 | 85.9 | 483 |
|
30 |
-
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech) | 3.98 | 82.9 | 483 |
|
31 |
-
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech) | 4.76 | 75.5 | 145 |
|
32 |
-
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech) | 6.11 | 67.1 | 145 |
|
33 |
-
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech) | 6.72 | 64 | 66 |
|
34 |
-
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech) | 8.94 | 52.4 | 66 |
|
35 |
-
|
36 |
-
|
37 |
-
### Explanation of Evaluation Metrics
|
38 |
-
We believe that rigorously measuring the quality of inference is necessary for developers and
|
39 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
40 |
-
any machine learning model in production.
|
41 |
-
and benchmark them using consistent evaluation
|
42 |
|
43 |
Server-side Implementations:
|
44 |
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text) ($0.36/hour as of 02/29/24, 25MB max file size)
|
@@ -49,12 +13,9 @@ On-device Implementations:
|
|
49 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
50 |
|
51 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
52 |
-
in float16 precision along with additional undisclosed optimizations from OpenAI.
|
53 |
-
|
54 |
-
|
55 |
-
backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
56 |
-
where per-example known behavior changes after a code/model update and causes divergence in
|
57 |
-
downstream code or breaks the user experience itself (even if dataset averages might stay flat
|
58 |
across updates). Pseudocode for `qoi`:
|
59 |
|
60 |
```python
|
@@ -65,12 +26,13 @@ for example in dataset:
|
|
65 |
qoi = (sum(qoi) / len(qoi)) * 100.
|
66 |
```
|
67 |
|
68 |
-
Note that the ordering of models with respect to `WER` does not match the ordering with respect to `QoI`. This is because the reference model gets assigned
|
69 |
a QoI of 100% by definition. Any per-example regression by other implementations get penalized while per-example improvements are not rewarded. `QoI` (higher is better) matters
|
70 |
-
where the production behavior is established by the reference results and
|
|
|
71 |
|
72 |
-
We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and whisperkittools offers
|
73 |
-
the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](
|
74 |
|
75 |
### Datasets
|
76 |
- [librispeech](https://huggingface.co/datasets/argmaxinc/librispeech): ~5 hours of short English audio clips, tests short-form transcription quality
|
@@ -85,7 +47,43 @@ evaluation in under 1 hour regardless of the Whisper implementation. Older Apple
|
|
85 |
|
86 |
|
87 |
|
88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
91 |
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
|
|
|
1 |
|
2 |
+
We believe that rigorously measuring the "quality of inference" is necessary for developers and
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
4 |
+
any machine learning model in production. To contextualize `WhisperKit`, we take the following Whisper
|
5 |
+
implementations and benchmark them using a consistent evaluation harness:
|
6 |
|
7 |
Server-side Implementations:
|
8 |
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text) ($0.36/hour as of 02/29/24, 25MB max file size)
|
|
|
13 |
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)
|
14 |
|
15 |
`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
|
16 |
+
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
17 |
+
which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
18 |
+
where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat
|
|
|
|
|
|
|
19 |
across updates). Pseudocode for `qoi`:
|
20 |
|
21 |
```python
|
|
|
26 |
qoi = (sum(qoi) / len(qoi)) * 100.
|
27 |
```
|
28 |
|
29 |
+
Note that the ordering of models with respect to `WER` does not necessarily match the ordering with respect to `QoI`. This is because the reference model gets assigned
|
30 |
a QoI of 100% by definition. Any per-example regression by other implementations get penalized while per-example improvements are not rewarded. `QoI` (higher is better) matters
|
31 |
+
where the production behavior is established by the reference results and the goal is to not regress when switching to an optimized or compressed model. On the other hand,
|
32 |
+
`WER` (lower is better) matters when there is no established production behavior and one is picking the best quality versus model size trade off point.
|
33 |
|
34 |
+
We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and [whisperkittools](https://github.com/argmaxinc/whisperkittools) offers
|
35 |
+
the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset]((https://github.com/argmaxinc/whisperkittools)) for details.
|
36 |
|
37 |
### Datasets
|
38 |
- [librispeech](https://huggingface.co/datasets/argmaxinc/librispeech): ~5 hours of short English audio clips, tests short-form transcription quality
|
|
|
47 |
|
48 |
|
49 |
|
50 |
+
---
|
51 |
+
pretty_name: "WhisperKit ASR Evaluation Results"
|
52 |
+
viewer: false
|
53 |
+
tags:
|
54 |
+
- whisper
|
55 |
+
- whisperkit
|
56 |
+
- coreml
|
57 |
+
- asr
|
58 |
+
- quantized
|
59 |
+
---
|
60 |
+
# WhisperKit Evaluation Results
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
## Dataset: `librispeech`
|
65 |
+
|
66 |
+
### Quality Evaluation
|
67 |
+
|
68 |
+
| | WER | QoI (%) | File Size (MB) |
|
69 |
+
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
70 |
+
| [WhisperOpenAIAPI/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech) | 2.85 | 100 | 3100 |
|
71 |
+
| [WhisperKit/openai_whisper-large-v3](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 2.48 | 95.2 | 3100 |
|
72 |
+
| [WhisperKit/openai_whisper-large-v3_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech) | 2.44 | 95.4 | 3100 |
|
73 |
+
| [WhisperKit/openai_whisper-large-v3_turbo_1018MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_1018MB/librispeech) | 2.49 | 94.8 | 1018 |
|
74 |
+
| [WhisperKit/openai_whisper-large-v2](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech) | 3.28 | 96.6 | 3100 |
|
75 |
+
| [WhisperKit/openai_whisper-large-v2_1050MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_1050MB/librispeech) | 3.32 | 95 | 1050 |
|
76 |
+
| [WhisperKit/openai_whisper-large-v2_turbo](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech) | 3.24 | 96.6 | 3100 |
|
77 |
+
| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB/librispeech) | 3.33 | 94.9 | 1022 |
|
78 |
+
| [WhisperKit/openai_whisper-small.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech) | 4.31 | 85.9 | 483 |
|
79 |
+
| [WhisperKit/openai_whisper-small](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech) | 3.98 | 82.9 | 483 |
|
80 |
+
| [WhisperKit/openai_whisper-base.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech) | 4.76 | 75.5 | 145 |
|
81 |
+
| [WhisperKit/openai_whisper-base](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech) | 6.11 | 67.1 | 145 |
|
82 |
+
| [WhisperKit/openai_whisper-tiny.en](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech) | 6.72 | 64 | 66 |
|
83 |
+
| [WhisperKit/openai_whisper-tiny](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech) | 8.94 | 52.4 | 66 |
|
84 |
+
|
85 |
+
|
86 |
+
### Glossary
|
87 |
|
88 |
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
89 |
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
|