Spaces:
Running
Running
ardaatahan
commited on
Commit
·
1543414
0
Parent(s):
initial commit
Browse files- .gitignore +58 -0
- .pre-commit-config.yaml +18 -0
- Makefile +12 -0
- README.md +85 -0
- constants.py +254 -0
- dashboard_data/config.json +136 -0
- dashboard_data/device_map.json +14 -0
- dashboard_data/diff_checker_data.json +0 -0
- dashboard_data/multilingual_confusion_matrices.json +0 -0
- dashboard_data/multilingual_results.csv +17 -0
- dashboard_data/performance_data.json +0 -0
- dashboard_data/quality_data.json +23 -0
- dashboard_data/support_data.csv +23 -0
- main.py +1302 -0
- multilingual_generate.py +132 -0
- performance_generate.py +465 -0
- quality_generate.py +186 -0
- requirements.txt +122 -0
- static/Zwizz-Medium.woff +0 -0
- static/Zwizz-Regular.woff +0 -0
- static/Zwizz-SemiBold.woff +0 -0
- text_normalizer.py +2374 -0
- utils.py +991 -0
.gitignore
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# OS generated files
|
2 |
+
.DS_Store
|
3 |
+
Thumbs.db
|
4 |
+
|
5 |
+
# Environment files
|
6 |
+
*.env
|
7 |
+
.env
|
8 |
+
|
9 |
+
# Python virtual environment
|
10 |
+
venv/
|
11 |
+
env/
|
12 |
+
*.pyc
|
13 |
+
__pycache__/
|
14 |
+
|
15 |
+
# Hugging Face related
|
16 |
+
.huggingface
|
17 |
+
|
18 |
+
# Project specific
|
19 |
+
argmaxinc/
|
20 |
+
table_data.json
|
21 |
+
|
22 |
+
# Jupyter Notebook
|
23 |
+
.ipynb_checkpoints
|
24 |
+
|
25 |
+
# PyCharm
|
26 |
+
.idea/
|
27 |
+
|
28 |
+
# VS Code
|
29 |
+
.vscode/
|
30 |
+
|
31 |
+
# Gradio temporary files
|
32 |
+
gradio_cached_examples/
|
33 |
+
|
34 |
+
# Logs
|
35 |
+
*.log
|
36 |
+
|
37 |
+
# Dependency directories
|
38 |
+
node_modules/
|
39 |
+
|
40 |
+
# Distribution / packaging
|
41 |
+
dist/
|
42 |
+
build/
|
43 |
+
*.egg-info/
|
44 |
+
|
45 |
+
# Temporary files
|
46 |
+
*.tmp
|
47 |
+
*.bak
|
48 |
+
*.swp
|
49 |
+
|
50 |
+
# Dataset files (if you don't want to track them)
|
51 |
+
*.jsonl
|
52 |
+
|
53 |
+
# Model files (if you don't want to track them)
|
54 |
+
*.pth
|
55 |
+
*.h5
|
56 |
+
*.ckpt
|
57 |
+
|
58 |
+
.gradio/
|
.pre-commit-config.yaml
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
repos:
|
2 |
+
- repo: https://github.com/pycqa/isort
|
3 |
+
rev: 5.13.2
|
4 |
+
hooks:
|
5 |
+
- id: isort
|
6 |
+
args: ["--profile", "black"]
|
7 |
+
|
8 |
+
- repo: https://github.com/psf/black
|
9 |
+
rev: 23.3.0
|
10 |
+
hooks:
|
11 |
+
- id: black
|
12 |
+
name: black
|
13 |
+
language: python
|
14 |
+
|
15 |
+
- repo: https://github.com/pre-commit/pre-commit-hooks
|
16 |
+
rev: v4.5.0
|
17 |
+
hooks:
|
18 |
+
- id: end-of-file-fixer
|
Makefile
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
.PHONY: format use-huggingface-data use-local-data
|
2 |
+
|
3 |
+
format:
|
4 |
+
@pre-commit run --all-files
|
5 |
+
|
6 |
+
use-huggingface-data:
|
7 |
+
@python multilingual_generate.py download
|
8 |
+
@python performance_generate.py download
|
9 |
+
@python quality_generate.py
|
10 |
+
|
11 |
+
use-local-data:
|
12 |
+
@python performance_generate.py
|
README.md
ADDED
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: WhisperKit Benchmarks
|
3 |
+
emoji: 🏆
|
4 |
+
colorFrom: green
|
5 |
+
colorTo: indigo
|
6 |
+
sdk: gradio
|
7 |
+
app_file: main.py
|
8 |
+
license: mit
|
9 |
+
---
|
10 |
+
|
11 |
+
## Prerequisites
|
12 |
+
|
13 |
+
Ensure you have the following software installed:
|
14 |
+
|
15 |
+
- Python 3.10 or higher
|
16 |
+
- pip (Python package installer)
|
17 |
+
|
18 |
+
## Installation
|
19 |
+
|
20 |
+
1. **Clone the repository**:
|
21 |
+
|
22 |
+
```sh
|
23 |
+
git clone https://github.com/argmaxinc/model-performance-dashboard.git
|
24 |
+
cd model-performance-dashboard
|
25 |
+
```
|
26 |
+
|
27 |
+
2. **Create a virtual environment**:
|
28 |
+
|
29 |
+
```sh
|
30 |
+
python -m venv venv
|
31 |
+
source venv/bin/activate
|
32 |
+
```
|
33 |
+
|
34 |
+
3. **Install required packages**:
|
35 |
+
```sh
|
36 |
+
pip install -r requirements.txt
|
37 |
+
```
|
38 |
+
|
39 |
+
## Usage
|
40 |
+
|
41 |
+
1. **Run the application**:
|
42 |
+
|
43 |
+
```sh
|
44 |
+
gradio main.py
|
45 |
+
```
|
46 |
+
|
47 |
+
2. **Access the application**:
|
48 |
+
After running main.py, a local server will start, and you will see an interface URL in the terminal. Open the URL in your web browser to interact with Argmax Benchmark dashboard.
|
49 |
+
|
50 |
+
## Data Generation
|
51 |
+
|
52 |
+
The data generation process involves three main scripts: performance_generate.py, multilingual_generate.py, and quality_generate.py. Each script is responsible for updating a specific aspect of the benchmark data.
|
53 |
+
|
54 |
+
1. **Performance Data Update (performance_generate.py)**:
|
55 |
+
|
56 |
+
- Downloads benchmark data from [WhisperKit Evals Dataset](https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset).
|
57 |
+
- Processes the data to extract performance metrics for various models, devices, and operating systems.
|
58 |
+
- Calculates metrics such as speed, tokens per second for long and short-form data.
|
59 |
+
- Saves the results in `performance_data.json` and `support_data.csv`.
|
60 |
+
|
61 |
+
2. **Multilingual Data Update (multilingual_generate.py)**:
|
62 |
+
|
63 |
+
- Downloads multilingual evaluation data from [WhisperKit Multilingual Evals Dataset](https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual).
|
64 |
+
- Processes the data to generate confusion matrices for language detection.
|
65 |
+
- Calculates metrics for both forced and unforced language detection scenarios.
|
66 |
+
- Saves the results in `multilingual_confusion_matrices.json` and `multilingual_results.csv`.
|
67 |
+
|
68 |
+
3. **Quality Data Update (quality_generate.py)**:
|
69 |
+
- Downloads quality evaluation data from [WhisperKit Evals](https://huggingface.co/datasets/argmaxinc/whisperkit-evals).
|
70 |
+
- Processes the data to calculate Word Error Rate (WER) and Quality of Inference (QoI) metrics for each dataset.
|
71 |
+
- Saves the results in `quality_data.json`.
|
72 |
+
|
73 |
+
## Data Update
|
74 |
+
|
75 |
+
To update the dashboard with latest data from our HuggingFace datasets, run:
|
76 |
+
|
77 |
+
```sh
|
78 |
+
make use-huggingface-data
|
79 |
+
```
|
80 |
+
|
81 |
+
Alternatively, you can use our on-device testing code [TODO:INSERT_LINK_TO_OS_TEST_CODE] on your device to update the dashboard with your own data. After generating the Xcode data, place the resulting `.json` files in the `whisperkit-evals/xcresults/benchmark_data` directory, then run:
|
82 |
+
|
83 |
+
```sh
|
84 |
+
make use-local-data
|
85 |
+
```
|
constants.py
ADDED
@@ -0,0 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from textwrap import dedent
|
2 |
+
|
3 |
+
from iso639 import Lang
|
4 |
+
|
5 |
+
BANNER_TEXT = """
|
6 |
+
<div style="text-align: center;">
|
7 |
+
<h1><a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit Benchmarks</a></h1>
|
8 |
+
</div>
|
9 |
+
"""
|
10 |
+
|
11 |
+
|
12 |
+
INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit, our on-device ASR solution, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more."""
|
13 |
+
|
14 |
+
|
15 |
+
INTRO_TEXT = """
|
16 |
+
<h3 style="display: flex;
|
17 |
+
justify-content: center;
|
18 |
+
align-items: center;
|
19 |
+
"></h2>
|
20 |
+
\n📈 Key Metrics:
|
21 |
+
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better.
|
22 |
+
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit performs no worse than the reference model. Higher is better.
|
23 |
+
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better.
|
24 |
+
Speed (⬆️): Input audio seconds transcribed per second. Higher is better.
|
25 |
+
|
26 |
+
🎯 WhisperKit is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER).
|
27 |
+
\n💻 Our benchmarks include:
|
28 |
+
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API)
|
29 |
+
On-device: <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> (various versions and optimizations)
|
30 |
+
|
31 |
+
ℹ️ Reference Implementation:
|
32 |
+
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request.
|
33 |
+
\n🔍 We use two primary datasets:
|
34 |
+
<a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips
|
35 |
+
<a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls
|
36 |
+
|
37 |
+
🌐 Multilingual Benchmarks:
|
38 |
+
These benchmarks aim to demonstrate WhisperKit's capabilities across diverse languages, helping developers assess its suitability for multilingual applications.
|
39 |
+
\nDataset:
|
40 |
+
<a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>: Short-form audio files (<30s/clip) for a maximum of 400 samples per language from Common Voice 17.0. Test set covers a wide range of languages to test model's versatility.
|
41 |
+
|
42 |
+
\nMetrics:
|
43 |
+
Average WER: Provides an overall measure of model performance across all languages.
|
44 |
+
Language-specific WER: Allows for detailed analysis of model performance for each supported language.
|
45 |
+
Language Detection Accuracy: Measured using a confusion matrix, showing the model's ability to identify the correct language.
|
46 |
+
Results are shown for both forced (correct language given as input) and unforced (model detects language) scenarios.
|
47 |
+
|
48 |
+
🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs.
|
49 |
+
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> to reproduce these results or run evaluations on their own custom datasets.
|
50 |
+
|
51 |
+
🔗 Links:
|
52 |
+
- <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a>
|
53 |
+
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
|
54 |
+
- <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
|
55 |
+
- <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
|
56 |
+
- <a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>
|
57 |
+
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
|
58 |
+
"""
|
59 |
+
|
60 |
+
|
61 |
+
METHODOLOGY_TEXT = dedent(
|
62 |
+
"""
|
63 |
+
# Methodology
|
64 |
+
|
65 |
+
## Overview
|
66 |
+
WhisperKit Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit models across supported devices, OS versions and audio datasets.
|
67 |
+
|
68 |
+
## Metrics
|
69 |
+
|
70 |
+
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
|
71 |
+
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
|
72 |
+
- This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files.
|
73 |
+
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
|
74 |
+
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
|
75 |
+
- This metric does not capture improvements to the reference. It only measures potential regressions.
|
76 |
+
- **Parity %**: The percentage difference between a model's Average WER on a given device and its Average WER on the Apple M2 Ultra, where a negative value indicates worse performance compared to the M2 Ultra.
|
77 |
+
- **Multilingual results**: Separated into "language hinted" and "language predicted" categories to evaluate performance with and without prior knowledge of the input language.
|
78 |
+
|
79 |
+
## Data
|
80 |
+
|
81 |
+
- **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
|
82 |
+
- **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
|
83 |
+
- Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
|
84 |
+
- **Multilingual**: Max 400 samples per language with <30s/clip from [Common Voice 17.0 Test Set](https://huggingface.co/datasets/argmaxinc/common_voice_17_0-argmax_subset-400). Common Voice covers 77 of the 99 languages supported by Whisper.
|
85 |
+
|
86 |
+
## Performance Measurement
|
87 |
+
|
88 |
+
1. On-device testing is conducted with [WhisperKit Regression Test Automations](https://github.com/argmaxinc/WhisperKit/blob/main/BENCHMARKS.md) on iPhones, iPads, and Macs, across different iOS and macOS versions.
|
89 |
+
2. Performance is recorded on 10-minute datasets described above for short- and long-form
|
90 |
+
3. Quality metrics are recorded on full datasets on Apple M2 Ultra Mac Studios to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab.
|
91 |
+
4. Quality is also sanity-checked on 10-minute datasets in order to catch potential correctness regressions across different device and OS combinations despite running the same version of WhisperKit.
|
92 |
+
5. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis.
|
93 |
+
|
94 |
+
## Dashboard Features
|
95 |
+
|
96 |
+
- Performance: Interactive filtering by model, device, OS, and performance metrics
|
97 |
+
- Timeline: Visualizations of performance trends
|
98 |
+
- English Quality: English transcription quality on short- and long-form audio
|
99 |
+
- Multilingual Quality: Multilingual (77) transcription quality on short-form audio with and without language prediction
|
100 |
+
- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
|
101 |
+
- This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases.
|
102 |
+
"""
|
103 |
+
)
|
104 |
+
|
105 |
+
PERFORMANCE_TEXT = dedent(
|
106 |
+
"""
|
107 |
+
## Metrics
|
108 |
+
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
|
109 |
+
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
|
110 |
+
- **Parity %**: The percentage difference between a model's Average WER on a given device and its Average WER on the Apple M2 Ultra, where a negative value indicates worse performance compared to the M2 Ultra.
|
111 |
+
|
112 |
+
## Data
|
113 |
+
|
114 |
+
- **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech).
|
115 |
+
- **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours).
|
116 |
+
"""
|
117 |
+
)
|
118 |
+
|
119 |
+
QUALITY_TEXT = dedent(
|
120 |
+
"""
|
121 |
+
## Metrics
|
122 |
+
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
|
123 |
+
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
|
124 |
+
- This metric does not capture improvements to the reference. It only measures potential regressions.
|
125 |
+
"""
|
126 |
+
)
|
127 |
+
|
128 |
+
COL_NAMES = {
|
129 |
+
"model.model_version": "Model",
|
130 |
+
"device.product_name": "Device",
|
131 |
+
"device.os": "OS",
|
132 |
+
"average_wer": "Average WER",
|
133 |
+
"qoi": "QoI",
|
134 |
+
"speed": "Speed",
|
135 |
+
"tokens_per_second": "Tok / s",
|
136 |
+
"model": "Model",
|
137 |
+
"device": "Device",
|
138 |
+
"os": "OS",
|
139 |
+
"parity": "Parity %",
|
140 |
+
}
|
141 |
+
|
142 |
+
|
143 |
+
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
144 |
+
|
145 |
+
|
146 |
+
CITATION_BUTTON_TEXT = r"""@misc{whisperkit-argmax,
|
147 |
+
title = {WhisperKit},
|
148 |
+
author = {Argmax, Inc.},
|
149 |
+
year = {2024},
|
150 |
+
URL = {https://github.com/argmaxinc/WhisperKit}
|
151 |
+
}"""
|
152 |
+
|
153 |
+
|
154 |
+
HEADER = """<div align="center">
|
155 |
+
<div position: relative>
|
156 |
+
<img
|
157 |
+
src=""
|
158 |
+
style="display:block;width:7%;height:auto;"
|
159 |
+
/>
|
160 |
+
</div>
|
161 |
+
</div>"""
|
162 |
+
|
163 |
+
|
164 |
+
EARNINGS22_URL = (
|
165 |
+
"https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}"
|
166 |
+
)
|
167 |
+
LIBRISPEECH_URL = (
|
168 |
+
"https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}"
|
169 |
+
)
|
170 |
+
|
171 |
+
AUDIO_URL = (
|
172 |
+
"https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/"
|
173 |
+
)
|
174 |
+
|
175 |
+
WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}"
|
176 |
+
|
177 |
+
BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"
|
178 |
+
|
179 |
+
AVAILABLE_LANGUAGES = [
|
180 |
+
"af",
|
181 |
+
"am",
|
182 |
+
"ar",
|
183 |
+
"as",
|
184 |
+
"az",
|
185 |
+
"ba",
|
186 |
+
"be",
|
187 |
+
"bg",
|
188 |
+
"bn",
|
189 |
+
"br",
|
190 |
+
"ca",
|
191 |
+
"cs",
|
192 |
+
"cy",
|
193 |
+
"da",
|
194 |
+
"de",
|
195 |
+
"el",
|
196 |
+
"en",
|
197 |
+
"es",
|
198 |
+
"et",
|
199 |
+
"eu",
|
200 |
+
"fa",
|
201 |
+
"fi",
|
202 |
+
"fr",
|
203 |
+
"gl",
|
204 |
+
"ha",
|
205 |
+
"he",
|
206 |
+
"hi",
|
207 |
+
"hu",
|
208 |
+
"hy",
|
209 |
+
"id",
|
210 |
+
"it",
|
211 |
+
"ja",
|
212 |
+
"ka",
|
213 |
+
"kk",
|
214 |
+
"ko",
|
215 |
+
"lo",
|
216 |
+
"lt",
|
217 |
+
"lv",
|
218 |
+
"mk",
|
219 |
+
"ml",
|
220 |
+
"mn",
|
221 |
+
"mr",
|
222 |
+
"mt",
|
223 |
+
"ne",
|
224 |
+
"nl",
|
225 |
+
"nn",
|
226 |
+
"oc",
|
227 |
+
"pa",
|
228 |
+
"pl",
|
229 |
+
"ps",
|
230 |
+
"pt",
|
231 |
+
"ro",
|
232 |
+
"ru",
|
233 |
+
"sk",
|
234 |
+
"sl",
|
235 |
+
"sq",
|
236 |
+
"sr",
|
237 |
+
"sv",
|
238 |
+
"sw",
|
239 |
+
"ta",
|
240 |
+
"te",
|
241 |
+
"th",
|
242 |
+
"tk",
|
243 |
+
"tr",
|
244 |
+
"tt",
|
245 |
+
"uk",
|
246 |
+
"ur",
|
247 |
+
"uz",
|
248 |
+
"vi",
|
249 |
+
"yi",
|
250 |
+
"yo",
|
251 |
+
"yue",
|
252 |
+
"zh",
|
253 |
+
]
|
254 |
+
LANGUAGE_MAP = {lang: Lang(lang).name for lang in AVAILABLE_LANGUAGES}
|
dashboard_data/config.json
ADDED
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"name": "whisperkit-coreml",
|
3 |
+
"version": "0.2",
|
4 |
+
"device_support": [
|
5 |
+
{
|
6 |
+
"identifiers": ["iPhone11", "iPhone12", "Watch7", "Watch8"],
|
7 |
+
"models": {
|
8 |
+
"default": "openai_whisper-tiny",
|
9 |
+
"supported": [
|
10 |
+
"openai_whisper-tiny",
|
11 |
+
"openai_whisper-tiny.en",
|
12 |
+
"openai_whisper-base",
|
13 |
+
"openai_whisper-base.en"
|
14 |
+
]
|
15 |
+
}
|
16 |
+
},
|
17 |
+
{
|
18 |
+
"identifiers": ["iPhone13", "iPad13,18", "iPad13,1"],
|
19 |
+
"models": {
|
20 |
+
"default": "openai_whisper-base",
|
21 |
+
"supported": [
|
22 |
+
"openai_whisper-tiny",
|
23 |
+
"openai_whisper-tiny.en",
|
24 |
+
"openai_whisper-base",
|
25 |
+
"openai_whisper-base.en",
|
26 |
+
"openai_whisper-small",
|
27 |
+
"openai_whisper-small.en"
|
28 |
+
]
|
29 |
+
}
|
30 |
+
},
|
31 |
+
{
|
32 |
+
"identifiers": [
|
33 |
+
"iPhone14",
|
34 |
+
"iPhone15",
|
35 |
+
"iPhone16",
|
36 |
+
"iPhone17",
|
37 |
+
"iPad14,1",
|
38 |
+
"iPad14,2"
|
39 |
+
],
|
40 |
+
"models": {
|
41 |
+
"default": "openai_whisper-base",
|
42 |
+
"supported": [
|
43 |
+
"openai_whisper-tiny",
|
44 |
+
"openai_whisper-tiny.en",
|
45 |
+
"openai_whisper-base",
|
46 |
+
"openai_whisper-base.en",
|
47 |
+
"openai_whisper-small",
|
48 |
+
"openai_whisper-small.en",
|
49 |
+
"openai_whisper-large-v2_949MB",
|
50 |
+
"openai_whisper-large-v2_turbo_955MB",
|
51 |
+
"openai_whisper-large-v3_947MB",
|
52 |
+
"openai_whisper-large-v3_turbo_954MB",
|
53 |
+
"distil-whisper_distil-large-v3_594MB",
|
54 |
+
"distil-whisper_distil-large-v3_turbo_600MB",
|
55 |
+
"openai_whisper-large-v3-v20240930_626MB",
|
56 |
+
"openai_whisper-large-v3-v20240930_turbo_632MB"
|
57 |
+
]
|
58 |
+
}
|
59 |
+
},
|
60 |
+
{
|
61 |
+
"identifiers": [
|
62 |
+
"Mac13",
|
63 |
+
"iMac21",
|
64 |
+
"MacBookAir10,1",
|
65 |
+
"MacBookPro17",
|
66 |
+
"MacBookPro18",
|
67 |
+
"Macmini9",
|
68 |
+
"iPad13,16",
|
69 |
+
"iPad13,4",
|
70 |
+
"iPad13,8"
|
71 |
+
],
|
72 |
+
"models": {
|
73 |
+
"default": "openai_whisper-large-v3-v20240930",
|
74 |
+
"supported": [
|
75 |
+
"openai_whisper-tiny",
|
76 |
+
"openai_whisper-tiny.en",
|
77 |
+
"openai_whisper-base",
|
78 |
+
"openai_whisper-base.en",
|
79 |
+
"openai_whisper-small",
|
80 |
+
"openai_whisper-small.en",
|
81 |
+
"openai_whisper-large-v2",
|
82 |
+
"openai_whisper-large-v2_949MB",
|
83 |
+
"openai_whisper-large-v3",
|
84 |
+
"openai_whisper-large-v3_947MB",
|
85 |
+
"distil-whisper_distil-large-v3",
|
86 |
+
"distil-whisper_distil-large-v3_594MB",
|
87 |
+
"openai_whisper-large-v3-v20240930",
|
88 |
+
"openai_whisper-large-v3-v20240930_626MB"
|
89 |
+
]
|
90 |
+
}
|
91 |
+
},
|
92 |
+
{
|
93 |
+
"identifiers": [
|
94 |
+
"Mac14",
|
95 |
+
"Mac15",
|
96 |
+
"Mac16",
|
97 |
+
"iPad14,3",
|
98 |
+
"iPad14,4",
|
99 |
+
"iPad14,5",
|
100 |
+
"iPad14,6",
|
101 |
+
"iPad14,8",
|
102 |
+
"iPad14,9",
|
103 |
+
"iPad14,10",
|
104 |
+
"iPad14,11",
|
105 |
+
"iPad16"
|
106 |
+
],
|
107 |
+
"models": {
|
108 |
+
"default": "openai_whisper-large-v3-v20240930",
|
109 |
+
"supported": [
|
110 |
+
"openai_whisper-tiny",
|
111 |
+
"openai_whisper-tiny.en",
|
112 |
+
"openai_whisper-base",
|
113 |
+
"openai_whisper-base.en",
|
114 |
+
"openai_whisper-small",
|
115 |
+
"openai_whisper-small.en",
|
116 |
+
"openai_whisper-large-v2",
|
117 |
+
"openai_whisper-large-v2_949MB",
|
118 |
+
"openai_whisper-large-v2_turbo",
|
119 |
+
"openai_whisper-large-v2_turbo_955MB",
|
120 |
+
"openai_whisper-large-v3",
|
121 |
+
"openai_whisper-large-v3_947MB",
|
122 |
+
"openai_whisper-large-v3_turbo",
|
123 |
+
"openai_whisper-large-v3_turbo_954MB",
|
124 |
+
"distil-whisper_distil-large-v3",
|
125 |
+
"distil-whisper_distil-large-v3_594MB",
|
126 |
+
"distil-whisper_distil-large-v3_turbo",
|
127 |
+
"distil-whisper_distil-large-v3_turbo_600MB",
|
128 |
+
"openai_whisper-large-v3-v20240930",
|
129 |
+
"openai_whisper-large-v3-v20240930_turbo",
|
130 |
+
"openai_whisper-large-v3-v20240930_626MB",
|
131 |
+
"openai_whisper-large-v3-v20240930_turbo_632MB"
|
132 |
+
]
|
133 |
+
}
|
134 |
+
}
|
135 |
+
]
|
136 |
+
}
|
dashboard_data/device_map.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"Mac14,12": "Apple M2 Pro",
|
3 |
+
"Mac14,14": "Apple M2 Ultra",
|
4 |
+
"Mac15,3": "Apple M3",
|
5 |
+
"Mac15,9": "Apple M3 Max",
|
6 |
+
"iPad14,8": "iPad Air 11-inch (M2)",
|
7 |
+
"iPad16,1": "iPad mini (A17 Pro)",
|
8 |
+
"iPad16,3": "iPad Pro 11-inch (M4)",
|
9 |
+
"iPhone12,1": "iPhone 11",
|
10 |
+
"iPhone14,2": "iPhone 13 Pro",
|
11 |
+
"iPhone14,5": "iPhone 13",
|
12 |
+
"iPhone14,7": "iPhone 14",
|
13 |
+
"iPhone17,1": "iPhone 16 Pro"
|
14 |
+
}
|
dashboard_data/diff_checker_data.json
ADDED
File without changes
|
dashboard_data/multilingual_confusion_matrices.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dashboard_data/multilingual_results.csv
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Model,Forced Tokens,Average WER,WER_sl,WER_sk,WER_ur,WER_sw,WER_uz,WER_pl,WER_vi,WER_sq,WER_sv,WER_he,WER_mt,WER_hy,WER_am,WER_nn,WER_be,WER_da,WER_mr,WER_kk,WER_mn,WER_ja,WER_el,WER_lv,WER_oc,WER_it,WER_ca,WER_cs,WER_te,WER_ru,WER_tk,WER_ro,WER_yo,WER_yue,WER_yi,WER_pt,WER_ps,WER_zh,WER_uk,WER_sr,WER_pa,WER_ml,WER_mk,WER_ba,WER_ha,WER_ar,WER_gl,WER_hu,WER_nl,WER_bg,WER_bn,WER_ne,WER_af,WER_hi,WER_ka,WER_de,WER_as,WER_az,WER_br,WER_ko,WER_fi,WER_id,WER_fr,WER_es,WER_et,WER_en,WER_fa,WER_lt,WER_cy,WER_eu,WER_lo,WER_tt,WER_ta,WER_th,WER_tr
|
2 |
+
openai_whisper-large-v3-v20240930,False,51.57,36.9,29.71,46.48,64.04,110.02,14.74,14.89,69.25,18.09,29.11,86.41,74.32,145.83,50.03,79.08,19.43,67.19,43.57,116.51,26.33,21.75,32.92,73.51,14.39,20.36,14.41,140.14,15.81,112.64,15.2,95.06,51.16,103.7,16.37,111.73,27.24,24.08,62.2,104.28,121.81,48.07,102.63,104.87,40.62,18.12,16.39,11.46,21.95,98.71,86.28,37.8,43.31,137.87,14.01,103.2,38.1,100.68,20.79,16.62,12.28,16.31,5.94,32.21,12.54,60.73,35.6,57.45,42.35,103.14,98.21,44.83,31.0,31.24
|
3 |
+
openai_whisper-large-v3-v20240930,True,46.09,27.13,24.61,25.59,61.29,98.84,12.12,16.92,65.69,12.97,26.85,84.04,73.95,128.9,39.97,61.51,17.63,48.26,41.87,97.08,21.97,17.73,30.78,71.01,12.83,18.25,12.85,75.43,13.28,104.35,11.41,89.71,64.28,100.0,14.93,95.78,25.34,19.14,54.07,120.4,112.94,34.52,100.0,96.64,31.45,15.0,15.3,8.91,20.42,79.7,63.89,36.54,26.14,132.26,12.26,105.14,33.33,95.96,20.75,15.42,11.11,15.51,6.1,31.51,12.13,55.96,32.84,54.92,40.65,114.11,98.39,41.54,23.3,24.29
|
4 |
+
openai_whisper-tiny,False,105.22,121.79,133.13,113.57,119.78,118.34,103.44,99.27,119.73,82.19,122.66,112.51,132.53,120.31,103.18,115.1,99.88,101.13,125.86,114.12,82.61,130.16,112.75,100.2,89.71,82.85,125.93,113.15,109.31,117.09,118.71,109.16,88.8,120.37,81.94,115.21,79.77,115.73,114.84,103.05,105.04,117.77,116.19,109.58,159.43,78.61,129.6,76.43,122.17,100.32,104.44,118.43,102.01,140.29,79.95,100.86,146.83,110.82,110.13,93.9,124.4,67.17,68.41,113.76,33.14,122.06,133.54,112.59,132.9,106.52,123.61,100.96,110.41,125.91
|
5 |
+
openai_whisper-tiny,True,86.1,81.42,92.88,70.33,112.75,122.16,56.82,50.52,99.17,64.45,72.15,103.81,133.47,140.93,102.1,98.9,79.55,102.76,179.77,128.57,53.32,66.33,89.27,93.19,60.12,59.02,81.79,133.22,58.43,124.62,66.43,111.99,90.36,102.78,65.71,105.43,65.2,69.07,80.42,104.57,133.29,83.84,110.69,97.86,97.63,54.0,85.5,54.06,83.5,106.27,103.34,93.39,102.17,140.86,49.53,112.36,90.67,102.29,62.34,72.56,54.08,59.57,34.99,101.75,33.4,130.66,101.05,93.62,97.05,113.15,107.44,80.94,42.42,66.07
|
6 |
+
openai_whisper-small,False,96.89,116.31,109.59,106.84,110.05,117.73,69.24,74.63,110.83,52.47,97.58,109.76,138.13,118.51,93.33,113.81,66.04,101.28,127.1,115.44,89.93,120.81,101.47,88.06,67.04,51.78,117.66,120.07,92.0,116.14,86.09,106.41,48.07,105.56,37.94,105.77,117.15,101.7,108.27,101.54,102.57,115.41,118.11,104.62,151.92,65.17,66.19,60.49,118.61,100.46,102.89,99.37,100.76,141.61,49.17,100.37,121.83,107.77,137.18,68.85,104.92,54.88,47.03,96.83,18.2,119.04,121.42,111.48,116.31,118.1,120.3,100.63,115.33,108.8
|
7 |
+
openai_whisper-small,True,69.14,49.09,51.74,40.93,96.1,115.21,23.74,25.43,89.94,23.97,43.29,96.2,120.55,130.06,164.5,78.06,37.18,89.86,82.95,262.79,30.25,31.49,62.47,114.14,25.02,30.35,37.7,311.76,26.09,174.55,26.99,161.95,48.61,100.0,35.7,94.66,42.22,40.02,60.5,160.3,115.92,50.81,118.57,97.46,47.01,30.45,44.66,19.94,49.16,129.67,107.33,71.02,45.58,,23.87,131.95,62.3,,34.7,30.07,23.81,27.11,11.94,72.39,17.35,97.5,75.61,67.42,77.08,102.07,103.03,42.18,21.52,33.32
|
8 |
+
openai_whisper-large-v3,False,54.77,41.01,32.74,44.39,66.07,110.74,17.82,14.19,64.45,13.59,36.31,96.11,70.79,134.23,52.71,85.41,16.63,60.48,58.79,122.59,33.66,28.76,27.08,78.1,13.55,17.32,20.67,123.88,16.13,107.71,10.99,110.75,53.95,105.56,14.51,103.6,42.68,32.28,64.23,101.2,102.62,68.19,100.46,99.97,36.03,23.57,13.44,12.17,30.2,98.44,101.1,40.79,75.87,149.84,14.75,100.54,35.32,106.08,20.94,15.9,11.86,15.51,6.36,30.06,12.7,62.68,32.37,51.06,45.38,104.29,100.73,81.72,38.07,26.73
|
9 |
+
openai_whisper-large-v3,True,34.23,18.87,18.44,21.24,58.02,90.52,10.13,12.32,53.97,9.81,23.79,78.78,54.56,,29.37,45.53,13.89,42.37,48.61,87.75,20.38,12.35,21.06,65.39,11.11,14.69,12.04,61.25,13.0,99.39,5.39,97.25,14.27,101.85,13.75,88.95,25.41,15.59,41.4,57.1,107.34,20.59,99.25,91.39,23.08,13.06,12.44,7.03,17.37,,52.77,36.38,20.33,,9.89,,21.43,86.38,20.37,10.32,9.47,13.67,4.93,28.43,12.21,45.43,27.63,35.05,40.65,102.76,90.45,28.97,6.11,17.88
|
10 |
+
openai_whisper-large-v3-v20240930_626MB,False,52.29,39.68,29.99,49.08,66.59,107.43,15.31,15.95,71.18,17.19,32.01,88.37,79.06,135.02,51.08,80.09,20.74,71.26,47.37,105.47,25.78,22.21,34.77,74.12,15.26,20.99,15.98,139.45,16.29,106.18,16.59,95.23,51.42,101.85,16.46,107.76,29.67,27.49,64.5,103.61,115.9,47.55,100.79,103.61,38.22,19.62,17.52,11.63,24.46,98.93,85.04,39.69,47.4,133.75,14.69,104.02,38.49,101.45,22.74,16.62,12.28,17.04,6.02,34.3,13.39,62.2,38.5,60.13,45.51,103.6,98.12,48.42,35.09,31.2
|
11 |
+
openai_whisper-large-v3-v20240930_626MB,True,47.64,30.62,25.67,26.93,62.36,97.82,13.11,17.36,67.47,12.72,29.16,84.89,77.31,111.23,39.77,63.57,18.94,50.51,45.76,97.71,22.33,18.71,31.72,72.64,13.16,19.39,14.49,84.78,14.37,102.51,12.24,93.42,66.1,100.0,14.85,95.84,27.18,21.48,56.17,134.5,123.72,36.74,98.12,95.73,32.09,15.48,17.05,9.05,22.25,81.89,63.49,40.0,27.3,128.28,13.21,102.9,34.52,96.28,22.01,15.17,12.66,15.68,5.97,33.98,13.03,56.96,35.97,57.0,43.62,121.47,99.17,42.74,23.09,24.11
|
12 |
+
openai_whisper-large-v3-v20240930_547MB,False,61.3,56.47,43.16,61.91,88.9,109.85,24.17,22.93,88.74,26.09,45.97,96.7,107.38,134.76,57.25,85.1,27.18,70.85,71.7,109.68,30.21,32.19,50.95,80.91,20.97,30.91,27.66,137.02,20.13,112.37,27.76,99.08,65.2,116.67,21.24,103.29,38.97,40.09,73.1,103.36,116.51,67.78,109.58,103.61,53.89,27.16,30.42,17.39,39.13,102.58,86.68,51.97,58.32,132.81,19.31,103.2,59.92,103.96,26.14,23.24,18.05,23.09,8.1,47.03,16.56,84.54,55.51,75.58,63.11,111.89,103.54,61.24,58.17,42.66
|
13 |
+
openai_whisper-large-v3-v20240930_547MB,True,54.61,40.12,35.54,35.4,78.38,102.05,19.6,25.97,81.39,19.25,38.72,89.86,109.32,146.41,46.71,69.67,25.3,60.25,66.07,101.55,25.79,26.23,45.55,75.4,18.77,27.16,23.73,106.23,18.63,108.87,18.26,97.33,74.97,101.85,18.91,95.65,34.74,30.15,63.01,,120.76,47.96,104.64,100.75,41.63,20.54,27.11,14.18,33.31,96.72,74.91,48.5,38.49,129.01,17.62,101.41,47.22,99.32,26.25,20.31,17.25,22.0,7.84,45.12,16.12,77.07,49.49,71.63,57.09,114.95,101.79,53.56,39.58,33.54
|
14 |
+
openai_whisper-large-v2,False,94.09,119.27,112.44,106.95,110.77,122.16,75.3,61.28,112.19,43.62,91.08,112.01,137.54,118.3,90.82,118.11,25.33,100.79,152.88,115.67,79.34,113.99,69.07,91.12,50.45,40.5,112.0,113.84,99.36,123.07,96.59,110.1,52.67,100.0,47.38,106.7,125.66,95.49,,101.13,102.34,118.31,124.76,105.15,143.76,63.7,44.82,48.04,119.44,100.18,102.64,101.1,100.38,154.52,65.08,100.59,85.71,105.79,97.02,48.57,92.39,31.95,46.74,99.3,13.74,116.06,137.91,74.42,107.11,111.27,131.83,100.18,119.35,113.86
|
15 |
+
openai_whisper-large-v2,True,47.14,25.76,25.84,25.24,67.14,100.99,12.51,17.69,65.57,12.16,24.01,83.34,62.4,176.79,47.12,49.05,16.72,48.12,58.01,136.6,22.6,15.04,28.69,72.69,14.34,16.2,17.14,165.74,15.11,115.56,7.86,95.93,53.06,105.56,15.23,99.75,36.59,20.95,43.09,105.46,114.5,25.32,107.07,115.23,26.39,16.27,16.72,8.93,21.52,103.9,62.0,47.24,25.92,150.19,11.7,107.19,29.37,106.33,24.84,13.13,12.2,16.21,6.93,35.96,12.7,53.38,38.94,32.85,49.09,103.76,105.51,28.08,8.76,19.55
|
16 |
+
openai_whisper-base,False,104.18,125.5,143.16,112.16,113.79,122.43,99.57,99.07,123.02,75.03,98.01,114.12,138.03,114.66,99.56,122.48,81.21,101.78,137.07,115.04,91.47,131.68,117.3,96.63,78.16,69.85,128.78,132.18,103.59,125.21,114.33,106.87,72.69,125.93,59.81,114.59,74.81,113.57,119.31,103.18,105.38,123.9,123.21,109.11,160.77,70.55,110.22,80.49,122.28,100.7,104.89,108.98,101.32,143.37,61.29,100.7,134.52,111.44,136.38,102.17,126.88,58.74,58.79,115.46,25.71,122.69,149.93,110.1,126.44,116.79,132.15,101.37,112.61,122.57
|
17 |
+
openai_whisper-base,True,79.92,72.07,76.57,59.1,106.54,171.77,43.44,40.15,100.62,45.53,61.22,102.6,208.4,165.98,83.98,92.45,61.96,103.31,99.05,201.0,42.73,55.22,81.5,82.95,46.45,48.59,67.24,117.99,44.21,151.1,54.19,105.42,70.49,111.11,48.98,98.51,53.88,58.31,76.9,100.38,119.75,74.21,134.5,116.11,72.17,47.63,71.07,37.01,73.54,100.79,101.9,87.4,102.24,117.93,38.09,109.8,84.13,106.76,48.87,56.32,43.04,45.09,24.55,91.31,25.11,104.21,91.07,87.41,98.64,106.13,108.77,60.25,32.91,51.87
|
dashboard_data/performance_data.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dashboard_data/quality_data.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{"model": "openai/whisper-large-v3/947MB", "timestamp": "2024-10-18_16:59:10_GMT-0700", "average_wer": 9.74, "dataset_wer": {"librispeech": 2.41, "earnings22-12hours": 17.08}, "qoi": 0.94}
|
2 |
+
{"model": "openai/whisper-large-v2/turbo/955MB", "timestamp": "2024-10-18_16:52:35_GMT-0700", "average_wer": 7.27, "dataset_wer": {"librispeech": 2.4, "earnings22-12hours": 12.14}, "qoi": 0.94}
|
3 |
+
{"model": "openai/whisper-tiny.en", "timestamp": "2024-10-19_15:40:06_GMT-0700", "average_wer": 12.23, "dataset_wer": {"librispeech": 5.61, "earnings22-12hours": 18.86}, "qoi": 0.63}
|
4 |
+
{"model": "distil-whisper/distil-large-v3/594MB", "timestamp": "2024-10-20_13:02:33_GMT-0700", "average_wer": 8.96, "dataset_wer": {"librispeech": 2.87, "earnings22-12hours": 15.06}, "qoi": 0.86}
|
5 |
+
{"model": "openai/whisper-large-v2/949MB", "timestamp": "2024-10-18_19:51:30_GMT-0400", "average_wer": 7.88, "dataset_wer": {"librispeech": 2.38, "earnings22-12hours": 13.39}, "qoi": 0.94}
|
6 |
+
{"model": "openai/whisper-large-v3/turbo/954MB", "timestamp": "2024-10-20_13:49:26_GMT-0700", "average_wer": 22.75, "dataset_wer": {"librispeech": 2.51, "earnings22-12hours": 43.0}, "qoi": 0.93}
|
7 |
+
{"model": "distil-whisper/distil-large-v3", "timestamp": "2024-10-20_20:32:22_GMT-0700", "average_wer": 7.2, "dataset_wer": {"librispeech": 2.38, "earnings22-12hours": 12.02}, "qoi": 0.9}
|
8 |
+
{"model": "openai/whisper-large-v3-v20240930", "timestamp": "2024-10-18_18:35:46_GMT-0700", "average_wer": 6.74, "dataset_wer": {"librispeech": 1.93, "earnings22-12hours": 11.55}, "qoi": 0.94}
|
9 |
+
{"model": "openai/whisper-tiny", "timestamp": "2024-10-20_20:19:04_GMT-0700", "average_wer": 14.21, "dataset_wer": {"librispeech": 7.46, "earnings22-12hours": 20.97}, "qoi": 0.52}
|
10 |
+
{"model": "openai/whisper-large-v3-v20240930/turbo/632MB", "timestamp": "2024-10-18_20:10:30_GMT-0700", "average_wer": 6.86, "dataset_wer": {"librispeech": 1.95, "earnings22-12hours": 11.77}, "qoi": 0.93}
|
11 |
+
{"model": "openai/whisper-large-v2/turbo", "timestamp": "2024-10-18_14:58:38_GMT-0700", "average_wer": 7.25, "dataset_wer": {"librispeech": 2.4, "earnings22-12hours": 12.1}, "qoi": 0.96}
|
12 |
+
{"model": "openai/whisper-small", "timestamp": "2024-10-18_12:40:03_GMT-0700", "average_wer": 8.11, "dataset_wer": {"librispeech": 3.21, "earnings22-12hours": 13.0}, "qoi": 0.83}
|
13 |
+
{"model": "openai/whisper-large-v3-v20240930/turbo", "timestamp": "2024-10-18_19:37:26_GMT-0700", "average_wer": 6.72, "dataset_wer": {"librispeech": 1.92, "earnings22-12hours": 11.52}, "qoi": 0.94}
|
14 |
+
{"model": "openai/whisper-large-v3", "timestamp": "2024-10-18_18:01:14_GMT-0400", "average_wer": 6.85, "dataset_wer": {"librispeech": 2.02, "earnings22-12hours": 11.69}, "qoi": 0.95}
|
15 |
+
{"model": "openai/whisper-large-v3-v20240930/626MB", "timestamp": "2024-10-18_19:21:06_GMT-0700", "average_wer": 7.15, "dataset_wer": {"librispeech": 1.96, "earnings22-12hours": 12.35}, "qoi": 0.93}
|
16 |
+
{"model": "openai/whisper-base.en", "timestamp": "2024-10-20_12:31:44_GMT-0700", "average_wer": 9.59, "dataset_wer": {"librispeech": 3.98, "earnings22-12hours": 15.2}, "qoi": 0.75}
|
17 |
+
{"model": "openai/whisper-large-v3-v20240930/547MB", "timestamp": "2024-10-18_21:59:11_GMT-0400", "average_wer": 16.82, "dataset_wer": {"librispeech": 2.16, "earnings22-12hours": 31.49}, "qoi": 0.92}
|
18 |
+
{"model": "distil-whisper/distil-large-v3/turbo/600MB", "timestamp": "2024-10-18_17:50:17_GMT-0700", "average_wer": 8.33, "dataset_wer": {"librispeech": 2.8, "earnings22-12hours": 13.87}, "qoi": 0.86}
|
19 |
+
{"model": "openai/whisper-large-v2", "timestamp": "2024-10-18_17:07:15_GMT-0400", "average_wer": 7.32, "dataset_wer": {"librispeech": 2.36, "earnings22-12hours": 12.28}, "qoi": 0.97}
|
20 |
+
{"model": "openai/whisper-small.en", "timestamp": "2024-10-18_15:39:48_GMT-0400", "average_wer": 7.85, "dataset_wer": {"librispeech": 2.88, "earnings22-12hours": 12.82}, "qoi": 0.86}
|
21 |
+
{"model": "distil-whisper/distil-large-v3/turbo", "timestamp": "2024-10-20_12:45:20_GMT-0700", "average_wer": 7.2, "dataset_wer": {"librispeech": 2.35, "earnings22-12hours": 12.05}, "qoi": 0.9}
|
22 |
+
{"model": "openai/whisper-base", "timestamp": "2024-10-18_20:25:50_GMT-0700", "average_wer": 10.67, "dataset_wer": {"librispeech": 4.94, "earnings22-12hours": 16.4}, "qoi": 0.67}
|
23 |
+
{"model": "openai/whisper-large-v3/turbo", "timestamp": "2024-10-20_16:58:25_GMT-0400", "average_wer": 6.86, "dataset_wer": {"librispeech": 1.97, "earnings22-12hours": 11.74}, "qoi": 0.95}
|
dashboard_data/support_data.csv
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
,Model,Apple M2 Pro,Apple M2 Ultra,Apple M3,Apple M3 Max,iPad Air 11-inch (M2),iPad mini (A17 Pro),iPad Pro 11-inch (M4),iPhone 11,iPhone 13 Pro,iPhone 13,iPhone 14,iPhone 16 Pro
|
2 |
+
distil-whisper_distil-large-v3,distil-whisper_distil-large-v3,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
3 |
+
distil-whisper_distil-large-v3_594MB,distil-whisper_distil-large-v3_594MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
4 |
+
distil-whisper_distil-large-v3_turbo,distil-whisper_distil-large-v3_turbo,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad14%2C8_summary_2024-10-25T032747.json>iPadOS 17.6.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C1_summary_2024-10-25T054749.json>iPadOS 18.0.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C3_summary_2024-10-25T032747.json>iPadOS 18.1</a>,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
5 |
+
distil-whisper_distil-large-v3_turbo_600MB,distil-whisper_distil-large-v3_turbo_600MB,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/Mac14%2C12_summary_2024-10-25T031359.json>macOS 15.0.1</a>,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
6 |
+
openai_whisper-base,openai_whisper-base,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/Mac14%2C12_summary_2024-10-25T031359.json>macOS 15.0.1</a>,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,✅ iOS 17.6.1,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
7 |
+
openai_whisper-base.en,openai_whisper-base.en,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,✅ iOS 17.6.1,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
8 |
+
openai_whisper-large-v2,openai_whisper-large-v2,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/Mac14%2C12_summary_2024-10-25T031359.json>macOS 15.0.1</a>,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
9 |
+
openai_whisper-large-v2_949MB,openai_whisper-large-v2_949MB,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/Mac14%2C12_summary_2024-10-25T031359.json>macOS 15.0.1</a>,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPhone14%2C5_summary_2024-10-25T032747.json>iOS 17.3</a>,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
10 |
+
openai_whisper-large-v2_turbo,openai_whisper-large-v2_turbo,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad14%2C8_summary_2024-10-25T032747.json>iPadOS 17.6.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C1_summary_2024-10-25T054749.json>iPadOS 18.0.1</a>,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
11 |
+
openai_whisper-large-v2_turbo_955MB,openai_whisper-large-v2_turbo_955MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
12 |
+
openai_whisper-large-v3,openai_whisper-large-v3,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
13 |
+
openai_whisper-large-v3-v20240930,openai_whisper-large-v3-v20240930,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad14%2C8_summary_2024-10-25T032747.json>iPadOS 17.6.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C1_summary_2024-10-25T054749.json>iPadOS 18.0.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C3_summary_2024-10-25T032747.json>iPadOS 18.1</a>,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
14 |
+
openai_whisper-large-v3-v20240930_626MB,openai_whisper-large-v3-v20240930_626MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
15 |
+
openai_whisper-large-v3-v20240930_turbo,openai_whisper-large-v3-v20240930_turbo,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,Not Supported,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C1_summary_2024-10-25T054749.json>iPadOS 18.0.1</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPad16%2C3_summary_2024-10-25T032747.json>iPadOS 18.1</a>,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
16 |
+
openai_whisper-large-v3-v20240930_turbo_632MB,openai_whisper-large-v3-v20240930_turbo_632MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPhone14%2C5_summary_2024-10-25T032747.json>iOS 17.3</a>,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPhone14%2C7_summary_2024-10-25T032747.json>iOS 17.3</a>,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
17 |
+
openai_whisper-large-v3_947MB,openai_whisper-large-v3_947MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/iPhone14%2C5_summary_2024-10-25T032747.json>iOS 17.3</a>,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
18 |
+
openai_whisper-large-v3_turbo,openai_whisper-large-v3_turbo,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported,Not Supported
|
19 |
+
openai_whisper-large-v3_turbo_954MB,openai_whisper-large-v3_turbo_954MB,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
20 |
+
openai_whisper-small,openai_whisper-small,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
21 |
+
openai_whisper-small.en,openai_whisper-small.en,⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href=https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data/2024-10-25T012729_6962d0d/Mac14%2C12_summary_2024-10-25T031359.json>macOS 15.0.1</a>,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
22 |
+
openai_whisper-tiny,openai_whisper-tiny,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,Not Supported,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
23 |
+
openai_whisper-tiny.en,openai_whisper-tiny.en,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.0.1,✅ macOS 15.1,✅ iPadOS 17.6.1,✅ iPadOS 18.0.1,✅ iPadOS 18.1,✅ iOS 17.6.1,✅ iOS 18.0,✅ iOS 17.3,✅ iOS 17.3,✅ iOS 18.0.1<p>✅ iOS 18.0</p>
|
main.py
ADDED
@@ -0,0 +1,1302 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Main module for the WhisperKit Evaluation Dashboard.
|
3 |
+
This module sets up and runs the Gradio interface for the WhisperKit Evaluation Dashboard,
|
4 |
+
allowing users to explore and compare speech recognition model performance across different
|
5 |
+
devices, operating systems, and datasets.
|
6 |
+
"""
|
7 |
+
|
8 |
+
import json
|
9 |
+
import os
|
10 |
+
import re
|
11 |
+
from math import ceil, floor
|
12 |
+
|
13 |
+
import gradio as gr
|
14 |
+
import pandas as pd
|
15 |
+
from argmax_gradio_components import RangeSlider
|
16 |
+
from dotenv import load_dotenv
|
17 |
+
from huggingface_hub import login
|
18 |
+
|
19 |
+
# Import custom constants and utility functions
|
20 |
+
from constants import (
|
21 |
+
BANNER_TEXT,
|
22 |
+
CITATION_BUTTON_LABEL,
|
23 |
+
CITATION_BUTTON_TEXT,
|
24 |
+
COL_NAMES,
|
25 |
+
HEADER,
|
26 |
+
LANGUAGE_MAP,
|
27 |
+
METHODOLOGY_TEXT,
|
28 |
+
PERFORMANCE_TEXT,
|
29 |
+
QUALITY_TEXT,
|
30 |
+
)
|
31 |
+
from utils import (
|
32 |
+
add_datasets_to_performance_columns,
|
33 |
+
add_datasets_to_quality_columns,
|
34 |
+
calculate_parity,
|
35 |
+
create_confusion_matrix_plot,
|
36 |
+
create_initial_performance_column_dict,
|
37 |
+
create_initial_quality_column_dict,
|
38 |
+
css,
|
39 |
+
fields,
|
40 |
+
get_os_name_and_version,
|
41 |
+
make_dataset_wer_clickable_link,
|
42 |
+
make_model_name_clickable_link,
|
43 |
+
make_multilingual_model_clickable_link,
|
44 |
+
plot_metric,
|
45 |
+
read_json_line_by_line,
|
46 |
+
)
|
47 |
+
|
48 |
+
# Load environment variables
|
49 |
+
load_dotenv()
|
50 |
+
|
51 |
+
# Get the Hugging Face token from the environment variable
|
52 |
+
HF_TOKEN = os.getenv("HF_TOKEN")
|
53 |
+
|
54 |
+
# Use the token for login
|
55 |
+
login(token=HF_TOKEN, add_to_git_credential=True)
|
56 |
+
|
57 |
+
# Define repository and directory information
|
58 |
+
repo_id = "argmaxinc/whisperkit-evals-dataset"
|
59 |
+
directory = "xcresults/benchmark_results"
|
60 |
+
local_dir = ""
|
61 |
+
|
62 |
+
# Load benchmark data from JSON files
|
63 |
+
PERFORMANCE_DATA = read_json_line_by_line("dashboard_data/performance_data.json")
|
64 |
+
QUALITY_DATA = read_json_line_by_line("dashboard_data/quality_data.json")
|
65 |
+
|
66 |
+
# Convert JSON data to pandas DataFrames
|
67 |
+
quality_df = pd.json_normalize(QUALITY_DATA)
|
68 |
+
benchmark_df = pd.json_normalize(PERFORMANCE_DATA)
|
69 |
+
|
70 |
+
# Process timestamp data
|
71 |
+
benchmark_df["timestamp"] = pd.to_datetime(benchmark_df["timestamp"]).dt.tz_localize(
|
72 |
+
None
|
73 |
+
)
|
74 |
+
benchmark_df["timestamp"] = pd.to_datetime(benchmark_df["timestamp"]).dt.tz_localize(
|
75 |
+
None
|
76 |
+
)
|
77 |
+
|
78 |
+
# First create a temporary column for model length
|
79 |
+
sorted_quality_df = (
|
80 |
+
quality_df.assign(model_len=quality_df["model"].str.len())
|
81 |
+
.sort_values(
|
82 |
+
by=["model_len", "model", "timestamp"],
|
83 |
+
ascending=[True, True, False],
|
84 |
+
)
|
85 |
+
.drop(columns=["model_len"])
|
86 |
+
.drop_duplicates(subset=["model"], keep="first")
|
87 |
+
.reset_index(drop=True)
|
88 |
+
)
|
89 |
+
|
90 |
+
sorted_performance_df = (
|
91 |
+
benchmark_df.assign(model_len=benchmark_df["model"].str.len())
|
92 |
+
.sort_values(
|
93 |
+
by=["model_len", "model", "device", "os", "timestamp"],
|
94 |
+
ascending=[True, True, True, True, False],
|
95 |
+
)
|
96 |
+
.drop(columns=["model_len"])
|
97 |
+
.drop_duplicates(subset=["model", "device", "os"], keep="first")
|
98 |
+
.reset_index(drop=True)
|
99 |
+
)
|
100 |
+
|
101 |
+
# Identify dataset-specific columns
|
102 |
+
dataset_wer_columns = [
|
103 |
+
col for col in sorted_quality_df.columns if col.startswith("dataset_wer.")
|
104 |
+
]
|
105 |
+
dataset_speed_columns = [
|
106 |
+
col for col in sorted_performance_df.columns if col.startswith("dataset_speed.")
|
107 |
+
]
|
108 |
+
dataset_toks_columns = [
|
109 |
+
col
|
110 |
+
for col in sorted_performance_df.columns
|
111 |
+
if col.startswith("dataset_tokens_per_second.")
|
112 |
+
]
|
113 |
+
|
114 |
+
# Extract dataset names
|
115 |
+
QUALITY_DATASETS = [col.split(".")[-1] for col in dataset_wer_columns]
|
116 |
+
PERFORMANCE_DATASETS = [col.split(".")[-1] for col in dataset_speed_columns]
|
117 |
+
|
118 |
+
# Prepare DataFrames for display
|
119 |
+
model_df = sorted_quality_df[
|
120 |
+
["model", "average_wer", "qoi", "timestamp"] + dataset_wer_columns
|
121 |
+
]
|
122 |
+
performance_df = sorted_performance_df[
|
123 |
+
[
|
124 |
+
"model",
|
125 |
+
"device",
|
126 |
+
"os",
|
127 |
+
"average_wer",
|
128 |
+
"qoi",
|
129 |
+
"speed",
|
130 |
+
"tokens_per_second",
|
131 |
+
"timestamp",
|
132 |
+
]
|
133 |
+
+ dataset_speed_columns
|
134 |
+
+ dataset_toks_columns
|
135 |
+
].copy()
|
136 |
+
|
137 |
+
# Rename columns for clarity
|
138 |
+
performance_df = performance_df.rename(
|
139 |
+
lambda x: COL_NAMES[x] if x in COL_NAMES else x, axis="columns"
|
140 |
+
)
|
141 |
+
model_df = model_df.rename(
|
142 |
+
lambda x: COL_NAMES[x] if x in COL_NAMES else x, axis="columns"
|
143 |
+
)
|
144 |
+
|
145 |
+
# Process dataset-specific columns
|
146 |
+
for col in dataset_wer_columns:
|
147 |
+
dataset_name = col.split(".")[-1]
|
148 |
+
model_df = model_df.rename(columns={col: dataset_name})
|
149 |
+
model_df[dataset_name] = model_df.apply(
|
150 |
+
lambda x: make_dataset_wer_clickable_link(x, dataset_name), axis=1
|
151 |
+
)
|
152 |
+
|
153 |
+
for col in dataset_speed_columns:
|
154 |
+
dataset_name = col.split(".")[-1]
|
155 |
+
performance_df = performance_df.rename(
|
156 |
+
columns={
|
157 |
+
col: f"{'Short-Form' if dataset_name == 'librispeech-10mins' else 'Long-Form'} Speed"
|
158 |
+
}
|
159 |
+
)
|
160 |
+
|
161 |
+
for col in dataset_toks_columns:
|
162 |
+
dataset_name = col.split(".")[-1]
|
163 |
+
performance_df = performance_df.rename(
|
164 |
+
columns={
|
165 |
+
col: f"{'Short-Form' if dataset_name == 'librispeech-10mins' else 'Long-Form'} Tok/s"
|
166 |
+
}
|
167 |
+
)
|
168 |
+
|
169 |
+
# Calculate parity with M2 Ultra
|
170 |
+
m2_ultra_wer = (
|
171 |
+
performance_df[performance_df["Device"] == "Apple M2 Ultra"]
|
172 |
+
.groupby("Model")["Average WER"]
|
173 |
+
.first()
|
174 |
+
)
|
175 |
+
performance_df["Parity %"] = performance_df.apply(
|
176 |
+
lambda row: calculate_parity(m2_ultra_wer, row), axis=1
|
177 |
+
)
|
178 |
+
|
179 |
+
# Process model names for display
|
180 |
+
model_df["model_raw"] = model_df["Model"].copy()
|
181 |
+
performance_df["model_raw"] = performance_df["Model"].copy()
|
182 |
+
model_df["Model"] = model_df["Model"].apply(lambda x: make_model_name_clickable_link(x))
|
183 |
+
performance_df["Model"] = performance_df["Model"].apply(
|
184 |
+
lambda x: make_model_name_clickable_link(x)
|
185 |
+
)
|
186 |
+
|
187 |
+
# Extract unique devices and OS versions
|
188 |
+
PERFORMANCE_DEVICES = performance_df["Device"].unique().tolist()
|
189 |
+
PERFORMANCE_OS = performance_df["OS"].apply(get_os_name_and_version).unique().tolist()
|
190 |
+
PERFORMANCE_OS.sort()
|
191 |
+
|
192 |
+
# Create initial column dictionaries and update with dataset information
|
193 |
+
initial_performance_column_dict = create_initial_performance_column_dict()
|
194 |
+
initial_quality_column_dict = create_initial_quality_column_dict()
|
195 |
+
|
196 |
+
performance_column_info = add_datasets_to_performance_columns(
|
197 |
+
initial_performance_column_dict, PERFORMANCE_DATASETS
|
198 |
+
)
|
199 |
+
quality_column_info = add_datasets_to_quality_columns(
|
200 |
+
initial_quality_column_dict, QUALITY_DATASETS
|
201 |
+
)
|
202 |
+
|
203 |
+
# Unpack the returned dictionaries
|
204 |
+
updated_performance_column_dict = performance_column_info["column_dict"]
|
205 |
+
updated_quality_column_dict = quality_column_info["column_dict"]
|
206 |
+
|
207 |
+
PerformanceAutoEvalColumn = performance_column_info["AutoEvalColumn"]
|
208 |
+
QualityAutoEvalColumn = quality_column_info["AutoEvalColumn"]
|
209 |
+
|
210 |
+
# Define column sets for different views
|
211 |
+
PERFORMANCE_COLS = performance_column_info["COLS"]
|
212 |
+
QUALITY_COLS = quality_column_info["COLS"]
|
213 |
+
PERFORMANCE_TYPES = performance_column_info["TYPES"]
|
214 |
+
QUALITY_TYPES = quality_column_info["TYPES"]
|
215 |
+
PERFORMANCE_ALWAYS_HERE_COLS = performance_column_info["ALWAYS_HERE_COLS"]
|
216 |
+
QUALITY_ALWAYS_HERE_COLS = quality_column_info["ALWAYS_HERE_COLS"]
|
217 |
+
PERFORMANCE_TOGGLE_COLS = performance_column_info["TOGGLE_COLS"]
|
218 |
+
QUALITY_TOGGLE_COLS = quality_column_info["TOGGLE_COLS"]
|
219 |
+
PERFORMANCE_SELECTED_COLS = performance_column_info["SELECTED_COLS"]
|
220 |
+
QUALITY_SELECTED_COLS = quality_column_info["SELECTED_COLS"]
|
221 |
+
|
222 |
+
|
223 |
+
def performance_filter(
|
224 |
+
df,
|
225 |
+
columns,
|
226 |
+
model_query,
|
227 |
+
exclude_models,
|
228 |
+
devices,
|
229 |
+
os,
|
230 |
+
short_speed_slider,
|
231 |
+
long_speed_slider,
|
232 |
+
short_toks_slider,
|
233 |
+
long_toks_slider,
|
234 |
+
):
|
235 |
+
"""
|
236 |
+
Filters the performance DataFrame based on specified criteria.
|
237 |
+
:param df: The DataFrame to be filtered.
|
238 |
+
:param columns: The columns to be included in the filtered DataFrame.
|
239 |
+
:param model_query: The query string to filter the 'Model' column.
|
240 |
+
:param exclude_models: Models to exclude from the results.
|
241 |
+
:param devices: The devices to filter the 'Device' column.
|
242 |
+
:param os: The list of operating systems to filter the 'OS' column.
|
243 |
+
:param short_speed_slider: The range of values to filter the 'Short-Form Speed' column.
|
244 |
+
:param long_speed_slider: The range of values to filter the 'Long-Form Speed' column.
|
245 |
+
:param short_toks_slider: The range of values to filter the 'Short-Form Tok/s' column.
|
246 |
+
:param long_toks_slider: The range of values to filter the 'Long-Form Tok/s' column.
|
247 |
+
:return: The filtered DataFrame.
|
248 |
+
"""
|
249 |
+
# Select columns based on input and always-present columns
|
250 |
+
filtered_df = df[
|
251 |
+
PERFORMANCE_ALWAYS_HERE_COLS
|
252 |
+
+ [c for c in PERFORMANCE_COLS if c in df.columns and c in columns]
|
253 |
+
]
|
254 |
+
|
255 |
+
# Filter models based on query
|
256 |
+
if model_query:
|
257 |
+
filtered_df = filtered_df[
|
258 |
+
filtered_df["Model"].str.contains(
|
259 |
+
"|".join(q.strip() for q in model_query.split(";")), case=False
|
260 |
+
)
|
261 |
+
]
|
262 |
+
|
263 |
+
# Exclude specified models
|
264 |
+
if exclude_models:
|
265 |
+
exclude_list = [m.strip() for m in exclude_models.split(";")]
|
266 |
+
filtered_df = filtered_df[
|
267 |
+
~filtered_df["Model"].str.contains("|".join(exclude_list), case=False)
|
268 |
+
]
|
269 |
+
|
270 |
+
# Filter by devices
|
271 |
+
filtered_df = (
|
272 |
+
filtered_df[
|
273 |
+
(
|
274 |
+
filtered_df["Device"].str.contains(
|
275 |
+
"|".join(re.escape(q.strip()) for q in devices), case=False
|
276 |
+
)
|
277 |
+
)
|
278 |
+
]
|
279 |
+
if devices
|
280 |
+
else pd.DataFrame(columns=filtered_df.columns)
|
281 |
+
)
|
282 |
+
|
283 |
+
# Filter by operating systems
|
284 |
+
filtered_df = (
|
285 |
+
filtered_df[
|
286 |
+
(
|
287 |
+
filtered_df["OS"].str.contains(
|
288 |
+
"|".join(q.strip() for q in os), case=False
|
289 |
+
)
|
290 |
+
)
|
291 |
+
]
|
292 |
+
if os
|
293 |
+
else pd.DataFrame(columns=filtered_df.columns)
|
294 |
+
)
|
295 |
+
|
296 |
+
# Apply short-form and long-form speed and tokens per second filters
|
297 |
+
min_short_speed, max_short_speed = short_speed_slider
|
298 |
+
min_long_speed, max_long_speed = long_speed_slider
|
299 |
+
min_short_toks, max_short_toks = short_toks_slider
|
300 |
+
min_long_toks, max_long_toks = long_toks_slider
|
301 |
+
|
302 |
+
filtered_df = filtered_df[
|
303 |
+
(filtered_df["Short-Form Speed"] >= min_short_speed)
|
304 |
+
& (filtered_df["Short-Form Speed"] <= max_short_speed)
|
305 |
+
& (filtered_df["Long-Form Speed"] >= min_long_speed)
|
306 |
+
& (filtered_df["Long-Form Speed"] <= max_long_speed)
|
307 |
+
& (filtered_df["Short-Form Tok/s"] >= min_short_toks)
|
308 |
+
& (filtered_df["Short-Form Tok/s"] <= max_short_toks)
|
309 |
+
& (filtered_df["Long-Form Tok/s"] >= min_long_toks)
|
310 |
+
& (filtered_df["Long-Form Tok/s"] <= max_long_toks)
|
311 |
+
]
|
312 |
+
|
313 |
+
return filtered_df
|
314 |
+
|
315 |
+
|
316 |
+
def quality_filter(df, columns, model_query, wer_slider, qoi_slider, exclude_models):
|
317 |
+
"""
|
318 |
+
Filters the quality DataFrame based on specified criteria.
|
319 |
+
:param df: The DataFrame to be filtered.
|
320 |
+
:param columns: The columns to be included in the filtered DataFrame.
|
321 |
+
:param model_query: The query string to filter the 'Model' column.
|
322 |
+
:param wer_slider: The range of values to filter the 'Average WER' column.
|
323 |
+
:param qoi_slider: The range of values to filter the 'QoI' column.
|
324 |
+
:param exclude_models: Models to exclude from the results.
|
325 |
+
:return: The filtered DataFrame.
|
326 |
+
"""
|
327 |
+
# Select columns based on input and always-present columns
|
328 |
+
filtered_df = df[
|
329 |
+
QUALITY_ALWAYS_HERE_COLS
|
330 |
+
+ [c for c in QUALITY_COLS if c in df.columns and c in columns]
|
331 |
+
]
|
332 |
+
|
333 |
+
# Filter models based on query
|
334 |
+
if model_query:
|
335 |
+
filtered_df = filtered_df[
|
336 |
+
filtered_df["Model"].str.contains(
|
337 |
+
"|".join(q.strip() for q in model_query.split(";")), case=False
|
338 |
+
)
|
339 |
+
]
|
340 |
+
|
341 |
+
# Exclude specified models
|
342 |
+
if exclude_models:
|
343 |
+
exclude_list = [m.strip() for m in exclude_models.split(";")]
|
344 |
+
filtered_df = filtered_df[
|
345 |
+
~filtered_df["Model"].str.contains("|".join(exclude_list), case=False)
|
346 |
+
]
|
347 |
+
|
348 |
+
# Apply WER and QoI filters
|
349 |
+
min_wer_slider, max_wer_slider = wer_slider
|
350 |
+
min_qoi_slider, max_qoi_slider = qoi_slider
|
351 |
+
if "Average WER" in filtered_df.columns:
|
352 |
+
filtered_df = filtered_df[
|
353 |
+
(filtered_df["Average WER"] >= min_wer_slider)
|
354 |
+
& (filtered_df["Average WER"] <= max_wer_slider)
|
355 |
+
]
|
356 |
+
if "QoI" in filtered_df.columns:
|
357 |
+
filtered_df = filtered_df[
|
358 |
+
(filtered_df["QoI"] >= min_qoi_slider)
|
359 |
+
& (filtered_df["QoI"] <= max_qoi_slider)
|
360 |
+
]
|
361 |
+
|
362 |
+
return filtered_df
|
363 |
+
|
364 |
+
|
365 |
+
diff_tab = gr.TabItem("Difference Checker", elem_id="diff_checker", id=2)
|
366 |
+
text_diff_elems = []
|
367 |
+
|
368 |
+
tabs = gr.Tabs(elem_id="tab-elems")
|
369 |
+
|
370 |
+
multilingual_df = pd.read_csv("dashboard_data/multilingual_results.csv")
|
371 |
+
multilingual_models_df = multilingual_df[["Model"]].drop_duplicates()
|
372 |
+
multilingual_models_buttons = []
|
373 |
+
for model in multilingual_models_df["Model"]:
|
374 |
+
elem_id = (
|
375 |
+
f"{model}".replace(" ", "_").replace('"', "").replace("'", "").replace(",", "")
|
376 |
+
)
|
377 |
+
multilingual_models_buttons.append(
|
378 |
+
gr.Button(value=model, elem_id=elem_id, visible=False)
|
379 |
+
)
|
380 |
+
multilingual_models_df["Model"] = multilingual_models_df["Model"].apply(
|
381 |
+
lambda x: make_multilingual_model_clickable_link(x)
|
382 |
+
)
|
383 |
+
|
384 |
+
with open("dashboard_data/multilingual_confusion_matrices.json", "r") as file:
|
385 |
+
confusion_matrix_map = dict(json.load(file))
|
386 |
+
|
387 |
+
|
388 |
+
def update_multilingual_results(selected_model):
|
389 |
+
"""
|
390 |
+
Updates the multilingual results display based on the selected model.
|
391 |
+
|
392 |
+
This function processes the multilingual data for the chosen model,
|
393 |
+
calculates average WER for different scenarios (language hinted vs. predicted),
|
394 |
+
and prepares language-specific WER data for display.
|
395 |
+
|
396 |
+
:param selected_model: The name of the selected model
|
397 |
+
:return: A list containing updated components for the Gradio interface
|
398 |
+
"""
|
399 |
+
if selected_model is None:
|
400 |
+
return "# Select a model from the dropdown to view results."
|
401 |
+
|
402 |
+
# Filter data for the selected model
|
403 |
+
model_data = multilingual_df[multilingual_df["Model"] == selected_model]
|
404 |
+
|
405 |
+
if model_data.empty:
|
406 |
+
return f"# No data available for model: {selected_model}"
|
407 |
+
|
408 |
+
# Separate data for forced and not forced scenarios
|
409 |
+
forced_data = model_data[model_data["Forced Tokens"] == True]
|
410 |
+
not_forced_data = model_data[model_data["Forced Tokens"] == False]
|
411 |
+
|
412 |
+
result_text = f"# Model: {selected_model}\n\n"
|
413 |
+
|
414 |
+
# Prepare average WER data
|
415 |
+
average_wer_data = []
|
416 |
+
if not forced_data.empty:
|
417 |
+
average_wer_data.append(
|
418 |
+
{
|
419 |
+
"Scenario": "Language Hinted",
|
420 |
+
"Average WER": forced_data.iloc[0]["Average WER"],
|
421 |
+
}
|
422 |
+
)
|
423 |
+
if not not_forced_data.empty:
|
424 |
+
average_wer_data.append(
|
425 |
+
{
|
426 |
+
"Scenario": "Language Predicted",
|
427 |
+
"Average WER": not_forced_data.iloc[0]["Average WER"],
|
428 |
+
}
|
429 |
+
)
|
430 |
+
average_wer_df = pd.DataFrame(average_wer_data)
|
431 |
+
average_wer_df["Average WER"] = average_wer_df["Average WER"].apply(
|
432 |
+
lambda x: round(x, 2)
|
433 |
+
)
|
434 |
+
|
435 |
+
# Prepare language-specific WER data
|
436 |
+
lang_columns = [col for col in model_data.columns if col.startswith("WER_")]
|
437 |
+
lang_wer_data = []
|
438 |
+
for column in lang_columns:
|
439 |
+
lang = column.split("_")[1]
|
440 |
+
forced_wer = forced_data[column].iloc[0] if not forced_data.empty else None
|
441 |
+
not_forced_wer = (
|
442 |
+
not_forced_data[column].iloc[0] if not not_forced_data.empty else None
|
443 |
+
)
|
444 |
+
if forced_wer is not None or not_forced_wer is not None:
|
445 |
+
lang_wer_data.append(
|
446 |
+
{
|
447 |
+
"Language": LANGUAGE_MAP[lang],
|
448 |
+
"Language Hinted WER": round(forced_wer, 2)
|
449 |
+
if forced_wer is not None
|
450 |
+
else "N/A",
|
451 |
+
"Language Predicted WER": round(not_forced_wer, 2)
|
452 |
+
if not_forced_wer is not None
|
453 |
+
else "N/A",
|
454 |
+
}
|
455 |
+
)
|
456 |
+
lang_wer_df = pd.DataFrame(lang_wer_data)
|
457 |
+
lang_wer_df = lang_wer_df.fillna("No Data")
|
458 |
+
|
459 |
+
# Create confusion matrix plot for unforced scenario
|
460 |
+
unforced_plot = None
|
461 |
+
if selected_model in confusion_matrix_map:
|
462 |
+
if "not_forced" in confusion_matrix_map[selected_model]:
|
463 |
+
unforced_plot = create_confusion_matrix_plot(
|
464 |
+
confusion_matrix_map[selected_model]["not_forced"]["matrix"],
|
465 |
+
confusion_matrix_map[selected_model]["not_forced"]["labels"],
|
466 |
+
False,
|
467 |
+
)
|
468 |
+
|
469 |
+
# Return updated components for Gradio interface
|
470 |
+
return [
|
471 |
+
gr.update(value=result_text),
|
472 |
+
gr.update(visible=True, value=average_wer_df),
|
473 |
+
gr.update(visible=True, value=lang_wer_df),
|
474 |
+
gr.update(visible=unforced_plot is not None, value=unforced_plot),
|
475 |
+
]
|
476 |
+
|
477 |
+
|
478 |
+
font = [
|
479 |
+
"Zwizz Regular", # Local font
|
480 |
+
"IBM Plex Mono", # Monospace font
|
481 |
+
"ui-sans-serif",
|
482 |
+
"system-ui",
|
483 |
+
"sans-serif",
|
484 |
+
]
|
485 |
+
|
486 |
+
# Define the Gradio interface
|
487 |
+
with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
|
488 |
+
# Add header and banner to the interface
|
489 |
+
gr.HTML(HEADER)
|
490 |
+
gr.HTML(BANNER_TEXT, elem_classes="markdown-text")
|
491 |
+
|
492 |
+
# Create tabs for different sections of the dashboard
|
493 |
+
with tabs.render():
|
494 |
+
# Performance Tab
|
495 |
+
with gr.TabItem("Performance", elem_id="benchmark", id=0):
|
496 |
+
with gr.Row():
|
497 |
+
with gr.Column(scale=1):
|
498 |
+
with gr.Row():
|
499 |
+
with gr.Column(scale=6, elem_classes="filter_models_column"):
|
500 |
+
filter_performance_models = gr.Textbox(
|
501 |
+
placeholder="🔍 Filter Model (separate multiple queries with ';')",
|
502 |
+
label="Filter Models",
|
503 |
+
)
|
504 |
+
with gr.Column(scale=4, elem_classes="exclude_models_column"):
|
505 |
+
exclude_performance_models = gr.Textbox(
|
506 |
+
placeholder="🔍 Exclude (separate multiple queries with ';')",
|
507 |
+
label="Exclude Models",
|
508 |
+
)
|
509 |
+
with gr.Row():
|
510 |
+
with gr.Accordion("See All Columns", open=False):
|
511 |
+
with gr.Row():
|
512 |
+
with gr.Column(scale=9, elem_id="performance_columns"):
|
513 |
+
performance_shown_columns = gr.CheckboxGroup(
|
514 |
+
choices=PERFORMANCE_TOGGLE_COLS,
|
515 |
+
value=PERFORMANCE_SELECTED_COLS,
|
516 |
+
label="Toggle Columns",
|
517 |
+
elem_id="column-select",
|
518 |
+
interactive=True,
|
519 |
+
)
|
520 |
+
with gr.Column(
|
521 |
+
scale=1,
|
522 |
+
min_width=200,
|
523 |
+
elem_id="performance_select_columns",
|
524 |
+
):
|
525 |
+
with gr.Row():
|
526 |
+
select_all_button = gr.Button(
|
527 |
+
"Select All",
|
528 |
+
elem_id="select-all-button",
|
529 |
+
interactive=True,
|
530 |
+
)
|
531 |
+
deselect_all_button = gr.Button(
|
532 |
+
"Deselect All",
|
533 |
+
elem_id="deselect-all-button",
|
534 |
+
interactive=True,
|
535 |
+
)
|
536 |
+
|
537 |
+
def select_all_columns():
|
538 |
+
return PERFORMANCE_TOGGLE_COLS
|
539 |
+
|
540 |
+
def deselect_all_columns():
|
541 |
+
return []
|
542 |
+
|
543 |
+
select_all_button.click(
|
544 |
+
select_all_columns,
|
545 |
+
inputs=[],
|
546 |
+
outputs=performance_shown_columns,
|
547 |
+
)
|
548 |
+
deselect_all_button.click(
|
549 |
+
deselect_all_columns,
|
550 |
+
inputs=[],
|
551 |
+
outputs=performance_shown_columns,
|
552 |
+
)
|
553 |
+
|
554 |
+
with gr.Row():
|
555 |
+
with gr.Accordion("Filter Devices", open=False):
|
556 |
+
with gr.Row():
|
557 |
+
with gr.Column(
|
558 |
+
scale=9, elem_id="filter_devices_column"
|
559 |
+
):
|
560 |
+
performance_shown_devices = gr.CheckboxGroup(
|
561 |
+
choices=PERFORMANCE_DEVICES,
|
562 |
+
value=PERFORMANCE_DEVICES,
|
563 |
+
label="Filter Devices",
|
564 |
+
interactive=True,
|
565 |
+
)
|
566 |
+
with gr.Column(
|
567 |
+
scale=1,
|
568 |
+
min_width=200,
|
569 |
+
elem_id="filter_select_devices",
|
570 |
+
):
|
571 |
+
with gr.Row():
|
572 |
+
select_all_devices_button = gr.Button(
|
573 |
+
"Select All",
|
574 |
+
elem_id="select-all-devices-button",
|
575 |
+
interactive=True,
|
576 |
+
)
|
577 |
+
deselect_all_devices_button = gr.Button(
|
578 |
+
"Deselect All",
|
579 |
+
elem_id="deselect-all-devices-button",
|
580 |
+
interactive=True,
|
581 |
+
)
|
582 |
+
|
583 |
+
def select_all_devices():
|
584 |
+
return PERFORMANCE_DEVICES
|
585 |
+
|
586 |
+
def deselect_all_devices():
|
587 |
+
return []
|
588 |
+
|
589 |
+
select_all_devices_button.click(
|
590 |
+
select_all_devices,
|
591 |
+
inputs=[],
|
592 |
+
outputs=performance_shown_devices,
|
593 |
+
)
|
594 |
+
deselect_all_devices_button.click(
|
595 |
+
deselect_all_devices,
|
596 |
+
inputs=[],
|
597 |
+
outputs=performance_shown_devices,
|
598 |
+
)
|
599 |
+
with gr.Row():
|
600 |
+
performance_shown_os = gr.CheckboxGroup(
|
601 |
+
choices=PERFORMANCE_OS,
|
602 |
+
value=PERFORMANCE_OS,
|
603 |
+
label="Filter OS",
|
604 |
+
interactive=True,
|
605 |
+
)
|
606 |
+
with gr.Column(scale=1):
|
607 |
+
with gr.Accordion("See Performance Filters"):
|
608 |
+
with gr.Row():
|
609 |
+
with gr.Row():
|
610 |
+
min_short_speed, max_short_speed = floor(
|
611 |
+
min(performance_df["Short-Form Speed"])
|
612 |
+
), ceil(max(performance_df["Short-Form Speed"]))
|
613 |
+
short_speed_slider = RangeSlider(
|
614 |
+
value=[min_short_speed, max_short_speed],
|
615 |
+
minimum=min_short_speed,
|
616 |
+
maximum=max_short_speed,
|
617 |
+
step=0.001,
|
618 |
+
label="Short-Form Speed",
|
619 |
+
)
|
620 |
+
with gr.Row():
|
621 |
+
min_long_speed, max_long_speed = floor(
|
622 |
+
min(performance_df["Long-Form Speed"])
|
623 |
+
), ceil(max(performance_df["Long-Form Speed"]))
|
624 |
+
long_speed_slider = RangeSlider(
|
625 |
+
value=[min_long_speed, max_long_speed],
|
626 |
+
minimum=min_long_speed,
|
627 |
+
maximum=max_long_speed,
|
628 |
+
step=0.001,
|
629 |
+
label="Long-Form Speed",
|
630 |
+
)
|
631 |
+
with gr.Row():
|
632 |
+
with gr.Row():
|
633 |
+
min_short_toks, max_short_toks = floor(
|
634 |
+
min(performance_df["Short-Form Tok/s"])
|
635 |
+
), ceil(max(performance_df["Short-Form Tok/s"]))
|
636 |
+
short_toks_slider = RangeSlider(
|
637 |
+
value=[min_short_toks, max_short_toks],
|
638 |
+
minimum=min_short_toks,
|
639 |
+
maximum=max_short_toks,
|
640 |
+
step=0.001,
|
641 |
+
label="Short-Form Tok/s",
|
642 |
+
)
|
643 |
+
with gr.Row():
|
644 |
+
min_long_toks, max_long_toks = floor(
|
645 |
+
min(performance_df["Long-Form Tok/s"])
|
646 |
+
), ceil(max(performance_df["Long-Form Tok/s"]))
|
647 |
+
long_toks_slider = RangeSlider(
|
648 |
+
value=[min_long_toks, max_long_toks],
|
649 |
+
minimum=min_long_toks,
|
650 |
+
maximum=max_long_toks,
|
651 |
+
step=0.001,
|
652 |
+
label="Long-Form Tok/s",
|
653 |
+
)
|
654 |
+
with gr.Row():
|
655 |
+
gr.Markdown(PERFORMANCE_TEXT, elem_classes="markdown-text")
|
656 |
+
with gr.Row():
|
657 |
+
leaderboard_df = gr.components.Dataframe(
|
658 |
+
value=performance_df[
|
659 |
+
PERFORMANCE_ALWAYS_HERE_COLS + performance_shown_columns.value
|
660 |
+
],
|
661 |
+
headers=[
|
662 |
+
PERFORMANCE_ALWAYS_HERE_COLS + performance_shown_columns.value
|
663 |
+
],
|
664 |
+
datatype=[
|
665 |
+
c.type
|
666 |
+
for c in fields(PerformanceAutoEvalColumn)
|
667 |
+
if c.name in PERFORMANCE_COLS
|
668 |
+
],
|
669 |
+
elem_id="leaderboard-table",
|
670 |
+
elem_classes="large-table",
|
671 |
+
interactive=False,
|
672 |
+
)
|
673 |
+
|
674 |
+
# Copy of the leaderboard dataframe to apply filters to
|
675 |
+
hidden_leaderboard_df = gr.components.Dataframe(
|
676 |
+
value=performance_df,
|
677 |
+
headers=PERFORMANCE_COLS,
|
678 |
+
datatype=[
|
679 |
+
c.type
|
680 |
+
for c in fields(PerformanceAutoEvalColumn)
|
681 |
+
if c.name in PERFORMANCE_COLS
|
682 |
+
],
|
683 |
+
visible=False,
|
684 |
+
)
|
685 |
+
|
686 |
+
# Inputs for the dataframe filter function
|
687 |
+
performance_filter_inputs = [
|
688 |
+
hidden_leaderboard_df,
|
689 |
+
performance_shown_columns,
|
690 |
+
filter_performance_models,
|
691 |
+
exclude_performance_models,
|
692 |
+
performance_shown_devices,
|
693 |
+
performance_shown_os,
|
694 |
+
short_speed_slider,
|
695 |
+
long_speed_slider,
|
696 |
+
short_toks_slider,
|
697 |
+
long_toks_slider,
|
698 |
+
]
|
699 |
+
|
700 |
+
filter_output = leaderboard_df
|
701 |
+
filter_performance_models.change(
|
702 |
+
performance_filter, performance_filter_inputs, filter_output
|
703 |
+
)
|
704 |
+
exclude_performance_models.change(
|
705 |
+
performance_filter, performance_filter_inputs, filter_output
|
706 |
+
)
|
707 |
+
performance_shown_columns.change(
|
708 |
+
performance_filter, performance_filter_inputs, filter_output
|
709 |
+
)
|
710 |
+
performance_shown_devices.change(
|
711 |
+
performance_filter, performance_filter_inputs, filter_output
|
712 |
+
)
|
713 |
+
performance_shown_os.change(
|
714 |
+
performance_filter, performance_filter_inputs, filter_output
|
715 |
+
)
|
716 |
+
short_speed_slider.change(
|
717 |
+
performance_filter, performance_filter_inputs, filter_output
|
718 |
+
)
|
719 |
+
long_speed_slider.change(
|
720 |
+
performance_filter, performance_filter_inputs, filter_output
|
721 |
+
)
|
722 |
+
short_toks_slider.change(
|
723 |
+
performance_filter, performance_filter_inputs, filter_output
|
724 |
+
)
|
725 |
+
long_toks_slider.change(
|
726 |
+
performance_filter, performance_filter_inputs, filter_output
|
727 |
+
)
|
728 |
+
|
729 |
+
# English Quality Tab
|
730 |
+
with gr.TabItem("English Quality", elem_id="timeline", id=1):
|
731 |
+
with gr.Row():
|
732 |
+
with gr.Column(scale=1):
|
733 |
+
with gr.Row():
|
734 |
+
with gr.Column(scale=6, elem_classes="filter_models_column"):
|
735 |
+
filter_quality_models = gr.Textbox(
|
736 |
+
placeholder="🔍 Filter Model (separate multiple queries with ';')",
|
737 |
+
label="Filter Models",
|
738 |
+
)
|
739 |
+
with gr.Column(scale=4, elem_classes="exclude_models_column"):
|
740 |
+
exclude_quality_models = gr.Textbox(
|
741 |
+
placeholder="🔍 Exclude Model (separate multiple models with ';')",
|
742 |
+
label="Exclude Models",
|
743 |
+
)
|
744 |
+
with gr.Row():
|
745 |
+
with gr.Accordion("See All Columns", open=False):
|
746 |
+
quality_shown_columns = gr.CheckboxGroup(
|
747 |
+
choices=QUALITY_TOGGLE_COLS,
|
748 |
+
value=QUALITY_SELECTED_COLS,
|
749 |
+
label="Toggle Columns",
|
750 |
+
elem_id="column-select",
|
751 |
+
interactive=True,
|
752 |
+
)
|
753 |
+
with gr.Column(scale=1):
|
754 |
+
with gr.Accordion("See Quality Filters"):
|
755 |
+
with gr.Row():
|
756 |
+
with gr.Row():
|
757 |
+
quality_min_avg_wer, quality_max_avg_wer = (
|
758 |
+
floor(min(model_df["Average WER"])),
|
759 |
+
ceil(max(model_df["Average WER"])) + 1,
|
760 |
+
)
|
761 |
+
wer_slider = RangeSlider(
|
762 |
+
value=[quality_min_avg_wer, quality_max_avg_wer],
|
763 |
+
minimum=quality_min_avg_wer,
|
764 |
+
maximum=quality_max_avg_wer,
|
765 |
+
label="Average WER",
|
766 |
+
)
|
767 |
+
with gr.Row():
|
768 |
+
quality_min_qoi, quality_max_qoi = floor(
|
769 |
+
min(model_df["QoI"])
|
770 |
+
), ceil(max(model_df["QoI"] + 1))
|
771 |
+
qoi_slider = RangeSlider(
|
772 |
+
value=[quality_min_qoi, quality_max_qoi],
|
773 |
+
minimum=quality_min_qoi,
|
774 |
+
maximum=quality_max_qoi,
|
775 |
+
label="QoI",
|
776 |
+
)
|
777 |
+
with gr.Row():
|
778 |
+
gr.Markdown(QUALITY_TEXT)
|
779 |
+
with gr.Row():
|
780 |
+
quality_leaderboard_df = gr.components.Dataframe(
|
781 |
+
value=model_df[
|
782 |
+
QUALITY_ALWAYS_HERE_COLS + quality_shown_columns.value
|
783 |
+
],
|
784 |
+
headers=[QUALITY_ALWAYS_HERE_COLS + quality_shown_columns.value],
|
785 |
+
datatype=[
|
786 |
+
c.type
|
787 |
+
for c in fields(QualityAutoEvalColumn)
|
788 |
+
if c.name in QUALITY_COLS
|
789 |
+
],
|
790 |
+
elem_id="leaderboard-table",
|
791 |
+
elem_classes="large-table",
|
792 |
+
interactive=False,
|
793 |
+
)
|
794 |
+
|
795 |
+
# Copy of the leaderboard dataframe to apply filters to
|
796 |
+
hidden_quality_leaderboard_df = gr.components.Dataframe(
|
797 |
+
value=model_df,
|
798 |
+
headers=QUALITY_COLS,
|
799 |
+
datatype=[
|
800 |
+
c.type
|
801 |
+
for c in fields(QualityAutoEvalColumn)
|
802 |
+
if c.name in QUALITY_COLS
|
803 |
+
],
|
804 |
+
visible=False,
|
805 |
+
)
|
806 |
+
|
807 |
+
# Inputs for the dataframe filter function
|
808 |
+
filter_inputs = [
|
809 |
+
hidden_quality_leaderboard_df,
|
810 |
+
quality_shown_columns,
|
811 |
+
filter_quality_models,
|
812 |
+
wer_slider,
|
813 |
+
qoi_slider,
|
814 |
+
exclude_quality_models,
|
815 |
+
]
|
816 |
+
filter_output = quality_leaderboard_df
|
817 |
+
filter_quality_models.change(
|
818 |
+
quality_filter, filter_inputs, filter_output
|
819 |
+
)
|
820 |
+
exclude_quality_models.change(
|
821 |
+
quality_filter, filter_inputs, filter_output
|
822 |
+
)
|
823 |
+
quality_shown_columns.change(
|
824 |
+
quality_filter, filter_inputs, filter_output
|
825 |
+
)
|
826 |
+
wer_slider.change(quality_filter, filter_inputs, filter_output)
|
827 |
+
qoi_slider.change(quality_filter, filter_inputs, filter_output)
|
828 |
+
|
829 |
+
# Timeline Tab
|
830 |
+
with gr.TabItem("Timeline", elem_id="timeline", id=4):
|
831 |
+
# Create subtabs for different metrics
|
832 |
+
with gr.Tabs():
|
833 |
+
with gr.TabItem("QoI", id=0):
|
834 |
+
with gr.Row():
|
835 |
+
with gr.Column(scale=6):
|
836 |
+
filter_qoi = gr.Textbox(
|
837 |
+
placeholder="🔍 Filter Model-Device-OS (separate multiple queries with ';')",
|
838 |
+
label="Filter",
|
839 |
+
)
|
840 |
+
with gr.Column(scale=4):
|
841 |
+
exclude_qoi = gr.Textbox(
|
842 |
+
placeholder="🔍 Exclude Model-Device-OS (separate multiple with ';')",
|
843 |
+
label="Exclude",
|
844 |
+
)
|
845 |
+
with gr.Row():
|
846 |
+
with gr.Column():
|
847 |
+
qoi_plot = gr.Plot(container=True)
|
848 |
+
demo.load(
|
849 |
+
lambda x, y, z: plot_metric(
|
850 |
+
x,
|
851 |
+
"qoi",
|
852 |
+
"QoI",
|
853 |
+
"QoI Over Time for Model-Device-OS Combinations",
|
854 |
+
y,
|
855 |
+
z,
|
856 |
+
),
|
857 |
+
[
|
858 |
+
gr.Dataframe(benchmark_df, visible=False),
|
859 |
+
filter_qoi,
|
860 |
+
exclude_qoi,
|
861 |
+
],
|
862 |
+
qoi_plot,
|
863 |
+
)
|
864 |
+
filter_qoi.change(
|
865 |
+
lambda x, y, z: plot_metric(
|
866 |
+
x,
|
867 |
+
"qoi",
|
868 |
+
"QoI",
|
869 |
+
"QoI Over Time for Model-Device-OS Combinations",
|
870 |
+
y,
|
871 |
+
z,
|
872 |
+
),
|
873 |
+
[
|
874 |
+
gr.Dataframe(benchmark_df, visible=False),
|
875 |
+
filter_qoi,
|
876 |
+
exclude_qoi,
|
877 |
+
],
|
878 |
+
qoi_plot,
|
879 |
+
)
|
880 |
+
exclude_qoi.change(
|
881 |
+
lambda x, y, z: plot_metric(
|
882 |
+
x,
|
883 |
+
"qoi",
|
884 |
+
"QoI",
|
885 |
+
"QoI Over Time for Model-Device-OS Combinations",
|
886 |
+
y,
|
887 |
+
z,
|
888 |
+
),
|
889 |
+
[
|
890 |
+
gr.Dataframe(benchmark_df, visible=False),
|
891 |
+
filter_qoi,
|
892 |
+
exclude_qoi,
|
893 |
+
],
|
894 |
+
qoi_plot,
|
895 |
+
)
|
896 |
+
|
897 |
+
with gr.TabItem("Average WER", id=1):
|
898 |
+
with gr.Row():
|
899 |
+
with gr.Column(scale=6):
|
900 |
+
filter_average_wer = gr.Textbox(
|
901 |
+
placeholder="🔍 Filter Model-Device-OS (separate multiple queries with ';')",
|
902 |
+
label="Filter",
|
903 |
+
)
|
904 |
+
with gr.Column(scale=4):
|
905 |
+
exclude_average_wer = gr.Textbox(
|
906 |
+
placeholder="🔍 Exclude Model-Device-OS (separate multiple with ';')",
|
907 |
+
label="Exclude",
|
908 |
+
)
|
909 |
+
with gr.Row():
|
910 |
+
with gr.Column():
|
911 |
+
average_wer_plot = gr.Plot(container=True)
|
912 |
+
demo.load(
|
913 |
+
lambda x, y, z: plot_metric(
|
914 |
+
x,
|
915 |
+
"average_wer",
|
916 |
+
"Average WER",
|
917 |
+
"Average WER Over Time for Model-Device-OS Combinations",
|
918 |
+
y,
|
919 |
+
z,
|
920 |
+
),
|
921 |
+
[
|
922 |
+
gr.Dataframe(benchmark_df, visible=False),
|
923 |
+
filter_average_wer,
|
924 |
+
exclude_average_wer,
|
925 |
+
],
|
926 |
+
average_wer_plot,
|
927 |
+
)
|
928 |
+
filter_average_wer.change(
|
929 |
+
lambda x, y, z: plot_metric(
|
930 |
+
x,
|
931 |
+
"average_wer",
|
932 |
+
"Average WER",
|
933 |
+
"Average WER Over Time for Model-Device-OS Combinations",
|
934 |
+
y,
|
935 |
+
z,
|
936 |
+
),
|
937 |
+
[
|
938 |
+
gr.Dataframe(benchmark_df, visible=False),
|
939 |
+
filter_average_wer,
|
940 |
+
exclude_average_wer,
|
941 |
+
],
|
942 |
+
average_wer_plot,
|
943 |
+
)
|
944 |
+
exclude_average_wer.change(
|
945 |
+
lambda x, y, z: plot_metric(
|
946 |
+
x,
|
947 |
+
"average_wer",
|
948 |
+
"Average WER",
|
949 |
+
"Average WER Over Time for Model-Device-OS Combinations",
|
950 |
+
y,
|
951 |
+
z,
|
952 |
+
),
|
953 |
+
[
|
954 |
+
gr.Dataframe(benchmark_df, visible=False),
|
955 |
+
filter_average_wer,
|
956 |
+
exclude_average_wer,
|
957 |
+
],
|
958 |
+
average_wer_plot,
|
959 |
+
)
|
960 |
+
|
961 |
+
with gr.TabItem("Speed", id=2):
|
962 |
+
with gr.Row():
|
963 |
+
with gr.Column(scale=6):
|
964 |
+
filter_speed = gr.Textbox(
|
965 |
+
placeholder="🔍 Filter Model-Device-OS (separate multiple queries with ';')",
|
966 |
+
label="Filter",
|
967 |
+
)
|
968 |
+
with gr.Column(scale=4):
|
969 |
+
exclude_speed = gr.Textbox(
|
970 |
+
placeholder="🔍 Exclude Model-Device-OS (separate multiple with ';')",
|
971 |
+
label="Exclude",
|
972 |
+
)
|
973 |
+
with gr.Row():
|
974 |
+
with gr.Column():
|
975 |
+
speed_plot = gr.Plot(container=True)
|
976 |
+
demo.load(
|
977 |
+
lambda x, y, z: plot_metric(
|
978 |
+
x,
|
979 |
+
"speed",
|
980 |
+
"Speed",
|
981 |
+
"Speed Over Time for Model-Device-OS Combinations",
|
982 |
+
y,
|
983 |
+
z,
|
984 |
+
),
|
985 |
+
[
|
986 |
+
gr.Dataframe(benchmark_df, visible=False),
|
987 |
+
filter_speed,
|
988 |
+
exclude_speed,
|
989 |
+
],
|
990 |
+
speed_plot,
|
991 |
+
)
|
992 |
+
filter_speed.change(
|
993 |
+
lambda x, y, z: plot_metric(
|
994 |
+
x,
|
995 |
+
"speed",
|
996 |
+
"Speed",
|
997 |
+
"Speed Over Time for Model-Device-OS Combinations",
|
998 |
+
y,
|
999 |
+
z,
|
1000 |
+
),
|
1001 |
+
[
|
1002 |
+
gr.Dataframe(benchmark_df, visible=False),
|
1003 |
+
filter_speed,
|
1004 |
+
exclude_speed,
|
1005 |
+
],
|
1006 |
+
speed_plot,
|
1007 |
+
)
|
1008 |
+
exclude_speed.change(
|
1009 |
+
lambda x, y, z: plot_metric(
|
1010 |
+
x,
|
1011 |
+
"speed",
|
1012 |
+
"Speed",
|
1013 |
+
"Speed Over Time for Model-Device-OS Combinations",
|
1014 |
+
y,
|
1015 |
+
z,
|
1016 |
+
),
|
1017 |
+
[
|
1018 |
+
gr.Dataframe(benchmark_df, visible=False),
|
1019 |
+
filter_speed,
|
1020 |
+
exclude_speed,
|
1021 |
+
],
|
1022 |
+
speed_plot,
|
1023 |
+
)
|
1024 |
+
|
1025 |
+
with gr.TabItem("Tok/s", id=3):
|
1026 |
+
with gr.Row():
|
1027 |
+
with gr.Column(scale=6):
|
1028 |
+
filter_toks = gr.Textbox(
|
1029 |
+
placeholder="🔍 Filter Model-Device-OS (separate multiple queries with ';')",
|
1030 |
+
label="Filter",
|
1031 |
+
)
|
1032 |
+
with gr.Column(scale=4):
|
1033 |
+
exclude_toks = gr.Textbox(
|
1034 |
+
placeholder="🔍 Exclude Model-Device-OS (separate multiple with ';')",
|
1035 |
+
label="Exclude",
|
1036 |
+
)
|
1037 |
+
with gr.Row():
|
1038 |
+
with gr.Column():
|
1039 |
+
toks_plot = gr.Plot(container=True)
|
1040 |
+
demo.load(
|
1041 |
+
lambda x, y, z: plot_metric(
|
1042 |
+
x,
|
1043 |
+
"tokens_per_second",
|
1044 |
+
"Tok/s",
|
1045 |
+
"Tok/s Over Time for Model-Device-OS Combinations",
|
1046 |
+
y,
|
1047 |
+
z,
|
1048 |
+
),
|
1049 |
+
[
|
1050 |
+
gr.Dataframe(benchmark_df, visible=False),
|
1051 |
+
filter_toks,
|
1052 |
+
exclude_toks,
|
1053 |
+
],
|
1054 |
+
toks_plot,
|
1055 |
+
)
|
1056 |
+
filter_toks.change(
|
1057 |
+
lambda x, y, z: plot_metric(
|
1058 |
+
x,
|
1059 |
+
"tokens_per_second",
|
1060 |
+
"Tok/s",
|
1061 |
+
"Tok/s Over Time for Model-Device-OS Combinations",
|
1062 |
+
y,
|
1063 |
+
z,
|
1064 |
+
),
|
1065 |
+
[
|
1066 |
+
gr.Dataframe(benchmark_df, visible=False),
|
1067 |
+
filter_toks,
|
1068 |
+
exclude_toks,
|
1069 |
+
],
|
1070 |
+
toks_plot,
|
1071 |
+
)
|
1072 |
+
exclude_toks.change(
|
1073 |
+
lambda x, y, z: plot_metric(
|
1074 |
+
x,
|
1075 |
+
"tokens_per_second",
|
1076 |
+
"Tok/s",
|
1077 |
+
"Tok/s Over Time for Model-Device-OS Combinations",
|
1078 |
+
y,
|
1079 |
+
z,
|
1080 |
+
),
|
1081 |
+
[
|
1082 |
+
gr.Dataframe(benchmark_df, visible=False),
|
1083 |
+
filter_toks,
|
1084 |
+
exclude_toks,
|
1085 |
+
],
|
1086 |
+
toks_plot,
|
1087 |
+
)
|
1088 |
+
|
1089 |
+
# Multilingual Quality Tab
|
1090 |
+
with gr.TabItem("Multilingual Quality", elem_id="multilingual", id=5):
|
1091 |
+
if multilingual_df is not None:
|
1092 |
+
with gr.Row():
|
1093 |
+
with gr.Column(scale=1):
|
1094 |
+
# Display table of multilingual models
|
1095 |
+
model_table = gr.Dataframe(
|
1096 |
+
value=multilingual_models_df,
|
1097 |
+
headers=["Model"],
|
1098 |
+
datatype=["html"],
|
1099 |
+
elem_classes="left-side-table",
|
1100 |
+
)
|
1101 |
+
# Placeholders for confusion matrix plots
|
1102 |
+
with gr.Row():
|
1103 |
+
unforced_confusion_matrix = gr.Plot(visible=False)
|
1104 |
+
with gr.Row():
|
1105 |
+
forced_confusion_matrix = gr.Plot(visible=False)
|
1106 |
+
|
1107 |
+
with gr.Column(scale=1):
|
1108 |
+
# Display area for selected model results
|
1109 |
+
results_markdown = gr.Markdown(
|
1110 |
+
"# Select a model from the table on the left to view results.",
|
1111 |
+
elem_id="multilingual-results",
|
1112 |
+
)
|
1113 |
+
# Tables for displaying average WER and language-specific WER
|
1114 |
+
average_wer_table = gr.Dataframe(
|
1115 |
+
value=None, elem_id="average-wer-table", visible=False
|
1116 |
+
)
|
1117 |
+
language_wer_table = gr.Dataframe(
|
1118 |
+
value=None, elem_id="general-wer-table", visible=False
|
1119 |
+
)
|
1120 |
+
|
1121 |
+
# Set up click event to update results when a model is selected
|
1122 |
+
for button in multilingual_models_buttons:
|
1123 |
+
button.render()
|
1124 |
+
button.click(
|
1125 |
+
fn=lambda x: update_multilingual_results(x),
|
1126 |
+
inputs=[button],
|
1127 |
+
outputs=[
|
1128 |
+
results_markdown,
|
1129 |
+
average_wer_table,
|
1130 |
+
language_wer_table,
|
1131 |
+
unforced_confusion_matrix,
|
1132 |
+
],
|
1133 |
+
)
|
1134 |
+
else:
|
1135 |
+
# Display message if no multilingual data is available
|
1136 |
+
gr.Markdown("No multilingual benchmark results available.")
|
1137 |
+
|
1138 |
+
# Device Support Tab
|
1139 |
+
with gr.TabItem("Device Support", elem_id="device_support", id=6):
|
1140 |
+
# Load device support data from CSV
|
1141 |
+
support_data = pd.read_csv("dashboard_data/support_data.csv")
|
1142 |
+
support_data.set_index(support_data.columns[0], inplace=True)
|
1143 |
+
support_data["Model"] = support_data["Model"].apply(
|
1144 |
+
lambda x: x.replace("_", "/")
|
1145 |
+
)
|
1146 |
+
support_data["Model"] = support_data["Model"].apply(
|
1147 |
+
lambda x: make_model_name_clickable_link(x)
|
1148 |
+
)
|
1149 |
+
support_data = (
|
1150 |
+
support_data.assign(model_len=support_data["Model"].str.len())
|
1151 |
+
.sort_values(
|
1152 |
+
by=["model_len"],
|
1153 |
+
ascending=[True],
|
1154 |
+
)
|
1155 |
+
.drop(columns=["model_len"])
|
1156 |
+
)
|
1157 |
+
|
1158 |
+
with gr.Row():
|
1159 |
+
with gr.Column(scale=1):
|
1160 |
+
with gr.Row():
|
1161 |
+
with gr.Column(scale=6, elem_id="filter_models_column"):
|
1162 |
+
filter_support_models = gr.Textbox(
|
1163 |
+
placeholder="🔍 Filter Model (separate multiple queries with ';')",
|
1164 |
+
label="Filter Models",
|
1165 |
+
)
|
1166 |
+
with gr.Column(scale=4, elem_classes="exclude_models_column"):
|
1167 |
+
exclude_support_models = gr.Textbox(
|
1168 |
+
placeholder="🔍 Exclude Model (separate multiple models with ';')",
|
1169 |
+
label="Exclude Models",
|
1170 |
+
)
|
1171 |
+
with gr.Row():
|
1172 |
+
with gr.Accordion("See All Columns", open=False):
|
1173 |
+
with gr.Row():
|
1174 |
+
with gr.Column(scale=9):
|
1175 |
+
support_shown_columns = gr.CheckboxGroup(
|
1176 |
+
choices=support_data.columns.tolist()[
|
1177 |
+
1:
|
1178 |
+
], # Exclude 'Model' column
|
1179 |
+
value=support_data.columns.tolist()[1:],
|
1180 |
+
label="Toggle Columns",
|
1181 |
+
elem_id="support-column-select",
|
1182 |
+
interactive=True,
|
1183 |
+
)
|
1184 |
+
with gr.Column(scale=1, min_width=200):
|
1185 |
+
with gr.Row():
|
1186 |
+
select_all_support_button = gr.Button(
|
1187 |
+
"Select All",
|
1188 |
+
elem_id="select-all-support-button",
|
1189 |
+
interactive=True,
|
1190 |
+
)
|
1191 |
+
deselect_all_support_button = gr.Button(
|
1192 |
+
"Deselect All",
|
1193 |
+
elem_id="deselect-all-support-button",
|
1194 |
+
interactive=True,
|
1195 |
+
)
|
1196 |
+
with gr.Column():
|
1197 |
+
gr.Markdown(
|
1198 |
+
"""
|
1199 |
+
### Legend
|
1200 |
+
- ✅ Supported: The model is supported and tested on this device.
|
1201 |
+
- ⚠️ Failed: Either The model tests failed on this device or the Speed Factor for the test is less than 1.
|
1202 |
+
- ? Not Tested: The model is supported on this device but no test information available.
|
1203 |
+
- Not Supported: The model is not supported on this device as per the [WhisperKit configuration](https://huggingface.co/argmaxinc/whisperkit-coreml/blob/main/config.json).
|
1204 |
+
"""
|
1205 |
+
)
|
1206 |
+
|
1207 |
+
# Display device support data in a table
|
1208 |
+
device_support_table = gr.Dataframe(
|
1209 |
+
value=support_data,
|
1210 |
+
headers=support_data.columns.tolist(),
|
1211 |
+
datatype=["html" for _ in support_data.columns],
|
1212 |
+
elem_id="device-support-table",
|
1213 |
+
elem_classes="large-table",
|
1214 |
+
interactive=False,
|
1215 |
+
)
|
1216 |
+
|
1217 |
+
# Hidden dataframe to store the original data
|
1218 |
+
hidden_support_df = gr.Dataframe(value=support_data, visible=False)
|
1219 |
+
|
1220 |
+
def filter_support_data(df, columns, model_query, exclude_models):
|
1221 |
+
filtered_df = df.copy()
|
1222 |
+
|
1223 |
+
# Filter models based on query
|
1224 |
+
if model_query:
|
1225 |
+
filtered_df = filtered_df[
|
1226 |
+
filtered_df["Model"].str.contains(
|
1227 |
+
"|".join(q.strip() for q in model_query.split(";")),
|
1228 |
+
case=False,
|
1229 |
+
regex=True,
|
1230 |
+
)
|
1231 |
+
]
|
1232 |
+
|
1233 |
+
# Exclude specified models
|
1234 |
+
if exclude_models:
|
1235 |
+
exclude_list = [
|
1236 |
+
re.escape(m.strip()) for m in exclude_models.split(";")
|
1237 |
+
]
|
1238 |
+
filtered_df = filtered_df[
|
1239 |
+
~filtered_df["Model"].str.contains(
|
1240 |
+
"|".join(exclude_list), case=False, regex=True
|
1241 |
+
)
|
1242 |
+
]
|
1243 |
+
|
1244 |
+
# Select columns
|
1245 |
+
selected_columns = ["Model"] + [
|
1246 |
+
col for col in columns if col in df.columns
|
1247 |
+
]
|
1248 |
+
filtered_df = filtered_df[selected_columns]
|
1249 |
+
|
1250 |
+
return filtered_df
|
1251 |
+
|
1252 |
+
def select_all_support_columns():
|
1253 |
+
return support_data.columns.tolist()[1:] # Exclude 'Model' column
|
1254 |
+
|
1255 |
+
def deselect_all_support_columns():
|
1256 |
+
return []
|
1257 |
+
|
1258 |
+
# Connect the filter function to the input components
|
1259 |
+
filter_inputs = [
|
1260 |
+
hidden_support_df,
|
1261 |
+
support_shown_columns,
|
1262 |
+
filter_support_models,
|
1263 |
+
exclude_support_models,
|
1264 |
+
]
|
1265 |
+
filter_support_models.change(
|
1266 |
+
filter_support_data, filter_inputs, device_support_table
|
1267 |
+
)
|
1268 |
+
exclude_support_models.change(
|
1269 |
+
filter_support_data, filter_inputs, device_support_table
|
1270 |
+
)
|
1271 |
+
support_shown_columns.change(
|
1272 |
+
filter_support_data, filter_inputs, device_support_table
|
1273 |
+
)
|
1274 |
+
|
1275 |
+
# Connect select all and deselect all buttons
|
1276 |
+
select_all_support_button.click(
|
1277 |
+
select_all_support_columns,
|
1278 |
+
inputs=[],
|
1279 |
+
outputs=support_shown_columns,
|
1280 |
+
)
|
1281 |
+
deselect_all_support_button.click(
|
1282 |
+
deselect_all_support_columns,
|
1283 |
+
inputs=[],
|
1284 |
+
outputs=support_shown_columns,
|
1285 |
+
)
|
1286 |
+
|
1287 |
+
# Methodology Tab
|
1288 |
+
with gr.TabItem("Methodology", elem_id="methodology", id=7):
|
1289 |
+
gr.Markdown(METHODOLOGY_TEXT, elem_id="methodology-text")
|
1290 |
+
|
1291 |
+
# Citation section
|
1292 |
+
with gr.Accordion("📙 Citation", open=False):
|
1293 |
+
citation_button = gr.Textbox(
|
1294 |
+
value=CITATION_BUTTON_TEXT,
|
1295 |
+
label=CITATION_BUTTON_LABEL,
|
1296 |
+
lines=7,
|
1297 |
+
elem_id="citation-button",
|
1298 |
+
show_copy_button=True,
|
1299 |
+
)
|
1300 |
+
|
1301 |
+
# Launch the Gradio interface
|
1302 |
+
demo.launch(debug=True, share=True, ssr_mode=False)
|
multilingual_generate.py
ADDED
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import os
|
3 |
+
import shutil
|
4 |
+
import sys
|
5 |
+
from collections import defaultdict
|
6 |
+
|
7 |
+
import numpy as np
|
8 |
+
import pandas as pd
|
9 |
+
from sklearn.metrics import confusion_matrix
|
10 |
+
|
11 |
+
from utils import compute_average_wer, download_dataset
|
12 |
+
|
13 |
+
|
14 |
+
def main():
|
15 |
+
"""
|
16 |
+
Main function to orchestrate the multilingual data generation process.
|
17 |
+
|
18 |
+
This function performs the following steps:
|
19 |
+
1. Downloads multilingual evaluation data if requested.
|
20 |
+
2. Processes multilingual evaluation files.
|
21 |
+
3. Calculates and saves results, including Word Error Rate (WER) and
|
22 |
+
language detection confusion matrices.
|
23 |
+
"""
|
24 |
+
source_repo = "argmaxinc/whisperkit-evals-multilingual"
|
25 |
+
source_subfolder = "WhisperKit"
|
26 |
+
source_directory = f"{source_repo}/{source_subfolder}"
|
27 |
+
if len(sys.argv) > 1 and sys.argv[1] == "download":
|
28 |
+
try:
|
29 |
+
shutil.rmtree(source_repo)
|
30 |
+
except:
|
31 |
+
print("Nothing to remove.")
|
32 |
+
download_dataset(source_repo, source_repo, source_subfolder)
|
33 |
+
|
34 |
+
results = defaultdict(
|
35 |
+
lambda: {
|
36 |
+
"average_wer": [],
|
37 |
+
"language_wer": defaultdict(list),
|
38 |
+
"language_detection": [],
|
39 |
+
}
|
40 |
+
)
|
41 |
+
|
42 |
+
confusion_matrices = {}
|
43 |
+
|
44 |
+
for subdir, _, files in os.walk(source_directory):
|
45 |
+
for filename in files:
|
46 |
+
if not filename.endswith(".json") or "summary" in filename:
|
47 |
+
continue
|
48 |
+
|
49 |
+
file_path = os.path.join(subdir, filename)
|
50 |
+
with open(file_path, "r") as f:
|
51 |
+
data = json.load(f)
|
52 |
+
|
53 |
+
subdir_components = subdir.split(os.path.sep)
|
54 |
+
is_forced = "forced" in subdir_components
|
55 |
+
model = subdir_components[-3] if not is_forced else subdir_components[-4]
|
56 |
+
|
57 |
+
key = f"{model}/{'forced' if is_forced else 'not_forced'}"
|
58 |
+
|
59 |
+
for item in data["results"]:
|
60 |
+
if "reference_language" not in item:
|
61 |
+
continue
|
62 |
+
reference_language = item["reference_language"]
|
63 |
+
wer = item["wer"]
|
64 |
+
detected_language = item["predicted_language"]
|
65 |
+
|
66 |
+
result = {
|
67 |
+
"reference": item["reference"],
|
68 |
+
"prediction": item["prediction"],
|
69 |
+
}
|
70 |
+
|
71 |
+
results[key]["average_wer"].append(result)
|
72 |
+
results[key]["language_wer"][reference_language].append(result)
|
73 |
+
results[key]["language_detection"].append(
|
74 |
+
(reference_language, detected_language)
|
75 |
+
)
|
76 |
+
|
77 |
+
calculate_and_save_results(results, confusion_matrices)
|
78 |
+
|
79 |
+
|
80 |
+
def calculate_and_save_results(results, confusion_matrices):
|
81 |
+
"""
|
82 |
+
Calculates final multilingual metrics and saves them to CSV and JSON files.
|
83 |
+
|
84 |
+
:param results: Dictionary containing raw multilingual evaluation data.
|
85 |
+
:param confusion_matrices: Dictionary to store confusion matrices for language detection.
|
86 |
+
|
87 |
+
This function processes the raw multilingual data, calculates average metrics,
|
88 |
+
creates confusion matrices for language detection, and saves the results to:
|
89 |
+
1. A CSV file with WER data for each model and language.
|
90 |
+
2. A JSON file with confusion matrices for language detection.
|
91 |
+
"""
|
92 |
+
wer_data = []
|
93 |
+
for key, data in results.items():
|
94 |
+
model, forced = key.rsplit("/", 1)
|
95 |
+
row = {
|
96 |
+
"Model": model,
|
97 |
+
"Forced Tokens": forced == "forced",
|
98 |
+
"Average WER": compute_average_wer(data["average_wer"]),
|
99 |
+
}
|
100 |
+
for lang, wers in data["language_wer"].items():
|
101 |
+
row[f"WER_{lang}"] = compute_average_wer(wers)
|
102 |
+
wer_data.append(row)
|
103 |
+
|
104 |
+
true_languages, detected_languages = zip(*data["language_detection"])
|
105 |
+
unique_languages = sorted(set(true_languages))
|
106 |
+
cm = confusion_matrix(
|
107 |
+
true_languages, detected_languages, labels=unique_languages
|
108 |
+
)
|
109 |
+
|
110 |
+
row_sums = cm.sum(axis=1)
|
111 |
+
cm_normalized = np.zeros_like(cm, dtype=float)
|
112 |
+
non_zero_rows = row_sums != 0
|
113 |
+
cm_normalized[non_zero_rows] = (
|
114 |
+
cm[non_zero_rows] / row_sums[non_zero_rows, np.newaxis]
|
115 |
+
)
|
116 |
+
|
117 |
+
if model not in confusion_matrices:
|
118 |
+
confusion_matrices[model] = {}
|
119 |
+
confusion_matrices[model][forced] = {
|
120 |
+
"matrix": cm_normalized.tolist(),
|
121 |
+
"labels": unique_languages,
|
122 |
+
}
|
123 |
+
|
124 |
+
df = pd.DataFrame(wer_data)
|
125 |
+
df.to_csv("dashboard_data/multilingual_results.csv", index=False)
|
126 |
+
|
127 |
+
with open("dashboard_data/multilingual_confusion_matrices.json", "w") as f:
|
128 |
+
json.dump(confusion_matrices, f, indent=2)
|
129 |
+
|
130 |
+
|
131 |
+
if __name__ == "__main__":
|
132 |
+
main()
|
performance_generate.py
ADDED
@@ -0,0 +1,465 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import glob
|
2 |
+
import json
|
3 |
+
import os
|
4 |
+
import shutil
|
5 |
+
import sys
|
6 |
+
import urllib
|
7 |
+
from collections import defaultdict
|
8 |
+
from datetime import datetime
|
9 |
+
from statistics import mean
|
10 |
+
|
11 |
+
import pandas as pd
|
12 |
+
import requests
|
13 |
+
|
14 |
+
from constants import BASE_WHISPERKIT_BENCHMARK_URL
|
15 |
+
from text_normalizer import text_normalizer
|
16 |
+
from utils import compute_average_wer, dir_to_json, download_dataset
|
17 |
+
|
18 |
+
|
19 |
+
def fetch_evaluation_data(url):
|
20 |
+
"""
|
21 |
+
Fetches evaluation data from the given URL.
|
22 |
+
:param url: The URL to fetch the evaluation data from.
|
23 |
+
:returns: The evaluation data as a dictionary.
|
24 |
+
:rauses: sys.exit if the request fails
|
25 |
+
"""
|
26 |
+
response = requests.get(url)
|
27 |
+
if response.status_code == 200:
|
28 |
+
return json.loads(response.text)
|
29 |
+
else:
|
30 |
+
sys.exit(f"Failed to fetch WhisperKit evals: {response.text}")
|
31 |
+
|
32 |
+
|
33 |
+
def generate_device_map(base_dir):
|
34 |
+
"""
|
35 |
+
Generates a mapping of device identifiers to their corresponding device models.
|
36 |
+
|
37 |
+
This function iterates through all summary files in the specified base directory and its subdirectories,
|
38 |
+
extracting device identifier and device model information. It stores this information in a dictionary,
|
39 |
+
where the keys are device identifiers and the values are device models.
|
40 |
+
|
41 |
+
:param base_dir: The base directory to search for summary files.
|
42 |
+
:returns: A dictionary mapping device identifiers to device models.
|
43 |
+
"""
|
44 |
+
device_map = {}
|
45 |
+
|
46 |
+
# Find all summary files recursively
|
47 |
+
summary_files = glob.glob(f"{base_dir}/**/*summary*.json", recursive=True)
|
48 |
+
|
49 |
+
for file_path in summary_files:
|
50 |
+
try:
|
51 |
+
with open(file_path, "r") as f:
|
52 |
+
data = json.load(f)
|
53 |
+
|
54 |
+
# Extract device information and create simple mapping
|
55 |
+
if "deviceModel" in data and "deviceIdentifier" in data:
|
56 |
+
device_map[data["deviceIdentifier"]] = data["deviceModel"]
|
57 |
+
|
58 |
+
except json.JSONDecodeError:
|
59 |
+
print(f"Error reading {file_path}")
|
60 |
+
except Exception as e:
|
61 |
+
print(f"Error processing {file_path}: {e}")
|
62 |
+
|
63 |
+
# Save the device map to project root
|
64 |
+
output_path = "dashboard_data/device_map.json"
|
65 |
+
|
66 |
+
with open(output_path, "w") as f:
|
67 |
+
json.dump(device_map, f, indent=4, sort_keys=True)
|
68 |
+
|
69 |
+
return device_map
|
70 |
+
|
71 |
+
|
72 |
+
def get_device_name(device):
|
73 |
+
"""
|
74 |
+
Gets the device name from the device map if it exists.
|
75 |
+
:param device: String representing the device name.
|
76 |
+
:returns: The device name from the device map if it exists, otherwise the input device name.
|
77 |
+
"""
|
78 |
+
with open("dashboard_data/device_map.json", "r") as f:
|
79 |
+
device_map = json.load(f)
|
80 |
+
return device_map.get(device, device).replace(" ", "_")
|
81 |
+
|
82 |
+
|
83 |
+
def process_benchmark_file(file_path, dataset_dfs, results):
|
84 |
+
"""
|
85 |
+
Processes a single benchmark file and updates the results dictionary.
|
86 |
+
|
87 |
+
:param file_path: Path to the benchmark JSON file.
|
88 |
+
:param dataset_dfs: Dictionary of DataFrames containing dataset information.
|
89 |
+
:param results: Dictionary to store the processed results.
|
90 |
+
|
91 |
+
This function reads a benchmark JSON file, extracts relevant information,
|
92 |
+
and updates the results dictionary with various metrics including WER,
|
93 |
+
speed, tokens per second, and quality of inference (QoI).
|
94 |
+
"""
|
95 |
+
with open(file_path, "r") as file:
|
96 |
+
test_results = json.load(file)
|
97 |
+
|
98 |
+
if len(test_results) == 0:
|
99 |
+
return
|
100 |
+
|
101 |
+
first_test_result = test_results[0]
|
102 |
+
model = first_test_result["testInfo"]["model"]
|
103 |
+
device = first_test_result["testInfo"]["device"]
|
104 |
+
dataset_dir = first_test_result["testInfo"]["datasetDir"]
|
105 |
+
if "iPhone" in device or "iPad" in device:
|
106 |
+
version_numbers = first_test_result["staticAttributes"]["osVersion"].split(".")
|
107 |
+
if len(version_numbers) == 3 and version_numbers[-1] == "0":
|
108 |
+
version_numbers.pop()
|
109 |
+
os_info = f"""{'iOS' if 'iPhone' in device else 'iPadOS'}_{".".join(version_numbers)}"""
|
110 |
+
else:
|
111 |
+
os_info = f"macOS_{first_test_result['staticAttributes']['osVersion']}"
|
112 |
+
timestamp = first_test_result["testInfo"]["date"]
|
113 |
+
commit_hash_timestamp = file_path.split("/")[-2]
|
114 |
+
commit_timestamp, commit_hash = commit_hash_timestamp.split("_")
|
115 |
+
|
116 |
+
key = (model, device, os_info, commit_timestamp)
|
117 |
+
dataset_name = dataset_dir
|
118 |
+
for test_result in test_results:
|
119 |
+
test_info = test_result["testInfo"]
|
120 |
+
audio_file_name = test_info["audioFile"]
|
121 |
+
|
122 |
+
dataset_df = dataset_dfs[dataset_name]
|
123 |
+
|
124 |
+
wer_entry = {
|
125 |
+
"prediction": text_normalizer(test_info["prediction"]),
|
126 |
+
"reference": text_normalizer(test_info["reference"]),
|
127 |
+
}
|
128 |
+
results[key]["timestamp"] = timestamp
|
129 |
+
results[key]["average_wer"].append(wer_entry)
|
130 |
+
results[key]["dataset_wer"][dataset_name].append(wer_entry)
|
131 |
+
|
132 |
+
input_audio_seconds = test_info["timings"]["inputAudioSeconds"]
|
133 |
+
full_pipeline = test_info["timings"]["fullPipeline"]
|
134 |
+
total_decoding_loops = test_info["timings"]["totalDecodingLoops"]
|
135 |
+
|
136 |
+
results[key]["dataset_speed"][dataset_name][
|
137 |
+
"inputAudioSeconds"
|
138 |
+
] += input_audio_seconds
|
139 |
+
results[key]["dataset_speed"][dataset_name]["fullPipeline"] += full_pipeline
|
140 |
+
|
141 |
+
results[key]["speed"]["inputAudioSeconds"] += input_audio_seconds
|
142 |
+
results[key]["speed"]["fullPipeline"] += full_pipeline
|
143 |
+
|
144 |
+
results[key]["commit_hash"] = commit_hash
|
145 |
+
results[key]["commit_timestamp"] = commit_timestamp
|
146 |
+
|
147 |
+
results[key]["dataset_tokens_per_second"][dataset_name][
|
148 |
+
"totalDecodingLoops"
|
149 |
+
] += total_decoding_loops
|
150 |
+
results[key]["dataset_tokens_per_second"][dataset_name][
|
151 |
+
"fullPipeline"
|
152 |
+
] += full_pipeline
|
153 |
+
results[key]["tokens_per_second"]["totalDecodingLoops"] += total_decoding_loops
|
154 |
+
results[key]["tokens_per_second"]["fullPipeline"] += full_pipeline
|
155 |
+
|
156 |
+
audio = audio_file_name.split(".")[0]
|
157 |
+
if dataset_name == "earnings22-10mins":
|
158 |
+
audio = audio.split("-")[0]
|
159 |
+
|
160 |
+
dataset_row = dataset_df.loc[dataset_df["file"].str.contains(audio)].iloc[0]
|
161 |
+
reference_wer = dataset_row["wer"]
|
162 |
+
prediction_wer = test_info["wer"]
|
163 |
+
|
164 |
+
results[key]["qoi"].append(1 if prediction_wer <= reference_wer else 0)
|
165 |
+
|
166 |
+
return key, dataset_name
|
167 |
+
|
168 |
+
|
169 |
+
def process_summary_file(file_path, results):
|
170 |
+
"""
|
171 |
+
Processes a summary file and updates the results dictionary with device support information.
|
172 |
+
|
173 |
+
:param file_path: Path to the summary JSON file.
|
174 |
+
:param results: Dictionary to store the processed results.
|
175 |
+
|
176 |
+
This function reads a summary JSON file, extracts information about supported
|
177 |
+
and failed models for a specific device and OS combination, and updates the
|
178 |
+
results dictionary accordingly.
|
179 |
+
"""
|
180 |
+
with open(file_path, "r") as file:
|
181 |
+
summary_data = json.load(file)
|
182 |
+
|
183 |
+
device = summary_data["deviceIdentifier"]
|
184 |
+
os = f"{'iPadOS' if 'iPad' in device else summary_data['osType']} {summary_data['osVersion']}"
|
185 |
+
commit_timestamp = summary_data["commitTimestamp"]
|
186 |
+
|
187 |
+
key = (device, os)
|
188 |
+
if key in results:
|
189 |
+
existing_timestamp = results[key]["commitTimestamp"]
|
190 |
+
|
191 |
+
existing_dt = datetime.strptime(existing_timestamp, "%Y-%m-%dT%H%M%S")
|
192 |
+
new_dt = datetime.strptime(commit_timestamp, "%Y-%m-%dT%H%M%S")
|
193 |
+
|
194 |
+
if new_dt <= existing_dt:
|
195 |
+
return
|
196 |
+
else:
|
197 |
+
results[key] = {}
|
198 |
+
|
199 |
+
supported_models = set(summary_data["modelsTested"])
|
200 |
+
failed_models = set()
|
201 |
+
|
202 |
+
dataset_count = 2
|
203 |
+
for model, value in summary_data["testResults"].items():
|
204 |
+
if model not in summary_data["failureInfo"]:
|
205 |
+
dataset_count = len(value)
|
206 |
+
break
|
207 |
+
|
208 |
+
for failed_model in summary_data["failureInfo"]:
|
209 |
+
if (
|
210 |
+
failed_model in summary_data["testResults"]
|
211 |
+
and len(summary_data["testResults"][failed_model]) == dataset_count
|
212 |
+
):
|
213 |
+
continue
|
214 |
+
supported_models.discard(failed_model)
|
215 |
+
failed_models.add(failed_model)
|
216 |
+
|
217 |
+
results[key]["supportedModels"] = supported_models
|
218 |
+
results[key]["commitTimestamp"] = commit_timestamp
|
219 |
+
results[key]["failedModels"] = (failed_models, file_path)
|
220 |
+
results["modelsTested"] |= supported_models
|
221 |
+
results["devices"].add(device)
|
222 |
+
|
223 |
+
|
224 |
+
def calculate_and_save_performance_results(
|
225 |
+
performance_results, performance_output_path
|
226 |
+
):
|
227 |
+
"""
|
228 |
+
Calculates final performance metrics and saves them to a JSON file.
|
229 |
+
|
230 |
+
:param performance_results: Dictionary containing raw performance data.
|
231 |
+
:param performance_output_path: Path to save the processed performance results.
|
232 |
+
|
233 |
+
This function processes the raw performance data, calculates average metrics,
|
234 |
+
and writes the final results to a JSON file, with each entry representing
|
235 |
+
a unique combination of model, device, and OS.
|
236 |
+
"""
|
237 |
+
not_supported = []
|
238 |
+
with open(performance_output_path, "w") as performance_file:
|
239 |
+
for key, data in performance_results.items():
|
240 |
+
model, device, os_info, timestamp = key
|
241 |
+
speed = round(
|
242 |
+
data["speed"]["inputAudioSeconds"] / data["speed"]["fullPipeline"], 2
|
243 |
+
)
|
244 |
+
|
245 |
+
if speed < 1.0:
|
246 |
+
not_supported.append((model, device, os_info))
|
247 |
+
continue
|
248 |
+
|
249 |
+
performance_entry = {
|
250 |
+
"model": model.replace("_", "/"),
|
251 |
+
"device": get_device_name(device).replace("_", " "),
|
252 |
+
"os": os_info.replace("_", " "),
|
253 |
+
"timestamp": data["timestamp"],
|
254 |
+
"speed": speed,
|
255 |
+
"tokens_per_second": round(
|
256 |
+
data["tokens_per_second"]["totalDecodingLoops"]
|
257 |
+
/ data["tokens_per_second"]["fullPipeline"],
|
258 |
+
2,
|
259 |
+
),
|
260 |
+
"dataset_speed": {
|
261 |
+
dataset: round(
|
262 |
+
speed_info["inputAudioSeconds"] / speed_info["fullPipeline"], 2
|
263 |
+
)
|
264 |
+
for dataset, speed_info in data["dataset_speed"].items()
|
265 |
+
},
|
266 |
+
"dataset_tokens_per_second": {
|
267 |
+
dataset: round(
|
268 |
+
tps_info["totalDecodingLoops"] / tps_info["fullPipeline"], 2
|
269 |
+
)
|
270 |
+
for dataset, tps_info in data["dataset_tokens_per_second"].items()
|
271 |
+
},
|
272 |
+
"average_wer": compute_average_wer(data["average_wer"]),
|
273 |
+
"dataset_wer": {
|
274 |
+
dataset: compute_average_wer(wer)
|
275 |
+
for dataset, wer in data["dataset_wer"].items()
|
276 |
+
},
|
277 |
+
"qoi": round(mean(data["qoi"]), 2),
|
278 |
+
"commit_hash": data["commit_hash"],
|
279 |
+
"commit_timestamp": data["commit_timestamp"],
|
280 |
+
}
|
281 |
+
|
282 |
+
json.dump(performance_entry, performance_file)
|
283 |
+
performance_file.write("\n")
|
284 |
+
return not_supported
|
285 |
+
|
286 |
+
|
287 |
+
def calculate_and_save_support_results(
|
288 |
+
support_results, not_supported, support_output_path
|
289 |
+
):
|
290 |
+
"""
|
291 |
+
Calculates device support results and saves them to a CSV file.
|
292 |
+
|
293 |
+
:param support_results: Dictionary containing device support information.
|
294 |
+
:param support_output_path: Path to save the processed support results.
|
295 |
+
|
296 |
+
This function processes the device support data and creates a CSV file
|
297 |
+
showing which models are supported on different devices and OS versions,
|
298 |
+
using checkmarks, warning signs, quesiton marks or Not supported to
|
299 |
+
indicate support status.
|
300 |
+
"""
|
301 |
+
all_models = sorted(support_results["modelsTested"])
|
302 |
+
all_devices = sorted(set(support_results["devices"]))
|
303 |
+
|
304 |
+
df = pd.DataFrame(index=all_models, columns=["Model"] + all_devices)
|
305 |
+
|
306 |
+
for model in all_models:
|
307 |
+
row = {"Model": model}
|
308 |
+
for device in all_devices:
|
309 |
+
row[device] = ""
|
310 |
+
|
311 |
+
for key, data in support_results.items():
|
312 |
+
if key in ["modelsTested", "devices"]:
|
313 |
+
continue
|
314 |
+
(device, os) = key
|
315 |
+
supported_models = data["supportedModels"]
|
316 |
+
failed_models, file_path = data["failedModels"]
|
317 |
+
directories = file_path.split("/")
|
318 |
+
commit_file, summary_file = directories[-2], directories[-1]
|
319 |
+
url = f"{BASE_WHISPERKIT_BENCHMARK_URL}/{commit_file}/{urllib.parse.quote(summary_file)}"
|
320 |
+
|
321 |
+
if model in supported_models:
|
322 |
+
current_value = row[device]
|
323 |
+
new_value = (
|
324 |
+
f"✅ {os}"
|
325 |
+
if current_value == ""
|
326 |
+
else f"{current_value}<p>✅ {os}</p>"
|
327 |
+
)
|
328 |
+
elif model in failed_models:
|
329 |
+
current_value = row[device]
|
330 |
+
new_value = (
|
331 |
+
f"""⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href={url}>{os}</a>"""
|
332 |
+
if current_value == ""
|
333 |
+
else f"""{current_value}<p>⚠️ <a style='color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;' href={url}>{os}</a></p>"""
|
334 |
+
)
|
335 |
+
else:
|
336 |
+
current_value = row[device]
|
337 |
+
new_value = (
|
338 |
+
f"? {os}"
|
339 |
+
if current_value == ""
|
340 |
+
else f"{current_value}<p>? {os}</p>"
|
341 |
+
)
|
342 |
+
row[device] = new_value
|
343 |
+
|
344 |
+
df.loc[model] = row
|
345 |
+
|
346 |
+
remove_unsupported_cells(df, not_supported)
|
347 |
+
|
348 |
+
cols = df.columns.tolist()
|
349 |
+
cols = ["Model"] + [
|
350 |
+
get_device_name(col).replace("_", " ") for col in cols if col != "Model"
|
351 |
+
]
|
352 |
+
df.columns = cols
|
353 |
+
|
354 |
+
df.to_csv(support_output_path, index=True)
|
355 |
+
|
356 |
+
|
357 |
+
def remove_unsupported_cells(df, not_supported):
|
358 |
+
"""
|
359 |
+
Updates the DataFrame to mark unsupported model-device combinations.
|
360 |
+
|
361 |
+
This function reads a configuration file to determine which models are supported
|
362 |
+
on which devices. It then iterates over the DataFrame and sets the value to "Not supported"
|
363 |
+
for any model-device combination that is not supported according to the configuration.
|
364 |
+
|
365 |
+
:param df: A Pandas DataFrame where the index represents models and columns represent devices.
|
366 |
+
"""
|
367 |
+
with open("dashboard_data/config.json", "r") as file:
|
368 |
+
config_data = json.load(file)
|
369 |
+
|
370 |
+
device_support = config_data["device_support"]
|
371 |
+
for info in device_support:
|
372 |
+
identifiers = set(info["identifiers"])
|
373 |
+
supported = set(info["models"]["supported"])
|
374 |
+
|
375 |
+
for model in df.index:
|
376 |
+
for device in df.columns:
|
377 |
+
if (
|
378 |
+
any(identifier in device for identifier in identifiers)
|
379 |
+
and model not in supported
|
380 |
+
):
|
381 |
+
df.at[model, device] = "Not Supported"
|
382 |
+
|
383 |
+
for model, device, os in not_supported:
|
384 |
+
df.at[model, device] = "Not Supported"
|
385 |
+
|
386 |
+
|
387 |
+
def main():
|
388 |
+
"""
|
389 |
+
Main function to orchestrate the performance data generation process.
|
390 |
+
|
391 |
+
This function performs the following steps:
|
392 |
+
1. Downloads benchmark data if requested.
|
393 |
+
2. Fetches evaluation data for various datasets.
|
394 |
+
3. Processes benchmark files and summary files.
|
395 |
+
4. Calculates and saves performance and support results.
|
396 |
+
"""
|
397 |
+
source_xcresult_repo = "argmaxinc/whisperkit-evals-dataset"
|
398 |
+
source_xcresult_subfolder = "benchmark_data/"
|
399 |
+
source_xcresult_directory = f"{source_xcresult_repo}/{source_xcresult_subfolder}"
|
400 |
+
if len(sys.argv) > 1 and sys.argv[1] == "download":
|
401 |
+
try:
|
402 |
+
shutil.rmtree(source_xcresult_repo)
|
403 |
+
except:
|
404 |
+
print("Nothing to remove.")
|
405 |
+
download_dataset(
|
406 |
+
source_xcresult_repo, source_xcresult_repo, source_xcresult_subfolder
|
407 |
+
)
|
408 |
+
|
409 |
+
datasets = {
|
410 |
+
"Earnings-22": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
411 |
+
"LibriSpeech": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
412 |
+
"earnings22-10mins": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
413 |
+
"librispeech-10mins": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
414 |
+
"earnings22-12hours": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
415 |
+
"librispeech": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
416 |
+
}
|
417 |
+
|
418 |
+
dataset_dfs = {}
|
419 |
+
for dataset_name, url in datasets.items():
|
420 |
+
evals = fetch_evaluation_data(url)
|
421 |
+
dataset_dfs[dataset_name] = pd.json_normalize(evals["results"])
|
422 |
+
|
423 |
+
performance_results = defaultdict(
|
424 |
+
lambda: {
|
425 |
+
"average_wer": [],
|
426 |
+
"dataset_wer": defaultdict(list),
|
427 |
+
"qoi": [],
|
428 |
+
"speed": {"inputAudioSeconds": 0, "fullPipeline": 0},
|
429 |
+
"tokens_per_second": {"totalDecodingLoops": 0, "fullPipeline": 0},
|
430 |
+
"dataset_speed": defaultdict(
|
431 |
+
lambda: {"inputAudioSeconds": 0, "fullPipeline": 0}
|
432 |
+
),
|
433 |
+
"dataset_tokens_per_second": defaultdict(
|
434 |
+
lambda: {"totalDecodingLoops": 0, "fullPipeline": 0}
|
435 |
+
),
|
436 |
+
"timestamp": None,
|
437 |
+
"commit_hash": None,
|
438 |
+
"commit_timestamp": None,
|
439 |
+
}
|
440 |
+
)
|
441 |
+
|
442 |
+
support_results = {"modelsTested": set(), "devices": set()}
|
443 |
+
|
444 |
+
generate_device_map(source_xcresult_directory)
|
445 |
+
|
446 |
+
for subdir, _, files in os.walk(source_xcresult_directory):
|
447 |
+
for filename in files:
|
448 |
+
file_path = os.path.join(subdir, filename)
|
449 |
+
if not filename.endswith(".json"):
|
450 |
+
continue
|
451 |
+
elif "summary" in filename:
|
452 |
+
process_summary_file(file_path, support_results)
|
453 |
+
else:
|
454 |
+
process_benchmark_file(file_path, dataset_dfs, performance_results)
|
455 |
+
|
456 |
+
not_supported = calculate_and_save_performance_results(
|
457 |
+
performance_results, "dashboard_data/performance_data.json"
|
458 |
+
)
|
459 |
+
calculate_and_save_support_results(
|
460 |
+
support_results, not_supported, "dashboard_data/support_data.csv"
|
461 |
+
)
|
462 |
+
|
463 |
+
|
464 |
+
if __name__ == "__main__":
|
465 |
+
main()
|
quality_generate.py
ADDED
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import os
|
3 |
+
import shutil
|
4 |
+
import sys
|
5 |
+
from collections import defaultdict
|
6 |
+
from statistics import mean
|
7 |
+
|
8 |
+
import pandas as pd
|
9 |
+
import requests
|
10 |
+
|
11 |
+
from text_normalizer import text_normalizer
|
12 |
+
from utils import compute_average_wer, download_dataset
|
13 |
+
|
14 |
+
|
15 |
+
def fetch_evaluation_data(url):
|
16 |
+
"""
|
17 |
+
Fetches evaluation data from the given URL.
|
18 |
+
:param url: The URL to fetch the evaluation data from.
|
19 |
+
:returns: The evaluation data as a dictionary.
|
20 |
+
:rauses: sys.exit if the request fails
|
21 |
+
"""
|
22 |
+
response = requests.get(url)
|
23 |
+
if response.status_code == 200:
|
24 |
+
return json.loads(response.text)
|
25 |
+
else:
|
26 |
+
sys.exit(f"Failed to fetch WhisperKit evals: {response.text}")
|
27 |
+
|
28 |
+
|
29 |
+
def get_device_name(device):
|
30 |
+
"""
|
31 |
+
Gets the device name from the device map if it exists.
|
32 |
+
:param device: String representing the device name.
|
33 |
+
:returns: The device name from the device map if it exists, otherwise the input device name.
|
34 |
+
"""
|
35 |
+
with open("dashboard_data/device_map.json", "r") as f:
|
36 |
+
device_map = json.load(f)
|
37 |
+
return device_map.get(device, device).replace(" ", "_")
|
38 |
+
|
39 |
+
|
40 |
+
def process_quality_file(file_path, dataset_dfs, quality_results):
|
41 |
+
"""
|
42 |
+
Processes a single quality file and updates the quality_results dictionary.
|
43 |
+
|
44 |
+
:param file_path: Path to the quality JSON file.
|
45 |
+
:param dataset_dfs: Dictionary of DataFrames containing dataset information.
|
46 |
+
:param quality_results: Dictionary to store the processed quality results.
|
47 |
+
|
48 |
+
This function reads a quality JSON file, extracts relevant information,
|
49 |
+
and updates the quality_results dictionary with various metrics including WER
|
50 |
+
and Quality of Inference (QoI) for different datasets.
|
51 |
+
"""
|
52 |
+
with open(file_path, "r") as file:
|
53 |
+
test_results = json.load(file)
|
54 |
+
|
55 |
+
if len(test_results) == 0:
|
56 |
+
return
|
57 |
+
|
58 |
+
metadata = test_results["metadata"]
|
59 |
+
test_results = test_results["results"]
|
60 |
+
model = file_path.split("/")[-3].replace("_", "/")
|
61 |
+
device = metadata["inference_context"]["device_spec"]["product_name"]
|
62 |
+
device = get_device_name(device)
|
63 |
+
timestamp = file_path.split("/")[-1].split(".")[0]
|
64 |
+
key = model
|
65 |
+
dataset_name = metadata["dataset_name"]
|
66 |
+
|
67 |
+
for test_result in test_results:
|
68 |
+
audio_file_name = test_result["file"]
|
69 |
+
|
70 |
+
dataset_key = "Earnings-22" if "earnings22" in dataset_name else "LibriSpeech"
|
71 |
+
dataset_df = dataset_dfs[dataset_key]
|
72 |
+
|
73 |
+
wer_entry = {
|
74 |
+
"prediction": text_normalizer(test_result["prediction"]),
|
75 |
+
"reference": text_normalizer(test_result["reference"]),
|
76 |
+
}
|
77 |
+
quality_results[key]["timestamp"] = timestamp
|
78 |
+
quality_results[key]["dataset_wer"][dataset_name].append(wer_entry)
|
79 |
+
|
80 |
+
audio = audio_file_name.split(".")[0]
|
81 |
+
dataset_row = dataset_df.loc[dataset_df["file"].str.contains(audio)].iloc[0]
|
82 |
+
reference_wer = dataset_row["wer"]
|
83 |
+
prediction_wer = test_result["wer"]
|
84 |
+
|
85 |
+
quality_results[key]["qoi"].append(1 if prediction_wer <= reference_wer else 0)
|
86 |
+
|
87 |
+
|
88 |
+
def calculate_and_save_quality_results(quality_results, quality_output_path):
|
89 |
+
"""
|
90 |
+
Calculates final quality metrics and saves them to a JSON file.
|
91 |
+
|
92 |
+
:param quality_results: Dictionary containing raw quality data.
|
93 |
+
:param quality_output_path: Path to save the processed quality results.
|
94 |
+
|
95 |
+
This function processes the raw quality data, calculates average metrics,
|
96 |
+
and writes the final results to a JSON file, with each entry representing
|
97 |
+
a unique model's quality metrics across different datasets, including
|
98 |
+
Word Error Rate (WER) and Quality of Inference (QoI).
|
99 |
+
"""
|
100 |
+
with open(quality_output_path, "w") as quality_file:
|
101 |
+
for key, data in quality_results.items():
|
102 |
+
model = key
|
103 |
+
|
104 |
+
dataset_wers = {
|
105 |
+
dataset: compute_average_wer(wer)
|
106 |
+
for dataset, wer in data["dataset_wer"].items()
|
107 |
+
}
|
108 |
+
average_wer = (
|
109 |
+
sum(dataset_wers.values()) / len(dataset_wers)
|
110 |
+
if len(dataset_wers) != 0
|
111 |
+
else 0
|
112 |
+
)
|
113 |
+
|
114 |
+
quality_entry = {
|
115 |
+
"model": model.replace("_", "/"),
|
116 |
+
"timestamp": data["timestamp"],
|
117 |
+
"average_wer": round(average_wer, 2),
|
118 |
+
"dataset_wer": dataset_wers,
|
119 |
+
"qoi": round(mean(data["qoi"]), 2),
|
120 |
+
}
|
121 |
+
|
122 |
+
json.dump(quality_entry, quality_file)
|
123 |
+
quality_file.write("\n")
|
124 |
+
|
125 |
+
|
126 |
+
def main():
|
127 |
+
"""
|
128 |
+
Main function to orchestrate the quality data generation process.
|
129 |
+
|
130 |
+
This function performs the following steps:
|
131 |
+
1. Downloads quality data if requested.
|
132 |
+
2. Fetches evaluation data for various datasets.
|
133 |
+
3. Processes quality files for specific datasets.
|
134 |
+
4. Calculates and saves quality results, including WER and QoI metrics.
|
135 |
+
"""
|
136 |
+
if len(sys.argv) > 1 and sys.argv[1] == "download":
|
137 |
+
try:
|
138 |
+
shutil.rmtree("english")
|
139 |
+
except:
|
140 |
+
print("Nothing to remove.")
|
141 |
+
download_dataset("argmaxinc/whisperkit-evals", "english", "WhisperKit")
|
142 |
+
|
143 |
+
datasets = {
|
144 |
+
"Earnings-22": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
145 |
+
"LibriSpeech": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
146 |
+
"earnings22-10mins": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
147 |
+
"librispeech-10mins": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
148 |
+
"earnings22-12hours": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22/2024-03-04_13%3A39%3A42_GMT-0800.json",
|
149 |
+
"librispeech": "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/resolve/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech/2024-02-28_18%3A45%3A02_GMT-0800.json?download=true",
|
150 |
+
}
|
151 |
+
|
152 |
+
dataset_dfs = {}
|
153 |
+
for dataset_name, url in datasets.items():
|
154 |
+
evals = fetch_evaluation_data(url)
|
155 |
+
dataset_dfs[dataset_name] = pd.json_normalize(evals["results"])
|
156 |
+
|
157 |
+
source_quality_directory = "argmaxinc/english/WhisperKit/"
|
158 |
+
|
159 |
+
quality_results = defaultdict(
|
160 |
+
lambda: {
|
161 |
+
"average_wer": [],
|
162 |
+
"dataset_wer": defaultdict(list),
|
163 |
+
"qoi": [],
|
164 |
+
"timestamp": None,
|
165 |
+
}
|
166 |
+
)
|
167 |
+
|
168 |
+
for subdir, _, files in os.walk(source_quality_directory):
|
169 |
+
dataset = subdir.split("/")[-1]
|
170 |
+
if dataset not in ["earnings22-12hours", "librispeech"]:
|
171 |
+
continue
|
172 |
+
|
173 |
+
for filename in files:
|
174 |
+
if not filename.endswith(".json"):
|
175 |
+
continue
|
176 |
+
|
177 |
+
file_path = os.path.join(subdir, filename)
|
178 |
+
process_quality_file(file_path, dataset_dfs, quality_results)
|
179 |
+
|
180 |
+
calculate_and_save_quality_results(
|
181 |
+
quality_results, "dashboard_data/quality_data.json"
|
182 |
+
)
|
183 |
+
|
184 |
+
|
185 |
+
if __name__ == "__main__":
|
186 |
+
main()
|
requirements.txt
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
aiofiles
|
2 |
+
aiohttp
|
3 |
+
aiosignal
|
4 |
+
altair
|
5 |
+
annotated-types
|
6 |
+
anyio
|
7 |
+
argmax_gradio_components
|
8 |
+
async-timeout
|
9 |
+
attrs
|
10 |
+
backports.tarfile
|
11 |
+
build
|
12 |
+
certifi
|
13 |
+
cffi
|
14 |
+
cfgv
|
15 |
+
charset-normalizer
|
16 |
+
click
|
17 |
+
contourpy
|
18 |
+
cycler
|
19 |
+
datasets
|
20 |
+
dill
|
21 |
+
distlib
|
22 |
+
dnspython
|
23 |
+
docutils
|
24 |
+
email_validator
|
25 |
+
exceptiongroup
|
26 |
+
fastapi
|
27 |
+
fastapi-cli
|
28 |
+
ffmpy
|
29 |
+
filelock
|
30 |
+
fonttools
|
31 |
+
frozenlist
|
32 |
+
fsspec
|
33 |
+
gradio==5.0.1
|
34 |
+
h11
|
35 |
+
httpcore
|
36 |
+
httptools
|
37 |
+
httpx
|
38 |
+
huggingface-hub
|
39 |
+
identify
|
40 |
+
idna
|
41 |
+
importlib_metadata
|
42 |
+
importlib_resources
|
43 |
+
jaraco.classes
|
44 |
+
jaraco.context
|
45 |
+
jaraco.functools
|
46 |
+
Jinja2
|
47 |
+
jsonschema
|
48 |
+
jsonschema-specifications
|
49 |
+
keyring
|
50 |
+
kiwisolver
|
51 |
+
markdown-it-py
|
52 |
+
MarkupSafe
|
53 |
+
matplotlib
|
54 |
+
mdurl
|
55 |
+
more-itertools
|
56 |
+
multidict
|
57 |
+
multiprocess
|
58 |
+
nh3
|
59 |
+
nodeenv
|
60 |
+
numpy
|
61 |
+
orjson
|
62 |
+
packaging
|
63 |
+
pandas
|
64 |
+
pillow
|
65 |
+
pkginfo
|
66 |
+
platformdirs
|
67 |
+
plotly
|
68 |
+
pre-commit
|
69 |
+
pyarrow
|
70 |
+
pyarrow-hotfix
|
71 |
+
pycparser
|
72 |
+
pydantic
|
73 |
+
pydantic_core
|
74 |
+
pydub
|
75 |
+
Pygments
|
76 |
+
pyparsing
|
77 |
+
pyproject_hooks
|
78 |
+
python-dateutil
|
79 |
+
python-dotenv
|
80 |
+
python-multipart
|
81 |
+
pytz
|
82 |
+
PyYAML
|
83 |
+
readme_renderer
|
84 |
+
referencing
|
85 |
+
requests
|
86 |
+
requests-toolbelt
|
87 |
+
rfc3986
|
88 |
+
rich
|
89 |
+
rpds-py
|
90 |
+
ruff
|
91 |
+
scipy
|
92 |
+
scikit-learn
|
93 |
+
semantic-version
|
94 |
+
shellingham
|
95 |
+
six
|
96 |
+
sniffio
|
97 |
+
soundfile
|
98 |
+
starlette
|
99 |
+
tenacity
|
100 |
+
text_normalizer
|
101 |
+
tomli
|
102 |
+
tomlkit
|
103 |
+
toolz
|
104 |
+
tqdm
|
105 |
+
twine
|
106 |
+
typer
|
107 |
+
typing_extensions
|
108 |
+
tzdata
|
109 |
+
ujson
|
110 |
+
urllib3
|
111 |
+
uvicorn
|
112 |
+
uvloop
|
113 |
+
virtualenv
|
114 |
+
watchfiles
|
115 |
+
websockets
|
116 |
+
xxhash
|
117 |
+
yarl
|
118 |
+
zipp
|
119 |
+
iso639-lang
|
120 |
+
evaluate
|
121 |
+
jiwer
|
122 |
+
regex
|
static/Zwizz-Medium.woff
ADDED
Binary file (28.7 kB). View file
|
|
static/Zwizz-Regular.woff
ADDED
Binary file (28.4 kB). View file
|
|
static/Zwizz-SemiBold.woff
ADDED
Binary file (28.7 kB). View file
|
|
text_normalizer.py
ADDED
@@ -0,0 +1,2374 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2022 The OpenAI team and The HuggingFace Team. All rights reserved.
|
2 |
+
# Most of the code is copy pasted from the original whisper repository
|
3 |
+
#
|
4 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
5 |
+
# you may not use this file except in compliance with the License.
|
6 |
+
# You may obtain a copy of the License at
|
7 |
+
#
|
8 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
9 |
+
#
|
10 |
+
# Unless required by applicable law or agreed to in writing, software
|
11 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
12 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
13 |
+
# See the License for the specific language governing permissions and
|
14 |
+
# limitations under the License.
|
15 |
+
|
16 |
+
import re
|
17 |
+
import unicodedata
|
18 |
+
from fractions import Fraction
|
19 |
+
from typing import Iterator, List, Match, Optional, Union
|
20 |
+
|
21 |
+
import regex
|
22 |
+
|
23 |
+
abbr = {
|
24 |
+
"accessorise": "accessorize",
|
25 |
+
"accessorised": "accessorized",
|
26 |
+
"accessorises": "accessorizes",
|
27 |
+
"accessorising": "accessorizing",
|
28 |
+
"acclimatisation": "acclimatization",
|
29 |
+
"acclimatise": "acclimatize",
|
30 |
+
"acclimatised": "acclimatized",
|
31 |
+
"acclimatises": "acclimatizes",
|
32 |
+
"acclimatising": "acclimatizing",
|
33 |
+
"accoutrements": "accouterments",
|
34 |
+
"aeon": "eon",
|
35 |
+
"aeons": "eons",
|
36 |
+
"aerogramme": "aerogram",
|
37 |
+
"aerogrammes": "aerograms",
|
38 |
+
"aeroplane": "airplane",
|
39 |
+
"aeroplanes": "airplanes",
|
40 |
+
"aesthete": "esthete",
|
41 |
+
"aesthetes": "esthetes",
|
42 |
+
"aesthetic": "esthetic",
|
43 |
+
"aesthetically": "esthetically",
|
44 |
+
"aesthetics": "esthetics",
|
45 |
+
"aetiology": "etiology",
|
46 |
+
"ageing": "aging",
|
47 |
+
"aggrandisement": "aggrandizement",
|
48 |
+
"agonise": "agonize",
|
49 |
+
"agonised": "agonized",
|
50 |
+
"agonises": "agonizes",
|
51 |
+
"agonising": "agonizing",
|
52 |
+
"agonisingly": "agonizingly",
|
53 |
+
"almanack": "almanac",
|
54 |
+
"almanacks": "almanacs",
|
55 |
+
"aluminium": "aluminum",
|
56 |
+
"amortisable": "amortizable",
|
57 |
+
"amortisation": "amortization",
|
58 |
+
"amortisations": "amortizations",
|
59 |
+
"amortise": "amortize",
|
60 |
+
"amortised": "amortized",
|
61 |
+
"amortises": "amortizes",
|
62 |
+
"amortising": "amortizing",
|
63 |
+
"amphitheatre": "amphitheater",
|
64 |
+
"amphitheatres": "amphitheaters",
|
65 |
+
"anaemia": "anemia",
|
66 |
+
"anaemic": "anemic",
|
67 |
+
"anaesthesia": "anesthesia",
|
68 |
+
"anaesthetic": "anesthetic",
|
69 |
+
"anaesthetics": "anesthetics",
|
70 |
+
"anaesthetise": "anesthetize",
|
71 |
+
"anaesthetised": "anesthetized",
|
72 |
+
"anaesthetises": "anesthetizes",
|
73 |
+
"anaesthetising": "anesthetizing",
|
74 |
+
"anaesthetist": "anesthetist",
|
75 |
+
"anaesthetists": "anesthetists",
|
76 |
+
"anaesthetize": "anesthetize",
|
77 |
+
"anaesthetized": "anesthetized",
|
78 |
+
"anaesthetizes": "anesthetizes",
|
79 |
+
"anaesthetizing": "anesthetizing",
|
80 |
+
"analogue": "analog",
|
81 |
+
"analogues": "analogs",
|
82 |
+
"analyse": "analyze",
|
83 |
+
"analysed": "analyzed",
|
84 |
+
"analyses": "analyzes",
|
85 |
+
"analysing": "analyzing",
|
86 |
+
"anglicise": "anglicize",
|
87 |
+
"anglicised": "anglicized",
|
88 |
+
"anglicises": "anglicizes",
|
89 |
+
"anglicising": "anglicizing",
|
90 |
+
"annualised": "annualized",
|
91 |
+
"antagonise": "antagonize",
|
92 |
+
"antagonised": "antagonized",
|
93 |
+
"antagonises": "antagonizes",
|
94 |
+
"antagonising": "antagonizing",
|
95 |
+
"apologise": "apologize",
|
96 |
+
"apologised": "apologized",
|
97 |
+
"apologises": "apologizes",
|
98 |
+
"apologising": "apologizing",
|
99 |
+
"appal": "appall",
|
100 |
+
"appals": "appalls",
|
101 |
+
"appetiser": "appetizer",
|
102 |
+
"appetisers": "appetizers",
|
103 |
+
"appetising": "appetizing",
|
104 |
+
"appetisingly": "appetizingly",
|
105 |
+
"arbour": "arbor",
|
106 |
+
"arbours": "arbors",
|
107 |
+
"archaeologically": "archeologically",
|
108 |
+
"archaeologist": "archeologist",
|
109 |
+
"archaeologists": "archeologists",
|
110 |
+
"archaeology": "archeology</span>",
|
111 |
+
"archeological": "archaeological",
|
112 |
+
"ardour": "ardor",
|
113 |
+
"armour": "armor",
|
114 |
+
"armoured": "armored",
|
115 |
+
"armourer": "armorer",
|
116 |
+
"armourers": "armorers",
|
117 |
+
"armouries": "armories",
|
118 |
+
"armoury": "armory",
|
119 |
+
"artefact": "artifact",
|
120 |
+
"artefacts": "artifacts",
|
121 |
+
"authorise": "authorize",
|
122 |
+
"authorised": "authorized",
|
123 |
+
"authorises": "authorizes",
|
124 |
+
"authorising": "authorizing",
|
125 |
+
"axe": "ax",
|
126 |
+
"backpedalled": "backpedaled",
|
127 |
+
"backpedalling": "backpedaling",
|
128 |
+
"bannister": "banister",
|
129 |
+
"bannisters": "banisters",
|
130 |
+
"baptise": "baptize",
|
131 |
+
"baptised": "baptized",
|
132 |
+
"baptises": "baptizes",
|
133 |
+
"baptising": "baptizing",
|
134 |
+
"bastardise": "bastardize",
|
135 |
+
"bastardised": "bastardized",
|
136 |
+
"bastardises": "bastardizes",
|
137 |
+
"bastardising": "bastardizing",
|
138 |
+
"battleax": "battleaxe",
|
139 |
+
"baulk": "balk",
|
140 |
+
"baulked": "balked",
|
141 |
+
"baulking": "balking",
|
142 |
+
"baulks": "balks",
|
143 |
+
"bedevilled": "bedeviled",
|
144 |
+
"bedevilling": "bedeviling",
|
145 |
+
"behaviour": "behavior",
|
146 |
+
"behavioural": "behavioral",
|
147 |
+
"behaviourism": "behaviorism",
|
148 |
+
"behaviourist": "behaviorist",
|
149 |
+
"behaviourists": "behaviorists",
|
150 |
+
"behaviours": "behaviors",
|
151 |
+
"behove": "behoove",
|
152 |
+
"behoved": "behooved",
|
153 |
+
"behoves": "behooves",
|
154 |
+
"bejewelled": "bejeweled",
|
155 |
+
"belabour": "belabor",
|
156 |
+
"belaboured": "belabored",
|
157 |
+
"belabouring": "belaboring",
|
158 |
+
"belabours": "belabors",
|
159 |
+
"bevelled": "beveled",
|
160 |
+
"bevvies": "bevies",
|
161 |
+
"bevvy": "bevy",
|
162 |
+
"biassed": "biased",
|
163 |
+
"biassing": "biasing",
|
164 |
+
"bingeing": "binging",
|
165 |
+
"bougainvillaea": "bougainvillea",
|
166 |
+
"bougainvillaeas": "bougainvilleas",
|
167 |
+
"bowdlerise": "bowdlerize",
|
168 |
+
"bowdlerised": "bowdlerized",
|
169 |
+
"bowdlerises": "bowdlerizes",
|
170 |
+
"bowdlerising": "bowdlerizing",
|
171 |
+
"breathalyse": "breathalyze",
|
172 |
+
"breathalysed": "breathalyzed",
|
173 |
+
"breathalyser": "breathalyzer",
|
174 |
+
"breathalysers": "breathalyzers",
|
175 |
+
"breathalyses": "breathalyzes",
|
176 |
+
"breathalysing": "breathalyzing",
|
177 |
+
"brutalise": "brutalize",
|
178 |
+
"brutalised": "brutalized",
|
179 |
+
"brutalises": "brutalizes",
|
180 |
+
"brutalising": "brutalizing",
|
181 |
+
"busses": "buses",
|
182 |
+
"bussing": "busing",
|
183 |
+
"caesarean": "cesarean",
|
184 |
+
"caesareans": "cesareans",
|
185 |
+
"calibre": "caliber",
|
186 |
+
"calibres": "calibers",
|
187 |
+
"calliper": "caliper",
|
188 |
+
"callipers": "calipers",
|
189 |
+
"callisthenics": "calisthenics",
|
190 |
+
"canalise": "canalize",
|
191 |
+
"canalised": "canalized",
|
192 |
+
"canalises": "canalizes",
|
193 |
+
"canalising": "canalizing",
|
194 |
+
"cancelation": "cancellation",
|
195 |
+
"cancelations": "cancellations",
|
196 |
+
"cancelled": "canceled",
|
197 |
+
"cancelling": "canceling",
|
198 |
+
"candour": "candor",
|
199 |
+
"cannibalise": "cannibalize",
|
200 |
+
"cannibalised": "cannibalized",
|
201 |
+
"cannibalises": "cannibalizes",
|
202 |
+
"cannibalising": "cannibalizing",
|
203 |
+
"canonise": "canonize",
|
204 |
+
"canonised": "canonized",
|
205 |
+
"canonises": "canonizes",
|
206 |
+
"canonising": "canonizing",
|
207 |
+
"capitalise": "capitalize",
|
208 |
+
"capitalised": "capitalized",
|
209 |
+
"capitalises": "capitalizes",
|
210 |
+
"capitalising": "capitalizing",
|
211 |
+
"caramelise": "caramelize",
|
212 |
+
"caramelised": "caramelized",
|
213 |
+
"caramelises": "caramelizes",
|
214 |
+
"caramelising": "caramelizing",
|
215 |
+
"carbonise": "carbonize",
|
216 |
+
"carbonised": "carbonized",
|
217 |
+
"carbonises": "carbonizes",
|
218 |
+
"carbonising": "carbonizing",
|
219 |
+
"carolled": "caroled",
|
220 |
+
"carolling": "caroling",
|
221 |
+
"catalogue": "catalog",
|
222 |
+
"catalogued": "cataloged",
|
223 |
+
"catalogues": "catalogs",
|
224 |
+
"cataloguing": "cataloging",
|
225 |
+
"catalyse": "catalyze",
|
226 |
+
"catalysed": "catalyzed",
|
227 |
+
"catalyses": "catalyzes",
|
228 |
+
"catalysing": "catalyzing",
|
229 |
+
"categorise": "categorize",
|
230 |
+
"categorised": "categorized",
|
231 |
+
"categorises": "categorizes",
|
232 |
+
"categorising": "categorizing",
|
233 |
+
"cauterise": "cauterize",
|
234 |
+
"cauterised": "cauterized",
|
235 |
+
"cauterises": "cauterizes",
|
236 |
+
"cauterising": "cauterizing",
|
237 |
+
"cavilled": "caviled",
|
238 |
+
"cavilling": "caviling",
|
239 |
+
"centigramme": "centigram",
|
240 |
+
"centigrammes": "centigrams",
|
241 |
+
"centilitre": "centiliter",
|
242 |
+
"centilitres": "centiliters",
|
243 |
+
"centimetre": "centimeter",
|
244 |
+
"centimetres": "centimeters",
|
245 |
+
"centralise": "centralize",
|
246 |
+
"centralised": "centralized",
|
247 |
+
"centralises": "centralizes",
|
248 |
+
"centralising": "centralizing",
|
249 |
+
"centre": "center",
|
250 |
+
"centred": "centered",
|
251 |
+
"centrefold": "centerfold",
|
252 |
+
"centrefolds": "centerfolds",
|
253 |
+
"centrepiece": "centerpiece",
|
254 |
+
"centrepieces": "centerpieces",
|
255 |
+
"centres": "centers",
|
256 |
+
"channelled": "channeled",
|
257 |
+
"channelling": "channeling",
|
258 |
+
"characterise": "characterize",
|
259 |
+
"characterised": "characterized",
|
260 |
+
"characterises": "characterizes",
|
261 |
+
"characterising": "characterizing",
|
262 |
+
"cheque": "check",
|
263 |
+
"chequebook": "checkbook",
|
264 |
+
"chequebooks": "checkbooks",
|
265 |
+
"chequered": "checkered",
|
266 |
+
"cheques": "checks",
|
267 |
+
"chilli": "chili",
|
268 |
+
"chimaera": "chimera",
|
269 |
+
"chimaeras": "chimeras",
|
270 |
+
"chiselled": "chiseled",
|
271 |
+
"chiselling": "chiseling",
|
272 |
+
"circularise": "circularize",
|
273 |
+
"circularised": "circularized",
|
274 |
+
"circularises": "circularizes",
|
275 |
+
"circularising": "circularizing",
|
276 |
+
"civilise": "civilize",
|
277 |
+
"civilised": "civilized",
|
278 |
+
"civilises": "civilizes",
|
279 |
+
"civilising": "civilizing",
|
280 |
+
"clamour": "clamor",
|
281 |
+
"clamoured": "clamored",
|
282 |
+
"clamouring": "clamoring",
|
283 |
+
"clamours": "clamors",
|
284 |
+
"clangour": "clangor",
|
285 |
+
"clarinettist": "clarinetist",
|
286 |
+
"clarinettists": "clarinetists",
|
287 |
+
"collectivise": "collectivize",
|
288 |
+
"collectivised": "collectivized",
|
289 |
+
"collectivises": "collectivizes",
|
290 |
+
"collectivising": "collectivizing",
|
291 |
+
"colonisation": "colonization",
|
292 |
+
"colonise": "colonize",
|
293 |
+
"colonised": "colonized",
|
294 |
+
"coloniser": "colonizer",
|
295 |
+
"colonisers": "colonizers",
|
296 |
+
"colonises": "colonizes",
|
297 |
+
"colonising": "colonizing",
|
298 |
+
"colour": "color",
|
299 |
+
"colourant": "colorant",
|
300 |
+
"colourants": "colorants",
|
301 |
+
"coloured": "colored",
|
302 |
+
"coloureds": "coloreds",
|
303 |
+
"colourful": "colorful",
|
304 |
+
"colourfully": "colorfully",
|
305 |
+
"colouring": "coloring",
|
306 |
+
"colourize": "colorize",
|
307 |
+
"colourized": "colorized",
|
308 |
+
"colourizes": "colorizes",
|
309 |
+
"colourizing": "colorizing",
|
310 |
+
"colourless": "colorless",
|
311 |
+
"colours": "colors",
|
312 |
+
"commercialise": "commercialize",
|
313 |
+
"commercialised": "commercialized",
|
314 |
+
"commercialises": "commercializes",
|
315 |
+
"commercialising": "commercializing",
|
316 |
+
"compartmentalise": "compartmentalize",
|
317 |
+
"compartmentalised": "compartmentalized",
|
318 |
+
"compartmentalises": "compartmentalizes",
|
319 |
+
"compartmentalising": "compartmentalizing",
|
320 |
+
"computerise": "computerize",
|
321 |
+
"computerised": "computerized",
|
322 |
+
"computerises": "computerizes",
|
323 |
+
"computerising": "computerizing",
|
324 |
+
"conceptualise": "conceptualize",
|
325 |
+
"conceptualised": "conceptualized",
|
326 |
+
"conceptualises": "conceptualizes",
|
327 |
+
"conceptualising": "conceptualizing",
|
328 |
+
"connexion": "connection",
|
329 |
+
"connexions": "connections",
|
330 |
+
"contextualise": "contextualize",
|
331 |
+
"contextualised": "contextualized",
|
332 |
+
"contextualises": "contextualizes",
|
333 |
+
"contextualising": "contextualizing",
|
334 |
+
"cosier": "cozier",
|
335 |
+
"cosies": "cozies",
|
336 |
+
"cosiest": "coziest",
|
337 |
+
"cosily": "cozily",
|
338 |
+
"cosiness": "coziness",
|
339 |
+
"cosy": "cozy",
|
340 |
+
"councillor": "councilor",
|
341 |
+
"councillors": "councilors",
|
342 |
+
"counselled": "counseled",
|
343 |
+
"counselling": "counseling",
|
344 |
+
"counsellor": "counselor",
|
345 |
+
"counsellors": "counselors",
|
346 |
+
"crenelated": "crenellated",
|
347 |
+
"criminalise": "criminalize",
|
348 |
+
"criminalised": "criminalized",
|
349 |
+
"criminalises": "criminalizes",
|
350 |
+
"criminalising": "criminalizing",
|
351 |
+
"criticise": "criticize",
|
352 |
+
"criticised": "criticized",
|
353 |
+
"criticises": "criticizes",
|
354 |
+
"criticising": "criticizing",
|
355 |
+
"crueller": "crueler",
|
356 |
+
"cruellest": "cruelest",
|
357 |
+
"crystallisation": "crystallization",
|
358 |
+
"crystallise": "crystallize",
|
359 |
+
"crystallised": "crystallized",
|
360 |
+
"crystallises": "crystallizes",
|
361 |
+
"crystallising": "crystallizing",
|
362 |
+
"cudgelled": "cudgeled",
|
363 |
+
"cudgelling": "cudgeling",
|
364 |
+
"customise": "customize",
|
365 |
+
"customised": "customized",
|
366 |
+
"customises": "customizes",
|
367 |
+
"customising": "customizing",
|
368 |
+
"cypher": "cipher",
|
369 |
+
"cyphers": "ciphers",
|
370 |
+
"decentralisation": "decentralization",
|
371 |
+
"decentralise": "decentralize",
|
372 |
+
"decentralised": "decentralized",
|
373 |
+
"decentralises": "decentralizes",
|
374 |
+
"decentralising": "decentralizing",
|
375 |
+
"decriminalisation": "decriminalization",
|
376 |
+
"decriminalise": "decriminalize",
|
377 |
+
"decriminalised": "decriminalized",
|
378 |
+
"decriminalises": "decriminalizes",
|
379 |
+
"decriminalising": "decriminalizing",
|
380 |
+
"defence": "defense",
|
381 |
+
"defenceless": "defenseless",
|
382 |
+
"defences": "defenses",
|
383 |
+
"dehumanisation": "dehumanization",
|
384 |
+
"dehumanise": "dehumanize",
|
385 |
+
"dehumanised": "dehumanized",
|
386 |
+
"dehumanises": "dehumanizes",
|
387 |
+
"dehumanising": "dehumanizing",
|
388 |
+
"demeanour": "demeanor",
|
389 |
+
"demilitarisation": "demilitarization",
|
390 |
+
"demilitarise": "demilitarize",
|
391 |
+
"demilitarised": "demilitarized",
|
392 |
+
"demilitarises": "demilitarizes",
|
393 |
+
"demilitarising": "demilitarizing",
|
394 |
+
"demobilisation": "demobilization",
|
395 |
+
"demobilise": "demobilize",
|
396 |
+
"demobilised": "demobilized",
|
397 |
+
"demobilises": "demobilizes",
|
398 |
+
"demobilising": "demobilizing",
|
399 |
+
"democratisation": "democratization",
|
400 |
+
"democratise": "democratize",
|
401 |
+
"democratised": "democratized",
|
402 |
+
"democratises": "democratizes",
|
403 |
+
"democratising": "democratizing",
|
404 |
+
"demonise": "demonize",
|
405 |
+
"demonised": "demonized",
|
406 |
+
"demonises": "demonizes",
|
407 |
+
"demonising": "demonizing",
|
408 |
+
"demoralisation": "demoralization",
|
409 |
+
"demoralise": "demoralize",
|
410 |
+
"demoralised": "demoralized",
|
411 |
+
"demoralises": "demoralizes",
|
412 |
+
"demoralising": "demoralizing",
|
413 |
+
"denationalisation": "denationalization",
|
414 |
+
"denationalise": "denationalize",
|
415 |
+
"denationalised": "denationalized",
|
416 |
+
"denationalises": "denationalizes",
|
417 |
+
"denationalising": "denationalizing",
|
418 |
+
"deodorise": "deodorize",
|
419 |
+
"deodorised": "deodorized",
|
420 |
+
"deodorises": "deodorizes",
|
421 |
+
"deodorising": "deodorizing",
|
422 |
+
"depersonalise": "depersonalize",
|
423 |
+
"depersonalised": "depersonalized",
|
424 |
+
"depersonalises": "depersonalizes",
|
425 |
+
"depersonalising": "depersonalizing",
|
426 |
+
"deputise": "deputize",
|
427 |
+
"deputised": "deputized",
|
428 |
+
"deputises": "deputizes",
|
429 |
+
"deputising": "deputizing",
|
430 |
+
"desensitisation": "desensitization",
|
431 |
+
"desensitise": "desensitize",
|
432 |
+
"desensitised": "desensitized",
|
433 |
+
"desensitises": "desensitizes",
|
434 |
+
"desensitising": "desensitizing",
|
435 |
+
"destabilisation": "destabilization",
|
436 |
+
"destabilise": "destabilize",
|
437 |
+
"destabilised": "destabilized",
|
438 |
+
"destabilises": "destabilizes",
|
439 |
+
"destabilising": "destabilizing",
|
440 |
+
"dialled": "dialed",
|
441 |
+
"dialling": "dialing",
|
442 |
+
"dialogue": "dialog",
|
443 |
+
"dialogues": "dialogs",
|
444 |
+
"diarrhoea": "diarrhea",
|
445 |
+
"digitise": "digitize",
|
446 |
+
"digitised": "digitized",
|
447 |
+
"digitises": "digitizes",
|
448 |
+
"digitising": "digitizing",
|
449 |
+
"disc": "disk",
|
450 |
+
"discolour": "discolor",
|
451 |
+
"discoloured": "discolored",
|
452 |
+
"discolouring": "discoloring",
|
453 |
+
"discolours": "discolors",
|
454 |
+
"discs": "disks",
|
455 |
+
"disembowelled": "disemboweled",
|
456 |
+
"disembowelling": "disemboweling",
|
457 |
+
"disfavour": "disfavor",
|
458 |
+
"dishevelled": "disheveled",
|
459 |
+
"dishonour": "dishonor",
|
460 |
+
"dishonourable": "dishonorable",
|
461 |
+
"dishonourably": "dishonorably",
|
462 |
+
"dishonoured": "dishonored",
|
463 |
+
"dishonouring": "dishonoring",
|
464 |
+
"dishonours": "dishonors",
|
465 |
+
"disorganisation": "disorganization",
|
466 |
+
"disorganised": "disorganized",
|
467 |
+
"distil": "distill",
|
468 |
+
"distils": "distills",
|
469 |
+
"dramatisation": "dramatization",
|
470 |
+
"dramatisations": "dramatizations",
|
471 |
+
"dramatise": "dramatize",
|
472 |
+
"dramatised": "dramatized",
|
473 |
+
"dramatises": "dramatizes",
|
474 |
+
"dramatising": "dramatizing",
|
475 |
+
"draught": "draft",
|
476 |
+
"draughtboard": "draftboard",
|
477 |
+
"draughtboards": "draftboards",
|
478 |
+
"draughtier": "draftier",
|
479 |
+
"draughtiest": "draftiest",
|
480 |
+
"draughts": "drafts",
|
481 |
+
"draughtsman": "draftsman",
|
482 |
+
"draughtsmanship": "draftsmanship",
|
483 |
+
"draughtsmen": "draftsmen",
|
484 |
+
"draughtswoman": "draftswoman",
|
485 |
+
"draughtswomen": "draftswomen",
|
486 |
+
"draughty": "drafty",
|
487 |
+
"drivelled": "driveled",
|
488 |
+
"drivelling": "driveling",
|
489 |
+
"duelled": "dueled",
|
490 |
+
"duelling": "dueling",
|
491 |
+
"economise": "economize",
|
492 |
+
"economised": "economized",
|
493 |
+
"economises": "economizes",
|
494 |
+
"economising": "economizing",
|
495 |
+
"editorialise": "editorialize",
|
496 |
+
"editorialised": "editorialized",
|
497 |
+
"editorialises": "editorializes",
|
498 |
+
"editorialising": "editorializing",
|
499 |
+
"edoema": "edema",
|
500 |
+
"empathise": "empathize",
|
501 |
+
"empathised": "empathized",
|
502 |
+
"empathises": "empathizes",
|
503 |
+
"empathising": "empathizing",
|
504 |
+
"emphasise": "emphasize",
|
505 |
+
"emphasised": "emphasized",
|
506 |
+
"emphasises": "emphasizes",
|
507 |
+
"emphasising": "emphasizing",
|
508 |
+
"enamelled": "enameled",
|
509 |
+
"enamelling": "enameling",
|
510 |
+
"enamoured": "enamored",
|
511 |
+
"encyclopaedia": "encyclopedia",
|
512 |
+
"encyclopaedias": "encyclopedias",
|
513 |
+
"encyclopaedic": "encyclopedic",
|
514 |
+
"endeavour": "endeavor",
|
515 |
+
"endeavoured": "endeavored",
|
516 |
+
"endeavouring": "endeavoring",
|
517 |
+
"endeavours": "endeavors",
|
518 |
+
"energise": "energize",
|
519 |
+
"energised": "energized",
|
520 |
+
"energises": "energizes",
|
521 |
+
"energising": "energizing",
|
522 |
+
"enrol": "enroll",
|
523 |
+
"enrols": "enrolls",
|
524 |
+
"enthral": "enthrall",
|
525 |
+
"enthrals": "enthralls",
|
526 |
+
"epaulette": "epaulet",
|
527 |
+
"epaulettes": "epaulets",
|
528 |
+
"epicentre": "epicenter",
|
529 |
+
"epicentres": "epicenters",
|
530 |
+
"epilogue": "epilog",
|
531 |
+
"epilogues": "epilogs",
|
532 |
+
"epitomise": "epitomize",
|
533 |
+
"epitomised": "epitomized",
|
534 |
+
"epitomises": "epitomizes",
|
535 |
+
"epitomising": "epitomizing",
|
536 |
+
"equalisation": "equalization",
|
537 |
+
"equalise": "equalize",
|
538 |
+
"equalised": "equalized",
|
539 |
+
"equaliser": "equalizer",
|
540 |
+
"equalisers": "equalizers",
|
541 |
+
"equalises": "equalizes",
|
542 |
+
"equalising": "equalizing",
|
543 |
+
"eulogise": "eulogize",
|
544 |
+
"eulogised": "eulogized",
|
545 |
+
"eulogises": "eulogizes",
|
546 |
+
"eulogising": "eulogizing",
|
547 |
+
"evangelise": "evangelize",
|
548 |
+
"evangelised": "evangelized",
|
549 |
+
"evangelises": "evangelizes",
|
550 |
+
"evangelising": "evangelizing",
|
551 |
+
"exorcise": "exorcize",
|
552 |
+
"exorcised": "exorcized",
|
553 |
+
"exorcises": "exorcizes",
|
554 |
+
"exorcising": "exorcizing",
|
555 |
+
"extemporisation": "extemporization",
|
556 |
+
"extemporise": "extemporize",
|
557 |
+
"extemporised": "extemporized",
|
558 |
+
"extemporises": "extemporizes",
|
559 |
+
"extemporising": "extemporizing",
|
560 |
+
"externalisation": "externalization",
|
561 |
+
"externalisations": "externalizations",
|
562 |
+
"externalise": "externalize",
|
563 |
+
"externalised": "externalized",
|
564 |
+
"externalises": "externalizes",
|
565 |
+
"externalising": "externalizing",
|
566 |
+
"factorise": "factorize",
|
567 |
+
"factorised": "factorized",
|
568 |
+
"factorises": "factorizes",
|
569 |
+
"factorising": "factorizing",
|
570 |
+
"faecal": "fecal",
|
571 |
+
"faeces": "feces",
|
572 |
+
"familiarisation": "familiarization",
|
573 |
+
"familiarise": "familiarize",
|
574 |
+
"familiarised": "familiarized",
|
575 |
+
"familiarises": "familiarizes",
|
576 |
+
"familiarising": "familiarizing",
|
577 |
+
"fantasise": "fantasize",
|
578 |
+
"fantasised": "fantasized",
|
579 |
+
"fantasises": "fantasizes",
|
580 |
+
"fantasising": "fantasizing",
|
581 |
+
"favour": "favor",
|
582 |
+
"favourable": "favorable",
|
583 |
+
"favourably": "favorably",
|
584 |
+
"favoured": "favored",
|
585 |
+
"favouring": "favoring",
|
586 |
+
"favourite": "favorite",
|
587 |
+
"favourites": "favorites",
|
588 |
+
"favouritism": "favoritism",
|
589 |
+
"favours": "favors",
|
590 |
+
"feminise": "feminize",
|
591 |
+
"feminised": "feminized",
|
592 |
+
"feminises": "feminizes",
|
593 |
+
"feminising": "feminizing",
|
594 |
+
"fertilisation": "fertilization",
|
595 |
+
"fertilise": "fertilize",
|
596 |
+
"fertilised": "fertilized",
|
597 |
+
"fertiliser": "fertilizer",
|
598 |
+
"fertilisers": "fertilizers",
|
599 |
+
"fertilises": "fertilizes",
|
600 |
+
"fertilising": "fertilizing",
|
601 |
+
"fervour": "fervor",
|
602 |
+
"fibre": "fiber",
|
603 |
+
"fibreglass": "fiberglass",
|
604 |
+
"fibres": "fibers",
|
605 |
+
"fictionalisation": "fictionalization",
|
606 |
+
"fictionalisations": "fictionalizations",
|
607 |
+
"fictionalise": "fictionalize",
|
608 |
+
"fictionalised": "fictionalized",
|
609 |
+
"fictionalises": "fictionalizes",
|
610 |
+
"fictionalising": "fictionalizing",
|
611 |
+
"fillet": "filet",
|
612 |
+
"filleted": "fileted",
|
613 |
+
"filleting": "fileting",
|
614 |
+
"fillets": "filets",
|
615 |
+
"finalisation": "finalization",
|
616 |
+
"finalise": "finalize",
|
617 |
+
"finalised": "finalized",
|
618 |
+
"finalises": "finalizes",
|
619 |
+
"finalising": "finalizing",
|
620 |
+
"flautist": "flutist",
|
621 |
+
"flautists": "flutists",
|
622 |
+
"flavour": "flavor",
|
623 |
+
"flavoured": "flavored",
|
624 |
+
"flavouring": "flavoring",
|
625 |
+
"flavourings": "flavorings",
|
626 |
+
"flavourless": "flavorless",
|
627 |
+
"flavours": "flavors",
|
628 |
+
"flavoursome": "flavorsome",
|
629 |
+
"flyer / flier": "flier / flyer",
|
630 |
+
"foetal": "fetal",
|
631 |
+
"foetid": "fetid",
|
632 |
+
"foetus": "fetus",
|
633 |
+
"foetuses": "fetuses",
|
634 |
+
"formalisation": "formalization",
|
635 |
+
"formalise": "formalize",
|
636 |
+
"formalised": "formalized",
|
637 |
+
"formalises": "formalizes",
|
638 |
+
"formalising": "formalizing",
|
639 |
+
"fossilisation": "fossilization",
|
640 |
+
"fossilise": "fossilize",
|
641 |
+
"fossilised": "fossilized",
|
642 |
+
"fossilises": "fossilizes",
|
643 |
+
"fossilising": "fossilizing",
|
644 |
+
"fraternisation": "fraternization",
|
645 |
+
"fraternise": "fraternize",
|
646 |
+
"fraternised": "fraternized",
|
647 |
+
"fraternises": "fraternizes",
|
648 |
+
"fraternising": "fraternizing",
|
649 |
+
"fulfil": "fulfill",
|
650 |
+
"fulfilment": "fulfillment",
|
651 |
+
"fulfils": "fulfills",
|
652 |
+
"funnelled": "funneled",
|
653 |
+
"funnelling": "funneling",
|
654 |
+
"gage": "gauge",
|
655 |
+
"gaged": "gauged",
|
656 |
+
"gages": "gauges",
|
657 |
+
"gaging": "gauging",
|
658 |
+
"galvanise": "galvanize",
|
659 |
+
"galvanised": "galvanized",
|
660 |
+
"galvanises": "galvanizes",
|
661 |
+
"galvanising": "galvanizing",
|
662 |
+
"gambolled": "gamboled",
|
663 |
+
"gambolling": "gamboling",
|
664 |
+
"gaol": "jail",
|
665 |
+
"gaolbird": "jailbird",
|
666 |
+
"gaolbirds": "jailbirds",
|
667 |
+
"gaolbreak": "jailbreak",
|
668 |
+
"gaolbreaks": "jailbreaks",
|
669 |
+
"gaoled": "jailed",
|
670 |
+
"gaoler": "jailer",
|
671 |
+
"gaolers": "jailers",
|
672 |
+
"gaoling": "jailing",
|
673 |
+
"gaols": "jails",
|
674 |
+
"gasses": "gases",
|
675 |
+
"generalisation": "generalization",
|
676 |
+
"generalisations": "generalizations",
|
677 |
+
"generalise": "generalize",
|
678 |
+
"generalised": "generalized",
|
679 |
+
"generalises": "generalizes",
|
680 |
+
"generalising": "generalizing",
|
681 |
+
"ghettoise": "ghettoize",
|
682 |
+
"ghettoised": "ghettoized",
|
683 |
+
"ghettoises": "ghettoizes",
|
684 |
+
"ghettoising": "ghettoizing",
|
685 |
+
"gipsies": "gypsies",
|
686 |
+
"glamor": "glamour",
|
687 |
+
"glamorise": "glamorize",
|
688 |
+
"glamorised": "glamorized",
|
689 |
+
"glamorises": "glamorizes",
|
690 |
+
"glamorising": "glamorizing",
|
691 |
+
"globalisation": "globalization",
|
692 |
+
"globalise": "globalize",
|
693 |
+
"globalised": "globalized",
|
694 |
+
"globalises": "globalizes",
|
695 |
+
"globalising": "globalizing",
|
696 |
+
"glueing": "gluing",
|
697 |
+
"goitre": "goiter",
|
698 |
+
"goitres": "goiters",
|
699 |
+
"gonorrhoea": "gonorrhea",
|
700 |
+
"gramme": "gram",
|
701 |
+
"grammes": "grams",
|
702 |
+
"gravelled": "graveled",
|
703 |
+
"grey": "gray",
|
704 |
+
"greyed": "grayed",
|
705 |
+
"greying": "graying",
|
706 |
+
"greyish": "grayish",
|
707 |
+
"greyness": "grayness",
|
708 |
+
"greys": "grays",
|
709 |
+
"grovelled": "groveled",
|
710 |
+
"grovelling": "groveling",
|
711 |
+
"groyne": "groin",
|
712 |
+
"groynes": "groins",
|
713 |
+
"gruelling": "grueling",
|
714 |
+
"gruellingly": "gruelingly",
|
715 |
+
"gryphon": "griffin",
|
716 |
+
"gryphons": "griffins",
|
717 |
+
"gynaecological": "gynecological",
|
718 |
+
"gynaecologist": "gynecologist",
|
719 |
+
"gynaecologists": "gynecologists",
|
720 |
+
"gynaecology": "gynecology",
|
721 |
+
"haematological": "hematological",
|
722 |
+
"haematologist": "hematologist",
|
723 |
+
"haematologists": "hematologists",
|
724 |
+
"haematology": "hematology",
|
725 |
+
"haemoglobin": "hemoglobin",
|
726 |
+
"haemophilia": "hemophilia",
|
727 |
+
"haemophiliac": "hemophiliac",
|
728 |
+
"haemophiliacs": "hemophiliacs",
|
729 |
+
"haemorrhage": "hemorrhage",
|
730 |
+
"haemorrhaged": "hemorrhaged",
|
731 |
+
"haemorrhages": "hemorrhages",
|
732 |
+
"haemorrhaging": "hemorrhaging",
|
733 |
+
"haemorrhoids": "hemorrhoids",
|
734 |
+
"harbour": "harbor",
|
735 |
+
"harboured": "harbored",
|
736 |
+
"harbouring": "harboring",
|
737 |
+
"harbours": "harbors",
|
738 |
+
"harmonisation": "harmonization",
|
739 |
+
"harmonise": "harmonize",
|
740 |
+
"harmonised": "harmonized",
|
741 |
+
"harmonises": "harmonizes",
|
742 |
+
"harmonising": "harmonizing",
|
743 |
+
"homoeopath": "homeopath",
|
744 |
+
"homoeopathic": "homeopathic",
|
745 |
+
"homoeopaths": "homeopaths",
|
746 |
+
"homoeopathy": "homeopathy",
|
747 |
+
"homogenise": "homogenize",
|
748 |
+
"homogenised": "homogenized",
|
749 |
+
"homogenises": "homogenizes",
|
750 |
+
"homogenising": "homogenizing",
|
751 |
+
"honour": "honor",
|
752 |
+
"honourable": "honorable",
|
753 |
+
"honourably": "honorably",
|
754 |
+
"honoured": "honored",
|
755 |
+
"honouring": "honoring",
|
756 |
+
"honours": "honors",
|
757 |
+
"hospitalisation": "hospitalization",
|
758 |
+
"hospitalise": "hospitalize",
|
759 |
+
"hospitalised": "hospitalized",
|
760 |
+
"hospitalises": "hospitalizes",
|
761 |
+
"hospitalising": "hospitalizing",
|
762 |
+
"humanise": "humanize",
|
763 |
+
"humanised": "humanized",
|
764 |
+
"humanises": "humanizes",
|
765 |
+
"humanising": "humanizing",
|
766 |
+
"humour": "humor",
|
767 |
+
"humoured": "humored",
|
768 |
+
"humouring": "humoring",
|
769 |
+
"humourless": "humorless",
|
770 |
+
"humours": "humors",
|
771 |
+
"hybridise": "hybridize",
|
772 |
+
"hybridised": "hybridized",
|
773 |
+
"hybridises": "hybridizes",
|
774 |
+
"hybridising": "hybridizing",
|
775 |
+
"hypnotise": "hypnotize",
|
776 |
+
"hypnotised": "hypnotized",
|
777 |
+
"hypnotises": "hypnotizes",
|
778 |
+
"hypnotising": "hypnotizing",
|
779 |
+
"hypothesise": "hypothesize",
|
780 |
+
"hypothesised": "hypothesized",
|
781 |
+
"hypothesises": "hypothesizes",
|
782 |
+
"hypothesising": "hypothesizing",
|
783 |
+
"idealisation": "idealization",
|
784 |
+
"idealise": "idealize",
|
785 |
+
"idealised": "idealized",
|
786 |
+
"idealises": "idealizes",
|
787 |
+
"idealising": "idealizing",
|
788 |
+
"idolise": "idolize",
|
789 |
+
"idolised": "idolized",
|
790 |
+
"idolises": "idolizes",
|
791 |
+
"idolising": "idolizing",
|
792 |
+
"immobilisation": "immobilization",
|
793 |
+
"immobilise": "immobilize",
|
794 |
+
"immobilised": "immobilized",
|
795 |
+
"immobiliser": "immobilizer",
|
796 |
+
"immobilisers": "immobilizers",
|
797 |
+
"immobilises": "immobilizes",
|
798 |
+
"immobilising": "immobilizing",
|
799 |
+
"immortalise": "immortalize",
|
800 |
+
"immortalised": "immortalized",
|
801 |
+
"immortalises": "immortalizes",
|
802 |
+
"immortalising": "immortalizing",
|
803 |
+
"immunisation": "immunization",
|
804 |
+
"immunise": "immunize",
|
805 |
+
"immunised": "immunized",
|
806 |
+
"immunises": "immunizes",
|
807 |
+
"immunising": "immunizing",
|
808 |
+
"impanelled": "impaneled",
|
809 |
+
"impanelling": "impaneling",
|
810 |
+
"imperilled": "imperiled",
|
811 |
+
"imperilling": "imperiling",
|
812 |
+
"individualise": "individualize",
|
813 |
+
"individualised": "individualized",
|
814 |
+
"individualises": "individualizes",
|
815 |
+
"individualising": "individualizing",
|
816 |
+
"industrialise": "industrialize",
|
817 |
+
"industrialised": "industrialized",
|
818 |
+
"industrialises": "industrializes",
|
819 |
+
"industrialising": "industrializing",
|
820 |
+
"inflexion": "inflection",
|
821 |
+
"inflexions": "inflections",
|
822 |
+
"initialise": "initialize",
|
823 |
+
"initialised": "initialized",
|
824 |
+
"initialises": "initializes",
|
825 |
+
"initialising": "initializing",
|
826 |
+
"initialled": "initialed",
|
827 |
+
"initialling": "initialing",
|
828 |
+
"instal": "install",
|
829 |
+
"instalment": "installment",
|
830 |
+
"instalments": "installments",
|
831 |
+
"instals": "installs",
|
832 |
+
"instil": "instill",
|
833 |
+
"instils": "instills",
|
834 |
+
"institutionalisation": "institutionalization",
|
835 |
+
"institutionalise": "institutionalize",
|
836 |
+
"institutionalised": "institutionalized",
|
837 |
+
"institutionalises": "institutionalizes",
|
838 |
+
"institutionalising": "institutionalizing",
|
839 |
+
"intellectualise": "intellectualize",
|
840 |
+
"intellectualised": "intellectualized",
|
841 |
+
"intellectualises": "intellectualizes",
|
842 |
+
"intellectualising": "intellectualizing",
|
843 |
+
"internalisation": "internalization",
|
844 |
+
"internalise": "internalize",
|
845 |
+
"internalised": "internalized",
|
846 |
+
"internalises": "internalizes",
|
847 |
+
"internalising": "internalizing",
|
848 |
+
"internationalisation": "internationalization",
|
849 |
+
"internationalise": "internationalize",
|
850 |
+
"internationalised": "internationalized",
|
851 |
+
"internationalises": "internationalizes",
|
852 |
+
"internationalising": "internationalizing",
|
853 |
+
"ionisation": "ionization",
|
854 |
+
"ionise": "ionize",
|
855 |
+
"ionised": "ionized",
|
856 |
+
"ioniser": "ionizer",
|
857 |
+
"ionisers": "ionizers",
|
858 |
+
"ionises": "ionizes",
|
859 |
+
"ionising": "ionizing",
|
860 |
+
"italicise": "italicize",
|
861 |
+
"italicised": "italicized",
|
862 |
+
"italicises": "italicizes",
|
863 |
+
"italicising": "italicizing",
|
864 |
+
"itemise": "itemize",
|
865 |
+
"itemised": "itemized",
|
866 |
+
"itemises": "itemizes",
|
867 |
+
"itemising": "itemizing",
|
868 |
+
"jeopardise": "jeopardize",
|
869 |
+
"jeopardised": "jeopardized",
|
870 |
+
"jeopardises": "jeopardizes",
|
871 |
+
"jeopardising": "jeopardizing",
|
872 |
+
"jewelled": "jeweled",
|
873 |
+
"jeweller": "jeweler",
|
874 |
+
"jewellers": "jewelers",
|
875 |
+
"jewellery": "jewelry",
|
876 |
+
"judgement": "judgment",
|
877 |
+
"kilogramme": "kilogram",
|
878 |
+
"kilogrammes": "kilograms",
|
879 |
+
"kilometre": "kilometer",
|
880 |
+
"kilometres": "kilometers",
|
881 |
+
"labelled": "labeled",
|
882 |
+
"labelling": "labeling",
|
883 |
+
"labour": "labor",
|
884 |
+
"laboured": "labored",
|
885 |
+
"labourer": "laborer",
|
886 |
+
"labourers": "laborers",
|
887 |
+
"labouring": "laboring",
|
888 |
+
"labours": "labors",
|
889 |
+
"lacklustre": "lackluster",
|
890 |
+
"legalisation": "legalization",
|
891 |
+
"legalise": "legalize",
|
892 |
+
"legalised": "legalized",
|
893 |
+
"legalises": "legalizes",
|
894 |
+
"legalising": "legalizing",
|
895 |
+
"legitimise": "legitimize",
|
896 |
+
"legitimised": "legitimized",
|
897 |
+
"legitimises": "legitimizes",
|
898 |
+
"legitimising": "legitimizing",
|
899 |
+
"leukaemia": "leukemia",
|
900 |
+
"levelled": "leveled",
|
901 |
+
"leveller": "leveler",
|
902 |
+
"levellers": "levelers",
|
903 |
+
"levelling": "leveling",
|
904 |
+
"libelled": "libeled",
|
905 |
+
"libelling": "libeling",
|
906 |
+
"libellous": "libelous",
|
907 |
+
"liberalisation": "liberalization",
|
908 |
+
"liberalise": "liberalize",
|
909 |
+
"liberalised": "liberalized",
|
910 |
+
"liberalises": "liberalizes",
|
911 |
+
"liberalising": "liberalizing",
|
912 |
+
"licence": "license",
|
913 |
+
"licenced": "licensed",
|
914 |
+
"licences": "licenses",
|
915 |
+
"licencing": "licensing",
|
916 |
+
"likeable": "likable",
|
917 |
+
"lionisation": "lionization",
|
918 |
+
"lionise": "lionize",
|
919 |
+
"lionised": "lionized",
|
920 |
+
"lionises": "lionizes",
|
921 |
+
"lionising": "lionizing",
|
922 |
+
"liquidise": "liquidize",
|
923 |
+
"liquidised": "liquidized",
|
924 |
+
"liquidiser": "liquidizer",
|
925 |
+
"liquidisers": "liquidizers",
|
926 |
+
"liquidises": "liquidizes",
|
927 |
+
"liquidising": "liquidizing",
|
928 |
+
"litre": "liter",
|
929 |
+
"litres": "liters",
|
930 |
+
"localise": "localize",
|
931 |
+
"localised": "localized",
|
932 |
+
"localises": "localizes",
|
933 |
+
"localising": "localizing",
|
934 |
+
"louvre": "louver",
|
935 |
+
"louvred": "louvered",
|
936 |
+
"louvres": "louvers",
|
937 |
+
"lustre": "luster",
|
938 |
+
"magnetise": "magnetize",
|
939 |
+
"magnetised": "magnetized",
|
940 |
+
"magnetises": "magnetizes",
|
941 |
+
"magnetising": "magnetizing",
|
942 |
+
"manoeuvrability": "maneuverability",
|
943 |
+
"manoeuvrable": "maneuverable",
|
944 |
+
"manoeuvre": "maneuver",
|
945 |
+
"manoeuvred": "maneuvered",
|
946 |
+
"manoeuvres": "maneuvers",
|
947 |
+
"manoeuvring": "maneuvering",
|
948 |
+
"manoeuvrings": "maneuverings",
|
949 |
+
"marginalisation": "marginalization",
|
950 |
+
"marginalise": "marginalize",
|
951 |
+
"marginalised": "marginalized",
|
952 |
+
"marginalises": "marginalizes",
|
953 |
+
"marginalising": "marginalizing",
|
954 |
+
"marshalled": "marshaled",
|
955 |
+
"marshalling": "marshaling",
|
956 |
+
"marvelled": "marveled",
|
957 |
+
"marvelling": "marveling",
|
958 |
+
"marvellous": "marvelous",
|
959 |
+
"marvellously": "marvelously",
|
960 |
+
"materialisation": "materialization",
|
961 |
+
"materialise": "materialize",
|
962 |
+
"materialised": "materialized",
|
963 |
+
"materialises": "materializes",
|
964 |
+
"materialising": "materializing",
|
965 |
+
"maximisation": "maximization",
|
966 |
+
"maximise": "maximize",
|
967 |
+
"maximised": "maximized",
|
968 |
+
"maximises": "maximizes",
|
969 |
+
"maximising": "maximizing",
|
970 |
+
"meagre": "meager",
|
971 |
+
"mechanisation": "mechanization",
|
972 |
+
"mechanise": "mechanize",
|
973 |
+
"mechanised": "mechanized",
|
974 |
+
"mechanises": "mechanizes",
|
975 |
+
"mechanising": "mechanizing",
|
976 |
+
"mediaeval": "medieval",
|
977 |
+
"memorialise": "memorialize",
|
978 |
+
"memorialised": "memorialized",
|
979 |
+
"memorialises": "memorializes",
|
980 |
+
"memorialising": "memorializing",
|
981 |
+
"memorise": "memorize",
|
982 |
+
"memorised": "memorized",
|
983 |
+
"memorises": "memorizes",
|
984 |
+
"memorising": "memorizing",
|
985 |
+
"mesmerise": "mesmerize",
|
986 |
+
"mesmerised": "mesmerized",
|
987 |
+
"mesmerises": "mesmerizes",
|
988 |
+
"mesmerising": "mesmerizing",
|
989 |
+
"metabolise": "metabolize",
|
990 |
+
"metabolised": "metabolized",
|
991 |
+
"metabolises": "metabolizes",
|
992 |
+
"metabolising": "metabolizing",
|
993 |
+
"metre": "meter",
|
994 |
+
"metres": "meters",
|
995 |
+
"mhm": "hmm",
|
996 |
+
"micrometre": "micrometer",
|
997 |
+
"micrometres": "micrometers",
|
998 |
+
"militarise": "militarize",
|
999 |
+
"militarised": "militarized",
|
1000 |
+
"militarises": "militarizes",
|
1001 |
+
"militarising": "militarizing",
|
1002 |
+
"milligramme": "milligram",
|
1003 |
+
"milligrammes": "milligrams",
|
1004 |
+
"millilitre": "milliliter",
|
1005 |
+
"millilitres": "milliliters",
|
1006 |
+
"millimetre": "millimeter",
|
1007 |
+
"millimetres": "millimeters",
|
1008 |
+
"miniaturisation": "miniaturization",
|
1009 |
+
"miniaturise": "miniaturize",
|
1010 |
+
"miniaturised": "miniaturized",
|
1011 |
+
"miniaturises": "miniaturizes",
|
1012 |
+
"miniaturising": "miniaturizing",
|
1013 |
+
"minibusses": "minibuses",
|
1014 |
+
"minimise": "minimize",
|
1015 |
+
"minimised": "minimized",
|
1016 |
+
"minimises": "minimizes",
|
1017 |
+
"minimising": "minimizing",
|
1018 |
+
"misbehaviour": "misbehavior",
|
1019 |
+
"misdemeanour": "misdemeanor",
|
1020 |
+
"misdemeanours": "misdemeanors",
|
1021 |
+
"misspelt": "misspelled",
|
1022 |
+
"mitre": "miter",
|
1023 |
+
"mitres": "miters",
|
1024 |
+
"mm": "hmm",
|
1025 |
+
"mmm": "hmm",
|
1026 |
+
"mobilisation": "mobilization",
|
1027 |
+
"mobilise": "mobilize",
|
1028 |
+
"mobilised": "mobilized",
|
1029 |
+
"mobilises": "mobilizes",
|
1030 |
+
"mobilising": "mobilizing",
|
1031 |
+
"modelled": "modeled",
|
1032 |
+
"modeller": "modeler",
|
1033 |
+
"modellers": "modelers",
|
1034 |
+
"modelling": "modeling",
|
1035 |
+
"modernise": "modernize",
|
1036 |
+
"modernised": "modernized",
|
1037 |
+
"modernises": "modernizes",
|
1038 |
+
"modernising": "modernizing",
|
1039 |
+
"moisturise": "moisturize",
|
1040 |
+
"moisturised": "moisturized",
|
1041 |
+
"moisturiser": "moisturizer",
|
1042 |
+
"moisturisers": "moisturizers",
|
1043 |
+
"moisturises": "moisturizes",
|
1044 |
+
"moisturising": "moisturizing",
|
1045 |
+
"monologue": "monolog",
|
1046 |
+
"monologues": "monologs",
|
1047 |
+
"monopolisation": "monopolization",
|
1048 |
+
"monopolise": "monopolize",
|
1049 |
+
"monopolised": "monopolized",
|
1050 |
+
"monopolises": "monopolizes",
|
1051 |
+
"monopolising": "monopolizing",
|
1052 |
+
"moralise": "moralize",
|
1053 |
+
"moralised": "moralized",
|
1054 |
+
"moralises": "moralizes",
|
1055 |
+
"moralising": "moralizing",
|
1056 |
+
"motorised": "motorized",
|
1057 |
+
"mould": "mold",
|
1058 |
+
"moulded": "molded",
|
1059 |
+
"moulder": "molder",
|
1060 |
+
"mouldered": "moldered",
|
1061 |
+
"mouldering": "moldering",
|
1062 |
+
"moulders": "molders",
|
1063 |
+
"mouldier": "moldier",
|
1064 |
+
"mouldiest": "moldiest",
|
1065 |
+
"moulding": "molding",
|
1066 |
+
"mouldings": "moldings",
|
1067 |
+
"moulds": "molds",
|
1068 |
+
"mouldy": "moldy",
|
1069 |
+
"moult": "molt",
|
1070 |
+
"moulted": "molted",
|
1071 |
+
"moulting": "molting",
|
1072 |
+
"moults": "molts",
|
1073 |
+
"moustache": "mustache",
|
1074 |
+
"moustached": "mustached",
|
1075 |
+
"moustaches": "mustaches",
|
1076 |
+
"moustachioed": "mustachioed",
|
1077 |
+
"multicoloured": "multicolored",
|
1078 |
+
"nationalisation": "nationalization",
|
1079 |
+
"nationalisations": "nationalizations",
|
1080 |
+
"nationalise": "nationalize",
|
1081 |
+
"nationalised": "nationalized",
|
1082 |
+
"nationalises": "nationalizes",
|
1083 |
+
"nationalising": "nationalizing",
|
1084 |
+
"naturalisation": "naturalization",
|
1085 |
+
"naturalise": "naturalize",
|
1086 |
+
"naturalised": "naturalized",
|
1087 |
+
"naturalises": "naturalizes",
|
1088 |
+
"naturalising": "naturalizing",
|
1089 |
+
"neighbour": "neighbor",
|
1090 |
+
"neighbourhood": "neighborhood",
|
1091 |
+
"neighbourhoods": "neighborhoods",
|
1092 |
+
"neighbouring": "neighboring",
|
1093 |
+
"neighbourliness": "neighborliness",
|
1094 |
+
"neighbourly": "neighborly",
|
1095 |
+
"neighbours": "neighbors",
|
1096 |
+
"neutralisation": "neutralization",
|
1097 |
+
"neutralise": "neutralize",
|
1098 |
+
"neutralised": "neutralized",
|
1099 |
+
"neutralises": "neutralizes",
|
1100 |
+
"neutralising": "neutralizing",
|
1101 |
+
"normalisation": "normalization",
|
1102 |
+
"normalise": "normalize",
|
1103 |
+
"normalised": "normalized",
|
1104 |
+
"normalises": "normalizes",
|
1105 |
+
"normalising": "normalizing",
|
1106 |
+
"odour": "odor",
|
1107 |
+
"odourless": "odorless",
|
1108 |
+
"odours": "odors",
|
1109 |
+
"oesophagus": "esophagus",
|
1110 |
+
"oesophaguses": "esophaguses",
|
1111 |
+
"oestrogen": "estrogen",
|
1112 |
+
"offence": "offense",
|
1113 |
+
"offences": "offenses",
|
1114 |
+
"omelette": "omelet",
|
1115 |
+
"omelettes": "omelets",
|
1116 |
+
"optimise": "optimize",
|
1117 |
+
"optimised": "optimized",
|
1118 |
+
"optimises": "optimizes",
|
1119 |
+
"optimising": "optimizing",
|
1120 |
+
"organisation": "organization",
|
1121 |
+
"organisational": "organizational",
|
1122 |
+
"organisations": "organizations",
|
1123 |
+
"organise": "organize",
|
1124 |
+
"organised": "organized",
|
1125 |
+
"organiser": "organizer",
|
1126 |
+
"organisers": "organizers",
|
1127 |
+
"organises": "organizes",
|
1128 |
+
"organising": "organizing",
|
1129 |
+
"orthopaedic": "orthopedic",
|
1130 |
+
"orthopaedics": "orthopedics",
|
1131 |
+
"ostracise": "ostracize",
|
1132 |
+
"ostracised": "ostracized",
|
1133 |
+
"ostracises": "ostracizes",
|
1134 |
+
"ostracising": "ostracizing",
|
1135 |
+
"outmanoeuvre": "outmaneuver",
|
1136 |
+
"outmanoeuvred": "outmaneuvered",
|
1137 |
+
"outmanoeuvres": "outmaneuvers",
|
1138 |
+
"outmanoeuvring": "outmaneuvering",
|
1139 |
+
"overemphasise": "overemphasize",
|
1140 |
+
"overemphasised": "overemphasized",
|
1141 |
+
"overemphasises": "overemphasizes",
|
1142 |
+
"overemphasising": "overemphasizing",
|
1143 |
+
"oxidisation": "oxidization",
|
1144 |
+
"oxidise": "oxidize",
|
1145 |
+
"oxidised": "oxidized",
|
1146 |
+
"oxidises": "oxidizes",
|
1147 |
+
"oxidising": "oxidizing",
|
1148 |
+
"paederast": "pederast",
|
1149 |
+
"paederasts": "pederasts",
|
1150 |
+
"paediatric": "pediatric",
|
1151 |
+
"paediatrician": "pediatrician",
|
1152 |
+
"paediatricians": "pediatricians",
|
1153 |
+
"paediatrics": "pediatrics",
|
1154 |
+
"paedophile": "pedophile",
|
1155 |
+
"paedophiles": "pedophiles",
|
1156 |
+
"paedophilia": "pedophilia",
|
1157 |
+
"palaeolithic": "paleolithic",
|
1158 |
+
"palaeontologist": "paleontologist",
|
1159 |
+
"palaeontologists": "paleontologists",
|
1160 |
+
"palaeontology": "paleontology",
|
1161 |
+
"panelled": "paneled",
|
1162 |
+
"panelling": "paneling",
|
1163 |
+
"panellist": "panelist",
|
1164 |
+
"panellists": "panelists",
|
1165 |
+
"paralyse": "paralyze",
|
1166 |
+
"paralysed": "paralyzed",
|
1167 |
+
"paralyses": "paralyzes",
|
1168 |
+
"paralysing": "paralyzing",
|
1169 |
+
"parcelled": "parceled",
|
1170 |
+
"parcelling": "parceling",
|
1171 |
+
"parlour": "parlor",
|
1172 |
+
"parlours": "parlors",
|
1173 |
+
"particularise": "particularize",
|
1174 |
+
"particularised": "particularized",
|
1175 |
+
"particularises": "particularizes",
|
1176 |
+
"particularising": "particularizing",
|
1177 |
+
"passivisation": "passivization",
|
1178 |
+
"passivise": "passivize",
|
1179 |
+
"passivised": "passivized",
|
1180 |
+
"passivises": "passivizes",
|
1181 |
+
"passivising": "passivizing",
|
1182 |
+
"pasteurisation": "pasteurization",
|
1183 |
+
"pasteurise": "pasteurize",
|
1184 |
+
"pasteurised": "pasteurized",
|
1185 |
+
"pasteurises": "pasteurizes",
|
1186 |
+
"pasteurising": "pasteurizing",
|
1187 |
+
"patronise": "patronize",
|
1188 |
+
"patronised": "patronized",
|
1189 |
+
"patronises": "patronizes",
|
1190 |
+
"patronising": "patronizing",
|
1191 |
+
"patronisingly": "patronizingly",
|
1192 |
+
"pedalled": "pedaled",
|
1193 |
+
"pedalling": "pedaling",
|
1194 |
+
"pedestrianisation": "pedestrianization",
|
1195 |
+
"pedestrianise": "pedestrianize",
|
1196 |
+
"pedestrianised": "pedestrianized",
|
1197 |
+
"pedestrianises": "pedestrianizes",
|
1198 |
+
"pedestrianising": "pedestrianizing",
|
1199 |
+
"penalise": "penalize",
|
1200 |
+
"penalised": "penalized",
|
1201 |
+
"penalises": "penalizes",
|
1202 |
+
"penalising": "penalizing",
|
1203 |
+
"pencilled": "penciled",
|
1204 |
+
"pencilling": "penciling",
|
1205 |
+
"personalise": "personalize",
|
1206 |
+
"personalised": "personalized",
|
1207 |
+
"personalises": "personalizes",
|
1208 |
+
"personalising": "personalizing",
|
1209 |
+
"pharmacopoeia": "pharmacopeia",
|
1210 |
+
"pharmacopoeias": "pharmacopeias",
|
1211 |
+
"philosophise": "philosophize",
|
1212 |
+
"philosophised": "philosophized",
|
1213 |
+
"philosophises": "philosophizes",
|
1214 |
+
"philosophising": "philosophizing",
|
1215 |
+
"philtre": "filter",
|
1216 |
+
"philtres": "filters",
|
1217 |
+
"phoney": "phony",
|
1218 |
+
"plagiarise": "plagiarize",
|
1219 |
+
"plagiarised": "plagiarized",
|
1220 |
+
"plagiarises": "plagiarizes",
|
1221 |
+
"plagiarising": "plagiarizing",
|
1222 |
+
"plough": "plow",
|
1223 |
+
"ploughed": "plowed",
|
1224 |
+
"ploughing": "plowing",
|
1225 |
+
"ploughman": "plowman",
|
1226 |
+
"ploughmen": "plowmen",
|
1227 |
+
"ploughs": "plows",
|
1228 |
+
"ploughshare": "plowshare",
|
1229 |
+
"ploughshares": "plowshares",
|
1230 |
+
"polarisation": "polarization",
|
1231 |
+
"polarise": "polarize",
|
1232 |
+
"polarised": "polarized",
|
1233 |
+
"polarises": "polarizes",
|
1234 |
+
"polarising": "polarizing",
|
1235 |
+
"politicisation": "politicization",
|
1236 |
+
"politicise": "politicize",
|
1237 |
+
"politicised": "politicized",
|
1238 |
+
"politicises": "politicizes",
|
1239 |
+
"politicising": "politicizing",
|
1240 |
+
"popularisation": "popularization",
|
1241 |
+
"popularise": "popularize",
|
1242 |
+
"popularised": "popularized",
|
1243 |
+
"popularises": "popularizes",
|
1244 |
+
"popularising": "popularizing",
|
1245 |
+
"pouffe": "pouf",
|
1246 |
+
"pouffes": "poufs",
|
1247 |
+
"practise": "practice",
|
1248 |
+
"practised": "practiced",
|
1249 |
+
"practises": "practices",
|
1250 |
+
"practising": "practicing",
|
1251 |
+
"praesidium": "presidium",
|
1252 |
+
"praesidiums": "presidiums",
|
1253 |
+
"pressurisation": "pressurization",
|
1254 |
+
"pressurise": "pressurize",
|
1255 |
+
"pressurised": "pressurized",
|
1256 |
+
"pressurises": "pressurizes",
|
1257 |
+
"pressurising": "pressurizing",
|
1258 |
+
"pretence": "pretense",
|
1259 |
+
"pretences": "pretenses",
|
1260 |
+
"primaeval": "primeval",
|
1261 |
+
"prioritisation": "prioritization",
|
1262 |
+
"prioritise": "prioritize",
|
1263 |
+
"prioritised": "prioritized",
|
1264 |
+
"prioritises": "prioritizes",
|
1265 |
+
"prioritising": "prioritizing",
|
1266 |
+
"privatisation": "privatization",
|
1267 |
+
"privatisations": "privatizations",
|
1268 |
+
"privatise": "privatize",
|
1269 |
+
"privatised": "privatized",
|
1270 |
+
"privatises": "privatizes",
|
1271 |
+
"privatising": "privatizing",
|
1272 |
+
"professionalisation": "professionalization",
|
1273 |
+
"professionalise": "professionalize",
|
1274 |
+
"professionalised": "professionalized",
|
1275 |
+
"professionalises": "professionalizes",
|
1276 |
+
"professionalising": "professionalizing",
|
1277 |
+
"programme": "program",
|
1278 |
+
"programmes": "programs",
|
1279 |
+
"prologue": "prolog",
|
1280 |
+
"prologues": "prologs",
|
1281 |
+
"propagandise": "propagandize",
|
1282 |
+
"propagandised": "propagandized",
|
1283 |
+
"propagandises": "propagandizes",
|
1284 |
+
"propagandising": "propagandizing",
|
1285 |
+
"proselytise": "proselytize",
|
1286 |
+
"proselytised": "proselytized",
|
1287 |
+
"proselytiser": "proselytizer",
|
1288 |
+
"proselytisers": "proselytizers",
|
1289 |
+
"proselytises": "proselytizes",
|
1290 |
+
"proselytising": "proselytizing",
|
1291 |
+
"psychoanalyse": "psychoanalyze",
|
1292 |
+
"psychoanalysed": "psychoanalyzed",
|
1293 |
+
"psychoanalyses": "psychoanalyzes",
|
1294 |
+
"psychoanalysing": "psychoanalyzing",
|
1295 |
+
"publicise": "publicize",
|
1296 |
+
"publicised": "publicized",
|
1297 |
+
"publicises": "publicizes",
|
1298 |
+
"publicising": "publicizing",
|
1299 |
+
"pulverisation": "pulverization",
|
1300 |
+
"pulverise": "pulverize",
|
1301 |
+
"pulverised": "pulverized",
|
1302 |
+
"pulverises": "pulverizes",
|
1303 |
+
"pulverising": "pulverizing",
|
1304 |
+
"pummelled": "pummel",
|
1305 |
+
"pummelling": "pummeled",
|
1306 |
+
"pyjama": "pajama",
|
1307 |
+
"pyjamas": "pajamas",
|
1308 |
+
"pzazz": "pizzazz",
|
1309 |
+
"quarrelled": "quarreled",
|
1310 |
+
"quarrelling": "quarreling",
|
1311 |
+
"radicalise": "radicalize",
|
1312 |
+
"radicalised": "radicalized",
|
1313 |
+
"radicalises": "radicalizes",
|
1314 |
+
"radicalising": "radicalizing",
|
1315 |
+
"rancour": "rancor",
|
1316 |
+
"randomise": "randomize",
|
1317 |
+
"randomised": "randomized",
|
1318 |
+
"randomises": "randomizes",
|
1319 |
+
"randomising": "randomizing",
|
1320 |
+
"rationalisation": "rationalization",
|
1321 |
+
"rationalisations": "rationalizations",
|
1322 |
+
"rationalise": "rationalize",
|
1323 |
+
"rationalised": "rationalized",
|
1324 |
+
"rationalises": "rationalizes",
|
1325 |
+
"rationalising": "rationalizing",
|
1326 |
+
"ravelled": "raveled",
|
1327 |
+
"ravelling": "raveling",
|
1328 |
+
"realisable": "realizable",
|
1329 |
+
"realisation": "realization",
|
1330 |
+
"realisations": "realizations",
|
1331 |
+
"realise": "realize",
|
1332 |
+
"realised": "realized",
|
1333 |
+
"realises": "realizes",
|
1334 |
+
"realising": "realizing",
|
1335 |
+
"recognisable": "recognizable",
|
1336 |
+
"recognisably": "recognizably",
|
1337 |
+
"recognisance": "recognizance",
|
1338 |
+
"recognise": "recognize",
|
1339 |
+
"recognised": "recognized",
|
1340 |
+
"recognises": "recognizes",
|
1341 |
+
"recognising": "recognizing",
|
1342 |
+
"reconnoitre": "reconnoiter",
|
1343 |
+
"reconnoitred": "reconnoitered",
|
1344 |
+
"reconnoitres": "reconnoiters",
|
1345 |
+
"reconnoitring": "reconnoitering",
|
1346 |
+
"refuelled": "refueled",
|
1347 |
+
"refuelling": "refueling",
|
1348 |
+
"regularisation": "regularization",
|
1349 |
+
"regularise": "regularize",
|
1350 |
+
"regularised": "regularized",
|
1351 |
+
"regularises": "regularizes",
|
1352 |
+
"regularising": "regularizing",
|
1353 |
+
"remodelled": "remodeled",
|
1354 |
+
"remodelling": "remodeling",
|
1355 |
+
"remould": "remold",
|
1356 |
+
"remoulded": "remolded",
|
1357 |
+
"remoulding": "remolding",
|
1358 |
+
"remoulds": "remolds",
|
1359 |
+
"reorganisation": "reorganization",
|
1360 |
+
"reorganisations": "reorganizations",
|
1361 |
+
"reorganise": "reorganize",
|
1362 |
+
"reorganised": "reorganized",
|
1363 |
+
"reorganises": "reorganizes",
|
1364 |
+
"reorganising": "reorganizing",
|
1365 |
+
"revelled": "reveled",
|
1366 |
+
"reveller": "reveler",
|
1367 |
+
"revellers": "revelers",
|
1368 |
+
"revelling": "reveling",
|
1369 |
+
"revitalise": "revitalize",
|
1370 |
+
"revitalised": "revitalized",
|
1371 |
+
"revitalises": "revitalizes",
|
1372 |
+
"revitalising": "revitalizing",
|
1373 |
+
"revolutionise": "revolutionize",
|
1374 |
+
"revolutionised": "revolutionized",
|
1375 |
+
"revolutionises": "revolutionizes",
|
1376 |
+
"revolutionising": "revolutionizing",
|
1377 |
+
"rhapsodise": "rhapsodize",
|
1378 |
+
"rhapsodised": "rhapsodized",
|
1379 |
+
"rhapsodises": "rhapsodizes",
|
1380 |
+
"rhapsodising": "rhapsodizing",
|
1381 |
+
"rigour": "rigor",
|
1382 |
+
"rigours": "rigors",
|
1383 |
+
"ritualised": "ritualized",
|
1384 |
+
"rivalled": "rivaled",
|
1385 |
+
"rivalling": "rivaling",
|
1386 |
+
"romanticise": "romanticize",
|
1387 |
+
"romanticised": "romanticized",
|
1388 |
+
"romanticises": "romanticizes",
|
1389 |
+
"romanticising": "romanticizing",
|
1390 |
+
"rumour": "rumor",
|
1391 |
+
"rumoured": "rumored",
|
1392 |
+
"rumours": "rumors",
|
1393 |
+
"sabre": "saber",
|
1394 |
+
"sabres": "sabers",
|
1395 |
+
"saltpetre": "saltpeter",
|
1396 |
+
"sanitise": "sanitize",
|
1397 |
+
"sanitised": "sanitized",
|
1398 |
+
"sanitises": "sanitizes",
|
1399 |
+
"sanitising": "sanitizing",
|
1400 |
+
"satirise": "satirize",
|
1401 |
+
"satirised": "satirized",
|
1402 |
+
"satirises": "satirizes",
|
1403 |
+
"satirising": "satirizing",
|
1404 |
+
"saviour": "savior",
|
1405 |
+
"saviours": "saviors",
|
1406 |
+
"savour": "savor",
|
1407 |
+
"savoured": "savored",
|
1408 |
+
"savouries": "savories",
|
1409 |
+
"savouring": "savoring",
|
1410 |
+
"savours": "savors",
|
1411 |
+
"savoury": "savory",
|
1412 |
+
"scandalise": "scandalize",
|
1413 |
+
"scandalised": "scandalized",
|
1414 |
+
"scandalises": "scandalizes",
|
1415 |
+
"scandalising": "scandalizing",
|
1416 |
+
"sceptic": "skeptic",
|
1417 |
+
"sceptical": "skeptical",
|
1418 |
+
"sceptically": "skeptically",
|
1419 |
+
"scepticism": "skepticism",
|
1420 |
+
"sceptics": "skeptics",
|
1421 |
+
"sceptre": "scepter",
|
1422 |
+
"sceptres": "scepters",
|
1423 |
+
"scrutinise": "scrutinize",
|
1424 |
+
"scrutinised": "scrutinized",
|
1425 |
+
"scrutinises": "scrutinizes",
|
1426 |
+
"scrutinising": "scrutinizing",
|
1427 |
+
"secularisation": "secularization",
|
1428 |
+
"secularise": "secularize",
|
1429 |
+
"secularised": "secularized",
|
1430 |
+
"secularises": "secularizes",
|
1431 |
+
"secularising": "secularizing",
|
1432 |
+
"sensationalise": "sensationalize",
|
1433 |
+
"sensationalised": "sensationalized",
|
1434 |
+
"sensationalises": "sensationalizes",
|
1435 |
+
"sensationalising": "sensationalizing",
|
1436 |
+
"sensitise": "sensitize",
|
1437 |
+
"sensitised": "sensitized",
|
1438 |
+
"sensitises": "sensitizes",
|
1439 |
+
"sensitising": "sensitizing",
|
1440 |
+
"sentimentalise": "sentimentalize",
|
1441 |
+
"sentimentalised": "sentimentalized",
|
1442 |
+
"sentimentalises": "sentimentalizes",
|
1443 |
+
"sentimentalising": "sentimentalizing",
|
1444 |
+
"sepulchre": "sepulcher",
|
1445 |
+
"sepulchres": "sepulchers",
|
1446 |
+
"serialisation": "serialization",
|
1447 |
+
"serialisations": "serializations",
|
1448 |
+
"serialise": "serialize",
|
1449 |
+
"serialised": "serialized",
|
1450 |
+
"serialises": "serializes",
|
1451 |
+
"serialising": "serializing",
|
1452 |
+
"sermonise": "sermonize",
|
1453 |
+
"sermonised": "sermonized",
|
1454 |
+
"sermonises": "sermonizes",
|
1455 |
+
"sermonising": "sermonizing",
|
1456 |
+
"sheikh": "sheik",
|
1457 |
+
"shovelled": "shoveled",
|
1458 |
+
"shovelling": "shoveling",
|
1459 |
+
"shrivelled": "shriveled",
|
1460 |
+
"shrivelling": "shriveling",
|
1461 |
+
"signalise": "signalize",
|
1462 |
+
"signalised": "signalized",
|
1463 |
+
"signalises": "signalizes",
|
1464 |
+
"signalising": "signalizing",
|
1465 |
+
"signalled": "signaled",
|
1466 |
+
"signalling": "signaling",
|
1467 |
+
"smoulder": "smolder",
|
1468 |
+
"smouldered": "smoldered",
|
1469 |
+
"smouldering": "smoldering",
|
1470 |
+
"smoulders": "smolders",
|
1471 |
+
"snivelled": "sniveled",
|
1472 |
+
"snivelling": "sniveling",
|
1473 |
+
"snorkelled": "snorkeled",
|
1474 |
+
"snorkelling": "snorkeling",
|
1475 |
+
"snowplough": "snowplow",
|
1476 |
+
"snowploughs": "snowplow",
|
1477 |
+
"socialisation": "socialization",
|
1478 |
+
"socialise": "socialize",
|
1479 |
+
"socialised": "socialized",
|
1480 |
+
"socialises": "socializes",
|
1481 |
+
"socialising": "socializing",
|
1482 |
+
"sodomise": "sodomize",
|
1483 |
+
"sodomised": "sodomized",
|
1484 |
+
"sodomises": "sodomizes",
|
1485 |
+
"sodomising": "sodomizing",
|
1486 |
+
"solemnise": "solemnize",
|
1487 |
+
"solemnised": "solemnized",
|
1488 |
+
"solemnises": "solemnizes",
|
1489 |
+
"solemnising": "solemnizing",
|
1490 |
+
"sombre": "somber",
|
1491 |
+
"specialisation": "specialization",
|
1492 |
+
"specialisations": "specializations",
|
1493 |
+
"specialise": "specialize",
|
1494 |
+
"specialised": "specialized",
|
1495 |
+
"specialises": "specializes",
|
1496 |
+
"specialising": "specializing",
|
1497 |
+
"spectre": "specter",
|
1498 |
+
"spectres": "specters",
|
1499 |
+
"spiralled": "spiraled",
|
1500 |
+
"spiralling": "spiraling",
|
1501 |
+
"splendour": "splendor",
|
1502 |
+
"splendours": "splendors",
|
1503 |
+
"squirrelled": "squirreled",
|
1504 |
+
"squirrelling": "squirreling",
|
1505 |
+
"stabilisation": "stabilization",
|
1506 |
+
"stabilise": "stabilize",
|
1507 |
+
"stabilised": "stabilized",
|
1508 |
+
"stabiliser": "stabilizer",
|
1509 |
+
"stabilisers": "stabilizers",
|
1510 |
+
"stabilises": "stabilizes",
|
1511 |
+
"stabilising": "stabilizing",
|
1512 |
+
"standardisation": "standardization",
|
1513 |
+
"standardise": "standardize",
|
1514 |
+
"standardised": "standardized",
|
1515 |
+
"standardises": "standardizes",
|
1516 |
+
"standardising": "standardizing",
|
1517 |
+
"stencilled": "stenciled",
|
1518 |
+
"stencilling": "stenciling",
|
1519 |
+
"sterilisation": "sterilization",
|
1520 |
+
"sterilisations": "sterilizations",
|
1521 |
+
"sterilise": "sterilize",
|
1522 |
+
"sterilised": "sterilized",
|
1523 |
+
"steriliser": "sterilizer",
|
1524 |
+
"sterilisers": "sterilizers",
|
1525 |
+
"sterilises": "sterilizes",
|
1526 |
+
"sterilising": "sterilizing",
|
1527 |
+
"stigmatisation": "stigmatization",
|
1528 |
+
"stigmatise": "stigmatize",
|
1529 |
+
"stigmatised": "stigmatized",
|
1530 |
+
"stigmatises": "stigmatizes",
|
1531 |
+
"stigmatising": "stigmatizing",
|
1532 |
+
"storey": "story",
|
1533 |
+
"storeys": "stories",
|
1534 |
+
"subsidisation": "subsidization",
|
1535 |
+
"subsidise": "subsidize",
|
1536 |
+
"subsidised": "subsidized",
|
1537 |
+
"subsidiser": "subsidizer",
|
1538 |
+
"subsidisers": "subsidizers",
|
1539 |
+
"subsidises": "subsidizes",
|
1540 |
+
"subsidising": "subsidizing",
|
1541 |
+
"succour": "succor",
|
1542 |
+
"succoured": "succored",
|
1543 |
+
"succouring": "succoring",
|
1544 |
+
"succours": "succors",
|
1545 |
+
"sulphate": "sulfate",
|
1546 |
+
"sulphates": "sulfates",
|
1547 |
+
"sulphide": "sulfide",
|
1548 |
+
"sulphides": "sulfides",
|
1549 |
+
"sulphur": "sulfur",
|
1550 |
+
"sulphurous": "sulfurous",
|
1551 |
+
"summarise": "summarize",
|
1552 |
+
"summarised": "summarized",
|
1553 |
+
"summarises": "summarizes",
|
1554 |
+
"summarising": "summarizing",
|
1555 |
+
"swivelled": "swiveled",
|
1556 |
+
"swivelling": "swiveling",
|
1557 |
+
"symbolise": "symbolize",
|
1558 |
+
"symbolised": "symbolized",
|
1559 |
+
"symbolises": "symbolizes",
|
1560 |
+
"symbolising": "symbolizing",
|
1561 |
+
"sympathise": "sympathize",
|
1562 |
+
"sympathised": "sympathized",
|
1563 |
+
"sympathiser": "sympathizer",
|
1564 |
+
"sympathisers": "sympathizers",
|
1565 |
+
"sympathises": "sympathizes",
|
1566 |
+
"sympathising": "sympathizing",
|
1567 |
+
"synchronisation": "synchronization",
|
1568 |
+
"synchronise": "synchronize",
|
1569 |
+
"synchronised": "synchronized",
|
1570 |
+
"synchronises": "synchronizes",
|
1571 |
+
"synchronising": "synchronizing",
|
1572 |
+
"synthesise": "synthesize",
|
1573 |
+
"synthesised": "synthesized",
|
1574 |
+
"synthesiser": "synthesizer",
|
1575 |
+
"synthesisers": "synthesizers",
|
1576 |
+
"synthesises": "synthesizes",
|
1577 |
+
"synthesising": "synthesizing",
|
1578 |
+
"syphon": "siphon",
|
1579 |
+
"syphoned": "siphoned",
|
1580 |
+
"syphoning": "siphoning",
|
1581 |
+
"syphons": "siphons",
|
1582 |
+
"systematisation": "systematization",
|
1583 |
+
"systematise": "systematize",
|
1584 |
+
"systematised": "systematized",
|
1585 |
+
"systematises": "systematizes",
|
1586 |
+
"systematising": "systematizing",
|
1587 |
+
"tantalise": "tantalize",
|
1588 |
+
"tantalised": "tantalized",
|
1589 |
+
"tantalises": "tantalizes",
|
1590 |
+
"tantalising": "tantalizing",
|
1591 |
+
"tantalisingly": "tantalizingly",
|
1592 |
+
"tasselled": "tasseled",
|
1593 |
+
"technicolour": "technicolor",
|
1594 |
+
"temporise": "temporize",
|
1595 |
+
"temporised": "temporized",
|
1596 |
+
"temporises": "temporizes",
|
1597 |
+
"temporising": "temporizing",
|
1598 |
+
"tenderise": "tenderize",
|
1599 |
+
"tenderised": "tenderized",
|
1600 |
+
"tenderises": "tenderizes",
|
1601 |
+
"tenderising": "tenderizing",
|
1602 |
+
"terrorise": "terrorize",
|
1603 |
+
"terrorised": "terrorized",
|
1604 |
+
"terrorises": "terrorizes",
|
1605 |
+
"terrorising": "terrorizing",
|
1606 |
+
"theatre": "theater",
|
1607 |
+
"theatregoer": "theatergoer",
|
1608 |
+
"theatregoers": "theatergoers",
|
1609 |
+
"theatres": "theaters",
|
1610 |
+
"theorise": "theorize",
|
1611 |
+
"theorised": "theorized",
|
1612 |
+
"theorises": "theorizes",
|
1613 |
+
"theorising": "theorizing",
|
1614 |
+
"tonne": "ton",
|
1615 |
+
"tonnes": "tons",
|
1616 |
+
"towelled": "toweled",
|
1617 |
+
"towelling": "toweling",
|
1618 |
+
"toxaemia": "toxemia",
|
1619 |
+
"tranquillise": "tranquilize",
|
1620 |
+
"tranquillised": "tranquilized",
|
1621 |
+
"tranquilliser": "tranquilizer",
|
1622 |
+
"tranquillisers": "tranquilizers",
|
1623 |
+
"tranquillises": "tranquilizes",
|
1624 |
+
"tranquillising": "tranquilizing",
|
1625 |
+
"tranquillity": "tranquility",
|
1626 |
+
"tranquillize": "tranquilize",
|
1627 |
+
"tranquillized": "tranquilized",
|
1628 |
+
"tranquillizer": "tranquilizer",
|
1629 |
+
"tranquillizers": "tranquilizers",
|
1630 |
+
"tranquillizes": "tranquilizes",
|
1631 |
+
"tranquillizing": "tranquilizing",
|
1632 |
+
"tranquilly": "tranquility",
|
1633 |
+
"transistorised": "transistorized",
|
1634 |
+
"traumatise": "traumatize",
|
1635 |
+
"traumatised": "traumatized",
|
1636 |
+
"traumatises": "traumatizes",
|
1637 |
+
"traumatising": "traumatizing",
|
1638 |
+
"travelled": "traveled",
|
1639 |
+
"traveller": "traveler",
|
1640 |
+
"travellers": "travelers",
|
1641 |
+
"travelling": "traveling",
|
1642 |
+
"travelog": "travelogue",
|
1643 |
+
"travelogs": "travelogues",
|
1644 |
+
"trialled": "trialed",
|
1645 |
+
"trialling": "trialing",
|
1646 |
+
"tricolour": "tricolor",
|
1647 |
+
"tricolours": "tricolors",
|
1648 |
+
"trivialise": "trivialize",
|
1649 |
+
"trivialised": "trivialized",
|
1650 |
+
"trivialises": "trivializes",
|
1651 |
+
"trivialising": "trivializing",
|
1652 |
+
"tumour": "tumor",
|
1653 |
+
"tumours": "tumors",
|
1654 |
+
"tunnelled": "tunneled",
|
1655 |
+
"tunnelling": "tunneling",
|
1656 |
+
"tyrannise": "tyrannize",
|
1657 |
+
"tyrannised": "tyrannized",
|
1658 |
+
"tyrannises": "tyrannizes",
|
1659 |
+
"tyrannising": "tyrannizing",
|
1660 |
+
"tyre": "tire",
|
1661 |
+
"tyres": "tires",
|
1662 |
+
"unauthorised": "unauthorized",
|
1663 |
+
"uncivilised": "uncivilized",
|
1664 |
+
"underutilised": "underutilized",
|
1665 |
+
"unequalled": "unequaled",
|
1666 |
+
"unfavourable": "unfavorable",
|
1667 |
+
"unfavourably": "unfavorably",
|
1668 |
+
"unionisation": "unionization",
|
1669 |
+
"unionise": "unionize",
|
1670 |
+
"unionised": "unionized",
|
1671 |
+
"unionises": "unionizes",
|
1672 |
+
"unionising": "unionizing",
|
1673 |
+
"unorganised": "unorganized",
|
1674 |
+
"unravelled": "unraveled",
|
1675 |
+
"unravelling": "unraveling",
|
1676 |
+
"unrecognisable": "unrecognizable",
|
1677 |
+
"unrecognised": "unrecognized",
|
1678 |
+
"unrivalled": "unrivaled",
|
1679 |
+
"unsavoury": "unsavory",
|
1680 |
+
"untrammelled": "untrammeled",
|
1681 |
+
"urbanisation": "urbanization",
|
1682 |
+
"urbanise": "urbanize",
|
1683 |
+
"urbanised": "urbanized",
|
1684 |
+
"urbanises": "urbanizes",
|
1685 |
+
"urbanising": "urbanizing",
|
1686 |
+
"utilisable": "utilizable",
|
1687 |
+
"utilisation": "utilization",
|
1688 |
+
"utilise": "utilize",
|
1689 |
+
"utilised": "utilized",
|
1690 |
+
"utilises": "utilizes",
|
1691 |
+
"utilising": "utilizing",
|
1692 |
+
"valour": "valor",
|
1693 |
+
"vandalise": "vandalize",
|
1694 |
+
"vandalised": "vandalized",
|
1695 |
+
"vandalises": "vandalizes",
|
1696 |
+
"vandalising": "vandalizing",
|
1697 |
+
"vaporisation": "vaporization",
|
1698 |
+
"vaporise": "vaporize",
|
1699 |
+
"vaporised": "vaporized",
|
1700 |
+
"vaporises": "vaporizes",
|
1701 |
+
"vaporising": "vaporizing",
|
1702 |
+
"vapour": "vapor",
|
1703 |
+
"vapours": "vapors",
|
1704 |
+
"verbalise": "verbalize",
|
1705 |
+
"verbalised": "verbalized",
|
1706 |
+
"verbalises": "verbalizes",
|
1707 |
+
"verbalising": "verbalizing",
|
1708 |
+
"victimisation": "victimization",
|
1709 |
+
"victimise": "victimize",
|
1710 |
+
"victimised": "victimized",
|
1711 |
+
"victimises": "victimizes",
|
1712 |
+
"victimising": "victimizing",
|
1713 |
+
"videodisc": "videodisk",
|
1714 |
+
"videodiscs": "videodisks",
|
1715 |
+
"vigour": "vigor",
|
1716 |
+
"visualisation": "visualization",
|
1717 |
+
"visualisations": "visualizations",
|
1718 |
+
"visualise": "visualize",
|
1719 |
+
"visualised": "visualized",
|
1720 |
+
"visualises": "visualizes",
|
1721 |
+
"visualising": "visualizing",
|
1722 |
+
"vocalisation": "vocalization",
|
1723 |
+
"vocalisations": "vocalizations",
|
1724 |
+
"vocalise": "vocalize",
|
1725 |
+
"vocalised": "vocalized",
|
1726 |
+
"vocalises": "vocalizes",
|
1727 |
+
"vocalising": "vocalizing",
|
1728 |
+
"vulcanised": "vulcanized",
|
1729 |
+
"vulgarisation": "vulgarization",
|
1730 |
+
"vulgarise": "vulgarize",
|
1731 |
+
"vulgarised": "vulgarized",
|
1732 |
+
"vulgarises": "vulgarizes",
|
1733 |
+
"vulgarising": "vulgarizing",
|
1734 |
+
"waggon": "wagon",
|
1735 |
+
"waggons": "wagons",
|
1736 |
+
"watercolour": "watercolor",
|
1737 |
+
"watercolours": "watercolors",
|
1738 |
+
"weaselled": "weaseled",
|
1739 |
+
"weaselling": "weaseling",
|
1740 |
+
"westernisation": "westernization",
|
1741 |
+
"westernise": "westernize",
|
1742 |
+
"westernised": "westernized",
|
1743 |
+
"westernises": "westernizes",
|
1744 |
+
"westernising": "westernizing",
|
1745 |
+
"womanise": "womanize",
|
1746 |
+
"womanised": "womanized",
|
1747 |
+
"womaniser": "womanizer",
|
1748 |
+
"womanisers": "womanizers",
|
1749 |
+
"womanises": "womanizes",
|
1750 |
+
"womanising": "womanizing",
|
1751 |
+
"woollen": "woolen",
|
1752 |
+
"woollens": "woolens",
|
1753 |
+
"woollies": "woolies",
|
1754 |
+
"woolly": "wooly",
|
1755 |
+
"worshipped": "worshiped",
|
1756 |
+
"worshipper": "worshiper",
|
1757 |
+
"worshipping": "worshiping",
|
1758 |
+
"yodelled": "yodeled",
|
1759 |
+
"yodelling": "yodeling",
|
1760 |
+
"yoghourt": "yogurt",
|
1761 |
+
"yoghourts": "yogurts",
|
1762 |
+
"yoghurt": "yogurt",
|
1763 |
+
"yoghurts": "yogurts",
|
1764 |
+
}
|
1765 |
+
|
1766 |
+
# non-ASCII letters that are not separated by "NFKD" normalization
|
1767 |
+
ADDITIONAL_DIACRITICS = {
|
1768 |
+
"œ": "oe",
|
1769 |
+
"Œ": "OE",
|
1770 |
+
"ø": "o",
|
1771 |
+
"Ø": "O",
|
1772 |
+
"æ": "ae",
|
1773 |
+
"Æ": "AE",
|
1774 |
+
"ß": "ss",
|
1775 |
+
"ẞ": "SS",
|
1776 |
+
"đ": "d",
|
1777 |
+
"Đ": "D",
|
1778 |
+
"ð": "d",
|
1779 |
+
"Ð": "D",
|
1780 |
+
"þ": "th",
|
1781 |
+
"Þ": "th",
|
1782 |
+
"ł": "l",
|
1783 |
+
"Ł": "L",
|
1784 |
+
}
|
1785 |
+
|
1786 |
+
|
1787 |
+
def remove_symbols_and_diacritics(s: str, keep=""):
|
1788 |
+
"""
|
1789 |
+
Replace any other markers, symbols, and punctuations with a space, and drop any diacritics
|
1790 |
+
(category 'Mn' and some manual mappings)
|
1791 |
+
"""
|
1792 |
+
|
1793 |
+
def replace_character(char):
|
1794 |
+
if char in keep:
|
1795 |
+
return char
|
1796 |
+
elif char in ADDITIONAL_DIACRITICS:
|
1797 |
+
return ADDITIONAL_DIACRITICS[char]
|
1798 |
+
|
1799 |
+
elif unicodedata.category(char) == "Mn":
|
1800 |
+
return ""
|
1801 |
+
|
1802 |
+
elif unicodedata.category(char)[0] in "MSP":
|
1803 |
+
return " "
|
1804 |
+
|
1805 |
+
return char
|
1806 |
+
|
1807 |
+
return "".join(replace_character(c) for c in unicodedata.normalize("NFKD", s))
|
1808 |
+
|
1809 |
+
|
1810 |
+
def remove_symbols(s: str):
|
1811 |
+
"""
|
1812 |
+
Replace any other markers, symbols, punctuations with a space, keeping diacritics
|
1813 |
+
"""
|
1814 |
+
return "".join(
|
1815 |
+
" " if unicodedata.category(c)[0] in "MSP" else c
|
1816 |
+
for c in unicodedata.normalize("NFKC", s)
|
1817 |
+
)
|
1818 |
+
|
1819 |
+
|
1820 |
+
class BasicTextNormalizer:
|
1821 |
+
def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
|
1822 |
+
self.clean = (
|
1823 |
+
remove_symbols_and_diacritics if remove_diacritics else remove_symbols
|
1824 |
+
)
|
1825 |
+
self.split_letters = split_letters
|
1826 |
+
|
1827 |
+
def __call__(self, s: str):
|
1828 |
+
s = s.lower()
|
1829 |
+
s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
|
1830 |
+
s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
|
1831 |
+
s = self.clean(s).lower()
|
1832 |
+
|
1833 |
+
if self.split_letters:
|
1834 |
+
s = " ".join(regex.findall(r"\X", s, regex.U))
|
1835 |
+
|
1836 |
+
s = re.sub(
|
1837 |
+
r"\s+", " ", s
|
1838 |
+
) # replace any successive whitespace characters with a space
|
1839 |
+
|
1840 |
+
return s
|
1841 |
+
|
1842 |
+
|
1843 |
+
class EnglishNumberNormalizer:
|
1844 |
+
"""
|
1845 |
+
Convert any spelled-out numbers into arabic numbers, while handling:
|
1846 |
+
|
1847 |
+
- remove any commas
|
1848 |
+
- keep the suffixes such as: `1960s`, `274th`, `32nd`, etc.
|
1849 |
+
- spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars`
|
1850 |
+
- spell out `one` and `ones`
|
1851 |
+
- interpret successive single-digit numbers as nominal: `one oh one` -> `101`
|
1852 |
+
"""
|
1853 |
+
|
1854 |
+
def __init__(self):
|
1855 |
+
super().__init__()
|
1856 |
+
|
1857 |
+
self.zeros = {"o", "oh", "zero"}
|
1858 |
+
# fmt: off
|
1859 |
+
self.ones = {
|
1860 |
+
name: i
|
1861 |
+
for i, name in enumerate(
|
1862 |
+
[
|
1863 |
+
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
|
1864 |
+
"eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen",
|
1865 |
+
"eighteen", "nineteen"],
|
1866 |
+
start=1,
|
1867 |
+
)
|
1868 |
+
}
|
1869 |
+
# fmt: on
|
1870 |
+
self.ones_plural = {
|
1871 |
+
"sixes" if name == "six" else name + "s": (value, "s")
|
1872 |
+
for name, value in self.ones.items()
|
1873 |
+
}
|
1874 |
+
self.ones_ordinal = {
|
1875 |
+
"zeroth": (0, "th"),
|
1876 |
+
"first": (1, "st"),
|
1877 |
+
"second": (2, "nd"),
|
1878 |
+
"third": (3, "rd"),
|
1879 |
+
"fifth": (5, "th"),
|
1880 |
+
"twelfth": (12, "th"),
|
1881 |
+
**{
|
1882 |
+
name + ("h" if name.endswith("t") else "th"): (value, "th")
|
1883 |
+
for name, value in self.ones.items()
|
1884 |
+
if value > 3 and value != 5 and value != 12
|
1885 |
+
},
|
1886 |
+
}
|
1887 |
+
self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal}
|
1888 |
+
|
1889 |
+
self.tens = {
|
1890 |
+
"twenty": 20,
|
1891 |
+
"thirty": 30,
|
1892 |
+
"forty": 40,
|
1893 |
+
"fifty": 50,
|
1894 |
+
"sixty": 60,
|
1895 |
+
"seventy": 70,
|
1896 |
+
"eighty": 80,
|
1897 |
+
"ninety": 90,
|
1898 |
+
}
|
1899 |
+
self.tens_plural = {
|
1900 |
+
name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()
|
1901 |
+
}
|
1902 |
+
self.tens_ordinal = {
|
1903 |
+
name.replace("y", "ieth"): (value, "th")
|
1904 |
+
for name, value in self.tens.items()
|
1905 |
+
}
|
1906 |
+
self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal}
|
1907 |
+
|
1908 |
+
self.multipliers = {
|
1909 |
+
"hundred": 100,
|
1910 |
+
"thousand": 1_000,
|
1911 |
+
"million": 1_000_000,
|
1912 |
+
"billion": 1_000_000_000,
|
1913 |
+
"trillion": 1_000_000_000_000,
|
1914 |
+
"quadrillion": 1_000_000_000_000_000,
|
1915 |
+
"quintillion": 1_000_000_000_000_000_000,
|
1916 |
+
"sextillion": 1_000_000_000_000_000_000_000,
|
1917 |
+
"septillion": 1_000_000_000_000_000_000_000_000,
|
1918 |
+
"octillion": 1_000_000_000_000_000_000_000_000_000,
|
1919 |
+
"nonillion": 1_000_000_000_000_000_000_000_000_000_000,
|
1920 |
+
"decillion": 1_000_000_000_000_000_000_000_000_000_000_000,
|
1921 |
+
}
|
1922 |
+
self.multipliers_plural = {
|
1923 |
+
name + "s": (value, "s") for name, value in self.multipliers.items()
|
1924 |
+
}
|
1925 |
+
self.multipliers_ordinal = {
|
1926 |
+
name + "th": (value, "th") for name, value in self.multipliers.items()
|
1927 |
+
}
|
1928 |
+
self.multipliers_suffixed = {
|
1929 |
+
**self.multipliers_plural,
|
1930 |
+
**self.multipliers_ordinal,
|
1931 |
+
}
|
1932 |
+
self.decimals = {*self.ones, *self.tens, *self.zeros}
|
1933 |
+
|
1934 |
+
self.preceding_prefixers = {
|
1935 |
+
"minus": "-",
|
1936 |
+
"negative": "-",
|
1937 |
+
"plus": "+",
|
1938 |
+
"positive": "+",
|
1939 |
+
}
|
1940 |
+
self.following_prefixers = {
|
1941 |
+
"pound": "£",
|
1942 |
+
"pounds": "£",
|
1943 |
+
"euro": "€",
|
1944 |
+
"euros": "€",
|
1945 |
+
"dollar": "$",
|
1946 |
+
"dollars": "$",
|
1947 |
+
"cent": "¢",
|
1948 |
+
"cents": "¢",
|
1949 |
+
}
|
1950 |
+
self.prefixes = set(
|
1951 |
+
list(self.preceding_prefixers.values())
|
1952 |
+
+ list(self.following_prefixers.values())
|
1953 |
+
)
|
1954 |
+
self.suffixers = {
|
1955 |
+
"per": {"cent": "%"},
|
1956 |
+
"percent": "%",
|
1957 |
+
}
|
1958 |
+
self.specials = {"and", "double", "triple", "point"}
|
1959 |
+
|
1960 |
+
self.words = {
|
1961 |
+
key
|
1962 |
+
for mapping in [
|
1963 |
+
self.zeros,
|
1964 |
+
self.ones,
|
1965 |
+
self.ones_suffixed,
|
1966 |
+
self.tens,
|
1967 |
+
self.tens_suffixed,
|
1968 |
+
self.multipliers,
|
1969 |
+
self.multipliers_suffixed,
|
1970 |
+
self.preceding_prefixers,
|
1971 |
+
self.following_prefixers,
|
1972 |
+
self.suffixers,
|
1973 |
+
self.specials,
|
1974 |
+
]
|
1975 |
+
for key in mapping
|
1976 |
+
}
|
1977 |
+
self.literal_words = {"one", "ones"}
|
1978 |
+
|
1979 |
+
def process_words(self, words: List[str]) -> Iterator[str]:
|
1980 |
+
prefix: Optional[str] = None
|
1981 |
+
value: Optional[Union[str, int]] = None
|
1982 |
+
skip = False
|
1983 |
+
|
1984 |
+
def to_fraction(s: str):
|
1985 |
+
try:
|
1986 |
+
return Fraction(s)
|
1987 |
+
except ValueError:
|
1988 |
+
return None
|
1989 |
+
|
1990 |
+
def output(result: Union[str, int]):
|
1991 |
+
nonlocal prefix, value
|
1992 |
+
result = str(result)
|
1993 |
+
if prefix is not None:
|
1994 |
+
result = prefix + result
|
1995 |
+
value = None
|
1996 |
+
prefix = None
|
1997 |
+
return result
|
1998 |
+
|
1999 |
+
if len(words) == 0:
|
2000 |
+
return
|
2001 |
+
|
2002 |
+
for i, current in enumerate(words):
|
2003 |
+
prev = words[i - 1] if i != 0 else None
|
2004 |
+
next = words[i + 1] if i != len(words) - 1 else None
|
2005 |
+
if skip:
|
2006 |
+
skip = False
|
2007 |
+
continue
|
2008 |
+
|
2009 |
+
next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next)
|
2010 |
+
has_prefix = current[0] in self.prefixes
|
2011 |
+
current_without_prefix = current[1:] if has_prefix else current
|
2012 |
+
if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
|
2013 |
+
# arabic numbers (potentially with signs and fractions)
|
2014 |
+
f = to_fraction(current_without_prefix)
|
2015 |
+
if f is None:
|
2016 |
+
raise ValueError("Converting the fraction failed")
|
2017 |
+
|
2018 |
+
if value is not None:
|
2019 |
+
if isinstance(value, str) and value.endswith("."):
|
2020 |
+
# concatenate decimals / ip address components
|
2021 |
+
value = str(value) + str(current)
|
2022 |
+
continue
|
2023 |
+
else:
|
2024 |
+
yield output(value)
|
2025 |
+
|
2026 |
+
prefix = current[0] if has_prefix else prefix
|
2027 |
+
if f.denominator == 1:
|
2028 |
+
value = f.numerator # store integers as int
|
2029 |
+
else:
|
2030 |
+
value = current_without_prefix
|
2031 |
+
elif current not in self.words:
|
2032 |
+
# non-numeric words
|
2033 |
+
if value is not None:
|
2034 |
+
yield output(value)
|
2035 |
+
yield output(current)
|
2036 |
+
elif current in self.zeros:
|
2037 |
+
value = str(value or "") + "0"
|
2038 |
+
elif current in self.ones:
|
2039 |
+
ones = self.ones[current]
|
2040 |
+
|
2041 |
+
if value is None:
|
2042 |
+
value = ones
|
2043 |
+
elif isinstance(value, str) or prev in self.ones:
|
2044 |
+
if (
|
2045 |
+
prev in self.tens and ones < 10
|
2046 |
+
): # replace the last zero with the digit
|
2047 |
+
value = value[:-1] + str(ones)
|
2048 |
+
else:
|
2049 |
+
value = str(value) + str(ones)
|
2050 |
+
elif ones < 10:
|
2051 |
+
if value % 10 == 0:
|
2052 |
+
value += ones
|
2053 |
+
else:
|
2054 |
+
value = str(value) + str(ones)
|
2055 |
+
else: # eleven to nineteen
|
2056 |
+
if value % 100 == 0:
|
2057 |
+
value += ones
|
2058 |
+
else:
|
2059 |
+
value = str(value) + str(ones)
|
2060 |
+
elif current in self.ones_suffixed:
|
2061 |
+
# ordinal or cardinal; yield the number right away
|
2062 |
+
ones, suffix = self.ones_suffixed[current]
|
2063 |
+
if value is None:
|
2064 |
+
yield output(str(ones) + suffix)
|
2065 |
+
elif isinstance(value, str) or prev in self.ones:
|
2066 |
+
if prev in self.tens and ones < 10:
|
2067 |
+
yield output(value[:-1] + str(ones) + suffix)
|
2068 |
+
else:
|
2069 |
+
yield output(str(value) + str(ones) + suffix)
|
2070 |
+
elif ones < 10:
|
2071 |
+
if value % 10 == 0:
|
2072 |
+
yield output(str(value + ones) + suffix)
|
2073 |
+
else:
|
2074 |
+
yield output(str(value) + str(ones) + suffix)
|
2075 |
+
else: # eleven to nineteen
|
2076 |
+
if value % 100 == 0:
|
2077 |
+
yield output(str(value + ones) + suffix)
|
2078 |
+
else:
|
2079 |
+
yield output(str(value) + str(ones) + suffix)
|
2080 |
+
value = None
|
2081 |
+
elif current in self.tens:
|
2082 |
+
tens = self.tens[current]
|
2083 |
+
if value is None:
|
2084 |
+
value = tens
|
2085 |
+
elif isinstance(value, str):
|
2086 |
+
value = str(value) + str(tens)
|
2087 |
+
else:
|
2088 |
+
if value % 100 == 0:
|
2089 |
+
value += tens
|
2090 |
+
else:
|
2091 |
+
value = str(value) + str(tens)
|
2092 |
+
elif current in self.tens_suffixed:
|
2093 |
+
# ordinal or cardinal; yield the number right away
|
2094 |
+
tens, suffix = self.tens_suffixed[current]
|
2095 |
+
if value is None:
|
2096 |
+
yield output(str(tens) + suffix)
|
2097 |
+
elif isinstance(value, str):
|
2098 |
+
yield output(str(value) + str(tens) + suffix)
|
2099 |
+
else:
|
2100 |
+
if value % 100 == 0:
|
2101 |
+
yield output(str(value + tens) + suffix)
|
2102 |
+
else:
|
2103 |
+
yield output(str(value) + str(tens) + suffix)
|
2104 |
+
elif current in self.multipliers:
|
2105 |
+
multiplier = self.multipliers[current]
|
2106 |
+
if value is None:
|
2107 |
+
value = multiplier
|
2108 |
+
elif isinstance(value, str) or value == 0:
|
2109 |
+
f = to_fraction(value)
|
2110 |
+
p = f * multiplier if f is not None else None
|
2111 |
+
if f is not None and p.denominator == 1:
|
2112 |
+
value = p.numerator
|
2113 |
+
else:
|
2114 |
+
yield output(value)
|
2115 |
+
value = multiplier
|
2116 |
+
else:
|
2117 |
+
before = value // 1000 * 1000
|
2118 |
+
residual = value % 1000
|
2119 |
+
value = before + residual * multiplier
|
2120 |
+
elif current in self.multipliers_suffixed:
|
2121 |
+
multiplier, suffix = self.multipliers_suffixed[current]
|
2122 |
+
if value is None:
|
2123 |
+
yield output(str(multiplier) + suffix)
|
2124 |
+
elif isinstance(value, str):
|
2125 |
+
f = to_fraction(value)
|
2126 |
+
p = f * multiplier if f is not None else None
|
2127 |
+
if f is not None and p.denominator == 1:
|
2128 |
+
yield output(str(p.numerator) + suffix)
|
2129 |
+
else:
|
2130 |
+
yield output(value)
|
2131 |
+
yield output(str(multiplier) + suffix)
|
2132 |
+
else: # int
|
2133 |
+
before = value // 1000 * 1000
|
2134 |
+
residual = value % 1000
|
2135 |
+
value = before + residual * multiplier
|
2136 |
+
yield output(str(value) + suffix)
|
2137 |
+
value = None
|
2138 |
+
elif current in self.preceding_prefixers:
|
2139 |
+
# apply prefix (positive, minus, etc.) if it precedes a number
|
2140 |
+
if value is not None:
|
2141 |
+
yield output(value)
|
2142 |
+
|
2143 |
+
if next in self.words or next_is_numeric:
|
2144 |
+
prefix = self.preceding_prefixers[current]
|
2145 |
+
else:
|
2146 |
+
yield output(current)
|
2147 |
+
elif current in self.following_prefixers:
|
2148 |
+
# apply prefix (dollars, cents, etc.) only after a number
|
2149 |
+
if value is not None:
|
2150 |
+
prefix = self.following_prefixers[current]
|
2151 |
+
yield output(value)
|
2152 |
+
else:
|
2153 |
+
yield output(current)
|
2154 |
+
elif current in self.suffixers:
|
2155 |
+
# apply suffix symbols (percent -> '%')
|
2156 |
+
if value is not None:
|
2157 |
+
suffix = self.suffixers[current]
|
2158 |
+
if isinstance(suffix, dict):
|
2159 |
+
if next in suffix:
|
2160 |
+
yield output(str(value) + suffix[next])
|
2161 |
+
skip = True
|
2162 |
+
else:
|
2163 |
+
yield output(value)
|
2164 |
+
yield output(current)
|
2165 |
+
else:
|
2166 |
+
yield output(str(value) + suffix)
|
2167 |
+
else:
|
2168 |
+
yield output(current)
|
2169 |
+
elif current in self.specials:
|
2170 |
+
if next not in self.words and not next_is_numeric:
|
2171 |
+
# apply special handling only if the next word can be numeric
|
2172 |
+
if value is not None:
|
2173 |
+
yield output(value)
|
2174 |
+
yield output(current)
|
2175 |
+
elif current == "and":
|
2176 |
+
# ignore "and" after hundreds, thousands, etc.
|
2177 |
+
if prev not in self.multipliers:
|
2178 |
+
if value is not None:
|
2179 |
+
yield output(value)
|
2180 |
+
yield output(current)
|
2181 |
+
elif current == "double" or current == "triple":
|
2182 |
+
if next in self.ones or next in self.zeros:
|
2183 |
+
repeats = 2 if current == "double" else 3
|
2184 |
+
ones = self.ones.get(next, 0)
|
2185 |
+
value = str(value or "") + str(ones) * repeats
|
2186 |
+
skip = True
|
2187 |
+
else:
|
2188 |
+
if value is not None:
|
2189 |
+
yield output(value)
|
2190 |
+
yield output(current)
|
2191 |
+
elif current == "point":
|
2192 |
+
if next in self.decimals or next_is_numeric:
|
2193 |
+
value = str(value or "") + "."
|
2194 |
+
else:
|
2195 |
+
# should all have been covered at this point
|
2196 |
+
raise ValueError(f"Unexpected token: {current}")
|
2197 |
+
else:
|
2198 |
+
# all should have been covered at this point
|
2199 |
+
raise ValueError(f"Unexpected token: {current}")
|
2200 |
+
|
2201 |
+
if value is not None:
|
2202 |
+
yield output(value)
|
2203 |
+
|
2204 |
+
def preprocess(self, s: str):
|
2205 |
+
# replace "<number> and a half" with "<number> point five"
|
2206 |
+
results = []
|
2207 |
+
|
2208 |
+
segments = re.split(r"\band\s+a\s+half\b", s)
|
2209 |
+
for i, segment in enumerate(segments):
|
2210 |
+
if len(segment.strip()) == 0:
|
2211 |
+
continue
|
2212 |
+
if i == len(segments) - 1:
|
2213 |
+
results.append(segment)
|
2214 |
+
else:
|
2215 |
+
results.append(segment)
|
2216 |
+
last_word = segment.rsplit(maxsplit=2)[-1]
|
2217 |
+
if last_word in self.decimals or last_word in self.multipliers:
|
2218 |
+
results.append("point five")
|
2219 |
+
else:
|
2220 |
+
results.append("and a half")
|
2221 |
+
|
2222 |
+
s = " ".join(results)
|
2223 |
+
|
2224 |
+
# put a space at number/letter boundary
|
2225 |
+
s = re.sub(r"([a-z])([0-9])", r"\1 \2", s)
|
2226 |
+
s = re.sub(r"([0-9])([a-z])", r"\1 \2", s)
|
2227 |
+
|
2228 |
+
# but remove spaces which could be a suffix
|
2229 |
+
s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s)
|
2230 |
+
|
2231 |
+
return s
|
2232 |
+
|
2233 |
+
def postprocess(self, s: str):
|
2234 |
+
def combine_cents(m: Match):
|
2235 |
+
try:
|
2236 |
+
currency = m.group(1)
|
2237 |
+
integer = m.group(2)
|
2238 |
+
cents = int(m.group(3))
|
2239 |
+
return f"{currency}{integer}.{cents:02d}"
|
2240 |
+
except ValueError:
|
2241 |
+
return m.string
|
2242 |
+
|
2243 |
+
def extract_cents(m: Match):
|
2244 |
+
try:
|
2245 |
+
return f"¢{int(m.group(1))}"
|
2246 |
+
except ValueError:
|
2247 |
+
return m.string
|
2248 |
+
|
2249 |
+
# apply currency postprocessing; "$2 and ¢7" -> "$2.07"
|
2250 |
+
s = re.sub(r"([€£$])([0-9]+) (?:and )?¢([0-9]{1,2})\b", combine_cents, s)
|
2251 |
+
s = re.sub(r"[€£$]0.([0-9]{1,2})\b", extract_cents, s)
|
2252 |
+
|
2253 |
+
# write "one(s)" instead of "1(s)", just for the readability
|
2254 |
+
s = re.sub(r"\b1(s?)\b", r"one\1", s)
|
2255 |
+
|
2256 |
+
return s
|
2257 |
+
|
2258 |
+
def __call__(self, s: str):
|
2259 |
+
s = self.preprocess(s)
|
2260 |
+
s = " ".join(word for word in self.process_words(s.split()) if word is not None)
|
2261 |
+
s = self.postprocess(s)
|
2262 |
+
|
2263 |
+
return s
|
2264 |
+
|
2265 |
+
|
2266 |
+
class EnglishSpellingNormalizer:
|
2267 |
+
"""
|
2268 |
+
Applies British-American spelling mappings as listed in [1].
|
2269 |
+
|
2270 |
+
[1] https://www.tysto.com/uk-us-spelling-list.html
|
2271 |
+
"""
|
2272 |
+
|
2273 |
+
def __init__(self, english_spelling_mapping):
|
2274 |
+
self.mapping = english_spelling_mapping
|
2275 |
+
|
2276 |
+
def __call__(self, s: str):
|
2277 |
+
return " ".join(self.mapping.get(word, word) for word in s.split())
|
2278 |
+
|
2279 |
+
|
2280 |
+
class EnglishTextNormalizer:
|
2281 |
+
def __init__(self, english_spelling_mapping=abbr):
|
2282 |
+
self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b"
|
2283 |
+
self.replacers = {
|
2284 |
+
# common contractions
|
2285 |
+
r"\bwon't\b": "will not",
|
2286 |
+
r"\bcan't\b": "can not",
|
2287 |
+
r"\blet's\b": "let us",
|
2288 |
+
r"\bain't\b": "aint",
|
2289 |
+
r"\by'all\b": "you all",
|
2290 |
+
r"\bwanna\b": "want to",
|
2291 |
+
r"\bgotta\b": "got to",
|
2292 |
+
r"\bgonna\b": "going to",
|
2293 |
+
r"\bi'ma\b": "i am going to",
|
2294 |
+
r"\bimma\b": "i am going to",
|
2295 |
+
r"\bwoulda\b": "would have",
|
2296 |
+
r"\bcoulda\b": "could have",
|
2297 |
+
r"\bshoulda\b": "should have",
|
2298 |
+
r"\bma'am\b": "madam",
|
2299 |
+
# contractions in titles/prefixes
|
2300 |
+
r"\bmr\b": "mister ",
|
2301 |
+
r"\bmrs\b": "missus ",
|
2302 |
+
r"\bst\b": "saint ",
|
2303 |
+
r"\bdr\b": "doctor ",
|
2304 |
+
r"\bprof\b": "professor ",
|
2305 |
+
r"\bcapt\b": "captain ",
|
2306 |
+
r"\bgov\b": "governor ",
|
2307 |
+
r"\bald\b": "alderman ",
|
2308 |
+
r"\bgen\b": "general ",
|
2309 |
+
r"\bsen\b": "senator ",
|
2310 |
+
r"\brep\b": "representative ",
|
2311 |
+
r"\bpres\b": "president ",
|
2312 |
+
r"\brev\b": "reverend ",
|
2313 |
+
r"\bhon\b": "honorable ",
|
2314 |
+
r"\basst\b": "assistant ",
|
2315 |
+
r"\bassoc\b": "associate ",
|
2316 |
+
r"\blt\b": "lieutenant ",
|
2317 |
+
r"\bcol\b": "colonel ",
|
2318 |
+
r"\bjr\b": "junior ",
|
2319 |
+
r"\bsr\b": "senior ",
|
2320 |
+
r"\besq\b": "esquire ",
|
2321 |
+
# prefect tenses, ideally it should be any past participles, but it's harder..
|
2322 |
+
r"'d been\b": " had been",
|
2323 |
+
r"'s been\b": " has been",
|
2324 |
+
r"'d gone\b": " had gone",
|
2325 |
+
r"'s gone\b": " has gone",
|
2326 |
+
r"'d done\b": " had done", # "'s done" is ambiguous
|
2327 |
+
r"'s got\b": " has got",
|
2328 |
+
# general contractions
|
2329 |
+
r"n't\b": " not",
|
2330 |
+
r"'re\b": " are",
|
2331 |
+
r"'s\b": " is",
|
2332 |
+
r"'d\b": " would",
|
2333 |
+
r"'ll\b": " will",
|
2334 |
+
r"'t\b": " not",
|
2335 |
+
r"'ve\b": " have",
|
2336 |
+
r"'m\b": " am",
|
2337 |
+
}
|
2338 |
+
self.standardize_numbers = EnglishNumberNormalizer()
|
2339 |
+
self.standardize_spellings = EnglishSpellingNormalizer(english_spelling_mapping)
|
2340 |
+
|
2341 |
+
def __call__(self, s: str):
|
2342 |
+
s = s.lower()
|
2343 |
+
|
2344 |
+
s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
|
2345 |
+
s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
|
2346 |
+
s = re.sub(self.ignore_patterns, "", s)
|
2347 |
+
s = re.sub(
|
2348 |
+
r"\s+'", "'", s
|
2349 |
+
) # standardize when there's a space before an apostrophe
|
2350 |
+
|
2351 |
+
for pattern, replacement in self.replacers.items():
|
2352 |
+
s = re.sub(pattern, replacement, s)
|
2353 |
+
|
2354 |
+
s = re.sub(r"(\d),(\d)", r"\1\2", s) # remove commas between digits
|
2355 |
+
s = re.sub(r"\.([^0-9]|$)", r" \1", s) # remove periods not followed by numbers
|
2356 |
+
s = remove_symbols_and_diacritics(
|
2357 |
+
s, keep=".%$¢€£"
|
2358 |
+
) # keep some symbols for numerics
|
2359 |
+
|
2360 |
+
s = self.standardize_numbers(s)
|
2361 |
+
s = self.standardize_spellings(s)
|
2362 |
+
|
2363 |
+
# now remove prefix/suffix symbols that are not preceded/followed by numbers
|
2364 |
+
s = re.sub(r"[.$¢€£]([^0-9])", r" \1", s)
|
2365 |
+
s = re.sub(r"([^0-9])%", r"\1 ", s)
|
2366 |
+
|
2367 |
+
s = re.sub(
|
2368 |
+
r"\s+", " ", s
|
2369 |
+
) # replace any successive whitespace characters with a space
|
2370 |
+
|
2371 |
+
return s
|
2372 |
+
|
2373 |
+
|
2374 |
+
text_normalizer = EnglishTextNormalizer()
|
utils.py
ADDED
@@ -0,0 +1,991 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import colorsys
|
2 |
+
import json
|
3 |
+
import os
|
4 |
+
import random
|
5 |
+
from concurrent.futures import ThreadPoolExecutor
|
6 |
+
from dataclasses import dataclass, make_dataclass
|
7 |
+
from datetime import datetime
|
8 |
+
from io import BytesIO
|
9 |
+
|
10 |
+
import aiohttp
|
11 |
+
import evaluate
|
12 |
+
import numpy as np
|
13 |
+
import pandas as pd
|
14 |
+
import plotly.graph_objects as go
|
15 |
+
from huggingface_hub import hf_hub_download, list_repo_files
|
16 |
+
from pydub import AudioSegment
|
17 |
+
|
18 |
+
from constants import WHISPER_OPEN_AI_LINK
|
19 |
+
|
20 |
+
# Load the Word Error Rate (WER) metric from the evaluate library
|
21 |
+
wer_metric = evaluate.load("wer")
|
22 |
+
|
23 |
+
|
24 |
+
def compute_average_wer(results):
|
25 |
+
"""
|
26 |
+
Compute the average Word Error Rate (WER) for a list of transcription results.
|
27 |
+
|
28 |
+
:param results: List of dictionaries, each containing 'reference' and 'prediction' keys
|
29 |
+
:return: Average WER as a percentage, rounded to 2 decimal places
|
30 |
+
|
31 |
+
This function calculates the WER for each reference-prediction pair and returns
|
32 |
+
the average. If no predictions are provided, it returns 100% WER.
|
33 |
+
"""
|
34 |
+
references = [result["reference"] for result in results]
|
35 |
+
predictions = [result["prediction"] for result in results]
|
36 |
+
if len(predictions) == 0:
|
37 |
+
return 1
|
38 |
+
return round(
|
39 |
+
wer_metric.compute(references=references, predictions=predictions) * 100.0,
|
40 |
+
2,
|
41 |
+
)
|
42 |
+
|
43 |
+
|
44 |
+
def read_json_line_by_line(file_path):
|
45 |
+
"""
|
46 |
+
Read a JSON file line by line, parsing each line as a separate JSON object.
|
47 |
+
|
48 |
+
:param file_path: Path to the JSON file
|
49 |
+
:return: List of parsed JSON objects
|
50 |
+
|
51 |
+
This function is useful for reading large JSON files that contain one JSON object
|
52 |
+
per line. It handles JSON parsing errors gracefully, skipping invalid lines.
|
53 |
+
"""
|
54 |
+
data = []
|
55 |
+
with open(file_path, "r") as f:
|
56 |
+
for line in f:
|
57 |
+
try:
|
58 |
+
item = json.loads(line.strip())
|
59 |
+
data.append(item)
|
60 |
+
except json.JSONDecodeError:
|
61 |
+
print(f"Skipping invalid JSON in {file_path}: {line}")
|
62 |
+
return data
|
63 |
+
|
64 |
+
|
65 |
+
def group_wer(group):
|
66 |
+
"""
|
67 |
+
Calculate the Word Error Rate (WER) for a group of transcriptions.
|
68 |
+
|
69 |
+
:param group: DataFrame group containing 'normalized_reference' and 'normalized_prediction' columns
|
70 |
+
:return: Average WER for the group
|
71 |
+
|
72 |
+
This function is typically used with DataFrame groupby operations to calculate
|
73 |
+
WER for specific groups of transcriptions.
|
74 |
+
"""
|
75 |
+
return compute_average_wer(
|
76 |
+
group[["normalized_reference", "normalized_prediction"]]
|
77 |
+
.rename(
|
78 |
+
columns={
|
79 |
+
"normalized_reference": "reference",
|
80 |
+
"normalized_prediction": "prediction",
|
81 |
+
}
|
82 |
+
)
|
83 |
+
.to_dict("records")
|
84 |
+
)
|
85 |
+
|
86 |
+
|
87 |
+
def load_multilingual_results(csv_file):
|
88 |
+
"""
|
89 |
+
Load multilingual results from a CSV file into a pandas DataFrame.
|
90 |
+
|
91 |
+
:param csv_file: Path to the CSV file containing multilingual results
|
92 |
+
:return: DataFrame with the loaded results, or None if the file is not found
|
93 |
+
|
94 |
+
This function attempts to load a CSV file using pandas, handling potential
|
95 |
+
FileNotFoundError exceptions.
|
96 |
+
"""
|
97 |
+
try:
|
98 |
+
df = pd.json_normalize(csv_file)
|
99 |
+
return df
|
100 |
+
except FileNotFoundError:
|
101 |
+
return None
|
102 |
+
|
103 |
+
|
104 |
+
def download_dataset(repo_id, local_dir, remote_dir, path_includes=""):
|
105 |
+
"""
|
106 |
+
Download benchmark result files from a specified Hugging Face repository to a local directory.
|
107 |
+
|
108 |
+
:param repo_id: ID of the Hugging Face repository
|
109 |
+
:param local_dir: Local directory where downloaded files will be saved
|
110 |
+
:param remote_dir: Remote directory within the repository to download from
|
111 |
+
|
112 |
+
This function uses the Hugging Face Hub API to list and download files from a
|
113 |
+
specific directory in a repository. It forces the download to ensure up-to-date files.
|
114 |
+
"""
|
115 |
+
files = list_repo_files(repo_id, repo_type="dataset")
|
116 |
+
directory_files = [
|
117 |
+
file for file in files if file.startswith(remote_dir) and path_includes in file
|
118 |
+
]
|
119 |
+
with ThreadPoolExecutor() as executor:
|
120 |
+
executor.map(
|
121 |
+
lambda file: hf_hub_download(
|
122 |
+
repo_id=repo_id,
|
123 |
+
repo_type="dataset",
|
124 |
+
filename=file,
|
125 |
+
local_dir=local_dir,
|
126 |
+
force_download=True,
|
127 |
+
),
|
128 |
+
directory_files,
|
129 |
+
)
|
130 |
+
|
131 |
+
|
132 |
+
def process_file(file_path):
|
133 |
+
"""
|
134 |
+
Process a file containing JSON objects delimited by new lines.
|
135 |
+
|
136 |
+
:param file_path: Path to the file to be processed
|
137 |
+
:return: List of dictionaries, each representing a parsed JSON object
|
138 |
+
|
139 |
+
This function reads the file line by line, parsing each line as a JSON object.
|
140 |
+
It handles potential JSON decoding errors, printing error messages for invalid lines.
|
141 |
+
"""
|
142 |
+
data = []
|
143 |
+
with open(file_path, "r") as file:
|
144 |
+
for line in file:
|
145 |
+
line = line.strip()
|
146 |
+
if not line:
|
147 |
+
continue
|
148 |
+
try:
|
149 |
+
json_obj = json.loads(line)
|
150 |
+
data.append(json_obj)
|
151 |
+
except json.JSONDecodeError as e:
|
152 |
+
print(f"Error decoding JSON in line: {line}")
|
153 |
+
print(f"Error message: {str(e)}")
|
154 |
+
return data
|
155 |
+
|
156 |
+
|
157 |
+
def dir_to_json(root_dir, output_file):
|
158 |
+
"""
|
159 |
+
Convert a directory of benchmark result files to a single JSON file.
|
160 |
+
|
161 |
+
:param root_dir: Root directory containing the benchmark result files
|
162 |
+
:param output_file: Output file where the JSON data will be saved
|
163 |
+
|
164 |
+
This function walks through the directory structure, processes each file,
|
165 |
+
and writes the combined data to a single JSON file. It extracts metadata
|
166 |
+
from the file path and includes it in the JSON output.
|
167 |
+
"""
|
168 |
+
with open(output_file, "w") as outfile:
|
169 |
+
for subdir, _, files in os.walk(root_dir):
|
170 |
+
for file in files:
|
171 |
+
file_path = os.path.join(subdir, file)
|
172 |
+
# ignore .DS_Store and summary files
|
173 |
+
if file_path.endswith(".DS_Store") or "summary" in file_path:
|
174 |
+
continue
|
175 |
+
parts = file_path.split(os.sep)
|
176 |
+
print(parts)
|
177 |
+
model_version = parts[2]
|
178 |
+
device_name = parts[3].replace("_", " ")
|
179 |
+
os_type_version = parts[4]
|
180 |
+
dataset_name = parts[5]
|
181 |
+
timestamp_commit = parts[6].replace(".json", "")
|
182 |
+
timestamp, commit_hash, commit_timestamp = timestamp_commit.split("_")
|
183 |
+
|
184 |
+
data_list = process_file(file_path)
|
185 |
+
for data in data_list:
|
186 |
+
original_entry = {
|
187 |
+
"model": model_version.replace("_", "/"),
|
188 |
+
"device": device_name,
|
189 |
+
"os": os_type_version.replace("_", " "),
|
190 |
+
"wer": data["wer"],
|
191 |
+
"dataset_name": dataset_name,
|
192 |
+
"reference_transcription": data["reference_transcription"],
|
193 |
+
"prediction_transcription": data["prediction_transcription"],
|
194 |
+
"difference_transcription": data["difference_transcription"],
|
195 |
+
"audio_file_url": data["audio_file_url"],
|
196 |
+
"timestamp": timestamp.replace("-", ":").replace(":", "-", 2),
|
197 |
+
"commit_hash": commit_hash,
|
198 |
+
"commit_timestamp": commit_timestamp,
|
199 |
+
}
|
200 |
+
|
201 |
+
outfile.write(json.dumps(original_entry) + "\n")
|
202 |
+
|
203 |
+
|
204 |
+
async def download_audio_to_ndarray(url):
|
205 |
+
"""
|
206 |
+
Downloads an audio file from a URL and converts it to a NumPy array.
|
207 |
+
|
208 |
+
:param url: The URL of the audio file to download
|
209 |
+
:return: A tuple containing the sample rate and audio data as a NumPy array
|
210 |
+
|
211 |
+
This asynchronous function uses aiohttp to download the audio file,
|
212 |
+
converts it to an AudioSegment, and then to a NumPy array. It handles
|
213 |
+
both mono and stereo audio files.
|
214 |
+
"""
|
215 |
+
async with aiohttp.ClientSession() as session:
|
216 |
+
async with session.get(url) as response:
|
217 |
+
if response.status == 200:
|
218 |
+
audio_bytes = BytesIO(await response.read())
|
219 |
+
audio = AudioSegment.from_file(audio_bytes, format="mp3")
|
220 |
+
audio_data = np.array(audio.get_array_of_samples())
|
221 |
+
if audio.channels == 2:
|
222 |
+
audio_data = audio_data.reshape((-1, 2))
|
223 |
+
return audio.frame_rate, audio_data
|
224 |
+
else:
|
225 |
+
return None, None
|
226 |
+
|
227 |
+
|
228 |
+
async def play_audio(url):
|
229 |
+
"""
|
230 |
+
Wrapper function for Gradio to play audio from a URL.
|
231 |
+
|
232 |
+
:param url: The URL of the audio file to play
|
233 |
+
:return: A tuple of sample rate and audio data, or an error message
|
234 |
+
|
235 |
+
This function uses download_audio_to_ndarray to get the audio data
|
236 |
+
and returns it in a format suitable for Gradio's audio player.
|
237 |
+
"""
|
238 |
+
sample_rate, audio_data = await download_audio_to_ndarray(url)
|
239 |
+
if audio_data is None:
|
240 |
+
return "Error downloading the file"
|
241 |
+
else:
|
242 |
+
return sample_rate, audio_data
|
243 |
+
|
244 |
+
|
245 |
+
def get_filter_cond(df, model, device, os, dataset, timestamp=None):
|
246 |
+
"""
|
247 |
+
Creates a filter condition for a DataFrame based on specified parameters.
|
248 |
+
|
249 |
+
:param df: DataFrame containing the transcription data
|
250 |
+
:param model: String representing the model name
|
251 |
+
:param device: String representing the device name
|
252 |
+
:param os: String representing the OS name
|
253 |
+
:param dataset: String representing the dataset name
|
254 |
+
:param timestamp: Optional timestamp for filtering (default: None)
|
255 |
+
:return: A boolean mask for filtering the DataFrame
|
256 |
+
|
257 |
+
This function constructs a complex boolean condition for filtering
|
258 |
+
the DataFrame based on the provided parameters.
|
259 |
+
"""
|
260 |
+
filter_cond = (
|
261 |
+
(df["model"] == model)
|
262 |
+
& (df["device"] == device)
|
263 |
+
& (df["os"] == os)
|
264 |
+
& (df["dataset_name"] == dataset)
|
265 |
+
)
|
266 |
+
return filter_cond & (df["timestamp"] == timestamp) if timestamp else filter_cond
|
267 |
+
|
268 |
+
|
269 |
+
def get_filtered_transcript(df, model, device, os, dataset, timestamp):
|
270 |
+
"""
|
271 |
+
Retrieves filtered transcription data from a DataFrame.
|
272 |
+
|
273 |
+
:param df: DataFrame containing the transcription data
|
274 |
+
:param model: String representing the model name
|
275 |
+
:param device: String representing the device name
|
276 |
+
:param os: String representing the OS name
|
277 |
+
:param dataset: String representing the dataset name
|
278 |
+
:param timestamp: String representing the timestamp
|
279 |
+
:return: A filtered DataFrame with transcription data
|
280 |
+
|
281 |
+
This function applies a filter to the input DataFrame and returns
|
282 |
+
relevant columns for transcription analysis.
|
283 |
+
"""
|
284 |
+
filter_cond = get_filter_cond(df, model, device, os, dataset, timestamp)
|
285 |
+
df = df[filter_cond][
|
286 |
+
[
|
287 |
+
"reference_transcription",
|
288 |
+
"prediction_transcription",
|
289 |
+
"difference_transcription",
|
290 |
+
"audio_file_url",
|
291 |
+
]
|
292 |
+
]
|
293 |
+
return df
|
294 |
+
|
295 |
+
|
296 |
+
def get_filtered_timestamps(df, model, device, os, dataset):
|
297 |
+
"""
|
298 |
+
Retrieves unique timestamps for a specific model, device, OS, and dataset combination.
|
299 |
+
|
300 |
+
:param df: DataFrame containing the transcription data
|
301 |
+
:param model: String representing the model name
|
302 |
+
:param device: String representing the device name
|
303 |
+
:param os: String representing the OS name
|
304 |
+
:param dataset: String representing the dataset name
|
305 |
+
:return: A filtered DataFrame containing unique timestamps
|
306 |
+
|
307 |
+
This function is useful for getting a list of available timestamps
|
308 |
+
for a specific configuration, which can be used for further analysis or UI elements.
|
309 |
+
"""
|
310 |
+
filter_cond = get_filter_cond(df, model, device, os, dataset)
|
311 |
+
df = df[filter_cond][["timestamp"]].drop_duplicates()
|
312 |
+
return df
|
313 |
+
|
314 |
+
|
315 |
+
def make_model_name_clickable_link(model):
|
316 |
+
"""
|
317 |
+
Creates an HTML link to the Hugging Face model page.
|
318 |
+
|
319 |
+
:param model: String representing the model name
|
320 |
+
:return: An HTML string containing a clickable link to the model page
|
321 |
+
|
322 |
+
This function generates a formatted HTML link that can be used in
|
323 |
+
web interfaces to provide direct access to the model's page on Hugging Face.
|
324 |
+
"""
|
325 |
+
return f"""<a style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" href="https://huggingface.co/argmaxinc/whisperkit-coreml/tree/main/{model.replace('/', '_')}" target="_blank">{model}</a>"""
|
326 |
+
|
327 |
+
|
328 |
+
def make_dataset_wer_clickable_link(row, dataset):
|
329 |
+
"""
|
330 |
+
Creates a clickable link for the WER value of a dataset.
|
331 |
+
|
332 |
+
:param row: Row containing the dataset WER value
|
333 |
+
:param dataset: String representing the dataset name
|
334 |
+
:return: An HTML string containing a clickable link to the dataset's WER details
|
335 |
+
|
336 |
+
This function generates a formatted HTML link that can be used in
|
337 |
+
web interfaces to provide access to detailed WER information for a specific dataset.
|
338 |
+
"""
|
339 |
+
dataset_column = f"{dataset}"
|
340 |
+
href = WHISPER_OPEN_AI_LINK.format(
|
341 |
+
row["Model"].replace("/", "_"),
|
342 |
+
dataset,
|
343 |
+
)
|
344 |
+
return f'<a style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" href="{href}">{row[dataset_column]}</a>'
|
345 |
+
|
346 |
+
|
347 |
+
def make_timestamp_clickable_link(model, dataset, timestamp):
|
348 |
+
"""
|
349 |
+
Creates a clickable link for a timestamp.
|
350 |
+
|
351 |
+
:param model: String representing the model name
|
352 |
+
:param dataset: String representing the dataset name
|
353 |
+
:param timestamp: Timestamp to be displayed and used in the link
|
354 |
+
:return: An HTML string containing a clickable div for the timestamp
|
355 |
+
|
356 |
+
This function generates a formatted HTML div that can be used as a clickable
|
357 |
+
element in web interfaces, typically for displaying and interacting with specific timestamps.
|
358 |
+
"""
|
359 |
+
elem_id = (
|
360 |
+
f"{dataset}-{model}-{timestamp}".replace(" ", "_")
|
361 |
+
.replace('"', "")
|
362 |
+
.replace("'", "")
|
363 |
+
.replace(",", "")
|
364 |
+
)
|
365 |
+
onclick = f"onclick=\"document.getElementById('{elem_id}').click();\""
|
366 |
+
return f'<div style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" {onclick} href="#">{timestamp}</div>'
|
367 |
+
|
368 |
+
|
369 |
+
def make_multilingual_model_clickable_link(model):
|
370 |
+
"""
|
371 |
+
Creates a clickable link for a multilingual model name.
|
372 |
+
|
373 |
+
:param model: String representing the model name
|
374 |
+
:return: An HTML string containing a clickable div for the model name
|
375 |
+
|
376 |
+
This function generates a formatted HTML div that can be used as a clickable
|
377 |
+
element in web interfaces, typically for displaying and interacting with multilingual model names.
|
378 |
+
"""
|
379 |
+
elem_id = (
|
380 |
+
f"{model}".replace(" ", "_").replace('"', "").replace("'", "").replace(",", "")
|
381 |
+
)
|
382 |
+
onclick = f"onclick=\"document.getElementById('{elem_id}').click();console.log('hello');\""
|
383 |
+
return f'<div style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" {onclick} href="#">{model}</div>'
|
384 |
+
|
385 |
+
|
386 |
+
def plot_metric(
|
387 |
+
df, y_axis_col, y_axis_title, fig_title, filter_input=None, exclude_input=None
|
388 |
+
):
|
389 |
+
"""
|
390 |
+
Plots a metric for each model-device-OS group in a DataFrame.
|
391 |
+
|
392 |
+
:param df: DataFrame containing the benchmark data
|
393 |
+
:param y_axis_col: DataFrame column to use as the y-axis
|
394 |
+
:param y_axis_title: Display name for the y-axis
|
395 |
+
:param fig_title: Display title for the figure
|
396 |
+
:param filter_input: Optional string to filter the model-device-OS combinations
|
397 |
+
:param exclude_input: Optional string to exclude model-device-OS combinations
|
398 |
+
:return: A Plotly figure object
|
399 |
+
"""
|
400 |
+
grouped = df.groupby(["model", "device", "os"])
|
401 |
+
sorted_groups = [group.sort_values("commit_timestamp") for _, group in grouped]
|
402 |
+
|
403 |
+
if filter_input:
|
404 |
+
filters = [f.strip().lower() for f in filter_input.split(";")]
|
405 |
+
sorted_groups = [
|
406 |
+
group
|
407 |
+
for group in sorted_groups
|
408 |
+
if any(
|
409 |
+
f
|
410 |
+
in f"{group['model'].iloc[0]}-{group['device'].iloc[0]}-{group['os'].iloc[0]}".lower()
|
411 |
+
for f in filters
|
412 |
+
)
|
413 |
+
]
|
414 |
+
|
415 |
+
if exclude_input:
|
416 |
+
excludes = [e.strip().lower() for e in exclude_input.split(";")]
|
417 |
+
sorted_groups = [
|
418 |
+
group
|
419 |
+
for group in sorted_groups
|
420 |
+
if not any(
|
421 |
+
e
|
422 |
+
in f"{group['model'].iloc[0]}-{group['device'].iloc[0]}-{group['os'].iloc[0]}".lower()
|
423 |
+
for e in excludes
|
424 |
+
)
|
425 |
+
]
|
426 |
+
|
427 |
+
base_colors = ["#4542f4", "#0e0c06", "#ccf0a7", "#ff7f4e", "#ffd15a"]
|
428 |
+
num_colors = len(sorted_groups)
|
429 |
+
random_colors = generate_random_colors(base_colors, num_colors)
|
430 |
+
fig = go.Figure()
|
431 |
+
for i, group in enumerate(sorted_groups):
|
432 |
+
model_device_os = (
|
433 |
+
f"{group['model'].iloc[0]}-{group['device'].iloc[0]}-{group['os'].iloc[0]}"
|
434 |
+
)
|
435 |
+
fig.add_trace(
|
436 |
+
go.Scatter(
|
437 |
+
x=group["commit_timestamp"].apply(
|
438 |
+
lambda x: datetime.strptime(x, "%Y-%m-%dT%H%M%S").strftime(
|
439 |
+
"%Y-%m-%d %H:%M:%S"
|
440 |
+
)
|
441 |
+
),
|
442 |
+
y=group[y_axis_col],
|
443 |
+
mode="lines+markers",
|
444 |
+
name=model_device_os,
|
445 |
+
line=dict(color=random_colors[i % len(random_colors)]),
|
446 |
+
marker=dict(color=random_colors[i % len(random_colors)]),
|
447 |
+
hovertemplate=(
|
448 |
+
f"<b>{model_device_os}</b><br>"
|
449 |
+
"Timestamp: %{x}<br>"
|
450 |
+
f"{y_axis_title}: %{{y:.2f}}<br>"
|
451 |
+
"<extra></extra>"
|
452 |
+
),
|
453 |
+
)
|
454 |
+
)
|
455 |
+
fig.update_layout(
|
456 |
+
title=fig_title,
|
457 |
+
xaxis_title="Commit Timestamp",
|
458 |
+
yaxis_title=y_axis_title,
|
459 |
+
legend_title="Model-Device-OS",
|
460 |
+
width=1100,
|
461 |
+
height=600,
|
462 |
+
plot_bgcolor="rgb(250,249,244)",
|
463 |
+
)
|
464 |
+
return fig
|
465 |
+
|
466 |
+
|
467 |
+
def fields(raw_class):
|
468 |
+
"""
|
469 |
+
Returns the fields of a dataclass.
|
470 |
+
|
471 |
+
:param raw_class: The dataclass to inspect
|
472 |
+
:return: List of fields in the dataclass
|
473 |
+
|
474 |
+
This utility function extracts and returns all the fields defined in a dataclass,
|
475 |
+
excluding special methods and attributes.
|
476 |
+
"""
|
477 |
+
return [
|
478 |
+
v for k, v in raw_class.__dict__.items() if k[:2] != "__" and k[-2:] != "__"
|
479 |
+
]
|
480 |
+
|
481 |
+
|
482 |
+
def get_os_name_and_version(os_string):
|
483 |
+
"""
|
484 |
+
Extracts the OS name and major version from a string.
|
485 |
+
|
486 |
+
:param os_string: String representing the OS name and version
|
487 |
+
:return: Formatted string with OS name and major version
|
488 |
+
|
489 |
+
This function splits the input string into OS name and version,
|
490 |
+
then returns a formatted string with just the major version number.
|
491 |
+
"""
|
492 |
+
os_name, os_version = os_string.split()
|
493 |
+
os_version = os_version.split(".")[0]
|
494 |
+
return f"{os_name} {os_version}"
|
495 |
+
|
496 |
+
|
497 |
+
def create_initial_quality_column_dict():
|
498 |
+
"""
|
499 |
+
Creates the initial column dictionary for the quality table.
|
500 |
+
|
501 |
+
:return: A list of column dictionaries
|
502 |
+
|
503 |
+
This function defines the basic structure of the quality table,
|
504 |
+
including columns for model, average WER, and QoI (Quality of Implementation).
|
505 |
+
"""
|
506 |
+
return [
|
507 |
+
[
|
508 |
+
"model",
|
509 |
+
ColumnContent,
|
510 |
+
ColumnContent("Model", "html", True, never_hidden=True),
|
511 |
+
],
|
512 |
+
["average_wer", ColumnContent, ColumnContent("Average WER", "html", True)],
|
513 |
+
["qoi", ColumnContent, ColumnContent("QoI", "html", True)],
|
514 |
+
]
|
515 |
+
|
516 |
+
|
517 |
+
def calculate_parity(m2_ultra_wer, row):
|
518 |
+
"""
|
519 |
+
Calculates the WER parity between M2 Ultra and the current model.
|
520 |
+
|
521 |
+
:param m2_ultra_wer: DataFrame containing WER values for M2 Ultra
|
522 |
+
:param row: Current row being processed
|
523 |
+
:return: WER difference between M2 Ultra and current model, or None if not applicable
|
524 |
+
|
525 |
+
This function computes the percentage difference in WER between the M2 Ultra model
|
526 |
+
and the current model, providing a measure of relative performance.
|
527 |
+
"""
|
528 |
+
if row["Model"] in m2_ultra_wer.index:
|
529 |
+
return round(m2_ultra_wer[row["Model"]] - row["Average WER"], 2)
|
530 |
+
return None
|
531 |
+
|
532 |
+
|
533 |
+
def create_initial_performance_column_dict():
|
534 |
+
"""
|
535 |
+
Creates the initial column dictionary for the performance table.
|
536 |
+
|
537 |
+
:return: A list of column dictionaries
|
538 |
+
|
539 |
+
This function defines the basic structure of the performance table,
|
540 |
+
including columns for model, device, OS, parity, average WER, QoI, speed, and tokens per second.
|
541 |
+
"""
|
542 |
+
return [
|
543 |
+
[
|
544 |
+
"model",
|
545 |
+
ColumnContent,
|
546 |
+
ColumnContent("Model", "html", True, never_hidden=True),
|
547 |
+
],
|
548 |
+
[
|
549 |
+
"device",
|
550 |
+
ColumnContent,
|
551 |
+
ColumnContent("Device", "html", True, never_hidden=True),
|
552 |
+
],
|
553 |
+
["os", ColumnContent, ColumnContent("OS", "html", True, never_hidden=True)],
|
554 |
+
["parity", ColumnContent, ColumnContent("Parity %", "html", False)],
|
555 |
+
["average_wer", ColumnContent, ColumnContent("Average WER", "html", False)],
|
556 |
+
["qoi", ColumnContent, ColumnContent("QoI", "html", False)],
|
557 |
+
["speed", ColumnContent, ColumnContent("Speed", "html", False)],
|
558 |
+
["toks", ColumnContent, ColumnContent("Tok / s", "html", False)],
|
559 |
+
]
|
560 |
+
|
561 |
+
|
562 |
+
def add_datasets_to_quality_columns(column_dict, datasets):
|
563 |
+
"""
|
564 |
+
Adds dataset-specific columns to the quality table column dictionary.
|
565 |
+
|
566 |
+
:param column_dict: The initial column dictionary
|
567 |
+
:param datasets: List of dataset names to add
|
568 |
+
:return: A dictionary containing the updated column dictionary and related metadata
|
569 |
+
|
570 |
+
This function extends the quality table structure with columns for each dataset,
|
571 |
+
and creates a dataclass to represent the table structure. It also generates
|
572 |
+
metadata about the columns for use in the UI.
|
573 |
+
"""
|
574 |
+
updated_column_dict = column_dict.copy()
|
575 |
+
|
576 |
+
for dataset in datasets:
|
577 |
+
field_name = dataset.replace("-", "")
|
578 |
+
updated_column_dict.append(
|
579 |
+
[field_name, ColumnContent, ColumnContent(dataset, "html", True)]
|
580 |
+
)
|
581 |
+
|
582 |
+
AutoEvalColumn = make_dataclass("AutoEvalColumn", updated_column_dict, frozen=True)
|
583 |
+
|
584 |
+
COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden]
|
585 |
+
TYPES = [c.type for c in fields(AutoEvalColumn) if not c.hidden]
|
586 |
+
ALWAYS_HERE_COLS = [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
|
587 |
+
TOGGLE_COLS = [c.name for c in fields(AutoEvalColumn) if not c.never_hidden]
|
588 |
+
SELECTED_COLS = [
|
589 |
+
c.name
|
590 |
+
for c in fields(AutoEvalColumn)
|
591 |
+
if not c.never_hidden and c.displayed_by_default
|
592 |
+
]
|
593 |
+
|
594 |
+
return {
|
595 |
+
"column_dict": updated_column_dict,
|
596 |
+
"AutoEvalColumn": AutoEvalColumn,
|
597 |
+
"COLS": COLS,
|
598 |
+
"TYPES": TYPES,
|
599 |
+
"ALWAYS_HERE_COLS": ALWAYS_HERE_COLS,
|
600 |
+
"TOGGLE_COLS": TOGGLE_COLS,
|
601 |
+
"SELECTED_COLS": SELECTED_COLS,
|
602 |
+
}
|
603 |
+
|
604 |
+
|
605 |
+
def add_datasets_to_performance_columns(column_dict, datasets):
|
606 |
+
"""
|
607 |
+
Adds dataset-specific columns to the performance table column dictionary.
|
608 |
+
|
609 |
+
:param column_dict: The initial column dictionary
|
610 |
+
:param datasets: List of dataset names to add
|
611 |
+
:return: A dictionary containing the updated column dictionary and related metadata
|
612 |
+
|
613 |
+
This function extends the performance table structure with columns for each dataset,
|
614 |
+
adding both speed and tokens per second metrics. It also creates a dataclass to
|
615 |
+
represent the table structure and generates metadata about the columns for use in the UI.
|
616 |
+
"""
|
617 |
+
updated_column_dict = column_dict.copy()
|
618 |
+
|
619 |
+
for dataset in datasets:
|
620 |
+
field_name = dataset.replace("-", "")
|
621 |
+
updated_column_dict.append(
|
622 |
+
[
|
623 |
+
f"{field_name}_speed",
|
624 |
+
ColumnContent,
|
625 |
+
ColumnContent(
|
626 |
+
f"{'Short-Form' if dataset == 'librispeech-10mins' else 'Long-Form'} Speed",
|
627 |
+
"html",
|
628 |
+
True,
|
629 |
+
),
|
630 |
+
]
|
631 |
+
)
|
632 |
+
updated_column_dict.append(
|
633 |
+
[
|
634 |
+
f"{field_name}_toks",
|
635 |
+
ColumnContent,
|
636 |
+
ColumnContent(
|
637 |
+
f"{'Short-Form' if dataset == 'librispeech-10mins' else 'Long-Form'} Tok/s",
|
638 |
+
"html",
|
639 |
+
True,
|
640 |
+
),
|
641 |
+
]
|
642 |
+
)
|
643 |
+
|
644 |
+
AutoEvalColumn = make_dataclass("AutoEvalColumn", updated_column_dict, frozen=True)
|
645 |
+
|
646 |
+
COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden]
|
647 |
+
TYPES = [c.type for c in fields(AutoEvalColumn) if not c.hidden]
|
648 |
+
ALWAYS_HERE_COLS = [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
|
649 |
+
TOGGLE_COLS = [c.name for c in fields(AutoEvalColumn) if not c.never_hidden]
|
650 |
+
SELECTED_COLS = [
|
651 |
+
c.name
|
652 |
+
for c in fields(AutoEvalColumn)
|
653 |
+
if not c.never_hidden and c.displayed_by_default
|
654 |
+
]
|
655 |
+
|
656 |
+
return {
|
657 |
+
"column_dict": updated_column_dict,
|
658 |
+
"AutoEvalColumn": AutoEvalColumn,
|
659 |
+
"COLS": COLS,
|
660 |
+
"TYPES": TYPES,
|
661 |
+
"ALWAYS_HERE_COLS": ALWAYS_HERE_COLS,
|
662 |
+
"TOGGLE_COLS": TOGGLE_COLS,
|
663 |
+
"SELECTED_COLS": SELECTED_COLS,
|
664 |
+
}
|
665 |
+
|
666 |
+
|
667 |
+
def create_confusion_matrix_plot(matrix, labels, is_forced):
|
668 |
+
"""
|
669 |
+
Creates a confusion matrix plot for language detection.
|
670 |
+
|
671 |
+
:param matrix: 2D numpy array representing the confusion matrix
|
672 |
+
:param labels: List of language labels
|
673 |
+
:param is_forced: Boolean indicating whether language hint was used
|
674 |
+
:return: A Plotly figure object representing the confusion matrix
|
675 |
+
|
676 |
+
This function generates a heatmap visualization of the confusion matrix
|
677 |
+
for language detection, with customized layout and hover information.
|
678 |
+
"""
|
679 |
+
fig = go.Figure(
|
680 |
+
data=go.Heatmap(
|
681 |
+
z=matrix,
|
682 |
+
x=labels,
|
683 |
+
y=labels,
|
684 |
+
colorscale=[
|
685 |
+
[0, "rgb(250,249,244)"],
|
686 |
+
[0.5, "rgb(69,66,244)"],
|
687 |
+
[1.0, "rgb(14,12,6)"],
|
688 |
+
],
|
689 |
+
hoverongaps=False,
|
690 |
+
hovertemplate="True: %{y}<br>Predicted: %{x}<br>Value: %{z}<extra></extra>",
|
691 |
+
)
|
692 |
+
)
|
693 |
+
fig.update_layout(
|
694 |
+
title=f'Language Detection Confusion Matrix with {"Language Hint" if is_forced else "Language Prediction by Model"}',
|
695 |
+
xaxis_title="Predicted Language",
|
696 |
+
yaxis_title="True Language",
|
697 |
+
xaxis=dict(tickangle=-45),
|
698 |
+
width=600,
|
699 |
+
height=600,
|
700 |
+
margin=dict(l=50, r=50, t=50, b=50),
|
701 |
+
)
|
702 |
+
return fig
|
703 |
+
|
704 |
+
|
705 |
+
def hex_to_rgb(hex_color):
|
706 |
+
"""
|
707 |
+
Converts a hexadecimal color code to RGB values.
|
708 |
+
|
709 |
+
:param hex_color: String representing a color in hexadecimal format
|
710 |
+
:return: Tuple of three integers representing RGB values
|
711 |
+
|
712 |
+
This function takes a hex color code and returns the corresponding
|
713 |
+
RGB values as a tuple of integers.
|
714 |
+
"""
|
715 |
+
hex_color = hex_color.lstrip("#")
|
716 |
+
return tuple(int(hex_color[i : i + 2], 16) for i in (0, 2, 4))
|
717 |
+
|
718 |
+
|
719 |
+
def rgb_to_hex(rgb):
|
720 |
+
"""
|
721 |
+
Converts RGB values to a hexadecimal color code.
|
722 |
+
|
723 |
+
:param rgb: Tuple of three integers representing RGB values
|
724 |
+
:return: String representing the color in hexadecimal format
|
725 |
+
|
726 |
+
This function takes RGB values as a tuple and returns the corresponding
|
727 |
+
hex color code as a string.
|
728 |
+
"""
|
729 |
+
return "#{:02x}{:02x}{:02x}".format(*rgb)
|
730 |
+
|
731 |
+
|
732 |
+
def interpolate_colors(color1, color2, factor):
|
733 |
+
"""
|
734 |
+
Interpolates between two colors in HSV space.
|
735 |
+
|
736 |
+
:param color1: First color in hexadecimal format
|
737 |
+
:param color2: Second color in hexadecimal format
|
738 |
+
:param factor: Float between 0 and 1, representing the interpolation factor
|
739 |
+
:return: Interpolated color in hexadecimal format
|
740 |
+
|
741 |
+
This function performs color interpolation in HSV color space, which can
|
742 |
+
produce more visually pleasing results than simple RGB interpolation.
|
743 |
+
"""
|
744 |
+
rgb1 = hex_to_rgb(color1)
|
745 |
+
rgb2 = hex_to_rgb(color2)
|
746 |
+
|
747 |
+
hsv1 = colorsys.rgb_to_hsv(*[x / 255.0 for x in rgb1])
|
748 |
+
hsv2 = colorsys.rgb_to_hsv(*[x / 255.0 for x in rgb2])
|
749 |
+
|
750 |
+
h = (hsv1[0] + factor * (hsv2[0] - hsv1[0])) % 1.0
|
751 |
+
s = hsv1[1] + factor * (hsv2[1] - hsv1[1])
|
752 |
+
v = hsv1[2] + factor * (hsv2[2] - hsv1[2])
|
753 |
+
|
754 |
+
rgb = colorsys.hsv_to_rgb(h, s, v)
|
755 |
+
return rgb_to_hex(tuple(int(x * 255) for x in rgb))
|
756 |
+
|
757 |
+
|
758 |
+
def color_distance(color1, color2):
|
759 |
+
"""
|
760 |
+
Calculates the Euclidean distance between two colors in RGB space.
|
761 |
+
|
762 |
+
:param color1: First color in hexadecimal format
|
763 |
+
:param color2: Second color in hexadecimal format
|
764 |
+
:return: Float representing the distance between the two colors
|
765 |
+
|
766 |
+
This function computes the Euclidean distance between two colors in RGB space,
|
767 |
+
which can be used as a measure of color similarity.
|
768 |
+
"""
|
769 |
+
rgb1 = hex_to_rgb(color1)
|
770 |
+
rgb2 = hex_to_rgb(color2)
|
771 |
+
return sum((a - b) ** 2 for a, b in zip(rgb1, rgb2)) ** 0.5
|
772 |
+
|
773 |
+
|
774 |
+
def generate_random_colors(base_colors, num_colors, min_distance=30):
|
775 |
+
"""
|
776 |
+
Generates a list of random colors based on a set of base colors.
|
777 |
+
|
778 |
+
:param base_colors: List of base colors in hexadecimal format
|
779 |
+
:param num_colors: Number of colors to generate
|
780 |
+
:param min_distance: Minimum distance between generated colors (default: 30)
|
781 |
+
:return: List of generated colors in hexadecimal format
|
782 |
+
|
783 |
+
This function creates a list of random colors by interpolating between
|
784 |
+
the provided base colors. It attempts to maintain a minimum distance
|
785 |
+
between colors to ensure visual distinctiveness.
|
786 |
+
"""
|
787 |
+
generated_colors = []
|
788 |
+
attempts = 0
|
789 |
+
max_attempts = 1000
|
790 |
+
|
791 |
+
while len(generated_colors) < num_colors and attempts < max_attempts:
|
792 |
+
color1, color2 = random.sample(base_colors, 2)
|
793 |
+
factor = random.random()
|
794 |
+
new_color = interpolate_colors(color1, color2, factor)
|
795 |
+
|
796 |
+
if all(color_distance(new_color, c) >= min_distance for c in generated_colors):
|
797 |
+
generated_colors.append(new_color)
|
798 |
+
attempts = 0
|
799 |
+
else:
|
800 |
+
attempts += 1
|
801 |
+
|
802 |
+
if attempts > 100:
|
803 |
+
if random.random() < 0.1:
|
804 |
+
generated_colors.append(new_color)
|
805 |
+
attempts = 0
|
806 |
+
|
807 |
+
return generated_colors
|
808 |
+
|
809 |
+
|
810 |
+
@dataclass
|
811 |
+
class Task:
|
812 |
+
"""
|
813 |
+
Dataclass representing a benchmark task.
|
814 |
+
|
815 |
+
:param benchmark: String representing the benchmark name
|
816 |
+
:param metric: String representing the metric used for evaluation
|
817 |
+
:param col_name: String representing the column name in the results DataFrame
|
818 |
+
"""
|
819 |
+
|
820 |
+
benchmark: str
|
821 |
+
metric: str
|
822 |
+
col_name: str
|
823 |
+
|
824 |
+
|
825 |
+
@dataclass(frozen=True)
|
826 |
+
class ColumnContent:
|
827 |
+
"""
|
828 |
+
Dataclass representing a column in the results table.
|
829 |
+
|
830 |
+
:param name: String representing the column name
|
831 |
+
:param type: String representing the data type of the column
|
832 |
+
:param displayed_by_default: Boolean indicating if the column should be displayed by default
|
833 |
+
:param hidden: Boolean indicating if the column should be hidden (default: False)
|
834 |
+
:param never_hidden: Boolean indicating if the column should never be hidden (default: False)
|
835 |
+
:param dummy: Boolean indicating if this is a dummy column (default: False)
|
836 |
+
"""
|
837 |
+
|
838 |
+
name: str
|
839 |
+
type: str
|
840 |
+
displayed_by_default: bool
|
841 |
+
hidden: bool = False
|
842 |
+
never_hidden: bool = False
|
843 |
+
dummy: bool = False
|
844 |
+
|
845 |
+
|
846 |
+
css = """
|
847 |
+
@font-face {
|
848 |
+
font-family: 'Zwizz Regular';
|
849 |
+
font-style: normal;
|
850 |
+
font-weight: normal;
|
851 |
+
src: local('Zwizz Regular'), url('static/Zwizz-Regular.woff') format('woff');
|
852 |
+
}
|
853 |
+
|
854 |
+
@font-face {
|
855 |
+
font-family: 'Zwizz Medium';
|
856 |
+
font-style: normal;
|
857 |
+
font-weight: normal;
|
858 |
+
src: local('Zwizz Medium'), url('static/Zwizz-Medium.woff') format('woff');
|
859 |
+
}
|
860 |
+
|
861 |
+
@font-face {
|
862 |
+
font-family: 'Zwizz SemiBold';
|
863 |
+
font-style: normal;
|
864 |
+
font-weight: normal;
|
865 |
+
src: local('Zwizz SemiBold'), url('static/Zwizz-SemiBold.woff') format('woff');
|
866 |
+
}
|
867 |
+
|
868 |
+
@import url('https://fonts.googleapis.com/css2?family=Noto+Color+Emoji&display=swap');
|
869 |
+
@import url('https://fonts.googleapis.com/css2?family=Sora:wght@300..400&display=swap');
|
870 |
+
|
871 |
+
/* Typography Scale */
|
872 |
+
h1, .h1 {
|
873 |
+
font-family: 'Sora', sans-serif;
|
874 |
+
font-weight: 300;
|
875 |
+
font-size: 2em;
|
876 |
+
letter-spacing: -0.05em;
|
877 |
+
}
|
878 |
+
|
879 |
+
h2, .h2 {
|
880 |
+
font-family: 'Sora', sans-serif;
|
881 |
+
font-weight: 400;
|
882 |
+
letter-spacing: -0.05em;
|
883 |
+
}
|
884 |
+
|
885 |
+
h3, h4, h5, .h3, .h4, .h5 {
|
886 |
+
font-family: 'Sora', sans-serif;
|
887 |
+
font-weight: 400;
|
888 |
+
letter-spacing: -0.05em;
|
889 |
+
}
|
890 |
+
|
891 |
+
h6, .h6, pre, code, .monospace {
|
892 |
+
font-family: 'IBM Plex Mono', monospace;
|
893 |
+
font-weight: 400;
|
894 |
+
letter-spacing: 0.01em;
|
895 |
+
}
|
896 |
+
|
897 |
+
/* Add strong tag styling */
|
898 |
+
strong, b {
|
899 |
+
font-family: 'Zwizz SemiBold', -apple-system, BlinkMacSystemFont, system-ui, sans-serif;
|
900 |
+
letter-spacing: -0.02em;
|
901 |
+
}
|
902 |
+
|
903 |
+
/* Global Zwizz styles */
|
904 |
+
:root {
|
905 |
+
--zwizz-spacing: -0.02em;
|
906 |
+
}
|
907 |
+
|
908 |
+
/* All Gradio elements should have Zwizz spacing */
|
909 |
+
.gradio-container * {
|
910 |
+
letter-spacing: var(--zwizz-spacing);
|
911 |
+
line-height: 1.7;
|
912 |
+
}
|
913 |
+
|
914 |
+
/* UI Elements */
|
915 |
+
.tab-buttons button, #models-to-add-text, .gradio-button {
|
916 |
+
font-family: 'Sora', sans-serif;
|
917 |
+
font-weight: 400;
|
918 |
+
letter-spacing: -0.05em;
|
919 |
+
}
|
920 |
+
|
921 |
+
/* Specific Table Styling */
|
922 |
+
table, .table, th, td {
|
923 |
+
font-family: 'IBM Plex Mono', 'Noto Color Emoji', sans-serif, monospace !important;
|
924 |
+
font-weight: 400;
|
925 |
+
letter-spacing: 0.01em;
|
926 |
+
}
|
927 |
+
|
928 |
+
/* Technical/Code Elements */
|
929 |
+
.code-block, .technical-text {
|
930 |
+
font-family: 'IBM Plex Mono', monospace;
|
931 |
+
font-weight: 400;
|
932 |
+
letter-spacing: 0.01em;
|
933 |
+
}
|
934 |
+
|
935 |
+
/* Additional Elements */
|
936 |
+
#methodology-text p, #methodology-text li, .markdown-text {
|
937 |
+
font-family: 'Zwizz Regular', -apple-system, BlinkMacSystemFont, system-ui, sans-serif;
|
938 |
+
font-size: 16px !important;
|
939 |
+
letter-spacing: var(--zwizz-spacing);
|
940 |
+
line-height: 1.7;
|
941 |
+
}
|
942 |
+
|
943 |
+
/* Font weight utilities */
|
944 |
+
.zwizz-medium {
|
945 |
+
font-family: 'Zwizz Medium', -apple-system, BlinkMacSystemFont, system-ui, sans-serif;
|
946 |
+
}
|
947 |
+
|
948 |
+
.zwizz-semibold {
|
949 |
+
font-family: 'Zwizz SemiBold', -apple-system, BlinkMacSystemFont, system-ui, sans-serif;
|
950 |
+
}
|
951 |
+
|
952 |
+
/* Maintaining Original Layout Rules */
|
953 |
+
.gradio-container {
|
954 |
+
max-width: 95% !important;
|
955 |
+
}
|
956 |
+
|
957 |
+
/* Table Layouts */
|
958 |
+
.large-table,
|
959 |
+
.large-table .table-wrap,
|
960 |
+
#multilingual-model-table .table-wrap,
|
961 |
+
#lookup-table .table-wrap {
|
962 |
+
height: 35em !important;
|
963 |
+
overflow-y: scroll !important;
|
964 |
+
}
|
965 |
+
|
966 |
+
/* SVG Container Rules */
|
967 |
+
.svg-container,
|
968 |
+
.main-svg {
|
969 |
+
width: 100% !important;
|
970 |
+
}
|
971 |
+
|
972 |
+
.large-table, .large-table .table-wrap, #multilingual-model-table .table-wrap, #lookup-table .table-wrap {
|
973 |
+
height: 35em !important;
|
974 |
+
overflow-y: scroll !important;
|
975 |
+
}
|
976 |
+
|
977 |
+
.left-side-table .table-wrap {
|
978 |
+
height: 15em !important;
|
979 |
+
overflow-y: scroll !important;
|
980 |
+
}
|
981 |
+
|
982 |
+
#average-wer-table .table-wrap {
|
983 |
+
height: 8em !important;
|
984 |
+
overflow-y: scroll !important;
|
985 |
+
}
|
986 |
+
|
987 |
+
#general-wer-table .table-wrap {
|
988 |
+
height: 35em !important;
|
989 |
+
overflow-y: scroll !important;
|
990 |
+
}
|
991 |
+
"""
|