Introducing the Polish ASR Leaderboard (PAL) and Benchmark Intended Grouping of Open Speech (BIGOS) Corpora

Community Article Published July 10, 2024

Introduction

Polish ASR Leaderboard Purpose

BIGOS Corpora Purpose

Contributions

Evaluation Data, Scenarios, and Systems

Major Findings

Known Limitations

End Vision

Call to Action

References

Introduction

In the rapidly evolving field of Automatic Speech Recognition (ASR) technology, a robust ecosystem is essential for monitoring advancements and comparing the efficacy of various solutions. Inspired by initiatives like the Open ASR leaderboard and the emerging science of ML benchmarks, AMU CAI (Adam Mickiewicz University Centre of Artificial Intelligence) launches the BIGOS (Benchmark Intended Grouping of Open Speech) corpora and the Polish ASR Leaderboard (PAL).

Polish ASR Leaderboard Purpose

The mission of the Polish ASR Leaderboard (PAL) is to provide a dynamic evaluation ecosystem for Polish ASR. This platform levels the playing field for benchmarking both commercial vendors and openly accessible systems. Our vision is for PAL to serve as a comprehensive resource that informs potential ASR users about the advantages, limitations, and expected performance of ASR technology in various practical scenarios. We aim to bridge the gap between benchmarks conducted in controlled settings, typically reported in scientific publications, and the continuous, multi-aspect evaluations of real-world applications often conducted privately by Big Tech companies.

We aspire for the PAL leaderboard to become the go-to resource for anyone considering ASR technology for the Polish language (and other languages in the future). To achieve this, it is crucial to use comprehensive evaluation data that accurately represents specific use cases and language characteristics. This is achieved through the BIGOS (Benchmark Intended Grouping of Open Speech) corpora.

BIGOS Corpora Purpose

BIGOS aims to make open ASR speech data usable by discovering, organizing, and refining existing ASR speech data, making it more accessible and valuable for speech recognition development and evaluation. We aim to save the precious time of ASR researchers and developers by providing unified data formats and convenient management tools, leveraging industry best practices like the Hugging Face datasets framework.

Currently, the BIGOS curation process has been applied to two major collections:

BIGOS V2: A collection of 12 well-known ASR speech datasets for Polish ASR development, including Google FLEURS, Facebook MLS, Mozilla Common Voice, and CLARIN-PL. Learn more here.
PELCRA for BIGOS: A collection of annotated conversational speech data for linguistic research and ASR development created by the University of Łódź PELCRA group, including SpokesMix, SpokesBiz, and DiaBiz. Learn more here.

Contributions

The largest unified collection of open Polish speech datasets, curated for maximum evaluation utility and ease of use. [1, 2]
The most extensive benchmark of available ASR systems for Polish, covering both commercial and freely available systems.[3]
An extendable data management framework for cataloging and curating ASR speech data.[4.]
An extendable evaluation framework for benchmarking new ASR systems. [5]

Evaluation Data, Scenarios, and Systems

The Polish ASR Leaderboard currently supports:

25 ASR systems (10 commercial and 15 freely available), including state-of-the-art systems from OpenAI (Whisper), Google, Microsoft, Meta (MMS, wav2vec2), Assembly AI, NVidia and more. The full list is available [6].
Over 4000 recordings sampled from 24 subsets of BIGOS and PELCRA corpora, resulting in a linguistically and acoustically diverse evaluation set.

Major Findings

Whisper Large and Assembly AI systems demonstrate the strongest performance on both BIGOS and PELCRA tasks.

Median and Average WER (Word Error Rate) for read speech in BIGOS corpora is lower than for conversational speech in PELCRA corpora.

Dataset WER (Median) WER (Average) WER (Std Dev) WER (Min) WER (Max)

BIGOS 14.52 20.06 21.83 0 260.86

PELCRA 32.42 35.00 17.92 5.27 114.1
The accuracy of all evaluated systems, measured by Word Error Rate (WER), was similar between free and commercial systems, with a median WER difference of 2.5 and 4.2 percentage points for the BIGOS and PELCRA datasets, respectively.
The best free models are Whisper Large, followed by Nvidia Nemo multilang, MMS, and Wav2Vec. NVidia quartznet, Whisper base and tiny had the lowest scores.
Larger and more recent models generally demonstrate better performance. Notably, the Nemo multilingual model (with 120 million parameters) achieves performance on par with, or even surpasses, that of larger models like Whisper (150 million parameters) and Wav2Vec/MMS (1 billion parameters).

Dataset	WER (Median)	WER (Average)	WER (Std Dev)	WER (Min)	WER (Max)
BIGOS	14.52	20.06	21.83	0	260.86
PELCRA	32.42	35.00	17.92	5.27	114.1

To learn more check the paper or visit the leaderboard.

Known Limitations

Data Quality: Despite efforts to curate open data, some recordings and transcriptions are of subpar quality. We continuously refine the BIGOS corpora to eliminate such examples.
Data Representativeness: Open datasets can become outdated. To keep the ASR leaderboard representative of real-world capabilities, it is essential to systematically add new datasets and analyze ASR performance across various sociodemographic dimensions.
Risk of Leakage: Since BIGOS corpora originate from public resources, there is a risk that evaluated systems were trained on the test data. Including non-disclosure test sets in the future can mitigate this. The leaderboard supports private new test sets to ensure fair comparison.
Limited Language Support: Currently, BIGOS and PAL are limited to Polish. Expanding this data curation process to other languages could reduce the cost of delivering comprehensive ASR benchmarks, although data preparation remains resource-intensive.

End Vision

We aim to bridge the gap between academic research and practical applications by incorporating a variety of benchmarks that correspond to real-world use cases. We also organize the open Polish ASR challenge for the community. The best scores from the challenge will be incorporated to the leaderboard. Through these efforts, we hope to advance the field by providing a platform that accurately measures and drives the progress of ASR for the Polish language and beyond.

Call to Action

We invite all ASR folks to participate in the Polish ASR challenge. We also welcome feedback and contributions from both academia and industry. Our common goal is to ensure benchmarks remain rigorous, comprehensive, and up-to-date. If you want your ASR systems and or speech datasets to be included in the benchmark, please contact michal.junczyk@amu.edu.pl.

References

Upvote