Introducing the Polish ASR Leaderboard (PAL) and Benchmark Intended Grouping of Open Speech (BIGOS) Corpora

Community Article Published July 10, 2024

Introduction

In the rapidly evolving field of Automatic Speech Recognition (ASR) technology, a robust ecosystem is essential for monitoring advancements and comparing the efficacy of various solutions. Inspired by initiatives like the Open ASR leaderboard and the emerging science of benchmarks, AMU CAI (Adam Mickiewicz University Centre of Artificial Intelligence) launches the BIGOS (Benchmark Intended Grouping of Open Speech) corpora and the Polish ASR Leaderboard (PAL).

Polish ASR Leaderboard Purpose

The mission of the Polish ASR Leaderboard (PAL) is to provide a dynamic evaluation ecosystem for Polish ASR. This platform levels the playing field for benchmarking both commercial vendors and openly accessible systems. Our vision is for PAL to serve as a comprehensive resource that informs potential ASR users about the advantages, limitations, and expected performance of ASR technology in various practical scenarios. We aim to bridge the gap between benchmarks conducted in controlled settings, typically reported in scientific publications, and the continuous, multi-aspect evaluations of real-world applications often conducted privately by Big Tech companies.

We aspire for the PAL leaderboard to become the go-to resource for anyone considering ASR technology for the Polish language (and other languages in the future). To achieve this, it is crucial to use comprehensive evaluation data that accurately represents specific use cases and language characteristics. This is achieved through the BIGOS (Benchmark Intended Grouping of Open Speech) corpora.

BIGOS Corpora Purpose

BIGOS aims to make open ASR speech data usable by discovering, organizing, and refining existing ASR speech data, making it more accessible and valuable for speech recognition development and evaluation. We aim to save the precious time of ASR researchers and developers by providing unified data formats and convenient management tools, leveraging industry best practices like the Hugging Face datasets framework.

Currently, the BIGOS curation process has been applied to two major collections:

  • PL ASR BIGOS V2: A collection of 12 well-known ASR speech datasets for Polish ASR development, including Google FLEURS, Facebook MLS, Mozilla Common Voice, and CLARIN-PL. Learn more here.
  • PL ASR PELCRA for BIGOS: A collection of annotated conversational speech data for linguistic research and ASR development created by the University of Łódź PELCRA group, including SpokesMix, SpokesBiz, and DiaBiz. Learn more here.

Contributions

AMU BIGOS and the Polish ASR Leaderboard provide the community with:

  • The largest unified collection of open Polish speech datasets, curated for maximum evaluation utility and ease of use.
  • The most extensive benchmark of available ASR systems for Polish, covering both commercial and freely available systems.
  • An extendable data management framework for cataloging and curating ASR speech data. Learn more here.
  • An extendable evaluation framework for benchmarking new ASR systems. Learn more here.

Evaluation Data, Scenarios, and Systems

The Polish ASR Leaderboard currently supports:

  • 25 ASR systems (10 commercial and 15 freely available), including state-of-the-art systems from OpenAI (Whisper), Google, Microsoft, Meta (MMS, wav2vec2), Assembly AI, and more. The full list is available here.
  • Over 4000 recordings sampled from 24 subsets of BIGOS and PELCRA corpora, resulting in a linguistically and acoustically diverse evaluation set.

Major Findings

  • Whisper Large and Assembly AI systems demonstrate the strongest performance on both BIGOS and PELCRA tasks.
  • Median WER (Word Error Rate) for read speech in BIGOS corpora (14%) is lower than for conversational speech in PELCRA corpora (31%).
  • The average accuracy for all evaluated systems was comparable between free and commercial systems (2 percentage points difference for BIGOS and 4 percentage points for PELCRA).
  • The best free models are Whisper Large, followed by Nvidia Nemo, MMS, and Wav2Vec. Whisper base and tiny had the lowest scores.
  • Larger models generally offer better performance, except for NVidia Nemo, which offers superior performance considering its model size of only 20-120 million parameters.

Visit the leaderboard for the full results.

Known Limitations

  • Data Quality: Despite efforts to curate open data, some recordings and transcriptions are of subpar quality. We continuously refine the BIGOS corpora to eliminate such examples.
  • Data Representativeness: Open datasets can become outdated. To keep the ASR leaderboard representative of real-world capabilities, it is essential to systematically add new datasets and analyze ASR performance across various sociodemographic dimensions.
  • Risk of Leakage: Since BIGOS corpora originate from public resources, there is a risk that evaluated systems were trained on the test data. Including non-disclosure test sets in the future can mitigate this. The leaderboard supports private new test sets to ensure fair comparison.
  • Limited Language Support: Currently, BIGOS and PAL are limited to Polish. Expanding this data curation process to other languages could reduce the cost of delivering comprehensive ASR benchmarks, although data preparation remains resource-intensive.

End Vision

We aim to bridge the gap between academic research and practical applications by incorporating a variety of benchmarks that correspond to real-world use cases. We also organize the open Polish ASR challenge for the community. The best scores from the challenge will be incorporated to the leaderboard. Through these efforts, we hope to advance the field by providing a platform that accurately measures and drives the progress of ASR for the Polish language and beyond.

Call to Action

We welcome feedback from the research community and industry practitioners to ensure that the benchmarks remain rigorous, comprehensive, and up-to-date. If you develop ASR systems or speech datasets and would like to collaborate, please contact michal.junczyk@amu.edu.pl.