🚩 Report: Ethical issue(s)

#2
by wasertech - opened

It's crucial to address a fundamental concern when comparing ASR models on this leaderboard: many of the models were trained on custom splits, which unfortunately don't allow the community to verify their results or ensure that they haven't encountered the data used for testing. This creates an uneven playing field for comparing architectures and models effectively.

Before models are even considered for testing and inclusion on the leaderboard, there should be a rigorous evaluation process in place to scrutinize their training procedures and data handling practices to maintain transparency and fairness.

I invite you to review my comprehensive analysis, available here, on my fork of this leaderboard, where I've delved into these crucial issues and their impact on ASR model evaluations. Your insights and feedback are highly appreciated as we work together to improve the integrity and reliability of ASR model comparisons.

I second the points made by @wasertech (here and in the given link). The dataset cards and possibly related papers must contain the details of dataset joining and splitting practices, or better make these splits and/or splitting code publicly available (open source) for reproduction of experiments, which is a must in the scientific approach. Just saying "custom splits" cannot make it scientific or reliable.

As these models can use any combination of merged datasets, to test the models against each other in such a leaderboard scenario, the creation of a special test set, which will never be used in any training seems to be necessary. Otherwise, we will be comparing apples with oranges.

As these models can use any combination of merged datasets, to test the models against each other in such a leaderboard scenario, the creation of a special test set, which will never be used in any training seems to be necessary. Otherwise, we will be comparing apples with oranges.

As @bozden rightly pointed out, when evaluating ASR models in a leaderboard scenario, the potential for comparing 'apples with oranges' due to variations in training data poses a significant challenge. Creating a specialized test set that is never used in any training is a sensible step to ensure fair model-to-model comparisons.

However, it's also essential to maintain transparency and open participation within the ASR community. Open-source test sets play a pivotal role in promoting transparency, enabling external validation, and fostering community collaboration. Striking a balance between an open-source test set and a confidential, specialized test set may be a viable solution.

Ultimately, the decision between these approaches should align with the goals and values of the ASR community and the organizers of the leaderboard. Finding common ground that addresses the concerns of all stakeholders can lead to a more robust and trustworthy evaluation process. Hence the need to involve everyone in this process. Collaboration and open dialogue within the ASR community will be instrumental in crafting a solution that ensures both fairness in model comparisons and the transparency we all value.

Sign up or log in to comment