open_asr_leaderboard

Running

App Files Files Community

wasertech commited on Sep 9, 2023

Commit

1ef2735

•

1 Parent(s): 7abdbe6

update custom message

Browse files

Files changed (1) hide show

constants.py +17 -0

constants.py CHANGED Viewed

@@ -115,4 +115,21 @@ The CommonVoice Test provides a Word Error Rate (WER) within a 20-point margin o
 Moreover, it's worth noting that selecting the model with the lowest WER on CommonVoice aligns with choosing the model based on the lowest average WER. This approach proves effective for ranking the best-performing models with precision. However, it's essential to acknowledge that as the average WER increases, the spread of results becomes more pronounced. This can pose challenges in reliably identifying the worst-performing models. The test split size of CommonVoice for a given language is a crucial factor in this context, and it's worth considering. This insight highlights the need for a nuanced approach to ASR model selection, considering various factors, including dataset characteristics, to ensure a comprehensive evaluation of ASR model performance.
 Additionally, it's been brought to our attention that Nvidia's models, trained using NeMo with custom splits from common datasets, including Common Voice, may have had an advantage due to their familiarity with parts of the Common Voice test set. This could explain their strong performance in the results. Transparency in model training and dataset usage is crucial for fair comparisons in the ASR field and ensuring that results align with real-world scenarios.
 """

 Moreover, it's worth noting that selecting the model with the lowest WER on CommonVoice aligns with choosing the model based on the lowest average WER. This approach proves effective for ranking the best-performing models with precision. However, it's essential to acknowledge that as the average WER increases, the spread of results becomes more pronounced. This can pose challenges in reliably identifying the worst-performing models. The test split size of CommonVoice for a given language is a crucial factor in this context, and it's worth considering. This insight highlights the need for a nuanced approach to ASR model selection, considering various factors, including dataset characteristics, to ensure a comprehensive evaluation of ASR model performance.
 Additionally, it's been brought to our attention that Nvidia's models, trained using NeMo with custom splits from common datasets, including Common Voice, may have had an advantage due to their familiarity with parts of the Common Voice test set. This could explain their strong performance in the results. Transparency in model training and dataset usage is crucial for fair comparisons in the ASR field and ensuring that results align with real-world scenarios.
+Custom splits and potential data leakage during training can indeed lead to misleading results, making it challenging to compare architectures accurately.
+To address these concerns and ensure the reliability of metrics on the leaderboard:
+1. **Transparency in Training Data**: Model submissions should come with detailed information about the training data used, including whether they have seen the specific test sets used for evaluation. This transparency enables the community to assess the validity of the results.
+2. **Standardized Evaluation**: Promote the use of standardized evaluation datasets and testing procedures across models. This helps prevent data leakage and ensures fair comparisons.
+3. **Verification and Validation**: Implement verification processes to check the integrity of submitted models. This could include cross-validation checks to identify any potential issues with custom splits or data leakage.
+4. **Community Engagement**: Encourage active participation and feedback from the ASR community. Regular discussions and collaborations can help identify and address issues related to data integrity and model evaluations.
+5. **Documentation**: Models added to the leaderboard should provide comprehensive documentation, including information on dataset usage, preprocessing steps, and any custom splits employed during training.
+By focusing on these aspects, we can enhance trust in the metrics and evaluations within the ASR community and ensure that the models added to the leaderboard are reliable and accurately represent their performance. It's essential for the community to work together to maintain transparency and data integrity.
 """