Fix TypeError during data collection

#11
by jkassemi - opened

Language information from the dataset args can contain a "language" key referencing a string and not the expected dict. On parsing this data, the application errors with "TypeError: string indices must be integers" and then fails to load. This fix checks the type of args and ensures that it's a dict. If not, it uses the previously developed deafult bahavior: using the "language" value from the model's metadata.

I'm happy to reach out to the one model owner with the non-standard configuration, though it does look like it may have been generated by ๐Ÿค— Trainer: https://huggingface.co/sanchit-gandhi/whisper-small-hi/edit/main/README.md.

Here's the record that causes the error in production:

meta: {'language': ['hi'], 'license': 'apache-2.0', 'tags': ['hf-asr-leaderboard', 'generated_from_trainer'], 'datasets': ['mozilla-foundation/common_voice_11_0'], 'metrics': ['wer'], 'model-index': [{'name': 'Whisper Small Hi - Sanchit Gandhi', 'results': [{'task': {'name': 'Automatic Speech Recognition', 'type': 'automatic-speech-recognition'}, 'dataset': {'name': 'Common Voice 11.0', 'type': 'mozilla-foundation/common_voice_11_0', 'config': 'hi', 'split': 'test', 'args': 'language hi'}, 'metrics': [{'name': 'Wer', 'type': 'wer', 'value': 32.09599593667993}]}]}]}
result["dataset"]: {'name': 'Common Voice 11.0', 'type': 'mozilla-foundation/common_voice_11_0', 'config': 'hi', 'split': 'test', 'args': 'language hi'}

Fixes #10 and possibly Fixes #9. According to huggingface/hf-speech-bench#8 as of two months ago, users are reporting that the leaderboard has moved, but this repository is still seeing staff contributions. Submitting fix for review regardless.

Publish this branch
This branch is in draft mode, publish it to be able to merge.

Sign up or log in to comment