sanchit-gandhi/whisper_event_winners · Dataset map is always failing

Hey @betim ! Super glad to hear the blog post was helpful 🤗 Sorry about the late reply in getting back to you. How is .map(...) failing? Is it just hanging when you're doing the preprocessing? Feel free to share the stack trace (if there is one) and I can advise accordingly!

What you might need to do is run pre-processing on a single worker first in a non-distributed manner, and then load the dataset from cache when you use the distributed set-up (see https://github.com/huggingface/transformers/blob/57f25f4b7fb85ff069f8701372710b2a3207bf2d/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py#L181 for details)

Alternatively, you could try using the dataset in streaming mode (if it supports this), see this notebook for details: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine_tune_whisper_streaming_colab.ipynb
-> this will bypass the need to preprocess your entire dataset at once at the start

Also worth checking this forum posts to make sure you've not got the same issue: https://discuss.huggingface.co/t/map-function-freezes-on-large-dataset/33224

You can also open an issue on the datasets repo to get dedicated help if this persists: https://github.com/huggingface/datasets
A reproducible code snippet would be required here if possible!