Dataset map is always failing

#2
by betim - opened

Hi @sanchit-gandhi ,

Following you amazing blog post on whisper tuning, we are trying to fine-tune whisper for a language.

We're using a good machine (8xA100 @ 40GB GPUs, 120 core CPU) but we're failing constantly on dataset.map or this line:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

Our train dataset consists on 300,000 audio files with their respective sentences.

Is there any way to instruct dataset.map to use GPU or make it work?

p.s. We have tried w/ num_proc=10, 5, None but it still failing.

Appreciate your help!

Hey @betim ! Super glad to hear the blog post was helpful πŸ€— Sorry about the late reply in getting back to you. How is .map(...) failing? Is it just hanging when you're doing the preprocessing? Feel free to share the stack trace (if there is one) and I can advise accordingly!

What you might need to do is run pre-processing on a single worker first in a non-distributed manner, and then load the dataset from cache when you use the distributed set-up (see https://github.com/huggingface/transformers/blob/57f25f4b7fb85ff069f8701372710b2a3207bf2d/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py#L181 for details)

Alternatively, you could try using the dataset in streaming mode (if it supports this), see this notebook for details: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine_tune_whisper_streaming_colab.ipynb
-> this will bypass the need to preprocess your entire dataset at once at the start

Also worth checking this forum posts to make sure you've not got the same issue: https://discuss.huggingface.co/t/map-function-freezes-on-large-dataset/33224

You can also open an issue on the datasets repo to get dedicated help if this persists: https://github.com/huggingface/datasets
A reproducible code snippet would be required here if possible!

Sign up or log in to comment