Hub documentation
Audio Dataset
Audio Dataset
This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this Audio datasets examples collection.
A dataset with a supported structure and file formats automatically has a Dataset Viewer on its page on the Hub.
Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (metadata.csv/metadata.jsonl/metadata.parquet).
Alternatively, audio files can be in Parquet files or in TAR archives following the WebDataset format.
Only audio files
If your dataset only consists of one column with audio, you can simply store your audio files at the root:
my_dataset_repository/ βββ 1.wav βββ 2.wav βββ 3.wav βββ 4.wav
or in a subdirectory:
my_dataset_repository/
βββ audio
βββ 1.wav
βββ 2.wav
βββ 3.wav
βββ 4.wavMultiple formats are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV.
my_dataset_repository/
βββ audio
βββ 1.aiff
βββ 2.ogg
βββ 3.mp3
βββ 4.flacIf you have several splits, you can put your audio files into directories named accordingly:
my_dataset_repository/
βββ train
βΒ Β βββ 1.wav
βΒ Β βββ 2.wav
βββ test
βββ 3.wav
βββ 4.wavSee File names and splits for more information and other ways to organize data by splits.
Additional columns
If there is additional information youβd like to include about your dataset, like the transcription, add it as a metadata.csv file in your repository. This lets you quickly create datasets for different audio tasks like text-to-speech or automatic speech recognition.
my_dataset_repository/ βββ 1.wav βββ 2.wav βββ 3.wav βββ 4.wav βββ metadata.csv
Your metadata.csv file must have a file_name column which links image files with their metadata:
file_name,animal
1.wav,cat
2.wav,cat
3.wav,dog
4.wav,dogYou can also use a JSONL file metadata.jsonl:
{"file_name": "1.wav","text": "cat"}
{"file_name": "2.wav","text": "cat"}
{"file_name": "3.wav","text": "dog"}
{"file_name": "4.wav","text": "dog"}And for bigger datasets or if you are interested in advanced data retrieval features, you can use a Parquet file metadata.parquet.
Relative paths
Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example:
my_dataset_repository/
βββ test
βββ audio
βΒ Β βββ 1.wav
βΒ Β βββ 2.wav
βΒ Β βββ 3.wav
βΒ Β βββ 4.wav
βββ metadata.csvIn this case, the file_name column must be a full relative path to the audio files, not just the filename:
file_name,animal
audio/1.wav,cat
audio/2.wav,cat
audio/3.wav,dog
audio/4.wav,dogMetadata files cannot be put in subdirectories of a directory with the audio files.
More generally, any column named file_name or *_file_name should contain the full relative path to the audio files.
In this example, the test directory is used to setup the name of the training split. See File names and splits for more information.
Audio classification
For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like:
my_dataset_repository/
βββ cat
βΒ Β βββ 1.wav
βΒ Β βββ 2.wav
βββ dog
βββ 3.wav
βββ 4.wavThe dataset created with this structure contains two columns: audio and label (with values cat and dog).
You can also provide multiple splits. To do so, your dataset directory should have the following structure (see File names and splits for more information):
my_dataset_repository/
βββ test
βΒ Β βββ cat
βΒ Β βΒ Β βββ 2.wav
βΒ Β βββ dog
βΒ Β βββ 4.wav
βββ train
βββ cat
βΒ Β βββ 1.wav
βββ dog
βββ 3.wavYou can disable this automatic addition of the label column in the YAML configuration. If your directory names have no special meaning, set drop_labels: true in the README header:
configs:
- config_name: default # Name of the dataset subset, if applicable.
drop_labels: trueLarge scale datasets
WebDataset format
The WebDataset format is well suited for large scale audio datasets (see AlienKevin/sbs_cantonese for example). It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training.
my_dataset_repository/ βββ train-0000.tar βββ train-0001.tar βββ ... βββ train-1023.tar
To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the tar command.
The usual size per archive is generally around 1GB.
Make sure each audio file and metadata pair share the same file prefix, for example:
train-0000/ βββ 000.flac βββ 000.json βββ 001.flac βββ 001.json βββ ... βββ 999.flac βββ 999.json
Note that for user convenience and to enable the Dataset Viewer, every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the Parquet format documentation.
Parquet format
Instead of uploading the audio files and metadata as individual files, you can embed everything inside a Parquet file. This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.
my_dataset_repository/ βββ train.parquet
Parquet files with audio data can be created using pandas or the datasets library. To create Parquet files with audio data in pandas, you can use pandas-audio-methods and df.to_parquet(). In datasets, you can set the column type to Audio() and use the ds.to_parquet(...) method or ds.push_to_hub(...). You can find a guide on loading audio datasets in datasets here.
Alternatively you can manually set the audio type of Parquet created using other tools. First, make sure your audio columns are of type struct, with a binary field "bytes" for the audio data and a string field "path" for the audio file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example:
dataset_info:
features:
- name: audio
dtype: audio
- name: caption
dtype: stringNote that Parquet is recommended for small audio files (<1MB per audio file) and small row groups (100 rows per row group, which is what datasets uses for audio). For larger audio files it is recommended to use the WebDataset format, or to share the original audio files (optionally with metadata files).