Datasets documentation

Create a dataset

You are viewing v2.20.0 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Create a dataset

Sometimes, you may need to create a dataset if you’re working with your own data. Creating a dataset with 🤗 Datasets confers all the advantages of the library to your dataset: fast loading and processing, stream enormous datasets, memory-mapping, and more. You can easily and rapidly create a dataset with 🤗 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as dragging and dropping your data files into a dataset repository on the Hub.

In this tutorial, you’ll learn how to use 🤗 Datasets low-code methods for creating all types of datasets:

  • Folder-based builders for quickly creating an image or audio dataset
  • from_ methods for creating datasets from local files

Folder-based builders

There are two folder-based builders, ImageFolder and AudioFolder. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset’s features, splits, and labels. Under the hood:

  • ImageFolder uses the Image feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete list of supported image extensions.
  • AudioFolder uses the Audio feature to decode an audio file. Audio extensions such as wav and mp3 are supported, and you can check the complete list of supported audio extensions.

The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.

For example, if your image dataset (it is the same for an audio dataset) is stored like this:

pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png

pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png

Then this is how the folder-based builder generates an example:

Create the image dataset by specifying imagefolder in load_dataset():

>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

An audio dataset is created in the same way, except you specify audiofolder in load_dataset() instead:

>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

Any additional information about your dataset, such as text captions or transcriptions, can be included with a metadata.csv file in the folder containing your dataset. The metadata file needs to have a file_name column that links the image or audio file to its corresponding metadata:

file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.

To learn more about each of these folder-based builders, check out the and ImageFolder or AudioFolder guides.

From local files

You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the from_ methods:

  • The from_generator() method is the most memory-efficient way to create a dataset from a generator due to a generators iterative behavior. This is especially useful when you’re working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

    >>> from datasets import Dataset
    >>> def gen():
    ...     yield {"pokemon": "bulbasaur", "type": "grass"}
    ...     yield {"pokemon": "squirtle", "type": "water"}
    >>> ds = Dataset.from_generator(gen)
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}

    A generator-based IterableDataset needs to be iterated over with a for loop for example:

    >>> from datasets import IterableDataset
    >>> ds = IterableDataset.from_generator(gen)
    >>> for example in ds:
    ...     print(example)
    {"pokemon": "bulbasaur", "type": "grass"}
    {"pokemon": "squirtle", "type": "water"}
  • The from_dict() method is a straightforward way to create a dataset from a dictionary:

    >>> from datasets import Dataset
    >>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}

    To create an image or audio dataset, chain the cast_column() method with from_dict() and specify the column and feature type. For example, to create an audio dataset:

    >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

Next steps

We didn’t mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.

To learn more about how to write loading scripts, take a look at the image loading script, audio loading script, and text loading script guides.

Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.

< > Update on GitHub