Load a dataset from the Hub

Finding high-quality datasets that are reproducible and accessible can be difficult. One of 🤗 Datasets main goals is to provide a simple way to load a dataset of any format or type. The easiest way to get started is to discover an existing dataset on the Hugging Face Hub - a community-driven collection of datasets for tasks in NLP, computer vision, and audio - and use 🤗 Datasets to download and generate the dataset.

This tutorial uses the rotten_tomatoes and MInDS-14 datasets, but feel free to load any dataset you want and follow along. Head over to the Hub now and find a dataset for your task!

Load a dataset

Before you take the time to download a dataset, it’s often helpful to quickly get some general information about a dataset. A dataset’s information is stored inside DatasetInfo and can include information such as the dataset description, features, and dataset size.

Use the load_dataset_builder() function to load a dataset builder and inspect a dataset’s attributes without committing to downloading it:

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

If you’re happy with the dataset, then load it with load_dataset():

>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

Splits

A split is a specific subset of a dataset like train and test. List a dataset’s split names with the get_dataset_split_names() function:

>>> from datasets import get_dataset_split_names

>>> get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")
['train', 'validation', 'test']

Then you can load a specific split with the split parameter. Loading a dataset split returns a Dataset object:

>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
>>> dataset
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

If you don’t specify a split, 🤗 Datasets returns a DatasetDict object instead:

>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Configurations

Some datasets contain several sub-datasets. For example, the MInDS-14 dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations or subsets, and you must explicitly select one when loading the dataset. If you don’t provide a configuration name, 🤗 Datasets will raise a ValueError and remind you to choose a configuration.

Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:

>>> from datasets import get_dataset_config_names

>>> configs = get_dataset_config_names("PolyAI/minds14")
>>> print(configs)
['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN', 'all']

Then load the configuration you want:

>>> from datasets import load_dataset

>>> mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

Remote code

Certain datasets repositories contain a loading script with the Python code used to generate the dataset. All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set trust_remote_code=True to use a dataset with a loading script, or you will get an error:

>>> from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset

>>> c4 = load_dataset("c4", "en", split="train", trust_remote_code=True)
>>> get_dataset_config_names("c4", trust_remote_code=True)
['en', 'realnewslike', 'en.noblocklist', 'en.noclean']
>>> get_dataset_split_names("c4", "en", trust_remote_code=True)
['train', 'validation']

For security reasons, 🤗 Datasets do not allow running dataset loading scripts by default, and you have to pass trust_remote_code=True to load datasets that require running a dataset script.

< > Update on GitHub