Loading a Dataset

A datasets.Dataset can be created from various sources of data:

  • from the HuggingFace Hub,

  • from local files, e.g. CSV/JSON/text/pandas files, or

  • from in-memory data like python dict or a pandas dataframe.

In this section we study each option.

From the HuggingFace Hub

Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the HuggingFace Hub and can be viewed and explored online with the 🤗 Datasets viewer.

Note

You can also add a new dataset to the Hub to share with the community as detailed in the guide on adding a new dataset.

All the datasets currently available on the Hub can be listed using datasets.list_datasets():

>>> from datasets import list_datasets
>>> datasets_list = list_datasets()
>>> len(datasets_list)
1067
>>> print(', '.join(dataset for dataset in datasets_list))
acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar,
allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat,
aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art,
arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli,
bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp,
blog_authorship_corpus, bn_hate_speech [...]

To load a dataset from the Hub we use the datasets.load_dataset() command and give it the short name of the dataset you would like to load as listed above or on the Hub.

Let’s load the SQuAD dataset for Question Answering. You can explore this dataset and find more details about it on the online viewer here (which is actually just a wrapper on top of the datasets.Dataset we will now create):

>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', split='train')

This call to datasets.load_dataset() does the following steps under the hood:

  1. Download and import in the library the SQuAD python processing script from HuggingFace github repository or AWS bucket if it’s not already stored in the library.

Note

Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script here for instance.

  1. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it’s not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard split stored on the drive.

Note

An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store an arbitrarily long dataframe, typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory (RAM) by setting the keep_in_memory argument of datasets.load_dataset() to True. The default in 🤗 Datasets is to memory-map the dataset on disk unless you set datasets.config.IN_MEMORY_MAX_SIZE different from 0 bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than datasets.config.IN_MEMORY_MAX_SIZE bytes, and memory-mapped otherwise. This behavior can be enabled by setting either the configuration option datasets.config.IN_MEMORY_MAX_SIZE (higher precedence) or the environment variable HF_DATASETS_IN_MEMORY_MAX_SIZE (lower precedence) to nonzero.

  1. Return a dataset built from the splits asked by the user (default: all); in the above example we create a dataset with the train split.

Selecting a split

If you don’t provide a split argument to datasets.load_dataset(), this method will return a dictionary containing a datasets for each split in the dataset.

>>> from datasets import load_dataset
>>> datasets = load_dataset('squad')
>>> print(datasets)
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

The split argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. split='train[:10%]' will load only the first 10% of the train split) or to mix splits (e.g. split='train[:100]+validation[:100]' will create a split from the first 100 examples of the train split and the first 100 examples of the validation split).

You can find more details on the syntax for using split on the dedicated tutorial on split.

Selecting a configuration

Some datasets comprise several configurations. A Configuration defines a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:

  • the GLUE dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX.

  • the wikipedia dataset which is provided for several languages.

When a dataset is provided with more than one configuration, you will be requested to explicitely select a configuration among the possibilities.

Selecting a configuration is done by providing datasets.load_dataset() with a name argument. Here is an example for GLUE:

>>> from datasets import load_dataset

>>> dataset = load_dataset('glue')
ValueError: Config name is missing.
Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
Example of usage:
        `load_dataset('glue', 'cola')`

>>> dataset = load_dataset('glue', 'sst2')
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|██████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
>>> print(dataset)
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

Manually downloading files

Some dataset require you to download manually some files, usually because of licencing issues or when these files are behind a login page.

In this case specific instruction for dowloading the missing files will be provided when running the script with datasets.load_dataset() for the first time to explain where and how you can get the files.

After you’ve downloaded the files, you can point to the folder hosting them locally with the data_dir argument as follows:

>>> dataset = load_dataset("xtreme", "PAN-X.fr")
Downloading and preparing dataset xtreme/PAN-X.fr (download: Unknown size, generated: 5.80 MiB, total: 5.80 MiB) to /Users/thomwolf/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0...
AssertionError: The dataset xtreme with config PAN-X.fr requires manual data.
Please follow the manual download instructions: You need to manually download the AmazonPhotos.zip file on Amazon Cloud Drive (https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN). The folder containing the saved file can be used to load the dataset via 'datasets.load_dataset("xtreme", data_dir="<path/to/folder>")'

Apart from name and split, the datasets.load_dataset() method provide a few arguments which can be used to control where the data is cached (cache_dir), some options for the download process it-self like the proxies and whether the download cache should be used (download_config, download_mode).

The use of these arguments is discussed in the Loading datasets in streaming mode section below. You can also find the full details on these arguments on the package reference page for datasets.load_dataset().

From local files

It’s also possible to create a dataset from local files.

Generic loading scripts are provided for:

  • CSV files (with the csv script),

  • JSON files (with the json script),

  • text files (read as a line-by-line dataset with the text script),

  • pandas pickled dataframe (with the pandas script).

If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the HuggingFace Hub, it can be more flexible and simpler to create your own loading script, from scratch or by adapting one of the provided loading scripts. In this case, please go check the Writing a dataset loading script chapter.

The data_files argument in datasets.load_dataset() is used to provide paths to one or several files. This argument currently accepts three types of inputs:

  • str: a single string as the path to a single file (considered to constitute the train split by default)

  • List[str]: a list of strings as paths to a list of files (also considered to constitute the train split by default)

  • Dict[Union[str, List[str]]]: a dictionary mapping splits names to a single file or a list of files.

Let’s see an example of all the various ways you can provide files to datasets.load_dataset():

>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files='my_file.csv')
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
>>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],
                                              'test': 'my_test_file.csv'})

Note

The split argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using split on the dedicated tutorial on split. The only specific behavior related to loading local files is that if you don’t indicate which split each files is related to, the provided files are assumed to belong to the train split.

CSV files

🤗 Datasets can read a dataset made of one or several CSV files.

All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.

A few interesting features are provided out-of-the-box by the Apache Arrow backend:

  • multi-threaded or single-threaded reading

  • automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)

  • fetching column names from the first row in the CSV file

  • column-wise type inference and conversion to one of null, int64, float64, timestamp[s], string or binary data

  • detecting various spellings of null values such as NaN or #N/A

Here is an example loading two CSV file to create a train split (default split unless specify otherwise):

>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv'])

The csv loading script provides a few simple access options to control parsing and reading the CSV files:

  • skiprows (int) - Number of first rows in the file to skip (default is 0)

  • column_names (list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty).

  • delimiter (1-character string) – The character delimiting individual cells in the CSV data (default ,).

  • quotechar (1-character string) – The character used optionally for quoting CSV values (default ").

  • quoting (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to pandas.read_csv documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html> for more details).

If you want more control, the csv script provides full control on reading, parsing and converting through the Apache Arrow pyarrow.csv.ReadOptions, pyarrow.csv.ParseOptions and pyarrow.csv.ConvertOptions

  • read_options — Can be provided with a pyarrow.csv.ReadOptions to control all the reading options. If skiprows, column_names or autogenerate_column_names are also provided (see above), they will take priority over the attributes in read_options.

  • parse_options — Can be provided with a pyarrow.csv.ParseOptions to control all the parsing options. If delimiter or quote_char are also provided (see above), they will take priority over the attributes in parse_options.

  • convert_options — Can be provided with a pyarrow.csv.ConvertOptions to control all the conversion options.

JSON files

🤗 Datasets supports building a dataset from JSON files in various formats.

The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:

{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

In this case, interesting features are provided out-of-the-box by the Apache Arrow backend:

  • multi-threaded reading

  • automatic decompression of input files (based on the filename extension, such as my_data.json.gz)

  • sophisticated type inference (see below)

You can load such a dataset direcly with:

>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='my_file.json')

In real-life though, JSON files can have diverse format and the json script will accordingly fallback on using python JSON loading methods to handle various JSON file format.

One common occurence is to have a JSON file with a single root dictionary where the dataset is contained in a specific field, as a list of dicts or a dict of lists.

{"version": "0.1.0",
 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
          {"a": 4, "b": -5.5, "c": null, "d": true}]
}

In this case you will need to specify which field contains the dataset using the field argument as follows:

>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='my_file.json', field='data')

Text files

🤗 Datasets also supports building a dataset from text files read line by line (each line will be a row in the dataset).

This is simply done using the text loading script which will generate a dataset with a single column called text containing all the text lines of the input files as strings.

>>> from datasets import load_dataset
>>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'})

Specifying the features of the dataset

When you create a dataset from local files, the datasets.Features of the dataset are automatically guessed using an automatic type inference system based on Apache Arrow Automatic Type Inference.

However sometime you may want to define yourself the features of the dataset, for instance to control the names and indices of labels using a datasets.ClassLabel.

In this case you can use the features arguments to datasets.load_dataset() to supply a datasets.Features instance definining the features of your dataset and overriding the default pre-computed features.

From in-memory data

Eventually, it’s also possible to instantiate a datasets.Dataset directly from in-memory data, currently:

  • a python dict, or

  • a pandas dataframe.

From a python dictionary

Let’s say that you have already loaded some data in a in-memory object in your python session:

>>> my_dict = {'id': [0, 1, 2],
>>>            'name': ['mary', 'bob', 'eve'],
>>>            'age': [24, 53, 19]}

You can then directly create a datasets.Dataset object using the datasets.Dataset.from_dict() or the datasets.Dataset.from_pandas() class methods of the datasets.Dataset class:

>>> from datasets import Dataset
>>> dataset = Dataset.from_dict(my_dict)

From a pandas dataframe

You can similarly instantiate a Dataset object from a pandas DataFrame:

>>> from datasets import Dataset
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> dataset = Dataset.from_pandas(df)

Note

The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.

Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function.

To be sure that the schema and type of the instantiated datasets.Dataset are as intended, you can explicitely provide the features of the dataset as a datasets.Features object to the from_dict and from_pandas methods.

Using a custom dataset loading script

If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script.

You can use a local loading script by providing its path instead of the usual shortcut name:

>>> from datasets import load_dataset
>>> dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE')

We provide more details on how to create your own dataset generation script on the Writing a dataset loading script page and you can also find some inspiration in all the already provided loading scripts on the GitHub repository.

Loading datasets in streaming mode

When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset. The data are downloaded progressively as you iterate over the dataset. You can enable dataset streaming by passing streaming=True in the load_dataset() function to get an iterable dataset.

For example, you can start iterating over big datasets like OSCAR without having to download terabytes of data using this code:

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> print(next(iter(dataset)))
{'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help...

Note

A dataset in streaming mode is not a datasets.Dataset object, but an datasets.IterableDataset object. You can find more information about iterable datasets in the dataset streaming documentation

Cache management and integrity verifications

Cache directory

To avoid re-downloading the whole dataset every time you use it, the datasets library caches the data on your computer.

By default, the datasets library caches the datasets and the downloaded data files under the following directory: ~/.cache/huggingface/datasets.

If you want to change the location where the datasets cache is stored, simply set the HF_DATASETS_CACHE environment variable. For example, if you’re using linux:

$ export HF_DATASETS_CACHE="/path/to/another/directory"

In addition, you can control where the data is cached when invoking the loading script, by setting the cache_dir parameter:

>>> from datasets import load_dataset
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")

Download mode

You can control the way the the datasets.load_dataset() function handles already downloaded data by setting its download_mode parameter.

By default, download_mode is set to "reuse_dataset_if_exists". The datasets.load_dataset() function will reuse both raw downloads and the prepared dataset, if they exist in the cache directory.

The following table describes the three available modes for download:

Behavior of datasets.load_dataset() depending on download_mode

download_mode parameter value

Downloaded files (raw data)

Dataset object

"reuse_dataset_if_exists" (default)

Reuse

Reuse

"reuse_cache_if_exists"

Reuse

Fresh

"force_redownload"

Fresh

Fresh

For example, you can run the following if you want to force the re-download of the SQuAD raw data files:

>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', download_mode="force_redownload")

Integrity verifications

When downloading a dataset from the 🤗 Datasets Hub, the datasets.load_dataset() function performs by default a number of verifications on the downloaded files. These verifications include:

  • Verifying the list of downloaded files

  • Verifying the number of bytes of the downloaded files

  • Verifying the SHA256 checksums of the downloaded files

  • Verifying the number of splits in the generated DatasetDict

  • Verifying the number of samples in each split of the generated DatasetDict

You can disable these verifications by setting the ignore_verifications parameter to True.

You also have the possibility to locally override the informations used to perform the integrity verifications by setting the save_infos parameter to True.

For example, run the following to skip integrity verifications when loading the IMDB dataset:

>>> from datasets import load_dataset
>>> dataset = load_dataset('imdb', ignore_verifications=True)

Loading datasets offline

Each dataset builder (e.g. “squad”) is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the HuggingFace Hub. Only the text, csv, json and pandas builders are included in datasets without requiring external downloads.

Therefore if you don’t have an internet connection you can’t load a dataset that is not packaged with datasets, unless the dataset is already cached. Indeed, if you’ve already loaded the dataset once before (when you had an internet connection), then the dataset is reloaded from the cache and you can use it offline.

You can even set the environment variable HF_DATASETS_OFFLINE to 1 to tell datasets to run in full offline mode. This mode disables all the network calls of the library. This way, instead of waiting for a dataset builder download to time out, the library looks directly at the cache.

Loading a dataset builder

You can use datasets.load_dataset_builder() to inspect metadata (cache directory, configs, dataset info, etc.) that is required to build a dataset without downloading the dataset itself.

For example, run the following to get the path to the cache directory of the IMDB dataset:

>>> from datasets import load_dataset_builder
>>> dataset_builder = load_dataset_builder('imdb')
>>> print(dataset_builder.cache_dir)
/Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676
>>> print(dataset_builder.info.features)
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
>>> print(dataset_builder.info.splits)
{'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')}

You can see all the attributes of dataset_builder.info in the documentation of datasets.DatasetInfo

Enhancing performance

If you would like to speed up dataset operations, you can disable caching and copy the dataset in-memory by setting datasets.config.IN_MEMORY_MAX_SIZE to a nonzero size (in bytes) that fits in your RAM memory. In that case, the dataset will be copied in-memory if its size is smaller than datasets.config.IN_MEMORY_MAX_SIZE bytes, and memory-mapped otherwise. This behavior can be enabled by setting either the configuration option datasets.config.IN_MEMORY_MAX_SIZE (higher precedence) or the environment variable HF_DATASETS_IN_MEMORY_MAX_SIZE (lower precedence) to nonzero.