Loading a Dataset¶
A datasets.Dataset
can be created from various source of data:
from the HuggingFace Hub,
from local files, e.g. CSV/JSON/text/pandas files, or
from in-memory data like python dict or a pandas dataframe.
In this section we study each option.
From the HuggingFace Hub¶
Over 135 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the HuggingFace Hub and can be viewed and explored online with the 🤗 Datasets viewer.
Note
You can also add new dataset to the Hub to share with the community as detailed in the guide on adding a new dataset.
All the datasets currently available on the Hub can be listed using datasets.list_datasets()
:
>>> from datasets import list_datasets
>>> datasets_list = list_datasets()
>>> len(datasets_list)
656
>>> print(', '.join(dataset for dataset in datasets_list))
aeslc, ag_news, ai2_arc, allocine, anli, arcd, art, billsum, blended_skill_talk, blimp, blog_authorship_corpus, bookcorpus, boolq, break_data,
c4, cfq, civil_comments, cmrc2018, cnn_dailymail, coarse_discourse, com_qa, commonsense_qa, compguesswhat, coqa, cornell_movie_dialog, cos_e,
cosmos_qa, crime_and_punish, csv, definite_pronoun_resolution, discofuse, docred, drop, eli5, empathetic_dialogues, eraser_multi_rc, esnli,
event2Mind, fever, flores, fquad, gap, germeval_14, ghomasHudson/cqc, gigaword, glue, hansards, hellaswag, hyperpartisan_news_detection,
imdb, jeopardy, json, k-halid/ar, kor_nli, lc_quad, lhoestq/c4, librispeech_lm, lm1b, math_dataset, math_qa, mlqa, movie_rationales,
multi_news, multi_nli, multi_nli_mismatch, mwsc, natural_questions, newsroom, openbookqa, opinosis, pandas, para_crawl, pg19, piaf, qa4mre,
qa_zre, qangaroo, qanta, qasc, quarel, quartz, quoref, race, reclor, reddit, reddit_tifu, rotten_tomatoes, scan, scicite, scientific_papers,
scifact, sciq, scitail, sentiment140, snli, social_i_qa, squad, squad_es, squad_it, squad_v1_pt, squad_v2, squadshifts, super_glue, ted_hrlr,
ted_multi, tiny_shakespeare, trivia_qa, tydiqa, ubuntu_dialogs_corpus, webis/tl_dr, wiki40b, wiki_dpr, wiki_qa, wiki_snippets, wiki_split,
wikihow, wikipedia, wikisql, wikitext, winogrande, wiqa, wmt14, wmt15, wmt16, wmt17, wmt18, wmt19, wmt_t2t, wnut_17, x_stance, xcopa, xnli,
xquad, xsum, xtreme, yelp_polarity
To load a dataset from the Hub we use the datasets.load_dataset()
command and give it the short name of the dataset you would like to load as listed above or on the Hub.
Let’s load the SQuAD dataset for Question Answering. You can explore this dataset and find more details about it on the online viewer here (which is actually just a wrapper on top of the datasets.Dataset
we will now create):
>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', split='train')
This call to datasets.load_dataset()
does the following steps under the hood:
Download and import in the library the SQuAD python processing script from HuggingFace github repository or AWS bucket if it’s not already stored in the library.
Note
Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script here for instance.
Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it’s not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard splits stored on the drive.
Note
An Apache Arrow Table is the internal storing format for 🤗 Datasets. It allows to store arbitrarily long dataframe,
typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you
to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
(RAM) by setting the keep_in_memory
argument of datasets.load_dataset()
to True
.
The default in 🤗 Datasets is to memory-map the dataset on disk unless you set datasets.config.IN_MEMORY_MAX_SIZE
different from 0
bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than
datasets.config.IN_MEMORY_MAX_SIZE
bytes, and memory-mapped otherwise. This behavior can be enabled by setting
either the configuration option datasets.config.IN_MEMORY_MAX_SIZE
(higher precedence) or the environment
variable HF_DATASETS_IN_MEMORY_MAX_SIZE
(lower precedence) to nonzero.
Return a dataset built from the splits asked by the user (default: all); in the above example we create a dataset with the train split.
Selecting a split¶
If you don’t provide a split
argument to datasets.load_dataset()
, this method will return a dictionary containing a datasets for each split in the dataset.
>>> from datasets import load_dataset
>>> datasets = load_dataset('squad')
>>> print(datasets)
{'train': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 87599),
'validation': Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 10570)
}
The split
argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. split='train[:10%]'
will load only the first 10% of the train split) or to mix splits (e.g. split='train[:100]+validation[:100]'
will create a split from the first 100 examples of the train split and the first 100 examples of the validation split).
You can find more details on the syntax for using split
on the dedicated tutorial on split.
Selecting a configuration¶
Some datasets comprise several configurations
. A Configuration define a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are:
the GLUE dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX.
the wikipedia dataset which is provided for several languages.
When a dataset is provided with more than one configurations
, you will be requested to explicitely select a configuration among the possibilities.
Selecting a configuration is done by providing datasets.load_dataset()
with a name
argument. Here is an example for GLUE:
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue')
ValueError: Config name is missing.
Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
Example of usage:
`load_dataset('glue', 'cola')`
>>> dataset = load_dataset('glue', 'sst2')
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|██████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data.
>>> print(dataset)
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349),
'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872),
'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821)
}
Manually downloading files¶
Some dataset require you to download manually some files, usually because of licencing issues or when these files are behind a login page.
In this case specific instruction for dowloading the missing files will be provided when running the script with datasets.load_dataset()
for the first time to explain where and how you can get the files.
After you’ve downloaded the files, you can point to the folder hosting them locally with the data_dir
argument as follows:
>>> dataset = load_dataset("xtreme", "PAN-X.fr")
Downloading and preparing dataset xtreme/PAN-X.fr (download: Unknown size, generated: 5.80 MiB, total: 5.80 MiB) to /Users/thomwolf/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0...
AssertionError: The dataset xtreme with config PAN-X.fr requires manual data.
Please follow the manual download instructions: You need to manually download the AmazonPhotos.zip file on Amazon Cloud Drive (https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN). The folder containing the saved file can be used to load the dataset via 'datasets.load_dataset("xtreme", data_dir="<path/to/folder>")'
Apart from name
and split
, the datasets.load_dataset()
method provide a few arguments which can be used to control where the data is cached (cache_dir
), some options for the download process it-self like the proxies and whether the download cache should be used (download_config
, download_mode
).
The use of these arguments is discussed in the Loading datasets in streaming mode section below. You can also find the full details on these arguments on the package reference page for datasets.load_dataset()
.
From local files¶
It’s also possible to create a dataset from local files.
Generic loading scripts are provided for:
CSV files (with the
csv
script),JSON files (with the
json
script),text files (read as a line-by-line dataset with the
text
script),pandas pickled dataframe (with the
pandas
script).
If you want to control better how you files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the HuggingFace Hub, it can be more flexible and simpler to create your own loading script, from scratch or by adapting one of the provided loading scripts. In this case, please go check the Writing a dataset loading script chapter.
The data_files
argument in datasets.load_dataset()
is used to provide paths to one or several files. This arguments currently accept three types of inputs:
str
: a single string as the path to a single file (considered to constitute the train split by default)List[str]
: a list of strings as paths to a list of files (also considered to constitute the train split by default)Dict[Union[str, List[str]]]
: a dictionary mapping splits names to a single file or a list of files.
Let’s see an example of all the various ways you can provide files to datasets.load_dataset()
:
>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files='my_file.csv')
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
>>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],
'test': 'my_test_file.csv'})
Note
The split
argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using split
on the dedicated tutorial on split. The only specific behavior related to loading local files is that if you don’t indicate which split each files is realted to, the provided files are assumed to belong to the train split.
CSV files¶
🤗 Datasets can read a dataset made of on or several CSV files.
All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
A few interesting features are provided out-of-the-box by the Apache Arrow backend:
multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)
fetching column names from the first row in the CSV file
column-wise type inference and conversion to one of null, int64, float64, timestamp[s], string or binary data
detecting various spellings of null values such as NaN or #N/A
Here is an example loading two CSV file to create a train
split (default split unless specify otherwise):
>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv'])
The csv
loading script provides a few simple access options to control parsing and reading the CSV files:
skiprows
(int) - Number of first rows in the file to skip (default is 0)
column_names
(list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty).
delimiter
(1-character string) – The character delimiting individual cells in the CSV data (default','
).
quotechar
(1-character string) – The character used optionally for quoting CSV values (default ‘”’).
quoting
(bool) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to pandas.read_csv documentation for more details).
If you want more control, the csv
script provide full control on reading, parsong and convertion through the Apache Arrow pyarrow.csv.ReadOptions, pyarrow.csv.ParseOptions and pyarrow.csv.ConvertOptions
read_options
— Can be provided with a pyarrow.csv.ReadOptions to control all the reading options. Ifskiprows
,column_names
orautogenerate_column_names
are also provided (see above), they will take priority over the attributes inread_options
.
parse_options
— Can be provided with a pyarrow.csv.ParseOptions to control all the parsing options. Ifdelimiter
orquote_char
are also provided (see above), they will take priority over the attributes inparse_options
.
convert_options
— Can be provided with a pyarrow.csv.ConvertOptions to control all the conversion options.
JSON files¶
🤗 Datasets supports building a dataset from JSON files in various format.
The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
In this case, interesting features are provided out-of-the-box by the Apache Arrow backend:
multi-threaded reading
automatic decompression of input files (based on the filename extension, such as my_data.json.gz)
sophisticated type inference (see below)
You can load such a dataset direcly with:
>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='my_file.json')
In real-life though, JSON files can have diverse format and the json
script will accordingly fallback on using python JSON loading methods to handle various JSON file format.
One common occurence is to have a JSON file with a single root dictionary where the dataset is contained in a specific field, as a list of dicts or a dict of lists.
{"version": "0.1.0",
"data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
In this case you will need to specify which field contains the dataset using the field
argument as follows:
>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='my_file.json', field='data')
Text files¶
🤗 Datasets also supports building a dataset from text files read line by line (each line will be a row in the dataset).
This is simply done using the text
loading script which will generate a dataset with a single column called text
containing all the text lines of the input files as strings.
>>> from datasets import load_dataset
>>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'})
Specifying the features of the dataset¶
When you create a dataset from local files, the datasets.Features
of the dataset are automatically guessed using an automatic type inference system based on Apache Arrow Automatic Type Inference.
However sometime you may want to define yourself the features of the dataset, for instance to control the names and indices of labels using a datasets.ClassLabel
.
In this case you can use the features
arguments to datasets.load_dataset()
to supply a datasets.Features
instance definining the features of your dataset and overriding the default pre-computed features.
From in-memory data¶
Eventually, it’s also possible to instantiate a datasets.Dataset
directly from in-memory data, currently one or:
a python dict, or
a pandas dataframe.
From a python dictionary¶
Let’s say that you have already loaded some data in a in-memory object in your python session:
>>> my_dict = {'id': [0, 1, 2],
>>> 'name': ['mary', 'bob', 'eve'],
>>> 'age': [24, 53, 19]}
You can then directly create a datasets.Dataset
object using the datasets.Dataset.from_dict()
or the datasets.Dataset.from_pandas()
class methods of the datasets.Dataset
class:
>>> from datasets import Dataset
>>> dataset = Dataset.from_dict(my_dict)
From a pandas dataframe¶
You can similarly instantiate a Dataset object from a pandas
DataFrame:
>>> from datasets import Dataset
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> dataset = Dataset.from_pandas(df)
Note
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function.
To be sure that the schema and type of the instantiated datasets.Dataset
are as intended, you can explicitely provide the features of the dataset as a datasets.Features
object to the from_dict
and from_pandas
methods.
Using a custom dataset loading script¶
If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script.
You can use a local loading script just by providing its path instead of the usual shortcut name:
>>> from datasets import load_dataset
>>> dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE')
We provide more details on how to create your own dataset generation script on the Writing a dataset loading script page and you can also find some inspiration in all the already provided loading scripts on the GitHub repository.
Loading datasets in streaming mode¶
When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset.
The data are downloaded progressively as you iterate over the dataset.
You can enable dataset streaming by passing streaming=True
in the load_dataset()
function to get an iterable dataset.
For example, you can start iterating over big datasets like OSCAR without having to download terabytes of data using this code:
>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> print(next(iter(dataset)))
{'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help...
Note
A dataset in streaming mode is not a datasets.Dataset
object, but an datasets.IterableDataset
object. You can find more information about iterable datasets in the dataset streaming documentation
Cache management and integrity verifications¶
Cache directory¶
To avoid re-downloading the whole dataset every time you use it, the datasets library caches the data on your computer.
By default, the datasets library caches the datasets and the downloaded data files under the following directory: ~/.cache/huggingface/datasets.
If you want to change the location where the datasets cache is stored, simply set the HF_DATASETS_CACHE environment variable. For example, if you’re using linux:
$ export HF_DATASETS_CACHE="/path/to/another/directory"
In addition, you can control where the data is cached when invoking the loading script, by setting the cache_dir
parameter:
>>> from datasets import load_dataset
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
Download mode¶
You can control the way the the datasets.load_dataset()
function handles already downloaded data by setting its download_mode
parameter.
By default, download_mode
is set to "reuse_dataset_if_exists"
. The datasets.load_dataset()
function will reuse both raw downloads and the prepared dataset, if they exist in the cache directory.
The following table describes the three available modes for download:
|
Downloaded files (raw data) |
Dataset object |
---|---|---|
|
Reuse |
Reuse |
|
Reuse |
Fresh |
|
Fresh |
Fresh |
For example, you can run the following if you want to force the re-download of the SQuAD raw data files:
>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', download_mode="force_redownload")
Integrity verifications¶
When downloading a dataset from the 🤗 Datasets Hub, the datasets.load_dataset()
function performs by default a number of verifications on the downloaded files. These verifications include:
Verifying the list of downloaded files
Verifying the number of bytes of the downloaded files
Verifying the SHA256 checksums of the downloaded files
Verifying the number of splits in the generated DatasetDict
Verifying the number of samples in each split of the generated DatasetDict
You can disable these verifications by setting the ignore_verifications
parameter to True
.
You also have the possibility to locally override the informations used to perform the integrity verifications by setting the save_infos
parameter to True
.
For example, run the following to skip integrity verifications when loading the IMDB dataset:
>>> from datasets import load_dataset
>>> dataset = load_dataset('imdb', ignore_verifications=True)
Loading datasets offline¶
Each dataset builder (e.g. “squad”) is a python script that is downloaded and cached from either from the 🤗 Datasets GitHub repository or from the HuggingFace Hub.
Only the text
, csv
, json
and pandas
builders are included in datasets
without requiring external downloads.
Therefore if you don’t have an internet connection you can’t load a dataset that is not packaged with datasets
, unless the dataset is already cached.
Indeed, if you’ve already loaded the dataset once before (when you had an internet connection), then the dataset is reloaded from the cache and you can use it offline.
You can even set the environment variable HF_DATASETS_OFFLINE to 1
to tell datasets
to run in full offline mode.
This mode disables all the network calls of the library.
This way, instead of waiting for a dataset builder download to time out, the library looks directly at the cache.
Loading a dataset builder¶
You can use datasets.load_dataset_builder()
to inspect metadata (cache directory, configs, dataset info, etc.) that is required to build a dataset without downloading the dataset itself.
For example, run the following to get the path to the cache directory of the IMDB dataset:
>>> from datasets import load_dataset_builder
>>> dataset_builder = load_dataset_builder('imdb')
>>> print(dataset_builder.cache_dir)
/Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676
>>> print(dataset_builder.info.features)
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
>>> print(dataset_builder.info.splits)
{'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')}
You can see all the attributes of dataset_builder.info
in the documentation of datasets.DatasetInfo
Enhancing performance¶
If you would like to speed up dataset operations, you can disable caching and copy the dataset in-memory by setting
datasets.config.IN_MEMORY_MAX_SIZE
to a nonzero size (in bytes) that fits in your RAM memory. In that case, the
dataset will be copied in-memory if its size is smaller than datasets.config.IN_MEMORY_MAX_SIZE
bytes, and
memory-mapped otherwise. This behavior can be enabled by setting either the configuration option
datasets.config.IN_MEMORY_MAX_SIZE
(higher precedence) or the environment variable
HF_DATASETS_IN_MEMORY_MAX_SIZE
(lower precedence) to nonzero.