Load
Your data can be stored in various places; they can be on your local machine’s disk, in a Github repository, and in in-memory data structures like Python dictionaries and Pandas DataFrames. Wherever a dataset is stored, 🤗 Datasets can help you load it.
This guide will show you how to load a dataset from:
- The Hub without a dataset loading script
- Local loading script
- Local files
- In-memory data
- Offline
- A specific slice of a split
For more details specific to loading other dataset modalities, take a look at the load audio dataset guide, the load image dataset guide, or the load text dataset guide.
Hugging Face Hub
Datasets are loaded from a dataset loading script that downloads and generates the dataset. However, you can also load a dataset from any dataset repository on the Hub without a loading script! Begin by creating a dataset repository and upload your data files. Now you can use the load_dataset() function to load the dataset.
For example, try loading the files from this demo repository by providing the repository namespace and dataset name. This dataset repository contains CSV files, and the code below loads the dataset from the CSV files:
>>> from datasets import load_dataset
>>> dataset = load_dataset("lhoestq/demo1")
Some datasets may have more than one version based on Git tags, branches, or commits. Use the revision
parameter to specify the dataset version you want to load:
>>> dataset = load_dataset(
... "lhoestq/custom_squad",
... revision="main" # tag name, or branch name, or commit hash
... )
Refer to the Upload a dataset to the Hub tutorial for more details on how to create a dataset repository on the Hub, and how to upload your data files.
A dataset without a loading script by default loads all the data into the train
split. Use the data_files
parameter to map data files to splits like train
, validation
and test
:
>>> data_files = {"train": "train.csv", "test": "test.csv"}
>>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)
If you don’t specify which data files to use, load_dataset() will return all the data files. This can take a long time if you load a large dataset like C4, which is approximately 13TB of data.
You can also load a specific subset of the files with the data_files
parameter. The example below only loads files that match the grep pattern:
>>> from datasets import load_dataset
>>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
The split
parameter can also map a data file to a specific split:
>>> data_files = {"validation": "en/c4-validation.*.json.gz"}
>>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")
Local loading script
You may have a 🤗 Datasets loading script locally on your computer. In this case, load the dataset by passing one of the following paths to load_dataset():
- The local path to the loading script file.
- The local path to the directory containing the loading script file (only if the script file has the same name as the directory).
>>> dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
>>> dataset = load_dataset("path/to/local/loading_script", split="train") # equivalent because the file has the same name as the directory
Local and remote files
Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a csv
, json
, txt
or parquet
file. The load_dataset() function can load each of these file types.
CSV
🤗 Datasets can read a dataset made up of one or several CSV files:
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")
If you have more than one CSV file:
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])
You can also map the training and test splits to specific CSV files:
>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})
To load remote CSV files via HTTP, pass the URLs instead:
>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
>>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'})
To load zipped CSV files:
>>> url = "https://domain.org/train_data.zip"
>>> data_files = {"train": url}
>>> dataset = load_dataset("csv", data_files=data_files)
JSON
JSON files are loaded directly with load_dataset() as shown below:
>>> from datasets import load_dataset
>>> dataset = load_dataset("json", data_files="my_file.json")
JSON files have diverse formats, but we think the most efficient format is to have multiple JSON objects; each line represents an individual row of data. For example:
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
Another JSON format you may encounter is a nested field, in which case you’ll need to specify the field
argument as shown in the following:
{"version": "0.1.0",
"data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
>>> from datasets import load_dataset
>>> dataset = load_dataset("json", data_files="my_file.json", field="data")
To load remote JSON files via HTTP, pass the URLs instead:
>>> base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
>>> dataset = load_dataset("json", data_files={"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}, field="data")
While these are the most common JSON formats, you’ll see other datasets that are formatted differently. 🤗 Datasets recognizes these other formats and will fallback accordingly on the Python JSON loading methods to handle them.
Parquet
Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.
To load a Parquet file:
>>> from datasets import load_dataset
>>> dataset = load_dataset("parquet", data_files={'train': 'train.parquet', 'test': 'test.parquet'})
To load remote Parquet files via HTTP, pass the URLs instead:
>>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
>>> data_files = {"train": base_url + "wikipedia-train.parquet"}
>>> wiki = load_dataset("parquet", data_files=data_files, split="train")
In-memory data
🤗 Datasets will also allow you to create a Dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames.
Python dictionary
Load Python dictionaries with from_dict():
>>> from datasets import Dataset
>>> my_dict = {"a": [1, 2, 3]}
>>> dataset = Dataset.from_dict(my_dict)
Pandas DataFrame
Load Pandas DataFrames with from_pandas():
>>> from datasets import Dataset
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> dataset = Dataset.from_pandas(df)
An object data type in pandas.Series doesn’t always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0
or the Series only contains None/NaN
objects, the type is set to null
. Avoid potential errors by constructing an explicit schema with Features using the from_dict
or from_pandas
methods. See the troubleshoot section for more details on how to explicitly specify your own features.
Offline
Even if you don’t have an internet connection, it is still possible to load a dataset. As long as you’ve downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
If you know you won’t have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable HF_DATASETS_OFFLINE
to 1
to enable full offline mode.
Slice splits
You can also choose only to load specific slices of a split. There are two options for slicing a split: using strings or the ReadInstruction API. Strings are more compact and readable for simple cases, while ReadInstruction is easier to use with variable slicing parameters.
Concatenate a train
and test
split by:
>>> train_test_ds = datasets.load_dataset("bookcorpus", split="train+test")
Select specific rows of the train
split:
>>> train_10_20_ds = datasets.load_dataset("bookcorpus", split="train[10:20]")
Or select a percentage of a split with:
>>> train_10pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]")
Select a combination of percentages from each split:
>>> train_10_80pct_ds = datasets.load_dataset("bookcorpus", split="tr"in[:10%]+train[-80%:]")
Finally, you can even create cross-validated splits. The example below creates 10-fold cross-validated splits. Each validation dataset is a 10% chunk, and the training dataset makes up the remaining complementary 90% chunk:
>>> val_ds = datasets.load_dataset("bookcorpus", split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)])
>>> train_ds = datasets.load_dataset("bookcorpus", split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])
Percent slicing and rounding
The default behavior is to round the boundaries to the nearest integer for datasets where the requested slice boundaries do not divide evenly by 100. As shown below, some slices may contain more examples than others. For instance, if the following train split includes 999 records, then:
# 19 records, from 500 (included) to 519 (excluded).
>>> train_50_52_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%]")
# 20 records, from 519 (included) to 539 (excluded).
>>> train_52_54_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%]")
If you want equal sized splits, use pct1_dropremainder
rounding instead. This treats the specified percentage boundaries as multiples of 1%.
# 18 records, from 450 (included) to 468 (excluded).
>>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train", from_=50, to=52, unit="%", rounding="pct1_dropremainder"))
# 18 records, from 468 (included) to 486 (excluded).
>>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train",from_=52, to=54, unit="%", rounding="pct1_dropremainder"))
# Or equivalently:
>>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%](pct1_dropremainder)")
>>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%](pct1_dropremainder)")
pct1_dropremainder
rounding may truncate the last examples in a dataset if the number of examples in your dataset don’t divide evenly by 100.
Troubleshooting
Sometimes, you may get unexpected results when you load a dataset. Two of the most common issues you may encounter are manually downloading a dataset and specifying features of a dataset.
Manual download
Certain datasets require you to manually download the dataset files due to licensing incompatibility or if the files are hidden behind a login page. This causes load_dataset() to throw an AssertionError
. But 🤗 Datasets provides detailed instructions for downloading the missing files. After you’ve downloaded the files, use the data_dir
argument to specify the path to the files you just downloaded.
For example, if you try to download a configuration from the MATINF dataset:
>>> dataset = load_dataset("matinf", "summarization")
Downloading and preparing dataset matinf/summarization (download: Unknown size, generated: 246.89 MiB, post-processed: Unknown size, total: 246.89 MiB) to /root/.cache/huggingface/datasets/matinf/summarization/1.0.0/82eee5e71c3ceaf20d909bca36ff237452b4e4ab195d3be7ee1c78b53e6f540e...
AssertionError: The dataset matinf with config summarization requires manual data.
Please follow the manual download instructions: To use MATINF you have to download it manually. Please fill this google form (https://forms.gle/nkH4LVE4iNQeDzsc9). You will receive a download link and a password once you complete the form. Please extract all files in one folder and load the dataset with: *datasets.load_dataset('matinf', data_dir='path/to/folder/folder_name')*.
Manual data can be loaded with `datasets.load_dataset(matinf, data_dir='<path/to/manual/data>')
Specify features
When you create a dataset from local files, the Features are automatically inferred by Apache Arrow. However, the dataset’s features may not always align with your expectations, or you may want to define the features yourself. The following example shows how you can add custom labels with the ClassLabel feature.
Start by defining your own labels with the Features class:
>>> class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
>>> emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})
Next, specify the features
parameter in load_dataset() with the features you just created:
>>> dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)
Now when you look at your dataset features, you can see it uses the custom labels you defined:
>>> dataset['train'].features
{'text': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}
Metrics
Metrics will soon be deprecated in 🤗 Datasets. To learn more about how to use metrics, take a look at our newest library 🤗 Evaluate! In addition to metrics, we’ve also added more tools for evaluating models and datasets.
When the metric you want to use is not supported by 🤗 Datasets, you can write and use your own metric script. Load your metric by providing the path to your local metric loading script:
>>> from datasets import load_metric
>>> metric = load_metric('PATH/TO/MY/METRIC/SCRIPT')
>>> # Example of typical usage
>>> for batch in dataset:
... inputs, references = batch
... predictions = model(inputs)
... metric.add_batch(predictions=predictions, references=references)
>>> score = metric.compute()
See the Metrics guide for more details on how to write your own metric loading script.
Load configurations
It is possible for a metric to have different configurations. The configurations are stored in the config_name
parameter in MetricInfo attribute. When you load a metric, provide the configuration name as shown in the following:
>>> from datasets import load_metric
>>> metric = load_metric('bleurt', name='bleurt-base-128')
>>> metric = load_metric('bleurt', name='bleurt-base-512')
Distributed setup
When working in a distributed or parallel processing environment, loading and computing a metric can be tricky because these processes are executed in parallel on separate subsets of the data. 🤗 Datasets supports distributed usage with a few additional arguments when you load a metric.
For example, imagine you are training and evaluating on eight parallel processes. Here’s how you would load a metric in this distributed setting:
Define the total number of processes with the
num_process
argument.Set the process
rank
as an integer between zero andnum_process - 1
.Load your metric with load_metric() with these arguments:
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank)
Once you’ve loaded a metric for distributed usage, you can compute the metric as usual. Behind the scenes, Metric.compute() gathers all the predictions and references from the nodes, and computes the final metric.
In some instances, you may be simultaneously running multiple independent distributed evaluations on the same server and files. To avoid any conflicts, it is important to provide an experiment_id
to distinguish the separate evaluations:
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id, experiment_id="My_experiment_10")