Structure your repository
To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.
This guide will show you how to structure your dataset repository when you upload it.
A dataset with a supported structure and file format (.txt
, .csv
, .parquet
, .jsonl
, .mp3
, .jpg
, .zip
etc.) are loaded automatically with load_dataset(), and itβll have a dataset viewer on its dataset page on the Hub.
Main use-case
The simplest dataset structure has two files: train.csv
and test.csv
(this works with any supported file format).
Your repository will also contain a README.md
file, the dataset card displayed on your dataset page.
my_dataset_repository/
βββ README.md
βββ train.csv
βββ test.csv
In this simple case, youβll get a dataset with two splits: train
(containing examples from train.csv
) and test
(containing examples from test.csv
).
Define your splits and subsets in YAML
Splits
If you have multiple files and want to define which file goes into which split, you can use the YAML configs
field at the top of your README.md.
For example, given a repository like this one:
my_dataset_repository/
βββ README.md
βββ data.csv
βββ holdout.csv
You can define your splits by adding the configs
field in the YAML block at the top of your README.md:
---
configs:
- config_name: default
data_files:
- split: train
path: "data.csv"
- split: test
path: "holdout.csv"
---
You can select multiple files per split using a list of paths:
my_dataset_repository/
βββ README.md
βββ data/
β βββ abc.csv
β βββ def.csv
βββ holdout/
βββ ghi.csv
---
configs:
- config_name: default
data_files:
- split: train
path:
- "data/abc.csv"
- "data/def.csv"
- split: test
path: "holdout/ghi.csv"
---
Or you can use glob patterns to automatically list all the files you need:
---
configs:
- config_name: default
data_files:
- split: train
path: "data/*.csv"
- split: test
path: "holdout/*.csv"
---
Note that config_name
field is required even if you have a single configuration.
Configurations
Your dataset might have several subsets of data that you want to be able to load separately. In that case you can define a list of configurations inside the configs
field in YAML:
my_dataset_repository/
βββ README.md
βββ main_data.csv
βββ additional_data.csv
---
configs:
- config_name: main_data
data_files: "main_data.csv"
- config_name: additional_data
data_files: "additional_data.csv"
---
Each configuration is shown separately on the Hugging Face Hub, and can be loaded by passing its name as a second parameter:
from datasets import load_dataset
main_data = load_dataset("my_dataset_repository", "main_data")
additional_data = load_dataset("my_dataset_repository", "additional_data")
Builder parameters
Not only data_files
, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which configuration to load your csv
files:
---
configs:
- config_name: tab
data_files: "main_data.csv"
sep: "\t"
- config_name: comma
data_files: "additional_data.csv"
sep: ","
---
Refer to specific buildersβ documentation to see what configuration parameters they have.
You can set a default configuration using default: true
, e.g. you can run main_data = load_dataset("my_dataset_repository")
if you set
- config_name: main_data
data_files: "main_data.csv"
default: true
Automatic splits detection
If no YAML is provided, π€ Datasets searches for certain patterns in the dataset repository to automatically infer the dataset splits. There is an order to the patterns, beginning with the custom filename split format to treating all files as a single split if no pattern is found.
Directory name
Your data files may also be placed into different directories named train
, test
, and validation
where each directory contains the data files for that split:
my_dataset_repository/
βββ README.md
βββ data/
βββ train/
β βββ bees.csv
βββ test/
β βββ more_bees.csv
βββ validation/
βββ even_more_bees.csv
Filename splits
If you donβt have any non-traditional splits, then you can place the split name anywhere in the data file and it is automatically inferred. The only rule is that the split name must be delimited by non-word characters, like test-file.csv
for example instead of testfile.csv
. Supported delimiters include underscores, dashes, spaces, dots, and numbers.
For example, the following file names are all acceptable:
- train split:
train.csv
,my_train_file.csv
,train1.csv
- validation split:
validation.csv
,my_validation_file.csv
,validation1.csv
- test split:
test.csv
,my_test_file.csv
,test1.csv
Here is an example where all the files are placed into a directory named data
:
my_dataset_repository/
βββ README.md
βββ data/
βββ train.csv
βββ test.csv
βββ validation.csv
Custom filename split
If your dataset splits have custom names that arenβt train
, test
, or validation
, then you can name your data files like data/<split_name>-xxxxx-of-xxxxx.csv
.
Here is an example with three splits, train
, test
, and random
:
my_dataset_repository/
βββ README.md
βββ data/
βββ train-00000-of-00003.csv
βββ train-00001-of-00003.csv
βββ train-00002-of-00003.csv
βββ test-00000-of-00001.csv
βββ random-00000-of-00003.csv
βββ random-00001-of-00003.csv
βββ random-00002-of-00003.csv
Single split
When π€ Datasets canβt find any of the above patterns, then itβll treat all the files as a single train split. If your dataset splits arenβt loading as expected, it may be due to an incorrect pattern.
Split name keywords
There are several ways to name splits. Validation splits are sometimes called βdevβ, and test splits may be referred to as βevalβ. These other split names are also supported, and the following keywords are equivalent:
- train, training
- validation, valid, val, dev
- test, testing, eval, evaluation
The structure below is a valid repository:
my_dataset_repository/
βββ README.md
βββ data/
βββ training.csv
βββ eval.csv
βββ valid.csv
Multiple files per split
If one of your splits comprises several files, π€ Datasets can still infer whether it is the train, validation, and test split from the file name. For example, if your train and test splits span several files:
my_dataset_repository/
βββ README.md
βββ train_0.csv
βββ train_1.csv
βββ train_2.csv
βββ train_3.csv
βββ test_0.csv
βββ test_1.csv
Make sure all the files of your train
set have train in their names (same for test and validation).
Even if you add a prefix or suffix to train
in the file name (like my_train_file_00001.csv
for example),
π€ Datasets can still infer the appropriate split.
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
my_dataset_repository/
βββ README.md
βββ data/
βββ train/
β βββ shard_0.csv
β βββ shard_1.csv
β βββ shard_2.csv
β βββ shard_3.csv
βββ test/
βββ shard_0.csv
βββ shard_1.csv
For more flexibility over how to load and generate a dataset, you can also write a dataset loading script.
< > Update on GitHub