Structure your repository
To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files.
This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure can be loaded automatically with load_dataset(), and it will have a preview on its dataset page on the Hub.
Note that you can also include a python script to define your dataset, for more flexibility.
The following examples use CSV files, but you can use any supported format (text, JSON, JSON Lines, CSV, Parquet).
Main use-case
The simplest dataset structure has two files: train.csv and test.csv.
Your repository will also contain a README.md file, the dataset card displayed on your dataset page.
my_dataset_repository/
βββ README.md
βββ train.csv
βββ test.csv
Splits and file names
π€ Datasets automatically infers the train/validation/test splits of your dataset from the file names. All the files that contain train in their names are considered part of the train split. The same idea applies to the test and validation split:
- All the files that contain test in their names are considered part of the test split.
- All the files that contain valid in their names are considered part of the validation split.
Here is an example where all the files are placed into a directory named data:
my_dataset_repository/
βββ README.md
βββ data/
βββ train.csv
βββ test.csv
βββ valid.csv
Multiple files per split
If one of your splits comprises several files, π€ Datasets can still infer whether it is the train/validation/ test split from the file name. For example, if your train and test splits span several files:
my_dataset_repository/
βββ README.md
βββ train_0.csv
βββ train_1.csv
βββ train_2.csv
βββ train_3.csv
βββ test_0.csv
βββ test_1.csv
Just make sure that all the files of your train set have train in their names (same for test and validation). It doesnβt matter if you add a prefix or suffix to train in the file name (like my_train_file_00001.csv, for example). π€ Datasets can still infer the appropriate split.
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
my_dataset_repository/
βββ README.md
βββ data/
βββ train/
β βββ shard_0.csv
β βββ shard_1.csv
β βββ shard_2.csv
β βββ shard_3.csv
βββ test/
βββ shard_0.csv
βββ shard_1.csv
Custom split names
If you have other data files in addition to the traditional train/validation/test sets, you must use the following structure. Follow the file name format exactly for this type of structure: data/<split_name>-xxxxx-of-xxxxx.csv. Here is an example with three splits: train, test, and random:
my_dataset_repository/
βββ README.md
βββ data/
βββ train-00000-of-00003.csv
βββ train-00001-of-00003.csv
βββ train-00002-of-00003.csv
βββ test-00000-of-00001.csv
βββ random-00000-of-00003.csv
βββ random-00001-of-00003.csv
βββ random-00002-of-00003.csv
Multiple configuration (WIP)
You can specify different configurations of your dataset (for example, if a dataset contains different languages) with one directory per configuration.
These structures are not supported yet, but are a work in progress:
my_dataset_repository/
βββ README.md
βββ en/
β βββ train.csv
β βββ test.csv
βββ fr/
βββ train.csv
βββ test.csv
Or with one directory per split:
my_dataset_repository/
βββ README.md
βββ en/
β βββ train/
β β βββ shard_0.csv
β β βββ shard_1.csv
β βββ test/
β βββ shard_0.csv
β βββ shard_1.csv
βββ fr/
βββ train/
β βββ shard_0.csv
β βββ shard_1.csv
βββ test/
βββ shard_0.csv
βββ shard_1.csv