To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files.
This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure can be loaded automatically with load_dataset(), and it will have a preview on its dataset page on the Hub.
Note that you can also include a python script to define your dataset, for more flexibility.
The following examples use CSV files, but you can use any supported format (text, JSON, JSON Lines, CSV, Parquet).
The simplest dataset structure has two files: train.csv and test.csv.
Your repository will also contain a README.md file, the dataset card displayed on your dataset page.
my_dataset_repository/ ├── README.md ├── train.csv └── test.csv
🤗 Datasets automatically infers the train/validation/test splits of your dataset from the file names. All the files that contain train in their names are considered part of the train split. The same idea applies to the test and validation split:
- All the files that contain test in their names are considered part of the test split.
- All the files that contain valid in their names are considered part of the validation split.
Here is an example where all the files are placed into a directory named data:
my_dataset_repository/ ├── README.md └── data/ ├── train.csv ├── test.csv └── valid.csv
If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train/validation/ test split from the file name. For example, if your train and test splits span several files:
my_dataset_repository/ ├── README.md ├── train_0.csv ├── train_1.csv ├── train_2.csv ├── train_3.csv ├── test_0.csv └── test_1.csv
Just make sure that all the files of your train set have train in their names (same for test and validation). It doesn’t matter if you add a prefix or suffix to train in the file name (like my_train_file_00001.csv, for example). 🤗 Datasets can still infer the appropriate split.
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
my_dataset_repository/ ├── README.md └── data/ ├── train/ │ ├── shard_0.csv │ ├── shard_1.csv │ ├── shard_2.csv │ └── shard_3.csv └── test/ ├── shard_0.csv └── shard_1.csv
If you have other data files in addition to the traditional train/validation/test sets, you must use the following structure. Follow the file name format exactly for this type of structure: data/<split_name>-xxxxx-of-xxxxx.csv. Here is an example with three splits: train, test, and random:
my_dataset_repository/ ├── README.md └── data/ ├── train-00000-of-00003.csv ├── train-00001-of-00003.csv ├── train-00002-of-00003.csv ├── test-00000-of-00001.csv ├── random-00000-of-00003.csv ├── random-00001-of-00003.csv └── random-00002-of-00003.csv
You can specify different configurations of your dataset (for example, if a dataset contains different languages) with one directory per configuration.
These structures are not supported yet, but are a work in progress:
my_dataset_repository/ ├── README.md ├── en/ │ ├── train.csv │ └── test.csv └── fr/ ├── train.csv └── test.csv
Or with one directory per split:
my_dataset_repository/ ├── README.md ├── en/ │ ├── train/ │ │ ├── shard_0.csv │ │ └── shard_1.csv │ └── test/ │ ├── shard_0.csv │ └── shard_1.csv └── fr/ ├── train/ │ ├── shard_0.csv │ └── shard_1.csv └── test/ ├── shard_0.csv └── shard_1.csv