Structure your repository ========================= To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files. This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure can be loaded automatically with ``load_dataset``, and it will have a preview on its dataset page on the Hub. .. Note:: The following examples use CSV files, but you can use any supported format (text, JSON, JSON Lines, CSV, Parquet). Main use-case ------------- The simplest dataset structure has two files: `train.csv` and `test.csv`. Your repository will also contain a `README.md` file, the `dataset card `__ displayed on your dataset page. :: my_dataset_repository/ ├── README.md ├── train.csv └── test.csv Splits and file names --------------------- 🤗 Datasets automatically infers the train/validation/test splits of your dataset from the file names. All the files that contain **train** in their names are considered part of the train split. The same idea applies to the test and validation split: * All the files that contain **test** in their names are considered part of the test split. * All the files that contain **valid** in their names are considered part of the validation split. Here is an example where all the files are placed into a directory named `data`: :: my_dataset_repository/ ├── README.md └── data/ ├── train.csv ├── test.csv └── valid.csv Multiple files per split ------------------------ If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train/validation/ test split from the file name. For example, if your **train** and **test** splits span several files: :: my_dataset_repository/ ├── README.md ├── train_0.csv ├── train_1.csv ├── train_2.csv ├── train_3.csv ├── test_0.csv └── test_1.csv Just make sure that all the files of your **train** set have **train** in their names (same for test and validation). It doesn’t matter if you add a prefix or suffix to **train** in the file name (like `my_train_file_00001.csv`, for example). 🤗 Datasets can still infer the appropriate split. For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name. :: my_dataset_repository/ ├── README.md └── data/ ├── train/ │ ├── shard_0.csv │ ├── shard_1.csv │ ├── shard_2.csv │ └── shard_3.csv └── test/ ├── shard_0.csv └── shard_1.csv Custom split names ------------------ If you have other data files in addition to the traditional train/validation/test sets, you must use the following structure. Follow the file name format exactly for this type of structure: `data/-xxxxx-of-xxxxx.csv`. Here is an example with three splits: **train**, **test**, and **random**: :: my_dataset_repository/ ├── README.md └── data/ ├── train-00000-of-00003.csv ├── train-00001-of-00003.csv ├── train-00002-of-00003.csv ├── test-00000-of-00001.csv ├── random-00000-of-00003.csv ├── random-00001-of-00003.csv └── random-00002-of-00003.csv Multiple configuration (WIP) ---------------------------- You can specify different configurations of your dataset (for example, if a dataset contains different languages) with one directory per configuration. These structures are not supported yet, but are a work in progress: :: my_dataset_repository/ ├── README.md ├── en/ │ ├── train.csv │ └── test.csv └── fr/ ├── train.csv └── test.csv Or with one directory per split: :: my_dataset_repository/ ├── README.md ├── en/ │ ├── train/ │ │ ├── shard_0.csv │ │ └── shard_1.csv │ └── test/ │ ├── shard_0.csv │ └── shard_1.csv └── fr/ ├── train/ │ ├── shard_0.csv │ └── shard_1.csv └── test/ ├── shard_0.csv └── shard_1.csv