To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files.
This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure and file format (text, JSON, JSON Lines, CSV, Parquet) can be loaded automatically with load_dataset(), and it’ll have a preview on its dataset page on the Hub.
For more flexibility over how to load and generate a dataset, you can also write a dataset loading script.
The simplest dataset structure has two files:
Your repository will also contain a
README.md file, the dataset card displayed on your dataset page.
my_dataset_repository/ ├── README.md ├── train.csv └── test.csv
🤗 Datasets automatically infer a dataset’s train, validation, and test splits from the file names.
All the files that contain a split name in their names (delimited by non-word characters, see below) are considered part of that split:
- train split:
- validation split:
- test split:
Here is an example where all the files are placed into a directory named
my_dataset_repository/ ├── README.md └── data/ ├── train.csv ├── test.csv └── validation.csv
Note that if a file contains test but is embedded in another word (e.g.
testfile.csv), it’s not counted as a test file.
It must be delimited by non-word characters, e.g.
Supported delimiters are underscores, dashes, spaces, dots and numbers.
If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train, validation, and test split from the file name. For example, if your train and test splits span several files:
my_dataset_repository/ ├── README.md ├── train_0.csv ├── train_1.csv ├── train_2.csv ├── train_3.csv ├── test_0.csv └── test_1.csv
Make sure all the files of your
train set have train in their names (same for test and validation).
Even if you add a prefix or suffix to
train in the file name (like
my_train_file_00001.csv for example),
🤗 Datasets can still infer the appropriate split.
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
my_dataset_repository/ ├── README.md └── data/ ├── train/ │ ├── shard_0.csv │ ├── shard_1.csv │ ├── shard_2.csv │ └── shard_3.csv └── test/ ├── shard_0.csv └── shard_1.csv
Eventually, you’ll also be able to structure your repository to specify different dataset configurations. Stay tuned on this issue for the latest updates!
Validation splits are sometimes called “dev”, and test splits are called “eval”. These other names are also supported. In particular, these keywords are equivalent:
- train, training
- validation, valid, val, dev
- test, testing, eval, evaluation
Therefore this is also a valid repository:
my_dataset_repository/ ├── README.md └── data/ ├── training.csv ├── eval.csv └── valid.csv
If you have other data files in addition to the traditional train, validation, and test sets, you must use a different structure.
Use this exact file name format for this structure type:
Here is an example with three splits:
my_dataset_repository/ ├── README.md └── data/ ├── train-00000-of-00003.csv ├── train-00001-of-00003.csv ├── train-00002-of-00003.csv ├── test-00000-of-00001.csv ├── random-00000-of-00003.csv ├── random-00001-of-00003.csv └── random-00002-of-00003.csv