Datasets documentation

Structure your repository

You are viewing v2.12.0 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Structure your repository

To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.

This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure and file format (.txt, .csv, .mp3, .jpg, .zip) are loaded automatically with load_dataset(), and it’ll have a preview on its dataset page on the Hub.

For more flexibility over how to load and generate a dataset, you can also write a dataset loading script.

Main use-case

The simplest dataset structure has two files: train.csv and test.csv.

Your repository will also contain a README.md file, the dataset card displayed on your dataset page.

my_dataset_repository/
β”œβ”€β”€ README.md
β”œβ”€β”€ train.csv
└── test.csv

Split pattern hierarchy

πŸ€— Datasets searches for certain patterns in the dataset repository to automatically infer the dataset splits. There is an order to the patterns, beginning with the custom filename split format to treating all files as a single split if no pattern is found.

Custom filename split

If your dataset splits have custom names that aren’t train, test, or validation, then you should name your data files like data/<split_name>-xxxxx-of-xxxxx.csv.

Here is an example with three splits, train, test, and random:

my_dataset_repository/
β”œβ”€β”€ README.md
└── data/
    β”œβ”€β”€ train-00000-of-00003.csv
    β”œβ”€β”€ train-00001-of-00003.csv
    β”œβ”€β”€ train-00002-of-00003.csv
    β”œβ”€β”€ test-00000-of-00001.csv
    β”œβ”€β”€ random-00000-of-00003.csv
    β”œβ”€β”€ random-00001-of-00003.csv
    └── random-00002-of-00003.csv

Directory name

Your data files may also be placed into different directories named train, test, and validation where each directory contains the data files for that split:

my_dataset_repository/
β”œβ”€β”€ README.md
└── data/
    β”œβ”€β”€ train/
    β”‚   └── bees.csv
    β”œβ”€β”€ test/
    β”‚   └── more_bees.csv
    └── validation/
        └── even_more_bees.csv

Filename splits

If you don’t have any non-traditional splits, then you can place the split name anywhere in the data file and it is automatically inferred. The only rule is that the split name must be delimited by non-word characters, like test-file.csv for example instead of testfile.csv. Supported delimiters include underscores, dashes, spaces, dots, and numbers.

For example, the following file names are all acceptable:

  • train split: train.csv, my_train_file.csv, train1.csv
  • validation split: validation.csv, my_validation_file.csv, validation1.csv
  • test split: test.csv, my_test_file.csv, test1.csv

Here is an example where all the files are placed into a directory named data:

my_dataset_repository/
β”œβ”€β”€ README.md
└── data/
    β”œβ”€β”€ train.csv
    β”œβ”€β”€ test.csv
    └── validation.csv

Single split

When πŸ€— Datasets can’t find any of the above patterns, then it’ll treat all the files as a single train split. If your dataset splits aren’t loading as expected, it may be due to an incorrect pattern.

Split name keywords

There are several ways to name splits. Validation splits are sometimes called β€œdev”, and test splits may be referred to as β€œeval”. These other split names are also supported, and the following keywords are equivalent:

  • train, training
  • validation, valid, val, dev
  • test, testing, eval, evaluation

The structure below is a valid repository:

my_dataset_repository/
β”œβ”€β”€ README.md
└── data/
    β”œβ”€β”€ training.csv
    β”œβ”€β”€ eval.csv
    └── valid.csv

Multiple files per split

If one of your splits comprises several files, πŸ€— Datasets can still infer whether it is the train, validation, and test split from the file name. For example, if your train and test splits span several files:

my_dataset_repository/
β”œβ”€β”€ README.md
β”œβ”€β”€ train_0.csv
β”œβ”€β”€ train_1.csv
β”œβ”€β”€ train_2.csv
β”œβ”€β”€ train_3.csv
β”œβ”€β”€ test_0.csv
└── test_1.csv

Make sure all the files of your train set have train in their names (same for test and validation). Even if you add a prefix or suffix to train in the file name (like my_train_file_00001.csv for example), πŸ€— Datasets can still infer the appropriate split.

For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.

my_dataset_repository/
β”œβ”€β”€ README.md
└── data/
    β”œβ”€β”€ train/
    β”‚   β”œβ”€β”€ shard_0.csv
    β”‚   β”œβ”€β”€ shard_1.csv
    β”‚   β”œβ”€β”€ shard_2.csv
    β”‚   └── shard_3.csv
    └── test/
        β”œβ”€β”€ shard_0.csv
        └── shard_1.csv

Eventually, you’ll also be able to structure your repository to specify different dataset configurations. Stay tuned on this issue for the latest updates!