Dataset viewer documentation

Splits and subsets

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Splits and subsets

Machine learning datasets are commonly organized in splits and they may also have subsets (also called configurations). These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset’s structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.

split-configs-server

Splits

Every processed and cleaned dataset contains splits, specific parts of the data reserved for specific needs. The most common splits are:

  • train: data used to train a model; this data is exposed to the model
  • validation: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model
  • test: data reserved for evaluation only; this data is completely hidden from the model and ourselves

The validation and test sets are especially important to ensure a model is actually learning instead of overfitting, or just memorizing the data.

Subsets

A subset (also called configuration) is a higher-level internal structure than a split, and a subset contains splits. You can think of a subset as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the Multilingual LibriSpeech (MLS) dataset, you’ll notice there are eight different languages. While you can create a dataset containing all eight languages, it’s probably neater to create a dataset with each language as a subset. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.

Subsets are flexible, and can be used to organize a dataset along whatever objective you’d like. For example, the SceneParse150 dataset uses subsets to organize the dataset by task. One subset is dedicated to segmenting the whole image, while the other subset is for instance segmentation.

< > Update on GitHub