Dataset features

datasets.Features defines the internal structure of a dataset. The datasets.Features is used to specify the underlying serialization format. What’s more interesting to you though is that datasets.Features contains high-level information about everything from the column names and types, to the datasets.ClassLabel. You can think of datasets.Features as the backbone of a dataset.

The datasets.Features format is simple: dict[column_name, column_type]. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have.

Let’s have a look at the features of the MRPC dataset from the GLUE benchmark:

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
}

The datasets.Value feature tells 🤗 Datasets:

  • The idx data type is int32.

  • The sentence1 and sentence2 data types are string.

🤗 Datasets supports many other data types such as bool, float32 and binary to name just a few.

See also

Refer to datasets.Value for a full list of supported data types.

The datasets.ClassLabel feature informs 🤗 Datasets the label column contains two classes. The classes are labeled not_equivalent and equivalent. Labels are stored as integers in the dataset. When you retrieve the labels, datasets.ClassLabel.int2str() and datasets.ClassLabel.str2int() carries out the conversion from integer value to label name, and vice versa.

If your data type contains a list of objects, then you want to use the datasets.Sequence feature. Remember the SQuAD dataset?

>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', split='train')
>>> dataset.features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}

The answers field is constructed using the datasets.Sequence feature because it contains two subfields, text and answer_start, which are lists of string and int32, respectively.

Tip

See the Flatten section to learn how you can extract the nested subfields as their own independent columns.