Dataset features

datasets.Features define the internal structure and typings for each example in the dataset. Features are used to specify the underlying serailization format but also contain high-level informations regarding the fields, e.g. conversion methods from names to integer values for a class label field.

Here is a brief presentation of the various types of features which can be used to define the dataset fields (aka columns):

  • datasets.Features is the base class and should be only called once and instantiated with a dictionary of field names and field sub-features as detailed in the rest of this list,

  • a python dict specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It’s possible to have nested fields of nested fields in an arbitrary manner.

  • a python list or a datasets.Sequence specifies that the field contains a list of objects. The python list or datasets.Sequence should be provided with a single sub-feature as an example of the feature type hosted in this list. Python list are simplest to define and write while datasets.Sequence provide a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient).

Note

A datasets.Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list instead of the datasets.Sequence.