Dataset features

datasets.Features defines the internal structure of a dataset. Features are used to specify the underlying serialization format but also contain high-level information regarding the fields, e.g. column names, types, and conversion methods from names to integer values for a class label field.

A brief summary of how to use this class:

  • datasets.Features should be only called once and instantiated with a dict[str, FieldType], where keys are your desired column names, and values are the type of that column.

FieldType can be one of a few possibilities:

  • a datasets.Value feature specifies a single typed value, e.g. int64 or string. The dtypes supported are as follows:
    • null

    • bool

    • int8

    • int16

    • int32

    • int64

    • uint8

    • uint16

    • uint32

    • uint64

    • float16

    • float32 (alias float)

    • float64 (alias double)

    • timestamp[(s|ms|us|ns)]

    • timestamp[(s|ms|us|ns), tz=(tzstring)]

    • binary

    • large_binary

    • string

    • large_string

  • a python dict specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It’s possible to have nested fields of nested fields in an arbitrary manner.

  • a python list or a datasets.Sequence specifies that the field contains a list of objects. The python list or datasets.Sequence should be provided with a single sub-feature as an example of the feature type hosted in this list. Python list are simplest to define and write while datasets.Sequence provide a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient).

Note

A datasets.Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list instead of the datasets.Sequence.