Dataset features
================

:class:`datasets.Features` defines the internal structure of a dataset. The :class:`datasets.Features` is used to specify the underlying serialization format. What's more interesting to you though is that :class:`datasets.Features` contains high-level information about everything from the column names and types, to the :class:`datasets.ClassLabel`. You can think of :class:`datasets.Features` as the backbone of a dataset.

The :class:`datasets.Features` format is simple: ``dict[column_name, column_type]``. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have.

Let's have a look at the features of the MRPC dataset from the GLUE benchmark:

.. code-block::

   >>> from datasets import load_dataset
   >>> dataset = load_dataset('glue', 'mrpc', split='train')
   >>> dataset.features
   {'idx': Value(dtype='int32', id=None),
    'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
    'sentence1': Value(dtype='string', id=None),
    'sentence2': Value(dtype='string', id=None),
   }

The :class:`datasets.Value` feature tells 🤗 Datasets:

* The ``idx`` data type is ``int32``.
* The ``sentence1`` and ``sentence2`` data types are ``string``.

🤗 Datasets supports many other data types such as ``bool``, ``float32`` and ``binary`` to name just a few.

.. seealso::

   Refer to :class:`datasets.Value` for a full list of supported data types.

The :class:`datasets.ClassLabel` feature informs 🤗 Datasets the ``label`` column contains two classes. The classes are labeled ``not_equivalent`` and ``equivalent``. Labels are stored as integers in the dataset. When you retrieve the labels, :func:`datasets.ClassLabel.int2str` and :func:`datasets.ClassLabel.str2int` carries out the conversion from integer value to label name, and vice versa.

If your data type contains a list of objects, then you want to use the :class:`datasets.Sequence` feature. Remember the SQuAD dataset?

.. code-block::

   >>> from datasets import load_dataset
   >>> dataset = load_dataset('squad', split='train')
   >>> dataset.features
   {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
   'context': Value(dtype='string', id=None),
   'id': Value(dtype='string', id=None),
   'question': Value(dtype='string', id=None),
   'title': Value(dtype='string', id=None)}

The ``answers`` field is constructed using the :class:`datasets.Sequence` feature because it contains two subfields, ``text`` and ``answer_start``, which are lists of ``string`` and ``int32``, respectively.

.. tip::

   See the :ref:`flatten` section to learn how you can extract the nested subfields as their own independent columns.