Splits and slicing
===========================

Similarly to Tensorfow Datasets, all :class:`DatasetBuilder` s expose various data subsets defined as splits (eg:
``train``, ``test``). When constructing a :class:`datasets.Dataset` instance using either
:func:`datasets.load_dataset()` or :func:`datasets.DatasetBuilder.as_dataset()`, one can specify which
split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
as well as combinations of those.

Slicing API
---------------------------------------------------

Slicing instructions are specified in :obj:`datasets.load_dataset` or :obj:`datasets.DatasetBuilder.as_dataset`.

Instructions can be provided as either strings or :obj:`ReadInstruction`. Strings
are more compact and readable for simple cases, while :obj:`ReadInstruction` 
might be easier to use with variable slicing parameters.

Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Examples using the string API:

.. code-block::

    # The full `train` split.
    train_ds = datasets.load_dataset('bookcorpus', split='train')

    # The full `train` split and the full `test` split as two distinct datasets.
    train_ds, test_ds = datasets.load_dataset('bookcorpus', split=['train', 'test'])

    # The full `train` and `test` splits, concatenated together.
    train_test_ds = datasets.load_dataset('bookcorpus', split='train+test')

    # From record 10 (included) to record 20 (excluded) of `train` split.
    train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]')

    # The first 10% of `train` split.
    train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]')

    # The first 10% of `train` + the last 80% of `train`.
    train_10_80pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]+train[-80%:]')

    # 10-fold cross-validation (see also next section on rounding behavior):
    # The validation datasets are each going to be 10%:
    # [0%:10%], [10%:20%], ..., [90%:100%].
    # And the training datasets are each going to be the complementary 90%:
    # [10%:100%] (for a corresponding validation set of [0%:10%]),
    # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
    # [0%:90%] (for a validation set of [90%:100%]).
    vals_ds = datasets.load_dataset('bookcorpus', split=[
        f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
    ])
    trains_ds = datasets.load_dataset('bookcorpus', split=[
        f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
    ])


Examples using the ``ReadInstruction`` API (equivalent as above):

.. code-block::

    # The full `train` split.
    train_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train'))

    # The full `train` split and the full `test` split as two distinct datasets.
    train_ds, test_ds = datasets.load_dataset('bookcorpus', split=[
        datasets.ReadInstruction('train'),
        datasets.ReadInstruction('test'),
    ])

    # The full `train` and `test` splits, concatenated together.
    ri = datasets.ReadInstruction('train') + datasets.ReadInstruction('test')
    train_test_ds = datasets.load_dataset('bookcorpus', split=ri)

    # From record 10 (included) to record 20 (excluded) of `train` split.
    train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
        'train', from_=10, to=20, unit='abs'))

    # The first 10% of `train` split.
    train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
        'train', to=10, unit='%'))

    # The first 10% of `train` + the last 80% of `train`.
    ri = (datasets.ReadInstruction('train', to=10, unit='%') +
        datasets.ReadInstruction('train', from_=-80, unit='%'))
    train_10_80pct_ds = datasets.load_dataset('bookcorpus', split=ri)

    # 10-fold cross-validation (see also next section on rounding behavior):
    # The validation datasets are each going to be 10%:
    # [0%:10%], [10%:20%], ..., [90%:100%].
    # And the training datasets are each going to be the complementary 90%:
    # [10%:100%] (for a corresponding validation set of [0%:10%]),
    # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
    # [0%:90%] (for a validation set of [90%:100%]).
    vals_ds = datasets.load_dataset('bookcorpus', [
        datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
        for k in range(0, 100, 10)])
    trains_ds = datasets.load_dataset('bookcorpus', [
        (datasets.ReadInstruction('train', to=k, unit='%') +
        datasets.ReadInstruction('train', from_=k+10, unit='%'))
        for k in range(0, 100, 10)])

Percent slicing and rounding
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If a slice of a split is requested using the percent (``%``) unit, and the
requested slice boundaries do not divide evenly by 100, then the default
behaviour is to round boundaries to the nearest integer (``closest``). This means
that some slices may contain more examples than others. For example:

.. code-block::

    # Assuming `train` split contains 999 records.
    # 19 records, from 500 (included) to 519 (excluded).
    train_50_52_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%]')
    # 20 records, from 519 (included) to 539 (excluded).
    train_52_54_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%]')

Alternatively, the ``pct1_dropremainder`` rounding can be used, so specified
percentage boundaries are treated as multiples of 1%. This option should be used
when consistency is needed (eg: ``len(5%) == 5 * len(1%)``). This means the last
examples may be truncated if ``info.splits[split_name].num_examples % 100 != 0``.

.. code-block::

    # 18 records, from 450 (included) to 468 (excluded).
    train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
        'train', from_=50, to=52, unit='%', rounding='pct1_dropremainder'))
    # 18 records, from 468 (included) to 486 (excluded).
    train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
        'train', from_=52, to=54, unit='%', rounding='pct1_dropremainder'))
    # Or equivalently:
    train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%](pct1_dropremainder)')
    train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%](pct1_dropremainder)')