Splits and slicing¶
Similarly to Tensorfow Datasets, all DatasetBuilder
s expose various data subsets defined as splits (eg:
train
, test
). When constructing a datasets.Dataset
instance using either
datasets.load_dataset()
or datasets.DatasetBuilder.as_dataset()
, one can specify which
split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
as well as combinations of those.
Slicing API¶
Slicing instructions are specified in datasets.load_dataset
or datasets.DatasetBuilder.as_dataset
.
Instructions can be provided as either strings or ReadInstruction
. Strings
are more compact and readable for simple cases, while ReadInstruction
might be easier to use with variable slicing parameters.
Examples¶
Examples using the string API:
# The full `train` split.
train_ds = datasets.load_dataset('bookcorpus', split='train')
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = datasets.load_dataset('bookcorpus', split=['train', 'test'])
# The full `train` and `test` splits, concatenated together.
train_test_ds = datasets.load_dataset('bookcorpus', split='train+test')
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]')
# The first 10% of `train` split.
train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]')
# The first 10% of `train` + the last 80% of `train`.
train_10_80pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]+train[-80%:]')
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = datasets.load_dataset('bookcorpus', split=[
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
])
trains_ds = datasets.load_dataset('bookcorpus', split=[
f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])
Examples using the ReadInstruction
API (equivalent as above):
# The full `train` split.
train_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train'))
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = datasets.load_dataset('bookcorpus', split=[
datasets.ReadInstruction('train'),
datasets.ReadInstruction('test'),
])
# The full `train` and `test` splits, concatenated together.
ri = datasets.ReadInstruction('train') + datasets.ReadInstruction('test')
train_test_ds = datasets.load_dataset('bookcorpus', split=ri)
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
'train', from_=10, to=20, unit='abs'))
# The first 10% of `train` split.
train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
'train', to=10, unit='%'))
# The first 10% of `train` + the last 80% of `train`.
ri = (datasets.ReadInstruction('train', to=10, unit='%') +
datasets.ReadInstruction('train', from_=-80, unit='%'))
train_10_80pct_ds = datasets.load_dataset('bookcorpus', split=ri)
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = datasets.load_dataset('bookcorpus', [
datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains_ds = datasets.load_dataset('bookcorpus', [
(datasets.ReadInstruction('train', to=k, unit='%') +
datasets.ReadInstruction('train', from_=k+10, unit='%'))
for k in range(0, 100, 10)])
Percent slicing and rounding¶
If a slice of a split is requested using the percent (%
) unit, and the
requested slice boundaries do not divide evenly by 100, then the default
behaviour is to round boundaries to the nearest integer (closest
). This means
that some slices may contain more examples than others. For example:
# Assuming `train` split contains 999 records.
# 19 records, from 500 (included) to 519 (excluded).
train_50_52_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%]')
# 20 records, from 519 (included) to 539 (excluded).
train_52_54_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%]')
Alternatively, the pct1_dropremainder
rounding can be used, so specified
percentage boundaries are treated as multiples of 1%. This option should be used
when consistency is needed (eg: len(5%) == 5 * len(1%)
). This means the last
examples may be truncated if info.splits[split_name].num_examples % 100 != 0
.
# 18 records, from 450 (included) to 468 (excluded).
train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
'train', from_=50, to=52, unit='%', rounding='pct1_dropremainder'))
# 18 records, from 468 (included) to 486 (excluded).
train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction(
'train', from_=52, to=54, unit='%', rounding='pct1_dropremainder'))
# Or equivalently:
train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%](pct1_dropremainder)')
train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%](pct1_dropremainder)')