Splits and slicing¶
Similarly to Tensorfow Datasets, all DatasetBuilder
s expose various data subsets defined as splits (eg:
train
, test
). When constructing a nlp.Dataset
instance using either
nlp.load_dataset()
or nlp.DatasetBuilder.as_dataset()
, one can specify which
split(s) to retrieve. It is also possible to retrieve slice(s) of split(s)
as well as combinations of those.
Slicing API¶
Slicing instructions are specified in nlp.load_dataset
or nlp.DatasetBuilder.as_dataset
.
Instructions can be provided as either strings or ReadInstruction
. Strings
are more compact and readable for simple cases, while ReadInstruction
provide
more options and might be easier to use with variable slicing parameters.
Examples¶
Examples using the string API:
# The full `train` split.
train_ds = nlp.load_dataset('bookcorpus', split='train')
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = nlp.load_dataset('bookcorpus', split=['train', 'test'])
# The full `train` and `test` splits, concatenated together.
train_test_ds = nlp.load_dataset('bookcorpus', split='train+test')
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = nlp.load_dataset('bookcorpus', split='train[10:20]')
# The first 10% of train split.
train_10pct_ds = nlp.load_dataset('bookcorpus', split='train[:10%]')
# The first 10% of train + the last 80% of train.
train_10_80pct_ds = nlp.load_dataset('bookcorpus', split='train[:10%]+train[-80%:]')
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = nlp.load_dataset('bookcorpus', split=[
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
])
trains_ds = nlp.load_dataset('bookcorpus', split=[
f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])
Examples using the ReadInstruction
API (equivalent as above):
# The full `train` split.
train_ds = nlp.load_dataset('bookcorpus', split=nlp.ReadInstruction('train'))
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = nlp.load_dataset('bookcorpus', split=[
nlp.ReadInstruction('train'),
nlp.ReadInstruction('test'),
])
# The full `train` and `test` splits, concatenated together.
ri = nlp.ReadInstruction('train') + nlp.ReadInstruction('test')
train_test_ds = nlp.load_dataset('bookcorpus', split=ri)
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = nlp.load_dataset('bookcorpus', split=nlp.ReadInstruction(
'train', from_=10, to=20, unit='abs'))
# The first 10% of train split.
train_10_20_ds = nlp.load_dataset('bookcorpus', split=nlp.ReadInstruction(
'train', to=10, unit='%'))
# The first 10% of train + the last 80% of train.
ri = (nlp.ReadInstruction('train', to=10, unit='%') +
nlp.ReadInstruction('train', from_=-80, unit='%'))
train_10_80pct_ds = nlp.load_dataset('bookcorpus', split=ri)
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = nlp.load_dataset('bookcorpus', [
nlp.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains_ds = nlp.load_dataset('bookcorpus', [
(nlp.ReadInstruction('train', to=k, unit='%') +
nlp.ReadInstruction('train', from_=k+10, unit='%'))
for k in range(0, 100, 10)])