Classes used during the dataset building process¶
Two main classes are mostly used during the dataset building process.
-
class
datasets.
DatasetBuilder
(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]¶ Abstract base class for all datasets.
DatasetBuilder has 3 key methods:
datasets.DatasetBuilder.info()
: Documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.datasets.DatasetBuilder.download_and_prepare()
: Downloads the source data and writes it to disk.datasets.DatasetBuilder.as_dataset()
: Generates a Dataset.
Configuration: Some DatasetBuilder`s expose multiple variants of the dataset by defining a `datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in
datasets.DatasetBuilder.builder_configs()
.
-
class
datasets.
GeneratorBasedBuilder
(*args, writer_batch_size=None, **kwargs)[source]¶ Base class for datasets with data generation based on dict generators.
GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.
-
class
datasets.
ArrowBasedBuilder
(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]¶ Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).
-
class
datasets.
BuilderConfig
(name: str = 'default', version: Optional[Union[str, datasets.utils.version.Version]] = '0.0.0', data_dir: Optional[str] = None, data_files: Optional[Union[str, Dict, List, Tuple]] = None, description: Optional[str] = None)[source]¶ Base class for DatasetBuilder data configuration.
DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.
-
class
datasets.
DownloadManager
(dataset_name: Optional[str] = None, data_dir: Optional[str] = None, download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, base_path: Optional[str] = None)[source]¶
-
class
datasets.
SplitGenerator
(name: str, gen_kwargs: Dict = <factory>)[source]¶ Defines the split information for the generator.
This should be used as returned value of
GeneratorBasedBuilder._split_generators()
. SeeGeneratorBasedBuilder._split_generators()
for more info and example of usage.- Parameters
name (str) – Name of the Split for which the generator will create the examples.
**gen_kwargs – Keyword arguments to forward to the
DatasetBuilder._generate_examples()
method of the builder.
-
class
datasets.
Split
(name)[source]¶ Enum for dataset splits.
Datasets are typically split into different subsets to be used at various stages of training and evaluation.
TRAIN: the training data.
VALIDATION: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).
TEST: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.
Note: All splits, including compositions inherit from datasets.SplitBase
See the guide on splits for more information.
-
class
datasets.
NamedSplit
(name)[source]¶ Descriptor corresponding to a named split (train, test, …).
Examples
Each descriptor can be composed with other using addition or slice. Ex:
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST
The resulting split will correspond to 25% of the train split merged with 100% of the test split.
Warning
A split cannot be added twice, so the following will fail:
split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TRAIN.subsplit(datasets.percent[75:]) ) # Error split = datasets.Split.TEST + datasets.Split.ALL # Error
Warning
The slices can be applied only one time. So the following are valid:
split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TEST.subsplit(datasets.percent[:50]) ) split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])
But not:
train = datasets.Split.TRAIN test = datasets.Split.TEST split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25]) split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])