Classes used during the dataset building process

Two main classes are mostly used during the dataset building process.

class datasets.DatasetBuilder(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

  • datasets.DatasetBuilder.info: documents the dataset, including feature

    names, types, and shapes, version, splits, citation, etc.

  • datasets.DatasetBuilder.download_and_prepare: downloads the source data

    and writes it to disk.

  • datasets.DatasetBuilder.as_dataset: generates a Dataset.

Configuration: Some DatasetBuilder`s expose multiple variants of the dataset by defining a `datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in datasets.DatasetBuilder.builder_configs.

class datasets.GeneratorBasedBuilder(*args, writer_batch_size=None, **kwargs)[source]

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.

class datasets.BeamBasedBuilder(*args, **kwargs)[source]

Beam based Builder.

class datasets.ArrowBasedBuilder(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]

Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

class datasets.BuilderConfig(name: str = 'default', version: Optional[Union[str, datasets.utils.version.Version]] = '0.0.0', data_dir: str = None, data_files: Union[Dict, List] = None, description: str = None)[source]

Base class for DatasetBuilder data configuration.

DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.

class datasets.DownloadManager(dataset_name: Optional[str] = None, data_dir: Optional[str] = None, download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, base_path: Optional[str] = None)[source]
class datasets.SplitGenerator(name: str, gen_kwargs: Dict = <factory>)[source]

Defines the split information for the generator.

This should be used as returned value of GeneratorBasedBuilder._split_generators. See GeneratorBasedBuilder._split_generators for more info and example of usage.

Parameters
  • namestr, name of the Split for which the generator will create the examples.

  • gen_kwargsdict, kwargs to forward to the _generate_examples() method of the builder.

class datasets.Split(name)[source]

Enum for dataset splits.

Datasets are typically split into different subsets to be used at various stages of training and evaluation.

  • TRAIN: the training data.

  • VALIDATION: the validation data. If present, this is typically used as

    evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).

  • TEST: the testing data. This is the data to report metrics on. Typically

    you do not want to use this during model iteration as you may overfit to it.

Note: All splits, including compositions inherit from datasets.SplitBase

See the [guide on splits](https://github.com/huggingface/datasets/blob/master/docs/source/splits.rst) for more information.