Classes used during the dataset building process¶

Two main classes are mostly used during the dataset building process.

class datasets.DatasetBuilder(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]¶

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

  • Documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.

  • datasets.DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.

  • datasets.DatasetBuilder.as_dataset(): Generates a Dataset.

Configuration: Some DatasetBuilder`s expose multiple variants of the dataset by defining a `datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in datasets.DatasetBuilder.builder_configs().

class datasets.GeneratorBasedBuilder(*args, writer_batch_size=None, **kwargs)[source]¶

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.

class datasets.BeamBasedBuilder(*args, **kwargs)[source]¶

Beam based Builder.

class datasets.ArrowBasedBuilder(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, features: Optional[datasets.features.Features] = None, **config_kwargs)[source]¶

Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

class datasets.BuilderConfig(name: str = 'default', version: Optional[Union[str, datasets.utils.version.Version]] = '0.0.0', data_dir: Optional[str] = None, data_files: Optional[Union[str, Dict, List, Tuple]] = None, description: Optional[str] = None)[source]¶

Base class for DatasetBuilder data configuration.

DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.

  • name (str, default "default") –

  • version (Version or str, optional) –

  • data_dir (str, optional) –

  • data_files (str or dict or list or tuple, optional) –

  • description (str, optional) –

class datasets.DownloadManager(dataset_name: Optional[str] = None, data_dir: Optional[str] = None, download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, base_path: Optional[str] = None)[source]¶
class datasets.GenerateMode(value)[source]¶

Enum for how to treat pre-existing downloads and data.

The default mode is REUSE_DATASET_IF_EXISTS, which will reuse both raw downloads and the prepared dataset if they exist.

The generations modes:












class datasets.SplitGenerator(name: str, gen_kwargs: Dict = <factory>)[source]¶

Defines the split information for the generator.

This should be used as returned value of GeneratorBasedBuilder._split_generators(). See GeneratorBasedBuilder._split_generators() for more info and example of usage.

  • name (str) – Name of the Split for which the generator will create the examples.

  • **gen_kwargs – Keyword arguments to forward to the DatasetBuilder._generate_examples() method of the builder.

class datasets.Split(name)[source]¶

Enum for dataset splits.

Datasets are typically split into different subsets to be used at various stages of training and evaluation.

  • TRAIN: the training data.

  • VALIDATION: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).

  • TEST: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.

Note: All splits, including compositions inherit from datasets.SplitBase

See the guide on splits for more information.

class datasets.NamedSplit(name)[source]¶

Descriptor corresponding to a named split (train, test, …).


Each descriptor can be composed with other using addition or slice. Ex:

split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST

The resulting split will correspond to 25% of the train split merged with 100% of the test split.


A split cannot be added twice, so the following will fail:

split = (
        datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
)  # Error
split = datasets.Split.TEST + datasets.Split.ALL  # Error


The slices can be applied only one time. So the following are valid:

split = (
        datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])

But not:

train = datasets.Split.TRAIN
test = datasets.Split.TEST
split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25])
split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])
class datasets.ReadInstruction(split_name, rounding=None, from_=None, to=None, unit=None)[source]¶

Reading instruction for a dataset.

Examples of usage:

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
    'test', from_=0, to=33, unit='%'))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
ds = datasets.load_dataset('mnist', split=(
    datasets.ReadInstruction('test', to=33, unit='%') +
    datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
    'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

# 10-fold validation:
tests = datasets.load_dataset(
    [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
    for k in range(0, 100, 10)])
trains = datasets.load_dataset(
    [datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
    for k in range(0, 100, 10)])
class datasets.utils.DownloadConfig(cache_dir: Optional[Union[pathlib.Path, str]] = None, force_download: bool = False, resume_download: bool = False, local_files_only: bool = False, proxies: Optional[Dict] = None, user_agent: Optional[str] = None, extract_compressed_file: bool = False, force_extract: bool = False, use_etag: bool = True, num_proc: Optional[int] = None, max_retries: int = 1, use_auth_token: Optional[Union[bool, str]] = None)[source]¶

Configuration for our cached path manager.

  • cache_dir (str or Path, optional) – Specify a cache directory to save the file to (overwrite the default cache dir).

  • force_download (bool, default False) – If True, re-dowload the file even if it’s already cached in the cache dir.

  • resume_download (bool, default False) – If True, resume the download if incompletly recieved file is found.

  • proxies (dict, optional) –

  • user_agent (str, optional) – Optional string or dict that will be appended to the user-agent on remote requests.

  • extract_compressed_file (bool, default False) – If True and the path point to a zip or tar file, extract the compressed file in a folder along the archive.

  • force_extract (bool, default False) – If True when extract_compressed_file is True and the archive was already extracted, re-extract the archive and override the folder where it was extracted.

  • use_etag (bool, default True) –

  • num_proc (int, optional) –

  • max_retries (int, default 1) – The number of times to retry an HTTP request if it fails.

  • use_auth_token (str or bool, optional) – Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.

class datasets.utils.Version(version_str: str, description: Optional[str] = None, major: Optional[Union[str, int]] = None, minor: Optional[Union[str, int]] = None, patch: Optional[Union[str, int]] = None)[source]¶

Dataset version MAJOR.MINOR.PATCH.

  • version_str (str) – Eg: “1.2.3”.

  • description (str) – A description of what is new in this version.

  • version_str (str) – Eg: “1.2.3”.

  • description (str) – A description of what is new in this version.

  • major (str) –

  • minor (str) –

  • patch (str) –